“Addressing vision tasks with large foundation models: how far can we go without training?”

Relatore:  Dott.ssa Yiming Wang - Deep Visual Learning (DVL) Unit in Fondazione Bruno Kessler (FBK)
  martedì 4 giugno 2024 alle ore 16.30

Recent advancements in Vision and Language Models (VLMs) have significantly impacted computer vision research, particularly thanks to their ability to interpret multimodal information within a unified embedding space. Notably, the generalisation capability of VLMs, honed through extensive web-scale data pre-training, has shown remarkable performance in zero-shot recognition.  As direct competition in developing such large models is not a viable option for most public institutes due to their limited resources, we have explored new research opportunities following a training-free methodology by leveraging pre-trained models and existing databases that contain rich world knowledge.

In this talk, I will first present how we exploit VLMs to approach image classification without a pre-defined set of categories (i.e., the vocabulary), a de facto prior in existing zero-shot open-vocabulary recognition. We term this novel task vocabulary-free image classification, and propose CaSED, a training-free retrieval-based method to handle the absence of a known vocabulary. I will also demonstrate how such retrieval-based method can be leveraged to improve the recognition in rare domains where the amount of visual data or its supervision is limited. Lastly, I will present how VLMs and Large Language Models (LLMs) can be synergised in a training-free manner to advance video understanding, in particular in recognising anomalous patterns in video content.

Recent advancements in Vision and Language Models (VLMs) have significantly impacted computer vision research, particularly thanks to their ability to interpret multimodal information within a unified embedding space. Notably, the generalisation capability of VLMs, honed through extensive web-scale data pre-training, has shown remarkable performance in zero-shot recognition.  As direct competition in developing such large models is not a viable option for most public institutes due to their limited resources, we have explored new research opportunities following a training-free methodology by leveraging pre-trained models and existing databases that contain rich world knowledge.

In this talk, I will first present how we exploit VLMs to approach image classification without a pre-defined set of categories (i.e., the vocabulary), a de facto prior in existing zero-shot open-vocabulary recognition. We term this novel task vocabulary-free image classification, and propose CaSED, a training-free retrieval-based method to handle the absence of a known vocabulary. I will also demonstrate how such retrieval-based method can be leveraged to improve the recognition in rare domains where the amount of visual data or its supervision is limited. Lastly, I will present how VLMs and Large Language Models (LLMs) can be synergised in a training-free manner to advance video understanding, in particular in recognising anomalous patterns in video content.

Titolo Formato  (Lingua, Dimensione, Data pubblicazione)
Flyer  pdfpdf (it, 179 KB, 29/05/24)

Referente
Marco Cristani

Referente esterno
Data pubblicazione
29 maggio 2024

Offerta formativa

Condividi