“Addressing vision tasks with large foundation models: how far can we go without training?”-Dip.Ingegneria per la medicina di innovazione-Università degli Studi di Verona

Relatore: Dott.ssa Yiming Wang - Deep Visual Learning (DVL) Unit in Fondazione Bruno Kessler (FBK)

martedì 4 giugno 2024 alle ore 16.30

Recent advancements in Vision and Language Models (VLMs) have significantly impacted computer vision research, particularly thanks to their ability to interpret multimodal information within a unified embedding space. Notably, the generalisation capability of VLMs, honed through extensive web-scale data pre-training, has shown remarkable performance in zero-shot recognition. As direct competition in developing such large models is not a viable option for most public institutes due to their limited resources, we have explored new research opportunities following a training-free methodology by leveraging pre-trained models and existing databases that contain rich world knowledge.

In this talk, I will first present how we exploit VLMs to approach image classification without a pre-defined set of categories (i.e., the vocabulary), a de facto prior in existing zero-shot open-vocabulary recognition. We term this novel task vocabulary-free image classification, and propose CaSED, a training-free retrieval-based method to handle the absence of a known vocabulary. I will also demonstrate how such retrieval-based method can be leveraged to improve the recognition in rare domains where the amount of visual data or its supervision is limited. Lastly, I will present how VLMs and Large Language Models (LLMs) can be synergised in a training-free manner to advance video understanding, in particular in recognising anomalous patterns in video content.

Recent advancements in Vision and Language Models (VLMs) have significantly impacted computer vision research, particularly thanks to their ability to interpret multimodal information within a unified embedding space. Notably, the generalisation capability of VLMs, honed through extensive web-scale data pre-training, has shown remarkable performance in zero-shot recognition. As direct competition in developing such large models is not a viable option for most public institutes due to their limited resources, we have explored new research opportunities following a training-free methodology by leveraging pre-trained models and existing databases that contain rich world knowledge.

In this talk, I will first present how we exploit VLMs to approach image classification without a pre-defined set of categories (i.e., the vocabulary), a de facto prior in existing zero-shot open-vocabulary recognition. We term this novel task vocabulary-free image classification, and propose CaSED, a training-free retrieval-based method to handle the absence of a known vocabulary. I will also demonstrate how such retrieval-based method can be leveraged to improve the recognition in rare domains where the amount of visual data or its supervision is limited. Lastly, I will present how VLMs and Large Language Models (LLMs) can be synergised in a training-free manner to advance video understanding, in particular in recognising anomalous patterns in video content.

Titolo	Formato (Lingua, Dimensione, Data pubblicazione)
Flyer	pdf (it, 179 KB, 29/05/24)

Referente: Marco Cristani
Referente esterno
Data pubblicazione: 29 maggio 2024

Strada le Grazie 15
37134 Verona
Partita IVA01541040232
Codice Fiscale93009870234

Play store Apple Store

Presentazione

Governance

Riferimenti

Attività di ricerca

Strutture

Corsi di Studio

Dottorati, Master e Formazione superiore

Servizi per il territorio

Informazioni per il territorio

Riferimenti

“Addressing vision tasks with large foundation models: how far can we go without training?”

Offerta formativa

Corsi di Studio

Offerta formativa

Corsi di Studio