Recent advancements in Vision and Language Models (VLMs) have significantly impacted computer vision research, particularly thanks to their ability to interpret multimodal information within a unified embedding space. Notably, the generalisation capability of VLMs, honed through extensive web-scale data pre-training, has shown remarkable performance in zero-shot recognition. As direct competition in developing such large models is not a viable option for most public institutes due to their limited resources, we have explored new research opportunities following a training-free methodology by leveraging pre-trained models and existing databases that contain rich world knowledge.
In this talk, I will first present how we exploit VLMs to approach image classification without a pre-defined set of categories (i.e., the vocabulary), a de facto prior in existing zero-shot open-vocabulary recognition. We term this novel task vocabulary-free image classification, and propose CaSED, a training-free retrieval-based method to handle the absence of a known vocabulary. I will also demonstrate how such retrieval-based method can be leveraged to improve the recognition in rare domains where the amount of visual data or its supervision is limited. Lastly, I will present how VLMs and Large Language Models (LLMs) can be synergised in a training-free manner to advance video understanding, in particular in recognising anomalous patterns in video content.
Recent advancements in Vision and Language Models (VLMs) have significantly impacted computer vision research, particularly thanks to their ability to interpret multimodal information within a unified embedding space. Notably, the generalisation capability of VLMs, honed through extensive web-scale data pre-training, has shown remarkable performance in zero-shot recognition. As direct competition in developing such large models is not a viable option for most public institutes due to their limited resources, we have explored new research opportunities following a training-free methodology by leveraging pre-trained models and existing databases that contain rich world knowledge.
In this talk, I will first present how we exploit VLMs to approach image classification without a pre-defined set of categories (i.e., the vocabulary), a de facto prior in existing zero-shot open-vocabulary recognition. We term this novel task vocabulary-free image classification, and propose CaSED, a training-free retrieval-based method to handle the absence of a known vocabulary. I will also demonstrate how such retrieval-based method can be leveraged to improve the recognition in rare domains where the amount of visual data or its supervision is limited. Lastly, I will present how VLMs and Large Language Models (LLMs) can be synergised in a training-free manner to advance video understanding, in particular in recognising anomalous patterns in video content.
Titolo | Formato (Lingua, Dimensione, Data pubblicazione) |
---|---|
Flyer |
![]() |
******** CSS e script comuni siti DOL - frase 9957 ********