Seminari - Dipartimento Department of Engineering for Innovation Medicine Seminari - Dipartimento Department of Engineering for Innovation Medicine validi dal 02.06.2024 al 02.06.2025. https://www.dimi.univr.it/?ent=seminario&lang=en&rss=0 “Addressing vision tasks with large foundation models: how far can we go without training?” https://www.dimi.univr.it/?ent=seminario&lang=en&rss=0&id=6328 Relatore: Dott.ssa Yiming Wang; Provenienza: Deep Visual Learning (DVL) Unit in Fondazione Bruno Kessler (FBK); Data inizio: 2024-06-04; Ora inizio: 16.30; Referente interno: Marco Cristani; Riassunto: Recent advancements in Vision and Language Models (VLMs) have significantly impacted computer vision research, particularly thanks to their ability to interpret multimodal information within a unified embedding space. Notably, the generalisation capability of VLMs, honed through extensive web-scale data pre-training, has shown remarkable performance in zero-shot recognition. As direct competition in developing such large models is not a viable option for most public institutes due to their limited resources, we have explored new research opportunities following a training-free methodology by leveraging pre-trained models and existing databases that contain rich world knowledge. In this talk, I will first present how we exploit VLMs to approach image classification without a pre-defined set of categories (i.e., the vocabulary), a de facto prior in existing zero-shot open-vocabulary recognition. We term this novel task vocabulary-free image classification, and propose CaSED, a training-free retrieval-based method to handle the absence of a known vocabulary. I will also demonstrate how such retrieval-based method can be leveraged to improve the recognition in rare domains where the amount of visual data or its supervision is limited. Lastly, I will present how VLMs and Large Language Models (LLMs) can be synergised in a training-free manner to advance video understanding, in particular in recognising anomalous patterns in video content. Recent advancements in Vision and Language Models (VLMs) have significantly impacted computer vision research, particularly thanks to their ability to interpret multimodal information within a unified embedding space. Notably, the generalisation capability of VLMs, honed through extensive web-scale data pre-training, has shown remarkable performance in zero-shot recognition. As direct competition in developing such large models is not a viable option for most public institutes due to their limited resources, we have explored new research opportunities following a training-free methodology by leveraging pre-trained models and existing databases that contain rich world knowledge. In this talk, I will first present how we exploit VLMs to approach image classification without a pre-defined set of categories (i.e., the vocabulary), a de facto prior in existing zero-shot open-vocabulary recognition. We term this novel task vocabulary-free image classification, and propose CaSED, a training-free retrieval-based method to handle the absence of a known vocabulary. I will also demonstrate how such retrieval-based method can be leveraged to improve the recognition in rare domains where the amount of visual data or its supervision is limited. Lastly, I will present how VLMs and Large Language Models (LLMs) can be synergised in a training-free manner to advance video understanding, in particular in recognising anomalous patterns in video content. Tue, 4 Jun 2024 16:30:00 +0200 https://www.dimi.univr.it/?ent=seminario&lang=en&rss=0&id=6328