Computer Vision (CV) and Machine Learning (ML) have received a tremendous attention in both research and industrial domains recently. Vision is in fact the most ubiquitous sensing modality for autonomous systems in robotics and many other industrial applications. However, modern vision systems are not just complex perception systems but should possess intelligent abilities. They should be able to deal with real world scenarios, i.e., understanding scenes in the wild. Understanding a scene means essentially to figure out which objects are therein, recognize people behaviors and occurring dynamic events, classify activities, predict events, reconstructing the 3D environment, etc., exploiting the various data sources available, which mostly depend on the specific application to be tackled. Indeed, the demand for solving the scene understanding (SU) problem is largely growing given the many real-world applications which can be faced by efficiently tackling its many associated tasks, e.g., autonomous driving or surveillance, to name a few.
Such intelligent capabilities are transversal to many areas and, as such, they have a strong impact on several application domains related to robotics, biomed, finance, and many others. The incredible advancement of such research fields was made possible, on the one hand, by the massive amount of visual, as well as other types of, data now available to train classifiers for tackling problems that were deemed very hard, if not impossible, to solve just a few years ago. This is in turn due to the increased availability of sensors, their affordability, and reduced costs. On the other hand, this big data regime is becoming more viable, mainly due to the development of deep learning methods that, coupled with the increased GPU hardware effectiveness, can make such type of analysis more manageable than in the past. Moreover, targeting the deployment of computational systems in the real world, it is clear that multiple sensory modalities are generally needed to cope effectively and efficiently with the variability of the situations one can face.
While high performance can be obtained thanks to large scale labelled datasets, there are still challenging open questions to actually cope with application domains effectively working in the wild, on how computational systems can adapt to new environments, scenarios and tasks, or when no or very little information is available a priori. In fact, despite the current big data regime, in the real world the availability of reliable annotated data is not always guaranteed, because of the high annotation cost or the inherent difficulty to retrieve data (e.g., complexity of data acquisition, data privacy, ethical issues, etc.). So, especially nowadays it is important to investigate topics related to learning with scarce data, which means dealing with the design of learning models when we have either (variable amounts of) well annotated data, data with noisy labels, unlabeled data only, imbalanced data, or a mix of the previous scenarios.
In this context, a number of intermingled, long-term scientific topics can be investigated, which pose challenging, still open issues, in both theoretical and practical terms. They are:
• Unsupervised and Self-supervised Learning
• Semi-supervised learning
• Learning in long-tail data distribution scenarios
• Few-shot learning
• Domain adaptation and generalization, transfer learning
• Multi-modal learning
These topics are not independent domains, but rather a sort of continuum, which can benefit each other to solve actual open problems. They constitute crucial aspects for every learning algorithm, and especially for Computer Vision that aims at injecting intelligence capabilities in seeing machines while understanding and explaining the data.