“Vision language models take interaction to the next level”

How artificial intelligence with new sensory modalities can support humans even more effectively across different tasks and situations

Dr Voit, AI chatbots have been making waves for over a year now How important is this as a factor in the AI and Autonomous Systems business unit? 

Michael Voit: ChatGPT and other generative language AIs are based on so-called large language models. We do track developments in these models and how they can be used in our application domains closely, of course. But for us as an institute working in image processing, it’s especially interesting to go one step beyond and start to look at vision language models (VLMs), which are AI systems that are not limited to language as a modality. They also process visual input at the same time. So you might think of us as preferring to work with “chatbots with eyes” than with purely language-based chatbots. This kind of AI can grasp situations on a more comprehensive basis, much like a human combines various sensory impressions to form a holistic concept. For us, that offers an opportunity to take assistance systems and their interactions with humans to the next level.

What are potential applications?

Voit: Our various departments research a wide range of different use cases for VLMs. In the e-learning space, we would like to feed our semantic search engine for learning content with images as well in the future. Then you could search along the lines of, “I don’t understand this figure – what learning material could help me?” In the area of military reconnaissance, the goal is to automate image exploitation and subsequent preparation of reports. And for cars, our goal is to develop voice assistants that also take what they see into account. So, the assistant could do things like tell when a vehicle occupant is struggling with motion sickness and offer tips instead of waiting to respond until the person says they have an issue. 

What is your approach to getting these kinds of VLMs?

Voit: We don’t have the resources to create a big VLM of our own, but adjustments to existing models are possible with relatively little effort and expense via few-shot learning, meaning not a lot of training data is needed. That’s why we know and test the available models, evaluate them with an eye to our application domains and use cases, and optimize the suitable candidates. Linking existing large models with our own specialized systems is another promising approach. In the automotive segment, for example, we use our Occupant Monitoring System, which is optimized to recognize poses and activity based on images, and we feed that output into a language model. It is even conceivable that we might be able to add further modalities, supplementing the system with acceleration sensors, for example.

 

Dr.-Ing. Michael Voit is the spokesperson for the business unit Artificial Intelligence and Autonomous Systems and head of the department Human-AI Interaction (HAI).

Digital technologies for productivity, sustainability, and security

The above interview is taken from the 2023/2024 Fraunhofer IOSB progress report.

 

Artificial Intelligence and Autonomous Systems

Learn more about the fields of application and technologies of our business unit Artificial Intelligence and Autonomous Systems.