SIGMM Workshop on Multimodal AI Agents

Author: Alberto Del Bimbo (ACM SIGMM Chair)
Affiliation: University of Firenze (Italy)

The SIGMM Workshop on Multimodal AI Agents was held on October 28th, 2024, at ACMMM24 in Melbourne as an invitation-only event. The initiative was launched by Alberto Del Bimbo, Ramesh Jain, and Alan Smeaton following a vision of the future where multimedia expertise converges with the power of large language models and the belief that there is a great opportunity to position the Multimedia research community at the center of this transformation. The event was structured as three roundtables, inviting some of the most influential figures in the multimedia field to brainstorm on key issues. The goal was to design the future, identifying the multimodal opportunity in the days of powerful large-model systems and preparing an agenda for the coming years for the SIGMM community. We did not want to overlap with the current thinking of how multimodality will be included in the emerging large-models.  Instead, the goal was on how deep multimodality is essential in building next stages of AI agents for real world applications and how fundamental it is in understanding real-time contexts and for actions by agents. The event received a great response, with over 30 attendees from both Academia and Industry, representing 13 different countries.

Three roundtables focused on Tech ChallengesApplications, and Industry-University collaboration. The participants were divided into three groups and assigned to the three roundtables according to their profiles and preferences. For the roundtables, we did not prepare specific questions but rather outlined key areas of focus for discussion. A brief document that provided a short introduction for each roundtable, summarizing the topic of the debate and highlighting three major subjects to guide the discussion was prepared and given to the discussant a few days before the meeting. 

In the following we report a brief synthesis of the discussions at the roundtables, highlighting the principal arguments of discussion and proposals. 

Tech challenges Roundtable

Motivations for the discussion: As large pre-trained models become more prevalent and move towards multimodality, looking at the future, a key issue for their usage arises around the impact of their updating and fine-tuning, understanding how to ensure that improvements in one area don’t come at the cost of degradation in others. It is also fundamentally important to understand how deep multimodality is essential for building next stages of AI agents for real world applications, as well as for comprehending real-time contexts and guiding actions by agents towards Artificial General Intelligence. 

Some salient sentences, open questions, proposals from the discussion:

  • The interplay between human intelligence and machine intelligence is a fundamental aspect of what should be multi-modal. There are not yet deep enough multimodal models…. models for information that truly span all, or even a subset of modalities. We need metrics for this human-machine, human-intelligence machine-intelligence, action. We should come up with and define a task around how people collaborate productively. We should look at something like dynamic difficulty adjustment, that requires continuous, real-time development or training. 
  • Benchmarks are of crucial importance, not just to evaluate one thing against another thing, but to stretch the capabilities. It is not just about passing the benchmark; it is about setting the targets. We should envision a SIGMM-endorsed or sponsored multimodal benchmark by approaching some big tech companies to benchmark some multimodal activity within and across companies.

Applications Roundtable

Motivations for the discussion:   Multimodality is a cornerstone of emerging real-world applications, providing context and situational awareness to systems. Large Multimodal Models are credited for transforming various industries and enabling new applications. Key challenges lie in developing computational approaches for media fusion to construct context and situational understanding, addressing real-time computing costs, and refining model building. It is therefore essential for the SIGMM community to reason on how to build a vibrant community around one or a few key applications.

Some salient sentences, open questions, proposals from the discussion:

There are many areas for application where the SIGMM community can provide vital and innovative contributions and should concentrate its applicative research. Example application areas  and examples of research are: 

  • Health: there is an absence of open-ended sensory data representing of long-term complex information in the health area. We can think of integrated, federated machine learning, i.e. an integrated, federated data space for data control. 
  • Education: we can think of some futuristic learning approach, like completely autonomous learning.  Namely, AI agents that will be supportive through observation models, able to adjust the learning level so that some can finish faster than the others and learn depending on the modalities they like to receive. It is also of key importance to consider what the role of teacher and the role of AI is. 
  • Productivitywe can think of tools for immersive multi-modal experiences, to generate cross-modal content including 3D and podcasting in immersive environments.
  • Entertainment: we should think of how we can improve entertainment through immersive story driven experiences. 

Industry and University Roundtable

Motivations for the discussion:   Research on large AI models is by far dominated by private companies, thanks in part to their access to the data and the cost for building and training such models. As a result, academic institutions are being left behind in the AI race. It is therefore urgent to reason about which research directions are viable for universities and think of new Industry-University collaboration models for multimodal AI research.  It is also important to capitalize on the unique advantage of Academy, concerning their neutrality and ability to address long-term social and ethical issues related to technology.

Some salient sentences, open questions, proposals from the discussion:

  • Small and medium enterprises feel that they are left out. These are the ones who came to talk to universities. This is an opportunity for the SIGMM community to see how we can help.  SIGMM could sponsor joint PhD programs for example addressing small size, multi-model, foundation models, or intelligent agents, where a company sponsors part of the grant project. 
  • SIGMM should promote large visibility events at ACM Multimedia like Grand Challenges and Hackathons. As a community we could sponsor a company-wise Grand Challenge on multimodal AI and intelligent agents, leveraging industry to contribute more data sets. We could promote a regional-global Hackathon where Hackathons are held and overseen in different regions in the world, and the top teams then invited to come to ACM Multimedia and compete for it. 

Based on the discussions at the roundtables, we have identified several concrete actions that could help position the SIGMM research community at the forefront of the multimodal AI transformation:

At the next ACM Multimedia Conference

  • Explicit inclusion of multimodality as a key topic in the next ACM Multimedia call.
  • Multimodal Hackathon on Intelligent Agents (regional-global hackathon).
  • Multimodal Benchmarks (collaborations within and across major tech companies).
  • Multimodal Grand Challenges (in partnership with industry leaders).

At the next ACM SIGMM call for Special projects

  • Special Projects focused on Multimodal AI.

SIGMM is committed to pursuing these initiatives.

Bookmark the permalink.