Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2024 – Part 3 (MediaEval 2023, ImageCLEF 2024)


In this final part of the Overview of Open Dataset Sessions and Benchmarking Competitions we are focusing on the latest editions of some of the most popular multimedia-centric benchmarking competitions, continuing our reviews from previous years (https://records.sigmm.org/2023/01/19/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2022-part-3/). This third part of our review focuses on two benchmarking competitions:

  • MediaEval 2023 (https://multimediaeval.github.io/editions/2023/). We present the five benchmarking tasks, which target a wide range of topics, including medical multimedia applications (Medico), multimodal understanding of smells (Musti), multimodal content in news media (NewsImages), social media video memorability (Memorability), and sports action classification (SportsVideo).
  • ImageCLEF 2024 (https://www.imageclef.org/2024). This edition of ImageCLEF targets a wide range of tasks, covering four different medical-focused tasks (medical captions, Visual Question Answering, remote medicine, and GANs in medical scenarios), recommendation systems for editorials, image retrieval and generation, and pictogram generation from textual information.

For an overview of the QoMEX 2023 and QoMEX 2024 conferences, please see the first part of this column (https://records.sigmm.org/2024/09/07/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2023-2024-part-1-qomex-2023-and-qomex-2024/), while an overview of MDRE special sessions at MMM2023 and 2024 please take a look at the second part of this column (https://records.sigmm.org/2024/11/19/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2023-2024-part-2-mdre-at-mmm-2023-and-mmm-2024/).

MediaEval 2023

The MediaEval Multimedia Evaluation benchmark (https://multimediaeval.github.io/) benchmark offers challenges in artificial intelligence for multimedia data, tasking participants in benchmarking tasks centered around retrieval, classification, generation, analysis, and exploration of multimodal data. The latest editions of MediaEval also wish to delve deeper into understanding the data, trends, and system performance, by proposing a set of Quest for Insight (Q4I) questions and themes for each task. A column signed by the Coordination Committee of the latest MediaEval edition, outlying MediaEval’s history, impressions from the latest edition, and plans for the future is published in the October 2024 edition of our records (https://records.sigmm.org/2024/11/15/one-benchmarking-cycle-wraps-up-and-the-next-ramps-up-news-from-the-mediaeval-multimedia-benchmark/). MediaEval 2023 (https://multimediaeval.github.io/editions/2023/) was held between 1-2 February 2024, Collocated with MMM 2024 in Amsterdam, Netherlands, and the Coordination Committee was composed of Mihai Gabriel Constantin (University Politehnica of Bucharest, Romania), Steven Hicks, (SimulaMet, Norway), and Martha Larson (Radboud University, Netherlands) as the main coordinator.

Medical Multimedia Task – Transparent Tracking of Spermatozoa
Paper available at: https://ceur-ws.org/Vol-3658/paper1.pdf
Vajira Thambawita, Andrea Storås, Tuan-Luc Huynh, Hai-Dang Nguyen, Minh-Triet Tran, Trung-Nghia Le, Pål Halvorsen, Michael Riegler, Steven Hicks, Thien-Phuc Tran
SimulaMet, Norway, OsloMet, Norway, University of Science, VNU-HCM, Vietnam, Vietnam National University, Ho Chi Minh City, Vietnam
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/medico/

The Medico task provides a set of spermatozoa videos, tracked with a set of frame-by-frame bounding box annotations, tasking participants with the prediction of standard sperm quality assessment measurements, specifically the motility (movement) of spermatozoa (living sperm cells).

Musti: Multimodal Understanding of Smells in Texts and Images
Paper available at: https://ceur-ws.org/Vol-3658/paper34.pdf
Ali Hürriyetoğlu, Inna Novalija, Mathias Zinnen, Vincent Christlein, Pasquale Lisena, Stefano Menini, Marieke van Erp, Raphael Troncy
KNAW Humanities Cluster, DHLab, Jožef Stefan Institute, Slovenia, Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, EURECOM, Sophia Antipolis, France, Fondazione Bruno Kessler, Trento, Italy
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/musti/

Musti is an innovative task, seeking to understand the descriptions and depictions of smells in multilingual texts (English, German, Italian, French, Slovenian) and images from the 17th to the 20th century. Participants must create systems that recognize references to smells in texts and images, connecting these references across different modalities.

NewsImages: Connecting Text and Images
Paper available at: https://ceur-ws.org/Vol-3658/paper4.pdf
Andreas Lommatzsch, Benjamin Kille, Özlem Özgöbek, Mehdi Elahi, Duc Tien Dang Nguyen
Technische Universität Berlin, Berlin, Germany, Norwegian University of Science and Technology, Trondheim, Norway, University of Bergen, Bergen, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/newsimages/

In this edition of the NewsImages task participants are encouraged to discover patterns and models that describe the relation between images and texts of news articles, body of articles, and their headlines.

Predicting Video Memorability
Paper available at: https://ceur-ws.org/Vol-3658/paper2.pdf
Mihai Gabriel Constantin, Claire-Hélène Demarty, Camilo Fosco, Alba García Seco de Herrera, Sebastian Halder, Graham Healy, Bogdan Ionescu, Ana Matran-Fernandez, Rukiye Savran Kiziltepe, Alan F. Smeaton, Lorin Sweeney
University Politehnica of Bucharest, Romania, InterDigital, France, Massachusetts Institute of Technology Cambridge, USA, University of Essex, UK, Dublin City University, Ireland, Karadeniz Technical University, Turkey
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/memorability/

The organizers propose a dataset that studies the long-term memorability of social media-like videos, providing participants with an extensive data set of videos with memorability annotations, related information, pre-extracted state-of-the-art visual features, and Electroencephalography (EEG) recordings.

SportsVideo: Fine Grained Action Classification and Position Detection in Table Tennis and Swimming Videos
Paper available at: https://ceur-ws.org/Vol-3658/paper3.pdf
Aymeric Erades, Pierre-Etienne Martin, Romain Vuillemot, Boris Mansencal, Renaud Peteri, Julien Morlier, Stefan Duffner, Jenny Benois-Pineau
Ecole Centrale de Lyon, LIRIS, France, CCP Department, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany, University of Bordeaux, Labri, France, INSA Lyon, LIRIS, France
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/sportsvideo/

The organizers developed a set of six sub-tasks covering table tennis and swimming, related to athlete position detection, stroke detection, the classification of motions, field or table registration, sound detection in sports, and scores and result extraction from visual cues.

ImageCLEF 2024

ImageCLEF (https://www.imageclef.org/)  is part of the popular CLEF initiative (https://www.clef-initiative.eu/), and states as its main goal the evaluation of technologies for annotation, indexing, classification, and retrieval of multimodal data. The 2024 edition of ImageCLEF (https://www.imageclef.org/2024) was organized between the 9-12 September 2024, in Grenoble, France, with an Organization Committee composed of Bogdan Ionescu, Henning Müller, Ana-Maria Drăgulinescu, Ivan Eggel, and Liviu-Daniel Ștefan.

ImageCLEFmedical Caption
Paper available at: https://ceur-ws.org/Vol-3740/paper-132.pdf
Johannes Rückert, Asma Ben Abacha, Alba G. Seco de Herrera, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Benjamin Bracke, Hendrik Damm, Tabea M. G. Pakull, Cynthia Sabrina Schmidt, Henning Müller, Christoph M. Friedrich
Department of Computer Science, University of Applied Sciences and Arts Dortmund, Dortmund, Germany, Microsoft, Redmond, Washington, USA, University of Essex, UK, UNED, Spain, Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Germany, Institute for Artificial Intelligence in Medicine (IKIM), University Hospital Essen, Germany, Institute for Transfusion Medicine, University Hospital Essen, Essen, Germany, University of Applied Sciences Western Switzerland (HES-SO), Switzerland, University of Geneva, Switzerland,
Dataset available at: https://www.imageclef.org/2024/medical/caption

The medical caption task focuses on evaluating models that detect medical concepts and automatically create captions for medical images, which can be further applied for context-based image and information retrieval purposes.

ImageCLEFmed VQA
Paper available at: https://ceur-ws.org/Vol-3740/paper-131.pdf
Steven Hicks, Andrea Storås, Pål Halvorsen, Michael Riegler, Vajira Thambawita
SimulaMet, Oslo, Norway, OsloMet- Oslo Metropolitan University, Oslo, Norway
Dataset available at: https://www.imageclef.org/2024/medical/vqa

This edition of the medical VQA task focuses on images of the gastrointestinal tract, tasking participants with directing the power of artificial intelligence to generate medical images based on text input, while also looking at optimal prompts for off-the-shelf generative models, thus augmenting the datasets associated with the previous edition of this task.

ImageCLEFmed MEDIQA-MAGIC
Paper available at: https://ceur-ws.org/Vol-3740/paper-133.pdf
Wen-Wai Yim, Asma Ben Abacha, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia
Microsoft Health AI, Redmond, USA, University of Washington, Seattle, USA.
Dataset available at: https://www.imageclef.org/2024/medical/mediqa

The MEDIQA task focuses on the problem of Multimodal And Generative TelemedICine (MAGIC) in the area of dermatology. Participants must develop systems that can take queries, text, clinical context, and images as input and generate appropriate medical textual responses to this input in a telemedicine setting.

ImageCLEFmed GANs
Paper available at: https://ceur-ws.org/Vol-3740/paper-130.pdf
Alexandra-Georgiana Andrei, Ahmedkhan Radzhabov, Dzmitry Karpenka, Yuri Prokopchuk, Vassili Kovalev, Bogdan Ionescu, Henning Müller
AI Multimedia Lab, National University of Science and Technology Politehnica Bucharest, Romania, Belarusian Academy of Sciences, Minsk, Belarus, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2024/medical/gans

This task addresses the challenges of privacy preservation in artificially generated medical images, looking for “fingerprints” of the original real-world training images in a set of artificially generated images, fingerprints that may break patient privacy when exposed in unwanted or unforeseen circumstances.

ImageCLEFrecommending
Alexandru Stan, George Ioannidis, Bogdan Ionescu, Hugo Manguinhas
IN2 Digital Innovations, Germany, Politehnica University of Bucharest, Romania, Europeana Foundation, Netherlands,
Dataset available at: https://www.imageclef.org/2024/recommending

This task identifies traditional multimedia search methods as a performance bottleneck and proposes the development of recommendation methods and systems applied to blog posts, editorials, and galleries, targeting data related to cultural heritage organizations and collections.

Image Retrieval for Arguments (part of Touché at CLEF)
Paper available at: https://ceur-ws.org/Vol-3740/paper-322.pdf
Johannes Kiesel, Çağrı Çöltekin, Maximilian Heinrich, Maik Fröbe, Milad Alshomary, Bertrand De Longueville, Tomaž Erjavec, Nicolas Handke, Matyáš Kopp, Nikola Ljubešić, Katja Meden, Nailia Mirzakhmedova, Vaidas Morkevičius, Theresa Reitis-Münstermann, Mario Scharfbillig, Nicolas Stefanovitch, Henning Wachsmuth, Martin Potthast, Benno Stein
Bauhaus-Universität Weimar, University of Tübingen, Friedrich-Schiller-Universität Jena, Leibniz University Hannover, European Commission, Joint Research Centre (JRC), Jožef Stefan Institute, Leipzig University, Charles University, Kaunas University of Technology, Arcadia Sistemi Informativi Territoriali, University of Kassel, hessian.AI, and ScaDS.AI
Dataset available at: https://www.imageclef.org/2024/image-retrieval-for-arguments

The goal for this task is the retrieval of images and data that can increase the persuasiveness of an argument, building upon the datasets of topics developed in previous editions of the Touché task.

ImageCLEF ToPicto
Cécile Macaire, Benjamin Lecouteux, Didier Schwab, Emmanuelle Esperança-Rodier
Université Grenoble Alpes, LIG, France
Dataset available at: https://www.imageclef.org/2023/topicto

Targeting the alleviation of symptoms related to diseases that create language impairment problems, the ToPicto task proposes the development of automated systems that translate text or speech to visual pictograms, that can then be used as communication aids and tools.

Challenges in Experiencing Realistic Immersive Telepresence


Immersive imaging technologies offer a transformative way to change how we experience interacting with remote environments, i.e., telepresence. By leveraging advancements in light field imaging, omnidirectional cameras, and head-mounted displays, these systems enable realistic, real-time visual experiences that can revolutionize how we interact with the remote scene in fields such as healthcare, education, remote collaboration, and entertainment. However, the field faces significant technical and experiential challenges, including efficient data capture and compression, real-time rendering, and quality of experience (QoE) assessment. Expanding on the findings of the authors’ recent publication and situating them within a broader theoretical framework, this article provides an integrated overview of immersive telepresence technologies, focusing on their technological foundations, applications, and the challenges that must be addressed to advance this field.

1. Redefining Telepresence Through Immersive Imaging

Telepresence is defined as the “sense of being physically present at a remote location through interaction with the system’s human interface[Minsky1980]. Such virtual presence is made possible by digital imaging systems and real-time communication of visuals and interaction signals. Immersive imaging systems such as light fields and omnidirectional imaging enhance the visual sense of presence, i.e., “being there[IJsselsteijn2000], with photorealistic recreation of the remote scene. This emerging field has seen rapid growth, both in research and development [Valenzise2022], due to advancements in imaging and display technologies, combined with increasing demand for interactive and immersive experiences. A visualization is provided in Figure 1 that shows a telepresence system that utilizes traditional cameras and controls and an immersive telepresence system.

Figure 1 – A side-by-side visualization of a traditional telepresence system (left) and an immersive telepresence system (right).

The experience of “presence” consists of three components according to Schubert et al. [Schubert2001], which are renamed in this article to take into account other definitions:

  1. Realness – “Realness[Schubert2001] or “realism[Takatalo2008] of the environment (i.e., in this case, the remote scene) relates to the “believability, the fidelity and validity of sensory features within the generated environments, e.g., photorealism.” [Perkis 2020].
  2. Immersion – User’s level of “involvement[Schubert2001] and “concentration to the virtual environment instead of real world, loss of time[Takatalo2008]. “The combination of sensory cues with symbolic cues essential for user emplacement and engagement[Perkis2020].
  3. Spatiality – An attribute of the environment helps “transporting” the user to induce spatial awareness [Schubert2001] which allows “spatial presence[Takatalo2008] and “the possibility for users to move freely and discover the world offered” [Perkis2020].

Immersion can happen without having realness or spatiality, for example, while we are reading a novel. Telepresence using traditional imaging systems might not be immersive in case of a relatively small display and other distractors present in the visual field. Realistic immersive telepresence necessitates higher degrees of freedom (e.g., 3 DoF+ or 6DoF) compared to a telepresence application with a traditional display. In this context, new view synthesis methods and spherical light field representations (cf. Section 3) will be crucial in giving correct depth cues and depth perception – which will increase realness and spatiality tremendously.

The rapid progress of immersive imaging technologies and their adoption can largely be attributed to advancements in processing and display systems, including light field displays and extended reality (XR) headsets. These XR headsets are becoming increasingly affordable while delivering excellent user experiences [Jackson2023], paving the way for the widespread adoption of immersive communication and telepresence applications in the near future. To further accelerate this transition, extensive efforts are being undertaken in both academia as well as industry.

The visual realism (i.e., realness) in realistic immersive telepresence relies on acquired photos rather than computer-generated imagery (CGI). In healthcare, it enables realistic remote consultations and surgical collaborations [Wisotzky2025]. In education and training, it facilitates immersive, location-independent learning environments [Kachach2021]. Similarly, visual realism can enhance remote collaboration by creating lifelike meeting spaces, while in media and entertainment, it can provide unprecedented realism for live events and performances, offering users a closer connection and having a feeling of being present on remote sites.

This article provides a brief overview of the technological foundations, applications, and challenges in immersive telepresence. The novel contribution of this article is setting up the theoretical framework for realistic immersive telepresence informed by prior literature and positioning the findings of the author’s recent publication [Zerman2024] within this broader theoretical framework. It explores how foundational technologies like light field imaging and real-time rendering drive the field forward, while also identifying critical obstacles, such as dataset availability, compression efficiency, and QoE evaluation.

2. Technological Foundations for Immersive Telepresence

A realistic immersive telepresence can be made possible by enabling its main defining factors of realness (e.g., photorealism), immersion, and spatiality. Although these factors can be satisfied with other modalities (e.g., spatial audio), this article focuses on the visual modality and visual recreation of the remote scene.

2.1 Immersive Imaging Modalities

Immersive imaging technologies encompass a wide range of methods aimed at capturing and recreating realistic visual and spatial experiences. These include light fields, omnidirectional images, volumetric videos using either point clouds or 3D meshes, holography, multi-view stereo imaging, neural radiance fields, Gaussian splats, and other extended reality (XR) applications — all of which contribute to recreating highly realistic and interactive representations of scenes and environments.

Light fields (LF) are vector fields of all the light rays passing through a given region in space, describing the intensity and direction of light at every point. This is fully described through the plenoptic function [Adelson1991] as follows: P(x,y,z,θ,ϕ,λ,t), where x, y, and z describe the 3D position of sampling, θ and ϕ are the angular direction, λ is the wavelength of the light ray, and t is time. Traditionally, LFs are represented using the two-plane parametrization [Levoy1996] with 2 spatial dimensions and 2 angular dimensions; however, this parametrization limits the use case of LFs to processing planar visual stimuli. The plenoptic function can be leveraged beyond the two-plane parameterization for a highly detailed view reconstruction or view synthesis. Newer capture scenarios and representations enable increased immersion with LFs [Overbeck2018],[Broxton2020], which can be further advanced in the future.

Omnidirectional image (or video) representation can provide an all-encompassing 360-degree view of a scene from a point in space for immersive visualization [Yagi1999], [Maugey2023]. This is made possible by stitching multiple views together. The created spherical image can be stored using traditional image formats (i.e., 2D planar formats) by projecting the sphere to planar format (e.g., equirectangular projection, cubemap projection, and others); however, processing these special representations without proper consideration for their spherical nature results in errors or biases.

2.2 Processing Requirements for Realistic Immersive Telepresence

Immersive telepresence relies on capturing, transmitting, and rendering realistic representations of remote environments. “Capturing” can be considered an inherent part of the imaging modalities discussed in the previous section. For transmitting and rendering, there are different requirements to take into account.

Compression is an important step for telepresence that relies heavily on real-time transmission of the visual data from the remote scene. The importance of compression increases even more for immersive telepresence applications as immersive imaging modalities capture (and represent) more information and need even more compression compared to the telepresence using traditional 2D imaging systems. Compression of LFs [Stepanov2023], omnidirectional images and video [Croci2020], and other forms of immersive video such as MPEG Immersive Video [Boyce2021], volumetric 3D representations represented with point clouds [Graziosi2020], and textured 3D meshes [Marvie2022] have been a very hot research topic within the last decade, which led to the standardization of compression methods for some immersive imaging modalities.

Rendering [Eisert2023], [Maugey2023] is yet another important aspect, especially for LFs [Overbeck2018]. The LF data needs to be rendered correctly for the position of the viewer (i.e., to render interpolated or extrapolated views) to provide a realistic and immersive experience to the user. Without the view rendering (i.e., for interpolation or extrapolation), the final displayed visuals will appear jittery, which will make the experience harder to sustain the necessary “suspension of disbelief” for an immersive experience. Furthermore, this rendering has to be real-time, as it is a requirement for telepresence. Although technologies such as GPU acceleration and advanced compression algorithms ensure seamless interaction while minimizing latency, the quality and the realness of the remote scene are still to be solved.

Immersive telepresence systems rely on specialized hardware, including omnidirectional cameras, head-mounted displays, and motion tracking systems. These components must work in harmony to deliver high-quality, immersive experiences. Reducing prices and increasing availability of such specialized devices make them easier to deploy in industrial settings [Jackson2023] regardless of business size and enables the democratization of immersive imaging applications in a broader sense.

3. Efforts in Creating a Realistic Immersive Telepresence Experience

Creating an immersive telepresence system has been a topic of many scholarly studies. These include frameworks for group-to-group telepresence [Beck2013], creating capture and delivery frameworks for volumetric 3D models [Fechteler2013], and various other social XR applications [Cortés2024]. Google’s project Starline can also be mentioned here to include realness and immersion in its delivery of the visuals, creating an immersive experience [Lawrence2024], [Starline2025], although its main functionality is interpersonal video communication. In supporting realness, LFs [Broxton2020] and other types of neural representations [Suhail2022] can create views that can support reflections and similar non-Lambertian light material interactions in recreating light occurring in the remote scene, whereas the usual assumption for texturing reconstructed 3D objects is to assume Lambertian materials [Zhi2020].

Light field reconstruction [Gond2023] and new view synthesis from single-view [Lin2023] or sparse views [Chibane2021] can be a valid way to approach creating realistic immersive telepresence experiences. Various representations can be used to recreate various views that would support movement of the user and the spatial awareness factor of presence in the remote scene. These representations can be Multi-Planar Image (MPI) [Srinivasan2019], Multi-Cylinder Image (MCI) [Waidhofer2022], layered mesh representation [Broxton2020], and neural representations [Chibane2021], [Lin2023], [Gond 2023] – which rely on structured or unstructured 2D image captures of the remote scene.

Another way of creating a realistic immersive experience can be by combining the different imaging modalities – i.e., omnidirectional content and light fields – in the form of spherical light fields (SLFs). SLFs then enable rendering and view synthesis that can generate more realistic and immersive content. There have been various attempts to create SLFs by collecting linear captures vertically [Krolla2014], capturing omnidirectional content from the scene with multiple cameras [Maugey2019], and moving a single camera in a circular trajectory and utilizing deep neural networks to generate an image grid [Lo2023]. Nevertheless, these works either did not yield publicly available datasets or did not have precise localizations of the cameras. To address this, the Spherical Light Field Database (SLFDB) was introduced in previous work [Zerman2024], which provides a foundational dataset for testing and developing applications for realistic immersive telepresence applications.

4. Challenges and Limitations

Studies in creating realistic immersive telepresence environments showed that there are still certain challenges and limitations that need to be addressed to improve QoE and IMEx for these systems. These challenges include dataset availability, compression of the structured and unstructured LFs, new view synthesis and rendering, and QoE estimation. Most of these challenges are also discussed in our recent study [Zerman2024].

Figure 2 – A set of captures highlighting the effects of dynamically changing scene: lighting change and its effect on white balance (top) and dynamic capture environment, where people appear and disappear (bottom).

Datasets relevant to realistic immersive telepresence tasks, such as the SLFDB [Zerman2024], are crucial for developing and validating immersive telepresence technologies. However, the creation and use of such datasets with precise spatial and angular resolution and very precise positioning of the camera face significant hurdles. Traditional camera grid setups are ineffective for capturing spherical light fields due to occlusions. This challenge necessitates having static scenes and meticulous camera positioning for a consistent capture of the scene. A dynamic scene brings a risk of non-consistent views within the same light field, as shown in Figure 2, which is non-ideal. These challenges highlight the critical need for innovative approaches to spherical light field dataset generation and sharing, ensuring future advancements in the field. Additionally, variations in lighting present significant challenges when capturing spherical light fields, as they impact the scene’s dynamic range, white balance, and color grading, which creates yet another challenge in database creation. Brightness and color variations, such as sunlight’s yellow tint compared to cloudy daylight, are not easy to correct and often require advanced algorithms for adjustment. Capturing static outdoor scenes remains a challenge for future work, as they still encounter lighting-related issues despite lacking movement.

LF compression is also another challenge that requires attention after combining imaging modalities. JPEG Pleno compression algorithm [ISO2021] is adapted for 2-dimensional grid-like structured LFs (e.g., LFs captured by microlens array or structured camera grids) and does not work for linear or unstructured captures. The situation is the same for many other compression methods, as most of them require some form of structured representation. Considering how well scene regression and other new view synthesis algorithms can adapt for unstructured inputs, one can also see the importance of advancing the compression field for unstructured LFs (e.g., the volume of light captured by cameras in various positions or in-the-wild user captures). Furthermore, the said LF compression method needs to be real-time to support immersive telepresence applications while having a very good visual QoE that would not impede realism.

Figure 3 – Strong artifacts created at the extremes of view synthesis with a large baseline (i.e. 30cm), where either the scene is warped (left – 360ViewSynth), or strong ghosting artifacts occur (right – PanoSynthVR).

Current new view synthesis methods are primarily designed to handle small baselines, typically just a few centimeters, and face significant challenges when applied to larger baselines required in telepresence applications. Challenges such as ghosting artifacts and unrealistic distortions (e.g., nonlinear distortions, stretching) occur when interpolating views, particularly for larger baselines, as shown in Figure 3. A recent comparative evaluation of PanoSynthVR and 360ViewSynth [Zerman2024] reveals that while 360ViewSynth marginally outperforms PanoSynthVR on average quality metrics, the scores for both methods remain suboptimal. PanoSynthVR struggles with large baselines, exhibiting prominent layer-like ghosting artifacts due to limitations in its MCI structure. Although 360ViewSynth produces visually better results, closer inspection shows that it distorts object perspectives by stretching them rather than accurately rendering the scene, leading to an unnatural user experience. These findings underscore the limitations of current state-of-the-art view synthesis methods for SLFs and highlight the complexity of addressing larger baselines effectively in view synthesis.

Assessing user satisfaction and immersion in telepresence systems is a multidimensional challenge, requiring assessments in three different strands as described in IMEx whitepaper: subjective assessment, behavioral assessment, and assessment via psycho-physiological methods [Perkis2020]. Quantitative metrics can be used for interaction latency and task performance metrics in a user study, and individual preferences and experiences can be collected qualitatively. Certain aspects of user experience, such as visual quality and user engagement, can also be collected as quantitative data during user studies – with user self-reporting. Additionally, behavioral assessment (e.g., user movement, interaction patterns) can be used to identify different use patterns. Here, the limiting factor is mainly the time and experience cost in running the said user studies. Therefore, the challenge here is to prepare a framework that can model the user experience for realistic immersive telepresence scenarios, which can speed up the assessment strategies.

Other limitations and aspects to consider include accessibility, privacy issues, and ethics. Regarding accessibility, it is important to ensure that immersive telepresence technologies are affordable and usable by diverse populations. The situation is improving as the cameras and headsets are getting cheaper and easier to use (e.g., faster and stronger on-device processing, removal of headset connection cables, increased ease of use with hand gestures, etc.). Nevertheless, hardware costs, connectivity requirements, and usability barriers must be further addressed to make these systems widely accessible. Regarding privacy and ethics, the realistic nature of immersive telepresence may raise ethical and privacy concerns. Capturing and transmitting live environments may involve sensitive data, necessitating robust privacy safeguards and ethical guidelines to prevent misuse. Also, privacy concerns regarding the headsets that rely on visual cameras for localization and mapping must be addressed.

5. Conclusions and Future Directions

Realistic immersive telepresence systems represent a transformative shift in how people interact with remote environments. By combining advanced imaging, rendering, and interaction technologies, these systems promise to revolutionize industries ranging from healthcare to entertainment. However, significant challenges remain, including data availability, compression, rendering, and QoE assessment. Addressing these obstacles will require collaboration across disciplines and industries.

To address these challenges, future research should focus on attempting to create relevant datasets for spherical LFs that address with accurate positioning of the camera and challenges such as dynamic lighting conditions and occlusions. Developing real-time, robust compression methods for unstructured LFs, which maintain visual quality and support immersive applications, is another critical area. Developing advanced view synthesis algorithms capable of handling large baselines without introducing artifacts or distortions and creating frameworks for user experience and QoE assessment methodologies are still open research questions.

Further into the future, the remaining challenges can be solved using learning-based algorithms for the challenges related to realness and spatiality factors as well as QoE estimation, increasing the level of interactivity and feeling of immersion through integrating different senses to the existing systems (e.g., spatial audio, haptics, natural interfaces), and increasing the standardization to create common frameworks that can manage interoperability across different systems. Long-term goals include the integration of realistic immersive displays – such as LF displays or improved holographic displays – and the convergence of telepresence systems with emerging technologies like 5G or 6G networks and edge computing, on which the efforts are already underway [Mahmoud2023].

References

  • [Adelson1991] Adelson, E. H., & Bergen, J. R. (1991). The plenoptic function and the elements of early vision (Vol. 2). Cambridge, MA, USA: Vision and Modeling Group, Media Laboratory, Massachusetts Institute of Technology.
  • [Beck2013] Beck, S., Kunert, A., Kulik, A., & Froehlich, B. (2013). Immersive group-to-group telepresence. IEEE transactions on visualization and computer graphics, 19(4), 616-625.
  • [Boyce2021] Boyce, J. M., Doré, R., Dziembowski, A., Fleureau, J., Jung, J., Kroon, B., … & Yu, L. (2021). MPEG immersive video coding standard. Proceedings of the IEEE, 109(9), 1521-1536.
  • [Broxton2020] Broxton, M., Flynn, J., Overbeck, R., Erickson, D., Hedman, P., Duvall, M., … & Debevec, P. (2020). Immersive light field video with a layered mesh representation. ACM Transactions on Graphics (TOG), 39(4), 86-1.
  • [Chibane2021] Chibane, J., Bansal, A., Lazova, V., & Pons-Moll, G. (2021). Stereo radiance fields (SRF): Learning view synthesis for sparse views of novel scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7911-7920).
  • [Cortés2024] Cortés, C., Pérez, P., & García, N. (2023). Understanding latency and qoe in social xr. IEEE Consumer Electronics Magazine.
  • [Croci2020] Croci, S., Ozcinar, C., Zerman, E., Knorr, S., Cabrera, J., & Smolic, A. (2020). Visual attention-aware quality estimation framework for omnidirectional video using spherical Voronoi diagram. Quality and User Experience, 5, 1-17.
  • [Eisert2023] Eisert, P., Schreer, O., Feldmann, I., Hellge, C., & Hilsmann, A. (2023). Volumetric video– acquisition, interaction, streaming and rendering. In Immersive Video Technologies (pp. 289-326). Academic Press.
  • [Fechteler2013] Fechteler, P., Hilsmann, A., Eisert, P., Broeck, S. V., Stevens, C., Wall, J., … & Zahariadis, T. (2013, June). A framework for realistic 3D tele-immersion. In Proceedings of the 6th International Conference on Computer Vision/Computer Graphics Collaboration Techniques and Applications.
  • [Gond2023] Gond, M., Zerman, E., Knorr, S., & Sjöström, M. (2023, November). LFSphereNet: Real Time Spherical Light Field Reconstruction from a Single Omnidirectional Image. In Proceedings of the 20th ACM SIGGRAPH European Conference on Visual Media Production (pp. 1-10).
  • [Graziosi2020] Graziosi, D., Nakagami, O., Kuma, S., Zaghetto, A., Suzuki, T., & Tabatabai, A. (2020). An overview of ongoing point cloud compression standardization activities: Video-based (V-PCC) and geometry-based (G-PCC). APSIPA Transactions on Signal and Information Processing, 9, e13.
  • [IJsselsteijn2000] IJsselsteijn, W. A., De Ridder, H., Freeman, J., & Avons, S. E. (2000, June). Presence: concept, determinants, and measurement. In Human Vision and Electronic Imaging V (Vol. 3959, pp. 520-529). SPIE.
  • [ISO2021] ISO/IEC 21794-2:2021 (2021) Information technology – Plenoptic image coding system (JPEG Pleno) — Part 2: Light field coding.
  • [Jackson2023] Jackson, A. (2023, September) Meta Quest 3: Can businesses use VR day-to-day?, Technology Magazine. https://technologymagazine.com/digital-transformation/meta-quest-3-can-businesses-use-vr-day- to-day, Accessed: 2024-02-05.
  • [Kachach2021] Kachach, R., Orduna, M., Rodríguez, J., Pérez, P., Villegas, Á., Cabrera, J., & García, N. (2021, July). Immersive telepresence in remote education. In Proceedings of the International Workshop on Immersive Mixed and Virtual Environment Systems (MMVE’21) (pp. 21-24).
  • [Krolla2014] Krolla, B., Diebold, M., Goldlücke, B., & Stricker, D. (2014, September). Spherical Light Fields. In BMVC (No. 67.1–67.12).
  • [Lawrence2024] Lawrence, J., Overbeck, R., Prives, T., Fortes, T., Roth, N., & Newman, B. (2024). Project starline: A high-fidelity telepresence system. In ACM SIGGRAPH 2024 Emerging Technologies (pp. 1-2).
  • [Levoy1996] Levoy, M. & Hanrahan, P. (1996) Light field rendering, in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (pp. 31-42), New York, NY, USA, Association for Computing Machinery.
  • [Lin2023] Lin, K. E., Lin, Y. C., Lai, W. S., Lin, T. Y., Shih, Y. C., & Ramamoorthi, R. (2023). Vision transformer for nerf-based view synthesis from a single input image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 806-815).
  • [Lo2023] Lo, I. C., & Chen, H. H. (2023). Acquiring 360° Light Field by a Moving Dual-Fisheye Camera. IEEE Transactions on Image Processing.
  • [Mahmoud2023] Mahmood, A., Abedin, S. F., O’Nils, M., Bergman, M., & Gidlund, M. (2023). Remote-timber: an outlook for teleoperated forestry with first 5g measurements. IEEE Industrial Electronics Magazine, 17(3), 42-53.
  • [Marvie2022] Marvie, J. E., Krivokuća, M., Guede, C., Ricard, J., Mocquard, O., & Tariolle, F. L. (2022, September). Compression of time-varying textured meshes using patch tiling and image-based tracking. In 2022 10th European Workshop on Visual Information Processing (EUVIP) (pp. 1-6). IEEE.
  • [Maugey2019] Maugey, T., Guillo, L., & Cam, C. L. (2019, June). FTV360: A multiview 360° video dataset with calibration parameters. In Proceedings of the 10th ACM Multimedia Systems Conference (pp. 291-295).
  • [Maugey2023] Maugey, T. (2023). Acquisition, representation, and rendering of omnidirectional videos. In Immersive Video Technologies (pp. 27-48). Academic Press. [Minsky1980] Minsky, M. (1980). Telepresence. Omni, pp. 45-51.
  • [Overbeck2018] Overbeck, R. S., Erickson, D., Evangelakos, D., Pharr, M., & Debevec, P. (2018). A system for acquiring, processing, and rendering panoramic light field stills for virtual reality. ACM Transactions on Graphics (TOG), 37(6), 1-15.
  • [Perkis2020] Perkis, A., Timmerer, C., et al. (2020, May) “QUALINET White Paper on Definitions of Immersive Media Experience (IMEx)”, European Network on Quality of Experience in Multimedia Systems and Services, 14th QUALINET meeting (online), Online: https://arxiv.org/abs/2007.07032
  • [Schubert2001] Schubert, T., Friedmann, F., & Regenbrecht, H. (2001). The experience of presence: Factor analytic insights. Presence: Teleoperators & Virtual Environments, 10(3), 266-281.
  • [Srinivasan2019] Srinivasan, P. P., Tucker, R., Barron, J. T., Ramamoorthi, R., Ng, R., & Snavely, N. (2019). Pushing the boundaries of view extrapolation with multiplane images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 175-184).
  • [Starline2025] Project Starline: Be there from anywhere with our breakthrough communication technology. (n.d.). Online: https://starline.google/. Accessed: 2025-01-14
  • [Stepanov2023] Stepanov, M., Valenzise, G., & Dufaux, F. (2023). Compression of light fields. In Immersive Video Technologies (pp. 201-226). Academic Press.
  • [Suhail2022] Suhail, M., Esteves, C., Sigal, L., & Makadia, A. (2022). Light field neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8269-8279).
  • [Takatalo2008] Takatalo, J., Nyman, G., & Laaksonen, L. (2008). Components of human experience in virtual environments. Computers in Human Behavior, 24(1), 1-15.
  • [Valenzise2022] Valenzise, G., Alain, M., Zerman, E., & Ozcinar, C. (Eds.). (2022). Immersive Video Technologies. Academic Press.
  • [Waidhofer2022] Waidhofer, J., Gadgil, R., Dickson, A., Zollmann, S., & Ventura, J. (2022, October). PanoSynthVR: Toward light-weight 360-degree view synthesis from a single panoramic input. In 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (pp. 584-592). IEEE.
  • [Wisotzky2025] Wisotzky, E. L., Rosenthal, J. C., Meij, S., van den Dobblesteen, J., Arens, P., Hilsmann, A., … & Schneider, A. (2025). Telepresence for surgical assistance and training using eXtended reality during and after pandemic periods. Journal of telemedicine and telecare, 31(1), 14-28.
  • [Yagi1999] Yagi, Y. (1999). Omnidirectional sensing and its applications. IEICE transactions on information and systems, 82(3), 568-579.
  • [Zerman2024] Zerman, E., Gond, M., Takhtardeshir, S., Olsson, R., & Sjöström, M. (2024, June). A Spherical Light Field Database for Immersive Telecommunication and Telepresence Applications. In 2024 16th International Conference on Quality of Multimedia Experience (QoMEX) (pp. 200-206). IEEE.
  • [Zhi2020] Zhi, T., Lassner, C., Tung, T., Stoll, C., Narasimhan, S. G., & Vo, M. (2020). TexMesh: Reconstructing detailed human texture and geometry from RGB-D video. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16 (pp. 492-509). Springer International Publishing.

SIGMM Workshop on Multimodal AI Agents

The SIGMM Workshop on Multimodal AI Agents was held on October 28th, 2024, at ACMMM24 in Melbourne as an invitation-only event. The initiative was launched by Alberto Del Bimbo, Ramesh Jain, and Alan Smeaton following a vision of the future where multimedia expertise converges with the power of large language models and the belief that there is a great opportunity to position the Multimedia research community at the center of this transformation. The event was structured as three roundtables, inviting some of the most influential figures in the multimedia field to brainstorm on key issues. The goal was to design the future, identifying the multimodal opportunity in the days of powerful large-model systems and preparing an agenda for the coming years for the SIGMM community. We did not want to overlap with the current thinking of how multimodality will be included in the emerging large-models.  Instead, the goal was on how deep multimodality is essential in building next stages of AI agents for real world applications and how fundamental it is in understanding real-time contexts and for actions by agents. The event received a great response, with over 30 attendees from both Academia and Industry, representing 13 different countries.

Three roundtables focused on Tech ChallengesApplications, and Industry-University collaboration. The participants were divided into three groups and assigned to the three roundtables according to their profiles and preferences. For the roundtables, we did not prepare specific questions but rather outlined key areas of focus for discussion. A brief document that provided a short introduction for each roundtable, summarizing the topic of the debate and highlighting three major subjects to guide the discussion was prepared and given to the discussant a few days before the meeting. 

In the following we report a brief synthesis of the discussions at the roundtables, highlighting the principal arguments of discussion and proposals. 

Tech challenges Roundtable

Motivations for the discussion: As large pre-trained models become more prevalent and move towards multimodality, looking at the future, a key issue for their usage arises around the impact of their updating and fine-tuning, understanding how to ensure that improvements in one area don’t come at the cost of degradation in others. It is also fundamentally important to understand how deep multimodality is essential for building next stages of AI agents for real world applications, as well as for comprehending real-time contexts and guiding actions by agents towards Artificial General Intelligence. 

Some salient sentences, open questions, proposals from the discussion:

  • The interplay between human intelligence and machine intelligence is a fundamental aspect of what should be multi-modal. There are not yet deep enough multimodal models…. models for information that truly span all, or even a subset of modalities. We need metrics for this human-machine, human-intelligence machine-intelligence, action. We should come up with and define a task around how people collaborate productively. We should look at something like dynamic difficulty adjustment, that requires continuous, real-time development or training. 
  • Benchmarks are of crucial importance, not just to evaluate one thing against another thing, but to stretch the capabilities. It is not just about passing the benchmark; it is about setting the targets. We should envision a SIGMM-endorsed or sponsored multimodal benchmark by approaching some big tech companies to benchmark some multimodal activity within and across companies.

Applications Roundtable

Motivations for the discussion:   Multimodality is a cornerstone of emerging real-world applications, providing context and situational awareness to systems. Large Multimodal Models are credited for transforming various industries and enabling new applications. Key challenges lie in developing computational approaches for media fusion to construct context and situational understanding, addressing real-time computing costs, and refining model building. It is therefore essential for the SIGMM community to reason on how to build a vibrant community around one or a few key applications.

Some salient sentences, open questions, proposals from the discussion:

There are many areas for application where the SIGMM community can provide vital and innovative contributions and should concentrate its applicative research. Example application areas  and examples of research are: 

  • Health: there is an absence of open-ended sensory data representing of long-term complex information in the health area. We can think of integrated, federated machine learning, i.e. an integrated, federated data space for data control. 
  • Education: we can think of some futuristic learning approach, like completely autonomous learning.  Namely, AI agents that will be supportive through observation models, able to adjust the learning level so that some can finish faster than the others and learn depending on the modalities they like to receive. It is also of key importance to consider what the role of teacher and the role of AI is. 
  • Productivitywe can think of tools for immersive multi-modal experiences, to generate cross-modal content including 3D and podcasting in immersive environments.
  • Entertainment: we should think of how we can improve entertainment through immersive story driven experiences. 

Industry and University Roundtable

Motivations for the discussion:   Research on large AI models is by far dominated by private companies, thanks in part to their access to the data and the cost for building and training such models. As a result, academic institutions are being left behind in the AI race. It is therefore urgent to reason about which research directions are viable for universities and think of new Industry-University collaboration models for multimodal AI research.  It is also important to capitalize on the unique advantage of Academy, concerning their neutrality and ability to address long-term social and ethical issues related to technology.

Some salient sentences, open questions, proposals from the discussion:

  • Small and medium enterprises feel that they are left out. These are the ones who came to talk to universities. This is an opportunity for the SIGMM community to see how we can help.  SIGMM could sponsor joint PhD programs for example addressing small size, multi-model, foundation models, or intelligent agents, where a company sponsors part of the grant project. 
  • SIGMM should promote large visibility events at ACM Multimedia like Grand Challenges and Hackathons. As a community we could sponsor a company-wise Grand Challenge on multimodal AI and intelligent agents, leveraging industry to contribute more data sets. We could promote a regional-global Hackathon where Hackathons are held and overseen in different regions in the world, and the top teams then invited to come to ACM Multimedia and compete for it. 

Based on the discussions at the roundtables, we have identified several concrete actions that could help position the SIGMM research community at the forefront of the multimodal AI transformation:

At the next ACM Multimedia Conference

  • Explicit inclusion of multimodality as a key topic in the next ACM Multimedia call.
  • Multimodal Hackathon on Intelligent Agents (regional-global hackathon).
  • Multimodal Benchmarks (collaborations within and across major tech companies).
  • Multimodal Grand Challenges (in partnership with industry leaders).

At the next ACM SIGMM call for Special projects

  • Special Projects focused on Multimodal AI.

SIGMM is committed to pursuing these initiatives.

Diversity and Inclusion in focus at ACM IMX 2024

Summary: ACM IMX 2024 took place in Stockholm, Sweden, from June 12 to 14, continuing its dedication to promoting diversity within the community. Recognising the importance of amplifying varied voices and experiences to advance the field, the conference built on prior achievements in diversity and inclusion of IMX through a series of initiatives to promote diversity and inclusion (D&I).  This column provides a concise overview of the main D&I initiatives, including childcare support, early-career researcher grants, and manuscript accessibility support.  It includes participant feedback and short testimonials shared during and after the conference to highlight the value of these initiatives. 

To encourage a broad and inclusive pool of organisers, one method employed by the general chairs of ACM IMX’24 to prioritise diversity and inclusion was to team seasoned committee members with new members within the organising committee, this was done as a method to actively foster mentoring opportunities that support continuity and the development of future conference leadership. In addition to this, IMX’24 invited community members to self-nominate for various chair and organisational roles to make it clear that chair roles were open and available to all who were interested in being part of organising the conference. This call for applications was announced during the closing session of ACM IMX’23 in Nantes, France and, over a two-month period, the committee received 12 applications from which 5 candidates were selected to serve as chairs in various capacities. This inclusive approach allowed ACM IMX to engage with junior members and volunteers who might not have been reached through traditional recruitment methods, pairing them with experienced team members to ensure that they were able to build their network within the community and their skills in conference organisation and management. 

SIGMM support was used to enable the chairs of IMX’24 to introduce several initiatives to ensure that all individuals, regardless of personal circumstances, could participate fully in the conference. These initiatives had openly announced calls to all eligible community members who wished to attend the conference in person in Stockholm but required financial assistance. To ensure a fair and thorough selection, the IMX’24 Diversity and Inclusion Chairs, in collaboration with the General Chairs, reviewed each of the applications to ensure that the widest range of support could be offered with the available funds. Applications were evaluated on a rolling basis to ensure that participants were able to organise their travel and visa arrangements without the added challenges of time pressure.

With this support from SIGMM, Diversity and Inclusion grants for IMX were made available for participants, covering:

  • Travel Support for Non-Students from Marginalised and Underrepresented Groups: This grant provided travel support for researchers who self-identified as marginalised or underrepresented within the ACM IMX community, particularly those from non-WEIRD (Western, Educated, Industrialised, Rich, Developed) countries who lacked other funding opportunities. Priority was given to early-career researchers (such as post-docs), and those needing financial assistance, to compliment existing SIGCHI and SIGMM student targeted travel grants. 
  • Childcare and Parental Support: This grant offered financial assistance to parents attending ACM IMX’24, subsidising childcare costs to enable broader participation and to cover expenses related to children’s travel, travel for a childcare companion, and on-site or arranged babysitting during the conference.
  • Disability and Carer Support: This grant aimed to support attendees on extended leave from work due to disability, parental responsibilities, or other personal circumstances. Recipients of this award also received a complementary free conference registration. 
  • Student Travel Awards: SIGMM also provided awards directly to students to support travel expences, enabling a broader range of participation and complimenting free registration offered for those students volunteering at the conference. 

The SIGMM’s special initiatives for diversity and inclusion enable IMX’24 to secure a keynote designed to foster a more inclusive dialogue. Delivered by artist Jake Elwes—a self described hacker, radical faerie, and researcher—the keynote focused on “queer artificial intelligence” and featured deepfake drag performers. Elwes’ work invited the attendees to reflect on who builds these systems, the intentions behind them, and how they can be reclaimed to envision and create different visions of a technology enhanced future.

In combination with support from SIGMM, a special workshop focused on engaging with research and researchers from Latin America as a region of interest was made possible through the generous backing of the SIGCHI Development Fund (SDF). This enabled researchers and workshop keynote speakers to participate in both the “IMX in Latin America – 2nd International Workshop” and attend the conference. A core objective was to increase diversity by broadening the IMX community through actively encouraging colleagues from Latin America to attend and contribute. This workshop also published it’s submissions as part of the ACM IMX’24 workshop proceedings in ICPS.

For the first time at ACM IMX, an external provider (TAPS) was hired to ensure accessibility of papers prior to publication. Finally, the conference offered a range of venue-focused diversity and inclusion initiatives, including the provision of all-gender bathrooms, pronoun badges, and approachable senior community members to support engagement. Care corner and tables were thoughtfully set up throughout the conference to provide attendees with free hygiene essentials such as masks, refreshers, hand sanitisers, sanitary pads and tampons. These measures highlighted ACM IMX’24 commitment to fostering a welcoming and accessible environment for all participants.

Figure 1: Participants’ responses on their perception of diversity and inclusion at IMX, highlighting that it encompasses representation, welcoming environments, active engagement, research focus, and shaping future media experiences.

“During the closing event of IMX2024, we asked our attendees to answer a few questions that could help plan future IMX conferences. We asked everyone to share what future research directions could be included to address D&I at IMX. Some of the suggestions were to include the field of Humanities, to study usability among different demographics, and to understand how people who might not have economic access to technology could benefit from such technology. We also asked everyone to select what, according to them, is D&I at IMX. The options Everyone feels welcomed, Diverse individuals are able to engage and contribute and People from diverse backgrounds get represented and have a voice received a majority of the votes when compared to “Shape the future of interactive media experiences and “Research that focuses on diversity and inclusion in media experiences”. When asked to share how included they felt at IMX2024, 92% of the participants shared that they either felt included or very much [with some leaving the question unanswered]. They also shared how different aspects made them feel included. Some of the highlights were the care corner that was arranged to support the basic needs of the attendees, the social events, interactions at the conference, and the community. ” – Sujithra Raviselvam, IMX’24 Diversity and Inclusion Co-Chair.

Figure 2: Participants’ feedback on factors contributing to feelings of inclusion and exclusion at IMX, along with suggestions for future research directions aimed at improving diversity and inclusion. The feedback highlights personal interactions, event organization, and amenities as key to feeling included, while future research suggestions focus on enhancing accessibility, providing economic support, and integrating more diverse perspectives in HCI research.

The best way to understand the impacts of these supports is through the words of those who were enabled to join the conference by receiving it. 

The grant received for IMX2024 allowed me to attend the conference. Having a young child is challenging as an early researcher, as you must, sometimes, sacrifice your career or family. This grant allowed me to travel without any of these. I could attend the conference without stress or second thoughts, and support my family during the few days of the conference. Thanks to this, I received valuable feedback on my work, followed interesting presentations, and did not miss my family.” – Romain Herault, childcare award recipient. 

“I had the opportunity to present our qualitative study focused on understanding the sensitive values of women entrepreneurs in Brazil to support designing multi-model conversational AI financial systems at IMX, followed by interesting discussions about it in the workshop organized by Debora Christina Muchaluat Saade, Mylene Farias and Jesus Favela. The conference was focused on the future of multimodal technologies, with many exciting demos to investigate, to make more accessible, and to challenge assumptions of real life through a multimedia lens. We also had a conference dinner with the theme of the midsummer celebration. I was amazed by its meaning; as far as I understood, the purpose is to celebrate the light, sun, and summer season with family and friends! I loved it! It was also an opportunity to explore the beautiful Stockholm city with new colleagues and meet current collaborators in research.”– Heloisa Caroline de Souza Pereira Candello.  

A total of 21 applicants received support through diversity and inclusion grants provided by both SIGMM and the SIGCHI Development Fund (SDF). This assistance enabled full participation in ACM IMX’24 and supported a diverse group, including students, non-students from marginalised backgrounds, early-career researchers, and Latin American researchers, all of whom benefitted from these grants and made up more than 10% of the total conference attendees – truly changing and undoubtedly enhancing the experience of all attendees at the conference. 

Figure 3: The word clouds present two data sets from an IMX survey: the countries respondents identify as home, and the locations they would like IMX to feature in the future. It highlights a diverse range of home countries, including Brazil, Germany, and India, and suggest future IMX locations such as Japan, Brazil, and various cities in the USA, indicating a global interest and the geographical diversity of the IMX community.