Challenges in Experiencing Realistic Immersive Telepresence


Immersive imaging technologies offer a transformative way to change how we experience interacting with remote environments, i.e., telepresence. By leveraging advancements in light field imaging, omnidirectional cameras, and head-mounted displays, these systems enable realistic, real-time visual experiences that can revolutionize how we interact with the remote scene in fields such as healthcare, education, remote collaboration, and entertainment. However, the field faces significant technical and experiential challenges, including efficient data capture and compression, real-time rendering, and quality of experience (QoE) assessment. Expanding on the findings of the authors’ recent publication and situating them within a broader theoretical framework, this article provides an integrated overview of immersive telepresence technologies, focusing on their technological foundations, applications, and the challenges that must be addressed to advance this field.

1. Redefining Telepresence Through Immersive Imaging

Telepresence is defined as the “sense of being physically present at a remote location through interaction with the system’s human interface[Minsky1980]. Such virtual presence is made possible by digital imaging systems and real-time communication of visuals and interaction signals. Immersive imaging systems such as light fields and omnidirectional imaging enhance the visual sense of presence, i.e., “being there[IJsselsteijn2000], with photorealistic recreation of the remote scene. This emerging field has seen rapid growth, both in research and development [Valenzise2022], due to advancements in imaging and display technologies, combined with increasing demand for interactive and immersive experiences. A visualization is provided in Figure 1 that shows a telepresence system that utilizes traditional cameras and controls and an immersive telepresence system.

Figure 1 – A side-by-side visualization of a traditional telepresence system (left) and an immersive telepresence system (right).

The experience of “presence” consists of three components according to Schubert et al. [Schubert2001], which are renamed in this article to take into account other definitions:

  1. Realness – “Realness[Schubert2001] or “realism[Takatalo2008] of the environment (i.e., in this case, the remote scene) relates to the “believability, the fidelity and validity of sensory features within the generated environments, e.g., photorealism.” [Perkis 2020].
  2. Immersion – User’s level of “involvement[Schubert2001] and “concentration to the virtual environment instead of real world, loss of time[Takatalo2008]. “The combination of sensory cues with symbolic cues essential for user emplacement and engagement[Perkis2020].
  3. Spatiality – An attribute of the environment helps “transporting” the user to induce spatial awareness [Schubert2001] which allows “spatial presence[Takatalo2008] and “the possibility for users to move freely and discover the world offered” [Perkis2020].

Immersion can happen without having realness or spatiality, for example, while we are reading a novel. Telepresence using traditional imaging systems might not be immersive in case of a relatively small display and other distractors present in the visual field. Realistic immersive telepresence necessitates higher degrees of freedom (e.g., 3 DoF+ or 6DoF) compared to a telepresence application with a traditional display. In this context, new view synthesis methods and spherical light field representations (cf. Section 3) will be crucial in giving correct depth cues and depth perception – which will increase realness and spatiality tremendously.

The rapid progress of immersive imaging technologies and their adoption can largely be attributed to advancements in processing and display systems, including light field displays and extended reality (XR) headsets. These XR headsets are becoming increasingly affordable while delivering excellent user experiences [Jackson2023], paving the way for the widespread adoption of immersive communication and telepresence applications in the near future. To further accelerate this transition, extensive efforts are being undertaken in both academia as well as industry.

The visual realism (i.e., realness) in realistic immersive telepresence relies on acquired photos rather than computer-generated imagery (CGI). In healthcare, it enables realistic remote consultations and surgical collaborations [Wisotzky2025]. In education and training, it facilitates immersive, location-independent learning environments [Kachach2021]. Similarly, visual realism can enhance remote collaboration by creating lifelike meeting spaces, while in media and entertainment, it can provide unprecedented realism for live events and performances, offering users a closer connection and having a feeling of being present on remote sites.

This article provides a brief overview of the technological foundations, applications, and challenges in immersive telepresence. The novel contribution of this article is setting up the theoretical framework for realistic immersive telepresence informed by prior literature and positioning the findings of the author’s recent publication [Zerman2024] within this broader theoretical framework. It explores how foundational technologies like light field imaging and real-time rendering drive the field forward, while also identifying critical obstacles, such as dataset availability, compression efficiency, and QoE evaluation.

2. Technological Foundations for Immersive Telepresence

A realistic immersive telepresence can be made possible by enabling its main defining factors of realness (e.g., photorealism), immersion, and spatiality. Although these factors can be satisfied with other modalities (e.g., spatial audio), this article focuses on the visual modality and visual recreation of the remote scene.

2.1 Immersive Imaging Modalities

Immersive imaging technologies encompass a wide range of methods aimed at capturing and recreating realistic visual and spatial experiences. These include light fields, omnidirectional images, volumetric videos using either point clouds or 3D meshes, holography, multi-view stereo imaging, neural radiance fields, Gaussian splats, and other extended reality (XR) applications — all of which contribute to recreating highly realistic and interactive representations of scenes and environments.

Light fields (LF) are vector fields of all the light rays passing through a given region in space, describing the intensity and direction of light at every point. This is fully described through the plenoptic function [Adelson1991] as follows: P(x,y,z,θ,ϕ,λ,t), where x, y, and z describe the 3D position of sampling, θ and ϕ are the angular direction, λ is the wavelength of the light ray, and t is time. Traditionally, LFs are represented using the two-plane parametrization [Levoy1996] with 2 spatial dimensions and 2 angular dimensions; however, this parametrization limits the use case of LFs to processing planar visual stimuli. The plenoptic function can be leveraged beyond the two-plane parameterization for a highly detailed view reconstruction or view synthesis. Newer capture scenarios and representations enable increased immersion with LFs [Overbeck2018],[Broxton2020], which can be further advanced in the future.

Omnidirectional image (or video) representation can provide an all-encompassing 360-degree view of a scene from a point in space for immersive visualization [Yagi1999], [Maugey2023]. This is made possible by stitching multiple views together. The created spherical image can be stored using traditional image formats (i.e., 2D planar formats) by projecting the sphere to planar format (e.g., equirectangular projection, cubemap projection, and others); however, processing these special representations without proper consideration for their spherical nature results in errors or biases.

2.2 Processing Requirements for Realistic Immersive Telepresence

Immersive telepresence relies on capturing, transmitting, and rendering realistic representations of remote environments. “Capturing” can be considered an inherent part of the imaging modalities discussed in the previous section. For transmitting and rendering, there are different requirements to take into account.

Compression is an important step for telepresence that relies heavily on real-time transmission of the visual data from the remote scene. The importance of compression increases even more for immersive telepresence applications as immersive imaging modalities capture (and represent) more information and need even more compression compared to the telepresence using traditional 2D imaging systems. Compression of LFs [Stepanov2023], omnidirectional images and video [Croci2020], and other forms of immersive video such as MPEG Immersive Video [Boyce2021], volumetric 3D representations represented with point clouds [Graziosi2020], and textured 3D meshes [Marvie2022] have been a very hot research topic within the last decade, which led to the standardization of compression methods for some immersive imaging modalities.

Rendering [Eisert2023], [Maugey2023] is yet another important aspect, especially for LFs [Overbeck2018]. The LF data needs to be rendered correctly for the position of the viewer (i.e., to render interpolated or extrapolated views) to provide a realistic and immersive experience to the user. Without the view rendering (i.e., for interpolation or extrapolation), the final displayed visuals will appear jittery, which will make the experience harder to sustain the necessary “suspension of disbelief” for an immersive experience. Furthermore, this rendering has to be real-time, as it is a requirement for telepresence. Although technologies such as GPU acceleration and advanced compression algorithms ensure seamless interaction while minimizing latency, the quality and the realness of the remote scene are still to be solved.

Immersive telepresence systems rely on specialized hardware, including omnidirectional cameras, head-mounted displays, and motion tracking systems. These components must work in harmony to deliver high-quality, immersive experiences. Reducing prices and increasing availability of such specialized devices make them easier to deploy in industrial settings [Jackson2023] regardless of business size and enables the democratization of immersive imaging applications in a broader sense.

3. Efforts in Creating a Realistic Immersive Telepresence Experience

Creating an immersive telepresence system has been a topic of many scholarly studies. These include frameworks for group-to-group telepresence [Beck2013], creating capture and delivery frameworks for volumetric 3D models [Fechteler2013], and various other social XR applications [Cortés2024]. Google’s project Starline can also be mentioned here to include realness and immersion in its delivery of the visuals, creating an immersive experience [Lawrence2024], [Starline2025], although its main functionality is interpersonal video communication. In supporting realness, LFs [Broxton2020] and other types of neural representations [Suhail2022] can create views that can support reflections and similar non-Lambertian light material interactions in recreating light occurring in the remote scene, whereas the usual assumption for texturing reconstructed 3D objects is to assume Lambertian materials [Zhi2020].

Light field reconstruction [Gond2023] and new view synthesis from single-view [Lin2023] or sparse views [Chibane2021] can be a valid way to approach creating realistic immersive telepresence experiences. Various representations can be used to recreate various views that would support movement of the user and the spatial awareness factor of presence in the remote scene. These representations can be Multi-Planar Image (MPI) [Srinivasan2019], Multi-Cylinder Image (MCI) [Waidhofer2022], layered mesh representation [Broxton2020], and neural representations [Chibane2021], [Lin2023], [Gond 2023] – which rely on structured or unstructured 2D image captures of the remote scene.

Another way of creating a realistic immersive experience can be by combining the different imaging modalities – i.e., omnidirectional content and light fields – in the form of spherical light fields (SLFs). SLFs then enable rendering and view synthesis that can generate more realistic and immersive content. There have been various attempts to create SLFs by collecting linear captures vertically [Krolla2014], capturing omnidirectional content from the scene with multiple cameras [Maugey2019], and moving a single camera in a circular trajectory and utilizing deep neural networks to generate an image grid [Lo2023]. Nevertheless, these works either did not yield publicly available datasets or did not have precise localizations of the cameras. To address this, the Spherical Light Field Database (SLFDB) was introduced in previous work [Zerman2024], which provides a foundational dataset for testing and developing applications for realistic immersive telepresence applications.

4. Challenges and Limitations

Studies in creating realistic immersive telepresence environments showed that there are still certain challenges and limitations that need to be addressed to improve QoE and IMEx for these systems. These challenges include dataset availability, compression of the structured and unstructured LFs, new view synthesis and rendering, and QoE estimation. Most of these challenges are also discussed in our recent study [Zerman2024].

Figure 2 – A set of captures highlighting the effects of dynamically changing scene: lighting change and its effect on white balance (top) and dynamic capture environment, where people appear and disappear (bottom).

Datasets relevant to realistic immersive telepresence tasks, such as the SLFDB [Zerman2024], are crucial for developing and validating immersive telepresence technologies. However, the creation and use of such datasets with precise spatial and angular resolution and very precise positioning of the camera face significant hurdles. Traditional camera grid setups are ineffective for capturing spherical light fields due to occlusions. This challenge necessitates having static scenes and meticulous camera positioning for a consistent capture of the scene. A dynamic scene brings a risk of non-consistent views within the same light field, as shown in Figure 2, which is non-ideal. These challenges highlight the critical need for innovative approaches to spherical light field dataset generation and sharing, ensuring future advancements in the field. Additionally, variations in lighting present significant challenges when capturing spherical light fields, as they impact the scene’s dynamic range, white balance, and color grading, which creates yet another challenge in database creation. Brightness and color variations, such as sunlight’s yellow tint compared to cloudy daylight, are not easy to correct and often require advanced algorithms for adjustment. Capturing static outdoor scenes remains a challenge for future work, as they still encounter lighting-related issues despite lacking movement.

LF compression is also another challenge that requires attention after combining imaging modalities. JPEG Pleno compression algorithm [ISO2021] is adapted for 2-dimensional grid-like structured LFs (e.g., LFs captured by microlens array or structured camera grids) and does not work for linear or unstructured captures. The situation is the same for many other compression methods, as most of them require some form of structured representation. Considering how well scene regression and other new view synthesis algorithms can adapt for unstructured inputs, one can also see the importance of advancing the compression field for unstructured LFs (e.g., the volume of light captured by cameras in various positions or in-the-wild user captures). Furthermore, the said LF compression method needs to be real-time to support immersive telepresence applications while having a very good visual QoE that would not impede realism.

Figure 3 – Strong artifacts created at the extremes of view synthesis with a large baseline (i.e. 30cm), where either the scene is warped (left – 360ViewSynth), or strong ghosting artifacts occur (right – PanoSynthVR).

Current new view synthesis methods are primarily designed to handle small baselines, typically just a few centimeters, and face significant challenges when applied to larger baselines required in telepresence applications. Challenges such as ghosting artifacts and unrealistic distortions (e.g., nonlinear distortions, stretching) occur when interpolating views, particularly for larger baselines, as shown in Figure 3. A recent comparative evaluation of PanoSynthVR and 360ViewSynth [Zerman2024] reveals that while 360ViewSynth marginally outperforms PanoSynthVR on average quality metrics, the scores for both methods remain suboptimal. PanoSynthVR struggles with large baselines, exhibiting prominent layer-like ghosting artifacts due to limitations in its MCI structure. Although 360ViewSynth produces visually better results, closer inspection shows that it distorts object perspectives by stretching them rather than accurately rendering the scene, leading to an unnatural user experience. These findings underscore the limitations of current state-of-the-art view synthesis methods for SLFs and highlight the complexity of addressing larger baselines effectively in view synthesis.

Assessing user satisfaction and immersion in telepresence systems is a multidimensional challenge, requiring assessments in three different strands as described in IMEx whitepaper: subjective assessment, behavioral assessment, and assessment via psycho-physiological methods [Perkis2020]. Quantitative metrics can be used for interaction latency and task performance metrics in a user study, and individual preferences and experiences can be collected qualitatively. Certain aspects of user experience, such as visual quality and user engagement, can also be collected as quantitative data during user studies – with user self-reporting. Additionally, behavioral assessment (e.g., user movement, interaction patterns) can be used to identify different use patterns. Here, the limiting factor is mainly the time and experience cost in running the said user studies. Therefore, the challenge here is to prepare a framework that can model the user experience for realistic immersive telepresence scenarios, which can speed up the assessment strategies.

Other limitations and aspects to consider include accessibility, privacy issues, and ethics. Regarding accessibility, it is important to ensure that immersive telepresence technologies are affordable and usable by diverse populations. The situation is improving as the cameras and headsets are getting cheaper and easier to use (e.g., faster and stronger on-device processing, removal of headset connection cables, increased ease of use with hand gestures, etc.). Nevertheless, hardware costs, connectivity requirements, and usability barriers must be further addressed to make these systems widely accessible. Regarding privacy and ethics, the realistic nature of immersive telepresence may raise ethical and privacy concerns. Capturing and transmitting live environments may involve sensitive data, necessitating robust privacy safeguards and ethical guidelines to prevent misuse. Also, privacy concerns regarding the headsets that rely on visual cameras for localization and mapping must be addressed.

5. Conclusions and Future Directions

Realistic immersive telepresence systems represent a transformative shift in how people interact with remote environments. By combining advanced imaging, rendering, and interaction technologies, these systems promise to revolutionize industries ranging from healthcare to entertainment. However, significant challenges remain, including data availability, compression, rendering, and QoE assessment. Addressing these obstacles will require collaboration across disciplines and industries.

To address these challenges, future research should focus on attempting to create relevant datasets for spherical LFs that address with accurate positioning of the camera and challenges such as dynamic lighting conditions and occlusions. Developing real-time, robust compression methods for unstructured LFs, which maintain visual quality and support immersive applications, is another critical area. Developing advanced view synthesis algorithms capable of handling large baselines without introducing artifacts or distortions and creating frameworks for user experience and QoE assessment methodologies are still open research questions.

Further into the future, the remaining challenges can be solved using learning-based algorithms for the challenges related to realness and spatiality factors as well as QoE estimation, increasing the level of interactivity and feeling of immersion through integrating different senses to the existing systems (e.g., spatial audio, haptics, natural interfaces), and increasing the standardization to create common frameworks that can manage interoperability across different systems. Long-term goals include the integration of realistic immersive displays – such as LF displays or improved holographic displays – and the convergence of telepresence systems with emerging technologies like 5G or 6G networks and edge computing, on which the efforts are already underway [Mahmoud2023].

References

  • [Adelson1991] Adelson, E. H., & Bergen, J. R. (1991). The plenoptic function and the elements of early vision (Vol. 2). Cambridge, MA, USA: Vision and Modeling Group, Media Laboratory, Massachusetts Institute of Technology.
  • [Beck2013] Beck, S., Kunert, A., Kulik, A., & Froehlich, B. (2013). Immersive group-to-group telepresence. IEEE transactions on visualization and computer graphics, 19(4), 616-625.
  • [Boyce2021] Boyce, J. M., Doré, R., Dziembowski, A., Fleureau, J., Jung, J., Kroon, B., … & Yu, L. (2021). MPEG immersive video coding standard. Proceedings of the IEEE, 109(9), 1521-1536.
  • [Broxton2020] Broxton, M., Flynn, J., Overbeck, R., Erickson, D., Hedman, P., Duvall, M., … & Debevec, P. (2020). Immersive light field video with a layered mesh representation. ACM Transactions on Graphics (TOG), 39(4), 86-1.
  • [Chibane2021] Chibane, J., Bansal, A., Lazova, V., & Pons-Moll, G. (2021). Stereo radiance fields (SRF): Learning view synthesis for sparse views of novel scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7911-7920).
  • [Cortés2024] Cortés, C., Pérez, P., & García, N. (2023). Understanding latency and qoe in social xr. IEEE Consumer Electronics Magazine.
  • [Croci2020] Croci, S., Ozcinar, C., Zerman, E., Knorr, S., Cabrera, J., & Smolic, A. (2020). Visual attention-aware quality estimation framework for omnidirectional video using spherical Voronoi diagram. Quality and User Experience, 5, 1-17.
  • [Eisert2023] Eisert, P., Schreer, O., Feldmann, I., Hellge, C., & Hilsmann, A. (2023). Volumetric video– acquisition, interaction, streaming and rendering. In Immersive Video Technologies (pp. 289-326). Academic Press.
  • [Fechteler2013] Fechteler, P., Hilsmann, A., Eisert, P., Broeck, S. V., Stevens, C., Wall, J., … & Zahariadis, T. (2013, June). A framework for realistic 3D tele-immersion. In Proceedings of the 6th International Conference on Computer Vision/Computer Graphics Collaboration Techniques and Applications.
  • [Gond2023] Gond, M., Zerman, E., Knorr, S., & Sjöström, M. (2023, November). LFSphereNet: Real Time Spherical Light Field Reconstruction from a Single Omnidirectional Image. In Proceedings of the 20th ACM SIGGRAPH European Conference on Visual Media Production (pp. 1-10).
  • [Graziosi2020] Graziosi, D., Nakagami, O., Kuma, S., Zaghetto, A., Suzuki, T., & Tabatabai, A. (2020). An overview of ongoing point cloud compression standardization activities: Video-based (V-PCC) and geometry-based (G-PCC). APSIPA Transactions on Signal and Information Processing, 9, e13.
  • [IJsselsteijn2000] IJsselsteijn, W. A., De Ridder, H., Freeman, J., & Avons, S. E. (2000, June). Presence: concept, determinants, and measurement. In Human Vision and Electronic Imaging V (Vol. 3959, pp. 520-529). SPIE.
  • [ISO2021] ISO/IEC 21794-2:2021 (2021) Information technology – Plenoptic image coding system (JPEG Pleno) — Part 2: Light field coding.
  • [Jackson2023] Jackson, A. (2023, September) Meta Quest 3: Can businesses use VR day-to-day?, Technology Magazine. https://technologymagazine.com/digital-transformation/meta-quest-3-can-businesses-use-vr-day- to-day, Accessed: 2024-02-05.
  • [Kachach2021] Kachach, R., Orduna, M., Rodríguez, J., Pérez, P., Villegas, Á., Cabrera, J., & García, N. (2021, July). Immersive telepresence in remote education. In Proceedings of the International Workshop on Immersive Mixed and Virtual Environment Systems (MMVE’21) (pp. 21-24).
  • [Krolla2014] Krolla, B., Diebold, M., Goldlücke, B., & Stricker, D. (2014, September). Spherical Light Fields. In BMVC (No. 67.1–67.12).
  • [Lawrence2024] Lawrence, J., Overbeck, R., Prives, T., Fortes, T., Roth, N., & Newman, B. (2024). Project starline: A high-fidelity telepresence system. In ACM SIGGRAPH 2024 Emerging Technologies (pp. 1-2).
  • [Levoy1996] Levoy, M. & Hanrahan, P. (1996) Light field rendering, in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (pp. 31-42), New York, NY, USA, Association for Computing Machinery.
  • [Lin2023] Lin, K. E., Lin, Y. C., Lai, W. S., Lin, T. Y., Shih, Y. C., & Ramamoorthi, R. (2023). Vision transformer for nerf-based view synthesis from a single input image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 806-815).
  • [Lo2023] Lo, I. C., & Chen, H. H. (2023). Acquiring 360° Light Field by a Moving Dual-Fisheye Camera. IEEE Transactions on Image Processing.
  • [Mahmoud2023] Mahmood, A., Abedin, S. F., O’Nils, M., Bergman, M., & Gidlund, M. (2023). Remote-timber: an outlook for teleoperated forestry with first 5g measurements. IEEE Industrial Electronics Magazine, 17(3), 42-53.
  • [Marvie2022] Marvie, J. E., Krivokuća, M., Guede, C., Ricard, J., Mocquard, O., & Tariolle, F. L. (2022, September). Compression of time-varying textured meshes using patch tiling and image-based tracking. In 2022 10th European Workshop on Visual Information Processing (EUVIP) (pp. 1-6). IEEE.
  • [Maugey2019] Maugey, T., Guillo, L., & Cam, C. L. (2019, June). FTV360: A multiview 360° video dataset with calibration parameters. In Proceedings of the 10th ACM Multimedia Systems Conference (pp. 291-295).
  • [Maugey2023] Maugey, T. (2023). Acquisition, representation, and rendering of omnidirectional videos. In Immersive Video Technologies (pp. 27-48). Academic Press. [Minsky1980] Minsky, M. (1980). Telepresence. Omni, pp. 45-51.
  • [Overbeck2018] Overbeck, R. S., Erickson, D., Evangelakos, D., Pharr, M., & Debevec, P. (2018). A system for acquiring, processing, and rendering panoramic light field stills for virtual reality. ACM Transactions on Graphics (TOG), 37(6), 1-15.
  • [Perkis2020] Perkis, A., Timmerer, C., et al. (2020, May) “QUALINET White Paper on Definitions of Immersive Media Experience (IMEx)”, European Network on Quality of Experience in Multimedia Systems and Services, 14th QUALINET meeting (online), Online: https://arxiv.org/abs/2007.07032
  • [Schubert2001] Schubert, T., Friedmann, F., & Regenbrecht, H. (2001). The experience of presence: Factor analytic insights. Presence: Teleoperators & Virtual Environments, 10(3), 266-281.
  • [Srinivasan2019] Srinivasan, P. P., Tucker, R., Barron, J. T., Ramamoorthi, R., Ng, R., & Snavely, N. (2019). Pushing the boundaries of view extrapolation with multiplane images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 175-184).
  • [Starline2025] Project Starline: Be there from anywhere with our breakthrough communication technology. (n.d.). Online: https://starline.google/. Accessed: 2025-01-14
  • [Stepanov2023] Stepanov, M., Valenzise, G., & Dufaux, F. (2023). Compression of light fields. In Immersive Video Technologies (pp. 201-226). Academic Press.
  • [Suhail2022] Suhail, M., Esteves, C., Sigal, L., & Makadia, A. (2022). Light field neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8269-8279).
  • [Takatalo2008] Takatalo, J., Nyman, G., & Laaksonen, L. (2008). Components of human experience in virtual environments. Computers in Human Behavior, 24(1), 1-15.
  • [Valenzise2022] Valenzise, G., Alain, M., Zerman, E., & Ozcinar, C. (Eds.). (2022). Immersive Video Technologies. Academic Press.
  • [Waidhofer2022] Waidhofer, J., Gadgil, R., Dickson, A., Zollmann, S., & Ventura, J. (2022, October). PanoSynthVR: Toward light-weight 360-degree view synthesis from a single panoramic input. In 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (pp. 584-592). IEEE.
  • [Wisotzky2025] Wisotzky, E. L., Rosenthal, J. C., Meij, S., van den Dobblesteen, J., Arens, P., Hilsmann, A., … & Schneider, A. (2025). Telepresence for surgical assistance and training using eXtended reality during and after pandemic periods. Journal of telemedicine and telecare, 31(1), 14-28.
  • [Yagi1999] Yagi, Y. (1999). Omnidirectional sensing and its applications. IEICE transactions on information and systems, 82(3), 568-579.
  • [Zerman2024] Zerman, E., Gond, M., Takhtardeshir, S., Olsson, R., & Sjöström, M. (2024, June). A Spherical Light Field Database for Immersive Telecommunication and Telepresence Applications. In 2024 16th International Conference on Quality of Multimedia Experience (QoMEX) (pp. 200-206). IEEE.
  • [Zhi2020] Zhi, T., Lassner, C., Tung, T., Stoll, C., Narasimhan, S. G., & Vo, M. (2020). TexMesh: Reconstructing detailed human texture and geometry from RGB-D video. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16 (pp. 492-509). Springer International Publishing.

SIGMM Workshop on Multimodal AI Agents

The SIGMM Workshop on Multimodal AI Agents was held on October 28th, 2024, at ACMMM24 in Melbourne as an invitation-only event. The initiative was launched by Alberto Del Bimbo, Ramesh Jain, and Alan Smeaton following a vision of the future where multimedia expertise converges with the power of large language models and the belief that there is a great opportunity to position the Multimedia research community at the center of this transformation. The event was structured as three roundtables, inviting some of the most influential figures in the multimedia field to brainstorm on key issues. The goal was to design the future, identifying the multimodal opportunity in the days of powerful large-model systems and preparing an agenda for the coming years for the SIGMM community. We did not want to overlap with the current thinking of how multimodality will be included in the emerging large-models.  Instead, the goal was on how deep multimodality is essential in building next stages of AI agents for real world applications and how fundamental it is in understanding real-time contexts and for actions by agents. The event received a great response, with over 30 attendees from both Academia and Industry, representing 13 different countries.

Three roundtables focused on Tech ChallengesApplications, and Industry-University collaboration. The participants were divided into three groups and assigned to the three roundtables according to their profiles and preferences. For the roundtables, we did not prepare specific questions but rather outlined key areas of focus for discussion. A brief document that provided a short introduction for each roundtable, summarizing the topic of the debate and highlighting three major subjects to guide the discussion was prepared and given to the discussant a few days before the meeting. 

In the following we report a brief synthesis of the discussions at the roundtables, highlighting the principal arguments of discussion and proposals. 

Tech challenges Roundtable

Motivations for the discussion: As large pre-trained models become more prevalent and move towards multimodality, looking at the future, a key issue for their usage arises around the impact of their updating and fine-tuning, understanding how to ensure that improvements in one area don’t come at the cost of degradation in others. It is also fundamentally important to understand how deep multimodality is essential for building next stages of AI agents for real world applications, as well as for comprehending real-time contexts and guiding actions by agents towards Artificial General Intelligence. 

Some salient sentences, open questions, proposals from the discussion:

  • The interplay between human intelligence and machine intelligence is a fundamental aspect of what should be multi-modal. There are not yet deep enough multimodal models…. models for information that truly span all, or even a subset of modalities. We need metrics for this human-machine, human-intelligence machine-intelligence, action. We should come up with and define a task around how people collaborate productively. We should look at something like dynamic difficulty adjustment, that requires continuous, real-time development or training. 
  • Benchmarks are of crucial importance, not just to evaluate one thing against another thing, but to stretch the capabilities. It is not just about passing the benchmark; it is about setting the targets. We should envision a SIGMM-endorsed or sponsored multimodal benchmark by approaching some big tech companies to benchmark some multimodal activity within and across companies.

Applications Roundtable

Motivations for the discussion:   Multimodality is a cornerstone of emerging real-world applications, providing context and situational awareness to systems. Large Multimodal Models are credited for transforming various industries and enabling new applications. Key challenges lie in developing computational approaches for media fusion to construct context and situational understanding, addressing real-time computing costs, and refining model building. It is therefore essential for the SIGMM community to reason on how to build a vibrant community around one or a few key applications.

Some salient sentences, open questions, proposals from the discussion:

There are many areas for application where the SIGMM community can provide vital and innovative contributions and should concentrate its applicative research. Example application areas  and examples of research are: 

  • Health: there is an absence of open-ended sensory data representing of long-term complex information in the health area. We can think of integrated, federated machine learning, i.e. an integrated, federated data space for data control. 
  • Education: we can think of some futuristic learning approach, like completely autonomous learning.  Namely, AI agents that will be supportive through observation models, able to adjust the learning level so that some can finish faster than the others and learn depending on the modalities they like to receive. It is also of key importance to consider what the role of teacher and the role of AI is. 
  • Productivitywe can think of tools for immersive multi-modal experiences, to generate cross-modal content including 3D and podcasting in immersive environments.
  • Entertainment: we should think of how we can improve entertainment through immersive story driven experiences. 

Industry and University Roundtable

Motivations for the discussion:   Research on large AI models is by far dominated by private companies, thanks in part to their access to the data and the cost for building and training such models. As a result, academic institutions are being left behind in the AI race. It is therefore urgent to reason about which research directions are viable for universities and think of new Industry-University collaboration models for multimodal AI research.  It is also important to capitalize on the unique advantage of Academy, concerning their neutrality and ability to address long-term social and ethical issues related to technology.

Some salient sentences, open questions, proposals from the discussion:

  • Small and medium enterprises feel that they are left out. These are the ones who came to talk to universities. This is an opportunity for the SIGMM community to see how we can help.  SIGMM could sponsor joint PhD programs for example addressing small size, multi-model, foundation models, or intelligent agents, where a company sponsors part of the grant project. 
  • SIGMM should promote large visibility events at ACM Multimedia like Grand Challenges and Hackathons. As a community we could sponsor a company-wise Grand Challenge on multimodal AI and intelligent agents, leveraging industry to contribute more data sets. We could promote a regional-global Hackathon where Hackathons are held and overseen in different regions in the world, and the top teams then invited to come to ACM Multimedia and compete for it. 

Based on the discussions at the roundtables, we have identified several concrete actions that could help position the SIGMM research community at the forefront of the multimodal AI transformation:

At the next ACM Multimedia Conference

  • Explicit inclusion of multimodality as a key topic in the next ACM Multimedia call.
  • Multimodal Hackathon on Intelligent Agents (regional-global hackathon).
  • Multimodal Benchmarks (collaborations within and across major tech companies).
  • Multimodal Grand Challenges (in partnership with industry leaders).

At the next ACM SIGMM call for Special projects

  • Special Projects focused on Multimodal AI.

SIGMM is committed to pursuing these initiatives.

Diversity and Inclusion in focus at ACM IMX 2024

Summary: ACM IMX 2024 took place in Stockholm, Sweden, from June 12 to 14, continuing its dedication to promoting diversity within the community. Recognising the importance of amplifying varied voices and experiences to advance the field, the conference built on prior achievements in diversity and inclusion of IMX through a series of initiatives to promote diversity and inclusion (D&I).  This column provides a concise overview of the main D&I initiatives, including childcare support, early-career researcher grants, and manuscript accessibility support.  It includes participant feedback and short testimonials shared during and after the conference to highlight the value of these initiatives. 

To encourage a broad and inclusive pool of organisers, one method employed by the general chairs of ACM IMX’24 to prioritise diversity and inclusion was to team seasoned committee members with new members within the organising committee, this was done as a method to actively foster mentoring opportunities that support continuity and the development of future conference leadership. In addition to this, IMX’24 invited community members to self-nominate for various chair and organisational roles to make it clear that chair roles were open and available to all who were interested in being part of organising the conference. This call for applications was announced during the closing session of ACM IMX’23 in Nantes, France and, over a two-month period, the committee received 12 applications from which 5 candidates were selected to serve as chairs in various capacities. This inclusive approach allowed ACM IMX to engage with junior members and volunteers who might not have been reached through traditional recruitment methods, pairing them with experienced team members to ensure that they were able to build their network within the community and their skills in conference organisation and management. 

SIGMM support was used to enable the chairs of IMX’24 to introduce several initiatives to ensure that all individuals, regardless of personal circumstances, could participate fully in the conference. These initiatives had openly announced calls to all eligible community members who wished to attend the conference in person in Stockholm but required financial assistance. To ensure a fair and thorough selection, the IMX’24 Diversity and Inclusion Chairs, in collaboration with the General Chairs, reviewed each of the applications to ensure that the widest range of support could be offered with the available funds. Applications were evaluated on a rolling basis to ensure that participants were able to organise their travel and visa arrangements without the added challenges of time pressure.

With this support from SIGMM, Diversity and Inclusion grants for IMX were made available for participants, covering:

  • Travel Support for Non-Students from Marginalised and Underrepresented Groups: This grant provided travel support for researchers who self-identified as marginalised or underrepresented within the ACM IMX community, particularly those from non-WEIRD (Western, Educated, Industrialised, Rich, Developed) countries who lacked other funding opportunities. Priority was given to early-career researchers (such as post-docs), and those needing financial assistance, to compliment existing SIGCHI and SIGMM student targeted travel grants. 
  • Childcare and Parental Support: This grant offered financial assistance to parents attending ACM IMX’24, subsidising childcare costs to enable broader participation and to cover expenses related to children’s travel, travel for a childcare companion, and on-site or arranged babysitting during the conference.
  • Disability and Carer Support: This grant aimed to support attendees on extended leave from work due to disability, parental responsibilities, or other personal circumstances. Recipients of this award also received a complementary free conference registration. 
  • Student Travel Awards: SIGMM also provided awards directly to students to support travel expences, enabling a broader range of participation and complimenting free registration offered for those students volunteering at the conference. 

The SIGMM’s special initiatives for diversity and inclusion enable IMX’24 to secure a keynote designed to foster a more inclusive dialogue. Delivered by artist Jake Elwes—a self described hacker, radical faerie, and researcher—the keynote focused on “queer artificial intelligence” and featured deepfake drag performers. Elwes’ work invited the attendees to reflect on who builds these systems, the intentions behind them, and how they can be reclaimed to envision and create different visions of a technology enhanced future.

In combination with support from SIGMM, a special workshop focused on engaging with research and researchers from Latin America as a region of interest was made possible through the generous backing of the SIGCHI Development Fund (SDF). This enabled researchers and workshop keynote speakers to participate in both the “IMX in Latin America – 2nd International Workshop” and attend the conference. A core objective was to increase diversity by broadening the IMX community through actively encouraging colleagues from Latin America to attend and contribute. This workshop also published it’s submissions as part of the ACM IMX’24 workshop proceedings in ICPS.

For the first time at ACM IMX, an external provider (TAPS) was hired to ensure accessibility of papers prior to publication. Finally, the conference offered a range of venue-focused diversity and inclusion initiatives, including the provision of all-gender bathrooms, pronoun badges, and approachable senior community members to support engagement. Care corner and tables were thoughtfully set up throughout the conference to provide attendees with free hygiene essentials such as masks, refreshers, hand sanitisers, sanitary pads and tampons. These measures highlighted ACM IMX’24 commitment to fostering a welcoming and accessible environment for all participants.

Figure 1: Participants’ responses on their perception of diversity and inclusion at IMX, highlighting that it encompasses representation, welcoming environments, active engagement, research focus, and shaping future media experiences.

“During the closing event of IMX2024, we asked our attendees to answer a few questions that could help plan future IMX conferences. We asked everyone to share what future research directions could be included to address D&I at IMX. Some of the suggestions were to include the field of Humanities, to study usability among different demographics, and to understand how people who might not have economic access to technology could benefit from such technology. We also asked everyone to select what, according to them, is D&I at IMX. The options Everyone feels welcomed, Diverse individuals are able to engage and contribute and People from diverse backgrounds get represented and have a voice received a majority of the votes when compared to “Shape the future of interactive media experiences and “Research that focuses on diversity and inclusion in media experiences”. When asked to share how included they felt at IMX2024, 92% of the participants shared that they either felt included or very much [with some leaving the question unanswered]. They also shared how different aspects made them feel included. Some of the highlights were the care corner that was arranged to support the basic needs of the attendees, the social events, interactions at the conference, and the community. ” – Sujithra Raviselvam, IMX’24 Diversity and Inclusion Co-Chair.

Figure 2: Participants’ feedback on factors contributing to feelings of inclusion and exclusion at IMX, along with suggestions for future research directions aimed at improving diversity and inclusion. The feedback highlights personal interactions, event organization, and amenities as key to feeling included, while future research suggestions focus on enhancing accessibility, providing economic support, and integrating more diverse perspectives in HCI research.

The best way to understand the impacts of these supports is through the words of those who were enabled to join the conference by receiving it. 

The grant received for IMX2024 allowed me to attend the conference. Having a young child is challenging as an early researcher, as you must, sometimes, sacrifice your career or family. This grant allowed me to travel without any of these. I could attend the conference without stress or second thoughts, and support my family during the few days of the conference. Thanks to this, I received valuable feedback on my work, followed interesting presentations, and did not miss my family.” – Romain Herault, childcare award recipient. 

“I had the opportunity to present our qualitative study focused on understanding the sensitive values of women entrepreneurs in Brazil to support designing multi-model conversational AI financial systems at IMX, followed by interesting discussions about it in the workshop organized by Debora Christina Muchaluat Saade, Mylene Farias and Jesus Favela. The conference was focused on the future of multimodal technologies, with many exciting demos to investigate, to make more accessible, and to challenge assumptions of real life through a multimedia lens. We also had a conference dinner with the theme of the midsummer celebration. I was amazed by its meaning; as far as I understood, the purpose is to celebrate the light, sun, and summer season with family and friends! I loved it! It was also an opportunity to explore the beautiful Stockholm city with new colleagues and meet current collaborators in research.”– Heloisa Caroline de Souza Pereira Candello.  

A total of 21 applicants received support through diversity and inclusion grants provided by both SIGMM and the SIGCHI Development Fund (SDF). This assistance enabled full participation in ACM IMX’24 and supported a diverse group, including students, non-students from marginalised backgrounds, early-career researchers, and Latin American researchers, all of whom benefitted from these grants and made up more than 10% of the total conference attendees – truly changing and undoubtedly enhancing the experience of all attendees at the conference. 

Figure 3: The word clouds present two data sets from an IMX survey: the countries respondents identify as home, and the locations they would like IMX to feature in the future. It highlights a diverse range of home countries, including Brazil, Germany, and India, and suggest future IMX locations such as Japan, Brazil, and various cities in the USA, indicating a global interest and the geographical diversity of the IMX community.

Reports from ACM Multimedia 2024

Introduction

The ACM Multimedia Conference 2024, held in Melbourne, Australia from October 28 to November 1, 2024, was a major event that brought together leading researchers, practitioners, and industry professionals in the field of multimedia. This year’s conference marked a significant milestone as it was the first time since the end of the COVID-19 pandemic that the event returned to the Asia-Pacific region and resumed as a fully in-person gathering. The event offered a dynamic platform for presenting cutting-edge research, exploring new trends, and fostering collaborations across academia and industry.

Held in Melbourne, a city known for its vibrant culture and technological advancements, the conference was well-organized, ensuring a seamless experience for all participants. As part of its ongoing commitment to supporting the next generation of multimedia researchers, SIGMM awarded Student Travel Grants to 24 students. Each recipient received up to 1,000 USD to cover their travel and accommodation expenses. These grants were intended to help students who showed academic promise but faced financial barriers, allowing them to fully engage with the conference and its events. To apply, students were required to submit an online form, and the selection committee chose the recipients based on academic excellence and demonstrated financial need.

To give a voice to the travel grant recipients, we interviewed several of them to hear about their experiences and the impact the conference had on their academic and professional development. Below are some of their reflections.

Zhedong Zhang – Hangzhou Dianzi University

ACM Multimedia 2024 in Melbourne was my first international academic conference, and I am incredibly grateful to SIGMM for providing the travel grant. It was a great honor to present my paper, “From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning”, and to receive the Best Paper Award. As a PhD student, this recognition means a lot to me and encourages me to keep pushing forward with my research. 

Beyond the academic presentations, I had the chance to meet many brilliant researchers and fellow PhD students. I made connections with scholars working on similar topics and exchanged ideas that will help improve my work. The networking events and social gatherings were also highlights, as they allowed me to build friendships with colleagues from different parts of the world. I am truly grateful to SIGMM for making this experience possible and for the chance to be part of such a vibrant and inspiring academic community. I look forward to continuing my research and contributing to this exciting field.

Wu Tao – Zhejiang University

I’m incredibly grateful to the SIGMM team for awarding me this student travel grant – it really helped me a lot. I got to learn about so many fascinating papers at the conference and meet some brilliant professors and students. I even see some potential for future collaborations. I also had the chance to meet some big names in the field, like Tat-Seng Chua, who I’ve admired for a while. Meeting him, chatting, and even taking a photo with him felt like a once-in-a-lifetime opportunity, and I’m so thankful for it.

As for my own paper, I was both surprised and thrilled to see it actually got quite a bit of attention. At the welcome reception on the second day – before the poster session even began and before I’d even put up my poster – I noticed a few students already looking it up on their laptops. During the poster session, which was supposed to be two hours but probably stretched to three, I had a steady stream of people coming by to check out my work and ask questions. Some people even approached me earlier that morning. It was incredibly motivating to feel that kind of recognition and interest in what I’m working on. Thank you once again for this generous support! I look forward to attending the conference again.

Jianjun Qiao (Southwest Jiaotong University)

Attending ACM Multimedia 2024 in Melbourne was an incredible opportunity that greatly enriched my academic journey. This was my first time participating in an in-person conference, and I’m so grateful for the experience. The keynotes were fascinating, especially the talk on the Multimodal LLMs, which has significantly influenced my current research. I also enjoyed the poster sessions, where I could present my own work and engage in meaningful discussions with researchers from diverse backgrounds. The networking opportunities were invaluable, and I made several connections that I believe will lead to fruitful collaborations. I would like to extend my sincere thanks to SIGMM for the travel grant, which made my attendance possible. It was truly an unforgettable experience.

Changli Wu (Xiamen University)

ACM Multimedia 2024 was an unforgettable experience that exceeded all my expectations. As a PhD student, this was my first time presenting my research on 3D Referring Expression Segmentation at such a prestigious conference. The discussions I had with other attendees were invaluable, and I received constructive feedback that will undoubtedly improve my work. The diversity of the sessions was a highlight for me, as I was exposed to a variety of multimedia topics that I hadn’t considered before. The conference also provided a unique opportunity to interact with industry leaders, and I am now considering how to apply my research in real-world settings. 

JPEG Column: 105th JPEG Meeting in Berlin, Germany

JPEG Trust becomes an International Standard

The 105th JPEG meeting was held in Berlin, Germany, from October 6 to 11, 2024. During this JPEG meeting, JPEG Trust was sent for publication as an International Standard. This is a major achievement in providing standardized tools to effectively fight against the proliferation of fake media and disinformation while restoring confidence in multimedia information.

In addition, the JPEG Committee also sent for publication the JPEG Pleno Holography standard, which is the first standardized solution for holographic content coding. This type of content might be represented by huge amounts of information, and efficient compression is needed to enable reliable and effective applications.

The following sections summarize the main highlights of the 105th JPEG meeting:

105th JPEG Meeting, held in Berlin, Germany.
  • JPEG Trust
  • JPEG Pleno
  • JPEG AI
  • JPEG XE
  • JPEG AIC
  • JPEG DNA
  • JPEG XS
  • JPEG XL


JPEG Trust

In an important milestone, the first part of JPEG Trust, the “Core Foundation” (ISO/IEC IS 21617-1) International Standard, has now been approved by the international ISO committee and is being published. This standard addresses the problem of dis- and misinformation and provides leadership in global interoperable media asset authenticity. JPEG Trust defines a framework for establishing trust in digital media.

Users of social media are challenged to assess the trustworthiness of the media they encounter, and agencies that depend on the authenticity of media assets must be concerned with mistaking fake media for real, with risks of real-world consequences. JPEG Trust provides a proactive approach to trust management. It is built upon and extends the Coalition for Content Provenance and Authenticity (C2PA) engine. The first part defines the JPEG Trust framework and provides building blocks for more elaborate use cases via its three main pillars:

  • Annotating provenance – linking media assets together with their associated provenance annotations in a tamper-evident manner
  • Extracting and evaluating Trust Indicators – specifying how to extract an extensive array of Trust Indicators from any given media asset for evaluation
  • Handling privacy and security concerns – providing protection for sensitive information based on the provision of JPEG Privacy and Security (ISO/IEC 19566-4)

Trust in digital media is context-dependent. JPEG Trust does NOT explicitly define trustworthiness but rather provides a framework and tools for proactively establishing trust in accordance with the trust conditions needed. The JPEG Trust framework outlined in the core foundation enables individuals, organizations, and governing institutions to identify specific conditions for trustworthiness, expressed in Trust Profiles, to evaluate relevant Trust Indicators according to the requirements for their specific usage scenarios. The resulting evaluation can be expressed in a Trust Report to make the information easily accessed and understood by end users.

JPEG Trust has an ambitious schedule of future work, including evolving and extending the core foundation into related topics of media tokenization and media asset watermarking, and assembling a library of common Trust Profile requirements.

JPEG Pleno

The JPEG Pleno Holography activity reached a major milestone with the FDIS of ISO/IEC 21794-5 being accepted and the International Standard being under preparation by ISO. This is a major achievement for this activity and is the result of the dedicated work of the JPEG Committee over a number of years. The JPEG Pleno Holography activity continues with the development of a White Paper on JPEG Pleno Holography to be released at the 106th JPEG meeting and planning for a workshop for future standardization on holography intended to be conducted in November or December 2024.

The JPEG Pleno Light Field activity focused on the 2nd edition of ISO/IEC 21794-2 (“Plenoptic image coding system (JPEG Pleno) Part 2: Light field coding”) which will integrate AMD1 of ISO/IEC 21794-2 (“Profiles and levels for JPEG Pleno Light Field Coding”) and include the specification of the third coding mode entitled Slanted 4D Transform Mode and the associated profile.

Following the Call for Contributions on Subjective Light Field Quality Assessment and as a result of the collaborative process, the JPEG Pleno Light Field is also preparing standardization activities for subjective and objective quality assessment of light fields. At the 105th JPEG meeting, collaborative subjective results on light field quality assessments were presented and discussed. The results will guide the subjective quality assessment standardization process, which has issued its fourth Working Draft.

The JPEG Pleno Point Cloud activity released a White Paper on JPEG Pleno Learning-based Point Cloud Coding. This document outlines the context, motivation, and scope of the upcoming Part 6 of ISO/IEC 21794 scheduled for publication in early 2025, as well as giving the basis of the new technology, use cases, performance, and future activities. This activity focuses on a new exploration study into the latent space optimization for the current Verification Model.

JPEG AI

At the 105th meeting JPEG AI activity primarily concentrated on advancing Part 2 (Profiling), Part 3 (Reference Software), and Part 4 (Conformance). Part 4 moved forward to the Committee Draft (CD) stage, while Parts 2 and 3 are anticipated to reach DIS at the next meeting. The conformance CD outlines different types of conformances: 1) strict conformance for decoded residuals; 2) soft conformance for decoded feature tensors, allowing minor deviations; and 3) soft conformance for decoded images, ensuring that image quality remains comparable to or better than the quality offered by the reference model. For decoded images, two types of soft conformance were introduced based on device capabilities. Discussions on Part 2 examined memory requirements for various JPEG AI VM codec configurations. Additionally, three core experiments were established during this meeting, focusing on JPEG AI subjective assessment, integerization, and the study of profiles and levels.

JPEG XE

The JPEG XE activity is currently focused on preparing for handling the open Final Call for Proposals on lossless coding of events. This activity revolves around a new and emerging image modality created by event-based visual sensors. JPEG XE is about the creation and development of a standard to represent events in an efficient way allowing interoperability between sensing, storage, and processing, targeting machine vision and other relevant applications. The Final Call for Proposals ends in March of 2025 and aims to receive relevant coding tools that will serve as a basis for a JPEG XE standard. The JPEG Committee is also preparing discussions on lossy coding of events and how to evaluate such lossy coding technologies in the future. The JPEG Committee invites those interested in JPEG XE activity to consider the public documents, available on jpeg.org. The Ad-hoc Group on event-based vision was re-established to continue work towards the 106th JPEG meeting. To stay informed about this activity, please join the event-based vision Ad-hoc Group mailing list.

JPEG AIC

Part 3 of JPEG AIC (AIC-3) advanced to the Committee Draft (CD) stage during the 105th JPEG meeting. AIC-3 defines a methodology for subjective assessment of the visual quality of high-fidelity images. Based on two test protocols—Boosted Triplet Comparisons and Plain Triplet Comparisons—it reconstructs a fine-grained quality scale in JND (Just Noticeable Difference) units. According to the defined work plan, JPEG AIC-3 is expected to advance to the Draft International Standard (DIS) stage by April 2025 and become an International Standard (IS) by October 2026. During this meeting, the JPEG Committee also focused on the upcoming Part 4 of JPEG AIC, which refers to the objective quality assessment of high-fidelity images.

JPEG DNA

JPEG DNA is an initiative aimed at developing a standard capable of representing bi-level, continuous-tone grey-scale, continuous-tone colour, or multichannel digital samples in a format using nucleotide sequences to support DNA storage. The JPEG DNA Verification Model was created during the 102nd JPEG meeting based on the performance assessments and descriptive analyses of the submitted solutions to the Call for Proposals, published at the 99th JPEG meeting. Several core experiments are continuously conducted to validate and improve this Verification Model (VM), leading to the creation of the first Working Draft of JPEG DNA during the 103rd JPEG meeting. At the 105th JPEG meeting, the committee created a New Work Item Proposal for JPEG DNA to make it an official ISO work item. The proposal stated that JPEG DNA would be a multi-part standard: Part 1—Core Coding System, Part 2—Profiles and Levels, Part 3—Reference Software, and Part 4—Conformance. The committee aims to reach the IS stage for Part 1 by April 2026.

JPEG XS

The third editions of JPEG XS, Part 1 – Core coding tools, Part 2 – Profiles and buffer models, and Part 3 – Transport and container formats, have now been published and made available on ISO. The JPEG Committee is finalizing the third edition of the remaining two parts of the JPEG XS standards suite, Part 4 – Conformance testing and Part 5 – Reference software. The FDIS of Party 4 was issued for the ballot at this meeting. Part 5 is still at the Committee Draft stage, and the DIS is planned for the next JPEG meeting. The reference software has a feature-complete decoder fully compliant with the 3rd edition. Work on the TDC profile encoder is ongoing.

JPEG XL

A third edition of JPEG XL Part 2 (File Format) will be initiated to add an embedding syntax for ISO 21496 gain maps, which can be used to represent a custom local tone mapping and have artistic control over the SDR rendition of an HDR image coded with JPEG XL. Work on hardware and software implementations continues, including a new Rust implementation.

Final Quote

“In its commitment to tackle dis/misinformation and to manage provenance, authorship, and ownership of multimedia information, the JPEG Committee has reached a major milestone by publishing the first ever ISO/IEC endorsed specifications for bringing back trust into multimedia. The committee will continue developing additional enhancements to JPEG Trust. New parts of the standard are under development to define a set of additional tools to further enhance interoperable trust mechanisms in multimedia.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

VQEG Column: VQEG Meeting July 2024

Introduction

The University of Klagenfurt (Austria) hosted from July 01-05, 2024 a plenary meeting of the Video Quality Experts Group (VQEG). More than 110 participants from 20 different countries could attend this meeting in person and remotely.

The first three days of the meeting were dedicated to presentations and discussions about topics related to the ongoing projects within VQEG, while during the last two days an IUT-T Study Group G12 Question 19 (SG12/Q9) interim meeting took place. All the related information, minutes, and files from the meeting are available online in the VQEG meeting website, and video recordings of the meeting are available in Youtube.

All the topics mentioned bellow can be of interest for the SIGMM community working on quality assessment, but special attention can be devoted to the workshop on quality assessment towards 6G held within the 5GKPI group, and to the dedicated meeting of the IMG group hosted by the Distributed and Interactive Systems Group (DIS) of the CWI in September 2024 to work on ITU-T P.IXC recommendation. In addition, during those days there was a co-located ITU-T SG12 Q19 interim meeting.

Readers of these columns interested in the ongoing projects of VQEG are encouraged to subscribe to their corresponding reflectors to follow the activities going on and to get involved in them.

Another plenary meeting of VQEG has taken place from 18th 22nd of November 2024 and will be reported in a following issue of the ACM SIGMM Records.

VQEG plenary meeting at University of Klagenfurt (Austria), from July 01-05, 2024

Overview of VQEG Projects

Audiovisual HD (AVHD)

The AVHD group works on developing and validating subjective and objective methods to analyze commonly available video systems. During the meeting, there were 8 presentations covering very diverse topics within this project, such as open-source efforts, quality models, and subjective assessment methodologies:

Quality Assessment for Health applications (QAH)

The QAH group is focused on the quality assessment of health applications. It addresses subjective evaluation, generation of datasets, development of objective metrics, and task-based approaches. Joshua Maraval and Meriem Outtas (INSA Rennes, France) a dual rig approach for capturing multi-view video and spatialized audio capture for medical training applications, including a dataset for quality assessment purposes.

Statistical Analysis Methods (SAM)

The group SAM investigates on analysis methods both for the results of subjective experiments and for objective quality models and metrics. The following presentations were delivered during the meeting:  

No Reference Metrics (NORM)

The group NORM addresses a collaborative effort to develop no-reference metrics for monitoring visual service quality. In this sense, the following topics were covered:

  • Yixu Chen (Amazon, US) presented their development of a metric tailored for video compression and scaling, which can extrapolate to different dynamic ranges, is suitable for real-time video quality metrics delivery in the bitstream, and can achieve better correlation than VMAF and P.1204.3.
  • Filip Korus (AGH University of Krakow, Poland) talked about the detection of hard-to-compress video sequences (e.g., video content generated during e-sports events) based on objective quality metrics, and proposed a machine-learning model to assess compression difficulty.
  • Hadi Amirpour (University of Klagenfurt, Austria) provided a summary of activities in video complexity analysis, covering from VCA to DeepVCA and describing a Grand Challenge on Video Complexity.
  • Pierre Lebreton (Capacités & Nantes Université, France) presented a new dataset that allows studying the differences among existing UGC video datasets, in terms of characteristics, covered range of quality, and the implication of these quality ranges on training and validation performance of quality prediction models.
  • Zhengzhong Tu (Texas A&M University, US) introduced a comprehensive video quality evaluator (COVER) designed to evaluate video quality holistically, from a technical, aesthetic, and semantic perspective. It is based on leveraging three parallel branches: a Swin Transformer backbone to predict technical quality, a ConvNet employed to derive aesthetic quality, and a CLIP image encoder to obtain semantic quality.

Emerging Technologies Group (ETG)

The ETG group focuses on various aspects of multimedia that, although they are not necessarily directly related to “video quality”, can indirectly impact the work carried out within VQEG and are not addressed by any of the existing VQEG groups. In particular, this group aims to provide a common platform for people to gather together and discuss new emerging topics, possible collaborations in the form of joint survey papers, funding proposals, etc. During this meeting, the following presentations were delivered:

Joint Effort Group (JEG) – Hybrid

The group JEG-Hybrid addresses several areas of Video Quality Assessment (VQA), such as the creation of a large dataset for training such models using full-reference metrics instead of subjective metrics. In addition, the group includes the VQEG project Implementer’s Guide for Video Quality Metrics (IGVQM). The chair of this group,  Enrico Masala (Politecnico di Torino, Italy) presented the updates on the latest activities going on, including the status of the IGVQM project and a new image dataset, which will be partially subjectively annotated, to train DNN models to predict single user’s subjective quality perception. In addition to this:

Immersive Media Group (IMG)

The IMG group researches on the quality assessment of immersive media technologies. Currently, the main joint activity of the group is the development of a test plan to evaluate the QoE of immersive interactive communication systems, which is carried out in collaboration with ITU-T through the work item P.IXC. In this meeting, Pablo Pérez (Nokia XR Lab, Spain) and Jesús Gutiérrez (Universidad Politécnica de Madrid, Spain) provided an update on the progress of the test plan, reviewing the status of the subjective tests that were being performed at the 13 involved labs. Also in relation with this test plan:

In relation with other topics addressed by IMG:

In addition, a specific meeting of the group was held at Distributed and Interactive Systems Group (DIS) of CWI in Amsterdam (Netherlands) from the 2nd to the 4th of September to progress on the joint test plan for evaluating immersive communication systems. A total of 26 international experts from seven countries (Netherlands, Spain, Italy, UK, Sweden, Germany, US, and Poland) participated, with 7 attending online. In particular, the meeting featured presentations on the status of tests run by 13 participating labs, leading to insightful discussions and progress towards the ITU-T P.IXC recommendation.

IMG meeting at CWI (2-4 September, 2024, Netherlands)

Quality Assessment for Computer Vision Applications (QACoViA)

The group QACoViA addresses the study the visual quality requirements for computer vision methods, where the final user is an algorithm. In this meeting, Mikołaj Leszczuk (AGH University of Krakow, Poland) presented a study introducing a novel evaluation framework designed to address accurately predicting the impact of different quality factors on recognition algorithm, by focusing on machine vision rather than human perceptual quality metrics.

5G Key Performance Indicators (5GKPI)

The 5GKPI group studies relationship between key performance indicators of new 5G networks and QoE of video services on top of them. In this meeting, a workshop was organized by Pablo Pérez (Nokia XR Lab, Spain) and Kjell Brunnström (RISE, Sweden) on “Future directions of 5GKPI: Towards 6G“.

The workshop consisted of a set of diverse topics such as: QoS and QoE management in 5G/6G networks by (Michelle Zorzi, University of Padova, Italy); parametric QoE models and QoE management by Tobias Hoßfeld (University of. Würzburb, Germany) and Pablo Pérez (Nokia XR Lab, Spain); current status of standardization and industry by Kjell Brunnström (RISE, Sweden) and Gunilla Berndtsson (Ericsson); content and applications provider perspectives on QoE management by François Blouin (Meta, US); and communications service provider perspectives by Theo Karagioules and Emir Halepovic (AT&T, US). In addition, a panel moderated by Narciso García (Universidad Politécnica de Madrid, Spain) with Christian Timmerer (University of Klagenfurt, Austria), Enrico Masala (Politecnico di Torino, Italy) and Francois Blouin (Meta, US) as speakers.

Human Factors for Visual Experiences (HFVE)

The HFVE group covers human factors related to audiovisual experiences and upholds the liaison relation between VQEG and the IEEE standardization group P3333.1. In this meeting, there were two presentations related to these topics:

  • Mikołaj Leszczuk and Kamil Koniuch (AGH University of Krakow, Poland) presented a two-part insight into the realm of image quality assessment: 1) it provided an overview of the TUFIQoE project (Towards Better Understanding of Factors Influencing the QoE by More Ecologically-Valid Evaluation Standards) with a focus on challenges related to ecological validity; and 2) it delved into the ‘Psychological Image Quality’ experiment, highlighting the influence of emotional content on multimedia quality perception.

MPEG Column: 148th MPEG Meeting in Kemer, Türkiye

The 148th MPEG meeting took place in Kemer, Türkiye, from November 4 to 8, 2024. The official press release can be found here and includes the following highlights:

  • Point Cloud Coding: AI-based point cloud coding & enhanced G-PCC
  • MPEG Systems: New Part of MPEG DASH for redundant encoding and packaging, reference software and conformance of ISOBMFF, and a new structural CMAF brand profile
  • Video Coding: New part of MPEG-AI and 2nd edition of conformance and reference software for MPEG Immersive Video (MIV)
  • MPEG completes subjective quality testing for film grain synthesis using the Film Grain Characteristics SEI message
148th MPEG Meeting, Kemer, Türkiye, November 4-8, 2024.

Point Cloud Coding

At the 148th MPEG meeting, MPEG Coding of 3D Graphics and Haptics (WG 7) launched a new AI-based Point Cloud Coding standardization project. MPEG WG 7 reviewed six responses to a Call for Proposals (CfP) issued in April 2024 targeting the full range of point cloud formats, from dense point clouds used in immersive applications to sparse point clouds generated by Light Detection and Ranging (LiDAR) sensors in autonomous driving. With bit depths ranging from 10 to 18 bits, the CfP called for solutions that could meet the precision requirements of these varied use cases.

Among the six reviewed proposals, the leading proposal distinguished itself with a hybrid coding strategy that integrates end-to-end learning-based geometry coding and traditional attribute coding. This proposal demonstrated exceptional adaptability, capable of efficiently encoding both dense point clouds for immersive experiences and sparse point clouds from LiDAR sensors. With its unified design, the system supports inter-prediction coding using a shared model with intra-coding, applicable across various bitrate requirements without retraining. Furthermore, the proposal offers flexible configurations for both lossy and lossless geometry coding.

Performance assessments highlighted the leading proposal’s effectiveness, with significant bitrate reductions compared to traditional codecs: a 47% reduction for dense, dynamic sequences in immersive applications and a 35% reduction for sparse dynamic sequences in LiDAR data. For combined geometry and attribute coding, it achieved a 40% bitrate reduction across both dense and sparse dynamic sequences, while subjective evaluations confirmed its superior visual quality over baseline codecs.

The leading proposal has been selected as the initial test model, which can be seen as a baseline implementation for future improvements and developments. Additionally, MPEG issued a working draft and common test conditions.

Research aspects: The initial test model, like those for other codec test models, is typically available as open source. This enables both academia and industry to contribute to refining various elements of the upcoming AI-based Point Cloud Coding standard. Of particular interest is how training data and processes are incorporated into the standardization project and their impact on the final standard.

Another point cloud-related project is called Enhanced G-PCC, which introduces several advanced features to improve the compression and transmission of 3D point clouds. Notable enhancements include inter-frame coding, refined octree coding techniques, Trisoup surface coding for smoother geometry representation, and dynamic Optimal Binarization with Update On-the-fly (OBUF) modules. These updates provide higher compression efficiency while managing computational complexity and memory usage, making them particularly advantageous for real-time processing and high visual fidelity applications, such as LiDAR data for autonomous driving and dense point clouds for immersive media.

By adding this new part to MPEG-I, MPEG addresses the industry’s growing demand for scalable, versatile 3D compression technology capable of handling both dense and sparse point clouds. Enhanced G-PCC provides a robust framework that meets the diverse needs of both current and emerging applications in 3D graphics and multimedia, solidifying its role as a vital component of modern multimedia systems.

MPEG Systems Updates

At its 148th meeting, MPEG Systems (WG 3) worked on the following aspects, among others:

  • New Part of MPEG DASH for redundant encoding and packaging
  • Reference software and conformance of ISOBMFF
  • A new structural CMAF brand profile

The second edition of ISO/IEC 14496-32 (ISOBMFF) introduces updated reference software and conformance guidelines, and the new CMAF brand profile supports Multi-View High Efficiency Video Coding (MV-HEVC), which is compatible with devices like Apple Vision Pro and Meta Quest 3.

The new part of MPEG DASH, ISO/IEC 23009-9, addresses redundant encoding and packaging for segmented live media (REAP). The standard is designed for scenarios where redundant encoding and packaging are essential, such as 24/7 live media production and distribution in cloud-based workflows. It specifies formats for interchangeable live media ingest and stream announcements, as well as formats for generating interchangeable media presentation descriptions. Additionally, it provides failover support and mechanisms for reintegrating distributed components in the workflow, whether they involve file-based content, live inputs, or a combination of both.

Research aspects: With the FDIS of MPEG DASH REAP available, the following topics offer potential for both academic and industry-driven research aligned with the standard’s objectives (in no particular order or priority):

  • Optimization of redundant encoding and packaging: Investigate methods to minimize resource usage (e.g., computational power, storage, and bandwidth) in redundant encoding and packaging workflows. Explore trade-offs between redundancy levels and quality of service (QoS) in segmented live media scenarios.
  • Interoperability of live media Ingest formats: Evaluate the interoperability of the standard’s formats with existing live media workflows and tools. Develop techniques for seamless integration with legacy systems and emerging cloud-based media workflows.
  • Failover mechanisms for cloud-based workflows: Study the reliability and latency of failover mechanisms in distributed live media workflows. Propose enhancements to the reintegration of failed components to maintain uninterrupted service.
  • Standardized stream announcements and descriptions: Analyze the efficiency and scalability of stream announcement formats in large-scale live streaming scenarios. Research methods for dynamically updating media presentation descriptions during live events.
  • Hybrid workflow support: Investigate the challenges and opportunities in combining file-based and live input workflows within the standard. Explore strategies for adaptive workflow transitions between live and on-demand content.
  • Cloud-based workflow scalability: Examine the scalability of the REAP standard in high-demand scenarios, such as global live event streaming. Study the impact of cloud-based distributed workflows on latency and synchronization.
  • Security and resilience: Research security challenges related to redundant encoding and packaging in cloud environments. Develop techniques to enhance the resilience of workflows against cyberattacks or system failures.
  • Performance metrics and quality assessment: Define performance metrics for evaluating the effectiveness of REAP in live media workflows. Explore objective and subjective quality assessment methods for media streams delivered using this standard.

The current/updated status of MPEG-DASH is shown in the figure below.

MPEG-DASH status, November 2024.

Video Coding Updates

In terms of video coding, two noteworthy updates are described here:

  • Part 3 of MPEG-AI, ISO/IEC 23888-3 – Optimization of encoders and receiving systems for machine analysis of coded video content, reached Committee Draft Technical Report (CDTR) status
  • Second edition of conformance and reference software for MPEG Immersive Video (MIV). This draft includes verified and validated conformance bitstreams and encoding and decoding reference software based on version 22 of the Test model for MPEG immersive video (TMIV). The test model, objective metrics, and some other tools are publicly available at https://gitlab.com/mpeg-i-visual.

Part 3 of MPEG-AI, ISO/IEC 23888-3: This new technical report on “optimization of encoders and receiving systems for machine analysis of coded video content” is based on software experiments conducted by JVET, focusing on optimizing non-normative elements such as preprocessing, encoder settings, and postprocessing. The research explored scenarios where video signals, decoded from bitstreams compliant with the latest video compression standard, ISO/IEC 23090-3 – Versatile Video Coding (VVC), are intended for input into machine vision systems rather than for human viewing. Compared to the JVET VVC reference software encoder, which was originally optimized for human consumption, significant bit rate reductions were achieved when machine vision task precision was used as the performance criterion.

The report will include an annex with example software implementations of these non-normative algorithmic elements, applicable to VVC or other video compression standards. Additionally, it will explore the potential use of existing supplemental enhancement information messages from ISO/IEC 23002-7 – Versatile supplemental enhancement information messages for coded video bitstreams – for embedding metadata useful in these contexts.

Research aspects: (1) Focus on optimizing video encoding for machine vision tasks by refining preprocessing, encoder settings, and postprocessing to improve bit rate efficiency and task precision, compared to traditional approaches for human viewing. (2) Examine the use of metadata, specifically SEI messages from ISO/IEC 23002-7, to enhance machine analysis of compressed video, improving adaptability, performance, and interoperability.

Subjective Quality Testing for Film Grain Synthesis

At the 148th MPEG meeting , the MPEG Joint Video Experts Team (JVET) with ITU-T SG 16 (WG 5 / JVET) and MPEG Visual Quality Assessment (AG 5) conducted a formal expert viewing experiment to assess the impact of film grain synthesis on the subjective quality of video content. This evaluation specifically focused on film grain synthesis controlled by the Film Grain Characteristics (FGC) supplemental enhancement information (SEI) message. The study aimed to demonstrate the capability of film grain synthesis to mask compression artifacts introduced by the underlying video coding schemes.

For the evaluation, FGC SEI messages were adapted to a diverse set of video sequences, including scans of original film material, digital camera noise, and synthetic film grain artificially applied to digitally captured video. The subjective performance of video reconstructed from VVC and HEVC bitstreams was compared with and without film grain synthesis. The results highlighted the effectiveness of film grain synthesis, showing a significant improvement in subjective quality and enabling bitrate savings of up to a factor of 10 for certain test points.

This study opens several avenues for further research:

  • Optimization of film grain synthesis techniques: Investigating how different grain synthesis methods affect the perceptual quality of video across a broader range of content and compression levels.
  • Compression artifact mitigation: Exploring the interaction between film grain synthesis and specific types of compression artifacts, with a focus on improving masking efficiency.
  • Adaptation of FGC SEI messages: Developing advanced algorithms for tailoring FGC SEI messages to dynamically adapt to diverse video characteristics, including real-time encoding scenarios.
  • Bitrate savings analysis: Examining the trade-offs between bitrate savings and subjective quality across various coding standards and network conditions.

The 149th MPEG meeting will be held in Geneva, Switzerland from January 20-24, 2025. Click here for more information about MPEG meetings and their developments.

JPEG Column: 104th JPEG Meeting in Sapporo, Japan

JPEG XE issues Call for Proposals on event-based vision representation

The 104th JPEG meeting was held in Sapporo, Japan from July 15 to 19, 2024. During this JPEG meeting, a Call for Proposals on event-based vision representation was launched for the creation of the first standardised representation of this type of data. This CfP addresses lossless coding, and aims to provide the first standard representation for event-based data that ensures interoperability between systems and devices.

Furthermore, the JPEG Committee pursued its work in various standardisation activities, particularly the development of new learning-based technology codecs and JPEG Trust.

The following summarises the main highlights of the 104th JPEG meeting.

Event based vision reconstruction (from IEEE Spectrum, Feb. 2020).
  • JPEG XE
  • JPEG Trust
  • JPEG AI
  • JPEG Pleno Learning-based Point Cloud coding
  • JPEG Pleno Light Field
  • JPEG AIC
  • JPEG Systems
  • JPEG DNA
  • JPEG XS
  • JPEG XL

JPEG XE

The JPEG Committee continued its activity on JPEG XE and event-based vision. This activity revolves around a new and emerging image modality created by event-based visual sensors. JPEG XE is about the creation and development of a standard to represent events in an efficient way allowing interoperability between sensing, storage, and processing, targeting machine vision and other relevant applications. The JPEG Committee completed the Common Test Conditions (CTC) v2.0 document that provides the means to perform an evaluation of candidate technologies for efficient coding of events. The Common Test Conditions document also defines a canonical raw event format, a reference dataset, a set of key performance metrics and an evaluation methodology.

The JPEG Committee furthermore issued a Final Call for Proposals (CfP) on lossless coding for event-based data. This call marks an important milestone in the standardization process and the JPEG Committee is eager to receive proposals. The deadline for submission of proposals is set to March 31st of 2025. Standardization will start with lossless coding of events as this has the most imminent application urgency in industry. However, the JPEG Committee acknowledges that lossy coding of events is also a valuable feature, which will be addressed at a later stage.

Accompanying these two new public documents, a revised Use Cases and Requirements v2.0 document was also released to provide a formal definition for lossless coding of events that is used in the CTC and the CfP.

All documents are publicly available on jpeg.org. The Ad-hoc Group on event-based vision was re-established to continue work towards the 105th JPEG meeting. To stay informed about this activity please join the event-based vision Ad-hoc Group mailing list.

JPEG Trust

JPEG Trust provides a comprehensive framework for individuals, organizations, and governing institutions interested in establishing an environment of trust for the media that they use, and supports trust in the media they share. At the 104th meeting, the JPEG Committee produced an updated version of the Use Cases and Requirements for JPEG Trust (v3.0). This document integrates additional use cases and requirements related to authorship, ownership, and rights declaration. The JPEG Committee also requested a new Part to JPEG Trust, entitled “Media asset watermarking”. This new Part will define the use of watermarking as one of the available components of the JPEG Trust framework to support usage scenarios for content authenticity, provenance, integrity, labeling, and binding between JPEG Trust metadata and corresponding media assets. This work will focus on various types of watermarking, including explicit or visible watermarking, invisible watermarking, and implicit watermarking of the media assets with relevant metadata.

JPEG AI

At the 104th meeting, the JPEG Committee reviewed recent integration efforts, following the adoption of the changes in the past meeting and the creation of a new version of the JPEG AI verification model. This version reflects the JPEG AI DIS text and was thoroughly evaluated for performance and functionalities, including bitrate matching, 4:2:0 coding, region adaptive quantization maps, and other key features. JPEG AI supports a multi-branch coding architecture with two encoders and three decoders, allowing for six compatible combinations that have been jointly trained. The compression efficiency improvements range from 12% to 27% over the VVC Intra coding anchor, with decoding complexities between 8 to 215 kMAC/px.

The meeting also focused on Part 2: Profiles and Levels, which is moving to Committee Draft consultation. Two main concepts have been established: 1) the stream profile, defining a specific subset of the code stream syntax along with permissible parameter values, and 2) the decoder profile, specifying a subset of the full JPEG AI decoder toolset required to obtain the decoded image. Additionally, Part 3: Reference Software and Part 5: File Format will also proceed to Committee Draft consultation. Part 4 is significant as it sets the conformance points for JPEG AI compliance, and some preliminary experiments have been conducted in this area.

JPEG Pleno Learning-based Point Cloud coding

Learning-based solutions are the state of the art for several computer vision tasks, such as those requiring high-level understanding of image semantics, e.g., image classification, face recognition and object segmentation, but also 3D processing tasks, e.g. visual enhancement and super-resolution. Learning-based point cloud coding solutions have demonstrated the ability to achieve competitive compression efficiency compared to available conventional point cloud coding solutions at equivalent subjective quality. At the 104th meeting, the JPEG Committee instigated balloting for the Draft International Standard (DIS) of ISO/IEC 21794 Information technology — Plenoptic image coding system (JPEG Pleno) — Part 6: Learning-based point cloud coding. This activity is on track for the publication of an International Standard in January 2025. The 104th meeting also began an exploration into advanced point cloud coding functionality, in particular the potential for progressive decoding of point clouds.

JPEG Pleno Light Field

The JPEG Pleno Light Field effort has an ongoing standardization activity concerning a novel light field coding architecture that delivers a single coding mode to efficiently code light fields spanning from narrow to wide baselines. This novel coding mode is depth information agnostic resulting in significant improvement in compression efficiency. The first version of the Working Draft of the JPEG Pleno Part 2: Light Field Coding second edition (ISO/IEC 21794-2 2ED), including this novel coding mode, was issued during the 104th JPEG meeting in Sapporo, Japan.

The JPEG PLeno Model (JPLM) provides reference implementations for the standardized technologies within the JPEG Pleno framework, including the JPEG Pleno Part 2 (ISO/IEC 21794-2). Improvements to the JPLM have been implemented and tested, including the design of a more user-friendly platform.

The JPEG Pleno Light Field effort is also preparing standardization activities in the domains of objective and subjective quality assessment for light fields, aiming to address other plenoptic modalities in the future. During the 104th JPEG meeting in Sapporo, Japan, the collaborative subjective experiments aiming at exploring various aspects of subjective light field quality assessments were presented and discussed. The outcomes of these experiments will guide the decisions during the subjective quality assessment standardization process, which has issued its third Working Draft. A new version of a specialized tool for subjective quality evaluation, that supports these experiments, has also been released.

JPEG AIC

At its 104th meeting, the JPEG Committee reviewed results from previous Core Experiments that collected subjective data for fine-grained quality assessments of compressed images ranging from high to near-lossless visual quality. These crowdsourcing experiments used triplet comparisons with and without boosted distortions, as well as double stimulus ratings on a visual analog scale. Analysis revealed that boosting increased the precision of reconstructed scale values by nearly a factor of two. Consequently, the JPEG Committee has decided to use triplet comparisons in the upcoming AIC-3.

The JPEG Committee also discussed JPEG AIC Part 4, which focuses on objective image quality assessments for compressed images in the high to near-lossless quality range. This includes developing methods to evaluate the performance of such objective image quality metrics. A draft call for contributions is planned for January 2025.

JPEG Systems

At the 104th meeting Part 10 of JPEG Systems (ISO/IEC 19566-10), the JPEG Systems Reference Software, reached the IS stage. This first version of the reference software provides a reference implementation and reference dataset for the JPEG Universal Metadata Box Format (JUMBF, ISO/IEC 19566-5). Meanwhile, work is in progress to extend the reference software implementations of additional Parts, including JPEG Privacy and Security and JPEG 360.

JPEG DNA

JPEG DNA is an initiative aimed at developing a standard capable of representing bi-level, continuous-tone grey-scale, continuous-tone colour, or multichannel digital samples in a format using nucleotide sequences to support DNA storage. A Call for Proposals was published at the 99th JPEG meeting. Based on the performance assessments and descriptive analyses of the submitted solutions, the JPEG DNA Verification Model was created during the 102nd JPEG meeting. Several core experiments were conducted to validate this Verification Model, leading to the creation of the first Working Draft of JPEG DNA during the 103rd JPEG meeting.

The next phase of this work involves newly defined core experiments to enhance the rate-distortion performance of the Verification Model and its robustness to insertion, deletion, and substitution errors. Additionally, core experiments to test robustness against substitution and indel noise are conducted. A core experiment was also performed to integrate JPEG AI into the JPEG DNA VM, and quality comparisons have been carried out. A study on visual quality assessment of JPEG AI as an alternative to JPEG XL in the VM will be carried out.

In parallel, efforts are underway to improve the noise simulator developed at the 102nd JPEG meeting, enabling a more realistic assessment of the Verification Model’s resilience to noise. There is also ongoing exploration of the performance of different clustering and consensus algorithms to further enhance the VM’s capabilities.

JPEG XS

The core parts of JPEG XS 3rd edition were prepared for immediate publication as International Standards. This means that Part 1 of the standard – Core coding tools, Part 2 – Profiles and buffer models, and Part 3 – Transport and container formats, will be available before the end of 2024. Part 4 – Conformance testing is currently still under DIS ballot and it will be finalized in October 2024. At the 104th meeting, the JPEG Committee continued the work on Part 5 – Reference software. This part is currently at Committee Draft stage and the DIS is planned for October 2024. The reference software has a feature-complete decoder that is fully compliant with the 3rd edition. Work on the encoder is ongoing.

Finally, additional experimental results were presented on how JPEG XS can be used over 5G mobile networks for wireless transmission of low-latency and high quality 6K/8K 360 degree views with mobile devices and VR headsets. This work will be continued.

JPEG XL

Objective metrics results for HDR images were investigated (using among others the ColorVideoVDP metric), indicating very promising compression performance of JPEG XL compared to other codecs like AVIF and JPEG 2000. Both the libjxl reference software encoder and a simulated candidate hardware encoder were tested. Subjective experiments for HDR images are planned.

The second editions of JPEG XL Part 1 (Core coding system) and Part 2 (File format) are now ready for publication. The second edition of JPEG XL Part 3 (Conformance testing) has moved to the FDIS stage.

Final Quote

“The JPEG Committee has reached a new milestone by releasing a new Call for Proposals to code events. This call is aimed at creating the first International Standard to efficiently represent events, enabling interoperability between devices and systems that rely on event sensing.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2024 – Part 2 (MDRE at MMM 2023 and MMM 2024)

As already started in the previous Datasets column, we are reviewing some of the most notable events related to open datasets and benchmarking competitions in the field of multimedia in the years 2023 and 2024. This selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/records-issues/acm-sigmm-records-issue-1-2023/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. This second part of the column focuses on the last two editions of MDRE at MMM 2023 and MMM 2024:

  • Multimedia Datasets for Repeatable Experimentation at 29th International Conference on Multimedia Modeling (MDRE at MMM 2023). We summarize the seven datasets presented during the MDRE in 2023, namely NCKU-VTF (thermal-to-visible face recognition benchmark), Link-Rot (web dataset decay and reproducibility study), People@Places and ToDY (scene classification for media production), ScopeSense (lifelogging dataset for health analysis), OceanFish (high-resolution fish species recognition), GIGO (urban garbage classification and demographics), and Marine Video Kit (underwater video retrieval and analysis).
  • Multimedia Datasets for Repeatable Experimentation at 30th International Conference on Multimedia Modeling (MDRE at MMM 2024 – https://mmm2024.org/). We summarize the eight datasets presented during the MDRE in 2024, namely RESET (video similarity annotations for embeddings), DocCT (content-aware document image classification), Rach3 (multimodal data for piano rehearsal analysis), WikiMuTe (semantic music descriptions from Wikipedia), PDTW150K (large-scale patent drawing retrieval dataset), Lifelog QA (question answering for lifelog retrieval), Laparoscopic Events (event recognition in surgery videos), and GreenScreen (social media dataset for greenwashing detection).

For the overview of datasets related to QoMEX 2023 and QoMEX 2024, please check the first part (https://records.sigmm.org/2024/09/07/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2023-2024-part-1-qomex-2023-and-qomex-2024/).

MDRE at MMM 2023

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2023 International Conference on Multimedia Modeling (MMM 2023), Bergen, Norway, January 9-12, 2023. The MDRE’23 special session at MMM’23, is the fifth MDRE session. The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Adam Jatowt (University of Innsbruck, Austria), Liting Zhou (Dublin City University, Ireland) and Graham Healy (Dublin City University, Ireland). 

The NCKU-VTF Dataset and a Multi-scale Thermal-to-Visible Face Synthesis System
Tsung-Han Ho, Chen-Yin Yu, Tsai-Yen Ko & Wei-Ta Chu
National Cheng Kung University, Tainan, Taiwan

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_36
Dataset available at: http://mmcv.csie.ncku.edu.tw/~wtchu/projects/NCKU-VTF/index.html

The dataset, named VTF, comprises paired thermal-visible face images of primarily Asian subjects under diverse visual conditions, introducing challenges for thermal face recognition models. It serves as a benchmark for evaluating model robustness while also revealing racial bias issues in current systems. By addressing both technical and fairness aspects, VTF promotes advancements in developing more accurate and inclusive thermal-to-visible face recognition methods.

Link-Rot in Web-Sourced Multimedia Datasets
Viktor Lakic, Luca Rossetto & Abraham Bernstein
Department of Informatics, University of Zurich, Zurich, Switzerland

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_37
Dataset available at: Combination of 24 different Web-sourced datasets described in the paper

The dataset examines 24 Web-sourced datasets comprising over 270 million URLs and reveals that more than 20% of the content has become unavailable due to link-rot. This decay poses significant challenges to the reproducibility of research relying on such datasets. Addressing this issue, the dataset highlights the need for strategies to mitigate content loss and maintain data integrity for future studies.

People@Places and ToDY: Two Datasets for Scene Classification in Media Production and Archiving
Werner Bailer & Hannes Fassold
Joanneum Research, Graz, Austria

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_38
Dataset available at: https://github.com/wbailer/PeopleAtPlaces

The dataset supports annotation tasks in visual media production and archiving, focusing on scene bustle (from populated to unpopulated), cinematographic shot types, time of day, and season. The People@Places dataset augments Places365 with bustle and shot-type annotations, while the ToDY (time of day/year) dataset enhances SkyFinder. Both datasets come with a toolchain for automatic annotations, manually verified for accuracy. Baseline results using the EfficientNet-B3 model, pretrained on Places365, are provided for benchmarking.

ScopeSense: An 8.5-Month Sport, Nutrition, and Lifestyle Lifelogging Dataset
Michael A. Riegler, Vajira Thambawita, Ayan Chatterjee, Thu Nguyen, Steven A. Hicks, Vibeke Telle-Hansen, Svein Arne Pettersen, Dag Johansen, Ramesh Jain & Pål Halvorsen
SimulaMet, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; UIT The Artic University of Norway, Tromsø, Norway; University of California Irvine, CA, USA

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_39
Dataset available at: https://datasets.simula.no/scopesense

The dataset, ScopeSense, offers comprehensive sport, nutrition, and lifestyle logs collected over eight and a half months from two individuals. It includes extensive sensor data alongside nutrition, training, and well-being information, structured to facilitate detailed, data-driven research on healthy lifestyles. This dataset aims to support modeling for personalized guidance, addressing challenges in unstructured data and enhancing the precision of lifestyle recommendations. ScopeSense is fully accessible to researchers, serving as a foundation for methods to expand this data-driven approach to larger populations.

Fast Accurate Fish Recognition with Deep Learning Based on a Domain-Specific Large-Scale Fish Dataset
Yuan Lin, Zhaoqi Chu, Jari Korhonen, Jiayi Xu, Xiangrong Liu, Juan Liu, Min Liu, Lvping Fang, Weidi Yang, Debasish Ghose & Junyong You
School of Economics, Innovation, and Technology, Kristiania University College, Oslo, Norway; School of Aerospace Engineering, Xiamen University, Xiamen, China; School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, UK; School of Information Science and Technology, Xiamen University, Xiamen, China; School of Ocean and Earth, Xiamen University, Xiamen, China; Norwegian Research Centre (NORCE), Bergen, Norway

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_40
Dataset available at: Upon request from the authors

The dataset, OceanFish, addresses key challenges in fish species recognition by providing high-resolution images of marine species from the East China Sea, covering 63,622 images across 136 fine-grained fish species. This large-scale, diverse dataset overcomes limitations found in prior fish datasets, such as low resolution and limited annotations. OceanFish includes a fish recognition testbed with deep learning models, achieving high precision and speed in species detection. This dataset can be expanded with additional species and annotations, offering a valuable benchmark for advancing marine biodiversity research and automated fish recognition.

GIGO, Garbage In, Garbage Out: An Urban Garbage Classification Dataset
Maarten Sukel, Stevan Rudinac & Marcel Worring
University of Amsterdam, Amsterdam, The Netherlands

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_41
Dataset available at: https://doi.org/10.21942/uva.20750044

The dataset, GIGO: Garbage in, Garbage out, offers 25,000 images for multimodal urban waste classification, captured across a large area of Amsterdam. It supports sustainable urban waste collection by providing fine-grained classifications of diverse garbage types, differing in size, origin, and material. Unique to GIGO are additional geographic and demographic data, enabling multimodal analysis that incorporates neighborhood and building statistics. The dataset includes state-of-the-art baselines, serving as a benchmark for algorithm development in urban waste management and multimodal classification.

Marine Video Kit: A New Marine Video Dataset for Content-Based Analysis and Retrieval
Quang-Trung Truong, Tuan-Anh Vu, Tan-Sang Ha, Jakub Lokoč, Yue-Him Wong, Ajay Joneja & Sai-Kit Yeung
Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; FMP, Charles
University, Prague, Czech Republic; Shenzhen University, Shenzhen, China

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_42
Dataset available at: https://hkust-vgd.github.io/marinevideokit

The dataset, Marine Video Kit, focuses on single-shot underwater videos captured by moving cameras, providing a challenging benchmark for video retrieval and computer vision tasks. Designed to address the limitations of general-purpose models in domain-specific contexts, the dataset includes meta-data, low-level feature analysis, and semantic annotations of keyframes. Used in the Video Browser Showdown 2023, Marine Video Kit highlights challenges in underwater video analysis and is publicly accessible, supporting advancements in model robustness for specialized video retrieval applications.

MDRE at MMM 2024

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2024 International Conference on Multimedia Modeling (MMM 2024), Amsterdam, The Netherlands, January 29 – February 2, 2024. The MDRE’24 special session at MMM’24, is the sixth MDRE session. The session was organized by Klaus Schöffmann (Klagenfurt University, Austria), Björn Þór Jónsson (Reykjavik University, Iceland), Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), and Liting Zhou (Dublin City University, Ireland). Details regarding this session can be found at: https://mmm2024.org/specialpaper.html#s1.

RESET: Relational Similarity Extension for V3C1 Video Dataset
Patrik Veselý & Ladislav Peška
Faculty of Mathematics and Physics, Charles University, Prague, Czechia

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_1
Dataset available at: https://osf.io/ruh5k

The dataset, RESET: RElational Similarity Evaluation dataseT, offers over 17,000 similarity annotations for video keyframe triples drawn from the V3C1 video collection. RESET includes both close and distant similarity triplets in general and specific sub-domains (wedding and diving), with multiple user re-annotations and similarity scores from 30 pre-trained models. This dataset supports the evaluation and fine-tuning of visual embedding models, aligning them more closely with human-perceived similarity, and enhances content-based information retrieval for more accurate, user-aligned results.

A New Benchmark and OCR-Free Method for Document Image Topic Classification
Zhen Wang, Peide Zhu, Fuyang Yu & Manabu Okumura
Tokyo Institute of Technology, Tokyo, Japan; Delft University of Technology, Delft, Netherlands; Beihang University, Beijing, China

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_2
Dataset available at: https://github.com/zhenwangrs/DocCT

The dataset, DocCT, is a content-aware document image classification dataset designed to handle complex document images that integrate text and illustrations across diverse topics. Unlike prior datasets focusing mainly on format, DocCT requires fine-grained content understanding for accurate classification. Alongside DocCT, the self-supervised model DocMAE is introduced, showing that document image semantics can be understood effectively without OCR. DocMAE surpasses previous vision models and some OCR-based models in understanding document content purely from pixel data, marking a significant advance in document image analysis.

The Rach3 Dataset: Towards Data-Driven Analysis of Piano Performance Rehearsal
Carlos Eduardo Cancino-Chacón & Ivan Pilkov
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_3
Dataset available at: https://dataset.rach3project.com/

The dataset, named Rach3, captures the rehearsal processes of pianists as they learn new repertoire, providing a multimodal resource with video, audio, and MIDI data. Designed for AI and machine learning applications, Rach3 enables analysis of long-term practice sessions, focusing on how advanced students and professional musicians interpret and refine their performances. This dataset offers valuable insights into music learning and expression, addressing an understudied area in music performance research.

WikiMuTe: A Web-Sourced Dataset of Semantic Descriptions for Music Audio
Benno Weck, Holger Kirchhoff, Peter Grosche & Xavier Serra
Huawei Technologies, Munich Research Center, Munich, Germany; Universitat Pompeu Fabra, Music Technology Group, Barcelona, Spain

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_4
Dataset available at: https://github.com/Bomme/wikimute

The dataset, WikiMuTe, is an open, multi-modal resource designed for Music Information Retrieval (MIR), offering detailed semantic descriptions of music sourced from Wikipedia. It includes both long and short-form text on aspects like genre, style, mood, instrumentation, and tempo. Using a custom text-mining pipeline, WikiMuTe provides data to train models that jointly learn text and audio representations, achieving strong results in tasks such as tag-based music retrieval and auto-tagging. This dataset supports MIR advancements by providing accessible, rich semantic data for matching text and music.

PDTW150K: A Dataset for Patent Drawing Retrieval
Chan-Ming Hsu, Tse-Hung Lin, Yu-Hsien Chen & Chih-Yi Chiu
Department of Computer Science and Information Engineering, National Chiayi University, Chiayi, Taiwan

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_5
Dataset available at: https://github.com/ncyuMARSLab/PDTW150K

The dataset, PDTW150K, is a large-scale resource for patent drawing retrieval, featuring over 150,000 patents with text metadata and more than 850,000 patent drawings. It includes bounding box annotations for drawing views and supporting object detection model construction. PDTW150K enables diverse applications, such as image retrieval, cross-modal retrieval, and object detection. This dataset is publicly available, offering a valuable tool for advancing research in patent analysis and retrieval tasks.

Interactive Question Answering for Multimodal Lifelog Retrieval
Ly-Duyen Tran, Liting Zhou, Binh Nguyen & Cathal Gurrin
Dublin City University, Dublin, Ireland; AISIA Research Lab, Ho Chi Minh, Vietnam; Ho Chi Minh University of Science, Vietnam National University, Hanoi, Vietnam

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_6
Dataset available at: Upon request from the authors

The dataset supports Question Answering (QA) tasks in lifelog retrieval, advancing the field toward open-domain QA capabilities. Integrated into a multimodal lifelog retrieval system, it allows users to ask lifelog-specific questions and receive suggested answers based on multimodal data. A test collection is provided to assess system effectiveness and user satisfaction, demonstrating enhanced performance over conventional lifelog systems, especially for novice users. This dataset paves the way for more intuitive and effective lifelog interaction.

Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers
Sahar Nasirihaghighi, Negin Ghamsarian, Heinrich Husslein & Klaus Schoeffmann
Institute of Information Technology (ITEC), Klagenfurt University, Klagenfurt, Austria; Center for AI in Medicine, University of Bern, Bern, Switzerland; Department of Gynecology and Gynecological Oncology, Medical University Vienna, Vienna, Austria

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_7
Dataset available at: https://ftp.itec.aau.at/datasets/LapGyn6-Events/

The dataset is tailored for event recognition in laparoscopic gynecology surgery videos, including annotations for critical intra-operative and post-operative events. Designed for applications in surgical training and complication prediction, it facilitates precise event recognition. The dataset supports a hybrid Transformer-based architecture that leverages inter-frame dependencies, improving accuracy amid challenges like occlusion and motion blur. Additionally, a custom frame sampling strategy addresses variations in surgical scenes and skill levels, achieving high temporal resolution. This methodology outperforms conventional CNN-RNN architectures, advancing laparoscopic video analysis.

GreenScreen: A Multimodal Dataset for Detecting Corporate Greenwashing in the Wild
Ujjwal Sharma, Stevan Rudinac, Joris Demmers, Willemijn van Dolen & Marcel Worring
University of Amsterdam, Amsterdam, The Netherlands

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_8
Dataset available at: https://uva-hva.gitlab.host/u.sharma/greenscreen

The dataset focuses on detecting greenwashing in social media by combining large-scale text and image collections from Fortune-1000 company Twitter accounts with environmental risk scores on specific issues like emissions and resource usage. This dataset addresses the challenge of identifying subtle, abstract greenwashing signals requiring contextual interpretation. It includes a baseline method leveraging advanced content encoding to analyze connections between social media content and greenwashing tendencies. This resource enables the multimedia retrieval community to advance greenwashing detection, promoting transparency in corporate sustainability claims.

One benchmarking cycle wraps up, and the next ramps up: News from the MediaEval Multimedia Benchmark

Introduction

MediaEval, the Multimedia Evaluation Benchmark, has offered a yearly set of multimedia challenges since 2010. MediaEval supports the development of algorithms and technologies for analyzing, exploring and accessing information in multimedia data. MediaEval aims to help make multimedia technology a force for good in society and for this reason focuses on tasks with a human or social aspect. Benchmarking contributes in two ways to advancing multimedia research. First, by offering standardized definitions of tasks and evaluation data sets, it makes it possible to fairly compare algorithms and, in this way, track progress. If we can understand which types of algorithms perform better, we can more easily find ways (and the motivation) to improve them. Second, benchmarking helps to direct the attention of the research community, for example, towards new tasks that are based on real-world needs, or towards known problems for which more research is necessary to have a solution that is good enough for a real world application scenario.

The 2023 MediaEval benchmarking season culminated with the yearly workshop, which was held in conjunction with MMM 2024 (https://www.mmm2024.org) in Amsterdam, Netherlands. It was a hybrid workshop, which also welcomed online participants. The workshop kicked off with a joint keynote with MMM 2024. Yiannis Kompatsiaris, Information Technologies Institute, CERTH, on Visual and Multimodal Disinformation Detection. The talk covered the implications of multimodal disinformation online and the challenges that must be faced in order to detect it. The workshop featured an invited speaker, Adriënne Mendrik, CEO & Co-founder of Eyra, supporting benchmarks with the online Next platform. She talked about benchmark challenge design for science and how the Next platform is currently being used in the Social Sciences.

More information about the workshop can be found at https://multimediaeval.github.io/editions/2023/ and the proceedings were published at  https://ceur-ws.org/Vol-3658/ In the rest of this article, we provide an overview of the highlights of the workshop as well as an outlook to the next edition of MediaEval in 2025.  

Tasks at MediaEval

The MultimediaEval Workshop 2023 featured five tasks that focused on human and social aspects of multimedia analysis.

Three of the tasks required participants to combine or cross modalities or even consider new modalities. The Musti: Multimodal Understanding of Smells in Texts and Images task challenged participants to detect and classify smell references in multilingual texts and images from the 17th to the 20th century. They needed to identify whether a text and image evoked the same smell source, detect specific smell sources, and apply zero-shot learning for untrained languages. The remaining two tasks emphasized the social aspects of multimedia. In the NewsImages: Connecting Text and Images task, participants worked with a dataset of news articles and images, predicting which image accompanied a news article. This task aimed to explore cases in which there is a link between a text and an image that goes beyond the text being a literal description of what was pictured in the image. The Predicting Video Memorability task required participants to predict how likely videos were to be remembered, both short- and long-term, and to use EEG data to predict whether specific individuals would remember a given video, combining visual features and neurological signals. 

Two of the tasks focused on pushing forward video analysis, to be useful to support experts in carrying out their jobs. The task SportsVideo: Fine-Grained Action Classification and Position Detection task strives to develop technology that will support coaches. To address this task, participants analyzed videos of table tennis and swimming competitions, detecting athlete positions, identifying strokes, classifying actions, and recognizing game events such as scores and sounds. The task Transparent Tracking of Spermatozoa strived to develop technology that will support medical professionals. Task participants were asked to track sperm cells in video recordings to evaluate male reproductive health. This involved localizing and tracking individual cells in real time, predicting their motility, and using bounding box data to assess sperm quality. The task emphasized both accuracy and processing efficiency, with subtasks involving graph data structures for motility prediction. 

Impressions of Student Participants

MediaEval is grateful to SIGMM for providing funding for three students who attended the MediaEval Workshop and greatly helped us with the organization of this edition: Iván Martín-Fernández and Sergio Esteban-Romero from Speech Technology and Machine Learning Group (GTHAU) – Universidad Politécnica de Madrid, and Xiaomeng Wang from Radboud University. Below the students provide their comments and impressions of the workshop.

“As a novel PhD student, I greatly valued my experience attending MediaEval 2023. I participated as the main author and presented work from our group, GTHAU – Universidad Politécnica de Madrid, on the Predicting Video Memorability Challenge. The opportunity to meet renowned colleagues and senior researchers, and learn from their experiences, provided valuable insight into what academia looks like from the inside. 

MediaEval offers a range of multimedia-related tasks, which may sometimes seem under the radar but are crucial in developing real-world applications. Moreover, the conference distinguishes itself by pushing the boundaries, going beyond just presenting results to foster a deeper understanding of the challenges being addressed. This makes it a truly enriching experience for both newcomers and seasoned professionals alike. 

Having volunteered and contributed to organizational tasks, I also gained first-hand insight into the inner workings of an academic conference, a facet I found particularly rewarding. Overall, MediaEval 2023 proved to be an exceptional blend of scientific rigor, collaborative spirit, and practical insights, making it an event I would highly recommend for anyone in the multimedia community.”

Iván Martín-Fernández, PhD Student, GTHAU – Universidad Politécnica de Madrid

“Attending MediaEval was an invaluable experience that allowed me to connect with a wide range of researchers and engage in discussions about the latest advancements in Artificial Intelligence. Presenting my work on the Multimedia Understanding of Smells in Text and Images (MUSTI) challenge was particularly insightful, as the feedback I received sparked ideas for future research. Additionally, volunteering and assisting with organizational tasks gave me a behind-the-scenes perspective on the significant effort required to coordinate an event like MediaEval. Overall, this experience was highly enriching, and I look forward to participating and collaborating in future editions of the workshop.”

Sergio Esteban-Romero, PhD Student, GTHAU – Universidad Politécnica de Madrid

“I was glad to be a student volunteer at MediaEval 2024. Collaborating with other volunteers, we organized submission files and prepared the facilities. Everyone was exceptionally kind and supportive.
In addition to volunteering, I also participated in the workshop as a paper author. I submitted a paper to the NewsImage task and delivered my first oral presentation. The atmosphere was highly academic, fostering insightful discussions. And I received valuable suggestions to improve my paper.  I truly appreciate this experience, both as a volunteer and as a participant.”

Xiaomeng Wang PhD Student, Data Science – Radboud University

Outlook to MediaEval 2025 

We are happy to announce that in 2025 MediaEval will be hosted in Dublin, Ireland, co-located with CBMI 2025. The Call for Task Proposals is now open, and details regarding submitting proposals can be found here: https://multimediaeval.github.io/2024/09/24/call.html. The final deadline for submitting your task proposals is Wed. 22nd January 2025. We will publish the list of tasks offered in March and registration for participation in MediaEval 2025 will open in April 2025.

For this edition of MediaEval we will again emphasize our “Quest for Insight”: we push beyond improving evaluation scores to achieving deeper understanding about the challenges, including data and the strengths and weaknesses of particular types of approaches, with the larger aim of understanding and explaining the concepts that the tasks revolve around, promoting reproducible research, and fostering a positive impact on society. We look forward to welcoming you to participate in the new benchmarking year.