Diversity and Inclusion at ACM MMSys 2025

The 16th ACM Multimedia Systems Conference and its associated workshops (MMVE 2025 and NOSSDAV’25) were held from March 31st to April 4th 2025, in Stellenbosch, South Africa. With the intention to create a diverse and inclusive community for multimedia systems, several activities were followed. In this column, we provide a brief overview of different Diversity and Inclusion activities taken before and during the 16th ACM MMSys’25.

Activities Before the Conference

Grants

Thanks to the generous support from the ACM Special Interest Group on Multimedia (SIGMM), we could provide some grants:

  • Student Travel Grant: ACM SIGMM offered travel grants for students in order to promote participation and diversity of students in the conference. ACM SIGMM has centralised support for standard student travel for in-person participation, and any student member of SIGMM, and those who were the first author of an accepted paper, were eligible and encouraged to apply. Female and minority students’ applications were also encouraged.
  • Young African Researcher Travel Awards: Travel grants were awarded specifically aimed to support young African researchers to attend the ACM MMSys’25 Conference and its co-located workshops. These awards targeted to foster diversity, promote knowledge exchange, and strengthen the multimedia systems research community across Africa. One of the eligibility criteria was to be affiliated with an African institution or to be an enrolled PhD student at an African higher learning institution.

Diversity in Papers

Previous to the conference, a brief analysis was done to understand how diverse and inclusive are the submitted papers. During the review process, paper reviewers indicated weather a paper tackled any aspect of diversity and inclusion by considering the following diversity criteria:

  1. Scope
  2. Approach
  3. Evaluation procedure
  4. Results
  5. Other
  6. This paper does not address any topics of diversity

It was found that the majority of papers did not address any topic of diversity, as shown in the diagram below. With these results in mind, we decided to organise a pabel about how to increase diversity and inclusion in future submissions to the conference.

Activities at the Conference: The Diversity Panel

Fuelled by the results of the study about diversity in MMSys’25 papers, the conference featured a panel discussion with the purpose to understand how diverse and inclusive are the topics, methodologies and evaluations in the papers submitted to the conference. In particular, the topics of discussion were (i) Implementing Diversity and Inclusion in research; (ii) Challenges in implementing Diversity and Inclusion; (iii) Inclusive and Diverse Practices; and (iv) Monitoring implementation progress.

Diversity and Inclusion panel discussion in this context targeted to explore how researchers/academia accommodate or work together with their relevant stakeholders or communities during their research activities, and during results dissemination such as in conferences.

To enable the discussion, we invited 4 panellists with different expertise both from academia and industry. These were: 


Professor Vali Lalioti
University of the Arts London (United Kingdom)
Vali Lalioti is a pioneering designer, computer scientist, and innovator. She is Professor of Creative XR and Robotics and Director of Programmes at the Creative Computing Institute (CCI), University of the Arts London (UAL). She played a key role in developing the world’s first Virtual Reality (VR) systems in Germany. Her research focuses on human-robot interaction, robotic movement design, and XR for societal impact, spanning well-being, healthy aging, performance art, and the future of work. She pioneered BBC’s first Augmented Reality production (2003). As Founder-Director at CCI, she founded the Creative XR and Robotics Research Hub, that led the Institute’s expansion.

Associate Professor. Ketan Mayer-Patel
 University of North Carolina at Chapel Hill
Ketan Mayer-Patel is an associate professor in the Department of Computer Science at the University of North Carolina. His research generally focuses on multimedia systems, networking, and multicast applications. Currently, he is investigating model-based video coding, dynamic media coding models, and networking problems associated with multiple independent, but semantically related, media streams.

Dr. Marta Orduna 
Nokia XR Lab; Madrid,Spain
Marta Orduna is a Telecommunication Engineer, Bachelor of Engineering in Telecommunication Technologies and Services in 2016 and Master in Telecommunication Engineering in 2018 both from Universidad Politécnica de Madrid (UPM). In 2023, she received her PhD from UPM entitled “Understanding and Assessing Quality of Experience in Immersive Communications”, reaching Cum Laude. In 2023, she joined Nokia Extended Reality Lab team in Spain, where she continues her research line of the PhD in the area of quality of experience in extended reality

Professor Gregor Schiele
University Duisburg-Essen, Germany
Gregor Schiele is leading the research lab on Intelligent Embedded Systems at the University of Duisburg-Essen in Germany. Professor Gregor’s goal is to make deep learning algorithms so efficient that they can be executed efficiently on every computer device, including tiny embedded sensors and wearable XR devices. He is a big fan of the MMSys community and its constructive discussion culture. 

Below, we provide a summary of the main findings on the four presented topics:

(1) Implementing Diversity and Inclusion in research

The panel discussion revealed that all panellists have worked or collaborated successfully with stakeholders outside their workplaces. Diversity and inclusion were mainly implemented via data collection for research work, co-creation, stakeholders’ workshops or seminars, and in research methodologies such as working with community in participatory action. The discussion highlighted the experience of our panellists with diversity measures as well as helped rising awareness in the audience as to what could they apply as diversity measures to their own work. 

(2) Challenges in implementing the Diversity and Inclusion

The following were mentioned as challenges in implementing diversity and inclusion in research and research dissemination activities:

  1. Financial and time constraints,
  2. Different organizational culture,
  3. Difficulty to find a common time for collaboration due to different priorities,
  4. Differences in language, organizational priorities and objectives.

(3) Inclusive and Diverse Practices

The panel discussed how to build a diverse and an inclusive conference in terms of topics, methodology (variety of approaches in pre-conference, during the conference and post conference). The following are some of the proposed practices:

  1. In a conference, invite at least three best papers and three best demos from other related conferences to present their work and showcase their demos respectively.
  2. Co-location of at least two conferences or workshop with related or complementing themes.
  3. Focus on relevant related conferences to find a match which will lead to run a common workshop, this will build relation that can lead to conferences co-location hence diversity and inclusion.
  4. Invite University graduates employers and   equipment vendors or manufacturers to participate and exhibit their products in conferences.
  5. Provide avenue in conferences for stakeholders to interact with academia such as in roundtable discussion or debates between academia vs industry and keynote presentation from industry/stakeholders.
  6. Run a flagship workshops or conferences with switching roles, for example this year the conference is for academia while industry/stakeholders are invited and assigned minor roles, next year the conference is dominated by industry/stakeholders and academia are invited with minor roles in the conference
  7. Run a conference with tracks of diverse and inclusive themes
  8. In order to accommodate policy makers in conferences, suggestions were as follows:
    1. Invite high profile Government officials such as Ministers or Presidents to officiate or close a conference where they will spend few hours listening to policy brief aligned to the conference theme or to the major conference resolutions during conference opening or closing respectively.
    2. Seek audience with the officials to briefly discuss conference resolutions or issues raised during the conference relevant to their offices.

(4) Monitoring implementation progress

Panellists were required to discuss how to track and measure progress in implementing diversity and inclusion in future ACM MMSys conferences. Generally, this point appeared difficult or it was not well understood by the panellists. It received very few and short responses. Most of the responses were kind of recommendation to:

  1. First set performance criteria which will be used as benchmarks for tracking and measuring implementation progress on diversity and inclusion.
  2. Develop stages of diverse and inclusive such as early/infant stage, medium/growing stage and premium/mature stage to guide a monitoring process, performance parameters and monitoring tools for paper evaluation process and in pre, during and post conference.

Concluding Remarks

Diversity and Inclusion activities done at the ACM MMSys 2025 served as important steps in nurturing diverse and inclusive multimedia system community. The activities comprised of travel grants supporting underrepresented and young African researchers, together with panel discussion at the conference. Although paper review analysis discovered that diversity topics remain underrepresented in paper submissions, this finding served as a catalyst for a rigorous panel discussion, that leads to concrete recommendations.  Going forward, the multimedia systems community is encouraged to adopt a smart framework with progress stages and performance parameters to monitor and track progress of diversity and inclusion in the ACM MMSys conference series.

VQEG Column: VQEG Meeting November 2024

Introduction

The last plenary meeting of the Video Quality Experts Group (VQEG) was held online by the Institute for Telecommunication Sciences (ITS) of the National Telecommunications and Information Adminsitration (NTIA) from November 18th to 22nd, 2024. The meeting was attended by 70 participants from industry and academic institutions from 17 different countries worldwide.

The meeting was dedicated to present updates and discuss about topics related to the ongoing projects within VQEG. All the related information, minutes, and files from the meeting are available online in the VQEG meeting website, and video recordings of the meeting are available in Youtube.

All the topics mentioned bellow can be of interest for the SIGMM community working on quality assessment, but special attention can be devoted to the creation of a new group focused on Subjective and objective assessment of GenAI content (SOGAI) and to the recent contribution of the Immersive Media Group (IMG) group to the International Telecommunication Union (ITU) towards the Rec. ITU-T P.IXC for the evaluation of Quality of Experience (QoE) of immersive interactive communication systems. Finally, it is worth noting that Ioannis Katsavounidis (Meta, US) joins Kjell Brunnström (RISE, Sweden) as co-chairs of VQEG, substituting Margaret Pinson (NTIA(ITS).

Readers of these columns interested in the ongoing projects of VQEG are encouraged to subscribe to their corresponding reflectors to follow the activities going on and to get involved in them.

Group picture of the online meeting

Overview of VQEG Projects

Audiovisual HD (AVHD)

The AVHD group works on developing and validating subjective and objective methods to analyze commonly available video systems. In this meeting, Lucjan Janowski (AGH University of Krakow, Poland) and Margaret Pinson (NTIA/ITS) presented their proposal to fix wording related to an experiment realism and validity, based on the experience in the psychology domain that addresses the important concept of describing how much results from lab experiment can be used outside a laboratory.

In addition, given that there are no current joint activities of the group, the AVHD project will become dormant, with the possibility to be activated when new activities are planned.

Statistical Analysis Methods (SAM)

The group SAM investigates on analysis methods both for the results of subjective experiments and for objective quality models and metrics. In addition to a discussion on the future activities of the group lead by its chairs Ioannis Katsavounidis (Meta, US), Zhi Li (Netflix, US), and Lucjan Janowski (AGH University of Krakow, Poland), the following presentations were delivered during the meeting:  

No Reference Metrics (NORM)

The group NORM addresses a collaborative effort to develop no-reference metrics for monitoring visual service quality. In this sense, Ioannis Katsavounidis (Meta, US) and Margaret Pinson (NTIA/ITS) summarized recent discussions within the group on developing best practices for subjective test methods when analyzing Artificial Intelligence (AI) generated images and videos. This discussion resulted in the creation of a new VQEG project called Subjective and objective assessment of GenAI content (SOGAI) to investigate subjective and objective methods to evaluate the content produced by generative AI approaches.

Emerging Technologies Group (ETG)

The ETG group focuses on various aspects of multimedia that, although they are not necessarily directly related to “video quality”, can indirectly impact the work carried out within VQEG and are not addressed by any of the existing VQEG groups. In particular, this group aims to provide a common platform for people to gather together and discuss new emerging topics, possible collaborations in the form of joint survey papers, funding proposals, etc. During this meeting, Abhijay Ghildyal (Portland State University, US), Saman Zadtootaghaj (Sony Interactive Entertainment, Germany), and Nabajeet Barman (Sony Interactive Entertainment, UK) presented their work on quality assessment of AI generated content and AI enhanced content. In addition, Matthias Wien (RWTH Aachen University, Germany) presented the approach, design and methodology for the evaluation of AI-based Point Cloud Compression in the corresponding Call for Proposals in MPEG. Finally, Abhijay Ghildyal (Portland State University, US) presented his work on how foundation models boost low-level perceptual similarity metrics, investigating the potential of using intermediate features or activations from these models for low-level image quality assessment, and showing that such metrics can outperform existing ones without requiring additional training.

Joint Effort Group (JEG) – Hybrid

The group JEG addresses several areas of Video Quality Assessment (VQA), such as the creation of a large dataset for training such models using full-reference metrics instead of subjective metrics. In addition, the group includes the VQEG project Implementer’s Guide for Video Quality Metrics (IGVQM). The chair of this group, Enrico Masala (Politecnico di Torino, Italy) presented the updates on the latest activities going on, including the plans for experiments within the IGVMQ project to get feedback from other VQEG members.

In addition to this, Lohic Fotio Tiotsop (Politecnico di Torino, Italy) delivered two presentations. The first one focused on the prediction of the opinion score distribution via AI-based observers in media quality assessment, while the second one analyzed unexpected scoring behaviors in image quality assessment comparing controlled and crowdsourced subjective tests.

Immersive Media Group (IMG)

The IMG group researches on the quality assessment of immersive media technologies. Currently, the main joint activity of the group is the development of a test plan to evaluate the QoE of immersive interactive communication systems, which is carried out in collaboration with ITU-T through the work item P.IXC. In this meeting, Pablo Pérez (Nokia XR Lab, Spain), Marta Orduna (Nokia XR Lab, Spain), and Jesús Gutiérrez (Universidad Politécnica de Madrid, Spain) presented the status of the Rec. ITU-T P.IXC that the group was writing based on the joint test plan developed in the last months and that was submitted to ITU and discussed in its meeting in January 2025.

Also, in relation with this test plan, Lucjan Janowski (AGH University of Krakow, Poland) and Margaret Pinson (NTIA/ITS) presented an overview of ITU recommendations for interactive experiments that can be used in the IMG context.

In relation with other topics addressed by IMG, Emin Zerman (Mid Sweden University, Sweden) delivered two presentations. The first one presented the BASICS dataset, which contains a representative range of nearly 1500 point clouds assessed by thousands of participants to enable robust quality assessments for 3D scenes. The approach involved a careful selection of diverse source scenes and the application of specific “distortions” to simulate real-world compression impacts, including traditional and learning-based methods. The second presentation described a spherical light field database (SLFDB) for immersive telecommunication and telepresence applications, which comprises 60-view omnidirectional captures across 20 scenes, providing a comprehensive basis for telepresence research.

Quality Assessment for Computer Vision Applications (QACoViA)

The group QACoViA addresses the study the visual quality requirements for computer vision methods, where the final user is an algorithm. In this meeting, Mehr un Nisa (AGH University of Krakow, Poland) presented a comparative performance analysis of deep learning architectures in underwater image classification. In particular, the study assessed the performance of the VGG-16, EfficientNetB0, and SimCLR models in classifying 5,000 underwater images. The results reveal each model’s strengths and weaknesses, providing insights for future improvements in underwater image analysis

5G Key Performance Indicators (5GKPI)

The 5GKPI group studies relationship between key performance indicators of new 5G networks and QoE of video services on top of them. In this meeting, Pablo Perez (Nokia XR Lab, Spain) and Francois Blouin (Meta, US) and others presented the progress on the 5G-KPI White Paper, sharing some of the ideas on QoS-to-QoE modeling that the group has been working on to get feedback from other VQEG members.

Multimedia Experience and Human Factors (MEHF)

The MEHF group focuses on the human factors influencing audiovisual and multimedia experiences, facilitating a comprehensive understanding of how human factors impact the perceived quality of multimedia content. In this meeting, Dominika Wanat (AGH University of Krakow, Poland) presented MANIANA (Mobile Appliance for Network Interrupting, Analysis & Notorious Annoyance), an IoT device for testing QoS and QoE applications in home network conditions that is made based on Raspberry Pi 4 minicomputer and open source solutions and allows safe, robust, and universal testing applications.

Other updates

Apart from this, it is worth noting that, although no progresses were presented in this meeting, the Quality Assessment for Health Applications (QAH) group is still active and focused on the quality assessment of health applications. It addresses subjective evaluation, generation of datasets, development of objective metrics, and task-based approaches.

In addition, the Computer Generated Imagery (CGI) project became dormant, since it recent activities can be covered by other existing groups such as ETG and SOGAI.

Also, in this meeting Margaret Pinson (NTIA/ITS) stepped down as co-chair of VQEG and Ioannis Katsavounidis (Meta, US) is the new co-chair together with Kjell Brunnström (RISE, Sweden).

Finally, as already announced in the VQEG website, the next VQEG plenary meeting be hosted by Meta at Meta’s Menlo Park campus, California, in the United States from May 5th to 9th, 2025. For more information see: https://vqeg.org/meetings-home/vqeg-meeting-information/

JPEG Column: 106th JPEG Meeting

JPEG AI becomes an International Standard

The 106th JPEG meeting was held online from January 6 to 10, 2025. During this meeting, the first image coding standard based on machine learning technology, JPEG AI, was sent for publication as an International Standard. This is a major achievement as it leverages JPEG with major trends in imaging technologies and provides an efficient standardized solution for image coding, with nearly 30% improvement over the most advanced solutions in the state-of-the-art. JPEG AI has been developed under the auspices of three major standardization organizations: ISO, IEC and ITU.

The following sections summarize the main highlights of the 106th JPEG meeting.

  • JPEG AI – the first International Standard for end-to-end learning-based image coding
  • JPEG Trust – a framework for establishing trust in digital media
  • JPEG XE – lossless coding of event-based vision
  • JPEG AIC – assessment of the visual quality of high-fidelity images
  • JPEG Pleno – standard framework for representing plenoptic data
  • JPEG Systems – file formats and metadata
  • JPEG DNA – DNA-based storage of digital pictures
  • JPEG XS – end-to-end low latency and low complexity image coding
  • JPEG XL – new image coding system
  • JPEG 2000
  • JPEG RF – exploration on Radiance Fields

JPEG AI

At its 106th meeting, the JPEG Committee approved publication of the text of JPEG AI, the first International Standard for end-to-end learning-based image coding. This achievement marks a significant milestone in the field of digital imaging and compression, offering a new approach for efficient, high-quality image storage and transmission.

The scope of JPEG AI is the creation of a learning-based image coding standard offering a single-stream, compact compressed domain representation, targeting both human visualization with significant compression efficiency improvement over image coding standards in common use at equivalent subjective quality, and effective performance for image processing and computer vision tasks, with the goal of supporting a royalty-free baseline.

The JPEG AI standard leverages deep learning algorithms that learn from vast amounts of image data the best way to compress images, allowing it to adapt to a wide range of content and offering enhanced perceptual visual quality and faster compression capabilities. The key benefits of JPEG AI are:

  1. Superior compression efficiency: JPEG AI offers higher compression efficiency, leading to reduced storage requirements and faster transmission times compared to other state-of-the-art image coding solutions.
  2. Implementation-friendly encoding and decoding: JPEG AI codec supports a wide array of devices with different characteristics, including mobile platforms, through optimized encoding and decoding processes.
  3. Compressed-domain image processing and computer vision tasks: JPEG AI’s architecture enables multi-purpose optimization for both human visualization and machine-driven tasks.

By creating the JPEG AI International Standard, the JPEG Committee has opened the door to more efficient and versatile image compression solutions that will benefit industries ranging from digital media and telecommunications to cloud storage and visual surveillance. This standard provides a framework for image compression in the face of rapidly growing visual data demands, enabling more efficient storage, faster transmission, and higher-quality visual experiences.

As JPEG AI establishes itself as the new benchmark in image compression, its potential to reshape the future of digital imaging is undeniable, promising groundbreaking advancements in efficiency and versatility.

JPEG Trust

The first part of JPEG Trust, the “Core Foundation” (ISO/IEC 21617-1) was approved for publication in late 2024 and is in the process of being published as an International Standard by ISO. The JPEG Trust standard provides a proactive approach to trust management by defining a framework for establishing trust in digital media. The Core Foundation specifies three main pillars: annotating provenance, extracting and evaluating Trust Indicators, and handling privacy and security concerns.

At the 106th JPEG Meeting, the JPEG Committee produced a Committee Draft (CD) for a 2nd edition of the Core Foundation. The 2nd edition further extends and improves the standard with new functionalities, including important specifications for Intellectual Property Rights (IPR) management such as authorship and rights declarations. In addition, this new edition will align the specification with the upcoming ISO 22144 standard, which is a standard for Content Credentials based on the C2PA 2.1 specification.

In parallel with the work on the 2nd edition of the Core Foundation (Part 1), the JPEG Committee continues to work on Part 2 and Part 3, “Trust Profiles Catalogue” and “Media Asset Watermarking”, respectively.

JPEG XE

The JPEG XE initiative is currently awaiting the conclusion of the open Final Call for Proposals on lossless coding of events, which will close on March 31, 2025. This initiative focuses on a new and emerging image modality introduced by event-based visual sensors. JPEG aims to establish a standard that efficiently represents events, facilitating interoperability in sensing, storage, and processing for machine vision and other relevant applications.

To ensure the success of this emerging standard, the JPEG Committee has reached out to other standardization organizations. The JPEG Committee, already a collaborative group under ISO/IEC and ITU-T, is engaged in discussions with ITU-T’s SG21 to develop JPEG XE as a joint standard. This collaboration aligns perfectly with the objectives of both organizations, as SG21 is also dedicated to creating standards around event-based systems.

Additionally, the JPEG Committee continues its discussions and research on lossy coding of events, focusing on future evaluation methods for these technologies. Those interested in the JPEG XE initiative are encouraged to review the public documents available at jpeg.org. Furthermore, the Ad-hoc Group on event-based vision has been re-established to advance work leading up to the 107th JPEG meeting in Brussels. To stay informed about this activity, please join the event-based vision Ad-hoc Group mailing list.

JPEG AIC

Part 3 of JPEG AIC (AIC-3) defines a methodology for subjective assessment of the visual quality of high-fidelity images, and the forthcoming Part 4 of JPEG AIC deals with objective quality metrics, also of high-fidelity images. In this JPEG meeting, the document on Use Cases and Requirements that refers to both AIC-3 and AIC-4, was revised. It defines the scope of both anticipated standards and sets it into relation to the previous specifications for AIC-1 and AIC-2. While AIC-1 covers a broad quality range including low quality, it does not allow fine-grained quality assessment in the high-fidelity range. AIC-2 entails methods that determine a threshold separating visually lossless coded images from lossy ones. The quality range addressed by AIC-3 and AIC-4 is an interval that contains the AIC-2 threshold, reaching from high quality up to the numerically lossless case. The JPEG Committee is preparing the DIS text for AIC-3 and has launched the Second Draft Call for Proposals on Objective Image Quality Assessment (AIC-4) which includes the timeline for this JPEG activity. Proposals are expected at the end of Summer 2025. The first Working Draft for Objective Image Quality Assessment (AIC-4) is planned for April 2026.

JPEG Pleno

The 106th meeting marked a major milestone for the JPEG Pleno Point Cloud activity with the release of the Final Draft International Standard (FDIS) for ISO/IEC DIS 21794-6:2024 Information technology — Plenoptic image coding system (JPEG Pleno) — Part 6: Learning-based point cloud coding. Point cloud data supports a wide range of applications, including computer-aided manufacturing, entertainment, cultural heritage preservation, scientific research, and advanced sensing and analysis. The JPEG Committee considers this learning-based standard to be a powerful and efficient solution for point cloud coding. This standard is applicable to interactive human visualization, with competitive compression efficiency compared to state-of-the-art point cloud coding solutions in common use, and effective performance for 3D processing and machine-related computer vision tasks and has the goal of supporting a royalty-free baseline. This standard specifies a codestream format for storage of point clouds. The standard also provides information on the coding tools and defines extensions to the JPEG Pleno File Format and associated metadata descriptors that are specific to point cloud modalities. With the release of the FDIS at the 106th JPEG meeting, it is expected that the International Standard will be published in July 2025.

The JPEG Pleno Light Field activity discussed the Committee Draft (CD) of the 2nd edition of ISO/IEC 21794-2 (“Plenoptic image coding system (JPEG Pleno) Part 2: Light field coding”) that integrates AMD1 of ISO/IEC 21794-2 (“Profiles and levels for JPEG Pleno Light Field Coding”) and includes the specification of a third coding mode entitled Slanted 4D Transform Mode and its associated profile.

A White Paper on JPEG Pleno Light Field Coding has been released, providing the architecture of the current two JPEG Pleno Part-2 coding modes, as well as the coding architecture of its third coding mode, to be included in the 2nd edition of the standard. The White Paper also presents applications and use cases and briefly describes the JPEG Pleno Model (JPLM). The JPLM provides a reference implementation for the standardized technologies within the JPEG Pleno framework, including the JPEG Pleno Part 2 (ISO/IEC 21794-2). Improvements to JPLM have been implemented and tested, including a user-friendly interface that relies on well-documented JSON configuration files.

During the JPEG meeting week, significant progress was made in the JPEG Pleno Quality Assessment activity, which focuses on developing methodologies for subjective and objective quality assessment of plenoptic modalities. A Working Draft on subjective quality assessment, incorporating insights from extensive experiments conducted by JPEG experts, was discussed.

JPEG Systems

The reference software of JPEG Systems (ISO/IEC 19566-10) is now published as an International Standard and is available as open source on the JPEG website. This first edition implements the JPEG Universal Metadata Box Format (ISO/IEC 19566-5) and provides a reference dataset. An extended version of the reference software with support for additional Parts of JPEG Systems is currently under development. This new edition will add support for JPEG Privacy and Security, JPEG 360, JLINK, and JPEG Snack.

At its 106th meeting, the JPEG Committee also initiated a 3rd edition of the JPEG Universal Metadata Box Format (ISO/IEC 19566-5). This new edition will integrate the latest amendment that allows JUMBF boxes to exist as stand-alone files and adds support for payload compression. In addition, the 3rd edition will add a JUMBF validator and a scheme for JUMBF box retainment while transcoding from one JPEG format to another.

JPEG DNA

JPEG DNA is an initiative aimed at developing a standard capable of representing bi-level, continuous-tone grayscale, continuous-tone color, or multichannel digital samples in a format using nucleotide sequences to support DNA storage. The JPEG DNA Verification Model (VM) was created during the 102nd JPEG meeting based on performance assessments and descriptive analyses of the submitted solutions to a Call for Proposals, issued at the 99th JPEG meeting. Since then, several core experiments have been continuously conducted to validate and enhance this Verification Model. Such efforts led to the creation of the first Working Draft of JPEG DNA during the 103rd JPEG meeting. At the 105th JPEG meeting, the JPEG Committee officially introduced a New Work Item Proposal (NWIP) for JPEG DNA, elevating it to an officially sanctioned ISO/IEC Project. The proposal defined JPEG DNA as a multi-part standard: Part 1: Core Coding System, Part 2: Profiles and Levels, Part 3: Reference Software, Part 4: Conformance.

The JPEG Committee is targeting the International Standard (IS) stage for Part 1 by April 2026.

At its 106th meeting, the JPEG Committee made significant progress toward achieving this goal. Efforts were focused on producing the Committee Draft (CD) for Part 1, a crucial milestone in the standardization process. Additionally, JPEG DNA Part 1 has now been assigned the Project identification ISO/IEC 25508-01.

JPEG XS

The JPEG XS activity focussed primarily on finalization of the third editions of JPEG XS Part 4 – Conformance testing, and Part 5 – Reference software. Recall that the 3rd editions of Parts 1, 2, and 3 are published and available for purchase. Part 4 is now at FDIS stage and is expected to be approved as International Standard around April of 2025. For Part 5, work on the reference software was completed to implement TDC profile encoding functionality, making it feature complete and fully compliant with the 3rd edition of JPEG XS. As such, Part 5 is ready to be balloted as a DIS. However, work on the reference software will continue to bring further improvements. The reference software and Part 5 will become publicly and freely available, similar to Part 4.

JPEG XL

The second edition of Part 3 (conformance testing) of JPEG XL proceeded to publication as International Standard. Regarding Part 2 (file format), a third edition has been prepared, and it reached the DIS stage. The new edition will include support for embedding gain maps in JPEG XL files.

JPEG 2000

The JPEG Committee has begun work on adding support for the HTTP/3 transport to the JPIP protocol, which allows the interactive browsing of JPEG 2000 images over networks. HTTP/3 is the third major version of the Hypertext Transfer Protocol (HTTP) and allows for significantly lower latency operations compared to earlier versions. A Committee Draft ballot of the 3rd edition of the JPIP specifications (Rec. ITU-T T.808 | ISO/IEC 15444-9) is expected to start shortly, with the project completed sometime in 2026.

Separately, the 3rd edition of Rec. ITU-T T.815 | ISO/IEC 15444-16, which specifies the carriage of JPEG 2000 imagery in the ISOBMFF and HEIF file formats, has been approved for publication. This new edition adds support for more flexible color signaling and JPEG 2000 video tracks.

JPEG RF

The JPEG RF exploration issued at this meeting the “JPEG Radiance Fields State of the Art and Challenges”, a public document that describes the latest developments on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) technologies and defines a scope for the activity focusing on the creation of a coding standard. The JPEG Committee is also organizing a workshop on Radiance Fields jointly with MPEG, which will take place on January 31st, featuring key experts in the field presenting various aspects of this exciting new emerging technology.

Final Quote

“The newly approved JPEG AI, developed under the auspices of ISO, IEC and ITU, is the first image coding standard based on machine learning and is a breakthrough in image coding providing 30% compression gains over the most advanced solutions in state-of-the-art.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

MPEG Column: 150th MPEG Meeting (Virtual/Online)

The 150th MPEG meeting was held online from 31 March to 04 April 2025. The official press release can be found here. This column provides the following highlights:

  • Requirements: MPEG-AI strategy and white paper on MPEG technologies for metaverse
  • JVET: Draft Joint Call for Evidence on video compression with capability beyond Versatile Video Coding (VVC)
  • Video: Gaussian splat coding and video coding for machines
  • Audio: Audio coding for machines
  • 3DGH: 3D Gaussian splat coding

MPEG-AI Strategy

The MPEG-AI strategy envisions a future where AI and neural networks are deeply integrated into multimedia coding and processing, enabling transformative improvements in how digital content is created, compressed, analyzed, and delivered. By positioning AI at the core of multimedia systems, MPEG-AI seeks to enhance both content representation and intelligent analysis. This approach supports applications ranging from adaptive streaming and immersive media to machine-centric use cases like autonomous vehicles and smart cities. AI is employed to optimize coding efficiency, generate intelligent descriptors, and facilitate seamless interaction between content and AI systems. The strategy builds on foundational standards such as ISO/IEC 15938-13 (CDVS), 15938-15 (CDVA), and 15938-17 (Neural Network Coding), which collectively laid the groundwork for integrating AI into multimedia frameworks.

Currently, MPEG is developing a family of standards under the ISO/IEC 23888 series that includes a vision document, machine-oriented video coding, and encoder optimization for AI analysis. Future work focuses on feature coding for machines and AI-based point cloud compression to support high-efficiency 3D and visual data handling. These efforts reflect a paradigm shift from human-centric media consumption to systems that also serve intelligent machine agents. MPEG-AI maintains compatibility with traditional media processing while enabling scalable, secure, and privacy-conscious AI deployments. Through this initiative, MPEG aims to define the future of multimedia as an intelligent, adaptable ecosystem capable of supporting complex, real-time, and immersive digital experiences.

MPEG White Paper on Metaverse Technologies

The MPEG white paper on metaverse technologies (cf. MPEG white papers) outlines the pivotal role of MPEG standards in enabling immersive, interoperable, and high-quality virtual experiences that define the emerging metaverse. It identifies core metaverse parameters – real-time operation, 3D experience, interactivity, persistence, and social engagement – and maps them to MPEG’s longstanding and evolving technical contributions. From early efforts like MPEG-4’s Binary Format for Scenes (BIFS) and Animation Framework eXtension (AFX) to MPEG-V’s sensory integration, and the advanced MPEG-I suite, these standards underpin critical features such as scene representation, dynamic 3D asset compression, immersive audio, avatar animation, and real-time streaming. Key technologies like point cloud compression (V-PCC, G-PCC), immersive video (MIV), and dynamic mesh coding (V-DMC) demonstrate MPEG’s capacity to support realistic, responsive, and adaptive virtual environments. Recent efforts include neural network compression for learned scene representations (e.g., NeRFs), haptic coding formats, and scene description enhancements, all geared toward richer user engagement and broader device interoperability.

The document highlights five major metaverse use cases – virtual environments, immersive entertainment, virtual commerce, remote collaboration, and digital twins – all supported by MPEG innovations. It emphasizes the foundational role of MPEG-I standards (e.g., Parts 12, 14, 29, 39) for synchronizing immersive content, representing avatars, and orchestrating complex 3D scenes across platforms. Future challenges identified include ensuring interoperability across systems, advancing compression methods for AI-assisted scenarios, and embedding security and privacy protections. With decades of multimedia expertise and a future-focused standards roadmap, MPEG positions itself as a key enabler of the metaverse – ensuring that emerging virtual ecosystems are scalable, immersive, and universally accessible​.

The MPEG white paper on metaverse technologies highlights several research opportunities, including efficient compression of dynamic 3D content (e.g., point clouds, meshes, neural representations), synchronization of immersive audio and haptics, real-time adaptive streaming, and scene orchestration. It also points to challenges in standardizing interoperable avatar formats, AI-enhanced media representation, and ensuring seamless user experiences across devices. Additional research directions include neural network compression, cross-platform media rendering, and developing perceptual metrics for immersive Quality of Experience (QoE).

Draft Joint Call for Evidence (CfE) on Video Compression beyond Versatile Video Coding (VVC)

The latest JVET AHG report on ECM software development (AHG6), documented as JVET-AL0006, shows promising results. Specifically, in the “Overall” row and “Y” column, there is a 27.06% improvement in coding efficiency compared to VVC, as shown in the figure below.

The Draft Joint Call for Evidence (CfE) on video compression beyond VVC (Versatile Video Coding), identified as document JVET-AL2026 | N 355, is being developed to explore new advancements in video compression. The CfE seeks evidence in three main areas: (a) improved compression efficiency and associated trade-offs, (b) encoding under runtime constraints, and (c) enhanced performance in additional functionalities. This initiative aims to evaluate whether new techniques can significantly outperform the current state-of-the-art VVC standard in both compression and practical deployment aspects.

The visual testing will be carried out across seven categories, including various combinations of resolution, dynamic range, and use cases: SDR Random Access UHD/4K, SDR Random Access HD, SDR Low Bitrate HD, HDR Random Access 4K, HDR Random Access Cropped 8K, Gaming Low Bitrate HD, and UGC (User-Generated Content) Random Access HD. Sequences and rate points for testing have already been defined and agreed upon. For a fair comparison, rate-matched anchors using VTM (VVC Test Model) and ECM (Enhanced Compression Model) will be generated, with new configurations to enable reduced run-time evaluations. A dry-run of the visual tests is planned during the upcoming Daejeon meeting, with ECM and VTM as reference anchors, and the CfE welcomes additional submissions. Following this dry-run, the final Call for Evidence is expected to be issued in July, with responses due in October.

The Draft Joint Call for Evidence (CfE) on video compression beyond VVC invites research into next-generation video coding techniques that offer improved compression efficiency, reduced encoding complexity under runtime constraints, and enhanced functionalities such as scalability or perceptual quality. Key research aspects include optimizing the trade-off between bitrate and visual fidelity, developing fast encoding methods suitable for constrained devices, and advancing performance in emerging use cases like HDR, 8K, gaming, and user-generated content.

3D Gaussian Splat Coding

Gaussian splatting is a real-time radiance field rendering method that represents a scene using 3D Gaussians. Each Gaussian has parameters like position, scale, color, opacity, and orientation, and together they approximate how light interacts with surfaces in a scene. Instead of ray marching (as in NeRF), it renders images by splatting the Gaussians onto a 2D image plane and blending them using a rasterization pipeline, which is GPU-friendly and much faster. Developed by Kerbl et al. (2023) it is capable of real-time rendering (60+ fps) and outperforms previous NeRF-based methods in speed and visual quality. Gaussian splat coding refers to the compression and streaming of 3D Gaussian representations for efficient storage and transmission. It’s an active research area and under standardization consideration in MPEG.

MPEG technical requirements working group together with MPEG video working group started an exploration on Gaussian splat coding and the MPEG coding of 3D graphics and haptics (3DGH) working group addresses 3D Gaussian splat coding, respectively. Draft Gaussian splat coding use cases and requirements are available and various joint exploration experiments (JEEs) are conducted between meetings.

(3D) Gaussian splat coding is actively researched in academia, also in the context of streaming, e.g., like in “LapisGS: Layered Progressive 3D Gaussian Splatting for Adaptive Streaming” or “LTS: A DASH Streaming System for Dynamic Multi-Layer 3D Gaussian Splatting Scenes”. The research aspects of 3D Gaussian splat coding and streaming span a wide range of areas across computer graphics, compression, machine learning, and systems for real-time immersive media. In particular, on efficiently representing and transmitting Gaussian-based neural scene representations for real-time rendering. Key areas include compression of Gaussian parameters (position, scale, color, opacity), perceptual and geometry-aware optimizations, and neural compression techniques such as learned latent coding. Streaming challenges involve adaptive, view-dependent delivery, level-of-detail management, and low-latency rendering on edge or mobile devices. Additional research directions include standardizing file formats, integrating with scene graphs, and ensuring interoperability with existing 3D and immersive media frameworks.

MPEG Audio and Video Coding for Machines

The Call for Proposals on Audio Coding for Machines (ACoM), issued by the MPEG audio coding working group, aims to develop a standard for efficiently compressing audio, multi-dimensional signals (e.g., medical data), or extracted features for use in machine-driven applications. The standard targets use cases such as connected vehicles, audio surveillance, diagnostics, health monitoring, and smart cities, where vast data streams must be transmitted, stored, and processed with low latency and high fidelity. The ACoM system is designed in two phases: the first focusing on near-lossless compression of audio and metadata to facilitate training of machine learning models, and the second expanding to lossy compression of features optimized for specific applications. The goal is to support hybrid consumption – by machines and, where needed, humans – while ensuring interoperability, low delay, and efficient use of storage and bandwidth.

The CfP outlines technical requirements, submission guidelines, and evaluation metrics. Participants must provide decoders compatible with Linux/x86 systems, demonstrate performance through objective metrics like compression ratio, encoder/decoder runtime, and memory usage, and undergo a mandatory cross-checking process. Selected proposals will contribute to a reference model and working draft of the standard. Proponents must register by August 1, 2025, with submissions due in September, and evaluation taking place in October. The selection process emphasizes lossless reproduction, metadata fidelity, and significant improvements over a baseline codec, with a path to merge top-performing technologies into a unified solution for standardization.

Research aspects of Audio Coding for Machines (ACoM) include developing efficient compression techniques for audio and multi-dimensional data that preserve key features for machine learning tasks, optimizing encoding for low-latency and resource-constrained environments, and designing hybrid formats suitable for both machine and human consumption. Additional research areas involve creating interoperable feature representations, enhancing metadata handling for context-aware processing, evaluating trade-offs between lossless and lossy compression, and integrating machine-optimized codecs into real-world applications like surveillance, diagnostics, and smart systems.

The MPEG video coding working group approved the committee draft (CD) for ISO/IEC 23888-2 video coding for machines (VCM). VCM aims to encode visual content in a way that maximizes machine task performance, such as computer vision, scene understanding, autonomous driving, smart surveillance, robotics and IoT. Instead of preserving photorealistic quality, VCM seeks to retain features and structures important for machines, possibly at much lower bitrates than traditional video codecs. The CD introduces several new tools and enhancements aimed at improving machine-centric video processing efficiency. These include updates to spatial resampling, such as the signaling of the inner decoded picture size to better support scalable inference. For temporal resampling, the CD enables adaptive resampling ratios and introduces pre- and post-filters within the temporal resampler to maintain task-relevant temporal features. In the filtering domain, it adopts bit depth truncation techniques – integrating bit depth shifting, luma enhancement, and chroma reconstruction – to optimize both signaling efficiency and cross-platform interoperability. Luma enhancement is further refined through an integer-based implementation for luma distribution parameters, while chroma reconstruction is stabilized across different hardware platforms. Additionally, the CD proposes removing the neural network-based in-loop filter (NNLF) to simplify the pipeline. Finally, in terms of bitstream structure, it adopts a flattened structure with new signaling methods to support efficient random access and better coordination with system layers, aligning with the low-latency, high-accuracy needs of machine-driven applications.

Research in VCM focuses on optimizing video representation for downstream machine tasks, exploring task-driven compression techniques that prioritize inference accuracy over perceptual quality. Key areas include joint video and feature coding, adaptive resampling methods tailored to machine perception, learning-based filter design, and bitstream structuring for efficient decoding and random access. Other important directions involve balancing bitrate and task accuracy, enhancing robustness across platforms, and integrating machine-in-the-loop optimization to co-design codecs with AI inference pipelines.

Concluding Remarks

The 150th MPEG meeting marks significant progress across AI-enhanced media, immersive technologies, and machine-oriented coding. With ongoing work on MPEG-AI, metaverse standards, next-gen video compression, Gaussian splat representation, and machine-friendly audio and video coding, MPEG continues to shape the future of interoperable, intelligent, and adaptive multimedia systems. The research opportunities and standardization efforts outlined in this meeting provide a strong foundation for innovations that support real-time, efficient, and cross-platform media experiences for both human and machine consumption.

The 151st MPEG meeting will be held in Daejeon, Korea, from 30 June to 04 July 2025. Click here for more information about MPEG meetings and their developments.

O QoE, Where Art Thou?


Once upon a time, when engineers measured networks in latency and packet loss, the idea of Quality of Experience (QoE) emerged — a myth whispered among researchers who dared to ask not what the system delivers, but what the user perceives. Decades later, QoE has evolved into a sprawling epic, spanning disciplines and domains, from humble MOS scores to immersive virtual realities. But as media experiences become ever more complex — adaptive, interactive, personalized — the question lingers: O QoE, where art thou?

1. Introduction

In this column, we revisit the notion of QoE and its evolution over time. We begin by reviewing early work from the 1990s to 2000s on the definitions of QoE (Section 2), where researchers first recognized the importance of user perception and the relevant QoE influence factors, as well as QoE modeling efforts. As a summary of this literature survey, QoE evolved from abstract notions of perception and satisfaction to a measurable, standardized concept encompassing the emotional, cognitive, and contextual responses of users to a service or application. The trends across time are:

  • 1990s: Early focus on perception and interaction design.
  • Early 2000s: Growing focus on subjectivity, emotion, and context in user experience. QoE separated from QoS, emphasizing emotion, context, and expectation. Seen as key to commercial and user success.
  • Mid-2000s: Integration of technical and perceptual layers; need for metrics and quantification. Push for measurable models combining technical and user perspectives. Recognition of multiple definitions across domains.
  • Late 2000s–2010s: Standardization, recognition of multi-dimensionality, and development of cross-disciplinary definitions. QoE defined around subjective perception and system-wide impact.
  • 2010s: Unified, multidisciplinary understanding established through initiatives like QUALINET; QoE as “delight or annoyance”.

This initial insight laid the foundation for larger initiatives like QUALINET, which helped to shape the field by providing widely accepted QoE definitions. We then examine how these developments have been formalized through standardization activities (Section 3), particularly within the ITU and the QUALINET whitepapers on the definition of QoE and immersive QoE.

The diverse and often conflicting definitions of QoE emerging in the 2000s highlighted the need for coordinated efforts and shared understanding across disciplines. This led to joint initiatives like QUALINET, which aimed to formalize and unify QoE research within a dedicated network. One of the results is the updated QoE definition, which is now taken in standardization.

  • 2016: ITU-T Recommendation P.10/G.100 (2006) Amendment 5 (07/ 16), New Definitions for Inclusion in Recommendation ITU-T P.10/G.100, International Telecommunication Union, July 2016. ‘‘Quality of experience (QoE) is the degree of delight or annoyance of the user of an application or service’’.
Figure 1. Timeline on the notion and definitions of QoE in literature and standardization.

A timeline of the literature survey and the early definitions of QoE as well as the standardization activities is visualized in Figure 1. Finally, we discuss selected open issues in QoE research (Section 4) that continue to challenge both academia and industry.

2. Early Definitions of QoE: 1990s to 2000s

The term Quality of Experience (QoE) emerged in the late 1990s to early 2000s as a response to the limitations of traditional network-centric approaches. Although Quality of Service (QoS) had already been formally defined in ITU-T Recommendation E.800 (1994) [ITU-T E.800] for telephony and established a basis for assessing service quality from both technical and user viewpoints, QoS primarily addresses performance at the network level. QoS is commonly applied within communication networks to describe a system’s ability to meet predefined performance targets, ensuring consistent data transmission through metrics such as bandwidth, latency, jitter, and packet loss [Varela2014].

In contrast, researchers and industry practitioners began to recognize the importance of how users actually perceive the quality of a service in the late 1990s to early 2000s. In this context, a variety of alternative terms were used prior to the standardization and definition of QoE, including User-Perceived Quality, Perceived Quality, End-User Quality, User-Experience Quality, Multimedia Experience Quality, Subjective Quality of Service, and user-level QoS. These early terms reflected a growing awareness of the need to evaluate digital services from the user’s point of view, ultimately leading to the coining and adoption of QoE as a distinct and essential concept in the field of communication systems and multimedia applications.

The term QoE brought attention to the user’s subjective perception, marking a shift toward evaluating service quality from the end-user’s perspective in the mid of 2000s. In the following, a brief overview on first documents about “Quality of Experience” or “QoE” are provided to sketch the definition of terms. In particular, research articles from the ACM Digital Library and IEEE Xplore searching for “Quality of Experience” or “QoE” are collected. 

Focus on user perception and interaction design

  • 1990: Harman, G. “The intrinsic quality of experience.“ laims we’re not directly aware of our experiences’ intrinsic properties, but of those of the external objects they represent—like color, shape, texture, motion, and spatial relations.
  • 1996: Austin Henderson. “What’s next?” explains the idea behind the ACM Award about QoE in interaction. “We really want to know what users experience! In short we are interested in the quality of a person’s experience in the interaction. […] factors contribute to the effective experience of interacting with the device.“ However, no QoE definition is proposed.
  • 1996: Lauralee Alben. “Quality of experience: defining the criteria for effective interaction design“ is also related to the ACM interactions design award. “By ‘experience’ we mean all the aspects of how people use an interactive product: the way it feels in their hands, how well they understand how it works, how they feel about it while they’re using it, how well it serves their purposes, and how well it fits into the entire context in which they are using it. If these experiences are successful and engaging, then they are valuable to users and noteworthy to the interaction design awards jury. We call this ‘quality of experience’.”  This early definition of QoE encompasses all aspects of a user’s interaction with a product, including its physical feel, usability, emotional impact, and the overall satisfaction derived from its use.
  • 2000: Alan Turner and Lucy T. Nowell. “Beyond the desktop: diversity and artistry” relate QoE to the need for engaging, media-rich interactions across diverse devices, emphasizing the role of artistry in delivering compelling user experiences. A remarkable statement: “We also believe that the quality of experience will become the key metric of success for software, both commercially and socially.“

Focus on subjectivity, emotion, and context

  • 2000: Marion Buchenau and Jane Fulton Suri. “Experience prototyping.” introduce a prototyping approach that immerses users in simulated interactions to explore and refine QoE, including sensory, emotional, and contextual dimensions beyond usability or function. QoE goes beyond usability or functionality, encompassing emotional and contextual factors.
  • 2000: Anna Bouch, Allan Kuchinsky, and Nina Bhatti. “Quality is in the eye of the beholder: meeting users’ requirements for Internet quality of service.” They show that in Internet commerce, QoE depends on both technical QoS as well as user expectations and context. “Only through such integration of users’ requirements into systems design [of users’ requirements into systems design] will it be possible to achieve the customer satisfaction that leads to the success of any commercial system.”
  • 2001: Public slide set by Touradj Ebrahimi (2012) “Quality of Experience Past, Present and Future Trends”, presented 23 Nov 2012, refers to a definition of QoE as follows. “The degree of fulfillment of an intended experience on a given user – as defined by Touradj Ebrahimi, 2001”.
  • 2002: Heddaya, A. S. “An economically scalable Internet” uses the term “QoE rather than quality of service because QoS is not necessary for QoE, and QoE is sufficient for successful service.”

Focus on measurable models combining technical and user perspectives

  • 1994: Nahrstedt, K., & Smith, J., Ralf Steinmetz. “Mapping User Level QoS from a Single Parameter” aims at quantifying QoE. “The ‘satisfaction’ concept has been introduced to quantify the QoS provided by the system. The transformations required to both map the cost into satisfaction and then configure the system are then developed.”
  • 2003: Siller, M., & Woods, J. C. “QoS arbitration for improving the QoE in multimedia transmission.” propose a QoE-aware framework that adapts QoS to real-time user perception for multimedia networks. They define QoE as  “the user’s perceived experience of what is being presented by the Application Layer, where the application layer acts as a user interface front-end that presents the overall result of the individual Quality of Services”.
    They also review current related work at that time, which are taken from white papers, which are not accessible anymore:
    • “A metric used for measuring the performance of this perceptual layer is Quality of Experience (QoE).”
    • “QoE is referred to as; what a customer experiences and values to complete his tasks quickly and with confidence.”
    • “QoE is considered as all the perception elements of the network and performance relative to expectations of the users/subscribers.“
    • The QoE is defined as “the totality of the Quality of Service mechanisms, provided to ensure smooth transmission of audio and video over IP networks”.
  • 2004: R. Jain. “Quality of Experience” asks the following questions. “But how do we quantitatively define the quality of experience? Can we extend QoS to QoE? What factors should we consider in developing measures for QoE?” He concludes with a remarkable statement. “In a sense, the challenges of QoE are nothing new. People in social sciences and marketing have always developed techniques to quantify people’s preferences and choices. That situation is similar to what goes into QoE.”
  • 2004: Euro-NGI D.JRA.6.1.1 “State-of-the-art with regards to user-perceived Quality of Service and quality feedback” with Fiedler as lead for this deliverable reviews QoS from the user’s perspective. The notion of QoE is “The degree of satisfaction, i.e. the subjective quality, is influenced by the technical, objective quality stemming from the application and the interconnecting network(s). For this reason, subjective quality as perceived by the network has to be linked to objective, measurable quality, which is expressed in application and network performance parameters. “
  • 2007: Hoßfeld, Tobias, Phuoc Tran-Gia, and Markus Fiedler. “Quantification of quality of experience for edge-based applications” provide a quantitative link between technical metrics and QoE. “Quality of Experience (QoE), a subjective measure from the user perspective of the overall value of the provided service or application”.

Diversity of definitions and interdisciplinarity

  • 2007:  Soldani, D., Li, M., & Cuny, R. “QoS and QoE management in UMTS cellular systems” define: “QoE is the term used to describe the perception of end-users on how usable the services are. […] The term ‘QoE’ refers to the perception of the user about the quality of a particular service or networks.” Notably, they already mentioned that “Browsing through the literature, one may find many different definitions for quality of end-user experience (QoE) and quality of service (QoS).”
  • 2009: International Conference on Quality of Multimedia Experience (QoMEX) includes in the call for papers“perceived user experience is psychological in nature and changes in different environmental conditions and with different multimedia devices.”

3. Definitions of QoE in Standardization

In standardization, the following definitions were introduced.

  • 2007: ITU-T Rec. G.100/P.10 Amendment 1 (2007) New Appendix I – Definition of Quality of Experience (QoE).  “The overall acceptability of an application or service, as perceived subjectively by the end user. NOTE 1: Quality of experience includes the complete end-to-end system effects (client, terminal, network, services infrastructure, etc.). NOTE 2: Overall acceptability may be influenced by user expectations and context.”
    This definition has been superseded by the Qualinet Definition of QoE in 2016. It should be mentioned that acceptance and QoE are different concepts. acceptability refers more narrowly to whether a service or system is deemed “good enough” or usable under certain conditions. Approaches to link QoE and acceptance have  been discussed in literature [Schatz2011,Hossfeld2016].
  • 2008: ITU-T Recommendation E.800. “Definitions of terms related to quality of service” defines in as follows: “quality of service experienced/perceived by customer/user (QoSE): a statement expressing the level of quality that customers/users believe they have experienced. NOTE 1: The level of QoS experienced and/or perceived by the customer/user may be expressed by an opinion rating.”
  • 2009: ETSI TR 102 643 V1.0.1 (2009-12) “Human Factors (HF); Quality of Experience (QoE) requirements for real-time communication services” defines QoE as “measure of user performance based on both objective and subjective psychological measures of using an ICT service or product”. It includes two notes on QoE: (1) Considers technical QoS, context, and measures both communication process and outcomes (e.g. effectiveness, satisfaction). (2) Uses objective (e.g. task time, errors) and subjective (e.g. perceived quality, satisfaction) psychological measures, depending on context.

The diverse and often conflicting definitions of QoE emerging in the 2000s highlighted the need for coordinated efforts and shared understanding across disciplines. This led to joint initiatives like QUALINET, which aimed to formalize and unify QoE research within a dedicated network. One of the results is the updated QoE definition, which is now taken in standardization.

  • 2016: ITU-T Recommendation P.10/G.100 (2006) Amendment 5 (07/ 16), New Definitions for Inclusion in Recommendation ITU-T P.10/G.100, International Telecommunication Union, July 2016: ‘‘Quality of experience (QoE) is the degree of delight or annoyance of the user of an application or service’’.

QUALINET White Paper on Definitions of Quality of Experience

QUALINET is the European Network on Quality of Experience in Multimedia Systems and Service (COST Action IC 1003 from 2010 to 2014, later a network that meets regularly at QoMEX) with the aim to “to establish a strong network on Quality of Experience (QoE) with participation from both academia and industry” (https://www.cost.eu/actions/IC1003/). QUALINET was the driving force to further advance research in the context of QoE, producing three major, well-cited assets (among others), namely (1) QUALINET White Paper on Definitions of Quality of Experience, (2) QUALINET databases [QUALINET2019], and (3) QUALINET White Paper on Definitions of Immersive Media Experience (IMEx)

The white paper on definitions of QoE was the result from a consultation and collaborative writing process within the COST Action IC 1003 of 38 authors, contributors, and editors from 18 countries. A first draft was discussed and improved at the 2012 QoE Dagstuhl Seminar [Fiedler2012].  The final definition of QoE: 

“Quality of Experience (QoE) is the degree of delight or annoyance of the user of an application or service. It results from the fulfillment of his or her expectations with respect to the utility and / or enjoyment of the application or service in the light of the user’s personality and current state.”

[QUALINET2013]

The white paper also defines influence factors (human, system, context) and features of QoE (level of direct perception, level of interaction, level of the usage situation, level of service) as well as the relationship between QoS and QoE, plus application areas, which allow “to provide specializations of a generally agreed definition of QoE pertaining to the respective application domain taking into account its requirements formulated by means of influence factors and features of QoE”.

QUALINET White Paper on Definitions of Immersive Media Experience (IMEx)

A follow-up white paper defines the QoE for immersive media as

“the degree of delight or annoyance of the user of an application or service which involves an immersive media experience. It results from the fulfillment of his or her expectations with respect to the utility and/or enjoyment of the application or service in the light of the user’s personality and current state.”

[QUALINET2020]

IMEx is defined as

“a high-fidelity simulation provided and communicated to the user through multiple sensory and semiotic modalities. Users are emplaced in a technology-driven environment with the possibility to actively partake and participate in the information and experiences dispensed by the generated world.”

[QUALINET2020]

Consequently, this white paper provides a “toolbox for definitions of IMEx including its Quality of Experience, application areas, influencing factors, and assessment methods.” [QUALINET2020].

4. Open Issues in QoE Research

We would like to conclude with some open issues regarding Quality of Experience. The upcoming 6G standard presents significant opportunities, such as QoE-aware orchestration of edge computing, cloud rendering, and network slicing [Tondwalkar2024] and native AI in 6G [Ziegler2020], while also considering tradeoff between QoE and CO2 emissions [Hossfeld2023]. As AI-generated content continues to rise, the evaluation of its quality remains in its early stages. The same applies to learning-based codecs, where existing quality assessment methods—both objective and subjective—are reaching their limits, particularly concerning media authenticity, which is becoming a critical issue. In this context, ethics and privacy are paramount, as user data plays a central role in QoE modeling. Future research must focus on privacy-preserving methods for QoE measurement and personalization. Finally, new modalities such as point clouds, light fields, and holograms necessitate the adaptation of existing techniques or the development of new methods. Moreover, multimodal or multisensory QoE, particularly concerning audio-visual-haptic or olfactory integration (previously referred to as Mulsemedia), is emerging as an important area that requires tailored QoE assessment methods and metrics. This is also reflected by the upcoming 17th Int. Conf. on Quality of Multimedia Experiences (QoMEX’25) under the theme “Thinking of a QoE ®evolultion”. In particular, the call for papers requests: “On the edge of QoMEX ‘coming of age’, it is time to rethink the purpose and methods of QoE research: cross-fertilizing with adjacent fields, reaching more diverse populations, or exploring novel techniques and paradigms.” This addresses innovative approaches and novel paradigms in QoE research, technological innovations in the era of big data data and AI, but also on user-centricity in 6G. Interdisciplinary links in QoE include diversity, ethics, accessibility, but also novel interaction techniques and multimedia experiences. Specific applications such as gaming, healthcare, education, and immersive technologies, and multisensory perception are in the scope.

And so, like any true odyssey, the search for Quality of Experience continues — not as a destination, but as a path we shape with every interaction, every pixel tuned, every user understood. QoE is no longer a myth, but neither is it fully found. It lives at the intersection of perception and precision, where engineers meet psychologists, and systems learn to listen. In a world of immersive media and intelligent networks, perhaps the better question is no longer “O QoE, where art thou?” but rather — “Are we ready to meet it where it truly resides?”

References

  • [Alben1996] Lauralee Alben. 1996. Quality of experience: defining the criteria for effective interaction design. interactions 3, 3 (May/June 1996), 11–15. https://doi.org/10.1145/235008.235010
  • [Bouch2000]: Anna Bouch, Allan Kuchinsky, and Nina Bhatti. 2000. Quality is in  the eye of the beholder: meeting users’ requirements for Internet quality of service. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems (CHI ’00). Association for Computing Machinery, New York, NY, USA, 297–304. https://doi.org/10.1145/332040.332447
  • [Buchenau2000] Marion Buchenau and Jane Fulton Suri. 2000. Experience prototyping. In Proceedings of the 3rd conference on Designing interactive systems: processes, practices, methods, and techniques (DIS ’00). Association for Computing Machinery, New York, NY, USA, 424–433. https://doi.org/10.1145/347642.347802
  • [Ebrahimi2001] Public slide set by Touradj Ebrahimi (2012) “Quality of Experience Past, Present and Future Trends”, presented at Alpen-Adria-Universität Klagenfurt, 23 Nov 2012
  • ETSI TR 102 643 V1.0.1 (2009-12) “Human Factors (HF); Quality of Experience (QoE) requirements for real-time communication services”
  • [EuroNGI2004] Euro-NGI D.JRA.6.1.1 : State-of-the-art with regards to user-perceived Quality of Service and quality feedback, Deliverable version No: 1.0 Sending date: 31/05-2004, Lead: Markus Fiedler, BTH Karlskrona. <a href=”https://www.diva-portal.org/smash/get/diva2:837296/FULLTEXT01.pdf”>Last accessed: 2025/04/22</a>
  • [Fiedler2012] Markus Fiedler, Sebastian Möller, and Peter Reichl. Quality of Experience: From User Perception to Instrumental Metrics (Dagstuhl Seminar 12181). In Dagstuhl Reports, Volume 2, Issue 5, pp. 1-25, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2012) https://doi.org/10.4230/DagRep.2.5.1
  • [Harman1990] Harman, G. (1990). The intrinsic quality of experience. Philosophical perspectives, 4, 31-52. https://doi.org/10.2307/2214186
  • [Heddaya2002] Heddaya, A. S. (2002). An economically scalable Internet. Computer, 35(9), 93-95. https://doi.org/10.1109/MC.2002.1033035
  • [Henderson1996] Austin Henderson. 1996. What’s next?—growing the notion of quality. Interactions 3, 3 (May/June 1996), 56–59. https://doi.org/10.1145/235008.235019
  • [Hestnes2009] Hestnes, B., Brooks, P., Heiestad, S. (2009). “QoE (Quality of Experience) – measuring QoE for improving the usage of telecommunication services”, Telenor R&I R 21/2009.
  • [Hossfeld2007] Hoßfeld, Tobias, Phuoc Tran-Gia, and Markus Fiedler. “Quantification of quality of experience for edge-based applications.” International Teletraffic Congress. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007. https://doi.org/10.1007/978-3-540-72990-7_34
  • [Hossfeld2016] Hoßfeld, T., Heegaard, P. E., Varela, M., & Möller, S. (2016). QoE beyond the MOS: an in-depth look at QoE via better metrics and their relation to MOS. Quality and User Experience, 1, 1-23. https://doi.org/10.1007/s41233-016-0002-1
  • [Hossfeld2023] Hoßfeld, T., Varela, M., Skorin-Kapov, L., & Heegaard, P. E. (2023). A Greener Experience: Trade-Offs between QoE and CO 2 Emissions in Today’s and 6G Networks. IEEE communications magazine, 61(9), 178-184. https://doi.org/10.1109/MCOM.006.2200490
  • [ITU-T E.800] E.800: Terms and definitions related to quality of service and network performance including dependability”. ITU-T Recommendation. August 1994. Updated September 2008 as Definitions of terms related to quality of service. Last access: 2025/04/22
  • [ITU-T G.100/P.10 2007] ITU-T Rec. G.100/P.10 Amendment 1 (2007) New Appendix I—Definition of Quality of Experience (QoE). International Telecommunication Union, Geneva.
  • [Nahrstedt1994] Nahrstedt, K., & Smith, J., Ralf Steinmetz (Ed), 1994, “Service Kernel for Multimedia Endpoints”, Multimedia: Advanced Teleservices and High-speed Communication Architectures, Lecture Notes in Computer Science LNCS868, chanter I, pp. 8-22, Springer Verlag. https://doi.org/10.1007/3-540-58494-3_2
  • [QUALINET2013] Patrick Le Callet, Sebastian Möller and Andrew Perkis, eds., Qualinet White Paper on Definitions of Quality of Experience (2012). European Network on Quality of Experience in Multimedia Systems and  Services (COST Action IC 1003) Lausanne, Switzerland, Version 1.2, March 2013. Last access: 2025/04/22
  • [QUALINET2019] Karel Fliegel, Lukáš Krasula, and Werner Robitza. 2022. Qualinet databases: central resource for QoE research – history, current status, and plans. SIGMultimedia Rec. 11, 3, Article 5 (September 2019), 1 pages. https://doi.org/10.1145/3524460.3524465
  • [QUALINET2020] Perkis, A., Timmerer, C., et al., “QUALINET White Paper on Definitions of Immersive Media Experience (IMEx)”, European Network on Quality of Experience in Multimedia Systems and Services, 14th QUALINET meeting (online), May 25, 2020. https://arxiv.org/abs/2007.07032
  • [Richards1998] Richards, A., Rogers, G., Witana, V., & Antoniades, M., 1998, “Mapping User Level QoS from a Single Parameter”, In Proceedings of the International Conference on MultimediaNetworks and Services (MMNS ‘98).
  • [Schatz2011] Schatz, R., Egger, S., & Platzer, A. (2011, June). Poor, good enough or even better? bridging the gap between acceptability and qoe of mobile broadband data services. In 2011 IEEE International Conference on Communications (ICC) (pp. 1-6). IEEE. https://doi.org/10.1109/icc.2011.5963220
  • [Siller2003] Siller, M., & Woods, J. C. (2003, July). QoS arbitration for improving the QoE in multimedia transmission. In International Conference on Visual Information Engineering (VIE 2003). Ideas, Applications, Experience (pp. 238-241). London UK: IEE. https://doi.org/10.1049/cp:20030531
  • [Soldani2006]  Soldani, D., Li, M., & Cuny, R. (Eds.). (2007). QoS and QoE management in UMTS cellular systems. John Wiley & Sons. https://doi.org/10.1002/9780470034057
  • [Tondwalkar2024] Tondwalkar, A., Andres-Maldonado, P., Chandramouli, D., Liebhart, R., Moya, F. S., Kolding, T., & Perez, P. (2024). Provisioning Quality of Experience in 6G Networks. IEEE Access. https://doi.org/10.1109/ACCESS.2024.3455938
  • [Turner2000] Alan Turner and Lucy T. Nowell. 2000. Beyond the desktop: diversity and artistry. In CHI ’00 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’00). Association for Computing Machinery, New York, NY, USA, 35–36. https://doi.org/10.1145/633292.633317
  • [Varela2014] Varela, M., Skorin-Kapov, L., & Ebrahimi, T. (2014). Quality of service versus quality of experience. In Quality of Experience: Advanced Concepts, Applications and Methods (pp. 85-96). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-02681-7_6
  • [Ziegler2020] Ziegler, V., Viswanathan, H., Flinck, H., Hoffmann, M., Räisänen, V., & Hätönen, K. (2020). 6G architecture to connect the worlds. IEEE Access, 8, 173508-173520. https://doi.org/10.1109/ACCESS.2020.3025032

CASTLE 2024: A Collaborative Effort to Create a Large Multimodal Multi-perspective Daily Activity Dataset

This report describes the CASTLE 2024 event, a collaborative effort to create a PoV 4K video dataset recorded by a dozen people in parallel over several days. The participating content creators wore a GoPro and a Fitbit for approximately 12 hours each day while engaging in typical daily activities. The event took place in Ballyconneely, Ireland, and lasted for four days. The resulting data is publicly available and can be used for papers, studies, and challenges in the multimedia domain in the coming years. A preprint of the paper presenting the resulting dataset is available on arXiv (https://arxiv.org/abs/2503.17116). 

Introduction

Motivated by a requirement for a real-world PoV video dataset, a group of co-organizers of the annual VBS and LSC challenges came together to hold an invitation workshop and generate a novel PoV video dataset. In the first week of December 2024, twelve researchers from the multimedia community gathered in a remote house in Ballyconneely, Ireland, with the goal to create a large multi-view and multimodal lifelogging video dataset. Equipped with a Fitbit on their wrists, a GoPro Hero 13 on their heads for about 12 hours a day, with five fixed cameras capturing the environment, they began a journey of 4K lifelogging. They lived together for four full days and performed some typical living tasks, such as cooking, eating, washing dishes, talking, discussing, reading, watching TV, as well as playing games (ranging from paper plane folding and darts to quizzes). While this sounds very enjoyable, the whole event required a lot of effort, discipline, and meticulous planning – in terms of food and, more importantly, the data acquisition, data storage, data synchronization, avoiding the usage of any copyrighted material (book, movie, songs, etc.), limiting the usage of smartphones and laptops for privacy concerns, and making the content as diverse as possible. Figure 1 gives an impression of the event and shows different activities by the participants.

Figure 1: Participants at CASTLE 2024, having a light dinner and playing cards.

Organisational Procedure

Already months before the event, we were planning for the recording equipment, the participants, the activities, as well as the food.    

The first challenge was figuring out a way to make wearing a GoPro camera all day as simple and enjoyable as possible. This was realized by using the camera with the elastic strap for a strong hold, a specifically adapted rubber pad at the back side of the camera, and a USB-C cable to a large 20,000 mAh power bank that every participant was wearing in their pocket. In the end of the day, the Fitbits, the battery packs, and the SD cards of every participant were collected, approximately 4TB of data was copied to an on-site NAS system, the SD cards cleared, and the batteries fully charged, so that next day in the morning they were usable again.

We ended up with six people from Dublin City University, and six international researchers, but only 10 people were wearing recording equipment. Every participant was asked to prepare at least one breakfast, lunch, or dinner, and all the food and drinks were purchased a few days before the event. 

After arrival at the house, every participant had to sign an agreement that all collected data can be publicly released and used for scientific purposes in the future.    

CASTLE 2024 Multimodal Dataset

The dataset (https://castle-dataset.github.io/) that emerged from this collaborative effort contains heart rate and steps logs of 10 people, 4K@50fps video streams from five fixed mounted cameras, as well as 4K video streams from 10 head-mounted cameras. The recording time per day is 7-12 hours per device, resulting in over 600 hours of video that totals to about 8.5 TB of data, after processing and more efficient re-encoding. The videos were processed into one hour-long parts that are aligned to all start at the hour. This was achieved in a multi-stage process, using a machine-readable QR code-based clock for initial rough- and subsequent audio signal correlation analysis for fine-alignment. 

The language spoken in the videos is mainly English with a few parts of (Swiss-)German and Vietnamese. The activities by the participants include:

  • preparing food and drinks
  • eating
  • washing dishes
  • cleaning up
  • discussing
  • hiding items
  • presenting and listening
  • drawing and painting
  • playing games (e.g., chess, darts, guitar, various card games, etc.)
  • reading (out loud)
  • watching tv (open source videos)
  • having a walk
  • having a car-ride

Use Scenarios of the Dataset

The dataset can be used for content retrieval contests, such as the Lifelog Search Challenge (LSC) and the Video Browser Showdown (VBS), but also for automatic content recognition and annotation challenges, such as the CASTLE Challenge that will happen at ACM Multimedia 2025 (https://castle-dataset.github.io/).  

Further application scenarios include complex scene understanding, 3d reconstruction and localization, audio event prediction, source separation, human-human/machine interaction, and many more.

Challenges of Organizing the Event

As this was the first collaborative event to collect such a multi-view multimodal dataset, there were also some challenges that are worth mentioning and may help other people that want to organize a similar event in the future. 

First of all, the event turned out to be much more costly than originally planned for. Reasons for this are the increased living/rental costs, the travel costs for international participants, but also expenses for technical equipment such as batteries, which we originally did not intend to use. Originally we wanted to organize the event in a real castle, but it turned out to be way too expensive, without a significant gain.

For the participants it was also hard to maintain privacy for all days, since not even quickly responding to emails was possible. When having a walk or a car ride, we needed to make sure that other people or car plates were not recorded.

In terms of the data, it should be mentioned that the different recording devices needed to be synchronized. This was achieved via regular capturing of dynamic QR codes showing the master time (or wall clock time), and using these positions in all videos as temporal anchors during post-processing. 

The data volume together with the available transfer speed were also an issue and it required many hours during the nights to copy all the data from all sd-cards. 

Summary

The CASTLE 2024 event brought together twelve multimedia researchers in a remote house in Ireland for an intensive four-day data collection retreat, resulting in a rich multimodal 4K video dataset designed for lifelogging research. Equipped with head-mounted GoPro cameras and Fitbits, ten participants captured synchronized, real-world point-of-view footage while engaging in everyday activities like cooking, playing games, and discussing, with additional environmental video captured from fixed cameras. The team faced significant logistical challenges, including power management, synchronization, privacy concerns, and data storage, but ultimately produced over 600 hours of aligned video content. The dataset – freely available for scientific use – is intended to support future research and competitions focused on content-based video analysis, lifelogging, and human activity understanding.

MPEG Column: 149th MPEG Meeting in Geneva, Switzerland

The 149th MPEG meeting took place in Geneva, Switzerland, from January 20 to 24, 2025. The official press release can be found here. MPEG promoted three standards (among others) to Final Draft International Standard (FDIS), driving innovation in next-generation, immersive audio and video coding, and adaptive streaming:

  • MPEG-I Immersive Audio enables realistic 3D audio with six degrees of freedom (6DoF).
  • MPEG Immersive Video (Second Edition) introduces advanced coding tools for volumetric video.
  • MPEG-DASH (Sixth Edition) enhances low-latency streaming, content steering, and interactive media.

This column focuses on these new standards/editions based on the press release and amended with research aspect relevant for the ACM SIGMM community.

MPEG-I Immersive Audio

At the 149th MPEG meeting, MPEG Audio Coding (WG 6) promoted ISO/IEC 23090-4 MPEG-I immersive audio to Final Draft International Standard (FDIS), marking a major milestone in the development of next-generation audio technology.

MPEG-I immersive audio is a groundbreaking standard designed for the compact and highly realistic representation of spatial sound. Tailored for Metaverse applications, including Virtual, Augmented, and Mixed Reality (VR/AR/MR), it enables seamless real-time rendering of interactive 3D audio with six degrees of freedom (6DoF). Users can not only turn their heads in any direction (pitch/yaw/roll) but also move freely through virtual environments (x/y/z), creating an unparalleled sense of immersion.

True to MPEG’s legacy, this standard is optimized for efficient distribution – even over networks with severe bitrate constraints. Unlike proprietary VR/AR audio solutions, MPEG-I Immersive Audio ensures broad interoperability, long-term stability, and suitability for both streaming and downloadable content. It also natively integrates MPEG-H 3D Audio for high-quality compression.

The standard models a wide range of real-world acoustic effects to enhance realism. It captures detailed sound source properties (e.g., level, point sources, extended sources, directivity characteristics, and Doppler effects) as well as complex environmental interactions (e.g., reflections, reverberation, diffraction, and both total and partial occlusion). Additionally, it supports diverse acoustic environments, including outdoor spaces, multiroom scenes with connecting portals, and areas with dynamic openings such as doors and windows. Its rendering engine balances computational efficiency with high-quality output, making it suitable for a variety of applications.

Further reinforcing its impact, the upcoming ISO/IEC 23090-34 Immersive audio reference software will fully implement MPEG-I immersive audio in a real-time framework. This interactive 6DoF experience will facilitate industry adoption and accelerate innovation in immersive audio. The reference software is expected to reach FDIS status by April 2025.

With MPEG-I immersive audio, MPEG continues to set the standard for the future of interactive and spatial audio, paving the way for more immersive digital experiences.

Research aspects: Research can focus on optimizing the streaming and compression of MPEG-I immersive audio for constrained networks, ensuring efficient delivery without compromising spatial accuracy. Another key area is improving real-time 6DoF audio rendering by balancing computational efficiency and perceptual realism, particularly in modeling complex acoustic effects like occlusions, reflections, and Doppler shifts for interactive VR/AR/MR applications.

MPEG Immersive Video (Second Edition)

At the 149th MPEG meeting, MPEG Video Coding (WG 4) advanced the second edition of ISO/IEC 23090-12 MPEG immersive video (MIV) to Final Draft International Standard (FDIS), marking a significant step forward in immersive video technology.

MIV enables the efficient compression, storage, and distribution of immersive video content, where multiple real or virtual cameras capture a 3D scene. Designed for next-generation applications, the standard supports playback with six degrees of freedom (6DoF), allowing users to not only change their viewing orientation (pitch/yaw/roll) but also move freely within the scene (x/y/z). By leveraging strong hardware support for widely used video formats, MPEG immersive video provides a highly flexible framework for multi-view video plus depth (MVD) and multi-plane image (MPI) video coding, making volumetric video more accessible and efficient.

With the second edition, MPEG continues to expand the capabilities of MPEG immersive video, introducing a range of new technologies to enhance coding efficiency and support more advanced immersive experiences. Key additions include:

  • Geometry coding using luma and chroma planes, improving depth representation
  • Capture device information, enabling better reconstruction of the original scene
  • Patch margins and background views, optimizing scene composition
  • Static background atlases, reducing redundant data for stationary elements
  • Support for decoder-side depth estimation, enhancing depth accuracy
  • Chroma dynamic range modification, improving color fidelity
  • Piecewise linear normalized disparity quantization and linear depth quantization, refining depth precision

The second edition also introduces two new profiles: (1) MIV Simple MPI profile, allowing MPI content playback with a single 2D video decoder, and (2) MIV 2 profile, a superset of existing profiles that incorporates all newly added tools.

With these advancements, MPEG immersive video continues to push the boundaries of immersive media, providing a robust and efficient solution for next-generation video applications.

Research aspects: Possible research may explore advancements in MPEG immersive video to improve compression efficiency and real-time streaming while preserving depth accuracy and spatial quality. Another key area is enhancing 6DoF video rendering by leveraging new coding tools like decoder-side depth estimation and geometry coding, enabling more precise scene reconstruction and seamless user interaction in volumetric video applications.

MPEG-DASH (Sixth Edition)

At the 149th MPEG meeting, MPEG Systems (WG 3) advanced the sixth edition of MPEG-DASH (ISO/IEC 23009-1 Media presentation description and segment formats) by promoting it to the Final Draft International Standard (FDIS), the final stage of standards development. This milestone underscores MPEG’s ongoing commitment to innovation and responsiveness to evolving market needs.

The sixth edition introduces several key enhancements to improve the flexibility and efficiency of MPEG-DASH:

  • Alternative media presentation support, enabling seamless switching between main and alternative streams
  • Content steering signaling across multiple CDNs, optimizing content delivery
  • Enhanced segment sequence addressing, improving low-latency streaming and faster tune-in
  • Compact duration signaling using patterns, reducing MPD overhead
  • Support for Common Media Client Data (CMCD), enabling better client-side analytics
  • Nonlinear playback for interactive storylines, expanding support for next-generation media experiences

With these advancements, MPEG-DASH continues to evolve as a robust and scalable solution for adaptive streaming, ensuring greater efficiency, flexibility, and enhanced user experiences across a wide range of applications.

Research aspects: While advancing MPEG-DASH for more efficient and flexible adaptive streaming has been subject to research for a while, optimizing content delivery across multiple CDNs while minimizing latency and optimizing QoE remains an open issue. Another key area is enhancing interactivity and user experiences by leveraging new features like nonlinear playback for interactive storylines and improved client-side analytics through Common Media Client Data (CMCD).

The 150th MPEG meeting will be held online from March 31 to April 04, 2025. Click here for more information about MPEG meetings and their developments.

JPEG Column: 105th JPEG Meeting in Berlin, Germany

JPEG Trust becomes an International Standard

The 105th JPEG meeting was held in Berlin, Germany, from October 6 to 11, 2024. During this JPEG meeting, JPEG Trust was sent for publication as an International Standard. This is a major achievement in providing standardized tools to effectively fight against the proliferation of fake media and disinformation while restoring confidence in multimedia information.

In addition, the JPEG Committee also sent for publication the JPEG Pleno Holography standard, which is the first standardized solution for holographic content coding. This type of content might be represented by huge amounts of information, and efficient compression is needed to enable reliable and effective applications.

The following sections summarize the main highlights of the 105th JPEG meeting:

105th JPEG Meeting, held in Berlin, Germany.
  • JPEG Trust
  • JPEG Pleno
  • JPEG AI
  • JPEG XE
  • JPEG AIC
  • JPEG DNA
  • JPEG XS
  • JPEG XL


JPEG Trust

In an important milestone, the first part of JPEG Trust, the “Core Foundation” (ISO/IEC IS 21617-1) International Standard, has now been approved by the international ISO committee and is being published. This standard addresses the problem of dis- and misinformation and provides leadership in global interoperable media asset authenticity. JPEG Trust defines a framework for establishing trust in digital media.

Users of social media are challenged to assess the trustworthiness of the media they encounter, and agencies that depend on the authenticity of media assets must be concerned with mistaking fake media for real, with risks of real-world consequences. JPEG Trust provides a proactive approach to trust management. It is built upon and extends the Coalition for Content Provenance and Authenticity (C2PA) engine. The first part defines the JPEG Trust framework and provides building blocks for more elaborate use cases via its three main pillars:

  • Annotating provenance – linking media assets together with their associated provenance annotations in a tamper-evident manner
  • Extracting and evaluating Trust Indicators – specifying how to extract an extensive array of Trust Indicators from any given media asset for evaluation
  • Handling privacy and security concerns – providing protection for sensitive information based on the provision of JPEG Privacy and Security (ISO/IEC 19566-4)

Trust in digital media is context-dependent. JPEG Trust does NOT explicitly define trustworthiness but rather provides a framework and tools for proactively establishing trust in accordance with the trust conditions needed. The JPEG Trust framework outlined in the core foundation enables individuals, organizations, and governing institutions to identify specific conditions for trustworthiness, expressed in Trust Profiles, to evaluate relevant Trust Indicators according to the requirements for their specific usage scenarios. The resulting evaluation can be expressed in a Trust Report to make the information easily accessed and understood by end users.

JPEG Trust has an ambitious schedule of future work, including evolving and extending the core foundation into related topics of media tokenization and media asset watermarking, and assembling a library of common Trust Profile requirements.

JPEG Pleno

The JPEG Pleno Holography activity reached a major milestone with the FDIS of ISO/IEC 21794-5 being accepted and the International Standard being under preparation by ISO. This is a major achievement for this activity and is the result of the dedicated work of the JPEG Committee over a number of years. The JPEG Pleno Holography activity continues with the development of a White Paper on JPEG Pleno Holography to be released at the 106th JPEG meeting and planning for a workshop for future standardization on holography intended to be conducted in November or December 2024.

The JPEG Pleno Light Field activity focused on the 2nd edition of ISO/IEC 21794-2 (“Plenoptic image coding system (JPEG Pleno) Part 2: Light field coding”) which will integrate AMD1 of ISO/IEC 21794-2 (“Profiles and levels for JPEG Pleno Light Field Coding”) and include the specification of the third coding mode entitled Slanted 4D Transform Mode and the associated profile.

Following the Call for Contributions on Subjective Light Field Quality Assessment and as a result of the collaborative process, the JPEG Pleno Light Field is also preparing standardization activities for subjective and objective quality assessment of light fields. At the 105th JPEG meeting, collaborative subjective results on light field quality assessments were presented and discussed. The results will guide the subjective quality assessment standardization process, which has issued its fourth Working Draft.

The JPEG Pleno Point Cloud activity released a White Paper on JPEG Pleno Learning-based Point Cloud Coding. This document outlines the context, motivation, and scope of the upcoming Part 6 of ISO/IEC 21794 scheduled for publication in early 2025, as well as giving the basis of the new technology, use cases, performance, and future activities. This activity focuses on a new exploration study into the latent space optimization for the current Verification Model.

JPEG AI

At the 105th meeting JPEG AI activity primarily concentrated on advancing Part 2 (Profiling), Part 3 (Reference Software), and Part 4 (Conformance). Part 4 moved forward to the Committee Draft (CD) stage, while Parts 2 and 3 are anticipated to reach DIS at the next meeting. The conformance CD outlines different types of conformances: 1) strict conformance for decoded residuals; 2) soft conformance for decoded feature tensors, allowing minor deviations; and 3) soft conformance for decoded images, ensuring that image quality remains comparable to or better than the quality offered by the reference model. For decoded images, two types of soft conformance were introduced based on device capabilities. Discussions on Part 2 examined memory requirements for various JPEG AI VM codec configurations. Additionally, three core experiments were established during this meeting, focusing on JPEG AI subjective assessment, integerization, and the study of profiles and levels.

JPEG XE

The JPEG XE activity is currently focused on preparing for handling the open Final Call for Proposals on lossless coding of events. This activity revolves around a new and emerging image modality created by event-based visual sensors. JPEG XE is about the creation and development of a standard to represent events in an efficient way allowing interoperability between sensing, storage, and processing, targeting machine vision and other relevant applications. The Final Call for Proposals ends in March of 2025 and aims to receive relevant coding tools that will serve as a basis for a JPEG XE standard. The JPEG Committee is also preparing discussions on lossy coding of events and how to evaluate such lossy coding technologies in the future. The JPEG Committee invites those interested in JPEG XE activity to consider the public documents, available on jpeg.org. The Ad-hoc Group on event-based vision was re-established to continue work towards the 106th JPEG meeting. To stay informed about this activity, please join the event-based vision Ad-hoc Group mailing list.

JPEG AIC

Part 3 of JPEG AIC (AIC-3) advanced to the Committee Draft (CD) stage during the 105th JPEG meeting. AIC-3 defines a methodology for subjective assessment of the visual quality of high-fidelity images. Based on two test protocols—Boosted Triplet Comparisons and Plain Triplet Comparisons—it reconstructs a fine-grained quality scale in JND (Just Noticeable Difference) units. According to the defined work plan, JPEG AIC-3 is expected to advance to the Draft International Standard (DIS) stage by April 2025 and become an International Standard (IS) by October 2026. During this meeting, the JPEG Committee also focused on the upcoming Part 4 of JPEG AIC, which refers to the objective quality assessment of high-fidelity images.

JPEG DNA

JPEG DNA is an initiative aimed at developing a standard capable of representing bi-level, continuous-tone grey-scale, continuous-tone colour, or multichannel digital samples in a format using nucleotide sequences to support DNA storage. The JPEG DNA Verification Model was created during the 102nd JPEG meeting based on the performance assessments and descriptive analyses of the submitted solutions to the Call for Proposals, published at the 99th JPEG meeting. Several core experiments are continuously conducted to validate and improve this Verification Model (VM), leading to the creation of the first Working Draft of JPEG DNA during the 103rd JPEG meeting. At the 105th JPEG meeting, the committee created a New Work Item Proposal for JPEG DNA to make it an official ISO work item. The proposal stated that JPEG DNA would be a multi-part standard: Part 1—Core Coding System, Part 2—Profiles and Levels, Part 3—Reference Software, and Part 4—Conformance. The committee aims to reach the IS stage for Part 1 by April 2026.

JPEG XS

The third editions of JPEG XS, Part 1 – Core coding tools, Part 2 – Profiles and buffer models, and Part 3 – Transport and container formats, have now been published and made available on ISO. The JPEG Committee is finalizing the third edition of the remaining two parts of the JPEG XS standards suite, Part 4 – Conformance testing and Part 5 – Reference software. The FDIS of Party 4 was issued for the ballot at this meeting. Part 5 is still at the Committee Draft stage, and the DIS is planned for the next JPEG meeting. The reference software has a feature-complete decoder fully compliant with the 3rd edition. Work on the TDC profile encoder is ongoing.

JPEG XL

A third edition of JPEG XL Part 2 (File Format) will be initiated to add an embedding syntax for ISO 21496 gain maps, which can be used to represent a custom local tone mapping and have artistic control over the SDR rendition of an HDR image coded with JPEG XL. Work on hardware and software implementations continues, including a new Rust implementation.

Final Quote

“In its commitment to tackle dis/misinformation and to manage provenance, authorship, and ownership of multimedia information, the JPEG Committee has reached a major milestone by publishing the first ever ISO/IEC endorsed specifications for bringing back trust into multimedia. The committee will continue developing additional enhancements to JPEG Trust. New parts of the standard are under development to define a set of additional tools to further enhance interoperable trust mechanisms in multimedia.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2024 – Part 3 (MediaEval 2023, ImageCLEF 2024)


In this final part of the Overview of Open Dataset Sessions and Benchmarking Competitions we are focusing on the latest editions of some of the most popular multimedia-centric benchmarking competitions, continuing our reviews from previous years (https://records.sigmm.org/2023/01/19/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2022-part-3/). This third part of our review focuses on two benchmarking competitions:

  • MediaEval 2023 (https://multimediaeval.github.io/editions/2023/). We present the five benchmarking tasks, which target a wide range of topics, including medical multimedia applications (Medico), multimodal understanding of smells (Musti), multimodal content in news media (NewsImages), social media video memorability (Memorability), and sports action classification (SportsVideo).
  • ImageCLEF 2024 (https://www.imageclef.org/2024). This edition of ImageCLEF targets a wide range of tasks, covering four different medical-focused tasks (medical captions, Visual Question Answering, remote medicine, and GANs in medical scenarios), recommendation systems for editorials, image retrieval and generation, and pictogram generation from textual information.

For an overview of the QoMEX 2023 and QoMEX 2024 conferences, please see the first part of this column (https://records.sigmm.org/2024/09/07/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2023-2024-part-1-qomex-2023-and-qomex-2024/), while an overview of MDRE special sessions at MMM2023 and 2024 please take a look at the second part of this column (https://records.sigmm.org/2024/11/19/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2023-2024-part-2-mdre-at-mmm-2023-and-mmm-2024/).

MediaEval 2023

The MediaEval Multimedia Evaluation benchmark (https://multimediaeval.github.io/) benchmark offers challenges in artificial intelligence for multimedia data, tasking participants in benchmarking tasks centered around retrieval, classification, generation, analysis, and exploration of multimodal data. The latest editions of MediaEval also wish to delve deeper into understanding the data, trends, and system performance, by proposing a set of Quest for Insight (Q4I) questions and themes for each task. A column signed by the Coordination Committee of the latest MediaEval edition, outlying MediaEval’s history, impressions from the latest edition, and plans for the future is published in the October 2024 edition of our records (https://records.sigmm.org/2024/11/15/one-benchmarking-cycle-wraps-up-and-the-next-ramps-up-news-from-the-mediaeval-multimedia-benchmark/). MediaEval 2023 (https://multimediaeval.github.io/editions/2023/) was held between 1-2 February 2024, Collocated with MMM 2024 in Amsterdam, Netherlands, and the Coordination Committee was composed of Mihai Gabriel Constantin (University Politehnica of Bucharest, Romania), Steven Hicks, (SimulaMet, Norway), and Martha Larson (Radboud University, Netherlands) as the main coordinator.

Medical Multimedia Task – Transparent Tracking of Spermatozoa
Paper available at: https://ceur-ws.org/Vol-3658/paper1.pdf
Vajira Thambawita, Andrea Storås, Tuan-Luc Huynh, Hai-Dang Nguyen, Minh-Triet Tran, Trung-Nghia Le, Pål Halvorsen, Michael Riegler, Steven Hicks, Thien-Phuc Tran
SimulaMet, Norway, OsloMet, Norway, University of Science, VNU-HCM, Vietnam, Vietnam National University, Ho Chi Minh City, Vietnam
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/medico/

The Medico task provides a set of spermatozoa videos, tracked with a set of frame-by-frame bounding box annotations, tasking participants with the prediction of standard sperm quality assessment measurements, specifically the motility (movement) of spermatozoa (living sperm cells).

Musti: Multimodal Understanding of Smells in Texts and Images
Paper available at: https://ceur-ws.org/Vol-3658/paper34.pdf
Ali Hürriyetoğlu, Inna Novalija, Mathias Zinnen, Vincent Christlein, Pasquale Lisena, Stefano Menini, Marieke van Erp, Raphael Troncy
KNAW Humanities Cluster, DHLab, Jožef Stefan Institute, Slovenia, Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, EURECOM, Sophia Antipolis, France, Fondazione Bruno Kessler, Trento, Italy
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/musti/

Musti is an innovative task, seeking to understand the descriptions and depictions of smells in multilingual texts (English, German, Italian, French, Slovenian) and images from the 17th to the 20th century. Participants must create systems that recognize references to smells in texts and images, connecting these references across different modalities.

NewsImages: Connecting Text and Images
Paper available at: https://ceur-ws.org/Vol-3658/paper4.pdf
Andreas Lommatzsch, Benjamin Kille, Özlem Özgöbek, Mehdi Elahi, Duc Tien Dang Nguyen
Technische Universität Berlin, Berlin, Germany, Norwegian University of Science and Technology, Trondheim, Norway, University of Bergen, Bergen, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/newsimages/

In this edition of the NewsImages task participants are encouraged to discover patterns and models that describe the relation between images and texts of news articles, body of articles, and their headlines.

Predicting Video Memorability
Paper available at: https://ceur-ws.org/Vol-3658/paper2.pdf
Mihai Gabriel Constantin, Claire-Hélène Demarty, Camilo Fosco, Alba García Seco de Herrera, Sebastian Halder, Graham Healy, Bogdan Ionescu, Ana Matran-Fernandez, Rukiye Savran Kiziltepe, Alan F. Smeaton, Lorin Sweeney
University Politehnica of Bucharest, Romania, InterDigital, France, Massachusetts Institute of Technology Cambridge, USA, University of Essex, UK, Dublin City University, Ireland, Karadeniz Technical University, Turkey
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/memorability/

The organizers propose a dataset that studies the long-term memorability of social media-like videos, providing participants with an extensive data set of videos with memorability annotations, related information, pre-extracted state-of-the-art visual features, and Electroencephalography (EEG) recordings.

SportsVideo: Fine Grained Action Classification and Position Detection in Table Tennis and Swimming Videos
Paper available at: https://ceur-ws.org/Vol-3658/paper3.pdf
Aymeric Erades, Pierre-Etienne Martin, Romain Vuillemot, Boris Mansencal, Renaud Peteri, Julien Morlier, Stefan Duffner, Jenny Benois-Pineau
Ecole Centrale de Lyon, LIRIS, France, CCP Department, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany, University of Bordeaux, Labri, France, INSA Lyon, LIRIS, France
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/sportsvideo/

The organizers developed a set of six sub-tasks covering table tennis and swimming, related to athlete position detection, stroke detection, the classification of motions, field or table registration, sound detection in sports, and scores and result extraction from visual cues.

ImageCLEF 2024

ImageCLEF (https://www.imageclef.org/)  is part of the popular CLEF initiative (https://www.clef-initiative.eu/), and states as its main goal the evaluation of technologies for annotation, indexing, classification, and retrieval of multimodal data. The 2024 edition of ImageCLEF (https://www.imageclef.org/2024) was organized between the 9-12 September 2024, in Grenoble, France, with an Organization Committee composed of Bogdan Ionescu, Henning Müller, Ana-Maria Drăgulinescu, Ivan Eggel, and Liviu-Daniel Ștefan.

ImageCLEFmedical Caption
Paper available at: https://ceur-ws.org/Vol-3740/paper-132.pdf
Johannes Rückert, Asma Ben Abacha, Alba G. Seco de Herrera, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Benjamin Bracke, Hendrik Damm, Tabea M. G. Pakull, Cynthia Sabrina Schmidt, Henning Müller, Christoph M. Friedrich
Department of Computer Science, University of Applied Sciences and Arts Dortmund, Dortmund, Germany, Microsoft, Redmond, Washington, USA, University of Essex, UK, UNED, Spain, Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Germany, Institute for Artificial Intelligence in Medicine (IKIM), University Hospital Essen, Germany, Institute for Transfusion Medicine, University Hospital Essen, Essen, Germany, University of Applied Sciences Western Switzerland (HES-SO), Switzerland, University of Geneva, Switzerland,
Dataset available at: https://www.imageclef.org/2024/medical/caption

The medical caption task focuses on evaluating models that detect medical concepts and automatically create captions for medical images, which can be further applied for context-based image and information retrieval purposes.

ImageCLEFmed VQA
Paper available at: https://ceur-ws.org/Vol-3740/paper-131.pdf
Steven Hicks, Andrea Storås, Pål Halvorsen, Michael Riegler, Vajira Thambawita
SimulaMet, Oslo, Norway, OsloMet- Oslo Metropolitan University, Oslo, Norway
Dataset available at: https://www.imageclef.org/2024/medical/vqa

This edition of the medical VQA task focuses on images of the gastrointestinal tract, tasking participants with directing the power of artificial intelligence to generate medical images based on text input, while also looking at optimal prompts for off-the-shelf generative models, thus augmenting the datasets associated with the previous edition of this task.

ImageCLEFmed MEDIQA-MAGIC
Paper available at: https://ceur-ws.org/Vol-3740/paper-133.pdf
Wen-Wai Yim, Asma Ben Abacha, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia
Microsoft Health AI, Redmond, USA, University of Washington, Seattle, USA.
Dataset available at: https://www.imageclef.org/2024/medical/mediqa

The MEDIQA task focuses on the problem of Multimodal And Generative TelemedICine (MAGIC) in the area of dermatology. Participants must develop systems that can take queries, text, clinical context, and images as input and generate appropriate medical textual responses to this input in a telemedicine setting.

ImageCLEFmed GANs
Paper available at: https://ceur-ws.org/Vol-3740/paper-130.pdf
Alexandra-Georgiana Andrei, Ahmedkhan Radzhabov, Dzmitry Karpenka, Yuri Prokopchuk, Vassili Kovalev, Bogdan Ionescu, Henning Müller
AI Multimedia Lab, National University of Science and Technology Politehnica Bucharest, Romania, Belarusian Academy of Sciences, Minsk, Belarus, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2024/medical/gans

This task addresses the challenges of privacy preservation in artificially generated medical images, looking for “fingerprints” of the original real-world training images in a set of artificially generated images, fingerprints that may break patient privacy when exposed in unwanted or unforeseen circumstances.

ImageCLEFrecommending
Alexandru Stan, George Ioannidis, Bogdan Ionescu, Hugo Manguinhas
IN2 Digital Innovations, Germany, Politehnica University of Bucharest, Romania, Europeana Foundation, Netherlands,
Dataset available at: https://www.imageclef.org/2024/recommending

This task identifies traditional multimedia search methods as a performance bottleneck and proposes the development of recommendation methods and systems applied to blog posts, editorials, and galleries, targeting data related to cultural heritage organizations and collections.

Image Retrieval for Arguments (part of Touché at CLEF)
Paper available at: https://ceur-ws.org/Vol-3740/paper-322.pdf
Johannes Kiesel, Çağrı Çöltekin, Maximilian Heinrich, Maik Fröbe, Milad Alshomary, Bertrand De Longueville, Tomaž Erjavec, Nicolas Handke, Matyáš Kopp, Nikola Ljubešić, Katja Meden, Nailia Mirzakhmedova, Vaidas Morkevičius, Theresa Reitis-Münstermann, Mario Scharfbillig, Nicolas Stefanovitch, Henning Wachsmuth, Martin Potthast, Benno Stein
Bauhaus-Universität Weimar, University of Tübingen, Friedrich-Schiller-Universität Jena, Leibniz University Hannover, European Commission, Joint Research Centre (JRC), Jožef Stefan Institute, Leipzig University, Charles University, Kaunas University of Technology, Arcadia Sistemi Informativi Territoriali, University of Kassel, hessian.AI, and ScaDS.AI
Dataset available at: https://www.imageclef.org/2024/image-retrieval-for-arguments

The goal for this task is the retrieval of images and data that can increase the persuasiveness of an argument, building upon the datasets of topics developed in previous editions of the Touché task.

ImageCLEF ToPicto
Cécile Macaire, Benjamin Lecouteux, Didier Schwab, Emmanuelle Esperança-Rodier
Université Grenoble Alpes, LIG, France
Dataset available at: https://www.imageclef.org/2023/topicto

Targeting the alleviation of symptoms related to diseases that create language impairment problems, the ToPicto task proposes the development of automated systems that translate text or speech to visual pictograms, that can then be used as communication aids and tools.

Challenges in Experiencing Realistic Immersive Telepresence


Immersive imaging technologies offer a transformative way to change how we experience interacting with remote environments, i.e., telepresence. By leveraging advancements in light field imaging, omnidirectional cameras, and head-mounted displays, these systems enable realistic, real-time visual experiences that can revolutionize how we interact with the remote scene in fields such as healthcare, education, remote collaboration, and entertainment. However, the field faces significant technical and experiential challenges, including efficient data capture and compression, real-time rendering, and quality of experience (QoE) assessment. Expanding on the findings of the authors’ recent publication and situating them within a broader theoretical framework, this article provides an integrated overview of immersive telepresence technologies, focusing on their technological foundations, applications, and the challenges that must be addressed to advance this field.

1. Redefining Telepresence Through Immersive Imaging

Telepresence is defined as the “sense of being physically present at a remote location through interaction with the system’s human interface[Minsky1980]. Such virtual presence is made possible by digital imaging systems and real-time communication of visuals and interaction signals. Immersive imaging systems such as light fields and omnidirectional imaging enhance the visual sense of presence, i.e., “being there[IJsselsteijn2000], with photorealistic recreation of the remote scene. This emerging field has seen rapid growth, both in research and development [Valenzise2022], due to advancements in imaging and display technologies, combined with increasing demand for interactive and immersive experiences. A visualization is provided in Figure 1 that shows a telepresence system that utilizes traditional cameras and controls and an immersive telepresence system.

Figure 1 – A side-by-side visualization of a traditional telepresence system (left) and an immersive telepresence system (right).

The experience of “presence” consists of three components according to Schubert et al. [Schubert2001], which are renamed in this article to take into account other definitions:

  1. Realness – “Realness[Schubert2001] or “realism[Takatalo2008] of the environment (i.e., in this case, the remote scene) relates to the “believability, the fidelity and validity of sensory features within the generated environments, e.g., photorealism.” [Perkis 2020].
  2. Immersion – User’s level of “involvement[Schubert2001] and “concentration to the virtual environment instead of real world, loss of time[Takatalo2008]. “The combination of sensory cues with symbolic cues essential for user emplacement and engagement[Perkis2020].
  3. Spatiality – An attribute of the environment helps “transporting” the user to induce spatial awareness [Schubert2001] which allows “spatial presence[Takatalo2008] and “the possibility for users to move freely and discover the world offered” [Perkis2020].

Immersion can happen without having realness or spatiality, for example, while we are reading a novel. Telepresence using traditional imaging systems might not be immersive in case of a relatively small display and other distractors present in the visual field. Realistic immersive telepresence necessitates higher degrees of freedom (e.g., 3 DoF+ or 6DoF) compared to a telepresence application with a traditional display. In this context, new view synthesis methods and spherical light field representations (cf. Section 3) will be crucial in giving correct depth cues and depth perception – which will increase realness and spatiality tremendously.

The rapid progress of immersive imaging technologies and their adoption can largely be attributed to advancements in processing and display systems, including light field displays and extended reality (XR) headsets. These XR headsets are becoming increasingly affordable while delivering excellent user experiences [Jackson2023], paving the way for the widespread adoption of immersive communication and telepresence applications in the near future. To further accelerate this transition, extensive efforts are being undertaken in both academia as well as industry.

The visual realism (i.e., realness) in realistic immersive telepresence relies on acquired photos rather than computer-generated imagery (CGI). In healthcare, it enables realistic remote consultations and surgical collaborations [Wisotzky2025]. In education and training, it facilitates immersive, location-independent learning environments [Kachach2021]. Similarly, visual realism can enhance remote collaboration by creating lifelike meeting spaces, while in media and entertainment, it can provide unprecedented realism for live events and performances, offering users a closer connection and having a feeling of being present on remote sites.

This article provides a brief overview of the technological foundations, applications, and challenges in immersive telepresence. The novel contribution of this article is setting up the theoretical framework for realistic immersive telepresence informed by prior literature and positioning the findings of the author’s recent publication [Zerman2024] within this broader theoretical framework. It explores how foundational technologies like light field imaging and real-time rendering drive the field forward, while also identifying critical obstacles, such as dataset availability, compression efficiency, and QoE evaluation.

2. Technological Foundations for Immersive Telepresence

A realistic immersive telepresence can be made possible by enabling its main defining factors of realness (e.g., photorealism), immersion, and spatiality. Although these factors can be satisfied with other modalities (e.g., spatial audio), this article focuses on the visual modality and visual recreation of the remote scene.

2.1 Immersive Imaging Modalities

Immersive imaging technologies encompass a wide range of methods aimed at capturing and recreating realistic visual and spatial experiences. These include light fields, omnidirectional images, volumetric videos using either point clouds or 3D meshes, holography, multi-view stereo imaging, neural radiance fields, Gaussian splats, and other extended reality (XR) applications — all of which contribute to recreating highly realistic and interactive representations of scenes and environments.

Light fields (LF) are vector fields of all the light rays passing through a given region in space, describing the intensity and direction of light at every point. This is fully described through the plenoptic function [Adelson1991] as follows: P(x,y,z,θ,ϕ,λ,t), where x, y, and z describe the 3D position of sampling, θ and ϕ are the angular direction, λ is the wavelength of the light ray, and t is time. Traditionally, LFs are represented using the two-plane parametrization [Levoy1996] with 2 spatial dimensions and 2 angular dimensions; however, this parametrization limits the use case of LFs to processing planar visual stimuli. The plenoptic function can be leveraged beyond the two-plane parameterization for a highly detailed view reconstruction or view synthesis. Newer capture scenarios and representations enable increased immersion with LFs [Overbeck2018],[Broxton2020], which can be further advanced in the future.

Omnidirectional image (or video) representation can provide an all-encompassing 360-degree view of a scene from a point in space for immersive visualization [Yagi1999], [Maugey2023]. This is made possible by stitching multiple views together. The created spherical image can be stored using traditional image formats (i.e., 2D planar formats) by projecting the sphere to planar format (e.g., equirectangular projection, cubemap projection, and others); however, processing these special representations without proper consideration for their spherical nature results in errors or biases.

2.2 Processing Requirements for Realistic Immersive Telepresence

Immersive telepresence relies on capturing, transmitting, and rendering realistic representations of remote environments. “Capturing” can be considered an inherent part of the imaging modalities discussed in the previous section. For transmitting and rendering, there are different requirements to take into account.

Compression is an important step for telepresence that relies heavily on real-time transmission of the visual data from the remote scene. The importance of compression increases even more for immersive telepresence applications as immersive imaging modalities capture (and represent) more information and need even more compression compared to the telepresence using traditional 2D imaging systems. Compression of LFs [Stepanov2023], omnidirectional images and video [Croci2020], and other forms of immersive video such as MPEG Immersive Video [Boyce2021], volumetric 3D representations represented with point clouds [Graziosi2020], and textured 3D meshes [Marvie2022] have been a very hot research topic within the last decade, which led to the standardization of compression methods for some immersive imaging modalities.

Rendering [Eisert2023], [Maugey2023] is yet another important aspect, especially for LFs [Overbeck2018]. The LF data needs to be rendered correctly for the position of the viewer (i.e., to render interpolated or extrapolated views) to provide a realistic and immersive experience to the user. Without the view rendering (i.e., for interpolation or extrapolation), the final displayed visuals will appear jittery, which will make the experience harder to sustain the necessary “suspension of disbelief” for an immersive experience. Furthermore, this rendering has to be real-time, as it is a requirement for telepresence. Although technologies such as GPU acceleration and advanced compression algorithms ensure seamless interaction while minimizing latency, the quality and the realness of the remote scene are still to be solved.

Immersive telepresence systems rely on specialized hardware, including omnidirectional cameras, head-mounted displays, and motion tracking systems. These components must work in harmony to deliver high-quality, immersive experiences. Reducing prices and increasing availability of such specialized devices make them easier to deploy in industrial settings [Jackson2023] regardless of business size and enables the democratization of immersive imaging applications in a broader sense.

3. Efforts in Creating a Realistic Immersive Telepresence Experience

Creating an immersive telepresence system has been a topic of many scholarly studies. These include frameworks for group-to-group telepresence [Beck2013], creating capture and delivery frameworks for volumetric 3D models [Fechteler2013], and various other social XR applications [Cortés2024]. Google’s project Starline can also be mentioned here to include realness and immersion in its delivery of the visuals, creating an immersive experience [Lawrence2024], [Starline2025], although its main functionality is interpersonal video communication. In supporting realness, LFs [Broxton2020] and other types of neural representations [Suhail2022] can create views that can support reflections and similar non-Lambertian light material interactions in recreating light occurring in the remote scene, whereas the usual assumption for texturing reconstructed 3D objects is to assume Lambertian materials [Zhi2020].

Light field reconstruction [Gond2023] and new view synthesis from single-view [Lin2023] or sparse views [Chibane2021] can be a valid way to approach creating realistic immersive telepresence experiences. Various representations can be used to recreate various views that would support movement of the user and the spatial awareness factor of presence in the remote scene. These representations can be Multi-Planar Image (MPI) [Srinivasan2019], Multi-Cylinder Image (MCI) [Waidhofer2022], layered mesh representation [Broxton2020], and neural representations [Chibane2021], [Lin2023], [Gond 2023] – which rely on structured or unstructured 2D image captures of the remote scene.

Another way of creating a realistic immersive experience can be by combining the different imaging modalities – i.e., omnidirectional content and light fields – in the form of spherical light fields (SLFs). SLFs then enable rendering and view synthesis that can generate more realistic and immersive content. There have been various attempts to create SLFs by collecting linear captures vertically [Krolla2014], capturing omnidirectional content from the scene with multiple cameras [Maugey2019], and moving a single camera in a circular trajectory and utilizing deep neural networks to generate an image grid [Lo2023]. Nevertheless, these works either did not yield publicly available datasets or did not have precise localizations of the cameras. To address this, the Spherical Light Field Database (SLFDB) was introduced in previous work [Zerman2024], which provides a foundational dataset for testing and developing applications for realistic immersive telepresence applications.

4. Challenges and Limitations

Studies in creating realistic immersive telepresence environments showed that there are still certain challenges and limitations that need to be addressed to improve QoE and IMEx for these systems. These challenges include dataset availability, compression of the structured and unstructured LFs, new view synthesis and rendering, and QoE estimation. Most of these challenges are also discussed in our recent study [Zerman2024].

Figure 2 – A set of captures highlighting the effects of dynamically changing scene: lighting change and its effect on white balance (top) and dynamic capture environment, where people appear and disappear (bottom).

Datasets relevant to realistic immersive telepresence tasks, such as the SLFDB [Zerman2024], are crucial for developing and validating immersive telepresence technologies. However, the creation and use of such datasets with precise spatial and angular resolution and very precise positioning of the camera face significant hurdles. Traditional camera grid setups are ineffective for capturing spherical light fields due to occlusions. This challenge necessitates having static scenes and meticulous camera positioning for a consistent capture of the scene. A dynamic scene brings a risk of non-consistent views within the same light field, as shown in Figure 2, which is non-ideal. These challenges highlight the critical need for innovative approaches to spherical light field dataset generation and sharing, ensuring future advancements in the field. Additionally, variations in lighting present significant challenges when capturing spherical light fields, as they impact the scene’s dynamic range, white balance, and color grading, which creates yet another challenge in database creation. Brightness and color variations, such as sunlight’s yellow tint compared to cloudy daylight, are not easy to correct and often require advanced algorithms for adjustment. Capturing static outdoor scenes remains a challenge for future work, as they still encounter lighting-related issues despite lacking movement.

LF compression is also another challenge that requires attention after combining imaging modalities. JPEG Pleno compression algorithm [ISO2021] is adapted for 2-dimensional grid-like structured LFs (e.g., LFs captured by microlens array or structured camera grids) and does not work for linear or unstructured captures. The situation is the same for many other compression methods, as most of them require some form of structured representation. Considering how well scene regression and other new view synthesis algorithms can adapt for unstructured inputs, one can also see the importance of advancing the compression field for unstructured LFs (e.g., the volume of light captured by cameras in various positions or in-the-wild user captures). Furthermore, the said LF compression method needs to be real-time to support immersive telepresence applications while having a very good visual QoE that would not impede realism.

Figure 3 – Strong artifacts created at the extremes of view synthesis with a large baseline (i.e. 30cm), where either the scene is warped (left – 360ViewSynth), or strong ghosting artifacts occur (right – PanoSynthVR).

Current new view synthesis methods are primarily designed to handle small baselines, typically just a few centimeters, and face significant challenges when applied to larger baselines required in telepresence applications. Challenges such as ghosting artifacts and unrealistic distortions (e.g., nonlinear distortions, stretching) occur when interpolating views, particularly for larger baselines, as shown in Figure 3. A recent comparative evaluation of PanoSynthVR and 360ViewSynth [Zerman2024] reveals that while 360ViewSynth marginally outperforms PanoSynthVR on average quality metrics, the scores for both methods remain suboptimal. PanoSynthVR struggles with large baselines, exhibiting prominent layer-like ghosting artifacts due to limitations in its MCI structure. Although 360ViewSynth produces visually better results, closer inspection shows that it distorts object perspectives by stretching them rather than accurately rendering the scene, leading to an unnatural user experience. These findings underscore the limitations of current state-of-the-art view synthesis methods for SLFs and highlight the complexity of addressing larger baselines effectively in view synthesis.

Assessing user satisfaction and immersion in telepresence systems is a multidimensional challenge, requiring assessments in three different strands as described in IMEx whitepaper: subjective assessment, behavioral assessment, and assessment via psycho-physiological methods [Perkis2020]. Quantitative metrics can be used for interaction latency and task performance metrics in a user study, and individual preferences and experiences can be collected qualitatively. Certain aspects of user experience, such as visual quality and user engagement, can also be collected as quantitative data during user studies – with user self-reporting. Additionally, behavioral assessment (e.g., user movement, interaction patterns) can be used to identify different use patterns. Here, the limiting factor is mainly the time and experience cost in running the said user studies. Therefore, the challenge here is to prepare a framework that can model the user experience for realistic immersive telepresence scenarios, which can speed up the assessment strategies.

Other limitations and aspects to consider include accessibility, privacy issues, and ethics. Regarding accessibility, it is important to ensure that immersive telepresence technologies are affordable and usable by diverse populations. The situation is improving as the cameras and headsets are getting cheaper and easier to use (e.g., faster and stronger on-device processing, removal of headset connection cables, increased ease of use with hand gestures, etc.). Nevertheless, hardware costs, connectivity requirements, and usability barriers must be further addressed to make these systems widely accessible. Regarding privacy and ethics, the realistic nature of immersive telepresence may raise ethical and privacy concerns. Capturing and transmitting live environments may involve sensitive data, necessitating robust privacy safeguards and ethical guidelines to prevent misuse. Also, privacy concerns regarding the headsets that rely on visual cameras for localization and mapping must be addressed.

5. Conclusions and Future Directions

Realistic immersive telepresence systems represent a transformative shift in how people interact with remote environments. By combining advanced imaging, rendering, and interaction technologies, these systems promise to revolutionize industries ranging from healthcare to entertainment. However, significant challenges remain, including data availability, compression, rendering, and QoE assessment. Addressing these obstacles will require collaboration across disciplines and industries.

To address these challenges, future research should focus on attempting to create relevant datasets for spherical LFs that address with accurate positioning of the camera and challenges such as dynamic lighting conditions and occlusions. Developing real-time, robust compression methods for unstructured LFs, which maintain visual quality and support immersive applications, is another critical area. Developing advanced view synthesis algorithms capable of handling large baselines without introducing artifacts or distortions and creating frameworks for user experience and QoE assessment methodologies are still open research questions.

Further into the future, the remaining challenges can be solved using learning-based algorithms for the challenges related to realness and spatiality factors as well as QoE estimation, increasing the level of interactivity and feeling of immersion through integrating different senses to the existing systems (e.g., spatial audio, haptics, natural interfaces), and increasing the standardization to create common frameworks that can manage interoperability across different systems. Long-term goals include the integration of realistic immersive displays – such as LF displays or improved holographic displays – and the convergence of telepresence systems with emerging technologies like 5G or 6G networks and edge computing, on which the efforts are already underway [Mahmoud2023].

References

  • [Adelson1991] Adelson, E. H., & Bergen, J. R. (1991). The plenoptic function and the elements of early vision (Vol. 2). Cambridge, MA, USA: Vision and Modeling Group, Media Laboratory, Massachusetts Institute of Technology.
  • [Beck2013] Beck, S., Kunert, A., Kulik, A., & Froehlich, B. (2013). Immersive group-to-group telepresence. IEEE transactions on visualization and computer graphics, 19(4), 616-625.
  • [Boyce2021] Boyce, J. M., Doré, R., Dziembowski, A., Fleureau, J., Jung, J., Kroon, B., … & Yu, L. (2021). MPEG immersive video coding standard. Proceedings of the IEEE, 109(9), 1521-1536.
  • [Broxton2020] Broxton, M., Flynn, J., Overbeck, R., Erickson, D., Hedman, P., Duvall, M., … & Debevec, P. (2020). Immersive light field video with a layered mesh representation. ACM Transactions on Graphics (TOG), 39(4), 86-1.
  • [Chibane2021] Chibane, J., Bansal, A., Lazova, V., & Pons-Moll, G. (2021). Stereo radiance fields (SRF): Learning view synthesis for sparse views of novel scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7911-7920).
  • [Cortés2024] Cortés, C., Pérez, P., & García, N. (2023). Understanding latency and qoe in social xr. IEEE Consumer Electronics Magazine.
  • [Croci2020] Croci, S., Ozcinar, C., Zerman, E., Knorr, S., Cabrera, J., & Smolic, A. (2020). Visual attention-aware quality estimation framework for omnidirectional video using spherical Voronoi diagram. Quality and User Experience, 5, 1-17.
  • [Eisert2023] Eisert, P., Schreer, O., Feldmann, I., Hellge, C., & Hilsmann, A. (2023). Volumetric video– acquisition, interaction, streaming and rendering. In Immersive Video Technologies (pp. 289-326). Academic Press.
  • [Fechteler2013] Fechteler, P., Hilsmann, A., Eisert, P., Broeck, S. V., Stevens, C., Wall, J., … & Zahariadis, T. (2013, June). A framework for realistic 3D tele-immersion. In Proceedings of the 6th International Conference on Computer Vision/Computer Graphics Collaboration Techniques and Applications.
  • [Gond2023] Gond, M., Zerman, E., Knorr, S., & Sjöström, M. (2023, November). LFSphereNet: Real Time Spherical Light Field Reconstruction from a Single Omnidirectional Image. In Proceedings of the 20th ACM SIGGRAPH European Conference on Visual Media Production (pp. 1-10).
  • [Graziosi2020] Graziosi, D., Nakagami, O., Kuma, S., Zaghetto, A., Suzuki, T., & Tabatabai, A. (2020). An overview of ongoing point cloud compression standardization activities: Video-based (V-PCC) and geometry-based (G-PCC). APSIPA Transactions on Signal and Information Processing, 9, e13.
  • [IJsselsteijn2000] IJsselsteijn, W. A., De Ridder, H., Freeman, J., & Avons, S. E. (2000, June). Presence: concept, determinants, and measurement. In Human Vision and Electronic Imaging V (Vol. 3959, pp. 520-529). SPIE.
  • [ISO2021] ISO/IEC 21794-2:2021 (2021) Information technology – Plenoptic image coding system (JPEG Pleno) — Part 2: Light field coding.
  • [Jackson2023] Jackson, A. (2023, September) Meta Quest 3: Can businesses use VR day-to-day?, Technology Magazine. https://technologymagazine.com/digital-transformation/meta-quest-3-can-businesses-use-vr-day- to-day, Accessed: 2024-02-05.
  • [Kachach2021] Kachach, R., Orduna, M., Rodríguez, J., Pérez, P., Villegas, Á., Cabrera, J., & García, N. (2021, July). Immersive telepresence in remote education. In Proceedings of the International Workshop on Immersive Mixed and Virtual Environment Systems (MMVE’21) (pp. 21-24).
  • [Krolla2014] Krolla, B., Diebold, M., Goldlücke, B., & Stricker, D. (2014, September). Spherical Light Fields. In BMVC (No. 67.1–67.12).
  • [Lawrence2024] Lawrence, J., Overbeck, R., Prives, T., Fortes, T., Roth, N., & Newman, B. (2024). Project starline: A high-fidelity telepresence system. In ACM SIGGRAPH 2024 Emerging Technologies (pp. 1-2).
  • [Levoy1996] Levoy, M. & Hanrahan, P. (1996) Light field rendering, in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (pp. 31-42), New York, NY, USA, Association for Computing Machinery.
  • [Lin2023] Lin, K. E., Lin, Y. C., Lai, W. S., Lin, T. Y., Shih, Y. C., & Ramamoorthi, R. (2023). Vision transformer for nerf-based view synthesis from a single input image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 806-815).
  • [Lo2023] Lo, I. C., & Chen, H. H. (2023). Acquiring 360° Light Field by a Moving Dual-Fisheye Camera. IEEE Transactions on Image Processing.
  • [Mahmoud2023] Mahmood, A., Abedin, S. F., O’Nils, M., Bergman, M., & Gidlund, M. (2023). Remote-timber: an outlook for teleoperated forestry with first 5g measurements. IEEE Industrial Electronics Magazine, 17(3), 42-53.
  • [Marvie2022] Marvie, J. E., Krivokuća, M., Guede, C., Ricard, J., Mocquard, O., & Tariolle, F. L. (2022, September). Compression of time-varying textured meshes using patch tiling and image-based tracking. In 2022 10th European Workshop on Visual Information Processing (EUVIP) (pp. 1-6). IEEE.
  • [Maugey2019] Maugey, T., Guillo, L., & Cam, C. L. (2019, June). FTV360: A multiview 360° video dataset with calibration parameters. In Proceedings of the 10th ACM Multimedia Systems Conference (pp. 291-295).
  • [Maugey2023] Maugey, T. (2023). Acquisition, representation, and rendering of omnidirectional videos. In Immersive Video Technologies (pp. 27-48). Academic Press. [Minsky1980] Minsky, M. (1980). Telepresence. Omni, pp. 45-51.
  • [Overbeck2018] Overbeck, R. S., Erickson, D., Evangelakos, D., Pharr, M., & Debevec, P. (2018). A system for acquiring, processing, and rendering panoramic light field stills for virtual reality. ACM Transactions on Graphics (TOG), 37(6), 1-15.
  • [Perkis2020] Perkis, A., Timmerer, C., et al. (2020, May) “QUALINET White Paper on Definitions of Immersive Media Experience (IMEx)”, European Network on Quality of Experience in Multimedia Systems and Services, 14th QUALINET meeting (online), Online: https://arxiv.org/abs/2007.07032
  • [Schubert2001] Schubert, T., Friedmann, F., & Regenbrecht, H. (2001). The experience of presence: Factor analytic insights. Presence: Teleoperators & Virtual Environments, 10(3), 266-281.
  • [Srinivasan2019] Srinivasan, P. P., Tucker, R., Barron, J. T., Ramamoorthi, R., Ng, R., & Snavely, N. (2019). Pushing the boundaries of view extrapolation with multiplane images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 175-184).
  • [Starline2025] Project Starline: Be there from anywhere with our breakthrough communication technology. (n.d.). Online: https://starline.google/. Accessed: 2025-01-14
  • [Stepanov2023] Stepanov, M., Valenzise, G., & Dufaux, F. (2023). Compression of light fields. In Immersive Video Technologies (pp. 201-226). Academic Press.
  • [Suhail2022] Suhail, M., Esteves, C., Sigal, L., & Makadia, A. (2022). Light field neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8269-8279).
  • [Takatalo2008] Takatalo, J., Nyman, G., & Laaksonen, L. (2008). Components of human experience in virtual environments. Computers in Human Behavior, 24(1), 1-15.
  • [Valenzise2022] Valenzise, G., Alain, M., Zerman, E., & Ozcinar, C. (Eds.). (2022). Immersive Video Technologies. Academic Press.
  • [Waidhofer2022] Waidhofer, J., Gadgil, R., Dickson, A., Zollmann, S., & Ventura, J. (2022, October). PanoSynthVR: Toward light-weight 360-degree view synthesis from a single panoramic input. In 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (pp. 584-592). IEEE.
  • [Wisotzky2025] Wisotzky, E. L., Rosenthal, J. C., Meij, S., van den Dobblesteen, J., Arens, P., Hilsmann, A., … & Schneider, A. (2025). Telepresence for surgical assistance and training using eXtended reality during and after pandemic periods. Journal of telemedicine and telecare, 31(1), 14-28.
  • [Yagi1999] Yagi, Y. (1999). Omnidirectional sensing and its applications. IEICE transactions on information and systems, 82(3), 568-579.
  • [Zerman2024] Zerman, E., Gond, M., Takhtardeshir, S., Olsson, R., & Sjöström, M. (2024, June). A Spherical Light Field Database for Immersive Telecommunication and Telepresence Applications. In 2024 16th International Conference on Quality of Multimedia Experience (QoMEX) (pp. 200-206). IEEE.
  • [Zhi2020] Zhi, T., Lassner, C., Tung, T., Stoll, C., Narasimhan, S. G., & Vo, M. (2020). TexMesh: Reconstructing detailed human texture and geometry from RGB-D video. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16 (pp. 492-509). Springer International Publishing.