Quality of Multimedia Experience Meets Machine Intelligence

By Tobias Hossfeld | February 2, 2026 - 11:16 |February 2, 2026 0126, Feature, QoE Column

1. Why QoE meets Machine Intelligence Now

[Multimedia systems are evolving towards AI-driven, adaptive services, leading to a natural convergence of QoE and machine intelligence. In this context, machine intelligence can empower QoE through learning-based, context-aware, and semantic-driven modelling and optimization. At the same time, QoE can guide machine intelligence by providing a human-centred objective for AI system design and evaluation; see also [11]. Looking beyond human perception, toward agent-centric and hybrid QoE, future multimedia systems increasingly require unified experience objectives that support human-AI co-experience. QoMEX’26 in Cardiff stands as a major milestone highlighting the convergence of Quality of Multimedia Experience with Machine Intelligence. This column reflects on this evolution and outlines the key challenges ahead.

Multimedia systems have shifted from “best-effort delivery” toward intelligent, adaptive services that operate under highly diverse network conditions, device capabilities, and user contexts. In this landscape, Quality of Experience (QoE) has become a central concept, focusing on user satisfaction rather than purely signal-level fidelity [1, 2, 3].

QoE has traditionally been human-centric, reflecting perceived quality, enjoyment, comfort, and acceptance of multimedia services [2]. Meanwhile, machine intelligence, from deep learning and reinforcement learning to multimodal foundation models, has rapidly become the dominant paradigm for perception, generation, and decision-making. The intersection of these trends is timely and inevitable: QoE provides the human-centred goal, while machine intelligence provides scalable tools to model and optimize experience in complex real-world environments. Figure 1 summarizes this bidirectional relationship between QoE and machine intelligence, from multimodal inputs to human-centric, agent-centric, and hybrid QoE objectives.

Figure 1. A conceptual framework where machine intelligence enables QoE prediction and QoE-aware optimization, while QoE evolves from a human-centric notion toward agent-centric and hybrid objectives in intelligent multimedia systems.

2. How machine intelligence can empower QoE

(1) Learning QoE models beyond handcrafted rules

Classic QoE models often rely on handcrafted features and simplified assumptions linking system parameters (bitrate, delay, resolution) to perceived quality. Machine learning offers a flexible alternative: it can learn complex nonlinear mappings from content, network conditions, and user interaction signals to QoE outcomes. Deep models further enable learning from high-dimensional inputs such as raw video frames, audio signals, and multimodal logs, supporting richer QoE prediction in streaming, immersive media, short-form video, gaming, and interactive communication. In this context, advances in perceptual quality assessment (e.g., full-reference and no-reference IQA/VQA) also provide useful foundations for QoE-related modelling [5, 8, 9].

(2) QoE-aware control and optimization

Machine intelligence is not only about prediction, it can also enable QoE-driven decision-making. Instead of optimizing network metrics alone, systems can adapt encoding, bitrate selection, buffering strategies, or rendering policies to maximize predicted QoE. This direction has been extensively studied in adaptive streaming, where QoE-driven strategies are used to balance bitrate quality and playback stability [4]. Reinforcement learning is particularly promising, where QoE can serve as a reward signal and agents can learn robust policies under uncertainty (e.g., bandwidth fluctuations, user engagement changes) [6, 7].

(3) Personalization and context-awareness

QoE is inherently subjective and context-dependent. Machine intelligence can support personalization by incorporating user preferences and context signals such as device type, mobility, ambient environment, and usage patterns. For example, some users are more sensitive to rebuffering events, while others prioritize sharpness and resolution. Context-aware learning enables systems to move beyond “one-size-fits-all” adaptation.

(4) Semantic Intelligence

Machine intelligence can empower QoE by shifting quality assessment from perceptual fidelity toward semantic quality. This means how well the meaning and task-relevant information of multimodal content is preserved for both machines and humans. As multimedia data is increasingly consumed by AI systems in applications like autonomous systems and AI-generated content pipelines, traditional perceptual metrics fail to reflect performance and experience because they ignore semantic consistency. Semantic-aware evaluation may enable task-oriented and task-agnostic assessment. By integrating semantic quality assessment, AI can guide compression, transmission, and system design in ways that better align technical performance with downstream task success and user experience.

3. How QoE can guide machine intelligence

The relationship between QoE and machine intelligence is bidirectional: QoE can also shape how multimedia AI systems are designed, trained, and evaluated.

(1) QoE as a human-centric objective function

Many multimedia AI pipelines optimize proxy metrics such as accuracy, PSNR/SSIM, or task performance. However, these do not always align with perceived quality or user satisfaction. QoE provides a principled framework to define what “better” means from the user’s perspective and encourages evaluation beyond technical fidelity [2, 10].

(2) Aligning generative intelligence with user satisfaction

With the rise of generative AI for multimedia enhancement and creation, QoE becomes even more critical. High-quality generation is not only about realism but also about temporal consistency, comfort, trust, and acceptance in real usage conditions. Integrating QoE considerations can help steer generative models toward outcomes that users actually prefer.

Emerging Challenge “QoE of interactive AI systems”

AI evaluation is shifting from pure model accuracy toward experience-based assessment of how humans interact with AI, aligned with frameworks like the EU AI Act. Quality of Experience (QoE) and UX research provide established methods to measure subjective aspects such as trust, transparency, human oversight of the AIS systems, robustness, and satisfaction. Applying QoE methodologies can translate high-level AI principles into measurable experiential dimensions reflecting real-world user understanding and use. This requires new metrics that reflect how users actually understand, trust and operate AI systems in practice. For more details, see [11].

4. Beyond human-centric QoE: toward agent-centric and hybrid QoE

While QoE has historically focused on human perception, emerging multimedia systems increasingly serve autonomous agents such as robots, drones, and intelligent vehicles. In these scenarios, multimedia is not only consumed by humans but also by machines. This motivates an extended view of QoE, agent-centric QoE, where “experience” can be interpreted as the utility of multimedia inputs for decision-making and task execution.

Agent-centric QoE can be characterized through indicators such as perception reliability, uncertainty reduction, latency sensitivity, safety margins, energy efficiency, and task success rate. Importantly, many future applications involve human–AI co-experience, for example, in teleoperation, remote driving, robot-assisted inspection, and collaborative XR. In such systems, overall quality depends on both human satisfaction and machine performance, motivating unified QoE objectives that jointly optimize human-centric and agent-centric requirements. As shown in Figure 1, future multimedia systems may require unified QoE objectives that jointly optimize human satisfaction and agent utility in human–AI co-experience scenarios.

5. Key challenges

Despite its promise, QoE-meets-AI research faces several open challenges:

Subjective data cost and scarcity: QoE ground truth often requires user studies and careful experimental design [2, 3].
Generalization: QoE models may struggle across unseen content types, devices, or cultural contexts.
Bias and fairness: QoE datasets may underrepresent certain user groups or contexts, leading to skewed optimization.
Explainability and trust: Black-box QoE predictions can be difficult to interpret and validate in engineering pipelines.
Privacy: Personalization requires user data, raising responsible data usage concerns.
Ethical aspects: Beyond established research ethics procedures, QoE research must increasingly address the broader ethical implications of AI-driven experience optimization, such as fairness, transparency, wellbeing, privacy, and environmental impact, which are essential for truly human-centred technology.

6. Outlook and takeaways

The convergence of Quality of Experience and machine intelligence represents a major opportunity for the multimedia community. Machine intelligence offers scalable tools to predict and optimize QoE in complex environments, while QoE provides a human-centred lens to guide AI system design toward real user value. Looking forward, QoE may evolve from a purely human-centric notion to a hybrid experience shared by humans and intelligent agents, enabling multimedia systems that are not only technically advanced, but also aligned with what humans and autonomous agents truly need.

Looking ahead to the continued evolution of the QoMEX conference series, QoMEX’26 in Cardiff represents a key milestone where Quality of Multimedia Experience directly converges with Machine Intelligence. As AI increasingly shapes how multimedia is created, transmitted, and consumed, the conference invites the community to rethink both the goals and methods of QoE research – using AI to enhance user experience, while drawing on QoE insights to build more human-aware, trustworthy, and adaptive intelligent systems. This vision is reflected in special sessions:
“SS1: Semantic Quality Assessment for Multi-Modal Intelligent Systems” on semantic quality assessment for multimodal intelligent systems, which extend quality evaluation beyond perceptual fidelity toward meaning and task relevance. The session aims to lay the foundations of multimodal semantic quality assessment, enable semantic-driven compression and transmission, and connect semantic quality evaluation with AI understanding.
“SS2: Beyond Quality: Integrating Ethical Dimensions in QoE Research” on integrating ethical dimensions into QoE research, emphasizing fairness, transparency, wellbeing, privacy, and environmental impact, which are essential for truly human-centred technology. This session calls for ethically reflexive, value-sensitive QoE frameworks that incorporate social impact, collective QoE, and inclusive research practices alongside traditional UX measures.

Together, these themes signal a continued broadening of the QoE scope, reaffirming QoMEX as a forum that evolves with emerging technologies while advancing inclusive, responsible, and future-oriented quality research. The 18th International Conference on Quality of Multimedia Experience (QoMEX’26) will take place in Cardiff, United Kingdom, from June 29 to July 3, 2026. Please find more information on the website of QoMEX’26: https://qomex2026.itec.aau.at/

18th International Conference on Quality of Multimedia Experience (QoMEX’26) will take place in Cardiff, United Kingdom, from June 29 to July 3, 2026

Reference

[1] ITU-T Rec. P.10/G.100 (2006), Vocabulary for performance and quality of service.

[2] Möller, S., & Raake, A. (2014), Quality of Experience: Advanced Concepts, Applications and Methods. Springer.

[3] De Moor, K., et al. (2010), Proposed framework for evaluating quality of experience in a mobile, testbed-oriented living lab setting. Mobile Networks and Applications.

[4] Seufert, M., Egger, S., Slanina, M., Zinner, T., Hoßfeld, T., & Tran-Gia, P. (2015), A survey on quality of experience of HTTP adaptive streaming. IEEE Communications Surveys & Tutorials.

[5] Bampis, C. G., Li, Z., Moorthy, A. K., Katsavounidis, I., Aaron, A., & Bovik, A. C. (2018), Study of temporal effects on subjective video quality of experience. IEEE Transactions on Image Processing.

[6] Yin, X., Jindal, A., Sekar, V., & Sinopoli, B. (2015), A control-theoretic approach for dynamic adaptive video streaming over HTTP. ACM SIGCOMM.

[7] Mao, H., Netravali, R., & Alizadeh, M. (2017), Neural adaptive video streaming with Pensieve. ACM SIGCOMM.

[8] Wang, Z. & Bovik, AC. (2006), Modern image quality assessment. Springer.

[9] Mittal, A., Moorthy, A. K., & Bovik, A. C. (2013), No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing.

[10] Hoßfeld, T., Schatz, R., & Egger, S. (2011), SOS: The MOS is not enough! QoMEX.

[11] Hupont, I., De Moor, K, Skorin-Kapov, L., Varela, M. & Hoßfeld, T. “Rethinking QoE in the Age of AI: From Algorithms to Experience-Based Evaluation.” ACM SIGMultimedia Records (2025).

ACM SIGMM Multimodal Reasoning Workshop

By Anna Ferrarotti | February 2, 2026 - 10:28 |February 24, 2026 0126, Event Report, Feature

Leave a comment

The ACM SIGMM Multimodal Reasoning Workshop was held on 8–9 November 2025 at Indian Institute of Technology Patna (IIT Patna) in Hybrid mode. Organised by Dr. Sriparna Saha, faculty member of the Department of Computer Science and Engineering, IIT Patna and supported by ACM SIGMM, the two-day event brought together researchers, students, and practitioners to discuss the foundations, methods, and applications of multimodal reasoning and generative intelligence. The workshop registered 108 participants and featured invited talks, tutorials, and hands-on sessions by national and international experts. Sessions covered topics ranging from trustworthy AI, LLM fine-tuning, temporal and multimodal reasoning, to knowledge-grounded visual question answering and healthcare applications.

Inauguration:

During the inauguration, Dr. Sriparna Saha welcomed participants and acknowledged the presence and support of Prof. Jimson Mathew (Dean of Student Affairs, IIT Patna), Prof. Rajiv Ratn Shah (IIIT Delhi) and all speakers. The organising committee expressed gratitude to ACM SIGMM for its financial support, which made the workshop possible. Felicitations were exchanged, and the inauguration concluded with words of encouragement for active participation and interdisciplinary collaboration.

Session summaries and highlights:

Day 1:

The first day began with an inaugural session followed by a series of engaging talks and tutorials. Prof. Rajiv Ratn Shah (Associate Professor, IIIT Delhi, India) delivered the opening talk on “Tackling Multimodal Challenges with AI: From User Behavior to Content Generation,” highlighting the role of multimodal data in understanding user behaviour, enabling AI-driven content generation, and building region-specific applications such as voice conversion and video dubbing for Indic languages. Prof. Sriparna Saha (Associate Professor, Department of Computer Science and Engineering, IIT Patna, India) then presented “Harnessing Generative Intelligence for Healthcare: Models, Methods, and Evaluations,” discussing safe, domain-specific AI systems for healthcare, multimodal summarisation for low-resource languages, and evaluation frameworks like M3Retrieve and multilingual trust benchmarks. The afternoon sessions featured hands-on tutorials and technical talks. Ms. Swagata Mukherjee (Research Scholar, IIT Patna, India) conducted a tutorial on “Advanced Prompting Techniques for Large Language Models,” covering zero/few-shot prompting, chain-of-thought reasoning, and iterative refinement strategies. Mr. Rohan Kirti (Research Scholar, IIT Patna, India) led a tutorial on “Exploring Multimodal Reasoning: Text & Image Embeddings, Augmentation, and VQA,” demonstrating text–image fusion using models such as CLIP, VILT, and PaliGemma. Prof. José G. Moreno (Associate Professor, IRIT, France) presented “Visual Question Answering about Named Entities with Knowledge-Based Explanation (ViQuAE),” introducing a benchmark for explainable, knowledge-grounded VQA. The day concluded with Prof. Chirag Agarwal (Assistant Professor, University of Virginia, USA) delivering a talk on “Trustworthy AI in the Era of Frontier Models,” which emphasised fairness, safety, and alignment in multimodal and LLM systems.

Day 2:

The second day continued with high-level technical sessions and tutorials. Prof. Ranjeet Ranjan Jha (Assistant Professor, Department of Mathematics, IIT Patna, India) opened with a talk on “Bridging Deep Learning and Multimodal Reasoning: Generative AI in Real-World Contexts,” tracing the evolution of deep learning into multimodal generative models and discussing ethical and computational challenges in deployment. Prof. Adam Jatowt (Professor, Department of Computer Science, University of Innsbruck, Austria) followed with a presentation on “Analyzing and Improving Temporal Reasoning Capabilities of Large Language Models,” showcasing benchmarks such as BiTimeBERT, TempRetriever, and ComplexTempQA, while proposing methods to enhance time-sensitive reasoning. The final technical session featured Mr. Syed Ibrahim Ahmad (Research Scholar, IIT Patna, India) conducted a tutorial on “LLM Fine-Tuning,” which covered PEFT approaches, QLoRA, quantization, and optimization techniques to fine-tune large models efficiently.

Valedictory session:

The valedictory session marked the formal close of the workshop. Dr. Sriparna Saha thanked speakers, participants and the organising team for active engagement across technical talks and tutorials. Participants shared positive feedback on the depth and practicality of sessions. Certificates were distributed to attendees. Final remarks encouraged continued research, collaboration and dissemination of resources. Dr. Saha reiterated gratitude to ACM SIGMM for financial support.

Outcomes, observations and suggested actions:

Multimodal reasoning remains an interdisciplinary challenge that benefits from close collaboration between multimedia, NLP, and application domain experts.
Trustworthiness, safety, and evaluation (benchmarks and metrics) are critical for moving multimodal models from demonstration to practice especially in healthcare and other high-stakes domains.
Practical methods for model adaptation (PEFT, quantization) make large models accessible for research groups with limited compute.
Datasets and retrieval resources that combine multimodal inputs with external knowledge (as in ViQuAE) are valuable for advancing explainable VQA and grounded reasoning.
The community should prioritise regional and language-diverse resources (Indic languages, code-mixed data) to ensure equitable benefits from multimodal AI.
SIGMM and ACM venues can play a role in fostering collaborations via special projects, regional hackathons, grand challenges, and multimodal benchmark initiatives.

Outreach & social media:

The workshop generated significant visibility on LinkedIn and other professional networks. Photos and session highlights were widely shared by participants and organisers, acknowledging ACM SIGMM support and the quality of the technical programme.

Acknowledgements: The organising committee thanks all speakers, attendees, student volunteers, and ACM SIGMM for financial and logistic support that enabled the workshop.

Reports from ACM Multimedia 2025

By Anna Ferrarotti | February 2, 2026 - 09:11 |February 2, 2026 0126, Event Report, Feature

Leave a comment

URL: https://acmmm2025.org

Date: Oct 27 – Oct 31, 2025
Place: Dublin, Ireland
General Chairs: Cathal Gurrin, Klaus Schoeffmann, Min Zhang, Adapt Centre & DCU, Klagenfurt University, Tsinghua University

Introduction

The ACM Multimedia Conference 2025, held in Dublin, Ireland from October 27 to October 31, 2025, continued its tradition as a premier international forum for researchers, practitioners, and industry experts in the field of multimedia. This year’s conference marked an exciting return to Europe, bringing the community together in a city renowned for its rich cultural heritage, innovation-driven ecosystem, and welcoming atmosphere. ACM MM 2025 provided a dynamic platform for presenting state-of-the-art research, discussing emerging trends, and fostering collaboration across diverse areas of multimedia computing.

Hosted in Dublin—a vibrant hub for both technology and academia—the conference delivered a seamless and engaging experience for all attendees. As part of its ongoing mission to support and encourage the next generation of multimedia researchers, SIGMM awarded Student Travel Grants to assist students facing financial constraints. Each recipient received up to 1,000 USD to help offset travel and accommodation expenses. Applicants completed an online form, and the selection committee evaluated candidates based on academic excellence, research potential, and demonstrated financial need.

To shed light on the experiences of these outstanding young scholars, we interviewed several travel grant recipients about their participation in ACM MM 2025 and the conference’s influence on their academic and professional development. Their reflections are shared below.

Wang Zihao – Zhejiang University

This was not my first time attending ACM Multimedia—I also participated in ACMMM 2022—but coming back in 2025 has been just as fantastic. There were many memorable moments, but two of them stood out the most for me. The first was the beautiful violin performance during the Volunteer Dinner, which created such a warm and elegant atmosphere. The second was the Irish drumming performance at the conference banquet on October 30th. It was incredibly energetic and truly unforgettable. These moments reminded me how special it is to be part of this community, where academic exchange and cultural experiences blend together so naturally.

I am truly grateful for the SIGMM Student Travel Grant. The financial support made it possible for me to attend the conference in person, and I really appreciate the effort that SIGMM puts into supporting students. One of the most valuable aspects of this trip was meeting researchers from all over the world who work in areas similar to mine—especially those focusing on music, audio, and multimodality. Having deep, face-to-face conversations with them was inspiring and has given me many new ideas to explore in my future research.

As for suggestions, I honestly think the SIGMM Student Travel Grant program is already doing an amazing job in supporting young scholars like us. My only small hope is for a smooth reimbursement process.

Overall, I feel incredibly fortunate to be here again, reconnecting with the ACM MM community and learning so much from everyone. I’m thankful for this opportunity and excited to continue growing in this field.

Huang Feng-Kai – National Taiwan University

Attending ACM Multimedia 2025 in Dublin was my first time joining the conference, and it has been an unforgettable experience. Everything was so well-organized, and I truly enjoyed every moment. The welcome reception was especially memorable—the food was delicious, the atmosphere was lively, and it was inspiring to see so many renowned researchers and professors chatting enthusiastically. It really felt like the perfect start to my ACM MM journey.

I am deeply grateful to SIGMM for the Student Travel Grant. As a student traveling all the way from Taiwan, attending a conference in Europe is a major financial challenge. The grant covered my accommodation, meals, and flights, which made it possible for me to participate without worrying too much about the cost. Being here has really broadened my horizons. I was able to learn about so many fascinating research topics and meet many kind, talented researchers who generously shared their thoughts with me. These conversations gave me a lot of inspiration for my own work.

I also had the chance to serve as a volunteer, which became my first experience working with an international team. Collaborating with people from different cultural and academic backgrounds helped me improve my communication skills and made the conference even more meaningful.

I truly believe the SIGMM Student Travel Grant is an amazing program that enables students from all over the world to join this vibrant community, exchange ideas, and form new collaborations. My only wish is that SIGMM will continue offering this opportunity in the future. This grant brings so much energy to young researchers like me and plays an important role in supporting the next generation of the multimedia community. I am sincerely thankful for everything this experience has given me, and I look forward to returning to ACM Multimedia in the coming years.

Wang Hao (Peking University)

Attending ACM Multimedia 2025 was my very first time participating in the conference, and the experience was truly amazing. The moment that impressed me the most was having the chance to present my own paper. As a non-native English speaker, giving an academic talk on an international stage was both challenging and rewarding. I felt nervous at first, but I’m really proud of how I managed to deliver my presentation. It was a big milestone for me.

I’m incredibly grateful to SIGMM for the Student Travel Grant, which made it possible for me to attend an international conference for the first time. Without this support, I wouldn’t have been able to experience such a meaningful academic event. Throughout the conference, I met so many new friends, attended inspiring talks, and gained fresh perspectives on multimedia research. These experiences have broadened my view of the field and will definitely influence the direction of my future work.

I’m thankful for this opportunity and truly appreciate how welcoming and encouraging the ACM MM community is. This conference has given me motivation and confidence to continue growing as a researcher.

Yu Liu (University of Electronic Science and Technology of China)

This is my first time attending ACM Multimedia. I am a PhD student at the University of Electronic Science and Technology of China (UESTC), currently spending a year at the University of Auckland, New Zealand, as part of a joint PhD program. It took 25 hours to travel from Auckland to Dublin, but the journey was completely worth it. The conference has been vibrant and intellectually engaging. I had the honor of being the first speaker in my session, and it was incredibly fulfilling to see the audience show genuine interest and appreciation for our work. Outside the sessions, I thoroughly enjoyed immersing myself in Irish culture—tasting the smooth, rich Guinness, watching lively tap dancing, and listening to traditional Irish music. Overall, it has been an inspiring and truly memorable experience.

The SIGMM Student Travel Grant played a vital role in making my attendance possible. In recent years, UESTC has discontinued funding for PhD students’ conference travel, transferring the financial responsibility entirely to individual research groups. Receiving this grant was crucial, allowing me to attend ACM MM 2025 without placing additional strain on my research team’s limited budget. Attending the conference in person provided an invaluable opportunity to present my research, exchange ideas face-to-face with international scholars, and receive constructive feedback from leading experts. These experiences fostered meaningful academic connections and opened doors for potential long-term collaborations that online participation simply cannot replace.

My biggest takeaway from ACM MM 2025 is the inspiration I gained from being part of such a diverse and passionate research community, which has motivated me to continue advancing in the field of responsible AI. I also really enjoyed the volunteer “Thank You” dinner—it was a wonderful experience. At the same time, I noticed that it is not always easy for students to approach professors they do not know personally. In the future, including short icebreaker or networking activities could help start conversations more naturally, making the conference experience even more valuable for students like me.

Li Deng (LUT University)

This is my first time attending ACM Multimedia, and my experience has been exceptionally positive. I was particularly impressed by the workshops relevant to my research area, as the discussions provided valuable insights that are already influencing my ongoing work. I was also struck by the abundance and quality of the social events and networking opportunities, which made it easy to connect with senior researchers and fellow students from diverse backgrounds.

Receiving the SIGMM Student Travel Grant significantly reduced the financial burden of travel and accommodation, allowing me to attend the conference in person without major financial stress. The opportunity to present my work and engage in discussions with leading researchers has greatly supported my academic development. I received direct feedback and established connections that may lead to future collaborations. My biggest takeaway from ACM MM 2025 is a deeper understanding of the rapid development and impact of multimodal large language models.

Looking ahead, I suggest that the SIGMM Student Travel Grant program collaborate with the main conference to organize sessions such as a “Career Forum” for grant recipients and other student volunteers, providing additional guidance and support for early-career researchers.

JPEG Column: 109th JPEG Meeting in Nuremberg, Germany

By Antonio Pinheiro | February 1, 2026 - 23:54 |February 2, 2026 0126, Event Report, Feature, Standards

Leave a comment

JPEG XS developers awarded the Engineering, Science and Technology Emmy®.

The 109th JPEG meeting was held in Nuremberg, Germany, from 12 to 17 October 2025.

This JPEG meeting began with the excellent news that JPEG XS developers Fraunhofer IIS and intoPIX were awarded the Engineering, Science and Technology Emmy® for their contributions to the development of the JPEG XS standard.

Furthermore the 109th JPEG meeting was also marked by several major achievements: JPEG Trust Part 2 on Trust Profiles and Reports, complementing Part 1 with several profiles for various usage scenarios, reached Committee Draft; JPEG AIC part 3 was produced for final publication by ISO; JPEG XE reached Committee Draft stage; and the calls for proposals on objective evaluation JPEG AIC-4 and JPEG Pleno Quality Assessment of Light Field received several responses.

The following sections summarise the main highlights of the 109th JPEG meeting:

Fraunhofer IIS and intoPIX representatives with the awarded Engineering, Science and Technology Emmy®.

JPEG Trust Part 2 on Trust Profiles and Reports reaches Committee Draft stage.
JPEG AIC-4 receives responses to the Call for Proposals on Objective Image Quality Assessment.
JPEG XE Part 1, the core coding system, reaches DIS stage.
JPEG XS Part 1 AMD 1 reaches DIS stage.
JPEG AI Part 2 (Profiling), Part 3 (Reference Software), and Part 5 (File Format) approved as International Standards.
JPEG DNA designed the wet-lab experiments, including DNA synthesis/sequencing.
JPEG Peno receives responses to the Call for Proposals on Objective Metrics for Light Field Quality Assessment.
JPEG RF establishes frameworks for coding and quality assessment of radiance fields.
JPEG XL innitiates embedding of JPEG XL in ISOBMFF/HEIF.

JPEG Trust

At the 109th JPEG Meeting, the JPEG Committee reached a key milestone with the completion of the Committee Draft (CD) for JPEG Trust Part 2 – Trust Profiles and Reports (ISO/IEC 21617-2). Building on the framework established in Part 1 (Core Foundation), this new specification further refines Trust Profiles and Trust Reports and provides several example profiles and reusable profile snippets for adoption in diverse usage scenarios.

Compared to earlier drafts, the new Trust Profiles specification introduces templates and dynamic metadata blocks, offering enhanced flexibility while maintaining full backwards compatibility for existing profiles. This flexibility is also reflected in the updated Trust Reports, which can now be more easily tailored to specific usage scenarios. This new specification sets the stage for user communities to build their own Trust Profiles and customise them to their specific needs.

In addition to the CD on Part 2, the committee also produced a CD of Part 4 – Reference Software. This specification provides a reference implementation and reference dataset of the Core Foundation. The reference software will be extended with additional implementations in the future.

Finally, the committee also advanced Part 3 – Media Asset Watermarking. The Terms and Definitions and Use Cases and Requirements documents are now publicly available on the JPEG website. The development of Part 3 is progressing on schedule, with the Committee Draft stage targeted for January 2026.

JPEG AIC

The JPEG AIC-3 standard, which specifies a methodology for fine-grained subjective image quality assessment in the range from good quality up to mathematically lossless, was finalised at the 109th JPEG meeting and will be published as International Standard ISO/IEC 29170-3.

In response to the JPEG AIC-4 Call for Proposals on Objective Image Quality Assessment, four proposals were received and presented. A large-scale subjective experiment has been prepared in order to evaluate the proposals.

JPEG XE

JPEG XE is a joint effort between ITU-T SG21 and ISO/IEC JTC1/SC29/WG1 and will become the first internationally endorsed specification by major standardization bodies ITU-T, ISO, and IEC, for coding of events. It aims to establish a robust and interoperable format for efficient representation and coding of events in the context of machine vision and related applications. To expand the reach of JPEG XE, the JPEG Committee has closely coordinated its activities with the MIPI Alliance with the intention of developing a cross-compatible coding mode, allowing MIPI ESP signals to be decoded effectively by JPEG XE decoders.

At the 109th JPEG Meeting, the DIS of JPEG XE Part 1, the core coding system, was prepared. This part specifies the low-complexity and low-latency lossless coding technology that will be the foundation of JPEG XE. Reaching DIS stage is a major milestone and freezes the core coding technology for the first edition of JPEG XE. The JPEG Committee plans to further improve the coding performance and to provide additional lossless and lossy coding modes, scheduled to be developed in 2026. While the DIS of Part 1 is under ballot for approval as an International Standard, the JPEG Committee initiated the work on Part 2 of JPEG XE to define the profiles and levels. A DIS of Part 2 is planned to be ready for ballot in January 2026.

With JPEG XE Part 1 under ballot and Part 2 in the pipeline, the JPEG Committee remains committed to the development of a comprehensive and industry-aligned standard that meets the growing demand for event-based vision technologies. The collaborative approach between multiple standardisation organisations underscores a shared vision for a unified, international standard to accelerate innovation and interoperability in this emerging field.

JPEG XS

The JPEG Committee is extremely proud to announce that the two companies behind the development of JPEG XS, intoPIX and Fraunhofer IIS, were awarded an Emmy® for Engineering, Science, and Technology for their role in the development of the JPEG XS standard. The awards ceremony was held on October 14th, 2025, at the Television Academy’s Saban Media Center in North Hollywood, California. This award recognizes JPEG XS for being a state-of-the-art image compression format that transmits high-quality images with minimal latency and low-resource consumption, with visually near-lossless image quality. It affirms that JPEG XS is the fundamental game changer for real-time transmission of video in live, professional video, and broadcast applications, and that it is being heavily adopted by the industry.

Nevertheless, the work to further improve JPEG XS continues. In this context, the DIS of AMD 1 of JPEG XS Part 1 is currently under ballot at ISO and is expected to be ready by January 2026. This amendment enables the embedding of sub-frame metadata to JPEG XS as required by augmented and virtual reality applications currently discussed within VESA. The JPEG Committee also initiated the steps to start an amendment for Part 2 (Profiles and buffer models) that will define additional sublevels needed to support on-the-fly proxy-level extraction (i.e. lower resolution streams from a master stream) without recompression. The amendment is planned to go to DIS ballot at the next 110th JPEG meeting in Sydney, Australia.

JPEG AI

During the 109th JPEG meeting, the JPEG AI project achieved major milestones, with Part 2 (Profiling), Part 3 (Reference Software), and Part 5 (File Format) approved as International Standards. Meanwhile, Part 4 (Conformance) is proceeding to publication after a positive ballot. The Core Experiments confirmed that JPEG AI outperforms state-of-the-art codecs in compression efficiency and demonstrated a decoder implementation based on the SADL library.

JPEG DNA

During the 109th JPEG meeting, the JPEG Committee designed the wet-lab experiments, including DNA synthesis/sequencing, with results expected by January 2026. The primary objective of the wet-lab experiments is to validate the technical specifications outlined in the current DIS study text of ISO/IEC 25508-1 in the realistic procedures for DNA media storage. Additional efforts are underway as a new Core Experiment to study the performance of the codec-dependent unequal error correction technique, which is expected to result in the future publication of JPEG DNA Part 2 – Profiles and levels.

JPEG Pleno

JPEG Pleno marked a pivotal step toward the forthcoming ISO/IEC 21794-7 standard, Light Field Quality Assessment. The new Part 7 was officially approved for inclusion in the ISO/IEC work programme, confirming international support for standardizing light field quality assessment methodologies. Moreover, in response to the Call for Proposals on Objective Metrics for Light Field Quality Assessment, three proposals were received and presented. In preparation for the evaluation of the proposals submitted in response to the CfP, an evaluation dataset was released and discussed during the meeting. The next milestone is the execution of a Subjective Quality Assessment on the evaluation dataset to evaluate the proposed objective metrics by the 110th JPEG meeting in Sydney. To this end, the methodological design and preparation of the subjective test were discussed and finalized, marking an important step toward developing the standardization framework for objective light field quality assessment.

The JPEG Pleno Workshop on Emerging Coding Technologies for Plenoptic Modalities was conducted at the 109th meeting with presentations from Touradj Ebrahimi (JPEG Convenor), Peter Schelkens (JPEG Plenoptic Coding and Quality Sub-Group Chair), Aljosa Smolic (Hochschule Luzern), Søren Otto Forchhammer (Danmarks Tekniske Universitet), Giuseppe Valenzise (Université Paris-Saclay), Amr Rizk (Leibniz Universität Hannover), Michael Rudolph (Leibniz Universität Hannover), and Irene Viola (Centrum Wiskunde & Informatica).

JPEG RF

At the 109th JPEG Meeting the exploration activity on JPEG Radiance Fields (JPEG RF) continued its progress toward establishing frameworks for coding and quality assessment of radiance fields. The group updated the drafts of the Use Cases and Requirements and Common Test Conditions, alongside the outcomes of an Exploration Study, which examined the impact of camera trajectory design on human perception during a subjective quality assessment. These discussions refined methodological guidelines for trajectory generation and the subjective assessment procedures. Building on this progress, Exploration Study 6 was launched to benchmark the complete assessment framework through a subjective experiment using the developed protocols. Outreach activities were also planned to engage additional stakeholders and support further development ahead of the next 110th JPEG Meeting in Sydney, Australia.

JPEG XL

At the 109th JPEG meeting, work has started on an embedding of JPEG XL in ISOBMFF/HEIF. It will be described in a new edition of ISO/IEC 18181-2, which has been initiated.

Final Quote

“During the 109th JPEG Meeting, the JPEG Committee reached several important milestones. In particular, JPEG Trust continues its development with the addition of new Parts towards the creation of a reliable and effective standard that restores authenticity and provenance of the multimedia information.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

VQEG Column: Finalization of Recommendation Series P.1204, a Multi-Model Video Quality Evaluation Standard – The New Standards P.1204.1 and P.1204.2

By Jesús Gutiérrez | December 17, 2025 - 13:08 |December 17, 2025 0425, 0425, Event Report, Feature, Standards

Leave a comment

Abstract

This column introduces the now completed ITU-T P.1204 video quality model standards for assessing sequences up to UHD/4K resolution. Initially developed over two years by ITU-T Study Group 12 (Question Q14/12) and VQEG, the work used a large dataset of 26 subjective tests (13 for training, 13 for validation), each involving at least 24 participants rating sequences on the 5-point ACR scale. The tests covered diverse encoding settings, bitrates, resolutions, and framerates for H.264/AVC, H.265/HEVC, and VP9 codecs. The resulting 5,000-sequence dataset forms the largest lab-based source for model development to date. Initially standardized were P.1204.3, a no-reference bitstream-based model with full bitstream access, P.1204.4, a pixel-based, reduced-/full-reference model, and P.1204.5, a no-reference hybrid model. The current record focuses on the latest additions to the series, namely P.1204.1, a parametric, metadata-based model using only information about which codec was used, plus bitrate, framerate and resolution, and P.1204.2, which in addition uses frame-size and frame-type information to include video-content aspects into the predictions.

Introduction

Video quality under specific encoding settings is central to applications such as VoD, live streaming, and audiovisual communication. In HTTP-based adaptive streaming (HAS) services, bitrate ladders define video representations across resolutions and bitrates, balancing screen resolution and network capacity. Video quality, a key contributor to users’ Quality of Experience (QoE), can vary with bandwidth fluctuations, buffer delays, or playback stalls.

While such quality fluctuations and broader QoE aspects are discussed elsewhere, this record focuses on short-term video quality as modeled by ITU-T P.1204 for HAS-type content. These models assess segments of around 10s under reliable transport (e.g., TCP, QUIC), covering resolution, framerate, and encoding effects, but excluding pixel-level impairments from packet loss under unreliable transport.

Because video quality is perceptual, subjective tests, laboratory or crowdsourced, remain essential, especially at high resolutions such as 4K UHD under controlled viewing conditions (1.5H or 1.6H viewing distance). Yet, studies show limited perceptual gain between HD and 4K, depending on source content, underlining the need for representative test materials. Given the high cost of such tests, objective (instrumental) models are required for scalable, automated assessment supporting applications like bitrate ladder design and service monitoring.

Four main model classes exist: metadata-based, bitstream-based, pixel-based, and hybrid. Metadata-based models use codec parameters (e.g., resolution, bitrate) and are lightweight; bitstream-based models analyze encoded streams without decoding, as in ITU-T P.1203 and P.1204.3 [1][2][3][7]. Pixel-based models compare decoded frames and include Full Reference and Reduced Reference models (e.g., P.1204.4, and also PSNR [9], SSIM [10], VMAF [11][12]), as well as No Reference variants. Finally, hybrid models combine pixel and bitstream or metadata inputs, exemplified by the ITU-T P.1204.5 standard. These three standards, P.1204.3 P.1204.4 and P.1204.5, formed the initial P.1204 Recommendation series finalized in 2020.

ITU-T P.1204 series completed with P.1204.1 and P.1204.2

The respective standardization project under the Work Item name P.NATS Phase 2 (read: Peanuts) was a unique video quality model development competition conducted in collaboration between ITU-T Study Group 12 (SG12) and the Video Quality Experts Group (VQEG). The target use cases were for up to UHD/4K resolution, with presentation on UHD/4K resolution PC/TV or Mobile/Tablet (MO/TA). For the first time, bitstream-, pixel-based, and hybrid models were jointly developed, trained, and validated, using a large common subjective dataset comprising 26 tests, each with at least 24 participants (see, e.g., [1] for details). The P.NATS Phase 2 work built on the earlier “P.NATS Phase 1” project, which resulted in the ITU-T Rec. P.1203 standards series (P.1203, P.1203.1, P.1203.2, P.1203.3). In the P.NATS Phase 2 project, video quality models in five different categories were evaluated, and different candidates were found to be eligible to be recommended as standards. The initially standardized three models out of the five categories were the aforementioned P.1204.3, P.1204.4 and P.1204.5. However, due to the lack of consensus between the winning proponents, no models were recommended as standards for the category “bitstream Mode 0” with access to high-level metadata only, such as the video codec, resolution, framerate and bitrate used, and “bitstream Mode 1”, with further access to frame-size information that can be used for content-complexity estimation.

For the latest model additions of P.1204.1 and P.1204.2, subsets of the databases initially used in the P.NATS Phase 2 project were employed for model training. Two different datasets belonging to the two contexts PC/TV and MO/TA were used for training the models. AVT-PNATS-UHD-1 is the dataset for the PC/TV use case and ERCS-PNATS-UHD-1 the dataset used for the MO/TA use case.

AVT-PNATS-UHD-1 [7] consists of four different subjective tests conducted by TU Ilmenau as part of the P.NATS Phase 2 competition. The target resolution of these datasets was 3840 x 2160 pixels. ERCS-PNATS-UHD-1 [1] is a dataset targeting the MO/TA use case. It consists of one subjective test conducted by Ericsson as part of the P.NATS Phase 2 competition. The target resolution of these datasets was 2560 x 1440 pixels.

For model performance evaluation, beyond AVT-PNATS-UHD-1, further externally available video-quality test databases were used, as outlined in the following.

AVT-VQDB-UHD-1: This is a publicly available dataset and consists of four different subjective tests. All the four tests had a full-factorial design. In total, 17 different SRCs with a duration of 7-10 s were used across all the four tests. All the sources had a resolution of 3840×2160 pixels and a framerate of 60 fps. For HRC design, bitrate was selected in fixed (i.e. non-adaptive) values per PVS between 200kbps and 40000kbps, resolution between 360p and 2160p and framerate between 15fps and 60fps. In all the tests, a 2-pass encoding approach was used to encode the videos, with medium preset for H.264 and H.265, and the speed parameter for VP9 set to the default value “0”. A total of 104 participants in the four tests.

GVS: This dataset consists of 24 SRCs that have been extracted from 12 different games. The SRCs are of 1920×1080 pixel resolution, 30fps framerate and have a duration of 30s . The HRC design included three different resolutions, namely, 480p, 720p and 1080p . 90 PVSs resulting from 15 bitrate-resolution pairs were used for subjective evaluation. A total of 25 participants rated all the 90 PVSs.

KUGVD: Six SRCs out of the 24 SRCs from the GVSwere used to develop KUGVD. The same bitrate-resolution pairs from GVS were included to define the HRCs. In total, 90 PVSs were used in the subjective evaluation and 17 participants took part in the test.

CGVDS: This dataset consists of SRCs captured at 60fps from 15 different games. For designing the HRCs, three resolutions, namely, 480p, 720p and 1080p at three different framerates of 20, 30, and 60fps were considered. To ensure that the SRCs from all the games could be assessed by test subjects, the overall test was split into 5 different subjective tests, with a minimum of 72 PVSs being rated in each of the tests. A total of over 100 participants took part over the five different tests, with a minimum of 20 participants per test.

Twitch: The Twitch Dataset consists of 36 different games, with 6 games each representing one out of 6 pre-defined genres. The dataset consists of streams directly downloaded from Twitch. A total of 351 video sequences of approximately 50s duration across all representations were downloaded. 90 video sequences out of these 351 video sequences were selected for subjective evaluation. Only the first 30s of the chosen 90 PVSs were considered for subjective testing. Six different resolutions between 160p and 1080p at framerates of 30 and 60fps were used. 29 participants rated all the 90 PVSs.

BBQCG: This is the training dataset developed as part of the P.BBQCG work item. This dataset consists of nine subjective test databases. Three out of these nine test databases consisted of processed video sequences (PVSs) up to 1080p/120fps and the remaining had PVSs up to 4K/60fps. Three codecs, namely, H.264, H.265, and AV1 were used to encode the videos. Overall 900 different PVSs were created from 12 sources (SRCs) by encoding the SRCs with different encoding settings.

AVT-VQDB-UHD-1-VD: This dataset consists of 16 source contents encoded using a CRF-based encoding approach. Overall 192 PVSs were generated by encoding all 16 sources in four resolutions, namely, 360p, 720p, 1080p, 2160p with three CRF values (22, 30, 38) each. A total of 40 subjects participate in the study.

ITU-T P.1204.1 and P.1204.2 model prediction performance

The performance figures of the two new models P.1204.1 and P.1204.2 models on the different datasets are indicated in Table 1 (P.1204.1) and Table 2 (P.1204.2) below.

Table 1: Performance of P.1204.1 (Mode 0) on the evaluation datasets, in terms of Root Mean Square Error (RMSE, measure used as winning criterion in the ITU-T/VQEG modelling competition). Pearson Correlation Coefficeint (PCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall’s tau.

Dataset	RMSE	PCC	SRCC	Kendall
AVT-VQDB-UHD-1	0.499	0.890	0.877	0.684
KUGVD	0.840	0.590	0.570	0.410
GVS	0.690	0.670	0.650	0.490
CGVDS	0.470	0.780	0.750	0.560
Twitch	0.430	0.920	0.890	0.710
BBQCG	0.598 (on a 7-point scale)	0.841	0.843	0.647
AVT-VQDB-UHD-1-VD	0.650	0.814	0.813	0.617

Table 2: Performance of P.1204.1 (Mode 1) on the evaluation datasets, in terms of Root Mean Square Error (RMSE, measure used as winning criterion in the ITU-T/VQEG modelling competition). Pearson Correlation Coefficeint (PCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall’s tau.

Dataset	RMSE	PCC	SRCC	Kendall
AVT-VQDB-UHD-1	0.476	0.901	0.900	0.730
KUGVD	0.500	0.870	0.860	0.690
GVS	0.420	0.890	0.870	0.710
CGVDS	0.360	0.900	0.880	0.690
Twitch	0.370	0.940	0.930	0.770
BBQCG	0.737 (on a 7-point scale)	0.745	0.746	0.547
AVT-VQDB-UHD-1-VD	0.598	0.845	0.845	0.654

For all databases except BBQCG and KUGVD, the Mode 0 model P.1204.1 performs in a solid way, as shown in Table 1. With the information about frame types and sizes available to the Mode 1 model P.1204.2, performance improves considerably, as shown in Table 2. For performance results of all three previously standardized models, P.1204.3, P.1204.4 and P.1204.5, the reader is referred to [1] and the individual standards, [4][5][6]. For the P.1204.3 model, complementary performance information is presented in, e.g., [2][7]. For P.1204.4, additional model performance information is available in [8], including results for AV1, AVS2, and VVC.

The following plots provide an illustration of how the new P.1204.1 Mode 0 model may be used. Here, bitrate-ladder-type graphs are presented, with the predicted Mean Opinion Score on a 5-point scale plotted over log bitrate.

Codec: H.264

Codec: H.265

Codec: VP9

Conclusions and Outlook

The P.1204 standard series now comprises the complete initially planned set of models, namely:

ITU-T P.1204.1: Bitstream Mode 0, i.e., metadata-based model with access to information about video codec, resolution, framerate and bitrate used.
ITU-T P.1204.2: Bitstream Mode 1, i.e., metadata-based model with access to information about video codec, resolution, framerate and bitrate used, plus information about video frame types and sizes.
ITU-T P.1204.3: Bitstream Mode 3 [1][2][3][7].
ITU-T P.1204.4: Pixel-based reduced- and full-reference [1][5][8].
ITU-T P.1204.5: Hybrid no-reference Mode 0 [1][6].

Extensions of some of these models beyond the initial scope of codecs (H.264/AVC, H.265/HEVC, VP9) have been included over the last few years. Here, P.1204.4 and P.1204.5 have been extended (P.1204.5) or evaluated (P.1204.4) to also cover the AV1 video codec. Work in ITU-T SG12 (Q14/12) is ongoing so as to also extend P.1204.1, P.1204.2 and P.1204.3 to newer codecs such as AV1, and all five models are planned to be extended so as to also cover VVC. It is noted that for P.1204.3, P.1204.4 and P.1204.5, also long-term quality integration modules that generate per-session scores for up to 5min long streaming sessions have been described in Appendices of the respective recommendations. For P.1204.1 and P.1204.2, this extension still has to be completed. Initial evaluations for similar Mode 0 and Mode 1 models that use the P.1204.3-type long-term integration can be found in [7].

References

[1] Raake, A., Borer, S., Satti, S.M., Gustafsson, J., Rao, R.R.R., Medagli, S., List, P., Göring, S., Lindero, D., Robitza, W. and Heikkilä, G., 2020. Multi-model standard for bitstream-, pixel-based and hybrid video quality assessment of UHD/4K: ITU-T P. 1204. IEEE Access, 8, pp.193020-193049.
[2] Rao, R.R.R., Göring, S., List, P., Robitza, W., Feiten, B., Wüstenhagen, U. and Raake, A., 2020, May. Bitstream-based model standard for 4K/UHD: ITU-T P. 1204.3—Model details, evaluation, analysis and open source implementation. In 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX) (pp. 1-6).
[3] ITU-T Rec. P.1204, 2025. Video quality assessment of streaming services over reliable transport for resolutions up to 4K. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[4] ITU-T Rec. P.1204.3, 2020. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to full bitstream information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[5] ITU-T Rec. P.1204.4, 2022. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to full and reduced reference pixel information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[6] ITU-T Rec. P.1204.5, 2023. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to transport and received pixel information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[7] Rao, R.R.R., Göring, S. and Raake, A., 2022. AVQBits – Adaptive video quality model based on bitstream information for various video applications. IEEE Access, 10, pp.80321-80351.
[8] Borer, S., 2022, September. Performance of ITU-T P. 1204.4 on Video Encoded with AV1, AVS2, VVC. In 2022 14th International Conference on Quality of Multimedia Experience (QoMEX) (pp. 1-4).
[9] Winkler, S. and Mohandas, P., 2008. The evolution of video quality measurement: From PSNR to hybrid metrics. IEEE transactions on Broadcasting, 54(3), pp.660-668.
[10] Wang, Z., Lu, L. and Bovik, A.C., 2004. Video quality assessment based on structural distortion measurement. Signal Processing: Image Communication, 19(2), pp.121-132.
[11] Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., and Manohara, M., 2016. Toward A Practical Perceptual Video Quality Metric, Netflix TechBlog.
[12] Li, Z., Swanson, K., Bampis, C., Krasula, L., and Aaron, A., 2020. Toward a Better Quality Metric for the Video Community, Netflix TechBlog.

Rethinking QoE in the Age of AI: From Algorithms to Experience-Based Evaluation

By Tobias Hossfeld | December 17, 2025 - 08:02 |December 17, 2025 0425, 0425, Feature, QoE Column, September 2013

Leave a comment

AI evaluation is undergoing a paradigm shift from focusing solely on algorithmic accuracy of AI models to emphasizing experience-based assessment of human interactions with AI systems. Under frameworks like the EU AI Act, evaluation now considers intended purpose, risk, transparency, human oversight, and real-world robustness alongside accuracy. Quality of Experience (QoE) methodologies may offer a structured approach to evaluate how users perceive and experience AI systems in terms of transparency, trust, control and overall satisfaction. This column gives inspiration and shared insights for both communities to advance experience-based AI system evaluation together.

1. From algorithms to systems: AI as user experience

Artificial Intelligence (AI) algorithms—mathematical models implemented as lines of code and trained on data to predict, recommend or generate outputs—were, until recently, tools reserved for programmers and researchers. Only those with technical expertise could access, run or adapt them. For decades, progress in AI was equated with improvements in algorithmic performance: higher accuracy, better precision or new benchmark records—often achieved under narrow, controlled conditions that did not reflect the full spectrum of real-world operational environments. These advances, though scientifically impressive, remained largely invisible to society at large.

The turning point came when AI stopped being just code and became an experience accessible to everyone, regardless of their technical background. Once algorithms were embedded into interactive systems—chatbots, voice assistants, recommendation platforms, image generators—AI became ubiquitous, integrated into people’s daily lives. Interfaces transformed technical capability into human experience, making AI not only a purely algorithmic or research-oriented field but also a social, experiential and increasingly public phenomenon [Mlynář et al ., 2025].

This shift fundamentally changed what it means to evaluate AI [Bach et al., 2024 ]. Accuracy-based metrics—such as precision, recall, specificity or F1-score—no longer suffice for systems that mediate human experiences, influence decision-making and shape trust. Evaluation must now extend beyond the model’s internal performance to assess the interaction, context and experience that emerge when humans engage with AI systems in realistic conditions. We must therefore move from evaluating algorithms in isolation to genuinely human-centered approaches to AI and the experiences it enables [see e.g., https://hai.stanford.edu/], evaluating AI systems as a whole, holistically—considering not only their technical performance but also their experiential, contextual, and social impact [Shneiderman, 2022 ]. The European Union’s Artificial Intelligence Act [AI Act, 2024 ] provides a clear illustration of this shift. As the first comprehensive regulatory framework for AI, it recognizes that while algorithmic quality remains essential, what is ultimately regulated is the AI system—its design, use, and intended purpose. Obligations under the Act are tied to that intended purpose, which determines both the risk level and the compliance requirements (see figure below). For instance, the same object detection model can be considered low risk when used to organize personal photo libraries, but high risk when deployed in an autonomous vehicle’s collision-avoidance system.

Figure 1. The European Union’s Artificial Intelligence Act [AI Act, 2024]: risk and obligations depend on an AI system’s intended purpose—permitting low-risk uses while restricting or prohibiting high-risk applications. Examples in the figure are illustrative, not exhaustive. Some uses require prior authorisation under the EU AI Act.

This illustrates a fundamental change: evaluating AI systems today requires understanding how, where and by whom a system is used—not merely how accurate its underlying AI model is. Moreover, evaluation must consider how systems behave and degrade under operational conditions (e.g., adverse weather in traffic monitoring or biased performance across demographic groups in facial analysis), how humans interact with, interpret and rely on them, and what mechanisms of human oversight or intervention exist in practice to ensure accountability and control [Panigutti et al., 2023 ].

2. Towards a paradigm shift in AI evaluation

The European AI Act marks the first comprehensive attempt to regulate the design, deployment and use of AI systems. Yet its underlying philosophy resonates broadly with the principles endorsed by other high-level international institutions and initiatives—such as the OECD [OECD, 2024 ], the World Economic Forum [WEF, 2025 ] and, more recently, the Paris AI Action Summit [CSIS, 2025 ], where over sixty countries signed a joint commitment to promote responsible, trustworthy and human-centric AI.

Among the many obligations set out in the AI Act for high-risk AI systems, three provisions stand out as emblematic of this paradigm shift: they focus not on algorithmic precision, but on how AI systems are experienced, supervised and operated in the real world.

Article 13 – Transparency. AI systems must be designed and developed in a way that is sufficiently transparent to enable users to interpret their output and use it appropriately. Transparency therefore extends beyond disclosure or documentation: it encompasses interaction design and interpretability, ensuring that users—especially non-experts—can meaningfully understand and act upon what the system produces, based on which input and how.
Article 14 – Human oversight. High-risk AI systems must allow for effective human supervision so that they can be used as intended and to prevent or minimise risks to health, safety or fundamental rights (e.g., respect for human dignity, privacy, equality and non-discrimination). Oversight involves not only control features or override mechanisms, but also interface designs that help operators recognise when human intervention is necessary—addressing known challenges such as automation bias and over-trust on AI systems [Gaudeul et al., 2024 ].
Article 15 – Accuracy, robustness and cybersecurity. This provision broadens the traditional notion of accuracy, demanding that systems perform reliably under real-world operational conditions and remain secure and resilient to errors, adversarial manipulation or context change. It also calls for mechanisms that support graceful degradation and error recovery, ensuring sustained trust and dependable performance over time.

These provisions, aligned to both the AI Act and the broader international discourse on responsible AI, express a clear transformation in how AI systems should be evaluated. They call for a move beyond in-lab algorithmic performance metrics to include criteria grounded in human experience, operational reliability and social trust. To make these requirements actionable, the European Commission issued a Standardisation Request on Artificial Intelligence (initially published as M/593, 2024 [European Commission, 2024 ] and subsequently updated following the adoption of the AI Act), mandating the development of harmonised standards to support conformity with the regulation. Yet analyses of existing AI standardisation frameworks suggest that they remain primarily focused on technical robustness and risk management, while offering limited methodological guidance for assessing transparency, human oversight and perceived reliability [Soler et al., 2023 ].

This gap underscores the need for contributions from the Quality of Experience (QoE) community, whose expertise in assessing perceived quality, pragmatic, hedonic and increasingly also eudaimonic aspects of users’ experiences, usability and trust could inform both standardisation efforts and AI system design in practice. For example, [Hammer et al., 2018 ] introduced the “HEP cube”, that is a 3D model that maps hedonic (H), eudaimonic (E), and pragmatic (P) aspects of QoE and user experience. For example, utility (P), joy-of-use (H), and meaningfulness (E) are integrated into a multidimensional HEP construct [Egger-Lampl et al., 2019 ]. In professional contexts, long-term experiential quality depends increasingly on eudaimonic factors such as meaning and personal growth of the user’s capabilities. On the example of augmented reality for the informational phase of procedure assistance, [Hynes et al., 2023 ] take into account pragmatic aspects like clear, accurately aligned AR instructions that reduce cognitive load and support efficient task execution; hedonic and eudaimonic aspects involve engaging, intuitive interactions that not only make the experience pleasant but also foster confidence, competence, and meaningful professional growth. The study confirmed that AR better fulfills users’ pragmatic needs compared to paper-based instructions. However, the hypothesis that AR surpasses paper-based instructions in meeting hedonic needs was rejected. [Oppermann et al., 2024 ] evaluated a VR-based forestry safety training and found improved experiential quality and real-world skill transfer compared to traditional instruction. In addition to hedonic and pragmatic UX, eudaimonic experience was assessed by asking participants whether the training would help them “make me a better forestry worker” and “develop my personal potential”.

3. From benchmark performance to operational reality: the case of facial recognition

The example of remote facial recognition (RFR) for public security clearly illustrates how traditional accuracy-based evaluation fails to capture the real challenges of proportionality, operational viability and public trust that define the true quality of experience of AI in use. Under the EU AI Act, the use of real-time remote biometric identification systems in publicly accessible spaces for law enforcement is prohibited, except in narrowly defined circumstances—such as the prevention of terrorist threats, the search for missing persons or the prosecution of crimes—and always subject to prior authorisation by a competent authority. In these cases, the authority must assess whether the deployment of such a system is necessary and proportionate to the intended purpose.

Both the AI Act and the World Economic Forum emphasise this principle of “proportionality” for face recognition systems [AI Act, 2024 ], [Louradour & Madzou, 2021 ], yet without providing a clear guidance to determine what “proportionate use” actually means. Deciding whether to deploy RFR therefore requires balancing multiple dimensions—technical performance, societal impact and human oversight—beyond mere accuracy scores [Negri et al., 2024 ]. Consider, for instance, a competent authority evaluating whether to deploy an RFR system in airports screening 200 million passengers annually, where the estimated prevalence of genuine threats is roughly one in fifty million. Even with a true positive rate (TPR) and true negative rate (TNR) of 99% (equivalent to 99% sensitivity and specificity), the outcome is paradoxical: nearly all real threats would be detected (≈ 4 per year), but around two million innocent passengers would face unnecessary police interventions. Algorithmically, a 99% performance looks excellent. Operationally, it is unmanageable and counterproductive. Handling millions of false alarms would overwhelm security forces, delay operations, and—most importantly—erode public trust, as citizens repeatedly experience unjustified scrutiny and loss of confidence in authorities.

Beyond accuracy, competent authorities must evaluate trade-offs between different operational, social and economic dimensions that holistically define the proportionality and viability of an AI system:

Operational feasibility: number of human interventions needed, false alarms to handle and system downtime.
Social impact: perceived fairness, legitimacy and transparency of interventions.
Economic cost: cost of system deployment, resources spent managing false positives versus genuine detections.
Human trust and cognitive load: how repeated interactions with the system affect operator confidence, vigilance and the balance between over-trust and alert fatigue.
Consequences of error: the cost of a missed detection versus that of an unjustified intervention.

Hence, accuracy alone cannot guarantee reliability or trustworthiness. Evaluating AI systems requires contextual and human-aware metrics that capture operational trade-offs and social implications. The goal is not only to predict well, but to perform well in the real world. This example reveals a broader truth: trustworthy AI demands evaluation methods that connect technical performance with lived experience—and this is precisely where the QoE community can make a distinctive contribution.

4. Where AI and QoE should meet: new metrics for a new era

The limitations of accuracy-based evaluation, as illustrated by the facial recognition case, point to a broader need for metrics that capture how AI systems perform in real-world, human-centred contexts [Virvou, 2023 ],[Park et al., 2023 ].

Over the past decades, the scientific communities focusing on QoE and user experience (UX) research have developed a rigorous toolbox for quantifying subjective experience—how users perceive quality, usability, pragmatic, hedonic and increasingly also eudaimonic aspects of users’ experiences, reliability, control and satisfaction when interacting with complex technological systems. Originally rooted in multimedia, communication networks and human–computer interaction, these methodologies offer a mature foundation for assessing experienced quality in AI systems. QoE-based approaches can help transform general principles such as transparency, human oversight and robustness into measurable experiential dimensions that reflect how users actually understand, trust and operate AI systems in practice.

The following table presents a set of illustrative examples of QoE-inspired metrics—adapted from long-standing practices in the field—that could be further adapted, developed and validated for the evaluation of trustworthy AI.

General AI principles	QoE-inspired metrics
Transparency and comprehensibility	Perceived transparency score: % of users reporting understanding of system capabilities/limitations, potentially with a way to dimension the gap between reported understanding and actual understanding Explanation clarity MOS: Mean Opinion Score on clarity and interpretability of explanations. While traditional QoE assessment results are often reported as a Mean Opinion Score (MOS), additional statistical measures related to the distribution of scores in the target population are of interest, such as user diversity, uncertainty of user rating distributions, ratio of dissatisfied users, etc. [Hoßfeld et al., 2016] Time to comprehension: average time for a non-expert to understand the meaning of a given output produced by the system. Experienced interpretability: extent to which users feel that explanations meaningfully enhance their understanding of the system’s reasoning and limitations [Wehner et al., 2025 ].
Human oversight	Perceived controllability: MOS on ease of intervening or correcting system behavior. Intervention success rate: % of interventions improving outcomes. Trust calibration index: alignment between user confidence and actual system reliability.
Robustness and resilience to errors	Perceived reliability over time: longitudinal QoE measure of stability (for example, inspired by work on the longitudinal development of QoE, such as [Guse, D., 2016 ], [Cieplinska, 2023 ]). Graceful degradation MOS: subjective quality under stress (e.g., noise, adversarial input). Error recovery satisfaction: % of users satisfied with post-failure recovery.
Experience quality (holistic)	Overall satisfaction MOS: overall perceived quality of interaction with the AI system and factors influencing that experience quality (human, system, context, as discussed in [Reiter et al.. 2014 ]. Smoothness of use: perceived fluidity, continuity, absence of frustration. Perceived usefulness and usability: e.g., adapted from widely-used SUS/UMUX-Lite scales [Lewis et al., 2013 ]. Perceived response alignment: capture to what extent the system response aligns semantically and contextually with the prompt intent (particularly relevant for generative AI systems). Cognitive load: mental effort perceived during operation (e.g., adapted NASA-TLX [Hart & Staveland, 1988 ]). Perceived productivity impact: how users perceive the effect of AI system assistance on task efficiency and cognitive effort, reflecting findings from recent large-scale developer studies [Early-2025 AI, AI hampers Productivity].

These examples illustrate how the QoE perspective can complement traditional performance indicators such as accuracy or robustness. They extend evaluation beyond technical correctness to include how people experience, trust and manage AI systems in operational environments. Of interest will be to further explore and model the complex relationships between identified QoE dimensions and underlying system, context and human influence factors.

To better illustrate such complex relationships, it is useful to consider how technical and experiential dimensions interact dynamically in use. One particularly relevant example concerns how AI systems communicate confidence or uncertainty, and how this shapes users’ perceived trustworthiness, engagement and overall Quality of Experience.

Figure 2. Positive and negative feedback loop between confidence and QoE of AI systems.

While this is only one example among many possible human–AI interaction dynamics, it illustrates the kind of interrelation that still requires deeper understanding. As depicted in the figure above, complex interrelations exist that are not yet fully understood. AI confidence calibration (based on the AI model) and the way how this confidence or uncertainty is transported to users influences the users’ perceived trustworthiness of the AI system. This impacts the user’s confidence to which degree a user trusts their own ability to understand, interpret, and effectively interact with the AI system. Poor calibration can trigger a negative feedback loop of mistrust and disengagement, while well-calibrated, transparent AI fosters a positive feedback loop that enhances trust, confidence, and effective human-AI collaboration. In a negative feedback loop, overconfidence leads to low perceived trustworthiness and a strong QoE decline, while underconfidence results in moderate perceived trustworthiness and medium QoE, ultimately lowering user engagement. In contrast, a positive feedback loop emerges when confidence is well-calibrated and aligns with accuracy or when uncertainty is expressed transparently, leading to high trust, higher QoE, and stronger user engagement. User engagement and QoE are closely interrelated [Reichl et al., 2015 ], as higher engagement often reflects and reinforces a more positive overall experience.

Following this and similar examples, the bridge that now needs to be built is between the AI community’s focus on algorithmic performance and the QoE community’s expertise in human experience, bringing together two perspectives that have evolved largely in isolation, but are inherently complementary.

5. Conclusions: QoE as part of the missing link between AI systems and real-world experiences

Bridging the gap between how AI systems perform and how they are experienced is now one of the most pressing challenges in the field. The AI community has achieved extraordinary advances in model accuracy, scalability and efficiency, yet these metrics alone do not fully capture how systems behave in context—how they interact with people, support oversight or sustain trust under real operating conditions. The field of QoE, with its long tradition of measuring perceived quality, different experiential dimensions and usability, offers the conceptual and methodological tools needed to evaluate AI systems as experienced technologies, not merely as computational artefacts.

In this context, QoE of AI systems can be adapted from the original definition of QoE as proposed in [Qualinet, 2013 ] to read as: “The degree of delight or annoyance of a user resulting from interacting with an AI system. It results from how well the AI system fulfills the user’s expectations regarding usefulness, transparency, trustworthiness, comprehensibility, controllability, and reliability, considering the user’s goals, context, and cognitive state.”

Collaborative research between these domains can foster new interdisciplinary methodologies, shared benchmarks and evidence-based guidelines for assessing AI systems as they are used in the real world—not just as they perform in the lab or within classical accuracy-centred benchmarks. Building this shared evaluation culture is essential to advance trustworthy, human-centric AI, ensuring that future systems are not only intelligent but also understandable, reliable and aligned with human values.

This need is becoming increasingly urgent as, in many regions such as the EU, the principles of trustworthy AI are evolving from ethical aspirations into formal regulatory requirements, reinforcing the importance of robust, experience-based evaluation frameworks.

References

[AI Act, 2024] European Parliament & Council of the European Union. (2024). Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/1020 and Directives (EU) 2015/1535 and 2017/745 (Artificial Intelligence Act). Official Journal of the European Union, L 2024/1689. https://eur-lex.europa.eu/eli/reg/2024/1689/oj
[AI hampers Productivity] Experienced software developers assumed AI would save them a chunk of time. But in one experiment, their tasks took 20% longer. Available at: https://fortune.com/2025/07/20/ai-hampers-productivity-software-developers-productivity-study/
[Bach et al., 2024] Bach, T. A., Khan, A., Hallock, H., Beltrão, G., & Sousa, S. (2024). A systematic literature review of user trust in AI-enabled systems: An HCI perspective. International Journal of Human–Computer Interaction, 40(5), 1251-1266.
[Cieplinska, 2023] Cieplínska, Natalia; Janowski, Lucjan; Moor, Katrien De; Wierzchoń, Michał. (2023) Long-Term Video QoE Assessment Studies: A Systematic Review. IEEE Access.
[CSIS, 2025] Center for Strategic and International Studies. (2025). France’s AI Action Summit. Available at: https://www.csis.org/analysis/frances-ai-action-summit
[Early-2025 AI] Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. Available at: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
[Egger-Lampl et al., 2019] Egger-Lampl, S., Hammer, F., & Möller, S. (2019). Towards an integrated view on QoE and UX: adding the Eudaimonic Dimension. ACM SIGMultimedia Records, 10(4), 5-5.
[European Commission, 2024] European Commission. (2023). C(2023)3215 – Standardisation request M/593 to the European Committee for Standardisation and the European Committee for Electrotechnical Standardisation in support of Union policy on artificial intelligence. Available at: https://ec.europa.eu/growth/tools-databases/enorm/mandate/593_en
[Gaudeul et al., 2024] Gaudeul, A., Arrigoni, O., Charisi, V., Escobar-Planas, M., & Hupont, I. (2024, October). Understanding the Impact of Human Oversight on Discriminatory Outcomes in AI-Supported Decision-Making. In 27th European Conference on Artificial Intelligence (pp. 19-24).
[Guse, D., 2016] Guse, D. (2017). Multi-episodic perceived quality of telecommunication services. PhD thesis, TU Berlin.
[Hammer et al., 2018] Hammer, F., Egger-Lampl, S., & Möller, S. (2018). Quality-of-user-experience: a position paper. Quality and User Experience, 3(1), 9.
[Hart & Staveland, 1988] Hart, S. G., & Staveland, L. E. (1988). Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology (Vol. 52, pp. 139-183). North-Holland.
[Hoßfeld et al., 2016] Hoßfeld, T., Heegaard, P. E., Varela, M., & Möller, S. (2016). QoE beyond the MOS: an in-depth look at QoE via better metrics and their relation to MOS. Quality and User Experience, 1(1), 2.
[Hynes et al., 2023] Hynes, E., Flynn, R., Lee, B., & Murray, N. (2023). A QoE evaluation of augmented reality for the informational phase of procedure assistance. Quality and User Experience, 8(1), 1.
[Lewis et al., 2013] Lewis, J. R., Utesch, B. S., & Maher, D. E. (2013, April). UMUX-LITE: when there’s no time for the SUS. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 2099-2102).
[Louradour & Madzou, 2021] Louradour, S. & Madzou, L. (2021). A policy framework for responsible limits on facial recognition, use case: Law enforcement investigations. In World Economic Forum, 2021.
[Mlynář et al., 2025] Mlynář, J., De Rijk, L., Liesenfeld, A., Stommel, W., & Albert, S. (2025). AI in situated action: a scoping review of ethnomethodological and conversation analytic studies. AI & society, 40(3), 1497-1527.
[Negri et al., 2024] Negri, P., Hupont, I., & Gomez, E. (2024, May). A framework for assessing proportionate intervention with face recognition systems in real-life scenarios. In 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition.
[OECD, 2024] OECD Legal Instruments. (2024). Recommendation of the Council on Artificial Intelligence.
[Oppermann et al., 2024] Oppermann, M., Schatz, R., Sackl, A., & Egger-Lampl, S. (2024, June). Virtual Forests, Real Skills: Assessing the QoE of VR-based Occupational Training and its Impact on Experience and Learning Outcomes. In 2024 16th International Conference on Quality of Multimedia Experience (QoMEX) (pp. 250-253). IEEE.
[Panigutti et al., 2023] Panigutti, C., Hamon, R., Hupont, I., Fernandez, D., Fano, D., Junklewitz, H., … & Gomez, E. (2023, June). The role of explainable AI in the context of the AI Act. In Proceedings of the 2023 ACM conference on fairness, accountability, and transparency (pp. 1139-1150).
[Park et al., 2023] Park, S., Kim, H. K., Park, J., & Lee, Y. (2023). Designing and evaluating user experience of an AI-based defense system. IEEE Access, 11, 122045-122056.
[Shneiderman, 2022] Schneiderman, B. (2022). Human-centered AI. Oxford University Press. Online ISBN: 9780191937583
[Qualinet, 2013] Qualinet White Paper on Definitions of Quality of Experience (2012). European Network on Quality of Experience in Multimedia Systems and Services (COST Action IC 1003), Patrick Le Callet, Sebastian Möller and Andrew Perkis, eds., Lausanne, Switzerland, Version 1.2, March 2013.
[Reichl et al., 2015] Reichl, P. et al. (2015). Towards a comprehensive framework for QoE and user behavior modelling. In 2015 seventh international workshop on quality of multimedia experience (QoMEX) (pp. 1-6). IEEE.
[Reiter et al.. 2014] Reiter, U. et al. (2014). Factors Influencing Quality of Experience. In: Möller, S., Raake, A. (eds) Quality of Experience. T-Labs Series in Telecommunication Services. Springer, Cham.
[Soler et al., 2023] Soler, J., Tolan, S., Hupont, I., Fernandez, D., Charisi, V., Gomez, E., Junklewitz, H., Hamon, R., Fano, D. and Panigutti, C., AI Watch: Artificial Intelligence Standardisation Landscape Update, EUR 31343 EN, Publications Office of the European Union, Luxembourg, 2023, ISBN 978-92-76-60450-1, doi:10.2760/131984, JRC131155.
[Virvou, 2023] Virvou, M. Artificial Intelligence and User Experience in reciprocity: Contributions and state of the art. Intelligent Decision Technologies, 17(1), 73-125.
[WEF, 2025] World Economic Forum. (2025). AI Governance Alliance.
[Wehner et al., 2025] Wehner, N., Seufert, A., Hoßfeld, T. and Seufert, M. (2025). A Tutorial on Data-Driven Quality of Experience Modeling With Explainable Artificial Intelligence. IEEE Communications Surveys & Tutorials, doi: 10.1109/COMST.2025.3583227.

MPEG Column: 152nd MPEG Meeting

By Christian Timmerer | December 9, 2025 - 08:20 |December 9, 2025 0425, 0425, Event Report, Feature, September 2013, Standards

Leave a comment

The 152nd MPEG meeting took place in Geneva, Switzerland, from October 7 to October 11, 2025. The official MPEG press release can be found here. This column highlights key points from the meeting, amended with research aspects relevant to the ACM SIGMM community:

MPEG Systems received an Emmy® Award for the Common Media Application Format (CMAF). A separate press release regarding this achievement is available here.
JVET ratified new editions of VSEI, VVC, and HEVC
The fourth edition of Visual Volumetric Video-based Coding (V3C and V-PCC) has been finalized
Responses to the call for evidence on video compression with capability beyond VVC successfully evaluated

MPEG Systems received an Emmy® Award for the Common Media Application Format (CMAF)

On September 18, 2025, the National Academy of Television Arts & Sciences (NATAS) announced that the MPEG Systems Working Group (ISO/IEC JTC 1/SC 29/WG 3) had been selected as a recipient of a Technology & Engineering Emmy® Award for standardizing the Common Media Application Format (CMAF). But what is CMAF? CMAF (ISO/IEC 23000-19) is a media format standard designed to simplify and unify video streaming workflows across different delivery protocols and devices. Here’s a structured overview. Before CMAF, streaming services often had to produce multiple container formats, i.e., (i) ISO Base Media File Format (ISOBMFF) for MPEG-DASH and MPEG-2 Transport Stream (TS) for Apple HLS. This duplication resulted in additional encoding, packaging, and storage costs. I wrote a blog post about this some time ago here. CMAF’s main goal is to define a single, standardized segmented media format usable by both HLS and DASH, enabling “encode once, package once, deliver everywhere.”

The core concept of CMAF is that it is based on ISOBMFF, the foundation for MP4. Each CMAF stream consists of a CMAF header, CMAF media segments, and CMAF track files (a logical sequence of segments for one stream, e.g., video or audio). CMAF enables low-latency streaming by allowing progressive segment transfer, adopting chunked transfer encoding via CMAF chunks. CMAF defines interoperable profiles for codecs and presentation types for video, audio, and subtitles. Thanks to its compatibility with and adoption within existing streaming standards, CMAF bridges the gaps between DASH and HLS, creating a unified ecosystem.

Research aspects include – but are not limited to – low-latency tuning (segment/chunk size trade-offs, HTTP/3, QUIC), Quality of Experience (QoE) impact of chunk-based adaptation, synchronization of live and interactive CMAF streams, edge-assisted CMAF caching and prediction, and interoperability testing and compliance tools.

JVET ratified new editions of VSEI, VVC, and HEVC

At its 40th meeting, the Joint Video Experts Team (JVET, ISO/IEC JTC 1/SC 29/WG 5) concluded the standardization work on the next editions of three key video coding standards, advancing them to the Final Draft International Standard (FDIS) stage. Corresponding twin-text versions have also been submitted to ITU-T for consent procedures. The finalized standards include:

Versatile Supplemental Enhancement Information (VSEI) — ISO/IEC 23002-7 | ITU-T Rec. H.274
Versatile Video Coding (VVC) — ISO/IEC 23090-3 | ITU-T Rec. H.266
High Efficiency Video Coding (HEVC) — ISO/IEC 23008-2 | ITU-T Rec. H.265

The primary focus of these new editions is the extension and refinement of Supplemental Enhancement Information (SEI) messages, which provide metadata and auxiliary data to support advanced processing, interpretation, and quality management of coded video streams.

The updated VSEI specification introduces both new and refined SEI message types supporting advanced use cases:

AI-driven processing: Extensions for neural-network-based post-filtering and film grain synthesis offer standardized signalling for machine learning components in decoding and rendering pipelines.
Semantic and multimodal content: New SEI messages describe infrared, X-ray, and other modality indicators, region packing, and object mask encoding; creating interoperability points for multimodal fusion and object-aware compression research.
Pipeline optimization: Messages defining processing order and post-processing nesting support research on joint encoder-decoder optimization and edge-cloud coordination in streaming architectures.
Authenticity and generative media: A new set of messages supports digital signature embedding and generative-AI-based face encoding, raising questions for the SIGMM community about trust, authenticity, and ethical AI in media pipelines.
Metadata and interpretability: New SEIs for text description, image format metadata, and AI usage restriction requests could facilitate research into explainable media, human-AI interaction, and regulatory compliance in multimedia systems.

All VSEI features are fully compatible with the new VVC edition, and most are also supported in HEVC. The new HEVC edition further refines its multi-view profiles, enabling more robust 3D and immersive video use cases.

Research aspects of these new standard’s editions can be summarized as follows: (i) Define new standardized interfaces between neural post-processing and conventional video coding, fostering reproducible and interoperable research on learned enhancement models. (ii) Encourage exploration of metadata-driven adaptation and QoE optimization using SEI-based signals in streaming systems. (iii) Open possibilities for cross-layer system research, connecting compression, transport, and AI-based decision layers. (iv) Introduce a formal foundation for authenticity verification, content provenance, and AI-generated media signalling, relevant to current debates on trustworthy multimedia.

These updates highlight how ongoing MPEG/ITU standardization is evolving toward a more AI-aware, multimodal, and semantically rich media ecosystem, providing fertile ground for experimental and applied research in multimedia systems, coding, and intelligent media delivery.

The fourth edition of Visual Volumetric Video-based Coding (V3C and V-PCC) has been finalized

MPEG Coding of 3D Graphics and Haptics (ISO/IEC JTC 1/SC 29/WG7) has advanced MPEG-I Part 5 – Visual Volumetric Video-based Coding (V3C and V-PCC) to the Final Draft International Standard (FDIS) stage, marking its fourth edition. This revision introduces major updates to the Video-based Coding of Volumetric Content (V3C) framework, particularly enabling support for an additional bitstream instance: V-DMC (Video-based Dynamic Mesh Compression).

Previously, V3C served as the structural foundation for V-PCC (Video-based Point Cloud Compression) and MIV (MPEG Immersive Video). The new edition extends this flexibility by allowing V-DMC integration, reinforcing V3C as a generic, extensible framework for volumetric and 3D video coding. All instances follow a shared principle, i.e., using conventional 2D video codecs (e.g., HEVC, VVC) for projection-based compression, complemented by specialized tools for mapping, geometry, and metadata handling.

While V-PCC remains co-specified within Part 5, MIV (Part 12) and V-DMC (Part 29) are standardized separately. The progression to FDIS confirms the technical maturity and architectural stability of the framework.

This evolution opens new research directions as follows: (i) Unified 3D content representation, enabling comparative evaluation of point cloud, mesh, and view-based methods under one coding architecture. (ii) Efficient use of 2D codecs for 3D media, raising questions on mapping optimization, distortion modeling, and geometry-texture compression. (iii) Dynamic and interactive volumetric streaming, relevant to AR/VR, telepresence, and immersive communication research.

The fourth edition of MPEG-I Part 5 thus positions V3C as a cornerstone for future volumetric, AI-assisted, and immersive video systems, bridging standardization and cutting-edge multimedia research.

Responses to the call for evidence on video compression with capability beyond VVC successfully evaluated

The Joint Video Experts Team (JVET, ISO/IEC JTC 1/SC 29/WG 5) has completed the evaluation of submissions to its Call for Evidence (CfE) on video compression with capability beyond VVC. The CfE investigated coding technologies that may surpass the performance of the current Versatile Video Coding (VVC) standard in compression efficiency, computational complexity, and extended functionality.

A total of five submissions were assessed, complemented by ECM16 reference encodings and VTM anchor sequences with multiple runtime variants. The evaluation addressed both compression capability and encoding runtime, as well as low-latency and error-resilience features. All technologies were derived from VTM, ECM, or NNVC frameworks, featuring modified encoder configurations and coding tools rather than entirely new architectures.

Key Findings

In the compression capability test, 76 out of 120 test cases showed at least one submission with a non-overlapping confidence interval compared to the VTM anchor. Several methods outperformed ECM16 in visual quality and achieved notable compression gains at lower complexity. Neural-network-based approaches demonstrated clear perceptual improvements, particularly for 8K HDR content, while gains were smaller for gaming scenarios.
In the encoding runtime test, significant improvements were observed even under strict complexity constraints: 37 of 60 test points (at both 1× and 0.2× runtime) showed statistically significant benefits over VTM. Some submissions achieved faster encoding than VTM, with only a 35% increase in decoder runtime.

Research Relevance and Outlook

The CfE results illustrate a maturing convergence between model-based and data-driven video coding, raising research questions highly relevant for the ACM SIGMM community:

How can learned prediction and filtering networks be integrated into standard codecs while preserving interoperability and runtime control?
What methodologies can best evaluate perceptual quality beyond PSNR, especially for HDR and immersive content?
How can complexity-quality trade-offs be optimized for diverse hardware and latency requirements?

Building on these outcomes, JVET is preparing a Call for Proposals (CfP) for the next-generation video coding standard, with a draft planned for early 2026 and evaluation through 2027. Upcoming activities include refining test material, adding Reference Picture Resampling (RPR), and forming a new ad hoc group on hardware implementation complexity.

For multimedia researchers, this CfE marks a pivotal step toward AI-assisted, complexity-adaptive, and perceptually optimized compression systems, which are considered a key frontier where codec standardization meets intelligent multimedia research.

The 153rd MPEG meeting will be held online from January 19 to January 23, 2026. Click here for more information about MPEG meetings and their developments.

It is All About the Experience… My Highlights of QoMEX 2025 in Madrid

By Tobias Hossfeld | December 2, 2025 - 09:10 |December 2, 2025 0425, 0425, Event Report, Feature, QoE Column

Leave a comment

Since my first QoMEX (international conference on Quality of Multimedia Experience) in 2015 (Costa Navarino, Greece), I have considered it my conference and the attendees, my research family. It has thus become my special yearly event to connect with familiar faces and meet the next generation of researchers in the field. This edition of QoMEX has brought together an outstanding program with very interesting keynotes, technical papers and demos (see https://qomex2025.itec.aau.at/ to check the full program). Moreover, it has been especially important for me both on a professional and a personal level. I would like to summarize my subjective Experience in 4 highlights:

Figure 1. Me explaining the working principles of eating 12 grapes at midnight for New Year, while walking through Puerta del Sol.

“Introducing my home city to my research family”

Madrid is my home “town”. A couple of times during the conference, one attendee or another asked me where I was from in Spain, and I proudly answered, “I am from here”. In Madrid, I spent the first 23^rd years of my life before moving abroad for my professional career. Thus, while I am not literally a local, I can be considered as such. During the conference, I had the opportunity to share my view and love of Madrid to my work family. This meant for me to introduce my research family to my early life in Madrid.

“Paying tribute to Narciso García”

Figure 2. Narciso García posing with his (former) PhD students Marta Orduna, Pablo Pérez, Jesús Gutierrez and Carlos Cortés.

QoMEX 2025 also provided the opportunity to pay well-deserved tribute to one of the two general chairs, Narciso García, on his retirement. Narciso has had an incredible impact not only on the Quality of Experience community. Moreover, plenty of researchers (including myself) in the community and beyond it consider him as a mentor and even their “spiritual guide”. Talking with Pablo Pérez (the other general co-chair) during the conference, he described Narciso as having a solution for every issue, independent from its size, complexity, or topic. Thank you Narciso for the insightful research discussions, the resourcefulness, the (history) chats, and just for being there always available for all of us.

“Mentoring the next generation of researchers”

On the final session of the conference, something very unexpected (and in my opinion very unusual) occurred. Attending the awards session is always exciting. On the one side, you are 99.9% sure that you will not get any award. However, on the other side, you always wonder “what if?”. This was definitely a “What if?” year for me. First, the Best Student Paper Award went to our work with my starting PhD student Gijs Fiten. A very interesting work on locomotion in Virtual Reality. This was also his first conference, which made it even more special (both for him and for me). When we were yet to recover from this first commotion, the Best Paper Award was announced. It went to Sam Van Damme, my former (first) PhD student on a collaborative work with CWI (Centrum Wiskunde & Informatica) in Amsterdam, about shared mental models. Details of both papers can be found in the appendix.

Seeing students that I mentored (and supervised) grow and achieve important goals in their research careers was more gratifying that winning any award myself.

“QoE researchers can easily walk in others’ shoes”

Figure 3. Reflecting our thoughts and feelings on the decoration of the bag.

To put the cherry to the cake that QoMEX 2025 was, I got the wonderful present of together with Marta Orduna (Nokia, Spain) and María Nava (Fundación Juan XXIII, Spain) to organize a diversity and inclusion workshop in the Fundación Juan XXIII (https://qomex2025.itec.aau.at/workshop/ws-walking-in-their-shoes/). It took place on Friday the 3^rd of October. The Fundación (https://www.fundacionjuanxxiii.org/ ) is an organization working for more than 55 years to promote the social and labor inclusion of people in situations of psychosocial vulnerability. With the help of their workers and users, we set up a workshop where our researchers had to switch the roles. Therefore, they became the participant of a “hands-on” experience guided by people with different abilities. The activity consisted of manufacturing paper bags with the help and guidance of the experts of the Paper Lovers project (https://www.fundacionjuanxxiii.org/nuestros-proyectos).

Figure 4. Santi (on the left) is teaching Matteo (on the right) to manufacture a paper bag.

There was some initial insecurity and fear of the language barrier with our Spanish teachers. However, this passed quickly and our QoE researchers adapted to the role of students and started manufacturing bags as they had been doing it for the last 5 years. After the experience, our experts rated the quality of the bags with the typical paper review grading (accept, major revision, minor revision and reject). Finally, after lunch, with the expert guidance of Elena Marquez Segura (Universidad Carlos III), we reflected on the morning session and decorated our bags to express what we had learned about researching from an inclusive perspective. All in all it was an experience session out of the usual constraints that our research imposes and a very fitting ending to a wonderful week.

Special Thanks to Gijs Fiten (KU Leuven, Belgium), Sam Van Damme (Ghent University, Belgium), Marta Orduna (Nokia XR Lab, Spain), Martín Varela (Metosin, Finland), Karan Mitra (Luleå University of Technology, Sweden), Markus Fiedler (BTH, Sweden) and of course the organizing committee of QoMEX’25 led by Pablo Pérez (Nokia XR Lab, Spain) and Narciso García (ETSIT-UPM, Spain)

Appendix. Details of the Best papers Awards at QoMEX 2025

Best Student Paper Award

Redirected Walking for Multi-User eXtended Reality Experiences with Confined Physical Spaces
G. Fiten, J. Chatterjee, K. Vanhaeren, M. Martens and M. Torres Vega
17th International Conference on Quality of Multimedia Experience (QoMEX), Madrid, Spain, 2025.

EXtended Reality (XR) applications allow the user to explore nearly infinite virtual worlds in a truly immersive way. However, wandering around through these Virtual Environments (VE)s while physically walking in reality is heavily constrained by the size of the Physical Environment (PE). Therefore, in the last years different techniques have been devised to improve locomotion in XR. One of these is Redirected Walking (RDW), which aims to find a balance between immersion and PE requirements by steering users away from the boundaries of the PE while allowing for arbitrary motion in the VE. However, current RDW methods still require large PEs, as to avoid obstacles and other users. Moreover, they introduce unnatural alterations in the natural path of the user, which can trigger perception anomalies, such as cybersickness or break of presence. These circumstances limit their usage in real life scenarios. This paper introduces a novel RDW algorithm, with the focus on allowing multiple users to explore an infinite VE in a confined space (6×6 m2). To evaluate it, we designed a multi-user Virtual Reality (VR) maze game, and benchmarked it against the state-of-the-art. A subjective study (20 participants) was conducted, where objective metrics, e.g., the path and the speed of the user, were combined with subjective perception analysis in terms of their cybersickness levels. Our results show that our method reduces the appearance of cybersickness appearance in 80% of participants compared to the state-of-the-art. These findings show the applicability of RDW to multi-user VR with constrained environments.

Best Paper Award

From Individual QoE to Shared Mental Models: A Novel Evaluation Paradigm for Collaborative XR
S. Van Damme, J. Jansen, S. Rossi and P. Cesar
17th International Conference on Quality of Multimedia Experience (QoMEX), Madrid, Spain, 2025.

Extended Reality (XR) systems are rapidly shifting from isolated, single-user applications towards collaborative and social multi-user experiences. To evaluate the quality and effectiveness of such interactions, it is therefore required to move beyond traditional individual metrics such as Quality-of-Experience (QoE) or Sense of Presence (SoP). Instead, group-level dynamics such as effective communication, coordination etc. need to be encompassed to assess the shared understanding of goals and procedures. In psychology, this is referred to as a Shared Mental Model (SMM). The strength and congruence of such an SMM are known to be key for effective team collaboration and performance. In an immersive XR setting, though, novel Influence Factors (IFs) emerge that are not considered in a setting of physical co-location. Evaluations on the impact of these novel factors on SMM formation in XR, however, are close to non-existent. Therefore, this work proposes SMMs as a novel evaluation tool for collaborative and social XR experiences. To better understand how to explore this construct, we ran a prototypical experiment based on ITU recommendations in which the influence of asymmetric end-to-end latency is evaluated through a collaborative, two-user block building task. The results show how also in an XR context strong SMM formation can take place even when collaborators have fundamentally different responsibilities and behavior. Moreover, the study confirms previous findings by showing in an XR context that a teams’ SMM strength is positively associated with its performance.

JPEG Column: 108th JPEG Meeting in Daejeon, Republic of Korea

By Antonio Pinheiro | December 2, 2025 - 08:51 |December 2, 2025 0425, 0425, Event Report, Feature, September 2013, Standards

Leave a comment

JPEG XE reaches Committee Draft stage at the 108th JPEG meeting

The 108th JPEG meeting was held in Daejeon, Republic of Korea, from 29 June to 4 July 2025.

During this meeting, the JPEG Committee finalised the Committee Draft of JPEG XE, an upcoming International Standard for lossless coding of visual events, that has been sent for consultation of ISO/IEC JTC1/SC29 national bodies. JPEG XE will be the first International Standard developed for the lossless representation and coding of visual events, and is being developed under the auspices of ISO, IEC, and ITU.

Furthermore, the JPEG Committee was informed that the prestigious Joseph von Fraunhofer Prize 2025 was awarded to three JPEG Committee members Prof. Siegfried Fößel, Dr. Joachim Keinert and Dr. Thomas Richter, for their contributions to the development of the JPEG XS standard. The JPEG XS standard specifies a compression technology with very low latency at a low implementation complexity and with a very precise bit-rate control. A presentation video can be accessed here.

108th JPEG Meeting in Daejeon, Rep. of Korea.

The following sections summarise the main highlights of the 108th JPEG meeting:

JPEG XE Committee Draft sent for consultation
JPEG Trust second edition aligns with C2PA
JPEG AI parts 2, 3 and 4 proceed for publication as IS
JPEG DNA reaches DIS stage
JPEG AIC on Objective Image Quality Assessment
JPEG Pleno Learning-based Point Cloud Coding proceed for publication as IS
JPEG XS Part 1 Amendment 1 proceeds to DIS stage
JPEG RF explores 3DGS coding and quality evaluation

JPEG XE

At the 108th JPEG Meeting, the Committee Draft of the first International Standard for lossless coding of events was issued and sent for consultation to ISO/IEC JTC1/SC29 national bodies for consultation. JPEG XE is being developed under the auspices of ISO/IEC and ITU-T and aims to establish a robust and interoperable format for efficient representation and coding of events in the context of machine vision and related applications. By reaching the Committee Draft stage, the JPEG Committee has attained a very important milestone. The Committee Draft was produced based on the five received responses to a Call for Proposals issued after the 104th JPEG Meeting held in July 2024. The two submissions meet the requirements for the constrained lossless coding of events and allow the implementation and operation of the coding model with limited resources, power, and complexity. The remaining three responses address the unconstrained coding mode and will be considered in a second phase of standardisation.

JPEG XE is the fruit of a joint effort between ISO/IEC JTC1/SC29/WG1 and ITU-T SG21 and is hoped to result in a largely supported JPEG XE standard, improving the potential compatibility and interoperability across applications, products, and services. Additionally, the JPEG Committee is in contact with the MIPI Alliance with the intention of developing a cross-compatible coding mode, allowing MIPI ESP signals to be decoded effectively by JPEG XE decoders.

The JPEG Committee remains committed to the development of a comprehensive and industry-aligned standard that meets the growing demand for event-based vision technologies. The collaborative approach between multiple standardisation organisations underscores a shared vision for a unified, international standard to accelerate innovation and interoperability in this emerging field.

JPEG Trust

JPEG Trust completed its second edition of JPEG Trust Part 1: Core Foundation, which brings JPEG Trust into alignment with the updated C2PA specification 2.1 and integrates aspects of Intellectual Property Rights (IPR). This second edition is now approved as a Draft International Standard for submission to ISO/IEC balloting, with an expected completion timeframe at the end of 2025.

Showcasing the adoption of JPEG Trust technology, JPEG Trust Part 4 – Reference software has now reached the Committee Draft stage.

Work continues on JPEG Trust Part 2: Trust profiles catalogue, a repository of Trust Profile and reporting snippets designed to assist implementers in constructing their Trust Profiles and Trust Reports, as well as JPEG Trust Part 3: Media asset watermarking.

JPEG AI

During the 108th JPEG meeting, JPEG AI Parts 2, 3, and 5 received positive DIS ballot results with only editorial comments, allowing them to proceed to publication as International Standards. These parts extend Part 1 by specifying stream and decoder profiles, reference software with usage documentation, and file format embedding for container formats such as ISOBMFF and HEIF.

The results from two Core Experiments were reviewed. The first evaluated gain map-based HDR coding, comparing it to simulcast methods and HEIC, while the second focused on implementing JPEG AI on smartphones using ONNX. Progressive decoding performance was assessed under channel truncation, and adaptive selection techniques were proposed to mitigate losses. Subjective and objective evaluations confirmed JPEG AI’s strong performance, often surpassing codecs such as VVC Intra, AVIF, JPEG XL, and performing comparably to ECM in informal viewing tests.

Another contribution explored compressed-domain image classification using latent representations, demonstrating competitive accuracy across bitrates. A proposal to limit tile splits in JPEG AI Part 2 was also discussed, and experiments identified Model 2 as the most robust and efficient default model for the levels with only one model at the decoder side.

JPEG DNA

During the 108th JPEG meeting, the JPEG Committee produced a study DIS text of JPEG DNA Part 1 (ISO/IEC 25508-1). The purpose of this text is to synchronise the current version of the Verification Model with the changes made to the Committee Draft document, reflecting the comments received from the consultation. The DIS balloting of Part 1 is scheduled to take place after the next JPEG meeting, starting in October 2025.

The JPEG Committee is also planning wet-lab experiments to validate that the current specification of the JPEG DNA satisfies the conditions required for applications using the current state of the art in DNA synthesis and sequencing, such as biochemical constraints, decodability, coverage rate, and the impact of error-correcting code on compression performance.

The goal still remains to reach International Standard (IS) status for Part 1 during 2026.

JPEG AIC

Part 4 of JPEG AIC deals with objective quality metrics for fine-grained assessment of high-fidelity compressed images. As of the 108th JPEG Meeting, the Call for Proposals on Objective Image Quality Assessment (JPEG AIC-4), which was launched in April 2025, has already resulted in four non-mandatory registrations of interest that were reviewed. In this JPEG meeting, the technical details regarding the evaluation of proposed metrics and of the anchor metrics were developed and finalised. The results have been integrated in the document “Common Test Conditions on Objective Image Quality Assessment v2.0”, available on the JPEG website. Moreover, the procedures to generate the evaluation image dataset were defined and will be carried out by JPEG experts. The responses to the Call for Proposals for JPEG AIC-4 are expected in September 2025, together with their application for the evaluation dataset, with the goal of creating a Working Draft of a new standard on objective quality assessment of high-fidelity images by April 2026.

JPEG Pleno

At the 108th JPEG meeting, significant progress was reported in the ongoing JPEG Pleno Quality Assessment activity for light fields. A Call for Proposals (CfP) on objective quality metrics for light fields is currently underway, with submissions to be evaluated using a new evaluation dataset. The JPEG Committee also prepares the DIS of ISO/IEC 21794-7, which defines a standard for subjective quality assessment methodologies of light fields.

During the 108th JPEG meeting, the 2nd edition of ISO/IEC 21794-2 (“Plenoptic image coding system (JPEG Pleno) Part 2: Light field coding”) advanced to the Draft International Standard (DIS) stage. This 2nd edition includes the specification of a third coding mode entitled Slanted 4D Transform Mode and its associated profile.

The 108th JPEG meeting also saw the successful completion of the Final Draft International Standard balloting and the impending publication of ISO/IEC 21794-6: Learning-based Point Cloud Coding. This is the world’s first international standard on learning-based point cloud coding. The publication of Part 6 of ISO/IEC 21794 is a crucial and notable milestone in the representation of point clouds. The publication of the International Standard is expected to take place during the second half of 2025.

JPEG XS

The JPEG Committee advanced the AMD 1 of JPEG XS Part 1 to DIS stage; it allows the embedding of sub-frame metadata to JPEG XS as required by augmented and virtual reality applications currently discussed within VESA. Part 5 3rd edition, which is the reference software of JPEG XS, was also approved for publication as an International Standard.

JPEG RF

During the 108th JPEG meeting, the JPEG Radiance Fields exploration advanced its work on discussing the procedures for reliable evaluation of potential proposals in the future, with a particular focus on refining subjective evaluation protocols. A key outcome was the initiation of Exploration Study 5, aimed at investigating how different test camera trajectories influence human perception during subjective quality assessment. The Common Test Conditions (CTC) document was also reviewed, with the subjective testing component remaining provisional pending the outcome of this exploration study. In addition, existing use cases and requirements for JPEG RF were re-examined, setting the stage for the development of revised drafts of both the Use Cases and Requirements document and the CTC. New mandates include conducting Exploration Study 5, revising documents, and expanding stakeholder engagement.

Final Quote

“The release of the Committee Draft of JPEG XE standard for lossless coding of events at the 108th JPEG meeting is an impressive achievement and will accelerate deployment of products and applications relying on visual events.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2025 – Part 4 (ACM MMSys 2023, 2024, 2025)

By maria | November 20, 2025 - 13:35 |November 20, 2025 0425, 0425, Datasets Column, Event Report, Feature

Leave a comment

Editors: Maria Torres Vega (KU Leuven, Belgium), Karel Fliegel (Czech Technical University in Prague, Czech Republic), Mihai Gabriel Constantin (University Politehnica of Bucharest, Romania),

In this Dataset Columns, we continue the tradition of the previous three columns by reviewing some of the notable events related to open datasets and benchmarking competitions in the field of multimedia in the years 2023, 2024 and 2025 from this column. This selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. This review follows similar efforts from the previous editions:

This fourth column focuses on the last three editions of ACM Multimedia Systems (MSys), i.e., 2023, 2024, and 2025:

The 14th ACM Multimedia Systems Conference (ACM MMSys’23 https://2023.acmmmsys.org/).
The 15th ACM Multimedia Systems Conference (ACM MMSys’24 https://2024.acmmmsys.org/).
The 16th ACM Multimedia Systems Conference (ACM MMSys’25 https://2025.acmmmsys.org/).

ACM MMSys 2023

10 dataset papers were presented at the 14th ACM Multimedia Systems Conference (ACM MMSys’23), organized in Vancouver, Canada, June 7-10, 2023 (https://2023.acmmmsys.org/). The complete ACM MMSys’23 Proceedings are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3587819).

Rhys Cox, S., et al., VOLVQAD: An MPEG V-PCC Volumetric Video Quality Assessment Dataset (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592543; dataset available at: https://github.com/nus-vv-streams/volvqad-dataset).
This is a volumetric video quality assessment dataset consisting of 7,680 ratings on 376 video sequences from 120 participants. The sequences are encoded with MPEG V-PCC using 4 different avatar models and 16 quality variations, and then rendered into test videos for quality assessment using 2 different background colors and 16 different quality switching patterns.
Prakash, N., et al., TotalDefMeme: A Multi-Attribute Meme dataset on Total Defence in Singapore (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592545; dataset available at: https://gitlab.com/bottle_shop/meme/TotalDefMemes). Total Defence is a large-scale multi-modal and multi-attribute meme dataset that captures public sentiments toward Singapore’s Total Defence policy. Besides supporting social informatics and public policy analysis of the Total Defence policy, TotalDefMeme can also support many downstream multi-modal machine learning tasks, such as aspect-based stance classification and multi-modal meme clustering.
Sun, Y., et al., A Dynamic 3D Point Cloud Dataset for Immersive Applications (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592546; dataset available on request to the authors). This dataset consists of synthetically generated objects with pre-determined motion patterns. It contains nine objects in three categories (shape, avatar, and textile) with different animation patterns.
Raca, D., et al., 360 Video DASH Dataset (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592548; dataset available at: https://github.com/darijo/360-Video-DASH-Dataset). This study introduces a SW tool that offers straight-forward encoding platforms to simplify the encoding of DASH VR videos. In addition, it includes a dataset composed of 9 VR videos encoded with seven tiling configurations, four segment durations, four different bitrates.
Hu, K., et al., FSVVD: A Dataset of Full Scene Volumetric Video ( paper available at: https://dl.acm.org/doi/10.1145/3587819.3592551, dataset available at: https://cuhksz-inml.github.io/full_scene_volumetric_video_dataset/). This dataset focuses on the current most widely used data format, point cloud, and for the first time, releases a full-scene volumetric video dataset that includes multiple people and their daily activities interacting with the external environments.
Wu, Y., et al., A Dataset of Food Intake Activities Using Sensors with Heterogeneous Privacy Sensitivity Levels (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592553; dataset available on request to the authors). This dataset compiles fine-grained food intake activities using sensors of heterogeneous privacy sensitivity levels, namely a mmWave radar, an RGB camera, and a depth camera. Solutions to recognize food intake activities can be developed using this dataset, which may provide a more comprehensive picture of the accuracy and privacy trade-offs involved with heterogeneous sensors.
Soares da Costa, T., et al., A Dataset for User Visual Behaviour with Multi-View Video Content (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592556; dataset available on request to the authors). This dataset, collected with a large-scale testbed, compiles tracking data from head movements was obtained from 45 participants using an Intel Realsense F200 camera, with 7 video playlists, each being viewed a minimum of 17 times.
Wei, Y., et al., A 6DoF VR Dataset of 3D virtualWorld for Privacy-Preserving Approach and Utility-Privacy Tradeoff (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592557; dataset available on request to the authors). This dataset collects a 6 degree-of-freedom VR dataset of 3D virtual worlds for the investigation of privacy-preserving approaches and utility-privacy tradeoff.
Mohammed, A. et al., IDCIA: Immunocytochemistry Dataset for Cellular Image Analysis (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592558; dataset available at: https://figshare.com/articles/dataset/Dataset/21970604). This dataset is a new annotated microscopic cellular image dataset to improve the effectiveness of machine learning methods for cellular image analysis. It includes microscopic images of cells, and for each image, the cell count and the location of individual cells. The data were collected as part of an ongoing study investigating the potential of electrical stimulation to modulate stem cell differentiation and possible applications for neural repair.
Al Shoura, T., et al., SEPE Dataset: 8K Video Sequences and Images for Analysis and Development (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592560; dataset available at: https://github.com/talshoura/SEPE-8K-Dataset). The SEPE 8K dataset (Software Engineering Practice and Education) is made of 40 different 8K (8192 x 4320) video sequences and 40 variant 8K (8192 x 5464) images. The proposed dataset is – as far as we know – the first to publish true 8K natural sequences; thus, it is important for the next level of applications dealing with multimedia such as video quality assessment, super-resolution, video coding, video compression, and many more.

ACM MMSys 2024

14 dataset papers were presented at the 15th ACM Multimedia Systems Conference (ACM MMSys’24), organized in Bari, Italy, April 15-18, 2024 (https://2024.acmmmsys.org/). The complete ACM MMSys’24 Proceedings are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3625468).

Malon, T., et al., Ceasefire Hierarchical Weapon Dataset (paper available at: https://dl.acm.org/doi/10.1145/3625468.3653434; dataset available on request to the authors). The Ceasefire Hierarchical Weapon Dataset, an RGB image dataset of firearms tailored for fine-grained image classi- fication, contains 260 classes ranging from 25 to hundreds of images per class, with a total of 40,789 images. In addition, a 4-level hierarchy (family, group, type, model) is provided and validated by forensic experts.
Kassab, E.J., et al., TACDEC: Dataset for Automatic Tackle Detection in Soccer Game Videos (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652166; dataset available on request to the authors). TACDEC is a dataset of tackle events in soccer game videos. By leveraging video data from the Norwegian Eliteserien league across multiple seasons, we annotated 425 videos with 4 types of tackle events, categorized into “tackle-live”, “tackle-replay”, “tackle-live-incomplete”, and “tackle-replay-incomplete”, yielding a total of 836 event annotations.
Zhao, J., Pan, J., LENS: A LEO Satellite Network Measurement Dataset (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652170; dataset available at: https://github.com/clarkzjw/LENS). LENS is a LEO satellite network measurement dataset, collected from 13 Starlink dishes, associated with 7 Point-of-Presence (PoP) locations across 3 continents. The dataset currently consists of network latency traces from Starlink dishes with different hardware revisions, various service subscriptions and distinct sky obstruction ratios.
Chen, B., et al., vRetention: A User Viewing Dataset for Popular Video Streaming Services (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652175; dataset available at: https://github.com/flowtele/vRetention). This dataset collects 229178 audience retention curves from YouTube and Bilibili, offering a thorough view of viewer engagement and diverse watching styles. Our analysis reveals notable behavioral differences across countries, categories, and platforms.
Xu , Y., et al., Panonut360: A Head and Eye Tracking Dataset for Panoramic Video (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652176; dataset available at: https://dianvrlab.github.io/Panonut360/). This dataset presents head and eye trackings involving 50 users (25 males and 25 females) watching 15 panoramic videos (mostly in 4K). The dataset provides details on the viewport and gaze attention locations of users.
Linder, S., et al., VEED: Video Encoding Energy and CO2 Emissions Dataset for AWS EC2 instances (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652178; dataset available at: https://github.com/cd-athena/VEED-dataset). VEED is a FAIR Video Encoding Energy and CO2 Emissions Dataset for Amazon Web Services (AWS) EC2 instances. The dataset also contains the duration, CPU utilization, and cost of the encoding.
Tashtarian, F., et al., COCONUT: Content Consumption Energy Measurement Dataset for Adaptive Video Streaming (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652179; dataset available at: https://athena.itec.aau.at/coconut/). The COCONUT dataset provides a COntent COnsumption eNergy measUrement daTaset for adaptive video streaming collected through a digital multimeter on various types of client devices, such as laptop and smartphone, streaming MPEG-DASH segments.
Sarkhoosh, M. H., et al., The SoccerSum Dataset for Automated Detection, Segmentation, and Tracking of Objects on the Soccer Pitch (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652180; dataset available at: https://zenodo.org/records/10612084). SoccerSum is a novel dataset aimed at enhancing object detection and segmentation in video frames depicting the soccer pitch, using footage from the Norwegian Eliteserien league across 2021-2023. It also includes the segmentation of key pitch areas such as the penalty and goal boxes for the same frame sequences. It comprises 750 frames annotated with 10 classes for advanced analysis.
Li, G., et al., A Driver Activity Dataset with Multiple RGB-D Cameras and mmWave Radars (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652181; dataset available at: https://www.kaggle.com/datasets/guanhualee/driver-activity-dataset). This work introduces a novel dataset for fine-grained driver activities, utilizing diverse sensors such as mmWave radars, RGB, and depth cameras, each of which includes three camera angles: body, face, and hands.
Nguyen, M., et al., ComPEQ – MR: Compressed Point Cloud Dataset with Eye-tracking and Quality Assessment in Mixed Reality (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652182; dataset available at: https://ftp.itec.aau.at/datasets/ComPEQ-MR/). This dataset comprises four compressed dynamic point clouds processed by Moving Picture Experts Group (MPEG) reference tools (i.e., VPCC and GPCC), each with 12 distortion levels. We also conducted subjective tests to assess the quality of the compressed point clouds with different levels of distortion. Additionally, eye-tracking data for visual saliency is included in this dataset, which is necessary to predict where people look when watching 3D videos in MR experiences. We collected opinion scores and eye-tracking data from 41 participants, resulting in 2132 responses and 164 visual attention maps in total.
Barone, N., et al., APEIRON: a Multimodal Drone Dataset Bridging Perception and Network Data in Outdoor Environments (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652186; dataset available at: https://c3lab.github.io/Apeiron/). APEIRON is a rich multimodal aerial dataset that simultaneously collects perception data from a stereo camera and an event based camera sensor, along with measurements of wireless network links obtained using an LTE module. The assembled dataset consists of both perception and network data, making it suitable for typical perception or communication applications, as well as cross-disciplinary applications that require both types of data.
Baldoni, S., et al., Questset: A VR Dataset for Network and Quality of Experience Studies (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652187; dataset available at: https://researchdata.cab.unipd.it/1179/). Questset contains over 40 hours of VR traces from 70 users playing commercially available video games, and includes both traffic data for network optimization, and movement and user experience data for cybersickness analysis. Therefore, Questset represents an enabler to jointly address the main VR challenges in the near future.
Jabal, A. et al., StreetLens: An In-Vehicle Video Dataset for Public Facility Monitoring in Urban Streets (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652188; dataset available on request to the authors). StreetLens is a new dataset of videos capturing urban streets with plentiful annotations for vision-based public facility monitoring. It includes four-and-a-half hours of videos recorded by smartphone cameras placed in moving vehicles in the suburbs of three different cities.
Brescia, W., et al., MilliNoise: a Millimeter-wave Radar Sparse Point Cloud Dataset in Indoor Scenarios (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652189; dataset available at: https://github.com/c3lab/MilliNoise). MilliNoise is a point cloud dataset captured in indoor scenarios through a mmWave radar sensor installed on a wheeled mobile robot. Each of the 12M points in the MilliNoise dataset is accurately labeled as true/noise point by leveraging known information of the scenes and a motion capture system to obtain the ground truth position of the moving robot. Along with the dataset, we provide researchers with the tools to visualize the data and prepare it for statistical and machine learning analysis.

ACM MMSys 2025

8 dataset papers were presented at the 16th ACM Multimedia Systems Conference (ACM MMSys’25), organized in Stellenbosch, South Africa, March 30th to April 4th, 2025 (https://2025.acmmmsys.org/). The complete ACM MMSys’25 Proceedings are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3712676).

Lechelek, L. et al., eCHFD: extended Ceasefire Hierarchical Firearm Dataset (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718333; dataset available on request to the authors). This is the extended Ceasefire Hierarchical Firearm Dataset (eCHFD), a large image dataset of firearms consisting of over 93,000 images in 505 classes. It was constructed from more than 240 videos filmed at the Toulouse Forensics Laboratory (France) and further enriched with images from the existing CHFD dataset and additional downloaded images.
Sarkhoosh, M. H. et al., HockeyAI: A Multi-Class Ice Hockey Dataset for Object Detection (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718335; dataset available at: https://huggingface.co/SimulaMet-HOST/HockeyAI). HockeyAI is a novel open source dataset specifically designed for multi-class object detection in ice hockey. It includes 2,101 high resolution frames extracted from professional games in the Swedish Hockey League (SHL), annotated in the You Look Only Once (YOLO) format.
Nguyen, M. et al., OLED-EQ: A Dataset for Assessing Video Quality and Energy Consumption in OLED TVs Across Varying Brightness Levels (paper available at: https://dl.acm.org/doi/abs/10.1145/3712676.3718337; dataset available at: https://github.com/minhkstn/OLED-EQ). The dataset comprises the energy data of four OLED TVs with different screen sizes and manufacturers in playing 176 videos in a range of dark and bright content. As a result, 704 data traces of energy consumption are collected. It also includes subjective annotations (28 participants, resulting in 2240 responses in total) of the quality of videos displayed in OLED TVs when they are reduced in brightness.
Sarkhoosh, M. H. et al., HockeyRink: A Dataset for Precise Ice Hockey Rink Keypoint Mapping and Analytics (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718338; dataset available at: https://huggingface.co/SimulaMet-HOST/HockeyRink). HockeyRink is a novel dataset comprising 56 meticulously annotated keypoints corresponding to significant landmarks on a standard hockey rink, including face-off dots, goalposts, and blue lines.
Sarkhoosh, M. H. et al., HockeyOrient: A Dataset for Ice Hockey Player Orientation Classification (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718342; dataset available at: https://huggingface.co/datasets/SimulaMet-HOST/HockeyOrient ). HockeyOrient is a novel dataset for classifying the orientation of ice hockey players based on their poses. The dataset comprises 9,700 manually annotated frames, selected randomly and non-sequentially, taken from Swedish Hockey League (SHL) games during the 2023 and 2024 seasons.
Li, J. et al., PCVD: A Dataset of Point Cloud Video for Dynamic Human Interaction (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718343; dataset available at: https://github.com/acmmmsys/2025-PCVD-A-Dataset-of-Point-Cloud-Video-for-Dynamic-Human-Interaction). This is a point cloud video dataset PCVD captured with synchronized Azure Kinect cameras, designed to support tasks like denoising, segmentation, and motion recognition in single and multi-person scenes. It provides high-quality depth and color data from diverse real-world scenes with human actions.
Bhattacharya, A. et al., AMIS: An Audiovisual Dataset for Multimodal XR Research (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718344; dataset available at: https://github.com/Telecommunication-Telemedia-Assessment/AMIS). The Audiovisual Multimodal Interaction Suite (AMIS) is an open-source dataset and accompanying Unity-based demo implementation designed to aid research on immersive media communication and social XR environments. It features synchronized audiovisual recordings of three actors performing monologues and participating in dyadic conversations across four modalities: talking-head videos, full-body videos, volumetric avatars, and personalized animated avatars.
Ouellette, J. et al., MazeLab: A Large-Scale Dynamic Volumetric Point Cloud Video Dataset With User Behavior Traces (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718345; dataset available on request to the authors). MazeLab is a dynamic volumetric video dataset comprising a feature-rich point cloud representation of a large maze environment. It captures navigation traces from 15 participants interacting with 15 distinct maze variants, categorized into seven classes designed to elicit specific behavioral characteristics such as navigation patterns, attention hotspots, and interaction dynamic.