Rethinking QoE in the Age of AI: From Algorithms to Experience-Based Evaluation

AI evaluation is undergoing a paradigm shift from focusing solely on algorithmic accuracy of AI models to emphasizing experience-based assessment of human interactions with AI systems. Under frameworks like the EU AI Act, evaluation now considers intended purpose, risk, transparency, human oversight, and real-world robustness alongside accuracy. Quality of Experience (QoE) methodologies may offer a structured approach to evaluate how users perceive and experience AI systems in terms of transparency, trust, control and overall satisfaction. This column gives inspiration and shared insights for both communities to advance experience-based AI system evaluation together.

1. From algorithms to systems: AI as user experience

Artificial Intelligence (AI) algorithms—mathematical models implemented as lines of code and trained on data to predict, recommend or generate outputs—were, until recently, tools reserved for programmers and researchers. Only those with technical expertise could access, run or adapt them. For decades, progress in AI was equated with improvements in algorithmic performance: higher accuracy, better precision or new benchmark records—often achieved under narrow, controlled conditions that did not reflect the full spectrum of real-world operational environments. These advances, though scientifically impressive, remained largely invisible to society at large.

The turning point came when AI stopped being just code and became an experience accessible to everyone, regardless of their technical background. Once algorithms were embedded into interactive systems—chatbots, voice assistants, recommendation platforms, image generators—AI became ubiquitous, integrated into people’s daily lives. Interfaces transformed technical capability into human experience, making AI not only a purely algorithmic or research-oriented field but also a social, experiential and increasingly public phenomenon [Mlynář et al., 2025].

This shift fundamentally changed what it means to evaluate AI [Bach et al., 2024]. Accuracy-based metrics—such as precision, recall, specificity or F1-score—no longer suffice for systems that mediate human experiences, influence decision-making and shape trust. Evaluation must now extend beyond the model’s internal performance to assess the interaction, context and experience that emerge when humans engage with AI systems in realistic conditions. We must therefore move from evaluating algorithms in isolation to genuinely human-centered approaches to AI and the experiences it enables [see e.g., https://hai.stanford.edu/], evaluating AI systems as a whole, holistically—considering not only their technical performance but also their experiential, contextual, and social impact [Shneiderman, 2022]. The European Union’s Artificial Intelligence Act [AI Act, 2024] provides a clear illustration of this shift. As the first comprehensive regulatory framework for AI, it recognizes that while algorithmic quality remains essential, what is ultimately regulated is the AI system—its design, use, and intended purpose. Obligations under the Act are tied to that intended purpose, which determines both the risk level and the compliance requirements (see figure below). For instance, the same object detection model can be considered low risk when used to organize personal photo libraries, but high risk when deployed in an autonomous vehicle’s collision-avoidance system.

Figure 1. The European Union’s Artificial Intelligence Act [AI Act, 2024]: risk and obligations depend on an AI system’s intended purpose—permitting low-risk uses while restricting or prohibiting high-risk applications. Examples in the figure are illustrative, not exhaustive. Some uses require prior authorisation under the EU AI Act.

This illustrates a fundamental change: evaluating AI systems today requires understanding how, where and by whom a system is used—not merely how accurate its underlying AI model is. Moreover, evaluation must consider how systems behave and degrade under operational conditions (e.g., adverse weather in traffic monitoring or biased performance across demographic groups in facial analysis), how humans interact with, interpret and rely on them, and what mechanisms of human oversight or intervention exist in practice to ensure accountability and control [Panigutti et al., 2023].

2. Towards a paradigm shift in AI evaluation

The European AI Act marks the first comprehensive attempt to regulate the design, deployment and use of AI systems. Yet its underlying philosophy resonates broadly with the principles endorsed by other high-level international institutions and initiatives—such as the OECD [OECD, 2024], the World Economic Forum [WEF, 2025] and, more recently, the Paris AI Action Summit [CSIS, 2025], where over sixty countries signed a joint commitment to promote responsible, trustworthy and human-centric AI.

Among the many obligations set out in the AI Act for high-risk AI systems, three provisions stand out as emblematic of this paradigm shift: they focus not on algorithmic precision, but on how AI systems are experienced, supervised and operated in the real world.

  • Article 13 – Transparency. AI systems must be designed and developed in a way that is sufficiently transparent to enable users to interpret their output and use it appropriately. Transparency therefore extends beyond disclosure or documentation: it encompasses interaction design and interpretability, ensuring that users—especially non-experts—can meaningfully understand and act upon what the system produces, based on which input and how.
  • Article 14 – Human oversight. High-risk AI systems must allow for effective human supervision so that they can be used as intended and to prevent or minimise risks to health, safety or fundamental rights (e.g., respect for human dignity, privacy, equality and non-discrimination). Oversight involves not only control features or override mechanisms, but also interface designs that help operators recognise when human intervention is necessary—addressing known challenges such as automation bias and over-trust on AI systems [Gaudeul et al., 2024].
  • Article 15 – Accuracy, robustness and cybersecurity. This provision broadens the traditional notion of accuracy, demanding that systems perform reliably under real-world operational conditions and remain secure and resilient to errors, adversarial manipulation or context change. It also calls for mechanisms that support graceful degradation and error recovery, ensuring sustained trust and dependable performance over time.

These provisions, aligned to both the AI Act and the broader international discourse on responsible AI, express a clear transformation in how AI systems should be evaluated. They call for a move beyond in-lab algorithmic performance metrics to include criteria grounded in human experience, operational reliability and social trust. To make these requirements actionable, the European Commission issued a Standardisation Request on Artificial Intelligence (initially published as M/593, 2024 [European Commission, 2024] and subsequently updated following the adoption of the AI Act), mandating the development of harmonised standards to support conformity with the regulation. Yet analyses of existing AI standardisation frameworks suggest that they remain primarily focused on technical robustness and risk management, while offering limited methodological guidance for assessing transparency, human oversight and perceived reliability [Soler et al., 2023].

This gap underscores the need for contributions from the Quality of Experience (QoE) community, whose expertise in assessing perceived quality, pragmatic, hedonic and increasingly also eudaimonic aspects of users’ experiences, usability and trust could inform both standardisation efforts and AI system design in practice. For example, [Hammer et al., 2018] introduced the “HEP cube”, that is a 3D model that maps hedonic (H), eudaimonic (E), and pragmatic (P) aspects of QoE and user experience. For example, utility (P), joy-of-use (H), and meaningfulness (E) are integrated into a multidimensional HEP construct [Egger-Lampl et al., 2019]. In professional contexts, long-term experiential quality depends increasingly on eudaimonic factors such as meaning and personal growth of the user’s capabilities. On the example of augmented reality for the informational phase of procedure assistance, [Hynes et al., 2023] take into account pragmatic aspects like clear, accurately aligned AR instructions that reduce cognitive load and support efficient task execution; hedonic and eudaimonic aspects involve engaging, intuitive interactions that not only make the experience pleasant but also foster confidence, competence, and meaningful professional growth. The study confirmed that AR better fulfills users’ pragmatic needs compared to paper-based instructions. However, the hypothesis that AR surpasses paper-based instructions in meeting hedonic needs was rejected. [Oppermann et al., 2024] evaluated a VR-based forestry safety training and found improved experiential quality and real-world skill transfer compared to traditional instruction. In addition to hedonic and pragmatic UX, eudaimonic experience was assessed by asking participants whether the training would help them “make me a better forestry worker” and “develop my personal potential”.

3. From benchmark performance to operational reality: the case of facial recognition

The example of remote facial recognition (RFR) for public security clearly illustrates how traditional accuracy-based evaluation fails to capture the real challenges of proportionality, operational viability and public trust that define the true quality of experience of AI in use. Under the EU AI Act, the use of real-time remote biometric identification systems in publicly accessible spaces for law enforcement is prohibited, except in narrowly defined circumstances—such as the prevention of terrorist threats, the search for missing persons or the prosecution of crimes—and always subject to prior authorisation by a competent authority. In these cases, the authority must assess whether the deployment of such a system is necessary and proportionate to the intended purpose.

Both the AI Act and the World Economic Forum emphasise this principle of “proportionality” for face recognition systems [AI Act, 2024], [Louradour & Madzou, 2021], yet without providing a clear guidance to determine what “proportionate use” actually means. Deciding whether to deploy RFR therefore requires balancing multiple dimensions—technical performance, societal impact and human oversight—beyond mere accuracy scores [Negri et al., 2024]. Consider, for instance, a competent authority evaluating whether to deploy an RFR system in airports screening 200 million passengers annually, where the estimated prevalence of genuine threats is roughly one in fifty million. Even with a true positive rate (TPR) and true negative rate (TNR) of 99% (equivalent to 99% sensitivity and specificity), the outcome is paradoxical: nearly all real threats would be detected (≈ 4 per year), but around two million innocent passengers would face unnecessary police interventions. Algorithmically, a 99% performance looks excellent. Operationally, it is unmanageable and counterproductive. Handling millions of false alarms would overwhelm security forces, delay operations, and—most importantly—erode public trust, as citizens repeatedly experience unjustified scrutiny and loss of confidence in authorities.

Beyond accuracy, competent authorities must evaluate trade-offs between different operational, social and economic dimensions that holistically define the proportionality and viability of an AI system:

  • Operational feasibility: number of human interventions needed, false alarms to handle and system downtime.
  • Social impact: perceived fairness, legitimacy and transparency of interventions.
  • Economic cost: cost of system deployment, resources spent managing false positives versus genuine detections.
  • Human trust and cognitive load: how repeated interactions with the system affect operator confidence, vigilance and the balance between over-trust and alert fatigue.
  • Consequences of error: the cost of a missed detection versus that of an unjustified intervention.

Hence, accuracy alone cannot guarantee reliability or trustworthiness. Evaluating AI systems requires contextual and human-aware metrics that capture operational trade-offs and social implications. The goal is not only to predict well, but to perform well in the real world. This example reveals a broader truth: trustworthy AI demands evaluation methods that connect technical performance with lived experience—and this is precisely where the QoE community can make a distinctive contribution.

4. Where AI and QoE should meet: new metrics for a new era

The limitations of accuracy-based evaluation, as illustrated by the facial recognition case, point to a broader need for metrics that capture how AI systems perform in real-world, human-centred contexts [Virvou, 2023],[Park et al., 2023]

Over the past decades, the scientific communities focusing on QoE and user experience (UX) research have developed a rigorous toolbox for quantifying subjective experience—how users perceive quality, usability, pragmatic, hedonic and increasingly also eudaimonic aspects of users’ experiences, reliability, control and satisfaction when interacting with complex technological systems. Originally rooted in multimedia, communication networks and human–computer interaction, these methodologies offer a mature foundation for assessing experienced quality in AI systems. QoE-based approaches can help transform general principles such as transparency, human oversight and robustness into measurable experiential dimensions that reflect how users actually understand, trust and operate AI systems in practice.

The following table presents a set of illustrative examples of QoE-inspired metrics—adapted from long-standing practices in the field—that could be further adapted, developed and validated for the evaluation of trustworthy AI.

General AI principlesQoE-inspired metrics
Transparency and comprehensibilityPerceived transparency score: % of users reporting understanding of system capabilities/limitations, potentially with a way to dimension the gap between reported understanding and actual understanding
Explanation clarity MOS: Mean Opinion Score on clarity and interpretability of explanations. While traditional QoE assessment results are often reported as a Mean Opinion Score (MOS), additional statistical measures related to the distribution of scores in the target population are of interest, such as user diversity, uncertainty of user rating distributions, ratio of dissatisfied users, etc. [Hoßfeld et al., 2016]
Time to comprehension: average time for a non-expert to understand the meaning of a given output produced by the system.
Experienced interpretability: extent to which users feel that explanations meaningfully enhance their understanding of the system’s reasoning and limitations [Wehner et al., 2025].
Human oversightPerceived controllability: MOS on ease of intervening or correcting system behavior.
Intervention success rate: % of interventions improving outcomes.
Trust calibration index: alignment between user confidence and actual system reliability.
Robustness and resilience to errorsPerceived reliability over time: longitudinal QoE measure of stability (for example, inspired by work on the longitudinal development of QoE, such as [Guse, D., 2016], [Cieplinska, 2023]).
Graceful degradation MOS: subjective quality under stress (e.g., noise, adversarial input).
Error recovery satisfaction: % of users satisfied with post-failure recovery.
Experience quality (holistic) Overall satisfaction MOS: overall perceived quality of interaction with the AI system and factors influencing that experience quality (human, system, context, as discussed in [Reiter et al.. 2014].
Smoothness of use: perceived fluidity, continuity, absence of frustration.
Perceived usefulness and usability: e.g., adapted from widely-used SUS/UMUX-Lite scales [Lewis et al., 2013].
Perceived response alignment: capture to what extent the system response aligns semantically and contextually with the prompt intent (particularly relevant for generative AI systems).
Cognitive load: mental effort perceived during operation (e.g., adapted NASA-TLX [Hart & Staveland, 1988]).
Perceived productivity impact: how users perceive the effect of AI system assistance on task efficiency and cognitive effort, reflecting findings from recent large-scale developer studies [Early-2025 AI, AI hampers Productivity].

These examples illustrate how the QoE perspective can complement traditional performance indicators such as accuracy or robustness. They extend evaluation beyond technical correctness to include how people experience, trust and manage AI systems in operational environments. Of interest will be to further explore and model the complex relationships between identified QoE dimensions and underlying system, context and human influence factors. 

To better illustrate such complex relationships, it is useful to consider how technical and experiential dimensions interact dynamically in use. One particularly relevant example concerns how AI systems communicate confidence or uncertainty, and how this shapes users’ perceived trustworthiness, engagement and overall Quality of Experience.

Figure 2. Positive and negative feedback loop between confidence and QoE of AI systems.

While this is only one example among many possible human–AI interaction dynamics, it illustrates the kind of interrelation that still requires deeper understanding. As depicted in the figure above, complex interrelations exist that are not yet fully understood. AI confidence calibration (based on the AI model) and the way how this confidence or uncertainty is transported to users influences the users’ perceived trustworthiness of the AI system. This impacts the user’s confidence to which degree a user trusts their own ability to understand, interpret, and effectively interact with the AI system. Poor calibration can trigger a negative feedback loop of mistrust and disengagement, while well-calibrated, transparent AI fosters a positive feedback loop that enhances trust, confidence, and effective human-AI collaboration. In a negative feedback loop, overconfidence leads to low perceived trustworthiness and a strong QoE decline, while underconfidence results in moderate perceived trustworthiness and medium QoE, ultimately lowering user engagement. In contrast, a positive feedback loop emerges when confidence is well-calibrated and aligns with accuracy or when uncertainty is expressed transparently, leading to high trust, higher QoE, and stronger user engagement. User engagement and QoE are closely interrelated [Reichl et al., 2015], as higher engagement often reflects and reinforces a more positive overall experience.   

Following this and similar examples, the bridge that now needs to be built is between the AI community’s focus on algorithmic performance and the QoE community’s expertise in human experience, bringing together two perspectives that have evolved largely in isolation, but are inherently complementary.

5. Conclusions: QoE as part of the missing link between AI systems and real-world experiences

Bridging the gap between how AI systems perform and how they are experienced is now one of the most pressing challenges in the field. The AI community has achieved extraordinary advances in model accuracy, scalability and efficiency, yet these metrics alone do not fully capture how systems behave in context—how they interact with people, support oversight or sustain trust under real operating conditions. The field of QoE, with its long tradition of measuring perceived quality, different experiential dimensions and usability, offers the conceptual and methodological tools needed to evaluate AI systems as experienced technologies, not merely as computational artefacts.

In this context, QoE of AI systems can be adapted from the original definition of QoE as proposed in [Qualinet, 2013] to read as: “The degree of delight or annoyance of a user resulting from interacting with an AI system. It results from how well the AI system fulfills the user’s expectations regarding usefulness, transparency, trustworthiness, comprehensibility, controllability, and reliability, considering the user’s goals, context, and cognitive state.” 

Collaborative research between these domains can foster new interdisciplinary methodologies, shared benchmarks and evidence-based guidelines for assessing AI systems as they are used in the real world—not just as they perform in the lab or within classical accuracy-centred benchmarks. Building this shared evaluation culture is essential to advance trustworthy, human-centric AI, ensuring that future systems are not only intelligent but also understandable, reliable and aligned with human values.

This need is becoming increasingly urgent as, in many regions such as the EU, the principles of trustworthy AI are evolving from ethical aspirations into formal regulatory requirements, reinforcing the importance of robust, experience-based evaluation frameworks.

References


MPEG Column: 152nd MPEG Meeting

The 152nd MPEG meeting took place in Geneva, Switzerland, from October 7 to October 11, 2025. The official MPEG press release can be found here. This column highlights key points from the meeting, amended with research aspects relevant to the ACM SIGMM community:

  • MPEG Systems received an Emmy® Award for the Common Media Application Format (CMAF). A separate press release regarding this achievement is available here.
  • JVET ratified new editions of VSEI, VVC, and HEVC
  • The fourth edition of Visual Volumetric Video-based Coding (V3C and V-PCC) has been finalized
  • Responses to the call for evidence on video compression with capability beyond VVC successfully evaluated

MPEG Systems received an Emmy® Award for the Common Media Application Format (CMAF)

On September 18, 2025, the National Academy of Television Arts & Sciences (NATAS) announced that the MPEG Systems Working Group (ISO/IEC JTC 1/SC 29/WG 3) had been selected as a recipient of a Technology & Engineering Emmy® Award for standardizing the Common Media Application Format (CMAF). But what is CMAF? CMAF (ISO/IEC 23000-19) is a media format standard designed to simplify and unify video streaming workflows across different delivery protocols and devices. Here’s a structured overview. Before CMAF, streaming services often had to produce multiple container formats, i.e., (i) ISO Base Media File Format (ISOBMFF) for MPEG-DASH and MPEG-2 Transport Stream (TS) for Apple HLS. This duplication resulted in additional encoding, packaging, and storage costs. I wrote a blog post about this some time ago here. CMAF’s main goal is to define a single, standardized segmented media format usable by both HLS and DASH, enabling “encode once, package once, deliver everywhere.”

The core concept of CMAF is that it is based on ISOBMFF, the foundation for MP4. Each CMAF stream consists of a CMAF header, CMAF media segments, and CMAF track files (a logical sequence of segments for one stream, e.g., video or audio). CMAF enables low-latency streaming by allowing progressive segment transfer, adopting chunked transfer encoding via CMAF chunks. CMAF defines interoperable profiles for codecs and presentation types for video, audio, and subtitles. Thanks to its compatibility with and adoption within existing streaming standards, CMAF bridges the gaps between DASH and HLS, creating a unified ecosystem.

Research aspects include – but are not limited to – low-latency tuning (segment/chunk size trade-offs, HTTP/3, QUIC), Quality of Experience (QoE) impact of chunk-based adaptation, synchronization of live and interactive CMAF streams, edge-assisted CMAF caching and prediction, and interoperability testing and compliance tools.

JVET ratified new editions of VSEI, VVC, and HEVC

At its 40th meeting, the Joint Video Experts Team (JVET, ISO/IEC JTC 1/SC 29/WG 5) concluded the standardization work on the next editions of three key video coding standards, advancing them to the Final Draft International Standard (FDIS) stage. Corresponding twin-text versions have also been submitted to ITU-T for consent procedures. The finalized standards include:

  • Versatile Supplemental Enhancement Information (VSEI) — ISO/IEC 23002-7 | ITU-T Rec. H.274
  • Versatile Video Coding (VVC) — ISO/IEC 23090-3 | ITU-T Rec. H.266
  • High Efficiency Video Coding (HEVC) — ISO/IEC 23008-2 | ITU-T Rec. H.265

The primary focus of these new editions is the extension and refinement of Supplemental Enhancement Information (SEI) messages, which provide metadata and auxiliary data to support advanced processing, interpretation, and quality management of coded video streams.

The updated VSEI specification introduces both new and refined SEI message types supporting advanced use cases:

  • AI-driven processing: Extensions for neural-network-based post-filtering and film grain synthesis offer standardized signalling for machine learning components in decoding and rendering pipelines.
  • Semantic and multimodal content: New SEI messages describe infrared, X-ray, and other modality indicators, region packing, and object mask encoding; creating interoperability points for multimodal fusion and object-aware compression research.
  • Pipeline optimization: Messages defining processing order and post-processing nesting support research on joint encoder-decoder optimization and edge-cloud coordination in streaming architectures.
  • Authenticity and generative media: A new set of messages supports digital signature embedding and generative-AI-based face encoding, raising questions for the SIGMM community about trust, authenticity, and ethical AI in media pipelines.
  • Metadata and interpretability: New SEIs for text description, image format metadata, and AI usage restriction requests could facilitate research into explainable media, human-AI interaction, and regulatory compliance in multimedia systems.

All VSEI features are fully compatible with the new VVC edition, and most are also supported in HEVC. The new HEVC edition further refines its multi-view profiles, enabling more robust 3D and immersive video use cases.

Research aspects of these new standard’s editions can be summarized as follows: (i) Define new standardized interfaces between neural post-processing and conventional video coding, fostering reproducible and interoperable research on learned enhancement models. (ii) Encourage exploration of metadata-driven adaptation and QoE optimization using SEI-based signals in streaming systems. (iii) Open possibilities for cross-layer system research, connecting compression, transport, and AI-based decision layers. (iv) Introduce a formal foundation for authenticity verification, content provenance, and AI-generated media signalling, relevant to current debates on trustworthy multimedia.

These updates highlight how ongoing MPEG/ITU standardization is evolving toward a more AI-aware, multimodal, and semantically rich media ecosystem, providing fertile ground for experimental and applied research in multimedia systems, coding, and intelligent media delivery.

The fourth edition of Visual Volumetric Video-based Coding (V3C and V-PCC) has been finalized

MPEG Coding of 3D Graphics and Haptics (ISO/IEC JTC 1/SC 29/WG7) has advanced MPEG-I Part 5 – Visual Volumetric Video-based Coding (V3C and V-PCC) to the Final Draft International Standard (FDIS) stage, marking its fourth edition. This revision introduces major updates to the Video-based Coding of Volumetric Content (V3C) framework, particularly enabling support for an additional bitstream instance: V-DMC (Video-based Dynamic Mesh Compression).

Previously, V3C served as the structural foundation for V-PCC (Video-based Point Cloud Compression) and MIV (MPEG Immersive Video). The new edition extends this flexibility by allowing V-DMC integration, reinforcing V3C as a generic, extensible framework for volumetric and 3D video coding. All instances follow a shared principle, i.e., using conventional 2D video codecs (e.g., HEVC, VVC) for projection-based compression, complemented by specialized tools for mapping, geometry, and metadata handling.

While V-PCC remains co-specified within Part 5, MIV (Part 12) and V-DMC (Part 29) are standardized separately. The progression to FDIS confirms the technical maturity and architectural stability of the framework.

This evolution opens new research directions as follows: (i) Unified 3D content representation, enabling comparative evaluation of point cloud, mesh, and view-based methods under one coding architecture. (ii) Efficient use of 2D codecs for 3D media, raising questions on mapping optimization, distortion modeling, and geometry-texture compression. (iii) Dynamic and interactive volumetric streaming, relevant to AR/VR, telepresence, and immersive communication research.

The fourth edition of MPEG-I Part 5 thus positions V3C as a cornerstone for future volumetric, AI-assisted, and immersive video systems, bridging standardization and cutting-edge multimedia research.

Responses to the call for evidence on video compression with capability beyond VVC successfully evaluated

The Joint Video Experts Team (JVET, ISO/IEC JTC 1/SC 29/WG 5) has completed the evaluation of submissions to its Call for Evidence (CfE) on video compression with capability beyond VVC. The CfE investigated coding technologies that may surpass the performance of the current Versatile Video Coding (VVC) standard in compression efficiency, computational complexity, and extended functionality.

A total of five submissions were assessed, complemented by ECM16 reference encodings and VTM anchor sequences with multiple runtime variants. The evaluation addressed both compression capability and encoding runtime, as well as low-latency and error-resilience features. All technologies were derived from VTM, ECM, or NNVC frameworks, featuring modified encoder configurations and coding tools rather than entirely new architectures.

Key Findings

  • In the compression capability test, 76 out of 120 test cases showed at least one submission with a non-overlapping confidence interval compared to the VTM anchor. Several methods outperformed ECM16 in visual quality and achieved notable compression gains at lower complexity. Neural-network-based approaches demonstrated clear perceptual improvements, particularly for 8K HDR content, while gains were smaller for gaming scenarios.
  • In the encoding runtime test, significant improvements were observed even under strict complexity constraints: 37 of 60 test points (at both 1× and 0.2× runtime) showed statistically significant benefits over VTM. Some submissions achieved faster encoding than VTM, with only a 35% increase in decoder runtime.

Research Relevance and Outlook

The CfE results illustrate a maturing convergence between model-based and data-driven video coding, raising research questions highly relevant for the ACM SIGMM community:

  • How can learned prediction and filtering networks be integrated into standard codecs while preserving interoperability and runtime control?
  • What methodologies can best evaluate perceptual quality beyond PSNR, especially for HDR and immersive content?
  • How can complexity-quality trade-offs be optimized for diverse hardware and latency requirements?

Building on these outcomes, JVET is preparing a Call for Proposals (CfP) for the next-generation video coding standard, with a draft planned for early 2026 and evaluation through 2027. Upcoming activities include refining test material, adding Reference Picture Resampling (RPR), and forming a new ad hoc group on hardware implementation complexity.

For multimedia researchers, this CfE marks a pivotal step toward AI-assisted, complexity-adaptive, and perceptually optimized compression systems, which are considered a key frontier where codec standardization meets intelligent multimedia research.

The 153rd MPEG meeting will be held online from January 19 to January 23, 2026. Click here for more information about MPEG meetings and their developments.

JPEG Column: 108th JPEG Meeting in Daejeon, Republic of Korea

JPEG XE reaches Committee Draft stage at the 108th JPEG meeting

The 108th JPEG meeting was held in Daejeon, Republic of Korea, from 29 June to 4 July 2025.

During this meeting, the JPEG Committee finalised the Committee Draft of JPEG XE, an upcoming International Standard for lossless coding of visual events, that has been sent for consultation of ISO/IEC JTC1/SC29 national bodies. JPEG XE will be the first International Standard developed for the lossless representation and coding of visual events, and is being developed under the auspices of ISO, IEC, and ITU.

Furthermore, the JPEG Committee was informed that the prestigious Joseph von Fraunhofer Prize 2025 was awarded to three JPEG Committee members Prof. Siegfried Fößel, Dr. Joachim Keinert and Dr. Thomas Richter, for their contributions to the development of the JPEG XS standard. The JPEG XS standard specifies a compression technology with very low latency at a low implementation complexity and with a very precise bit-rate control. A presentation video can be accessed here.

108th JPEG Meeting in Daejeon, Rep. of Korea.

The following sections summarise the main highlights of the 108th JPEG meeting:

  • JPEG XE Committee Draft sent for consultation
  • JPEG Trust second edition aligns with C2PA
  • JPEG AI parts 2, 3 and 4 proceed for publication as IS
  • JPEG DNA reaches DIS stage
  • JPEG AIC on Objective Image Quality Assessment
  • JPEG Pleno Learning-based Point Cloud Coding proceed for publication as IS
  • JPEG XS Part 1 Amendment 1 proceeds to DIS stage
  • JPEG RF explores 3DGS coding and quality evaluation

JPEG XE

At the 108th JPEG Meeting, the Committee Draft of the first International Standard for lossless coding of events was issued and sent for consultation to ISO/IEC JTC1/SC29 national bodies for consultation. JPEG XE is being developed under the auspices of ISO/IEC and ITU-T and aims to establish a robust and interoperable format for efficient representation and coding of events in the context of machine vision and related applications. By reaching the Committee Draft stage, the JPEG Committee has attained a very important milestone. The Committee Draft was produced based on the five received responses to a Call for Proposals issued after the 104th JPEG Meeting held in July 2024. The two submissions meet the requirements for the constrained lossless coding of events and allow the implementation and operation of the coding model with limited resources, power, and complexity. The remaining three responses address the unconstrained coding mode and will be considered in a second phase of standardisation.

JPEG XE is the fruit of a joint effort between ISO/IEC JTC1/SC29/WG1 and ITU-T SG21 and is hoped to result in a largely supported JPEG XE standard, improving the potential compatibility and interoperability across applications, products, and services. Additionally, the JPEG Committee is in contact with the MIPI Alliance with the intention of developing a cross-compatible coding mode, allowing MIPI ESP signals to be decoded effectively by JPEG XE decoders.

The JPEG Committee remains committed to the development of a comprehensive and industry-aligned standard that meets the growing demand for event-based vision technologies. The collaborative approach between multiple standardisation organisations underscores a shared vision for a unified, international standard to accelerate innovation and interoperability in this emerging field.

JPEG Trust

JPEG Trust completed its second edition of JPEG Trust Part 1: Core Foundation, which brings JPEG Trust into alignment with the updated C2PA specification 2.1 and integrates aspects of Intellectual Property Rights (IPR). This second edition is now approved as a Draft International Standard for submission to ISO/IEC balloting, with an expected completion timeframe at the end of 2025.

Showcasing the adoption of JPEG Trust technology, JPEG Trust Part 4 – Reference software has now reached the Committee Draft stage.

Work continues on JPEG Trust Part 2: Trust profiles catalogue, a repository of Trust Profile and reporting snippets designed to assist implementers in constructing their Trust Profiles and Trust Reports, as well as JPEG Trust Part 3: Media asset watermarking.

JPEG AI

During the 108th JPEG meeting, JPEG AI Parts 2, 3, and 5 received positive DIS ballot results with only editorial comments, allowing them to proceed to publication as International Standards. These parts extend Part 1 by specifying stream and decoder profiles, reference software with usage documentation, and file format embedding for container formats such as ISOBMFF and HEIF.

The results from two Core Experiments were reviewed. The first evaluated gain map-based HDR coding, comparing it to simulcast methods and HEIC, while the second focused on implementing JPEG AI on smartphones using ONNX. Progressive decoding performance was assessed under channel truncation, and adaptive selection techniques were proposed to mitigate losses. Subjective and objective evaluations confirmed JPEG AI’s strong performance, often surpassing codecs such as VVC Intra, AVIF, JPEG XL, and performing comparably to ECM in informal viewing tests.

Another contribution explored compressed-domain image classification using latent representations, demonstrating competitive accuracy across bitrates. A proposal to limit tile splits in JPEG AI Part 2 was also discussed, and experiments identified Model 2 as the most robust and efficient default model for the levels with only one model at the decoder side.

JPEG DNA

During the 108th JPEG meeting, the JPEG Committee produced a study DIS text of JPEG DNA Part 1 (ISO/IEC 25508-1). The purpose of this text is to synchronise the current version of the Verification Model with the changes made to the Committee Draft document, reflecting the comments received from the consultation. The DIS balloting of Part 1 is scheduled to take place after the next JPEG meeting, starting in October 2025.

The JPEG Committee is also planning wet-lab experiments to validate that the current specification of the JPEG DNA satisfies the conditions required for applications using the current state of the art in DNA synthesis and sequencing, such as biochemical constraints, decodability, coverage rate, and the impact of error-correcting code on compression performance.

The goal still remains to reach International Standard (IS) status for Part 1 during 2026.

JPEG AIC

Part 4 of JPEG AIC deals with objective quality metrics for fine-grained assessment of high-fidelity compressed images. As of the 108th JPEG Meeting, the Call for Proposals on Objective Image Quality Assessment (JPEG AIC-4), which was launched in April 2025, has already resulted in four non-mandatory registrations of interest that were reviewed. In this JPEG meeting, the technical details regarding the evaluation of proposed metrics and of the anchor metrics were developed and finalised. The results have been integrated in the document “Common Test Conditions on Objective Image Quality Assessment v2.0”, available on the JPEG website. Moreover, the procedures to generate the evaluation image dataset were defined and will be carried out by JPEG experts. The responses to the Call for Proposals for JPEG AIC-4 are expected in September 2025, together with their application for the evaluation dataset, with the goal of creating a Working Draft of a new standard on objective quality assessment of high-fidelity images by April 2026.

JPEG Pleno

At the 108th JPEG meeting, significant progress was reported in the ongoing JPEG Pleno Quality Assessment activity for light fields. A Call for Proposals (CfP) on objective quality metrics for light fields is currently underway, with submissions to be evaluated using a new evaluation dataset. The JPEG Committee also prepares the DIS of ISO/IEC 21794-7, which defines a standard for subjective quality assessment methodologies of light fields.

During the 108th JPEG meeting, the 2nd edition of ISO/IEC 21794-2 (“Plenoptic image coding system (JPEG Pleno) Part 2: Light field coding”) advanced to the Draft International Standard (DIS) stage. This 2nd edition includes the specification of a third coding mode entitled Slanted 4D Transform Mode and its associated profile.

The 108th JPEG meeting also saw the successful completion of the Final Draft International Standard balloting and the impending publication of ISO/IEC 21794-6: Learning-based Point Cloud Coding. This is the world’s first international standard on learning-based point cloud coding. The publication of Part 6 of ISO/IEC 21794 is a crucial and notable milestone in the representation of point clouds. The publication of the International Standard is expected to take place during the second half of 2025.

JPEG XS

The JPEG Committee advanced the AMD 1 of JPEG XS Part 1 to DIS stage; it allows the embedding of sub-frame metadata to JPEG XS as required by augmented and virtual reality applications currently discussed within VESA. Part 5 3rd edition, which is the reference software of JPEG XS, was also approved for publication as an International Standard.

JPEG RF

During the 108th JPEG meeting, the JPEG Radiance Fields exploration advanced its work on discussing the procedures for reliable evaluation of potential proposals in the future, with a particular focus on refining subjective evaluation protocols. A key outcome was the initiation of Exploration Study 5, aimed at investigating how different test camera trajectories influence human perception during subjective quality assessment. The Common Test Conditions (CTC) document was also reviewed, with the subjective testing component remaining provisional pending the outcome of this exploration study. In addition, existing use cases and requirements for JPEG RF were re-examined, setting the stage for the development of revised drafts of both the Use Cases and Requirements document and the CTC. New mandates include conducting Exploration Study 5, revising documents, and expanding stakeholder engagement.

Final Quote

“The release of the Committee Draft of JPEG XE standard for lossless coding of events at the 108th JPEG meeting is an impressive achievement and will accelerate deployment of products and applications relying on visual events.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

Students Report from ACM MMsys 2025

The 16th ACM Multimedia Systems Conference (with the associated workshops: NOSSDAV 2025 and MMVE 2025) was held from March 31st to April 4th 2025, in Stellenbosch, South Africa. By choosing this location, the steering committee marked a milestone for SIGMM: MMSys became the very first SIGMM conference to take place on the African continent. This perfectly aligns with the SIGMM ongoing mission to build an inclusive and globally representative multimedia‑systems community.

The MMSys conference brings together researchers in multimedia systems to showcase and exchange their cutting-edge research findings. Once again, there were technical talks spanning various multimedia domains and inspiring keynote presentations.

Recognising the importance of in‑person exchange—especially for early‑career researchers—SIGMM once again funded Student Travel Grants. This support enabled a group of doctoral students to attend the conference, present their work and start building their international peer networks.
In this column, the recipients of the travel grants share their experiences at MMSys 2025.

Guodong Chen – PhD student, Northeastern University, USA 

What an incredible experience attending ACM MMSys 2025 in South Africa! Huge thanks to SIGMM for the travel grant that made this possible. 

It was an honour to present our paper, “TVMC: Time-Varying Mesh Compression Using Volume-Tracked Reference Meshes”, and I’m so happy that it received the Best Reproducible Paper Award! 

MMSys is not that huge, but it’s truly great. It’s exceptionally well-organized, and what impressed me the most was the openness and enthusiasm of the community. Everyone is eager to communicate, exchange ideas, and dive deep into cutting-edge multimedia systems research. I made many new friends and discovered exciting overlaps between my research and the work of other groups. I believe many collaborations are on the way and that, to me, is the true mark of a successful conference. 

Besides the conference, South Africa was amazing, don’t miss the wonderful wines of Stellenbosch and the unforgettable experience of a safari tour. 

Lea Brzica – PhD student, University of Zagreb, Croatia

Attending MMSys’25 in Stellenbosch, South Africa was an unforgettable and inspiring experience. As a new PhD student and early-career researcher, this was not only my first in-person conference but also my first time presenting. I was honoured to share my work, “Analysis of User Experience and Task Performance in a Multi-User Cross-Reality Virtual Object Manipulation Task,” and excited to see genuine interest from other attendees.
Beyond the workshop and technical sessions, I thoroughly enjoyed the keynotes and panel discussions. The poster sessions and demos were great opportunities to explore new ideas and engage with people from all over the world.
One of the most meaningful aspects of the conference was the opportunity to meet fellow PhD students and researchers face-to-face. The coffee breaks and social activities created a welcoming atmosphere that made it easy to form new connections.

I am truly grateful to SIGMM for supporting my participation. The travel grant helped alleviate the financial burden of international travel and made this experience possible. I’m already hoping for the chance to come back and be part of it all over again!

Jérémy Ouellette – PhD student Concordia University, Canada

My time at MMSys 2025 was an incredibly rewarding experience. It was great meeting so many interesting and passionate people in the field, and the reception was both enthusiastic and exceptionally well organized. I want to sincerely thank SIGMM for the travel grant, as their support made it possible for me to attend and present my work. South Africa was an amazing destination, and the entire experience was both professionally and personally unforgettable. MMSys was also the perfect environment for networking, offering countless opportunities to connect with researchers and industry experts. It was truly exciting to see so much interest in my work and to engage in meaningful conversations with others in the multimedia systems community.

JPEG Column: 107th JPEG Meeting in Brussels, Belgium

JPEG assesses responses to its Call for Proposals on Lossless Coding of Visual Events

The 107th JPEG meeting was held in Brussels, Belgium, from April 12 to 18, 2025. During this meeting, the JPEG Committee assessed the responses to its call for proposals on JPEG XE, an International Standard for lossless coding of visual events. JPEG XE is being developed under the auspices of three major standardisation organisations: ISO, IEC, and ITU. It will be the first codec developed by the JPEG committee targeting lossless representation and coding of visual events.

The JPEG Committee is also working on various standardisation projects, such as JPEG AI, which uses learning technology to achieve high compression, JPEG Trust, which sets standards to combat fake media and misinformation while rebuilding trust in multimedia, and JPEG DNA, which represents digital images using DNA sequences for long-term storage.

The following sections summarise the main highlights of the 107th JPEG meeting:

  • JPEG XE
  • JPEG AI
  • JPEG Trust
  • JPEG AIC
  • JPEG Pleno
  • JPEG DNA
  • JPEG XS
  • JPEG RF

JPEG XE

This initiative focuses on a new imaging modality produced by event-based visual sensors. This effort aims to establish a standard that efficiently represents and codes events, thereby enhancing interoperability in sensing, storage, and processing for machine vision and related applications.

As a response to the JPEG XE Final Call for Proposals on lossless coding of events, the JPEG Committee received five innovative proposals for consideration. Their evaluation indicated that two among them meet the stringent requirements of the constrained case, where resources, power, and complexity are severely limited. The remaining three proposals can cater to the unconstrained case. During the 107th JPEG meeting, the JPEG Committee launched a series of Core Experiments to define a path forward based on the received proposals as a starting point for the development of the JPEG XE standard.

To streamline the standardisation process, the JPEG Committee will proceed with the JPEG XE initiative in three distinct phases. Phase 1 will concentrate on lossless coding for the constrained case, while Phase 2 will address the unconstrained case. Both phases will commence simultaneously, although Phase 1 will follow a faster timeline to enable a timely publication of the first edition of the standard. The JPEG Committee recognises the urgent industry demand for a standardised solution for the constrained case, aiming to produce a Committee Draft by as early as July 2025. The third phase will focus on lossy compression of event sequences. The discussions and preparations will be initiated soon.

In a significant collaborative effort between ISO/IEC JTC 1/SC 29/WG1 and ITU-T SG21, the JPEG Committee will proceed to specify a joint JPEG XE standard. This partnership will ensure that JPEG XE becomes a shared standard under ISO, IEC, and ITU-T, reflecting their mutual commitment to developing standards for event-based systems.

Additionally, the JPEG Committee is actively discussing and exploring lossy coding of visual events, exploring future evaluation methods for such advanced technologies. Stakeholders interested in JPEG XE are encouraged to access public documents available at jpeg.org. Moreover, a joint Ad-hoc Group on event-based vision has been formed between ITU-T Q7/21 and ISO/IEC JTC1 SC29/WG1, paving the way for continued collaboration leading up to the 108th JPEG meeting.

JPEG AI

At the 107th JPEG meeting, JPEG AI discussions focused around conformance (JPEG AI Part 4), which has now advanced to the Draft International Standard (DIS) stage. The specification defines three conformance points — namely, the decoded residual tensor, the decoded latent space tensor (also referred to as feature space), and the decoded image. Strict conformance for the residual tensor is evaluated immediately after entropy decoding, while soft conformance for the latent space tensor is assessed after tensor decoding. The decoded image conformance is measured after converting the image to the output picture format, but before any post-processing filters are applied. Regarding the decoded image, two types have been defined: conformance Type A, which implies low tolerance, and conformance Type B, which allows for moderate tolerance.

During the 107th JPEG meeting, the results of several subjective quality assessment experiments were also presented and discussed, using different methodologies and for different test conditions, from low to very high qualities, including both SDR and HDR images. The results of these evaluations have shown that JPEG AI is highly competitive and, in many cases, outperforms existing state-of-the-art codecs such as VVC Intra, AVIF, and JPEG XL. A demonstration of an JPEG AI encoder running on a Huawei Mate50 Pro smartphone with a Qualcomm Snapdragon 8+ Gen1 chipset was also presented. This implementation supports tiling, high-resolution (4K) support, and a base profile with level 20. Finally, the implementation status of all mandatory and desirable JPEG AI requirements was discussed, assessing whether each requirement had been fully met, partially addressed, or remained unaddressed. This helped to clarify the current maturity of the standard and identify areas for further refinements.

JPEG Trust

Building on the publication of JPEG Trust (ISO/IEC 21617) Part 1 – Core Foundation in January 2025, the JPEG Committee approved a Draft International Standard (DIS) for a 2nd edition of Part 1 – Core Foundation during the 107th JPEG meeting. This Part 1 – Core Foundation 2nd edition incorporates the signalling of identity and intellectual property rights to address three particular challenges:

  • achieving transparency, through the signaling of content provenance
  • identifying content that has been generated either by humans, machines or AI systems, and
  • enabling interoperability, for example, by standardising machine-readable terms of use of intellectual property, especially AI-related rights reservations.

Additionally, the JPEG Committee is currently developing Part 2 – Trust Profiles Catalogue. Part 2 provides a catalogue of trust profile snippets that can be used either on their own or in combination for the purpose of constructing trust profiles, which can then be used for assessing the trustworthiness of media assets in given usage scenarios. The Trust Profiles Catalogue also defines a collection of conformance points, which enables interoperability across usage scenarios through the use of associated trust profiles.

The Committee continues to develop JPEG Trust Part 3 – Media asset watermarking to build out additional requirements for identified use cases, including the emerging need to identify AIGC content.

Finally, during the 107th meeting, the JPEG Committee initiated a Part 4 – Reference software, which will provide reference implementations of JPEG Trust to which implementers can refer to in developing trust solutions based on the JPEG Trust framework.

JPEG AIC

The JPEG AIC Part 3 standard (ISO/IEC CD 29170-3), has received a revised title “Information technology — JPEG AIC Assessment of image coding — Part 3: Subjective quality assessment of high-fidelity images”. At the 107th JPEG meeting, the results of the last Core Experiments for the standard and the comments on the Committee Draft of the standard were addressed. The draft text was thoroughly revised and clarified, and has now advanced to the Draft International Standard (DIS) stage.

Furthermore, Part 4 of JPEG AIC deals with objective quality metrics, also of high-fidelity images, and at the 107th JPEG meeting, the technical details regarding anchor metrics as well as the testing and evaluation of proposed methods were discussed and finalised. The results have been compiled in the document “Common Test Conditions on Objective Image Quality Assessment”, available on the JPEG website. Moreover, the corresponding Final Call for Proposals on Objective Image Quality Assessment (AIC-4) has been issued. Proposals are expected at the end of Summer 2025. The first Working Draft for Objective Image Quality Assessment (AIC-4) is planned for April 2026.

JPEG Pleno

The JPEG Pleno Light Field activity discussed the DoCR for the submitted Committee Draft (CD) of the 2nd edition of ISO/IEC 21794-2 (“Plenoptic image coding system (JPEG Pleno) Part 2: Light field coding”). This 2nd edition integrates AMD1 of ISO/IEC 21794-2 (“Profiles and levels for JPEG Pleno Light Field Coding”) and includes the specification of a third coding mode entitled Slanted 4D Transform Mode and its associated profile. It is expected that at the 108th JPEG meeting this new edition will advance to the Draft International Standard (DIS) stage.

Software tools have been created and tested to be added as Common Test Condition Tools to a reference software implementation for the standardized technologies within the JPEG Pleno framework, including the JPEG Pleno Part 2 (ISO/IEC 21794-2).

In the framework of the ongoing standardisation effort on quality assessment methodologies for light fields, significant progress was achieved during the 107th JPEG meeting. The JPEG Committee finalised the Committee Draft (CD) of the forthcoming standard ISO/IEC 21794-7 entitled JPEG Pleno Quality Assessment – Light Fields, representing an important step toward the establishment of reliable tools for evaluating the perceptual quality of light fields. This CD incorporates recent refinements to the subjective light field assessment framework and integrates insights from the latest core experiments.

The Committee also approved the Final Call for Proposals (CfP) on Objective Metrics for JPEG Pleno Quality Assessment – Light Fields. This initiative invites proposals of novel objective metrics capable of accurately predicting perceived quality of compressed light field content. The detailed submission timeline and required proposal components are outlined in the released final CfP document. To support this process, updated versions of the Use Cases and Requirements (v6.0) and Common Test Conditions (v2.0) related to this CfP were reviewed and made available. Moreover, several task forces have been established to address key proposal elements, including dataset preparation, codec configuration, objective metric evaluation, and the subjective experiments.

At this meeting, ISO/IEC 21794-6 (“Plenoptic image coding system (JPEG Pleno) Part 6: Learning-based point cloud coding”) progressed to the balloting of the Final Draft International Standard (FDIS) stage. Balloting will end on the 12th of June 2025 with the publication of the International Standard expected for August 2025.

The JPEG Committee held a workshop on Future Challenges in Compression of Holograms for XR Applications organised on April 16th, covering major applications from holographic cameras to holographic displays. The 2nd workshop for Future Challenges in Compression of Holograms for Metrology Applications is planned for July.

JPEG DNA

The JPEG Committee continues to develop JPEG DNA, an ambitious initiative to standardize the representation of digital images using DNA sequences for long-term storage. Following a Call for Proposals launched at its 99th JPEG meeting, a Verification Model was established during the 102nd JPEG meeting, then refined through core experiments that led to the first Working Draft at the 103rd JPEG meeting.

New JPEG DNA logo.

At its 105th JPEG meeting, JPEG DNA was officially approved as a new ISO/IEC project (ISO/IEC 25508), structured into four parts: Core Coding System, Profiles and Levels, Reference Software, and Conformance. The Committee Draft (CD) of Part 1 was produced at the 106th JPEG meeting.

During the 107th JPEG meeting, the JPEG Committee reviewed the comments received on the CD of JPEG DNA standard and prepared a Disposition of Comments Report (DoCR). The goal remains to reach International Standard (IS) status for Part 1 by April 2026.

On this occasion, the official JPEG DNA logo was also unveiled, marking a new milestone in the visibility and identity of the project.

JPEG XS

The development of the third edition of the JPEG XS standard is nearing its final stages, marking significant progress for the standardisation of high-performance video coding. Notably, Part 4, focusing on conformance testing, has been officially accepted by ISO and IEC for publication. Meanwhile, Part 5, which provides reference software, is presently at Draft International Standard (DIS) ballot stage.

In a move that underscores the commitment to accessibility and innovation in media technology, both Part 4 and Part 5 will be made publicly available as free standards. This decision is expected to facilitate widespread adoption and integration of JPEG XS in relevant industries and applications.

Looking to the future, the JPEG Committee is exploring enhancements to the JPEG XS standard, particularly in supporting a master-proxy stream feature. This feature enables a high-fidelity master video stream to be accompanied by a lower-resolution proxy stream, ensuring minimal overhead. Such functionalities are crucial in optimising broadcast and content production workflows.

JPEG RF

The JPEG RF activity issued the proceedings of the Joint JPEG/MPEG Workshop on Radiance Fields which was held on the 31st of January and featured world-renowned speakers discussing Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) from the perspective of both academia, industry, and standardisation groups. Video recordings and all related material were made publicly available on the JPEG website. Moreover, an improved version of the JPEG RF State of the Art and Challenges document was proposed, including an updated review of coding techniques for radiance fields as well as newly identified use cases and requirements. The group also defined an exploration study to investigate protocols for subjective and objective quality assessment, which are considered to be crucial to advance this activity towards a coding standard for radiance fields.

Final Quote

“A cost-effective and interoperable event-based vision ecosystem requires an efficient coding standard. The JPEG Committee embraces this new challenge by initiating a new standardisation project to achieve this objective.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

Report from ACM SIG Heritage Workshop

What does history mean to computer scientists?” – that was the first question that popped up in my mind when I was to attend the ACM Heritage Workshop at Minneapolis few months back. And needless to say, the follow up question was “what does history mean for a multimedia systems researcher?” As a young graduate student, I had the joy of my life when my first research paper on multimedia authoring (a hot topic those days) was accepted for presentation in the first ACM Multimedia in 1993, and that conference was held along side SIGGRAPH. Thinking about that, it gives multimedia systems researchers about 25 to 30 years of history. But what a flow of topics this area has seen: from authoring to streaming to content-based retrieval to social media and human-centered multimedia, the research area has been hot as ever. So, is it the history of research topics or the researchers or both? Then, how about the venues hosting these conferences, the networking events, or the grueling TPC meetings that prepped the conference actions?

Figure 1. Picture from the venue

With only questions and no clear answers, I decided to attend the workshop with an open mind. Most SIGs (Special Interest Groups) in ACM had representation at this workshop. The workshop itself was organized by the ACM History Committee. I understood this committee, apart from the workshop, organizes several efforts to track, record, and preserve computing efforts across disciplines. This includes identifying distinguished persons (who are retired but made significant contributions to computing), coming up with a customized questionnaire for the persons, training the interviewer, recording the conversations, curating them, archiving, and providing them for public consumption. Efforts at most SIGs were mostly based on the website. They were talking about how they try to preserve conference materials such as paper proceedings (when only paper proceedings were published), meeting notes, pictures, and videos. For instance, some SIGs were talking about how they tracked and preserved ACM’s approval letter for the SIG! 

It was very interesting – and touching – to see some attendees (senior Professors) coming to the workshop with boxes of materials – papers, reports, books, etc. They were either downsizing their offices or clearing out, and did not feel like throwing the material in recycling bins! These materials were given to ACM and Babbage Institute (at University of Minnesota, Minneapolis) for possible curation and storage.

Figure 2. Galleries with collected material

ACM History committee members talked about how they can fund (at a small level) projects that target specific activities for preserving and archiving computing events and materials. ACM History Committee agreed that ACM should take more responsibility in providing technical support to web hosting – obviously, not sure whether anything tangible would result.

Over the two days at the workshop, I was getting answers to my questions: History can mean pictures and videos taken at earlier MM conferences, TPC meetings, SIGMM sponsored events and retreats. Perhaps, the earlier paper proceedings that have some additional information than what is found in the corresponding ACM Digital Library version. Interviews with different research leaders that built and promoted SIGMM.

It was clear that history meant different things to different SIGs, and as SIGMM community, we would have to arrive at our own interpretation, collect and preserve that. And that made me understand the most obvious and perhaps, the most important thing: today’s events become tomorrow’s history! No brainer, right? Preserving today’s SIGMM events will give us a richer, colorful, and more complete SIGMM history for the future generations!

For the curious ones:

ACM Heritage Workshop website is at: https://acmsigheritage.dash.umn.ed

Some of the workshop presentation materials are available at: https://acmsigheritage.dash.umn.edu/uncategorized/class-material-posted/

Socially significant music events

Social media sharing platforms (e.g., YouTube, Flickr, Instagram, and SoundCloud) have revolutionized how users access multimedia content online. Most of these platforms provide a variety of ways for the user to interact with the different types of media: images, video, music. In addition to watching or listening to the media content, users can also engage with content in different ways, e.g., like, share, tag, or comment. Social media sharing platforms have become an important resource for scientific researchers, who aim to develop new indexing and retrieval algorithms that can improve users’ access to multimedia content. As a result, enhancing the experience provided by social media sharing platforms.

Historically, the multimedia research community has focused on developing multimedia analysis algorithms that combine visual and text modalities. Less highly visible is research devoted to algorithms that exploit an audio signal as the main modality. Recently, awareness for the importance of audio has experienced a resurgence. Particularly notable is Google’s release of the AudioSet, “A large-scale dataset of manually annotated audio events” [7]. In a similar spirit, we have developed the “Socially Significant Music Event“ dataset that supports research on music events [3]. The dataset contains Electronic Dance Music (EDM) tracks with a Creative Commons license that have been collected from SoundCloud. Using this dataset, one can build machine learning algorithms to detect specific events in a given music track.

What are socially significant music events? Within a music track, listeners are able to identify certain acoustic patterns as nameable music events.  We call a music event “socially significant” if it is popular in social media circles, implying that it is readily identifiable and an important part of how listeners experience a certain music track or music genre. For example, listeners might talk about these events in their comments, suggesting that these events are important for the listeners (Figure 1).

Traditional music event detection has only tackled low-level events like music onsets [4] or music auto-tagging [810]. In our dataset, we consider events that are at a higher abstraction level than the low-level musical onsets. In auto-tagging, descriptive tags are associated with 10-second music segments. These tags generally fall into three categories: musical instruments (guitar, drums, etc.), musical genres (pop, electronic, etc.) and mood based tags (serene, intense, etc.). The types of tags are different than what we are detecting as part of this dataset. The events in our dataset have a particular temporal structure unlike the categories that are the target of auto-tagging. Additionally, we analyze the entire music track and detect start points of music events rather than short segments like auto-tagging.

There are three music events in our Socially Significant Music Event dataset: Drop, Build, and Break. These events can be considered to form the basic set of events used by the EDM producers [1, 2]. They have a certain temporal structure internal to themselves, which can be of varying complexity. Their social significance is visible from the presence of large number of timed comments related to these events on SoundCloud (Figure 1,2). The three events are popular in the social media circles with listeners often mentioning them in comments. Here, we define these events [2]:

  1. Drop: A point in the EDM track, where the full bassline is re-introduced and generally follows a recognizable build section
  2. Build: A section in the EDM track, where the intensity continuously increases and generally climaxes towards a drop
  3. Break: A section in an EDM track with a significantly thinner texture, usually marked by the removal of the bass drum

Figure 1. Screenshot from SoundCloud showing a list of timed comments left by listeners on a music track [11].

Figure 1. Screenshot from SoundCloud showing a list of timed comments left by listeners on a music track [11].

SoundCloud

SoundCloud is an online music sharing platform that allows users to record, upload, promote and share their self-created music. SoundCloud started out as a platform for amateur musicians, but currently many leading music labels are also represented. One of the interesting features of SoundCloud is that it allows “timed comments” on the music tracks. “Timed comments” are comments, left by listeners, associated with a particular time point in the music track. Our “Socially Significant Music Events” dataset is inspired by the potential usefulness of these timed comments as ground truth for training music event detectors. Figure 2 contains an example of a timed comment: “That intense buildup tho” (timestamp 00:46). We could potentially use this as a training label to detect a build, for example. In a similar way, listeners also mention the other events in their timed comments. So, these timed comments can serve as training labels to build machine learning algorithms to detect events.

Figure 2. Screenshot from SoundCloud indicating the useful information present in the timed comments. [11]

Figure 2. Screenshot from SoundCloud indicating the useful information present in the timed comments. [11]

SoundCloud also provides a well-documented API [6] with interfaces to many programming languages: Python, Ruby, JavaScript etc. Through this API, one can download the music tracks (if allowed by the uploader), timed comments and also other metadata related to the track. We used this API to collect our dataset. Via the search functionality we searched for tracks uploaded during the year 2014 with a Creative Commons license, which results in a list of tracks with unique identification numbers. We looked at the timed comments of these tracks for the keywords: drop, break and build. We kept the tracks whose timed comments contained a reference to these keywords and discarded the other tracks.

Dataset

The dataset contains 402 music tracks with an average duration of 4.9 minutes. Each track is accompanied by timed comments relating to Drop, Build, and Break. It is also accompanied by ground truth labels that mark the true locations of the three events within the tracks. The labels were created by a team of experts. Unlike many other publicly available music datasets that provide only metadata or short previews of music tracks  [9], we provide the entire track for research purposes. The download instructions for the dataset can be found here: [3]. All the music tracks in the dataset are distributed under the Creative Commons license. Some statistics of the dataset are provided in Table 1.  

Table 1. Statistics of the dataset: Number of events, Number of timed comments

Event Name Total number of events Number of events per track Total number of timed comments Number of timed comments per track
Drop  435  1.08  604  1.50
Build  596  1.48  609  1.51
Break  372  0.92  619  1.54

The main purpose of the dataset is to support training of detectors for the three events of interest (Drop, Build, and Break) in a given music track. These three events can be considered a case study to prove that it is possible to detect socially significant musical events, opening the way for future work on an extended inventory of events. Additionally, the dataset can be used to understand the properties of timed comments related to music events. Specifically, timed comments can be used to reduce the need for manually acquired ground truth, which is expensive and difficult to obtain.

Timed comments present an interesting research challenge: temporal noise. The timed comments and the actual events do not always coincide. The comments could be at the same position, before, or after the actual event. For example, in the below music track (Figure 3), there is a timed comment about a drop at 00:40, while the actual drop occurs only at 01:00. Because of this noisy nature, we cannot use the timed comments alone as ground truth. We need strategies to handle temporal noise in order to use timed comments for training [1].

Figure 3. Screenshot from SoundCloud indicating the noisy nature of timed comments [11].

Figure 3. Screenshot from SoundCloud indicating the noisy nature of timed comments [11].

In addition to music event detection, our “Socially Significant Music Event” dataset opens up other possibilities for research. Timed comments have the potential to improve users’ access to music and to support them in discovering new music. Specifically, timed comments mention aspects of music that are difficult to derive from the signal, and may be useful to calculate song-to-song similarity needed to improve music recommendation. The fact that the comments are related to a certain time point is important because it allows us to derive continuous information over time from a music track. Timed comments are potentially very helpful for supporting listeners in finding specific points of interest within a track, or deciding whether they want to listen to a track, since they allow users to jump-in and listen to specific moments, without listening to the track end-to-end.

State of the art

The detection of music events requires training classifiers that are able to generalize over the variability in the audio signal patterns corresponding to events. In Figure 4, we see that the build-drop combination has a characteristic pattern in the spectral representation of the music signal. The build is a sweep-like structure and is followed by the drop, which we indicate by a red vertical line. More details about the state-of-the-art features useful for music event detection and the strategies to filter the noisy timed comments can be found in our publication [1].

Figure 4. The spectral representation of the musical segment containing a drop. You can observe the sweeping structure indicating the buildup. The red vertical line is the drop.

Figure 4. The spectral representation of the musical segment containing a drop. You can observe the sweeping structure indicating the buildup. The red vertical line is the drop.

The evaluation metric used to measure the performance of a music event detector should be chosen according to the user scenario for that detector. For example, if the music event detector is used for non-linear access (i.e., creating jump-in points along the playbar) it is important that the detected time point of the event falls before, rather than after, the actual event.  In this case, we recommend using the “event anticipation distance” (ea_dist) as a metric. The ea_dist is amount of time that the predicted event time point precedes an actual event time point and represents the time the user would have to wait to listen to the actual event. More details about ea_dist can be found in our paper [1].

In [1], we report the implementation of a baseline music event detector that uses only timed comments as training labels. This detector attains an ea_dist of 18 seconds for a drop. We point out that from the user point of view, this level of performance could already lead to quite useful jump-in points. Note that the typical length of a build-drop combination is between 15-20 seconds. If the user is positioned 18 seconds before the drop, the build would have already started and the user knows that a drop is coming. Using an optimized combination of timed comments and manually acquired ground truth labels we are able to achieve an ea_dist of 6 seconds.

Conclusion

Timed comments, on their own, can be used as training labels to train detectors for socially significant events. A detector trained on timed comments performs reasonably well in applications like non-linear access, where the listener wants to jump through different events in the music track without listening to it in its entirety. We hope that the dataset will encourage researchers to explore the usefulness of timed comments for all media. Additionally, we would like to point out that our work has demonstrated that the impact of temporal noise can be overcome and that the contribution of timed comments to video event detection is worth investigating further.

Contact

Should you have any inquiries or questions about the dataset, do not hesitate to contact us via email at: n.k.yadati@tudelft.nl

References

[1] K. Yadati, M. Larson, C. Liem and A. Hanjalic, “Detecting Socially Significant Music Events using Temporally Noisy Labels,” in IEEE Transactions on Multimedia. 2018. http://ieeexplore.ieee.org/document/8279544/

[2] M. Butler, Unlocking the Groove: Rhythm, Meter, and Musical Design in Electronic Dance Music, ser. Profiles in Popular Music. Indiana University Press, 2006 

[3] http://osf.io/eydxk

[4] http://www.music-ir.org/mirex/wiki/2017:Audio_Onset_Detection

[5] https://developers.soundcloud.com/docs/api/guide

[6] https://developers.soundcloud.com/docs/api/guide

[7] https://research.google.com/audioset/

[8] H. Y. Lo, J. C. Wang, H. M. Wang and S. D. Lin, “Cost-Sensitive Multi-Label Learning for Audio Tag Annotation and Retrieval,” in IEEE Transactions on Multimedia, vol. 13, no. 3, pp. 518-529, June 2011. http://ieeexplore.ieee.org/document/5733421/

[9] http://majorminer.org/info/intro

[10] http://www.music-ir.org/mirex/wiki/2016:Audio_Tag_Classification

[11] https://soundcloud.com/spinninrecords/ummet-ozcan-lose-control-original-mix

Editorial

Dear Member of the SIGMM Community, welcome to the third issue of the SIGMM Records in 2013.

On the verge of ACM Multimedia 2013, we can already present the receivers of SIGMM’s yearly awards, the SIGMM Technical Achievement Award, the SIGMM Best Ph.D. Thesis Award, the TOMCCAP Nicolas D. Georganas Best Paper Award, and the TOMCCAP Best Associate Editor Award.

The TOMCCAP Special Issue on the 20th anniversary of ACM Multimedia is out in October, and you can read both the announcement, and find each of the contributions directly through the TOMCCAP Issue 9(1S) table of contents.

That SIGMM has established a strong foothold in the scientific community can also be seen by the Chinese Computing Federation’s rankings of SIGMM’s venues. Read the article to get even more motivation for submitting your papers to SIGMM’s conferences and journal.

We are also reporting from SLAM, the international workshop on Speech, Language and Audio in Multimedia. Not a SIGMM event, but certainly of interest to many SIGMMers who care about audio technology.

You find also two PhD thesis summaries, and last but most certainly not least, you find pointers to the latest issues of TOMCCAP and MMSJ, and several job announcements.

We hope that you enjoy this issue of the Records.

The Editors
Stephan Kopf, Viktor Wendel, Lei Zhang, Pradeep Atrey, Christian Timmerer, Pablo Cesar, Mathias Lux, Carsten Griwodz

ACM TOMCCAP Special on 20th Anniversary of ACM Multimedia

ACM Transactions on Multimedia Computing, Communications and Applications

Special Issue: 20th Anniversary of ACM International Conference on Multimedia

A journey ‘Back to the Future’

The ACM Special Interest Group on Multimedia (SIGMM) celebrated the 20th anniversary of the establishment of its premier conference, the ACM International Conference on Multimedia (ACM Multimedia) in 2012. To commemorate this milestone, leading researchers organized and extensively contributed to the 20th anniversary celebration.

from left to right: Malcolm Slaney, Ramesh Jain, Dick Bulterman, Klara Nahrstedt, Larry Rowe and Ralf Steinmetz

The celebratory events started at ACM Multimedia 2012 in Nara Japan, with the  “Coulda, Woulda, Shoulda: 20 Years of Multimedia Opportunities” panel, organized by Klara Nahrstedt (center) and Malcolm Slaney (far left). At this panel, pioneers of the field, Ramesh Jain, Dick Bulterman, Larry Rowe and Ralf Steinmetz, from left to right shown in the image, reflected on innovations, and successful and missed opportunities in the multimedia research area.

This special issue of the ACM Transaction on Multimedia Computing, Communication and Applications (TOMCCAP) is the final event to celebrate achievements and opportunities in a variety of multimedia areas. Through peer-reviewed long articles and invited short contributions, readers will get a sense of the past, present and future of multimedia research. The evolution ranges over traditional topics such as video streaming, multimedia synchronization, multimedia authoring, content analysis, and multimedia retrieval to newer topics including music retrieval, geo-tagging context in worldwide community of photos, multi-modal humancomputer interactions and experiential media systems.

Recent years have seen an explosion of research and technologies in multimedia, beyond individual algorithms, protocols and small scale systems. The scale of multimedia innovations and deployment has exploded with unimaginable speed. Hence, as the multimedia area is growing fast, penetrating every facet of our society, this special issue fills an important need to look back at the multimedia research achievements over the past 20 years, celebrates the exciting potential, and explores new goals of the multimedia research community.

Visit dl.acm.org/tomccap to view in the DL.

TOMCCAP Nicolas D. Georganas Best Paper Award 2013

ACM Transactions on Multimedia Computing, Communications and Applications (TOMCCAP) Nicolas D. Georganas Best Paper Award

The 2013 ACM Transactions on Multimedia Computing, Communications and Applications (TOMCCAP) Nicolas D. Georganas Best Paper Award is provided to the paper Exploring interest correlation for peer-to-peer socialized video sharing (TOMCCAP vol. 8, Issue 1) by Xu Cheng and Jiangchuan Liu.

The purpose of the named award is to recognize the most significant work in ACM TOMCCAP in a given calendar year. The whole readership of ACM TOMCCAP was invited to nominate articles which were published in Volume 8 (2012). Based on the nominations the winner has been chosen by the TOMCCAP Editorial Board. The main assessment criteria have been quality, novelty, timeliness, clarity of presentation, in addition to relevance to multimedia computing, communications, and applications.

In this paper the authors examine architectures for large-scale video streaming systems exploiting social relations. To achieve this objective, a large study of YouTube traffic was conducted and a cluster analysis performed on the resulting data. Based on the observations made, a new approach for video pre-fetching based on social relations has been developed. This important work bridges the gap between social media and multimedia streaming and hence combines two extremely relevant research topics.

The award honors the founding Editor-in-Chief of TOMCCAP, Nicolas D. Georganas, for his outstanding contributions to the field of multimedia computing and his significant contributions to ACM.  He exceedingly influenced the research and the whole multimedia community.

The Editor-in-Chief Prof. Dr.-Ing. Ralf Steinmetz and the Editorial Board of ACM TOMCCAP cordially congratulate the winner. The award will be presented to the authors on October 24th 2013 at the ACM Multimedia 2013 in Barcelona, Spain and includes travel expenses for the winning authors.

Bio of Awardees:

Xu Cheng is currently a research engineer at BroadbandTV, Vancouver, Canada. He receive the Bachelor of Science from Peking University, China, in 2006, Master of Science from Simon Fraser University, Canada, in 2008, and PhD from Simon Fraser University, Canada, in 2012. His research interests included multimedia networks, social networks and overlay networks.

 

Jiangchuan Liu is an Associate Professor in School of Computing Science, Simon Fraser University, British Columbia, Canada. He received BEng(cum laude) from Tsinghua University in 1999 and PhD from HKUST in 2003, both in computer science. He is a co-recipient of ACM Multimedia’2012 Best Paper Award, IEEE Globecom’2011 Best Paper Award, IEEE Communications Society Best Paper Award on Multimedia Communications 2009, as well as IEEE IWQoS’08 and IEEE/ACM IWQoS’2012 Best Student Paper Awards. His research interests are in networking and multimedia. He served on the editorial boards of IEEE Transactions on Multimedia, IEEE Communications Tutorial and Surveys, and IEEE Internet of Things Journal. He will be TPC co-chair for IEEE/ACM IWQoS’2014 at Hong Kong.