MPEG Column: 154th MPEG Meeting

The 154th MPEG meeting took place in Santa Eulària, Spain, from April 27 to May 1, 2026. The official MPEG press release can be found here. This report highlights key outcomes from the meeting, with a focus on research directions relevant to the ACM SIGMM community:

  • Exploration on MPEG Gaussian Splat Coding (GSC)
  • Draft Joint Call for Proposals: Video Compression Beyond VVC
  • Energy-aware Streaming in MPEG-DASH
  • MPEG-AI: Vision and Scenarios for Artificial Intelligence in Multimedia
  • MPEG Roadmap

Exploration on MPEG Gaussian Splat Coding (GSC)

The MPEG WG 2 Technical Requirements group — jointly with WG 4 (Video Coding), WG 5 (JVET: Joint Video Coding Team(s) with ITU-T SG 16), and WG 7 (Coding of 3D Graphics and Haptics) — made progress toward standardizing Gaussian Splat Coding (GSC) regarding draft requirements and use cases subject to change. Gaussian splatting, first introduced in a landmark 2023 ACM SIGGRAPH paper by Kerbl et al. [Kerbl2023], represents 3D scenes as collections of anisotropic Gaussian primitives carrying geometry (x, y, z positions) and appearance attributes (opacity, scale, rotation, and spherical harmonics coefficients for view-dependent color), enabling photorealistic novel-view synthesis with real-time rendering. Because raw Gaussian splat data can be extremely large and the ecosystem of proprietary formats (.ply, .splat, .spz, etc.) is fragmented, MPEG has identified a clear need for interoperable, efficient compression standards. Two exploration tracks are currently being pursued: I-3DGS, which operates on Gaussian splats in the well-established “INRIA” format as a symmetric encode/decode pipeline, and A-3DGS, which allows alternative learned representations and training-integrated approaches.

The draft requirements, still evolving, currently cover representation, coding, and system aspects across both tracks, with an additional lightweight profile targeting resource-constrained devices such as mobile phones (Snapdragon 8 Gen 3/Elite) and HMDs (Snapdragon XR Gen2, e.g., Meta Quest 3). Among the coding requirements under consideration are lossy and lossless compression with variable bitrate, spatial and temporal random access, progressive and scalable decoding (quality, Level of Detail (LoD), attribute subsets), and error resilience. Notably, a lightweight profile currently proposes hard complexity constraints (i.e., real-time encode/decode on 2024/2025 mobile hardware, a 2GB runtime memory cap, and at most four concurrent video decoder sessions) reflecting MPEG’s intent to enable a fast-deployment path for interoperable interchange and storage of static Gaussian splat assets. Alongside the requirements, a draft set of 27 use cases has been identified, spanning consumer XR (telepresence, gaming, social media, retail), professional media (movie production, sports broadcasting, immersive journalism), industrial applications (digital twins, Building Information Modeling (BIM), structure inspection, disaster assessment), and emerging hybrid representations such as Gaussian splats attached to deformable meshes for avatar animation and rigging. Several of these use cases are motivating draft requirements around primitive ordering preservation and stable identifier signaling for external metadata associations, though the details of these provisions may still change.

Research aspects: Even at this early draft stage, the direction of MPEG’s GSC work opens a rich set of research opportunities. On the compression side, the dual-track structure raises open questions around rate-distortion-complexity optimization for both geometry-based and video-codec-based pipelines, including temporally coherent coding of dynamic (tracked and non-tracked) Gaussian sequences and attribute-group-aware progressive coding. The QoE angle is equally pressing: no widely accepted perceptual quality metric yet exists for 6DoF Gaussian splat rendering, and the community can contribute splat-artifact-aware metrics, view-consistency measures, and subjective evaluation methodologies. The envisioned lightweight profile points to a need for co-design of decoders and real-time renderers targeting mobile GPU architectures, offering opportunities in GPU-friendly bitstream layouts and LOD-driven streaming. From a systems and networking perspective, the spatial and temporal random-access provisions, combined with the breadth of use cases demanding adaptive streaming to diverse devices (HMDs, phones, TVs, browsers), map naturally onto adaptive bitrate research, ROI- and view-dependent segment delivery, and loss-resilient transmission of splat parameters. Finally, the emerging use cases around hybrid mesh-Gaussian avatars, scene editing, and semantic metadata associations introduce new multimedia content management and interactive media challenges that go well beyond traditional video streaming and are squarely within the scope of ACM SIGMM’s research community.

Draft Joint Call for Proposals: Video Compression Beyond VVC

MPEG’s Joint Video Experts Team (JVET) — operating jointly under ITU-T SG21 and ISO/IEC JTC 1/SC 29 — advanced a draft Joint Call for Proposals (CfP) for a new generation of video compression technology with capabilities that would substantially exceed those of the current Versatile Video Coding (VVC) standard (Rec. ITU-T H.266 | ISO/IEC 23090-3). The final CfP is planned for July 2026, with proposal submissions evaluated at a JVET meeting in January 2027 and a tentative target of a completed standard by October 2029. The overarching goal is to solicit compression technology that significantly improves upon VVC’s Main 10 Profile in terms of rate-distortion performance, encoder/decoder implementability, applicability to diverse content types, and additional features such as low latency, error robustness, and scalability, while explicitly recognizing that practical fast encoding is increasingly important across a growing range of applications.

The draft CfP defines four test cases. The primary test case targets improved compression without runtime constraints, spanning several content categories: SDR random-access at UHD/4K and HD resolutions, SDR low-delay HD (targeting conversational and gaming applications), HDR content under both PQ and HLG transfer functions at UHD, gaming low-delay HD, and user-generated content. Three additional test cases impose encoder runtime constraints relative to the VVC Test Model (VTM) reference encoder, enabling JVET to characterize the compression-versus-speed trade-off across submissions. Formal subjective evaluation will follow the degradation category rating (DCR) methodology per ITU-R BT.500. Importantly, the CfP explicitly addresses neural and learned components: proponents must disclose what training data was used and are prohibited from using any test sequence as training material, and source code (incl. training scripts or parameter derivation procedures) must be made available for accepted technologies entering the core experiments process. The draft notes that specific test sequences and target bitrates may still change before the final CfP is issued.

Research aspects: The runtime-constrained test cases create a natural framework for studying the compression-complexity Pareto frontier for both classical and learned codecs. The inclusion of user-generated content and gaming video as distinct categories invites research into content-adaptive coding tools and perceptual quality metrics tailored to these sources, as does the HDR coverage with its use of weighted PSNR alongside MS-SSIM. The explicit allowance for neural and learned components, with mandatory training data disclosure and source code requirements, signals that JVET anticipates hybrid and end-to-end learned codecs as serious contenders, making codec-agnostic adaptive streaming, QoE modeling for learned video codecs, and large-scale perceptual quality benchmarking timely topics for the ACM SIGMM community.

Energy-aware Streaming in MPEG-DASH

MPEG’s WG 3 (Systems/DASH) is developing a framework for integrating energy-related information into adaptive streaming workflows, currently documented as a Technology under Consideration (TuC) in the DASH specification. The proposed framework treats energy as a first-class design metric alongside QoE, latency, and throughput, and defines an end-to-end approach for assigning, aggregating, and propagating energy consumption data across the entire media delivery chain — from production and encoding through CDN distribution to the client. A key design principle is extensibility: rather than hardcoding specific metrics, the framework proposes a common registry of energy-related metrics (such as energy indices or carbon indices) identified via URNs or 4CC codes, inspired by existing registries like MP4RA and DASH-IF. Energy information may be carried through a variety of existing DASH mechanisms, including MPD descriptors at multiple granularity levels (Adaptation Set, Representation, Segment, Service Location), CMCD/CMSD extensions, metadata tracks, SAND messages, and event streams. A dedicated Energy descriptor in the MPD is proposed, analogous to existing Accessibility descriptors, to expose energy information to clients and applications for representation selection, user exposure, and reporting to back-end servers.

Concept of Energy-aware Streaming in MPEG-DASH.

The April 2026 update reported significant progress on two related fronts. A 5G-MAG workshop co-organized with 3GPP SA4 and Greening of Streaming (March 2026) highlighted growing industry consensus around practical energy measurement, surfacing findings such as the dominant role of device eco-mode settings and content brightness over codec or resolution choices in determining end-device energy consumption, and the challenge of reproducible cloud-based energy measurement. In parallel, 3GPP’s Rel-20 study on media energy consumption exposure (FS_Energy_Ph2_MED) reached 80% completion and is expected to conclude in June 2026, with normative work to follow. Notably, 3GPP’s current draft conclusions focus on generic architectural enablers, specifically a new Energy Information Application Function, while explicitly deferring media-layer and client-driven energy optimization to external bodies such as MPEG, SVTA, and DVB. This positions MPEG-DASH’s manifest-based energy signaling work as the natural venue for maturing the streaming-level mechanisms that 3GPP may later reference.

Research aspects: This work opens several timely directions. Energy-aware ABR algorithm design, i.e., jointly optimizing QoE and energy across representation selection, CDN choice, and client device settings, is a natural extension of the existing adaptive streaming research agenda. The proposed metrics registry and MPD-level signaling create opportunities for dataset construction and benchmarking, building on emerging open datasets such as COCONUT [Tashtarian2024] and VEED [Linder2024]. The finding that device-side factors (eco-mode, display brightness) dominate energy consumption over codec and bitrate choices challenges some common assumptions and calls for more holistic QoE-energy modeling. Finally, the cross-SDO coordination between MPEG, 3GPP, IETF (GREEN working group), and Greening of Streaming presents opportunities for the ACM SIGMM community to contribute to the design of interoperable, standardized energy reporting APIs for streaming services.

MPEG-AI: Vision and Scenarios for Artificial Intelligence in Multimedia

The first edition of ISO/IEC TR 23888-1 serves as the foundational vision document for the MPEG-AI series (ISO/IEC 23888). The document maps out how AI and neural network technologies interact with multimedia standardization along two complementary axes: (i) AI as a multimedia coding tool (e.g., AI-based video compression, 3D point cloud coding) and (ii) multimedia as input for AI consumption (e.g., video coding optimized for machine vision tasks). Under this umbrella, the document surveys six technical areas. In AI-based video coding, neural network components are explored as hybrid additions to VVC-style codecs, covering in-loop filters, intra prediction, super-resolution via reference picture resampling, and content-adaptive postfilters transmitted via SEI messages using the Neural Network Coding standard (NNC, ISO/IEC 15938-17). In AI-based 3D graphics coding, the focus is on dynamic point clouds for immersive (XR, gaming) and machine-oriented (autonomous navigation, BIM) applications, where sparsity and geometric irregularity pose unique challenges beyond those faced by image/video AI codecs. AI model compression (NNC) addresses the bandwidth-efficient deployment and incremental updating of neural network weights to devices, with use cases ranging from adaptive streaming ABR models to federated learning and postfilter delivery. Video coding for machines (VCM) targets compression optimized for downstream AI tasks such as object detection, tracking, and content moderation, with applications in surveillance, intelligent transportation, smart cities, and industrial inspection. Feature coding for machines (FCM) extends this to split-inference architectures where intermediate feature maps — rather than reconstructed video — are compressed and transmitted between edge devices and servers. Finally, distributed AI media description addresses the interoperable representation and API-level exchange of AI inference results (e.g., bounding boxes, segmentation masks) between networked media analyzers, as specified in the MPEG-IoMT suite.

ISO/IEC TR 23888-1: AI as a multimedia coding tool and multimedia as input for AI consumption.

Research aspects: The hybrid codec paradigm raises open questions around joint optimization of traditional and learned tools and complexity-aware training for mobile targets. The VCM and FCM tracks call for new task-oriented quality metrics capturing machine-task performance as a function of bitrate, an area where the multimedia and computer vision communities can collaborate. The split-inference and feature coding scenarios introduce latency-constrained compression problems for edge-to-cloud pipelines, which naturally connect to adaptive streaming and IoT research. Finally, the reproducibility and bit-exactness challenges highlighted in the document — hardware-dependent inference, non-deterministic training, and the absence of standardized evaluation environments — present an opportunity for the community to develop shared benchmarking infrastructure for learned multimedia codecs.

MPEG Roadmap

MPEG released an updated roadmap at its 154th meeting, reflecting the current status and near-term trajectory of its standardization activities across three broad pillars. Under Media Coding, work nearing completion includes MPEG Immersive Video v.2, Feature Coding for Machines, Solid Point Cloud Coding, and Dynamic Mesh Compression, while longer-horizon efforts cover AI Graphics Compression, Video Coding for Machines, Lenslet video coding, and — directly relevant to this report — both Video-based and Geometry-based Gaussian Splat Coding tracks. Under Systems and Tools, near-term deliverables include DASH v.7, Green metadata v.4, and Carriage of Haptics Data, with CMAF v.4 and File Format (ISOBMFF) v.10 on a slightly longer timeline. The Beyond Media pillar continues to advance genomic data search and biomedical waveform coding (BWC), alongside media authenticity and provenance indication — underscoring MPEG’s expanding scope well beyond traditional audiovisual applications.

MPEG Roadmap as of April 2026.

Research aspects: The roadmap highlights several intersecting research opportunities. The convergence of volumetric and neural representations (i.e., point clouds, dynamic meshes, Gaussian splats, and lenslet video; all progressing in parallel) raises open questions around unified rate-distortion frameworks and cross-format QoE evaluation for 6DoF experiences. The simultaneous progression of Video Coding for Machines and Feature Coding for Machines alongside traditional human-centric codecs calls for research into adaptive pipelines that can serve both human and machine consumers from a shared bitstream. The Green metadata track connects directly to the energy-aware streaming work discussed above, underscoring the need for end-to-end energy modeling that spans codec choice, packaging, delivery, and consumption. Finally, the Beyond Media thread (e.g., particularly genomic data and biomedical waveforms) signals an expanding definition of “multimedia” that the ACM SIGMM community may wish to engage with as compression, retrieval, and QoE methods developed for audiovisual content find applicability in life sciences.

Concluding Remarks

The 154th MPEG meeting in Santa Eularia reflects a standards body in active transition, broadening its scope from traditional audiovisual compression toward a richer landscape that encompasses neural scene representations, AI-native codecs, energy-aware delivery, and even biomedical data. The Gaussian Splat Coding exploration, the next-generation video compression Call for Proposals, the MPEG-AI vision document, and the energy-aware streaming framework each address distinct but interconnected challenges: how to represent, compress, deliver, and consume increasingly complex and diverse media efficiently and sustainably. For the ACM SIGMM community, this meeting offers both a map of where industry standardization is heading and a set of open research problems (i.e., spanning perceptual quality assessment, learned compression, edge inference, green streaming, and immersive media delivery) where academic contributions can meaningfully shape the next generation of multimedia standards.

The 155th MPEG meeting will be held in Geneva, Switzerland, from July 13 to 17, 2026. Click here for more information about MPEG meetings and ongoing developments.

References

  • [Kerbl, 2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 42, 4, Article 139 (August 2023), 14 pages. https://doi.org/10.1145/3592433
  • [Tashtarian, 2024] Farzad Tashtarian, Daniele Lorenzi, Hadi Amirpour, Samira Afzal, and Christian Timmerer. 2024. COCONUT: Content Consumption Energy Measurement Dataset for Adaptive Video Streaming. In Proceedings of the 15th ACM Multimedia Systems Conference (MMSys ’24). Association for Computing Machinery, New York, NY, USA, 346–352. https://doi.org/10.1145/3625468.3652179
  • [Linder, 2024] Sandro Linder, Samira Afzal, Christian Bauer, Hadi Amirpour, Radu Prodan, and Christian Timmerer. 2024. VEED: Video Encoding Energy and CO2 Emissions Dataset for AWS EC2 instances. In Proceedings of the 15th ACM Multimedia Systems Conference (MMSys ’24). Association for Computing Machinery, New York, NY, USA, 332–338. https://doi.org/10.1145/3625468.3652178

JPEG Column: 110th JPEG Meeting in Sydney, Australia

JPEG Trust Media Asset Watermarking reaches Committee Draft stage at the 110th JPEG meeting

The 110th JPEG meeting was held in Sydney, Australia, from 11 to 16 January 2026.

This meeting was marked by several major achievements: JPEG Trust Part 3 Media Asset Watermarking that will extend JPEG Trust Core Foundation providing signalling capabilities for content authenticity, provenance, integrity, intellectual property rights, and labelling using watermarking. Furthermore, the first event-based codec, JPEG XE, reached the Draft International Standard stage.

In addition, the JPEG Committee celebrated the 25th birthday of the successful JPEG 2000 standard with a social event where members who had served the Committee shared their experience during the development of this important family of standards.

The following sections summarise the main highlights of the 110th JPEG meeting:

  • JPEG Trust Part 3: Media Asset Watermarking to provide watermarking support for media asset authenticity.
  • JPEG XE Part 1: core coding system is under DIS ballot.
  • JPEG AIC prepares large-scale subjective experiment.
  • JPEG 2000 defines a set of hardware-focused profiles for professional video streaming.
  • JPEG XS Part 2 new amendement defines additional levels and sublevels, ands a new frame buffer level.
  • JPEG RF activity approves new Use Cases and Requirements.
  • JPEG AI focus on implementation aspects and on extending its applicability across devices and use cases.
  • JPEG DNA completes wet-lab experiments, including DNA synthesis/sequencing.
  • JPEG Pleno Light Field Quality Assessment examines the performance of the proposed metrics.
  • JPEG 2000 25th Anniversary Celebrations.
The former convenor of the JPEG Committee, Daniel Lee, addressing JPEG 2000 development during the JPEG 2000 25th Anniversary Celebration.

JPEG Trust

Current technologies, especially the rise of generative AI, make synthetic creation and modification of media assets easy for general users. Media artefacts such as synthetic images and video increase the risks of online piracy, cyber security fraud, copyright breach, advertising misrepresentation and the spread of mis- and disinformation.

The JPEG Trust International Standard (ISO/IEC 21617-1) provides a framework for establishing trust in media assets, and has now been extended to include Part 3: Media Asset Watermarking (ISO/IEC 21617-3), to provide watermarking support for media asset authenticity.

This new part of the JPEG Trust framework provides a mechanism to empower businesses, governments and institutions to support critical use cases from labelling AI-generated media assets to Digital Rights Management and source tracing. This is in addition to its many applications in helping secure media asset authenticity.

In a major milestone achieved during the 110th JPEG meeting in Sydney, Part 3: Media Asset Watermarking reached the Committee Draft stage. It is expected that this standard will have a significant positive impact globally, as it directly responds to the urgent calls for watermarking functionality by governments around the world in response to the proliferation of AI-generated content online.

JPEG XE

JPEG XE is a joint effort between ITU-T SG21 and ISO/IEC JTC1/SC29/WG1 and will become the first internationally endorsed specification by major standardization bodies ITU-T, ISO, and IEC, for coding of events. It aims to establish a robust and interoperable format for efficient representation and coding of events in the context of machine vision and related applications. To expand the reach of JPEG XE, the JPEG Committee has closely coordinated its activities with the MIPI Alliance with the intention of developing a cross-compatible coding mode, allowing MIPI ESP signals to be decoded effectively by JPEG XE decoders.

Currently, JPEG XE Part 1, which defines the core coding system, is under DIS ballot and the JPEG Committee is awaiting the results. In the meantime, work started on Parts 2 and 3, which will define the Profiles and levels, and the Reference software, respectively. For both parts, a Committee Draft (CD) was created and their consultation was requested. The Profiles and levels in Part 2 will provide strict definitions to allow safe and correct interoperability between vendor specific implementations of the standard. The software for Part 3 will serve as a proof of concept implementation of an encoder and decoder of JPEG XE. The plan is to make the software free and open source to allow the community easy access to the JPEG XE technology.

Finally, work on Part 4 was also initiated to provide official and well-defined conformance tests. This will help vendors to verify interoperability and conformance to the standard.

The JPEG Committee remains committed to the development of a comprehensive and industry-aligned standard that meets the growing demand for event-based vision technologies. The collaborative approach between multiple standardisation organisations underscores a shared vision for a unified, international standard to accelerate innovation and interoperability in this emerging field. The JPEG XE public and joint AHG (ITU-T SG21 and ISO/IEC JTC1 SC29 WG1) was reestablished to continue the work. If you are interested, please consider joining the joint AHG.

JPEG AIC

The JPEG AIC-3 standard, which specifies a methodology for fine-grained subjective image quality assessment in the range from good quality up to mathematically lossless, is ready to be published as International Standard ISO/IEC 29170-3 in February this year. An implementation of the corresponding data analysis has been provided in MATLAB and will be ported to Python. For the current JPEG AIC-4 effort and evaluation of the responses to the call for Objective Image Quality Assessment, an image dataset for the large-scale subjective experiment was finalized, consisting of 18,000 compressed images for 70 source images and 17 codecs, including several learning-based methods. The crowdsourcing experiment is expected to take several weeks.

JPEG 2000

The JPEG Committee has initiated the development of a new standard to collect the growing number of profiles for its flexible JPEG 2000 image codec. As part of the activity, which is expected to be completed within the next 18 months, an initial set of hardware-focused profiles for professional video streaming coder are being codified. These profiles use the unique capabilities of the High-Throughput JPEG 2000 block coder, specified in Rec. ITU-T T.814 | ISO/IEC 15444-15, to shrink the hardware resources needed to tackle modern high-frame rate and high-resolution images.

JPEG XS

JPEG XS, the image and video compression format for transmitting visually lossless, high-quality pictures with minimal latency and low resource consumption, is a fundamental game-changer for real-time video transmission in live, professional, and broadcast applications. In this context, the JPEG Committee created an AMD1 for JPEG XS Part 2 to define some additional levels and sublevels, as well as a new frame buffer level. These additions each address specific requirements that came from the respective industry sectors that rely on JPEG XS. This new AMD1 for Part 2 was issued for DIS balloting. In the meantime, the ballot results for AMD1 for JPEG XS Part 1 were processed, and an FDIS ballot was initiated. Both AMDs are expected to be published before the end of this year.

JPEG RF

At the 110th JPEG meeting, JPEG RF made significant progress against its mandates, formally approving the Use Cases and Requirements for JPEG Radiance Fields v1.0 and requesting its public release on the JPEG website. Substantial technical discussions advanced the evaluation and assessment pipeline for radiance fields, covering both coding-only and joint instantiation and coding approaches. The Working Group also approved Exploration Study 7, including the study on pair-wise comparison assessment methodologies for radiance fields. In addition, next steps were agreed for outreach activities to engage additional stakeholders.

JPEG AI

During the 110th JPEG meeting, JPEG AI was focused on implementation aspects and on extending its applicability across devices and use cases. First, the Use Cases and Requirements document was updated, introducing a new video streaming and storage use case that positions JPEG AI as a deterministic still-image coding engine that can be integrated into video coding pipelines.

A new core experiment addresses the bit-exact reference frame reconstruction requirement. Moreover, other core experiments were defined to analyze power consumption on heterogeneous CPU–GPU/FPGA platforms and to retrain JPEG AI in the RGB domain for fair comparison with other codecs. Looking ahead, JPEG AI plans to develop mobile-ready encoder and decoder implementations, investigate error-resilience properties, and continue benchmarking JPEG AI against state-of-the-art learnt image codecs using solid and robust test conditions.

JPEG DNA

The wet-lab experiments, including DNA synthesis/sequencing, designed at the 109th JPEG meeting were completed, and the synthesized results have been delivered to the JPEG Committee as DNA molecules. As a next step, independent parties are carrying out sequencing separately, and the sequenced results are expected to be available by the next JPEG meeting, when the JPEG DNA, a.k.a. ISO/IEC 25508-1, will reach the DIS stage.

JPEG Pleno

During the 110th JPEG meeting, the JPEG Committee reviewed the outcomes of the subjective quality assessment conducted on the evaluation dataset with the aim to examine the performance of the proposals submitted in response to the Call for Proposals on objective metrics for JPEG Pleno Light Field Quality Assessment. The performance of submitted metrics was analysed across scenes with diverse spatial and angular resolutions and for both coding-only and joint coding and view-synthesis artefacts, highlighting differences in behaviour across distortion categories. Learning-based proposals were recognized as a promising direction, particularly when cross-validated on the evaluation dataset, while also raising considerations related to training, data dependency, and reproducibility. The evaluation phase was formally closed, with agreement to retain a set of well-established full-reference metrics as reference anchors and to pursue a combined technical direction integrating end-to-end and hybrid learning-based approaches. Finally, responsibilities across task forces were consolidated, and next steps were defined to continue the objective quality assessment work towards a first version of a working draft.

Highlights of JPEG 2000 25th Anniversary Celebrations, Sydney, 14 January 2026

The 110th JPEG meeting in Sydney offered a fitting occasion to mark the 25th anniversary of JPEG 2000 standardization. Opening the celebration, Prof. Touradj Ebrahimi, JPEG convenor, noted that it was in Sydney during the 12th JPEG meeting in 1997 that JPEG 2000 proposals were evaluated, culminating in the publication of the standard in December 2000.

The program featured a video message from Prof. Michael Marcellin, a key contributor to several core technologies adopted by JPEG 2000 and chair of the subsequent software verification model effort. He highlighted the successful deployment of JPEG 2000 for digital distribution of motion pictures and the essential standards work involved in defining the digital cinema profiles that enabled this adoption.

Prof. David Taubman, whose long-standing leadership and technical contributions continue to shape JPEG 2000 development, delivered a presentation highlighting the coding tools that underpin the format’s highly scalable and accessible codestreams. He also outlined recent progress in High Throughput JPEG 2000 (HTJ2K), including implementations achieving high performance, full float lossless compression for OpenEXR and FPGA based realizations delivering high speed, low latency coding.

Messages from Prof. Majid Rabbani and Dr. Daniel Lee—both instrumental in guiding the JPEG 2000 standardisation process—paid tribute to the dedication, expertise, and collaborative spirit of the many JPEG members who contributed to the standard’s success. Daniel, who served as JPEG convenor during the JPEG 2000 standardisation period, further underscored JPEG’s essential role as a collaborative international forum for developing standards with global reach.

The celebration concluded with an address by Dr. Pierre Anthony Lemieux, co-chair of the JPEG 2000 activity, who highlighted the format’s enduring flexibility as a key factor in its longevity. He noted that this flexibility allows end users to expand the capabilities of their workflows without the burden of switching to a different codec. Dr. Lemieux also emphasised the importance of ongoing maintenance activities, which allow JPEG 2000 to evolve to meet the shifting needs of its users, including current work on defining HTJ2K profiles and levels. He finished by stressing the importance of open source tools and libraries in driving adoption.

A sustained commitment to meeting industry needs and continued maintenance of the standard remains central to the ongoing and future success of JPEG 2000.

Final Quote

“Reaching Committee Draft for JPEG Trust Part 3: Media Asset Watermarking is a pivotal step toward restoring confidence in digital media at a moment when generative AI makes convincing manipulation accessible to anyone. This milestone equips industries and public institutions with interoperable, standards-based watermarking to support authenticity, provenance, integrity, rights signalling, and clear labelling, helping to curb mis- and disinformation, strengthen digital rights management, and enable reliable source tracing at a global scale.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

Music Meets Science at ACM Multimedia 2025

Multimedia research is framed through algorithms, datasets, and systems, but at its heart lies content that is deeply human. Few forms of content illustrate this better than classical music. Long before music becomes data to be recorded, generated, searched, or retrieved, it is imagined by composers and brought to life by performers. At ACM Multimedia 2025 in Dublin, this human origin of multimedia took centre stage in a unique social event that bridged classical music and multimedia content analysis. This event was the fourth supported by ACM SIGMM in the framework of Music Meets Science program (at CBMI’2022, CBMI’2023, CBMI’2024).

Music Meets Science explores musical spaces across centuries and styles, from the dynamic Folia of Vivaldi and Handel’s Passacaglia to works by Schubert and contemporary composers from around the world. The goal here is to bring a wide range of music performed by some of the original content creators, our classical musicians, to the multimedia research community, who explore and mine this content. It brings fundamental cultural values to the young researchers in Multimedia, opening their minds to classical and contemporary music which oscillates with the rhythm of centuries.

The concert took place on 29 October, starting at 8:00 PM, during the Welcome Reception of ACM Multimedia 2025. It was attended by over 1,000 delegates of all ages from doctoral students to senior researchers. The programme featured music by Irish composer Garth Knox and a new composition by Finnish composer Jarno Vanhanen, written especially for ACM Multimedia. The performance was delivered by internationally acclaimed French musicians of the new generation: François Pineau-Benois (violin) and Olivier Marin (viola), see Figure 1. Together, they invited the audience to experience music not only as sound, but as rich multimedia content shaped by structure, expression, interpretation, and context.

Figure 1. François Pineau-Benois (violin), Oliver Marin (viola) performing Jarno Vanhanen’s “Aurora Borealis” duet.

By embedding live performance within a major multimedia conference, Music Meets Science highlights the importance of integrating creative arts into the research ecosystem. As multimedia research continues to advance, from content understanding to generation, events like this remind us that artistic practice is not just an application domain, but a source of inspiration. Strengthening the dialogue between creative arts and multimedia research can deepen our understanding of content, context, and meaning, and enrich the future directions of the field.

The MediaEval Benchmark Looks Back at a Successful Fifteenth Edition, and Forward to its Sweet Sixteen

Introduction

The Benchmarking Initiative for Multimedia Evaluation (MediaEval) organizes interesting and engaging tasks related to multimedia data. MediaEval is proud to be supported by SIGMM. Tasks involve analyzing and exploring multimedia collections, as well as accessing the information that they contain. MediaEval emphasizes challenges that have a human or social aspect in order to support our goal of making multimedia a positive force in society. Participants in MediaEval are encouraged to submit effective, but also creative solutions to MediaEval tasks: We carry out quantitative evaluation of the submissions, but also go beyond the scores in order to obtain insight into the tasks, data, metrics. 

Participation in MediaEval is open to any team that wishes to sign up. Registration has just opened and information is available on the MediaEval 2026 website: https://multimediaeval.github.io/editions/2026 The workshop will take place in Amsterdam, Netherlands and online coordinated with ACM ICMR https://icmr2026.org

In this column, we present a short report on MediaEval 2025, which culminated with the annual workshop in Dublin, Ireland between CBMI (https://www.cbmi2025.org) and ACM Multimedia (https://acmmm2025.org). Then, we provide an outlook to MediaEval 2026, which will be the sixteenth edition of MediaEval.

A Keynote on Metascience

The workshop kicked off with a keynote on metascience for machine learning. The metascience initiative (https://metascienceforml.github.io) strives to promote discussion and development of the scientific underpinnings of machine learning. It looks at the way in which machine learning is done and examines the full range of relevant aspects, from methodologies and mindsets. The keynote was delivered by Jan van Gemert, head of the Computer Vision Lab (https://www.tudelft.nl/ewi/over-de-faculteit/afdelingen/intelligent-systems/pattern-recognition-bioinformatics/computer-vision-lab) at Delft University of Technology. He discussed the “growing pains” of the field of deep learning and the importance of the scientific method for keeping the field on course. He invited the audience to consider the question of the power of benchmarks for hypothesis-driven science in machine learning and deep learning.

Tasks at MediaEval 2025

The MediaEval 2025 tasks reflect the benchmark’s continued emphasis on human-centered and socially relevant multimedia challenges, spanning healthcare, media, memory, and responsible use of generative AI.

Several tasks this year focused on the human aspects of multimodal analysis, combining visual, textual, and physiological signals. The Medico Task challenges participants in building visual question answering models for the interpretation of gastrointestinal images, aiming to support clinical decision-making through interpretable multimodal explanations. The Memorability Task focuses on modeling long-term memory for short movie excerpts and commercial videos, requiring participants to predict how memorable a video is, whether viewers are familiar with it, and, in some cases, to leverage EEG signals alongside visual features. Multimodal understanding is further explored in the MultiSumm Task, where participants are provided with collections of multimodal web content describing food sharing initiatives in different cities and are asked to generate summaries that satisfy specific informational criteria, with evaluation exploring both traditional and emerging LLM-based assessment approaches.

The remaining two tasks emphasize the societal impact of multimedia technology in real-world settings. In the NewsImagesTask, participants worked with large collections of international news articles and images, either retrieving suitable thumbnail images or generating thumbnails for articles. The Synthetic Images Task addressed the growing prevalence of AI-generated content online, asking participants to detect synthetic or manipulated images and localize manipulated areas. The task used data created by state-of-the-art generative models as well as images collected from real-world online settings. We gratefully acknowledge the support of AI-CODE (https://aicode-project.eu), a European project focused on topics related to these two tasks.

MediaEval in Motion

MediaEval is especially proud of participants who return over the years, improving their approaches and contributing insights. We would like to highlight two previous participants who became so interested and involved in MediaEval tasks that they decided to join the task organization team and help organize the tasks. Iván Martín-Fernández, PhD student at Universidad Politécnica de Madrid, became a task organizer for the Memorability task and Lucien Heitz, PhD Student, University of Zurich, became a task organizer for NewsImages. 

One aspect of the MediaEval Benchmark I value most is its effort to go beyond metric-chasing and embark on a “quest for insights,” as the organizers put it, to help us better understand the tasks and encourage creative, innovative solutions. This spirit motivated me to participate in the 2023 Memorability Task in Amsterdam. The experience was so enriching that I wanted to become more involved in the community. In 2025, I was invited to join the Memorability Task organizing team, which gave me the chance to contribute to and help foster this innovative research effort. Thanks to SIGMM’s sponsorship, I was able to attend the event in Dublin, which further enhanced the experience. Working alongside Martha and Gabi as a student volunteer is always a pleasure. As my PhD studies come to an end, I’m proud to say that MediaEval has been a core part of my research, and I’m sure it will remain so in the immediate future. See you in Amsterdam in June!

Iván Martín-Fernández, PhD Student, GTHAU – Universidad Politécnica de Madrid

I ‘graduated’ from being a participant in the previous NewsImages challenge to now taking over the organization duties of the 2025 iteration of the task. It was an incredible journey and learning experience. Big thank you to the main MediaEval organizers for their tireless support and input for shaping this new task that combines image retrieval and generation. The recent benchmark event presented an amazing platform to share and discuss our research. We got so many great submissions from teams around the globe. I was truly overwhelmed by the feedback. Getting involved with the organization of a challenge task is something I can highly recommend to all participants. It allows you to take on an active role and bring new ideas to the table on what problems to tackle next.

Lucien Heitz, PhD Student, University of Zurich

MediaEval continues its tradition of awarding a “MediaEval Distinctive Mention” to teams that dive deeply into the data, the algorithms, and the evaluation procedure. Going above and beyond in this way makes important contributions to our understanding of the task and how to make meaningful progress. Moving the state of the art forward requires improving the scores on a pre-defined benchmark task such as the tasks offered by MediaEval. However, MediaEval Distinctive Mentions underline the importance of research that does not necessarily improve scores on a given task, but rather makes an overall contribution to knowledge.

We were happy to serve as student volunteers at MediaEval 2025. In addition, we participated as a team in the NewsImage task, contributing to two subtasks, and were honored to receive a Distinctive Mention. 

Xiaomeng had previously participated in the same task at MediaEval 2023. Compared to the 2023 edition, she observed notable evolution in both the data and task design. These changes reflect the organizers’ careful consideration of recent advances in modeling techniques as well as the practical applicability of the datasets, which proved to be highly inspiring. 

Bram participated in MediaEval for the first time and particularly found the discussions with colleagues about the challenges very rewarding. The NewsImage retrieval subtask additionally got him to learn how to deal with larger datasets. 

We tried to incorporate deeper reflections on our results into our presentation. Specifically, we showed how certain types of articles are particularly suited for image generation and identified the news categories where retrieval was most effective. 

Xiaomeng Wang and Bram Bakker PhD Students, Data Science – Radboud University

The people whose work is highlighted in this section are grateful to have received support from SIGMM in order to be able to attend the MediaEval workshop in person. 

Outlook to MediaEval 2026

The 2025 workshop concluded with participants collaborating with the task organizers to start to develop “benchmark biographies”, which are living documents that describe benchmarking tasks. Combining elements from data sheets and model cards, benchmark biographies document motivation, history, datasets, evaluation protocols, and baseline results to support transparency, reproducibility, and reuse by the broader research community. We plan to continue work on these benchmark biographies as we move toward MediaEval 2026. 

Further, in the 2026 edition, we will offer again the tasks that were held in 2025 to provide an opportunity for teams who were not able to participate in 2025. We especially encourage “Quest for Insight” papers that examine characteristics of the data and the task definitions, the strengths and weaknesses of particular types of approaches, observations about the evaluation procedure, and the implications of the task. 
We look forward to seeing you in Amsterdam for MediaEval and also ACM ICMR. Don’t forget to check out the MediaEval website (https://multimediaeval.github.io) and register your team if you are interested in participating in 2026.

MPEG Column: 153rd MPEG Meeting

The 153rd MPEG meeting took place online from January 19-23, 2026. The official MPEG press release can be found here. This report highlights key outcomes from the meeting, with a focus on research directions relevant to the ACM SIGMM community:

  • MPEG Roadmap
  • Exploration on MPEG Gaussian Splat Coding (GSC)
  • MPEG Immersive Video 2nd edition (new white paper)

MPEG Roadmap

MPEG released an updated roadmap showing continued convergence of immersive and “beyond video” media with deployment-ready systems work. Near-term priorities include 6DoF experiences (MPEG Immersive Video v2 and 6DoF audio), volumetric representations (dynamic meshes, solid point clouds, LiDAR, and emerging Gaussian splat coding), and “coding for machines,” which treats visual and audio signals as inputs to downstream analytics rather than only for human consumption.

Research aspects: The most promising research opportunities sit at the intersections: renderer and device-aware rate-distortion-complexity optimization for volumetric content; adaptive streaming and packaging evolution (e.g., MPEG-DASH / CMAF) for interactive 6DoF services under tight latency constraints; and cross-cutting themes such as media authenticity and provenance, green and energy metadata, and exploration threads on neural-network-based compression and compression of neural networks that foreshadow AI-native multimedia pipelines.

MPEG Gaussian Splat Coding (GSC)

Gaussian Splat Coding (GSC) is MPEG’s effort to standardize how 3D Gaussian Splatting content, scenes represented as sparse “Gaussian splats” with geometry plus rich attributes (scale and rotation, opacity, and spherical-harmonics appearance for view-dependent rendering), is encoded, decoded, and evaluated so it can be exchanged and rendered consistently across platforms. The main motivation is interoperability for immersive media pipelines: enabling reproducible results, shared benchmarks, and comparable rate-distortion-complexity trade-offs for use cases spanning telepresence and immersive replay to mobile XR and digital twins, while retaining the visual strengths that made 3DGS attractive compared to heavier neural scene representations.

The work remains in an exploration phase, coordinated across ISO/IEC JTC 1/SC 29 groups WG 4 (MPEG Video Coding) and WG 7 (MPEG Coding for 3D Graphics and Haptics) through Joint Exploration Experiments covering datasets and anchors, new coding tools, software (renderer and metrics), and Common Test Conditions (CTC). A notable systems thread is “lightweight GSC” for resource-constrained devices (single-frame, low-latency tracks using geometry-based and video-based pipelines with explicit time and memory targets), alongside an “early deployment” path via amendments to existing MPEG point-cloud codecs to more natively carry Gaussian-splat parameters. In parallel, MPEG is testing whether splat-specific tools can outperform straightforward mappings in quality, bitrate, and compute for real-time and streaming-centric scenarios.

Research aspects: Relevant SIGMM directions include splat-aware compression tools and rate-distortion-complexity optimization (including tracked vs. non-tracked temporal prediction); QoE evaluation for 6DoF navigation (metrics for view and temporal consistency and splat-specific artifacts); decoder and renderer co-design for real-time and mobile lightweight profiles (progressive and LOD-friendly layouts, GPU-friendly decode); and networked delivery problems such as adaptive streaming, ROI and view-dependent transmission, and loss resilience for splat parameters. Additional opportunities include interoperability work on reproducible benchmarking, conformance testing, and practical packaging and signaling for deployment.

MPEG Immersive Video 2nd edition (white paper)

The second edition of MPEG Immersive Video defines an interoperable bitstream and decoding process for efficient 6DoF immersive scene playback, supporting translational and rotational movement with motion parallax to reduce discomfort often associated with pure 3DoF viewing. The second edition primarily extends functionality (without changing the high-level bitstream structure), adding capabilities such as capture-device information, additional projection types, and support for Simple Multi-Plane Image (MPI), alongside tools that better support geometry and attribute handling and depth-related processing.

Architecturally, MIV ingests multiple (unordered) camera views with geometry (depth and occupancy) and attributes (e.g., texture), then reduces inter-view redundancy by extracting patches and packing them into 2D “atlases” that are compressed using conventional video codecs. MIV-specific metadata signals how to reconstruct views from the atlases. The standard is built as an extension of the common Visual Volumetric Video-based Coding (V3C) bitstream framework shared with V-PCC, with profiles that preserve backward compatibility while introducing a new profile for added second-edition functionality and a tailored profile for full-plane MPI delivery.

Research aspects: Key SIGMM topics include systems-efficient 6DoF delivery (better view and patch selection and atlas packing under latency and bandwidth constraints); rate-distortion-complexity-QoE optimization that accounts for decode and render cost (especially on HMD and mobile) and motion-parallax comfort; adaptive delivery strategies (representation ladders, viewport and pose-driven bit allocation, robust packetization and error resilience for atlas video plus metadata); renderer-aware metrics and subjective protocols for multi-view temporal consistency; and deployment-oriented work such as profile and level tuning, codec-group choices (HEVC / VVC), conformance testing, and exploiting second-edition features (capture device info, depth tools, Simple MPI) for more reliable reconstruction and improved user experience.

Concluding Remarks

The meeting outcomes highlight a clear shift toward immersive and AI-enabled media systems where compression, rendering, delivery, and evaluation must be co-designed. These developments offer timely opportunities for the ACM SIGMM community to contribute reproducible benchmarks, perceptual metrics, and end-to-end streaming and systems research that can directly influence emerging standards and deployments.

The 154th MPEG meeting will be held in Santa Eulària, Spain, from April 27 to May 1, 2026. Click here for more information about MPEG meetings and ongoing developments.

ACM SIGMM Multimodal Reasoning Workshop

The ACM SIGMM Multimodal Reasoning Workshop was held on 8–9 November 2025 at Indian Institute of Technology Patna (IIT Patna) in Hybrid mode. Organised by Dr. Sriparna Saha, faculty member of the Department of Computer Science and Engineering, IIT Patna and supported by ACM SIGMM, the two-day event brought together researchers, students, and practitioners to discuss the foundations, methods, and applications of multimodal reasoning and generative intelligence. The workshop registered 108 participants and featured invited talks, tutorials, and hands-on sessions by national and international experts. Sessions covered topics ranging from trustworthy AI, LLM fine-tuning, temporal and multimodal reasoning, to knowledge-grounded visual question answering and healthcare applications.

Inauguration:

 During the inauguration, Dr. Sriparna Saha welcomed participants and acknowledged the presence and support of Prof. Jimson Mathew (Dean of Student Affairs, IIT Patna), Prof. Rajiv Ratn Shah (IIIT Delhi) and all speakers. The organising committee expressed gratitude to ACM SIGMM for its financial support, which made the workshop possible. Felicitations were exchanged, and the inauguration concluded with words of encouragement for active participation and interdisciplinary collaboration.

Session summaries and highlights:

Day 1:

 The first day began with an inaugural session followed by a series of engaging talks and tutorials. Prof. Rajiv Ratn Shah (Associate Professor, IIIT Delhi, India) delivered the opening talk on “Tackling Multimodal Challenges with AI: From User Behavior to Content Generation,” highlighting the role of multimodal data in understanding user behaviour, enabling AI-driven content generation, and building region-specific applications such as voice conversion and video dubbing for Indic languages. Prof. Sriparna Saha (Associate Professor, Department of Computer Science and Engineering, IIT Patna, India) then presented “Harnessing Generative Intelligence for Healthcare: Models, Methods, and Evaluations,” discussing safe, domain-specific AI systems for healthcare, multimodal summarisation for low-resource languages, and evaluation frameworks like M3Retrieve and multilingual trust benchmarks. The afternoon sessions featured hands-on tutorials and technical talks. Ms. Swagata Mukherjee (Research Scholar, IIT Patna, India) conducted a tutorial on “Advanced Prompting Techniques for Large Language Models,” covering zero/few-shot prompting, chain-of-thought reasoning, and iterative refinement strategies. Mr. Rohan Kirti (Research Scholar, IIT Patna, India) led a tutorial on “Exploring Multimodal Reasoning: Text & Image Embeddings, Augmentation, and VQA,” demonstrating text–image fusion using models such as CLIP, VILT, and PaliGemma. Prof. José G. Moreno (Associate Professor, IRIT, France) presented “Visual Question Answering about Named Entities with Knowledge-Based Explanation (ViQuAE),” introducing a benchmark for explainable, knowledge-grounded VQA. The day concluded with Prof. Chirag Agarwal (Assistant Professor, University of Virginia, USA) delivering a talk on “Trustworthy AI in the Era of Frontier Models,” which emphasised fairness, safety, and alignment in multimodal and LLM systems.

Day 2:

 The second day continued with high-level technical sessions and tutorials. Prof. Ranjeet Ranjan Jha (Assistant Professor, Department of Mathematics, IIT Patna, India) opened with a talk on “Bridging Deep Learning and Multimodal Reasoning: Generative AI in Real-World Contexts,” tracing the evolution of deep learning into multimodal generative models and discussing ethical and computational challenges in deployment. Prof. Adam Jatowt (Professor, Department of Computer Science, University of Innsbruck, Austria) followed with a presentation on “Analyzing and Improving Temporal Reasoning Capabilities of Large Language Models,” showcasing benchmarks such as BiTimeBERT, TempRetriever, and ComplexTempQA, while proposing methods to enhance time-sensitive reasoning. The final technical session featured Mr. Syed Ibrahim Ahmad (Research Scholar, IIT Patna, India) conducted a tutorial on “LLM Fine-Tuning,” which covered PEFT approaches, QLoRA, quantization, and optimization techniques to fine-tune large models efficiently.

Valedictory session:

 The valedictory session marked the formal close of the workshop. Dr. Sriparna Saha thanked speakers, participants and the organising team for active engagement across technical talks and tutorials. Participants shared positive feedback on the depth and practicality of sessions. Certificates were distributed to attendees. Final remarks encouraged continued research, collaboration and dissemination of resources. Dr. Saha reiterated gratitude to ACM SIGMM for financial support.

Outcomes, observations and suggested actions:

  • Multimodal reasoning remains an interdisciplinary challenge that benefits from close collaboration between multimedia, NLP, and application domain experts.
  • Trustworthiness, safety, and evaluation (benchmarks and metrics) are critical for moving multimodal models from demonstration to practice especially in healthcare and other high-stakes domains.
  • Practical methods for model adaptation (PEFT, quantization) make large models accessible for research groups with limited compute.
  • Datasets and retrieval resources that combine multimodal inputs with external knowledge (as in ViQuAE) are valuable for advancing explainable VQA and grounded reasoning.
  • The community should prioritise regional and language-diverse resources (Indic languages, code-mixed data) to ensure equitable benefits from multimodal AI.
  • SIGMM and ACM venues can play a role in fostering collaborations via special projects, regional hackathons, grand challenges, and multimodal benchmark initiatives.

Outreach & social media:

 The workshop generated significant visibility on LinkedIn and other professional networks. Photos and session highlights were widely shared by participants and organisers, acknowledging ACM SIGMM support and the quality of the technical programme.

Acknowledgements:  The organising committee thanks all speakers, attendees, student volunteers, and ACM SIGMM for financial and logistic support that enabled the workshop.

Reports from ACM Multimedia 2025

URL:  https://acmmm2025.org

Date: Oct 27 – Oct 31, 2025
Place:  Dublin, Ireland
General Chairs: Cathal Gurrin, Klaus Schoeffmann, Min Zhang, Adapt Centre & DCU, Klagenfurt University, Tsinghua University

Introduction

The ACM Multimedia Conference 2025, held in Dublin, Ireland from October 27 to October 31, 2025, continued its tradition as a premier international forum for researchers, practitioners, and industry experts in the field of multimedia. This year’s conference marked an exciting return to Europe, bringing the community together in a city renowned for its rich cultural heritage, innovation-driven ecosystem, and welcoming atmosphere. ACM MM 2025 provided a dynamic platform for presenting state-of-the-art research, discussing emerging trends, and fostering collaboration across diverse areas of multimedia computing.

Hosted in Dublin—a vibrant hub for both technology and academia—the conference delivered a seamless and engaging experience for all attendees. As part of its ongoing mission to support and encourage the next generation of multimedia researchers, SIGMM awarded Student Travel Grants to assist students facing financial constraints. Each recipient received up to 1,000 USD to help offset travel and accommodation expenses. Applicants completed an online form, and the selection committee evaluated candidates based on academic excellence, research potential, and demonstrated financial need.

To shed light on the experiences of these outstanding young scholars, we interviewed several travel grant recipients about their participation in ACM MM 2025 and the conference’s influence on their academic and professional development. Their reflections are shared below.

Wang Zihao – Zhejiang University

This was not my first time attending ACM Multimedia—I also participated in ACMMM 2022—but coming back in 2025 has been just as fantastic. There were many memorable moments, but two of them stood out the most for me. The first was the beautiful violin performance during the Volunteer Dinner, which created such a warm and elegant atmosphere. The second was the Irish drumming performance at the conference banquet on October 30th. It was incredibly energetic and truly unforgettable. These moments reminded me how special it is to be part of this community, where academic exchange and cultural experiences blend together so naturally.

I am truly grateful for the SIGMM Student Travel Grant. The financial support made it possible for me to attend the conference in person, and I really appreciate the effort that SIGMM puts into supporting students. One of the most valuable aspects of this trip was meeting researchers from all over the world who work in areas similar to mine—especially those focusing on music, audio, and multimodality. Having deep, face-to-face conversations with them was inspiring and has given me many new ideas to explore in my future research.

As for suggestions, I honestly think the SIGMM Student Travel Grant program is already doing an amazing job in supporting young scholars like us. My only small hope is for a smooth reimbursement process.

Overall, I feel incredibly fortunate to be here again, reconnecting with the ACM MM community and learning so much from everyone. I’m thankful for this opportunity and excited to continue growing in this field.

Huang Feng-Kai – National Taiwan University

Attending ACM Multimedia 2025 in Dublin was my first time joining the conference, and it has been an unforgettable experience. Everything was so well-organized, and I truly enjoyed every moment. The welcome reception was especially memorable—the food was delicious, the atmosphere was lively, and it was inspiring to see so many renowned researchers and professors chatting enthusiastically. It really felt like the perfect start to my ACM MM journey.

I am deeply grateful to SIGMM for the Student Travel Grant. As a student traveling all the way from Taiwan, attending a conference in Europe is a major financial challenge. The grant covered my accommodation, meals, and flights, which made it possible for me to participate without worrying too much about the cost. Being here has really broadened my horizons. I was able to learn about so many fascinating research topics and meet many kind, talented researchers who generously shared their thoughts with me. These conversations gave me a lot of inspiration for my own work.

I also had the chance to serve as a volunteer, which became my first experience working with an international team. Collaborating with people from different cultural and academic backgrounds helped me improve my communication skills and made the conference even more meaningful.

I truly believe the SIGMM Student Travel Grant is an amazing program that enables students from all over the world to join this vibrant community, exchange ideas, and form new collaborations. My only wish is that SIGMM will continue offering this opportunity in the future. This grant brings so much energy to young researchers like me and plays an important role in supporting the next generation of the multimedia community. I am sincerely thankful for everything this experience has given me, and I look forward to returning to ACM Multimedia in the coming years.

Wang Hao (Peking University)

Attending ACM Multimedia 2025 was my very first time participating in the conference, and the experience was truly amazing. The moment that impressed me the most was having the chance to present my own paper. As a non-native English speaker, giving an academic talk on an international stage was both challenging and rewarding. I felt nervous at first, but I’m really proud of how I managed to deliver my presentation. It was a big milestone for me.

I’m incredibly grateful to SIGMM for the Student Travel Grant, which made it possible for me to attend an international conference for the first time. Without this support, I wouldn’t have been able to experience such a meaningful academic event. Throughout the conference, I met so many new friends, attended inspiring talks, and gained fresh perspectives on multimedia research. These experiences have broadened my view of the field and will definitely influence the direction of my future work.

I’m thankful for this opportunity and truly appreciate how welcoming and encouraging the ACM MM community is. This conference has given me motivation and confidence to continue growing as a researcher.

Yu Liu (University of Electronic Science and Technology of China)

This is my first time attending ACM Multimedia. I am a PhD student at the University of Electronic Science and Technology of China (UESTC), currently spending a year at the University of Auckland, New Zealand, as part of a joint PhD program. It took 25 hours to travel from Auckland to Dublin, but the journey was completely worth it. The conference has been vibrant and intellectually engaging. I had the honor of being the first speaker in my session, and it was incredibly fulfilling to see the audience show genuine interest and appreciation for our work. Outside the sessions, I thoroughly enjoyed immersing myself in Irish culture—tasting the smooth, rich Guinness, watching lively tap dancing, and listening to traditional Irish music. Overall, it has been an inspiring and truly memorable experience.

The SIGMM Student Travel Grant played a vital role in making my attendance possible. In recent years, UESTC has discontinued funding for PhD students’ conference travel, transferring the financial responsibility entirely to individual research groups. Receiving this grant was crucial, allowing me to attend ACM MM 2025 without placing additional strain on my research team’s limited budget. Attending the conference in person provided an invaluable opportunity to present my research, exchange ideas face-to-face with international scholars, and receive constructive feedback from leading experts. These experiences fostered meaningful academic connections and opened doors for potential long-term collaborations that online participation simply cannot replace.

My biggest takeaway from ACM MM 2025 is the inspiration I gained from being part of such a diverse and passionate research community, which has motivated me to continue advancing in the field of responsible AI. I also really enjoyed the volunteer “Thank You” dinner—it was a wonderful experience. At the same time, I noticed that it is not always easy for students to approach professors they do not know personally. In the future, including short icebreaker or networking activities could help start conversations more naturally, making the conference experience even more valuable for students like me.

Li Deng (LUT University)

This is my first time attending ACM Multimedia, and my experience has been exceptionally positive. I was particularly impressed by the workshops relevant to my research area, as the discussions provided valuable insights that are already influencing my ongoing work. I was also struck by the abundance and quality of the social events and networking opportunities, which made it easy to connect with senior researchers and fellow students from diverse backgrounds.

Receiving the SIGMM Student Travel Grant significantly reduced the financial burden of travel and accommodation, allowing me to attend the conference in person without major financial stress. The opportunity to present my work and engage in discussions with leading researchers has greatly supported my academic development. I received direct feedback and established connections that may lead to future collaborations. My biggest takeaway from ACM MM 2025 is a deeper understanding of the rapid development and impact of multimodal large language models.

Looking ahead, I suggest that the SIGMM Student Travel Grant program collaborate with the main conference to organize sessions such as a “Career Forum” for grant recipients and other student volunteers, providing additional guidance and support for early-career researchers.

JPEG Column: 109th JPEG Meeting in Nuremberg, Germany

JPEG XS developers awarded the Engineering, Science and Technology Emmy®.

The 109th JPEG meeting was held in Nuremberg, Germany, from 12 to 17 October 2025.

This JPEG meeting began with the excellent news that JPEG XS developers Fraunhofer IIS and intoPIX were awarded the Engineering, Science and Technology Emmy® for their contributions to the development of the JPEG XS standard.

Furthermore the 109th JPEG meeting was also marked by several major achievements: JPEG Trust Part 2 on Trust Profiles and Reports, complementing Part 1 with several profiles for various usage scenarios, reached Committee Draft; JPEG AIC part 3 was produced for final publication by ISO; JPEG XE reached Committee Draft stage; and the calls for proposals on objective evaluation JPEG AIC-4 and JPEG Pleno Quality Assessment of Light Field received several responses.

The following sections summarise the main highlights of the 109th JPEG meeting:

Fraunhofer IIS and intoPIX representatives with the awarded Engineering, Science and Technology Emmy®.
  • JPEG Trust Part 2 on Trust Profiles and Reports reaches Committee Draft stage.
  • JPEG AIC-4 receives responses to the Call for Proposals on Objective Image Quality Assessment.
  • JPEG XE Part 1, the core coding system, reaches DIS stage.
  • JPEG XS Part 1 AMD 1 reaches DIS stage.
  • JPEG AI Part 2 (Profiling), Part 3 (Reference Software), and Part 5 (File Format) approved as International Standards.
  • JPEG DNA designed the wet-lab experiments, including DNA synthesis/sequencing.
  • JPEG Peno receives responses to the Call for Proposals on Objective Metrics for Light Field Quality Assessment.
  • JPEG RF establishes frameworks for coding and quality assessment of radiance fields.
  • JPEG XL innitiates embedding of JPEG XL in ISOBMFF/HEIF.

JPEG Trust

At the 109th JPEG Meeting, the JPEG Committee reached a key milestone with the completion of the Committee Draft (CD) for JPEG Trust Part 2 – Trust Profiles and Reports (ISO/IEC 21617-2). Building on the framework established in Part 1 (Core Foundation), this new specification further refines Trust Profiles and Trust Reports and provides several example profiles and reusable profile snippets for adoption in diverse usage scenarios.

Compared to earlier drafts, the new Trust Profiles specification introduces templates and dynamic metadata blocks, offering enhanced flexibility while maintaining full backwards compatibility for existing profiles. This flexibility is also reflected in the updated Trust Reports, which can now be more easily tailored to specific usage scenarios. This new specification sets the stage for user communities to build their own Trust Profiles and customise them to their specific needs.

In addition to the CD on Part 2, the committee also produced a CD of Part 4 – Reference Software. This specification provides a reference implementation and reference dataset of the Core Foundation. The reference software will be extended with additional implementations in the future.

Finally, the committee also advanced Part 3 – Media Asset Watermarking. The Terms and Definitions and Use Cases and Requirements documents are now publicly available on the JPEG website. The development of Part 3 is progressing on schedule, with the Committee Draft stage targeted for January 2026.

JPEG AIC

The JPEG AIC-3 standard, which specifies a methodology for fine-grained subjective image quality assessment in the range from good quality up to mathematically lossless, was finalised at the 109th JPEG meeting and will be published as International Standard ISO/IEC 29170-3.

In response to the JPEG AIC-4 Call for Proposals on Objective Image Quality Assessment, four proposals were received and presented. A large-scale subjective experiment has been prepared in order to evaluate the proposals.

JPEG XE

JPEG XE is a joint effort between ITU-T SG21 and ISO/IEC JTC1/SC29/WG1 and will become the first internationally endorsed specification by major standardization bodies ITU-T, ISO, and IEC, for coding of events. It aims to establish a robust and interoperable format for efficient representation and coding of events in the context of machine vision and related applications. To expand the reach of JPEG XE, the JPEG Committee has closely coordinated its activities with the MIPI Alliance with the intention of developing a cross-compatible coding mode, allowing MIPI ESP signals to be decoded effectively by JPEG XE decoders.

At the 109th JPEG Meeting, the DIS of JPEG XE Part 1, the core coding system, was prepared. This part specifies the low-complexity and low-latency lossless coding technology that will be the foundation of JPEG XE. Reaching DIS stage is a major milestone and freezes the core coding technology for the first edition of JPEG XE. The JPEG Committee plans to further improve the coding performance and to provide additional lossless and lossy coding modes, scheduled to be developed in 2026. While the DIS of Part 1 is under ballot for approval as an International Standard, the JPEG Committee initiated the work on Part 2 of JPEG XE to define the profiles and levels. A DIS of Part 2 is planned to be ready for ballot in January 2026.

With JPEG XE Part 1 under ballot and Part 2 in the pipeline, the JPEG Committee remains committed to the development of a comprehensive and industry-aligned standard that meets the growing demand for event-based vision technologies. The collaborative approach between multiple standardisation organisations underscores a shared vision for a unified, international standard to accelerate innovation and interoperability in this emerging field.

JPEG XS

The JPEG Committee is extremely proud to announce that the two companies behind the development of JPEG XS, intoPIX and Fraunhofer IIS, were awarded an Emmy® for Engineering, Science, and Technology for their role in the development of the JPEG XS standard. The awards ceremony was held on October 14th, 2025, at the Television Academy’s Saban Media Center in North Hollywood, California. This award recognizes JPEG XS for being a state-of-the-art image compression format that transmits high-quality images with minimal latency and low-resource consumption, with visually near-lossless image quality. It affirms that JPEG XS is the fundamental game changer for real-time transmission of video in live, professional video, and broadcast applications, and that it is being heavily adopted by the industry.

Nevertheless, the work to further improve JPEG XS continues. In this context, the DIS of AMD 1 of JPEG XS Part 1 is currently under ballot at ISO and is expected to be ready by January 2026. This amendment enables the embedding of sub-frame metadata to JPEG XS as required by augmented and virtual reality applications currently discussed within VESA. The JPEG Committee also initiated the steps to start an amendment for Part 2 (Profiles and buffer models) that will define additional sublevels needed to support on-the-fly proxy-level extraction (i.e. lower resolution streams from a master stream) without recompression. The amendment is planned to go to DIS ballot at the next 110th JPEG meeting in Sydney, Australia.

JPEG AI

During the 109th JPEG meeting, the JPEG AI project achieved major milestones, with Part 2 (Profiling), Part 3 (Reference Software), and Part 5 (File Format) approved as International Standards. Meanwhile, Part 4 (Conformance) is proceeding to publication after a positive ballot. The Core Experiments confirmed that JPEG AI outperforms state-of-the-art codecs in compression efficiency and demonstrated a decoder implementation based on the SADL library.

JPEG DNA

During the 109th JPEG meeting, the JPEG Committee designed the wet-lab experiments, including DNA synthesis/sequencing, with results expected by January 2026. The primary objective of the wet-lab experiments is to validate the technical specifications outlined in the current DIS study text of ISO/IEC 25508-1 in the realistic procedures for DNA media storage. Additional efforts are underway as a new Core Experiment to study the performance of the codec-dependent unequal error correction technique, which is expected to result in the future publication of JPEG DNA Part 2 – Profiles and levels.

JPEG Pleno

JPEG Pleno marked a pivotal step toward the forthcoming ISO/IEC 21794-7 standard, Light Field Quality Assessment. The new Part 7 was officially approved for inclusion in the ISO/IEC work programme, confirming international support for standardizing light field quality assessment methodologies. Moreover, in response to the Call for Proposals on Objective Metrics for Light Field Quality Assessment, three proposals were received and presented. In preparation for the evaluation of the proposals submitted in response to the CfP, an evaluation dataset was released and discussed during the meeting. The next milestone is the execution of a Subjective Quality Assessment on the evaluation dataset to evaluate the proposed objective metrics by the 110th JPEG meeting in Sydney. To this end, the methodological design and preparation of the subjective test were discussed and finalized, marking an important step toward developing the standardization framework for objective light field quality assessment.

The JPEG Pleno Workshop on Emerging Coding Technologies for Plenoptic Modalities was conducted at the 109th meeting with presentations from Touradj Ebrahimi (JPEG Convenor), Peter Schelkens (JPEG Plenoptic Coding and Quality Sub-Group Chair), Aljosa Smolic (Hochschule Luzern), Søren Otto Forchhammer (Danmarks Tekniske Universitet), Giuseppe Valenzise (Université Paris-Saclay), Amr Rizk (Leibniz Universität Hannover), Michael Rudolph (Leibniz Universität Hannover), and Irene Viola (Centrum Wiskunde & Informatica).

JPEG RF

At the 109th JPEG Meeting the exploration activity on JPEG Radiance Fields (JPEG RF) continued its progress toward establishing frameworks for coding and quality assessment of radiance fields. The group updated the drafts of the Use Cases and Requirements and Common Test Conditions, alongside the outcomes of an Exploration Study, which examined the impact of camera trajectory design on human perception during a subjective quality assessment. These discussions refined methodological guidelines for trajectory generation and the subjective assessment procedures. Building on this progress, Exploration Study 6 was launched to benchmark the complete assessment framework through a subjective experiment using the developed protocols. Outreach activities were also planned to engage additional stakeholders and support further development ahead of the next 110th JPEG Meeting in Sydney, Australia.

JPEG XL

At the 109th JPEG meeting, work has started on an embedding of JPEG XL in ISOBMFF/HEIF. It will be described in a new edition of ISO/IEC 18181-2, which has been initiated.

Final Quote

“During the 109th JPEG Meeting, the JPEG Committee reached several important milestones. In particular, JPEG Trust continues its development with the addition of new Parts towards the creation of a reliable and effective standard that restores authenticity and provenance of the multimedia information.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

VQEG Column: Finalization of Recommendation Series P.1204, a Multi-Model Video Quality Evaluation Standard – The New Standards P.1204.1 and P.1204.2

Abstract

This column introduces the now completed ITU-T P.1204 video quality model standards for assessing sequences up to UHD/4K resolution. Initially developed over two years by ITU-T Study Group 12 (Question Q14/12) and VQEG, the work used a large dataset of 26 subjective tests (13 for training, 13 for validation), each involving at least 24 participants rating sequences on the 5-point ACR scale. The tests covered diverse encoding settings, bitrates, resolutions, and framerates for H.264/AVC, H.265/HEVC, and VP9 codecs. The resulting 5,000-sequence dataset forms the largest lab-based source for model development to date. Initially standardized were P.1204.3, a no-reference bitstream-based model with full bitstream access, P.1204.4, a pixel-based, reduced-/full-reference model, and P.1204.5, a no-reference hybrid model. The current record focuses on the latest additions to the series, namely P.1204.1, a parametric, metadata-based model using only information about which codec was used, plus bitrate, framerate and resolution, and P.1204.2, which in addition uses frame-size and frame-type information to include video-content aspects into the predictions.

Introduction

Video quality under specific encoding settings is central to applications such as VoD, live streaming, and audiovisual communication. In HTTP-based adaptive streaming (HAS) services, bitrate ladders define video representations across resolutions and bitrates, balancing screen resolution and network capacity. Video quality, a key contributor to users’ Quality of Experience (QoE), can vary with bandwidth fluctuations, buffer delays, or playback stalls. 

While such quality fluctuations and broader QoE aspects are discussed elsewhere, this record focuses on short-term video quality as modeled by ITU-T P.1204 for HAS-type content. These models assess segments of around 10s under reliable transport (e.g., TCP, QUIC), covering resolution, framerate, and encoding effects, but excluding pixel-level impairments from packet loss under unreliable transport.

Because video quality is perceptual, subjective tests, laboratory or crowdsourced, remain essential, especially at high resolutions such as 4K UHD under controlled viewing conditions (1.5H or 1.6H viewing distance). Yet, studies show limited perceptual gain between HD and 4K, depending on source content, underlining the need for representative test materials. Given the high cost of such tests, objective (instrumental) models are required for scalable, automated assessment supporting applications like bitrate ladder design and service monitoring.

Four main model classes exist: metadata-based, bitstream-based, pixel-based, and hybrid. Metadata-based models use codec parameters (e.g., resolution, bitrate) and are lightweight; bitstream-based models analyze encoded streams without decoding, as in ITU-T P.1203 and P.1204.3 [1][2][3][7]. Pixel-based models compare decoded frames and include Full Reference and Reduced Reference models (e.g., P.1204.4, and also PSNR [9], SSIM [10], VMAF [11][12]), as well as No Reference variants. Finally, hybrid models combine pixel and bitstream or metadata inputs, exemplified by the ITU-T P.1204.5 standard. These three standards, P.1204.3 P.1204.4 and P.1204.5, formed the initial P.1204 Recommendation series finalized in 2020.

ITU-T P.1204 series completed with P.1204.1 and P.1204.2

The respective standardization project under the Work Item name P.NATS Phase 2 (read: Peanuts) was a unique video quality model development competition conducted in collaboration between ITU-T Study Group 12 (SG12) and the Video Quality Experts Group (VQEG). The target use cases were for up to UHD/4K resolution, with presentation on UHD/4K resolution PC/TV or Mobile/Tablet (MO/TA). For the first time, bitstream-, pixel-based, and hybrid models were jointly developed, trained, and validated, using a large common subjective dataset comprising 26 tests, each with at least 24 participants (see, e.g., [1] for details). The P.NATS Phase 2 work built on the earlier “P.NATS Phase 1” project, which resulted in the ITU-T Rec. P.1203 standards series (P.1203, P.1203.1, P.1203.2, P.1203.3). In the P.NATS Phase 2 project, video quality models in five different categories were evaluated, and different candidates were found to be eligible to be recommended as standards. The initially standardized three models out of the five categories were the aforementioned P.1204.3, P.1204.4 and P.1204.5. However, due to the lack of consensus between the winning proponents, no models were recommended as standards for the category “bitstream Mode 0” with access to high-level metadata only, such as the video codec, resolution, framerate and bitrate used,  and “bitstream Mode 1”, with further access to frame-size information that can be used for content-complexity estimation.

For the latest model additions of P.1204.1 and P.1204.2, subsets of the databases initially used in the P.NATS Phase 2 project were employed for model training. Two different datasets belonging to the two contexts PC/TV and MO/TA were used for training the models. AVT-PNATS-UHD-1 is the dataset for the PC/TV use case and ERCS-PNATS-UHD-1 the dataset used for the MO/TA use case. 

AVT-PNATS-UHD-1 [7] consists of four different subjective tests conducted by TU Ilmenau as part of the P.NATS Phase 2 competition. The target resolution of these datasets was 3840 x 2160 pixels. ERCS-PNATS-UHD-1 [1] is a dataset targeting the MO/TA use case. It consists of one subjective test conducted by Ericsson as part of the P.NATS Phase 2 competition. The target resolution of these datasets was 2560 x 1440 pixels. 

For model performance evaluation, beyond AVT-PNATS-UHD-1, further externally available video-quality test databases were used, as outlined in the following.

AVT-VQDB-UHD-1: This is a publicly available dataset and consists of four different subjective tests. All the four tests had a full-factorial design. In total, 17 different SRCs with a duration of 7-10 s were used across all the four tests. All the sources had a resolution of 3840×2160 pixels and a framerate of 60 fps. For HRC design, bitrate was selected in fixed (i.e. non-adaptive) values per PVS between 200kbps and 40000kbps, resolution between 360p and 2160p and framerate between 15fps and 60fps. In all the tests, a 2-pass encoding approach was used to encode the videos, with medium preset for H.264 and H.265, and the speed parameter for VP9 set to the default value “0”. A total of 104 participants in the four tests.

GVS: This dataset consists of 24 SRCs that have been extracted from 12 different games. The SRCs are of 1920×1080 pixel resolution, 30fps framerate and have a duration of 30s . The HRC design included three different resolutions, namely, 480p, 720p and 1080p . 90 PVSs resulting from 15 bitrate-resolution pairs were used for subjective evaluation. A total of 25 participants rated all the 90 PVSs.

KUGVD: Six SRCs out of the 24 SRCs from the GVSwere used to develop KUGVD. The same bitrate-resolution pairs from GVS were included to define the HRCs. In total, 90 PVSs were used in the subjective evaluation and 17 participants took part in the test.

CGVDS:  This dataset consists of SRCs captured at 60fps from 15 different games. For designing the HRCs, three resolutions, namely, 480p, 720p and 1080p at three different framerates of 20, 30, and 60fps were considered. To ensure that the SRCs from all the games could be assessed by test subjects, the overall test was split into 5 different subjective tests, with a minimum of 72 PVSs being rated in each of the tests. A total of over 100 participants took part over the five different tests, with a minimum of 20 participants per test.

Twitch: The Twitch Dataset consists of 36 different games, with 6 games each representing one out of 6 pre-defined genres. The dataset consists of streams directly downloaded from Twitch. A total of 351 video sequences of approximately 50s duration across all representations were downloaded. 90 video sequences out of these 351 video sequences were selected for subjective evaluation. Only the first 30s of the chosen 90 PVSs were considered for subjective testing. Six different resolutions between 160p and 1080p at framerates of 30 and 60fps were used. 29 participants rated all the 90 PVSs.

BBQCG: This is the training dataset developed as part of the P.BBQCG work item. This dataset consists of nine subjective test databases. Three out of these nine test databases consisted of processed video sequences (PVSs) up to 1080p/120fps and the remaining had PVSs up to 4K/60fps. Three codecs, namely, H.264, H.265, and AV1 were used to encode the videos. Overall 900 different PVSs were created from 12 sources (SRCs) by encoding the SRCs with different encoding settings.

AVT-VQDB-UHD-1-VD: This dataset consists of 16 source contents encoded using a CRF-based encoding approach. Overall 192 PVSs were generated by encoding all 16 sources in four resolutions, namely, 360p, 720p, 1080p, 2160p with three CRF values (22, 30, 38) each. A total of 40 subjects participate in the study.

ITU-T P.1204.1 and P.1204.2 model prediction performance

The performance figures of the two new models P.1204.1 and P.1204.2 models on the different datasets are indicated in Table 1 (P.1204.1) and Table 2 (P.1204.2) below.

Table 1: Performance of P.1204.1 (Mode 0) on the evaluation datasets, in terms of Root Mean Square Error (RMSE, measure used as winning criterion in the ITU-T/VQEG modelling competition). Pearson Correlation Coefficeint (PCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall’s tau.
DatasetRMSEPCCSRCCKendall
AVT-VQDB-UHD-10.4990.890 0.8770.684
KUGVD 0.8400.590 0.5700.410
GVS 0.690 0.670 0.6500.490
CGVDS 0.470 0.7800.7500.560
Twitch 0.430 0.9200.8900.710
BBQCG 0.598 (on a 7-point scale) 0.8410.8430.647
AVT-VQDB-UHD-1-VD0.6500.8140.8130.617
Table 2: Performance of P.1204.1 (Mode 1) on the evaluation datasets, in terms of Root Mean Square Error (RMSE, measure used as winning criterion in the ITU-T/VQEG modelling competition). Pearson Correlation Coefficeint (PCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall’s tau.
DatasetRMSEPCCSRCCKendall
AVT-VQDB-UHD-10.4760.9010.9000.730
KUGVD0.5000.8700.8600.690
GVS0.4200.890 0.870 0.710
CGVDS0.3600.9000.8800.690
Twitch0.3700.940 0.9300.770
BBQCG0.737 (on a 7-point scale)0.745 0.746 0.547
AVT-VQDB-UHD-1-VD0.5980.845 0.845 0.654

For all databases except BBQCG and KUGVD, the Mode 0 model P.1204.1 performs in a solid way, as shown in Table 1. With the information about frame types and sizes available to the Mode 1 model P.1204.2, performance improves considerably, as shown in Table 2. For performance results of all three previously standardized models, P.1204.3, P.1204.4 and P.1204.5, the reader is referred to [1] and the individual standards, [4][5][6]. For the P.1204.3 model, complementary performance information is presented in, e.g., [2][7]. For P.1204.4, additional model performance information is available in [8], including results for AV1, AVS2, and VVC.

The following plots provide an illustration of how the new P.1204.1 Mode 0 model may be used. Here, bitrate-ladder-type graphs are presented, with the predicted Mean Opinion Score on a 5-point scale plotted over log bitrate.


Codec: H.264

Codec: H.265

Codec: VP9

Conclusions and Outlook

The P.1204 standard series now comprises the complete initially planned set of models, namely:

  • ITU-T P.1204.1: Bitstream Mode 0, i.e., metadata-based model with access to information about video codec, resolution, framerate and bitrate used.
  • ITU-T P.1204.2: Bitstream Mode 1, i.e., metadata-based model with access to information about video codec, resolution, framerate and bitrate used, plus information about video frame types and sizes.
  • ITU-T P.1204.3: Bitstream Mode 3 [1][2][3][7].
  • ITU-T P.1204.4: Pixel-based reduced- and full-reference [1][5][8].
  • ITU-T P.1204.5: Hybrid no-reference Mode 0 [1][6].

Extensions of some of these models beyond the initial scope of codecs (H.264/AVC, H.265/HEVC, VP9) have been included over the last few years. Here, P.1204.4 and P.1204.5 have been extended (P.1204.5) or evaluated (P.1204.4) to also cover the AV1 video codec. Work in ITU-T SG12 (Q14/12) is ongoing so as to also extend P.1204.1, P.1204.2 and P.1204.3 to newer codecs such as AV1, and all five models are planned to be extended so as to also cover VVC. It is noted that for P.1204.3, P.1204.4 and P.1204.5, also long-term quality integration modules that generate per-session scores for up to 5min long streaming sessions have been described in Appendices of the respective recommendations. For P.1204.1 and P.1204.2, this extension still has to be completed. Initial evaluations for similar Mode 0 and Mode 1 models that use the P.1204.3-type long-term integration can be found in [7].

References

[1] Raake, A., Borer, S., Satti, S.M., Gustafsson, J., Rao, R.R.R., Medagli, S., List, P., Göring, S., Lindero, D., Robitza, W. and Heikkilä, G., 2020. Multi-model standard for bitstream-, pixel-based and hybrid video quality assessment of UHD/4K: ITU-T P. 1204. IEEE Access, 8, pp.193020-193049.
[2] Rao, R.R.R., Göring, S., List, P., Robitza, W., Feiten, B., Wüstenhagen, U. and Raake, A., 2020, May. Bitstream-based model standard for 4K/UHD: ITU-T P. 1204.3—Model details, evaluation, analysis and open source implementation. In 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX) (pp. 1-6).
[3] ITU-T Rec. P.1204, 2025. Video quality assessment of streaming services over reliable transport for resolutions up to 4K. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[4] ITU-T Rec. P.1204.3, 2020. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to full bitstream information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[5] ITU-T Rec. P.1204.4, 2022. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to full and reduced reference pixel information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[6] ITU-T Rec. P.1204.5, 2023. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to transport and received pixel information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[7] Rao, R.R.R., Göring, S. and Raake, A., 2022. AVQBits – Adaptive video quality model based on bitstream information for various video applications. IEEE Access, 10, pp.80321-80351.
[8] Borer, S., 2022, September. Performance of ITU-T P. 1204.4 on Video Encoded with AV1, AVS2, VVC. In 2022 14th International Conference on Quality of Multimedia Experience (QoMEX) (pp. 1-4).
[9] Winkler, S. and Mohandas, P., 2008. The evolution of video quality measurement: From PSNR to hybrid metrics. IEEE transactions on Broadcasting, 54(3), pp.660-668.
[10] Wang, Z., Lu, L. and Bovik, A.C., 2004. Video quality assessment based on structural distortion measurement. Signal Processing: Image Communication, 19(2), pp.121-132.
[11] Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., and Manohara, M., 2016. Toward A Practical Perceptual Video Quality Metric, Netflix TechBlog.
[12] Li, Z., Swanson, K., Bampis, C., Krasula, L., and Aaron, A., 2020. Toward a Better Quality Metric for the Video Community, Netflix TechBlog.

MPEG Column: 152nd MPEG Meeting

The 152nd MPEG meeting took place in Geneva, Switzerland, from October 7 to October 11, 2025. The official MPEG press release can be found here. This column highlights key points from the meeting, amended with research aspects relevant to the ACM SIGMM community:

  • MPEG Systems received an Emmy® Award for the Common Media Application Format (CMAF). A separate press release regarding this achievement is available here.
  • JVET ratified new editions of VSEI, VVC, and HEVC
  • The fourth edition of Visual Volumetric Video-based Coding (V3C and V-PCC) has been finalized
  • Responses to the call for evidence on video compression with capability beyond VVC successfully evaluated

MPEG Systems received an Emmy® Award for the Common Media Application Format (CMAF)

On September 18, 2025, the National Academy of Television Arts & Sciences (NATAS) announced that the MPEG Systems Working Group (ISO/IEC JTC 1/SC 29/WG 3) had been selected as a recipient of a Technology & Engineering Emmy® Award for standardizing the Common Media Application Format (CMAF). But what is CMAF? CMAF (ISO/IEC 23000-19) is a media format standard designed to simplify and unify video streaming workflows across different delivery protocols and devices. Here’s a structured overview. Before CMAF, streaming services often had to produce multiple container formats, i.e., (i) ISO Base Media File Format (ISOBMFF) for MPEG-DASH and MPEG-2 Transport Stream (TS) for Apple HLS. This duplication resulted in additional encoding, packaging, and storage costs. I wrote a blog post about this some time ago here. CMAF’s main goal is to define a single, standardized segmented media format usable by both HLS and DASH, enabling “encode once, package once, deliver everywhere.”

The core concept of CMAF is that it is based on ISOBMFF, the foundation for MP4. Each CMAF stream consists of a CMAF header, CMAF media segments, and CMAF track files (a logical sequence of segments for one stream, e.g., video or audio). CMAF enables low-latency streaming by allowing progressive segment transfer, adopting chunked transfer encoding via CMAF chunks. CMAF defines interoperable profiles for codecs and presentation types for video, audio, and subtitles. Thanks to its compatibility with and adoption within existing streaming standards, CMAF bridges the gaps between DASH and HLS, creating a unified ecosystem.

Research aspects include – but are not limited to – low-latency tuning (segment/chunk size trade-offs, HTTP/3, QUIC), Quality of Experience (QoE) impact of chunk-based adaptation, synchronization of live and interactive CMAF streams, edge-assisted CMAF caching and prediction, and interoperability testing and compliance tools.

JVET ratified new editions of VSEI, VVC, and HEVC

At its 40th meeting, the Joint Video Experts Team (JVET, ISO/IEC JTC 1/SC 29/WG 5) concluded the standardization work on the next editions of three key video coding standards, advancing them to the Final Draft International Standard (FDIS) stage. Corresponding twin-text versions have also been submitted to ITU-T for consent procedures. The finalized standards include:

  • Versatile Supplemental Enhancement Information (VSEI) — ISO/IEC 23002-7 | ITU-T Rec. H.274
  • Versatile Video Coding (VVC) — ISO/IEC 23090-3 | ITU-T Rec. H.266
  • High Efficiency Video Coding (HEVC) — ISO/IEC 23008-2 | ITU-T Rec. H.265

The primary focus of these new editions is the extension and refinement of Supplemental Enhancement Information (SEI) messages, which provide metadata and auxiliary data to support advanced processing, interpretation, and quality management of coded video streams.

The updated VSEI specification introduces both new and refined SEI message types supporting advanced use cases:

  • AI-driven processing: Extensions for neural-network-based post-filtering and film grain synthesis offer standardized signalling for machine learning components in decoding and rendering pipelines.
  • Semantic and multimodal content: New SEI messages describe infrared, X-ray, and other modality indicators, region packing, and object mask encoding; creating interoperability points for multimodal fusion and object-aware compression research.
  • Pipeline optimization: Messages defining processing order and post-processing nesting support research on joint encoder-decoder optimization and edge-cloud coordination in streaming architectures.
  • Authenticity and generative media: A new set of messages supports digital signature embedding and generative-AI-based face encoding, raising questions for the SIGMM community about trust, authenticity, and ethical AI in media pipelines.
  • Metadata and interpretability: New SEIs for text description, image format metadata, and AI usage restriction requests could facilitate research into explainable media, human-AI interaction, and regulatory compliance in multimedia systems.

All VSEI features are fully compatible with the new VVC edition, and most are also supported in HEVC. The new HEVC edition further refines its multi-view profiles, enabling more robust 3D and immersive video use cases.

Research aspects of these new standard’s editions can be summarized as follows: (i) Define new standardized interfaces between neural post-processing and conventional video coding, fostering reproducible and interoperable research on learned enhancement models. (ii) Encourage exploration of metadata-driven adaptation and QoE optimization using SEI-based signals in streaming systems. (iii) Open possibilities for cross-layer system research, connecting compression, transport, and AI-based decision layers. (iv) Introduce a formal foundation for authenticity verification, content provenance, and AI-generated media signalling, relevant to current debates on trustworthy multimedia.

These updates highlight how ongoing MPEG/ITU standardization is evolving toward a more AI-aware, multimodal, and semantically rich media ecosystem, providing fertile ground for experimental and applied research in multimedia systems, coding, and intelligent media delivery.

The fourth edition of Visual Volumetric Video-based Coding (V3C and V-PCC) has been finalized

MPEG Coding of 3D Graphics and Haptics (ISO/IEC JTC 1/SC 29/WG7) has advanced MPEG-I Part 5 – Visual Volumetric Video-based Coding (V3C and V-PCC) to the Final Draft International Standard (FDIS) stage, marking its fourth edition. This revision introduces major updates to the Video-based Coding of Volumetric Content (V3C) framework, particularly enabling support for an additional bitstream instance: V-DMC (Video-based Dynamic Mesh Compression).

Previously, V3C served as the structural foundation for V-PCC (Video-based Point Cloud Compression) and MIV (MPEG Immersive Video). The new edition extends this flexibility by allowing V-DMC integration, reinforcing V3C as a generic, extensible framework for volumetric and 3D video coding. All instances follow a shared principle, i.e., using conventional 2D video codecs (e.g., HEVC, VVC) for projection-based compression, complemented by specialized tools for mapping, geometry, and metadata handling.

While V-PCC remains co-specified within Part 5, MIV (Part 12) and V-DMC (Part 29) are standardized separately. The progression to FDIS confirms the technical maturity and architectural stability of the framework.

This evolution opens new research directions as follows: (i) Unified 3D content representation, enabling comparative evaluation of point cloud, mesh, and view-based methods under one coding architecture. (ii) Efficient use of 2D codecs for 3D media, raising questions on mapping optimization, distortion modeling, and geometry-texture compression. (iii) Dynamic and interactive volumetric streaming, relevant to AR/VR, telepresence, and immersive communication research.

The fourth edition of MPEG-I Part 5 thus positions V3C as a cornerstone for future volumetric, AI-assisted, and immersive video systems, bridging standardization and cutting-edge multimedia research.

Responses to the call for evidence on video compression with capability beyond VVC successfully evaluated

The Joint Video Experts Team (JVET, ISO/IEC JTC 1/SC 29/WG 5) has completed the evaluation of submissions to its Call for Evidence (CfE) on video compression with capability beyond VVC. The CfE investigated coding technologies that may surpass the performance of the current Versatile Video Coding (VVC) standard in compression efficiency, computational complexity, and extended functionality.

A total of five submissions were assessed, complemented by ECM16 reference encodings and VTM anchor sequences with multiple runtime variants. The evaluation addressed both compression capability and encoding runtime, as well as low-latency and error-resilience features. All technologies were derived from VTM, ECM, or NNVC frameworks, featuring modified encoder configurations and coding tools rather than entirely new architectures.

Key Findings

  • In the compression capability test, 76 out of 120 test cases showed at least one submission with a non-overlapping confidence interval compared to the VTM anchor. Several methods outperformed ECM16 in visual quality and achieved notable compression gains at lower complexity. Neural-network-based approaches demonstrated clear perceptual improvements, particularly for 8K HDR content, while gains were smaller for gaming scenarios.
  • In the encoding runtime test, significant improvements were observed even under strict complexity constraints: 37 of 60 test points (at both 1× and 0.2× runtime) showed statistically significant benefits over VTM. Some submissions achieved faster encoding than VTM, with only a 35% increase in decoder runtime.

Research Relevance and Outlook

The CfE results illustrate a maturing convergence between model-based and data-driven video coding, raising research questions highly relevant for the ACM SIGMM community:

  • How can learned prediction and filtering networks be integrated into standard codecs while preserving interoperability and runtime control?
  • What methodologies can best evaluate perceptual quality beyond PSNR, especially for HDR and immersive content?
  • How can complexity-quality trade-offs be optimized for diverse hardware and latency requirements?

Building on these outcomes, JVET is preparing a Call for Proposals (CfP) for the next-generation video coding standard, with a draft planned for early 2026 and evaluation through 2027. Upcoming activities include refining test material, adding Reference Picture Resampling (RPR), and forming a new ad hoc group on hardware implementation complexity.

For multimedia researchers, this CfE marks a pivotal step toward AI-assisted, complexity-adaptive, and perceptually optimized compression systems, which are considered a key frontier where codec standardization meets intelligent multimedia research.

The 153rd MPEG meeting will be held online from January 19 to January 23, 2026. Click here for more information about MPEG meetings and their developments.