Event Report – Page 3 – ACM SIGMM Records

VQEG Column: VQEG Meeting November 2024

By Jesús Gutiérrez | May 30, 2025 - 12:14 |June 4, 2025 0225, 0225, Event Report, Feature, Standards

Introduction

The last plenary meeting of the Video Quality Experts Group (VQEG) was held online by the Institute for Telecommunication Sciences (ITS) of the National Telecommunications and Information Adminsitration (NTIA) from November 18th to 22nd, 2024. The meeting was attended by 70 participants from industry and academic institutions from 17 different countries worldwide.

The meeting was dedicated to present updates and discuss about topics related to the ongoing projects within VQEG. All the related information, minutes, and files from the meeting are available online in the VQEG meeting website, and video recordings of the meeting are available in Youtube.

All the topics mentioned bellow can be of interest for the SIGMM community working on quality assessment, but special attention can be devoted to the creation of a new group focused on Subjective and objective assessment of GenAI content (SOGAI) and to the recent contribution of the Immersive Media Group (IMG) group to the International Telecommunication Union (ITU) towards the Rec. ITU-T P.IXC for the evaluation of Quality of Experience (QoE) of immersive interactive communication systems. Finally, it is worth noting that Ioannis Katsavounidis (Meta, US) joins Kjell Brunnström (RISE, Sweden) as co-chairs of VQEG, substituting Margaret Pinson (NTIA(ITS).

Readers of these columns interested in the ongoing projects of VQEG are encouraged to subscribe to their corresponding reflectors to follow the activities going on and to get involved in them.

Overview of VQEG Projects

Audiovisual HD (AVHD)

The AVHD group works on developing and validating subjective and objective methods to analyze commonly available video systems. In this meeting, Lucjan Janowski (AGH University of Krakow, Poland) and Margaret Pinson (NTIA/ITS) presented their proposal to fix wording related to an experiment realism and validity, based on the experience in the psychology domain that addresses the important concept of describing how much results from lab experiment can be used outside a laboratory.

In addition, given that there are no current joint activities of the group, the AVHD project will become dormant, with the possibility to be activated when new activities are planned.

Statistical Analysis Methods (SAM)

The group SAM investigates on analysis methods both for the results of subjective experiments and for objective quality models and metrics. In addition to a discussion on the future activities of the group lead by its chairs Ioannis Katsavounidis (Meta, US), Zhi Li (Netflix, US), and Lucjan Janowski (AGH University of Krakow, Poland), the following presentations were delivered during the meeting:

Dietmar Saupe (University of Konstanz, Germany) delivered two presentations. The first one focused on maximum entropy and quantized metric models for absolute category ratings, based on the investigation of families of multinomial probability distributions parameterized by mean and variance that are used to fit the empirical rating distributions. To validate the proposed models, a comparison of the performance of these models and the state-of-the-art (given by the generalized score distribution) was done on two large datasets (KonIQ-10k and VQEG HDTV). The second presentation proposed a fine-grained subjective visual quality assessment method for high-fidelity compressed images, which is based on the current activities of the JPEG standardization project Advanced Image Coding (AIC). In addition to the assessment method, a dataset of high-quality compressed images and their corresponding crowdsourced visual quality ratings was presented.
Kjell Brunnström (RISE, Sweden) presented an experiment for collecting data to evaluate cloud gaming quality based on a passive video quality experiment and bootstrapped analysis. This experiment is part of the subjective test campaign (involving labs from different parts of the world) carried out by the ITU within the project Parametric Bitstream-Based Quality assessment of Cloud Gaming services (P.BBQCG) that focuses on for developing objective quality models. Analysis was based on a bootstrapping approach.

No Reference Metrics (NORM)

The group NORM addresses a collaborative effort to develop no-reference metrics for monitoring visual service quality. In this sense, Ioannis Katsavounidis (Meta, US) and Margaret Pinson (NTIA/ITS) summarized recent discussions within the group on developing best practices for subjective test methods when analyzing Artificial Intelligence (AI) generated images and videos. This discussion resulted in the creation of a new VQEG project called Subjective and objective assessment of GenAI content (SOGAI) to investigate subjective and objective methods to evaluate the content produced by generative AI approaches.

Emerging Technologies Group (ETG)

The ETG group focuses on various aspects of multimedia that, although they are not necessarily directly related to “video quality”, can indirectly impact the work carried out within VQEG and are not addressed by any of the existing VQEG groups. In particular, this group aims to provide a common platform for people to gather together and discuss new emerging topics, possible collaborations in the form of joint survey papers, funding proposals, etc. During this meeting, Abhijay Ghildyal (Portland State University, US), Saman Zadtootaghaj (Sony Interactive Entertainment, Germany), and Nabajeet Barman (Sony Interactive Entertainment, UK) presented their work on quality assessment of AI generated content and AI enhanced content. In addition, Matthias Wien (RWTH Aachen University, Germany) presented the approach, design and methodology for the evaluation of AI-based Point Cloud Compression in the corresponding Call for Proposals in MPEG. Finally, Abhijay Ghildyal (Portland State University, US) presented his work on how foundation models boost low-level perceptual similarity metrics, investigating the potential of using intermediate features or activations from these models for low-level image quality assessment, and showing that such metrics can outperform existing ones without requiring additional training.

Joint Effort Group (JEG) – Hybrid

The group JEG addresses several areas of Video Quality Assessment (VQA), such as the creation of a large dataset for training such models using full-reference metrics instead of subjective metrics. In addition, the group includes the VQEG project Implementer’s Guide for Video Quality Metrics (IGVQM). The chair of this group, Enrico Masala (Politecnico di Torino, Italy) presented the updates on the latest activities going on, including the plans for experiments within the IGVMQ project to get feedback from other VQEG members.

In addition to this, Lohic Fotio Tiotsop (Politecnico di Torino, Italy) delivered two presentations. The first one focused on the prediction of the opinion score distribution via AI-based observers in media quality assessment, while the second one analyzed unexpected scoring behaviors in image quality assessment comparing controlled and crowdsourced subjective tests.

Immersive Media Group (IMG)

The IMG group researches on the quality assessment of immersive media technologies. Currently, the main joint activity of the group is the development of a test plan to evaluate the QoE of immersive interactive communication systems, which is carried out in collaboration with ITU-T through the work item P.IXC. In this meeting, Pablo Pérez (Nokia XR Lab, Spain), Marta Orduna (Nokia XR Lab, Spain), and Jesús Gutiérrez (Universidad Politécnica de Madrid, Spain) presented the status of the Rec. ITU-T P.IXC that the group was writing based on the joint test plan developed in the last months and that was submitted to ITU and discussed in its meeting in January 2025.

Also, in relation with this test plan, Lucjan Janowski (AGH University of Krakow, Poland) and Margaret Pinson (NTIA/ITS) presented an overview of ITU recommendations for interactive experiments that can be used in the IMG context.

In relation with other topics addressed by IMG, Emin Zerman (Mid Sweden University, Sweden) delivered two presentations. The first one presented the BASICS dataset, which contains a representative range of nearly 1500 point clouds assessed by thousands of participants to enable robust quality assessments for 3D scenes. The approach involved a careful selection of diverse source scenes and the application of specific “distortions” to simulate real-world compression impacts, including traditional and learning-based methods. The second presentation described a spherical light field database (SLFDB) for immersive telecommunication and telepresence applications, which comprises 60-view omnidirectional captures across 20 scenes, providing a comprehensive basis for telepresence research.

Quality Assessment for Computer Vision Applications (QACoViA)

The group QACoViA addresses the study the visual quality requirements for computer vision methods, where the final user is an algorithm. In this meeting, Mehr un Nisa (AGH University of Krakow, Poland) presented a comparative performance analysis of deep learning architectures in underwater image classification. In particular, the study assessed the performance of the VGG-16, EfficientNetB0, and SimCLR models in classifying 5,000 underwater images. The results reveal each model’s strengths and weaknesses, providing insights for future improvements in underwater image analysis

5G Key Performance Indicators (5GKPI)

The 5GKPI group studies relationship between key performance indicators of new 5G networks and QoE of video services on top of them. In this meeting, Pablo Perez (Nokia XR Lab, Spain) and Francois Blouin (Meta, US) and others presented the progress on the 5G-KPI White Paper, sharing some of the ideas on QoS-to-QoE modeling that the group has been working on to get feedback from other VQEG members.

Multimedia Experience and Human Factors (MEHF)

The MEHF group focuses on the human factors influencing audiovisual and multimedia experiences, facilitating a comprehensive understanding of how human factors impact the perceived quality of multimedia content. In this meeting, Dominika Wanat (AGH University of Krakow, Poland) presented MANIANA (Mobile Appliance for Network Interrupting, Analysis & Notorious Annoyance), an IoT device for testing QoS and QoE applications in home network conditions that is made based on Raspberry Pi 4 minicomputer and open source solutions and allows safe, robust, and universal testing applications.

Other updates

Apart from this, it is worth noting that, although no progresses were presented in this meeting, the Quality Assessment for Health Applications (QAH) group is still active and focused on the quality assessment of health applications. It addresses subjective evaluation, generation of datasets, development of objective metrics, and task-based approaches.

In addition, the Computer Generated Imagery (CGI) project became dormant, since it recent activities can be covered by other existing groups such as ETG and SOGAI.

Also, in this meeting Margaret Pinson (NTIA/ITS) stepped down as co-chair of VQEG and Ioannis Katsavounidis (Meta, US) is the new co-chair together with Kjell Brunnström (RISE, Sweden).

Finally, as already announced in the VQEG website, the next VQEG plenary meeting be hosted by Meta at Meta’s Menlo Park campus, California, in the United States from May 5^th to 9^th, 2025. For more information see: https://vqeg.org/meetings-home/vqeg-meeting-information/

JPEG Column: 106th JPEG Meeting

By Antonio Pinheiro | May 28, 2025 - 23:53 |June 4, 2025 0225, Event Report, Feature, Standards

Leave a comment

JPEG AI becomes an International Standard

The 106th JPEG meeting was held online from January 6 to 10, 2025. During this meeting, the first image coding standard based on machine learning technology, JPEG AI, was sent for publication as an International Standard. This is a major achievement as it leverages JPEG with major trends in imaging technologies and provides an efficient standardized solution for image coding, with nearly 30% improvement over the most advanced solutions in the state-of-the-art. JPEG AI has been developed under the auspices of three major standardization organizations: ISO, IEC and ITU.

The following sections summarize the main highlights of the 106th JPEG meeting.

JPEG AI – the first International Standard for end-to-end learning-based image coding
JPEG Trust – a framework for establishing trust in digital media
JPEG XE – lossless coding of event-based vision
JPEG AIC – assessment of the visual quality of high-fidelity images
JPEG Pleno – standard framework for representing plenoptic data
JPEG Systems – file formats and metadata
JPEG DNA – DNA-based storage of digital pictures
JPEG XS – end-to-end low latency and low complexity image coding
JPEG XL – new image coding system
JPEG 2000
JPEG RF – exploration on Radiance Fields

JPEG AI

At its 106th meeting, the JPEG Committee approved publication of the text of JPEG AI, the first International Standard for end-to-end learning-based image coding. This achievement marks a significant milestone in the field of digital imaging and compression, offering a new approach for efficient, high-quality image storage and transmission.

The scope of JPEG AI is the creation of a learning-based image coding standard offering a single-stream, compact compressed domain representation, targeting both human visualization with significant compression efficiency improvement over image coding standards in common use at equivalent subjective quality, and effective performance for image processing and computer vision tasks, with the goal of supporting a royalty-free baseline.

The JPEG AI standard leverages deep learning algorithms that learn from vast amounts of image data the best way to compress images, allowing it to adapt to a wide range of content and offering enhanced perceptual visual quality and faster compression capabilities. The key benefits of JPEG AI are:

Superior compression efficiency: JPEG AI offers higher compression efficiency, leading to reduced storage requirements and faster transmission times compared to other state-of-the-art image coding solutions.
Implementation-friendly encoding and decoding: JPEG AI codec supports a wide array of devices with different characteristics, including mobile platforms, through optimized encoding and decoding processes.
Compressed-domain image processing and computer vision tasks: JPEG AI’s architecture enables multi-purpose optimization for both human visualization and machine-driven tasks.

By creating the JPEG AI International Standard, the JPEG Committee has opened the door to more efficient and versatile image compression solutions that will benefit industries ranging from digital media and telecommunications to cloud storage and visual surveillance. This standard provides a framework for image compression in the face of rapidly growing visual data demands, enabling more efficient storage, faster transmission, and higher-quality visual experiences.

As JPEG AI establishes itself as the new benchmark in image compression, its potential to reshape the future of digital imaging is undeniable, promising groundbreaking advancements in efficiency and versatility.

JPEG Trust

The first part of JPEG Trust, the “Core Foundation” (ISO/IEC 21617-1) was approved for publication in late 2024 and is in the process of being published as an International Standard by ISO. The JPEG Trust standard provides a proactive approach to trust management by defining a framework for establishing trust in digital media. The Core Foundation specifies three main pillars: annotating provenance, extracting and evaluating Trust Indicators, and handling privacy and security concerns.

At the 106th JPEG Meeting, the JPEG Committee produced a Committee Draft (CD) for a 2nd edition of the Core Foundation. The 2nd edition further extends and improves the standard with new functionalities, including important specifications for Intellectual Property Rights (IPR) management such as authorship and rights declarations. In addition, this new edition will align the specification with the upcoming ISO 22144 standard, which is a standard for Content Credentials based on the C2PA 2.1 specification.

In parallel with the work on the 2nd edition of the Core Foundation (Part 1), the JPEG Committee continues to work on Part 2 and Part 3, “Trust Profiles Catalogue” and “Media Asset Watermarking”, respectively.

JPEG XE

The JPEG XE initiative is currently awaiting the conclusion of the open Final Call for Proposals on lossless coding of events, which will close on March 31, 2025. This initiative focuses on a new and emerging image modality introduced by event-based visual sensors. JPEG aims to establish a standard that efficiently represents events, facilitating interoperability in sensing, storage, and processing for machine vision and other relevant applications.

To ensure the success of this emerging standard, the JPEG Committee has reached out to other standardization organizations. The JPEG Committee, already a collaborative group under ISO/IEC and ITU-T, is engaged in discussions with ITU-T’s SG21 to develop JPEG XE as a joint standard. This collaboration aligns perfectly with the objectives of both organizations, as SG21 is also dedicated to creating standards around event-based systems.

Additionally, the JPEG Committee continues its discussions and research on lossy coding of events, focusing on future evaluation methods for these technologies. Those interested in the JPEG XE initiative are encouraged to review the public documents available at jpeg.org. Furthermore, the Ad-hoc Group on event-based vision has been re-established to advance work leading up to the 107th JPEG meeting in Brussels. To stay informed about this activity, please join the event-based vision Ad-hoc Group mailing list.

JPEG AIC

Part 3 of JPEG AIC (AIC-3) defines a methodology for subjective assessment of the visual quality of high-fidelity images, and the forthcoming Part 4 of JPEG AIC deals with objective quality metrics, also of high-fidelity images. In this JPEG meeting, the document on Use Cases and Requirements that refers to both AIC-3 and AIC-4, was revised. It defines the scope of both anticipated standards and sets it into relation to the previous specifications for AIC-1 and AIC-2. While AIC-1 covers a broad quality range including low quality, it does not allow fine-grained quality assessment in the high-fidelity range. AIC-2 entails methods that determine a threshold separating visually lossless coded images from lossy ones. The quality range addressed by AIC-3 and AIC-4 is an interval that contains the AIC-2 threshold, reaching from high quality up to the numerically lossless case. The JPEG Committee is preparing the DIS text for AIC-3 and has launched the Second Draft Call for Proposals on Objective Image Quality Assessment (AIC-4) which includes the timeline for this JPEG activity. Proposals are expected at the end of Summer 2025. The first Working Draft for Objective Image Quality Assessment (AIC-4) is planned for April 2026.

JPEG Pleno

The 106th meeting marked a major milestone for the JPEG Pleno Point Cloud activity with the release of the Final Draft International Standard (FDIS) for ISO/IEC DIS 21794-6:2024 Information technology — Plenoptic image coding system (JPEG Pleno) — Part 6: Learning-based point cloud coding. Point cloud data supports a wide range of applications, including computer-aided manufacturing, entertainment, cultural heritage preservation, scientific research, and advanced sensing and analysis. The JPEG Committee considers this learning-based standard to be a powerful and efficient solution for point cloud coding. This standard is applicable to interactive human visualization, with competitive compression efficiency compared to state-of-the-art point cloud coding solutions in common use, and effective performance for 3D processing and machine-related computer vision tasks and has the goal of supporting a royalty-free baseline. This standard specifies a codestream format for storage of point clouds. The standard also provides information on the coding tools and defines extensions to the JPEG Pleno File Format and associated metadata descriptors that are specific to point cloud modalities. With the release of the FDIS at the 106th JPEG meeting, it is expected that the International Standard will be published in July 2025.

The JPEG Pleno Light Field activity discussed the Committee Draft (CD) of the 2nd edition of ISO/IEC 21794-2 (“Plenoptic image coding system (JPEG Pleno) Part 2: Light field coding”) that integrates AMD1 of ISO/IEC 21794-2 (“Profiles and levels for JPEG Pleno Light Field Coding”) and includes the specification of a third coding mode entitled Slanted 4D Transform Mode and its associated profile.

A White Paper on JPEG Pleno Light Field Coding has been released, providing the architecture of the current two JPEG Pleno Part-2 coding modes, as well as the coding architecture of its third coding mode, to be included in the 2nd edition of the standard. The White Paper also presents applications and use cases and briefly describes the JPEG Pleno Model (JPLM). The JPLM provides a reference implementation for the standardized technologies within the JPEG Pleno framework, including the JPEG Pleno Part 2 (ISO/IEC 21794-2). Improvements to JPLM have been implemented and tested, including a user-friendly interface that relies on well-documented JSON configuration files.

During the JPEG meeting week, significant progress was made in the JPEG Pleno Quality Assessment activity, which focuses on developing methodologies for subjective and objective quality assessment of plenoptic modalities. A Working Draft on subjective quality assessment, incorporating insights from extensive experiments conducted by JPEG experts, was discussed.

JPEG Systems

The reference software of JPEG Systems (ISO/IEC 19566-10) is now published as an International Standard and is available as open source on the JPEG website. This first edition implements the JPEG Universal Metadata Box Format (ISO/IEC 19566-5) and provides a reference dataset. An extended version of the reference software with support for additional Parts of JPEG Systems is currently under development. This new edition will add support for JPEG Privacy and Security, JPEG 360, JLINK, and JPEG Snack.

At its 106th meeting, the JPEG Committee also initiated a 3rd edition of the JPEG Universal Metadata Box Format (ISO/IEC 19566-5). This new edition will integrate the latest amendment that allows JUMBF boxes to exist as stand-alone files and adds support for payload compression. In addition, the 3rd edition will add a JUMBF validator and a scheme for JUMBF box retainment while transcoding from one JPEG format to another.

JPEG DNA

JPEG DNA is an initiative aimed at developing a standard capable of representing bi-level, continuous-tone grayscale, continuous-tone color, or multichannel digital samples in a format using nucleotide sequences to support DNA storage. The JPEG DNA Verification Model (VM) was created during the 102nd JPEG meeting based on performance assessments and descriptive analyses of the submitted solutions to a Call for Proposals, issued at the 99th JPEG meeting. Since then, several core experiments have been continuously conducted to validate and enhance this Verification Model. Such efforts led to the creation of the first Working Draft of JPEG DNA during the 103rd JPEG meeting. At the 105th JPEG meeting, the JPEG Committee officially introduced a New Work Item Proposal (NWIP) for JPEG DNA, elevating it to an officially sanctioned ISO/IEC Project. The proposal defined JPEG DNA as a multi-part standard: Part 1: Core Coding System, Part 2: Profiles and Levels, Part 3: Reference Software, Part 4: Conformance.

The JPEG Committee is targeting the International Standard (IS) stage for Part 1 by April 2026.

At its 106th meeting, the JPEG Committee made significant progress toward achieving this goal. Efforts were focused on producing the Committee Draft (CD) for Part 1, a crucial milestone in the standardization process. Additionally, JPEG DNA Part 1 has now been assigned the Project identification ISO/IEC 25508-01.

JPEG XS

The JPEG XS activity focussed primarily on finalization of the third editions of JPEG XS Part 4 – Conformance testing, and Part 5 – Reference software. Recall that the 3rd editions of Parts 1, 2, and 3 are published and available for purchase. Part 4 is now at FDIS stage and is expected to be approved as International Standard around April of 2025. For Part 5, work on the reference software was completed to implement TDC profile encoding functionality, making it feature complete and fully compliant with the 3rd edition of JPEG XS. As such, Part 5 is ready to be balloted as a DIS. However, work on the reference software will continue to bring further improvements. The reference software and Part 5 will become publicly and freely available, similar to Part 4.

JPEG XL

The second edition of Part 3 (conformance testing) of JPEG XL proceeded to publication as International Standard. Regarding Part 2 (file format), a third edition has been prepared, and it reached the DIS stage. The new edition will include support for embedding gain maps in JPEG XL files.

JPEG 2000

The JPEG Committee has begun work on adding support for the HTTP/3 transport to the JPIP protocol, which allows the interactive browsing of JPEG 2000 images over networks. HTTP/3 is the third major version of the Hypertext Transfer Protocol (HTTP) and allows for significantly lower latency operations compared to earlier versions. A Committee Draft ballot of the 3rd edition of the JPIP specifications (Rec. ITU-T T.808 | ISO/IEC 15444-9) is expected to start shortly, with the project completed sometime in 2026.

Separately, the 3rd edition of Rec. ITU-T T.815 | ISO/IEC 15444-16, which specifies the carriage of JPEG 2000 imagery in the ISOBMFF and HEIF file formats, has been approved for publication. This new edition adds support for more flexible color signaling and JPEG 2000 video tracks.

JPEG RF

The JPEG RF exploration issued at this meeting the “JPEG Radiance Fields State of the Art and Challenges”, a public document that describes the latest developments on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) technologies and defines a scope for the activity focusing on the creation of a coding standard. The JPEG Committee is also organizing a workshop on Radiance Fields jointly with MPEG, which will take place on January 31st, featuring key experts in the field presenting various aspects of this exciting new emerging technology.

Final Quote

“The newly approved JPEG AI, developed under the auspices of ISO, IEC and ITU, is the first image coding standard based on machine learning and is a breakthrough in image coding providing 30% compression gains over the most advanced solutions in state-of-the-art.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

MPEG Column: 150th MPEG Meeting (Virtual/Online)

By Christian Timmerer | May 7, 2025 - 07:15 |June 4, 2025 0225, 0225, Event Report, Feature, Standards

Leave a comment

The 150th MPEG meeting was held online from 31 March to 04 April 2025. The official press release can be found here. This column provides the following highlights:

Requirements: MPEG-AI strategy and white paper on MPEG technologies for metaverse
JVET: Draft Joint Call for Evidence on video compression with capability beyond Versatile Video Coding (VVC)
Video: Gaussian splat coding and video coding for machines
Audio: Audio coding for machines
3DGH: 3D Gaussian splat coding

MPEG-AI Strategy

The MPEG-AI strategy envisions a future where AI and neural networks are deeply integrated into multimedia coding and processing, enabling transformative improvements in how digital content is created, compressed, analyzed, and delivered. By positioning AI at the core of multimedia systems, MPEG-AI seeks to enhance both content representation and intelligent analysis. This approach supports applications ranging from adaptive streaming and immersive media to machine-centric use cases like autonomous vehicles and smart cities. AI is employed to optimize coding efficiency, generate intelligent descriptors, and facilitate seamless interaction between content and AI systems. The strategy builds on foundational standards such as ISO/IEC 15938-13 (CDVS), 15938-15 (CDVA), and 15938-17 (Neural Network Coding), which collectively laid the groundwork for integrating AI into multimedia frameworks.

Currently, MPEG is developing a family of standards under the ISO/IEC 23888 series that includes a vision document, machine-oriented video coding, and encoder optimization for AI analysis. Future work focuses on feature coding for machines and AI-based point cloud compression to support high-efficiency 3D and visual data handling. These efforts reflect a paradigm shift from human-centric media consumption to systems that also serve intelligent machine agents. MPEG-AI maintains compatibility with traditional media processing while enabling scalable, secure, and privacy-conscious AI deployments. Through this initiative, MPEG aims to define the future of multimedia as an intelligent, adaptable ecosystem capable of supporting complex, real-time, and immersive digital experiences.

MPEG White Paper on Metaverse Technologies

The MPEG white paper on metaverse technologies (cf. MPEG white papers) outlines the pivotal role of MPEG standards in enabling immersive, interoperable, and high-quality virtual experiences that define the emerging metaverse. It identifies core metaverse parameters – real-time operation, 3D experience, interactivity, persistence, and social engagement – and maps them to MPEG’s longstanding and evolving technical contributions. From early efforts like MPEG-4’s Binary Format for Scenes (BIFS) and Animation Framework eXtension (AFX) to MPEG-V’s sensory integration, and the advanced MPEG-I suite, these standards underpin critical features such as scene representation, dynamic 3D asset compression, immersive audio, avatar animation, and real-time streaming. Key technologies like point cloud compression (V-PCC, G-PCC), immersive video (MIV), and dynamic mesh coding (V-DMC) demonstrate MPEG’s capacity to support realistic, responsive, and adaptive virtual environments. Recent efforts include neural network compression for learned scene representations (e.g., NeRFs), haptic coding formats, and scene description enhancements, all geared toward richer user engagement and broader device interoperability.

The document highlights five major metaverse use cases – virtual environments, immersive entertainment, virtual commerce, remote collaboration, and digital twins – all supported by MPEG innovations. It emphasizes the foundational role of MPEG-I standards (e.g., Parts 12, 14, 29, 39) for synchronizing immersive content, representing avatars, and orchestrating complex 3D scenes across platforms. Future challenges identified include ensuring interoperability across systems, advancing compression methods for AI-assisted scenarios, and embedding security and privacy protections. With decades of multimedia expertise and a future-focused standards roadmap, MPEG positions itself as a key enabler of the metaverse – ensuring that emerging virtual ecosystems are scalable, immersive, and universally accessible.

The MPEG white paper on metaverse technologies highlights several research opportunities, including efficient compression of dynamic 3D content (e.g., point clouds, meshes, neural representations), synchronization of immersive audio and haptics, real-time adaptive streaming, and scene orchestration. It also points to challenges in standardizing interoperable avatar formats, AI-enhanced media representation, and ensuring seamless user experiences across devices. Additional research directions include neural network compression, cross-platform media rendering, and developing perceptual metrics for immersive Quality of Experience (QoE).

Draft Joint Call for Evidence (CfE) on Video Compression beyond Versatile Video Coding (VVC)

The latest JVET AHG report on ECM software development (AHG6), documented as JVET-AL0006, shows promising results. Specifically, in the “Overall” row and “Y” column, there is a 27.06% improvement in coding efficiency compared to VVC, as shown in the figure below.

The Draft Joint Call for Evidence (CfE) on video compression beyond VVC (Versatile Video Coding), identified as document JVET-AL2026 | N 355, is being developed to explore new advancements in video compression. The CfE seeks evidence in three main areas: (a) improved compression efficiency and associated trade-offs, (b) encoding under runtime constraints, and (c) enhanced performance in additional functionalities. This initiative aims to evaluate whether new techniques can significantly outperform the current state-of-the-art VVC standard in both compression and practical deployment aspects.

The visual testing will be carried out across seven categories, including various combinations of resolution, dynamic range, and use cases: SDR Random Access UHD/4K, SDR Random Access HD, SDR Low Bitrate HD, HDR Random Access 4K, HDR Random Access Cropped 8K, Gaming Low Bitrate HD, and UGC (User-Generated Content) Random Access HD. Sequences and rate points for testing have already been defined and agreed upon. For a fair comparison, rate-matched anchors using VTM (VVC Test Model) and ECM (Enhanced Compression Model) will be generated, with new configurations to enable reduced run-time evaluations. A dry-run of the visual tests is planned during the upcoming Daejeon meeting, with ECM and VTM as reference anchors, and the CfE welcomes additional submissions. Following this dry-run, the final Call for Evidence is expected to be issued in July, with responses due in October.

The Draft Joint Call for Evidence (CfE) on video compression beyond VVC invites research into next-generation video coding techniques that offer improved compression efficiency, reduced encoding complexity under runtime constraints, and enhanced functionalities such as scalability or perceptual quality. Key research aspects include optimizing the trade-off between bitrate and visual fidelity, developing fast encoding methods suitable for constrained devices, and advancing performance in emerging use cases like HDR, 8K, gaming, and user-generated content.

3D Gaussian Splat Coding

Gaussian splatting is a real-time radiance field rendering method that represents a scene using 3D Gaussians. Each Gaussian has parameters like position, scale, color, opacity, and orientation, and together they approximate how light interacts with surfaces in a scene. Instead of ray marching (as in NeRF), it renders images by splatting the Gaussians onto a 2D image plane and blending them using a rasterization pipeline, which is GPU-friendly and much faster. Developed by Kerbl et al. (2023) it is capable of real-time rendering (60+ fps) and outperforms previous NeRF-based methods in speed and visual quality. Gaussian splat coding refers to the compression and streaming of 3D Gaussian representations for efficient storage and transmission. It’s an active research area and under standardization consideration in MPEG.

MPEG technical requirements working group together with MPEG video working group started an exploration on Gaussian splat coding and the MPEG coding of 3D graphics and haptics (3DGH) working group addresses 3D Gaussian splat coding, respectively. Draft Gaussian splat coding use cases and requirements are available and various joint exploration experiments (JEEs) are conducted between meetings.

(3D) Gaussian splat coding is actively researched in academia, also in the context of streaming, e.g., like in “LapisGS: Layered Progressive 3D Gaussian Splatting for Adaptive Streaming” or “LTS: A DASH Streaming System for Dynamic Multi-Layer 3D Gaussian Splatting Scenes”. The research aspects of 3D Gaussian splat coding and streaming span a wide range of areas across computer graphics, compression, machine learning, and systems for real-time immersive media. In particular, on efficiently representing and transmitting Gaussian-based neural scene representations for real-time rendering. Key areas include compression of Gaussian parameters (position, scale, color, opacity), perceptual and geometry-aware optimizations, and neural compression techniques such as learned latent coding. Streaming challenges involve adaptive, view-dependent delivery, level-of-detail management, and low-latency rendering on edge or mobile devices. Additional research directions include standardizing file formats, integrating with scene graphs, and ensuring interoperability with existing 3D and immersive media frameworks.

MPEG Audio and Video Coding for Machines

The Call for Proposals on Audio Coding for Machines (ACoM), issued by the MPEG audio coding working group, aims to develop a standard for efficiently compressing audio, multi-dimensional signals (e.g., medical data), or extracted features for use in machine-driven applications. The standard targets use cases such as connected vehicles, audio surveillance, diagnostics, health monitoring, and smart cities, where vast data streams must be transmitted, stored, and processed with low latency and high fidelity. The ACoM system is designed in two phases: the first focusing on near-lossless compression of audio and metadata to facilitate training of machine learning models, and the second expanding to lossy compression of features optimized for specific applications. The goal is to support hybrid consumption – by machines and, where needed, humans – while ensuring interoperability, low delay, and efficient use of storage and bandwidth.

The CfP outlines technical requirements, submission guidelines, and evaluation metrics. Participants must provide decoders compatible with Linux/x86 systems, demonstrate performance through objective metrics like compression ratio, encoder/decoder runtime, and memory usage, and undergo a mandatory cross-checking process. Selected proposals will contribute to a reference model and working draft of the standard. Proponents must register by August 1, 2025, with submissions due in September, and evaluation taking place in October. The selection process emphasizes lossless reproduction, metadata fidelity, and significant improvements over a baseline codec, with a path to merge top-performing technologies into a unified solution for standardization.

Research aspects of Audio Coding for Machines (ACoM) include developing efficient compression techniques for audio and multi-dimensional data that preserve key features for machine learning tasks, optimizing encoding for low-latency and resource-constrained environments, and designing hybrid formats suitable for both machine and human consumption. Additional research areas involve creating interoperable feature representations, enhancing metadata handling for context-aware processing, evaluating trade-offs between lossless and lossy compression, and integrating machine-optimized codecs into real-world applications like surveillance, diagnostics, and smart systems.

The MPEG video coding working group approved the committee draft (CD) for ISO/IEC 23888-2 video coding for machines (VCM). VCM aims to encode visual content in a way that maximizes machine task performance, such as computer vision, scene understanding, autonomous driving, smart surveillance, robotics and IoT. Instead of preserving photorealistic quality, VCM seeks to retain features and structures important for machines, possibly at much lower bitrates than traditional video codecs. The CD introduces several new tools and enhancements aimed at improving machine-centric video processing efficiency. These include updates to spatial resampling, such as the signaling of the inner decoded picture size to better support scalable inference. For temporal resampling, the CD enables adaptive resampling ratios and introduces pre- and post-filters within the temporal resampler to maintain task-relevant temporal features. In the filtering domain, it adopts bit depth truncation techniques – integrating bit depth shifting, luma enhancement, and chroma reconstruction – to optimize both signaling efficiency and cross-platform interoperability. Luma enhancement is further refined through an integer-based implementation for luma distribution parameters, while chroma reconstruction is stabilized across different hardware platforms. Additionally, the CD proposes removing the neural network-based in-loop filter (NNLF) to simplify the pipeline. Finally, in terms of bitstream structure, it adopts a flattened structure with new signaling methods to support efficient random access and better coordination with system layers, aligning with the low-latency, high-accuracy needs of machine-driven applications.

Research in VCM focuses on optimizing video representation for downstream machine tasks, exploring task-driven compression techniques that prioritize inference accuracy over perceptual quality. Key areas include joint video and feature coding, adaptive resampling methods tailored to machine perception, learning-based filter design, and bitstream structuring for efficient decoding and random access. Other important directions involve balancing bitrate and task accuracy, enhancing robustness across platforms, and integrating machine-in-the-loop optimization to co-design codecs with AI inference pipelines.

Concluding Remarks

The 150th MPEG meeting marks significant progress across AI-enhanced media, immersive technologies, and machine-oriented coding. With ongoing work on MPEG-AI, metaverse standards, next-gen video compression, Gaussian splat representation, and machine-friendly audio and video coding, MPEG continues to shape the future of interoperable, intelligent, and adaptive multimedia systems. The research opportunities and standardization efforts outlined in this meeting provide a strong foundation for innovations that support real-time, efficient, and cross-platform media experiences for both human and machine consumption.

The 151st MPEG meeting will be held in Daejeon, Korea, from 30 June to 04 July 2025. Click here for more information about MPEG meetings and their developments.

CASTLE 2024: A Collaborative Effort to Create a Large Multimodal Multi-perspective Daily Activity Dataset

By Silvia | May 1, 2025 - 12:17 |June 4, 2025 0225, 0225, Event Report, Feature

Leave a comment

This report describes the CASTLE 2024 event, a collaborative effort to create a PoV 4K video dataset recorded by a dozen people in parallel over several days. The participating content creators wore a GoPro and a Fitbit for approximately 12 hours each day while engaging in typical daily activities. The event took place in Ballyconneely, Ireland, and lasted for four days. The resulting data is publicly available and can be used for papers, studies, and challenges in the multimedia domain in the coming years. A preprint of the paper presenting the resulting dataset is available on arXiv (https://arxiv.org/abs/2503.17116).

Introduction

Motivated by a requirement for a real-world PoV video dataset, a group of co-organizers of the annual VBS and LSC challenges came together to hold an invitation workshop and generate a novel PoV video dataset. In the first week of December 2024, twelve researchers from the multimedia community gathered in a remote house in Ballyconneely, Ireland, with the goal to create a large multi-view and multimodal lifelogging video dataset. Equipped with a Fitbit on their wrists, a GoPro Hero 13 on their heads for about 12 hours a day, with five fixed cameras capturing the environment, they began a journey of 4K lifelogging. They lived together for four full days and performed some typical living tasks, such as cooking, eating, washing dishes, talking, discussing, reading, watching TV, as well as playing games (ranging from paper plane folding and darts to quizzes). While this sounds very enjoyable, the whole event required a lot of effort, discipline, and meticulous planning – in terms of food and, more importantly, the data acquisition, data storage, data synchronization, avoiding the usage of any copyrighted material (book, movie, songs, etc.), limiting the usage of smartphones and laptops for privacy concerns, and making the content as diverse as possible. Figure 1 gives an impression of the event and shows different activities by the participants.

**Figure 1:** Participants at CASTLE 2024, having a light dinner and playing cards.

Organisational Procedure

Already months before the event, we were planning for the recording equipment, the participants, the activities, as well as the food.

The first challenge was figuring out a way to make wearing a GoPro camera all day as simple and enjoyable as possible. This was realized by using the camera with the elastic strap for a strong hold, a specifically adapted rubber pad at the back side of the camera, and a USB-C cable to a large 20,000 mAh power bank that every participant was wearing in their pocket. In the end of the day, the Fitbits, the battery packs, and the SD cards of every participant were collected, approximately 4TB of data was copied to an on-site NAS system, the SD cards cleared, and the batteries fully charged, so that next day in the morning they were usable again.

We ended up with six people from Dublin City University, and six international researchers, but only 10 people were wearing recording equipment. Every participant was asked to prepare at least one breakfast, lunch, or dinner, and all the food and drinks were purchased a few days before the event.

After arrival at the house, every participant had to sign an agreement that all collected data can be publicly released and used for scientific purposes in the future.

CASTLE 2024 Multimodal Dataset

The dataset (https://castle-dataset.github.io/) that emerged from this collaborative effort contains heart rate and steps logs of 10 people, 4K@50fps video streams from five fixed mounted cameras, as well as 4K video streams from 10 head-mounted cameras. The recording time per day is 7-12 hours per device, resulting in over 600 hours of video that totals to about 8.5 TB of data, after processing and more efficient re-encoding. The videos were processed into one hour-long parts that are aligned to all start at the hour. This was achieved in a multi-stage process, using a machine-readable QR code-based clock for initial rough- and subsequent audio signal correlation analysis for fine-alignment.

The language spoken in the videos is mainly English with a few parts of (Swiss-)German and Vietnamese. The activities by the participants include:

preparing food and drinks
eating
washing dishes
cleaning up
discussing
hiding items
presenting and listening
drawing and painting
playing games (e.g., chess, darts, guitar, various card games, etc.)
reading (out loud)
watching tv (open source videos)
having a walk
having a car-ride

Use Scenarios of the Dataset

The dataset can be used for content retrieval contests, such as the Lifelog Search Challenge (LSC) and the Video Browser Showdown (VBS), but also for automatic content recognition and annotation challenges, such as the CASTLE Challenge that will happen at ACM Multimedia 2025 (https://castle-dataset.github.io/).

Further application scenarios include complex scene understanding, 3d reconstruction and localization, audio event prediction, source separation, human-human/machine interaction, and many more.

Challenges of Organizing the Event

As this was the first collaborative event to collect such a multi-view multimodal dataset, there were also some challenges that are worth mentioning and may help other people that want to organize a similar event in the future.

First of all, the event turned out to be much more costly than originally planned for. Reasons for this are the increased living/rental costs, the travel costs for international participants, but also expenses for technical equipment such as batteries, which we originally did not intend to use. Originally we wanted to organize the event in a real castle, but it turned out to be way too expensive, without a significant gain.

For the participants it was also hard to maintain privacy for all days, since not even quickly responding to emails was possible. When having a walk or a car ride, we needed to make sure that other people or car plates were not recorded.

In terms of the data, it should be mentioned that the different recording devices needed to be synchronized. This was achieved via regular capturing of dynamic QR codes showing the master time (or wall clock time), and using these positions in all videos as temporal anchors during post-processing.

The data volume together with the available transfer speed were also an issue and it required many hours during the nights to copy all the data from all sd-cards.

Summary

The CASTLE 2024 event brought together twelve multimedia researchers in a remote house in Ireland for an intensive four-day data collection retreat, resulting in a rich multimodal 4K video dataset designed for lifelogging research. Equipped with head-mounted GoPro cameras and Fitbits, ten participants captured synchronized, real-world point-of-view footage while engaging in everyday activities like cooking, playing games, and discussing, with additional environmental video captured from fixed cameras. The team faced significant logistical challenges, including power management, synchronization, privacy concerns, and data storage, but ultimately produced over 600 hours of aligned video content. The dataset – freely available for scientific use – is intended to support future research and competitions focused on content-based video analysis, lifelogging, and human activity understanding.

JPEG Column: 105th JPEG Meeting in Berlin, Germany

By Antonio Pinheiro | April 4, 2025 - 15:56 |April 23, 2025 0125, Event Report, Feature, Standards

Leave a comment

JPEG Trust becomes an International Standard

The 105th JPEG meeting was held in Berlin, Germany, from October 6 to 11, 2024. During this JPEG meeting, JPEG Trust was sent for publication as an International Standard. This is a major achievement in providing standardized tools to effectively fight against the proliferation of fake media and disinformation while restoring confidence in multimedia information.

In addition, the JPEG Committee also sent for publication the JPEG Pleno Holography standard, which is the first standardized solution for holographic content coding. This type of content might be represented by huge amounts of information, and efficient compression is needed to enable reliable and effective applications.

The following sections summarize the main highlights of the 105th JPEG meeting:

105th JPEG Meeting, held in Berlin, Germany.

JPEG Trust
JPEG Pleno
JPEG AI
JPEG XE
JPEG AIC
JPEG DNA
JPEG XS
JPEG XL

JPEG Trust

In an important milestone, the first part of JPEG Trust, the “Core Foundation” (ISO/IEC IS 21617-1) International Standard, has now been approved by the international ISO committee and is being published. This standard addresses the problem of dis- and misinformation and provides leadership in global interoperable media asset authenticity. JPEG Trust defines a framework for establishing trust in digital media.

Users of social media are challenged to assess the trustworthiness of the media they encounter, and agencies that depend on the authenticity of media assets must be concerned with mistaking fake media for real, with risks of real-world consequences. JPEG Trust provides a proactive approach to trust management. It is built upon and extends the Coalition for Content Provenance and Authenticity (C2PA) engine. The first part defines the JPEG Trust framework and provides building blocks for more elaborate use cases via its three main pillars:

Annotating provenance – linking media assets together with their associated provenance annotations in a tamper-evident manner
Extracting and evaluating Trust Indicators – specifying how to extract an extensive array of Trust Indicators from any given media asset for evaluation
Handling privacy and security concerns – providing protection for sensitive information based on the provision of JPEG Privacy and Security (ISO/IEC 19566-4)

Trust in digital media is context-dependent. JPEG Trust does NOT explicitly define trustworthiness but rather provides a framework and tools for proactively establishing trust in accordance with the trust conditions needed. The JPEG Trust framework outlined in the core foundation enables individuals, organizations, and governing institutions to identify specific conditions for trustworthiness, expressed in Trust Profiles, to evaluate relevant Trust Indicators according to the requirements for their specific usage scenarios. The resulting evaluation can be expressed in a Trust Report to make the information easily accessed and understood by end users.

JPEG Trust has an ambitious schedule of future work, including evolving and extending the core foundation into related topics of media tokenization and media asset watermarking, and assembling a library of common Trust Profile requirements.

JPEG Pleno

The JPEG Pleno Holography activity reached a major milestone with the FDIS of ISO/IEC 21794-5 being accepted and the International Standard being under preparation by ISO. This is a major achievement for this activity and is the result of the dedicated work of the JPEG Committee over a number of years. The JPEG Pleno Holography activity continues with the development of a White Paper on JPEG Pleno Holography to be released at the 106th JPEG meeting and planning for a workshop for future standardization on holography intended to be conducted in November or December 2024.

The JPEG Pleno Light Field activity focused on the 2nd edition of ISO/IEC 21794-2 (“Plenoptic image coding system (JPEG Pleno) Part 2: Light field coding”) which will integrate AMD1 of ISO/IEC 21794-2 (“Profiles and levels for JPEG Pleno Light Field Coding”) and include the specification of the third coding mode entitled Slanted 4D Transform Mode and the associated profile.

Following the Call for Contributions on Subjective Light Field Quality Assessment and as a result of the collaborative process, the JPEG Pleno Light Field is also preparing standardization activities for subjective and objective quality assessment of light fields. At the 105th JPEG meeting, collaborative subjective results on light field quality assessments were presented and discussed. The results will guide the subjective quality assessment standardization process, which has issued its fourth Working Draft.

The JPEG Pleno Point Cloud activity released a White Paper on JPEG Pleno Learning-based Point Cloud Coding. This document outlines the context, motivation, and scope of the upcoming Part 6 of ISO/IEC 21794 scheduled for publication in early 2025, as well as giving the basis of the new technology, use cases, performance, and future activities. This activity focuses on a new exploration study into the latent space optimization for the current Verification Model.

JPEG AI

At the 105th meeting JPEG AI activity primarily concentrated on advancing Part 2 (Profiling), Part 3 (Reference Software), and Part 4 (Conformance). Part 4 moved forward to the Committee Draft (CD) stage, while Parts 2 and 3 are anticipated to reach DIS at the next meeting. The conformance CD outlines different types of conformances: 1) strict conformance for decoded residuals; 2) soft conformance for decoded feature tensors, allowing minor deviations; and 3) soft conformance for decoded images, ensuring that image quality remains comparable to or better than the quality offered by the reference model. For decoded images, two types of soft conformance were introduced based on device capabilities. Discussions on Part 2 examined memory requirements for various JPEG AI VM codec configurations. Additionally, three core experiments were established during this meeting, focusing on JPEG AI subjective assessment, integerization, and the study of profiles and levels.

JPEG XE

The JPEG XE activity is currently focused on preparing for handling the open Final Call for Proposals on lossless coding of events. This activity revolves around a new and emerging image modality created by event-based visual sensors. JPEG XE is about the creation and development of a standard to represent events in an efficient way allowing interoperability between sensing, storage, and processing, targeting machine vision and other relevant applications. The Final Call for Proposals ends in March of 2025 and aims to receive relevant coding tools that will serve as a basis for a JPEG XE standard. The JPEG Committee is also preparing discussions on lossy coding of events and how to evaluate such lossy coding technologies in the future. The JPEG Committee invites those interested in JPEG XE activity to consider the public documents, available on jpeg.org. The Ad-hoc Group on event-based vision was re-established to continue work towards the 106th JPEG meeting. To stay informed about this activity, please join the event-based vision Ad-hoc Group mailing list.

JPEG AIC

Part 3 of JPEG AIC (AIC-3) advanced to the Committee Draft (CD) stage during the 105th JPEG meeting. AIC-3 defines a methodology for subjective assessment of the visual quality of high-fidelity images. Based on two test protocols—Boosted Triplet Comparisons and Plain Triplet Comparisons—it reconstructs a fine-grained quality scale in JND (Just Noticeable Difference) units. According to the defined work plan, JPEG AIC-3 is expected to advance to the Draft International Standard (DIS) stage by April 2025 and become an International Standard (IS) by October 2026. During this meeting, the JPEG Committee also focused on the upcoming Part 4 of JPEG AIC, which refers to the objective quality assessment of high-fidelity images.

JPEG DNA

JPEG DNA is an initiative aimed at developing a standard capable of representing bi-level, continuous-tone grey-scale, continuous-tone colour, or multichannel digital samples in a format using nucleotide sequences to support DNA storage. The JPEG DNA Verification Model was created during the 102nd JPEG meeting based on the performance assessments and descriptive analyses of the submitted solutions to the Call for Proposals, published at the 99th JPEG meeting. Several core experiments are continuously conducted to validate and improve this Verification Model (VM), leading to the creation of the first Working Draft of JPEG DNA during the 103rd JPEG meeting. At the 105th JPEG meeting, the committee created a New Work Item Proposal for JPEG DNA to make it an official ISO work item. The proposal stated that JPEG DNA would be a multi-part standard: Part 1—Core Coding System, Part 2—Profiles and Levels, Part 3—Reference Software, and Part 4—Conformance. The committee aims to reach the IS stage for Part 1 by April 2026.

JPEG XS

The third editions of JPEG XS, Part 1 – Core coding tools, Part 2 – Profiles and buffer models, and Part 3 – Transport and container formats, have now been published and made available on ISO. The JPEG Committee is finalizing the third edition of the remaining two parts of the JPEG XS standards suite, Part 4 – Conformance testing and Part 5 – Reference software. The FDIS of Party 4 was issued for the ballot at this meeting. Part 5 is still at the Committee Draft stage, and the DIS is planned for the next JPEG meeting. The reference software has a feature-complete decoder fully compliant with the 3rd edition. Work on the TDC profile encoder is ongoing.

JPEG XL

A third edition of JPEG XL Part 2 (File Format) will be initiated to add an embedding syntax for ISO 21496 gain maps, which can be used to represent a custom local tone mapping and have artistic control over the SDR rendition of an HDR image coded with JPEG XL. Work on hardware and software implementations continues, including a new Rust implementation.

Final Quote

“In its commitment to tackle dis/misinformation and to manage provenance, authorship, and ownership of multimedia information, the JPEG Committee has reached a major milestone by publishing the first ever ISO/IEC endorsed specifications for bringing back trust into multimedia. The committee will continue developing additional enhancements to JPEG Trust. New parts of the standard are under development to define a set of additional tools to further enhance interoperable trust mechanisms in multimedia.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

SIGMM Workshop on Multimodal AI Agents

By Silvia | March 12, 2025 - 13:49 |April 23, 2025 0125, 0125, Event Report, Feature

Leave a comment

The SIGMM Workshop on Multimodal AI Agents was held on October 28th, 2024, at ACMMM24 in Melbourne as an invitation-only event. The initiative was launched by Alberto Del Bimbo, Ramesh Jain, and Alan Smeaton following a vision of the future where multimedia expertise converges with the power of large language models and the belief that there is a great opportunity to position the Multimedia research community at the center of this transformation. The event was structured as three roundtables, inviting some of the most influential figures in the multimedia field to brainstorm on key issues. The goal was to design the future, identifying the multimodal opportunity in the days of powerful large-model systems and preparing an agenda for the coming years for the SIGMM community. We did not want to overlap with the current thinking of how multimodality will be included in the emerging large-models. Instead, the goal was on how deep multimodality is essential in building next stages of AI agents for real world applications and how fundamental it is in understanding real-time contexts and for actions by agents. The event received a great response, with over 30 attendees from both Academia and Industry, representing 13 different countries.

Three roundtables focused on Tech Challenges, Applications, and Industry-University collaboration. The participants were divided into three groups and assigned to the three roundtables according to their profiles and preferences. For the roundtables, we did not prepare specific questions but rather outlined key areas of focus for discussion. A brief document that provided a short introduction for each roundtable, summarizing the topic of the debate and highlighting three major subjects to guide the discussion was prepared and given to the discussant a few days before the meeting.

In the following we report a brief synthesis of the discussions at the roundtables, highlighting the principal arguments of discussion and proposals.

Tech challenges Roundtable

Motivations for the discussion: As large pre-trained models become more prevalent and move towards multimodality, looking at the future, a key issue for their usage arises around the impact of their updating and fine-tuning, understanding how to ensure that improvements in one area don’t come at the cost of degradation in others. It is also fundamentally important to understand how deep multimodality is essential for building next stages of AI agents for real world applications, as well as for comprehending real-time contexts and guiding actions by agents towards Artificial General Intelligence.

Some salient sentences, open questions, proposals from the discussion:

The interplay between human intelligence and machine intelligence is a fundamental aspect of what should be multi-modal. There are not yet deep enough multimodal models…. models for information that truly span all, or even a subset of modalities. We need metrics for this human-machine, human-intelligence machine-intelligence, action. We should come up with and define a task around how people collaborate productively. We should look at something like dynamic difficulty adjustment, that requires continuous, real-time development or training.
Benchmarks are of crucial importance, not just to evaluate one thing against another thing, but to stretch the capabilities. It is not just about passing the benchmark; it is about setting the targets. We should envision a SIGMM-endorsed or sponsored multimodal benchmark by approaching some big tech companies to benchmark some multimodal activity within and across companies.

Applications Roundtable

Motivations for the discussion: Multimodality is a cornerstone of emerging real-world applications, providing context and situational awareness to systems. Large Multimodal Models are credited for transforming various industries and enabling new applications. Key challenges lie in developing computational approaches for media fusion to construct context and situational understanding, addressing real-time computing costs, and refining model building. It is therefore essential for the SIGMM community to reason on how to build a vibrant community around one or a few key applications.

Some salient sentences, open questions, proposals from the discussion:

There are many areas for application where the SIGMM community can provide vital and innovative contributions and should concentrate its applicative research. Example application areas and examples of research are:

Health: there is an absence of open-ended sensory data representing of long-term complex information in the health area. We can think of integrated, federated machine learning, i.e. an integrated, federated data space for data control.
Education: we can think of some futuristic learning approach, like completely autonomous learning. Namely, AI agents that will be supportive through observation models, able to adjust the learning level so that some can finish faster than the others and learn depending on the modalities they like to receive. It is also of key importance to consider what the role of teacher and the role of AI is.
Productivity: we can think of tools for immersive multi-modal experiences, to generate cross-modal content including 3D and podcasting in immersive environments.
Entertainment: we should think of how we can improve entertainment through immersive story driven experiences.

Industry and University Roundtable

Motivations for the discussion: Research on large AI models is by far dominated by private companies, thanks in part to their access to the data and the cost for building and training such models. As a result, academic institutions are being left behind in the AI race. It is therefore urgent to reason about which research directions are viable for universities and think of new Industry-University collaboration models for multimodal AI research. It is also important to capitalize on the unique advantage of Academy, concerning their neutrality and ability to address long-term social and ethical issues related to technology.

Some salient sentences, open questions, proposals from the discussion:

Small and medium enterprises feel that they are left out. These are the ones who came to talk to universities. This is an opportunity for the SIGMM community to see how we can help. SIGMM could sponsor joint PhD programs for example addressing small size, multi-model, foundation models, or intelligent agents, where a company sponsors part of the grant project.
SIGMM should promote large visibility events at ACM Multimedia like Grand Challenges and Hackathons. As a community we could sponsor a company-wise Grand Challenge on multimodal AI and intelligent agents, leveraging industry to contribute more data sets. We could promote a regional-global Hackathon where Hackathons are held and overseen in different regions in the world, and the top teams then invited to come to ACM Multimedia and compete for it.

Based on the discussions at the roundtables, we have identified several concrete actions that could help position the SIGMM research community at the forefront of the multimodal AI transformation:

At the next ACM Multimedia Conference

Explicit inclusion of multimodality as a key topic in the next ACM Multimedia call.
Multimodal Hackathon on Intelligent Agents (regional-global hackathon).
Multimodal Benchmarks (collaborations within and across major tech companies).
Multimodal Grand Challenges (in partnership with industry leaders).

At the next ACM SIGMM call for Special projects

Special Projects focused on Multimodal AI.

SIGMM is committed to pursuing these initiatives.

Diversity and Inclusion in focus at ACM IMX 2024

By Silvia | February 28, 2025 - 15:45 |April 23, 2025 0125, 0125, Event Report, Feature

Leave a comment

Summary: ACM IMX 2024 took place in Stockholm, Sweden, from June 12 to 14, continuing its dedication to promoting diversity within the community. Recognising the importance of amplifying varied voices and experiences to advance the field, the conference built on prior achievements in diversity and inclusion of IMX through a series of initiatives to promote diversity and inclusion (D&I). This column provides a concise overview of the main D&I initiatives, including childcare support, early-career researcher grants, and manuscript accessibility support. It includes participant feedback and short testimonials shared during and after the conference to highlight the value of these initiatives.

To encourage a broad and inclusive pool of organisers, one method employed by the general chairs of ACM IMX’24 to prioritise diversity and inclusion was to team seasoned committee members with new members within the organising committee, this was done as a method to actively foster mentoring opportunities that support continuity and the development of future conference leadership. In addition to this, IMX’24 invited community members to self-nominate for various chair and organisational roles to make it clear that chair roles were open and available to all who were interested in being part of organising the conference. This call for applications was announced during the closing session of ACM IMX’23 in Nantes, France and, over a two-month period, the committee received 12 applications from which 5 candidates were selected to serve as chairs in various capacities. This inclusive approach allowed ACM IMX to engage with junior members and volunteers who might not have been reached through traditional recruitment methods, pairing them with experienced team members to ensure that they were able to build their network within the community and their skills in conference organisation and management.

SIGMM support was used to enable the chairs of IMX’24 to introduce several initiatives to ensure that all individuals, regardless of personal circumstances, could participate fully in the conference. These initiatives had openly announced calls to all eligible community members who wished to attend the conference in person in Stockholm but required financial assistance. To ensure a fair and thorough selection, the IMX’24 Diversity and Inclusion Chairs, in collaboration with the General Chairs, reviewed each of the applications to ensure that the widest range of support could be offered with the available funds. Applications were evaluated on a rolling basis to ensure that participants were able to organise their travel and visa arrangements without the added challenges of time pressure.

With this support from SIGMM, Diversity and Inclusion grants for IMX were made available for participants, covering:

Travel Support for Non-Students from Marginalised and Underrepresented Groups: This grant provided travel support for researchers who self-identified as marginalised or underrepresented within the ACM IMX community, particularly those from non-WEIRD (Western, Educated, Industrialised, Rich, Developed) countries who lacked other funding opportunities. Priority was given to early-career researchers (such as post-docs), and those needing financial assistance, to compliment existing SIGCHI and SIGMM student targeted travel grants.
Childcare and Parental Support: This grant offered financial assistance to parents attending ACM IMX’24, subsidising childcare costs to enable broader participation and to cover expenses related to children’s travel, travel for a childcare companion, and on-site or arranged babysitting during the conference.
Disability and Carer Support: This grant aimed to support attendees on extended leave from work due to disability, parental responsibilities, or other personal circumstances. Recipients of this award also received a complementary free conference registration.
Student Travel Awards: SIGMM also provided awards directly to students to support travel expences, enabling a broader range of participation and complimenting free registration offered for those students volunteering at the conference.

The SIGMM’s special initiatives for diversity and inclusion enable IMX’24 to secure a keynote designed to foster a more inclusive dialogue. Delivered by artist Jake Elwes—a self described hacker, radical faerie, and researcher—the keynote focused on “queer artificial intelligence” and featured deepfake drag performers. Elwes’ work invited the attendees to reflect on who builds these systems, the intentions behind them, and how they can be reclaimed to envision and create different visions of a technology enhanced future.

In combination with support from SIGMM, a special workshop focused on engaging with research and researchers from Latin America as a region of interest was made possible through the generous backing of the SIGCHI Development Fund (SDF). This enabled researchers and workshop keynote speakers to participate in both the “IMX in Latin America – 2nd International Workshop” and attend the conference. A core objective was to increase diversity by broadening the IMX community through actively encouraging colleagues from Latin America to attend and contribute. This workshop also published it’s submissions as part of the ACM IMX’24 workshop proceedings in ICPS.

For the first time at ACM IMX, an external provider (TAPS) was hired to ensure accessibility of papers prior to publication. Finally, the conference offered a range of venue-focused diversity and inclusion initiatives, including the provision of all-gender bathrooms, pronoun badges, and approachable senior community members to support engagement. Care corner and tables were thoughtfully set up throughout the conference to provide attendees with free hygiene essentials such as masks, refreshers, hand sanitisers, sanitary pads and tampons. These measures highlighted ACM IMX’24 commitment to fostering a welcoming and accessible environment for all participants.

Figure 1: Participants’ responses on their perception of diversity and inclusion at IMX, highlighting that it encompasses representation, welcoming environments, active engagement, research focus, and shaping future media experiences.

“During the closing event of IMX2024, we asked our attendees to answer a few questions that could help plan future IMX conferences. We asked everyone to share what future research directions could be included to address D&I at IMX. Some of the suggestions were to include the field of Humanities, to study usability among different demographics, and to understand how people who might not have economic access to technology could benefit from such technology. We also asked everyone to select what, according to them, is D&I at IMX. The options Everyone feels welcomed, Diverse individuals are able to engage and contribute and People from diverse backgrounds get represented and have a voice received a majority of the votes when compared to “Shape the future of interactive media experiences and “Research that focuses on diversity and inclusion in media experiences”. When asked to share how included they felt at IMX2024, 92% of the participants shared that they either felt included or very much [with some leaving the question unanswered]. They also shared how different aspects made them feel included. Some of the highlights were the care corner that was arranged to support the basic needs of the attendees, the social events, interactions at the conference, and the community. ” – Sujithra Raviselvam, IMX’24 Diversity and Inclusion Co-Chair.

Figure 2: Participants’ feedback on factors contributing to feelings of inclusion and exclusion at IMX, along with suggestions for future research directions aimed at improving diversity and inclusion. The feedback highlights personal interactions, event organization, and amenities as key to feeling included, while future research suggestions focus on enhancing accessibility, providing economic support, and integrating more diverse perspectives in HCI research.

The best way to understand the impacts of these supports is through the words of those who were enabled to join the conference by receiving it.

“The grant received for IMX2024 allowed me to attend the conference. Having a young child is challenging as an early researcher, as you must, sometimes, sacrifice your career or family. This grant allowed me to travel without any of these. I could attend the conference without stress or second thoughts, and support my family during the few days of the conference. Thanks to this, I received valuable feedback on my work, followed interesting presentations, and did not miss my family.” – Romain Herault, childcare award recipient.

“I had the opportunity to present our qualitative study focused on understanding the sensitive values of women entrepreneurs in Brazil to support designing multi-model conversational AI financial systems at IMX, followed by interesting discussions about it in the workshop organized by Debora Christina Muchaluat Saade, Mylene Farias and Jesus Favela. The conference was focused on the future of multimodal technologies, with many exciting demos to investigate, to make more accessible, and to challenge assumptions of real life through a multimedia lens. We also had a conference dinner with the theme of the midsummer celebration. I was amazed by its meaning; as far as I understood, the purpose is to celebrate the light, sun, and summer season with family and friends! I loved it! It was also an opportunity to explore the beautiful Stockholm city with new colleagues and meet current collaborators in research.”– Heloisa Caroline de Souza Pereira Candello.

A total of 21 applicants received support through diversity and inclusion grants provided by both SIGMM and the SIGCHI Development Fund (SDF). This assistance enabled full participation in ACM IMX’24 and supported a diverse group, including students, non-students from marginalised backgrounds, early-career researchers, and Latin American researchers, all of whom benefitted from these grants and made up more than 10% of the total conference attendees – truly changing and undoubtedly enhancing the experience of all attendees at the conference.

Figure 3: The word clouds present two data sets from an IMX survey: the countries respondents identify as home, and the locations they would like IMX to feature in the future. It highlights a diverse range of home countries, including Brazil, Germany, and India, and suggest future IMX locations such as Japan, Brazil, and various cities in the USA, indicating a global interest and the geographical diversity of the IMX community.

Reports from ACM Multimedia 2024

By chenjingjing | February 18, 2025 - 15:42 |April 23, 2025 0125, Conference Report, Event Report, Feature, Social Media Posts

Comments Off

Introduction

The ACM Multimedia Conference 2024, held in Melbourne, Australia from October 28 to November 1, 2024, was a major event that brought together leading researchers, practitioners, and industry professionals in the field of multimedia. This year’s conference marked a significant milestone as it was the first time since the end of the COVID-19 pandemic that the event returned to the Asia-Pacific region and resumed as a fully in-person gathering. The event offered a dynamic platform for presenting cutting-edge research, exploring new trends, and fostering collaborations across academia and industry.

Held in Melbourne, a city known for its vibrant culture and technological advancements, the conference was well-organized, ensuring a seamless experience for all participants. As part of its ongoing commitment to supporting the next generation of multimedia researchers, SIGMM awarded Student Travel Grants to 24 students. Each recipient received up to 1,000 USD to cover their travel and accommodation expenses. These grants were intended to help students who showed academic promise but faced financial barriers, allowing them to fully engage with the conference and its events. To apply, students were required to submit an online form, and the selection committee chose the recipients based on academic excellence and demonstrated financial need.

To give a voice to the travel grant recipients, we interviewed several of them to hear about their experiences and the impact the conference had on their academic and professional development. Below are some of their reflections.

Zhedong Zhang – Hangzhou Dianzi University

ACM Multimedia 2024 in Melbourne was my first international academic conference, and I am incredibly grateful to SIGMM for providing the travel grant. It was a great honor to present my paper, “From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning”, and to receive the Best Paper Award. As a PhD student, this recognition means a lot to me and encourages me to keep pushing forward with my research.

Beyond the academic presentations, I had the chance to meet many brilliant researchers and fellow PhD students. I made connections with scholars working on similar topics and exchanged ideas that will help improve my work. The networking events and social gatherings were also highlights, as they allowed me to build friendships with colleagues from different parts of the world. I am truly grateful to SIGMM for making this experience possible and for the chance to be part of such a vibrant and inspiring academic community. I look forward to continuing my research and contributing to this exciting field.

Wu Tao – Zhejiang University

I’m incredibly grateful to the SIGMM team for awarding me this student travel grant – it really helped me a lot. I got to learn about so many fascinating papers at the conference and meet some brilliant professors and students. I even see some potential for future collaborations. I also had the chance to meet some big names in the field, like Tat-Seng Chua, who I’ve admired for a while. Meeting him, chatting, and even taking a photo with him felt like a once-in-a-lifetime opportunity, and I’m so thankful for it.

As for my own paper, I was both surprised and thrilled to see it actually got quite a bit of attention. At the welcome reception on the second day – before the poster session even began and before I’d even put up my poster – I noticed a few students already looking it up on their laptops. During the poster session, which was supposed to be two hours but probably stretched to three, I had a steady stream of people coming by to check out my work and ask questions. Some people even approached me earlier that morning. It was incredibly motivating to feel that kind of recognition and interest in what I’m working on. Thank you once again for this generous support! I look forward to attending the conference again.

Jianjun Qiao (Southwest Jiaotong University)

Attending ACM Multimedia 2024 in Melbourne was an incredible opportunity that greatly enriched my academic journey. This was my first time participating in an in-person conference, and I’m so grateful for the experience. The keynotes were fascinating, especially the talk on the Multimodal LLMs, which has significantly influenced my current research. I also enjoyed the poster sessions, where I could present my own work and engage in meaningful discussions with researchers from diverse backgrounds. The networking opportunities were invaluable, and I made several connections that I believe will lead to fruitful collaborations. I would like to extend my sincere thanks to SIGMM for the travel grant, which made my attendance possible. It was truly an unforgettable experience.

Changli Wu (Xiamen University)

ACM Multimedia 2024 was an unforgettable experience that exceeded all my expectations. As a PhD student, this was my first time presenting my research on 3D Referring Expression Segmentation at such a prestigious conference. The discussions I had with other attendees were invaluable, and I received constructive feedback that will undoubtedly improve my work. The diversity of the sessions was a highlight for me, as I was exposed to a variety of multimedia topics that I hadn’t considered before. The conference also provided a unique opportunity to interact with industry leaders, and I am now considering how to apply my research in real-world settings.

VQEG Column: VQEG Meeting July 2024

By Jesús Gutiérrez | December 16, 2024 - 16:13 |December 16, 2024 0424, Event Report, Feature, Standards

Leave a comment

Introduction

The University of Klagenfurt (Austria) hosted from July 01-05, 2024 a plenary meeting of the Video Quality Experts Group (VQEG). More than 110 participants from 20 different countries could attend this meeting in person and remotely.

The first three days of the meeting were dedicated to presentations and discussions about topics related to the ongoing projects within VQEG, while during the last two days an IUT-T Study Group G12 Question 19 (SG12/Q9) interim meeting took place. All the related information, minutes, and files from the meeting are available online in the VQEG meeting website, and video recordings of the meeting are available in Youtube.

All the topics mentioned bellow can be of interest for the SIGMM community working on quality assessment, but special attention can be devoted to the workshop on quality assessment towards 6G held within the 5GKPI group, and to the dedicated meeting of the IMG group hosted by the Distributed and Interactive Systems Group (DIS) of the CWI in September 2024 to work on ITU-T P.IXC recommendation. In addition, during those days there was a co-located ITU-T SG12 Q19 interim meeting.

Readers of these columns interested in the ongoing projects of VQEG are encouraged to subscribe to their corresponding reflectors to follow the activities going on and to get involved in them.

Another plenary meeting of VQEG has taken place from 18th 22nd of November 2024 and will be reported in a following issue of the ACM SIGMM Records.

VQEG plenary meeting at University of Klagenfurt (Austria), from July 01-05, 2024

Overview of VQEG Projects

Audiovisual HD (AVHD)

The AVHD group works on developing and validating subjective and objective methods to analyze commonly available video systems. During the meeting, there were 8 presentations covering very diverse topics within this project, such as open-source efforts, quality models, and subjective assessment methodologies:

Jonas Birmé (Eyevinn Technology, Sweden) presented their work on lowering the barrier for using open source and contributing to a sustainable business. In this sense, he presented Open Source Cloud that offers open source as a service, removing the need for users of to maintain their own infrastructure in applications such as video encoding and quality assurance.
Hadi Amirpour (University of Klagenfurt, Austria) and Jingwen Zhu (Nantes Université, France) presented their joint work that explored the use of Just Noticeable Differences (JND) to select bitratere-solution pairs for constructing a bitrate ladder with respect to the proportion of Satisfied User Ratio (SUR).
Rafał Mantiuk (University of Cambridge, UK) talked about a family of metrics that directly model low-level human vision by incorporating the models of contrast sensitivity, contrast masking, and colour vision, which can bring many advantages, such as explainability, robustness to unseen distortion, etc. Those metrics include HDR-VDP-3, Foveated Video VDP, and Colour Video VDP, all publicly available as open-source projects.
Dounia Hammou (University of Cambridge, UK) presented a study on the effect of viewing distance and display luminance on the visibility of HDR video streaming distortions, including a new video quality dataset, HDR-VDC, which captures the quality degradation of HDR content due to AV1 coding artifacts and the resolution reduction.
Tomasz Konaszynski (AGH University of Krakow, Poland) talked about the impact of the structure and order of the stimuli presented to the viewers during subjective quality tests.
Dominik Keller (Technische Universität Ilmenau, Germany) an open 8K HDR source dataset for video quality research (AVT-VQDB-UHD-2-HDR).
Syed Uddin (AGH University of Krakow, Poland) presented his analysis on how effectively low-latency algorithms in DASH.JS enhance the user experience.
Avrajyoti Dutta (AGH University of Krakow, Poland) presented a study that investigates the evaluation of subjective video quality utilizing short video clips on a crowd-sourcing platform.

Quality Assessment for Health applications (QAH)

The QAH group is focused on the quality assessment of health applications. It addresses subjective evaluation, generation of datasets, development of objective metrics, and task-based approaches. Joshua Maraval and Meriem Outtas (INSA Rennes, France) a dual rig approach for capturing multi-view video and spatialized audio capture for medical training applications, including a dataset for quality assessment purposes.

Statistical Analysis Methods (SAM)

The group SAM investigates on analysis methods both for the results of subjective experiments and for objective quality models and metrics. The following presentations were delivered during the meeting:

Rafał Mantiuk (University of Cambridge, UK) presented lessons learned, covering the main strengths and caveats, from a large experience performing pairwise comparison experiments, which includes the publication of datasets, software tools, and methods.
Mohsen Jenadeleh (University of Konstanz, Germany) presented an experiment and an annotated datas e t on image quality evaluation with triplet comparisons, in the particular case of multi-dimensional scaling. Also, he presented a study on the effects of immediate feedback on crowdworkers’ performance in subjective image quality assessment tasks using paired comparisons.
Simon H. Del Pin (Norwegian University of Science and Technology, Norway) and Dietmar Saupe (University of Konstanz, Germany) presented their study on national differences in image quality assessment using discrete rating based on 3 large-scale datasets.
Andréas Pastor (Nantes Université, France) proposed a new framework for perceptually-optimized encoding using the “libaom” of the AV1 codec, which aims to improve perceptual quality and compression efficiency.
Hadi Amirpour (University of Klagenfurt, Austria) and Jingwen Zhu (Nantes Université, France) presented their joint work on analyzing the uncertainty of Satisfied User Ratios (SUR) and studying how different video quality metrics perform to estimate SUR.

No Reference Metrics (NORM)

The group NORM addresses a collaborative effort to develop no-reference metrics for monitoring visual service quality. In this sense, the following topics were covered:

Yixu Chen (Amazon, US) presented their development of a metric tailored for video compression and scaling, which can extrapolate to different dynamic ranges, is suitable for real-time video quality metrics delivery in the bitstream, and can achieve better correlation than VMAF and P.1204.3.
Filip Korus (AGH University of Krakow, Poland) talked about the detection of hard-to-compress video sequences (e.g., video content generated during e-sports events) based on objective quality metrics, and proposed a machine-learning model to assess compression difficulty.
Hadi Amirpour (University of Klagenfurt, Austria) provided a summary of activities in video complexity analysis, covering from VCA to DeepVCA and describing a Grand Challenge on Video Complexity.
Pierre Lebreton (Capacités & Nantes Université, France) presented a new dataset that allows studying the differences among existing UGC video datasets, in terms of characteristics, covered range of quality, and the implication of these quality ranges on training and validation performance of quality prediction models.
Zhengzhong Tu (Texas A&M University, US) introduced a comprehensive video quality evaluator (COVER) designed to evaluate video quality holistically, from a technical, aesthetic, and semantic perspective. It is based on leveraging three parallel branches: a Swin Transformer backbone to predict technical quality, a ConvNet employed to derive aesthetic quality, and a CLIP image encoder to obtain semantic quality.

Emerging Technologies Group (ETG)

Abhijay Ghildyal (Portland State University, Canada) talked about the current status, gaps, shortcomings, and opportunities that the new paradigms of AI-generated image and video content brings.
Kjell Brunnström (RISE, Sweden) presented his work on augmented reality head-up displays and digital rear view mirrors in cars, analyzing different factors, such as height of cameras, size of the field of view, etc.
Mohammad Ghasempour (University of Klagenfurt, Austria) presented their approach for energy-aware video streaming, based on the appropriate selection of spatial and temporal resolution of videos.
Mathias Wien (RWTH Aachen University, Germany) provided the updates on testing activities on video quality assessment within MPEG, including the creation of the CVQM database (Coded Video for study of Quality Metrics) covering both conventional and neural network-based video coding schemes.
Henrique Souza Rossi (Luleå University of Technology, Sweden) presented his study on subjective QoE assessment for VR cloud-based gaming, focused on a first-person shooter game.

Joint Effort Group (JEG) – Hybrid

The group JEG-Hybrid addresses several areas of Video Quality Assessment (VQA), such as the creation of a large dataset for training such models using full-reference metrics instead of subjective metrics. In addition, the group includes the VQEG project Implementer’s Guide for Video Quality Metrics (IGVQM). The chair of this group, Enrico Masala (Politecnico di Torino, Italy) presented the updates on the latest activities going on, including the status of the IGVQM project and a new image dataset, which will be partially subjectively annotated, to train DNN models to predict single user’s subjective quality perception. In addition to this:

Lohic Fotio Tiotsop (Politecnico di Torino, Italy) presented various advances on modeling subject scoring behaviors, such as a new approach to estimate the subjective quality from noisy subjective ratings and a novel subject scoring model that allows to highlight several peculiar. He also presented the development of a DNN-based model to predict individual subjective quality of images with multiple distortions, which included the creation of a dataset comprising two million samples with synthetic labels derived from human annotation.
Maria Martini (Kingston University London, UK) followed up from a presentation delivered in a previous VQEG meeting, highlighting the relationship between PSNR and SSIM for DCT-based compressed images and video, including comparisons with other approximations of the relationships between the two.

Immersive Media Group (IMG)

The IMG group researches on the quality assessment of immersive media technologies. Currently, the main joint activity of the group is the development of a test plan to evaluate the QoE of immersive interactive communication systems, which is carried out in collaboration with ITU-T through the work item P.IXC. In this meeting, Pablo Pérez (Nokia XR Lab, Spain) and Jesús Gutiérrez (Universidad Politécnica de Madrid, Spain) provided an update on the progress of the test plan, reviewing the status of the subjective tests that were being performed at the 13 involved labs. Also in relation with this test plan:

Jesús Gutiérrez and Miguel Die (Universidad Politécnica de Madrid, Spain) presented preliminary results from the subjective tests carried out to study the impact on remote communication of display technology with the real-time FVV Live system.
Felix Immohr (Technische Universität Ilmenau, Germany) presented the results of a study to assess the effect of spatial audio on audiovisual plausibility and presence perception in a three-user interactive communication scenario.

In relation with other topics addressed by IMG:

Kamran Javidi and Maria Martini (Kingston University London, UK) presented two light field datasets: 1) a display-specific turntable-based dataset for subjective quality assessment (KULF-TT53), and 2) a video dataset of scenes with moving objects captured with a plenoptic video camera.
Stephan Fremerey (Technische Universität Ilmenau, Germany) presented an open-source dataset to evaluate cognitive performance including source audiovisual 360° video and immersive CGI multi-talker content.

In addition, a specific meeting of the group was held at Distributed and Interactive Systems Group (DIS) of CWI in Amsterdam (Netherlands) from the 2nd to the 4th of September to progress on the joint test plan for evaluating immersive communication systems. A total of 26 international experts from seven countries (Netherlands, Spain, Italy, UK, Sweden, Germany, US, and Poland) participated, with 7 attending online. In particular, the meeting featured presentations on the status of tests run by 13 participating labs, leading to insightful discussions and progress towards the ITU-T P.IXC recommendation.

IMG meeting at CWI (2-4 September, 2024, Netherlands)

Quality Assessment for Computer Vision Applications (QACoViA)

The group QACoViA addresses the study the visual quality requirements for computer vision methods, where the final user is an algorithm. In this meeting, Mikołaj Leszczuk (AGH University of Krakow, Poland) presented a study introducing a novel evaluation framework designed to address accurately predicting the impact of different quality factors on recognition algorithm, by focusing on machine vision rather than human perceptual quality metrics.

5G Key Performance Indicators (5GKPI)

The 5GKPI group studies relationship between key performance indicators of new 5G networks and QoE of video services on top of them. In this meeting, a workshop was organized by Pablo Pérez (Nokia XR Lab, Spain) and Kjell Brunnström (RISE, Sweden) on “Future directions of 5GKPI: Towards 6G“.

The workshop consisted of a set of diverse topics such as: QoS and QoE management in 5G/6G networks by (Michelle Zorzi, University of Padova, Italy); parametric QoE models and QoE management by Tobias Hoßfeld (University of. Würzburb, Germany) and Pablo Pérez (Nokia XR Lab, Spain); current status of standardization and industry by Kjell Brunnström (RISE, Sweden) and Gunilla Berndtsson (Ericsson); content and applications provider perspectives on QoE management by François Blouin (Meta, US); and communications service provider perspectives by Theo Karagioules and Emir Halepovic (AT&T, US). In addition, a panel moderated by Narciso García (Universidad Politécnica de Madrid, Spain) with Christian Timmerer (University of Klagenfurt, Austria), Enrico Masala (Politecnico di Torino, Italy) and Francois Blouin (Meta, US) as speakers.

Human Factors for Visual Experiences (HFVE)

The HFVE group covers human factors related to audiovisual experiences and upholds the liaison relation between VQEG and the IEEE standardization group P3333.1. In this meeting, there were two presentations related to these topics:

Mikołaj Leszczuk and Kamil Koniuch (AGH University of Krakow, Poland) presented a two-part insight into the realm of image quality assessment: 1) it provided an overview of the TUFIQoE project (Towards Better Understanding of Factors Influencing the QoE by More Ecologically-Valid Evaluation Standards) with a focus on challenges related to ecological validity; and 2) it delved into the ‘Psychological Image Quality’ experiment, highlighting the influence of emotional content on multimedia quality perception.

Ali Ak (Capacités & Nantes Université, France) presented a video quality dataset with Iphone HDR videos and AV1 encoding (Nantes-MobileHDRVQA) and a study on the potential use of crowdsourcing platforms for acceptability and annoyance experiments.

MPEG Column: 148th MPEG Meeting in Kemer, Türkiye

By Christian Timmerer | December 11, 2024 - 13:29 |December 11, 2024 0424, Event Report, Feature, Standards

Leave a comment

The 148th MPEG meeting took place in Kemer, Türkiye, from November 4 to 8, 2024. The official press release can be found here and includes the following highlights:

Point Cloud Coding: AI-based point cloud coding & enhanced G-PCC
MPEG Systems: New Part of MPEG DASH for redundant encoding and packaging, reference software and conformance of ISOBMFF, and a new structural CMAF brand profile
Video Coding: New part of MPEG-AI and 2nd edition of conformance and reference software for MPEG Immersive Video (MIV)
MPEG completes subjective quality testing for film grain synthesis using the Film Grain Characteristics SEI message

148th MPEG Meeting, Kemer, Türkiye, November 4-8, 2024.

Point Cloud Coding

At the 148^th MPEG meeting, MPEG Coding of 3D Graphics and Haptics (WG 7) launched a new AI-based Point Cloud Coding standardization project. MPEG WG 7 reviewed six responses to a Call for Proposals (CfP) issued in April 2024 targeting the full range of point cloud formats, from dense point clouds used in immersive applications to sparse point clouds generated by Light Detection and Ranging (LiDAR) sensors in autonomous driving. With bit depths ranging from 10 to 18 bits, the CfP called for solutions that could meet the precision requirements of these varied use cases.

Among the six reviewed proposals, the leading proposal distinguished itself with a hybrid coding strategy that integrates end-to-end learning-based geometry coding and traditional attribute coding. This proposal demonstrated exceptional adaptability, capable of efficiently encoding both dense point clouds for immersive experiences and sparse point clouds from LiDAR sensors. With its unified design, the system supports inter-prediction coding using a shared model with intra-coding, applicable across various bitrate requirements without retraining. Furthermore, the proposal offers flexible configurations for both lossy and lossless geometry coding.

Performance assessments highlighted the leading proposal’s effectiveness, with significant bitrate reductions compared to traditional codecs: a 47% reduction for dense, dynamic sequences in immersive applications and a 35% reduction for sparse dynamic sequences in LiDAR data. For combined geometry and attribute coding, it achieved a 40% bitrate reduction across both dense and sparse dynamic sequences, while subjective evaluations confirmed its superior visual quality over baseline codecs.

The leading proposal has been selected as the initial test model, which can be seen as a baseline implementation for future improvements and developments. Additionally, MPEG issued a working draft and common test conditions.

Research aspects: The initial test model, like those for other codec test models, is typically available as open source. This enables both academia and industry to contribute to refining various elements of the upcoming AI-based Point Cloud Coding standard. Of particular interest is how training data and processes are incorporated into the standardization project and their impact on the final standard.

Another point cloud-related project is called Enhanced G-PCC, which introduces several advanced features to improve the compression and transmission of 3D point clouds. Notable enhancements include inter-frame coding, refined octree coding techniques, Trisoup surface coding for smoother geometry representation, and dynamic Optimal Binarization with Update On-the-fly (OBUF) modules. These updates provide higher compression efficiency while managing computational complexity and memory usage, making them particularly advantageous for real-time processing and high visual fidelity applications, such as LiDAR data for autonomous driving and dense point clouds for immersive media.

By adding this new part to MPEG-I, MPEG addresses the industry’s growing demand for scalable, versatile 3D compression technology capable of handling both dense and sparse point clouds. Enhanced G-PCC provides a robust framework that meets the diverse needs of both current and emerging applications in 3D graphics and multimedia, solidifying its role as a vital component of modern multimedia systems.

MPEG Systems Updates

At its 148^th meeting, MPEG Systems (WG 3) worked on the following aspects, among others:

New Part of MPEG DASH for redundant encoding and packaging
Reference software and conformance of ISOBMFF
A new structural CMAF brand profile

The second edition of ISO/IEC 14496-32 (ISOBMFF) introduces updated reference software and conformance guidelines, and the new CMAF brand profile supports Multi-View High Efficiency Video Coding (MV-HEVC), which is compatible with devices like Apple Vision Pro and Meta Quest 3.

The new part of MPEG DASH, ISO/IEC 23009-9, addresses redundant encoding and packaging for segmented live media (REAP). The standard is designed for scenarios where redundant encoding and packaging are essential, such as 24/7 live media production and distribution in cloud-based workflows. It specifies formats for interchangeable live media ingest and stream announcements, as well as formats for generating interchangeable media presentation descriptions. Additionally, it provides failover support and mechanisms for reintegrating distributed components in the workflow, whether they involve file-based content, live inputs, or a combination of both.

Research aspects: With the FDIS of MPEG DASH REAP available, the following topics offer potential for both academic and industry-driven research aligned with the standard’s objectives (in no particular order or priority):

Optimization of redundant encoding and packaging: Investigate methods to minimize resource usage (e.g., computational power, storage, and bandwidth) in redundant encoding and packaging workflows. Explore trade-offs between redundancy levels and quality of service (QoS) in segmented live media scenarios.
Interoperability of live media Ingest formats: Evaluate the interoperability of the standard’s formats with existing live media workflows and tools. Develop techniques for seamless integration with legacy systems and emerging cloud-based media workflows.
Failover mechanisms for cloud-based workflows: Study the reliability and latency of failover mechanisms in distributed live media workflows. Propose enhancements to the reintegration of failed components to maintain uninterrupted service.
Standardized stream announcements and descriptions: Analyze the efficiency and scalability of stream announcement formats in large-scale live streaming scenarios. Research methods for dynamically updating media presentation descriptions during live events.
Hybrid workflow support: Investigate the challenges and opportunities in combining file-based and live input workflows within the standard. Explore strategies for adaptive workflow transitions between live and on-demand content.
Cloud-based workflow scalability: Examine the scalability of the REAP standard in high-demand scenarios, such as global live event streaming. Study the impact of cloud-based distributed workflows on latency and synchronization.
Security and resilience: Research security challenges related to redundant encoding and packaging in cloud environments. Develop techniques to enhance the resilience of workflows against cyberattacks or system failures.
Performance metrics and quality assessment: Define performance metrics for evaluating the effectiveness of REAP in live media workflows. Explore objective and subjective quality assessment methods for media streams delivered using this standard.

The current/updated status of MPEG-DASH is shown in the figure below.

Video Coding Updates

In terms of video coding, two noteworthy updates are described here:

Part 3 of MPEG-AI, ISO/IEC 23888-3 – Optimization of encoders and receiving systems for machine analysis of coded video content, reached Committee Draft Technical Report (CDTR) status
Second edition of conformance and reference software for MPEG Immersive Video (MIV). This draft includes verified and validated conformance bitstreams and encoding and decoding reference software based on version 22 of the Test model for MPEG immersive video (TMIV). The test model, objective metrics, and some other tools are publicly available at https://gitlab.com/mpeg-i-visual.

Part 3 of MPEG-AI, ISO/IEC 23888-3: This new technical report on “optimization of encoders and receiving systems for machine analysis of coded video content” is based on software experiments conducted by JVET, focusing on optimizing non-normative elements such as preprocessing, encoder settings, and postprocessing. The research explored scenarios where video signals, decoded from bitstreams compliant with the latest video compression standard, ISO/IEC 23090-3 – Versatile Video Coding (VVC), are intended for input into machine vision systems rather than for human viewing. Compared to the JVET VVC reference software encoder, which was originally optimized for human consumption, significant bit rate reductions were achieved when machine vision task precision was used as the performance criterion.

The report will include an annex with example software implementations of these non-normative algorithmic elements, applicable to VVC or other video compression standards. Additionally, it will explore the potential use of existing supplemental enhancement information messages from ISO/IEC 23002-7 – Versatile supplemental enhancement information messages for coded video bitstreams – for embedding metadata useful in these contexts.

Research aspects: (1) Focus on optimizing video encoding for machine vision tasks by refining preprocessing, encoder settings, and postprocessing to improve bit rate efficiency and task precision, compared to traditional approaches for human viewing. (2) Examine the use of metadata, specifically SEI messages from ISO/IEC 23002-7, to enhance machine analysis of compressed video, improving adaptability, performance, and interoperability.

Subjective Quality Testing for Film Grain Synthesis

At the 148^th MPEG meeting , the MPEG Joint Video Experts Team (JVET) with ITU-T SG 16 (WG 5 / JVET) and MPEG Visual Quality Assessment (AG 5) conducted a formal expert viewing experiment to assess the impact of film grain synthesis on the subjective quality of video content. This evaluation specifically focused on film grain synthesis controlled by the Film Grain Characteristics (FGC) supplemental enhancement information (SEI) message. The study aimed to demonstrate the capability of film grain synthesis to mask compression artifacts introduced by the underlying video coding schemes.

For the evaluation, FGC SEI messages were adapted to a diverse set of video sequences, including scans of original film material, digital camera noise, and synthetic film grain artificially applied to digitally captured video. The subjective performance of video reconstructed from VVC and HEVC bitstreams was compared with and without film grain synthesis. The results highlighted the effectiveness of film grain synthesis, showing a significant improvement in subjective quality and enabling bitrate savings of up to a factor of 10 for certain test points.

This study opens several avenues for further research:

Optimization of film grain synthesis techniques: Investigating how different grain synthesis methods affect the perceptual quality of video across a broader range of content and compression levels.
Compression artifact mitigation: Exploring the interaction between film grain synthesis and specific types of compression artifacts, with a focus on improving masking efficiency.
Adaptation of FGC SEI messages: Developing advanced algorithms for tailoring FGC SEI messages to dynamically adapt to diverse video characteristics, including real-time encoding scenarios.
Bitrate savings analysis: Examining the trade-offs between bitrate savings and subjective quality across various coding standards and network conditions.

The 149th MPEG meeting will be held in Geneva, Switzerland from January 20-24, 2025. Click here for more information about MPEG meetings and their developments.