Reports from ACM Multimedia 2024

Introduction

The ACM Multimedia Conference 2024, held in Melbourne, Australia from October 28 to November 1, 2024, was a major event that brought together leading researchers, practitioners, and industry professionals in the field of multimedia. This year’s conference marked a significant milestone as it was the first time since the end of the COVID-19 pandemic that the event returned to the Asia-Pacific region and resumed as a fully in-person gathering. The event offered a dynamic platform for presenting cutting-edge research, exploring new trends, and fostering collaborations across academia and industry.

Held in Melbourne, a city known for its vibrant culture and technological advancements, the conference was well-organized, ensuring a seamless experience for all participants. As part of its ongoing commitment to supporting the next generation of multimedia researchers, SIGMM awarded Student Travel Grants to 24 students. Each recipient received up to 1,000 USD to cover their travel and accommodation expenses. These grants were intended to help students who showed academic promise but faced financial barriers, allowing them to fully engage with the conference and its events. To apply, students were required to submit an online form, and the selection committee chose the recipients based on academic excellence and demonstrated financial need.

To give a voice to the travel grant recipients, we interviewed several of them to hear about their experiences and the impact the conference had on their academic and professional development. Below are some of their reflections.

Zhedong Zhang – Hangzhou Dianzi University

ACM Multimedia 2024 in Melbourne was my first international academic conference, and I am incredibly grateful to SIGMM for providing the travel grant. It was a great honor to present my paper, “From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning”, and to receive the Best Paper Award. As a PhD student, this recognition means a lot to me and encourages me to keep pushing forward with my research. 

Beyond the academic presentations, I had the chance to meet many brilliant researchers and fellow PhD students. I made connections with scholars working on similar topics and exchanged ideas that will help improve my work. The networking events and social gatherings were also highlights, as they allowed me to build friendships with colleagues from different parts of the world. I am truly grateful to SIGMM for making this experience possible and for the chance to be part of such a vibrant and inspiring academic community. I look forward to continuing my research and contributing to this exciting field.

Wu Tao – Zhejiang University

I’m incredibly grateful to the SIGMM team for awarding me this student travel grant – it really helped me a lot. I got to learn about so many fascinating papers at the conference and meet some brilliant professors and students. I even see some potential for future collaborations. I also had the chance to meet some big names in the field, like Tat-Seng Chua, who I’ve admired for a while. Meeting him, chatting, and even taking a photo with him felt like a once-in-a-lifetime opportunity, and I’m so thankful for it.

As for my own paper, I was both surprised and thrilled to see it actually got quite a bit of attention. At the welcome reception on the second day – before the poster session even began and before I’d even put up my poster – I noticed a few students already looking it up on their laptops. During the poster session, which was supposed to be two hours but probably stretched to three, I had a steady stream of people coming by to check out my work and ask questions. Some people even approached me earlier that morning. It was incredibly motivating to feel that kind of recognition and interest in what I’m working on. Thank you once again for this generous support! I look forward to attending the conference again.

Jianjun Qiao (Southwest Jiaotong University)

Attending ACM Multimedia 2024 in Melbourne was an incredible opportunity that greatly enriched my academic journey. This was my first time participating in an in-person conference, and I’m so grateful for the experience. The keynotes were fascinating, especially the talk on the Multimodal LLMs, which has significantly influenced my current research. I also enjoyed the poster sessions, where I could present my own work and engage in meaningful discussions with researchers from diverse backgrounds. The networking opportunities were invaluable, and I made several connections that I believe will lead to fruitful collaborations. I would like to extend my sincere thanks to SIGMM for the travel grant, which made my attendance possible. It was truly an unforgettable experience.

Changli Wu (Xiamen University)

ACM Multimedia 2024 was an unforgettable experience that exceeded all my expectations. As a PhD student, this was my first time presenting my research on 3D Referring Expression Segmentation at such a prestigious conference. The discussions I had with other attendees were invaluable, and I received constructive feedback that will undoubtedly improve my work. The diversity of the sessions was a highlight for me, as I was exposed to a variety of multimedia topics that I hadn’t considered before. The conference also provided a unique opportunity to interact with industry leaders, and I am now considering how to apply my research in real-world settings. 

VQEG Column: VQEG Meeting July 2024

Introduction

The University of Klagenfurt (Austria) hosted from July 01-05, 2024 a plenary meeting of the Video Quality Experts Group (VQEG). More than 110 participants from 20 different countries could attend this meeting in person and remotely.

The first three days of the meeting were dedicated to presentations and discussions about topics related to the ongoing projects within VQEG, while during the last two days an IUT-T Study Group G12 Question 19 (SG12/Q9) interim meeting took place. All the related information, minutes, and files from the meeting are available online in the VQEG meeting website, and video recordings of the meeting are available in Youtube.

All the topics mentioned bellow can be of interest for the SIGMM community working on quality assessment, but special attention can be devoted to the workshop on quality assessment towards 6G held within the 5GKPI group, and to the dedicated meeting of the IMG group hosted by the Distributed and Interactive Systems Group (DIS) of the CWI in September 2024 to work on ITU-T P.IXC recommendation. In addition, during those days there was a co-located ITU-T SG12 Q19 interim meeting.

Readers of these columns interested in the ongoing projects of VQEG are encouraged to subscribe to their corresponding reflectors to follow the activities going on and to get involved in them.

Another plenary meeting of VQEG has taken place from 18th 22nd of November 2024 and will be reported in a following issue of the ACM SIGMM Records.

VQEG plenary meeting at University of Klagenfurt (Austria), from July 01-05, 2024

Overview of VQEG Projects

Audiovisual HD (AVHD)

The AVHD group works on developing and validating subjective and objective methods to analyze commonly available video systems. During the meeting, there were 8 presentations covering very diverse topics within this project, such as open-source efforts, quality models, and subjective assessment methodologies:

Quality Assessment for Health applications (QAH)

The QAH group is focused on the quality assessment of health applications. It addresses subjective evaluation, generation of datasets, development of objective metrics, and task-based approaches. Joshua Maraval and Meriem Outtas (INSA Rennes, France) a dual rig approach for capturing multi-view video and spatialized audio capture for medical training applications, including a dataset for quality assessment purposes.

Statistical Analysis Methods (SAM)

The group SAM investigates on analysis methods both for the results of subjective experiments and for objective quality models and metrics. The following presentations were delivered during the meeting:  

No Reference Metrics (NORM)

The group NORM addresses a collaborative effort to develop no-reference metrics for monitoring visual service quality. In this sense, the following topics were covered:

  • Yixu Chen (Amazon, US) presented their development of a metric tailored for video compression and scaling, which can extrapolate to different dynamic ranges, is suitable for real-time video quality metrics delivery in the bitstream, and can achieve better correlation than VMAF and P.1204.3.
  • Filip Korus (AGH University of Krakow, Poland) talked about the detection of hard-to-compress video sequences (e.g., video content generated during e-sports events) based on objective quality metrics, and proposed a machine-learning model to assess compression difficulty.
  • Hadi Amirpour (University of Klagenfurt, Austria) provided a summary of activities in video complexity analysis, covering from VCA to DeepVCA and describing a Grand Challenge on Video Complexity.
  • Pierre Lebreton (Capacités & Nantes Université, France) presented a new dataset that allows studying the differences among existing UGC video datasets, in terms of characteristics, covered range of quality, and the implication of these quality ranges on training and validation performance of quality prediction models.
  • Zhengzhong Tu (Texas A&M University, US) introduced a comprehensive video quality evaluator (COVER) designed to evaluate video quality holistically, from a technical, aesthetic, and semantic perspective. It is based on leveraging three parallel branches: a Swin Transformer backbone to predict technical quality, a ConvNet employed to derive aesthetic quality, and a CLIP image encoder to obtain semantic quality.

Emerging Technologies Group (ETG)

The ETG group focuses on various aspects of multimedia that, although they are not necessarily directly related to “video quality”, can indirectly impact the work carried out within VQEG and are not addressed by any of the existing VQEG groups. In particular, this group aims to provide a common platform for people to gather together and discuss new emerging topics, possible collaborations in the form of joint survey papers, funding proposals, etc. During this meeting, the following presentations were delivered:

Joint Effort Group (JEG) – Hybrid

The group JEG-Hybrid addresses several areas of Video Quality Assessment (VQA), such as the creation of a large dataset for training such models using full-reference metrics instead of subjective metrics. In addition, the group includes the VQEG project Implementer’s Guide for Video Quality Metrics (IGVQM). The chair of this group,  Enrico Masala (Politecnico di Torino, Italy) presented the updates on the latest activities going on, including the status of the IGVQM project and a new image dataset, which will be partially subjectively annotated, to train DNN models to predict single user’s subjective quality perception. In addition to this:

Immersive Media Group (IMG)

The IMG group researches on the quality assessment of immersive media technologies. Currently, the main joint activity of the group is the development of a test plan to evaluate the QoE of immersive interactive communication systems, which is carried out in collaboration with ITU-T through the work item P.IXC. In this meeting, Pablo Pérez (Nokia XR Lab, Spain) and Jesús Gutiérrez (Universidad Politécnica de Madrid, Spain) provided an update on the progress of the test plan, reviewing the status of the subjective tests that were being performed at the 13 involved labs. Also in relation with this test plan:

In relation with other topics addressed by IMG:

In addition, a specific meeting of the group was held at Distributed and Interactive Systems Group (DIS) of CWI in Amsterdam (Netherlands) from the 2nd to the 4th of September to progress on the joint test plan for evaluating immersive communication systems. A total of 26 international experts from seven countries (Netherlands, Spain, Italy, UK, Sweden, Germany, US, and Poland) participated, with 7 attending online. In particular, the meeting featured presentations on the status of tests run by 13 participating labs, leading to insightful discussions and progress towards the ITU-T P.IXC recommendation.

IMG meeting at CWI (2-4 September, 2024, Netherlands)

Quality Assessment for Computer Vision Applications (QACoViA)

The group QACoViA addresses the study the visual quality requirements for computer vision methods, where the final user is an algorithm. In this meeting, Mikołaj Leszczuk (AGH University of Krakow, Poland) presented a study introducing a novel evaluation framework designed to address accurately predicting the impact of different quality factors on recognition algorithm, by focusing on machine vision rather than human perceptual quality metrics.

5G Key Performance Indicators (5GKPI)

The 5GKPI group studies relationship between key performance indicators of new 5G networks and QoE of video services on top of them. In this meeting, a workshop was organized by Pablo Pérez (Nokia XR Lab, Spain) and Kjell Brunnström (RISE, Sweden) on “Future directions of 5GKPI: Towards 6G“.

The workshop consisted of a set of diverse topics such as: QoS and QoE management in 5G/6G networks by (Michelle Zorzi, University of Padova, Italy); parametric QoE models and QoE management by Tobias Hoßfeld (University of. Würzburb, Germany) and Pablo Pérez (Nokia XR Lab, Spain); current status of standardization and industry by Kjell Brunnström (RISE, Sweden) and Gunilla Berndtsson (Ericsson); content and applications provider perspectives on QoE management by François Blouin (Meta, US); and communications service provider perspectives by Theo Karagioules and Emir Halepovic (AT&T, US). In addition, a panel moderated by Narciso García (Universidad Politécnica de Madrid, Spain) with Christian Timmerer (University of Klagenfurt, Austria), Enrico Masala (Politecnico di Torino, Italy) and Francois Blouin (Meta, US) as speakers.

Human Factors for Visual Experiences (HFVE)

The HFVE group covers human factors related to audiovisual experiences and upholds the liaison relation between VQEG and the IEEE standardization group P3333.1. In this meeting, there were two presentations related to these topics:

  • Mikołaj Leszczuk and Kamil Koniuch (AGH University of Krakow, Poland) presented a two-part insight into the realm of image quality assessment: 1) it provided an overview of the TUFIQoE project (Towards Better Understanding of Factors Influencing the QoE by More Ecologically-Valid Evaluation Standards) with a focus on challenges related to ecological validity; and 2) it delved into the ‘Psychological Image Quality’ experiment, highlighting the influence of emotional content on multimedia quality perception.

MPEG Column: 148th MPEG Meeting in Kemer, Türkiye

The 148th MPEG meeting took place in Kemer, Türkiye, from November 4 to 8, 2024. The official press release can be found here and includes the following highlights:

  • Point Cloud Coding: AI-based point cloud coding & enhanced G-PCC
  • MPEG Systems: New Part of MPEG DASH for redundant encoding and packaging, reference software and conformance of ISOBMFF, and a new structural CMAF brand profile
  • Video Coding: New part of MPEG-AI and 2nd edition of conformance and reference software for MPEG Immersive Video (MIV)
  • MPEG completes subjective quality testing for film grain synthesis using the Film Grain Characteristics SEI message
148th MPEG Meeting, Kemer, Türkiye, November 4-8, 2024.

Point Cloud Coding

At the 148th MPEG meeting, MPEG Coding of 3D Graphics and Haptics (WG 7) launched a new AI-based Point Cloud Coding standardization project. MPEG WG 7 reviewed six responses to a Call for Proposals (CfP) issued in April 2024 targeting the full range of point cloud formats, from dense point clouds used in immersive applications to sparse point clouds generated by Light Detection and Ranging (LiDAR) sensors in autonomous driving. With bit depths ranging from 10 to 18 bits, the CfP called for solutions that could meet the precision requirements of these varied use cases.

Among the six reviewed proposals, the leading proposal distinguished itself with a hybrid coding strategy that integrates end-to-end learning-based geometry coding and traditional attribute coding. This proposal demonstrated exceptional adaptability, capable of efficiently encoding both dense point clouds for immersive experiences and sparse point clouds from LiDAR sensors. With its unified design, the system supports inter-prediction coding using a shared model with intra-coding, applicable across various bitrate requirements without retraining. Furthermore, the proposal offers flexible configurations for both lossy and lossless geometry coding.

Performance assessments highlighted the leading proposal’s effectiveness, with significant bitrate reductions compared to traditional codecs: a 47% reduction for dense, dynamic sequences in immersive applications and a 35% reduction for sparse dynamic sequences in LiDAR data. For combined geometry and attribute coding, it achieved a 40% bitrate reduction across both dense and sparse dynamic sequences, while subjective evaluations confirmed its superior visual quality over baseline codecs.

The leading proposal has been selected as the initial test model, which can be seen as a baseline implementation for future improvements and developments. Additionally, MPEG issued a working draft and common test conditions.

Research aspects: The initial test model, like those for other codec test models, is typically available as open source. This enables both academia and industry to contribute to refining various elements of the upcoming AI-based Point Cloud Coding standard. Of particular interest is how training data and processes are incorporated into the standardization project and their impact on the final standard.

Another point cloud-related project is called Enhanced G-PCC, which introduces several advanced features to improve the compression and transmission of 3D point clouds. Notable enhancements include inter-frame coding, refined octree coding techniques, Trisoup surface coding for smoother geometry representation, and dynamic Optimal Binarization with Update On-the-fly (OBUF) modules. These updates provide higher compression efficiency while managing computational complexity and memory usage, making them particularly advantageous for real-time processing and high visual fidelity applications, such as LiDAR data for autonomous driving and dense point clouds for immersive media.

By adding this new part to MPEG-I, MPEG addresses the industry’s growing demand for scalable, versatile 3D compression technology capable of handling both dense and sparse point clouds. Enhanced G-PCC provides a robust framework that meets the diverse needs of both current and emerging applications in 3D graphics and multimedia, solidifying its role as a vital component of modern multimedia systems.

MPEG Systems Updates

At its 148th meeting, MPEG Systems (WG 3) worked on the following aspects, among others:

  • New Part of MPEG DASH for redundant encoding and packaging
  • Reference software and conformance of ISOBMFF
  • A new structural CMAF brand profile

The second edition of ISO/IEC 14496-32 (ISOBMFF) introduces updated reference software and conformance guidelines, and the new CMAF brand profile supports Multi-View High Efficiency Video Coding (MV-HEVC), which is compatible with devices like Apple Vision Pro and Meta Quest 3.

The new part of MPEG DASH, ISO/IEC 23009-9, addresses redundant encoding and packaging for segmented live media (REAP). The standard is designed for scenarios where redundant encoding and packaging are essential, such as 24/7 live media production and distribution in cloud-based workflows. It specifies formats for interchangeable live media ingest and stream announcements, as well as formats for generating interchangeable media presentation descriptions. Additionally, it provides failover support and mechanisms for reintegrating distributed components in the workflow, whether they involve file-based content, live inputs, or a combination of both.

Research aspects: With the FDIS of MPEG DASH REAP available, the following topics offer potential for both academic and industry-driven research aligned with the standard’s objectives (in no particular order or priority):

  • Optimization of redundant encoding and packaging: Investigate methods to minimize resource usage (e.g., computational power, storage, and bandwidth) in redundant encoding and packaging workflows. Explore trade-offs between redundancy levels and quality of service (QoS) in segmented live media scenarios.
  • Interoperability of live media Ingest formats: Evaluate the interoperability of the standard’s formats with existing live media workflows and tools. Develop techniques for seamless integration with legacy systems and emerging cloud-based media workflows.
  • Failover mechanisms for cloud-based workflows: Study the reliability and latency of failover mechanisms in distributed live media workflows. Propose enhancements to the reintegration of failed components to maintain uninterrupted service.
  • Standardized stream announcements and descriptions: Analyze the efficiency and scalability of stream announcement formats in large-scale live streaming scenarios. Research methods for dynamically updating media presentation descriptions during live events.
  • Hybrid workflow support: Investigate the challenges and opportunities in combining file-based and live input workflows within the standard. Explore strategies for adaptive workflow transitions between live and on-demand content.
  • Cloud-based workflow scalability: Examine the scalability of the REAP standard in high-demand scenarios, such as global live event streaming. Study the impact of cloud-based distributed workflows on latency and synchronization.
  • Security and resilience: Research security challenges related to redundant encoding and packaging in cloud environments. Develop techniques to enhance the resilience of workflows against cyberattacks or system failures.
  • Performance metrics and quality assessment: Define performance metrics for evaluating the effectiveness of REAP in live media workflows. Explore objective and subjective quality assessment methods for media streams delivered using this standard.

The current/updated status of MPEG-DASH is shown in the figure below.

MPEG-DASH status, November 2024.

Video Coding Updates

In terms of video coding, two noteworthy updates are described here:

  • Part 3 of MPEG-AI, ISO/IEC 23888-3 – Optimization of encoders and receiving systems for machine analysis of coded video content, reached Committee Draft Technical Report (CDTR) status
  • Second edition of conformance and reference software for MPEG Immersive Video (MIV). This draft includes verified and validated conformance bitstreams and encoding and decoding reference software based on version 22 of the Test model for MPEG immersive video (TMIV). The test model, objective metrics, and some other tools are publicly available at https://gitlab.com/mpeg-i-visual.

Part 3 of MPEG-AI, ISO/IEC 23888-3: This new technical report on “optimization of encoders and receiving systems for machine analysis of coded video content” is based on software experiments conducted by JVET, focusing on optimizing non-normative elements such as preprocessing, encoder settings, and postprocessing. The research explored scenarios where video signals, decoded from bitstreams compliant with the latest video compression standard, ISO/IEC 23090-3 – Versatile Video Coding (VVC), are intended for input into machine vision systems rather than for human viewing. Compared to the JVET VVC reference software encoder, which was originally optimized for human consumption, significant bit rate reductions were achieved when machine vision task precision was used as the performance criterion.

The report will include an annex with example software implementations of these non-normative algorithmic elements, applicable to VVC or other video compression standards. Additionally, it will explore the potential use of existing supplemental enhancement information messages from ISO/IEC 23002-7 – Versatile supplemental enhancement information messages for coded video bitstreams – for embedding metadata useful in these contexts.

Research aspects: (1) Focus on optimizing video encoding for machine vision tasks by refining preprocessing, encoder settings, and postprocessing to improve bit rate efficiency and task precision, compared to traditional approaches for human viewing. (2) Examine the use of metadata, specifically SEI messages from ISO/IEC 23002-7, to enhance machine analysis of compressed video, improving adaptability, performance, and interoperability.

Subjective Quality Testing for Film Grain Synthesis

At the 148th MPEG meeting , the MPEG Joint Video Experts Team (JVET) with ITU-T SG 16 (WG 5 / JVET) and MPEG Visual Quality Assessment (AG 5) conducted a formal expert viewing experiment to assess the impact of film grain synthesis on the subjective quality of video content. This evaluation specifically focused on film grain synthesis controlled by the Film Grain Characteristics (FGC) supplemental enhancement information (SEI) message. The study aimed to demonstrate the capability of film grain synthesis to mask compression artifacts introduced by the underlying video coding schemes.

For the evaluation, FGC SEI messages were adapted to a diverse set of video sequences, including scans of original film material, digital camera noise, and synthetic film grain artificially applied to digitally captured video. The subjective performance of video reconstructed from VVC and HEVC bitstreams was compared with and without film grain synthesis. The results highlighted the effectiveness of film grain synthesis, showing a significant improvement in subjective quality and enabling bitrate savings of up to a factor of 10 for certain test points.

This study opens several avenues for further research:

  • Optimization of film grain synthesis techniques: Investigating how different grain synthesis methods affect the perceptual quality of video across a broader range of content and compression levels.
  • Compression artifact mitigation: Exploring the interaction between film grain synthesis and specific types of compression artifacts, with a focus on improving masking efficiency.
  • Adaptation of FGC SEI messages: Developing advanced algorithms for tailoring FGC SEI messages to dynamically adapt to diverse video characteristics, including real-time encoding scenarios.
  • Bitrate savings analysis: Examining the trade-offs between bitrate savings and subjective quality across various coding standards and network conditions.

The 149th MPEG meeting will be held in Geneva, Switzerland from January 20-24, 2025. Click here for more information about MPEG meetings and their developments.

JPEG Column: 104th JPEG Meeting in Sapporo, Japan

JPEG XE issues Call for Proposals on event-based vision representation

The 104th JPEG meeting was held in Sapporo, Japan from July 15 to 19, 2024. During this JPEG meeting, a Call for Proposals on event-based vision representation was launched for the creation of the first standardised representation of this type of data. This CfP addresses lossless coding, and aims to provide the first standard representation for event-based data that ensures interoperability between systems and devices.

Furthermore, the JPEG Committee pursued its work in various standardisation activities, particularly the development of new learning-based technology codecs and JPEG Trust.

The following summarises the main highlights of the 104th JPEG meeting.

Event based vision reconstruction (from IEEE Spectrum, Feb. 2020).
  • JPEG XE
  • JPEG Trust
  • JPEG AI
  • JPEG Pleno Learning-based Point Cloud coding
  • JPEG Pleno Light Field
  • JPEG AIC
  • JPEG Systems
  • JPEG DNA
  • JPEG XS
  • JPEG XL

JPEG XE

The JPEG Committee continued its activity on JPEG XE and event-based vision. This activity revolves around a new and emerging image modality created by event-based visual sensors. JPEG XE is about the creation and development of a standard to represent events in an efficient way allowing interoperability between sensing, storage, and processing, targeting machine vision and other relevant applications. The JPEG Committee completed the Common Test Conditions (CTC) v2.0 document that provides the means to perform an evaluation of candidate technologies for efficient coding of events. The Common Test Conditions document also defines a canonical raw event format, a reference dataset, a set of key performance metrics and an evaluation methodology.

The JPEG Committee furthermore issued a Final Call for Proposals (CfP) on lossless coding for event-based data. This call marks an important milestone in the standardization process and the JPEG Committee is eager to receive proposals. The deadline for submission of proposals is set to March 31st of 2025. Standardization will start with lossless coding of events as this has the most imminent application urgency in industry. However, the JPEG Committee acknowledges that lossy coding of events is also a valuable feature, which will be addressed at a later stage.

Accompanying these two new public documents, a revised Use Cases and Requirements v2.0 document was also released to provide a formal definition for lossless coding of events that is used in the CTC and the CfP.

All documents are publicly available on jpeg.org. The Ad-hoc Group on event-based vision was re-established to continue work towards the 105th JPEG meeting. To stay informed about this activity please join the event-based vision Ad-hoc Group mailing list.

JPEG Trust

JPEG Trust provides a comprehensive framework for individuals, organizations, and governing institutions interested in establishing an environment of trust for the media that they use, and supports trust in the media they share. At the 104th meeting, the JPEG Committee produced an updated version of the Use Cases and Requirements for JPEG Trust (v3.0). This document integrates additional use cases and requirements related to authorship, ownership, and rights declaration. The JPEG Committee also requested a new Part to JPEG Trust, entitled “Media asset watermarking”. This new Part will define the use of watermarking as one of the available components of the JPEG Trust framework to support usage scenarios for content authenticity, provenance, integrity, labeling, and binding between JPEG Trust metadata and corresponding media assets. This work will focus on various types of watermarking, including explicit or visible watermarking, invisible watermarking, and implicit watermarking of the media assets with relevant metadata.

JPEG AI

At the 104th meeting, the JPEG Committee reviewed recent integration efforts, following the adoption of the changes in the past meeting and the creation of a new version of the JPEG AI verification model. This version reflects the JPEG AI DIS text and was thoroughly evaluated for performance and functionalities, including bitrate matching, 4:2:0 coding, region adaptive quantization maps, and other key features. JPEG AI supports a multi-branch coding architecture with two encoders and three decoders, allowing for six compatible combinations that have been jointly trained. The compression efficiency improvements range from 12% to 27% over the VVC Intra coding anchor, with decoding complexities between 8 to 215 kMAC/px.

The meeting also focused on Part 2: Profiles and Levels, which is moving to Committee Draft consultation. Two main concepts have been established: 1) the stream profile, defining a specific subset of the code stream syntax along with permissible parameter values, and 2) the decoder profile, specifying a subset of the full JPEG AI decoder toolset required to obtain the decoded image. Additionally, Part 3: Reference Software and Part 5: File Format will also proceed to Committee Draft consultation. Part 4 is significant as it sets the conformance points for JPEG AI compliance, and some preliminary experiments have been conducted in this area.

JPEG Pleno Learning-based Point Cloud coding

Learning-based solutions are the state of the art for several computer vision tasks, such as those requiring high-level understanding of image semantics, e.g., image classification, face recognition and object segmentation, but also 3D processing tasks, e.g. visual enhancement and super-resolution. Learning-based point cloud coding solutions have demonstrated the ability to achieve competitive compression efficiency compared to available conventional point cloud coding solutions at equivalent subjective quality. At the 104th meeting, the JPEG Committee instigated balloting for the Draft International Standard (DIS) of ISO/IEC 21794 Information technology — Plenoptic image coding system (JPEG Pleno) — Part 6: Learning-based point cloud coding. This activity is on track for the publication of an International Standard in January 2025. The 104th meeting also began an exploration into advanced point cloud coding functionality, in particular the potential for progressive decoding of point clouds.

JPEG Pleno Light Field

The JPEG Pleno Light Field effort has an ongoing standardization activity concerning a novel light field coding architecture that delivers a single coding mode to efficiently code light fields spanning from narrow to wide baselines. This novel coding mode is depth information agnostic resulting in significant improvement in compression efficiency. The first version of the Working Draft of the JPEG Pleno Part 2: Light Field Coding second edition (ISO/IEC 21794-2 2ED), including this novel coding mode, was issued during the 104th JPEG meeting in Sapporo, Japan.

The JPEG PLeno Model (JPLM) provides reference implementations for the standardized technologies within the JPEG Pleno framework, including the JPEG Pleno Part 2 (ISO/IEC 21794-2). Improvements to the JPLM have been implemented and tested, including the design of a more user-friendly platform.

The JPEG Pleno Light Field effort is also preparing standardization activities in the domains of objective and subjective quality assessment for light fields, aiming to address other plenoptic modalities in the future. During the 104th JPEG meeting in Sapporo, Japan, the collaborative subjective experiments aiming at exploring various aspects of subjective light field quality assessments were presented and discussed. The outcomes of these experiments will guide the decisions during the subjective quality assessment standardization process, which has issued its third Working Draft. A new version of a specialized tool for subjective quality evaluation, that supports these experiments, has also been released.

JPEG AIC

At its 104th meeting, the JPEG Committee reviewed results from previous Core Experiments that collected subjective data for fine-grained quality assessments of compressed images ranging from high to near-lossless visual quality. These crowdsourcing experiments used triplet comparisons with and without boosted distortions, as well as double stimulus ratings on a visual analog scale. Analysis revealed that boosting increased the precision of reconstructed scale values by nearly a factor of two. Consequently, the JPEG Committee has decided to use triplet comparisons in the upcoming AIC-3.

The JPEG Committee also discussed JPEG AIC Part 4, which focuses on objective image quality assessments for compressed images in the high to near-lossless quality range. This includes developing methods to evaluate the performance of such objective image quality metrics. A draft call for contributions is planned for January 2025.

JPEG Systems

At the 104th meeting Part 10 of JPEG Systems (ISO/IEC 19566-10), the JPEG Systems Reference Software, reached the IS stage. This first version of the reference software provides a reference implementation and reference dataset for the JPEG Universal Metadata Box Format (JUMBF, ISO/IEC 19566-5). Meanwhile, work is in progress to extend the reference software implementations of additional Parts, including JPEG Privacy and Security and JPEG 360.

JPEG DNA

JPEG DNA is an initiative aimed at developing a standard capable of representing bi-level, continuous-tone grey-scale, continuous-tone colour, or multichannel digital samples in a format using nucleotide sequences to support DNA storage. A Call for Proposals was published at the 99th JPEG meeting. Based on the performance assessments and descriptive analyses of the submitted solutions, the JPEG DNA Verification Model was created during the 102nd JPEG meeting. Several core experiments were conducted to validate this Verification Model, leading to the creation of the first Working Draft of JPEG DNA during the 103rd JPEG meeting.

The next phase of this work involves newly defined core experiments to enhance the rate-distortion performance of the Verification Model and its robustness to insertion, deletion, and substitution errors. Additionally, core experiments to test robustness against substitution and indel noise are conducted. A core experiment was also performed to integrate JPEG AI into the JPEG DNA VM, and quality comparisons have been carried out. A study on visual quality assessment of JPEG AI as an alternative to JPEG XL in the VM will be carried out.

In parallel, efforts are underway to improve the noise simulator developed at the 102nd JPEG meeting, enabling a more realistic assessment of the Verification Model’s resilience to noise. There is also ongoing exploration of the performance of different clustering and consensus algorithms to further enhance the VM’s capabilities.

JPEG XS

The core parts of JPEG XS 3rd edition were prepared for immediate publication as International Standards. This means that Part 1 of the standard – Core coding tools, Part 2 – Profiles and buffer models, and Part 3 – Transport and container formats, will be available before the end of 2024. Part 4 – Conformance testing is currently still under DIS ballot and it will be finalized in October 2024. At the 104th meeting, the JPEG Committee continued the work on Part 5 – Reference software. This part is currently at Committee Draft stage and the DIS is planned for October 2024. The reference software has a feature-complete decoder that is fully compliant with the 3rd edition. Work on the encoder is ongoing.

Finally, additional experimental results were presented on how JPEG XS can be used over 5G mobile networks for wireless transmission of low-latency and high quality 6K/8K 360 degree views with mobile devices and VR headsets. This work will be continued.

JPEG XL

Objective metrics results for HDR images were investigated (using among others the ColorVideoVDP metric), indicating very promising compression performance of JPEG XL compared to other codecs like AVIF and JPEG 2000. Both the libjxl reference software encoder and a simulated candidate hardware encoder were tested. Subjective experiments for HDR images are planned.

The second editions of JPEG XL Part 1 (Core coding system) and Part 2 (File format) are now ready for publication. The second edition of JPEG XL Part 3 (Conformance testing) has moved to the FDIS stage.

Final Quote

“The JPEG Committee has reached a new milestone by releasing a new Call for Proposals to code events. This call is aimed at creating the first International Standard to efficiently represent events, enabling interoperability between devices and systems that rely on event sensing.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

One benchmarking cycle wraps up, and the next ramps up: News from the MediaEval Multimedia Benchmark

Introduction

MediaEval, the Multimedia Evaluation Benchmark, has offered a yearly set of multimedia challenges since 2010. MediaEval supports the development of algorithms and technologies for analyzing, exploring and accessing information in multimedia data. MediaEval aims to help make multimedia technology a force for good in society and for this reason focuses on tasks with a human or social aspect. Benchmarking contributes in two ways to advancing multimedia research. First, by offering standardized definitions of tasks and evaluation data sets, it makes it possible to fairly compare algorithms and, in this way, track progress. If we can understand which types of algorithms perform better, we can more easily find ways (and the motivation) to improve them. Second, benchmarking helps to direct the attention of the research community, for example, towards new tasks that are based on real-world needs, or towards known problems for which more research is necessary to have a solution that is good enough for a real world application scenario.

The 2023 MediaEval benchmarking season culminated with the yearly workshop, which was held in conjunction with MMM 2024 (https://www.mmm2024.org) in Amsterdam, Netherlands. It was a hybrid workshop, which also welcomed online participants. The workshop kicked off with a joint keynote with MMM 2024. Yiannis Kompatsiaris, Information Technologies Institute, CERTH, on Visual and Multimodal Disinformation Detection. The talk covered the implications of multimodal disinformation online and the challenges that must be faced in order to detect it. The workshop featured an invited speaker, Adriënne Mendrik, CEO & Co-founder of Eyra, supporting benchmarks with the online Next platform. She talked about benchmark challenge design for science and how the Next platform is currently being used in the Social Sciences.

More information about the workshop can be found at https://multimediaeval.github.io/editions/2023/ and the proceedings were published at  https://ceur-ws.org/Vol-3658/ In the rest of this article, we provide an overview of the highlights of the workshop as well as an outlook to the next edition of MediaEval in 2025.  

Tasks at MediaEval

The MultimediaEval Workshop 2023 featured five tasks that focused on human and social aspects of multimedia analysis.

Three of the tasks required participants to combine or cross modalities or even consider new modalities. The Musti: Multimodal Understanding of Smells in Texts and Images task challenged participants to detect and classify smell references in multilingual texts and images from the 17th to the 20th century. They needed to identify whether a text and image evoked the same smell source, detect specific smell sources, and apply zero-shot learning for untrained languages. The remaining two tasks emphasized the social aspects of multimedia. In the NewsImages: Connecting Text and Images task, participants worked with a dataset of news articles and images, predicting which image accompanied a news article. This task aimed to explore cases in which there is a link between a text and an image that goes beyond the text being a literal description of what was pictured in the image. The Predicting Video Memorability task required participants to predict how likely videos were to be remembered, both short- and long-term, and to use EEG data to predict whether specific individuals would remember a given video, combining visual features and neurological signals. 

Two of the tasks focused on pushing forward video analysis, to be useful to support experts in carrying out their jobs. The task SportsVideo: Fine-Grained Action Classification and Position Detection task strives to develop technology that will support coaches. To address this task, participants analyzed videos of table tennis and swimming competitions, detecting athlete positions, identifying strokes, classifying actions, and recognizing game events such as scores and sounds. The task Transparent Tracking of Spermatozoa strived to develop technology that will support medical professionals. Task participants were asked to track sperm cells in video recordings to evaluate male reproductive health. This involved localizing and tracking individual cells in real time, predicting their motility, and using bounding box data to assess sperm quality. The task emphasized both accuracy and processing efficiency, with subtasks involving graph data structures for motility prediction. 

Impressions of Student Participants

MediaEval is grateful to SIGMM for providing funding for three students who attended the MediaEval Workshop and greatly helped us with the organization of this edition: Iván Martín-Fernández and Sergio Esteban-Romero from Speech Technology and Machine Learning Group (GTHAU) – Universidad Politécnica de Madrid, and Xiaomeng Wang from Radboud University. Below the students provide their comments and impressions of the workshop.

“As a novel PhD student, I greatly valued my experience attending MediaEval 2023. I participated as the main author and presented work from our group, GTHAU – Universidad Politécnica de Madrid, on the Predicting Video Memorability Challenge. The opportunity to meet renowned colleagues and senior researchers, and learn from their experiences, provided valuable insight into what academia looks like from the inside. 

MediaEval offers a range of multimedia-related tasks, which may sometimes seem under the radar but are crucial in developing real-world applications. Moreover, the conference distinguishes itself by pushing the boundaries, going beyond just presenting results to foster a deeper understanding of the challenges being addressed. This makes it a truly enriching experience for both newcomers and seasoned professionals alike. 

Having volunteered and contributed to organizational tasks, I also gained first-hand insight into the inner workings of an academic conference, a facet I found particularly rewarding. Overall, MediaEval 2023 proved to be an exceptional blend of scientific rigor, collaborative spirit, and practical insights, making it an event I would highly recommend for anyone in the multimedia community.”

Iván Martín-Fernández, PhD Student, GTHAU – Universidad Politécnica de Madrid

“Attending MediaEval was an invaluable experience that allowed me to connect with a wide range of researchers and engage in discussions about the latest advancements in Artificial Intelligence. Presenting my work on the Multimedia Understanding of Smells in Text and Images (MUSTI) challenge was particularly insightful, as the feedback I received sparked ideas for future research. Additionally, volunteering and assisting with organizational tasks gave me a behind-the-scenes perspective on the significant effort required to coordinate an event like MediaEval. Overall, this experience was highly enriching, and I look forward to participating and collaborating in future editions of the workshop.”

Sergio Esteban-Romero, PhD Student, GTHAU – Universidad Politécnica de Madrid

“I was glad to be a student volunteer at MediaEval 2024. Collaborating with other volunteers, we organized submission files and prepared the facilities. Everyone was exceptionally kind and supportive.
In addition to volunteering, I also participated in the workshop as a paper author. I submitted a paper to the NewsImage task and delivered my first oral presentation. The atmosphere was highly academic, fostering insightful discussions. And I received valuable suggestions to improve my paper.  I truly appreciate this experience, both as a volunteer and as a participant.”

Xiaomeng Wang PhD Student, Data Science – Radboud University

Outlook to MediaEval 2025 

We are happy to announce that in 2025 MediaEval will be hosted in Dublin, Ireland, co-located with CBMI 2025. The Call for Task Proposals is now open, and details regarding submitting proposals can be found here: https://multimediaeval.github.io/2024/09/24/call.html. The final deadline for submitting your task proposals is Wed. 22nd January 2025. We will publish the list of tasks offered in March and registration for participation in MediaEval 2025 will open in April 2025.

For this edition of MediaEval we will again emphasize our “Quest for Insight”: we push beyond improving evaluation scores to achieving deeper understanding about the challenges, including data and the strengths and weaknesses of particular types of approaches, with the larger aim of understanding and explaining the concepts that the tasks revolve around, promoting reproducible research, and fostering a positive impact on society. We look forward to welcoming you to participate in the new benchmarking year.

Report from CBMI 2024

The 21st International Conference on Content-based Multimedia Indexing (CBMI) was hosted by Reykjavik University in cooperation with ACM, SIGMM, VTT and IEEE. The three-day event took place on September 20-22 in Reykjavik, Iceland. Like the year before, it was as an exclusively in-person event. Despite the remote location, an active volcano and in person attendance requirement, we are pleased to report that we had a perfect attendance of presenting authors. CBMI was started in France and still has strong European roots. Looking at the nationality of the submitting authors we can see 17 unique nationalities, 14 countries in Europe, 2 in Asia and 1 in North America.

Conference highlights

Figure 1: First keynote speaker being introduced.

Key elements of a successful conference are the keynote sessions. The first and opening keynote, titled “What does it mean to ‘work as intended’?” was presented by Dr. Cynthia C. S. Liem on day 1. In this talk Cynthia raised important questions on how complex it can be to define, measure and evaluate human-focused systems. Using real-world examples, she demonstrated how recently developed systems, that passed the traditional evaluating metrics, still failed when deployed in the real-world. Her talk was an important reminder that certain weaknesses in human-focused systems are only revealed when exposed to reality.

Figure 2: Keynote speaker Ingibjörg Jónsdóttir (left) and closing keynote speaker Hannes Högni Vilhjálmsson (right).

Traditionally there are only two keynotes at CBMI, first on day 1 and second on day 2. However, our planned second keynotes could not attend until the last day and thus a 3rd “surprise” keynote was organized on day 2 with the title “Remote Sensing of Natural Hazards”.  The speaker was Dr. Ingibjörg Jónsdóttir, an associate professor of geology at the University of Iceland. She gave a very interesting talk about the unique geology of Iceland, the threats posed by natural hazards and her work using remote sensing to monitor both sea ice and volcanoes. This talk was well received by attendees as it gave insight into the host country, the volcanic eruption that ended just a week before the start of the conference (7th in past 2 years on the Reykjanes Peninsula). This subject is highly relevant to community, as the analysis and prediction is based on multimodal data.

The planned second keynote took place in the last session on day 3 and was given by Dr. Hannes Högni Vilhjálmsson, professor at Reykjvik University. The talk, titled “Being Multimodal: What Building Virtual Humans has Taught us about Multimodality”, gave the audience a deep dive into lessons learnt from his 20+ years of experience of developing intelligent virtual agents with face-to-face communication skills. “I will review our attempts to capture, understand and analyze the multi-modal nature of human communication, and how we have built and evaluated systems that engage in and support such communication.” is a direct quote from his abstract of the talk. 

CBMI is a relatively small, but growing, conference that is built on a strong legacy and has a highly motivated community behind it. The special sessions have long played an important role at CBMI and this year there were 8 special sessions accepted.

  • AIMHDA: Advances in AI-Driven Medical and Health Data Analysis
  • CB4AMAS: Content-based Indexing for audio and music: from analysis to synthesis
  • ExMA: Explainability in Multimedia Analysis
  • IVR4B: Interactive Video Retrieval for Beginners
  • MAS4DT: Multimedia analysis and simulations for Digital Twins in the construction domain
  • MmIXR: Multimedia Indexing for XR
  • MIDRA: Multimodal Insights for Disaster Risk Management and Applications
  • UHBER: Multimodal Data Analysis for Understanding of Human Behaviour, Emotions and their Reasons
Figure 3: SS UHBER chair Dr.  E. Vildjunaite with a conference participant. 

The number of papers per session ranged from 2 to 8. The larger sessions (CB4AMAS, MmIXR and UHBER) used a discussion panel format that created a more inclusive atmosphere and, at times, sparked lively discussions. 

Figure 4: Images from the poster session and the IVR4B competition.

Especially popular with attendees was the competition that took place in the Interactive Video Retrieval for Beginners (IVR4B) session. This session was hosted right after the poster session in the wide open space of Reykjavik University’s foyer. 

Awards

The selection committee was unanimous in that the contribution of Lorenzo Bianchi, Fabio Carrara, Nicola Messina & Fabrizio Falchi, titled “Is CLIP the main roadblock for fine-grained open-world perception?”, was the best paper award winner. With the generous support of ACM SIGMM, they were awarded 500 Euros. As the best paper was indeed also a student paper, it was decided to also give the runner-up a 300 Euro award. The runner-up was the contribution of Recep Oguz Araz, Dmitry Bogdanov, Pablo Alonso-Jimenez and Frederic Font, titled “Evaluation of Deep Audio Representations for Semantic Sound Similarity”.

The best demonstration was awarded to Joshua David Springer, Gylfi Thor Gudmundsson and Marcel Kyas for “Lowering Barriers to Entry for Fully-Integrated Custom Payloads on a DJI Matrice”. 

The top two systems in the IVAR4B competition were also recognized: the first place was for Nick Pantelidis, Maria Pegia, Damianos Galanopoulos, et al. for “VERGE: Simplifying Video Search for Novice”; and the second place was for Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, et al. for “VISIONE 5.0: toward evaluation with novice users”. 

Social events

The first day of the conference was quite eventful as before the poster and IVAR4B sessions Francois Pineau-Benois and Raphael Moraly of the Odyssée Quartet performed selected classical works in the “Music-meets-Science” cultural event. The goals of the latter are to bring live classical music content to the community of Multimedia Research. Musicians played a concert and then discussed with researchers, specifically involved into music analysis and retrieval. Such kind of exchanges between content creators and content analysis, indexing and retrieval researchers has been a distinctive feature of CBMI since 2018. 
This event would not have been possible without the generous support of ACM SIGMM.

The second day was no less entertaining as before the banquet attendees took a virtual flight over Iceland’s beautiful landscape via the services of FlyOver Iceland. 
The next CBMI’2025 will be hold in Dublin organized by DCU.

The 2nd Edition of International Summer School on Extended Reality Technology and Experience (XRTX)

The ACM Special Interest  Group on Multimedia (ACM SIGMM), co-sponsored the second edition of the International Summer School on Extended Reality Technology and Experience (XRTX), which was took place from July 8-11, 2024 in Madrid (Spain) hosted by Universidad Carlos III de Madrid. As in the first edition in 2023, in the organization also participated Universidad Politécnica de Madrid, Nokia, Universidad de Zaragoza and Universidad Rey Juan Carlos. It attracted 29 participants from different disciplines (e.g., engineering, computer science, psychology, etc.) and 7 different countries.

Students and organizers of the Summer School XRTX (July 8-11, 2024, Madrid)

The support from ACM SIGMM permitted to bring top researchers in the field of XR to deliver keynotes in different topics related to technological (e.g., volumetric video, XR for healthcare) and user-centric factors (e.g., user experience evaluation, multisensory experiences, etc.), such as Pablo César (CWI, Netherlands), Anthony Steed (UCL, UK), Manuela Chessa (University of Genoa, Italy), Qi Sun (New York University), Diego Gutiérrez (Universidad de Zaragoza, Spain), and Marianna Obrist (UCL, UK). Also, an industry session was also included, led by Gloria Touchard (Nokia, Spain).

Keynote sessions

In addition to these 7 keynotes, the program included a creative design session led by Elena Márquez-Segura (Universidad Carlos III, Spain), a tutorial on Ubiq given by Sebastian Friston (UCL, UK), and 2 practical sessions led by Telmo Zarraonandia (Universidad Carlos III, Spain) and Sandra Malpica (Universidad de Zaragoza, Spain) to get hands-on experience working with Unity for VR development .

Design and practical sessions

Moreover, ​there were poster and demo sessions distributed along the whole duration of the school, in which the participants could showcase their works.

Poster and demo sessions

The summer school was also sponsored by Nokia and the Computer Science and Engineering Department of Universidad Carlos III, which allowed to offer grants to support a number of students with the registration and traveling costs.

Finally, in addition to science, there was time for fun with social activities, like practicing real and VR archery and visiting and immersive exhibition about Pompeii.

Practicing real and VR archery
Immersive exhibition about Pompeii

The list of talks was:

  • “Towards volumetric video conferencing” by Pablo Cesar.
  • “Strategies for Designing and Evaluating eXtended Reality Experiences” by Anthony Steed.
  • “XR for healthcare: Immersive and interactive technologies for serious games and exergames”, by Manuela Chessa.
  • “Toward Human-Centered XR: Bridging Cognition and Computation”, by Qi Sun.
  • “Improving the user’s experience in VR” by Diego Gutiérrez.
  • “Why XR is important for Nokia”, Gloria Touchard.
  • “The Role of Multisensory Experiences in Extended Reality: Unlocking the Power of Smell” by Marianna Obrist.

JPEG Column: 103rd JPEG Meeting

JPEG AI reaches Draft International Standard stage

The 103rd JPEG meeting was held online from April 8 to 12, 2024. During the 103rd JPEG meeting, the first learning-based standard, JPEG AI, reached the Draft International Standard (DIS) and was sent for balloting after a very successful development stage that led to performance improvements above 25% against its best-performing anchor, VVC. This high performance, combined with implementation in current mobile phones or the possibilities given by the latent representation to be used in image processing applications, leads to new opportunities and will certainly launch a new era of compression technology.

The following are the main highlights of the 103rd JPEG meeting:

  • JPEG AI reaches Draft International Standard;
  • JPEG Trust integrates JPEG NFT;
  • JPEG Pleno Learning based Point Cloud coding releases a Draft International Standard;
  • JPEG Pleno Light Field works in a new compression model;
  • JPEG AIC analyses different subjective evaluation models for near visually lossless quality evaluation;
  • JPEG XE prepares a call for proposal on event-based coding;
  • JPEG DNA proceeds with the development of a standard for image compression using nucleotide sequences for supporting DNA storage;
  • JPEG XS 3rd edition;
  • JPEG XL analyses HDR coding.

The following sections summarise the main highlights of the 103rd JPEG meeting.

JPEG AI reaches Draft International Standard

At its 103rd meeting the JPEG Committee produced the Draft International Standard (DIS) of the JPEG AI Part 1 Core Coding Engine which is expected to be published as an International Standard in October 2024. JPEG AI offers a coding solution for standard reconstruction with significant improvements in compression efficiency over previous image coding standards at equivalent subjective quality. The JPEG AI coding design allows for hardware/software implementation encoding and decoding, in terms of memory and computational complexity, efficient coding of images with text and graphics, support for 8- and 10-bit depth, region of interest coding, and progressive coding. To cover multiple encoder and decoder complexity-efficiency tradeoffs, JPEG AI supports a multi-branch coding architecture with two encoders and three decoders (6 possible compatible combinations) that have been jointly trained. Compression efficiency (BD-rate) gains of 12.5% to 27.9% over the VVC Intra coding anchor, for relevant encoder and decoder configurations, can be achieved with a wide range of complexity tradeoffs (7 to 216 kMAC/px at the decoder side).

The work regarding JPEG AI profiles and levels (part 2), reference software (part 3) and conformance (part 4) has started and a request for sub-division has been issued in this meeting to establish a new part on the file format (part 5). At this meeting, most of the work focused on the JPEG AI high-level syntax and improvement of several normative and non-normative tools, such as hyper-decoder activations, training dataset, progressive decoding, training methodology and enhancement filters. There are now two smartphone implementations of JPEG AI available. In this meeting, a JPEG AI demo was shown running on a Huawei Mate50 Pro with a Qualcomm Snapdragon 8+ Gen1 with high resolution (4K) image decoding, tiling, full base operating point support and arbitrary image resolution decoding.

JPEG Trust

At the 103rd meeting, the JPEG Committee produced an updated version of the Use Cases and Requirements for JPEG Trust (v2.0). This document integrates the use cases and requirements of the JPEG NFT exploration with the use cases and requirements of JPEG Trust. In addition, a new document with Terms and Definitions for JPEG Trust (v1.0) was published which incorporates all terms and concepts as they are used in the context of the JPEG Trust activities. Finally, an updated version of the JPEG Trust White Paper v1.1 has been released. These documents are publicly available on the JPEG Trust/Documentation page.

JPEG Pleno Learning-based Point Cloud coding

The JPEG Committee continued its activity on Learning-based Point Cloud Coding under the JPEG Pleno family of standards. During the 103rd JPEG meeting, comments on the Committee Draft of IS0/IEC 21794 Part 6: “Learning-based point cloud coding” were received and the activity is on track for the release of a Draft International Standard for balloting at the 104th JPEG meeting in Sapporo, Japan in July 2024. A new version of the Verification Model (Version 4.1) was released during the 103rd JPEG meeting containing an updated entropy coding module. In addition, version 2.1 of the Common Training and Test Conditions was released as a public document.

JPEG Pleno Light Field

The JPEG Pleno Light Field activity progressed at this meeting with a number of technical submissions for improvements to the JPEG PLeno Model (JPLM). The JPLM provides reference implementations for the standardized technologies within the JPEG Pleno framework. The JPEG Pleno Light Field activity has an ongoing standardization activity concerning a novel light field coding architecture that delivers a single coding mode to efficiently code all types of light fields. This novel coding mode does not need any depth information resulting in significant improvement in compression efficiency.

The JPEG Pleno Light Field is also preparing standardization activities in the domains of objective and subjective quality assessment for light fields, aiming to address other plenoptic modalities in the future. During the meeting, important decisions were made regarding the execution of multiple collaborative subjective experiments aiming at exploring various aspects of subjective light field quality assessments. Additionally, a specialized tool for subjective quality evaluation has been developed to support these experiments. The outcomes of these experiments will guide the decisions during the subjective quality assessment standardization process. They will also be utilized in evaluating proposals for the upcoming objective quality assessment standardization activities.

JPEG AIC

During the 103rd JPEG meeting, the work on visual image quality assessment continued with a focus on JPEG AIC-3, targeting a standard for a subjective quality assessment methodology for images in the range from high to nearly visually lossless quality. The activity is currently investigating three kinds of subjective image quality assessment methodologies, notably the Boosted Triplet Comparison (BTC), the In-place Double Stimulus Quality Scale (IDSQS), and the In-place Plain Triplet Comparison (IPTC), as well as a unified framework capable of merging the results of two among them.

The JPEG Committee has also worked on the preparation of the Part 4 of the standard (JPEG AIC-4) by initiating work on the Draft Call for Proposals on Objective Image Quality Assessment. The Final Call for Proposals on Objective Image Quality Assessment is planned to be released in January 2025, while the submission of the proposals is planned for April 2025.

JPEG XE

The JPEG Committee continued its activity on JPEG XE and event-based vision. This activity revolves around a new and emerging image modality created by event-based visual sensors. JPEG XE is about the creation and development of a standard to represent events in an efficient way allowing interoperability between sensing, storage, and processing, targeting machine vision and other relevant applications. The JPEG Committee finished the Common Test Conditions v1.0 document that provides the means to perform an evaluation of candidate technologies for efficient coding of event sequences. The Common Test Conditions define a canonical raw event format, a reference dataset, a set of key performance metrics and an evaluation methodology. In addition, the JPEG Committee also finalized the Draft Call for Proposals on lossless coding for event-based data. This call will be finalized at the next JPEG meeting in July 2024. Both the Common Test Conditions v1.0 and the Draft Call for Proposals are publicly available on jpeg.org. Standardization will start with lossless coding of event sequences as this has the most imminent application urgency in industry. However, the JPEG Committee acknowledges that lossy coding of event sequences is also a valuable feature, which will be addressed at a later stage. The Ad-hoc Group on Event-based Vision was reestablished to continue the work towards the 104th JPEG meeting. To stay informed about the activities please join the event-based imaging Ad-hoc Group mailing list.

JPEG DNA

JPEG DNA is an exploration aiming at developing a standard that provides technical solutions that are capable of representing bi-level, continuous-tone grey-scale, continuous-tone colour, or multichannel digital samples in a format representing nucleotide sequences for supporting DNA storage. A Call for Proposals was published at the 99th JPEG meeting and based on performance assessment and a descriptive analysis of the solutions that had been submitted, the JPEG DNA Verification Model was created during the 102nd JPEG meeting. A number of core experiments were conducted to validate the Verification Model, and notably, the first Working Draft of JPEG DNA was produced during the 103rd JPEG meeting. Work towards the creation of the specification will start with newly defined core experiments to improve the rate-distortion performance of Verification Model and the robustness to insertion, deletion, and substitution errors. In parallel, efforts are underway to improve the noise simulator produced at the 102nd JPEG meeting to allow the assessment of the resilience to noise in the Verification Model in more realistic conditions and to explore learning-based coding solutions.

JPEG XS

The JPEG Committee is happy to announce that the core parts of JPEG XS 3rd edition are ready for publication as International Standards. The Final Draft International Standard for Part 1 of the standard – Core coding tools – is ready, and Part 2 – Profiles and buffer models – and Part 3 – Transport and container formats – are both being prepared by ISO for immediate publication. At this meeting, the JPEG Committee continued the work on Part 4 – Conformance testing, to provide the necessary test streams and test protocols to implementers of the 3rd edition. Consultation of the Committee Draft for Part 4 took place and a DIS version was issued. The development of the reference software, contained in Part 5, continued and the reference decoder is now feature-complete and fully compliant with the 3rd edition. A Committee Draft for Part 5 was issued at this meeting. Development of a fully compliant reference encoder is scheduled to be completed by July.

Finally, new experimental results were presented on how to use JPEG XS over 5G mobile networks for the wireless transmission of low-latency and high quality 4K/8K 360 degree views with mobile devices and VR headsets. More experiments will be conducted, but first results show that JPEG XS is capable of providing immersive and excellent quality of experience in VR use cases, mainly thanks to its native low-latency and low-complexity properties.

JPEG XL

The performance of JPEG XL on HDR images was investigated and the experiments will continue. Work on a hardware implementation continues, and further improvements are made to the libjxl reference software. The second editions of Parts 1 and 2 are in the final stages of the ISO process and will be published soon.

Final Quote

“The JPEG AI Draft International Standard is a yet another important milestone in an age where AI is rapidly replacing previous technologies. With this achievement, the JPEG Committee has demonstrated its ability to reinvent itself and adapt to new technological paradigms, offering standardized solutions based on latest state-of-the-art technologies.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

MPEG Column: 147th MPEG Meeting in Sapporo, Japan


The 147th MPEG meeting was held in Sapporo, Japan from 15-19 July 2024, and the official press release can be found here. It comprises the following highlights:

  • ISO Base Media File Format*: The 8th edition was promoted to Final Draft International Standard, supporting seamless media presentation for DASH and CMAF.
  • Syntactic Description Language: Finalized as an independent standard for MPEG-4 syntax.
  • Low-Overhead Image File Format*: First milestone achieved for small image handling improvements.
  • Neural Network Compression*: Second edition for conformance and reference software promoted.
  • Internet of Media Things (IoMT): Progress made on reference software for distributed media tasks.

* … covered in this column and expanded with possible research aspects.

8th edition of ISO Base Media File Format

The ever-growing expansion of the ISO/IEC 14496-12 ISO base media file format (ISOBMFF) application area has continuously brought new technologies to the standards. During the last couple of years, MPEG Systems (WG 3) has received new technologies on ISOBMFF for more seamless support of ISO/IEC 23009 Dynamic Adaptive Streaming over HTTP (DASH) and ISO/IEC 23000-19 Common Media Application Format (CMAF) leading to the development of the 8th edition of ISO/IEC14496-12.

The new edition of the standard includes new technologies to explicitly indicate the set of tracks representing various versions of the media presentation of a single media for seamless switching and continuous presentation. Such technologies will enable more efficient processing of the ISOBMFF formatted files for DASH manifest or CMAF Fragments.

Research aspects: The central research aspect of the 8th edition of ISOBMFF, which “will enable more efficient processing,” will undoubtedly be its evaluation compared to the state-of-the-art. Standards typically define a format, but how to use it is left open to implementers. Therefore, the implementation is a crucial aspect and will allow for a comparison of performance. One such implementation of ISOBMFF is GPAC, which most likely will be among the first to implement these new features.

Low-Overhead Image File Format

ISO/IEC 23008-12 image format specification defines generic structures for storing image items and sequences based on ISO/IEC 14496-12 ISO base media file format (ISOBMFF). As it allows the use of various high-performance video compression standards for a single image or a series of images, it has been adopted by the market quickly. However, it was challenging to use it for very small-sized images such as icons or emojis. While the initial design of the standard was versatile and useful for a wide range of applications, the size of headers becomes an overhead for applications with tiny images. Thus, Amendment 3 of ISO/IEC 23008-12 low-overhead image file format aims to address this use case by adding a new compact box for storing metadata instead of the ‘Meta’ box to lower the size of the overhead.

Research aspects: The issue regarding header sizes of ISOBMFF for small files or low bitrate (in the case of video streaming) was known for some time. Therefore, amendments in these directions are appreciated while further performance evaluations are needed to confirm design choices made at this initial step of standardization.

Neural Network Compression

An increasing number of artificial intelligence applications based on artificial neural networks, such as edge-based multimedia content processing, content-adaptive video post-processing filters, or federated training, need to exchange updates of neural networks (e.g., after training on additional data or fine-tuning to specific content). For this purpose, MPEG developed a second edition of the standard for coding of neural networks for multimedia content description and analysis (NNC, ISO/IEC 15938-17, published in 2024), adding syntax for differential coding of neural network parameters as well as new coding tools. Trained models can be compressed to at least 10-20% for several architectures, even below 3%, of their original size without performance loss. Higher compression rates are possible at moderate performance degradation. In a distributed training scenario, a model update after a training iteration can be represented at 1% or less of the base model size on average without sacrificing the classification performance of the neural network.

In order to facilitate the implementation of the standard, the accompanying standard ISO/IEC 15938-18 has been updated to cover the second edition of ISO/IEC 15938-17. This standard provides a reference software for encoding and decoding NNC bitstreams, as well as a set of conformance guidelines and reference bitstreams for testing of decoder implementations. The software covers the functionalities of both editions of the standard, and can be configured to test different combinations of coding tools specified by the standard.

Research aspects: The reference software for NNC, together with the reference software for audio/video codecs, are vital tools for building complex multimedia systems and its (baseline) evaluation with respect to compression efficiency only (not speed). This is because reference software is usually designed for functionality (i.e., compression in this case) and not performance.

The 148th MPEG meeting will be held in Kemer, Türkiye, from November 04-08, 2024. Click here for more information about MPEG meetings and their developments.

The 2nd Edition of the Spring School on Social XR organized by CWI

ACM SIGMM co-sponsored the second edition of the Spring School on Social XR, organized by the Distributed and Interactive Systems group (DIS) at CWI in Amsterdam. The event took place on March 4th – 8th 2024 and attracted 30 students from different disciplines (technology, social sciences, and humanities). The program included 22 lectures, 6 of them open, by 23 instructors. The event was organized by Irene Viola, Silvia Rossi, Thomas Röggla, and Pablo Cesar from CWI; and Omar Niamut from TNO. The event was co-sponsored by the ACM Special Interest  Group on Multimedia ACM SIGMM, making available student grants and supporting international speaker from under-represented countries, and The Netherlands Institute for Sound and Vision (https://www.beeldengeluid.nl/en).

Students and organisers of the Spring School on Social XR (March 4th – 8th 2024, Amsterdam)

“The future of media communication is immersive, and will empower sectors such as cultural heritage, education, manufacturing, and provide a climate-neutral alternative to travelling in the European Green Deal”. With such a vision in mind, the organization committee continued for a second edition with a holistic program around the research topic of Social XR. The program included keynotes and workshops, where prominent scientists in the field shared their knowledge with students and triggered meaningful conversations and exchanges.

A poster session at the CWI DIS Spring School 2024.

The program included topics such as the capturing and modelling of realistic avatars and their behavior, coding and transmission techniques of volumetric video content, ethics for the design and development of responsible social XR experiences, novel rending and interaction paradigms, and human factors and evaluation of experiences. Together, they provided a holistic perspective, helping participants to better understand the area and to initiate a network of collaboration to overcome current limitations of current real-time conferencing systems.

The spring school is part of the semester program organized by the DIS group of CWI. It was initiated in May 2022 with the Symposium on human-centered multimedia systems: a workshop and seminar to celebrate the inaugural lecture, “Human-Centered Multimedia: Making Remote Togetherness Possible” of Prof. Pablo Cesar. Then, it was continued in 2023 with the 1st Spring School on Social XR.

The list of talks were:

  • “Volumetric Content Creation for Immersive XR Experiences” by Aljosa Smolic
  • “Social Signal Processing as a Method for Modelling Behaviour in SocialXR” by Julie Williamson
  • “Towards a Virtual Reality” by Elmar Eisemann
  • “Meeting Yourself and Others in Virtual Reality” by Mel Slater
  • “Social Presence in VR – A Media Psychology Perspective” by Tilo Hartmann
  • “Ubiquitous Mixed Reality: Designing Mixed Reality Technology to Fit into the Fabric of our Daily Lives” by Jan Gugenheimer
  • “Building Military Family Cohesion through Social XR: A 8-Week Field Study” by Sun Joo (Grace) Ahn
  • “Navigating the Ethical Landscape of XR: Building a Necessary Framework” by Eleni Mangina
  • “360° Multi-Sensory Experience Authoring” by Debora Christina Muchaluat Saade
  • “QoE Assessment of XR” by Patrick le Callet
  • “Bringing Soul to Digital” by Natasja Paulssen
  • “Evaluating QoE for Social XR – Audio, Visual, Audiovisual and Communication Aspects” by Alexander Raake
  • “Immersive Technologies Through the Lens of Public Values” by Mariëtte van Huijstee
  • “Designing Innovative Future XR Meeting Spaces” by Katherine Isbister
  • “Evaluation Methods for Social XR Experiences” by Mark Billinghurst
  • “Recent Advances in 3D Videocommunication” by Oliver Schreer
  • “Virtual Humans in Social XR” by Zerrin Yumak
  • “The Power of Graphs Learning in Immersive Communications” by Laura Toni
  • “Boundless Creativity: Bridging Sectors for Social Impact” by Benjamin de Wit
  • “Social XR in 5G and Beyond: Use Cases, Requirements, and Standardization Activities” by Lea Skorin-Kapov
  • “An Overview on Standardization for Social XR”  by Pablo Perez and Jesús Gutiérrez
  • “Funding: The Path to Research Independence” by Sergio Cabrero