VQEG Column: Finalization of Recommendation Series P.1204, a Multi-Model Video Quality Evaluation Standard – The New Standards P.1204.1 and P.1204.2

Abstract

This column introduces the now completed ITU-T P.1204 video quality model standards for assessing sequences up to UHD/4K resolution. Initially developed over two years by ITU-T Study Group 12 (Question Q14/12) and VQEG, the work used a large dataset of 26 subjective tests (13 for training, 13 for validation), each involving at least 24 participants rating sequences on the 5-point ACR scale. The tests covered diverse encoding settings, bitrates, resolutions, and framerates for H.264/AVC, H.265/HEVC, and VP9 codecs. The resulting 5,000-sequence dataset forms the largest lab-based source for model development to date. Initially standardized were P.1204.3, a no-reference bitstream-based model with full bitstream access, P.1204.4, a pixel-based, reduced-/full-reference model, and P.1204.5, a no-reference hybrid model. The current record focuses on the latest additions to the series, namely P.1204.1, a parametric, metadata-based model using only information about which codec was used, plus bitrate, framerate and resolution, and P.1204.2, which in addition uses frame-size and frame-type information to include video-content aspects into the predictions.

Introduction

Video quality under specific encoding settings is central to applications such as VoD, live streaming, and audiovisual communication. In HTTP-based adaptive streaming (HAS) services, bitrate ladders define video representations across resolutions and bitrates, balancing screen resolution and network capacity. Video quality, a key contributor to users’ Quality of Experience (QoE), can vary with bandwidth fluctuations, buffer delays, or playback stalls. 

While such quality fluctuations and broader QoE aspects are discussed elsewhere, this record focuses on short-term video quality as modeled by ITU-T P.1204 for HAS-type content. These models assess segments of around 10s under reliable transport (e.g., TCP, QUIC), covering resolution, framerate, and encoding effects, but excluding pixel-level impairments from packet loss under unreliable transport.

Because video quality is perceptual, subjective tests, laboratory or crowdsourced, remain essential, especially at high resolutions such as 4K UHD under controlled viewing conditions (1.5H or 1.6H viewing distance). Yet, studies show limited perceptual gain between HD and 4K, depending on source content, underlining the need for representative test materials. Given the high cost of such tests, objective (instrumental) models are required for scalable, automated assessment supporting applications like bitrate ladder design and service monitoring.

Four main model classes exist: metadata-based, bitstream-based, pixel-based, and hybrid. Metadata-based models use codec parameters (e.g., resolution, bitrate) and are lightweight; bitstream-based models analyze encoded streams without decoding, as in ITU-T P.1203 and P.1204.3 [1][2][3][7]. Pixel-based models compare decoded frames and include Full Reference and Reduced Reference models (e.g., P.1204.4, and also PSNR [9], SSIM [10], VMAF [11][12]), as well as No Reference variants. Finally, hybrid models combine pixel and bitstream or metadata inputs, exemplified by the ITU-T P.1204.5 standard. These three standards, P.1204.3 P.1204.4 and P.1204.5, formed the initial P.1204 Recommendation series finalized in 2020.

ITU-T P.1204 series completed with P.1204.1 and P.1204.2

The respective standardization project under the Work Item name P.NATS Phase 2 (read: Peanuts) was a unique video quality model development competition conducted in collaboration between ITU-T Study Group 12 (SG12) and the Video Quality Experts Group (VQEG). The target use cases were for up to UHD/4K resolution, with presentation on UHD/4K resolution PC/TV or Mobile/Tablet (MO/TA). For the first time, bitstream-, pixel-based, and hybrid models were jointly developed, trained, and validated, using a large common subjective dataset comprising 26 tests, each with at least 24 participants (see, e.g., [1] for details). The P.NATS Phase 2 work built on the earlier “P.NATS Phase 1” project, which resulted in the ITU-T Rec. P.1203 standards series (P.1203, P.1203.1, P.1203.2, P.1203.3). In the P.NATS Phase 2 project, video quality models in five different categories were evaluated, and different candidates were found to be eligible to be recommended as standards. The initially standardized three models out of the five categories were the aforementioned P.1204.3, P.1204.4 and P.1204.5. However, due to the lack of consensus between the winning proponents, no models were recommended as standards for the category “bitstream Mode 0” with access to high-level metadata only, such as the video codec, resolution, framerate and bitrate used,  and “bitstream Mode 1”, with further access to frame-size information that can be used for content-complexity estimation.

For the latest model additions of P.1204.1 and P.1204.2, subsets of the databases initially used in the P.NATS Phase 2 project were employed for model training. Two different datasets belonging to the two contexts PC/TV and MO/TA were used for training the models. AVT-PNATS-UHD-1 is the dataset for the PC/TV use case and ERCS-PNATS-UHD-1 the dataset used for the MO/TA use case. 

AVT-PNATS-UHD-1 [7] consists of four different subjective tests conducted by TU Ilmenau as part of the P.NATS Phase 2 competition. The target resolution of these datasets was 3840 x 2160 pixels. ERCS-PNATS-UHD-1 [1] is a dataset targeting the MO/TA use case. It consists of one subjective test conducted by Ericsson as part of the P.NATS Phase 2 competition. The target resolution of these datasets was 2560 x 1440 pixels. 

For model performance evaluation, beyond AVT-PNATS-UHD-1, further externally available video-quality test databases were used, as outlined in the following.

AVT-VQDB-UHD-1: This is a publicly available dataset and consists of four different subjective tests. All the four tests had a full-factorial design. In total, 17 different SRCs with a duration of 7-10 s were used across all the four tests. All the sources had a resolution of 3840×2160 pixels and a framerate of 60 fps. For HRC design, bitrate was selected in fixed (i.e. non-adaptive) values per PVS between 200kbps and 40000kbps, resolution between 360p and 2160p and framerate between 15fps and 60fps. In all the tests, a 2-pass encoding approach was used to encode the videos, with medium preset for H.264 and H.265, and the speed parameter for VP9 set to the default value “0”. A total of 104 participants in the four tests.

GVS: This dataset consists of 24 SRCs that have been extracted from 12 different games. The SRCs are of 1920×1080 pixel resolution, 30fps framerate and have a duration of 30s . The HRC design included three different resolutions, namely, 480p, 720p and 1080p . 90 PVSs resulting from 15 bitrate-resolution pairs were used for subjective evaluation. A total of 25 participants rated all the 90 PVSs.

KUGVD: Six SRCs out of the 24 SRCs from the GVSwere used to develop KUGVD. The same bitrate-resolution pairs from GVS were included to define the HRCs. In total, 90 PVSs were used in the subjective evaluation and 17 participants took part in the test.

CGVDS:  This dataset consists of SRCs captured at 60fps from 15 different games. For designing the HRCs, three resolutions, namely, 480p, 720p and 1080p at three different framerates of 20, 30, and 60fps were considered. To ensure that the SRCs from all the games could be assessed by test subjects, the overall test was split into 5 different subjective tests, with a minimum of 72 PVSs being rated in each of the tests. A total of over 100 participants took part over the five different tests, with a minimum of 20 participants per test.

Twitch: The Twitch Dataset consists of 36 different games, with 6 games each representing one out of 6 pre-defined genres. The dataset consists of streams directly downloaded from Twitch. A total of 351 video sequences of approximately 50s duration across all representations were downloaded. 90 video sequences out of these 351 video sequences were selected for subjective evaluation. Only the first 30s of the chosen 90 PVSs were considered for subjective testing. Six different resolutions between 160p and 1080p at framerates of 30 and 60fps were used. 29 participants rated all the 90 PVSs.

BBQCG: This is the training dataset developed as part of the P.BBQCG work item. This dataset consists of nine subjective test databases. Three out of these nine test databases consisted of processed video sequences (PVSs) up to 1080p/120fps and the remaining had PVSs up to 4K/60fps. Three codecs, namely, H.264, H.265, and AV1 were used to encode the videos. Overall 900 different PVSs were created from 12 sources (SRCs) by encoding the SRCs with different encoding settings.

AVT-VQDB-UHD-1-VD: This dataset consists of 16 source contents encoded using a CRF-based encoding approach. Overall 192 PVSs were generated by encoding all 16 sources in four resolutions, namely, 360p, 720p, 1080p, 2160p with three CRF values (22, 30, 38) each. A total of 40 subjects participate in the study.

ITU-T P.1204.1 and P.1204.2 model prediction performance

The performance figures of the two new models P.1204.1 and P.1204.2 models on the different datasets are indicated in Table 1 (P.1204.1) and Table 2 (P.1204.2) below.

Table 1: Performance of P.1204.1 (Mode 0) on the evaluation datasets, in terms of Root Mean Square Error (RMSE, measure used as winning criterion in the ITU-T/VQEG modelling competition). Pearson Correlation Coefficeint (PCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall’s tau.
DatasetRMSEPCCSRCCKendall
AVT-VQDB-UHD-10.4990.890 0.8770.684
KUGVD 0.8400.590 0.5700.410
GVS 0.690 0.670 0.6500.490
CGVDS 0.470 0.7800.7500.560
Twitch 0.430 0.9200.8900.710
BBQCG 0.598 (on a 7-point scale) 0.8410.8430.647
AVT-VQDB-UHD-1-VD0.6500.8140.8130.617
Table 2: Performance of P.1204.1 (Mode 1) on the evaluation datasets, in terms of Root Mean Square Error (RMSE, measure used as winning criterion in the ITU-T/VQEG modelling competition). Pearson Correlation Coefficeint (PCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall’s tau.
DatasetRMSEPCCSRCCKendall
AVT-VQDB-UHD-10.4760.9010.9000.730
KUGVD0.5000.8700.8600.690
GVS0.4200.890 0.870 0.710
CGVDS0.3600.9000.8800.690
Twitch0.3700.940 0.9300.770
BBQCG0.737 (on a 7-point scale)0.745 0.746 0.547
AVT-VQDB-UHD-1-VD0.5980.845 0.845 0.654

For all databases except BBQCG and KUGVD, the Mode 0 model P.1204.1 performs in a solid way, as shown in Table 1. With the information about frame types and sizes available to the Mode 1 model P.1204.2, performance improves considerably, as shown in Table 2. For performance results of all three previously standardized models, P.1204.3, P.1204.4 and P.1204.5, the reader is referred to [1] and the individual standards, [4][5][6]. For the P.1204.3 model, complementary performance information is presented in, e.g., [2][7]. For P.1204.4, additional model performance information is available in [8], including results for AV1, AVS2, and VVC.

The following plots provide an illustration of how the new P.1204.1 Mode 0 model may be used. Here, bitrate-ladder-type graphs are presented, with the predicted Mean Opinion Score on a 5-point scale plotted over log bitrate.


Codec: H.264

Codec: H.265

Codec: VP9

Conclusions and Outlook

The P.1204 standard series now comprises the complete initially planned set of models, namely:

  • ITU-T P.1204.1: Bitstream Mode 0, i.e., metadata-based model with access to information about video codec, resolution, framerate and bitrate used.
  • ITU-T P.1204.2: Bitstream Mode 1, i.e., metadata-based model with access to information about video codec, resolution, framerate and bitrate used, plus information about video frame types and sizes.
  • ITU-T P.1204.3: Bitstream Mode 3 [1][2][3][7].
  • ITU-T P.1204.4: Pixel-based reduced- and full-reference [1][5][8].
  • ITU-T P.1204.5: Hybrid no-reference Mode 0 [1][6].

Extensions of some of these models beyond the initial scope of codecs (H.264/AVC, H.265/HEVC, VP9) have been included over the last few years. Here, P.1204.4 and P.1204.5 have been extended (P.1204.5) or evaluated (P.1204.4) to also cover the AV1 video codec. Work in ITU-T SG12 (Q14/12) is ongoing so as to also extend P.1204.1, P.1204.2 and P.1204.3 to newer codecs such as AV1, and all five models are planned to be extended so as to also cover VVC. It is noted that for P.1204.3, P.1204.4 and P.1204.5, also long-term quality integration modules that generate per-session scores for up to 5min long streaming sessions have been described in Appendices of the respective recommendations. For P.1204.1 and P.1204.2, this extension still has to be completed. Initial evaluations for similar Mode 0 and Mode 1 models that use the P.1204.3-type long-term integration can be found in [7].

References

[1] Raake, A., Borer, S., Satti, S.M., Gustafsson, J., Rao, R.R.R., Medagli, S., List, P., Göring, S., Lindero, D., Robitza, W. and Heikkilä, G., 2020. Multi-model standard for bitstream-, pixel-based and hybrid video quality assessment of UHD/4K: ITU-T P. 1204. IEEE Access, 8, pp.193020-193049.
[2] Rao, R.R.R., Göring, S., List, P., Robitza, W., Feiten, B., Wüstenhagen, U. and Raake, A., 2020, May. Bitstream-based model standard for 4K/UHD: ITU-T P. 1204.3—Model details, evaluation, analysis and open source implementation. In 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX) (pp. 1-6).
[3] ITU-T Rec. P.1204, 2025. Video quality assessment of streaming services over reliable transport for resolutions up to 4K. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[4] ITU-T Rec. P.1204.3, 2020. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to full bitstream information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[5] ITU-T Rec. P.1204.4, 2022. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to full and reduced reference pixel information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[6] ITU-T Rec. P.1204.5, 2023. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to transport and received pixel information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[7] Rao, R.R.R., Göring, S. and Raake, A., 2022. AVQBits – Adaptive video quality model based on bitstream information for various video applications. IEEE Access, 10, pp.80321-80351.
[8] Borer, S., 2022, September. Performance of ITU-T P. 1204.4 on Video Encoded with AV1, AVS2, VVC. In 2022 14th International Conference on Quality of Multimedia Experience (QoMEX) (pp. 1-4).
[9] Winkler, S. and Mohandas, P., 2008. The evolution of video quality measurement: From PSNR to hybrid metrics. IEEE transactions on Broadcasting, 54(3), pp.660-668.
[10] Wang, Z., Lu, L. and Bovik, A.C., 2004. Video quality assessment based on structural distortion measurement. Signal Processing: Image Communication, 19(2), pp.121-132.
[11] Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., and Manohara, M., 2016. Toward A Practical Perceptual Video Quality Metric, Netflix TechBlog.
[12] Li, Z., Swanson, K., Bampis, C., Krasula, L., and Aaron, A., 2020. Toward a Better Quality Metric for the Video Community, Netflix TechBlog.

MPEG Column: 152nd MPEG Meeting

The 152nd MPEG meeting took place in Geneva, Switzerland, from October 7 to October 11, 2025. The official MPEG press release can be found here. This column highlights key points from the meeting, amended with research aspects relevant to the ACM SIGMM community:

  • MPEG Systems received an Emmy® Award for the Common Media Application Format (CMAF). A separate press release regarding this achievement is available here.
  • JVET ratified new editions of VSEI, VVC, and HEVC
  • The fourth edition of Visual Volumetric Video-based Coding (V3C and V-PCC) has been finalized
  • Responses to the call for evidence on video compression with capability beyond VVC successfully evaluated

MPEG Systems received an Emmy® Award for the Common Media Application Format (CMAF)

On September 18, 2025, the National Academy of Television Arts & Sciences (NATAS) announced that the MPEG Systems Working Group (ISO/IEC JTC 1/SC 29/WG 3) had been selected as a recipient of a Technology & Engineering Emmy® Award for standardizing the Common Media Application Format (CMAF). But what is CMAF? CMAF (ISO/IEC 23000-19) is a media format standard designed to simplify and unify video streaming workflows across different delivery protocols and devices. Here’s a structured overview. Before CMAF, streaming services often had to produce multiple container formats, i.e., (i) ISO Base Media File Format (ISOBMFF) for MPEG-DASH and MPEG-2 Transport Stream (TS) for Apple HLS. This duplication resulted in additional encoding, packaging, and storage costs. I wrote a blog post about this some time ago here. CMAF’s main goal is to define a single, standardized segmented media format usable by both HLS and DASH, enabling “encode once, package once, deliver everywhere.”

The core concept of CMAF is that it is based on ISOBMFF, the foundation for MP4. Each CMAF stream consists of a CMAF header, CMAF media segments, and CMAF track files (a logical sequence of segments for one stream, e.g., video or audio). CMAF enables low-latency streaming by allowing progressive segment transfer, adopting chunked transfer encoding via CMAF chunks. CMAF defines interoperable profiles for codecs and presentation types for video, audio, and subtitles. Thanks to its compatibility with and adoption within existing streaming standards, CMAF bridges the gaps between DASH and HLS, creating a unified ecosystem.

Research aspects include – but are not limited to – low-latency tuning (segment/chunk size trade-offs, HTTP/3, QUIC), Quality of Experience (QoE) impact of chunk-based adaptation, synchronization of live and interactive CMAF streams, edge-assisted CMAF caching and prediction, and interoperability testing and compliance tools.

JVET ratified new editions of VSEI, VVC, and HEVC

At its 40th meeting, the Joint Video Experts Team (JVET, ISO/IEC JTC 1/SC 29/WG 5) concluded the standardization work on the next editions of three key video coding standards, advancing them to the Final Draft International Standard (FDIS) stage. Corresponding twin-text versions have also been submitted to ITU-T for consent procedures. The finalized standards include:

  • Versatile Supplemental Enhancement Information (VSEI) — ISO/IEC 23002-7 | ITU-T Rec. H.274
  • Versatile Video Coding (VVC) — ISO/IEC 23090-3 | ITU-T Rec. H.266
  • High Efficiency Video Coding (HEVC) — ISO/IEC 23008-2 | ITU-T Rec. H.265

The primary focus of these new editions is the extension and refinement of Supplemental Enhancement Information (SEI) messages, which provide metadata and auxiliary data to support advanced processing, interpretation, and quality management of coded video streams.

The updated VSEI specification introduces both new and refined SEI message types supporting advanced use cases:

  • AI-driven processing: Extensions for neural-network-based post-filtering and film grain synthesis offer standardized signalling for machine learning components in decoding and rendering pipelines.
  • Semantic and multimodal content: New SEI messages describe infrared, X-ray, and other modality indicators, region packing, and object mask encoding; creating interoperability points for multimodal fusion and object-aware compression research.
  • Pipeline optimization: Messages defining processing order and post-processing nesting support research on joint encoder-decoder optimization and edge-cloud coordination in streaming architectures.
  • Authenticity and generative media: A new set of messages supports digital signature embedding and generative-AI-based face encoding, raising questions for the SIGMM community about trust, authenticity, and ethical AI in media pipelines.
  • Metadata and interpretability: New SEIs for text description, image format metadata, and AI usage restriction requests could facilitate research into explainable media, human-AI interaction, and regulatory compliance in multimedia systems.

All VSEI features are fully compatible with the new VVC edition, and most are also supported in HEVC. The new HEVC edition further refines its multi-view profiles, enabling more robust 3D and immersive video use cases.

Research aspects of these new standard’s editions can be summarized as follows: (i) Define new standardized interfaces between neural post-processing and conventional video coding, fostering reproducible and interoperable research on learned enhancement models. (ii) Encourage exploration of metadata-driven adaptation and QoE optimization using SEI-based signals in streaming systems. (iii) Open possibilities for cross-layer system research, connecting compression, transport, and AI-based decision layers. (iv) Introduce a formal foundation for authenticity verification, content provenance, and AI-generated media signalling, relevant to current debates on trustworthy multimedia.

These updates highlight how ongoing MPEG/ITU standardization is evolving toward a more AI-aware, multimodal, and semantically rich media ecosystem, providing fertile ground for experimental and applied research in multimedia systems, coding, and intelligent media delivery.

The fourth edition of Visual Volumetric Video-based Coding (V3C and V-PCC) has been finalized

MPEG Coding of 3D Graphics and Haptics (ISO/IEC JTC 1/SC 29/WG7) has advanced MPEG-I Part 5 – Visual Volumetric Video-based Coding (V3C and V-PCC) to the Final Draft International Standard (FDIS) stage, marking its fourth edition. This revision introduces major updates to the Video-based Coding of Volumetric Content (V3C) framework, particularly enabling support for an additional bitstream instance: V-DMC (Video-based Dynamic Mesh Compression).

Previously, V3C served as the structural foundation for V-PCC (Video-based Point Cloud Compression) and MIV (MPEG Immersive Video). The new edition extends this flexibility by allowing V-DMC integration, reinforcing V3C as a generic, extensible framework for volumetric and 3D video coding. All instances follow a shared principle, i.e., using conventional 2D video codecs (e.g., HEVC, VVC) for projection-based compression, complemented by specialized tools for mapping, geometry, and metadata handling.

While V-PCC remains co-specified within Part 5, MIV (Part 12) and V-DMC (Part 29) are standardized separately. The progression to FDIS confirms the technical maturity and architectural stability of the framework.

This evolution opens new research directions as follows: (i) Unified 3D content representation, enabling comparative evaluation of point cloud, mesh, and view-based methods under one coding architecture. (ii) Efficient use of 2D codecs for 3D media, raising questions on mapping optimization, distortion modeling, and geometry-texture compression. (iii) Dynamic and interactive volumetric streaming, relevant to AR/VR, telepresence, and immersive communication research.

The fourth edition of MPEG-I Part 5 thus positions V3C as a cornerstone for future volumetric, AI-assisted, and immersive video systems, bridging standardization and cutting-edge multimedia research.

Responses to the call for evidence on video compression with capability beyond VVC successfully evaluated

The Joint Video Experts Team (JVET, ISO/IEC JTC 1/SC 29/WG 5) has completed the evaluation of submissions to its Call for Evidence (CfE) on video compression with capability beyond VVC. The CfE investigated coding technologies that may surpass the performance of the current Versatile Video Coding (VVC) standard in compression efficiency, computational complexity, and extended functionality.

A total of five submissions were assessed, complemented by ECM16 reference encodings and VTM anchor sequences with multiple runtime variants. The evaluation addressed both compression capability and encoding runtime, as well as low-latency and error-resilience features. All technologies were derived from VTM, ECM, or NNVC frameworks, featuring modified encoder configurations and coding tools rather than entirely new architectures.

Key Findings

  • In the compression capability test, 76 out of 120 test cases showed at least one submission with a non-overlapping confidence interval compared to the VTM anchor. Several methods outperformed ECM16 in visual quality and achieved notable compression gains at lower complexity. Neural-network-based approaches demonstrated clear perceptual improvements, particularly for 8K HDR content, while gains were smaller for gaming scenarios.
  • In the encoding runtime test, significant improvements were observed even under strict complexity constraints: 37 of 60 test points (at both 1× and 0.2× runtime) showed statistically significant benefits over VTM. Some submissions achieved faster encoding than VTM, with only a 35% increase in decoder runtime.

Research Relevance and Outlook

The CfE results illustrate a maturing convergence between model-based and data-driven video coding, raising research questions highly relevant for the ACM SIGMM community:

  • How can learned prediction and filtering networks be integrated into standard codecs while preserving interoperability and runtime control?
  • What methodologies can best evaluate perceptual quality beyond PSNR, especially for HDR and immersive content?
  • How can complexity-quality trade-offs be optimized for diverse hardware and latency requirements?

Building on these outcomes, JVET is preparing a Call for Proposals (CfP) for the next-generation video coding standard, with a draft planned for early 2026 and evaluation through 2027. Upcoming activities include refining test material, adding Reference Picture Resampling (RPR), and forming a new ad hoc group on hardware implementation complexity.

For multimedia researchers, this CfE marks a pivotal step toward AI-assisted, complexity-adaptive, and perceptually optimized compression systems, which are considered a key frontier where codec standardization meets intelligent multimedia research.

The 153rd MPEG meeting will be held online from January 19 to January 23, 2026. Click here for more information about MPEG meetings and their developments.

It is All About the Experience… My Highlights of QoMEX 2025 in Madrid

Since my first QoMEX (international conference on Quality of Multimedia Experience) in 2015 (Costa Navarino, Greece), I have considered it my conference and the attendees, my research family. It has thus become my special yearly event to connect with familiar faces and meet the next generation of researchers in the field. This edition of QoMEX has brought together an outstanding program with very interesting keynotes, technical papers and demos (see https://qomex2025.itec.aau.at/ to check the full program). Moreover, it has been especially important for me both on a professional and a personal level. I would like to summarize my subjective Experience in 4 highlights:

Figure 1. Me explaining the working principles of eating 12 grapes at midnight for New Year, while walking through Puerta del Sol.

“Introducing my home city to my research family”

Madrid is my home “town”. A couple of times during the conference, one attendee or another asked me where I was from in Spain, and I proudly answered, “I am from here”. In Madrid, I spent the first 23rd years of my life before moving abroad for my professional career. Thus, while I am not literally a local, I can be considered as such. During the conference, I had the opportunity to share my view and love of Madrid to my work family. This meant for me to introduce my research family to my early life in Madrid.

“Paying tribute to Narciso García”

Figure 2. Narciso García posing with his (former) PhD students Marta Orduna, Pablo Pérez, Jesús Gutierrez and Carlos Cortés.

QoMEX 2025 also provided the opportunity to pay well-deserved tribute to one of the two general chairs, Narciso García, on his retirement. Narciso has had an incredible impact not only on the Quality of Experience community. Moreover, plenty of researchers (including myself) in the community and beyond it consider him as a mentor and even their “spiritual guide”. Talking with Pablo Pérez (the other general co-chair) during the conference, he described Narciso as having a solution for every issue, independent from its size, complexity, or topic. Thank you Narciso for the insightful research discussions, the resourcefulness, the (history) chats, and just for being there always available for all of us.  

“Mentoring the next generation of researchers”

On the final session of the conference, something very unexpected (and in my opinion very unusual) occurred. Attending the awards session is always exciting. On the one side, you are 99.9% sure that you will not get any award. However, on the other side, you always wonder “what if?”. This was definitely a “What if?” year for me. First, the Best Student Paper Award went to our work with my starting PhD student Gijs Fiten. A very interesting work on locomotion in Virtual Reality. This was also his first conference, which made it even more special (both for him and for me). When we were yet to recover from this first commotion, the Best Paper Award was announced. It went to Sam Van Damme, my former (first) PhD student on a collaborative work with CWI (Centrum Wiskunde & Informatica) in Amsterdam, about shared mental models. Details of both papers can be found in the appendix.

Seeing students that I mentored (and supervised) grow and achieve important goals in their research careers was more gratifying that winning any award myself.

“QoE researchers can easily walk in others’ shoes”

Figure 3. Reflecting our thoughts and feelings on the decoration of the bag.

To put the cherry to the cake that QoMEX 2025 was, I got the wonderful present of together with Marta Orduna (Nokia, Spain) and María Nava (Fundación Juan XXIII, Spain) to organize a diversity and inclusion workshop in the Fundación Juan XXIII (https://qomex2025.itec.aau.at/workshop/ws-walking-in-their-shoes/). It took place on Friday the 3rd of October. The Fundación (https://www.fundacionjuanxxiii.org/ ) is an organization working for more than 55 years to promote the social and labor inclusion of people in situations of psychosocial vulnerability. With the help of their workers and users, we set up a workshop where our researchers had to switch the roles. Therefore, they became the participant of a “hands-on” experience guided by people with different abilities. The activity consisted of manufacturing paper bags with the help and guidance of the experts of the Paper Lovers project (https://www.fundacionjuanxxiii.org/nuestros-proyectos).

Figure 4. Santi (on the left) is teaching Matteo (on the right) to manufacture a paper bag.

There was some initial insecurity and fear of the language barrier with our Spanish teachers. However, this passed quickly and our QoE researchers adapted to the role of students and started manufacturing bags as they had been doing it for the last 5 years. After the experience, our experts rated the quality of the bags with the typical paper review grading (accept, major revision, minor revision and reject). Finally, after lunch, with the expert guidance of Elena Marquez Segura (Universidad Carlos III), we reflected on the morning session and decorated our bags to express what we had learned about researching from an inclusive perspective. All in all it was an experience session out of the usual constraints that our research imposes and a very fitting ending to a wonderful week.

Special Thanks to Gijs Fiten (KU Leuven, Belgium), Sam Van Damme (Ghent University, Belgium), Marta Orduna (Nokia XR Lab, Spain), Martín Varela (Metosin, Finland), Karan Mitra (Luleå University of Technology, Sweden), Markus Fiedler (BTH, Sweden) and of course the organizing committee of QoMEX’25 led by Pablo Pérez (Nokia XR Lab, Spain) and Narciso García (ETSIT-UPM, Spain)


Appendix. Details of the Best papers Awards at QoMEX 2025

Best Student Paper Award

Redirected Walking for Multi-User eXtended Reality Experiences with Confined Physical Spaces
G. Fiten,  J. Chatterjee, K. Vanhaeren, M. Martens and M. Torres Vega
17th International Conference on Quality of Multimedia Experience (QoMEX), Madrid, Spain, 2025.

Figure 5. Gijs Fiten receiving the Best Student Paper Award.

EXtended Reality (XR) applications allow the user to explore nearly infinite virtual worlds in a truly immersive way. However, wandering around through these Virtual Environments (VE)s while physically walking in reality is heavily constrained by the size of the Physical Environment (PE). Therefore, in the last years different techniques have been devised to improve locomotion in XR. One of these is Redirected Walking (RDW), which aims to find a balance between immersion and PE requirements by steering users away from the boundaries of the PE while allowing for arbitrary motion in the VE. However, current RDW methods still require large PEs, as to avoid obstacles and other users. Moreover, they introduce unnatural alterations in the natural path of the user, which can trigger perception anomalies, such as cybersickness or break of presence. These circumstances limit their usage in real life scenarios. This paper introduces a novel RDW algorithm, with the focus on allowing multiple users to explore an infinite VE in a confined space (6×6 m2). To evaluate it, we designed a multi-user Virtual Reality (VR) maze game, and benchmarked it against the state-of-the-art. A subjective study (20 participants) was conducted, where objective metrics, e.g., the path and the speed of the user, were combined with subjective perception analysis in terms of their cybersickness levels. Our results show that our method reduces the appearance of cybersickness appearance in 80% of participants compared to the state-of-the-art. These findings show the applicability of RDW to multi-user VR with constrained environments.

Best Paper Award

From Individual QoE to Shared Mental Models: A Novel Evaluation Paradigm for Collaborative XR
S. Van Damme, J. Jansen, S. Rossi and P. Cesar
17th International Conference on Quality of Multimedia Experience (QoMEX), Madrid, Spain, 2025.

Figure 6. Sam Van Damme receiving the Best Paper Award.

Extended Reality (XR) systems are rapidly shifting from isolated, single-user applications towards collaborative and social multi-user experiences. To evaluate the quality and effectiveness of such interactions, it is therefore required to move beyond traditional individual metrics such as Quality-of-Experience (QoE) or Sense of Presence (SoP). Instead, group-level dynamics such as effective communication, coordination etc. need to be encompassed to assess the shared understanding of goals and procedures. In psychology, this is referred to as a Shared Mental Model (SMM). The strength and congruence of such an SMM are known to be key for effective team collaboration and performance. In an immersive XR setting, though, novel Influence Factors (IFs) emerge that are not considered in a setting of physical co-location. Evaluations on the impact of these novel factors on SMM formation in XR, however, are close to non-existent. Therefore, this work proposes SMMs as a novel evaluation tool for collaborative and social XR experiences. To better understand how to explore this construct, we ran a prototypical experiment based on ITU recommendations in which the influence of asymmetric end-to-end latency is evaluated through a collaborative, two-user block building task. The results show how also in an XR context strong SMM formation can take place even when collaborators have fundamentally different responsibilities and behavior. Moreover, the study confirms previous findings by showing in an XR context that a teams’ SMM strength is positively associated with its performance.

JPEG Column: 108th JPEG Meeting in Daejeon, Republic of Korea

JPEG XE reaches Committee Draft stage at the 108th JPEG meeting

The 108th JPEG meeting was held in Daejeon, Republic of Korea, from 29 June to 4 July 2025.

During this meeting, the JPEG Committee finalised the Committee Draft of JPEG XE, an upcoming International Standard for lossless coding of visual events, that has been sent for consultation of ISO/IEC JTC1/SC29 national bodies. JPEG XE will be the first International Standard developed for the lossless representation and coding of visual events, and is being developed under the auspices of ISO, IEC, and ITU.

Furthermore, the JPEG Committee was informed that the prestigious Joseph von Fraunhofer Prize 2025 was awarded to three JPEG Committee members Prof. Siegfried Fößel, Dr. Joachim Keinert and Dr. Thomas Richter, for their contributions to the development of the JPEG XS standard. The JPEG XS standard specifies a compression technology with very low latency at a low implementation complexity and with a very precise bit-rate control. A presentation video can be accessed here.

108th JPEG Meeting in Daejeon, Rep. of Korea.

The following sections summarise the main highlights of the 108th JPEG meeting:

  • JPEG XE Committee Draft sent for consultation
  • JPEG Trust second edition aligns with C2PA
  • JPEG AI parts 2, 3 and 4 proceed for publication as IS
  • JPEG DNA reaches DIS stage
  • JPEG AIC on Objective Image Quality Assessment
  • JPEG Pleno Learning-based Point Cloud Coding proceed for publication as IS
  • JPEG XS Part 1 Amendment 1 proceeds to DIS stage
  • JPEG RF explores 3DGS coding and quality evaluation

JPEG XE

At the 108th JPEG Meeting, the Committee Draft of the first International Standard for lossless coding of events was issued and sent for consultation to ISO/IEC JTC1/SC29 national bodies for consultation. JPEG XE is being developed under the auspices of ISO/IEC and ITU-T and aims to establish a robust and interoperable format for efficient representation and coding of events in the context of machine vision and related applications. By reaching the Committee Draft stage, the JPEG Committee has attained a very important milestone. The Committee Draft was produced based on the five received responses to a Call for Proposals issued after the 104th JPEG Meeting held in July 2024. The two submissions meet the requirements for the constrained lossless coding of events and allow the implementation and operation of the coding model with limited resources, power, and complexity. The remaining three responses address the unconstrained coding mode and will be considered in a second phase of standardisation.

JPEG XE is the fruit of a joint effort between ISO/IEC JTC1/SC29/WG1 and ITU-T SG21 and is hoped to result in a largely supported JPEG XE standard, improving the potential compatibility and interoperability across applications, products, and services. Additionally, the JPEG Committee is in contact with the MIPI Alliance with the intention of developing a cross-compatible coding mode, allowing MIPI ESP signals to be decoded effectively by JPEG XE decoders.

The JPEG Committee remains committed to the development of a comprehensive and industry-aligned standard that meets the growing demand for event-based vision technologies. The collaborative approach between multiple standardisation organisations underscores a shared vision for a unified, international standard to accelerate innovation and interoperability in this emerging field.

JPEG Trust

JPEG Trust completed its second edition of JPEG Trust Part 1: Core Foundation, which brings JPEG Trust into alignment with the updated C2PA specification 2.1 and integrates aspects of Intellectual Property Rights (IPR). This second edition is now approved as a Draft International Standard for submission to ISO/IEC balloting, with an expected completion timeframe at the end of 2025.

Showcasing the adoption of JPEG Trust technology, JPEG Trust Part 4 – Reference software has now reached the Committee Draft stage.

Work continues on JPEG Trust Part 2: Trust profiles catalogue, a repository of Trust Profile and reporting snippets designed to assist implementers in constructing their Trust Profiles and Trust Reports, as well as JPEG Trust Part 3: Media asset watermarking.

JPEG AI

During the 108th JPEG meeting, JPEG AI Parts 2, 3, and 5 received positive DIS ballot results with only editorial comments, allowing them to proceed to publication as International Standards. These parts extend Part 1 by specifying stream and decoder profiles, reference software with usage documentation, and file format embedding for container formats such as ISOBMFF and HEIF.

The results from two Core Experiments were reviewed. The first evaluated gain map-based HDR coding, comparing it to simulcast methods and HEIC, while the second focused on implementing JPEG AI on smartphones using ONNX. Progressive decoding performance was assessed under channel truncation, and adaptive selection techniques were proposed to mitigate losses. Subjective and objective evaluations confirmed JPEG AI’s strong performance, often surpassing codecs such as VVC Intra, AVIF, JPEG XL, and performing comparably to ECM in informal viewing tests.

Another contribution explored compressed-domain image classification using latent representations, demonstrating competitive accuracy across bitrates. A proposal to limit tile splits in JPEG AI Part 2 was also discussed, and experiments identified Model 2 as the most robust and efficient default model for the levels with only one model at the decoder side.

JPEG DNA

During the 108th JPEG meeting, the JPEG Committee produced a study DIS text of JPEG DNA Part 1 (ISO/IEC 25508-1). The purpose of this text is to synchronise the current version of the Verification Model with the changes made to the Committee Draft document, reflecting the comments received from the consultation. The DIS balloting of Part 1 is scheduled to take place after the next JPEG meeting, starting in October 2025.

The JPEG Committee is also planning wet-lab experiments to validate that the current specification of the JPEG DNA satisfies the conditions required for applications using the current state of the art in DNA synthesis and sequencing, such as biochemical constraints, decodability, coverage rate, and the impact of error-correcting code on compression performance.

The goal still remains to reach International Standard (IS) status for Part 1 during 2026.

JPEG AIC

Part 4 of JPEG AIC deals with objective quality metrics for fine-grained assessment of high-fidelity compressed images. As of the 108th JPEG Meeting, the Call for Proposals on Objective Image Quality Assessment (JPEG AIC-4), which was launched in April 2025, has already resulted in four non-mandatory registrations of interest that were reviewed. In this JPEG meeting, the technical details regarding the evaluation of proposed metrics and of the anchor metrics were developed and finalised. The results have been integrated in the document “Common Test Conditions on Objective Image Quality Assessment v2.0”, available on the JPEG website. Moreover, the procedures to generate the evaluation image dataset were defined and will be carried out by JPEG experts. The responses to the Call for Proposals for JPEG AIC-4 are expected in September 2025, together with their application for the evaluation dataset, with the goal of creating a Working Draft of a new standard on objective quality assessment of high-fidelity images by April 2026.

JPEG Pleno

At the 108th JPEG meeting, significant progress was reported in the ongoing JPEG Pleno Quality Assessment activity for light fields. A Call for Proposals (CfP) on objective quality metrics for light fields is currently underway, with submissions to be evaluated using a new evaluation dataset. The JPEG Committee also prepares the DIS of ISO/IEC 21794-7, which defines a standard for subjective quality assessment methodologies of light fields.

During the 108th JPEG meeting, the 2nd edition of ISO/IEC 21794-2 (“Plenoptic image coding system (JPEG Pleno) Part 2: Light field coding”) advanced to the Draft International Standard (DIS) stage. This 2nd edition includes the specification of a third coding mode entitled Slanted 4D Transform Mode and its associated profile.

The 108th JPEG meeting also saw the successful completion of the Final Draft International Standard balloting and the impending publication of ISO/IEC 21794-6: Learning-based Point Cloud Coding. This is the world’s first international standard on learning-based point cloud coding. The publication of Part 6 of ISO/IEC 21794 is a crucial and notable milestone in the representation of point clouds. The publication of the International Standard is expected to take place during the second half of 2025.

JPEG XS

The JPEG Committee advanced the AMD 1 of JPEG XS Part 1 to DIS stage; it allows the embedding of sub-frame metadata to JPEG XS as required by augmented and virtual reality applications currently discussed within VESA. Part 5 3rd edition, which is the reference software of JPEG XS, was also approved for publication as an International Standard.

JPEG RF

During the 108th JPEG meeting, the JPEG Radiance Fields exploration advanced its work on discussing the procedures for reliable evaluation of potential proposals in the future, with a particular focus on refining subjective evaluation protocols. A key outcome was the initiation of Exploration Study 5, aimed at investigating how different test camera trajectories influence human perception during subjective quality assessment. The Common Test Conditions (CTC) document was also reviewed, with the subjective testing component remaining provisional pending the outcome of this exploration study. In addition, existing use cases and requirements for JPEG RF were re-examined, setting the stage for the development of revised drafts of both the Use Cases and Requirements document and the CTC. New mandates include conducting Exploration Study 5, revising documents, and expanding stakeholder engagement.

Final Quote

“The release of the Committee Draft of JPEG XE standard for lossless coding of events at the 108th JPEG meeting is an impressive achievement and will accelerate deployment of products and applications relying on visual events.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2025 – Part 4 (ACM MMSys 2023, 2024, 2025)

Editors: Maria Torres Vega (KU Leuven, Belgium), Karel Fliegel (Czech Technical University in Prague, Czech Republic), Mihai Gabriel Constantin (University Politehnica of Bucharest, Romania),  

In this Dataset Columns, we continue the tradition of the previous three columns by reviewing  some of the notable events related to open datasets and benchmarking competitions in the field of multimedia in the years 2023, 2024 and 2025 from this column. This selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. This review follows similar efforts from the previous editions:

This fourth column focuses on the last three editions of ACM Multimedia Systems (MSys), i.e.,  2023, 2024, and 2025:

ACM MMSys 2023

10 dataset papers were presented at the 14th ACM Multimedia Systems Conference (ACM MMSys’23), organized in Vancouver, Canada, June 7-10, 2023 (https://2023.acmmmsys.org/). The complete ACM MMSys’23 Proceedings are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3587819).

  1. Rhys Cox, S., et al., VOLVQAD: An MPEG V-PCC Volumetric Video Quality Assessment Dataset (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592543; dataset available at: https://github.com/nus-vv-streams/volvqad-dataset).
    This is a volumetric video quality assessment dataset consisting of 7,680 ratings on 376 video sequences from 120 participants. The sequences are encoded with MPEG V-PCC using 4 different avatar models and 16 quality variations, and then rendered into test videos for quality assessment using 2 different background colors and 16 different quality switching patterns. 
  2. Prakash, N., et al., TotalDefMeme: A Multi-Attribute Meme dataset on Total Defence in Singapore (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592545; dataset available at: https://gitlab.com/bottle_shop/meme/TotalDefMemes).   Total Defence is a large-scale multi-modal and multi-attribute meme dataset that captures public sentiments toward Singapore’s Total Defence policy. Besides supporting social informatics and public policy analysis of the Total Defence policy, TotalDefMeme can also support many downstream multi-modal machine learning tasks, such as aspect-based stance classification and multi-modal meme clustering. 
  3. Sun, Y., et al., A Dynamic 3D Point Cloud Dataset for Immersive Applications (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592546; dataset available on request to the authors). This dataset consists of synthetically generated objects with pre-determined motion patterns. It contains nine objects in three categories (shape, avatar, and textile) with different animation patterns.
  4. Raca, D., et al., 360 Video DASH Dataset (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592548; dataset available at: https://github.com/darijo/360-Video-DASH-Dataset). This study introduces a SW tool that offers straight-forward encoding platforms to simplify the encoding of DASH VR videos. In addition, it includes a dataset composed of 9 VR videos encoded with seven tiling configurations, four segment durations, four different bitrates. 
  5. Hu, K., et al., FSVVD: A Dataset of Full Scene Volumetric Video ( paper available at: https://dl.acm.org/doi/10.1145/3587819.3592551, dataset available at: https://cuhksz-inml.github.io/full_scene_volumetric_video_dataset/). This dataset focuses on the current most widely used data format, point cloud, and for the first time, releases a full-scene volumetric video dataset that includes multiple people and their daily activities interacting with the external environments.
  6. Wu, Y., et al.,  A Dataset of Food Intake Activities Using Sensors with Heterogeneous Privacy Sensitivity Levels (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592553; dataset available on request to the authors). This dataset compiles fine-grained food intake activities using sensors of heterogeneous privacy sensitivity levels, namely a mmWave radar, an RGB camera, and a depth camera. Solutions to recognize food intake activities can be developed using this dataset, which may provide a more comprehensive picture of the accuracy and privacy trade-offs involved with heterogeneous sensors.
  7. Soares da Costa, T., et al., A Dataset for User Visual Behaviour with Multi-View Video Content (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592556; dataset available on request to the authors). This dataset, collected with a large-scale testbed, compiles tracking data from head movements was obtained from 45 participants using an Intel Realsense F200 camera, with 7 video playlists, each being viewed a minimum of 17 times. 
  8. Wei, Y., et al., A 6DoF VR Dataset of 3D virtualWorld for Privacy-Preserving Approach and Utility-Privacy Tradeoff (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592557; dataset available on request to the authors). This dataset collects a 6 degree-of-freedom VR dataset of 3D virtual worlds for the investigation of privacy-preserving approaches and utility-privacy tradeoff.
  9. Mohammed, A. et al., IDCIA: Immunocytochemistry Dataset for Cellular Image Analysis (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592558; dataset available at: https://figshare.com/articles/dataset/Dataset/21970604). This dataset is a new annotated microscopic cellular image dataset to improve the effectiveness of machine learning methods for cellular image analysis. It includes microscopic images of cells, and for each image, the cell count and the location of individual cells. The data were collected as part of an ongoing study investigating the potential of electrical stimulation to modulate stem cell differentiation and possible applications for neural repair. 
  10. Al Shoura, T., et al., SEPE Dataset: 8K Video Sequences and Images for Analysis and Development (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592560; dataset available at: https://github.com/talshoura/SEPE-8K-Dataset). The SEPE 8K dataset (Software Engineering Practice and Education) is made of 40 different 8K (8192 x 4320) video sequences and 40 variant 8K (8192 x 5464) images. The proposed dataset is – as far as we know – the first to publish true 8K natural sequences; thus, it is important for the next level of applications dealing with multimedia such as video quality assessment, super-resolution, video coding, video compression, and many more.

 ACM MMSys 2024

14 dataset papers were presented at the 15th ACM Multimedia Systems Conference (ACM MMSys’24), organized in Bari, Italy, April 15-18, 2024 (https://2024.acmmmsys.org/). The complete ACM MMSys’24 Proceedings are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3625468).

  1. Malon, T., et al., Ceasefire Hierarchical Weapon Dataset (paper available at: https://dl.acm.org/doi/10.1145/3625468.3653434; dataset available on request to the authors). The Ceasefire Hierarchical Weapon Dataset, an RGB image dataset of firearms tailored for fine-grained image classi- fication, contains 260 classes ranging from 25 to hundreds of images per class, with a total of 40,789 images. In addition, a 4-level hierarchy (family, group, type, model) is provided and validated by forensic experts. 
  2. Kassab, E.J., et al., TACDEC: Dataset for Automatic Tackle Detection in Soccer Game Videos (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652166; dataset available on request to the authors). TACDEC is a dataset of tackle events in soccer game videos. By leveraging video data from the Norwegian Eliteserien league across multiple seasons, we annotated 425 videos with 4 types of tackle events, categorized into “tackle-live”, “tackle-replay”, “tackle-live-incomplete”, and “tackle-replay-incomplete”, yielding a total of 836 event annotations. 
  3. Zhao, J., Pan, J., LENS: A LEO Satellite Network Measurement Dataset (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652170; dataset available at: https://github.com/clarkzjw/LENS). LENS is a LEO satellite network measurement dataset, collected from 13 Starlink dishes, associated with 7 Point-of-Presence (PoP) locations across 3 continents. The dataset currently consists of network latency traces from Starlink dishes with different hardware revisions, various service subscriptions and distinct sky obstruction ratios.
  4. Chen, B., et al., vRetention: A User Viewing Dataset for Popular Video Streaming Services (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652175; dataset available at: https://github.com/flowtele/vRetention). This dataset collects 229178 audience retention curves from YouTube and Bilibili, offering a thorough view of viewer engagement and diverse watching styles. Our analysis reveals notable behavioral differences across countries, categories, and platforms.
  5. Xu , Y.,  et al., Panonut360: A Head and Eye Tracking Dataset for Panoramic Video (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652176; dataset available at: https://dianvrlab.github.io/Panonut360/). This dataset presents head and eye trackings involving 50 users (25 males and 25 females) watching 15 panoramic videos (mostly in 4K). The dataset provides details on the viewport and gaze attention locations of users.
  6. Linder, S.,  et al., VEED: Video Encoding Energy and CO2 Emissions Dataset for AWS EC2 instances (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652178; dataset available at: https://github.com/cd-athena/VEED-dataset). VEED is a FAIR Video Encoding Energy and CO2 Emissions Dataset for Amazon Web Services (AWS) EC2 instances. The dataset also contains the duration, CPU utilization, and cost of the encoding. 
  7. Tashtarian, F., et al., COCONUT: Content Consumption Energy Measurement Dataset for Adaptive Video Streaming (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652179; dataset available at: https://athena.itec.aau.at/coconut/). The COCONUT dataset provides a COntent COnsumption eNergy measUrement daTaset for adaptive video streaming collected through a digital multimeter on various types of client devices, such as laptop and smartphone, streaming MPEG-DASH segments.
  8. Sarkhoosh, M. H., et al., The SoccerSum Dataset for Automated Detection, Segmentation, and Tracking of Objects on the Soccer Pitch (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652180; dataset available at: https://zenodo.org/records/10612084). SoccerSum is a novel dataset aimed at enhancing object detection and segmentation in video frames depicting the soccer pitch, using footage from the Norwegian Eliteserien league across 2021-2023. It also includes the segmentation of key pitch areas such as the penalty and goal boxes for the same frame sequences. It comprises 750 frames annotated with 10 classes for advanced analysis. 
  9. Li, G., et al., A Driver Activity Dataset with Multiple RGB-D Cameras and mmWave Radars (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652181; dataset available at: https://www.kaggle.com/datasets/guanhualee/driver-activity-dataset). This work introduces a novel dataset for fine-grained driver activities, utilizing diverse sensors such as mmWave radars, RGB, and depth cameras, each of which includes three camera angles: body, face, and hands. 
  10. Nguyen, M., et al., ComPEQ – MR: Compressed Point Cloud Dataset with Eye-tracking and Quality Assessment in Mixed Reality (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652182; dataset available at: https://ftp.itec.aau.at/datasets/ComPEQ-MR/). This dataset comprises four compressed dynamic point clouds processed by Moving Picture Experts Group (MPEG) reference tools (i.e., VPCC and GPCC), each with 12 distortion levels. We also conducted subjective tests to assess the quality of the compressed point clouds with different levels of distortion. Additionally, eye-tracking data for visual saliency is included in this dataset, which is necessary to predict where people look when watching 3D videos in MR experiences. We collected opinion scores and eye-tracking data from 41 participants, resulting in 2132 responses and 164 visual attention maps in total. 
  11. Barone, N., et al., APEIRON: a Multimodal Drone Dataset Bridging Perception and Network Data in Outdoor Environments (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652186; dataset available at: https://c3lab.github.io/Apeiron/). APEIRON is a rich multimodal aerial dataset that simultaneously collects perception data from a stereo camera and an event based camera sensor, along with measurements of wireless network links obtained using an LTE module. The assembled dataset consists of both perception and network data, making it suitable for typical perception or communication applications, as well as cross-disciplinary applications that require both types of data. 
  12. Baldoni, S., et al., Questset: A VR Dataset for Network and Quality of Experience Studies (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652187; dataset available at: https://researchdata.cab.unipd.it/1179/). Questset contains over 40 hours of VR traces from 70 users playing commercially available video games, and includes both traffic data for network optimization, and movement and user experience data for cybersickness analysis. Therefore, Questset represents an enabler to jointly address the main VR challenges in the near future.
  13. Jabal, A. et al., StreetLens: An In-Vehicle Video Dataset for Public Facility Monitoring in Urban Streets (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652188; dataset available on request to the authors). StreetLens is a new dataset of videos capturing urban streets with plentiful annotations for vision-based public facility monitoring. It includes four-and-a-half hours of videos recorded by smartphone cameras placed in moving vehicles in the suburbs of three different cities. 
  14. Brescia, W., et al., MilliNoise: a Millimeter-wave Radar Sparse Point Cloud Dataset in Indoor Scenarios (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652189; dataset available at: https://github.com/c3lab/MilliNoise). MilliNoise is a point cloud dataset captured in indoor scenarios through a mmWave radar sensor installed on a wheeled mobile robot. Each of the 12M points in the MilliNoise dataset is accurately labeled as true/noise point by leveraging known information of the scenes and a motion capture system to obtain the ground truth position of the moving robot. Along with the dataset, we provide researchers with the tools to visualize the data and prepare it for statistical and machine learning analysis.

ACM MMSys 2025

8 dataset papers were presented at the 16th ACM Multimedia Systems Conference (ACM MMSys’25), organized in Stellenbosch, South Africa, March 30th to April 4th, 2025 (https://2025.acmmmsys.org/). The complete ACM MMSys’25 Proceedings are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3712676).

  1. Lechelek, L. et al., eCHFD: extended Ceasefire Hierarchical Firearm Dataset (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718333; dataset available on request to the authors). This is the extended Ceasefire Hierarchical Firearm Dataset (eCHFD), a large image dataset of firearms consisting of over 93,000 images in 505 classes. It was constructed from more than 240 videos filmed at the Toulouse Forensics Laboratory (France) and further enriched with images from the existing CHFD dataset and additional downloaded images.
  2. Sarkhoosh, M. H. et al., HockeyAI: A Multi-Class Ice Hockey Dataset for Object Detection (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718335; dataset available at: https://huggingface.co/SimulaMet-HOST/HockeyAI). HockeyAI is a novel open source dataset specifically designed for multi-class object detection in ice hockey. It includes 2,101 high resolution frames extracted from professional games in the Swedish Hockey League (SHL), annotated in the You Look Only Once (YOLO) format. 
  3. Nguyen, M. et al., OLED-EQ: A Dataset for Assessing Video Quality and Energy Consumption in OLED TVs Across Varying Brightness Levels (paper available at: https://dl.acm.org/doi/abs/10.1145/3712676.3718337; dataset available at: https://github.com/minhkstn/OLED-EQ). The dataset comprises the energy data of four OLED TVs with different screen sizes and manufacturers in playing 176 videos in a range of dark and bright content. As a result, 704 data traces of energy consumption are collected. It also includes subjective annotations (28 participants, resulting in 2240 responses in total) of the quality of videos displayed in OLED TVs when they are reduced in brightness. 
  4. Sarkhoosh, M. H. et al., HockeyRink: A Dataset for Precise Ice Hockey Rink Keypoint Mapping and Analytics (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718338; dataset available at: https://huggingface.co/SimulaMet-HOST/HockeyRink). HockeyRink is a novel dataset comprising 56 meticulously annotated keypoints corresponding to significant landmarks on a standard hockey rink, including face-off dots, goalposts, and blue lines. 
  5. Sarkhoosh, M. H. et al., HockeyOrient: A Dataset for Ice Hockey Player Orientation Classification (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718342; dataset available at: https://huggingface.co/datasets/SimulaMet-HOST/HockeyOrient ). HockeyOrient is a novel dataset for classifying the orientation of ice hockey players based on their poses. The dataset comprises 9,700 manually annotated frames, selected randomly and non-sequentially, taken from Swedish Hockey League (SHL) games during the 2023 and 2024 seasons. 
  6. Li, J. et al., PCVD: A Dataset of Point Cloud Video for Dynamic Human Interaction (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718343; dataset available at:  https://github.com/acmmmsys/2025-PCVD-A-Dataset-of-Point-Cloud-Video-for-Dynamic-Human-Interaction). This is a point cloud video dataset PCVD captured with synchronized Azure Kinect cameras, designed to support tasks like denoising, segmentation, and motion recognition in single and multi-person scenes. It provides high-quality depth and color data from diverse real-world scenes with human actions. 
  7. Bhattacharya, A. et al., AMIS: An Audiovisual Dataset for Multimodal XR Research (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718344; dataset available at: https://github.com/Telecommunication-Telemedia-Assessment/AMIS). The Audiovisual Multimodal Interaction Suite (AMIS) is an open-source dataset and accompanying Unity-based demo implementation designed to aid research on immersive media communication and social XR environments. It features synchronized audiovisual recordings of three actors performing monologues and participating in dyadic conversations across four modalities: talking-head videos, full-body videos, volumetric avatars, and personalized animated avatars. 
  8. Ouellette, J. et al., MazeLab: A Large-Scale Dynamic Volumetric Point Cloud Video Dataset With User Behavior Traces (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718345; dataset available on request to the authors). MazeLab is a dynamic volumetric video dataset comprising a feature-rich point cloud representation of a large maze environment. It captures navigation traces from 15 participants interacting with 15 distinct maze variants, categorized into seven classes designed to elicit specific behavioral characteristics such as navigation patterns, attention hotspots, and interaction dynamic.

Summer School on Multimodal Foundation Models and Generative AI Second edition

Organizer: Prof. Mohamed Daoudi, Institut Mines-Télécom Nord Europe (IMT Nord Europe), France.

Co-Organizers: Prof. Ahmed Tamtaoui Institut National des Postes et Télécommunications (INPT), Morocco, Prof. Mohamed Khalil (MorroccoAI), Morocco, Jamal Benhamou (Soft Center), Morocco.

In September 2025, it was held the second edition of the Summer School on Multimodal Foundation Models and Generative AI, which, with the support of SIGMM attracted more than 60 students and young researchers to learn, discuss and first-hand experiment in topics related to Generative AI. The event’s success calls for further editions in upcoming years.

The 2nd edition of the Summer School dedicated to Generative AI and Multimodal Foundation Models was held from September 8 to 12, 2025, in Rabat, Morocco. Over five days, 60 students, researchers, and professionals—selected from more than 1,300 applications—took part in an intensive program combining theoretical courses, hands-on workshops, keynote lectures, evening mentorship sessions, and a hackathon. Following the first edition in 2024, led by INPT, IMT Nord Europe, and the Soft Centre, this new edition was organized in partnership with MoroccoAI, an initiative led by AI experts in Morocco and abroad to promote the growth of AI across the country. This Summer School welcomed students and early-career researchers from Morocco, Germany, France, Italy, Tunisia, and other African countries, further strengthening its international reach. More than 50% of the participants were women.

We chose to offer a summer school with low registration fees (thanks to the additional support of SIGMM), so that as many students and young researchers from diverse backgrounds as possible could attend.

Invited speakers

The AI Summer School presents a distinguished lineup of speakers who bridge the gap between academic research and industry innovation. Our carefully selected experts combine theoretical expertise with practical insights, offering participants a comprehensive understanding of AI’s current landscape and future directions.

  • Pioneering the Future of AI in E-Commerce: Foundation Models and Generative AI at Amazon, Dr. Amin Mantrach, Applied Science Manager, Amazon, Luxembourg
  • Geometric Deep Learning for Non-Rigid Shapes: From Theory to Practice, Dr. Emery Pierson, Researcher LIX, Ecole Polytechnique, France
  • Towards Detailed Understanding of the Visual World in Generative AI Era, Dr. Fahad Shahbaz Khan,  the MBZUAI, Abu Dhabi, United Arab Emirates
  • Design Thinking for Human-Centred AI Development, Dr. Houda Chakiri, Al Akhawayn University, Morocco
  • Leveraging AI for Sustainable Marine Ecosystems, Dr. Jihad Zahir, Cadi Ayyad University, Morocco
  • 4D Human Generation: past, present and the future, Prof. Mohamed Daoudi, IMT Nord Europe, France
  • Training LLMs: Optimize and Scale Your Training, Nouamane Tazi, ML Research Engineer, Hugging Face, France
  • Generating Synthetic Face and Body Models, Prof. Stefano Berretti, Department of Information Engineering of University of Firenze, Italy
  • From Generative to Agentic: The Next Era of Computing, Dr. Kaoutar El Maghraoui, Principal Research Scientist and Manager, IBM T.J. Watson Research Center, USA
  • From documents to structure: Hands-on exploration of agentic document extraction, Prof. Omar Souissi, INPT, Morocco
  • Efficient Speech Generative Modeling with Little Tokenization, Dr. Tatiana Likhomanenko, Staff Research Scientist, Apple, USA
  • From VAE to Diffusion: probabilistic learning with audio-visual data, Dr. Xavier Alameda-Pineda, Research Director, INRIA, France

Program

The program explored the theoretical and practical Multimodal Foundation Models and Generative AI, large-scale pre-trained models, multimodality (text, image, audio, etc.), and their applications across sectors. Evening sessions were dedicated to intensive mentorship, where participants worked on real-world projects under expert guidance.

On Wednesday, participants visited the Technopark in Casablanca as part of the Morocco Accelerator Program, where they had the opportunity to engage with innovative AI startups and learn about their projects. More information about the summer school on Multimodal Foundation Models and Generative AI is available on the webpage https://ai-summer-school.inpt.ac.ma/.

The final day showcased the teams’ talent with 15 project presentations from the hackathon, followed by the closing ceremony and award announcements:

  • First Prize – MyIris: A real-time multimodal navigation system for visually impaired individuals, integrating voice control and video analysis for safe guidance.
  • Second Prize – LALLACare: An innovative platform offering a low-cost alternative to mammography for early breast cancer screening, combining thermal imaging with Google’s MedGemma model for fast, accurate, and explainable assessments.
  • Special Jury Prize – Moun9idoun (AiDex): A multimodal AI solution dedicated to emergency management and community safety.

Acknowledgments: The organizers extend their sincere thanks to the INPT staff (reception, cafeteria, and accommodation), in particular Madame Leila Karakchou, for their invaluable support in facilitating the successful organization of this Summer School, to all the dedicated volunteers from the MoroccoAI association, and to Fatih Hamza from the Soft Center for his work in developing the event website.

VQEG Column: VQEG Meeting May 2025

Introduction

From May 5th to 9th, 2025 Meta hosted the plenary meeting of the Video Quality Experts Group (VQEG) in their headquarters in Menlo Park (CA, United Sates). Around 150 participants registered to the meeting, coming from industry and academic institutions from 26 different countries worldwide.

The meeting was dedicated to present updates and discuss about topics related to the ongoing projects within VQEG. All the related information, minutes, and files from the meeting are available online in the VQEG meeting website, and video recordings of the meeting are available in Youtube.

All the topics mentioned bellow can be of interest for the SIGMM community working on quality assessment, but special attention can be devoted to the first activities of the group on Subjective and objective assessment of GenAI content (SOGAI) and to the advances on the contribution of the Immersive Media Group (IMG) group to the International Telecommunication Union (ITU) towards the Rec. ITU-T P.IXC for the evaluation of Quality of Experience (QoE) of immersive interactive communication systems.

Readers of these columns who are interested in VQEG’s ongoing projects are encouraged to subscribe to the corresponding mailing lists to stay informed and get involved.

Group picture of the meeting

Overview of VQEG Projects

Immersive Media Group (IMG)

The IMG group researches on the quality assessment of immersive media technologies. Currently, the main joint activity of the group is the development of a test plan to evaluate the QoE of immersive interactive communication systems, which is carried out in collaboration with ITU-T through the work item P.IXC. In this meeting, Pablo Pérez (Nokia XR Lab, Spain), Marta Orduna (Nokia XR Lab, Spain), and Jesús Gutiérrez (Universidad Politécnica de Madrid, Spain) presented the status of this recommendation and the next steps to be addressed towards a new contribution to ITU-T in its next meeting in September 2025. Also, in this meeting, it was decided that Marta Orduna will replace Pablo Pérez as vice-chair of IMG. In addition, the following presentations related to IMG topics were delivered:

Statistical Analysis Methods (SAM)

The SAM group investigates on analysis methods both for the results of subjective experiments and for objective quality models and metrics. In relation with these topics, the following presentations were delivered during the meeting:

Joint Effort Group (JEG) – Hybrid

The group JEG addresses several areas of Video Quality Assessment (VQA), such as the creation of a large dataset for training such models using full-reference metrics instead of subjective metrics. The chair of this group, Enrico Masala (Politecnico di Torino, Italy) presented the updates on the latest activities of the group, including the current results of the Implementer’s Guide for Video Quality Metrics (IGVQM) project. In addition to this, the following presentations were delivered:

Emerging Technologies Group (ETG)

The ETG group focuses on various aspects of multimedia that, although they are not necessarily directly related to “video quality”, can indirectly impact the work carried out within VQEG and are not addressed by any of the existing VQEG groups. In particular, this group aims to provide a common platform for people to gather together and discuss new emerging topics, possible collaborations in the form of joint survey papers, funding proposals, etc. In this sense, the following topics were presented and discussed in the meeting:

  • Avinab Saha (UT Austin, United States) presented the dataset of perceived expression differences, FaceExpressions-70k, which contains 70,500 subjective expression comparisons rated by over 1,000 study participants obtained via crowdsourcing.
  • Mathias Wien (RWTH Aachen University, Germany) reported on recent developments in MPEG AG 5 and JVET for preparations towards a Call for Evidence (CfE) on video compression with capability beyond VVC.
  • Effrosyni Doutsi (Foundation for Research and Technology – Hellas, Greece) presented her research on novel evaluation frameworks for spike-based compression mechanisms.
  • David Ronca (Meta Platforms Inc. United States) presented the Video Codec Acid Test (VCAT), which is a benchmarking tool for hardware and software decoders on Android devices.

Subjective and objective assessment of GenAI content (SOGAI)

The SOGAI group seeks to standardize both subjective testing methodologies and objective metrics for assessing the quality of GenAI-generated content. In this first meeting of the group since its foundation, the following topics were presented and discussed:

  • Ryan Lei and Qi Cai (Meta Platforms Inc., United states) presented their work on learning from subjective evaluation of Super Resolution (SR) in production use cases at scale, which included extensive benchmarking tests and subjective evaluation with external crowdsource vendors.
  • Ioannis Katsavounidis, Qi Cai, Elias Kokkinis, Shankar Regunathan (Meta Platforms Inc., United States) presented their work on learning from synergistic subjective/objective evaluation of auto dubbing in production use cases.
  • Kamil Koniuch (AGH University of Krakow, Poland) presented his research on cognitive perspective on Absolute Category Rating (ACR) scale tests
  • Patrick Le Callet (Nantes Universite, France) presented his work, in collaboration with researchers from SJTU (China) on perceptual quality assessment of AI-generated omnidirectional images, including the annotated dataset called AIGCOIQA2024.

Multimedia Experience and Human Factors (MEHF)

The MEHF group focuses on the human factors influencing audiovisual and multimedia experiences, facilitating a comprehensive understanding of how human factors impact the perceived quality of multimedia content. In this meeting, the following presentations were given:

5G Key Performance Indicators (5GKPI)

The 5GKPI group studies the relationship between key performance indicators of new 5G networks and QoE of video services on top of them. In this meeting, Pablo Pérez (Nokia XR Lab, Spain) and the rest of the team presented a first draft of the VQEG Whitepaper on QoE management in telecommunication networks, which shares insights and recommendations on actionable controls and performance metrics that the Content Application Providers (CAPs) and Network Service Providers (NSPs) can use to infer, measure and manage QoE.

In addition, Pablo Perez (Nokia XR Lab, Spain), Marta Orduna (Nokia XR Lab, Spain), and Kamil Koniuch (AGH University of Krakow, Poland) presented design guidelines and a proposal of a simple but practical QoE model for communication networks, with a focus on 5G/6G compatibility.

Quality Assessment for Health Applications (QAH)

The QAH group is focused on the quality assessment of health applications. It addresses subjective evaluation, generation of datasets, development of objective metrics, and task-based approaches. In this meeting, Lumi Xia (INSA Rennes, France) presented her research on task-based medical image quality assessment by numerical observer.

Other updates

Apart from this, Ajit Ninan (Meta Platforms Inc., United States) delivered a keynote on rethinking visual quality for perceptual display; a panel was organized with Christos Bampis (Netflix, United States), Denise Noyes (Meta Platforms Inc., United States), and Yilin Wang (Google, United States) addressing what more is left to do on optimizing video quality for adaptive streaming applications, which was moderated by Narciso García (Universidad Politécnica de Madrid, Spain); and there was a co-located ITU-T Q19 interim meeting. In addition, although no progresses were presented in this meeting, the groups on No Reference Metrics (NORM) and on Quality Assessment for Computer Vision Applications (QACoViA) are still active.  

Finally, as already announced in the VQEG website, the next VQEG plenary meeting will be online or hybrid online/in-person, probably in November or December 2025.

Students Report from ACM MMsys 2025

The 16th ACM Multimedia Systems Conference (with the associated workshops: NOSSDAV 2025 and MMVE 2025) was held from March 31st to April 4th 2025, in Stellenbosch, South Africa. By choosing this location, the steering committee marked a milestone for SIGMM: MMSys became the very first SIGMM conference to take place on the African continent. This perfectly aligns with the SIGMM ongoing mission to build an inclusive and globally representative multimedia‑systems community.

The MMSys conference brings together researchers in multimedia systems to showcase and exchange their cutting-edge research findings. Once again, there were technical talks spanning various multimedia domains and inspiring keynote presentations.

Recognising the importance of in‑person exchange—especially for early‑career researchers—SIGMM once again funded Student Travel Grants. This support enabled a group of doctoral students to attend the conference, present their work and start building their international peer networks.
In this column, the recipients of the travel grants share their experiences at MMSys 2025.

Guodong Chen – PhD student, Northeastern University, USA 

What an incredible experience attending ACM MMSys 2025 in South Africa! Huge thanks to SIGMM for the travel grant that made this possible. 

It was an honour to present our paper, “TVMC: Time-Varying Mesh Compression Using Volume-Tracked Reference Meshes”, and I’m so happy that it received the Best Reproducible Paper Award! 

MMSys is not that huge, but it’s truly great. It’s exceptionally well-organized, and what impressed me the most was the openness and enthusiasm of the community. Everyone is eager to communicate, exchange ideas, and dive deep into cutting-edge multimedia systems research. I made many new friends and discovered exciting overlaps between my research and the work of other groups. I believe many collaborations are on the way and that, to me, is the true mark of a successful conference. 

Besides the conference, South Africa was amazing, don’t miss the wonderful wines of Stellenbosch and the unforgettable experience of a safari tour. 

Lea Brzica – PhD student, University of Zagreb, Croatia

Attending MMSys’25 in Stellenbosch, South Africa was an unforgettable and inspiring experience. As a new PhD student and early-career researcher, this was not only my first in-person conference but also my first time presenting. I was honoured to share my work, “Analysis of User Experience and Task Performance in a Multi-User Cross-Reality Virtual Object Manipulation Task,” and excited to see genuine interest from other attendees.
Beyond the workshop and technical sessions, I thoroughly enjoyed the keynotes and panel discussions. The poster sessions and demos were great opportunities to explore new ideas and engage with people from all over the world.
One of the most meaningful aspects of the conference was the opportunity to meet fellow PhD students and researchers face-to-face. The coffee breaks and social activities created a welcoming atmosphere that made it easy to form new connections.

I am truly grateful to SIGMM for supporting my participation. The travel grant helped alleviate the financial burden of international travel and made this experience possible. I’m already hoping for the chance to come back and be part of it all over again!

Jérémy Ouellette – PhD student Concordia University, Canada

My time at MMSys 2025 was an incredibly rewarding experience. It was great meeting so many interesting and passionate people in the field, and the reception was both enthusiastic and exceptionally well organized. I want to sincerely thank SIGMM for the travel grant, as their support made it possible for me to attend and present my work. South Africa was an amazing destination, and the entire experience was both professionally and personally unforgettable. MMSys was also the perfect environment for networking, offering countless opportunities to connect with researchers and industry experts. It was truly exciting to see so much interest in my work and to engage in meaningful conversations with others in the multimedia systems community.

The 3rd Edition of Spring School on Social XR organised by CWI

The 3rd edition of the Spring School on Social XR organised by Distributed and Interactive Systems group (DIS) at CWI in Amsterdam took place from 7 to 10 April 2025. The event attracted 30 students from different disciplines (technology, social sciences, and humanities) and countries from the world (Europe but also Canada and USA). The event was organized by Silvia Rossi, Irene Viola, Thomas Röggla, and Pablo Cesar from CWI; and Omar Niamut from TNO. Also this year, it was co-sponsored by ACM SIGMM, thanks to the founding for Special initiatives, and for the first time has been recognised as an ACM Europe Council Seasonal School.

Students and organisers of the 3rd Spring School on Social XR

Across 9 lectures (4 of them open to public) and three hands‑on workshops led by 14 international instructors, participants had the possibility to have cross-domain interactions on Social XR. Sessions ranged from photorealistic avatar capture and behaviour modelling, through AI‑driven volumetric‑video production, low‑latency streaming and novel rendering techniques, to rigorous QoE evaluation frameworks and open immersive‑media datasets. A new thematic topic this year tackled the privacy, security and UX challenges that arise when immersive systems move from lab prototypes to real‑world communication platforms. Together, they provided a holistic perspective, helping participants to better understand the area and to initiate a network of collaboration to overcome current limitations of current real-time conferencing systems. A unique feature of the school is its Open Days, where selected keynotes are made publicly accessible both in person and via live streaming, ensuring broader engagement with the XR research community. In addition to theoretical and hands-on sessions, the school supports networking and discussions through dedicated events, including a poster presentation where participants can receive feedback from peers and experts in the field of Social XR. 

Students during a boat trip.

The list of talks were: 

  • “The Multiple Dimensions of Social in Social XR” by Sun Joo (Grace) Ahn (University of Georgia, USA) 
  • “Shaping VR Experiences: Designing Applications and Experiences for Quality of Experience Assessment” by Marco Carli (Universitá degli Studi Roma TRE, Italy) 
  • “Making a Virtual Reality” by Elmar Eisemann (TU Delft, The Netherlands) 
  • “Robotic Avatar Mediated Social Interaction” by Jan van Erp (TNO & University of Twente, The Netherlands) 
  • “Novel Opportunities and Emerging Risks of Social Virtual Reality Spaces for Online Interactions” by Guo Freeman (Clemson University, USA) 
  • “Privacy, Security and UX Challenges in (Social) XR: an Overview” by Katrien de Moor (NTNU, Norway)
  •  “AI-based Volumetric Content Creation for Immersive XR Experiences and Production Workflows” by Aljosa Smolic (Hochschule Luzern, Switzerland)
  • “Changing Habits, One Experience at a Time” by Funda Yildirim (University of Twente, The Netherlands) 
  • “Challenge-Driven Quality Evaluation and Dataset Development for Immersive Visual Experiences” by Emin Zerman (Mid Sweden University, Sweden) 

The list of Workshops were: 

  • “Cooperative Development of Social XR Evaluation Methods” by Jesús Gutiérrez (Universidad Politecnica de Madrid, Spain) and Pablo Pérez (Nokia XR Labs, Spain) 
  • “From Principle to Practice: Public Values in Action” by Mariëtte van Huijstee (Rathenau Institute, The Netherlands) and Paulien Dresscher (PublicSpaces, The Netherlands) 
  • “Interoperability: What is a Visual Positioning System and Why an Open Source One and Interoperability Between These Systems Need to be Established” by Alina Kadlubsky (Open AR Cloud Europe, Germany)

JPEG Column: 107th JPEG Meeting in Brussels, Belgium

JPEG assesses responses to its Call for Proposals on Lossless Coding of Visual Events

The 107th JPEG meeting was held in Brussels, Belgium, from April 12 to 18, 2025. During this meeting, the JPEG Committee assessed the responses to its call for proposals on JPEG XE, an International Standard for lossless coding of visual events. JPEG XE is being developed under the auspices of three major standardisation organisations: ISO, IEC, and ITU. It will be the first codec developed by the JPEG committee targeting lossless representation and coding of visual events.

The JPEG Committee is also working on various standardisation projects, such as JPEG AI, which uses learning technology to achieve high compression, JPEG Trust, which sets standards to combat fake media and misinformation while rebuilding trust in multimedia, and JPEG DNA, which represents digital images using DNA sequences for long-term storage.

The following sections summarise the main highlights of the 107th JPEG meeting:

  • JPEG XE
  • JPEG AI
  • JPEG Trust
  • JPEG AIC
  • JPEG Pleno
  • JPEG DNA
  • JPEG XS
  • JPEG RF

JPEG XE

This initiative focuses on a new imaging modality produced by event-based visual sensors. This effort aims to establish a standard that efficiently represents and codes events, thereby enhancing interoperability in sensing, storage, and processing for machine vision and related applications.

As a response to the JPEG XE Final Call for Proposals on lossless coding of events, the JPEG Committee received five innovative proposals for consideration. Their evaluation indicated that two among them meet the stringent requirements of the constrained case, where resources, power, and complexity are severely limited. The remaining three proposals can cater to the unconstrained case. During the 107th JPEG meeting, the JPEG Committee launched a series of Core Experiments to define a path forward based on the received proposals as a starting point for the development of the JPEG XE standard.

To streamline the standardisation process, the JPEG Committee will proceed with the JPEG XE initiative in three distinct phases. Phase 1 will concentrate on lossless coding for the constrained case, while Phase 2 will address the unconstrained case. Both phases will commence simultaneously, although Phase 1 will follow a faster timeline to enable a timely publication of the first edition of the standard. The JPEG Committee recognises the urgent industry demand for a standardised solution for the constrained case, aiming to produce a Committee Draft by as early as July 2025. The third phase will focus on lossy compression of event sequences. The discussions and preparations will be initiated soon.

In a significant collaborative effort between ISO/IEC JTC 1/SC 29/WG1 and ITU-T SG21, the JPEG Committee will proceed to specify a joint JPEG XE standard. This partnership will ensure that JPEG XE becomes a shared standard under ISO, IEC, and ITU-T, reflecting their mutual commitment to developing standards for event-based systems.

Additionally, the JPEG Committee is actively discussing and exploring lossy coding of visual events, exploring future evaluation methods for such advanced technologies. Stakeholders interested in JPEG XE are encouraged to access public documents available at jpeg.org. Moreover, a joint Ad-hoc Group on event-based vision has been formed between ITU-T Q7/21 and ISO/IEC JTC1 SC29/WG1, paving the way for continued collaboration leading up to the 108th JPEG meeting.

JPEG AI

At the 107th JPEG meeting, JPEG AI discussions focused around conformance (JPEG AI Part 4), which has now advanced to the Draft International Standard (DIS) stage. The specification defines three conformance points — namely, the decoded residual tensor, the decoded latent space tensor (also referred to as feature space), and the decoded image. Strict conformance for the residual tensor is evaluated immediately after entropy decoding, while soft conformance for the latent space tensor is assessed after tensor decoding. The decoded image conformance is measured after converting the image to the output picture format, but before any post-processing filters are applied. Regarding the decoded image, two types have been defined: conformance Type A, which implies low tolerance, and conformance Type B, which allows for moderate tolerance.

During the 107th JPEG meeting, the results of several subjective quality assessment experiments were also presented and discussed, using different methodologies and for different test conditions, from low to very high qualities, including both SDR and HDR images. The results of these evaluations have shown that JPEG AI is highly competitive and, in many cases, outperforms existing state-of-the-art codecs such as VVC Intra, AVIF, and JPEG XL. A demonstration of an JPEG AI encoder running on a Huawei Mate50 Pro smartphone with a Qualcomm Snapdragon 8+ Gen1 chipset was also presented. This implementation supports tiling, high-resolution (4K) support, and a base profile with level 20. Finally, the implementation status of all mandatory and desirable JPEG AI requirements was discussed, assessing whether each requirement had been fully met, partially addressed, or remained unaddressed. This helped to clarify the current maturity of the standard and identify areas for further refinements.

JPEG Trust

Building on the publication of JPEG Trust (ISO/IEC 21617) Part 1 – Core Foundation in January 2025, the JPEG Committee approved a Draft International Standard (DIS) for a 2nd edition of Part 1 – Core Foundation during the 107th JPEG meeting. This Part 1 – Core Foundation 2nd edition incorporates the signalling of identity and intellectual property rights to address three particular challenges:

  • achieving transparency, through the signaling of content provenance
  • identifying content that has been generated either by humans, machines or AI systems, and
  • enabling interoperability, for example, by standardising machine-readable terms of use of intellectual property, especially AI-related rights reservations.

Additionally, the JPEG Committee is currently developing Part 2 – Trust Profiles Catalogue. Part 2 provides a catalogue of trust profile snippets that can be used either on their own or in combination for the purpose of constructing trust profiles, which can then be used for assessing the trustworthiness of media assets in given usage scenarios. The Trust Profiles Catalogue also defines a collection of conformance points, which enables interoperability across usage scenarios through the use of associated trust profiles.

The Committee continues to develop JPEG Trust Part 3 – Media asset watermarking to build out additional requirements for identified use cases, including the emerging need to identify AIGC content.

Finally, during the 107th meeting, the JPEG Committee initiated a Part 4 – Reference software, which will provide reference implementations of JPEG Trust to which implementers can refer to in developing trust solutions based on the JPEG Trust framework.

JPEG AIC

The JPEG AIC Part 3 standard (ISO/IEC CD 29170-3), has received a revised title “Information technology — JPEG AIC Assessment of image coding — Part 3: Subjective quality assessment of high-fidelity images”. At the 107th JPEG meeting, the results of the last Core Experiments for the standard and the comments on the Committee Draft of the standard were addressed. The draft text was thoroughly revised and clarified, and has now advanced to the Draft International Standard (DIS) stage.

Furthermore, Part 4 of JPEG AIC deals with objective quality metrics, also of high-fidelity images, and at the 107th JPEG meeting, the technical details regarding anchor metrics as well as the testing and evaluation of proposed methods were discussed and finalised. The results have been compiled in the document “Common Test Conditions on Objective Image Quality Assessment”, available on the JPEG website. Moreover, the corresponding Final Call for Proposals on Objective Image Quality Assessment (AIC-4) has been issued. Proposals are expected at the end of Summer 2025. The first Working Draft for Objective Image Quality Assessment (AIC-4) is planned for April 2026.

JPEG Pleno

The JPEG Pleno Light Field activity discussed the DoCR for the submitted Committee Draft (CD) of the 2nd edition of ISO/IEC 21794-2 (“Plenoptic image coding system (JPEG Pleno) Part 2: Light field coding”). This 2nd edition integrates AMD1 of ISO/IEC 21794-2 (“Profiles and levels for JPEG Pleno Light Field Coding”) and includes the specification of a third coding mode entitled Slanted 4D Transform Mode and its associated profile. It is expected that at the 108th JPEG meeting this new edition will advance to the Draft International Standard (DIS) stage.

Software tools have been created and tested to be added as Common Test Condition Tools to a reference software implementation for the standardized technologies within the JPEG Pleno framework, including the JPEG Pleno Part 2 (ISO/IEC 21794-2).

In the framework of the ongoing standardisation effort on quality assessment methodologies for light fields, significant progress was achieved during the 107th JPEG meeting. The JPEG Committee finalised the Committee Draft (CD) of the forthcoming standard ISO/IEC 21794-7 entitled JPEG Pleno Quality Assessment – Light Fields, representing an important step toward the establishment of reliable tools for evaluating the perceptual quality of light fields. This CD incorporates recent refinements to the subjective light field assessment framework and integrates insights from the latest core experiments.

The Committee also approved the Final Call for Proposals (CfP) on Objective Metrics for JPEG Pleno Quality Assessment – Light Fields. This initiative invites proposals of novel objective metrics capable of accurately predicting perceived quality of compressed light field content. The detailed submission timeline and required proposal components are outlined in the released final CfP document. To support this process, updated versions of the Use Cases and Requirements (v6.0) and Common Test Conditions (v2.0) related to this CfP were reviewed and made available. Moreover, several task forces have been established to address key proposal elements, including dataset preparation, codec configuration, objective metric evaluation, and the subjective experiments.

At this meeting, ISO/IEC 21794-6 (“Plenoptic image coding system (JPEG Pleno) Part 6: Learning-based point cloud coding”) progressed to the balloting of the Final Draft International Standard (FDIS) stage. Balloting will end on the 12th of June 2025 with the publication of the International Standard expected for August 2025.

The JPEG Committee held a workshop on Future Challenges in Compression of Holograms for XR Applications organised on April 16th, covering major applications from holographic cameras to holographic displays. The 2nd workshop for Future Challenges in Compression of Holograms for Metrology Applications is planned for July.

JPEG DNA

The JPEG Committee continues to develop JPEG DNA, an ambitious initiative to standardize the representation of digital images using DNA sequences for long-term storage. Following a Call for Proposals launched at its 99th JPEG meeting, a Verification Model was established during the 102nd JPEG meeting, then refined through core experiments that led to the first Working Draft at the 103rd JPEG meeting.

New JPEG DNA logo.

At its 105th JPEG meeting, JPEG DNA was officially approved as a new ISO/IEC project (ISO/IEC 25508), structured into four parts: Core Coding System, Profiles and Levels, Reference Software, and Conformance. The Committee Draft (CD) of Part 1 was produced at the 106th JPEG meeting.

During the 107th JPEG meeting, the JPEG Committee reviewed the comments received on the CD of JPEG DNA standard and prepared a Disposition of Comments Report (DoCR). The goal remains to reach International Standard (IS) status for Part 1 by April 2026.

On this occasion, the official JPEG DNA logo was also unveiled, marking a new milestone in the visibility and identity of the project.

JPEG XS

The development of the third edition of the JPEG XS standard is nearing its final stages, marking significant progress for the standardisation of high-performance video coding. Notably, Part 4, focusing on conformance testing, has been officially accepted by ISO and IEC for publication. Meanwhile, Part 5, which provides reference software, is presently at Draft International Standard (DIS) ballot stage.

In a move that underscores the commitment to accessibility and innovation in media technology, both Part 4 and Part 5 will be made publicly available as free standards. This decision is expected to facilitate widespread adoption and integration of JPEG XS in relevant industries and applications.

Looking to the future, the JPEG Committee is exploring enhancements to the JPEG XS standard, particularly in supporting a master-proxy stream feature. This feature enables a high-fidelity master video stream to be accompanied by a lower-resolution proxy stream, ensuring minimal overhead. Such functionalities are crucial in optimising broadcast and content production workflows.

JPEG RF

The JPEG RF activity issued the proceedings of the Joint JPEG/MPEG Workshop on Radiance Fields which was held on the 31st of January and featured world-renowned speakers discussing Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) from the perspective of both academia, industry, and standardisation groups. Video recordings and all related material were made publicly available on the JPEG website. Moreover, an improved version of the JPEG RF State of the Art and Challenges document was proposed, including an updated review of coding techniques for radiance fields as well as newly identified use cases and requirements. The group also defined an exploration study to investigate protocols for subjective and objective quality assessment, which are considered to be crucial to advance this activity towards a coding standard for radiance fields.

Final Quote

“A cost-effective and interoperable event-based vision ecosystem requires an efficient coding standard. The JPEG Committee embraces this new challenge by initiating a new standardisation project to achieve this objective.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.