Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2025 – Part 5 (ACM Multimedia 2023, 2024 and 2025)

In this Dataset Column, we continue the tradition of previous installments by reviewing notable developments in open datasets and benchmarking competitions in multimedia from 2023 to 2025. The selected events reflect the breadth of topics, challenges, and datasets currently shaping the multimedia research community. They include special sessions, grand challenges, competition tracks, and evaluation campaigns involving multimedia data. This Dataset Column extends a series of overviews previously published in ACM SIGMM Records:

This fifth column focuses on the last three editions of the ACM International Conference on Multimedia (ACM MM), one of the flagship conferences in the field, which has long served as a major venue including presentations of multimedia benchmarks, open datasets, and community-driven evaluation campaigns.

  • MM ’23: The 31st ACM International Conference on Multimedia (Ottawa, Canada, 29 October 29 – November 3, 2023)
  • MM ’24: The 32nd ACM International Conference on Multimedia (Melbourne, Australia, October 28 – November 1, 2024)
  • MM ’25: The 33rd ACM International Conference on Multimedia (Dublin, Ireland, October 27 – 31, 2025)

ACM Multimedia 2022 was reviewed in the Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 2 (MDRE at MMM 2022, ACM MM 2022), and ACM Multimedia 2024 was reviewed in a general report.


The growing prominence of data-centric research in the multimedia community is illustrated also by the increasing frequency of the term “dataset” in ACM MM proceedings. In paper titles, the term appeared in 9 papers at MM ’22, rising to 22 in MM ’23, 28 in MM ’24, and 106 in MM ’25. In author keywords, the corresponding numbers were 35, 37, 47, and 104, while in abstracts they increased from 438 to 558, 687, and 869, respectively. Notably, among the MM ’25 papers containing the term “dataset”, many were also marked as Artifacts Available (47 in titles, 35 in keywords, and 68 in abstracts), indicating that related research artifacts had been made publicly accessible. Although this is only an approximate indicator, the trend suggests a growing emphasis on datasets, reproducibility, and open research practices within ACM MM.

Across MM ’23, MM ’24, and MM ’25, the term dataset appears in 156 paper titles, 188 author keyword lists, and 2,114 abstracts. Based on these three editions, we present a curated selection of 39 publicly accessible datasets – 10 from MM ’23, 10 from MM ’24, and 19 from MM ’25 – selected for their relevance to multimedia research, diversity of application domains, and potential for reuse by the community.


ACM MM 2023

Numerous dataset-related papers were presented at the 31st ACM International Conference on Multimedia (MM ’23), organized in Ottawa, Canada, October 29 – November 3, 2023. The complete MM ’23: Proceedings of the 31st ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3581783).

There was no dedicated dataset session among roughly 33 sessions at the MM ’23 conference. As a small example, ten selected papers focused primarily on new datasets with publicly available data are listed below. Looking across the ACM MM ’23 papers containing “dataset” in the keyword field, the dominant focus is on creating benchmark resources for emerging multimedia tasks rather than merely applying existing datasets. The contributions span a broad range of multimedia domains, including image and video understanding, multimedia quality assessment, multimodal reasoning, user interaction analysis, emotional and social signal processing, immersive media, multimedia security, retrieval, and cross-modal learning. A clear trend is the coupling of dataset construction with benchmark protocols and baseline methods, reflecting the multimedia increasing emphasis on reproducibility, comparative evaluation, and open research resources. From this broader set, the ten examples below were selected based primarily on scientific impact, current relevance, public availability, and representativeness across the core multimedia research themes traditionally associated with ACM MM.

Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted Approach
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, Weisi Lin
Paper available at: https://doi.org/10.1145/3581783.3611737
Dataset available at: https://github.com/VQAssessment/MaxVQA
This work introduces the Maxwell database, a major benchmark for explainable video quality assessment containing 4,543 in-the-wild videos and over two million subjective quality annotations across 13 dimensions. The dataset significantly advances multimedia quality assessment by enabling interpretable analysis of perceptual video quality beyond traditional scalar scoring.

Understanding User Behavior in Volumetric Video Watching: Dataset, Analysis and Prediction
Kaiyuan Hu, Haowen Yang, Yili Jin, Junhua Liu, Yongting Chen, Miao Zhang, Fangxin Wang
Paper available at: https://doi.org/10.1145/3581783.3613810
Dataset available at: https://cuhksz-inml.github.io/user-behavior-in-vv-watching/
This contribution presents one of the first public datasets for volumetric video interaction analysis, including gaze, viewport, and motion behavior. The dataset is highly relevant for immersive multimedia delivery, adaptive streaming, and Quality of Experience optimization in emerging interactive media environments.

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World
Hongpeng Lin, Ludan Ruan, Wenke Xia, Peiyu Liu, Jingyuan Wen, Yixin Xu, Di Hu, Ruihua Song, Wayne Xin Zhao, Qin Jin, Zhiwu Lu
Paper available at: https://doi.org/10.1145/3581783.3612425
Dataset available at: https://ruc-aimind.github.io/projects/TikTalk/
TikTalk provides a large-scale benchmark for video-grounded multimodal dialogue, containing 38,000 videos and 367,000 real-world user conversations. Its scale and realism make it particularly relevant for conversational multimedia AI and multimodal human-computer interaction.

MultiMediate ’23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions
Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Dominik Schiller, Mohammed Guermal, Dominike Thomas, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling
Paper available at: https://doi.org/10.1145/3581783.3613851
Dataset available at: https://multimediate-challenge.org/Dataset/
This contribution extends benchmark resources for engagement estimation and bodily behavior recognition in social interactions. The dataset is particularly relevant for multimedia analysis of human communication, behavioral understanding, and socially intelligent interactive systems.

Light-VQA: A Multi-Dimensional Quality Assessment Model for Low-Light Video Enhancement
Yunlong Dong, Xiaohong Liu, Yixuan Gao, Xunchu Zhou, Tao Tan, Guangtao Zhai
Paper available at: https://doi.org/10.1145/3581783.3611923
Dataset available at: https://github.com/wenzhouyidu/Light-VQA
This paper introduces LLVE-QA, a benchmark dataset specifically designed for evaluating perceptual quality in low-light video enhancement. As video enhancement becomes increasingly important in real-world multimedia applications, this dataset fills an important gap between enhancement algorithms and user-centered perceptual evaluation.

SemanticRT: A Large-Scale Dataset and Method for Robust Semantic Segmentation in Multispectral Images
Wei Ji, Jingjing Li, Cheng Bian, Zhicheng Zhang, Li Cheng
Paper available at: https://doi.org/10.1145/3581783.3611738
Dataset available at: https://github.com/jiwei0921/SemanticRT
SemanticRT introduces a large RGB-thermal image benchmark for robust semantic segmentation under adverse environmental conditions. With over 11,000 annotated multispectral image pairs, it provides an important resource for multimodal scene understanding and intelligent visual perception.

MORE: A Multimodal Object-Entity Relation Extraction Dataset with a Benchmark Evaluation
Liang He, Hongke Wang, Yongchang Cao, Zhen Wu, Jianbing Zhang, Xinyu Dai
Paper available at: https://doi.org/10.1145/3581783.3612209
Dataset available at: https://github.com/NJUNLP/MORE
MORE establishes a benchmark for multimodal relation extraction using jointly visual and textual evidence. The dataset addresses a growing need for structured reasoning across multimedia modalities and represents a strong contribution to multimodal understanding.

Ground-to-Aerial Person Search: Benchmark Dataset and Approach
Shizhou Zhang, Qingchun Yang, De Cheng, Yinghui Xing, Guoqiang Liang, Peng Wang, Yanning Zhang
Paper available at: https://doi.org/10.1145/3581783.3612105
Dataset available at: https://github.com/yqc123456/HKD_for_person_search
This paper introduces G2APS, a benchmark for cross-platform person search between UAV and ground surveillance imagery. The dataset is highly relevant for multimedia retrieval, intelligent surveillance, and cross-view visual matching applications.

MEDIC: A Multimodal Empathy Dataset in Counseling
Zhouan Zhu, Chenguang Li, Jicai Pan, Xin Li, Yufei Xiao, Yanan Chang, Feiyi Zheng, Shangfei Wang
Paper available at: https://doi.org/10.1145/3581783.3612346
Dataset available at: https://ustc-ac.github.io/datasets/medic/
MEDIC provides a multimodal benchmark for empathy analysis in face-to-face counseling interactions. The dataset expands multimedia affective computing toward emotionally intelligent systems and richer human-centered interaction modeling.

CCMB: A Large-scale Chinese Cross-modal Benchmark
Chunyu Xie, Heng Cai, Jincheng Li, Fanjing Kong, Xiaoyu Wu, Jianfei Song, Henrique Morimitsu, Lin Yao, Dexin Wang, Xiangzheng Zhang, Dawei Leng, Baochang Zhang, Xiangyang Ji, Yafeng Deng
Paper available at: https://doi.org/10.1145/3581783.3611877
Dataset available at: https://github.com/yuxie11/R2D2
CCMB contributes one of the largest publicly available multimodal vision-language benchmarks, supporting large-scale pretraining and downstream evaluation. Its scale and broad applicability make it a significant resource for multimodal multimedia learning.


ACM MM 2024

Numerous dataset-related papers have been presented at the 32nd ACM International Conference on Multimedia (MM ’24), organized in Melbourne, Australia, October 28 – November 1, 2024. The complete MM ’24: Proceedings of the 32nd ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3664647).

There were three specifically dedicated Dataset sessions among roughly 42 sessions at the MM ’24 conference: “Multimodal Datasets, Models & Analytics” (6 papers), “Datasets & Algorithms for Multimedia Analysis” (6 papers), and “Audio-visual Datasets and Applications” (5 papers).

Ten selected papers focused primarily on new datasets or dataset-driven benchmarks are listed below. Looking across the ACM MM ’24 dataset-session papers provided here, the dominant focus shifts strongly toward multimodal foundation models, audiovisual understanding, media authenticity, video safety, and human-centered multimedia applications. Several contributions address emerging risks and opportunities created by generative AI, including deepfake detection, multimedia forgery, safety-aware video generation, and AI-generated image quality. Other works focus on video-centered understanding tasks, such as audio-visual event localization, hateful video detection, video dialogue, and multimodal stance detection. Compared with earlier dataset columns, MM ’24 reflects a clear trend toward datasets designed not only for recognition or retrieval, but also for reasoning, explanation, generation, safety, and robust real-world deployment.

AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset
Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, Kalin Stefanov
Paper available at: https://doi.org/10.1145/3664647.3680795
Dataset available at: https://github.com/ControlNet/AV-Deepfake1M
AV-Deepfake1M is the strongest candidate to highlight first, both for relevance and citation visibility. It provides more than one million videos with content-driven visual, audio, and audiovisual manipulations, supporting both detection and temporal localization of deepfake segments. Its scale and audiovisual nature make it highly relevant for multimedia forensics and trustworthy media analysis.

Identity-Driven Multimedia Forgery Detection via Reference Assistance
Junhao Xu, Jingjing Chen, Xue Song, Feng Han, Haijun Shan, Yu-Gang Jiang
Paper available at: https://doi.org/10.1145/3664647.3680622
Dataset available at: https://github.com/xyyandxyy/IDForge
IDForge introduces an identity-driven multimedia forgery dataset with video shots involving visual, audio, and textual manipulations, together with real reference data for celebrity identities. The paper is especially relevant because it reflects realistic identity-based forgery scenarios and connects dataset design with reference-assisted detection.

GPT4Video: A Unified Multimodal Large Language Model for Instruction-Followed Understanding and Safety-Aware Generation
Zhanyu Wang, Longyue Wang, Zhen Zhao, Minghao Wu, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou, Shuming Shi, Zhaopeng Tu
Paper available at: https://doi.org/10.1145/3664647.3681464
GPT4Video contributes dataset resources for video instruction-following, benchmarking, and safety-aware video understanding and generation. It is relevant because it combines video comprehension, generation, and safeguarding, reflecting the growing role of dataset construction in multimodal large language models.

MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video Detection on YouTube and Bilibili
Han Wang, Tan Rui Yang, Usman Naseem, Roy Ka-Wei Lee
Paper available at: https://doi.org/10.1145/3664647.3681521
MultiHateClip focuses on hateful video detection in multilingual and cross-cultural settings. It contains videos from YouTube and Bilibili annotated for hateful, offensive, and normal content, emphasizing the importance of visual, audio, language, and cultural signals in harmful multimedia analysis.

Open-Vocabulary Audio-Visual Semantic Segmentation
Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying
Paper available at: https://doi.org/10.1145/3664647.3681586
Dataset available at: https://github.com/ruohaoguo/ovavss
This work introduces open-vocabulary audio-visual semantic segmentation and builds AVSBench-OV from AVSBench-semantic. It is highly relevant for open-world video understanding because it combines audio cues, visual segmentation, and zero-shot category recognition.

OpenAVE: Moving towards Open Set Audio-Visual Event Localization
Jiale Yu, Baopeng Zhang, Zhu Teng, Jianping Fan
Paper available at: https://doi.org/10.1145/3664647.3681232
Dataset available at: https://github.com/yujialele/OpenAVE
OpenAVE extends audio-visual event localization beyond closed-set recognition. It is important for real-world audiovisual video understanding, where systems must distinguish known events, unknown events, and background segments.

G-Refine: A General Quality Refiner for Text-to-Image Generation
Chunyi Li, Haoning Wu, Hongkun Hao, Zicheng Zhang, Tengchuan Kou, Chaofeng Chen, Lei Bai, Xiaohong Liu, Weisi Lin, Guangtao Zhai
Paper available at: https://doi.org/10.1145/3664647.3681152
Dataset available at: https://github.com/Q-Future/Q-Refine
G-Refine addresses quality refinement for AI-generated images, focusing on both perceptual quality and text-image alignment. It is a strong representative of growing interest in generative media quality, image quality assessment, and practical improvement of text-to-image outputs.

NovaChart: A Large-scale Dataset towards Chart Understanding and Generation of Multimodal Large Language Models
Linmei Hu, Duokang Wang, Yiming Pan, Jifan Yu, Yingxia Shao, Chong Feng, Liqiang Nie
Paper available at: https://doi.org/10.1145/3664647.3680790
Dataset available at: https://github.com/Elucidator-V/NovaChart
NovaChart provides 47,000 chart images and 856,000 chart-related instructions across multiple chart types and tasks. Although not video-centered, it is a strong image-and-language benchmark for visual reasoning, chart understanding, and chart generation.

Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model
Fuqiang Niu, Zebang Cheng, Xianghua Fu, Xiaojiang Peng, Genan Dai, Yin Chen, Hu Huang, Bowen Zhang
Paper available at: https://doi.org/10.1145/3664647.3681416
Dataset available at: https://github.com/nfq729/MmMtCSD
This paper introduces MmMtCSD, a dataset for multimodal multi-turn conversational stance detection. It is relevant for social multimedia analysis because it models realistic online discussions involving both text and images rather than isolated image-text pairs.

CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart
Bowen Zhao, Tianhao Cheng, Yuejie Zhang, Ying Cheng, Rui Feng, Xiaobo Zhang
Paper available at: https://doi.org/10.1145/3664647.3681053
CT2C-QA introduces a multimodal question answering dataset over Chinese text, tables, and charts. It is relevant for evaluating whether multimodal systems can reason across heterogeneous information sources, including visual and structured data.


ACM MM 2025

Numerous dataset-related papers have been presented at the 33rd ACM International Conference on Multimedia (MM ’25), organized in Dublin, Ireland, October 27 – 31, 2025. The complete MM ’25: Proceedings of the 33rd ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3746027).

There was a specifically dedicated dataset session among roughly 26 sessions at the MM ’25 conference. This dataset track attracted 263 submissions, of which 123 were accepted. Considering the entire MM ’25 Proceedings, the term “dataset” appears in the title of 106 papers (28 in MM ’24), the keywords of 104 papers (47 in MM ’24), and the abstracts of 869 papers (687 in MM ’24). This substantial year-over-year increase highlights the growing centrality of datasets and benchmark creation in multimedia research, increasingly positioning dataset construction itself as a major scientific contribution rather than merely supporting experimental evaluation.

The ACM MM ’25 dataset papers show a shift toward datasets as primary research contributions. The dominant themes include high-quality video resources for compression, streaming, enhancement, video quality assessment, and QoE; immersive and 3D media datasets for VR, spatial video, 3D Gaussian Splatting, point clouds, and volumetric applications; multimodal and vision-language datasets for reasoning, event grounding, image/video generation, and instruction following; and safety-oriented datasets addressing deepfakes, harmful videos, social engineering, and media authenticity. A second major cluster concerns domain-specific multimedia datasets in medicine, robotics, agriculture, food computing, biometrics, wildlife monitoring, and urban or engineering environments. Overall, the MM ’25 dataset papers reflect a clear expansion from traditional image/video recognition benchmarks toward open resources for generative media evaluation, trustworthy AI, embodied perception, subjective quality assessment, and real-world multimodal decision-making.

Among the ACM MM ’25 papers that explicitly mention datasets and provide publicly accessible artifacts, the following is a curated selection of 20 representative examples chosen for their relevance to the multimedia community, diversity of topics, and availability of reusable public datasets.

Screen Content Video Dataset and Benchmark
Nickolay Safonov, Mikhail Rakhmanov, Dmitriy S. Vatolin
Paper available at: https://doi.org/10.1145/3746027.3758306
Dataset available at: https://videoprocessing.github.io/screen-content-dataset
This dataset focuses on screen-content video scenarios such as screen sharing, desktop streaming, and video conferencing, providing a large benchmark with subjective quality annotations for distorted content. The work is especially relevant for multimedia quality assessment, video compression research, and perceptual QoE modeling in increasingly important screen-based communication environments.

Nature-1k: The Raw Beauty of Nature in 4K at 60FPS
Mohammad Ghasempour, Hadi Amirpour, Christian Timmerer
Paper available at: https://doi.org/10.1145/3746027.3758258
Dataset available at: https://cd-athena.github.io/Nature-1k
Nature-1k provides a large-scale collection of professionally captured 4K 60 fps natural video content designed for modern video processing research. Its scale and quality make it highly relevant for video compression, streaming optimization, super-resolution, enhancement, frame interpolation, and generative video applications.

VIDEA-8K-60FPS Dataset: 8K 60FPS Video Sequences for Analysis and Development
Tariq Al Shoura, Ali Mollaahmadi Dehaghi, Reza Razavi, Mohammad Moshirpour
Paper available at: https://doi.org/10.1145/3746027.3758278
Dataset available at: https://github.com/talshoura/VIDEA-8K-60FPS-Dataset
VIDEA-8K-60FPS addresses the shortage of publicly available ultra-high-resolution video benchmarks by providing native 8K HDR sequences captured at 60 fps. The dataset is particularly valuable for next-generation video coding, scalable streaming, UHD content analysis, and benchmarking computationally intensive multimedia methods.

LEHA-CVQAD: Dataset To Enable Generalized Video Quality Assessment of Compression Artifacts
Aleksandr Gushchin, Maksim Smirnov, Dmitriy S. Vatolin, Anastasia Antsiferova
Paper available at: https://doi.org/10.1145/3746027.3758303
Dataset available at: https://aleksandrgushchin.github.io/lcvqad/
LEHA-CVQAD is a large-scale benchmark specifically designed for studying perceptual degradation caused by video compression artifacts. Its subjective quality annotations and codec diversity make it particularly useful for developing video quality metrics and improving practical codec parameter optimization.

HVEval: Towards Unified Evaluation of Human-Centric Video Generation and Understanding
Sijing Wu, Yunhao Li, Huiyu Duan, Yanwei Jiang, Yucheng Zhu, Guangtao Zhai
Paper available at: https://doi.org/10.1145/3746027.3758299
Dataset available at: https://huggingface.co/datasets/wsj-sjtu/HVEval
HVEval introduces a benchmark for evaluating human-centric video generation and understanding, a rapidly growing topic in generative multimedia research. It combines perceptual quality judgments with semantic evaluation tasks, making it highly relevant for benchmarking generative video systems and multimodal understanding models.

BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos
Jiahao Lin, Weixuan Peng, Bojia Zi, Yifeng Gao, Xianbiao Qi, Xingjun Ma, Yu-Gang Jiang
Paper available at: https://doi.org/10.1145/3746027.3758305
Dataset available at: https://broken-video-detection-datetsets.github.io/Broken-Video-Detection-Datasets.github.io/
BrokenVideos addresses a critical challenge in AI-generated media by providing fine-grained annotations for artifact localization in synthetic videos. The dataset is highly relevant for trustworthy generative AI, media forensics, and automated video quality assurance.

AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences
Jieyu Li, Xin Zhang, Joey Tianyi Zhou
Paper available at: https://doi.org/10.1145/3746027.3758295
Dataset available at: https://huggingface.co/datasets/Clarifiedfish/AEGIS
AEGIS provides a large benchmark for evaluating authenticity detection in increasingly realistic AI-generated videos. It is particularly relevant for multimedia security, deepfake detection, and the broader challenge of trustworthy synthetic media verification.

SVD: Spatial Video Dataset
MohammadHossein Izadimehr, Milad Ghanbari, Guodong Chen, Wei Zhou, Xiaoshuai Hao, Mallesham Dasari, Christian Timmerer, Hadi Amirpour
Paper available at: https://doi.org/10.1145/3746027.3758246
Dataset available at: https://cd-athena.github.io/SVD/
SVD introduces a public benchmark for consumer-captured stereoscopic spatial video, reflecting the increasing adoption of immersive video technologies. It is especially relevant for research in spatial video compression, immersive streaming, QoE evaluation, and depth-aware media analysis.

EyeNavGS: A 6-DoF Navigation Dataset and Record-n-Replay Software for Real-World 3DGS Scenes in VR
Zihao Ding, Cheng-Tse Lee, Mufeng Zhu, Tao Guan, Yuan-Chun Sun, Cheng-Hsin Hsu, Yao Liu
Paper available at: https://doi.org/10.1145/3746027.3758265
Dataset available at: https://symmru.github.io/EyeNavGS/
EyeNavGS provides immersive navigation traces, gaze data, and interaction recordings in virtual reality environments built on 3D Gaussian Splatting scenes. The dataset is particularly important for adaptive rendering, viewport prediction, foveated streaming, and immersive interaction research.

UVG-CWI-DQPC: Dual-Quality Point Cloud Dataset for Volumetric Video Applications
Guillaume Gautier, Xuemei Zhou, Thong Nguyen, Jack Jansen, Louis Fréneau, Marko Viitanen, Uyen Phan, Jani Käpylä, Irene Viola, Alexandre Mercat, Pablo Cesar, Jarno Vanne
Paper available at: https://doi.org/10.1145/3746027.3758263
Dataset available at: https://ultravideo.fi/UVG-CWI-DQPC/
This dataset provides paired high-end and consumer-grade point cloud captures for volumetric video research. It is highly relevant for point cloud compression, enhancement, quality assessment, and immersive multimedia benchmarking.

The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding
Luca Rossetto, Werner Bailer, Duc-Tien Dang-Nguyen, Graham Healy, Björn Þór Jónsson, Onanong Kongmeesub, Hoang-Bao Le, Stevan Rudinac, Klaus Schöffmann, Florian Spiess, Allie Tran, Minh-Triet Tran, Quang-Linh Tran, Cathal Gurrin
Paper available at: https://doi.org/10.1145/3746027.3758199
Dataset available at: https://castle-dataset.github.io/
CASTLE is a rich multimodal dataset combining egocentric and exocentric video, audio, and sensor streams captured in realistic environments. It is highly relevant for multimodal understanding, retrieval, lifelogging, embodied perception, and context-aware AI research.

OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding
Hieu Nguyen, Phuc-Tan Nguyen, Thien-Phuc Tran, Minh-Quang Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Paper available at: https://doi.org/10.1145/3746027.3758264
Dataset available at: https://ltnghia.github.io/eventa/openevents-v1
OpenEvents V1 provides a large benchmark for multimodal event understanding through aligned images, text, and news content. It is especially relevant for event retrieval, multimodal reasoning, news analysis, and contextual multimedia understanding.

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA
Yuhang Hu, Zhenyu Yang, Shihan Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Changsheng Xu
Paper available at: https://doi.org/10.1145/3746027.3758311
Dataset available at: https://github.com/Fleeting-hyh/StreamingCoT
StreamingCoT focuses on temporal reasoning in evolving video streams and introduces explicit multimodal reasoning annotations. The dataset is particularly relevant for streaming video understanding, video question answering, and multimodal reasoning research.

SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models
Zheng Liu, Hao Liang, Bozhou Li, Wentao Xiong, Chong Chen, Conghui He, Wentao Zhang, Bin Cui
Paper available at: https://doi.org/10.1145/3746027.3758222
Dataset available at: https://github.com/starriver030515/SynthVLM
SynthVLM introduces a synthetic image-caption dataset aimed at efficient vision-language model training. The work is particularly relevant because it addresses scalable multimodal data generation and benchmarking for next-generation multimodal foundation models.

UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models
Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, Yanbin Hao
Paper available at: https://doi.org/10.1145/3746027.3758269
Dataset available at: https://ryanlijinke.github.io/
UniSVG expands multimedia benchmarking into vector graphics, a modality often neglected in conventional multimedia datasets. It is highly relevant for multimodal reasoning, structured content understanding, and AI-driven graphic generation.

RoboAfford: A Dataset and Benchmark for Enhancing Object and Spatial Affordance Learning in Robot Manipulation
Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, Xiaoshuai Hao
Paper available at: https://doi.org/10.1145/3746027.3758209
Dataset available at: https://roboafford-dataset.github.io/
RoboAfford bridges multimedia perception and embodied intelligence through object and spatial affordance annotations for robotic manipulation. The dataset is particularly relevant for scene understanding, multimodal perception, and interaction-aware robotics research.

GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset
Sahar Nasirihaghighi, Negin Ghamsarian, Leonie Peschek, Matteo Munari, Heinrich Husslein, Raphael Sznitman, Klaus Schoeffmann
Paper available at: https://doi.org/10.1145/3746027.3758267
Dataset available at: https://ftp.itec.aau.at/datasets/GynSurge/
GynSurg provides richly annotated surgical video data for gynecological laparoscopic procedures. It is highly relevant for medical multimedia analysis, workflow understanding, semantic segmentation, and AI-assisted surgical support systems.

DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models
Jiarui Wang, Huiyu Duan, Juntong Wang, Jia Ziheng, Woo Yi Yang, Xiaorong Zhu, Yu Zhao, Jiaying Qian, Yuke Xing, Guangtao Zhai, Xiongkuo Min
Paper available at: https://doi.org/10.1145/3746027.3758204
Dataset available at: https://github.com/IntMeGroup/DFBench
DFBench introduces a large benchmark for evaluating deepfake detection with modern multimodal models. It is especially relevant for multimedia forensics, trustworthy AI, adversarial robustness, and authenticity verification research.

Multiverse Through Deepfakes: The MultiFakeVerse Dataset of Person-Centric Visual and Conceptual Manipulations
Parul Gupta, Shreya Ghosh, Tom Gedeon, Thanh-Toan Do, Abhinav Dhall
Paper available at: https://doi.org/10.1145/3746027.3758283
Dataset available at: https://github.com/Parul-Gupta/MultiFakeVerse
MultiFakeVerse extends deepfake benchmarking beyond simple identity swaps toward semantically meaningful person-centric manipulations. The dataset is particularly important for studying higher-level multimedia misinformation, contextual authenticity analysis, and robust deepfake detection.


The progression across ACM MM 2023-2025 clearly illustrates the evolution of datasets from supporting experimental resources toward primary research outputs in their own right. Beyond traditional benchmarks for recognition and retrieval, recent datasets increasingly target generative media evaluation, trustworthy AI, immersive environments, multimodal reasoning, and domain-specific real-world applications. This trajectory reflects the multimedia community’s growing commitment to reproducibility, open science, and shared evaluation resources that enable sustainable scientific progress.

MPEG Column: 154th MPEG Meeting

The 154th MPEG meeting took place in Santa Eulària, Spain, from April 27 to May 1, 2026. The official MPEG press release can be found here. This report highlights key outcomes from the meeting, with a focus on research directions relevant to the ACM SIGMM community:

  • Exploration on MPEG Gaussian Splat Coding (GSC)
  • Draft Joint Call for Proposals: Video Compression Beyond VVC
  • Energy-aware Streaming in MPEG-DASH
  • MPEG-AI: Vision and Scenarios for Artificial Intelligence in Multimedia
  • MPEG Roadmap

Exploration on MPEG Gaussian Splat Coding (GSC)

The MPEG WG 2 Technical Requirements group — jointly with WG 4 (Video Coding), WG 5 (JVET: Joint Video Coding Team(s) with ITU-T SG 16), and WG 7 (Coding of 3D Graphics and Haptics) — made progress toward standardizing Gaussian Splat Coding (GSC) regarding draft requirements and use cases subject to change. Gaussian splatting, first introduced in a landmark 2023 ACM SIGGRAPH paper by Kerbl et al. [Kerbl2023], represents 3D scenes as collections of anisotropic Gaussian primitives carrying geometry (x, y, z positions) and appearance attributes (opacity, scale, rotation, and spherical harmonics coefficients for view-dependent color), enabling photorealistic novel-view synthesis with real-time rendering. Because raw Gaussian splat data can be extremely large and the ecosystem of proprietary formats (.ply, .splat, .spz, etc.) is fragmented, MPEG has identified a clear need for interoperable, efficient compression standards. Two exploration tracks are currently being pursued: I-3DGS, which operates on Gaussian splats in the well-established “INRIA” format as a symmetric encode/decode pipeline, and A-3DGS, which allows alternative learned representations and training-integrated approaches.

The draft requirements, still evolving, currently cover representation, coding, and system aspects across both tracks, with an additional lightweight profile targeting resource-constrained devices such as mobile phones (Snapdragon 8 Gen 3/Elite) and HMDs (Snapdragon XR Gen2, e.g., Meta Quest 3). Among the coding requirements under consideration are lossy and lossless compression with variable bitrate, spatial and temporal random access, progressive and scalable decoding (quality, Level of Detail (LoD), attribute subsets), and error resilience. Notably, a lightweight profile currently proposes hard complexity constraints (i.e., real-time encode/decode on 2024/2025 mobile hardware, a 2GB runtime memory cap, and at most four concurrent video decoder sessions) reflecting MPEG’s intent to enable a fast-deployment path for interoperable interchange and storage of static Gaussian splat assets. Alongside the requirements, a draft set of 27 use cases has been identified, spanning consumer XR (telepresence, gaming, social media, retail), professional media (movie production, sports broadcasting, immersive journalism), industrial applications (digital twins, Building Information Modeling (BIM), structure inspection, disaster assessment), and emerging hybrid representations such as Gaussian splats attached to deformable meshes for avatar animation and rigging. Several of these use cases are motivating draft requirements around primitive ordering preservation and stable identifier signaling for external metadata associations, though the details of these provisions may still change.

Research aspects: Even at this early draft stage, the direction of MPEG’s GSC work opens a rich set of research opportunities. On the compression side, the dual-track structure raises open questions around rate-distortion-complexity optimization for both geometry-based and video-codec-based pipelines, including temporally coherent coding of dynamic (tracked and non-tracked) Gaussian sequences and attribute-group-aware progressive coding. The QoE angle is equally pressing: no widely accepted perceptual quality metric yet exists for 6DoF Gaussian splat rendering, and the community can contribute splat-artifact-aware metrics, view-consistency measures, and subjective evaluation methodologies. The envisioned lightweight profile points to a need for co-design of decoders and real-time renderers targeting mobile GPU architectures, offering opportunities in GPU-friendly bitstream layouts and LOD-driven streaming. From a systems and networking perspective, the spatial and temporal random-access provisions, combined with the breadth of use cases demanding adaptive streaming to diverse devices (HMDs, phones, TVs, browsers), map naturally onto adaptive bitrate research, ROI- and view-dependent segment delivery, and loss-resilient transmission of splat parameters. Finally, the emerging use cases around hybrid mesh-Gaussian avatars, scene editing, and semantic metadata associations introduce new multimedia content management and interactive media challenges that go well beyond traditional video streaming and are squarely within the scope of ACM SIGMM’s research community.

Draft Joint Call for Proposals: Video Compression Beyond VVC

MPEG’s Joint Video Experts Team (JVET) — operating jointly under ITU-T SG21 and ISO/IEC JTC 1/SC 29 — advanced a draft Joint Call for Proposals (CfP) for a new generation of video compression technology with capabilities that would substantially exceed those of the current Versatile Video Coding (VVC) standard (Rec. ITU-T H.266 | ISO/IEC 23090-3). The final CfP is planned for July 2026, with proposal submissions evaluated at a JVET meeting in January 2027 and a tentative target of a completed standard by October 2029. The overarching goal is to solicit compression technology that significantly improves upon VVC’s Main 10 Profile in terms of rate-distortion performance, encoder/decoder implementability, applicability to diverse content types, and additional features such as low latency, error robustness, and scalability, while explicitly recognizing that practical fast encoding is increasingly important across a growing range of applications.

The draft CfP defines four test cases. The primary test case targets improved compression without runtime constraints, spanning several content categories: SDR random-access at UHD/4K and HD resolutions, SDR low-delay HD (targeting conversational and gaming applications), HDR content under both PQ and HLG transfer functions at UHD, gaming low-delay HD, and user-generated content. Three additional test cases impose encoder runtime constraints relative to the VVC Test Model (VTM) reference encoder, enabling JVET to characterize the compression-versus-speed trade-off across submissions. Formal subjective evaluation will follow the degradation category rating (DCR) methodology per ITU-R BT.500. Importantly, the CfP explicitly addresses neural and learned components: proponents must disclose what training data was used and are prohibited from using any test sequence as training material, and source code (incl. training scripts or parameter derivation procedures) must be made available for accepted technologies entering the core experiments process. The draft notes that specific test sequences and target bitrates may still change before the final CfP is issued.

Research aspects: The runtime-constrained test cases create a natural framework for studying the compression-complexity Pareto frontier for both classical and learned codecs. The inclusion of user-generated content and gaming video as distinct categories invites research into content-adaptive coding tools and perceptual quality metrics tailored to these sources, as does the HDR coverage with its use of weighted PSNR alongside MS-SSIM. The explicit allowance for neural and learned components, with mandatory training data disclosure and source code requirements, signals that JVET anticipates hybrid and end-to-end learned codecs as serious contenders, making codec-agnostic adaptive streaming, QoE modeling for learned video codecs, and large-scale perceptual quality benchmarking timely topics for the ACM SIGMM community.

Energy-aware Streaming in MPEG-DASH

MPEG’s WG 3 (Systems/DASH) is developing a framework for integrating energy-related information into adaptive streaming workflows, currently documented as a Technology under Consideration (TuC) in the DASH specification. The proposed framework treats energy as a first-class design metric alongside QoE, latency, and throughput, and defines an end-to-end approach for assigning, aggregating, and propagating energy consumption data across the entire media delivery chain — from production and encoding through CDN distribution to the client. A key design principle is extensibility: rather than hardcoding specific metrics, the framework proposes a common registry of energy-related metrics (such as energy indices or carbon indices) identified via URNs or 4CC codes, inspired by existing registries like MP4RA and DASH-IF. Energy information may be carried through a variety of existing DASH mechanisms, including MPD descriptors at multiple granularity levels (Adaptation Set, Representation, Segment, Service Location), CMCD/CMSD extensions, metadata tracks, SAND messages, and event streams. A dedicated Energy descriptor in the MPD is proposed, analogous to existing Accessibility descriptors, to expose energy information to clients and applications for representation selection, user exposure, and reporting to back-end servers.

Concept of Energy-aware Streaming in MPEG-DASH.

The April 2026 update reported significant progress on two related fronts. A 5G-MAG workshop co-organized with 3GPP SA4 and Greening of Streaming (March 2026) highlighted growing industry consensus around practical energy measurement, surfacing findings such as the dominant role of device eco-mode settings and content brightness over codec or resolution choices in determining end-device energy consumption, and the challenge of reproducible cloud-based energy measurement. In parallel, 3GPP’s Rel-20 study on media energy consumption exposure (FS_Energy_Ph2_MED) reached 80% completion and is expected to conclude in June 2026, with normative work to follow. Notably, 3GPP’s current draft conclusions focus on generic architectural enablers, specifically a new Energy Information Application Function, while explicitly deferring media-layer and client-driven energy optimization to external bodies such as MPEG, SVTA, and DVB. This positions MPEG-DASH’s manifest-based energy signaling work as the natural venue for maturing the streaming-level mechanisms that 3GPP may later reference.

Research aspects: This work opens several timely directions. Energy-aware ABR algorithm design, i.e., jointly optimizing QoE and energy across representation selection, CDN choice, and client device settings, is a natural extension of the existing adaptive streaming research agenda. The proposed metrics registry and MPD-level signaling create opportunities for dataset construction and benchmarking, building on emerging open datasets such as COCONUT [Tashtarian2024] and VEED [Linder2024]. The finding that device-side factors (eco-mode, display brightness) dominate energy consumption over codec and bitrate choices challenges some common assumptions and calls for more holistic QoE-energy modeling. Finally, the cross-SDO coordination between MPEG, 3GPP, IETF (GREEN working group), and Greening of Streaming presents opportunities for the ACM SIGMM community to contribute to the design of interoperable, standardized energy reporting APIs for streaming services.

MPEG-AI: Vision and Scenarios for Artificial Intelligence in Multimedia

The first edition of ISO/IEC TR 23888-1 serves as the foundational vision document for the MPEG-AI series (ISO/IEC 23888). The document maps out how AI and neural network technologies interact with multimedia standardization along two complementary axes: (i) AI as a multimedia coding tool (e.g., AI-based video compression, 3D point cloud coding) and (ii) multimedia as input for AI consumption (e.g., video coding optimized for machine vision tasks). Under this umbrella, the document surveys six technical areas. In AI-based video coding, neural network components are explored as hybrid additions to VVC-style codecs, covering in-loop filters, intra prediction, super-resolution via reference picture resampling, and content-adaptive postfilters transmitted via SEI messages using the Neural Network Coding standard (NNC, ISO/IEC 15938-17). In AI-based 3D graphics coding, the focus is on dynamic point clouds for immersive (XR, gaming) and machine-oriented (autonomous navigation, BIM) applications, where sparsity and geometric irregularity pose unique challenges beyond those faced by image/video AI codecs. AI model compression (NNC) addresses the bandwidth-efficient deployment and incremental updating of neural network weights to devices, with use cases ranging from adaptive streaming ABR models to federated learning and postfilter delivery. Video coding for machines (VCM) targets compression optimized for downstream AI tasks such as object detection, tracking, and content moderation, with applications in surveillance, intelligent transportation, smart cities, and industrial inspection. Feature coding for machines (FCM) extends this to split-inference architectures where intermediate feature maps — rather than reconstructed video — are compressed and transmitted between edge devices and servers. Finally, distributed AI media description addresses the interoperable representation and API-level exchange of AI inference results (e.g., bounding boxes, segmentation masks) between networked media analyzers, as specified in the MPEG-IoMT suite.

ISO/IEC TR 23888-1: AI as a multimedia coding tool and multimedia as input for AI consumption.

Research aspects: The hybrid codec paradigm raises open questions around joint optimization of traditional and learned tools and complexity-aware training for mobile targets. The VCM and FCM tracks call for new task-oriented quality metrics capturing machine-task performance as a function of bitrate, an area where the multimedia and computer vision communities can collaborate. The split-inference and feature coding scenarios introduce latency-constrained compression problems for edge-to-cloud pipelines, which naturally connect to adaptive streaming and IoT research. Finally, the reproducibility and bit-exactness challenges highlighted in the document — hardware-dependent inference, non-deterministic training, and the absence of standardized evaluation environments — present an opportunity for the community to develop shared benchmarking infrastructure for learned multimedia codecs.

MPEG Roadmap

MPEG released an updated roadmap at its 154th meeting, reflecting the current status and near-term trajectory of its standardization activities across three broad pillars. Under Media Coding, work nearing completion includes MPEG Immersive Video v.2, Feature Coding for Machines, Solid Point Cloud Coding, and Dynamic Mesh Compression, while longer-horizon efforts cover AI Graphics Compression, Video Coding for Machines, Lenslet video coding, and — directly relevant to this report — both Video-based and Geometry-based Gaussian Splat Coding tracks. Under Systems and Tools, near-term deliverables include DASH v.7, Green metadata v.4, and Carriage of Haptics Data, with CMAF v.4 and File Format (ISOBMFF) v.10 on a slightly longer timeline. The Beyond Media pillar continues to advance genomic data search and biomedical waveform coding (BWC), alongside media authenticity and provenance indication — underscoring MPEG’s expanding scope well beyond traditional audiovisual applications.

MPEG Roadmap as of April 2026.

Research aspects: The roadmap highlights several intersecting research opportunities. The convergence of volumetric and neural representations (i.e., point clouds, dynamic meshes, Gaussian splats, and lenslet video; all progressing in parallel) raises open questions around unified rate-distortion frameworks and cross-format QoE evaluation for 6DoF experiences. The simultaneous progression of Video Coding for Machines and Feature Coding for Machines alongside traditional human-centric codecs calls for research into adaptive pipelines that can serve both human and machine consumers from a shared bitstream. The Green metadata track connects directly to the energy-aware streaming work discussed above, underscoring the need for end-to-end energy modeling that spans codec choice, packaging, delivery, and consumption. Finally, the Beyond Media thread (e.g., particularly genomic data and biomedical waveforms) signals an expanding definition of “multimedia” that the ACM SIGMM community may wish to engage with as compression, retrieval, and QoE methods developed for audiovisual content find applicability in life sciences.

Concluding Remarks

The 154th MPEG meeting in Santa Eularia reflects a standards body in active transition, broadening its scope from traditional audiovisual compression toward a richer landscape that encompasses neural scene representations, AI-native codecs, energy-aware delivery, and even biomedical data. The Gaussian Splat Coding exploration, the next-generation video compression Call for Proposals, the MPEG-AI vision document, and the energy-aware streaming framework each address distinct but interconnected challenges: how to represent, compress, deliver, and consume increasingly complex and diverse media efficiently and sustainably. For the ACM SIGMM community, this meeting offers both a map of where industry standardization is heading and a set of open research problems (i.e., spanning perceptual quality assessment, learned compression, edge inference, green streaming, and immersive media delivery) where academic contributions can meaningfully shape the next generation of multimedia standards.

The 155th MPEG meeting will be held in Geneva, Switzerland, from July 13 to 17, 2026. Click here for more information about MPEG meetings and ongoing developments.

References

  • [Kerbl, 2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 42, 4, Article 139 (August 2023), 14 pages. https://doi.org/10.1145/3592433
  • [Tashtarian, 2024] Farzad Tashtarian, Daniele Lorenzi, Hadi Amirpour, Samira Afzal, and Christian Timmerer. 2024. COCONUT: Content Consumption Energy Measurement Dataset for Adaptive Video Streaming. In Proceedings of the 15th ACM Multimedia Systems Conference (MMSys ’24). Association for Computing Machinery, New York, NY, USA, 346–352. https://doi.org/10.1145/3625468.3652179
  • [Linder, 2024] Sandro Linder, Samira Afzal, Christian Bauer, Hadi Amirpour, Radu Prodan, and Christian Timmerer. 2024. VEED: Video Encoding Energy and CO2 Emissions Dataset for AWS EC2 instances. In Proceedings of the 15th ACM Multimedia Systems Conference (MMSys ’24). Association for Computing Machinery, New York, NY, USA, 332–338. https://doi.org/10.1145/3625468.3652178

Bridging the QoE Blind Spot Between Applications and Networks

To the viewer, a stalled video is a single interruption; operationally, it fragments into different pieces of evidence. The streaming application records a bitrate drop and a rebuffering event, the network operator’s dashboard shows a congested access segment and a burst of packet loss, and the user experience sits between those two partial accounts.

This gap in visibility is the practical problem behind VQEG’s March 2026 white paper Quality of Experience-Aware Management for Collaboration Between Network and Application Providers, published as technical report VQEG_TR_2026_001 [VQEG, 2026]. Its guiding question is direct but difficult: if Content and Application Providers (CAPs) and Communication Service Providers (CSPs) both shape end-user Quality of Experience (QoE), what would they need to share in order to manage it together?

The Case for a Common QoS/QoE Vision

VQEG, the Video Quality Experts Group, is an international forum focused on video quality and QoE measurement. The VQEG 5G-KPI working group studies the relationship between network key performance indicators (KPIs), initially in 5G and extensible to other networks, and the QoE of video services running on top of them. At a July 2024 workshop in Klagenfurt, Austria, the group turned that broad mission into a more concrete agenda around a familiar problem in multimedia delivery: applications and networks are deeply entangled, yet the communities operating them often describe performance in different measurements.

CAPs can typically observe the application layer: startup time, player state, bitrate switches, rebuffering, device behavior, error logs, and user-facing quality. However, they have no visibility on the network topology or performance, and they need to devote significant effort in developing throughput estimation algorithms and control techniques to adapt to varying network conditions. CSPs, in turn, can observe the network layer: throughput, latency, jitter, packet loss, routing, congestion, and radio or access-network conditions. However, they have no visibility on the kind of traffic they transport, especially when it is encrypted, and therefore they have difficulties in properly dimensioning the network or provided focused support to customer problems. Figure 1 illustrates this split visibility across the delivery chain and the shared view that QoE-aware management tries to build between CAPs and CSPs.

Figure 1. CAPs and CSPs see different parts of the same delivery chain. CAPs observe application- and user-side quality signals; CSPs observe network-side conditions. QoE-aware management needs a shared view that connects these perspectives without assuming that either side sees the whole system alone.

The same divergence appears in the terminology. In multimedia research, QoE is usually anchored in the user’s experience, following the widely used Qualinet definition of QoE [Qualinet, 2013] as “the degree of delight or annoyance of the user of an application or service,” shaped by the user’s personality, state, expectations, context, and the system itself. This definition keeps the human being in the picture: a model may estimate QoE, and a subjective test may measure it more directly, but the object of interest remains experienced quality.

In networking practice, the starting point is usually Quality of Service (QoS). Metrics such as bandwidth, latency, jitter, and packet loss are measurable, operational, and essential, although they tell only part of the story. A network policy that reduces bitrate may ease congestion and improve latency while the viewer sees a picture that has become visibly worse; a cloud-gaming session may report decent throughput while the player feels the input delay immediately. QoS counters become much more useful when they can be connected to application-level Key Quality Indicators (KQIs), modeled QoE, user-reported QoE, and system QoE [Hoßfeld, 2019].

From the Klagenfurt discussions emerged four needs: a shared vocabulary for QoS and QoE-related concepts; a practical way to model QoS-QoE relationships; requirements for exchanging information between CAPs and CSPs; and early validation through concrete use cases. Before any mechanism could be specified, the group first had to make the language precise enough for both sides to use.

Writing the White Paper

For the organizations that would actually have to operate QoE management, the resulting report frames the problem in practical terms. Which metrics can CAPs and CSPs agree on? Which ones can they expose to one another? How should those measurements be interpreted? When can they trigger action during a session, and when are they more useful afterward for analytics, troubleshooting, or network dimensioning?

The focus on CAPs and CSPs is deliberate because these stakeholders control different parts of the delivery chain. CAPs provide the application or content experience directly to users, while CSPs provide the communication network on which that experience depends. Between them sit devices, access networks, transport networks, content delivery networks, adaptation algorithms, service policies, and business constraints. A video freeze, a cloud-gaming delay, or a broken conference call may involve several of those layers at once, which gives a shared operating language practical value.

Because the contributors come from CAPs, CSPs, equipment vendors, universities, research institutes, and independent QoE experts, the report also reflects a range of operational and research perspectives. The mix includes, among others, Nokia, Meta, YouTube, RISE, Telefónica, AT&T, Ericsson, TikTok, Audible, AVEQ, RWTH Aachen, TU Ilmenau, the University of Padova, the University of Würzburg, Blekinge Institute of Technology, AGH, and Universidad Politécnica de Madrid. That breadth is important: different communities brought different habits, constraints, and preferred measurements into the same conversation.

Published as VQEG_TR_2026_001 in March 2026, the report reviews definitions, QoE models, and relevant standards; organizes the relationships among QoS, KPIs, KQIs, and QoE; CAP-CSP collaboration challenge; and proposes a conceptual framework for exchanging QoS- and QoE-related information. Its examples include short-form video, long-form video, cloud gaming, and video conferencing. Although the scope is broad, the center of gravity remains practical: give the ecosystem a common foundation before arguing about protocols or product-specific tools.

Some Highlights

One of the report’s central contributions is its layered vocabulary. Borrowing the intuition of a networking stack, while still keeping user experience above packet-level quantities, it separates network KPIs such as throughput, latency, and loss [3GPP, 2024] from application KQIs [3GPP QMC, 2025] such as startup delay, rebuffering, media quality, interaction delay, and session stability. Above those layers sits QoE: the user’s perceived quality, whether reported directly or inferred through a model. Terms such as user-reported QoE, modeled QoE, and system QoE receive explicit treatment because collaboration breaks down quickly when a “quality score” means one thing in the player logs and another thing in the network dashboard.

The layered vocabulary also changes the diagnostic problem. A CAP may know that a session suffered repeated stalls and bitrate drops, while the CSP may know that the access link was congested at the same time. If those observations can be correlated, both sides can move from guessing to diagnosis: application issue, network congestion, device limitation, or some interaction among them. Better diagnosis can help a CSP decide where network action is warranted, help a CAP adapt more intelligently, and help both sides avoid optimizations that improve a local metric while leaving the user unhappy.

The shared state table is the report’s most concrete proposal: a logical view where CAPs and CSPs exchange selected metrics, at appropriate time scales and levels of aggregation, so each side can understand enough of the other’s state to act. For video streaming, such a table might include application-side information such as startup delay, rebuffering, selected representation, or estimated visual quality, alongside network-side information such as congestion indicators or available capacity. For interactive services, the useful signals may shift toward latency, jitter, and responsiveness.

Flexibility is built into the proposal because different services need different metrics, and the useful time scale for a live video call differs from the time scale for post-session analytics. The difficult questions are the operational ones: who can measure a signal reliably, who can act on it, how granular the exchange should be, and how privacy or business constraints shape what can be exposed. Metric sharing remains voluntary and opt-in, with mechanisms such as temporary or pseudonymized session identifiers when granular traffic correlation is needed.

What is next?

Seen in this light, the report functions as a foundation for implementation work. It offers a vocabulary, a framework, and a set of open tasks that still need to be made operational. A sensible next step would be to choose one use case and make the model concrete by selecting the relevant metrics, deciding who measures them, defining how they map to QoE, and testing whether the information would actually help CAPs and CSPs make better decisions.

A proof of concept, controlled testbed, or simulation, which is currently under discussion in VQEG, could then address the focused feasibility questions. Can the selected metrics be measured reliably? How fast must they be shared? Does per-session information add enough value to justify the complexity? Which side can take action, and what action is safe?

Deployment would also require careful treatment of the operational details surrounding the framework itself. CAPs and CSPs need ways to identify and correlate traffic flows. QoS monitoring must cover enough of the path to find problems where they occur and keep teams from merely shifting blame from one segment to another. Privacy, commercial sensitivity, and regulation will shape what can be shared. The technical framework will have to live inside those constraints.

For the idea to travel beyond a VQEG report, it will also need a standardization path. Parts of the framework may fit naturally in ITU-T Study Group 12, while protocol and system aspects may belong in IETF, 3GPP, MPEG, or related Standards Developing Organizations (SDOs). If that work succeeds, QoE can move from post-session evaluation toward a shared operating language for services and networks.

References

JPEG Column: 110th JPEG Meeting in Sydney, Australia

JPEG Trust Media Asset Watermarking reaches Committee Draft stage at the 110th JPEG meeting

The 110th JPEG meeting was held in Sydney, Australia, from 11 to 16 January 2026.

This meeting was marked by several major achievements: JPEG Trust Part 3 Media Asset Watermarking that will extend JPEG Trust Core Foundation providing signalling capabilities for content authenticity, provenance, integrity, intellectual property rights, and labelling using watermarking. Furthermore, the first event-based codec, JPEG XE, reached the Draft International Standard stage.

In addition, the JPEG Committee celebrated the 25th birthday of the successful JPEG 2000 standard with a social event where members who had served the Committee shared their experience during the development of this important family of standards.

The following sections summarise the main highlights of the 110th JPEG meeting:

  • JPEG Trust Part 3: Media Asset Watermarking to provide watermarking support for media asset authenticity.
  • JPEG XE Part 1: core coding system is under DIS ballot.
  • JPEG AIC prepares large-scale subjective experiment.
  • JPEG 2000 defines a set of hardware-focused profiles for professional video streaming.
  • JPEG XS Part 2 new amendement defines additional levels and sublevels, ands a new frame buffer level.
  • JPEG RF activity approves new Use Cases and Requirements.
  • JPEG AI focus on implementation aspects and on extending its applicability across devices and use cases.
  • JPEG DNA completes wet-lab experiments, including DNA synthesis/sequencing.
  • JPEG Pleno Light Field Quality Assessment examines the performance of the proposed metrics.
  • JPEG 2000 25th Anniversary Celebrations.
The former convenor of the JPEG Committee, Daniel Lee, addressing JPEG 2000 development during the JPEG 2000 25th Anniversary Celebration.

JPEG Trust

Current technologies, especially the rise of generative AI, make synthetic creation and modification of media assets easy for general users. Media artefacts such as synthetic images and video increase the risks of online piracy, cyber security fraud, copyright breach, advertising misrepresentation and the spread of mis- and disinformation.

The JPEG Trust International Standard (ISO/IEC 21617-1) provides a framework for establishing trust in media assets, and has now been extended to include Part 3: Media Asset Watermarking (ISO/IEC 21617-3), to provide watermarking support for media asset authenticity.

This new part of the JPEG Trust framework provides a mechanism to empower businesses, governments and institutions to support critical use cases from labelling AI-generated media assets to Digital Rights Management and source tracing. This is in addition to its many applications in helping secure media asset authenticity.

In a major milestone achieved during the 110th JPEG meeting in Sydney, Part 3: Media Asset Watermarking reached the Committee Draft stage. It is expected that this standard will have a significant positive impact globally, as it directly responds to the urgent calls for watermarking functionality by governments around the world in response to the proliferation of AI-generated content online.

JPEG XE

JPEG XE is a joint effort between ITU-T SG21 and ISO/IEC JTC1/SC29/WG1 and will become the first internationally endorsed specification by major standardization bodies ITU-T, ISO, and IEC, for coding of events. It aims to establish a robust and interoperable format for efficient representation and coding of events in the context of machine vision and related applications. To expand the reach of JPEG XE, the JPEG Committee has closely coordinated its activities with the MIPI Alliance with the intention of developing a cross-compatible coding mode, allowing MIPI ESP signals to be decoded effectively by JPEG XE decoders.

Currently, JPEG XE Part 1, which defines the core coding system, is under DIS ballot and the JPEG Committee is awaiting the results. In the meantime, work started on Parts 2 and 3, which will define the Profiles and levels, and the Reference software, respectively. For both parts, a Committee Draft (CD) was created and their consultation was requested. The Profiles and levels in Part 2 will provide strict definitions to allow safe and correct interoperability between vendor specific implementations of the standard. The software for Part 3 will serve as a proof of concept implementation of an encoder and decoder of JPEG XE. The plan is to make the software free and open source to allow the community easy access to the JPEG XE technology.

Finally, work on Part 4 was also initiated to provide official and well-defined conformance tests. This will help vendors to verify interoperability and conformance to the standard.

The JPEG Committee remains committed to the development of a comprehensive and industry-aligned standard that meets the growing demand for event-based vision technologies. The collaborative approach between multiple standardisation organisations underscores a shared vision for a unified, international standard to accelerate innovation and interoperability in this emerging field. The JPEG XE public and joint AHG (ITU-T SG21 and ISO/IEC JTC1 SC29 WG1) was reestablished to continue the work. If you are interested, please consider joining the joint AHG.

JPEG AIC

The JPEG AIC-3 standard, which specifies a methodology for fine-grained subjective image quality assessment in the range from good quality up to mathematically lossless, is ready to be published as International Standard ISO/IEC 29170-3 in February this year. An implementation of the corresponding data analysis has been provided in MATLAB and will be ported to Python. For the current JPEG AIC-4 effort and evaluation of the responses to the call for Objective Image Quality Assessment, an image dataset for the large-scale subjective experiment was finalized, consisting of 18,000 compressed images for 70 source images and 17 codecs, including several learning-based methods. The crowdsourcing experiment is expected to take several weeks.

JPEG 2000

The JPEG Committee has initiated the development of a new standard to collect the growing number of profiles for its flexible JPEG 2000 image codec. As part of the activity, which is expected to be completed within the next 18 months, an initial set of hardware-focused profiles for professional video streaming coder are being codified. These profiles use the unique capabilities of the High-Throughput JPEG 2000 block coder, specified in Rec. ITU-T T.814 | ISO/IEC 15444-15, to shrink the hardware resources needed to tackle modern high-frame rate and high-resolution images.

JPEG XS

JPEG XS, the image and video compression format for transmitting visually lossless, high-quality pictures with minimal latency and low resource consumption, is a fundamental game-changer for real-time video transmission in live, professional, and broadcast applications. In this context, the JPEG Committee created an AMD1 for JPEG XS Part 2 to define some additional levels and sublevels, as well as a new frame buffer level. These additions each address specific requirements that came from the respective industry sectors that rely on JPEG XS. This new AMD1 for Part 2 was issued for DIS balloting. In the meantime, the ballot results for AMD1 for JPEG XS Part 1 were processed, and an FDIS ballot was initiated. Both AMDs are expected to be published before the end of this year.

JPEG RF

At the 110th JPEG meeting, JPEG RF made significant progress against its mandates, formally approving the Use Cases and Requirements for JPEG Radiance Fields v1.0 and requesting its public release on the JPEG website. Substantial technical discussions advanced the evaluation and assessment pipeline for radiance fields, covering both coding-only and joint instantiation and coding approaches. The Working Group also approved Exploration Study 7, including the study on pair-wise comparison assessment methodologies for radiance fields. In addition, next steps were agreed for outreach activities to engage additional stakeholders.

JPEG AI

During the 110th JPEG meeting, JPEG AI was focused on implementation aspects and on extending its applicability across devices and use cases. First, the Use Cases and Requirements document was updated, introducing a new video streaming and storage use case that positions JPEG AI as a deterministic still-image coding engine that can be integrated into video coding pipelines.

A new core experiment addresses the bit-exact reference frame reconstruction requirement. Moreover, other core experiments were defined to analyze power consumption on heterogeneous CPU–GPU/FPGA platforms and to retrain JPEG AI in the RGB domain for fair comparison with other codecs. Looking ahead, JPEG AI plans to develop mobile-ready encoder and decoder implementations, investigate error-resilience properties, and continue benchmarking JPEG AI against state-of-the-art learnt image codecs using solid and robust test conditions.

JPEG DNA

The wet-lab experiments, including DNA synthesis/sequencing, designed at the 109th JPEG meeting were completed, and the synthesized results have been delivered to the JPEG Committee as DNA molecules. As a next step, independent parties are carrying out sequencing separately, and the sequenced results are expected to be available by the next JPEG meeting, when the JPEG DNA, a.k.a. ISO/IEC 25508-1, will reach the DIS stage.

JPEG Pleno

During the 110th JPEG meeting, the JPEG Committee reviewed the outcomes of the subjective quality assessment conducted on the evaluation dataset with the aim to examine the performance of the proposals submitted in response to the Call for Proposals on objective metrics for JPEG Pleno Light Field Quality Assessment. The performance of submitted metrics was analysed across scenes with diverse spatial and angular resolutions and for both coding-only and joint coding and view-synthesis artefacts, highlighting differences in behaviour across distortion categories. Learning-based proposals were recognized as a promising direction, particularly when cross-validated on the evaluation dataset, while also raising considerations related to training, data dependency, and reproducibility. The evaluation phase was formally closed, with agreement to retain a set of well-established full-reference metrics as reference anchors and to pursue a combined technical direction integrating end-to-end and hybrid learning-based approaches. Finally, responsibilities across task forces were consolidated, and next steps were defined to continue the objective quality assessment work towards a first version of a working draft.

Highlights of JPEG 2000 25th Anniversary Celebrations, Sydney, 14 January 2026

The 110th JPEG meeting in Sydney offered a fitting occasion to mark the 25th anniversary of JPEG 2000 standardization. Opening the celebration, Prof. Touradj Ebrahimi, JPEG convenor, noted that it was in Sydney during the 12th JPEG meeting in 1997 that JPEG 2000 proposals were evaluated, culminating in the publication of the standard in December 2000.

The program featured a video message from Prof. Michael Marcellin, a key contributor to several core technologies adopted by JPEG 2000 and chair of the subsequent software verification model effort. He highlighted the successful deployment of JPEG 2000 for digital distribution of motion pictures and the essential standards work involved in defining the digital cinema profiles that enabled this adoption.

Prof. David Taubman, whose long-standing leadership and technical contributions continue to shape JPEG 2000 development, delivered a presentation highlighting the coding tools that underpin the format’s highly scalable and accessible codestreams. He also outlined recent progress in High Throughput JPEG 2000 (HTJ2K), including implementations achieving high performance, full float lossless compression for OpenEXR and FPGA based realizations delivering high speed, low latency coding.

Messages from Prof. Majid Rabbani and Dr. Daniel Lee—both instrumental in guiding the JPEG 2000 standardisation process—paid tribute to the dedication, expertise, and collaborative spirit of the many JPEG members who contributed to the standard’s success. Daniel, who served as JPEG convenor during the JPEG 2000 standardisation period, further underscored JPEG’s essential role as a collaborative international forum for developing standards with global reach.

The celebration concluded with an address by Dr. Pierre Anthony Lemieux, co-chair of the JPEG 2000 activity, who highlighted the format’s enduring flexibility as a key factor in its longevity. He noted that this flexibility allows end users to expand the capabilities of their workflows without the burden of switching to a different codec. Dr. Lemieux also emphasised the importance of ongoing maintenance activities, which allow JPEG 2000 to evolve to meet the shifting needs of its users, including current work on defining HTJ2K profiles and levels. He finished by stressing the importance of open source tools and libraries in driving adoption.

A sustained commitment to meeting industry needs and continued maintenance of the standard remains central to the ongoing and future success of JPEG 2000.

Final Quote

“Reaching Committee Draft for JPEG Trust Part 3: Media Asset Watermarking is a pivotal step toward restoring confidence in digital media at a moment when generative AI makes convincing manipulation accessible to anyone. This milestone equips industries and public institutions with interoperable, standards-based watermarking to support authenticity, provenance, integrity, rights signalling, and clear labelling, helping to curb mis- and disinformation, strengthen digital rights management, and enable reliable source tracing at a global scale.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

ACM SIGMM European Chapter

The ACM SIGMM European Chapter (https://sites.google.com/view/sigmm-eu-chapter/) aims to serve as a vital regional hub, bridging the gap between the global multimedia community and the diverse research landscape within Europe. Its primary aim is to foster a more connected and inclusive environment by supporting the consolidation of local multimedia groups and their collaborations. The scope of the chapter is inherently interdisciplinary, seeking to strengthen exchanges between multimedia researchers and experts in domains such as Healthcare, Geospatial Science, Robotics, and Social Media. By coordinating European-wide events, thematic summer schools, and collaborative workshops, the chapter aims to promote technical innovation and facilitate a robust dialogue between academic research and industrial application across the continent in the multimedia field.

The official kick-off of the ACM SIGMM European Chapter took place on October 28, 2025, during the 33rd ACM International Conference on Multimedia (ACM MM) in Dublin, Ireland. Organized by prominent community members including Xavier Alameda-Pineda (https://xavirema.eu/), Elisa Ricci (https://eliricci.eu/), and Pablo Cesar (https://www.pablocesar.me/) who co-founded the Chapter, the meeting marked a significant milestone in establishing a structured regional presence for multimedia researchers across Europe. The event featured three keynote talks: From Individual Immersion to Shared Experiences in Social XR by Silvia Rossi (Centrum Wiskunde & Informatica), Lifelogging: from POV images to multi-perspective videos by Alice Tran (Dublin City University) and Bridging Multimodal Representation Learning and Generation through Masked Modeling by Samir Sadok (Inria @ Univ. Grenoble Alpes), setting a tone of innovation and interdisciplinary collaboration. See the pictures of these keynotes below. The primary mission discussed during the meeting was to bridge the gap between global SIGMM activities and the specific needs of the European landscape, focusing on networking, supporting local multimedia groups, and fostering stronger ties between academia and industry.

Figure 1: Pictures of the three keynote talks of the ACM SIGMM European Chapter kick-off, from left to right: Samir Sadok, Alice Tran, and Silvia Rossi.

What is an ACM Chapter?

An ACM chapter is a local or regional group within the ACM that brings people together around computing topics to network, share knowledge, and collaborate. Our Chapter is related to the European area and linked to the specialized group of interest SIGMM, focusing on the area of multimedia in Europe. It is run by volunteers and acts as a local hub where members organize events, discussions, and collaborations, making the global ACM community more interactive and accessible on a particular topic and area.

Current chairs 

During the kick-off meeting the brand new chapter bylaws were also approved and the new chapter officers were elected. They are:

  • Valérie Gouet-Brunet (Chair), is a Research Director at the French mapping agency (IGN) and is affiliated with University Gustave Eiffel within the LaSTIG laboratory in France. With a distinguished career in computer science, her research focuses on image retrieval, multimedia information retrieval, computer vision, and the structuring of large-scale iconographic heritage. She has a founding role in the SUMAC workshop series at ACM Multimedia. Her leadership in the chapter aims to promote inclusivity and support the growth of multimedia research initiatives throughout the diverse European regions.
  • Irene Viola (Vice-Chair) is a senior researcher at Centrum Wiskunde & Informatica (CWI) in the Netherlands, specifically within the Distributed and Interactive Systems (DIS) group. Her expertise lies at the intersection of multimedia compression, transmission, and the evaluation of Quality of Experience (QoE) for immersive systems. Dr. Viola’s work is particularly focused on volumetric video and social XR, exploring how users interact within virtual environments and how to optimize the delivery of these complex media types. 
  • Xavier Alameda-Pineda (Treasurer) is a Director of Research at Inria Grenoble in France, where he leads research in machine learning for computer vision and multi-modal fusion. His work often bridges the gap between multimedia perception and robotics, particularly in analyzing human behavior and scene understanding. Dr. Alameda-Pineda brings extensive experience in organizing high-level scientific events and managing international research collaborations to the chapter’s executive committee.

Current activities

The chapter endorses several activities in the European landscape. The SoRAIM Winter School (Social Robotics, Artificial Intelligence, and Multimedia, https://project.inria.fr/soraim/) stands as a flagship educational initiative of the chapter. The school has evolved into a recurring forum supported by ACM SIGMM and Inria, dedicated to the next generation of researchers in robotics and human-centric multi-modal AI. Its most recent edition, held in February 2026 in the French Alps (Autrans), epitomizes the school’s multidisciplinary spirit. By bridging the gap between signal processing, robotics, and social sciences, SoRAIM provides PhD students with a unique blend of high-level lectures and hands-on experience. This approach ensures that participants are not only technically proficient in multi-modal fusion and multi-person dialogue management but are also deeply attuned to the ethical and psychological dimensions of deploying AI and robotics in real-world social environments.

A second prominent activity endorsed by the chapter is the Spring School on Social XR, ACM Europe School (https://www.dis.cwi.nl/spring-school/), which pivots toward the cutting edge of immersive technologies and interpersonal communication. Hosted at CWI in Amsterdam, the school explores the technical and psychological dimensions of Social Extended Reality (XR). Participants engage with topics ranging from point cloud compression and low-latency transmission to the evaluation of user experience in shared virtual spaces. This program is particularly vital for the chapter’s mission, as it bridges the gap between traditional multimedia signal processing and the emerging field of social interactive systems, emphasizing hands-on tutorials and collaborative project work.

In addition to these educational activities, the chapter is currently laying the groundwork for a dedicated Symposium co-located with the ACM International Conference on Multimedia Retrieval (ICMR). Although still in the early stages of scheduling and organization, this symposium is envisioned as a strategic forum for European researchers to present “work-in-progress” and foster networking ahead of the larger global conferences. By aligning with ICMR, the chapter aims to provide a localized platform for the European multimedia retrieval community to synchronize their efforts, discuss regional funding opportunities, and strengthen the ties between academic labs and European industrial partners.

How to contribute to the Chapter?

There are several concrete ways for people to contribute, collaborate, and shape activities:

  • Sharing opportunities through the mailing list is one of the core lifelines of the community. Keeping it active makes a big difference.
  • Proposing events & activities, that can be partially funded by ACM SIGMM
    • Workshops, tutorials, or special sessions at conferences
    • Local meetups or regional symposiums
  • Supporting early-career researchers
    • Summer schools, PhD training events, doctoral symposiums
    • Job opportunities, internships, and mobility programs

Music Meets Science at ACM Multimedia 2025

Multimedia research is framed through algorithms, datasets, and systems, but at its heart lies content that is deeply human. Few forms of content illustrate this better than classical music. Long before music becomes data to be recorded, generated, searched, or retrieved, it is imagined by composers and brought to life by performers. At ACM Multimedia 2025 in Dublin, this human origin of multimedia took centre stage in a unique social event that bridged classical music and multimedia content analysis. This event was the fourth supported by ACM SIGMM in the framework of Music Meets Science program (at CBMI’2022, CBMI’2023, CBMI’2024).

Music Meets Science explores musical spaces across centuries and styles, from the dynamic Folia of Vivaldi and Handel’s Passacaglia to works by Schubert and contemporary composers from around the world. The goal here is to bring a wide range of music performed by some of the original content creators, our classical musicians, to the multimedia research community, who explore and mine this content. It brings fundamental cultural values to the young researchers in Multimedia, opening their minds to classical and contemporary music which oscillates with the rhythm of centuries.

The concert took place on 29 October, starting at 8:00 PM, during the Welcome Reception of ACM Multimedia 2025. It was attended by over 1,000 delegates of all ages from doctoral students to senior researchers. The programme featured music by Irish composer Garth Knox and a new composition by Finnish composer Jarno Vanhanen, written especially for ACM Multimedia. The performance was delivered by internationally acclaimed French musicians of the new generation: François Pineau-Benois (violin) and Olivier Marin (viola), see Figure 1. Together, they invited the audience to experience music not only as sound, but as rich multimedia content shaped by structure, expression, interpretation, and context.

Figure 1. François Pineau-Benois (violin), Oliver Marin (viola) performing Jarno Vanhanen’s “Aurora Borealis” duet.

By embedding live performance within a major multimedia conference, Music Meets Science highlights the importance of integrating creative arts into the research ecosystem. As multimedia research continues to advance, from content understanding to generation, events like this remind us that artistic practice is not just an application domain, but a source of inspiration. Strengthening the dialogue between creative arts and multimedia research can deepen our understanding of content, context, and meaning, and enrich the future directions of the field.

Welcome message from the SIGMM Executives

Dear colleagues and friends,

We would like to begin by sincerely thanking the SIGMM community for the trust you have placed in us. We are honored to serve as Chairs alongside a talented and dedicated team. A special thanks goes to the previous Executive Committee for their outstanding work during challenging times and for laying down a solid foundation for the future.

We are at an exciting juncture. Multimedia is no longer just a field, it is the connective tissue of modern life. From intelligent communication and immersive experiences to AI-generated content and digital twins, multimedia systems are shaping how we learn, work, and connect. Our community is uniquely positioned to lead in this space.

Over the next two years, we want to focus on presence — not only in terms of emerging technologies, but also in how SIGMM can be present for researchers around the world. Together, we will:

  • Champion young researchers and amplify their voices.
  • Promote open science by supporting the sharing of code, data, and reproducible research.
  • Increase industry engagement by creating meaningful bridges between academia and application.
  • Strengthen our global presence through active local chapters and outreach.
  • Ensure SIGMM remains a space that is inclusive, diverse, and equitable, a community where everyone feels welcome and empowered.
  • Position SIGMM conferences and journals as the leading venue for applied AI in multimodal systems and showcasing how sensing, understanding, generation, and interaction converge to solve real-world challenges.

Let us continue to work together and to make an impact, inspired by the richness of multimedia research and united by a shared commitment to excellence and openness.

Abdulmotaleb, Elisa and Silvia


Abdulmotaleb El Saddik is an award-winning technologist and Distinguished Professor whose leadership in Embodied AI, Digital Twins, and Mixed Reality bridges innovation, mentorship, and human impact.

Elisa Ricci is a Professor at University of Trento and Senior researcher at Fondazione Bruno Kessler in Italy. Her research interests include computer vision and multimedia analysis.

Silvia Rossi is a senior scientist at Centrum Wiskunde & Informatica (CWI) in The Netherlands. Her research interests are at the intersection of multimedia systems, artificial intelligence, and user behaviour modelling for immersive and interactive systems.

The MediaEval Benchmark Looks Back at a Successful Fifteenth Edition, and Forward to its Sweet Sixteen

Introduction

The Benchmarking Initiative for Multimedia Evaluation (MediaEval) organizes interesting and engaging tasks related to multimedia data. MediaEval is proud to be supported by SIGMM. Tasks involve analyzing and exploring multimedia collections, as well as accessing the information that they contain. MediaEval emphasizes challenges that have a human or social aspect in order to support our goal of making multimedia a positive force in society. Participants in MediaEval are encouraged to submit effective, but also creative solutions to MediaEval tasks: We carry out quantitative evaluation of the submissions, but also go beyond the scores in order to obtain insight into the tasks, data, metrics. 

Participation in MediaEval is open to any team that wishes to sign up. Registration has just opened and information is available on the MediaEval 2026 website: https://multimediaeval.github.io/editions/2026 The workshop will take place in Amsterdam, Netherlands and online coordinated with ACM ICMR https://icmr2026.org

In this column, we present a short report on MediaEval 2025, which culminated with the annual workshop in Dublin, Ireland between CBMI (https://www.cbmi2025.org) and ACM Multimedia (https://acmmm2025.org). Then, we provide an outlook to MediaEval 2026, which will be the sixteenth edition of MediaEval.

A Keynote on Metascience

The workshop kicked off with a keynote on metascience for machine learning. The metascience initiative (https://metascienceforml.github.io) strives to promote discussion and development of the scientific underpinnings of machine learning. It looks at the way in which machine learning is done and examines the full range of relevant aspects, from methodologies and mindsets. The keynote was delivered by Jan van Gemert, head of the Computer Vision Lab (https://www.tudelft.nl/ewi/over-de-faculteit/afdelingen/intelligent-systems/pattern-recognition-bioinformatics/computer-vision-lab) at Delft University of Technology. He discussed the “growing pains” of the field of deep learning and the importance of the scientific method for keeping the field on course. He invited the audience to consider the question of the power of benchmarks for hypothesis-driven science in machine learning and deep learning.

Tasks at MediaEval 2025

The MediaEval 2025 tasks reflect the benchmark’s continued emphasis on human-centered and socially relevant multimedia challenges, spanning healthcare, media, memory, and responsible use of generative AI.

Several tasks this year focused on the human aspects of multimodal analysis, combining visual, textual, and physiological signals. The Medico Task challenges participants in building visual question answering models for the interpretation of gastrointestinal images, aiming to support clinical decision-making through interpretable multimodal explanations. The Memorability Task focuses on modeling long-term memory for short movie excerpts and commercial videos, requiring participants to predict how memorable a video is, whether viewers are familiar with it, and, in some cases, to leverage EEG signals alongside visual features. Multimodal understanding is further explored in the MultiSumm Task, where participants are provided with collections of multimodal web content describing food sharing initiatives in different cities and are asked to generate summaries that satisfy specific informational criteria, with evaluation exploring both traditional and emerging LLM-based assessment approaches.

The remaining two tasks emphasize the societal impact of multimedia technology in real-world settings. In the NewsImagesTask, participants worked with large collections of international news articles and images, either retrieving suitable thumbnail images or generating thumbnails for articles. The Synthetic Images Task addressed the growing prevalence of AI-generated content online, asking participants to detect synthetic or manipulated images and localize manipulated areas. The task used data created by state-of-the-art generative models as well as images collected from real-world online settings. We gratefully acknowledge the support of AI-CODE (https://aicode-project.eu), a European project focused on topics related to these two tasks.

MediaEval in Motion

MediaEval is especially proud of participants who return over the years, improving their approaches and contributing insights. We would like to highlight two previous participants who became so interested and involved in MediaEval tasks that they decided to join the task organization team and help organize the tasks. Iván Martín-Fernández, PhD student at Universidad Politécnica de Madrid, became a task organizer for the Memorability task and Lucien Heitz, PhD Student, University of Zurich, became a task organizer for NewsImages. 

One aspect of the MediaEval Benchmark I value most is its effort to go beyond metric-chasing and embark on a “quest for insights,” as the organizers put it, to help us better understand the tasks and encourage creative, innovative solutions. This spirit motivated me to participate in the 2023 Memorability Task in Amsterdam. The experience was so enriching that I wanted to become more involved in the community. In 2025, I was invited to join the Memorability Task organizing team, which gave me the chance to contribute to and help foster this innovative research effort. Thanks to SIGMM’s sponsorship, I was able to attend the event in Dublin, which further enhanced the experience. Working alongside Martha and Gabi as a student volunteer is always a pleasure. As my PhD studies come to an end, I’m proud to say that MediaEval has been a core part of my research, and I’m sure it will remain so in the immediate future. See you in Amsterdam in June!

Iván Martín-Fernández, PhD Student, GTHAU – Universidad Politécnica de Madrid

I ‘graduated’ from being a participant in the previous NewsImages challenge to now taking over the organization duties of the 2025 iteration of the task. It was an incredible journey and learning experience. Big thank you to the main MediaEval organizers for their tireless support and input for shaping this new task that combines image retrieval and generation. The recent benchmark event presented an amazing platform to share and discuss our research. We got so many great submissions from teams around the globe. I was truly overwhelmed by the feedback. Getting involved with the organization of a challenge task is something I can highly recommend to all participants. It allows you to take on an active role and bring new ideas to the table on what problems to tackle next.

Lucien Heitz, PhD Student, University of Zurich

MediaEval continues its tradition of awarding a “MediaEval Distinctive Mention” to teams that dive deeply into the data, the algorithms, and the evaluation procedure. Going above and beyond in this way makes important contributions to our understanding of the task and how to make meaningful progress. Moving the state of the art forward requires improving the scores on a pre-defined benchmark task such as the tasks offered by MediaEval. However, MediaEval Distinctive Mentions underline the importance of research that does not necessarily improve scores on a given task, but rather makes an overall contribution to knowledge.

We were happy to serve as student volunteers at MediaEval 2025. In addition, we participated as a team in the NewsImage task, contributing to two subtasks, and were honored to receive a Distinctive Mention. 

Xiaomeng had previously participated in the same task at MediaEval 2023. Compared to the 2023 edition, she observed notable evolution in both the data and task design. These changes reflect the organizers’ careful consideration of recent advances in modeling techniques as well as the practical applicability of the datasets, which proved to be highly inspiring. 

Bram participated in MediaEval for the first time and particularly found the discussions with colleagues about the challenges very rewarding. The NewsImage retrieval subtask additionally got him to learn how to deal with larger datasets. 

We tried to incorporate deeper reflections on our results into our presentation. Specifically, we showed how certain types of articles are particularly suited for image generation and identified the news categories where retrieval was most effective. 

Xiaomeng Wang and Bram Bakker PhD Students, Data Science – Radboud University

The people whose work is highlighted in this section are grateful to have received support from SIGMM in order to be able to attend the MediaEval workshop in person. 

Outlook to MediaEval 2026

The 2025 workshop concluded with participants collaborating with the task organizers to start to develop “benchmark biographies”, which are living documents that describe benchmarking tasks. Combining elements from data sheets and model cards, benchmark biographies document motivation, history, datasets, evaluation protocols, and baseline results to support transparency, reproducibility, and reuse by the broader research community. We plan to continue work on these benchmark biographies as we move toward MediaEval 2026. 

Further, in the 2026 edition, we will offer again the tasks that were held in 2025 to provide an opportunity for teams who were not able to participate in 2025. We especially encourage “Quest for Insight” papers that examine characteristics of the data and the task definitions, the strengths and weaknesses of particular types of approaches, observations about the evaluation procedure, and the implications of the task. 
We look forward to seeing you in Amsterdam for MediaEval and also ACM ICMR. Don’t forget to check out the MediaEval website (https://multimediaeval.github.io) and register your team if you are interested in participating in 2026.

MPEG Column: 153rd MPEG Meeting

The 153rd MPEG meeting took place online from January 19-23, 2026. The official MPEG press release can be found here. This report highlights key outcomes from the meeting, with a focus on research directions relevant to the ACM SIGMM community:

  • MPEG Roadmap
  • Exploration on MPEG Gaussian Splat Coding (GSC)
  • MPEG Immersive Video 2nd edition (new white paper)

MPEG Roadmap

MPEG released an updated roadmap showing continued convergence of immersive and “beyond video” media with deployment-ready systems work. Near-term priorities include 6DoF experiences (MPEG Immersive Video v2 and 6DoF audio), volumetric representations (dynamic meshes, solid point clouds, LiDAR, and emerging Gaussian splat coding), and “coding for machines,” which treats visual and audio signals as inputs to downstream analytics rather than only for human consumption.

Research aspects: The most promising research opportunities sit at the intersections: renderer and device-aware rate-distortion-complexity optimization for volumetric content; adaptive streaming and packaging evolution (e.g., MPEG-DASH / CMAF) for interactive 6DoF services under tight latency constraints; and cross-cutting themes such as media authenticity and provenance, green and energy metadata, and exploration threads on neural-network-based compression and compression of neural networks that foreshadow AI-native multimedia pipelines.

MPEG Gaussian Splat Coding (GSC)

Gaussian Splat Coding (GSC) is MPEG’s effort to standardize how 3D Gaussian Splatting content, scenes represented as sparse “Gaussian splats” with geometry plus rich attributes (scale and rotation, opacity, and spherical-harmonics appearance for view-dependent rendering), is encoded, decoded, and evaluated so it can be exchanged and rendered consistently across platforms. The main motivation is interoperability for immersive media pipelines: enabling reproducible results, shared benchmarks, and comparable rate-distortion-complexity trade-offs for use cases spanning telepresence and immersive replay to mobile XR and digital twins, while retaining the visual strengths that made 3DGS attractive compared to heavier neural scene representations.

The work remains in an exploration phase, coordinated across ISO/IEC JTC 1/SC 29 groups WG 4 (MPEG Video Coding) and WG 7 (MPEG Coding for 3D Graphics and Haptics) through Joint Exploration Experiments covering datasets and anchors, new coding tools, software (renderer and metrics), and Common Test Conditions (CTC). A notable systems thread is “lightweight GSC” for resource-constrained devices (single-frame, low-latency tracks using geometry-based and video-based pipelines with explicit time and memory targets), alongside an “early deployment” path via amendments to existing MPEG point-cloud codecs to more natively carry Gaussian-splat parameters. In parallel, MPEG is testing whether splat-specific tools can outperform straightforward mappings in quality, bitrate, and compute for real-time and streaming-centric scenarios.

Research aspects: Relevant SIGMM directions include splat-aware compression tools and rate-distortion-complexity optimization (including tracked vs. non-tracked temporal prediction); QoE evaluation for 6DoF navigation (metrics for view and temporal consistency and splat-specific artifacts); decoder and renderer co-design for real-time and mobile lightweight profiles (progressive and LOD-friendly layouts, GPU-friendly decode); and networked delivery problems such as adaptive streaming, ROI and view-dependent transmission, and loss resilience for splat parameters. Additional opportunities include interoperability work on reproducible benchmarking, conformance testing, and practical packaging and signaling for deployment.

MPEG Immersive Video 2nd edition (white paper)

The second edition of MPEG Immersive Video defines an interoperable bitstream and decoding process for efficient 6DoF immersive scene playback, supporting translational and rotational movement with motion parallax to reduce discomfort often associated with pure 3DoF viewing. The second edition primarily extends functionality (without changing the high-level bitstream structure), adding capabilities such as capture-device information, additional projection types, and support for Simple Multi-Plane Image (MPI), alongside tools that better support geometry and attribute handling and depth-related processing.

Architecturally, MIV ingests multiple (unordered) camera views with geometry (depth and occupancy) and attributes (e.g., texture), then reduces inter-view redundancy by extracting patches and packing them into 2D “atlases” that are compressed using conventional video codecs. MIV-specific metadata signals how to reconstruct views from the atlases. The standard is built as an extension of the common Visual Volumetric Video-based Coding (V3C) bitstream framework shared with V-PCC, with profiles that preserve backward compatibility while introducing a new profile for added second-edition functionality and a tailored profile for full-plane MPI delivery.

Research aspects: Key SIGMM topics include systems-efficient 6DoF delivery (better view and patch selection and atlas packing under latency and bandwidth constraints); rate-distortion-complexity-QoE optimization that accounts for decode and render cost (especially on HMD and mobile) and motion-parallax comfort; adaptive delivery strategies (representation ladders, viewport and pose-driven bit allocation, robust packetization and error resilience for atlas video plus metadata); renderer-aware metrics and subjective protocols for multi-view temporal consistency; and deployment-oriented work such as profile and level tuning, codec-group choices (HEVC / VVC), conformance testing, and exploiting second-edition features (capture device info, depth tools, Simple MPI) for more reliable reconstruction and improved user experience.

Concluding Remarks

The meeting outcomes highlight a clear shift toward immersive and AI-enabled media systems where compression, rendering, delivery, and evaluation must be co-designed. These developments offer timely opportunities for the ACM SIGMM community to contribute reproducible benchmarks, perceptual metrics, and end-to-end streaming and systems research that can directly influence emerging standards and deployments.

The 154th MPEG meeting will be held in Santa Eulària, Spain, from April 27 to May 1, 2026. Click here for more information about MPEG meetings and ongoing developments.

Quality of Multimedia Experience Meets Machine Intelligence

1. Why QoE meets Machine Intelligence Now

[Multimedia systems are evolving towards AI-driven, adaptive services, leading to a natural convergence of QoE and machine intelligence. In this context, machine intelligence can empower QoE through learning-based, context-aware, and semantic-driven modelling and optimization. At the same time, QoE can guide machine intelligence by providing a human-centred objective for AI system design and evaluation; see also [11]. Looking beyond human perception, toward agent-centric and hybrid QoE, future multimedia systems increasingly require unified experience objectives that support human-AI co-experience. QoMEX’26 in Cardiff stands as a major milestone highlighting the convergence of Quality of Multimedia Experience with Machine Intelligence. This column reflects on this evolution and outlines the key challenges ahead.

Multimedia systems have shifted from “best-effort delivery” toward intelligent, adaptive services that operate under highly diverse network conditions, device capabilities, and user contexts. In this landscape, Quality of Experience (QoE) has become a central concept, focusing on user satisfaction rather than purely signal-level fidelity [1, 2, 3].

QoE has traditionally been human-centric, reflecting perceived quality, enjoyment, comfort, and acceptance of multimedia services [2]. Meanwhile, machine intelligence, from deep learning and reinforcement learning to multimodal foundation models, has rapidly become the dominant paradigm for perception, generation, and decision-making. The intersection of these trends is timely and inevitable: QoE provides the human-centred goal, while machine intelligence provides scalable tools to model and optimize experience in complex real-world environments. Figure 1 summarizes this bidirectional relationship between QoE and machine intelligence, from multimodal inputs to human-centric, agent-centric, and hybrid QoE objectives.

Figure 1. A conceptual framework where machine intelligence enables QoE prediction and QoE-aware optimization, while QoE evolves from a human-centric notion toward agent-centric and hybrid objectives in intelligent multimedia systems.

2. How machine intelligence can empower QoE

(1) Learning QoE models beyond handcrafted rules

Classic QoE models often rely on handcrafted features and simplified assumptions linking system parameters (bitrate, delay, resolution) to perceived quality. Machine learning offers a flexible alternative: it can learn complex nonlinear mappings from content, network conditions, and user interaction signals to QoE outcomes. Deep models further enable learning from high-dimensional inputs such as raw video frames, audio signals, and multimodal logs, supporting richer QoE prediction in streaming, immersive media, short-form video, gaming, and interactive communication. In this context, advances in perceptual quality assessment (e.g., full-reference and no-reference IQA/VQA) also provide useful foundations for QoE-related modelling [5, 8, 9].

(2) QoE-aware control and optimization

Machine intelligence is not only about prediction, it can also enable QoE-driven decision-making. Instead of optimizing network metrics alone, systems can adapt encoding, bitrate selection, buffering strategies, or rendering policies to maximize predicted QoE. This direction has been extensively studied in adaptive streaming, where QoE-driven strategies are used to balance bitrate quality and playback stability [4]. Reinforcement learning is particularly promising, where QoE can serve as a reward signal and agents can learn robust policies under uncertainty (e.g., bandwidth fluctuations, user engagement changes) [6, 7].

(3) Personalization and context-awareness

QoE is inherently subjective and context-dependent. Machine intelligence can support personalization by incorporating user preferences and context signals such as device type, mobility, ambient environment, and usage patterns. For example, some users are more sensitive to rebuffering events, while others prioritize sharpness and resolution. Context-aware learning enables systems to move beyond “one-size-fits-all” adaptation.

(4) Semantic Intelligence

Machine intelligence can empower QoE by shifting quality assessment from perceptual fidelity toward semantic quality. This means how well the meaning and task-relevant information of multimodal content is preserved for both machines and humans. As multimedia data is increasingly consumed by AI systems in applications like autonomous systems and AI-generated content pipelines, traditional perceptual metrics fail to reflect performance and experience because they ignore semantic consistency. Semantic-aware evaluation may enable task-oriented and task-agnostic assessment. By integrating semantic quality assessment, AI can guide compression, transmission, and system design in ways that better align technical performance with downstream task success and user experience.

3. How QoE can guide machine intelligence

The relationship between QoE and machine intelligence is bidirectional: QoE can also shape how multimedia AI systems are designed, trained, and evaluated.

(1) QoE as a human-centric objective function

Many multimedia AI pipelines optimize proxy metrics such as accuracy, PSNR/SSIM, or task performance. However, these do not always align with perceived quality or user satisfaction. QoE provides a principled framework to define what “better” means from the user’s perspective and encourages evaluation beyond technical fidelity [2, 10].

(2) Aligning generative intelligence with user satisfaction

With the rise of generative AI for multimedia enhancement and creation, QoE becomes even more critical. High-quality generation is not only about realism but also about temporal consistency, comfort, trust, and acceptance in real usage conditions. Integrating QoE considerations can help steer generative models toward outcomes that users actually prefer.

Emerging Challenge “QoE of interactive AI systems”

AI evaluation is shifting from pure model accuracy toward experience-based assessment of how humans interact with AI, aligned with frameworks like the EU AI Act. Quality of Experience (QoE) and UX research provide established methods to measure subjective aspects such as trust, transparency, human oversight of the AIS systems, robustness, and satisfaction. Applying QoE methodologies can translate high-level AI principles into measurable experiential dimensions reflecting real-world user understanding and use. This requires new metrics that reflect how users actually understand, trust and operate AI systems in practice. For more details, see [11].

4. Beyond human-centric QoE: toward agent-centric and hybrid QoE

While QoE has historically focused on human perception, emerging multimedia systems increasingly serve autonomous agents such as robots, drones, and intelligent vehicles. In these scenarios, multimedia is not only consumed by humans but also by machines. This motivates an extended view of QoE, agent-centric QoE, where “experience” can be interpreted as the utility of multimedia inputs for decision-making and task execution.

Agent-centric QoE can be characterized through indicators such as perception reliability, uncertainty reduction, latency sensitivity, safety margins, energy efficiency, and task success rate. Importantly, many future applications involve human–AI co-experience, for example, in teleoperation, remote driving, robot-assisted inspection, and collaborative XR. In such systems, overall quality depends on both human satisfaction and machine performance, motivating unified QoE objectives that jointly optimize human-centric and agent-centric requirements. As shown in Figure 1, future multimedia systems may require unified QoE objectives that jointly optimize human satisfaction and agent utility in human–AI co-experience scenarios.

5. Key challenges

Despite its promise, QoE-meets-AI research faces several open challenges:

  • Subjective data cost and scarcity: QoE ground truth often requires user studies and careful experimental design [2, 3].
  • Generalization: QoE models may struggle across unseen content types, devices, or cultural contexts.
  • Bias and fairness: QoE datasets may underrepresent certain user groups or contexts, leading to skewed optimization.
  • Explainability and trust: Black-box QoE predictions can be difficult to interpret and validate in engineering pipelines.
  • Privacy: Personalization requires user data, raising responsible data usage concerns.
  • Ethical aspects: Beyond established research ethics procedures, QoE research must increasingly address the broader ethical implications of AI-driven experience optimization, such as fairness, transparency, wellbeing, privacy, and environmental impact, which are essential for truly human-centred technology.

6. Outlook and takeaways

The convergence of Quality of Experience and machine intelligence represents a major opportunity for the multimedia community. Machine intelligence offers scalable tools to predict and optimize QoE in complex environments, while QoE provides a human-centred lens to guide AI system design toward real user value. Looking forward, QoE may evolve from a purely human-centric notion to a hybrid experience shared by humans and intelligent agents, enabling multimedia systems that are not only technically advanced, but also aligned with what humans and autonomous agents truly need.

Looking ahead to the continued evolution of the QoMEX conference series, QoMEX’26 in Cardiff represents a key milestone where Quality of Multimedia Experience directly converges with Machine Intelligence. As AI increasingly shapes how multimedia is created, transmitted, and consumed, the conference invites the community to rethink both the goals and methods of QoE research – using AI to enhance user experience, while drawing on QoE insights to build more human-aware, trustworthy, and adaptive intelligent systems. This vision is reflected in special sessions:
SS1: Semantic Quality Assessment for Multi-Modal Intelligent Systems” on semantic quality assessment for multimodal intelligent systems, which extend quality evaluation beyond perceptual fidelity toward meaning and task relevance. The session aims to lay the foundations of multimodal semantic quality assessment, enable semantic-driven compression and transmission, and connect semantic quality evaluation with AI understanding.
SS2: Beyond Quality: Integrating Ethical Dimensions in QoE Research” on integrating ethical dimensions into QoE research, emphasizing fairness, transparency, wellbeing, privacy, and environmental impact, which are essential for truly human-centred technology. This session calls for ethically reflexive, value-sensitive QoE frameworks that incorporate social impact, collective QoE, and inclusive research practices alongside traditional UX measures.

Together, these themes signal a continued broadening of the QoE scope, reaffirming QoMEX as a forum that evolves with emerging technologies while advancing inclusive, responsible, and future-oriented quality research. The 18th International Conference on Quality of Multimedia Experience (QoMEX’26) will take place in CardiffUnited Kingdom, from June 29 to July 3, 2026. Please find more information on the website of QoMEX’26: https://qomex2026.itec.aau.at/

18th International Conference on Quality of Multimedia Experience (QoMEX’26) will take place in Cardiff, United Kingdom, from June 29 to July 3, 2026

Reference

[1] ITU-T Rec. P.10/G.100 (2006), Vocabulary for performance and quality of service.

[2] Möller, S., & Raake, A. (2014), Quality of Experience: Advanced Concepts, Applications and Methods. Springer.

[3] De Moor, K., et al. (2010), Proposed framework for evaluating quality of experience in a mobile, testbed-oriented living lab setting. Mobile Networks and Applications.

[4] Seufert, M., Egger, S., Slanina, M., Zinner, T., Hoßfeld, T., & Tran-Gia, P. (2015), A survey on quality of experience of HTTP adaptive streaming. IEEE Communications Surveys & Tutorials.

[5] Bampis, C. G., Li, Z., Moorthy, A. K., Katsavounidis, I., Aaron, A., & Bovik, A. C. (2018), Study of temporal effects on subjective video quality of experience. IEEE Transactions on Image Processing.

[6] Yin, X., Jindal, A., Sekar, V., & Sinopoli, B. (2015), A control-theoretic approach for dynamic adaptive video streaming over HTTP. ACM SIGCOMM.

[7] Mao, H., Netravali, R., & Alizadeh, M. (2017), Neural adaptive video streaming with Pensieve. ACM SIGCOMM.

[8] Wang, Z. & Bovik, AC. (2006), Modern image quality assessment. Springer.

[9] Mittal, A., Moorthy, A. K., & Bovik, A. C. (2013), No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing.

[10] Hoßfeld, T., Schatz, R., & Egger, S. (2011), SOS: The MOS is not enough! QoMEX.

[11] Hupont, I., De Moor, K, Skorin-Kapov, L., Varela, M. & Hoßfeld, T. “Rethinking QoE in the Age of AI: From Algorithms to Experience-Based Evaluation.” ACM SIGMultimedia Records (2025).