Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2025 – Part 5 (ACM Multimedia 2023, 2024 and 2025)

In this Dataset Column, we continue the tradition of previous installments by reviewing notable developments in open datasets and benchmarking competitions in multimedia from 2023 to 2025. The selected events reflect the breadth of topics, challenges, and datasets currently shaping the multimedia research community. They include special sessions, grand challenges, competition tracks, and evaluation campaigns involving multimedia data. This Dataset Column extends a series of overviews previously published in ACM SIGMM Records:

This fifth column focuses on the last three editions of the ACM International Conference on Multimedia (ACM MM), one of the flagship conferences in the field, which has long served as a major venue including presentations of multimedia benchmarks, open datasets, and community-driven evaluation campaigns.

  • MM ’23: The 31st ACM International Conference on Multimedia (Ottawa, Canada, 29 October 29 – November 3, 2023)
  • MM ’24: The 32nd ACM International Conference on Multimedia (Melbourne, Australia, October 28 – November 1, 2024)
  • MM ’25: The 33rd ACM International Conference on Multimedia (Dublin, Ireland, October 27 – 31, 2025)

ACM Multimedia 2022 was reviewed in the Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 2 (MDRE at MMM 2022, ACM MM 2022), and ACM Multimedia 2024 was reviewed in a general report.


The growing prominence of data-centric research in the multimedia community is illustrated also by the increasing frequency of the term “dataset” in ACM MM proceedings. In paper titles, the term appeared in 9 papers at MM ’22, rising to 22 in MM ’23, 28 in MM ’24, and 106 in MM ’25. In author keywords, the corresponding numbers were 35, 37, 47, and 104, while in abstracts they increased from 438 to 558, 687, and 869, respectively. Notably, among the MM ’25 papers containing the term “dataset”, many were also marked as Artifacts Available (47 in titles, 35 in keywords, and 68 in abstracts), indicating that related research artifacts had been made publicly accessible. Although this is only an approximate indicator, the trend suggests a growing emphasis on datasets, reproducibility, and open research practices within ACM MM.

Across MM ’23, MM ’24, and MM ’25, the term dataset appears in 156 paper titles, 188 author keyword lists, and 2,114 abstracts. Based on these three editions, we present a curated selection of 39 publicly accessible datasets – 10 from MM ’23, 10 from MM ’24, and 19 from MM ’25 – selected for their relevance to multimedia research, diversity of application domains, and potential for reuse by the community.


ACM MM 2023

Numerous dataset-related papers were presented at the 31st ACM International Conference on Multimedia (MM ’23), organized in Ottawa, Canada, October 29 – November 3, 2023. The complete MM ’23: Proceedings of the 31st ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3581783).

There was no dedicated dataset session among roughly 33 sessions at the MM ’23 conference. As a small example, ten selected papers focused primarily on new datasets with publicly available data are listed below. Looking across the ACM MM ’23 papers containing “dataset” in the keyword field, the dominant focus is on creating benchmark resources for emerging multimedia tasks rather than merely applying existing datasets. The contributions span a broad range of multimedia domains, including image and video understanding, multimedia quality assessment, multimodal reasoning, user interaction analysis, emotional and social signal processing, immersive media, multimedia security, retrieval, and cross-modal learning. A clear trend is the coupling of dataset construction with benchmark protocols and baseline methods, reflecting the multimedia increasing emphasis on reproducibility, comparative evaluation, and open research resources. From this broader set, the ten examples below were selected based primarily on scientific impact, current relevance, public availability, and representativeness across the core multimedia research themes traditionally associated with ACM MM.

Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted Approach
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, Weisi Lin
Paper available at: https://doi.org/10.1145/3581783.3611737
Dataset available at: https://github.com/VQAssessment/MaxVQA
This work introduces the Maxwell database, a major benchmark for explainable video quality assessment containing 4,543 in-the-wild videos and over two million subjective quality annotations across 13 dimensions. The dataset significantly advances multimedia quality assessment by enabling interpretable analysis of perceptual video quality beyond traditional scalar scoring.

Understanding User Behavior in Volumetric Video Watching: Dataset, Analysis and Prediction
Kaiyuan Hu, Haowen Yang, Yili Jin, Junhua Liu, Yongting Chen, Miao Zhang, Fangxin Wang
Paper available at: https://doi.org/10.1145/3581783.3613810
Dataset available at: https://cuhksz-inml.github.io/user-behavior-in-vv-watching/
This contribution presents one of the first public datasets for volumetric video interaction analysis, including gaze, viewport, and motion behavior. The dataset is highly relevant for immersive multimedia delivery, adaptive streaming, and Quality of Experience optimization in emerging interactive media environments.

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World
Hongpeng Lin, Ludan Ruan, Wenke Xia, Peiyu Liu, Jingyuan Wen, Yixin Xu, Di Hu, Ruihua Song, Wayne Xin Zhao, Qin Jin, Zhiwu Lu
Paper available at: https://doi.org/10.1145/3581783.3612425
Dataset available at: https://ruc-aimind.github.io/projects/TikTalk/
TikTalk provides a large-scale benchmark for video-grounded multimodal dialogue, containing 38,000 videos and 367,000 real-world user conversations. Its scale and realism make it particularly relevant for conversational multimedia AI and multimodal human-computer interaction.

MultiMediate ’23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions
Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Dominik Schiller, Mohammed Guermal, Dominike Thomas, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling
Paper available at: https://doi.org/10.1145/3581783.3613851
Dataset available at: https://multimediate-challenge.org/Dataset/
This contribution extends benchmark resources for engagement estimation and bodily behavior recognition in social interactions. The dataset is particularly relevant for multimedia analysis of human communication, behavioral understanding, and socially intelligent interactive systems.

Light-VQA: A Multi-Dimensional Quality Assessment Model for Low-Light Video Enhancement
Yunlong Dong, Xiaohong Liu, Yixuan Gao, Xunchu Zhou, Tao Tan, Guangtao Zhai
Paper available at: https://doi.org/10.1145/3581783.3611923
Dataset available at: https://github.com/wenzhouyidu/Light-VQA
This paper introduces LLVE-QA, a benchmark dataset specifically designed for evaluating perceptual quality in low-light video enhancement. As video enhancement becomes increasingly important in real-world multimedia applications, this dataset fills an important gap between enhancement algorithms and user-centered perceptual evaluation.

SemanticRT: A Large-Scale Dataset and Method for Robust Semantic Segmentation in Multispectral Images
Wei Ji, Jingjing Li, Cheng Bian, Zhicheng Zhang, Li Cheng
Paper available at: https://doi.org/10.1145/3581783.3611738
Dataset available at: https://github.com/jiwei0921/SemanticRT
SemanticRT introduces a large RGB-thermal image benchmark for robust semantic segmentation under adverse environmental conditions. With over 11,000 annotated multispectral image pairs, it provides an important resource for multimodal scene understanding and intelligent visual perception.

MORE: A Multimodal Object-Entity Relation Extraction Dataset with a Benchmark Evaluation
Liang He, Hongke Wang, Yongchang Cao, Zhen Wu, Jianbing Zhang, Xinyu Dai
Paper available at: https://doi.org/10.1145/3581783.3612209
Dataset available at: https://github.com/NJUNLP/MORE
MORE establishes a benchmark for multimodal relation extraction using jointly visual and textual evidence. The dataset addresses a growing need for structured reasoning across multimedia modalities and represents a strong contribution to multimodal understanding.

Ground-to-Aerial Person Search: Benchmark Dataset and Approach
Shizhou Zhang, Qingchun Yang, De Cheng, Yinghui Xing, Guoqiang Liang, Peng Wang, Yanning Zhang
Paper available at: https://doi.org/10.1145/3581783.3612105
Dataset available at: https://github.com/yqc123456/HKD_for_person_search
This paper introduces G2APS, a benchmark for cross-platform person search between UAV and ground surveillance imagery. The dataset is highly relevant for multimedia retrieval, intelligent surveillance, and cross-view visual matching applications.

MEDIC: A Multimodal Empathy Dataset in Counseling
Zhouan Zhu, Chenguang Li, Jicai Pan, Xin Li, Yufei Xiao, Yanan Chang, Feiyi Zheng, Shangfei Wang
Paper available at: https://doi.org/10.1145/3581783.3612346
Dataset available at: https://ustc-ac.github.io/datasets/medic/
MEDIC provides a multimodal benchmark for empathy analysis in face-to-face counseling interactions. The dataset expands multimedia affective computing toward emotionally intelligent systems and richer human-centered interaction modeling.

CCMB: A Large-scale Chinese Cross-modal Benchmark
Chunyu Xie, Heng Cai, Jincheng Li, Fanjing Kong, Xiaoyu Wu, Jianfei Song, Henrique Morimitsu, Lin Yao, Dexin Wang, Xiangzheng Zhang, Dawei Leng, Baochang Zhang, Xiangyang Ji, Yafeng Deng
Paper available at: https://doi.org/10.1145/3581783.3611877
Dataset available at: https://github.com/yuxie11/R2D2
CCMB contributes one of the largest publicly available multimodal vision-language benchmarks, supporting large-scale pretraining and downstream evaluation. Its scale and broad applicability make it a significant resource for multimodal multimedia learning.


ACM MM 2024

Numerous dataset-related papers have been presented at the 32nd ACM International Conference on Multimedia (MM ’24), organized in Melbourne, Australia, October 28 – November 1, 2024. The complete MM ’24: Proceedings of the 32nd ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3664647).

There were three specifically dedicated Dataset sessions among roughly 42 sessions at the MM ’24 conference: “Multimodal Datasets, Models & Analytics” (6 papers), “Datasets & Algorithms for Multimedia Analysis” (6 papers), and “Audio-visual Datasets and Applications” (5 papers).

Ten selected papers focused primarily on new datasets or dataset-driven benchmarks are listed below. Looking across the ACM MM ’24 dataset-session papers provided here, the dominant focus shifts strongly toward multimodal foundation models, audiovisual understanding, media authenticity, video safety, and human-centered multimedia applications. Several contributions address emerging risks and opportunities created by generative AI, including deepfake detection, multimedia forgery, safety-aware video generation, and AI-generated image quality. Other works focus on video-centered understanding tasks, such as audio-visual event localization, hateful video detection, video dialogue, and multimodal stance detection. Compared with earlier dataset columns, MM ’24 reflects a clear trend toward datasets designed not only for recognition or retrieval, but also for reasoning, explanation, generation, safety, and robust real-world deployment.

AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset
Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, Kalin Stefanov
Paper available at: https://doi.org/10.1145/3664647.3680795
Dataset available at: https://github.com/ControlNet/AV-Deepfake1M
AV-Deepfake1M is the strongest candidate to highlight first, both for relevance and citation visibility. It provides more than one million videos with content-driven visual, audio, and audiovisual manipulations, supporting both detection and temporal localization of deepfake segments. Its scale and audiovisual nature make it highly relevant for multimedia forensics and trustworthy media analysis.

Identity-Driven Multimedia Forgery Detection via Reference Assistance
Junhao Xu, Jingjing Chen, Xue Song, Feng Han, Haijun Shan, Yu-Gang Jiang
Paper available at: https://doi.org/10.1145/3664647.3680622
Dataset available at: https://github.com/xyyandxyy/IDForge
IDForge introduces an identity-driven multimedia forgery dataset with video shots involving visual, audio, and textual manipulations, together with real reference data for celebrity identities. The paper is especially relevant because it reflects realistic identity-based forgery scenarios and connects dataset design with reference-assisted detection.

GPT4Video: A Unified Multimodal Large Language Model for Instruction-Followed Understanding and Safety-Aware Generation
Zhanyu Wang, Longyue Wang, Zhen Zhao, Minghao Wu, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou, Shuming Shi, Zhaopeng Tu
Paper available at: https://doi.org/10.1145/3664647.3681464
GPT4Video contributes dataset resources for video instruction-following, benchmarking, and safety-aware video understanding and generation. It is relevant because it combines video comprehension, generation, and safeguarding, reflecting the growing role of dataset construction in multimodal large language models.

MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video Detection on YouTube and Bilibili
Han Wang, Tan Rui Yang, Usman Naseem, Roy Ka-Wei Lee
Paper available at: https://doi.org/10.1145/3664647.3681521
MultiHateClip focuses on hateful video detection in multilingual and cross-cultural settings. It contains videos from YouTube and Bilibili annotated for hateful, offensive, and normal content, emphasizing the importance of visual, audio, language, and cultural signals in harmful multimedia analysis.

Open-Vocabulary Audio-Visual Semantic Segmentation
Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying
Paper available at: https://doi.org/10.1145/3664647.3681586
Dataset available at: https://github.com/ruohaoguo/ovavss
This work introduces open-vocabulary audio-visual semantic segmentation and builds AVSBench-OV from AVSBench-semantic. It is highly relevant for open-world video understanding because it combines audio cues, visual segmentation, and zero-shot category recognition.

OpenAVE: Moving towards Open Set Audio-Visual Event Localization
Jiale Yu, Baopeng Zhang, Zhu Teng, Jianping Fan
Paper available at: https://doi.org/10.1145/3664647.3681232
Dataset available at: https://github.com/yujialele/OpenAVE
OpenAVE extends audio-visual event localization beyond closed-set recognition. It is important for real-world audiovisual video understanding, where systems must distinguish known events, unknown events, and background segments.

G-Refine: A General Quality Refiner for Text-to-Image Generation
Chunyi Li, Haoning Wu, Hongkun Hao, Zicheng Zhang, Tengchuan Kou, Chaofeng Chen, Lei Bai, Xiaohong Liu, Weisi Lin, Guangtao Zhai
Paper available at: https://doi.org/10.1145/3664647.3681152
Dataset available at: https://github.com/Q-Future/Q-Refine
G-Refine addresses quality refinement for AI-generated images, focusing on both perceptual quality and text-image alignment. It is a strong representative of growing interest in generative media quality, image quality assessment, and practical improvement of text-to-image outputs.

NovaChart: A Large-scale Dataset towards Chart Understanding and Generation of Multimodal Large Language Models
Linmei Hu, Duokang Wang, Yiming Pan, Jifan Yu, Yingxia Shao, Chong Feng, Liqiang Nie
Paper available at: https://doi.org/10.1145/3664647.3680790
Dataset available at: https://github.com/Elucidator-V/NovaChart
NovaChart provides 47,000 chart images and 856,000 chart-related instructions across multiple chart types and tasks. Although not video-centered, it is a strong image-and-language benchmark for visual reasoning, chart understanding, and chart generation.

Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model
Fuqiang Niu, Zebang Cheng, Xianghua Fu, Xiaojiang Peng, Genan Dai, Yin Chen, Hu Huang, Bowen Zhang
Paper available at: https://doi.org/10.1145/3664647.3681416
Dataset available at: https://github.com/nfq729/MmMtCSD
This paper introduces MmMtCSD, a dataset for multimodal multi-turn conversational stance detection. It is relevant for social multimedia analysis because it models realistic online discussions involving both text and images rather than isolated image-text pairs.

CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart
Bowen Zhao, Tianhao Cheng, Yuejie Zhang, Ying Cheng, Rui Feng, Xiaobo Zhang
Paper available at: https://doi.org/10.1145/3664647.3681053
CT2C-QA introduces a multimodal question answering dataset over Chinese text, tables, and charts. It is relevant for evaluating whether multimodal systems can reason across heterogeneous information sources, including visual and structured data.


ACM MM 2025

Numerous dataset-related papers have been presented at the 33rd ACM International Conference on Multimedia (MM ’25), organized in Dublin, Ireland, October 27 – 31, 2025. The complete MM ’25: Proceedings of the 33rd ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3746027).

There was a specifically dedicated dataset session among roughly 26 sessions at the MM ’25 conference. This dataset track attracted 263 submissions, of which 123 were accepted. Considering the entire MM ’25 Proceedings, the term “dataset” appears in the title of 106 papers (28 in MM ’24), the keywords of 104 papers (47 in MM ’24), and the abstracts of 869 papers (687 in MM ’24). This substantial year-over-year increase highlights the growing centrality of datasets and benchmark creation in multimedia research, increasingly positioning dataset construction itself as a major scientific contribution rather than merely supporting experimental evaluation.

The ACM MM ’25 dataset papers show a shift toward datasets as primary research contributions. The dominant themes include high-quality video resources for compression, streaming, enhancement, video quality assessment, and QoE; immersive and 3D media datasets for VR, spatial video, 3D Gaussian Splatting, point clouds, and volumetric applications; multimodal and vision-language datasets for reasoning, event grounding, image/video generation, and instruction following; and safety-oriented datasets addressing deepfakes, harmful videos, social engineering, and media authenticity. A second major cluster concerns domain-specific multimedia datasets in medicine, robotics, agriculture, food computing, biometrics, wildlife monitoring, and urban or engineering environments. Overall, the MM ’25 dataset papers reflect a clear expansion from traditional image/video recognition benchmarks toward open resources for generative media evaluation, trustworthy AI, embodied perception, subjective quality assessment, and real-world multimodal decision-making.

Among the ACM MM ’25 papers that explicitly mention datasets and provide publicly accessible artifacts, the following is a curated selection of 20 representative examples chosen for their relevance to the multimedia community, diversity of topics, and availability of reusable public datasets.

Screen Content Video Dataset and Benchmark
Nickolay Safonov, Mikhail Rakhmanov, Dmitriy S. Vatolin
Paper available at: https://doi.org/10.1145/3746027.3758306
Dataset available at: https://videoprocessing.github.io/screen-content-dataset
This dataset focuses on screen-content video scenarios such as screen sharing, desktop streaming, and video conferencing, providing a large benchmark with subjective quality annotations for distorted content. The work is especially relevant for multimedia quality assessment, video compression research, and perceptual QoE modeling in increasingly important screen-based communication environments.

Nature-1k: The Raw Beauty of Nature in 4K at 60FPS
Mohammad Ghasempour, Hadi Amirpour, Christian Timmerer
Paper available at: https://doi.org/10.1145/3746027.3758258
Dataset available at: https://cd-athena.github.io/Nature-1k
Nature-1k provides a large-scale collection of professionally captured 4K 60 fps natural video content designed for modern video processing research. Its scale and quality make it highly relevant for video compression, streaming optimization, super-resolution, enhancement, frame interpolation, and generative video applications.

VIDEA-8K-60FPS Dataset: 8K 60FPS Video Sequences for Analysis and Development
Tariq Al Shoura, Ali Mollaahmadi Dehaghi, Reza Razavi, Mohammad Moshirpour
Paper available at: https://doi.org/10.1145/3746027.3758278
Dataset available at: https://github.com/talshoura/VIDEA-8K-60FPS-Dataset
VIDEA-8K-60FPS addresses the shortage of publicly available ultra-high-resolution video benchmarks by providing native 8K HDR sequences captured at 60 fps. The dataset is particularly valuable for next-generation video coding, scalable streaming, UHD content analysis, and benchmarking computationally intensive multimedia methods.

LEHA-CVQAD: Dataset To Enable Generalized Video Quality Assessment of Compression Artifacts
Aleksandr Gushchin, Maksim Smirnov, Dmitriy S. Vatolin, Anastasia Antsiferova
Paper available at: https://doi.org/10.1145/3746027.3758303
Dataset available at: https://aleksandrgushchin.github.io/lcvqad/
LEHA-CVQAD is a large-scale benchmark specifically designed for studying perceptual degradation caused by video compression artifacts. Its subjective quality annotations and codec diversity make it particularly useful for developing video quality metrics and improving practical codec parameter optimization.

HVEval: Towards Unified Evaluation of Human-Centric Video Generation and Understanding
Sijing Wu, Yunhao Li, Huiyu Duan, Yanwei Jiang, Yucheng Zhu, Guangtao Zhai
Paper available at: https://doi.org/10.1145/3746027.3758299
Dataset available at: https://huggingface.co/datasets/wsj-sjtu/HVEval
HVEval introduces a benchmark for evaluating human-centric video generation and understanding, a rapidly growing topic in generative multimedia research. It combines perceptual quality judgments with semantic evaluation tasks, making it highly relevant for benchmarking generative video systems and multimodal understanding models.

BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos
Jiahao Lin, Weixuan Peng, Bojia Zi, Yifeng Gao, Xianbiao Qi, Xingjun Ma, Yu-Gang Jiang
Paper available at: https://doi.org/10.1145/3746027.3758305
Dataset available at: https://broken-video-detection-datetsets.github.io/Broken-Video-Detection-Datasets.github.io/
BrokenVideos addresses a critical challenge in AI-generated media by providing fine-grained annotations for artifact localization in synthetic videos. The dataset is highly relevant for trustworthy generative AI, media forensics, and automated video quality assurance.

AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences
Jieyu Li, Xin Zhang, Joey Tianyi Zhou
Paper available at: https://doi.org/10.1145/3746027.3758295
Dataset available at: https://huggingface.co/datasets/Clarifiedfish/AEGIS
AEGIS provides a large benchmark for evaluating authenticity detection in increasingly realistic AI-generated videos. It is particularly relevant for multimedia security, deepfake detection, and the broader challenge of trustworthy synthetic media verification.

SVD: Spatial Video Dataset
MohammadHossein Izadimehr, Milad Ghanbari, Guodong Chen, Wei Zhou, Xiaoshuai Hao, Mallesham Dasari, Christian Timmerer, Hadi Amirpour
Paper available at: https://doi.org/10.1145/3746027.3758246
Dataset available at: https://cd-athena.github.io/SVD/
SVD introduces a public benchmark for consumer-captured stereoscopic spatial video, reflecting the increasing adoption of immersive video technologies. It is especially relevant for research in spatial video compression, immersive streaming, QoE evaluation, and depth-aware media analysis.

EyeNavGS: A 6-DoF Navigation Dataset and Record-n-Replay Software for Real-World 3DGS Scenes in VR
Zihao Ding, Cheng-Tse Lee, Mufeng Zhu, Tao Guan, Yuan-Chun Sun, Cheng-Hsin Hsu, Yao Liu
Paper available at: https://doi.org/10.1145/3746027.3758265
Dataset available at: https://symmru.github.io/EyeNavGS/
EyeNavGS provides immersive navigation traces, gaze data, and interaction recordings in virtual reality environments built on 3D Gaussian Splatting scenes. The dataset is particularly important for adaptive rendering, viewport prediction, foveated streaming, and immersive interaction research.

UVG-CWI-DQPC: Dual-Quality Point Cloud Dataset for Volumetric Video Applications
Guillaume Gautier, Xuemei Zhou, Thong Nguyen, Jack Jansen, Louis Fréneau, Marko Viitanen, Uyen Phan, Jani Käpylä, Irene Viola, Alexandre Mercat, Pablo Cesar, Jarno Vanne
Paper available at: https://doi.org/10.1145/3746027.3758263
Dataset available at: https://ultravideo.fi/UVG-CWI-DQPC/
This dataset provides paired high-end and consumer-grade point cloud captures for volumetric video research. It is highly relevant for point cloud compression, enhancement, quality assessment, and immersive multimedia benchmarking.

The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding
Luca Rossetto, Werner Bailer, Duc-Tien Dang-Nguyen, Graham Healy, Björn Þór Jónsson, Onanong Kongmeesub, Hoang-Bao Le, Stevan Rudinac, Klaus Schöffmann, Florian Spiess, Allie Tran, Minh-Triet Tran, Quang-Linh Tran, Cathal Gurrin
Paper available at: https://doi.org/10.1145/3746027.3758199
Dataset available at: https://castle-dataset.github.io/
CASTLE is a rich multimodal dataset combining egocentric and exocentric video, audio, and sensor streams captured in realistic environments. It is highly relevant for multimodal understanding, retrieval, lifelogging, embodied perception, and context-aware AI research.

OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding
Hieu Nguyen, Phuc-Tan Nguyen, Thien-Phuc Tran, Minh-Quang Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Paper available at: https://doi.org/10.1145/3746027.3758264
Dataset available at: https://ltnghia.github.io/eventa/openevents-v1
OpenEvents V1 provides a large benchmark for multimodal event understanding through aligned images, text, and news content. It is especially relevant for event retrieval, multimodal reasoning, news analysis, and contextual multimedia understanding.

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA
Yuhang Hu, Zhenyu Yang, Shihan Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Changsheng Xu
Paper available at: https://doi.org/10.1145/3746027.3758311
Dataset available at: https://github.com/Fleeting-hyh/StreamingCoT
StreamingCoT focuses on temporal reasoning in evolving video streams and introduces explicit multimodal reasoning annotations. The dataset is particularly relevant for streaming video understanding, video question answering, and multimodal reasoning research.

SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models
Zheng Liu, Hao Liang, Bozhou Li, Wentao Xiong, Chong Chen, Conghui He, Wentao Zhang, Bin Cui
Paper available at: https://doi.org/10.1145/3746027.3758222
Dataset available at: https://github.com/starriver030515/SynthVLM
SynthVLM introduces a synthetic image-caption dataset aimed at efficient vision-language model training. The work is particularly relevant because it addresses scalable multimodal data generation and benchmarking for next-generation multimodal foundation models.

UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models
Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, Yanbin Hao
Paper available at: https://doi.org/10.1145/3746027.3758269
Dataset available at: https://ryanlijinke.github.io/
UniSVG expands multimedia benchmarking into vector graphics, a modality often neglected in conventional multimedia datasets. It is highly relevant for multimodal reasoning, structured content understanding, and AI-driven graphic generation.

RoboAfford: A Dataset and Benchmark for Enhancing Object and Spatial Affordance Learning in Robot Manipulation
Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, Xiaoshuai Hao
Paper available at: https://doi.org/10.1145/3746027.3758209
Dataset available at: https://roboafford-dataset.github.io/
RoboAfford bridges multimedia perception and embodied intelligence through object and spatial affordance annotations for robotic manipulation. The dataset is particularly relevant for scene understanding, multimodal perception, and interaction-aware robotics research.

GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset
Sahar Nasirihaghighi, Negin Ghamsarian, Leonie Peschek, Matteo Munari, Heinrich Husslein, Raphael Sznitman, Klaus Schoeffmann
Paper available at: https://doi.org/10.1145/3746027.3758267
Dataset available at: https://ftp.itec.aau.at/datasets/GynSurge/
GynSurg provides richly annotated surgical video data for gynecological laparoscopic procedures. It is highly relevant for medical multimedia analysis, workflow understanding, semantic segmentation, and AI-assisted surgical support systems.

DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models
Jiarui Wang, Huiyu Duan, Juntong Wang, Jia Ziheng, Woo Yi Yang, Xiaorong Zhu, Yu Zhao, Jiaying Qian, Yuke Xing, Guangtao Zhai, Xiongkuo Min
Paper available at: https://doi.org/10.1145/3746027.3758204
Dataset available at: https://github.com/IntMeGroup/DFBench
DFBench introduces a large benchmark for evaluating deepfake detection with modern multimodal models. It is especially relevant for multimedia forensics, trustworthy AI, adversarial robustness, and authenticity verification research.

Multiverse Through Deepfakes: The MultiFakeVerse Dataset of Person-Centric Visual and Conceptual Manipulations
Parul Gupta, Shreya Ghosh, Tom Gedeon, Thanh-Toan Do, Abhinav Dhall
Paper available at: https://doi.org/10.1145/3746027.3758283
Dataset available at: https://github.com/Parul-Gupta/MultiFakeVerse
MultiFakeVerse extends deepfake benchmarking beyond simple identity swaps toward semantically meaningful person-centric manipulations. The dataset is particularly important for studying higher-level multimedia misinformation, contextual authenticity analysis, and robust deepfake detection.


The progression across ACM MM 2023-2025 clearly illustrates the evolution of datasets from supporting experimental resources toward primary research outputs in their own right. Beyond traditional benchmarks for recognition and retrieval, recent datasets increasingly target generative media evaluation, trustworthy AI, immersive environments, multimodal reasoning, and domain-specific real-world applications. This trajectory reflects the multimedia community’s growing commitment to reproducibility, open science, and shared evaluation resources that enable sustainable scientific progress.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2024 – Part 2 (MDRE at MMM 2023 and MMM 2024)

As already started in the previous Datasets column, we are reviewing some of the most notable events related to open datasets and benchmarking competitions in the field of multimedia in the years 2023 and 2024. This selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/records-issues/acm-sigmm-records-issue-1-2023/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. This second part of the column focuses on the last two editions of MDRE at MMM 2023 and MMM 2024:

  • Multimedia Datasets for Repeatable Experimentation at 29th International Conference on Multimedia Modeling (MDRE at MMM 2023). We summarize the seven datasets presented during the MDRE in 2023, namely NCKU-VTF (thermal-to-visible face recognition benchmark), Link-Rot (web dataset decay and reproducibility study), People@Places and ToDY (scene classification for media production), ScopeSense (lifelogging dataset for health analysis), OceanFish (high-resolution fish species recognition), GIGO (urban garbage classification and demographics), and Marine Video Kit (underwater video retrieval and analysis).
  • Multimedia Datasets for Repeatable Experimentation at 30th International Conference on Multimedia Modeling (MDRE at MMM 2024 – https://mmm2024.org/). We summarize the eight datasets presented during the MDRE in 2024, namely RESET (video similarity annotations for embeddings), DocCT (content-aware document image classification), Rach3 (multimodal data for piano rehearsal analysis), WikiMuTe (semantic music descriptions from Wikipedia), PDTW150K (large-scale patent drawing retrieval dataset), Lifelog QA (question answering for lifelog retrieval), Laparoscopic Events (event recognition in surgery videos), and GreenScreen (social media dataset for greenwashing detection).

For the overview of datasets related to QoMEX 2023 and QoMEX 2024, please check the first part (https://records.sigmm.org/2024/09/07/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2023-2024-part-1-qomex-2023-and-qomex-2024/).

MDRE at MMM 2023

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2023 International Conference on Multimedia Modeling (MMM 2023), Bergen, Norway, January 9-12, 2023. The MDRE’23 special session at MMM’23, is the fifth MDRE session. The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Adam Jatowt (University of Innsbruck, Austria), Liting Zhou (Dublin City University, Ireland) and Graham Healy (Dublin City University, Ireland). 

The NCKU-VTF Dataset and a Multi-scale Thermal-to-Visible Face Synthesis System
Tsung-Han Ho, Chen-Yin Yu, Tsai-Yen Ko & Wei-Ta Chu
National Cheng Kung University, Tainan, Taiwan

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_36
Dataset available at: http://mmcv.csie.ncku.edu.tw/~wtchu/projects/NCKU-VTF/index.html

The dataset, named VTF, comprises paired thermal-visible face images of primarily Asian subjects under diverse visual conditions, introducing challenges for thermal face recognition models. It serves as a benchmark for evaluating model robustness while also revealing racial bias issues in current systems. By addressing both technical and fairness aspects, VTF promotes advancements in developing more accurate and inclusive thermal-to-visible face recognition methods.

Link-Rot in Web-Sourced Multimedia Datasets
Viktor Lakic, Luca Rossetto & Abraham Bernstein
Department of Informatics, University of Zurich, Zurich, Switzerland

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_37
Dataset available at: Combination of 24 different Web-sourced datasets described in the paper

The dataset examines 24 Web-sourced datasets comprising over 270 million URLs and reveals that more than 20% of the content has become unavailable due to link-rot. This decay poses significant challenges to the reproducibility of research relying on such datasets. Addressing this issue, the dataset highlights the need for strategies to mitigate content loss and maintain data integrity for future studies.

People@Places and ToDY: Two Datasets for Scene Classification in Media Production and Archiving
Werner Bailer & Hannes Fassold
Joanneum Research, Graz, Austria

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_38
Dataset available at: https://github.com/wbailer/PeopleAtPlaces

The dataset supports annotation tasks in visual media production and archiving, focusing on scene bustle (from populated to unpopulated), cinematographic shot types, time of day, and season. The People@Places dataset augments Places365 with bustle and shot-type annotations, while the ToDY (time of day/year) dataset enhances SkyFinder. Both datasets come with a toolchain for automatic annotations, manually verified for accuracy. Baseline results using the EfficientNet-B3 model, pretrained on Places365, are provided for benchmarking.

ScopeSense: An 8.5-Month Sport, Nutrition, and Lifestyle Lifelogging Dataset
Michael A. Riegler, Vajira Thambawita, Ayan Chatterjee, Thu Nguyen, Steven A. Hicks, Vibeke Telle-Hansen, Svein Arne Pettersen, Dag Johansen, Ramesh Jain & Pål Halvorsen
SimulaMet, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; UIT The Artic University of Norway, Tromsø, Norway; University of California Irvine, CA, USA

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_39
Dataset available at: https://datasets.simula.no/scopesense

The dataset, ScopeSense, offers comprehensive sport, nutrition, and lifestyle logs collected over eight and a half months from two individuals. It includes extensive sensor data alongside nutrition, training, and well-being information, structured to facilitate detailed, data-driven research on healthy lifestyles. This dataset aims to support modeling for personalized guidance, addressing challenges in unstructured data and enhancing the precision of lifestyle recommendations. ScopeSense is fully accessible to researchers, serving as a foundation for methods to expand this data-driven approach to larger populations.

Fast Accurate Fish Recognition with Deep Learning Based on a Domain-Specific Large-Scale Fish Dataset
Yuan Lin, Zhaoqi Chu, Jari Korhonen, Jiayi Xu, Xiangrong Liu, Juan Liu, Min Liu, Lvping Fang, Weidi Yang, Debasish Ghose & Junyong You
School of Economics, Innovation, and Technology, Kristiania University College, Oslo, Norway; School of Aerospace Engineering, Xiamen University, Xiamen, China; School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, UK; School of Information Science and Technology, Xiamen University, Xiamen, China; School of Ocean and Earth, Xiamen University, Xiamen, China; Norwegian Research Centre (NORCE), Bergen, Norway

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_40
Dataset available at: Upon request from the authors

The dataset, OceanFish, addresses key challenges in fish species recognition by providing high-resolution images of marine species from the East China Sea, covering 63,622 images across 136 fine-grained fish species. This large-scale, diverse dataset overcomes limitations found in prior fish datasets, such as low resolution and limited annotations. OceanFish includes a fish recognition testbed with deep learning models, achieving high precision and speed in species detection. This dataset can be expanded with additional species and annotations, offering a valuable benchmark for advancing marine biodiversity research and automated fish recognition.

GIGO, Garbage In, Garbage Out: An Urban Garbage Classification Dataset
Maarten Sukel, Stevan Rudinac & Marcel Worring
University of Amsterdam, Amsterdam, The Netherlands

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_41
Dataset available at: https://doi.org/10.21942/uva.20750044

The dataset, GIGO: Garbage in, Garbage out, offers 25,000 images for multimodal urban waste classification, captured across a large area of Amsterdam. It supports sustainable urban waste collection by providing fine-grained classifications of diverse garbage types, differing in size, origin, and material. Unique to GIGO are additional geographic and demographic data, enabling multimodal analysis that incorporates neighborhood and building statistics. The dataset includes state-of-the-art baselines, serving as a benchmark for algorithm development in urban waste management and multimodal classification.

Marine Video Kit: A New Marine Video Dataset for Content-Based Analysis and Retrieval
Quang-Trung Truong, Tuan-Anh Vu, Tan-Sang Ha, Jakub Lokoč, Yue-Him Wong, Ajay Joneja & Sai-Kit Yeung
Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; FMP, Charles
University, Prague, Czech Republic; Shenzhen University, Shenzhen, China

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_42
Dataset available at: https://hkust-vgd.github.io/marinevideokit

The dataset, Marine Video Kit, focuses on single-shot underwater videos captured by moving cameras, providing a challenging benchmark for video retrieval and computer vision tasks. Designed to address the limitations of general-purpose models in domain-specific contexts, the dataset includes meta-data, low-level feature analysis, and semantic annotations of keyframes. Used in the Video Browser Showdown 2023, Marine Video Kit highlights challenges in underwater video analysis and is publicly accessible, supporting advancements in model robustness for specialized video retrieval applications.

MDRE at MMM 2024

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2024 International Conference on Multimedia Modeling (MMM 2024), Amsterdam, The Netherlands, January 29 – February 2, 2024. The MDRE’24 special session at MMM’24, is the sixth MDRE session. The session was organized by Klaus Schöffmann (Klagenfurt University, Austria), Björn Þór Jónsson (Reykjavik University, Iceland), Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), and Liting Zhou (Dublin City University, Ireland). Details regarding this session can be found at: https://mmm2024.org/specialpaper.html#s1.

RESET: Relational Similarity Extension for V3C1 Video Dataset
Patrik Veselý & Ladislav Peška
Faculty of Mathematics and Physics, Charles University, Prague, Czechia

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_1
Dataset available at: https://osf.io/ruh5k

The dataset, RESET: RElational Similarity Evaluation dataseT, offers over 17,000 similarity annotations for video keyframe triples drawn from the V3C1 video collection. RESET includes both close and distant similarity triplets in general and specific sub-domains (wedding and diving), with multiple user re-annotations and similarity scores from 30 pre-trained models. This dataset supports the evaluation and fine-tuning of visual embedding models, aligning them more closely with human-perceived similarity, and enhances content-based information retrieval for more accurate, user-aligned results.

A New Benchmark and OCR-Free Method for Document Image Topic Classification
Zhen Wang, Peide Zhu, Fuyang Yu & Manabu Okumura
Tokyo Institute of Technology, Tokyo, Japan; Delft University of Technology, Delft, Netherlands; Beihang University, Beijing, China

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_2
Dataset available at: https://github.com/zhenwangrs/DocCT

The dataset, DocCT, is a content-aware document image classification dataset designed to handle complex document images that integrate text and illustrations across diverse topics. Unlike prior datasets focusing mainly on format, DocCT requires fine-grained content understanding for accurate classification. Alongside DocCT, the self-supervised model DocMAE is introduced, showing that document image semantics can be understood effectively without OCR. DocMAE surpasses previous vision models and some OCR-based models in understanding document content purely from pixel data, marking a significant advance in document image analysis.

The Rach3 Dataset: Towards Data-Driven Analysis of Piano Performance Rehearsal
Carlos Eduardo Cancino-Chacón & Ivan Pilkov
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_3
Dataset available at: https://dataset.rach3project.com/

The dataset, named Rach3, captures the rehearsal processes of pianists as they learn new repertoire, providing a multimodal resource with video, audio, and MIDI data. Designed for AI and machine learning applications, Rach3 enables analysis of long-term practice sessions, focusing on how advanced students and professional musicians interpret and refine their performances. This dataset offers valuable insights into music learning and expression, addressing an understudied area in music performance research.

WikiMuTe: A Web-Sourced Dataset of Semantic Descriptions for Music Audio
Benno Weck, Holger Kirchhoff, Peter Grosche & Xavier Serra
Huawei Technologies, Munich Research Center, Munich, Germany; Universitat Pompeu Fabra, Music Technology Group, Barcelona, Spain

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_4
Dataset available at: https://github.com/Bomme/wikimute

The dataset, WikiMuTe, is an open, multi-modal resource designed for Music Information Retrieval (MIR), offering detailed semantic descriptions of music sourced from Wikipedia. It includes both long and short-form text on aspects like genre, style, mood, instrumentation, and tempo. Using a custom text-mining pipeline, WikiMuTe provides data to train models that jointly learn text and audio representations, achieving strong results in tasks such as tag-based music retrieval and auto-tagging. This dataset supports MIR advancements by providing accessible, rich semantic data for matching text and music.

PDTW150K: A Dataset for Patent Drawing Retrieval
Chan-Ming Hsu, Tse-Hung Lin, Yu-Hsien Chen & Chih-Yi Chiu
Department of Computer Science and Information Engineering, National Chiayi University, Chiayi, Taiwan

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_5
Dataset available at: https://github.com/ncyuMARSLab/PDTW150K

The dataset, PDTW150K, is a large-scale resource for patent drawing retrieval, featuring over 150,000 patents with text metadata and more than 850,000 patent drawings. It includes bounding box annotations for drawing views and supporting object detection model construction. PDTW150K enables diverse applications, such as image retrieval, cross-modal retrieval, and object detection. This dataset is publicly available, offering a valuable tool for advancing research in patent analysis and retrieval tasks.

Interactive Question Answering for Multimodal Lifelog Retrieval
Ly-Duyen Tran, Liting Zhou, Binh Nguyen & Cathal Gurrin
Dublin City University, Dublin, Ireland; AISIA Research Lab, Ho Chi Minh, Vietnam; Ho Chi Minh University of Science, Vietnam National University, Hanoi, Vietnam

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_6
Dataset available at: Upon request from the authors

The dataset supports Question Answering (QA) tasks in lifelog retrieval, advancing the field toward open-domain QA capabilities. Integrated into a multimodal lifelog retrieval system, it allows users to ask lifelog-specific questions and receive suggested answers based on multimodal data. A test collection is provided to assess system effectiveness and user satisfaction, demonstrating enhanced performance over conventional lifelog systems, especially for novice users. This dataset paves the way for more intuitive and effective lifelog interaction.

Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers
Sahar Nasirihaghighi, Negin Ghamsarian, Heinrich Husslein & Klaus Schoeffmann
Institute of Information Technology (ITEC), Klagenfurt University, Klagenfurt, Austria; Center for AI in Medicine, University of Bern, Bern, Switzerland; Department of Gynecology and Gynecological Oncology, Medical University Vienna, Vienna, Austria

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_7
Dataset available at: https://ftp.itec.aau.at/datasets/LapGyn6-Events/

The dataset is tailored for event recognition in laparoscopic gynecology surgery videos, including annotations for critical intra-operative and post-operative events. Designed for applications in surgical training and complication prediction, it facilitates precise event recognition. The dataset supports a hybrid Transformer-based architecture that leverages inter-frame dependencies, improving accuracy amid challenges like occlusion and motion blur. Additionally, a custom frame sampling strategy addresses variations in surgical scenes and skill levels, achieving high temporal resolution. This methodology outperforms conventional CNN-RNN architectures, advancing laparoscopic video analysis.

GreenScreen: A Multimodal Dataset for Detecting Corporate Greenwashing in the Wild
Ujjwal Sharma, Stevan Rudinac, Joris Demmers, Willemijn van Dolen & Marcel Worring
University of Amsterdam, Amsterdam, The Netherlands

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_8
Dataset available at: https://uva-hva.gitlab.host/u.sharma/greenscreen

The dataset focuses on detecting greenwashing in social media by combining large-scale text and image collections from Fortune-1000 company Twitter accounts with environmental risk scores on specific issues like emissions and resource usage. This dataset addresses the challenge of identifying subtle, abstract greenwashing signals requiring contextual interpretation. It includes a baseline method leveraging advanced content encoding to analyze connections between social media content and greenwashing tendencies. This resource enables the multimedia retrieval community to advance greenwashing detection, promoting transparency in corporate sustainability claims.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 2 (MDRE at MMM 2022, ACM MM 2022)


In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on MDRE at MMM 2022 and ACM MM 2022:

  • Multimedia Datasets for Repeatable Experimentation at 28th International Conference on Multimedia Modeling (MDRE at MMM 2022 – https://mmm2022.org/ssp.html#mdre). We summarize the three datasets presented during the MDRE, addressing several topics like user-centric video search competition, dataset (GPR1200) to evaluate the performance of deep neural networks for general image retrieval, and dataset for evaluating the performance of Question Answering (QA) systems on lifelog data (LLQA).
  • Selected datasets at the 30th ACM Multimedia Conference (MM ’22 – https://2022.acmmm.org/). For a general report from ACM Multimedia 2022 please see (https://records.sigmm.org/2022/12/07/report-from-acm-multimedia-2022-by-nitish-nagesh/). We summarize nine datasets presented during the conference, targeting several topics like dataset for multimodal intent recognition (MintRec), audio-visual question answering dataset (AVQA), large-scale radar dataset (mmWave), multimodal sticker emotion recognition dataset (SER30K), video-sentence dataset for vision-language pre-training (ACTION), dataset of head and gaze behavior for 360-degree videos, saliency in augmented reality dataset (SARD), multi-modal dataset spotting the differences between pairs of similar images (DialDiff), and large-scale remote sensing images dataset (RSVG).

For the overview of datasets related to QoMEX 2022 and ODS at MMSys ’22 please check the first part (https://records.sigmm.org/?p=12292), while ImageCLEF 2022 and MediaEval 2022 are addressed in the third part (http://records.sigmm.org/?p=12362).

MDRE at MMM 2022

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2022 International Conference on Multimedia Modeling (MMM 2022), supporting both online and onsite presentation, Phu Quoc, Vietnam, June 6-10, 2022. The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Björn Þór Jónsson (IT University of Copenhagen, Denmark), Adam Jatowt (University of Innsbruck, Austria), Liting Zhou (Dublin City University, Ireland) and Graham Healy (Dublin City University, Ireland). Details regarding this session can be found at: https://mmm2022.org/ssp.html#mdre

The MDRE’22 special session at MMM’22, is the fourth MDRE session, and it represents an opportunity for interested researchers to submit their datasets to this track. The work submitted to MDRE is permanently available at http://mmdatasets.org, where all the current and past editions of MDRE are hosted. Authors are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, and a discussion of how it can be useful to the community, along with the dataset in itself.

A Task Category Space for User-Centric Comparative Multimedia Search Evaluations
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_16
Lokoč, J., Bailer, W., Barthel, K.U., Gurrin, C., Heller, S., Jónsson, B., Peška, L., Rossetto, L., Schoeffmann, K., Vadicamo, L., Vrochidis, S., Wu, J.
Charles University, Prague, Czech Republic; JOANNEUM RESEARCH, Graz, Austria; HTW Berlin, Berlin, Germany; Dublin City University, Dublin, Ireland; University of Basel, Basel, Switzerland; IT University of Copenhagen, Copenhagen, Denmark; University of Zurich, Zurich, Switzerland; Klagenfurt University, Klagenfurt, Austria; ISTI CNR, Pisa, Italy; Centre for Research and Technology Hellas, Thessaloniki, Greece; City University of Hong Kong, Hong Kong.
Dataset available at: On request

The authors have analyzed the spectrum of possible task categories and propose a list of individual axes that define a large space of possible task categories. Using this concept of category space, new user-centric video search competitions can be designed to benchmark video search systems from different perspectives. They further analyze the three task categories considered at the Video Browser Showdown and discuss possible (but sometimes challenging) shifts within the task category space.

GPR1200: A Benchmark for General-Purpose Content-Based Image Retrieval
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_17
Schall, K., Barthel, K.U., Hezel, N., Jung, K.
Visual Computing Group, HTW Berlin, University of Applied Sciences, Germany.
Dataset available at: http://visual-computing.com/project/GPR1200

In this study, the authors have developed a new dataset called GPR1200 to evaluate the performance of deep neural networks for general image retrieval (CBIR). They found that large-scale pretraining significantly improves retrieval performance and that further improvement can be achieved through fine-tuning. GPR1200 is presented as an easy-to-use and accessible but challenging benchmark dataset with a broad range of image categories.

LLQA – Lifelog Question Answering Dataset
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_18
Tran, L.-D., Ho, T.C., Pham, L.A., Nguyen, B., Gurrin, C., Zhou, L.
Dublin City University, Dublin, Ireland; Vietnam National University, Ho Chi Minh University of Science, Ho Chi Minh City, Viet Nam; AISIA Research Lab, Ho Chi Minh City, Viet Nam.
Dataset available at: https://github.com/allie-tran/LLQA

This study presents Lifelog Question Answering Dataset (LLQA), a new dataset for evaluating the performance of Question Answering (QA) systems on lifelog data. The dataset includes over 15,000 multiple-choice questions as an augmented 85-day lifelog collection, and is intended to serve as a benchmark for future research in this area. The results of the study showed that QA on lifelog data is a challenging task that requires further exploration.

ACM MM 2022

Numerous dataset-related papers have been presented at the 30th ACM International Conference on Multimedia (MM’ 22), organized in Lisbon, Portugal, October 10 – 14, 2022 (https://2022.acmmm.org/). The complete MM ’22: Proceedings of the 30th ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3503161).

There was not a specifically dedicated Dataset session among roughly 35 sessions at the MM ’22 symposium. However, the importance of datasets can be illustrated in the following statistics, quantifying how often the term “dataset” appears in MM ’22 Proceedings. The term appears in the title of 9 papers (7 last year), the keywords of 35 papers (66 last year), and the abstracts of 438 papers (339 last year). As a small example, nine selected papers focused primarily on new datasets with publicly available data are listed below. There are contributions focused on various multimedia applications, e.g., understanding multimedia content, multimodal fusion and embeddings, media interpretation, vision and language, engaging users with multimedia, emotional and social signals, interactions and Quality of Experience, and multimedia search and recommendation.

MIntRec: A New Dataset for Multimodal Intent Recognition
Paper available at: https://doi.org/10.1145/3503161.3547906
Zhang, H., Xu, H., Wang, X., Zhou, Q., Zhao, S., Teng, J.
Tsinghua University, Beijing, China.
Dataset available at: https://github.com/thuiar/MIntRec

MIntRec is a dataset for multimodal intent recognition with 2,224 samples based on the data collected from the TV series Superstore, in text, video, and audio modalities, annotated with twenty intent categories and speaker bounding boxes. Baseline models are built by adapting multimodal fusion methods and show significant improvement over text-only modality. MIntRec is useful for studying relationships between modalities and improving intent recognition.

AVQA: A Dataset for Audio-Visual Question Answering on Videos
Paper available at: https://doi.org/10.1145/3503161.3548291
Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., Jin, C., Zhu, W.
Tsinghua University, Shenzhen, China; Communication University of China, Beijing, China.
Dataset available at: https://mn.cs.tsinghua.edu.cn/avqa

Audio-visual question-answering dataset (AVQA) is introduced for videos in real-life scenarios. It includes 57,015 videos and 57,335 question-answer pairs that rely on clues from both audio and visual modalities. A Hierarchical Audio-Visual Fusing module is proposed to model correlations among audio, visual, and text modalities. AVQA can be used to test models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios.

mmBody Benchmark: 3D Body Reconstruction Dataset and Analysis for Millimeter Wave Radar
Paper available at: https://doi.org/10.1145/3503161.3548262
Chen, A., Wang, X., Zhu, S., Li, Y., Chen, J., Ye, Q.
Zhejiang University, Hangzhou, China.
Dataset available at: On request

A large-scale mmWave radar dataset with synchronized and calibrated point clouds and RGB(D) images is presented, along with an automatic 3D body annotation system. State-of-the-art methods are trained and tested on the dataset, showing the mmWave radar can achieve better 3D body reconstruction accuracy than RGB camera but worse than depth camera. The dataset and results provide insights into improving mmWave radar reconstruction and combining signals from different sensors.

SER30K: A Large-Scale Dataset for Sticker Emotion Recognition
Paper available at: https://doi.org/10.1145/3503161.3548407
Liu, S., Zhang, X., Yan, J.
Nankai University, Tianjin, China.
Dataset available at: https://github.com/nku-shengzheliu/SER30K

A new multimodal sticker emotion recognition dataset called SER30K with 1,887 sticker themes and 30,739 images is introduced for understanding emotions in stickers. A proposed method called LORA, using a vision transformer and local re-attention module, effectively extracts visual and language features for emotion recognition on SER30K and other datasets.

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
Paper available at: https://doi.org/10.1145/3503161.3551581
Pan, Y., Li, Y., Luo, J., Xu, J., Yao, T., Mei, T.
JD Explore Academy, Beijing, China.
Dataset available at: http://www.auto-video-captions.top/2022/dataset

A new large-scale pre-training dataset, Auto-captions on GIF (ACTION), is presented for generic video understanding. It contains video-sentence pairs extracted and filtered from web pages and can be used for pre-training and downstream tasks such as video captioning and sentence localization. Comparisons with existing video-sentence datasets are made.

Where Are You Looking?: A Large-Scale Dataset of Head and Gaze Behavior for 360-Degree Videos and a Pilot Study
Paper available at: https://doi.org/10.1145/3503161.3548200
Jin, Y., Liu, J., Wang, F., Cui, S.
The Chinese University of Hong Kong, Shenzhen, Shenzhen, China.
Dataset available at: https://cuhksz-inml.github.io/head_gaze_dataset/

A dataset of users’ head and gaze behaviors in 360° videos is presented, containing rich dimensions, large scale, strong diversity, and high frequency. A quantitative taxonomy for 360° videos is also proposed, containing three objective technical metrics. Results of a pilot study on users’ behaviors and a case of application in tile-based 360° video streaming show the usefulness of the dataset for improving the performance of existing works.

Saliency in Augmented Reality
Paper available at: https://doi.org/10.1145/3503161.3547955
Duan, H., Shen, W., Min, X., Tu, D., Li, J., Zhai, G.
Shanghai Jiao Tong University, Shanghai, China; Alibaba Group, Hangzhou, China.
Dataset available at: https://github.com/DuanHuiyu/ARSaliency

A dataset, Saliency in AR Dataset (SARD), containing 450 background, 450 AR, and 1350 superimposed images with three mixing levels, is constructed to study the interaction between background scenes and AR contents, and the saliency prediction problem in AR. An eye-tracking experiment is conducted among 60 subjects to collect data.

Visual Dialog for Spotting the Differences between Pairs of Similar Images
Paper available at: https://doi.org/10.1145/3503161.3548170
Zheng, D., Meng, F., Si, Q., Fan, H., Xu, Z., Zhou, J., Feng, F., Wang, X.
Beijing University of Posts and Telecommunications, Beijing, China; WeChat AI, Tencent Inc, Beijing, China; Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; University of Trento, Trento, Italy.
Dataset available at: https://github.com/zd11024/Spot_Difference

A new visual dialog task called Dial-the-Diff is proposed, in which two interlocutors access two similar images and try to spot the difference between them through conversation in natural language. A large-scale multi-modal dataset called DialDiff, containing 87k Virtual Reality images and 78k dialogs, is built for the task. Benchmark models are also proposed and evaluated to bring new challenges to dialog strategy and object categorization.

Visual Grounding in Remote Sensing Images
Paper available at: https://doi.org/10.1145/3503161.3548316
Sun, Y., Feng, S., Li, X., Ye, Y., Kang, J., Huang, X.
Harbin Institute of Technology, Shenzhen, Shenzhen, China; Soochow University, Suzhou, China.
Dataset available at: https://sunyuxi.github.io/publication/GeoVG

A new problem of visual grounding in large-scale remote sensing images has been presented, in which the task is to locate particular objects in an image by a natural language expression. A new dataset, called RSVG, has been collected and a new method, GeoVG, has been designed to address the challenges of existing methods in dealing with remote sensing images.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 3 (ImageCLEF 2022, MediaEval 2022)

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on ImageCLEF 2022 and MediaEval 2022:

  • ImageCLEF 2022 (https://www.imageclef.org/2022). We summarize the 5 datasets launched for the benchmarking tasks, related to several topics like social media profile assessment (ImageCLEFaware), segmentation and labeling of underwater coral images (ImageCLEFcoral), late fusion ensembling systems for multimedia data (ImageCLEFfusion) and medical imaging analysis (ImageCLEFmedical Caption, and ImageCLEFmedical Tuberculosis).
  • MediaEval 2022 (https://multimediaeval.github.io/editions/2022/). We summarize the 11 datasets launched for the benchmarking tasks, that target a wide range of multimedia topics like the analysis of flood related media (DisasterMM), game analytics (Emotional Mario), news item processing (FakeNews, NewsImages), multimodal understanding of smells (MUSTI), medical imaging (Medico), fishing vessel analysis (NjordVid), media memorability (Memorability), sports data analysis (Sport Task, SwimTrack), and urban pollution analysis (Urban Air).

For the overview of datasets related to QoMEX 2022 and ODS at MMSys ’22 please check the first part (https://records.sigmm.org/?p=12292), while MDRE at MMM 2022 and ACM MM 2022 are addressed in the second part (http://records.sigmm.org/?p=12360).

ImageCLEF 2022

ImageCLEF is a multimedia evaluation campaign, part of the clef initiative (http://www.clef-initiative.eu/). The 2022 edition (https://www.imageclef.org/2022) is the 19th edition of this initiative and addresses four main research tasks in several domains like: medicine, nature, social media content and user interface processing. ImageCLEF 2021 is organized by Bogdan Ionescu (University Politehnica of Bucharest, Romania), Henning Müller (University of Applied Sciences Western Switzerland, Sierre, Switzerland), Renaud Péteri (University of La Rochelle, France), Ivan Eggel (University of Applied Sciences Western Switzerland, Sierre, Switzerland) and Mihai Dogariu (University Politehnica of Bucharest, Romania).

ImageCLEFaware
Paper available at: https://ceur-ws.org/Vol-3180/paper-98.pdf
Popescu, A., Deshayes-Chossar, J., Schindler, H., Ionescu, B.
CEA LIST, France; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2022/aware

This represents the second edition of the aware task at ImageCLEF, and it seeks to understand in what way do public social media profiles affect users in certain important scenarios, representing a search or application for: a bank loan, an accommodation, a job as waitress/waiter, and a job in IT.

ImageCLEFcoral
Paper available at: https://ceur-ws.org/Vol-3180/paper-97.pdf
Chamberlain, J., de Herrera, A.G.S., Campello, A., Clark, A..
University of Essex, United Kingdom; Wellcome Trust, United Kingdom.
Dataset available at: https://www.imageclef.org/2022/coral

This fourth edition of the coral task addresses the problem of segmenting and labeling a set of underwater images used in the monitoring of coral reefs. The task proposes two subtasks, namely an annotation and localization subtask and a pixel-wise parsing subtask.

ImageCLEFfusion
Paper available at: https://ceur-ws.org/Vol-3180/paper-99.pdf
Ştefan, L-D., Constantin, M.G., Dogariu, M., Ionescu, B.
University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2022/fusion

This represents the first edition of the fusion task, and it proposes several scenarios adapted for the use of late fusion or ensembling systems. The two scenarios correspond to a regression approach, using data associated with the prediction of media interestingness, and a retrieval scenario, using data associated with search result diversification.

ImageCLEFmedical Tuberculosis
Paper available at: https://ceur-ws.org/Vol-3180/paper-96.pdf
Kozlovski, S., Dicente Cid, Y., Kovalev, V., Müller, H.
United Institute of Informatics Problems, Belarus; Roche Diagnostics, Spain; University of Applied Sciences Western Switzerland, Switzerland; University of Geneva, Switzerland.
Dataset available at: https://www.imageclef.org/2022/medical/tuberculosis

This task is now at its sixth edition, and is being upgraded to a detection problem. Furthermore, two tasks are now included: the detection of lung cavern regions in lung CT images associated with lung tuberculosis and the prediction of 4 binary features of caverns suggested by experienced radiologists.

ImageCLEFmedical Caption
Paper available at: https://ceur-ws.org/Vol-3180/paper-95.pdf
Rückert, J., Ben Abacha, A., de Herrera, A.G.S., Bloch, L., Brüngel, R., Idrissi-Yaghir, A., Schäfer, H., Müller, H., Friedrich, C.M.
University of Applied Sciences and Arts Dortmund, Germany; Microsoft, USA; University of Essex, UK; University Hospital Essen, Germany; University of Applied Sciences Western Switzerland, Switzerland; University of Geneva, Switzerland.
Dataset available at: https://www.imageclef.org/2022/medical/caption

The sixth edition of this task consists of two tasks. In the first task participants must detect relevant medical concepts in a large corpus of medical images, while in the second task coherent captions must be generated for the entirety of the context of medical images, targeting the interplay of many visible concepts.

MediaEval 2022

The MediaEval Multimedia Evaluation benchmark (https://multimediaeval.github.io/) offers challenges in artificial intelligence for multimedia data. This is the 13th edition of MediaEval (https://multimediaeval.github.io/editions/2022/) and 11 tasks were proposed for this edition, targeting a large number of challenges by creating algorithms for retrieval, analysis, and exploration. For this edition, a “Quest for Insight” is pursued, where organizers are encouraged to propose interesting and insightful questions about the concepts that will be explored, and participants are encouraged to push beyond only striving to improve evaluation scores and to also working to achieve deeper understanding about the challenges.

DisasterMM: Multimedia Analysis of Disaster-Related Social Media Data
Preprint available at: https://2022.multimediaeval.com/paper5337.pdf
Andreadis, S., Bozas, A., Gialampoukidis, I., Mavropoulos, T., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I., Fiorin, R., Lombardo, F., Norbiato, D., Ferri, M.
Information Technologies Institute – Centre of Research and Technology Hellas, Greece; Eastern Alps River Basin District, Italy.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/disastermm/

The DisasterMM task proposes the analysis of social media data extracted from Twitter, targeting the analysis of natural or man-made disaster posts. For this year, the organizers focused on the analysis of flooding events and proposed two subtasks: relevance classification of posts and location extraction from texts.

Emotional Mario: A Game Analytics Challenge
Preprint or paper not published yet.
Lux, M., Alshaer, M., Riegler, M., Halvorsen, P., Thambawita, V., Hicks, S., Dang-Nguyen, D.-T.,
Alpen-Adria-Universität Klagenfurt, Austria; SimulaMet, Norway; University of Bergen, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/emotionalmario/

Emotional Mario focuses on the Super Mario Bros videogame, analyzing the data associated with gamers that consists of game input, demographics, biomedical data, and video associated with players’ faces. Two subtasks are proposed: event detection, seeking to identify gaming events of a significant importance based on facial videos and biometric data, and gameplay summarization, seeking to select the best moments of gameplay.

FakeNews Detection
Preprint available at: https://2022.multimediaeval.com/paper116.pdf
Pogorelov, K., Schroeder, D.T., Brenner, S., Maulana, A., Langguth, J.
Simula Research Laboratory, Norway; University of Bergen, Norway; Stuttgart Media University, Germany.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/fakenews/

The FakeNews Detection task proposes several types of methods of analyzing fake news and the way they spread, using COVID-19 related conspiracy theories. The competition proposes three tasks: the first subtask targets conspiracy detection in text-based data, the second asks participants to analyze graphs of conspiracy posters, while the last one combines the first two, aiming at detection on both text and graph data.

MUSTI – Multimodal Understanding of Smells in Texts and Images
Preprint available at: https://2022.multimediaeval.com/paper9634.pdf
Hürriyetoğlu, A., Paccosi, T., Menini, S., Zinnen, M., Lisena, P., Akdemir, K., Troncy, R., van Erp, M.
KNAW Humanities Cluster DHLab, Netherlands; Fondazione Bruno Kessler, Italy; Friedrich-Alexander-Universität, Germany; EURECOM, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/musti/

MUSTI is one of the few benchmarks that seek to analyze the underrepresented modality of smell. The organizers seek to further the understanding of descriptions of smell in texts and images, and propose two subtasks: the first one aims at classification of smells based on language and image models, predicting whether texts or images evoke the same smell source or not; while the second subtask targets the participants with identifying what are the common smell sources.

Medical Multimedia Task: Transparent Tracking of Spermatozoa
Preprint available at: https://2022.multimediaeval.com/paper5501.pdf
Thambawita, V., Hicks, S., Storås, A.M, Andersen, J.M., Witczak, O., Haugen, T.B., Hammer, H., Nguyen, T., Halvorsen, P., Riegler, M.A.
SimulaMet, Norway; OsloMet, Norway; The Arctic University of Norway, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/medico/

The Medico Medical Multimedia Task tackles the challenge of tracking sperm cells in video recordings, while analyzing the specific characteristics of these cells. Four subtasks are proposed: a sperm-cell real-time tracking task in videos, a prediction of cell motility task, a catch and highlight task seeking to identify sperm cell speed, and an explainability task.

NewsImages
Preprint available at: https://2022.multimediaeval.com/paper8446.pdf
Kille, B., Lommatzsch, A., Özgöbek, Ö., Elahi, M., Dang-Nguyen, D.-T.
Norwegian University of Science and Technology, Norway; Berlin Institute of Technology, Germany; University of Bergen, Norway; Kristiania University College, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/newsimages/

The goal of the NewsImages task is to further the understanding of the relationship between textual and image content in news articles. Participants are tasked with re-linking and re-matching textual news articles with the corresponding images, based on data gathered from social media, news portals and RSS feeds.

NjordVid: Fishing Trawler Video Analytics Task
Preprint available at: https://2022.multimediaeval.com/paper5854.pdf
Nordmo, T.A.S., Ovesen, A.B., Johansen, H.D., Johansen, D., Riegler, M.A.
The Arctic University of Norway, Norway; SimulaMet, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/njord/

The NjordVid task proposes data associated with fishing vessel recordings, representing a solution to maintaining sustainable fishing practices. Two different tasks are proposed: detection of events on the boat, like movement of people, catching fish, etc, and privacy of on-board personnel.

Predicting Video Memorability
Preprint available at: https://2022.multimediaeval.com/paper2265.pdf
Sweeney, L., Constantin, M.G., Demarty, C.-H., Fosco, C., de Herrera, A.G.S., Halder, S., Healy, G., Ionescu, B., Matran-Fernandez, A., Smeaton, A.F., Sultana, M.
Dublin City University, Ireland; University Politehnica of Bucharest, Romania; InterDigital, France; Massachusetts Institute of Technology Cambridge, USA; University of Essex, UK.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/memorability/

The Video Memorability task asks participants to predict how memorable a video sequence is, targeting short-term memorability. Three subtasks are proposed for this edition: a general video-based prediction task where participants are asked to predict the memorability score of a video, a generalization task where training and testing are performed on different sources of data, and an EEG-based task where annotator EEG scans are provided.

Sport Task: Fine Grained Action Detection and Classification of Table Tennis Strokes from Videos
Preprint available at: https://2022.multimediaeval.com/paper4766.pdf
Martin, P.-E., Calandre, J., Mansencal, B., Benois-Pineau, J., Péteri, R., Mascarilla, L., Morlier, J.
Max Planck Institute for Evolutionary Anthropology, Germany; La Rochelle University, France; Univ. Bordeaux, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/sportsvideo/

The Sport Task aims at action detection and classification in videos recorded at table tennis events. Low inter-class variability makes this task harder than other traditional action classification benchmarks. Two subtasks are proposed: a classification task where participants are asked to label table tennis videos according to the strokes the players make, and a detection task where participants must detect whether a stroke was made.

SwimTrack: Swimmers and Stroke Rate Detection in Elite Race Videos
Preprint available at: https://2022.multimediaeval.com/paper6876.pdf
Jacquelin, N., Jaunet, T., Vuillemot, R., Duffner, S.
École Centrale de Lyon, France; INSA-Lyon, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/swimtrack/

The SwimTrack comprises 5 different multimedia tracks related to the analysis of competition-level swimming videos, and provides multimodal video, image and audio data. The five subtasks are as follows: a position detection task associating swimmers with the numbers of swimming lanes, a stroke rate detection task, a camera registration task where participants must apply homography projection methods to create a top-view of the pool, a character recognition on scoreboards task, and a sound detection task associated with buzzer sounds.

Urban Air: Urban Life and Air Pollution
Preprint available at: https://2022.multimediaeval.com/paper586.pdf
Dao, M.-S., Dang, T.-H., Nguyen-Tai, T.-L., Nguyen, T.-B., Dang-Nguyen, D.-T.
National Institute of Information and Communications Technology, Japan; Dalat University, Vietnam; LOC GOLD Technology MTV Ltd. Co, Vietnam; University of Science, Vietnam National University in HCM City, Vietnam; Bergen University, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/urbanair/

The Urban Air task provides multimodal data that allows the analysis of air pollution and pollution patterns in urban environments. The organizers created two subtasks for this competition: a multimodal/crossmodal air quality index prediction task using station and/or CCTV data, and a periodic traffic pollution pattern discovery task.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 1 (QoMEX 2022, ODS at MMSys ’22)

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on QoMEX 2022 and ODS at MMSys ’22:

  • 14th International Conference on Quality of Multimedia Experience (QoMEX 2022 – https://qomex2022.itec.aau.at/). We summarize three datasets included in this conference, that address QoE studies on audiovisual 360° video, storytelling for quality perception and energy consumption while streaming video QoE.
  • Open Dataset and Software Track at 13th ACM Multimedia Systems Conference (ODS at MMSys ’22 – https://mmsys2022.ie/). We summarize nine datasets presented at the ODS track, targeting several topics, including surveillance videos from a fishing vessel (Njord), multi-codec 8K UHD videos (8K MPEG-DASH dataset), light-field (LF) synthetic immersive large-volume plenoptic dataset (SILVR), a dataset of online news items and the related task of rematching (NewsImages), video sequences, characterized by various complexity categories (VCD), QoE dataset of realistic video clips for real networks, dataset of 360° videos with subjective emotional ratings (PEM360), free-viewpoint video dataset, and cloud gaming dataset (CGD).

For the overview of datasets related to MDRE at MMM 2022 and ACM MM 2022 please check the second part (http://records.sigmm.org/?p=12360), while ImageCLEF 2022 and MediaEval 2022 are addressed in the third part (http://records.sigmm.org/?p=12362).

QoMEX 2022

Three dataset papers were presented at the International Conference on Quality of Multimedia Experience (QoMEX 2022), organized in Lippstadt, Germany, September 5 – 7, 2022 (https://qomex2022.itec.aau.at/). The complete QoMEX ’22 Proceeding is available in the IEEE Digital Library (https://ieeexplore.ieee.org/xpl/conhome/9900491/proceeding).

These datasets were presented within the Databases session, chaired by Professor Oliver Hohlfeld. These three papers present contributions focused on audiovisual 360-degree videos, storytelling for quality perception and modelling of energy consumption and streaming of video QoE.

Audiovisual Database with 360° Video and Higher-Order Ambisonics Audio for Perception, Cognition, Behavior and QoE Evaluation Research
Paper available at: https://ieeexplore.ieee.org/document/9900893
Robotham, T., Singla, A., Rummukainen, O., Raake, A. and Habets, E.
International Audio Laboratories Erlangen, A joint institution of the Friedrich-Alexander-Universitat Erlangen-Nurnberg (FAU) and Fraunhofer Institute for Integrated Circuits (IIS), Germany; TU Ilmenau, Germany.
Dataset available at: https://qoevave.github.io/database/

This publicly available database provides audiovisual 360° content with high-order Ambisonics audio. It consists of twelve scenes capturing real-life nature and urban environments with a video resolution of 7680×3840 at 60 frames-per-second and with 4th-order Ambisonics audio. These 360° video sequences, with an average duration of 60 seconds, represent real-life settings for systematically evaluating various dimensions of uni-/multi-modal perception, cognition, behavior, and QoE. It provides high-quality reference material with a balanced focus on auditory and visual sensory information.

The Storytime Dataset: Simulated Videotelephony Clips for Quality Perception Research
Paper available at: https://ieeexplore.ieee.org/document/9900888
Spang, R. P., Voigt-Antons, J. N. and Möller, S.
Technische Universität Berlin, Berlin, Germany; Hamm-Lippstadt University of Applied Sciences, Lippstadt, Germany.
Dataset available at: https://osf.io/cyb8w/

This is a dataset of simulated videotelephony clips to act as stimuli in quality perception research. It consists of four different stories in the German language that are told through ten consecutive parts, each about 10 seconds long. Each of these parts is available in four different quality levels, ranging from perfect to stalling. All clips (FullHD, H.264 / AAC) are actual recordings from end-user video-conference software to ensure ecological validity and realism of quality degradation. Apart from a detailed description of the methodological approach, we contribute the entire stimuli dataset containing 160 videos and all rating scores for each file.

Modelling of Energy Consumption and Streaming Video QoE using a Crowdsourcing Dataset
Paper available at: https://ieeexplore.ieee.org/document/9900886
Herglotz, C, Robitza, W., Kränzler, M., Kaup, A. and Raake, A.
Friedrich-Alexander-Universität, Erlangen, Germany; Audiovisual Technology Group, TU Ilmenau, Germany; AVEQ GmbH, Vienna, Austria.
Dataset available at: On request

This paper performs a first analysis of end-user power efficiency and Quality of Experience of a video streaming service. A crowdsourced dataset comprising 447,000 streaming events from YouTube is used to estimate both the power consumption and perceived quality. The power consumption is modelled based on previous work, which extends toward predicting the power usage of different devices and codecs. The user-perceived QoE is estimated using a standardized model.

ODS at MMSys ’22

The traditional Open Dataset and Software Track (ODS) was a part of the 13th ACM Multimedia Systems Conference (MMSys ’22) organized in Athlone, Ireland, June 14 – 17, 2022 (https://mmsys2022.ie/). The complete MMSys ’22: Proceedings of the 13th ACM Multimedia Systems Conference are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3524273).

The Open Dataset and Software Chairs for MMSys ’22 were Roberto Azevedo (Disney Research, Switzerland), Saba Ahsan (Nokia Technologies, Finland), and Yao Liu (Rutgers University, USA). The ODS session with 14 papers has been initiated with pitches on Wednesday, June 15, followed by a poster session. There have been nine dataset papers presented out of fourteen contributions. A listing of the paper titles, dataset summaries, and associated DOIs is included below for your convenience.

Njord: a fishing trawler dataset
Paper available at: https://doi.org/10.1145/3524273.3532886
Nordmo, T.-A.S., Ovesen, A.B., Juliussen, B.A., Hicks, S.A., Thambawita, V., Johansen, H.D., Halvorsen, P., Riegler, M.A., Johansen, D.
UiT the Arctic University of Norway, Norway; SimulaMet, Norway; Oslo Metropolitan University, Norway.
Dataset available at: https://doi.org/10.5281/zenodo.6284673

This paper presents Njord, a dataset of surveillance videos from a commercial fishing vessel. The dataset aims to demonstrate the potential for using data from fishing vessels to detect accidents and report fish catches automatically. The authors also provide a baseline analysis of the dataset and discuss possible research questions that it could help answer.

Multi-codec ultra high definition 8K MPEG-DASH dataset
Paper available at: https://doi.org/10.1145/3524273.3532889
Taraghi, B., Amirpour, H., Timmerer, C.
Christian Doppler Laboratory Athena, Institute of Information Technology (ITEC), Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria.
Dataset available at: http://ftp.itec.aau.at/datasets/mmsys22/

This paper presents a dataset of multimedia assets encoded with various video codecs, including AVC, HEVC, AV1, and VVC, and packaged using the MPEG-DASH format. The dataset includes resolutions up to 8K and has a maximum media duration of 322 seconds, with segment lengths of 4 and 8 seconds. It is intended to facilitate research and development of video encoding technology for streaming services.

SILVR: a synthetic immersive large-volume plenoptic dataset
Paper available at: https://doi.org/10.1145/3524273.3532890
Courteaux, M., Artois, J., De Pauw, S., Lambert, P., Van Wallendael, G.
Ghent University – Imec, Oost-Vlaanderen, Zwijnaarde, Belgium.
Dataset available at: https://idlabmedia.github.io/large-lightfields-dataset/

SILVR (synthetic immersive large-volume plenoptic dataset) is a light-field (LF) image dataset allowing for six-degrees-of-freedom navigation in larger volumes while maintaining full panoramic field of view. It includes three virtual scenes with 642-2226 views, rendered with 180° fish-eye lenses and featuring color images and depth maps. The dataset also includes multiview rendering software and a lens-reprojection tool. SILVR can be used to evaluate LF coding and rendering techniques.

NewsImages: addressing the depiction gap with an online news dataset for text-image rematching
Paper available at: https://doi.org/10.1145/3524273.3532891
Lommatzsch, A., Kille, B., Özgöbek, O., Zhou, Y., Tešić, J., Bartolomeu, C., Semedo, D., Pivovarova, L., Liang, M., Larson, M.
DAI-Labor, TU-Berlin, Berlin, Germany; NTNU, Trondheim, Norway; Texas State University, San Marcos, TX, United States; Universidade Nova de Lisboa, Lisbon, Portugal.
Dataset available at: https://multimediaeval.github.io/editions/2021/tasks/newsimages/

NewsImages is a dataset of online news items and the related task of news images rematching, which aims to study the “depiction gap” between the content of an image and the text that accompanies it. The dataset is useful for studying connections between image and text and addressing the depiction gap, including sparse data, diversity of content, and the importance of background knowledge.

VCD: Video Complexity Dataset
Paper available at: https://doi.org/10.1145/3524273.3532892
Amirpour, H., Menon, V.V., Afzal, S., Ghanbari, M., Timmerer, C.
Christian Doppler Laboratory Athena, Institute of Information Technology (ITEC), Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria; School of Computer Science and Electronic Engineering, University of Essex, Colchester, United Kingdom.
Dataset available at: https://ftp.itec.aau.at/datasets/video-complexity/

The Video Complexity Dataset (VCD) is a collection of 500 Ultra High Definition (UHD) resolution video sequences, characterized by spatial and temporal complexities, rate-distortion complexity, and encoding complexity with the x264 AVC/H.264 and x265 HEVC/H.265 video encoders. It is suitable for video coding applications such as video streaming, two-pass encoding, per-title encoding, and scene-cut detection. These sequences are provided at 24 frames per second (fps) and stored online in losslessly encoded 8-bit 4:2:0 format.

Realistic video sequences for subjective QoE analysis
Paper available at: https://doi.org/10.1145/3524273.3532894
Hodzic, K., Cosovic, M., Mrdovic, S., Quinlan, J.J., Raca, D.
Faculty of Electrical Engineering, University of Sarajevo, Bosnia and Herzegovina; School of Computer Science & Information Technology, University College Cork, Ireland.
Dataset available at: https://shorturl.at/dtISV

The DashReStreamer framework is designed to recreate adaptively streamed video in real networks to evaluate user Quality of Experience (QoE). The authors have also created a dataset of 234 realistic video clips, based on video logs collected from real mobile and wireless networks, including video logs and network bandwidth profiles. This dataset and framework will help researchers understand the impact of video QoE dynamics on multimedia streaming.

PEM360: a dataset of 360° videos with continuous physiological measurements, subjective emotional ratings and motion traces
Paper available at: https://doi.org/10.1145/3524273.3532895
Guimard, Q., Robert, F., Bauce, C., Ducreux, A., Sassatelli, L., Wu, H.-Y., Winckler, M., Gros, A.
Université Côte d’Azur, Inria, CNRS, I3S, Sophia-Antipolis, France.
Dataset available at: https://gitlab.com/PEM360/PEM360/

PEM360 is a dataset of user head movements and gaze recordings in 360° videos, along with self-reported emotional ratings and continuous physiological measurement data. It aims to understand the connection between user attention, emotions, and immersive content, and includes software tools and joint instantaneous visualization of user attention and emotion, called “emotional maps.” The entire data and code are available in a reproducible framework.

A New Free Viewpoint Video Dataset and DIBR Benchmark
Paper available at: https://doi.org/10.1145/3524273.3532897
Guo, S., Zhou, K., Hu, J., Wang, J., Xu, J., Song, L.
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China.
Dataset available at: https://github.com/sjtu-medialab/Free-Viewpoint-RGB-D-Video-Dataset

A new dynamic RGB-D video dataset for FVV research is presented, including 13 groups of dynamic scenes and one group of static scenes, each with 12 HD video sequences and 12 corresponding depth video sequences. Also, the FVV synthesis benchmark is introduced based on depth image-based rendering to aid data-driven method validation. The dataset and benchmark aim to advance FVV synthesis with improved robustness and performance.

CGD: a cloud gaming dataset with gameplay video and network recordings
Paper available at: https://doi.org/10.1145/3524273.3532898
Slivar, I., Bacic, K., Orsolic, I., Skorin-Kapov, L., Suznjevic, M.
University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia.
Dataset available at: https://muexlab.fer.hr/muexlab/research/datasets

The cloud gaming (CGD) dataset contains 600 game streaming sessions from 10 games of different genres, with various encoding parameters (bitrate, resolution, and frame rate) to evaluate the impact of these parameters on Quality of Experience (QoE). The dataset includes gameplay video recordings, network traffic traces, user input logs, and streaming performance logs, and can be used to understand relationships between network and application layer data for cloud gaming QoE and QoE-aware network management mechanisms.

Dataset Column: Overview, Scope and Call for Contributions

Overview and Scope

The Dataset Column (https://records.sigmm.org/open-science/datasets/) of ACM SIGMM Records provides timely updates on the developments in the domain of publicly available multimedia datasets as enabling tools for reproducible research in numerous related areas. It is intended as a platform for further dissemination of useful information on multimedia datasets and studies of datasets covering various domains, published in peer-reviewed journals, conference proceedings, dissertations, or as results of applied research in industry.

The aim of the Dataset Column is therefore not to substitute already established platforms for disseminating multimedia datasets, e.g., Qualinet Databases (https://qualinet.github.io/databases/) [2], Multimedia Evaluation Benchmark (https://multimediaeval.github.io/), but promote such platforms and particularly interesting datasets and benchmarking challenges associated with them. Multimedia Evaluation Benchmark, MediaEval 2021, registration is now open (https://multimediaeval.github.io). This year’s MediaEval features a wide variety of tasks and datasets tackling a large number of domains, including video privacy, social media data analysis and understanding, news items analysis, medicine and wellbeing, affective and subjective content analysis, and game and sports associated media.

The Column will also continue reporting of contributions presented within Dataset Tracks at relevant conferences, e.g., ACM Multimedia (MM), ACM Multimedia Systems (MMSys), International Conference on Quality of Multimedia Experience (QoMEX), International Conference on Multimedia Modeling (MMM).

Dataset Column in the SIGMM Records

Previously published Dataset Columns are listed below in chronological order.

Call for Contributions

Those who have created and even previously published elsewhere a dataset, benchmarking initiative or studies of datasets relevant to the multimedia community are very welcome to submit their contribution to the ACM SIGMM Records Dataset Column. Examples of these are the accepted datasets to the open dataset and software track of the ACM MMSys 2021 conference or the datasets presented at QoMEX 2021 conference. Please contact one of the editors responsible for the respective area, Mihai Gabriel Constantin (mihai.constantin84@upb.ro), Karel Fliegel (fliegek@fel.cvut.cz), and Maria Torres Vega (maria.torresvega@ugent.be) to report your contribution.

Column Editors

Since September 2021, the Dataset Column is edited by Mihai Gabriel Constantin, Karel Fliegel, and Maria Torres Vega. Current editors appreciate the work of the previous team, Martha Larson, Bart Thomee and all other contributors, and will continue and further develop this dissemination platform.

The general scope of the Dataset Column is reviewed above, with the more specific areas of the editors listed below:

  • Mihai Gabriel Constantin will be responsible for the datasets related to multimedia analysis, understanding, retrieval and exploration,
  • Karel Fliegel for the datasets with subjective annotations related to Quality of Experience (QoE) [1] research,
  • Maria Torres Vega for the datasets related to immersive multimedia systems, networked QoE and cognitive network management.

Mihai Gabriel Constantin is a researcher at the AI Multimedia Lab, University Politehnica of Bucharest, Romania, and got his PhD at the Faculty of Electronics, Telecommunications, and Information Technology at the same university, with the topic “Automatic Analysis of the Visual Impact of Multimedia Data”. He has authored over 25 scientific papers in international conferences and high impact journals, with an emphasis on the prediction of the subjective impact of multimedia items on human viewers and deep ensembles. He participated as researcher in more than 10 research projects, and is a member of program committees and reviewer for several workshops, conferences and journals. He is also an active member of the multimedia processing community, being part of the MediaEval benchmarking initiative organization team, and leading or co-organizing several tasks during MediaEval that include Predicting Media Memorability [3] and Recommending Movies Using Content [4], as well as publishing several papers that analyze the data, annotations, participant features, methods, and observed best practices for MediaEval tasks and datasets [5]. More details can be found on his webpage: https://gconstantin.aimultimedialab.ro/.

Karel Fliegel received M.Sc. (Ing.) in 2004 (electrical engineering and audiovisual technology) and his Ph.D. in 2011 (research on modeling of visual perception of image impairment features) both from the Czech Technical University in Prague, Faculty of Electrical Engineering (CTU FEE), Czech Republic. He is an assistant professor at Multimedia Technology Group of CTU FEE. His research interests include multimedia technology, image processing, image and video compression, subjective and objective image quality assessment, Quality of Experience, HVS modeling, and imaging photonics. He has been a member of research teams within various projects especially in the area of visual information processing. He has participated in COST ICT Actions IC1003 Qualinet and IC1105 3D-ConTourNet, responsible for development of Qualinet Databases [2] (https://qualinet.github.io/databases/) relevant especially to QoE research.

Maria Torres Vega is an FWO (Research Foundation Flanders) Senior Postdoctoral fellow working at the multimedia delivery cluster of the IDLab group of the Ghent University (UGent) currently working on the perception of immersive multimedia applications. She received her M.Sc. degree in Telecommunication Engineering from the Polytechnic University of Madrid, Spain, in 2009. Between 2009 and 2013 she worked as a software and test engineer in Germany with focus on Embedded Systems and Signal Processing. In October 2013, she decided to go back to academia and started her PhD at the Eindhoven University of Technology (Eindhoven, The Netherlands), where she researched on the impact of beam-steered optical wireless networks on the users’ perception of services. This work awarded her PhD in Electrical Engineering in September 2017. In her years in academia (since October 2013), she has authored more than 40 publications, including three best paper awards. Furthermore, she serves as reviewer to a plethora of journals and conferences. In 2020 she served as general chair of the 4th Quality of Experience Management workshop, as tutorial chair of the 2020 Network Softwarization conference (NetSoft), and as demo chair of the Quality of Multimedia Experience conference (QoMex 2020). In 2021, she served as Technical Program Committee (TPC) chair of the 2021 Quality of Multimedia Experience conference (QoMex 2021).

References