Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2024 – Part 2 (MDRE at MMM 2023 and MMM 2024)

As already started in the previous Datasets column, we are reviewing some of the most notable events related to open datasets and benchmarking competitions in the field of multimedia in the years 2023 and 2024. This selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/records-issues/acm-sigmm-records-issue-1-2023/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. This second part of the column focuses on the last two editions of MDRE at MMM 2023 and MMM 2024:

  • Multimedia Datasets for Repeatable Experimentation at 29th International Conference on Multimedia Modeling (MDRE at MMM 2023). We summarize the seven datasets presented during the MDRE in 2023, namely NCKU-VTF (thermal-to-visible face recognition benchmark), Link-Rot (web dataset decay and reproducibility study), People@Places and ToDY (scene classification for media production), ScopeSense (lifelogging dataset for health analysis), OceanFish (high-resolution fish species recognition), GIGO (urban garbage classification and demographics), and Marine Video Kit (underwater video retrieval and analysis).
  • Multimedia Datasets for Repeatable Experimentation at 30th International Conference on Multimedia Modeling (MDRE at MMM 2024 – https://mmm2024.org/). We summarize the eight datasets presented during the MDRE in 2024, namely RESET (video similarity annotations for embeddings), DocCT (content-aware document image classification), Rach3 (multimodal data for piano rehearsal analysis), WikiMuTe (semantic music descriptions from Wikipedia), PDTW150K (large-scale patent drawing retrieval dataset), Lifelog QA (question answering for lifelog retrieval), Laparoscopic Events (event recognition in surgery videos), and GreenScreen (social media dataset for greenwashing detection).

For the overview of datasets related to QoMEX 2023 and QoMEX 2024, please check the first part (https://records.sigmm.org/2024/09/07/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2023-2024-part-1-qomex-2023-and-qomex-2024/).

MDRE at MMM 2023

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2023 International Conference on Multimedia Modeling (MMM 2023), Bergen, Norway, January 9-12, 2023. The MDRE’23 special session at MMM’23, is the fifth MDRE session. The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Adam Jatowt (University of Innsbruck, Austria), Liting Zhou (Dublin City University, Ireland) and Graham Healy (Dublin City University, Ireland). 

The NCKU-VTF Dataset and a Multi-scale Thermal-to-Visible Face Synthesis System
Tsung-Han Ho, Chen-Yin Yu, Tsai-Yen Ko & Wei-Ta Chu
National Cheng Kung University, Tainan, Taiwan

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_36
Dataset available at: http://mmcv.csie.ncku.edu.tw/~wtchu/projects/NCKU-VTF/index.html

The dataset, named VTF, comprises paired thermal-visible face images of primarily Asian subjects under diverse visual conditions, introducing challenges for thermal face recognition models. It serves as a benchmark for evaluating model robustness while also revealing racial bias issues in current systems. By addressing both technical and fairness aspects, VTF promotes advancements in developing more accurate and inclusive thermal-to-visible face recognition methods.

Link-Rot in Web-Sourced Multimedia Datasets
Viktor Lakic, Luca Rossetto & Abraham Bernstein
Department of Informatics, University of Zurich, Zurich, Switzerland

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_37
Dataset available at: Combination of 24 different Web-sourced datasets described in the paper

The dataset examines 24 Web-sourced datasets comprising over 270 million URLs and reveals that more than 20% of the content has become unavailable due to link-rot. This decay poses significant challenges to the reproducibility of research relying on such datasets. Addressing this issue, the dataset highlights the need for strategies to mitigate content loss and maintain data integrity for future studies.

People@Places and ToDY: Two Datasets for Scene Classification in Media Production and Archiving
Werner Bailer & Hannes Fassold
Joanneum Research, Graz, Austria

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_38
Dataset available at: https://github.com/wbailer/PeopleAtPlaces

The dataset supports annotation tasks in visual media production and archiving, focusing on scene bustle (from populated to unpopulated), cinematographic shot types, time of day, and season. The People@Places dataset augments Places365 with bustle and shot-type annotations, while the ToDY (time of day/year) dataset enhances SkyFinder. Both datasets come with a toolchain for automatic annotations, manually verified for accuracy. Baseline results using the EfficientNet-B3 model, pretrained on Places365, are provided for benchmarking.

ScopeSense: An 8.5-Month Sport, Nutrition, and Lifestyle Lifelogging Dataset
Michael A. Riegler, Vajira Thambawita, Ayan Chatterjee, Thu Nguyen, Steven A. Hicks, Vibeke Telle-Hansen, Svein Arne Pettersen, Dag Johansen, Ramesh Jain & Pål Halvorsen
SimulaMet, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; UIT The Artic University of Norway, Tromsø, Norway; University of California Irvine, CA, USA

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_39
Dataset available at: https://datasets.simula.no/scopesense

The dataset, ScopeSense, offers comprehensive sport, nutrition, and lifestyle logs collected over eight and a half months from two individuals. It includes extensive sensor data alongside nutrition, training, and well-being information, structured to facilitate detailed, data-driven research on healthy lifestyles. This dataset aims to support modeling for personalized guidance, addressing challenges in unstructured data and enhancing the precision of lifestyle recommendations. ScopeSense is fully accessible to researchers, serving as a foundation for methods to expand this data-driven approach to larger populations.

Fast Accurate Fish Recognition with Deep Learning Based on a Domain-Specific Large-Scale Fish Dataset
Yuan Lin, Zhaoqi Chu, Jari Korhonen, Jiayi Xu, Xiangrong Liu, Juan Liu, Min Liu, Lvping Fang, Weidi Yang, Debasish Ghose & Junyong You
School of Economics, Innovation, and Technology, Kristiania University College, Oslo, Norway; School of Aerospace Engineering, Xiamen University, Xiamen, China; School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, UK; School of Information Science and Technology, Xiamen University, Xiamen, China; School of Ocean and Earth, Xiamen University, Xiamen, China; Norwegian Research Centre (NORCE), Bergen, Norway

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_40
Dataset available at: Upon request from the authors

The dataset, OceanFish, addresses key challenges in fish species recognition by providing high-resolution images of marine species from the East China Sea, covering 63,622 images across 136 fine-grained fish species. This large-scale, diverse dataset overcomes limitations found in prior fish datasets, such as low resolution and limited annotations. OceanFish includes a fish recognition testbed with deep learning models, achieving high precision and speed in species detection. This dataset can be expanded with additional species and annotations, offering a valuable benchmark for advancing marine biodiversity research and automated fish recognition.

GIGO, Garbage In, Garbage Out: An Urban Garbage Classification Dataset
Maarten Sukel, Stevan Rudinac & Marcel Worring
University of Amsterdam, Amsterdam, The Netherlands

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_41
Dataset available at: https://doi.org/10.21942/uva.20750044

The dataset, GIGO: Garbage in, Garbage out, offers 25,000 images for multimodal urban waste classification, captured across a large area of Amsterdam. It supports sustainable urban waste collection by providing fine-grained classifications of diverse garbage types, differing in size, origin, and material. Unique to GIGO are additional geographic and demographic data, enabling multimodal analysis that incorporates neighborhood and building statistics. The dataset includes state-of-the-art baselines, serving as a benchmark for algorithm development in urban waste management and multimodal classification.

Marine Video Kit: A New Marine Video Dataset for Content-Based Analysis and Retrieval
Quang-Trung Truong, Tuan-Anh Vu, Tan-Sang Ha, Jakub Lokoč, Yue-Him Wong, Ajay Joneja & Sai-Kit Yeung
Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; FMP, Charles
University, Prague, Czech Republic; Shenzhen University, Shenzhen, China

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_42
Dataset available at: https://hkust-vgd.github.io/marinevideokit

The dataset, Marine Video Kit, focuses on single-shot underwater videos captured by moving cameras, providing a challenging benchmark for video retrieval and computer vision tasks. Designed to address the limitations of general-purpose models in domain-specific contexts, the dataset includes meta-data, low-level feature analysis, and semantic annotations of keyframes. Used in the Video Browser Showdown 2023, Marine Video Kit highlights challenges in underwater video analysis and is publicly accessible, supporting advancements in model robustness for specialized video retrieval applications.

MDRE at MMM 2024

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2024 International Conference on Multimedia Modeling (MMM 2024), Amsterdam, The Netherlands, January 29 – February 2, 2024. The MDRE’24 special session at MMM’24, is the sixth MDRE session. The session was organized by Klaus Schöffmann (Klagenfurt University, Austria), Björn Þór Jónsson (Reykjavik University, Iceland), Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), and Liting Zhou (Dublin City University, Ireland). Details regarding this session can be found at: https://mmm2024.org/specialpaper.html#s1.

RESET: Relational Similarity Extension for V3C1 Video Dataset
Patrik Veselý & Ladislav Peška
Faculty of Mathematics and Physics, Charles University, Prague, Czechia

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_1
Dataset available at: https://osf.io/ruh5k

The dataset, RESET: RElational Similarity Evaluation dataseT, offers over 17,000 similarity annotations for video keyframe triples drawn from the V3C1 video collection. RESET includes both close and distant similarity triplets in general and specific sub-domains (wedding and diving), with multiple user re-annotations and similarity scores from 30 pre-trained models. This dataset supports the evaluation and fine-tuning of visual embedding models, aligning them more closely with human-perceived similarity, and enhances content-based information retrieval for more accurate, user-aligned results.

A New Benchmark and OCR-Free Method for Document Image Topic Classification
Zhen Wang, Peide Zhu, Fuyang Yu & Manabu Okumura
Tokyo Institute of Technology, Tokyo, Japan; Delft University of Technology, Delft, Netherlands; Beihang University, Beijing, China

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_2
Dataset available at: https://github.com/zhenwangrs/DocCT

The dataset, DocCT, is a content-aware document image classification dataset designed to handle complex document images that integrate text and illustrations across diverse topics. Unlike prior datasets focusing mainly on format, DocCT requires fine-grained content understanding for accurate classification. Alongside DocCT, the self-supervised model DocMAE is introduced, showing that document image semantics can be understood effectively without OCR. DocMAE surpasses previous vision models and some OCR-based models in understanding document content purely from pixel data, marking a significant advance in document image analysis.

The Rach3 Dataset: Towards Data-Driven Analysis of Piano Performance Rehearsal
Carlos Eduardo Cancino-Chacón & Ivan Pilkov
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_3
Dataset available at: https://dataset.rach3project.com/

The dataset, named Rach3, captures the rehearsal processes of pianists as they learn new repertoire, providing a multimodal resource with video, audio, and MIDI data. Designed for AI and machine learning applications, Rach3 enables analysis of long-term practice sessions, focusing on how advanced students and professional musicians interpret and refine their performances. This dataset offers valuable insights into music learning and expression, addressing an understudied area in music performance research.

WikiMuTe: A Web-Sourced Dataset of Semantic Descriptions for Music Audio
Benno Weck, Holger Kirchhoff, Peter Grosche & Xavier Serra
Huawei Technologies, Munich Research Center, Munich, Germany; Universitat Pompeu Fabra, Music Technology Group, Barcelona, Spain

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_4
Dataset available at: https://github.com/Bomme/wikimute

The dataset, WikiMuTe, is an open, multi-modal resource designed for Music Information Retrieval (MIR), offering detailed semantic descriptions of music sourced from Wikipedia. It includes both long and short-form text on aspects like genre, style, mood, instrumentation, and tempo. Using a custom text-mining pipeline, WikiMuTe provides data to train models that jointly learn text and audio representations, achieving strong results in tasks such as tag-based music retrieval and auto-tagging. This dataset supports MIR advancements by providing accessible, rich semantic data for matching text and music.

PDTW150K: A Dataset for Patent Drawing Retrieval
Chan-Ming Hsu, Tse-Hung Lin, Yu-Hsien Chen & Chih-Yi Chiu
Department of Computer Science and Information Engineering, National Chiayi University, Chiayi, Taiwan

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_5
Dataset available at: https://github.com/ncyuMARSLab/PDTW150K

The dataset, PDTW150K, is a large-scale resource for patent drawing retrieval, featuring over 150,000 patents with text metadata and more than 850,000 patent drawings. It includes bounding box annotations for drawing views and supporting object detection model construction. PDTW150K enables diverse applications, such as image retrieval, cross-modal retrieval, and object detection. This dataset is publicly available, offering a valuable tool for advancing research in patent analysis and retrieval tasks.

Interactive Question Answering for Multimodal Lifelog Retrieval
Ly-Duyen Tran, Liting Zhou, Binh Nguyen & Cathal Gurrin
Dublin City University, Dublin, Ireland; AISIA Research Lab, Ho Chi Minh, Vietnam; Ho Chi Minh University of Science, Vietnam National University, Hanoi, Vietnam

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_6
Dataset available at: Upon request from the authors

The dataset supports Question Answering (QA) tasks in lifelog retrieval, advancing the field toward open-domain QA capabilities. Integrated into a multimodal lifelog retrieval system, it allows users to ask lifelog-specific questions and receive suggested answers based on multimodal data. A test collection is provided to assess system effectiveness and user satisfaction, demonstrating enhanced performance over conventional lifelog systems, especially for novice users. This dataset paves the way for more intuitive and effective lifelog interaction.

Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers
Sahar Nasirihaghighi, Negin Ghamsarian, Heinrich Husslein & Klaus Schoeffmann
Institute of Information Technology (ITEC), Klagenfurt University, Klagenfurt, Austria; Center for AI in Medicine, University of Bern, Bern, Switzerland; Department of Gynecology and Gynecological Oncology, Medical University Vienna, Vienna, Austria

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_7
Dataset available at: https://ftp.itec.aau.at/datasets/LapGyn6-Events/

The dataset is tailored for event recognition in laparoscopic gynecology surgery videos, including annotations for critical intra-operative and post-operative events. Designed for applications in surgical training and complication prediction, it facilitates precise event recognition. The dataset supports a hybrid Transformer-based architecture that leverages inter-frame dependencies, improving accuracy amid challenges like occlusion and motion blur. Additionally, a custom frame sampling strategy addresses variations in surgical scenes and skill levels, achieving high temporal resolution. This methodology outperforms conventional CNN-RNN architectures, advancing laparoscopic video analysis.

GreenScreen: A Multimodal Dataset for Detecting Corporate Greenwashing in the Wild
Ujjwal Sharma, Stevan Rudinac, Joris Demmers, Willemijn van Dolen & Marcel Worring
University of Amsterdam, Amsterdam, The Netherlands

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_8
Dataset available at: https://uva-hva.gitlab.host/u.sharma/greenscreen

The dataset focuses on detecting greenwashing in social media by combining large-scale text and image collections from Fortune-1000 company Twitter accounts with environmental risk scores on specific issues like emissions and resource usage. This dataset addresses the challenge of identifying subtle, abstract greenwashing signals requiring contextual interpretation. It includes a baseline method leveraging advanced content encoding to analyze connections between social media content and greenwashing tendencies. This resource enables the multimedia retrieval community to advance greenwashing detection, promoting transparency in corporate sustainability claims.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 2 (MDRE at MMM 2022, ACM MM 2022)


In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on MDRE at MMM 2022 and ACM MM 2022:

  • Multimedia Datasets for Repeatable Experimentation at 28th International Conference on Multimedia Modeling (MDRE at MMM 2022 – https://mmm2022.org/ssp.html#mdre). We summarize the three datasets presented during the MDRE, addressing several topics like user-centric video search competition, dataset (GPR1200) to evaluate the performance of deep neural networks for general image retrieval, and dataset for evaluating the performance of Question Answering (QA) systems on lifelog data (LLQA).
  • Selected datasets at the 30th ACM Multimedia Conference (MM ’22 – https://2022.acmmm.org/). For a general report from ACM Multimedia 2022 please see (https://records.sigmm.org/2022/12/07/report-from-acm-multimedia-2022-by-nitish-nagesh/). We summarize nine datasets presented during the conference, targeting several topics like dataset for multimodal intent recognition (MintRec), audio-visual question answering dataset (AVQA), large-scale radar dataset (mmWave), multimodal sticker emotion recognition dataset (SER30K), video-sentence dataset for vision-language pre-training (ACTION), dataset of head and gaze behavior for 360-degree videos, saliency in augmented reality dataset (SARD), multi-modal dataset spotting the differences between pairs of similar images (DialDiff), and large-scale remote sensing images dataset (RSVG).

For the overview of datasets related to QoMEX 2022 and ODS at MMSys ’22 please check the first part (https://records.sigmm.org/?p=12292), while ImageCLEF 2022 and MediaEval 2022 are addressed in the third part (http://records.sigmm.org/?p=12362).

MDRE at MMM 2022

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2022 International Conference on Multimedia Modeling (MMM 2022), supporting both online and onsite presentation, Phu Quoc, Vietnam, June 6-10, 2022. The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Björn Þór Jónsson (IT University of Copenhagen, Denmark), Adam Jatowt (University of Innsbruck, Austria), Liting Zhou (Dublin City University, Ireland) and Graham Healy (Dublin City University, Ireland). Details regarding this session can be found at: https://mmm2022.org/ssp.html#mdre

The MDRE’22 special session at MMM’22, is the fourth MDRE session, and it represents an opportunity for interested researchers to submit their datasets to this track. The work submitted to MDRE is permanently available at http://mmdatasets.org, where all the current and past editions of MDRE are hosted. Authors are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, and a discussion of how it can be useful to the community, along with the dataset in itself.

A Task Category Space for User-Centric Comparative Multimedia Search Evaluations
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_16
Lokoč, J., Bailer, W., Barthel, K.U., Gurrin, C., Heller, S., Jónsson, B., Peška, L., Rossetto, L., Schoeffmann, K., Vadicamo, L., Vrochidis, S., Wu, J.
Charles University, Prague, Czech Republic; JOANNEUM RESEARCH, Graz, Austria; HTW Berlin, Berlin, Germany; Dublin City University, Dublin, Ireland; University of Basel, Basel, Switzerland; IT University of Copenhagen, Copenhagen, Denmark; University of Zurich, Zurich, Switzerland; Klagenfurt University, Klagenfurt, Austria; ISTI CNR, Pisa, Italy; Centre for Research and Technology Hellas, Thessaloniki, Greece; City University of Hong Kong, Hong Kong.
Dataset available at: On request

The authors have analyzed the spectrum of possible task categories and propose a list of individual axes that define a large space of possible task categories. Using this concept of category space, new user-centric video search competitions can be designed to benchmark video search systems from different perspectives. They further analyze the three task categories considered at the Video Browser Showdown and discuss possible (but sometimes challenging) shifts within the task category space.

GPR1200: A Benchmark for General-Purpose Content-Based Image Retrieval
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_17
Schall, K., Barthel, K.U., Hezel, N., Jung, K.
Visual Computing Group, HTW Berlin, University of Applied Sciences, Germany.
Dataset available at: http://visual-computing.com/project/GPR1200

In this study, the authors have developed a new dataset called GPR1200 to evaluate the performance of deep neural networks for general image retrieval (CBIR). They found that large-scale pretraining significantly improves retrieval performance and that further improvement can be achieved through fine-tuning. GPR1200 is presented as an easy-to-use and accessible but challenging benchmark dataset with a broad range of image categories.

LLQA – Lifelog Question Answering Dataset
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_18
Tran, L.-D., Ho, T.C., Pham, L.A., Nguyen, B., Gurrin, C., Zhou, L.
Dublin City University, Dublin, Ireland; Vietnam National University, Ho Chi Minh University of Science, Ho Chi Minh City, Viet Nam; AISIA Research Lab, Ho Chi Minh City, Viet Nam.
Dataset available at: https://github.com/allie-tran/LLQA

This study presents Lifelog Question Answering Dataset (LLQA), a new dataset for evaluating the performance of Question Answering (QA) systems on lifelog data. The dataset includes over 15,000 multiple-choice questions as an augmented 85-day lifelog collection, and is intended to serve as a benchmark for future research in this area. The results of the study showed that QA on lifelog data is a challenging task that requires further exploration.

ACM MM 2022

Numerous dataset-related papers have been presented at the 30th ACM International Conference on Multimedia (MM’ 22), organized in Lisbon, Portugal, October 10 – 14, 2022 (https://2022.acmmm.org/). The complete MM ’22: Proceedings of the 30th ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3503161).

There was not a specifically dedicated Dataset session among roughly 35 sessions at the MM ’22 symposium. However, the importance of datasets can be illustrated in the following statistics, quantifying how often the term “dataset” appears in MM ’22 Proceedings. The term appears in the title of 9 papers (7 last year), the keywords of 35 papers (66 last year), and the abstracts of 438 papers (339 last year). As a small example, nine selected papers focused primarily on new datasets with publicly available data are listed below. There are contributions focused on various multimedia applications, e.g., understanding multimedia content, multimodal fusion and embeddings, media interpretation, vision and language, engaging users with multimedia, emotional and social signals, interactions and Quality of Experience, and multimedia search and recommendation.

MIntRec: A New Dataset for Multimodal Intent Recognition
Paper available at: https://doi.org/10.1145/3503161.3547906
Zhang, H., Xu, H., Wang, X., Zhou, Q., Zhao, S., Teng, J.
Tsinghua University, Beijing, China.
Dataset available at: https://github.com/thuiar/MIntRec

MIntRec is a dataset for multimodal intent recognition with 2,224 samples based on the data collected from the TV series Superstore, in text, video, and audio modalities, annotated with twenty intent categories and speaker bounding boxes. Baseline models are built by adapting multimodal fusion methods and show significant improvement over text-only modality. MIntRec is useful for studying relationships between modalities and improving intent recognition.

AVQA: A Dataset for Audio-Visual Question Answering on Videos
Paper available at: https://doi.org/10.1145/3503161.3548291
Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., Jin, C., Zhu, W.
Tsinghua University, Shenzhen, China; Communication University of China, Beijing, China.
Dataset available at: https://mn.cs.tsinghua.edu.cn/avqa

Audio-visual question-answering dataset (AVQA) is introduced for videos in real-life scenarios. It includes 57,015 videos and 57,335 question-answer pairs that rely on clues from both audio and visual modalities. A Hierarchical Audio-Visual Fusing module is proposed to model correlations among audio, visual, and text modalities. AVQA can be used to test models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios.

mmBody Benchmark: 3D Body Reconstruction Dataset and Analysis for Millimeter Wave Radar
Paper available at: https://doi.org/10.1145/3503161.3548262
Chen, A., Wang, X., Zhu, S., Li, Y., Chen, J., Ye, Q.
Zhejiang University, Hangzhou, China.
Dataset available at: On request

A large-scale mmWave radar dataset with synchronized and calibrated point clouds and RGB(D) images is presented, along with an automatic 3D body annotation system. State-of-the-art methods are trained and tested on the dataset, showing the mmWave radar can achieve better 3D body reconstruction accuracy than RGB camera but worse than depth camera. The dataset and results provide insights into improving mmWave radar reconstruction and combining signals from different sensors.

SER30K: A Large-Scale Dataset for Sticker Emotion Recognition
Paper available at: https://doi.org/10.1145/3503161.3548407
Liu, S., Zhang, X., Yan, J.
Nankai University, Tianjin, China.
Dataset available at: https://github.com/nku-shengzheliu/SER30K

A new multimodal sticker emotion recognition dataset called SER30K with 1,887 sticker themes and 30,739 images is introduced for understanding emotions in stickers. A proposed method called LORA, using a vision transformer and local re-attention module, effectively extracts visual and language features for emotion recognition on SER30K and other datasets.

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
Paper available at: https://doi.org/10.1145/3503161.3551581
Pan, Y., Li, Y., Luo, J., Xu, J., Yao, T., Mei, T.
JD Explore Academy, Beijing, China.
Dataset available at: http://www.auto-video-captions.top/2022/dataset

A new large-scale pre-training dataset, Auto-captions on GIF (ACTION), is presented for generic video understanding. It contains video-sentence pairs extracted and filtered from web pages and can be used for pre-training and downstream tasks such as video captioning and sentence localization. Comparisons with existing video-sentence datasets are made.

Where Are You Looking?: A Large-Scale Dataset of Head and Gaze Behavior for 360-Degree Videos and a Pilot Study
Paper available at: https://doi.org/10.1145/3503161.3548200
Jin, Y., Liu, J., Wang, F., Cui, S.
The Chinese University of Hong Kong, Shenzhen, Shenzhen, China.
Dataset available at: https://cuhksz-inml.github.io/head_gaze_dataset/

A dataset of users’ head and gaze behaviors in 360° videos is presented, containing rich dimensions, large scale, strong diversity, and high frequency. A quantitative taxonomy for 360° videos is also proposed, containing three objective technical metrics. Results of a pilot study on users’ behaviors and a case of application in tile-based 360° video streaming show the usefulness of the dataset for improving the performance of existing works.

Saliency in Augmented Reality
Paper available at: https://doi.org/10.1145/3503161.3547955
Duan, H., Shen, W., Min, X., Tu, D., Li, J., Zhai, G.
Shanghai Jiao Tong University, Shanghai, China; Alibaba Group, Hangzhou, China.
Dataset available at: https://github.com/DuanHuiyu/ARSaliency

A dataset, Saliency in AR Dataset (SARD), containing 450 background, 450 AR, and 1350 superimposed images with three mixing levels, is constructed to study the interaction between background scenes and AR contents, and the saliency prediction problem in AR. An eye-tracking experiment is conducted among 60 subjects to collect data.

Visual Dialog for Spotting the Differences between Pairs of Similar Images
Paper available at: https://doi.org/10.1145/3503161.3548170
Zheng, D., Meng, F., Si, Q., Fan, H., Xu, Z., Zhou, J., Feng, F., Wang, X.
Beijing University of Posts and Telecommunications, Beijing, China; WeChat AI, Tencent Inc, Beijing, China; Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; University of Trento, Trento, Italy.
Dataset available at: https://github.com/zd11024/Spot_Difference

A new visual dialog task called Dial-the-Diff is proposed, in which two interlocutors access two similar images and try to spot the difference between them through conversation in natural language. A large-scale multi-modal dataset called DialDiff, containing 87k Virtual Reality images and 78k dialogs, is built for the task. Benchmark models are also proposed and evaluated to bring new challenges to dialog strategy and object categorization.

Visual Grounding in Remote Sensing Images
Paper available at: https://doi.org/10.1145/3503161.3548316
Sun, Y., Feng, S., Li, X., Ye, Y., Kang, J., Huang, X.
Harbin Institute of Technology, Shenzhen, Shenzhen, China; Soochow University, Suzhou, China.
Dataset available at: https://sunyuxi.github.io/publication/GeoVG

A new problem of visual grounding in large-scale remote sensing images has been presented, in which the task is to locate particular objects in an image by a natural language expression. A new dataset, called RSVG, has been collected and a new method, GeoVG, has been designed to address the challenges of existing methods in dealing with remote sensing images.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 3 (ImageCLEF 2022, MediaEval 2022)

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on ImageCLEF 2022 and MediaEval 2022:

  • ImageCLEF 2022 (https://www.imageclef.org/2022). We summarize the 5 datasets launched for the benchmarking tasks, related to several topics like social media profile assessment (ImageCLEFaware), segmentation and labeling of underwater coral images (ImageCLEFcoral), late fusion ensembling systems for multimedia data (ImageCLEFfusion) and medical imaging analysis (ImageCLEFmedical Caption, and ImageCLEFmedical Tuberculosis).
  • MediaEval 2022 (https://multimediaeval.github.io/editions/2022/). We summarize the 11 datasets launched for the benchmarking tasks, that target a wide range of multimedia topics like the analysis of flood related media (DisasterMM), game analytics (Emotional Mario), news item processing (FakeNews, NewsImages), multimodal understanding of smells (MUSTI), medical imaging (Medico), fishing vessel analysis (NjordVid), media memorability (Memorability), sports data analysis (Sport Task, SwimTrack), and urban pollution analysis (Urban Air).

For the overview of datasets related to QoMEX 2022 and ODS at MMSys ’22 please check the first part (https://records.sigmm.org/?p=12292), while MDRE at MMM 2022 and ACM MM 2022 are addressed in the second part (http://records.sigmm.org/?p=12360).

ImageCLEF 2022

ImageCLEF is a multimedia evaluation campaign, part of the clef initiative (http://www.clef-initiative.eu/). The 2022 edition (https://www.imageclef.org/2022) is the 19th edition of this initiative and addresses four main research tasks in several domains like: medicine, nature, social media content and user interface processing. ImageCLEF 2021 is organized by Bogdan Ionescu (University Politehnica of Bucharest, Romania), Henning Müller (University of Applied Sciences Western Switzerland, Sierre, Switzerland), Renaud Péteri (University of La Rochelle, France), Ivan Eggel (University of Applied Sciences Western Switzerland, Sierre, Switzerland) and Mihai Dogariu (University Politehnica of Bucharest, Romania).

ImageCLEFaware
Paper available at: https://ceur-ws.org/Vol-3180/paper-98.pdf
Popescu, A., Deshayes-Chossar, J., Schindler, H., Ionescu, B.
CEA LIST, France; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2022/aware

This represents the second edition of the aware task at ImageCLEF, and it seeks to understand in what way do public social media profiles affect users in certain important scenarios, representing a search or application for: a bank loan, an accommodation, a job as waitress/waiter, and a job in IT.

ImageCLEFcoral
Paper available at: https://ceur-ws.org/Vol-3180/paper-97.pdf
Chamberlain, J., de Herrera, A.G.S., Campello, A., Clark, A..
University of Essex, United Kingdom; Wellcome Trust, United Kingdom.
Dataset available at: https://www.imageclef.org/2022/coral

This fourth edition of the coral task addresses the problem of segmenting and labeling a set of underwater images used in the monitoring of coral reefs. The task proposes two subtasks, namely an annotation and localization subtask and a pixel-wise parsing subtask.

ImageCLEFfusion
Paper available at: https://ceur-ws.org/Vol-3180/paper-99.pdf
Ştefan, L-D., Constantin, M.G., Dogariu, M., Ionescu, B.
University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2022/fusion

This represents the first edition of the fusion task, and it proposes several scenarios adapted for the use of late fusion or ensembling systems. The two scenarios correspond to a regression approach, using data associated with the prediction of media interestingness, and a retrieval scenario, using data associated with search result diversification.

ImageCLEFmedical Tuberculosis
Paper available at: https://ceur-ws.org/Vol-3180/paper-96.pdf
Kozlovski, S., Dicente Cid, Y., Kovalev, V., Müller, H.
United Institute of Informatics Problems, Belarus; Roche Diagnostics, Spain; University of Applied Sciences Western Switzerland, Switzerland; University of Geneva, Switzerland.
Dataset available at: https://www.imageclef.org/2022/medical/tuberculosis

This task is now at its sixth edition, and is being upgraded to a detection problem. Furthermore, two tasks are now included: the detection of lung cavern regions in lung CT images associated with lung tuberculosis and the prediction of 4 binary features of caverns suggested by experienced radiologists.

ImageCLEFmedical Caption
Paper available at: https://ceur-ws.org/Vol-3180/paper-95.pdf
Rückert, J., Ben Abacha, A., de Herrera, A.G.S., Bloch, L., Brüngel, R., Idrissi-Yaghir, A., Schäfer, H., Müller, H., Friedrich, C.M.
University of Applied Sciences and Arts Dortmund, Germany; Microsoft, USA; University of Essex, UK; University Hospital Essen, Germany; University of Applied Sciences Western Switzerland, Switzerland; University of Geneva, Switzerland.
Dataset available at: https://www.imageclef.org/2022/medical/caption

The sixth edition of this task consists of two tasks. In the first task participants must detect relevant medical concepts in a large corpus of medical images, while in the second task coherent captions must be generated for the entirety of the context of medical images, targeting the interplay of many visible concepts.

MediaEval 2022

The MediaEval Multimedia Evaluation benchmark (https://multimediaeval.github.io/) offers challenges in artificial intelligence for multimedia data. This is the 13th edition of MediaEval (https://multimediaeval.github.io/editions/2022/) and 11 tasks were proposed for this edition, targeting a large number of challenges by creating algorithms for retrieval, analysis, and exploration. For this edition, a “Quest for Insight” is pursued, where organizers are encouraged to propose interesting and insightful questions about the concepts that will be explored, and participants are encouraged to push beyond only striving to improve evaluation scores and to also working to achieve deeper understanding about the challenges.

DisasterMM: Multimedia Analysis of Disaster-Related Social Media Data
Preprint available at: https://2022.multimediaeval.com/paper5337.pdf
Andreadis, S., Bozas, A., Gialampoukidis, I., Mavropoulos, T., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I., Fiorin, R., Lombardo, F., Norbiato, D., Ferri, M.
Information Technologies Institute – Centre of Research and Technology Hellas, Greece; Eastern Alps River Basin District, Italy.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/disastermm/

The DisasterMM task proposes the analysis of social media data extracted from Twitter, targeting the analysis of natural or man-made disaster posts. For this year, the organizers focused on the analysis of flooding events and proposed two subtasks: relevance classification of posts and location extraction from texts.

Emotional Mario: A Game Analytics Challenge
Preprint or paper not published yet.
Lux, M., Alshaer, M., Riegler, M., Halvorsen, P., Thambawita, V., Hicks, S., Dang-Nguyen, D.-T.,
Alpen-Adria-Universität Klagenfurt, Austria; SimulaMet, Norway; University of Bergen, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/emotionalmario/

Emotional Mario focuses on the Super Mario Bros videogame, analyzing the data associated with gamers that consists of game input, demographics, biomedical data, and video associated with players’ faces. Two subtasks are proposed: event detection, seeking to identify gaming events of a significant importance based on facial videos and biometric data, and gameplay summarization, seeking to select the best moments of gameplay.

FakeNews Detection
Preprint available at: https://2022.multimediaeval.com/paper116.pdf
Pogorelov, K., Schroeder, D.T., Brenner, S., Maulana, A., Langguth, J.
Simula Research Laboratory, Norway; University of Bergen, Norway; Stuttgart Media University, Germany.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/fakenews/

The FakeNews Detection task proposes several types of methods of analyzing fake news and the way they spread, using COVID-19 related conspiracy theories. The competition proposes three tasks: the first subtask targets conspiracy detection in text-based data, the second asks participants to analyze graphs of conspiracy posters, while the last one combines the first two, aiming at detection on both text and graph data.

MUSTI – Multimodal Understanding of Smells in Texts and Images
Preprint available at: https://2022.multimediaeval.com/paper9634.pdf
Hürriyetoğlu, A., Paccosi, T., Menini, S., Zinnen, M., Lisena, P., Akdemir, K., Troncy, R., van Erp, M.
KNAW Humanities Cluster DHLab, Netherlands; Fondazione Bruno Kessler, Italy; Friedrich-Alexander-Universität, Germany; EURECOM, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/musti/

MUSTI is one of the few benchmarks that seek to analyze the underrepresented modality of smell. The organizers seek to further the understanding of descriptions of smell in texts and images, and propose two subtasks: the first one aims at classification of smells based on language and image models, predicting whether texts or images evoke the same smell source or not; while the second subtask targets the participants with identifying what are the common smell sources.

Medical Multimedia Task: Transparent Tracking of Spermatozoa
Preprint available at: https://2022.multimediaeval.com/paper5501.pdf
Thambawita, V., Hicks, S., Storås, A.M, Andersen, J.M., Witczak, O., Haugen, T.B., Hammer, H., Nguyen, T., Halvorsen, P., Riegler, M.A.
SimulaMet, Norway; OsloMet, Norway; The Arctic University of Norway, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/medico/

The Medico Medical Multimedia Task tackles the challenge of tracking sperm cells in video recordings, while analyzing the specific characteristics of these cells. Four subtasks are proposed: a sperm-cell real-time tracking task in videos, a prediction of cell motility task, a catch and highlight task seeking to identify sperm cell speed, and an explainability task.

NewsImages
Preprint available at: https://2022.multimediaeval.com/paper8446.pdf
Kille, B., Lommatzsch, A., Özgöbek, Ö., Elahi, M., Dang-Nguyen, D.-T.
Norwegian University of Science and Technology, Norway; Berlin Institute of Technology, Germany; University of Bergen, Norway; Kristiania University College, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/newsimages/

The goal of the NewsImages task is to further the understanding of the relationship between textual and image content in news articles. Participants are tasked with re-linking and re-matching textual news articles with the corresponding images, based on data gathered from social media, news portals and RSS feeds.

NjordVid: Fishing Trawler Video Analytics Task
Preprint available at: https://2022.multimediaeval.com/paper5854.pdf
Nordmo, T.A.S., Ovesen, A.B., Johansen, H.D., Johansen, D., Riegler, M.A.
The Arctic University of Norway, Norway; SimulaMet, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/njord/

The NjordVid task proposes data associated with fishing vessel recordings, representing a solution to maintaining sustainable fishing practices. Two different tasks are proposed: detection of events on the boat, like movement of people, catching fish, etc, and privacy of on-board personnel.

Predicting Video Memorability
Preprint available at: https://2022.multimediaeval.com/paper2265.pdf
Sweeney, L., Constantin, M.G., Demarty, C.-H., Fosco, C., de Herrera, A.G.S., Halder, S., Healy, G., Ionescu, B., Matran-Fernandez, A., Smeaton, A.F., Sultana, M.
Dublin City University, Ireland; University Politehnica of Bucharest, Romania; InterDigital, France; Massachusetts Institute of Technology Cambridge, USA; University of Essex, UK.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/memorability/

The Video Memorability task asks participants to predict how memorable a video sequence is, targeting short-term memorability. Three subtasks are proposed for this edition: a general video-based prediction task where participants are asked to predict the memorability score of a video, a generalization task where training and testing are performed on different sources of data, and an EEG-based task where annotator EEG scans are provided.

Sport Task: Fine Grained Action Detection and Classification of Table Tennis Strokes from Videos
Preprint available at: https://2022.multimediaeval.com/paper4766.pdf
Martin, P.-E., Calandre, J., Mansencal, B., Benois-Pineau, J., Péteri, R., Mascarilla, L., Morlier, J.
Max Planck Institute for Evolutionary Anthropology, Germany; La Rochelle University, France; Univ. Bordeaux, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/sportsvideo/

The Sport Task aims at action detection and classification in videos recorded at table tennis events. Low inter-class variability makes this task harder than other traditional action classification benchmarks. Two subtasks are proposed: a classification task where participants are asked to label table tennis videos according to the strokes the players make, and a detection task where participants must detect whether a stroke was made.

SwimTrack: Swimmers and Stroke Rate Detection in Elite Race Videos
Preprint available at: https://2022.multimediaeval.com/paper6876.pdf
Jacquelin, N., Jaunet, T., Vuillemot, R., Duffner, S.
École Centrale de Lyon, France; INSA-Lyon, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/swimtrack/

The SwimTrack comprises 5 different multimedia tracks related to the analysis of competition-level swimming videos, and provides multimodal video, image and audio data. The five subtasks are as follows: a position detection task associating swimmers with the numbers of swimming lanes, a stroke rate detection task, a camera registration task where participants must apply homography projection methods to create a top-view of the pool, a character recognition on scoreboards task, and a sound detection task associated with buzzer sounds.

Urban Air: Urban Life and Air Pollution
Preprint available at: https://2022.multimediaeval.com/paper586.pdf
Dao, M.-S., Dang, T.-H., Nguyen-Tai, T.-L., Nguyen, T.-B., Dang-Nguyen, D.-T.
National Institute of Information and Communications Technology, Japan; Dalat University, Vietnam; LOC GOLD Technology MTV Ltd. Co, Vietnam; University of Science, Vietnam National University in HCM City, Vietnam; Bergen University, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/urbanair/

The Urban Air task provides multimodal data that allows the analysis of air pollution and pollution patterns in urban environments. The organizers created two subtasks for this competition: a multimodal/crossmodal air quality index prediction task using station and/or CCTV data, and a periodic traffic pollution pattern discovery task.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 1 (QoMEX 2022, ODS at MMSys ’22)

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on QoMEX 2022 and ODS at MMSys ’22:

  • 14th International Conference on Quality of Multimedia Experience (QoMEX 2022 – https://qomex2022.itec.aau.at/). We summarize three datasets included in this conference, that address QoE studies on audiovisual 360° video, storytelling for quality perception and energy consumption while streaming video QoE.
  • Open Dataset and Software Track at 13th ACM Multimedia Systems Conference (ODS at MMSys ’22 – https://mmsys2022.ie/). We summarize nine datasets presented at the ODS track, targeting several topics, including surveillance videos from a fishing vessel (Njord), multi-codec 8K UHD videos (8K MPEG-DASH dataset), light-field (LF) synthetic immersive large-volume plenoptic dataset (SILVR), a dataset of online news items and the related task of rematching (NewsImages), video sequences, characterized by various complexity categories (VCD), QoE dataset of realistic video clips for real networks, dataset of 360° videos with subjective emotional ratings (PEM360), free-viewpoint video dataset, and cloud gaming dataset (CGD).

For the overview of datasets related to MDRE at MMM 2022 and ACM MM 2022 please check the second part (http://records.sigmm.org/?p=12360), while ImageCLEF 2022 and MediaEval 2022 are addressed in the third part (http://records.sigmm.org/?p=12362).

QoMEX 2022

Three dataset papers were presented at the International Conference on Quality of Multimedia Experience (QoMEX 2022), organized in Lippstadt, Germany, September 5 – 7, 2022 (https://qomex2022.itec.aau.at/). The complete QoMEX ’22 Proceeding is available in the IEEE Digital Library (https://ieeexplore.ieee.org/xpl/conhome/9900491/proceeding).

These datasets were presented within the Databases session, chaired by Professor Oliver Hohlfeld. These three papers present contributions focused on audiovisual 360-degree videos, storytelling for quality perception and modelling of energy consumption and streaming of video QoE.

Audiovisual Database with 360° Video and Higher-Order Ambisonics Audio for Perception, Cognition, Behavior and QoE Evaluation Research
Paper available at: https://ieeexplore.ieee.org/document/9900893
Robotham, T., Singla, A., Rummukainen, O., Raake, A. and Habets, E.
International Audio Laboratories Erlangen, A joint institution of the Friedrich-Alexander-Universitat Erlangen-Nurnberg (FAU) and Fraunhofer Institute for Integrated Circuits (IIS), Germany; TU Ilmenau, Germany.
Dataset available at: https://qoevave.github.io/database/

This publicly available database provides audiovisual 360° content with high-order Ambisonics audio. It consists of twelve scenes capturing real-life nature and urban environments with a video resolution of 7680×3840 at 60 frames-per-second and with 4th-order Ambisonics audio. These 360° video sequences, with an average duration of 60 seconds, represent real-life settings for systematically evaluating various dimensions of uni-/multi-modal perception, cognition, behavior, and QoE. It provides high-quality reference material with a balanced focus on auditory and visual sensory information.

The Storytime Dataset: Simulated Videotelephony Clips for Quality Perception Research
Paper available at: https://ieeexplore.ieee.org/document/9900888
Spang, R. P., Voigt-Antons, J. N. and Möller, S.
Technische Universität Berlin, Berlin, Germany; Hamm-Lippstadt University of Applied Sciences, Lippstadt, Germany.
Dataset available at: https://osf.io/cyb8w/

This is a dataset of simulated videotelephony clips to act as stimuli in quality perception research. It consists of four different stories in the German language that are told through ten consecutive parts, each about 10 seconds long. Each of these parts is available in four different quality levels, ranging from perfect to stalling. All clips (FullHD, H.264 / AAC) are actual recordings from end-user video-conference software to ensure ecological validity and realism of quality degradation. Apart from a detailed description of the methodological approach, we contribute the entire stimuli dataset containing 160 videos and all rating scores for each file.

Modelling of Energy Consumption and Streaming Video QoE using a Crowdsourcing Dataset
Paper available at: https://ieeexplore.ieee.org/document/9900886
Herglotz, C, Robitza, W., Kränzler, M., Kaup, A. and Raake, A.
Friedrich-Alexander-Universität, Erlangen, Germany; Audiovisual Technology Group, TU Ilmenau, Germany; AVEQ GmbH, Vienna, Austria.
Dataset available at: On request

This paper performs a first analysis of end-user power efficiency and Quality of Experience of a video streaming service. A crowdsourced dataset comprising 447,000 streaming events from YouTube is used to estimate both the power consumption and perceived quality. The power consumption is modelled based on previous work, which extends toward predicting the power usage of different devices and codecs. The user-perceived QoE is estimated using a standardized model.

ODS at MMSys ’22

The traditional Open Dataset and Software Track (ODS) was a part of the 13th ACM Multimedia Systems Conference (MMSys ’22) organized in Athlone, Ireland, June 14 – 17, 2022 (https://mmsys2022.ie/). The complete MMSys ’22: Proceedings of the 13th ACM Multimedia Systems Conference are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3524273).

The Open Dataset and Software Chairs for MMSys ’22 were Roberto Azevedo (Disney Research, Switzerland), Saba Ahsan (Nokia Technologies, Finland), and Yao Liu (Rutgers University, USA). The ODS session with 14 papers has been initiated with pitches on Wednesday, June 15, followed by a poster session. There have been nine dataset papers presented out of fourteen contributions. A listing of the paper titles, dataset summaries, and associated DOIs is included below for your convenience.

Njord: a fishing trawler dataset
Paper available at: https://doi.org/10.1145/3524273.3532886
Nordmo, T.-A.S., Ovesen, A.B., Juliussen, B.A., Hicks, S.A., Thambawita, V., Johansen, H.D., Halvorsen, P., Riegler, M.A., Johansen, D.
UiT the Arctic University of Norway, Norway; SimulaMet, Norway; Oslo Metropolitan University, Norway.
Dataset available at: https://doi.org/10.5281/zenodo.6284673

This paper presents Njord, a dataset of surveillance videos from a commercial fishing vessel. The dataset aims to demonstrate the potential for using data from fishing vessels to detect accidents and report fish catches automatically. The authors also provide a baseline analysis of the dataset and discuss possible research questions that it could help answer.

Multi-codec ultra high definition 8K MPEG-DASH dataset
Paper available at: https://doi.org/10.1145/3524273.3532889
Taraghi, B., Amirpour, H., Timmerer, C.
Christian Doppler Laboratory Athena, Institute of Information Technology (ITEC), Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria.
Dataset available at: http://ftp.itec.aau.at/datasets/mmsys22/

This paper presents a dataset of multimedia assets encoded with various video codecs, including AVC, HEVC, AV1, and VVC, and packaged using the MPEG-DASH format. The dataset includes resolutions up to 8K and has a maximum media duration of 322 seconds, with segment lengths of 4 and 8 seconds. It is intended to facilitate research and development of video encoding technology for streaming services.

SILVR: a synthetic immersive large-volume plenoptic dataset
Paper available at: https://doi.org/10.1145/3524273.3532890
Courteaux, M., Artois, J., De Pauw, S., Lambert, P., Van Wallendael, G.
Ghent University – Imec, Oost-Vlaanderen, Zwijnaarde, Belgium.
Dataset available at: https://idlabmedia.github.io/large-lightfields-dataset/

SILVR (synthetic immersive large-volume plenoptic dataset) is a light-field (LF) image dataset allowing for six-degrees-of-freedom navigation in larger volumes while maintaining full panoramic field of view. It includes three virtual scenes with 642-2226 views, rendered with 180° fish-eye lenses and featuring color images and depth maps. The dataset also includes multiview rendering software and a lens-reprojection tool. SILVR can be used to evaluate LF coding and rendering techniques.

NewsImages: addressing the depiction gap with an online news dataset for text-image rematching
Paper available at: https://doi.org/10.1145/3524273.3532891
Lommatzsch, A., Kille, B., Özgöbek, O., Zhou, Y., Tešić, J., Bartolomeu, C., Semedo, D., Pivovarova, L., Liang, M., Larson, M.
DAI-Labor, TU-Berlin, Berlin, Germany; NTNU, Trondheim, Norway; Texas State University, San Marcos, TX, United States; Universidade Nova de Lisboa, Lisbon, Portugal.
Dataset available at: https://multimediaeval.github.io/editions/2021/tasks/newsimages/

NewsImages is a dataset of online news items and the related task of news images rematching, which aims to study the “depiction gap” between the content of an image and the text that accompanies it. The dataset is useful for studying connections between image and text and addressing the depiction gap, including sparse data, diversity of content, and the importance of background knowledge.

VCD: Video Complexity Dataset
Paper available at: https://doi.org/10.1145/3524273.3532892
Amirpour, H., Menon, V.V., Afzal, S., Ghanbari, M., Timmerer, C.
Christian Doppler Laboratory Athena, Institute of Information Technology (ITEC), Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria; School of Computer Science and Electronic Engineering, University of Essex, Colchester, United Kingdom.
Dataset available at: https://ftp.itec.aau.at/datasets/video-complexity/

The Video Complexity Dataset (VCD) is a collection of 500 Ultra High Definition (UHD) resolution video sequences, characterized by spatial and temporal complexities, rate-distortion complexity, and encoding complexity with the x264 AVC/H.264 and x265 HEVC/H.265 video encoders. It is suitable for video coding applications such as video streaming, two-pass encoding, per-title encoding, and scene-cut detection. These sequences are provided at 24 frames per second (fps) and stored online in losslessly encoded 8-bit 4:2:0 format.

Realistic video sequences for subjective QoE analysis
Paper available at: https://doi.org/10.1145/3524273.3532894
Hodzic, K., Cosovic, M., Mrdovic, S., Quinlan, J.J., Raca, D.
Faculty of Electrical Engineering, University of Sarajevo, Bosnia and Herzegovina; School of Computer Science & Information Technology, University College Cork, Ireland.
Dataset available at: https://shorturl.at/dtISV

The DashReStreamer framework is designed to recreate adaptively streamed video in real networks to evaluate user Quality of Experience (QoE). The authors have also created a dataset of 234 realistic video clips, based on video logs collected from real mobile and wireless networks, including video logs and network bandwidth profiles. This dataset and framework will help researchers understand the impact of video QoE dynamics on multimedia streaming.

PEM360: a dataset of 360° videos with continuous physiological measurements, subjective emotional ratings and motion traces
Paper available at: https://doi.org/10.1145/3524273.3532895
Guimard, Q., Robert, F., Bauce, C., Ducreux, A., Sassatelli, L., Wu, H.-Y., Winckler, M., Gros, A.
Université Côte d’Azur, Inria, CNRS, I3S, Sophia-Antipolis, France.
Dataset available at: https://gitlab.com/PEM360/PEM360/

PEM360 is a dataset of user head movements and gaze recordings in 360° videos, along with self-reported emotional ratings and continuous physiological measurement data. It aims to understand the connection between user attention, emotions, and immersive content, and includes software tools and joint instantaneous visualization of user attention and emotion, called “emotional maps.” The entire data and code are available in a reproducible framework.

A New Free Viewpoint Video Dataset and DIBR Benchmark
Paper available at: https://doi.org/10.1145/3524273.3532897
Guo, S., Zhou, K., Hu, J., Wang, J., Xu, J., Song, L.
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China.
Dataset available at: https://github.com/sjtu-medialab/Free-Viewpoint-RGB-D-Video-Dataset

A new dynamic RGB-D video dataset for FVV research is presented, including 13 groups of dynamic scenes and one group of static scenes, each with 12 HD video sequences and 12 corresponding depth video sequences. Also, the FVV synthesis benchmark is introduced based on depth image-based rendering to aid data-driven method validation. The dataset and benchmark aim to advance FVV synthesis with improved robustness and performance.

CGD: a cloud gaming dataset with gameplay video and network recordings
Paper available at: https://doi.org/10.1145/3524273.3532898
Slivar, I., Bacic, K., Orsolic, I., Skorin-Kapov, L., Suznjevic, M.
University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia.
Dataset available at: https://muexlab.fer.hr/muexlab/research/datasets

The cloud gaming (CGD) dataset contains 600 game streaming sessions from 10 games of different genres, with various encoding parameters (bitrate, resolution, and frame rate) to evaluate the impact of these parameters on Quality of Experience (QoE). The dataset includes gameplay video recordings, network traffic traces, user input logs, and streaming performance logs, and can be used to understand relationships between network and application layer data for cloud gaming QoE and QoE-aware network management mechanisms.

Dataset Column: Overview, Scope and Call for Contributions

Overview and Scope

The Dataset Column (https://records.sigmm.org/open-science/datasets/) of ACM SIGMM Records provides timely updates on the developments in the domain of publicly available multimedia datasets as enabling tools for reproducible research in numerous related areas. It is intended as a platform for further dissemination of useful information on multimedia datasets and studies of datasets covering various domains, published in peer-reviewed journals, conference proceedings, dissertations, or as results of applied research in industry.

The aim of the Dataset Column is therefore not to substitute already established platforms for disseminating multimedia datasets, e.g., Qualinet Databases (https://qualinet.github.io/databases/) [2], Multimedia Evaluation Benchmark (https://multimediaeval.github.io/), but promote such platforms and particularly interesting datasets and benchmarking challenges associated with them. Multimedia Evaluation Benchmark, MediaEval 2021, registration is now open (https://multimediaeval.github.io). This year’s MediaEval features a wide variety of tasks and datasets tackling a large number of domains, including video privacy, social media data analysis and understanding, news items analysis, medicine and wellbeing, affective and subjective content analysis, and game and sports associated media.

The Column will also continue reporting of contributions presented within Dataset Tracks at relevant conferences, e.g., ACM Multimedia (MM), ACM Multimedia Systems (MMSys), International Conference on Quality of Multimedia Experience (QoMEX), International Conference on Multimedia Modeling (MMM).

Dataset Column in the SIGMM Records

Previously published Dataset Columns are listed below in chronological order.

Call for Contributions

Those who have created and even previously published elsewhere a dataset, benchmarking initiative or studies of datasets relevant to the multimedia community are very welcome to submit their contribution to the ACM SIGMM Records Dataset Column. Examples of these are the accepted datasets to the open dataset and software track of the ACM MMSys 2021 conference or the datasets presented at QoMEX 2021 conference. Please contact one of the editors responsible for the respective area, Mihai Gabriel Constantin (mihai.constantin84@upb.ro), Karel Fliegel (fliegek@fel.cvut.cz), and Maria Torres Vega (maria.torresvega@ugent.be) to report your contribution.

Column Editors

Since September 2021, the Dataset Column is edited by Mihai Gabriel Constantin, Karel Fliegel, and Maria Torres Vega. Current editors appreciate the work of the previous team, Martha Larson, Bart Thomee and all other contributors, and will continue and further develop this dissemination platform.

The general scope of the Dataset Column is reviewed above, with the more specific areas of the editors listed below:

  • Mihai Gabriel Constantin will be responsible for the datasets related to multimedia analysis, understanding, retrieval and exploration,
  • Karel Fliegel for the datasets with subjective annotations related to Quality of Experience (QoE) [1] research,
  • Maria Torres Vega for the datasets related to immersive multimedia systems, networked QoE and cognitive network management.

Mihai Gabriel Constantin is a researcher at the AI Multimedia Lab, University Politehnica of Bucharest, Romania, and got his PhD at the Faculty of Electronics, Telecommunications, and Information Technology at the same university, with the topic “Automatic Analysis of the Visual Impact of Multimedia Data”. He has authored over 25 scientific papers in international conferences and high impact journals, with an emphasis on the prediction of the subjective impact of multimedia items on human viewers and deep ensembles. He participated as researcher in more than 10 research projects, and is a member of program committees and reviewer for several workshops, conferences and journals. He is also an active member of the multimedia processing community, being part of the MediaEval benchmarking initiative organization team, and leading or co-organizing several tasks during MediaEval that include Predicting Media Memorability [3] and Recommending Movies Using Content [4], as well as publishing several papers that analyze the data, annotations, participant features, methods, and observed best practices for MediaEval tasks and datasets [5]. More details can be found on his webpage: https://gconstantin.aimultimedialab.ro/.

Karel Fliegel received M.Sc. (Ing.) in 2004 (electrical engineering and audiovisual technology) and his Ph.D. in 2011 (research on modeling of visual perception of image impairment features) both from the Czech Technical University in Prague, Faculty of Electrical Engineering (CTU FEE), Czech Republic. He is an assistant professor at Multimedia Technology Group of CTU FEE. His research interests include multimedia technology, image processing, image and video compression, subjective and objective image quality assessment, Quality of Experience, HVS modeling, and imaging photonics. He has been a member of research teams within various projects especially in the area of visual information processing. He has participated in COST ICT Actions IC1003 Qualinet and IC1105 3D-ConTourNet, responsible for development of Qualinet Databases [2] (https://qualinet.github.io/databases/) relevant especially to QoE research.

Maria Torres Vega is an FWO (Research Foundation Flanders) Senior Postdoctoral fellow working at the multimedia delivery cluster of the IDLab group of the Ghent University (UGent) currently working on the perception of immersive multimedia applications. She received her M.Sc. degree in Telecommunication Engineering from the Polytechnic University of Madrid, Spain, in 2009. Between 2009 and 2013 she worked as a software and test engineer in Germany with focus on Embedded Systems and Signal Processing. In October 2013, she decided to go back to academia and started her PhD at the Eindhoven University of Technology (Eindhoven, The Netherlands), where she researched on the impact of beam-steered optical wireless networks on the users’ perception of services. This work awarded her PhD in Electrical Engineering in September 2017. In her years in academia (since October 2013), she has authored more than 40 publications, including three best paper awards. Furthermore, she serves as reviewer to a plethora of journals and conferences. In 2020 she served as general chair of the 4th Quality of Experience Management workshop, as tutorial chair of the 2020 Network Softwarization conference (NetSoft), and as demo chair of the Quality of Multimedia Experience conference (QoMex 2020). In 2021, she served as Technical Program Committee (TPC) chair of the 2021 Quality of Multimedia Experience conference (QoMex 2021).

References