 
						
As already started in the previous Datasets column, we are reviewing some of the most notable events related to open datasets and benchmarking competitions in the field of multimedia in the years 2023 and 2024. This selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/records-issues/acm-sigmm-records-issue-1-2023/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. This second part of the column focuses on the last two editions of MDRE at MMM 2023 and MMM 2024:
- Multimedia Datasets for Repeatable Experimentation at 29th International Conference on Multimedia Modeling (MDRE at MMM 2023). We summarize the seven datasets presented during the MDRE in 2023, namely NCKU-VTF (thermal-to-visible face recognition benchmark), Link-Rot (web dataset decay and reproducibility study), People@Places and ToDY (scene classification for media production), ScopeSense (lifelogging dataset for health analysis), OceanFish (high-resolution fish species recognition), GIGO (urban garbage classification and demographics), and Marine Video Kit (underwater video retrieval and analysis).
- Multimedia Datasets for Repeatable Experimentation at 30th International Conference on Multimedia Modeling (MDRE at MMM 2024 – https://mmm2024.org/). We summarize the eight datasets presented during the MDRE in 2024, namely RESET (video similarity annotations for embeddings), DocCT (content-aware document image classification), Rach3 (multimodal data for piano rehearsal analysis), WikiMuTe (semantic music descriptions from Wikipedia), PDTW150K (large-scale patent drawing retrieval dataset), Lifelog QA (question answering for lifelog retrieval), Laparoscopic Events (event recognition in surgery videos), and GreenScreen (social media dataset for greenwashing detection).
For the overview of datasets related to QoMEX 2023 and QoMEX 2024, please check the first part (https://records.sigmm.org/2024/09/07/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2023-2024-part-1-qomex-2023-and-qomex-2024/).
MDRE at MMM 2023
The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2023 International Conference on Multimedia Modeling (MMM 2023), Bergen, Norway, January 9-12, 2023. The MDRE’23 special session at MMM’23, is the fifth MDRE session. The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Adam Jatowt (University of Innsbruck, Austria), Liting Zhou (Dublin City University, Ireland) and Graham Healy (Dublin City University, Ireland).
The NCKU-VTF Dataset and a Multi-scale Thermal-to-Visible Face Synthesis System
Tsung-Han Ho, Chen-Yin Yu, Tsai-Yen Ko & Wei-Ta Chu
National Cheng Kung University, Tainan, Taiwan
Paper available at: https://doi.org/10.1007/978-3-031-27077-2_36
Dataset available at: http://mmcv.csie.ncku.edu.tw/~wtchu/projects/NCKU-VTF/index.html
The dataset, named VTF, comprises paired thermal-visible face images of primarily Asian subjects under diverse visual conditions, introducing challenges for thermal face recognition models. It serves as a benchmark for evaluating model robustness while also revealing racial bias issues in current systems. By addressing both technical and fairness aspects, VTF promotes advancements in developing more accurate and inclusive thermal-to-visible face recognition methods.
Link-Rot in Web-Sourced Multimedia Datasets
Viktor Lakic, Luca Rossetto & Abraham Bernstein
Department of Informatics, University of Zurich, Zurich, Switzerland
Paper available at: https://doi.org/10.1007/978-3-031-27077-2_37
Dataset available at: Combination of 24 different Web-sourced datasets described in the paper
The dataset examines 24 Web-sourced datasets comprising over 270 million URLs and reveals that more than 20% of the content has become unavailable due to link-rot. This decay poses significant challenges to the reproducibility of research relying on such datasets. Addressing this issue, the dataset highlights the need for strategies to mitigate content loss and maintain data integrity for future studies.
People@Places and ToDY: Two Datasets for Scene Classification in Media Production and Archiving
Werner Bailer & Hannes Fassold
Joanneum Research, Graz, Austria
Paper available at: https://doi.org/10.1007/978-3-031-27077-2_38
Dataset available at: https://github.com/wbailer/PeopleAtPlaces
The dataset supports annotation tasks in visual media production and archiving, focusing on scene bustle (from populated to unpopulated), cinematographic shot types, time of day, and season. The People@Places dataset augments Places365 with bustle and shot-type annotations, while the ToDY (time of day/year) dataset enhances SkyFinder. Both datasets come with a toolchain for automatic annotations, manually verified for accuracy. Baseline results using the EfficientNet-B3 model, pretrained on Places365, are provided for benchmarking.
ScopeSense: An 8.5-Month Sport, Nutrition, and Lifestyle Lifelogging Dataset
Michael A. Riegler, Vajira Thambawita, Ayan Chatterjee, Thu Nguyen, Steven A. Hicks, Vibeke Telle-Hansen, Svein Arne Pettersen, Dag Johansen, Ramesh Jain & Pål Halvorsen
SimulaMet, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; UIT The Artic University of Norway, Tromsø, Norway; University of California Irvine, CA, USA
Paper available at: https://doi.org/10.1007/978-3-031-27077-2_39
Dataset available at: https://datasets.simula.no/scopesense
The dataset, ScopeSense, offers comprehensive sport, nutrition, and lifestyle logs collected over eight and a half months from two individuals. It includes extensive sensor data alongside nutrition, training, and well-being information, structured to facilitate detailed, data-driven research on healthy lifestyles. This dataset aims to support modeling for personalized guidance, addressing challenges in unstructured data and enhancing the precision of lifestyle recommendations. ScopeSense is fully accessible to researchers, serving as a foundation for methods to expand this data-driven approach to larger populations.
Fast Accurate Fish Recognition with Deep Learning Based on a Domain-Specific Large-Scale Fish Dataset
Yuan Lin, Zhaoqi Chu, Jari Korhonen, Jiayi Xu, Xiangrong Liu, Juan Liu, Min Liu, Lvping Fang, Weidi Yang, Debasish Ghose & Junyong You
School of Economics, Innovation, and Technology, Kristiania University College, Oslo, Norway; School of Aerospace Engineering, Xiamen University, Xiamen, China; School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, UK; School of Information Science and Technology, Xiamen University, Xiamen, China; School of Ocean and Earth, Xiamen University, Xiamen, China; Norwegian Research Centre (NORCE), Bergen, Norway
Paper available at: https://doi.org/10.1007/978-3-031-27077-2_40
Dataset available at: Upon request from the authors
The dataset, OceanFish, addresses key challenges in fish species recognition by providing high-resolution images of marine species from the East China Sea, covering 63,622 images across 136 fine-grained fish species. This large-scale, diverse dataset overcomes limitations found in prior fish datasets, such as low resolution and limited annotations. OceanFish includes a fish recognition testbed with deep learning models, achieving high precision and speed in species detection. This dataset can be expanded with additional species and annotations, offering a valuable benchmark for advancing marine biodiversity research and automated fish recognition.
GIGO, Garbage In, Garbage Out: An Urban Garbage Classification Dataset
Maarten Sukel, Stevan Rudinac & Marcel Worring
University of Amsterdam, Amsterdam, The Netherlands
Paper available at: https://doi.org/10.1007/978-3-031-27077-2_41
Dataset available at: https://doi.org/10.21942/uva.20750044
The dataset, GIGO: Garbage in, Garbage out, offers 25,000 images for multimodal urban waste classification, captured across a large area of Amsterdam. It supports sustainable urban waste collection by providing fine-grained classifications of diverse garbage types, differing in size, origin, and material. Unique to GIGO are additional geographic and demographic data, enabling multimodal analysis that incorporates neighborhood and building statistics. The dataset includes state-of-the-art baselines, serving as a benchmark for algorithm development in urban waste management and multimodal classification.
Marine Video Kit: A New Marine Video Dataset for Content-Based Analysis and Retrieval
Quang-Trung Truong, Tuan-Anh Vu, Tan-Sang Ha, Jakub Lokoč, Yue-Him Wong, Ajay Joneja & Sai-Kit Yeung
Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; FMP, Charles
University, Prague, Czech Republic; Shenzhen University, Shenzhen, China
Paper available at: https://doi.org/10.1007/978-3-031-27077-2_42
Dataset available at: https://hkust-vgd.github.io/marinevideokit
The dataset, Marine Video Kit, focuses on single-shot underwater videos captured by moving cameras, providing a challenging benchmark for video retrieval and computer vision tasks. Designed to address the limitations of general-purpose models in domain-specific contexts, the dataset includes meta-data, low-level feature analysis, and semantic annotations of keyframes. Used in the Video Browser Showdown 2023, Marine Video Kit highlights challenges in underwater video analysis and is publicly accessible, supporting advancements in model robustness for specialized video retrieval applications.
MDRE at MMM 2024
The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2024 International Conference on Multimedia Modeling (MMM 2024), Amsterdam, The Netherlands, January 29 – February 2, 2024. The MDRE’24 special session at MMM’24, is the sixth MDRE session. The session was organized by Klaus Schöffmann (Klagenfurt University, Austria), Björn Þór Jónsson (Reykjavik University, Iceland), Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), and Liting Zhou (Dublin City University, Ireland). Details regarding this session can be found at: https://mmm2024.org/specialpaper.html#s1.
RESET: Relational Similarity Extension for V3C1 Video Dataset
Patrik Veselý & Ladislav Peška
Faculty of Mathematics and Physics, Charles University, Prague, Czechia
Paper available at: https://doi.org/10.1007/978-3-031-56435-2_1
Dataset available at: https://osf.io/ruh5k
The dataset, RESET: RElational Similarity Evaluation dataseT, offers over 17,000 similarity annotations for video keyframe triples drawn from the V3C1 video collection. RESET includes both close and distant similarity triplets in general and specific sub-domains (wedding and diving), with multiple user re-annotations and similarity scores from 30 pre-trained models. This dataset supports the evaluation and fine-tuning of visual embedding models, aligning them more closely with human-perceived similarity, and enhances content-based information retrieval for more accurate, user-aligned results.
A New Benchmark and OCR-Free Method for Document Image Topic Classification
Zhen Wang, Peide Zhu, Fuyang Yu & Manabu Okumura
Tokyo Institute of Technology, Tokyo, Japan; Delft University of Technology, Delft, Netherlands; Beihang University, Beijing, China
Paper available at: https://doi.org/10.1007/978-3-031-56435-2_2
Dataset available at: https://github.com/zhenwangrs/DocCT
The dataset, DocCT, is a content-aware document image classification dataset designed to handle complex document images that integrate text and illustrations across diverse topics. Unlike prior datasets focusing mainly on format, DocCT requires fine-grained content understanding for accurate classification. Alongside DocCT, the self-supervised model DocMAE is introduced, showing that document image semantics can be understood effectively without OCR. DocMAE surpasses previous vision models and some OCR-based models in understanding document content purely from pixel data, marking a significant advance in document image analysis.
The Rach3 Dataset: Towards Data-Driven Analysis of Piano Performance Rehearsal
Carlos Eduardo Cancino-Chacón & Ivan Pilkov
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria
Paper available at: https://doi.org/10.1007/978-3-031-56435-2_3
Dataset available at: https://dataset.rach3project.com/
The dataset, named Rach3, captures the rehearsal processes of pianists as they learn new repertoire, providing a multimodal resource with video, audio, and MIDI data. Designed for AI and machine learning applications, Rach3 enables analysis of long-term practice sessions, focusing on how advanced students and professional musicians interpret and refine their performances. This dataset offers valuable insights into music learning and expression, addressing an understudied area in music performance research.
WikiMuTe: A Web-Sourced Dataset of Semantic Descriptions for Music Audio
Benno Weck, Holger Kirchhoff, Peter Grosche & Xavier Serra
Huawei Technologies, Munich Research Center, Munich, Germany; Universitat Pompeu Fabra, Music Technology Group, Barcelona, Spain
Paper available at: https://doi.org/10.1007/978-3-031-56435-2_4
Dataset available at: https://github.com/Bomme/wikimute
The dataset, WikiMuTe, is an open, multi-modal resource designed for Music Information Retrieval (MIR), offering detailed semantic descriptions of music sourced from Wikipedia. It includes both long and short-form text on aspects like genre, style, mood, instrumentation, and tempo. Using a custom text-mining pipeline, WikiMuTe provides data to train models that jointly learn text and audio representations, achieving strong results in tasks such as tag-based music retrieval and auto-tagging. This dataset supports MIR advancements by providing accessible, rich semantic data for matching text and music.
PDTW150K: A Dataset for Patent Drawing Retrieval
Chan-Ming Hsu, Tse-Hung Lin, Yu-Hsien Chen & Chih-Yi Chiu
Department of Computer Science and Information Engineering, National Chiayi University, Chiayi, Taiwan
Paper available at: https://doi.org/10.1007/978-3-031-56435-2_5
Dataset available at: https://github.com/ncyuMARSLab/PDTW150K
The dataset, PDTW150K, is a large-scale resource for patent drawing retrieval, featuring over 150,000 patents with text metadata and more than 850,000 patent drawings. It includes bounding box annotations for drawing views and supporting object detection model construction. PDTW150K enables diverse applications, such as image retrieval, cross-modal retrieval, and object detection. This dataset is publicly available, offering a valuable tool for advancing research in patent analysis and retrieval tasks.
Interactive Question Answering for Multimodal Lifelog Retrieval
Ly-Duyen Tran, Liting Zhou, Binh Nguyen & Cathal Gurrin
Dublin City University, Dublin, Ireland; AISIA Research Lab, Ho Chi Minh, Vietnam; Ho Chi Minh University of Science, Vietnam National University, Hanoi, Vietnam
Paper available at: https://doi.org/10.1007/978-3-031-56435-2_6
Dataset available at: Upon request from the authors
The dataset supports Question Answering (QA) tasks in lifelog retrieval, advancing the field toward open-domain QA capabilities. Integrated into a multimodal lifelog retrieval system, it allows users to ask lifelog-specific questions and receive suggested answers based on multimodal data. A test collection is provided to assess system effectiveness and user satisfaction, demonstrating enhanced performance over conventional lifelog systems, especially for novice users. This dataset paves the way for more intuitive and effective lifelog interaction.
Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers
Sahar Nasirihaghighi, Negin Ghamsarian, Heinrich Husslein & Klaus Schoeffmann
Institute of Information Technology (ITEC), Klagenfurt University, Klagenfurt, Austria; Center for AI in Medicine, University of Bern, Bern, Switzerland; Department of Gynecology and Gynecological Oncology, Medical University Vienna, Vienna, Austria
Paper available at: https://doi.org/10.1007/978-3-031-56435-2_7
Dataset available at: https://ftp.itec.aau.at/datasets/LapGyn6-Events/
The dataset is tailored for event recognition in laparoscopic gynecology surgery videos, including annotations for critical intra-operative and post-operative events. Designed for applications in surgical training and complication prediction, it facilitates precise event recognition. The dataset supports a hybrid Transformer-based architecture that leverages inter-frame dependencies, improving accuracy amid challenges like occlusion and motion blur. Additionally, a custom frame sampling strategy addresses variations in surgical scenes and skill levels, achieving high temporal resolution. This methodology outperforms conventional CNN-RNN architectures, advancing laparoscopic video analysis.
GreenScreen: A Multimodal Dataset for Detecting Corporate Greenwashing in the Wild
Ujjwal Sharma, Stevan Rudinac, Joris Demmers, Willemijn van Dolen & Marcel Worring
University of Amsterdam, Amsterdam, The Netherlands
Paper available at: https://doi.org/10.1007/978-3-031-56435-2_8
Dataset available at: https://uva-hva.gitlab.host/u.sharma/greenscreen
The dataset focuses on detecting greenwashing in social media by combining large-scale text and image collections from Fortune-1000 company Twitter accounts with environmental risk scores on specific issues like emissions and resource usage. This dataset addresses the challenge of identifying subtle, abstract greenwashing signals requiring contextual interpretation. It includes a baseline method leveraging advanced content encoding to analyze connections between social media content and greenwashing tendencies. This resource enables the multimedia retrieval community to advance greenwashing detection, promoting transparency in corporate sustainability claims.
 
	
	 
			 
			