Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2025 – Part 4 (ACM MMSys 2023, 2024, 2025)

By maria | November 20, 2025 - 13:35 |November 20, 2025 0425, 0425, Datasets Column, Event Report, Feature

Editors: Maria Torres Vega (KU Leuven, Belgium), Karel Fliegel (Czech Technical University in Prague, Czech Republic), Mihai Gabriel Constantin (University Politehnica of Bucharest, Romania),

In this Dataset Columns, we continue the tradition of the previous three columns by reviewing some of the notable events related to open datasets and benchmarking competitions in the field of multimedia in the years 2023, 2024 and 2025 from this column. This selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. This review follows similar efforts from the previous editions:

This fourth column focuses on the last three editions of ACM Multimedia Systems (MSys), i.e., 2023, 2024, and 2025:

The 14th ACM Multimedia Systems Conference (ACM MMSys’23 https://2023.acmmmsys.org/).
The 15th ACM Multimedia Systems Conference (ACM MMSys’24 https://2024.acmmmsys.org/).
The 16th ACM Multimedia Systems Conference (ACM MMSys’25 https://2025.acmmmsys.org/).

ACM MMSys 2023

10 dataset papers were presented at the 14th ACM Multimedia Systems Conference (ACM MMSys’23), organized in Vancouver, Canada, June 7-10, 2023 (https://2023.acmmmsys.org/). The complete ACM MMSys’23 Proceedings are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3587819).

Rhys Cox, S., et al., VOLVQAD: An MPEG V-PCC Volumetric Video Quality Assessment Dataset (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592543; dataset available at: https://github.com/nus-vv-streams/volvqad-dataset).
This is a volumetric video quality assessment dataset consisting of 7,680 ratings on 376 video sequences from 120 participants. The sequences are encoded with MPEG V-PCC using 4 different avatar models and 16 quality variations, and then rendered into test videos for quality assessment using 2 different background colors and 16 different quality switching patterns.
Prakash, N., et al., TotalDefMeme: A Multi-Attribute Meme dataset on Total Defence in Singapore (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592545; dataset available at: https://gitlab.com/bottle_shop/meme/TotalDefMemes). Total Defence is a large-scale multi-modal and multi-attribute meme dataset that captures public sentiments toward Singapore’s Total Defence policy. Besides supporting social informatics and public policy analysis of the Total Defence policy, TotalDefMeme can also support many downstream multi-modal machine learning tasks, such as aspect-based stance classification and multi-modal meme clustering.
Sun, Y., et al., A Dynamic 3D Point Cloud Dataset for Immersive Applications (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592546; dataset available on request to the authors). This dataset consists of synthetically generated objects with pre-determined motion patterns. It contains nine objects in three categories (shape, avatar, and textile) with different animation patterns.
Raca, D., et al., 360 Video DASH Dataset (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592548; dataset available at: https://github.com/darijo/360-Video-DASH-Dataset). This study introduces a SW tool that offers straight-forward encoding platforms to simplify the encoding of DASH VR videos. In addition, it includes a dataset composed of 9 VR videos encoded with seven tiling configurations, four segment durations, four different bitrates.
Hu, K., et al., FSVVD: A Dataset of Full Scene Volumetric Video ( paper available at: https://dl.acm.org/doi/10.1145/3587819.3592551, dataset available at: https://cuhksz-inml.github.io/full_scene_volumetric_video_dataset/). This dataset focuses on the current most widely used data format, point cloud, and for the first time, releases a full-scene volumetric video dataset that includes multiple people and their daily activities interacting with the external environments.
Wu, Y., et al., A Dataset of Food Intake Activities Using Sensors with Heterogeneous Privacy Sensitivity Levels (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592553; dataset available on request to the authors). This dataset compiles fine-grained food intake activities using sensors of heterogeneous privacy sensitivity levels, namely a mmWave radar, an RGB camera, and a depth camera. Solutions to recognize food intake activities can be developed using this dataset, which may provide a more comprehensive picture of the accuracy and privacy trade-offs involved with heterogeneous sensors.
Soares da Costa, T., et al., A Dataset for User Visual Behaviour with Multi-View Video Content (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592556; dataset available on request to the authors). This dataset, collected with a large-scale testbed, compiles tracking data from head movements was obtained from 45 participants using an Intel Realsense F200 camera, with 7 video playlists, each being viewed a minimum of 17 times.
Wei, Y., et al., A 6DoF VR Dataset of 3D virtualWorld for Privacy-Preserving Approach and Utility-Privacy Tradeoff (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592557; dataset available on request to the authors). This dataset collects a 6 degree-of-freedom VR dataset of 3D virtual worlds for the investigation of privacy-preserving approaches and utility-privacy tradeoff.
Mohammed, A. et al., IDCIA: Immunocytochemistry Dataset for Cellular Image Analysis (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592558; dataset available at: https://figshare.com/articles/dataset/Dataset/21970604). This dataset is a new annotated microscopic cellular image dataset to improve the effectiveness of machine learning methods for cellular image analysis. It includes microscopic images of cells, and for each image, the cell count and the location of individual cells. The data were collected as part of an ongoing study investigating the potential of electrical stimulation to modulate stem cell differentiation and possible applications for neural repair.
Al Shoura, T., et al., SEPE Dataset: 8K Video Sequences and Images for Analysis and Development (paper available at: https://dl.acm.org/doi/10.1145/3587819.3592560; dataset available at: https://github.com/talshoura/SEPE-8K-Dataset). The SEPE 8K dataset (Software Engineering Practice and Education) is made of 40 different 8K (8192 x 4320) video sequences and 40 variant 8K (8192 x 5464) images. The proposed dataset is – as far as we know – the first to publish true 8K natural sequences; thus, it is important for the next level of applications dealing with multimedia such as video quality assessment, super-resolution, video coding, video compression, and many more.

ACM MMSys 2024

14 dataset papers were presented at the 15th ACM Multimedia Systems Conference (ACM MMSys’24), organized in Bari, Italy, April 15-18, 2024 (https://2024.acmmmsys.org/). The complete ACM MMSys’24 Proceedings are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3625468).

Malon, T., et al., Ceasefire Hierarchical Weapon Dataset (paper available at: https://dl.acm.org/doi/10.1145/3625468.3653434; dataset available on request to the authors). The Ceasefire Hierarchical Weapon Dataset, an RGB image dataset of firearms tailored for fine-grained image classi- fication, contains 260 classes ranging from 25 to hundreds of images per class, with a total of 40,789 images. In addition, a 4-level hierarchy (family, group, type, model) is provided and validated by forensic experts.
Kassab, E.J., et al., TACDEC: Dataset for Automatic Tackle Detection in Soccer Game Videos (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652166; dataset available on request to the authors). TACDEC is a dataset of tackle events in soccer game videos. By leveraging video data from the Norwegian Eliteserien league across multiple seasons, we annotated 425 videos with 4 types of tackle events, categorized into “tackle-live”, “tackle-replay”, “tackle-live-incomplete”, and “tackle-replay-incomplete”, yielding a total of 836 event annotations.
Zhao, J., Pan, J., LENS: A LEO Satellite Network Measurement Dataset (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652170; dataset available at: https://github.com/clarkzjw/LENS). LENS is a LEO satellite network measurement dataset, collected from 13 Starlink dishes, associated with 7 Point-of-Presence (PoP) locations across 3 continents. The dataset currently consists of network latency traces from Starlink dishes with different hardware revisions, various service subscriptions and distinct sky obstruction ratios.
Chen, B., et al., vRetention: A User Viewing Dataset for Popular Video Streaming Services (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652175; dataset available at: https://github.com/flowtele/vRetention). This dataset collects 229178 audience retention curves from YouTube and Bilibili, offering a thorough view of viewer engagement and diverse watching styles. Our analysis reveals notable behavioral differences across countries, categories, and platforms.
Xu , Y., et al., Panonut360: A Head and Eye Tracking Dataset for Panoramic Video (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652176; dataset available at: https://dianvrlab.github.io/Panonut360/). This dataset presents head and eye trackings involving 50 users (25 males and 25 females) watching 15 panoramic videos (mostly in 4K). The dataset provides details on the viewport and gaze attention locations of users.
Linder, S., et al., VEED: Video Encoding Energy and CO2 Emissions Dataset for AWS EC2 instances (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652178; dataset available at: https://github.com/cd-athena/VEED-dataset). VEED is a FAIR Video Encoding Energy and CO2 Emissions Dataset for Amazon Web Services (AWS) EC2 instances. The dataset also contains the duration, CPU utilization, and cost of the encoding.
Tashtarian, F., et al., COCONUT: Content Consumption Energy Measurement Dataset for Adaptive Video Streaming (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652179; dataset available at: https://athena.itec.aau.at/coconut/). The COCONUT dataset provides a COntent COnsumption eNergy measUrement daTaset for adaptive video streaming collected through a digital multimeter on various types of client devices, such as laptop and smartphone, streaming MPEG-DASH segments.
Sarkhoosh, M. H., et al., The SoccerSum Dataset for Automated Detection, Segmentation, and Tracking of Objects on the Soccer Pitch (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652180; dataset available at: https://zenodo.org/records/10612084). SoccerSum is a novel dataset aimed at enhancing object detection and segmentation in video frames depicting the soccer pitch, using footage from the Norwegian Eliteserien league across 2021-2023. It also includes the segmentation of key pitch areas such as the penalty and goal boxes for the same frame sequences. It comprises 750 frames annotated with 10 classes for advanced analysis.
Li, G., et al., A Driver Activity Dataset with Multiple RGB-D Cameras and mmWave Radars (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652181; dataset available at: https://www.kaggle.com/datasets/guanhualee/driver-activity-dataset). This work introduces a novel dataset for fine-grained driver activities, utilizing diverse sensors such as mmWave radars, RGB, and depth cameras, each of which includes three camera angles: body, face, and hands.
Nguyen, M., et al., ComPEQ – MR: Compressed Point Cloud Dataset with Eye-tracking and Quality Assessment in Mixed Reality (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652182; dataset available at: https://ftp.itec.aau.at/datasets/ComPEQ-MR/). This dataset comprises four compressed dynamic point clouds processed by Moving Picture Experts Group (MPEG) reference tools (i.e., VPCC and GPCC), each with 12 distortion levels. We also conducted subjective tests to assess the quality of the compressed point clouds with different levels of distortion. Additionally, eye-tracking data for visual saliency is included in this dataset, which is necessary to predict where people look when watching 3D videos in MR experiences. We collected opinion scores and eye-tracking data from 41 participants, resulting in 2132 responses and 164 visual attention maps in total.
Barone, N., et al., APEIRON: a Multimodal Drone Dataset Bridging Perception and Network Data in Outdoor Environments (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652186; dataset available at: https://c3lab.github.io/Apeiron/). APEIRON is a rich multimodal aerial dataset that simultaneously collects perception data from a stereo camera and an event based camera sensor, along with measurements of wireless network links obtained using an LTE module. The assembled dataset consists of both perception and network data, making it suitable for typical perception or communication applications, as well as cross-disciplinary applications that require both types of data.
Baldoni, S., et al., Questset: A VR Dataset for Network and Quality of Experience Studies (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652187; dataset available at: https://researchdata.cab.unipd.it/1179/). Questset contains over 40 hours of VR traces from 70 users playing commercially available video games, and includes both traffic data for network optimization, and movement and user experience data for cybersickness analysis. Therefore, Questset represents an enabler to jointly address the main VR challenges in the near future.
Jabal, A. et al., StreetLens: An In-Vehicle Video Dataset for Public Facility Monitoring in Urban Streets (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652188; dataset available on request to the authors). StreetLens is a new dataset of videos capturing urban streets with plentiful annotations for vision-based public facility monitoring. It includes four-and-a-half hours of videos recorded by smartphone cameras placed in moving vehicles in the suburbs of three different cities.
Brescia, W., et al., MilliNoise: a Millimeter-wave Radar Sparse Point Cloud Dataset in Indoor Scenarios (paper available at: https://dl.acm.org/doi/10.1145/3625468.3652189; dataset available at: https://github.com/c3lab/MilliNoise). MilliNoise is a point cloud dataset captured in indoor scenarios through a mmWave radar sensor installed on a wheeled mobile robot. Each of the 12M points in the MilliNoise dataset is accurately labeled as true/noise point by leveraging known information of the scenes and a motion capture system to obtain the ground truth position of the moving robot. Along with the dataset, we provide researchers with the tools to visualize the data and prepare it for statistical and machine learning analysis.

ACM MMSys 2025

8 dataset papers were presented at the 16th ACM Multimedia Systems Conference (ACM MMSys’25), organized in Stellenbosch, South Africa, March 30th to April 4th, 2025 (https://2025.acmmmsys.org/). The complete ACM MMSys’25 Proceedings are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3712676).

Lechelek, L. et al., eCHFD: extended Ceasefire Hierarchical Firearm Dataset (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718333; dataset available on request to the authors). This is the extended Ceasefire Hierarchical Firearm Dataset (eCHFD), a large image dataset of firearms consisting of over 93,000 images in 505 classes. It was constructed from more than 240 videos filmed at the Toulouse Forensics Laboratory (France) and further enriched with images from the existing CHFD dataset and additional downloaded images.
Sarkhoosh, M. H. et al., HockeyAI: A Multi-Class Ice Hockey Dataset for Object Detection (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718335; dataset available at: https://huggingface.co/SimulaMet-HOST/HockeyAI). HockeyAI is a novel open source dataset specifically designed for multi-class object detection in ice hockey. It includes 2,101 high resolution frames extracted from professional games in the Swedish Hockey League (SHL), annotated in the You Look Only Once (YOLO) format.
Nguyen, M. et al., OLED-EQ: A Dataset for Assessing Video Quality and Energy Consumption in OLED TVs Across Varying Brightness Levels (paper available at: https://dl.acm.org/doi/abs/10.1145/3712676.3718337; dataset available at: https://github.com/minhkstn/OLED-EQ). The dataset comprises the energy data of four OLED TVs with different screen sizes and manufacturers in playing 176 videos in a range of dark and bright content. As a result, 704 data traces of energy consumption are collected. It also includes subjective annotations (28 participants, resulting in 2240 responses in total) of the quality of videos displayed in OLED TVs when they are reduced in brightness.
Sarkhoosh, M. H. et al., HockeyRink: A Dataset for Precise Ice Hockey Rink Keypoint Mapping and Analytics (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718338; dataset available at: https://huggingface.co/SimulaMet-HOST/HockeyRink). HockeyRink is a novel dataset comprising 56 meticulously annotated keypoints corresponding to significant landmarks on a standard hockey rink, including face-off dots, goalposts, and blue lines.
Sarkhoosh, M. H. et al., HockeyOrient: A Dataset for Ice Hockey Player Orientation Classification (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718342; dataset available at: https://huggingface.co/datasets/SimulaMet-HOST/HockeyOrient ). HockeyOrient is a novel dataset for classifying the orientation of ice hockey players based on their poses. The dataset comprises 9,700 manually annotated frames, selected randomly and non-sequentially, taken from Swedish Hockey League (SHL) games during the 2023 and 2024 seasons.
Li, J. et al., PCVD: A Dataset of Point Cloud Video for Dynamic Human Interaction (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718343; dataset available at: https://github.com/acmmmsys/2025-PCVD-A-Dataset-of-Point-Cloud-Video-for-Dynamic-Human-Interaction). This is a point cloud video dataset PCVD captured with synchronized Azure Kinect cameras, designed to support tasks like denoising, segmentation, and motion recognition in single and multi-person scenes. It provides high-quality depth and color data from diverse real-world scenes with human actions.
Bhattacharya, A. et al., AMIS: An Audiovisual Dataset for Multimodal XR Research (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718344; dataset available at: https://github.com/Telecommunication-Telemedia-Assessment/AMIS). The Audiovisual Multimodal Interaction Suite (AMIS) is an open-source dataset and accompanying Unity-based demo implementation designed to aid research on immersive media communication and social XR environments. It features synchronized audiovisual recordings of three actors performing monologues and participating in dyadic conversations across four modalities: talking-head videos, full-body videos, volumetric avatars, and personalized animated avatars.
Ouellette, J. et al., MazeLab: A Large-Scale Dynamic Volumetric Point Cloud Video Dataset With User Behavior Traces (paper available at: https://dl.acm.org/doi/10.1145/3712676.3718345; dataset available on request to the authors). MazeLab is a dynamic volumetric video dataset comprising a feature-rich point cloud representation of a large maze environment. It captures navigation traces from 15 participants interacting with 15 distinct maze variants, categorized into seven classes designed to elicit specific behavioral characteristics such as navigation patterns, attention hotspots, and interaction dynamic.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2024 – Part 3 (MediaEval 2023, ImageCLEF 2024)

By gabi | March 20, 2025 - 14:46 |April 23, 2025 0125, 0125, Datasets Column, Feature

Leave a comment

In this final part of the Overview of Open Dataset Sessions and Benchmarking Competitions we are focusing on the latest editions of some of the most popular multimedia-centric benchmarking competitions, continuing our reviews from previous years (https://records.sigmm.org/2023/01/19/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2022-part-3/). This third part of our review focuses on two benchmarking competitions:

MediaEval 2023 (https://multimediaeval.github.io/editions/2023/). We present the five benchmarking tasks, which target a wide range of topics, including medical multimedia applications (Medico), multimodal understanding of smells (Musti), multimodal content in news media (NewsImages), social media video memorability (Memorability), and sports action classification (SportsVideo).
ImageCLEF 2024 (https://www.imageclef.org/2024). This edition of ImageCLEF targets a wide range of tasks, covering four different medical-focused tasks (medical captions, Visual Question Answering, remote medicine, and GANs in medical scenarios), recommendation systems for editorials, image retrieval and generation, and pictogram generation from textual information.

For an overview of the QoMEX 2023 and QoMEX 2024 conferences, please see the first part of this column (https://records.sigmm.org/2024/09/07/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2023-2024-part-1-qomex-2023-and-qomex-2024/), while an overview of MDRE special sessions at MMM2023 and 2024 please take a look at the second part of this column (https://records.sigmm.org/2024/11/19/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2023-2024-part-2-mdre-at-mmm-2023-and-mmm-2024/).

MediaEval 2023

The MediaEval Multimedia Evaluation benchmark (https://multimediaeval.github.io/) benchmark offers challenges in artificial intelligence for multimedia data, tasking participants in benchmarking tasks centered around retrieval, classification, generation, analysis, and exploration of multimodal data. The latest editions of MediaEval also wish to delve deeper into understanding the data, trends, and system performance, by proposing a set of Quest for Insight (Q4I) questions and themes for each task. A column signed by the Coordination Committee of the latest MediaEval edition, outlying MediaEval’s history, impressions from the latest edition, and plans for the future is published in the October 2024 edition of our records (https://records.sigmm.org/2024/11/15/one-benchmarking-cycle-wraps-up-and-the-next-ramps-up-news-from-the-mediaeval-multimedia-benchmark/). MediaEval 2023 (https://multimediaeval.github.io/editions/2023/) was held between 1-2 February 2024, Collocated with MMM 2024 in Amsterdam, Netherlands, and the Coordination Committee was composed of Mihai Gabriel Constantin (University Politehnica of Bucharest, Romania), Steven Hicks, (SimulaMet, Norway), and Martha Larson (Radboud University, Netherlands) as the main coordinator.

Medical Multimedia Task – Transparent Tracking of Spermatozoa
Paper available at: https://ceur-ws.org/Vol-3658/paper1.pdf
Vajira Thambawita, Andrea Storås, Tuan-Luc Huynh, Hai-Dang Nguyen, Minh-Triet Tran, Trung-Nghia Le, Pål Halvorsen, Michael Riegler, Steven Hicks, Thien-Phuc Tran
SimulaMet, Norway, OsloMet, Norway, University of Science, VNU-HCM, Vietnam, Vietnam National University, Ho Chi Minh City, Vietnam
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/medico/

The Medico task provides a set of spermatozoa videos, tracked with a set of frame-by-frame bounding box annotations, tasking participants with the prediction of standard sperm quality assessment measurements, specifically the motility (movement) of spermatozoa (living sperm cells).

Musti: Multimodal Understanding of Smells in Texts and Images
Paper available at: https://ceur-ws.org/Vol-3658/paper34.pdf
Ali Hürriyetoğlu, Inna Novalija, Mathias Zinnen, Vincent Christlein, Pasquale Lisena, Stefano Menini, Marieke van Erp, Raphael Troncy
KNAW Humanities Cluster, DHLab, Jožef Stefan Institute, Slovenia, Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, EURECOM, Sophia Antipolis, France, Fondazione Bruno Kessler, Trento, Italy
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/musti/

Musti is an innovative task, seeking to understand the descriptions and depictions of smells in multilingual texts (English, German, Italian, French, Slovenian) and images from the 17th to the 20th century. Participants must create systems that recognize references to smells in texts and images, connecting these references across different modalities.

NewsImages: Connecting Text and Images
Paper available at: https://ceur-ws.org/Vol-3658/paper4.pdf
Andreas Lommatzsch, Benjamin Kille, Özlem Özgöbek, Mehdi Elahi, Duc Tien Dang Nguyen
Technische Universität Berlin, Berlin, Germany, Norwegian University of Science and Technology, Trondheim, Norway, University of Bergen, Bergen, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/newsimages/

In this edition of the NewsImages task participants are encouraged to discover patterns and models that describe the relation between images and texts of news articles, body of articles, and their headlines.

Predicting Video Memorability
Paper available at: https://ceur-ws.org/Vol-3658/paper2.pdf
Mihai Gabriel Constantin, Claire-Hélène Demarty, Camilo Fosco, Alba García Seco de Herrera, Sebastian Halder, Graham Healy, Bogdan Ionescu, Ana Matran-Fernandez, Rukiye Savran Kiziltepe, Alan F. Smeaton, Lorin Sweeney
University Politehnica of Bucharest, Romania, InterDigital, France, Massachusetts Institute of Technology Cambridge, USA, University of Essex, UK, Dublin City University, Ireland, Karadeniz Technical University, Turkey
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/memorability/

The organizers propose a dataset that studies the long-term memorability of social media-like videos, providing participants with an extensive data set of videos with memorability annotations, related information, pre-extracted state-of-the-art visual features, and Electroencephalography (EEG) recordings.

SportsVideo: Fine Grained Action Classification and Position Detection in Table Tennis and Swimming Videos
Paper available at: https://ceur-ws.org/Vol-3658/paper3.pdf
Aymeric Erades, Pierre-Etienne Martin, Romain Vuillemot, Boris Mansencal, Renaud Peteri, Julien Morlier, Stefan Duffner, Jenny Benois-Pineau
Ecole Centrale de Lyon, LIRIS, France, CCP Department, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany, University of Bordeaux, Labri, France, INSA Lyon, LIRIS, France
Dataset available at: https://multimediaeval.github.io/editions/2023/tasks/sportsvideo/

The organizers developed a set of six sub-tasks covering table tennis and swimming, related to athlete position detection, stroke detection, the classification of motions, field or table registration, sound detection in sports, and scores and result extraction from visual cues.

ImageCLEF 2024

ImageCLEF (https://www.imageclef.org/) is part of the popular CLEF initiative (https://www.clef-initiative.eu/), and states as its main goal the evaluation of technologies for annotation, indexing, classification, and retrieval of multimodal data. The 2024 edition of ImageCLEF (https://www.imageclef.org/2024) was organized between the 9-12 September 2024, in Grenoble, France, with an Organization Committee composed of Bogdan Ionescu, Henning Müller, Ana-Maria Drăgulinescu, Ivan Eggel, and Liviu-Daniel Ștefan.

ImageCLEFmedical Caption
Paper available at: https://ceur-ws.org/Vol-3740/paper-132.pdf
Johannes Rückert, Asma Ben Abacha, Alba G. Seco de Herrera, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Benjamin Bracke, Hendrik Damm, Tabea M. G. Pakull, Cynthia Sabrina Schmidt, Henning Müller, Christoph M. Friedrich
Department of Computer Science, University of Applied Sciences and Arts Dortmund, Dortmund, Germany, Microsoft, Redmond, Washington, USA, University of Essex, UK, UNED, Spain, Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Germany, Institute for Artificial Intelligence in Medicine (IKIM), University Hospital Essen, Germany, Institute for Transfusion Medicine, University Hospital Essen, Essen, Germany, University of Applied Sciences Western Switzerland (HES-SO), Switzerland, University of Geneva, Switzerland,
Dataset available at: https://www.imageclef.org/2024/medical/caption

The medical caption task focuses on evaluating models that detect medical concepts and automatically create captions for medical images, which can be further applied for context-based image and information retrieval purposes.

ImageCLEFmed VQA
Paper available at: https://ceur-ws.org/Vol-3740/paper-131.pdf
Steven Hicks, Andrea Storås, Pål Halvorsen, Michael Riegler, Vajira Thambawita
SimulaMet, Oslo, Norway, OsloMet- Oslo Metropolitan University, Oslo, Norway
Dataset available at: https://www.imageclef.org/2024/medical/vqa

This edition of the medical VQA task focuses on images of the gastrointestinal tract, tasking participants with directing the power of artificial intelligence to generate medical images based on text input, while also looking at optimal prompts for off-the-shelf generative models, thus augmenting the datasets associated with the previous edition of this task.

ImageCLEFmed MEDIQA-MAGIC
Paper available at: https://ceur-ws.org/Vol-3740/paper-133.pdf
Wen-Wai Yim, Asma Ben Abacha, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia
Microsoft Health AI, Redmond, USA, University of Washington, Seattle, USA.
Dataset available at: https://www.imageclef.org/2024/medical/mediqa

The MEDIQA task focuses on the problem of Multimodal And Generative TelemedICine (MAGIC) in the area of dermatology. Participants must develop systems that can take queries, text, clinical context, and images as input and generate appropriate medical textual responses to this input in a telemedicine setting.

ImageCLEFmed GANs
Paper available at: https://ceur-ws.org/Vol-3740/paper-130.pdf
Alexandra-Georgiana Andrei, Ahmedkhan Radzhabov, Dzmitry Karpenka, Yuri Prokopchuk, Vassili Kovalev, Bogdan Ionescu, Henning Müller
AI Multimedia Lab, National University of Science and Technology Politehnica Bucharest, Romania, Belarusian Academy of Sciences, Minsk, Belarus, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2024/medical/gans

This task addresses the challenges of privacy preservation in artificially generated medical images, looking for “fingerprints” of the original real-world training images in a set of artificially generated images, fingerprints that may break patient privacy when exposed in unwanted or unforeseen circumstances.

ImageCLEFrecommending
Alexandru Stan, George Ioannidis, Bogdan Ionescu, Hugo Manguinhas
IN2 Digital Innovations, Germany, Politehnica University of Bucharest, Romania, Europeana Foundation, Netherlands,
Dataset available at: https://www.imageclef.org/2024/recommending

This task identifies traditional multimedia search methods as a performance bottleneck and proposes the development of recommendation methods and systems applied to blog posts, editorials, and galleries, targeting data related to cultural heritage organizations and collections.

Image Retrieval for Arguments (part of Touché at CLEF)
Paper available at: https://ceur-ws.org/Vol-3740/paper-322.pdf
Johannes Kiesel, Çağrı Çöltekin, Maximilian Heinrich, Maik Fröbe, Milad Alshomary, Bertrand De Longueville, Tomaž Erjavec, Nicolas Handke, Matyáš Kopp, Nikola Ljubešić, Katja Meden, Nailia Mirzakhmedova, Vaidas Morkevičius, Theresa Reitis-Münstermann, Mario Scharfbillig, Nicolas Stefanovitch, Henning Wachsmuth, Martin Potthast, Benno Stein
Bauhaus-Universität Weimar, University of Tübingen, Friedrich-Schiller-Universität Jena, Leibniz University Hannover, European Commission, Joint Research Centre (JRC), Jožef Stefan Institute, Leipzig University, Charles University, Kaunas University of Technology, Arcadia Sistemi Informativi Territoriali, University of Kassel, hessian.AI, and ScaDS.AI
Dataset available at: https://www.imageclef.org/2024/image-retrieval-for-arguments

The goal for this task is the retrieval of images and data that can increase the persuasiveness of an argument, building upon the datasets of topics developed in previous editions of the Touché task.

ImageCLEF ToPicto
Cécile Macaire, Benjamin Lecouteux, Didier Schwab, Emmanuelle Esperança-Rodier
Université Grenoble Alpes, LIG, France
Dataset available at: https://www.imageclef.org/2023/topicto

Targeting the alleviation of symptoms related to diseases that create language impairment problems, the ToPicto task proposes the development of automated systems that translate text or speech to visual pictograms, that can then be used as communication aids and tools.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2024 – Part 2 (MDRE at MMM 2023 and MMM 2024)

By karel | November 19, 2024 - 11:31 |November 25, 2024 0424, Datasets Column, Feature

Leave a comment

As already started in the previous Datasets column, we are reviewing some of the most notable events related to open datasets and benchmarking competitions in the field of multimedia in the years 2023 and 2024. This selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/records-issues/acm-sigmm-records-issue-1-2023/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. This second part of the column focuses on the last two editions of MDRE at MMM 2023 and MMM 2024:

Multimedia Datasets for Repeatable Experimentation at 29th International Conference on Multimedia Modeling (MDRE at MMM 2023). We summarize the seven datasets presented during the MDRE in 2023, namely NCKU-VTF (thermal-to-visible face recognition benchmark), Link-Rot (web dataset decay and reproducibility study), People@Places and ToDY (scene classification for media production), ScopeSense (lifelogging dataset for health analysis), OceanFish (high-resolution fish species recognition), GIGO (urban garbage classification and demographics), and Marine Video Kit (underwater video retrieval and analysis).
Multimedia Datasets for Repeatable Experimentation at 30th International Conference on Multimedia Modeling (MDRE at MMM 2024 – https://mmm2024.org/). We summarize the eight datasets presented during the MDRE in 2024, namely RESET (video similarity annotations for embeddings), DocCT (content-aware document image classification), Rach3 (multimodal data for piano rehearsal analysis), WikiMuTe (semantic music descriptions from Wikipedia), PDTW150K (large-scale patent drawing retrieval dataset), Lifelog QA (question answering for lifelog retrieval), Laparoscopic Events (event recognition in surgery videos), and GreenScreen (social media dataset for greenwashing detection).

For the overview of datasets related to QoMEX 2023 and QoMEX 2024, please check the first part (https://records.sigmm.org/2024/09/07/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2023-2024-part-1-qomex-2023-and-qomex-2024/).

MDRE at MMM 2023

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2023 International Conference on Multimedia Modeling (MMM 2023), Bergen, Norway, January 9-12, 2023. The MDRE’23 special session at MMM’23, is the fifth MDRE session. The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Adam Jatowt (University of Innsbruck, Austria), Liting Zhou (Dublin City University, Ireland) and Graham Healy (Dublin City University, Ireland).

The NCKU-VTF Dataset and a Multi-scale Thermal-to-Visible Face Synthesis System
Tsung-Han Ho, Chen-Yin Yu, Tsai-Yen Ko & Wei-Ta Chu
National Cheng Kung University, Tainan, Taiwan

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_36
Dataset available at: http://mmcv.csie.ncku.edu.tw/~wtchu/projects/NCKU-VTF/index.html

The dataset, named VTF, comprises paired thermal-visible face images of primarily Asian subjects under diverse visual conditions, introducing challenges for thermal face recognition models. It serves as a benchmark for evaluating model robustness while also revealing racial bias issues in current systems. By addressing both technical and fairness aspects, VTF promotes advancements in developing more accurate and inclusive thermal-to-visible face recognition methods.

Link-Rot in Web-Sourced Multimedia Datasets
Viktor Lakic, Luca Rossetto & Abraham Bernstein
Department of Informatics, University of Zurich, Zurich, Switzerland

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_37
Dataset available at: Combination of 24 different Web-sourced datasets described in the paper

The dataset examines 24 Web-sourced datasets comprising over 270 million URLs and reveals that more than 20% of the content has become unavailable due to link-rot. This decay poses significant challenges to the reproducibility of research relying on such datasets. Addressing this issue, the dataset highlights the need for strategies to mitigate content loss and maintain data integrity for future studies.

People@Places and ToDY: Two Datasets for Scene Classification in Media Production and Archiving
Werner Bailer & Hannes Fassold
Joanneum Research, Graz, Austria

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_38
Dataset available at: https://github.com/wbailer/PeopleAtPlaces

The dataset supports annotation tasks in visual media production and archiving, focusing on scene bustle (from populated to unpopulated), cinematographic shot types, time of day, and season. The People@Places dataset augments Places365 with bustle and shot-type annotations, while the ToDY (time of day/year) dataset enhances SkyFinder. Both datasets come with a toolchain for automatic annotations, manually verified for accuracy. Baseline results using the EfficientNet-B3 model, pretrained on Places365, are provided for benchmarking.

ScopeSense: An 8.5-Month Sport, Nutrition, and Lifestyle Lifelogging Dataset
Michael A. Riegler, Vajira Thambawita, Ayan Chatterjee, Thu Nguyen, Steven A. Hicks, Vibeke Telle-Hansen, Svein Arne Pettersen, Dag Johansen, Ramesh Jain & Pål Halvorsen
SimulaMet, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; UIT The Artic University of Norway, Tromsø, Norway; University of California Irvine, CA, USA

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_39
Dataset available at: https://datasets.simula.no/scopesense

The dataset, ScopeSense, offers comprehensive sport, nutrition, and lifestyle logs collected over eight and a half months from two individuals. It includes extensive sensor data alongside nutrition, training, and well-being information, structured to facilitate detailed, data-driven research on healthy lifestyles. This dataset aims to support modeling for personalized guidance, addressing challenges in unstructured data and enhancing the precision of lifestyle recommendations. ScopeSense is fully accessible to researchers, serving as a foundation for methods to expand this data-driven approach to larger populations.

Fast Accurate Fish Recognition with Deep Learning Based on a Domain-Specific Large-Scale Fish Dataset
Yuan Lin, Zhaoqi Chu, Jari Korhonen, Jiayi Xu, Xiangrong Liu, Juan Liu, Min Liu, Lvping Fang, Weidi Yang, Debasish Ghose & Junyong You
School of Economics, Innovation, and Technology, Kristiania University College, Oslo, Norway; School of Aerospace Engineering, Xiamen University, Xiamen, China; School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, UK; School of Information Science and Technology, Xiamen University, Xiamen, China; School of Ocean and Earth, Xiamen University, Xiamen, China; Norwegian Research Centre (NORCE), Bergen, Norway

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_40
Dataset available at: Upon request from the authors

The dataset, OceanFish, addresses key challenges in fish species recognition by providing high-resolution images of marine species from the East China Sea, covering 63,622 images across 136 fine-grained fish species. This large-scale, diverse dataset overcomes limitations found in prior fish datasets, such as low resolution and limited annotations. OceanFish includes a fish recognition testbed with deep learning models, achieving high precision and speed in species detection. This dataset can be expanded with additional species and annotations, offering a valuable benchmark for advancing marine biodiversity research and automated fish recognition.

GIGO, Garbage In, Garbage Out: An Urban Garbage Classification Dataset
Maarten Sukel, Stevan Rudinac & Marcel Worring
University of Amsterdam, Amsterdam, The Netherlands

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_41
Dataset available at: https://doi.org/10.21942/uva.20750044

The dataset, GIGO: Garbage in, Garbage out, offers 25,000 images for multimodal urban waste classification, captured across a large area of Amsterdam. It supports sustainable urban waste collection by providing fine-grained classifications of diverse garbage types, differing in size, origin, and material. Unique to GIGO are additional geographic and demographic data, enabling multimodal analysis that incorporates neighborhood and building statistics. The dataset includes state-of-the-art baselines, serving as a benchmark for algorithm development in urban waste management and multimodal classification.

Marine Video Kit: A New Marine Video Dataset for Content-Based Analysis and Retrieval
Quang-Trung Truong, Tuan-Anh Vu, Tan-Sang Ha, Jakub Lokoč, Yue-Him Wong, Ajay Joneja & Sai-Kit Yeung
Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; FMP, Charles
University, Prague, Czech Republic; Shenzhen University, Shenzhen, China

Paper available at: https://doi.org/10.1007/978-3-031-27077-2_42
Dataset available at: https://hkust-vgd.github.io/marinevideokit

The dataset, Marine Video Kit, focuses on single-shot underwater videos captured by moving cameras, providing a challenging benchmark for video retrieval and computer vision tasks. Designed to address the limitations of general-purpose models in domain-specific contexts, the dataset includes meta-data, low-level feature analysis, and semantic annotations of keyframes. Used in the Video Browser Showdown 2023, Marine Video Kit highlights challenges in underwater video analysis and is publicly accessible, supporting advancements in model robustness for specialized video retrieval applications.

MDRE at MMM 2024

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2024 International Conference on Multimedia Modeling (MMM 2024), Amsterdam, The Netherlands, January 29 – February 2, 2024. The MDRE’24 special session at MMM’24, is the sixth MDRE session. The session was organized by Klaus Schöffmann (Klagenfurt University, Austria), Björn Þór Jónsson (Reykjavik University, Iceland), Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), and Liting Zhou (Dublin City University, Ireland). Details regarding this session can be found at: https://mmm2024.org/specialpaper.html#s1.

RESET: Relational Similarity Extension for V3C1 Video Dataset
Patrik Veselý & Ladislav Peška
Faculty of Mathematics and Physics, Charles University, Prague, Czechia

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_1
Dataset available at: https://osf.io/ruh5k

The dataset, RESET: RElational Similarity Evaluation dataseT, offers over 17,000 similarity annotations for video keyframe triples drawn from the V3C1 video collection. RESET includes both close and distant similarity triplets in general and specific sub-domains (wedding and diving), with multiple user re-annotations and similarity scores from 30 pre-trained models. This dataset supports the evaluation and fine-tuning of visual embedding models, aligning them more closely with human-perceived similarity, and enhances content-based information retrieval for more accurate, user-aligned results.

A New Benchmark and OCR-Free Method for Document Image Topic Classification
Zhen Wang, Peide Zhu, Fuyang Yu & Manabu Okumura
Tokyo Institute of Technology, Tokyo, Japan; Delft University of Technology, Delft, Netherlands; Beihang University, Beijing, China

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_2
Dataset available at: https://github.com/zhenwangrs/DocCT

The dataset, DocCT, is a content-aware document image classification dataset designed to handle complex document images that integrate text and illustrations across diverse topics. Unlike prior datasets focusing mainly on format, DocCT requires fine-grained content understanding for accurate classification. Alongside DocCT, the self-supervised model DocMAE is introduced, showing that document image semantics can be understood effectively without OCR. DocMAE surpasses previous vision models and some OCR-based models in understanding document content purely from pixel data, marking a significant advance in document image analysis.

The Rach3 Dataset: Towards Data-Driven Analysis of Piano Performance Rehearsal
Carlos Eduardo Cancino-Chacón & Ivan Pilkov
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_3
Dataset available at: https://dataset.rach3project.com/

The dataset, named Rach3, captures the rehearsal processes of pianists as they learn new repertoire, providing a multimodal resource with video, audio, and MIDI data. Designed for AI and machine learning applications, Rach3 enables analysis of long-term practice sessions, focusing on how advanced students and professional musicians interpret and refine their performances. This dataset offers valuable insights into music learning and expression, addressing an understudied area in music performance research.

WikiMuTe: A Web-Sourced Dataset of Semantic Descriptions for Music Audio
Benno Weck, Holger Kirchhoff, Peter Grosche & Xavier Serra
Huawei Technologies, Munich Research Center, Munich, Germany; Universitat Pompeu Fabra, Music Technology Group, Barcelona, Spain

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_4
Dataset available at: https://github.com/Bomme/wikimute

The dataset, WikiMuTe, is an open, multi-modal resource designed for Music Information Retrieval (MIR), offering detailed semantic descriptions of music sourced from Wikipedia. It includes both long and short-form text on aspects like genre, style, mood, instrumentation, and tempo. Using a custom text-mining pipeline, WikiMuTe provides data to train models that jointly learn text and audio representations, achieving strong results in tasks such as tag-based music retrieval and auto-tagging. This dataset supports MIR advancements by providing accessible, rich semantic data for matching text and music.

PDTW150K: A Dataset for Patent Drawing Retrieval
Chan-Ming Hsu, Tse-Hung Lin, Yu-Hsien Chen & Chih-Yi Chiu
Department of Computer Science and Information Engineering, National Chiayi University, Chiayi, Taiwan

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_5
Dataset available at: https://github.com/ncyuMARSLab/PDTW150K

The dataset, PDTW150K, is a large-scale resource for patent drawing retrieval, featuring over 150,000 patents with text metadata and more than 850,000 patent drawings. It includes bounding box annotations for drawing views and supporting object detection model construction. PDTW150K enables diverse applications, such as image retrieval, cross-modal retrieval, and object detection. This dataset is publicly available, offering a valuable tool for advancing research in patent analysis and retrieval tasks.

Interactive Question Answering for Multimodal Lifelog Retrieval
Ly-Duyen Tran, Liting Zhou, Binh Nguyen & Cathal Gurrin
Dublin City University, Dublin, Ireland; AISIA Research Lab, Ho Chi Minh, Vietnam; Ho Chi Minh University of Science, Vietnam National University, Hanoi, Vietnam

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_6
Dataset available at: Upon request from the authors

The dataset supports Question Answering (QA) tasks in lifelog retrieval, advancing the field toward open-domain QA capabilities. Integrated into a multimodal lifelog retrieval system, it allows users to ask lifelog-specific questions and receive suggested answers based on multimodal data. A test collection is provided to assess system effectiveness and user satisfaction, demonstrating enhanced performance over conventional lifelog systems, especially for novice users. This dataset paves the way for more intuitive and effective lifelog interaction.

Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers
Sahar Nasirihaghighi, Negin Ghamsarian, Heinrich Husslein & Klaus Schoeffmann
Institute of Information Technology (ITEC), Klagenfurt University, Klagenfurt, Austria; Center for AI in Medicine, University of Bern, Bern, Switzerland; Department of Gynecology and Gynecological Oncology, Medical University Vienna, Vienna, Austria

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_7
Dataset available at: https://ftp.itec.aau.at/datasets/LapGyn6-Events/

The dataset is tailored for event recognition in laparoscopic gynecology surgery videos, including annotations for critical intra-operative and post-operative events. Designed for applications in surgical training and complication prediction, it facilitates precise event recognition. The dataset supports a hybrid Transformer-based architecture that leverages inter-frame dependencies, improving accuracy amid challenges like occlusion and motion blur. Additionally, a custom frame sampling strategy addresses variations in surgical scenes and skill levels, achieving high temporal resolution. This methodology outperforms conventional CNN-RNN architectures, advancing laparoscopic video analysis.

GreenScreen: A Multimodal Dataset for Detecting Corporate Greenwashing in the Wild
Ujjwal Sharma, Stevan Rudinac, Joris Demmers, Willemijn van Dolen & Marcel Worring
University of Amsterdam, Amsterdam, The Netherlands

Paper available at: https://doi.org/10.1007/978-3-031-56435-2_8
Dataset available at: https://uva-hva.gitlab.host/u.sharma/greenscreen

The dataset focuses on detecting greenwashing in social media by combining large-scale text and image collections from Fortune-1000 company Twitter accounts with environmental risk scores on specific issues like emissions and resource usage. This dataset addresses the challenge of identifying subtle, abstract greenwashing signals requiring contextual interpretation. It includes a baseline method leveraging advanced content encoding to analyze connections between social media content and greenwashing tendencies. This resource enables the multimedia retrieval community to advance greenwashing detection, promoting transparency in corporate sustainability claims.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2023-2024 – Part 1 (QoMEX 2023 and QoMEX 2024)

By maria | September 7, 2024 - 16:34 |October 15, 2024 0324, Datasets Column, Feature

Leave a comment

In this and the following Dataset Columns, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia in the years 2023 and 2024. This selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/records-issues/acm-sigmm-records-issue-1-2023/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. This first column focuses on the last two editions of QoMEX, i.e., 2023 and 2024:

15th International Conference on Quality of Multimedia Experience (QoMEX 2023 – https://qomex2023.itec.aau.at/).
16th International Conference on Quality of Multimedia Experience (QoMEX 2024 – https://qomex2024.itec.aau.at/ ).

QoMEX 2023

4 dataset full papers were presented at the 15th International Conference on Quality of Multimedia Experience (QoMEX 2023), organized in Ghent, Belgium, June 19 – 21, 2023 (https://qomex2023.itec.aau.at/). The complete QoMEX ’23 Proceedings is available in the IEEE Xplore Digital Library (https://ieeexplore.ieee.org/xpl/conhome/10178424/proceeding).

These datasets were presented within the Datasets session, chaired by Professor Lea Skorin-Kapov. Given the scope of the conference (i.e., Quality of Multimedia Experience), these four papers present contributions focused on the impact on user perception of adaptive 2D video streaming, holographic video codecs, omnidirectional video/audio environments and multi-screen video.

PNATS-UHD-1-Long: An Open Video Quality Dataset for Long Sequences for HTTP-based Adaptive Streaming QoE Assessment
Ramachandra Rao, R. R., Borer, S., Lindero, D., Göring, S. and Raake, A.

Paper available at: https://ieeexplore.ieee.org/document/10178493
Dataset available at: https://github.com/Telecommunication-Telemedia-Assessment/PNATS-UHD-1-Long

A collaboration work of Technische Universität Ilmenau (Germany), Ericsson Research (Sweden) and Rohde&Schwarz (Switzerland)

The presented dataset consists of 3 subjective databases targeting overall quality assessment of a typical HTTP-based Adaptive Streaming session consisting of degradations such as quality switching, initial loading delay, and stalling events using audiovisual content ranging between 2 and 5 minutes. In addition to this, subject bias and consistency in quality assessment of such longer-duration audiovisual contents with multiple degradations are investigated using a subject behaviour model. As part of this paper, the overall test design, subjective test results, sources, encoded audiovisual contents, and a set of analysis plots are made publicly available for further research.

Open access dataset of holographic videos for codec analysis and machine learning applications
Gilles, A., Gioia, P., Madali, N., El Rhammad, A., Morin, L.

Paper available at: https://ieeexplore.ieee.org/document/10178637
Dataset available at: https://hologram-repository.labs.b-com.com/#/holographic-videos

A collaboration work between IRT and INSA, Rennes, France

This is reported as the first large-scale dataset containing 18 holographic videos computed with three different resolutions and pixel pitches. By providing the color and depth images corresponding to each hologram frame, our dataset can be used in additional applications such as the validation of 3D scene geometry retrieval or deep learning-based hologram synthesis methods. Altogether, our dataset comprises 5400 pairs of RGB-D images and holograms, totaling more than 550 GB of data.

Saliency of Omnidirectional Videos with Different Audio Presentations: Analyses and Dataset
Singla, A., Robotham, T., Bhattacharya, A., Menz, W., Habets, E. and Raake, A.

Paper available at: https://ieeexplore.ieee.org/abstract/document/10178588
Dataset available at: https://qoevave.github.io/database/docs/Saliency

A collaboration between the Technische Universität Ilmenau and the International Audio Laboratories of Erlangen, both in Germany.

This dataset uses a between-subjects test design to collect users’ exploration data of 360-degree videos in a free-form viewing scenario using the Varjo XR-3 Head Mounted Display, in the presence of no, mono, and 4th-order Ambisonics audio. Saliency information was captured as head-saliency in terms of the center of a viewport at 50 Hz. For each item, subjects were asked to describe the scene with a short free-verbalization task. Moreover, cybersickness was assessed using the simulator sickness questionnaire at the beginning and at the end of the test. The data is sought to enable training of visual and audiovisual saliency prediction models for interactive experiences.

A Subjective Dataset for Multi-Screen Video Streaming Applications
Barman, N., Reznik Y. and Martini, M. G.

Paper available at: https://ieeexplore.ieee.org/document/10178645
Dataset available at: https://github.com/NabajeetBarman/Multiscreen-Dataset

A collaboration between Brightcove (London, UK and Seattle, USA) and Kingston University Londong, UK.

This paper presents a new, open-source dataset consisting of subjective ratings for various encoded video sequences of different resolutions and bitrates (quality) when viewed on three devices of varying screen sizes: TV, Tablet, and Mobile. Along with the subjective scores, an evaluation of some of the most famous and commonly used open-source objective quality metrics is also presented. It is observed that the performance of the metrics varies a lot across different device types, with the recently standardized ITU-T P.1204.3 Model, on average, outperforming their full-reference counterparts.

QoMEX’24

5 dataset full papers were presented at the 16th International Conference on Quality of Multimedia Experience (QoMEX 2024), organized in Karshamn, Sweden, June 18 – 20, 2024 (https://qomex2024.itec.aau.at/). The complete QoMEX ’24 Proceedings is available in the IEEE Xplore Digital Library (https://ieeexplore.ieee.org/xpl/conhome/10597667/proceeding ).

These datasets were presented within the Datasets session, chaired by Dr. Mohsen Jenadeleh. Given the scope of the conference (i.e., Quality of Multimedia Experience), these five papers present contributions focused on the impact on user perception of HDR videos (UHD-1, 8K, and AV1), immersive 360° video and light fields. This last contribution was awarded the best paper award of the conference.

AVT-VQDB-UHD-1-HDR: An Open Video Quality Dataset for Quality Assessment of UHD-1 HDR Videos
Ramachandra Rao, R. R., Herb, B., Helmi-Aurora, T., Ahmed, M. T, Raake, A.

Paper available at: https://ieeexplore.ieee.org/document/10598284
Dataset available at: https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-HDR

A work from Technische Universität Ilmenau, Germany.

This dataset deals with the assessment of the perceived quality of HDR videos. Firstly, a subjective test with 4K/UHD1 HDR videos using the ACR-HR (Absolute Category Rating – Hidden Reference) method was conducted. The tests consisted of a total of 195 encoded videos from 5 source videos which all had a framerate of 60 fps. In this test, the 4K/UHD-1 HDR stimuli were encoded at four different resolutions, namely, 720p, 1080p, 1440p, and 2160p using bitrates ranging between 0.5 Mbps and 40 Mbps. The results of the subjective test have been analyzed to assess the impact of factors such as resolution, bitrate, video codec, and content on the perceived video quality.

AVT-VQDB-UHD-2-HDR: An open 8K HDR source dataset for video quality research
Keller, D., Goebel, T., Sievenkees, V., Prenzel, J., Raake, A.

Paper available at: https://ieeexplore.ieee.org/document/10598268
Dataset available at: https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-2-HDR

A work from Techniche Universität Ilmenau, Germany.

The AVT-VQDB-UHD-2-HDR dataset consists of 31 8K HDR video sources of 15s created with the goal of accurately representing real-life footage, while taking into account video coding and video quality testing challenges.

The effect of viewing distance and display peak luminance – HDR AV1 video streaming quality dataset
Hammou, D., Krasula, L., Bampis, C., Li, Z., Mantiuk, R.,

Paper available at: https://ieeexplore.ieee.org/document/10598289
Dataset available at: https://doi.org/10.17863/CAM.107964

A collaboration between University of Cambridge (UK) and Netflix Inc. (USA).

The HDR-VDC dataset captures the quality degradation of HDR content due to AV1 coding artifacts and the resolution reduction. The quality drop was measured at two viewing distances, corresponding to 60 and 120 pixels per visual degree, and two display mean luminance levels, 51 and 5.6 nits. It employs a highly sensitive pairwise comparison protocol with active sampling and comparisons across viewing distances to ensure possibly accurate quality measurements. It also provides the first publicly available dataset that measures the effect of display peak luminance and includes HDR videos encoded with AV1.

A Spherical Light Field Database for Immersive Telecommunication and Telepresence Applications (Best Paper Award)
Zerman, E., Gond, M., Takhtardeshir, S., Olsson, R., Sjöström, M.

Paper available at: https://ieeexplore.ieee.org/document/10598264
Dataset available at: https://zenodo.org/records/13342006

A work presented from Mid Sweden University, Sundsvall, Sweden.

The Spherical Light Field Database (SLFDB) consists of a light field of 60 views captured with an omnidirectional camera in 20 scenes. To show the usefulness of the proposed database, we provide two use cases: compression and viewpoint estimation. The initial results validate that the publicly available SLFDB will benefit the scientific community.

AVT-ECoClass-VR: An open-source audiovisual 360° video and immersive CGI multi-talker dataset to evaluate cognitive performance
Fremerey, S., Breuer, C., Leist, L., Klatte, M., Fels, J., Raake, A.

Paper available at: https://ieeexplore.ieee.org/document/10598262
Dataset available at: https://github.com/Telecommunication-Telemedia-Assessment/AVT-ECoClass-VR

A collaboration work between Technische Universität Ilmenau, RWTH Aache University and RPTU Kaiserslautern (Germany).

This dataset includes two audiovisual scenarios (360◦ video and computer-generated imagery) and two implementations for dataset playback. The 360◦ video part of the dataset features 200 video and single-channel audio recordings of 20 speakers reading ten stories, and 20 videos of speakers in silence, resulting in a total of 220 video and 200 audio recordings. The dataset also includes one 360◦ background image of a real primary school classroom scene, targeting young school children for subsequent subjective tests. The second part of the dataset comprises 20 different 3D models of the speakers and a computer-generated classroom scene, along with an immersive audiovisual virtual environment implementation that can be interacted with using an HTC Vive controller.

Overview of Benchmarking Platforms and Software for Multimedia Applications

By gabi | November 15, 2023 - 12:56 |December 5, 2023 0423, Datasets Column, Feature

Leave a comment

In a time where Artificial Intelligence (AI) continues to push the boundaries of what was previously thought possible, the demand for benchmarking platforms that allow to fairly assess and evaluate AI models has become paramount. These platforms serve as connecting hubs between data scientists, machine learning specialists, industry partners, and other interested parties. They mostly function under the Evaluation-as-a-Service (EaaS) paradigm [1], the idea that participants that do a certain benchmarking task should be able to test the output of their systems in similar conditions, by being provided with a common definition of the targeted concepts, datasets and data splits, metrics, and evaluation tools. These common elements are provided through online platforms that can even offer Application Programming Interfaces (APIs) or container-level integration of the participants’ AI models. This column provides an insight into these platforms, looking at their main characteristics, use cases, and particularities. In the second part of the column we will also look into some of the main benchmarking platforms that are geared towards handling multimedia-centric benchmarks and datasets, relevant to SIGMM.

Defining Characteristics of EaaS platforms

Benchmarking competitions and initiatives, and EaaS platforms attempt to tackle a number of keypoints in the development of AI algorithms and models, namely:

Creating a fair and impartial evaluation environment, by standardizing the datasets and evaluation metrics used by all participants to an evaluation competition. In doing so, EaaS platforms play a pivotal role in promoting transparency and comparability in AI models and approaches.
Enhancing reproducibility by giving the option to run the AI models on dedicated servers provided and managed by competition organizers. This increases the trust and bolsters the integrity of the results produced by competition participants, as the organizers are able to closely monitor the testing process for each individual AI model.
Fostering, as a natural consequence, a higher degree of data privacy, as participants could be given access only to training data, while testing data is kept private and is only accessed via APIs on the dedicated servers, reducing the risk of data exposure.
Creating a common repository for the sharing the data and details of a benchmarking task, building a history not only of the results of the benchmarking tasks throughout the years, but also of the evolution of the types of approaches and models used by participants. Other interesting features, like the existence of forums and discussion threads on competitions, allow new participants to quickly search for problems they encounter and hopefully have a quicker resolution of their issues.

Given these common goals, benchmarking platforms usually integrate a set of common features and user-level functionalities that are summed up in this section and grouped into three categories: task organization and scheduling, scoring and reproducibility, and communication and dissemination.

Task organization and scheduling. The platforms allow the creation, modification and maintenance of benchmarking tasks, either through a graphical user interface (GUI) or by using task bundles (most commonly using JSON, XML, Python or custom scripting languages). Competition organizers can define their task, and define sub-tasks that may explore different facets of the targeted data. Scheduling is another important feature in benchmarking competition creation, as some parts of the data may be kept private until a certain moment in time, and allow the competition organizers to hide the results of other teams until a certain point in time. We consider the last point an important one, as participants may feel discouraged from continuing their participation if their initial results are not high enough compared with other participants. Another noteworthy feature is the run quantity management that allows organizers to specify a maximum number of allowed runs per participant during the benchmarking task. This limitation discourages participants from attempting to solve the given tasks with brute force approaches, where they implement a large number of models and model variations. As a result, participants are incentivized to delve deeper into the data, critically analyzing why certain methods succeed and others fall short.

Scoring and reproducibility. EaaS platforms generally deploy two paradigms, sometimes side-by-side, with regards to AI model testing and results generation [1, 2]: the Data-to-Algorithm (D2A) approach, and the Algorithm-to-Data (A2D) approach. The former refers to competitions where participants must download the testing set, run the prediction systems on their own machines, and provide the predictions to the organizers, usually in CSV format for the multimedia domain. In this setup, the ground truth data for the testing set is kept private, and after the organizers receive the prediction result files, they communicate the performance to the participants, or the results are automatically computed by the platform by organizer-provided scripts, once the files are uploaded to it. The A2D approach on the other hand is more complex, may incur additional financial costs, and may be more time consuming for both organizers and task participants, but increases the trustworthiness and reproducibility of the task and AI models themselves. In this setup, organizers provide cloud-based computing resources via Virtual Machines (VMs) and containers, and a common processing pipeline or API that competitors must integrate in their source code. The participants develop the wrappers that integrate their AI models accordingly, and upload the model to the EaaS platforms directly. The AI models are then executed according to the common pipeline and results are automatically provided to the participants, while also allowing for the testing data to be kept completely private. Traditionally, in order to achieve this, EaaS platforms offer the possibility of integration with cloud computing platforms like Amazon AWS, Microsoft Azure, or Google Cloud, and offer Docker integration for the creation of containers where the code can be hosted.

Communication and dissemination. EaaS platforms allow the interaction between competition organizers and participants, either through emails, automatic notifications, or forums where interested parties can exchange ideas, ask questions, offer help, signal potential problems in the data or scripts associated with the tasks.

Popular multimedia EaaS platforms

This section presents some of the most popular benchmarking platforms aimed at the multimedia domain. We will present some key features and associated popular multimedia datasets for the following platforms: Kaggle, AIcrowd, Codabench, Drivendata, and EvalAI.

Kaggle represents perhaps the top-most popular benchmarking platform at this moment, and goes beyond the scope of providing datasets and benchmarking competitions, also hosting AI models, courses, and source code repositories. Competition organizers can design the tasks under either of the D2A or A2D paradigms, giving participants the possibility of integrating their AI models in Jupyter Notebooks for reproducibility. The platform also gives the option of alloting CPU and GPU cloud-based resources for A2D competitions. The Kaggle repository offers code for a large number of additional competition management tools and communication APIs. Among an impressive number of datasets and competitions, Kaggle currently hosts competitions that use the MNIST original data [3], as well as other MNIST-like datasets like Fashion-MNIST [4], as well as datasets on varied subjects ranging from sentiment analysis in social media [5] to medical image processing [6].

AIcrowd is an open source EaaS platform for open benchmarking challenges that puts an accent on connections and collaborative work between data science and machine learning experts. This platform offers the source code for command line interface (CLI) and API clients that can interact with AIcrowd servers. ImageCLEF, between 2018 and 2022 [7 – 11], is one of the most popular multimedia benchmarking initiatives hosted on AICrowd, featuring diverse multimedia topics such as lifelogging, medical image processing, image processing for environment health prediction, the analysis of social media dangers with regards to image sharing, and ensemble learning for multimedia data.

Codabench, launched in August 2023, and its precursor CodaLab, are two open source benchmarking platforms that provide a large number of options, including A2D and D2A approaches, as well as “inverted benchmarks”, where organizers provide the reference algorithms and participants contribute with the datasets. Among the current running challenges on this platform standouts are the two Quality-of-Service-oriented challenges on audio-video synchronization error detection and error measurement challenges that are part of the 3rd Workshop on Image/Video/Audio Quality in Computer Vision and Generative AI at the Winter Conference on Applications of Computer Vision – WACV2024.

Drivendata targets the intersection of data science and social impact. This platform hosts competitions that integrate the social aspect of their domain of interest directly in their mission and definition, while also hosting a number of open-source projects and competition-winning AI models. Given its accent on social impact, this platform hosts a number of benchmarking challenges that target social issues like the detection of hateful memes [12] and image-based nature conservation efforts.

EvalAI is another open source platform that is able to create A2D and D2A competition environments, while also integrating optimization steps that allow for evaluation code to run faster on multi-core cloud infrastructure. The EvalAI platform holds many diverse multimedia-centric competitions, including image segmentation tasks based on LVIS [13] and a wide range of sport tasks [14].

Future directions, developments and other tools

While the tools and platforms described in the previous section represent just a portion of the number of EaaS platform currently online in the research community, we would also like to mention some projects that are currently in the development stage or that can be considered additional tools for benchmarking initiatives:

The AI4Media benchmarking platform, is a benchmarking platform that is currently in the prototype and development stage. Among its most interesting features and ideas promoted by the platform developers is the creation of complexity metrics that would help competition organizers understand the computational efficiency and resource requirements for the submitted systems.
The BenchmarkSTT started as a specialized benchmarking platform for speech-to-text, but is now evolving in different directions, including facial recognition in videos.
The PapersWithCode platform, while not a benchmarking platform per se, is useful as a repository that collects the results AI model on datasets throughout the years, and groups different datasets studying the same concepts under the same umbrella (i.e., Image Classification, Object Detection, Medical Image Segmentation, etc.), while also providing links to scientific papers, github implementations of the models, and links to the datasets. This may represent a good starting point for young researchers that are trying to understand the history and state-of-the-art for certain domains and applications.

Conclusions

Benchmarking platforms represent a key component of benchmarking, pushing for fairness and trustworthiness in AI model comparison, while also providing tools that may foster reproducibility in AI. We are happy to see that many of the platforms discussed in this article are open source, or have open source components, thus allowing interested scientists to create their own custom implementations of these platforms, and to adapt them when necessary to their particular fields.

Acknowledgements

The work presented in this column is supported under the H2020 AI4Media “A European Excellence Centre for Media, Society and Democracy” project, contract #951911.

References

[1] Hanbury, A., Müller, H., Balog, K., Brodt, T., Cormack, G. V., Eggel, I., Gollub, T., Hopfgartner, F., Kalpathy-Cramer, J., Kando, N., Krithara, A., Lin, J., Mercer, S. & Potthast, M. (2015). Evaluation-as-a-service: Overview and outlook. arXiv preprint arXiv:1512.07454.
[2] Hanbury, A., Müller, H., Langs, G., Weber, M. A., Menze, B. H., & Fernandez, T. S. (2012). Bringing the algorithms to the data: cloud–based benchmarking for medical image analysis. In Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics: Third International Conference of the CLEF Initiative, CLEF 2012, Rome, Italy, September 17-20, 2012. Proceedings 3 (pp. 24-29). Springer Berlin Heidelberg.
[3] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
[4] Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
[5] Niu, T., Zhu, S., Pang, L., & El Saddik, A. (2016). Sentiment analysis on multi-view social data. In MultiMedia Modeling: 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part II 22 (pp. 15-27). Springer International Publishing.
[6] Thambawita, V., Hicks, S. A., Storås, A. M., Nguyen, T., Andersen, J. M., Witczak, O., … & Riegler, M. A. (2023). VISEM-Tracking, a human spermatozoa tracking dataset. Scientific Data, 10(1), 1-8.
[7] Ionescu, B., Müller, H., Villegas, M., García Seco de Herrera, A., Eickhoff, C., Andrearczyk, V., … & Gurrin, C. (2018). Overview of ImageCLEF 2018: Challenges, datasets and evaluation. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings 9 (pp. 309-334). Springer International Publishing.
[8] Ionescu, B., Müller, H., Péteri, R., Dang-Nguyen, D. T., Piras, L., Riegler, M., … & Karampidis, K. (2019). ImageCLEF 2019: Multimedia retrieval in lifelogging, medical, nature, and security applications. In Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part II 41 (pp. 301-308). Springer International Publishing.
[9] Ionescu, B., Müller, H., Péteri, R., Dang-Nguyen, D. T., Zhou, L., Piras, L., … & Constantin, M. G. (2020). ImageCLEF 2020: Multimedia retrieval in lifelogging, medical, nature, and internet applications. In Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42 (pp. 533-541). Springer International Publishing.
[10] Ionescu, B., Müller, H., Péteri, R., Abacha, A. B., Demner-Fushman, D., Hasan, S. A., … & Popescu, A. (2021). The 2021 ImageCLEF Benchmark: Multimedia retrieval in medical, nature, internet and social media applications. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43 (pp. 616-623). Springer International Publishing.
[11] de Herrera, A. G. S., Ionescu, B., Müller, H., Péteri, R., Abacha, A. B., Friedrich, C. M., … & Dogariu, M. (2022, April). Imageclef 2022: multimedia retrieval in medical, nature, fusion, and internet applications. In European Conference on Information Retrieval (pp. 382-389). Cham: Springer International Publishing.
[12] Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Fitzpatrick, C. A., … & Parikh, D. (2021, August). The hateful memes challenge: Competition report. In NeurIPS 2020 Competition and Demonstration Track (pp. 344-360). PMLR.
[13] Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5356-5364).
[14] Giancola, S., Cioppa, A., Deliège, A., Magera, F., Somers, V., Kang, L., … & Li, Z. (2022, October). SoccerNet 2022 challenges results. In Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports (pp. 75-86).

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 2 (MDRE at MMM 2022, ACM MM 2022)

By karel | January 20, 2023 - 14:45 |March 1, 2023 0123, Datasets Column, Feature

Leave a comment

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on MDRE at MMM 2022 and ACM MM 2022:

Multimedia Datasets for Repeatable Experimentation at 28th International Conference on Multimedia Modeling (MDRE at MMM 2022 – https://mmm2022.org/ssp.html#mdre). We summarize the three datasets presented during the MDRE, addressing several topics like user-centric video search competition, dataset (GPR1200) to evaluate the performance of deep neural networks for general image retrieval, and dataset for evaluating the performance of Question Answering (QA) systems on lifelog data (LLQA).
Selected datasets at the 30th ACM Multimedia Conference (MM ’22 – https://2022.acmmm.org/). For a general report from ACM Multimedia 2022 please see (https://records.sigmm.org/2022/12/07/report-from-acm-multimedia-2022-by-nitish-nagesh/). We summarize nine datasets presented during the conference, targeting several topics like dataset for multimodal intent recognition (MintRec), audio-visual question answering dataset (AVQA), large-scale radar dataset (mmWave), multimodal sticker emotion recognition dataset (SER30K), video-sentence dataset for vision-language pre-training (ACTION), dataset of head and gaze behavior for 360-degree videos, saliency in augmented reality dataset (SARD), multi-modal dataset spotting the differences between pairs of similar images (DialDiff), and large-scale remote sensing images dataset (RSVG).

For the overview of datasets related to QoMEX 2022 and ODS at MMSys ’22 please check the first part (https://records.sigmm.org/?p=12292), while ImageCLEF 2022 and MediaEval 2022 are addressed in the third part (http://records.sigmm.org/?p=12362).

MDRE at MMM 2022

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2022 International Conference on Multimedia Modeling (MMM 2022), supporting both online and onsite presentation, Phu Quoc, Vietnam, June 6-10, 2022. The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Björn Þór Jónsson (IT University of Copenhagen, Denmark), Adam Jatowt (University of Innsbruck, Austria), Liting Zhou (Dublin City University, Ireland) and Graham Healy (Dublin City University, Ireland). Details regarding this session can be found at: https://mmm2022.org/ssp.html#mdre.

The MDRE’22 special session at MMM’22, is the fourth MDRE session, and it represents an opportunity for interested researchers to submit their datasets to this track. The work submitted to MDRE is permanently available at http://mmdatasets.org, where all the current and past editions of MDRE are hosted. Authors are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, and a discussion of how it can be useful to the community, along with the dataset in itself.

A Task Category Space for User-Centric Comparative Multimedia Search Evaluations
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_16
Lokoč, J., Bailer, W., Barthel, K.U., Gurrin, C., Heller, S., Jónsson, B., Peška, L., Rossetto, L., Schoeffmann, K., Vadicamo, L., Vrochidis, S., Wu, J.
Charles University, Prague, Czech Republic; JOANNEUM RESEARCH, Graz, Austria; HTW Berlin, Berlin, Germany; Dublin City University, Dublin, Ireland; University of Basel, Basel, Switzerland; IT University of Copenhagen, Copenhagen, Denmark; University of Zurich, Zurich, Switzerland; Klagenfurt University, Klagenfurt, Austria; ISTI CNR, Pisa, Italy; Centre for Research and Technology Hellas, Thessaloniki, Greece; City University of Hong Kong, Hong Kong.
Dataset available at: On request

The authors have analyzed the spectrum of possible task categories and propose a list of individual axes that define a large space of possible task categories. Using this concept of category space, new user-centric video search competitions can be designed to benchmark video search systems from different perspectives. They further analyze the three task categories considered at the Video Browser Showdown and discuss possible (but sometimes challenging) shifts within the task category space.

GPR1200: A Benchmark for General-Purpose Content-Based Image Retrieval
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_17
Schall, K., Barthel, K.U., Hezel, N., Jung, K.
Visual Computing Group, HTW Berlin, University of Applied Sciences, Germany.
Dataset available at: http://visual-computing.com/project/GPR1200

In this study, the authors have developed a new dataset called GPR1200 to evaluate the performance of deep neural networks for general image retrieval (CBIR). They found that large-scale pretraining significantly improves retrieval performance and that further improvement can be achieved through fine-tuning. GPR1200 is presented as an easy-to-use and accessible but challenging benchmark dataset with a broad range of image categories.

LLQA – Lifelog Question Answering Dataset
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_18
Tran, L.-D., Ho, T.C., Pham, L.A., Nguyen, B., Gurrin, C., Zhou, L.
Dublin City University, Dublin, Ireland; Vietnam National University, Ho Chi Minh University of Science, Ho Chi Minh City, Viet Nam; AISIA Research Lab, Ho Chi Minh City, Viet Nam.
Dataset available at: https://github.com/allie-tran/LLQA

This study presents Lifelog Question Answering Dataset (LLQA), a new dataset for evaluating the performance of Question Answering (QA) systems on lifelog data. The dataset includes over 15,000 multiple-choice questions as an augmented 85-day lifelog collection, and is intended to serve as a benchmark for future research in this area. The results of the study showed that QA on lifelog data is a challenging task that requires further exploration.

ACM MM 2022

Numerous dataset-related papers have been presented at the 30th ACM International Conference on Multimedia (MM’ 22), organized in Lisbon, Portugal, October 10 – 14, 2022 (https://2022.acmmm.org/). The complete MM ’22: Proceedings of the 30th ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3503161).

There was not a specifically dedicated Dataset session among roughly 35 sessions at the MM ’22 symposium. However, the importance of datasets can be illustrated in the following statistics, quantifying how often the term “dataset” appears in MM ’22 Proceedings. The term appears in the title of 9 papers (7 last year), the keywords of 35 papers (66 last year), and the abstracts of 438 papers (339 last year). As a small example, nine selected papers focused primarily on new datasets with publicly available data are listed below. There are contributions focused on various multimedia applications, e.g., understanding multimedia content, multimodal fusion and embeddings, media interpretation, vision and language, engaging users with multimedia, emotional and social signals, interactions and Quality of Experience, and multimedia search and recommendation.

MIntRec: A New Dataset for Multimodal Intent Recognition
Paper available at: https://doi.org/10.1145/3503161.3547906
Zhang, H., Xu, H., Wang, X., Zhou, Q., Zhao, S., Teng, J.
Tsinghua University, Beijing, China.
Dataset available at: https://github.com/thuiar/MIntRec

MIntRec is a dataset for multimodal intent recognition with 2,224 samples based on the data collected from the TV series Superstore, in text, video, and audio modalities, annotated with twenty intent categories and speaker bounding boxes. Baseline models are built by adapting multimodal fusion methods and show significant improvement over text-only modality. MIntRec is useful for studying relationships between modalities and improving intent recognition.

AVQA: A Dataset for Audio-Visual Question Answering on Videos
Paper available at: https://doi.org/10.1145/3503161.3548291
Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., Jin, C., Zhu, W.
Tsinghua University, Shenzhen, China; Communication University of China, Beijing, China.
Dataset available at: https://mn.cs.tsinghua.edu.cn/avqa

Audio-visual question-answering dataset (AVQA) is introduced for videos in real-life scenarios. It includes 57,015 videos and 57,335 question-answer pairs that rely on clues from both audio and visual modalities. A Hierarchical Audio-Visual Fusing module is proposed to model correlations among audio, visual, and text modalities. AVQA can be used to test models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios.

mmBody Benchmark: 3D Body Reconstruction Dataset and Analysis for Millimeter Wave Radar
Paper available at: https://doi.org/10.1145/3503161.3548262
Chen, A., Wang, X., Zhu, S., Li, Y., Chen, J., Ye, Q.
Zhejiang University, Hangzhou, China.
Dataset available at: On request

A large-scale mmWave radar dataset with synchronized and calibrated point clouds and RGB(D) images is presented, along with an automatic 3D body annotation system. State-of-the-art methods are trained and tested on the dataset, showing the mmWave radar can achieve better 3D body reconstruction accuracy than RGB camera but worse than depth camera. The dataset and results provide insights into improving mmWave radar reconstruction and combining signals from different sensors.

SER30K: A Large-Scale Dataset for Sticker Emotion Recognition
Paper available at: https://doi.org/10.1145/3503161.3548407
Liu, S., Zhang, X., Yan, J.
Nankai University, Tianjin, China.
Dataset available at: https://github.com/nku-shengzheliu/SER30K

A new multimodal sticker emotion recognition dataset called SER30K with 1,887 sticker themes and 30,739 images is introduced for understanding emotions in stickers. A proposed method called LORA, using a vision transformer and local re-attention module, effectively extracts visual and language features for emotion recognition on SER30K and other datasets.

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
Paper available at: https://doi.org/10.1145/3503161.3551581
Pan, Y., Li, Y., Luo, J., Xu, J., Yao, T., Mei, T.
JD Explore Academy, Beijing, China.
Dataset available at: http://www.auto-video-captions.top/2022/dataset

A new large-scale pre-training dataset, Auto-captions on GIF (ACTION), is presented for generic video understanding. It contains video-sentence pairs extracted and filtered from web pages and can be used for pre-training and downstream tasks such as video captioning and sentence localization. Comparisons with existing video-sentence datasets are made.

Where Are You Looking?: A Large-Scale Dataset of Head and Gaze Behavior for 360-Degree Videos and a Pilot Study
Paper available at: https://doi.org/10.1145/3503161.3548200
Jin, Y., Liu, J., Wang, F., Cui, S.
The Chinese University of Hong Kong, Shenzhen, Shenzhen, China.
Dataset available at: https://cuhksz-inml.github.io/head_gaze_dataset/

A dataset of users’ head and gaze behaviors in 360° videos is presented, containing rich dimensions, large scale, strong diversity, and high frequency. A quantitative taxonomy for 360° videos is also proposed, containing three objective technical metrics. Results of a pilot study on users’ behaviors and a case of application in tile-based 360° video streaming show the usefulness of the dataset for improving the performance of existing works.

Saliency in Augmented Reality
Paper available at: https://doi.org/10.1145/3503161.3547955
Duan, H., Shen, W., Min, X., Tu, D., Li, J., Zhai, G.
Shanghai Jiao Tong University, Shanghai, China; Alibaba Group, Hangzhou, China.
Dataset available at: https://github.com/DuanHuiyu/ARSaliency

A dataset, Saliency in AR Dataset (SARD), containing 450 background, 450 AR, and 1350 superimposed images with three mixing levels, is constructed to study the interaction between background scenes and AR contents, and the saliency prediction problem in AR. An eye-tracking experiment is conducted among 60 subjects to collect data.

Visual Dialog for Spotting the Differences between Pairs of Similar Images
Paper available at: https://doi.org/10.1145/3503161.3548170
Zheng, D., Meng, F., Si, Q., Fan, H., Xu, Z., Zhou, J., Feng, F., Wang, X.
Beijing University of Posts and Telecommunications, Beijing, China; WeChat AI, Tencent Inc, Beijing, China; Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; University of Trento, Trento, Italy.
Dataset available at: https://github.com/zd11024/Spot_Difference

A new visual dialog task called Dial-the-Diff is proposed, in which two interlocutors access two similar images and try to spot the difference between them through conversation in natural language. A large-scale multi-modal dataset called DialDiff, containing 87k Virtual Reality images and 78k dialogs, is built for the task. Benchmark models are also proposed and evaluated to bring new challenges to dialog strategy and object categorization.

Visual Grounding in Remote Sensing Images
Paper available at: https://doi.org/10.1145/3503161.3548316
Sun, Y., Feng, S., Li, X., Ye, Y., Kang, J., Huang, X.
Harbin Institute of Technology, Shenzhen, Shenzhen, China; Soochow University, Suzhou, China.
Dataset available at: https://sunyuxi.github.io/publication/GeoVG

A new problem of visual grounding in large-scale remote sensing images has been presented, in which the task is to locate particular objects in an image by a natural language expression. A new dataset, called RSVG, has been collected and a new method, GeoVG, has been designed to address the challenges of existing methods in dealing with remote sensing images.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 3 (ImageCLEF 2022, MediaEval 2022)

By karel | January 19, 2023 - 14:45 |March 1, 2023 0123, Datasets Column, Feature

Leave a comment

ImageCLEF 2022 (https://www.imageclef.org/2022). We summarize the 5 datasets launched for the benchmarking tasks, related to several topics like social media profile assessment (ImageCLEFaware), segmentation and labeling of underwater coral images (ImageCLEFcoral), late fusion ensembling systems for multimedia data (ImageCLEFfusion) and medical imaging analysis (ImageCLEFmedical Caption, and ImageCLEFmedical Tuberculosis).
MediaEval 2022 (https://multimediaeval.github.io/editions/2022/). We summarize the 11 datasets launched for the benchmarking tasks, that target a wide range of multimedia topics like the analysis of flood related media (DisasterMM), game analytics (Emotional Mario), news item processing (FakeNews, NewsImages), multimodal understanding of smells (MUSTI), medical imaging (Medico), fishing vessel analysis (NjordVid), media memorability (Memorability), sports data analysis (Sport Task, SwimTrack), and urban pollution analysis (Urban Air).

For the overview of datasets related to QoMEX 2022 and ODS at MMSys ’22 please check the first part (https://records.sigmm.org/?p=12292), while MDRE at MMM 2022 and ACM MM 2022 are addressed in the second part (http://records.sigmm.org/?p=12360).

ImageCLEF 2022

ImageCLEF is a multimedia evaluation campaign, part of the clef initiative (http://www.clef-initiative.eu/). The 2022 edition (https://www.imageclef.org/2022) is the 19th edition of this initiative and addresses four main research tasks in several domains like: medicine, nature, social media content and user interface processing. ImageCLEF 2021 is organized by Bogdan Ionescu (University Politehnica of Bucharest, Romania), Henning Müller (University of Applied Sciences Western Switzerland, Sierre, Switzerland), Renaud Péteri (University of La Rochelle, France), Ivan Eggel (University of Applied Sciences Western Switzerland, Sierre, Switzerland) and Mihai Dogariu (University Politehnica of Bucharest, Romania).

ImageCLEFaware
Paper available at: https://ceur-ws.org/Vol-3180/paper-98.pdf
Popescu, A., Deshayes-Chossar, J., Schindler, H., Ionescu, B.
CEA LIST, France; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2022/aware

This represents the second edition of the aware task at ImageCLEF, and it seeks to understand in what way do public social media profiles affect users in certain important scenarios, representing a search or application for: a bank loan, an accommodation, a job as waitress/waiter, and a job in IT.

ImageCLEFcoral
Paper available at: https://ceur-ws.org/Vol-3180/paper-97.pdf
Chamberlain, J., de Herrera, A.G.S., Campello, A., Clark, A..
University of Essex, United Kingdom; Wellcome Trust, United Kingdom.
Dataset available at: https://www.imageclef.org/2022/coral

This fourth edition of the coral task addresses the problem of segmenting and labeling a set of underwater images used in the monitoring of coral reefs. The task proposes two subtasks, namely an annotation and localization subtask and a pixel-wise parsing subtask.

ImageCLEFfusion
Paper available at: https://ceur-ws.org/Vol-3180/paper-99.pdf
Ştefan, L-D., Constantin, M.G., Dogariu, M., Ionescu, B.
University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2022/fusion

This represents the first edition of the fusion task, and it proposes several scenarios adapted for the use of late fusion or ensembling systems. The two scenarios correspond to a regression approach, using data associated with the prediction of media interestingness, and a retrieval scenario, using data associated with search result diversification.

ImageCLEFmedical Tuberculosis
Paper available at: https://ceur-ws.org/Vol-3180/paper-96.pdf
Kozlovski, S., Dicente Cid, Y., Kovalev, V., Müller, H.
United Institute of Informatics Problems, Belarus; Roche Diagnostics, Spain; University of Applied Sciences Western Switzerland, Switzerland; University of Geneva, Switzerland.
Dataset available at: https://www.imageclef.org/2022/medical/tuberculosis

This task is now at its sixth edition, and is being upgraded to a detection problem. Furthermore, two tasks are now included: the detection of lung cavern regions in lung CT images associated with lung tuberculosis and the prediction of 4 binary features of caverns suggested by experienced radiologists.

ImageCLEFmedical Caption
Paper available at: https://ceur-ws.org/Vol-3180/paper-95.pdf
Rückert, J., Ben Abacha, A., de Herrera, A.G.S., Bloch, L., Brüngel, R., Idrissi-Yaghir, A., Schäfer, H., Müller, H., Friedrich, C.M.
University of Applied Sciences and Arts Dortmund, Germany; Microsoft, USA; University of Essex, UK; University Hospital Essen, Germany; University of Applied Sciences Western Switzerland, Switzerland; University of Geneva, Switzerland.
Dataset available at: https://www.imageclef.org/2022/medical/caption

The sixth edition of this task consists of two tasks. In the first task participants must detect relevant medical concepts in a large corpus of medical images, while in the second task coherent captions must be generated for the entirety of the context of medical images, targeting the interplay of many visible concepts.

MediaEval 2022

The MediaEval Multimedia Evaluation benchmark (https://multimediaeval.github.io/) offers challenges in artificial intelligence for multimedia data. This is the 13th edition of MediaEval (https://multimediaeval.github.io/editions/2022/) and 11 tasks were proposed for this edition, targeting a large number of challenges by creating algorithms for retrieval, analysis, and exploration. For this edition, a “Quest for Insight” is pursued, where organizers are encouraged to propose interesting and insightful questions about the concepts that will be explored, and participants are encouraged to push beyond only striving to improve evaluation scores and to also working to achieve deeper understanding about the challenges.

DisasterMM: Multimedia Analysis of Disaster-Related Social Media Data
Preprint available at: https://2022.multimediaeval.com/paper5337.pdf
Andreadis, S., Bozas, A., Gialampoukidis, I., Mavropoulos, T., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I., Fiorin, R., Lombardo, F., Norbiato, D., Ferri, M.
Information Technologies Institute – Centre of Research and Technology Hellas, Greece; Eastern Alps River Basin District, Italy.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/disastermm/

The DisasterMM task proposes the analysis of social media data extracted from Twitter, targeting the analysis of natural or man-made disaster posts. For this year, the organizers focused on the analysis of flooding events and proposed two subtasks: relevance classification of posts and location extraction from texts.

Emotional Mario: A Game Analytics Challenge
Preprint or paper not published yet.
Lux, M., Alshaer, M., Riegler, M., Halvorsen, P., Thambawita, V., Hicks, S., Dang-Nguyen, D.-T.,
Alpen-Adria-Universität Klagenfurt, Austria; SimulaMet, Norway; University of Bergen, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/emotionalmario/

Emotional Mario focuses on the Super Mario Bros videogame, analyzing the data associated with gamers that consists of game input, demographics, biomedical data, and video associated with players’ faces. Two subtasks are proposed: event detection, seeking to identify gaming events of a significant importance based on facial videos and biometric data, and gameplay summarization, seeking to select the best moments of gameplay.

FakeNews Detection
Preprint available at: https://2022.multimediaeval.com/paper116.pdf
Pogorelov, K., Schroeder, D.T., Brenner, S., Maulana, A., Langguth, J.
Simula Research Laboratory, Norway; University of Bergen, Norway; Stuttgart Media University, Germany.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/fakenews/

The FakeNews Detection task proposes several types of methods of analyzing fake news and the way they spread, using COVID-19 related conspiracy theories. The competition proposes three tasks: the first subtask targets conspiracy detection in text-based data, the second asks participants to analyze graphs of conspiracy posters, while the last one combines the first two, aiming at detection on both text and graph data.

MUSTI – Multimodal Understanding of Smells in Texts and Images
Preprint available at: https://2022.multimediaeval.com/paper9634.pdf
Hürriyetoğlu, A., Paccosi, T., Menini, S., Zinnen, M., Lisena, P., Akdemir, K., Troncy, R., van Erp, M.
KNAW Humanities Cluster DHLab, Netherlands; Fondazione Bruno Kessler, Italy; Friedrich-Alexander-Universität, Germany; EURECOM, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/musti/

MUSTI is one of the few benchmarks that seek to analyze the underrepresented modality of smell. The organizers seek to further the understanding of descriptions of smell in texts and images, and propose two subtasks: the first one aims at classification of smells based on language and image models, predicting whether texts or images evoke the same smell source or not; while the second subtask targets the participants with identifying what are the common smell sources.

Medical Multimedia Task: Transparent Tracking of Spermatozoa
Preprint available at: https://2022.multimediaeval.com/paper5501.pdf
Thambawita, V., Hicks, S., Storås, A.M, Andersen, J.M., Witczak, O., Haugen, T.B., Hammer, H., Nguyen, T., Halvorsen, P., Riegler, M.A.
SimulaMet, Norway; OsloMet, Norway; The Arctic University of Norway, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/medico/

The Medico Medical Multimedia Task tackles the challenge of tracking sperm cells in video recordings, while analyzing the specific characteristics of these cells. Four subtasks are proposed: a sperm-cell real-time tracking task in videos, a prediction of cell motility task, a catch and highlight task seeking to identify sperm cell speed, and an explainability task.

NewsImages
Preprint available at: https://2022.multimediaeval.com/paper8446.pdf
Kille, B., Lommatzsch, A., Özgöbek, Ö., Elahi, M., Dang-Nguyen, D.-T.
Norwegian University of Science and Technology, Norway; Berlin Institute of Technology, Germany; University of Bergen, Norway; Kristiania University College, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/newsimages/

The goal of the NewsImages task is to further the understanding of the relationship between textual and image content in news articles. Participants are tasked with re-linking and re-matching textual news articles with the corresponding images, based on data gathered from social media, news portals and RSS feeds.

NjordVid: Fishing Trawler Video Analytics Task
Preprint available at: https://2022.multimediaeval.com/paper5854.pdf
Nordmo, T.A.S., Ovesen, A.B., Johansen, H.D., Johansen, D., Riegler, M.A.
The Arctic University of Norway, Norway; SimulaMet, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/njord/

The NjordVid task proposes data associated with fishing vessel recordings, representing a solution to maintaining sustainable fishing practices. Two different tasks are proposed: detection of events on the boat, like movement of people, catching fish, etc, and privacy of on-board personnel.

Predicting Video Memorability
Preprint available at: https://2022.multimediaeval.com/paper2265.pdf
Sweeney, L., Constantin, M.G., Demarty, C.-H., Fosco, C., de Herrera, A.G.S., Halder, S., Healy, G., Ionescu, B., Matran-Fernandez, A., Smeaton, A.F., Sultana, M.
Dublin City University, Ireland; University Politehnica of Bucharest, Romania; InterDigital, France; Massachusetts Institute of Technology Cambridge, USA; University of Essex, UK.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/memorability/

The Video Memorability task asks participants to predict how memorable a video sequence is, targeting short-term memorability. Three subtasks are proposed for this edition: a general video-based prediction task where participants are asked to predict the memorability score of a video, a generalization task where training and testing are performed on different sources of data, and an EEG-based task where annotator EEG scans are provided.

Sport Task: Fine Grained Action Detection and Classification of Table Tennis Strokes from Videos
Preprint available at: https://2022.multimediaeval.com/paper4766.pdf
Martin, P.-E., Calandre, J., Mansencal, B., Benois-Pineau, J., Péteri, R., Mascarilla, L., Morlier, J.
Max Planck Institute for Evolutionary Anthropology, Germany; La Rochelle University, France; Univ. Bordeaux, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/sportsvideo/

The Sport Task aims at action detection and classification in videos recorded at table tennis events. Low inter-class variability makes this task harder than other traditional action classification benchmarks. Two subtasks are proposed: a classification task where participants are asked to label table tennis videos according to the strokes the players make, and a detection task where participants must detect whether a stroke was made.

SwimTrack: Swimmers and Stroke Rate Detection in Elite Race Videos
Preprint available at: https://2022.multimediaeval.com/paper6876.pdf
Jacquelin, N., Jaunet, T., Vuillemot, R., Duffner, S.
École Centrale de Lyon, France; INSA-Lyon, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/swimtrack/

The SwimTrack comprises 5 different multimedia tracks related to the analysis of competition-level swimming videos, and provides multimodal video, image and audio data. The five subtasks are as follows: a position detection task associating swimmers with the numbers of swimming lanes, a stroke rate detection task, a camera registration task where participants must apply homography projection methods to create a top-view of the pool, a character recognition on scoreboards task, and a sound detection task associated with buzzer sounds.

Urban Air: Urban Life and Air Pollution
Preprint available at: https://2022.multimediaeval.com/paper586.pdf
Dao, M.-S., Dang, T.-H., Nguyen-Tai, T.-L., Nguyen, T.-B., Dang-Nguyen, D.-T.
National Institute of Information and Communications Technology, Japan; Dalat University, Vietnam; LOC GOLD Technology MTV Ltd. Co, Vietnam; University of Science, Vietnam National University in HCM City, Vietnam; Bergen University, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/urbanair/

The Urban Air task provides multimodal data that allows the analysis of air pollution and pollution patterns in urban environments. The organizers created two subtasks for this competition: a multimodal/crossmodal air quality index prediction task using station and/or CCTV data, and a periodic traffic pollution pattern discovery task.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 1 (QoMEX 2022, ODS at MMSys ’22)

By karel | January 18, 2023 - 14:44 |March 1, 2023 0123, Datasets Column, Feature

Leave a comment

14th International Conference on Quality of Multimedia Experience (QoMEX 2022 – https://qomex2022.itec.aau.at/). We summarize three datasets included in this conference, that address QoE studies on audiovisual 360° video, storytelling for quality perception and energy consumption while streaming video QoE.
Open Dataset and Software Track at 13th ACM Multimedia Systems Conference (ODS at MMSys ’22 – https://mmsys2022.ie/). We summarize nine datasets presented at the ODS track, targeting several topics, including surveillance videos from a fishing vessel (Njord), multi-codec 8K UHD videos (8K MPEG-DASH dataset), light-field (LF) synthetic immersive large-volume plenoptic dataset (SILVR), a dataset of online news items and the related task of rematching (NewsImages), video sequences, characterized by various complexity categories (VCD), QoE dataset of realistic video clips for real networks, dataset of 360° videos with subjective emotional ratings (PEM360), free-viewpoint video dataset, and cloud gaming dataset (CGD).

For the overview of datasets related to MDRE at MMM 2022 and ACM MM 2022 please check the second part (http://records.sigmm.org/?p=12360), while ImageCLEF 2022 and MediaEval 2022 are addressed in the third part (http://records.sigmm.org/?p=12362).

QoMEX 2022

Three dataset papers were presented at the International Conference on Quality of Multimedia Experience (QoMEX 2022), organized in Lippstadt, Germany, September 5 – 7, 2022 (https://qomex2022.itec.aau.at/). The complete QoMEX ’22 Proceeding is available in the IEEE Digital Library (https://ieeexplore.ieee.org/xpl/conhome/9900491/proceeding).

These datasets were presented within the Databases session, chaired by Professor Oliver Hohlfeld. These three papers present contributions focused on audiovisual 360-degree videos, storytelling for quality perception and modelling of energy consumption and streaming of video QoE.

Audiovisual Database with 360° Video and Higher-Order Ambisonics Audio for Perception, Cognition, Behavior and QoE Evaluation Research
Paper available at: https://ieeexplore.ieee.org/document/9900893
Robotham, T., Singla, A., Rummukainen, O., Raake, A. and Habets, E.
International Audio Laboratories Erlangen, A joint institution of the Friedrich-Alexander-Universitat Erlangen-Nurnberg (FAU) and Fraunhofer Institute for Integrated Circuits (IIS), Germany; TU Ilmenau, Germany.
Dataset available at: https://qoevave.github.io/database/

This publicly available database provides audiovisual 360° content with high-order Ambisonics audio. It consists of twelve scenes capturing real-life nature and urban environments with a video resolution of 7680×3840 at 60 frames-per-second and with 4th-order Ambisonics audio. These 360° video sequences, with an average duration of 60 seconds, represent real-life settings for systematically evaluating various dimensions of uni-/multi-modal perception, cognition, behavior, and QoE. It provides high-quality reference material with a balanced focus on auditory and visual sensory information.

The Storytime Dataset: Simulated Videotelephony Clips for Quality Perception Research
Paper available at: https://ieeexplore.ieee.org/document/9900888
Spang, R. P., Voigt-Antons, J. N. and Möller, S.
Technische Universität Berlin, Berlin, Germany; Hamm-Lippstadt University of Applied Sciences, Lippstadt, Germany.
Dataset available at: https://osf.io/cyb8w/

This is a dataset of simulated videotelephony clips to act as stimuli in quality perception research. It consists of four different stories in the German language that are told through ten consecutive parts, each about 10 seconds long. Each of these parts is available in four different quality levels, ranging from perfect to stalling. All clips (FullHD, H.264 / AAC) are actual recordings from end-user video-conference software to ensure ecological validity and realism of quality degradation. Apart from a detailed description of the methodological approach, we contribute the entire stimuli dataset containing 160 videos and all rating scores for each file.

Modelling of Energy Consumption and Streaming Video QoE using a Crowdsourcing Dataset
Paper available at: https://ieeexplore.ieee.org/document/9900886
Herglotz, C, Robitza, W., Kränzler, M., Kaup, A. and Raake, A.
Friedrich-Alexander-Universität, Erlangen, Germany; Audiovisual Technology Group, TU Ilmenau, Germany; AVEQ GmbH, Vienna, Austria.
Dataset available at: On request

This paper performs a first analysis of end-user power efficiency and Quality of Experience of a video streaming service. A crowdsourced dataset comprising 447,000 streaming events from YouTube is used to estimate both the power consumption and perceived quality. The power consumption is modelled based on previous work, which extends toward predicting the power usage of different devices and codecs. The user-perceived QoE is estimated using a standardized model.

ODS at MMSys ’22

The traditional Open Dataset and Software Track (ODS) was a part of the 13th ACM Multimedia Systems Conference (MMSys ’22) organized in Athlone, Ireland, June 14 – 17, 2022 (https://mmsys2022.ie/). The complete MMSys ’22: Proceedings of the 13th ACM Multimedia Systems Conference are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3524273).

The Open Dataset and Software Chairs for MMSys ’22 were Roberto Azevedo (Disney Research, Switzerland), Saba Ahsan (Nokia Technologies, Finland), and Yao Liu (Rutgers University, USA). The ODS session with 14 papers has been initiated with pitches on Wednesday, June 15, followed by a poster session. There have been nine dataset papers presented out of fourteen contributions. A listing of the paper titles, dataset summaries, and associated DOIs is included below for your convenience.

Njord: a fishing trawler dataset
Paper available at: https://doi.org/10.1145/3524273.3532886
Nordmo, T.-A.S., Ovesen, A.B., Juliussen, B.A., Hicks, S.A., Thambawita, V., Johansen, H.D., Halvorsen, P., Riegler, M.A., Johansen, D.
UiT the Arctic University of Norway, Norway; SimulaMet, Norway; Oslo Metropolitan University, Norway.
Dataset available at: https://doi.org/10.5281/zenodo.6284673

This paper presents Njord, a dataset of surveillance videos from a commercial fishing vessel. The dataset aims to demonstrate the potential for using data from fishing vessels to detect accidents and report fish catches automatically. The authors also provide a baseline analysis of the dataset and discuss possible research questions that it could help answer.

Multi-codec ultra high definition 8K MPEG-DASH dataset
Paper available at: https://doi.org/10.1145/3524273.3532889
Taraghi, B., Amirpour, H., Timmerer, C.
Christian Doppler Laboratory Athena, Institute of Information Technology (ITEC), Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria.
Dataset available at: http://ftp.itec.aau.at/datasets/mmsys22/

This paper presents a dataset of multimedia assets encoded with various video codecs, including AVC, HEVC, AV1, and VVC, and packaged using the MPEG-DASH format. The dataset includes resolutions up to 8K and has a maximum media duration of 322 seconds, with segment lengths of 4 and 8 seconds. It is intended to facilitate research and development of video encoding technology for streaming services.

SILVR: a synthetic immersive large-volume plenoptic dataset
Paper available at: https://doi.org/10.1145/3524273.3532890
Courteaux, M., Artois, J., De Pauw, S., Lambert, P., Van Wallendael, G.
Ghent University – Imec, Oost-Vlaanderen, Zwijnaarde, Belgium.
Dataset available at: https://idlabmedia.github.io/large-lightfields-dataset/

SILVR (synthetic immersive large-volume plenoptic dataset) is a light-field (LF) image dataset allowing for six-degrees-of-freedom navigation in larger volumes while maintaining full panoramic field of view. It includes three virtual scenes with 642-2226 views, rendered with 180° fish-eye lenses and featuring color images and depth maps. The dataset also includes multiview rendering software and a lens-reprojection tool. SILVR can be used to evaluate LF coding and rendering techniques.

NewsImages: addressing the depiction gap with an online news dataset for text-image rematching
Paper available at: https://doi.org/10.1145/3524273.3532891
Lommatzsch, A., Kille, B., Özgöbek, O., Zhou, Y., Tešić, J., Bartolomeu, C., Semedo, D., Pivovarova, L., Liang, M., Larson, M.
DAI-Labor, TU-Berlin, Berlin, Germany; NTNU, Trondheim, Norway; Texas State University, San Marcos, TX, United States; Universidade Nova de Lisboa, Lisbon, Portugal.
Dataset available at: https://multimediaeval.github.io/editions/2021/tasks/newsimages/

NewsImages is a dataset of online news items and the related task of news images rematching, which aims to study the “depiction gap” between the content of an image and the text that accompanies it. The dataset is useful for studying connections between image and text and addressing the depiction gap, including sparse data, diversity of content, and the importance of background knowledge.

VCD: Video Complexity Dataset
Paper available at: https://doi.org/10.1145/3524273.3532892
Amirpour, H., Menon, V.V., Afzal, S., Ghanbari, M., Timmerer, C.
Christian Doppler Laboratory Athena, Institute of Information Technology (ITEC), Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria; School of Computer Science and Electronic Engineering, University of Essex, Colchester, United Kingdom.
Dataset available at: https://ftp.itec.aau.at/datasets/video-complexity/

The Video Complexity Dataset (VCD) is a collection of 500 Ultra High Definition (UHD) resolution video sequences, characterized by spatial and temporal complexities, rate-distortion complexity, and encoding complexity with the x264 AVC/H.264 and x265 HEVC/H.265 video encoders. It is suitable for video coding applications such as video streaming, two-pass encoding, per-title encoding, and scene-cut detection. These sequences are provided at 24 frames per second (fps) and stored online in losslessly encoded 8-bit 4:2:0 format.

Realistic video sequences for subjective QoE analysis
Paper available at: https://doi.org/10.1145/3524273.3532894
Hodzic, K., Cosovic, M., Mrdovic, S., Quinlan, J.J., Raca, D.
Faculty of Electrical Engineering, University of Sarajevo, Bosnia and Herzegovina; School of Computer Science & Information Technology, University College Cork, Ireland.
Dataset available at: https://shorturl.at/dtISV

The DashReStreamer framework is designed to recreate adaptively streamed video in real networks to evaluate user Quality of Experience (QoE). The authors have also created a dataset of 234 realistic video clips, based on video logs collected from real mobile and wireless networks, including video logs and network bandwidth profiles. This dataset and framework will help researchers understand the impact of video QoE dynamics on multimedia streaming.

PEM360: a dataset of 360° videos with continuous physiological measurements, subjective emotional ratings and motion traces
Paper available at: https://doi.org/10.1145/3524273.3532895
Guimard, Q., Robert, F., Bauce, C., Ducreux, A., Sassatelli, L., Wu, H.-Y., Winckler, M., Gros, A.
Université Côte d’Azur, Inria, CNRS, I3S, Sophia-Antipolis, France.
Dataset available at: https://gitlab.com/PEM360/PEM360/

PEM360 is a dataset of user head movements and gaze recordings in 360° videos, along with self-reported emotional ratings and continuous physiological measurement data. It aims to understand the connection between user attention, emotions, and immersive content, and includes software tools and joint instantaneous visualization of user attention and emotion, called “emotional maps.” The entire data and code are available in a reproducible framework.

A New Free Viewpoint Video Dataset and DIBR Benchmark
Paper available at: https://doi.org/10.1145/3524273.3532897
Guo, S., Zhou, K., Hu, J., Wang, J., Xu, J., Song, L.
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China.
Dataset available at: https://github.com/sjtu-medialab/Free-Viewpoint-RGB-D-Video-Dataset

A new dynamic RGB-D video dataset for FVV research is presented, including 13 groups of dynamic scenes and one group of static scenes, each with 12 HD video sequences and 12 corresponding depth video sequences. Also, the FVV synthesis benchmark is introduced based on depth image-based rendering to aid data-driven method validation. The dataset and benchmark aim to advance FVV synthesis with improved robustness and performance.

CGD: a cloud gaming dataset with gameplay video and network recordings
Paper available at: https://doi.org/10.1145/3524273.3532898
Slivar, I., Bacic, K., Orsolic, I., Skorin-Kapov, L., Suznjevic, M.
University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia.
Dataset available at: https://muexlab.fer.hr/muexlab/research/datasets

The cloud gaming (CGD) dataset contains 600 game streaming sessions from 10 games of different genres, with various encoding parameters (bitrate, resolution, and frame rate) to evaluate the impact of these parameters on Quality of Experience (QoE). The dataset includes gameplay video recordings, network traffic traces, user input logs, and streaming performance logs, and can be used to understand relationships between network and application layer data for cloud gaming QoE and QoE-aware network management mechanisms.

Two Interviews with renown Datasets Researchers

By maria | July 20, 2022 - 11:54 |August 30, 2022 0222, Datasets Column, Feature

Leave a comment

This issue of the Dataset Column provides two interviews with the researchers responsible for novel datasets of recent years. In particular, we first interview Nacho Reimat (https://www.cwi.nl/people/nacho-reimat), the scientific programmer responsible for the CWIPC-SXR, one of the first datasets on dynamic, interactive volumetric media. Second, we interview Pierre-Etienne Martin (https://www.eva.mpg.de/comparative-cultural-psychology/staff/pierre-etienne-martin/), responsible for contributions to datasets in the area of sports and culture.

The two interviewees were asked about their contribution to the dataset research, their interests, challenges, and the future. We would like to thank both Nacho and Pierre-Etienne for their agreement to contribute to our column.

Nacho Reimat, Scientific Programmer at the Distributed and Interactive Systems group at the CWI, Amsterdam, The Netherlands

Short bio: Ignacio Reimat is currently an R&D Engineer at Centrum Wiskunde & Informatica (CWI) in Amsterdam. He received the B.S. degree in Audiovisual Systems Engineering of Telecommunications at Universitat Politecnica de Catalunya in 2016 and the M.S degree in Innovation and Research in Informatics – Computer Graphics and Virtual Reality at Universitat Politecnica de Catalunya in 2020. His current research interests are 3D graphics, volumetric capturing, 3d reconstruction, point clouds, social Virtual Reality and real-time communications.

Could you provide a small summary of your contribution to the dataset research?

We have released the CWI Point Cloud Social XR Dataset [1], a dynamic point cloud dataset that depicts humans interacting in social XR settings. In particular, using commodity hardware we captured audio-visual data (RGB + Depth + Infrared + synchronized Audio) for a total of 45 unique sequences of people performing scripted actions [2]. The screenplays for the human actors were devised so as to simulate a variety of common use cases in social XR, namely, (i) Education and training, (ii) Healthcare, (iii) communication and social interaction, and (iv) Performance and sports. Moreover, diversity in gender, age, ethnicities, materials, textures and colours were additionally considered. As part of our release, we provide annotated raw material, resulting point cloud sequences, and an auxiliary software toolbox to acquire, process, encode, and visualize data, suitable for real-time applications.

Sample frames from the point cloud sequences released with the CWIPC-SXR dataset.

Why did you get interested in datasets research?

Real-time, immersive telecommunication systems are quickly becoming a reality, thanks to the advances in the acquisition, transmission, and rendering technologies. Point clouds in particular serve as a promising representation in these types of systems, offering photorealistic rendering capabilities with low complexity. Further development of transmission, coding, and quality evaluation algorithms, though, is currently hindered by the lack of publicly available datasets that represent realistic scenarios of remote communication between people in real-time. So we are trying to fill this gap.

What is the most challenging aspect of datasets research?

In our case, because point clouds are a relatively new format, the most challenging part has been developing the technology to generate them. Our dataset is generated from several cameras, which need to be calibrated and synchronized in order to merge the views successfully. Apart from that, if you are releasing a large dataset, you also need to deal with other challenges like data hosting and maintenance, but even more important, find the way to distribute the data in a way that is suitable for different target users. Because we are not releasing just point clouds but also the raw data, there may be people interested in the raw videos, or in particular point clouds, and they do not want to download the full 1.6TB of data. And going even further, because of the novelty of the point cloud format, there is also a lack of tools to re-capture, playback or modify this type of data. That’s why, together with the dataset, we also released our point cloud auxiliary toolbox of software utilities built on top of the Point Cloud Library, which allows for alignment and processing of point clouds, as well as real-time capturing, encoding, transmission, and rendering.

How do you see the future of datasets research?

Open datasets are an essential part of science since they allow for comparison and reproducibility. The major problem is that creating datasets is difficult and expensive, requiring a big investment from research groups. In order to ensure that relevant datasets keep on being created, we need a push including: scientific venues for the publication and discussion of datasets (like the dataset track at the Multimedia Systems conference, which started more than a decade ago), investment from funding agencies and organizations identifying the datasets that the community will need in the future, and collaboration between labs to share the effort.

What are your future plans for your research?

We are very happy with the first version of the dataset since it provides a good starting point and was a source of learning. Still, there is room for improvements, so now that we have a full capturing system (together with the auxiliary tools), we would like to extend the dataset and refine the tools. The community still needs more datasets of volumetric video to further advance the research on alignment, post-processing, compression, delivery, and rendering. Apart from the dataset, the Distributed and Interactive Systems (https://www.dis.cwi.nl) group from CWI is working on volumetric video conferencing, developing a Social VR pipeline for enabling users to more naturally communicate and interact. Recently, we deployed a solution for visiting museums remotely together with friends and family members (https://youtu.be/zzB7B6EAU9c), and next October we will start two EU-funded projects on this topic.

Pierre-Etienne Martin, Postdoctoral Researcher & Tech Development Coordinator, Max Planck Institute for Evolutionary Anthropology, Department of Comparative Cultural Psychology, Leipzig, Germany

Short Bio: Pierre-Etienne Martin is currently a Postdoctoral researcher at the Max Planck Institute. He received his M.S. degree in 2017 from the University of Bordeaux, the Pázmány Péter Catholic University and the Autonomous University of Madrid via the Image Processing and Computer vision Erasmus Master program. He obtained his PhD, labelled European, from the University of Bordeaux in 2020, supervised by Jenny Benois-Pineau and Renaud Péteri, on the topic of video detection and classification by means of Convolutional Neural Networks. His current research interests include among others Artificial Intelligence, Machine Learning and Computer Vision.

Could you provide a small summary of your contribution to the dataset research?

In 2017, I started my PhD thesis which focuses on movement analysis in sports. The aim of this research project, so-called CRIPS (ComputeR vIsion for Sports Performance – see ), is to improve the training experience of the athletes. Our team decided to focus on Table Tennis, and it is with the collaboration of the Sports Faculty of the University of Bordeaux, STAPS, that our first contribution came to be: the TTStroke-21 dataset [3]. This dataset gathers recordings of table tennis games at high resolution and 120 frames per second. The players and annotators are both from the STAPS. The annotation platform was designed by students from the LaBRI – University of Bordeaux, and the MIA from the University of la Rochelle. Coordination for recording the videos and doing the annotation was performed by my supervisors and myself.

In 2019, and until now, the TTStroke-21 is used to propose the Sports Task at the Multimedia Evaluation benchmark – MediaEval [4]. The goal is to segment and classify table tennis strokes from videos.

Since 2021, I have joined the MPI EVA institute and I now focus on elaborating datasets for the Comparative Cultural Psychology department (CCP). The data we are working on focuses on great apes and children. We aim at segmenting, identifying and tracking.

Why did you get interested in datasets research?

Datasets research is the field where the application of computer vision tools is possible. In order to widen the range of applications, datasets with qualitative ground truth need to be offered by the scientific community. Only then, models can be developed to solve the problem raised by the dataset and finally be offered to the community. This has been the goal of the interdisciplinary CRISP project, through the collaboration of the sport and computer science community, for improving athlete performance.

It is also the aim of collaborative projects, such as MMLAB [5], which gathers many models and implementations trained on various datasets, in order to ease reproducibility, performance comparison and inference for applications.

What is the most challenging aspect of datasets research?

From my experience, when organizing the Sport task at the MediaEval workshop, the most challenging aspect of datasets research is to be able to provide qualitative data: from acquisition to annotation; and tools to process them: use, demonstration and evaluation. That is why, on the side of our task, we also provide a baseline which covers most of these aspects.

How do you see the future of datasets research?

I hope datasets research will transcend in order to have a general scheme for annotation and evaluation of datasets. I hope the different datasets could be used together for training multi-task models, and give the opportunity to share knowledge and features proper to each type of dataset. Finally, quantity has been a major criterion for dataset research, but quality should be more considered in order to improve state-of-the-art performance while keeping a sustainable way to conduct research.

What are your future plans for your research?

Within the CCP department at MPI, I hope to be able to build different types of datasets to put to best use what has been implemented in the computer vision field to psychology.

Relevant references:

CWIPC-SXR dataset: https://www.dis.cwi.nl/cwipc-sxr-dataset/
I. Reimat, et al., “CWIPC-SXR: Point Cloud dynamic human dataset for Social XR. In Proceedings of the 12th ACM Multimedia Systems Conference (MMSys ’21). Association for Computing Machinery, New York, NY, USA, 300–306. https://doi.org/10.1145/3458305.3478452
TTStroke-21: https://link.springer.com/article/10.1007/s11042-020-08917-3
Media-Eval: http://www.multimediaeval.org/
Open-MMLab: https://openmmlab.com/

Overview of Open Dataset Sessions and Benchmarking Competitions in 2021.

By gabi | January 12, 2022 - 10:21 |February 3, 2022 0421, Datasets Column, Feature

Leave a comment

This issue of the Dataset Column proposes a review of some of the most important events in 2021 related to special sessions on open datasets or benchmarking competitions associated with multimedia data. While this is not meant to represent an exhaustive list of events, we wish to underline the great diversity of subjects and dataset topics currently of interest to the multimedia community. We will present the following events:

13th International Conference on Quality of Multimedia Experience (QoMEX 2021 – https://qomex2021.itec.aau.at/). We summarize six datasets included in this conference, that address QoE studies on haze conditions (RHVD), tele-education events (EVENT-CLASS), storytelling scenes (MTF), image compression (EPFL), virtual reality effects on gamers (5Gaming), and live stream shopping (LSS-survey).
Multimedia Datasets for Repeatable Experimentation at 27th International Conference on Multimedia Modeling (MDRE at MMM 2021 – https://mmm2021.cz/special-session-mdre/). We summarize the five datasets presented during the MDRE, addressing several topics like lifelogging and environmental data (MNR-HCM), cat vocalizations (CatMeows), home activities (HTAD), gastrointestinal procedure tools (Kvasir-Instrument), and keystroke and lifelogging (KeystrokeDynamics).
Open Dataset and Software Track at 12th ACM Multimedia Systems Conference (ODS at MMSys ’21) (https://2021.acmmmsys.org/calls.php#ods). We summarize seven datasets presented at the ODS track, targeting several topics like network statistics (Brightcove Streaming Datasets, and PePa Ping), emerging image and video modalities (Full UHD 360-Degree, 4DLFVD, and CWIPC-SXR) and human behavior data (HYPERAKTIV and Target Selection Datasets).
Selected datasets at 29th ACM Multimedia Conference (MM ’21) (https://2021.acmmm.org/). For a general report from ACM Multimedia 2021 please see (https://records.sigmm.org/2021/11/23/reports-from-acm-multimedia-2021/). We summarize six datasets presented during the conference, targeting several topics like food logo detection (FoodLogoDet-1500), emotional relationship recognition (ERATO), text-to-face synthesis (CelebAText-HQ), multimodal linking (M3EL), egocentric video analysis (EGO-Deliver), and quality assessment of user-generated videos (PUGCQ).
ImageCLEF 2021 (https://www.imageclef.org/2021). We summarize the six datasets launched for the benchmarking tasks, related to several topics like social media profile assessment (ImageCLEFaware), segmentation and labeling of underwater coral images (ImageCLEFcoral), automatic generation of web-pages (ImageCLEFdrawnUI) and medical imaging analysis (ImageCLEF-VQAMed, ImageCLEFmedCaption, and ImageCLEFmedTuberculosis).

Creating annotated datasets is even more difficult in ongoing pandemic times, and we are glad to see that many interesting datasets were published despite this unfortunate situation.

QoMEX 2021

A large number of dataset-related papers have been presented at the International Conference on Quality of Multimedia Experience (QoMEX 2021), organized as a fully online event in Montreal, Canada, June 14 -17, 2021 (https://qomex2021.itec.aau.at/). The complete QoMEX ’21 Proceedings is available in the IEEE Digital Library (https://ieeexplore.ieee.org/xpl/conhome/9465370/proceeding).

In the conference, there was not a specifically dedicated Dataset session. However, datasets were very important to the conference with a number of papers showing new datasets or making use of broadly available ones. As a small example, six selected papers focused primarily on new datasets are listed below. They are contributions focused on haze, teaching in Virtual Reality, multiview video, image quality, cybersickness for Virtual Reality gaming and shopping patterns.

A Real Haze Video Database for Haze Evaluation
Paper available at: https://ieeexplore.ieee.org/document/9465461
Chu, Y., Luo, G., and Chen, F.
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, P.R.China.
Dataset available at: https://drive.google.com/file/d/1zY0LwJyNB8u1JTAJU2X7ZkiYXsBX7BF/view?usp=sharing

The RHVD video quality assessment dataset focuses on the study of perceptual degradation caused by heavy haze conditions in real-world outdoor scenes, addressing a large number of possible use case scenarios, including driving assistance and warning systems. The dataset is collected from Flickr video sharing platform and post-edited, while 40 annotators were used for creating the subjective quality assessment experiments.

EVENT-CLASS: Dataset of events in the classroom
Paper available at: https://ieeexplore.ieee.org/document/9465389
Orduna, M., Gutierrez, J., Manzano, C., Ruiz, D., Cabrera, J., Diaz, C., Perez, P., and Garcia, N.
Grupo de Tratamiento de Imágenes, Information Processing & Telecom. Center, Universidad Politécnica de Madrid, Spain; Nokia Bell Labs, Madrid, Spain.
Dataset available at: http://www.gti.ssr.upm.es/data/event-class

The EVENT-CLASS dataset consists of 360-degree videos that contain events and characteristics specific to the context of tele-education, composed of video and audio sequences taken in varying conditions. The dataset addresses several topics, including quality assessment tests with the aim of improving the immersive experience of remote users.

A Multi-View Stereoscopic Video Database With Green Screen (MTF) For Video Transition Quality-of-Experience Assessment
Paper available at: https://ieeexplore.ieee.org/document/9465458
Hobloss, N., Zhang, L., and Cagnazzo, M.
LTCI, Télécom-Paris, Institut Polytechnique de Paris, Paris, France; Univ Rennes, INSA Rennes, CNRS, Rennes, France.
Dataset available at: https://drive.google.com/drive/folders/1MYiD7WssSh6X2y-cf8MALNOMMish4N5j

MFT is a multi-view stereoscopic video dataset, containing full-HD videos of real storytelling scenes, targeting QoE assessment for the analysis of visual artefacts that appear during an automatically generated point of view transitions. The dataset features a large baseline of camera setups and can also be used in other computer vision applications, like video compression, 3D video content, VR environments and optical flow estimation.

Performance Evaluation of Objective Image Quality Metrics on Conventional and Learning-Based Compression Artifacts
Paper available at: https://ieeexplore.ieee.org/document/9465445
Testolina, M., Upenik, E., Ascenso, J., Pereira, F., and Ebrahimi, T.
Multimedia Signal Processing Group, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland; Instituto Superior Técnico, Universidade de Lisboa – Instituto de Telecomunicações, Lisbon, Portugal.
Dataset available on request to the authors.

This dataset consists of a collection of compressed images, labelled according to subjective quality scores, targeting the evaluation of 14 objective quality metrics against the perceived human quality baseline.

The Effect of VR Gaming on Discomfort, Cybersickness, and Reaction Time
Paper available at: https://ieeexplore.ieee.org/document/9465470
Vlahovic, S., Suznjevic, M., Pavlin-Bernardic, N., and Skorin-Kapov, L.
Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia; Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb, Croatia.
Dataset available on request to the authors.

The authors present the results of a study conducted on 20 human users, that measures the physiological and cognitive aftereffects of exposure to three different VR games with game mechanics centered around natural interactions. This work moves away from cybersickness as a primary measure of VR discomfort and wishes to analyze other concepts like device-related discomfort, muscle fatigue and pain and correlations with game complexity

Beyond Shopping: The Motivations and Experience of Live Stream Shopping Viewers
Paper available at: https://ieeexplore.ieee.org/document/9465387
Liu, X. and Kim, S. H.
Adelphi University.
Dataset available on request to the authors.

The authors propose a study of 286 live stream shopping users, where viewer motivations are examined according to the Uses and Gratifications Theory, seeking to identify motivations broken down into sixteen constructs organized under four larger constructs: entertainment, information, socialization, and experience.

MDRE at MMM 2021

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2021 International Conference on Multimedia Modeling (MMM 2021). The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Björn Þór Jónsson (IT University of Copenhagen, Denmark) and Klaus Schoeffmann (Klagenfurt University, Austria). More details regarding this session can be found at: https://mmm2021.cz/special-session-mdre/.

The MDRE’21 special session at MMM’21 is the third MDRE edition, and it represents an opportunity for interested researchers to submit their datasets to this track. The work submitted to MDRE is permanently available at http://mmdatasets.org, where all the current and past editions of MDRE are hosted. Authors are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, as well as discussing the way it can be useful to the community, along with the dataset in itself.

MNR-Air: An Economic and Dynamic Crowdsourcing Mechanism to Collect Personal Lifelog and Surrounding Environment Dataset.
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_18
Nguyen DH., Nguyen-Tai TL., Nguyen MT., Nguyen TB., Dao MS.
University of Information Technology, Ho Chi Minh City, Vietnam; University of Science, Ho Chi Minh City, Vietnam; Vietnam National University in Ho Chi Minh City, Ho Chi Minh City, Vietnam; National Institute of Information and Communications Technology, Koganei, Japan.
Dataset available on request to the authors.

The paper introduces an economical and dynamic crowdsourcing mechanism that can be used to collect personal lifelog associated events. The resulting dataset, MNR-HCM, represents data collected in Ho Chi Minh City, Vietnam, containing weather data, air pollution data, GPS data, lifelog images, and citizens’ cognition on a personal scale.

CatMeows: A Publicly-Available Dataset of Cat Vocalizations
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_20
Ludovico L.A., Ntalampiras S., Presti G., Cannas S., Battini M., Mattiello S.
Department of Computer Science, University of Milan, Milan, Italy; Department of Veterinary Medicine, University of Milan, Milan, Italy; Department of Agricultural and Environmental Science, University of Milan, Milan, Italy.
Dataset available at: https://zenodo.org/record/4008297

The CatMewos dataset consists of vocalizations produced by 21 cats belonging to two breeds, namely Main Coon and European Shorthair, that are emitted in three different contexts: brushing, isolation in an unfamiliar environment, and waiting for food. Recordings are performed with low-cost and easily available devices, thus creating a representative dataset for real-world scenarios.

HTAD: A Home-Tasks Activities Dataset with Wrist-accelerometer and Audio Features
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_17
Garcia-Ceja, E., Thambawita, V., Hicks, S.A., Jha, D., Jakobsen, P., Hammer, H.L., Halvorsen, P., Riegler, M.A.
SINTEF Digital, Oslo, Norway; SimulaMet, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; UIT The Arctic University of Norway, Tromsø, Norway; Haukeland University Hospital, Bergen, Norway.
Dataset available at: https://datasets.simula.no/htad/

The HTAD dataset contains wrist-accelerometer and audio data collected during several normal day-to-day tasks, such as sweeping, brushing teeth, or watching TV. Being able to detect these types of activities is important for the creation of assistive applications and technologies that target elderly care and mental health monitoring.

Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_19
Jha, D., Ali, S., Emanuelsen, K., Hicks, S.A., Thambawita, V., Garcia-Ceja, E., Riegler, M.A., de Lange, T., Schmidt, P.T., Johansen, H.D., Johansen, D., Halvorsen, P.
SimulaMet, Oslo, Norway; UIT The Arctic University of Norway, Tromsø, Norway; Simula Research Laboratory, Oslo, Norway; Augere Medical AS, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; Medical Department, Sahlgrenska University Hospital-Mölndal, Gothenburg, Sweden; Department of Medical Research, Bærum Hospital, Gjettum, Norway; Karolinska University Hospital, Solna, Sweden; Department of Engineering Science, University of Oxford, Oxford, UK; Sintef Digital, Oslo, Norway.
Dataset available at: https://datasets.simula.no/kvasir-instrument/

The Kvasir-Instrument dataset consists of 590 annotated frames that contain gastrointestinal (GI) procedure tools such as snares, balloons, and biopsy forceps, and seeks to improve follow-up and the set of available information regarding the disease and the procedure itself, by providing baseline data for the tracking and analysis of the medical tools.

Keystroke Dynamics as Part of Lifelogging
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_16
Smeaton, A.F., Krishnamurthy, N.G., Suryanarayana, A.H.
Insight Centre for Data Analytics, Dublin City University, Dublin, Ireland; School of Computing, Dublin City University, Dublin, Ireland.
Dataset available at: http://doras.dcu.ie/25133/

The authors created a dataset of longitudinal keystroke timing data that spans a period of up to seven months for four human participants. A detailed analysis of the data is performed, by examining the timing information associated with bigrams, or pairs of adjacently-typed alphabetic characters.

ODS at MMSys ’21

The traditional Open Dataset and Software Track (ODS) was a part of the 12th ACM Multimedia Systems Conference (MMSys ’21) organized as a hybrid event in Istanbul, Turkey, September 28 – October 1, 2021 (https://2021.acmmmsys.org/). The complete MMSys ’21: Proceedings of the 12th ACM Multimedia Systems Conference are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3458305).

The Session on Software, Tools and Datasets was chaired by Saba Ahsan (Nokia Technologies, Finland) and Luca De Cicco (Politecnico di Bari, Italy) on September 29, 2021, at 16:00 (UTC+3, Istanbul local time). The session has been initiated with 1-slide/minute intros given by the authors and then divided into individual virtual booths. There have been seven dataset papers presented out of thirteen contributions. Listing of the paper titles and their abstracts and associated DOIs is included below for your convenience.

Adaptive Streaming Playback Statistics Dataset
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478444
Teixeira, T, Zhang, B., Reznik, Y.
Brightcove Inc, USA
Dataset available at: https://github.com/brightcove/streaming-dataset

The authors propose a dataset that captures statistics from a number of real-world streaming events, utilizing different devices (TVs, desktops, mobiles, tablets, etc.) and networks (from 2.5G, 3G, and other early generation mobile networks to 5G and broadband). The captured data includes network and playback statistics, events and characteristics of the encoded stream.

PePa Ping Dataset: Comprehensive Contextualization of Periodic Passive Ping in Wireless Networks
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478456
Madariaga, D., Torrealba, L., Madariaga, J., Bustos-Jimenez, J., Bustos, B.
NIC Chile Research Labs, University of Chile
Dataset available at: https://github.com/niclabs/pepa-ping-mmsys21

The PePa Ping dataset consists of real-world data with a comprehensive contextualization of Internet QoS indicators, like Round-trip time, jitter and packet loss. A methodology is developed for Android devices, that obtains the necessary information, while the indicators are directly provided to the Linux kernel, therefore being an accurate representation of real-world data.

Full UHD 360-Degree Video Dataset and Modeling of Rate-Distortion Characteristics and Head Movement Navigation
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478447
Chakareski, J., Aksu, R., Swaminathan, V., Zink, M.
New Jersey Institute of Technology; University of Alabama; Adobe Research; University of Massachusetts Amherst, USA
Dataset available at: https://zenodo.org/record/5156999#.YQ1XMlNKjUI

The authors create a dataset of 360-degree videos that are used in analyzing the rate-distortion (R-D) characteristics of videos. These videos correspond to head movement navigation data in Virtual Reality (VR) and they may be used for analyzing how users explore panoramas around them in VR.

4DLFVD: A 4D Light Field Video Dataset
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478450
Hu, X., Wang, C.,Pan, Y., Liu, Y., Wang, Y., Liu, Y., Zhang, L., Shirmohammadi, S.
University of Ottawa, Canada / Beijing University of Posts and Telecommunication, China
Dataset available at: https://dx.doi.org/10.21227/hz0t-8482

The authors propose a 4D Light Field (LF) video dataset that is collected via a custom-made camera matrix. The dataset is to be used for designing and testing methods for LF video coding, processing and streaming, providing more viewpoints and/or higher framerate compared with similar datasets from the current literature.

CWIPC-SXR: Point Cloud dynamic human dataset for Social XR
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478452
Reimat, I., Alexiou, E., Jansen, J., Viola, I., Subramanyam, S., Cesar, P.
Centrum Wiskunde & Informatica, Netherlands
Dataset available at: https://www.dis.cwi.nl/cwipc-sxr-dataset/

The CWIPC-SXR dataset is composed of 45 unique sequences that correspond to several use cases for humans interacting in social extended reality. The dataset is composed of dynamic point clouds, that serve as a low complexity representation in these types of systems.

HYPERAKTIV: An Activity Dataset from Patients with Attention-Deficit/Hyperactivity Disorder (ADHD)
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478454
Hicks, S. A., Stautland, A., Fasmer, O. B., Forland, W., Hammer, H. L., Halvorsen, P., Mjeldheim, K., Oedegaard, K. J., Osnes, B., Syrstad, V. E.G., Riegler, M. A.
SimulaMet; University of Bergen; Haukeland University Hospital; OsloMet, Norway
Dataset available at: http://datasets.simula.no/hyperaktiv/

The HYPERAKTIV dataset contains general patient information, health, activity, information about the mental state, and heart rate data from patients with Attention-Deficit/Hyperactivity Disorder (ADHD). Included here are 51 patients with ADHD and 52 clinical control cases.

Datasets – Moving Target Selection with Delay
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478455
Liu, S. M., Claypool, M., Cockburn, A., Eg, R., Gutwin, C., Raaen, K.
Worcester Polytechnic Institute, USA; University of Canterbury, New Zealand; Kristiania University College, Norway; University of Saskatchewan, Canada
Dataset available at: https://web.cs.wpi.edu/~claypool/papers/selection-datasets/

The Selection datasets are composed of datasets created during four user studies on the effects of delay on video game actions and selections of a moving target with a various number of pointing devices. The datasets include performance data, like time to the selection, and demographic data for the users like age and gaming experience.

ACM MM 2021

A large number of dataset-related papers have been presented at the 29th ACM International Conference on Multimedia (MM’ 21), organized as a hybrid event in Chengdu, China, October 20 – 24, 2021 (https://2021.acmmm.org/). The complete MM ’21: Proceedings of the 29th ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3474085).

There was not a specifically dedicated Dataset session among more than 35 sessions at the MM ’21 symposium. However, the importance of datasets can be illustrated in the following statistics, quantifying how many times the term “dataset” appears among 542 accepted papers. The term appears in the title of 7 papers, the keywords of 66 papers, and the abstracts of 339 papers. As a small example, six selected papers focused primarily on new datasets are listed below. There are contributions focused on social multimedia, emotional recognition, text-to-face synthesis, egocentric video analysis, emerging multimedia applications, such as multimodal entity linking, and multimedia art, entertainment, and culture related to perceived quality of video content.

FoodLogoDet-1500: A Dataset for Large-Scale Food Logo Detection via Multi-Scale Feature Decoupling Network
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475289
Hou, Q., Min, W., Wang, J., Hou, S., Zheng, Y., Jiang, S.
Shandong Normal University, Jinan, China; Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Dataset available at: https://github.com/hq03/FoodLogoDet-1500-Dataset

The FoodLogoDet-1500 is a large-scale food logo dataset that has 1,500 categories, around 100,000 images and 150,000 manually annotated food logo objects. This type of dataset is important in self-service applications in shops and supermarkets, and copyright infringement detection for e-commerce websites.

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475493
Gao, X., Zhao, Y., Zhang, J., Cai, L.
Alibaba Group, Beijing, China
Dataset available on request to the authors.

The Emotional RelAtionship of inTeractiOn (ERATO) dataset is a large-scale multimodal dataset composed of over 30,000 interaction-centric video clips lasting around 203 hours. The videos are representative for studying the emotional relationships between the two interactive characters in the video clip.

Multi-caption Text-to-Face Synthesis: Dataset and Algorithm
Paper available at: https://dl.acm.org/doi/abs/10.1145/3474085.3475391
Sun, J., Li, Q., Wang, W., Zhao, J., Sun, Z.
Center for Research on Intelligent Perception and Computing, NLPR, CASIA, Beijing, China;
School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China; Institute of North Electronic Equipment, Beijing, China
Dataset available on request to the authors.

The authors propose the CelebAText-HQ dataset, which addresses the text-to-face generation problem. Each image in the dataset is manually annotated with 10 captions, allowing proposed methods and algorithms to take multiple captions as input in order to generate highly semantically related face images.

Multimodal Entity Linking: A New Dataset and A Baseline
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475400
Gan, J., Luo, J., Wang, H., Wang, S., He, W., Huang, Q.
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, China; Baidu Inc.
Dataset available at: https://jingrug.github.io/research/M3EL

The authors propose the M3EL large-scale multimodal entity linking dataset, containing data associated with 1,100 movies. Reviews and images are collected, and textual and visual mentions are extracted and labelled with entities registered from Wikipedia.

Ego-Deliver: A Large-Scale Dataset for Egocentric Video Analysis
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475336
Qiu, H., He, P., Liu, S., Shao, W., Zhang, F., Wang, J., He, L., Wang, F.
East China Normal University, Shanghai, China; University of Florida, Florida, FL, United States;
Alibaba Group, Shanghai, China
Dataset available at: https://egodeliver.github.io/EgoDeliver_Dataset/

The authors propose an egocentric video benchmarking dataset, consisting of videos recorded by takeaway riders doing their daily work. The dataset provides over 5,000 videos with more than 139,000 multi-track annotations and 45 different attributes, representing the first attempt in understanding the delivery takeaway process from an egocentric perspective.

PUGCQ: A Large Scale Dataset for Quality Assessment of Professional User-Generated Content
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475183
Li, G., Chen, B., Zhu, L., He, Q., Fan, H., Wang, S.
Kingsoft Cloud, Beijing, China; City University of Hong Kong, Hong Kong, Hong Kong
Dataset available at: https://github.com/wlkdb/pugcq_create

The PUGCQ dataset consists of 10,000 professional user-generated videos, annotated with a set of perceptual subjective ratings. In particular, during the subjective annotation and testing, human opinions are collected based upon not only MOS, but also attributes that may influence visual quality such as faces, noise, blur, brightness, and colour.

ImageCLEF 2021

ImageCLEF is a multimedia evaluation campaign, part of the clef initiative (http://www.clef-initiative.eu/). The 2021 edition (https://www.imageclef.org/2021) is the 19th edition of this initiative and addresses four main research tasks in several domains like: medicine, nature, social media content and user interface processing. ImageCLEF 2021 is organized by Bogdan Ionescu (University Politehnica of Bucharest, Romania), Henning Müller (University of Applied Sciences Western Switzerland, Sierre, Switzerland), Renaud Péteri (University of La Rochelle, France), Ivan Eggel (University of Applied Sciences Western Switzerland, Sierre, Switzerland) and Mihai Dogariu (University Politehnica of Bucharest, Romania).

ImageCLEFaware
Paper available at: https://arxiv.org/abs/2012.13180
Popescu, A., Deshayes-Chossar, J., Ionescu, B.
CEA LIST, France; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2021/aware

This represents the first edition of the aware task at ImageCLEF, and it seeks to understand in what way do public social media profiles affect users in certain important scenarios, representing a search or application for a bank loan, an accommodation, a job as waitress/waiter, and a job in IT.

ImageCLEFcoral
Paper available at: http://ceur-ws.org/Vol-2936/paper-88.pdf
Chamberlain, J., de Herrera, A. G. S., Campello, A., Clark, A., Oliver, T. A., Moustahfid, H.
University of Essex, UK; NOAA – Pacific Islands Fisheries Science Center, USA; NOAA/ US IOOS, USA; Wellcome Trust, UK.
Dataset available at: https://www.imageclef.org/2021/coral

The ImageCLEFcoral task, currently at its third edition, proposes a dataset and benchmarking task for the automatic segmentation and labelling of underwater images that can be combined for generating 3D models for monitoring coral reefs. The task itself is composed of two subtasks, namely the coral reef image annotation and localisation and the coral reef image pixel-wise parsing.

ImageCLEFdrawnUI
Paper available at: http://ceur-ws.org/Vol-2936/paper-89.pdf
Fichou, D., Berari, R., Tăuteanu, A., Brie, P., Dogariu, M., Ștefan, L.D., Constantin, M.G., Ionescu, B.
teleportHQ, Cluj Napoca, Romania; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2021/drawnui

The second edition ImageCLEFdrawnUI addresses the issue of creating appealing web page interfaces by fostering systems that are capable of automatically generating a web page from a hand-drawn sketch. The task is separated into two subtasks, the wireframe subtask and the screenshots task.

ImageCLEF-VQAMed
Paper available at: http://ceur-ws.org/Vol-2936/paper-87.pdf
Abacha, A.B., Sarrouti, M., Demner-Fushman, D., Hasan, S.A., Müller, H.
National Library of Medicine, USA; CVS Health, USA; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/vqa

This represents the fourth edition of the ImageCLEF Medical Visual Question Answering (VQAMed) task. This benchmark includes a task on Visual Question Answering (VQA), where participants are tasked with answering questions from the visual content of radiology images, and a second task on Visual Question Generation (VQG), consisting of generating relevant questions about radiology images.

ImageCLEFmed Caption
Paper available at: http://ceur-ws.org/Vol-2936/paper-111.pdf
Pelka, O., Abacha, A.B., de Herrera, A.G.S., Jacutprakart, J., Friedrich, C.M., Müller, H.
University of Applied Sciences and Arts Dortmund, Germany; National Library of Medicine, USA; University of Essex, UK; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/caption

This is the fifth edition of the ImageCLEF Medical Concepts and Captioning task. The objective is to extract UMLS-concept annotations and/or captions from the image data that are then compared against the original text captions of the images.

ImageCLEFmed Tuberculosis
Paper available at: http://ceur-ws.org/Vol-2936/paper-90.pdf
Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Müller, H.
Institute for Informatics, Minsk, Belarus; University of Warwick, Coventry, England, UK; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/tuberculosis