Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 2 (MDRE at MMM 2022, ACM MM 2022)

Editors: Karel Fliegel (Czech Technical University in Prague, Czech Republic), Mihai Gabriel Constantin (University Politehnica of Bucharest, Romania), Maria Torres Vega (Ghent University, Belgium)

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on MDRE at MMM 2022 and ACM MM 2022:

  • Multimedia Datasets for Repeatable Experimentation at 28th International Conference on Multimedia Modeling (MDRE at MMM 2022 – https://mmm2022.org/ssp.html#mdre). We summarize the three datasets presented during the MDRE, addressing several topics like user-centric video search competition, dataset (GPR1200) to evaluate the performance of deep neural networks for general image retrieval, and dataset for evaluating the performance of Question Answering (QA) systems on lifelog data (LLQA).
  • Selected datasets at the 30th ACM Multimedia Conference (MM ’22 – https://2022.acmmm.org/). For a general report from ACM Multimedia 2022 please see (https://records.sigmm.org/2022/12/07/report-from-acm-multimedia-2022-by-nitish-nagesh/). We summarize nine datasets presented during the conference, targeting several topics like dataset for multimodal intent recognition (MintRec), audio-visual question answering dataset (AVQA), large-scale radar dataset (mmWave), multimodal sticker emotion recognition dataset (SER30K), video-sentence dataset for vision-language pre-training (ACTION), dataset of head and gaze behavior for 360-degree videos, saliency in augmented reality dataset (SARD), multi-modal dataset spotting the differences between pairs of similar images (DialDiff), and large-scale remote sensing images dataset (RSVG).

For the overview of datasets related to QoMEX 2022 and ODS at MMSys ’22 please check the first part (https://records.sigmm.org/?p=12292), while ImageCLEF 2022 and MediaEval 2022 are addressed in the third part (http://records.sigmm.org/?p=12362).

MDRE at MMM 2022

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2022 International Conference on Multimedia Modeling (MMM 2022), supporting both online and onsite presentation, Phu Quoc, Vietnam, June 6-10, 2022. The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Björn Þór Jónsson (IT University of Copenhagen, Denmark), Adam Jatowt (University of Innsbruck, Austria), Liting Zhou (Dublin City University, Ireland) and Graham Healy (Dublin City University, Ireland). Details regarding this session can be found at: https://mmm2022.org/ssp.html#mdre

The MDRE’22 special session at MMM’22, is the fourth MDRE session, and it represents an opportunity for interested researchers to submit their datasets to this track. The work submitted to MDRE is permanently available at http://mmdatasets.org, where all the current and past editions of MDRE are hosted. Authors are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, and a discussion of how it can be useful to the community, along with the dataset in itself.

A Task Category Space for User-Centric Comparative Multimedia Search Evaluations
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_16
Lokoč, J., Bailer, W., Barthel, K.U., Gurrin, C., Heller, S., Jónsson, B., Peška, L., Rossetto, L., Schoeffmann, K., Vadicamo, L., Vrochidis, S., Wu, J.
Charles University, Prague, Czech Republic; JOANNEUM RESEARCH, Graz, Austria; HTW Berlin, Berlin, Germany; Dublin City University, Dublin, Ireland; University of Basel, Basel, Switzerland; IT University of Copenhagen, Copenhagen, Denmark; University of Zurich, Zurich, Switzerland; Klagenfurt University, Klagenfurt, Austria; ISTI CNR, Pisa, Italy; Centre for Research and Technology Hellas, Thessaloniki, Greece; City University of Hong Kong, Hong Kong.
Dataset available at: On request

The authors have analyzed the spectrum of possible task categories and propose a list of individual axes that define a large space of possible task categories. Using this concept of category space, new user-centric video search competitions can be designed to benchmark video search systems from different perspectives. They further analyze the three task categories considered at the Video Browser Showdown and discuss possible (but sometimes challenging) shifts within the task category space.

GPR1200: A Benchmark for General-Purpose Content-Based Image Retrieval
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_17
Schall, K., Barthel, K.U., Hezel, N., Jung, K.
Visual Computing Group, HTW Berlin, University of Applied Sciences, Germany.
Dataset available at: http://visual-computing.com/project/GPR1200

In this study, the authors have developed a new dataset called GPR1200 to evaluate the performance of deep neural networks for general image retrieval (CBIR). They found that large-scale pretraining significantly improves retrieval performance and that further improvement can be achieved through fine-tuning. GPR1200 is presented as an easy-to-use and accessible but challenging benchmark dataset with a broad range of image categories.

LLQA – Lifelog Question Answering Dataset
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_18
Tran, L.-D., Ho, T.C., Pham, L.A., Nguyen, B., Gurrin, C., Zhou, L.
Dublin City University, Dublin, Ireland; Vietnam National University, Ho Chi Minh University of Science, Ho Chi Minh City, Viet Nam; AISIA Research Lab, Ho Chi Minh City, Viet Nam.
Dataset available at: https://github.com/allie-tran/LLQA

This study presents Lifelog Question Answering Dataset (LLQA), a new dataset for evaluating the performance of Question Answering (QA) systems on lifelog data. The dataset includes over 15,000 multiple-choice questions as an augmented 85-day lifelog collection, and is intended to serve as a benchmark for future research in this area. The results of the study showed that QA on lifelog data is a challenging task that requires further exploration.

ACM MM 2022

Numerous dataset-related papers have been presented at the 30th ACM International Conference on Multimedia (MM’ 22), organized in Lisbon, Portugal, October 10 – 14, 2022 (https://2022.acmmm.org/). The complete MM ’22: Proceedings of the 30th ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3503161).

There was not a specifically dedicated Dataset session among roughly 35 sessions at the MM ’22 symposium. However, the importance of datasets can be illustrated in the following statistics, quantifying how often the term “dataset” appears in MM ’22 Proceedings. The term appears in the title of 9 papers (7 last year), the keywords of 35 papers (66 last year), and the abstracts of 438 papers (339 last year). As a small example, nine selected papers focused primarily on new datasets with publicly available data are listed below. There are contributions focused on various multimedia applications, e.g., understanding multimedia content, multimodal fusion and embeddings, media interpretation, vision and language, engaging users with multimedia, emotional and social signals, interactions and Quality of Experience, and multimedia search and recommendation.

MIntRec: A New Dataset for Multimodal Intent Recognition
Paper available at: https://doi.org/10.1145/3503161.3547906
Zhang, H., Xu, H., Wang, X., Zhou, Q., Zhao, S., Teng, J.
Tsinghua University, Beijing, China.
Dataset available at: https://github.com/thuiar/MIntRec

MIntRec is a dataset for multimodal intent recognition with 2,224 samples based on the data collected from the TV series Superstore, in text, video, and audio modalities, annotated with twenty intent categories and speaker bounding boxes. Baseline models are built by adapting multimodal fusion methods and show significant improvement over text-only modality. MIntRec is useful for studying relationships between modalities and improving intent recognition.

AVQA: A Dataset for Audio-Visual Question Answering on Videos
Paper available at: https://doi.org/10.1145/3503161.3548291
Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., Jin, C., Zhu, W.
Tsinghua University, Shenzhen, China; Communication University of China, Beijing, China.
Dataset available at: https://mn.cs.tsinghua.edu.cn/avqa

Audio-visual question-answering dataset (AVQA) is introduced for videos in real-life scenarios. It includes 57,015 videos and 57,335 question-answer pairs that rely on clues from both audio and visual modalities. A Hierarchical Audio-Visual Fusing module is proposed to model correlations among audio, visual, and text modalities. AVQA can be used to test models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios.

mmBody Benchmark: 3D Body Reconstruction Dataset and Analysis for Millimeter Wave Radar
Paper available at: https://doi.org/10.1145/3503161.3548262
Chen, A., Wang, X., Zhu, S., Li, Y., Chen, J., Ye, Q.
Zhejiang University, Hangzhou, China.
Dataset available at: On request

A large-scale mmWave radar dataset with synchronized and calibrated point clouds and RGB(D) images is presented, along with an automatic 3D body annotation system. State-of-the-art methods are trained and tested on the dataset, showing the mmWave radar can achieve better 3D body reconstruction accuracy than RGB camera but worse than depth camera. The dataset and results provide insights into improving mmWave radar reconstruction and combining signals from different sensors.

SER30K: A Large-Scale Dataset for Sticker Emotion Recognition
Paper available at: https://doi.org/10.1145/3503161.3548407
Liu, S., Zhang, X., Yan, J.
Nankai University, Tianjin, China.
Dataset available at: https://github.com/nku-shengzheliu/SER30K

A new multimodal sticker emotion recognition dataset called SER30K with 1,887 sticker themes and 30,739 images is introduced for understanding emotions in stickers. A proposed method called LORA, using a vision transformer and local re-attention module, effectively extracts visual and language features for emotion recognition on SER30K and other datasets.

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
Paper available at: https://doi.org/10.1145/3503161.3551581
Pan, Y., Li, Y., Luo, J., Xu, J., Yao, T., Mei, T.
JD Explore Academy, Beijing, China.
Dataset available at: http://www.auto-video-captions.top/2022/dataset

A new large-scale pre-training dataset, Auto-captions on GIF (ACTION), is presented for generic video understanding. It contains video-sentence pairs extracted and filtered from web pages and can be used for pre-training and downstream tasks such as video captioning and sentence localization. Comparisons with existing video-sentence datasets are made.

Where Are You Looking?: A Large-Scale Dataset of Head and Gaze Behavior for 360-Degree Videos and a Pilot Study
Paper available at: https://doi.org/10.1145/3503161.3548200
Jin, Y., Liu, J., Wang, F., Cui, S.
The Chinese University of Hong Kong, Shenzhen, Shenzhen, China.
Dataset available at: https://cuhksz-inml.github.io/head_gaze_dataset/

A dataset of users’ head and gaze behaviors in 360° videos is presented, containing rich dimensions, large scale, strong diversity, and high frequency. A quantitative taxonomy for 360° videos is also proposed, containing three objective technical metrics. Results of a pilot study on users’ behaviors and a case of application in tile-based 360° video streaming show the usefulness of the dataset for improving the performance of existing works.

Saliency in Augmented Reality
Paper available at: https://doi.org/10.1145/3503161.3547955
Duan, H., Shen, W., Min, X., Tu, D., Li, J., Zhai, G.
Shanghai Jiao Tong University, Shanghai, China; Alibaba Group, Hangzhou, China.
Dataset available at: https://github.com/DuanHuiyu/ARSaliency

A dataset, Saliency in AR Dataset (SARD), containing 450 background, 450 AR, and 1350 superimposed images with three mixing levels, is constructed to study the interaction between background scenes and AR contents, and the saliency prediction problem in AR. An eye-tracking experiment is conducted among 60 subjects to collect data.

Visual Dialog for Spotting the Differences between Pairs of Similar Images
Paper available at: https://doi.org/10.1145/3503161.3548170
Zheng, D., Meng, F., Si, Q., Fan, H., Xu, Z., Zhou, J., Feng, F., Wang, X.
Beijing University of Posts and Telecommunications, Beijing, China; WeChat AI, Tencent Inc, Beijing, China; Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; University of Trento, Trento, Italy.
Dataset available at: https://github.com/zd11024/Spot_Difference

A new visual dialog task called Dial-the-Diff is proposed, in which two interlocutors access two similar images and try to spot the difference between them through conversation in natural language. A large-scale multi-modal dataset called DialDiff, containing 87k Virtual Reality images and 78k dialogs, is built for the task. Benchmark models are also proposed and evaluated to bring new challenges to dialog strategy and object categorization.

Visual Grounding in Remote Sensing Images
Paper available at: https://doi.org/10.1145/3503161.3548316
Sun, Y., Feng, S., Li, X., Ye, Y., Kang, J., Huang, X.
Harbin Institute of Technology, Shenzhen, Shenzhen, China; Soochow University, Suzhou, China.
Dataset available at: https://sunyuxi.github.io/publication/GeoVG

A new problem of visual grounding in large-scale remote sensing images has been presented, in which the task is to locate particular objects in an image by a natural language expression. A new dataset, called RSVG, has been collected and a new method, GeoVG, has been designed to address the challenges of existing methods in dealing with remote sensing images.

Bookmark the permalink.