Overview of Benchmarking Platforms and Software for Multimedia Applications

In a time where Artificial Intelligence (AI) continues to push the boundaries of what was previously thought possible, the demand for benchmarking platforms that allow to fairly assess and evaluate AI models has become paramount. These platforms serve as connecting hubs between data scientists, machine learning specialists, industry partners, and other interested parties. They mostly function under the Evaluation-as-a-Service (EaaS) paradigm [1], the idea that participants that do a certain benchmarking task should be able to test the output of their systems in similar conditions, by being provided with a common definition of the targeted concepts, datasets and data splits, metrics, and evaluation tools. These common elements are provided through online platforms that can even offer Application Programming Interfaces (APIs) or container-level integration of the participants’ AI models. This column provides an insight into these platforms, looking at their main characteristics, use cases, and particularities. In the second part of the column we will also look into some of the main benchmarking platforms that are geared towards handling multimedia-centric benchmarks and datasets, relevant to SIGMM.

Defining Characteristics of EaaS platforms

Benchmarking competitions and initiatives, and EaaS platforms attempt to tackle a number of keypoints in the development of AI algorithms and models, namely:

  • Creating a fair and impartial evaluation environment, by standardizing the datasets and evaluation metrics used by all participants to an evaluation competition. In doing so, EaaS platforms play a pivotal role in promoting transparency and comparability in AI models and approaches.
  • Enhancing reproducibility by giving the option to run the AI models on dedicated servers provided and managed by competition organizers. This increases the trust and bolsters the integrity of the results produced by competition participants, as the organizers are able to closely monitor the testing process for each individual AI model. 
  • Fostering, as a natural consequence, a higher degree of data privacy, as participants could be given access only to training data, while testing data is kept private and is only accessed via APIs on the dedicated servers, reducing the risk of data exposure.
  • Creating a common repository for the sharing the data and details of a benchmarking task, building a history not only of the results of the benchmarking tasks throughout the years, but also of the evolution of the types of approaches and models used by participants. Other interesting features, like the existence of forums and discussion threads on competitions, allow new participants to quickly search for problems they encounter and hopefully have a quicker resolution of their issues.

Given these common goals, benchmarking platforms usually integrate a set of common features and user-level functionalities that are summed up in this section and grouped into three categories: task organization and scheduling, scoring and reproducibility, and communication and dissemination.

Task organization and scheduling. The platforms allow the creation, modification and maintenance of benchmarking tasks, either through a graphical user interface (GUI) or by using task bundles (most commonly using JSON, XML, Python or custom scripting languages). Competition organizers can define their task, and define sub-tasks that may explore different facets of the targeted data. Scheduling is another important feature in benchmarking competition creation, as some parts of the data may be kept private until a certain moment in time, and allow the competition organizers to hide the results of other teams until a certain point in time. We consider the last point an important one, as participants may feel discouraged from continuing their participation if their initial results are not high enough compared with other participants. Another noteworthy feature is the run quantity management that allows organizers to specify a maximum number of allowed runs per participant during the benchmarking task. This limitation discourages participants from attempting to solve the given tasks with brute force approaches, where they implement a large number of models and model variations. As a result, participants are incentivized to delve deeper into the data, critically analyzing why certain methods succeed and others fall short.

Scoring and reproducibility. EaaS platforms generally deploy two paradigms, sometimes side-by-side, with regards to AI model testing and results generation [1, 2]: the Data-to-Algorithm (D2A) approach, and the Algorithm-to-Data (A2D) approach. The former refers to competitions where participants must download the testing set, run the prediction systems on their own machines, and provide the predictions to the organizers, usually in CSV format for the multimedia domain. In this setup, the ground truth data for the testing set is kept private, and after the organizers receive the prediction result files, they communicate the performance to the participants, or the results are automatically computed by the platform by organizer-provided scripts, once the files are uploaded to it. The A2D approach on the other hand is more complex, may incur additional financial costs, and may be more time consuming for both organizers and task participants, but increases the trustworthiness and reproducibility of the task and AI models themselves. In this setup, organizers provide cloud-based computing resources via Virtual Machines (VMs) and containers, and a common processing pipeline or API that competitors must integrate in their source code. The participants develop the wrappers that integrate their AI models accordingly, and upload the model to the EaaS platforms directly. The AI models are then executed according to the common pipeline and results are automatically provided to the participants, while also allowing for the testing data to be kept completely private. Traditionally, in order to achieve this, EaaS platforms offer the possibility of integration with cloud computing platforms like Amazon AWS, Microsoft Azure, or Google Cloud, and offer Docker integration for the creation of containers where the code can be hosted.

Communication and dissemination. EaaS platforms allow the interaction between competition organizers and participants, either through emails, automatic notifications, or forums where interested parties can exchange ideas, ask questions, offer help, signal potential problems in the data or scripts associated with the tasks.

Popular multimedia EaaS platforms

This section presents some of the most popular benchmarking platforms aimed at the multimedia domain. We will present some key features and associated popular multimedia datasets for the following platforms: Kaggle, AIcrowd, Codabench, Drivendata, and EvalAI.

Kaggle represents perhaps the top-most popular benchmarking platform at this moment, and goes beyond the scope of providing datasets and benchmarking competitions, also hosting AI models, courses, and source code repositories. Competition organizers can design the tasks under either of the D2A or A2D paradigms, giving participants the possibility of integrating their AI models in Jupyter Notebooks for reproducibility. The platform also gives the option of alloting CPU and GPU cloud-based resources for A2D competitions. The Kaggle repository offers code for a large number of additional competition management tools and communication APIs. Among an impressive number of datasets and competitions, Kaggle currently hosts competitions that use the MNIST original data [3], as well as other MNIST-like datasets like Fashion-MNIST [4], as well as datasets on varied subjects ranging from sentiment analysis in social media [5] to medical image processing [6].

AIcrowd is an open source EaaS platform for open benchmarking challenges that puts an accent on connections and collaborative work between data science and machine learning experts. This platform offers the source code for command line interface (CLI) and API clients that can interact with AIcrowd servers. ImageCLEF, between 2018 and 2022 [7 – 11], is one of the most popular multimedia benchmarking initiatives hosted on AICrowd, featuring diverse multimedia topics such as lifelogging, medical image processing, image processing for environment health prediction, the analysis of social media dangers with regards to image sharing, and ensemble learning for multimedia data.

Codabench, launched in August 2023, and its precursor CodaLab, are two open source benchmarking platforms that provide a large number of options, including A2D and D2A approaches, as well as “inverted benchmarks”, where organizers provide the reference algorithms and participants contribute with the datasets. Among the current running challenges on this platform standouts are the two Quality-of-Service-oriented challenges on audio-video synchronization error detection and error measurement challenges that are part of the 3rd Workshop on Image/Video/Audio Quality in Computer Vision and Generative AI at the Winter Conference on Applications of Computer Vision – WACV2024.

Drivendata targets the intersection of data science and social impact. This platform hosts competitions that integrate the social aspect of their domain of interest directly in their mission and definition, while also hosting a number of open-source projects and competition-winning AI models. Given its accent on social impact, this platform hosts a number of benchmarking challenges that target social issues like the detection of hateful memes [12] and image-based nature conservation efforts.

EvalAI is another open source platform that is able to create A2D and D2A competition environments, while also integrating optimization steps that allow for evaluation code to run faster on multi-core cloud infrastructure. The EvalAI platform holds many diverse multimedia-centric competitions, including image segmentation tasks based on LVIS [13] and a wide range of sport tasks [14].

Future directions, developments and other tools

While the tools and platforms described in the previous section represent just a portion of the number of EaaS platform currently online in the research community, we would also like to mention some projects that are currently in the development stage or that can be considered additional tools for benchmarking initiatives:

  • The AI4Media benchmarking platform, is a benchmarking platform that is currently in the prototype and development stage. Among its most interesting features and ideas promoted by the platform developers is the creation of complexity metrics that would help competition organizers understand the computational efficiency and resource requirements for the submitted systems.
  • The BenchmarkSTT started as a specialized benchmarking platform for speech-to-text, but is now evolving in different directions, including facial recognition in videos.
  • The PapersWithCode platform, while not a benchmarking platform per se, is useful as a repository that collects the results AI model on datasets throughout the years, and groups different datasets studying the same concepts under the same umbrella (i.e., Image Classification, Object Detection, Medical Image Segmentation, etc.), while also providing links to scientific papers, github implementations of the models, and links to the datasets. This may represent a good starting point for young researchers that are trying to understand the history and state-of-the-art for certain domains and applications.


Benchmarking platforms represent a key component of benchmarking, pushing for fairness and trustworthiness in AI model comparison, while also providing tools that may foster reproducibility in AI. We are happy to see that many of the platforms discussed in this article are open source, or have open source components, thus allowing interested scientists to create their own custom implementations of these platforms, and to adapt them when necessary to their particular fields.


The work presented in this column is supported under the H2020 AI4Media “A European Excellence Centre for Media, Society and Democracy” project, contract #951911.


[1] Hanbury, A., Müller, H., Balog, K., Brodt, T., Cormack, G. V., Eggel, I., Gollub, T., Hopfgartner, F., Kalpathy-Cramer, J., Kando, N., Krithara, A., Lin, J., Mercer, S. & Potthast, M. (2015). Evaluation-as-a-service: Overview and outlook. arXiv preprint arXiv:1512.07454.
[2] Hanbury, A., Müller, H., Langs, G., Weber, M. A., Menze, B. H., & Fernandez, T. S. (2012). Bringing the algorithms to the data: cloud–based benchmarking for medical image analysis. In Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics: Third International Conference of the CLEF Initiative, CLEF 2012, Rome, Italy, September 17-20, 2012. Proceedings 3 (pp. 24-29). Springer Berlin Heidelberg.
[3] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
[4] Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
[5] Niu, T., Zhu, S., Pang, L., & El Saddik, A. (2016). Sentiment analysis on multi-view social data. In MultiMedia Modeling: 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part II 22 (pp. 15-27). Springer International Publishing.
[6] Thambawita, V., Hicks, S. A., Storås, A. M., Nguyen, T., Andersen, J. M., Witczak, O., … & Riegler, M. A. (2023). VISEM-Tracking, a human spermatozoa tracking dataset. Scientific Data, 10(1), 1-8.
[7] Ionescu, B., Müller, H., Villegas, M., García Seco de Herrera, A., Eickhoff, C., Andrearczyk, V., … & Gurrin, C. (2018). Overview of ImageCLEF 2018: Challenges, datasets and evaluation. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings 9 (pp. 309-334). Springer International Publishing.
[8] Ionescu, B., Müller, H., Péteri, R., Dang-Nguyen, D. T., Piras, L., Riegler, M., … & Karampidis, K. (2019). ImageCLEF 2019: Multimedia retrieval in lifelogging, medical, nature, and security applications. In Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part II 41 (pp. 301-308). Springer International Publishing.
[9] Ionescu, B., Müller, H., Péteri, R., Dang-Nguyen, D. T., Zhou, L., Piras, L., … & Constantin, M. G. (2020). ImageCLEF 2020: Multimedia retrieval in lifelogging, medical, nature, and internet applications. In Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42 (pp. 533-541). Springer International Publishing.
[10] Ionescu, B., Müller, H., Péteri, R., Abacha, A. B., Demner-Fushman, D., Hasan, S. A., … & Popescu, A. (2021). The 2021 ImageCLEF Benchmark: Multimedia retrieval in medical, nature, internet and social media applications. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43 (pp. 616-623). Springer International Publishing.
[11] de Herrera, A. G. S., Ionescu, B., Müller, H., Péteri, R., Abacha, A. B., Friedrich, C. M., … & Dogariu, M. (2022, April). Imageclef 2022: multimedia retrieval in medical, nature, fusion, and internet applications. In European Conference on Information Retrieval (pp. 382-389). Cham: Springer International Publishing.
[12] Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Fitzpatrick, C. A., … & Parikh, D. (2021, August). The hateful memes challenge: Competition report. In NeurIPS 2020 Competition and Demonstration Track (pp. 344-360). PMLR.
[13] Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5356-5364).
[14] Giancola, S., Cioppa, A., Deliège, A., Magera, F., Somers, V., Kang, L., … & Li, Z. (2022, October). SoccerNet 2022 challenges results. In Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports (pp. 75-86).

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 2 (MDRE at MMM 2022, ACM MM 2022)

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on MDRE at MMM 2022 and ACM MM 2022:

  • Multimedia Datasets for Repeatable Experimentation at 28th International Conference on Multimedia Modeling (MDRE at MMM 2022 – https://mmm2022.org/ssp.html#mdre). We summarize the three datasets presented during the MDRE, addressing several topics like user-centric video search competition, dataset (GPR1200) to evaluate the performance of deep neural networks for general image retrieval, and dataset for evaluating the performance of Question Answering (QA) systems on lifelog data (LLQA).
  • Selected datasets at the 30th ACM Multimedia Conference (MM ’22 – https://2022.acmmm.org/). For a general report from ACM Multimedia 2022 please see (https://records.sigmm.org/2022/12/07/report-from-acm-multimedia-2022-by-nitish-nagesh/). We summarize nine datasets presented during the conference, targeting several topics like dataset for multimodal intent recognition (MintRec), audio-visual question answering dataset (AVQA), large-scale radar dataset (mmWave), multimodal sticker emotion recognition dataset (SER30K), video-sentence dataset for vision-language pre-training (ACTION), dataset of head and gaze behavior for 360-degree videos, saliency in augmented reality dataset (SARD), multi-modal dataset spotting the differences between pairs of similar images (DialDiff), and large-scale remote sensing images dataset (RSVG).

For the overview of datasets related to QoMEX 2022 and ODS at MMSys ’22 please check the first part (https://records.sigmm.org/?p=12292), while ImageCLEF 2022 and MediaEval 2022 are addressed in the third part (http://records.sigmm.org/?p=12362).

MDRE at MMM 2022

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2022 International Conference on Multimedia Modeling (MMM 2022), supporting both online and onsite presentation, Phu Quoc, Vietnam, June 6-10, 2022. The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Björn Þór Jónsson (IT University of Copenhagen, Denmark), Adam Jatowt (University of Innsbruck, Austria), Liting Zhou (Dublin City University, Ireland) and Graham Healy (Dublin City University, Ireland). Details regarding this session can be found at: https://mmm2022.org/ssp.html#mdre

The MDRE’22 special session at MMM’22, is the fourth MDRE session, and it represents an opportunity for interested researchers to submit their datasets to this track. The work submitted to MDRE is permanently available at http://mmdatasets.org, where all the current and past editions of MDRE are hosted. Authors are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, and a discussion of how it can be useful to the community, along with the dataset in itself.

A Task Category Space for User-Centric Comparative Multimedia Search Evaluations
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_16
Lokoč, J., Bailer, W., Barthel, K.U., Gurrin, C., Heller, S., Jónsson, B., Peška, L., Rossetto, L., Schoeffmann, K., Vadicamo, L., Vrochidis, S., Wu, J.
Charles University, Prague, Czech Republic; JOANNEUM RESEARCH, Graz, Austria; HTW Berlin, Berlin, Germany; Dublin City University, Dublin, Ireland; University of Basel, Basel, Switzerland; IT University of Copenhagen, Copenhagen, Denmark; University of Zurich, Zurich, Switzerland; Klagenfurt University, Klagenfurt, Austria; ISTI CNR, Pisa, Italy; Centre for Research and Technology Hellas, Thessaloniki, Greece; City University of Hong Kong, Hong Kong.
Dataset available at: On request

The authors have analyzed the spectrum of possible task categories and propose a list of individual axes that define a large space of possible task categories. Using this concept of category space, new user-centric video search competitions can be designed to benchmark video search systems from different perspectives. They further analyze the three task categories considered at the Video Browser Showdown and discuss possible (but sometimes challenging) shifts within the task category space.

GPR1200: A Benchmark for General-Purpose Content-Based Image Retrieval
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_17
Schall, K., Barthel, K.U., Hezel, N., Jung, K.
Visual Computing Group, HTW Berlin, University of Applied Sciences, Germany.
Dataset available at: http://visual-computing.com/project/GPR1200

In this study, the authors have developed a new dataset called GPR1200 to evaluate the performance of deep neural networks for general image retrieval (CBIR). They found that large-scale pretraining significantly improves retrieval performance and that further improvement can be achieved through fine-tuning. GPR1200 is presented as an easy-to-use and accessible but challenging benchmark dataset with a broad range of image categories.

LLQA – Lifelog Question Answering Dataset
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_18
Tran, L.-D., Ho, T.C., Pham, L.A., Nguyen, B., Gurrin, C., Zhou, L.
Dublin City University, Dublin, Ireland; Vietnam National University, Ho Chi Minh University of Science, Ho Chi Minh City, Viet Nam; AISIA Research Lab, Ho Chi Minh City, Viet Nam.
Dataset available at: https://github.com/allie-tran/LLQA

This study presents Lifelog Question Answering Dataset (LLQA), a new dataset for evaluating the performance of Question Answering (QA) systems on lifelog data. The dataset includes over 15,000 multiple-choice questions as an augmented 85-day lifelog collection, and is intended to serve as a benchmark for future research in this area. The results of the study showed that QA on lifelog data is a challenging task that requires further exploration.

ACM MM 2022

Numerous dataset-related papers have been presented at the 30th ACM International Conference on Multimedia (MM’ 22), organized in Lisbon, Portugal, October 10 – 14, 2022 (https://2022.acmmm.org/). The complete MM ’22: Proceedings of the 30th ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3503161).

There was not a specifically dedicated Dataset session among roughly 35 sessions at the MM ’22 symposium. However, the importance of datasets can be illustrated in the following statistics, quantifying how often the term “dataset” appears in MM ’22 Proceedings. The term appears in the title of 9 papers (7 last year), the keywords of 35 papers (66 last year), and the abstracts of 438 papers (339 last year). As a small example, nine selected papers focused primarily on new datasets with publicly available data are listed below. There are contributions focused on various multimedia applications, e.g., understanding multimedia content, multimodal fusion and embeddings, media interpretation, vision and language, engaging users with multimedia, emotional and social signals, interactions and Quality of Experience, and multimedia search and recommendation.

MIntRec: A New Dataset for Multimodal Intent Recognition
Paper available at: https://doi.org/10.1145/3503161.3547906
Zhang, H., Xu, H., Wang, X., Zhou, Q., Zhao, S., Teng, J.
Tsinghua University, Beijing, China.
Dataset available at: https://github.com/thuiar/MIntRec

MIntRec is a dataset for multimodal intent recognition with 2,224 samples based on the data collected from the TV series Superstore, in text, video, and audio modalities, annotated with twenty intent categories and speaker bounding boxes. Baseline models are built by adapting multimodal fusion methods and show significant improvement over text-only modality. MIntRec is useful for studying relationships between modalities and improving intent recognition.

AVQA: A Dataset for Audio-Visual Question Answering on Videos
Paper available at: https://doi.org/10.1145/3503161.3548291
Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., Jin, C., Zhu, W.
Tsinghua University, Shenzhen, China; Communication University of China, Beijing, China.
Dataset available at: https://mn.cs.tsinghua.edu.cn/avqa

Audio-visual question-answering dataset (AVQA) is introduced for videos in real-life scenarios. It includes 57,015 videos and 57,335 question-answer pairs that rely on clues from both audio and visual modalities. A Hierarchical Audio-Visual Fusing module is proposed to model correlations among audio, visual, and text modalities. AVQA can be used to test models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios.

mmBody Benchmark: 3D Body Reconstruction Dataset and Analysis for Millimeter Wave Radar
Paper available at: https://doi.org/10.1145/3503161.3548262
Chen, A., Wang, X., Zhu, S., Li, Y., Chen, J., Ye, Q.
Zhejiang University, Hangzhou, China.
Dataset available at: On request

A large-scale mmWave radar dataset with synchronized and calibrated point clouds and RGB(D) images is presented, along with an automatic 3D body annotation system. State-of-the-art methods are trained and tested on the dataset, showing the mmWave radar can achieve better 3D body reconstruction accuracy than RGB camera but worse than depth camera. The dataset and results provide insights into improving mmWave radar reconstruction and combining signals from different sensors.

SER30K: A Large-Scale Dataset for Sticker Emotion Recognition
Paper available at: https://doi.org/10.1145/3503161.3548407
Liu, S., Zhang, X., Yan, J.
Nankai University, Tianjin, China.
Dataset available at: https://github.com/nku-shengzheliu/SER30K

A new multimodal sticker emotion recognition dataset called SER30K with 1,887 sticker themes and 30,739 images is introduced for understanding emotions in stickers. A proposed method called LORA, using a vision transformer and local re-attention module, effectively extracts visual and language features for emotion recognition on SER30K and other datasets.

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
Paper available at: https://doi.org/10.1145/3503161.3551581
Pan, Y., Li, Y., Luo, J., Xu, J., Yao, T., Mei, T.
JD Explore Academy, Beijing, China.
Dataset available at: http://www.auto-video-captions.top/2022/dataset

A new large-scale pre-training dataset, Auto-captions on GIF (ACTION), is presented for generic video understanding. It contains video-sentence pairs extracted and filtered from web pages and can be used for pre-training and downstream tasks such as video captioning and sentence localization. Comparisons with existing video-sentence datasets are made.

Where Are You Looking?: A Large-Scale Dataset of Head and Gaze Behavior for 360-Degree Videos and a Pilot Study
Paper available at: https://doi.org/10.1145/3503161.3548200
Jin, Y., Liu, J., Wang, F., Cui, S.
The Chinese University of Hong Kong, Shenzhen, Shenzhen, China.
Dataset available at: https://cuhksz-inml.github.io/head_gaze_dataset/

A dataset of users’ head and gaze behaviors in 360° videos is presented, containing rich dimensions, large scale, strong diversity, and high frequency. A quantitative taxonomy for 360° videos is also proposed, containing three objective technical metrics. Results of a pilot study on users’ behaviors and a case of application in tile-based 360° video streaming show the usefulness of the dataset for improving the performance of existing works.

Saliency in Augmented Reality
Paper available at: https://doi.org/10.1145/3503161.3547955
Duan, H., Shen, W., Min, X., Tu, D., Li, J., Zhai, G.
Shanghai Jiao Tong University, Shanghai, China; Alibaba Group, Hangzhou, China.
Dataset available at: https://github.com/DuanHuiyu/ARSaliency

A dataset, Saliency in AR Dataset (SARD), containing 450 background, 450 AR, and 1350 superimposed images with three mixing levels, is constructed to study the interaction between background scenes and AR contents, and the saliency prediction problem in AR. An eye-tracking experiment is conducted among 60 subjects to collect data.

Visual Dialog for Spotting the Differences between Pairs of Similar Images
Paper available at: https://doi.org/10.1145/3503161.3548170
Zheng, D., Meng, F., Si, Q., Fan, H., Xu, Z., Zhou, J., Feng, F., Wang, X.
Beijing University of Posts and Telecommunications, Beijing, China; WeChat AI, Tencent Inc, Beijing, China; Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; University of Trento, Trento, Italy.
Dataset available at: https://github.com/zd11024/Spot_Difference

A new visual dialog task called Dial-the-Diff is proposed, in which two interlocutors access two similar images and try to spot the difference between them through conversation in natural language. A large-scale multi-modal dataset called DialDiff, containing 87k Virtual Reality images and 78k dialogs, is built for the task. Benchmark models are also proposed and evaluated to bring new challenges to dialog strategy and object categorization.

Visual Grounding in Remote Sensing Images
Paper available at: https://doi.org/10.1145/3503161.3548316
Sun, Y., Feng, S., Li, X., Ye, Y., Kang, J., Huang, X.
Harbin Institute of Technology, Shenzhen, Shenzhen, China; Soochow University, Suzhou, China.
Dataset available at: https://sunyuxi.github.io/publication/GeoVG

A new problem of visual grounding in large-scale remote sensing images has been presented, in which the task is to locate particular objects in an image by a natural language expression. A new dataset, called RSVG, has been collected and a new method, GeoVG, has been designed to address the challenges of existing methods in dealing with remote sensing images.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 3 (ImageCLEF 2022, MediaEval 2022)

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on ImageCLEF 2022 and MediaEval 2022:

  • ImageCLEF 2022 (https://www.imageclef.org/2022). We summarize the 5 datasets launched for the benchmarking tasks, related to several topics like social media profile assessment (ImageCLEFaware), segmentation and labeling of underwater coral images (ImageCLEFcoral), late fusion ensembling systems for multimedia data (ImageCLEFfusion) and medical imaging analysis (ImageCLEFmedical Caption, and ImageCLEFmedical Tuberculosis).
  • MediaEval 2022 (https://multimediaeval.github.io/editions/2022/). We summarize the 11 datasets launched for the benchmarking tasks, that target a wide range of multimedia topics like the analysis of flood related media (DisasterMM), game analytics (Emotional Mario), news item processing (FakeNews, NewsImages), multimodal understanding of smells (MUSTI), medical imaging (Medico), fishing vessel analysis (NjordVid), media memorability (Memorability), sports data analysis (Sport Task, SwimTrack), and urban pollution analysis (Urban Air).

For the overview of datasets related to QoMEX 2022 and ODS at MMSys ’22 please check the first part (https://records.sigmm.org/?p=12292), while MDRE at MMM 2022 and ACM MM 2022 are addressed in the second part (http://records.sigmm.org/?p=12360).

ImageCLEF 2022

ImageCLEF is a multimedia evaluation campaign, part of the clef initiative (http://www.clef-initiative.eu/). The 2022 edition (https://www.imageclef.org/2022) is the 19th edition of this initiative and addresses four main research tasks in several domains like: medicine, nature, social media content and user interface processing. ImageCLEF 2021 is organized by Bogdan Ionescu (University Politehnica of Bucharest, Romania), Henning Müller (University of Applied Sciences Western Switzerland, Sierre, Switzerland), Renaud Péteri (University of La Rochelle, France), Ivan Eggel (University of Applied Sciences Western Switzerland, Sierre, Switzerland) and Mihai Dogariu (University Politehnica of Bucharest, Romania).

Paper available at: https://ceur-ws.org/Vol-3180/paper-98.pdf
Popescu, A., Deshayes-Chossar, J., Schindler, H., Ionescu, B.
CEA LIST, France; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2022/aware

This represents the second edition of the aware task at ImageCLEF, and it seeks to understand in what way do public social media profiles affect users in certain important scenarios, representing a search or application for: a bank loan, an accommodation, a job as waitress/waiter, and a job in IT.

Paper available at: https://ceur-ws.org/Vol-3180/paper-97.pdf
Chamberlain, J., de Herrera, A.G.S., Campello, A., Clark, A..
University of Essex, United Kingdom; Wellcome Trust, United Kingdom.
Dataset available at: https://www.imageclef.org/2022/coral

This fourth edition of the coral task addresses the problem of segmenting and labeling a set of underwater images used in the monitoring of coral reefs. The task proposes two subtasks, namely an annotation and localization subtask and a pixel-wise parsing subtask.

Paper available at: https://ceur-ws.org/Vol-3180/paper-99.pdf
Ştefan, L-D., Constantin, M.G., Dogariu, M., Ionescu, B.
University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2022/fusion

This represents the first edition of the fusion task, and it proposes several scenarios adapted for the use of late fusion or ensembling systems. The two scenarios correspond to a regression approach, using data associated with the prediction of media interestingness, and a retrieval scenario, using data associated with search result diversification.

ImageCLEFmedical Tuberculosis
Paper available at: https://ceur-ws.org/Vol-3180/paper-96.pdf
Kozlovski, S., Dicente Cid, Y., Kovalev, V., Müller, H.
United Institute of Informatics Problems, Belarus; Roche Diagnostics, Spain; University of Applied Sciences Western Switzerland, Switzerland; University of Geneva, Switzerland.
Dataset available at: https://www.imageclef.org/2022/medical/tuberculosis

This task is now at its sixth edition, and is being upgraded to a detection problem. Furthermore, two tasks are now included: the detection of lung cavern regions in lung CT images associated with lung tuberculosis and the prediction of 4 binary features of caverns suggested by experienced radiologists.

ImageCLEFmedical Caption
Paper available at: https://ceur-ws.org/Vol-3180/paper-95.pdf
Rückert, J., Ben Abacha, A., de Herrera, A.G.S., Bloch, L., Brüngel, R., Idrissi-Yaghir, A., Schäfer, H., Müller, H., Friedrich, C.M.
University of Applied Sciences and Arts Dortmund, Germany; Microsoft, USA; University of Essex, UK; University Hospital Essen, Germany; University of Applied Sciences Western Switzerland, Switzerland; University of Geneva, Switzerland.
Dataset available at: https://www.imageclef.org/2022/medical/caption

The sixth edition of this task consists of two tasks. In the first task participants must detect relevant medical concepts in a large corpus of medical images, while in the second task coherent captions must be generated for the entirety of the context of medical images, targeting the interplay of many visible concepts.

MediaEval 2022

The MediaEval Multimedia Evaluation benchmark (https://multimediaeval.github.io/) offers challenges in artificial intelligence for multimedia data. This is the 13th edition of MediaEval (https://multimediaeval.github.io/editions/2022/) and 11 tasks were proposed for this edition, targeting a large number of challenges by creating algorithms for retrieval, analysis, and exploration. For this edition, a “Quest for Insight” is pursued, where organizers are encouraged to propose interesting and insightful questions about the concepts that will be explored, and participants are encouraged to push beyond only striving to improve evaluation scores and to also working to achieve deeper understanding about the challenges.

DisasterMM: Multimedia Analysis of Disaster-Related Social Media Data
Preprint available at: https://2022.multimediaeval.com/paper5337.pdf
Andreadis, S., Bozas, A., Gialampoukidis, I., Mavropoulos, T., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I., Fiorin, R., Lombardo, F., Norbiato, D., Ferri, M.
Information Technologies Institute – Centre of Research and Technology Hellas, Greece; Eastern Alps River Basin District, Italy.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/disastermm/

The DisasterMM task proposes the analysis of social media data extracted from Twitter, targeting the analysis of natural or man-made disaster posts. For this year, the organizers focused on the analysis of flooding events and proposed two subtasks: relevance classification of posts and location extraction from texts.

Emotional Mario: A Game Analytics Challenge
Preprint or paper not published yet.
Lux, M., Alshaer, M., Riegler, M., Halvorsen, P., Thambawita, V., Hicks, S., Dang-Nguyen, D.-T.,
Alpen-Adria-Universität Klagenfurt, Austria; SimulaMet, Norway; University of Bergen, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/emotionalmario/

Emotional Mario focuses on the Super Mario Bros videogame, analyzing the data associated with gamers that consists of game input, demographics, biomedical data, and video associated with players’ faces. Two subtasks are proposed: event detection, seeking to identify gaming events of a significant importance based on facial videos and biometric data, and gameplay summarization, seeking to select the best moments of gameplay.

FakeNews Detection
Preprint available at: https://2022.multimediaeval.com/paper116.pdf
Pogorelov, K., Schroeder, D.T., Brenner, S., Maulana, A., Langguth, J.
Simula Research Laboratory, Norway; University of Bergen, Norway; Stuttgart Media University, Germany.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/fakenews/

The FakeNews Detection task proposes several types of methods of analyzing fake news and the way they spread, using COVID-19 related conspiracy theories. The competition proposes three tasks: the first subtask targets conspiracy detection in text-based data, the second asks participants to analyze graphs of conspiracy posters, while the last one combines the first two, aiming at detection on both text and graph data.

MUSTI – Multimodal Understanding of Smells in Texts and Images
Preprint available at: https://2022.multimediaeval.com/paper9634.pdf
Hürriyetoğlu, A., Paccosi, T., Menini, S., Zinnen, M., Lisena, P., Akdemir, K., Troncy, R., van Erp, M.
KNAW Humanities Cluster DHLab, Netherlands; Fondazione Bruno Kessler, Italy; Friedrich-Alexander-Universität, Germany; EURECOM, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/musti/

MUSTI is one of the few benchmarks that seek to analyze the underrepresented modality of smell. The organizers seek to further the understanding of descriptions of smell in texts and images, and propose two subtasks: the first one aims at classification of smells based on language and image models, predicting whether texts or images evoke the same smell source or not; while the second subtask targets the participants with identifying what are the common smell sources.

Medical Multimedia Task: Transparent Tracking of Spermatozoa
Preprint available at: https://2022.multimediaeval.com/paper5501.pdf
Thambawita, V., Hicks, S., Storås, A.M, Andersen, J.M., Witczak, O., Haugen, T.B., Hammer, H., Nguyen, T., Halvorsen, P., Riegler, M.A.
SimulaMet, Norway; OsloMet, Norway; The Arctic University of Norway, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/medico/

The Medico Medical Multimedia Task tackles the challenge of tracking sperm cells in video recordings, while analyzing the specific characteristics of these cells. Four subtasks are proposed: a sperm-cell real-time tracking task in videos, a prediction of cell motility task, a catch and highlight task seeking to identify sperm cell speed, and an explainability task.

Preprint available at: https://2022.multimediaeval.com/paper8446.pdf
Kille, B., Lommatzsch, A., Özgöbek, Ö., Elahi, M., Dang-Nguyen, D.-T.
Norwegian University of Science and Technology, Norway; Berlin Institute of Technology, Germany; University of Bergen, Norway; Kristiania University College, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/newsimages/

The goal of the NewsImages task is to further the understanding of the relationship between textual and image content in news articles. Participants are tasked with re-linking and re-matching textual news articles with the corresponding images, based on data gathered from social media, news portals and RSS feeds.

NjordVid: Fishing Trawler Video Analytics Task
Preprint available at: https://2022.multimediaeval.com/paper5854.pdf
Nordmo, T.A.S., Ovesen, A.B., Johansen, H.D., Johansen, D., Riegler, M.A.
The Arctic University of Norway, Norway; SimulaMet, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/njord/

The NjordVid task proposes data associated with fishing vessel recordings, representing a solution to maintaining sustainable fishing practices. Two different tasks are proposed: detection of events on the boat, like movement of people, catching fish, etc, and privacy of on-board personnel.

Predicting Video Memorability
Preprint available at: https://2022.multimediaeval.com/paper2265.pdf
Sweeney, L., Constantin, M.G., Demarty, C.-H., Fosco, C., de Herrera, A.G.S., Halder, S., Healy, G., Ionescu, B., Matran-Fernandez, A., Smeaton, A.F., Sultana, M.
Dublin City University, Ireland; University Politehnica of Bucharest, Romania; InterDigital, France; Massachusetts Institute of Technology Cambridge, USA; University of Essex, UK.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/memorability/

The Video Memorability task asks participants to predict how memorable a video sequence is, targeting short-term memorability. Three subtasks are proposed for this edition: a general video-based prediction task where participants are asked to predict the memorability score of a video, a generalization task where training and testing are performed on different sources of data, and an EEG-based task where annotator EEG scans are provided.

Sport Task: Fine Grained Action Detection and Classification of Table Tennis Strokes from Videos
Preprint available at: https://2022.multimediaeval.com/paper4766.pdf
Martin, P.-E., Calandre, J., Mansencal, B., Benois-Pineau, J., Péteri, R., Mascarilla, L., Morlier, J.
Max Planck Institute for Evolutionary Anthropology, Germany; La Rochelle University, France; Univ. Bordeaux, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/sportsvideo/

The Sport Task aims at action detection and classification in videos recorded at table tennis events. Low inter-class variability makes this task harder than other traditional action classification benchmarks. Two subtasks are proposed: a classification task where participants are asked to label table tennis videos according to the strokes the players make, and a detection task where participants must detect whether a stroke was made.

SwimTrack: Swimmers and Stroke Rate Detection in Elite Race Videos
Preprint available at: https://2022.multimediaeval.com/paper6876.pdf
Jacquelin, N., Jaunet, T., Vuillemot, R., Duffner, S.
École Centrale de Lyon, France; INSA-Lyon, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/swimtrack/

The SwimTrack comprises 5 different multimedia tracks related to the analysis of competition-level swimming videos, and provides multimodal video, image and audio data. The five subtasks are as follows: a position detection task associating swimmers with the numbers of swimming lanes, a stroke rate detection task, a camera registration task where participants must apply homography projection methods to create a top-view of the pool, a character recognition on scoreboards task, and a sound detection task associated with buzzer sounds.

Urban Air: Urban Life and Air Pollution
Preprint available at: https://2022.multimediaeval.com/paper586.pdf
Dao, M.-S., Dang, T.-H., Nguyen-Tai, T.-L., Nguyen, T.-B., Dang-Nguyen, D.-T.
National Institute of Information and Communications Technology, Japan; Dalat University, Vietnam; LOC GOLD Technology MTV Ltd. Co, Vietnam; University of Science, Vietnam National University in HCM City, Vietnam; Bergen University, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/urbanair/

The Urban Air task provides multimodal data that allows the analysis of air pollution and pollution patterns in urban environments. The organizers created two subtasks for this competition: a multimodal/crossmodal air quality index prediction task using station and/or CCTV data, and a periodic traffic pollution pattern discovery task.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 1 (QoMEX 2022, ODS at MMSys ’22)

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on QoMEX 2022 and ODS at MMSys ’22:

  • 14th International Conference on Quality of Multimedia Experience (QoMEX 2022 – https://qomex2022.itec.aau.at/). We summarize three datasets included in this conference, that address QoE studies on audiovisual 360° video, storytelling for quality perception and energy consumption while streaming video QoE.
  • Open Dataset and Software Track at 13th ACM Multimedia Systems Conference (ODS at MMSys ’22 – https://mmsys2022.ie/). We summarize nine datasets presented at the ODS track, targeting several topics, including surveillance videos from a fishing vessel (Njord), multi-codec 8K UHD videos (8K MPEG-DASH dataset), light-field (LF) synthetic immersive large-volume plenoptic dataset (SILVR), a dataset of online news items and the related task of rematching (NewsImages), video sequences, characterized by various complexity categories (VCD), QoE dataset of realistic video clips for real networks, dataset of 360° videos with subjective emotional ratings (PEM360), free-viewpoint video dataset, and cloud gaming dataset (CGD).

For the overview of datasets related to MDRE at MMM 2022 and ACM MM 2022 please check the second part (http://records.sigmm.org/?p=12360), while ImageCLEF 2022 and MediaEval 2022 are addressed in the third part (http://records.sigmm.org/?p=12362).

QoMEX 2022

Three dataset papers were presented at the International Conference on Quality of Multimedia Experience (QoMEX 2022), organized in Lippstadt, Germany, September 5 – 7, 2022 (https://qomex2022.itec.aau.at/). The complete QoMEX ’22 Proceeding is available in the IEEE Digital Library (https://ieeexplore.ieee.org/xpl/conhome/9900491/proceeding).

These datasets were presented within the Databases session, chaired by Professor Oliver Hohlfeld. These three papers present contributions focused on audiovisual 360-degree videos, storytelling for quality perception and modelling of energy consumption and streaming of video QoE.

Audiovisual Database with 360° Video and Higher-Order Ambisonics Audio for Perception, Cognition, Behavior and QoE Evaluation Research
Paper available at: https://ieeexplore.ieee.org/document/9900893
Robotham, T., Singla, A., Rummukainen, O., Raake, A. and Habets, E.
International Audio Laboratories Erlangen, A joint institution of the Friedrich-Alexander-Universitat Erlangen-Nurnberg (FAU) and Fraunhofer Institute for Integrated Circuits (IIS), Germany; TU Ilmenau, Germany.
Dataset available at: https://qoevave.github.io/database/

This publicly available database provides audiovisual 360° content with high-order Ambisonics audio. It consists of twelve scenes capturing real-life nature and urban environments with a video resolution of 7680×3840 at 60 frames-per-second and with 4th-order Ambisonics audio. These 360° video sequences, with an average duration of 60 seconds, represent real-life settings for systematically evaluating various dimensions of uni-/multi-modal perception, cognition, behavior, and QoE. It provides high-quality reference material with a balanced focus on auditory and visual sensory information.

The Storytime Dataset: Simulated Videotelephony Clips for Quality Perception Research
Paper available at: https://ieeexplore.ieee.org/document/9900888
Spang, R. P., Voigt-Antons, J. N. and Möller, S.
Technische Universität Berlin, Berlin, Germany; Hamm-Lippstadt University of Applied Sciences, Lippstadt, Germany.
Dataset available at: https://osf.io/cyb8w/

This is a dataset of simulated videotelephony clips to act as stimuli in quality perception research. It consists of four different stories in the German language that are told through ten consecutive parts, each about 10 seconds long. Each of these parts is available in four different quality levels, ranging from perfect to stalling. All clips (FullHD, H.264 / AAC) are actual recordings from end-user video-conference software to ensure ecological validity and realism of quality degradation. Apart from a detailed description of the methodological approach, we contribute the entire stimuli dataset containing 160 videos and all rating scores for each file.

Modelling of Energy Consumption and Streaming Video QoE using a Crowdsourcing Dataset
Paper available at: https://ieeexplore.ieee.org/document/9900886
Herglotz, C, Robitza, W., Kränzler, M., Kaup, A. and Raake, A.
Friedrich-Alexander-Universität, Erlangen, Germany; Audiovisual Technology Group, TU Ilmenau, Germany; AVEQ GmbH, Vienna, Austria.
Dataset available at: On request

This paper performs a first analysis of end-user power efficiency and Quality of Experience of a video streaming service. A crowdsourced dataset comprising 447,000 streaming events from YouTube is used to estimate both the power consumption and perceived quality. The power consumption is modelled based on previous work, which extends toward predicting the power usage of different devices and codecs. The user-perceived QoE is estimated using a standardized model.

ODS at MMSys ’22

The traditional Open Dataset and Software Track (ODS) was a part of the 13th ACM Multimedia Systems Conference (MMSys ’22) organized in Athlone, Ireland, June 14 – 17, 2022 (https://mmsys2022.ie/). The complete MMSys ’22: Proceedings of the 13th ACM Multimedia Systems Conference are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3524273).

The Open Dataset and Software Chairs for MMSys ’22 were Roberto Azevedo (Disney Research, Switzerland), Saba Ahsan (Nokia Technologies, Finland), and Yao Liu (Rutgers University, USA). The ODS session with 14 papers has been initiated with pitches on Wednesday, June 15, followed by a poster session. There have been nine dataset papers presented out of fourteen contributions. A listing of the paper titles, dataset summaries, and associated DOIs is included below for your convenience.

Njord: a fishing trawler dataset
Paper available at: https://doi.org/10.1145/3524273.3532886
Nordmo, T.-A.S., Ovesen, A.B., Juliussen, B.A., Hicks, S.A., Thambawita, V., Johansen, H.D., Halvorsen, P., Riegler, M.A., Johansen, D.
UiT the Arctic University of Norway, Norway; SimulaMet, Norway; Oslo Metropolitan University, Norway.
Dataset available at: https://doi.org/10.5281/zenodo.6284673

This paper presents Njord, a dataset of surveillance videos from a commercial fishing vessel. The dataset aims to demonstrate the potential for using data from fishing vessels to detect accidents and report fish catches automatically. The authors also provide a baseline analysis of the dataset and discuss possible research questions that it could help answer.

Multi-codec ultra high definition 8K MPEG-DASH dataset
Paper available at: https://doi.org/10.1145/3524273.3532889
Taraghi, B., Amirpour, H., Timmerer, C.
Christian Doppler Laboratory Athena, Institute of Information Technology (ITEC), Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria.
Dataset available at: http://ftp.itec.aau.at/datasets/mmsys22/

This paper presents a dataset of multimedia assets encoded with various video codecs, including AVC, HEVC, AV1, and VVC, and packaged using the MPEG-DASH format. The dataset includes resolutions up to 8K and has a maximum media duration of 322 seconds, with segment lengths of 4 and 8 seconds. It is intended to facilitate research and development of video encoding technology for streaming services.

SILVR: a synthetic immersive large-volume plenoptic dataset
Paper available at: https://doi.org/10.1145/3524273.3532890
Courteaux, M., Artois, J., De Pauw, S., Lambert, P., Van Wallendael, G.
Ghent University – Imec, Oost-Vlaanderen, Zwijnaarde, Belgium.
Dataset available at: https://idlabmedia.github.io/large-lightfields-dataset/

SILVR (synthetic immersive large-volume plenoptic dataset) is a light-field (LF) image dataset allowing for six-degrees-of-freedom navigation in larger volumes while maintaining full panoramic field of view. It includes three virtual scenes with 642-2226 views, rendered with 180° fish-eye lenses and featuring color images and depth maps. The dataset also includes multiview rendering software and a lens-reprojection tool. SILVR can be used to evaluate LF coding and rendering techniques.

NewsImages: addressing the depiction gap with an online news dataset for text-image rematching
Paper available at: https://doi.org/10.1145/3524273.3532891
Lommatzsch, A., Kille, B., Özgöbek, O., Zhou, Y., Tešić, J., Bartolomeu, C., Semedo, D., Pivovarova, L., Liang, M., Larson, M.
DAI-Labor, TU-Berlin, Berlin, Germany; NTNU, Trondheim, Norway; Texas State University, San Marcos, TX, United States; Universidade Nova de Lisboa, Lisbon, Portugal.
Dataset available at: https://multimediaeval.github.io/editions/2021/tasks/newsimages/

NewsImages is a dataset of online news items and the related task of news images rematching, which aims to study the “depiction gap” between the content of an image and the text that accompanies it. The dataset is useful for studying connections between image and text and addressing the depiction gap, including sparse data, diversity of content, and the importance of background knowledge.

VCD: Video Complexity Dataset
Paper available at: https://doi.org/10.1145/3524273.3532892
Amirpour, H., Menon, V.V., Afzal, S., Ghanbari, M., Timmerer, C.
Christian Doppler Laboratory Athena, Institute of Information Technology (ITEC), Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria; School of Computer Science and Electronic Engineering, University of Essex, Colchester, United Kingdom.
Dataset available at: https://ftp.itec.aau.at/datasets/video-complexity/

The Video Complexity Dataset (VCD) is a collection of 500 Ultra High Definition (UHD) resolution video sequences, characterized by spatial and temporal complexities, rate-distortion complexity, and encoding complexity with the x264 AVC/H.264 and x265 HEVC/H.265 video encoders. It is suitable for video coding applications such as video streaming, two-pass encoding, per-title encoding, and scene-cut detection. These sequences are provided at 24 frames per second (fps) and stored online in losslessly encoded 8-bit 4:2:0 format.

Realistic video sequences for subjective QoE analysis
Paper available at: https://doi.org/10.1145/3524273.3532894
Hodzic, K., Cosovic, M., Mrdovic, S., Quinlan, J.J., Raca, D.
Faculty of Electrical Engineering, University of Sarajevo, Bosnia and Herzegovina; School of Computer Science & Information Technology, University College Cork, Ireland.
Dataset available at: https://shorturl.at/dtISV

The DashReStreamer framework is designed to recreate adaptively streamed video in real networks to evaluate user Quality of Experience (QoE). The authors have also created a dataset of 234 realistic video clips, based on video logs collected from real mobile and wireless networks, including video logs and network bandwidth profiles. This dataset and framework will help researchers understand the impact of video QoE dynamics on multimedia streaming.

PEM360: a dataset of 360° videos with continuous physiological measurements, subjective emotional ratings and motion traces
Paper available at: https://doi.org/10.1145/3524273.3532895
Guimard, Q., Robert, F., Bauce, C., Ducreux, A., Sassatelli, L., Wu, H.-Y., Winckler, M., Gros, A.
Université Côte d’Azur, Inria, CNRS, I3S, Sophia-Antipolis, France.
Dataset available at: https://gitlab.com/PEM360/PEM360/

PEM360 is a dataset of user head movements and gaze recordings in 360° videos, along with self-reported emotional ratings and continuous physiological measurement data. It aims to understand the connection between user attention, emotions, and immersive content, and includes software tools and joint instantaneous visualization of user attention and emotion, called “emotional maps.” The entire data and code are available in a reproducible framework.

A New Free Viewpoint Video Dataset and DIBR Benchmark
Paper available at: https://doi.org/10.1145/3524273.3532897
Guo, S., Zhou, K., Hu, J., Wang, J., Xu, J., Song, L.
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China.
Dataset available at: https://github.com/sjtu-medialab/Free-Viewpoint-RGB-D-Video-Dataset

A new dynamic RGB-D video dataset for FVV research is presented, including 13 groups of dynamic scenes and one group of static scenes, each with 12 HD video sequences and 12 corresponding depth video sequences. Also, the FVV synthesis benchmark is introduced based on depth image-based rendering to aid data-driven method validation. The dataset and benchmark aim to advance FVV synthesis with improved robustness and performance.

CGD: a cloud gaming dataset with gameplay video and network recordings
Paper available at: https://doi.org/10.1145/3524273.3532898
Slivar, I., Bacic, K., Orsolic, I., Skorin-Kapov, L., Suznjevic, M.
University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia.
Dataset available at: https://muexlab.fer.hr/muexlab/research/datasets

The cloud gaming (CGD) dataset contains 600 game streaming sessions from 10 games of different genres, with various encoding parameters (bitrate, resolution, and frame rate) to evaluate the impact of these parameters on Quality of Experience (QoE). The dataset includes gameplay video recordings, network traffic traces, user input logs, and streaming performance logs, and can be used to understand relationships between network and application layer data for cloud gaming QoE and QoE-aware network management mechanisms.

Two Interviews with renown Datasets Researchers

This issue of the Dataset Column provides two interviews with the researchers responsible for novel datasets of recent years. In particular, we first interview Nacho Reimat (https://www.cwi.nl/people/nacho-reimat), the scientific programmer responsible for the CWIPC-SXR, one of the first datasets on dynamic, interactive volumetric media. Second, we interview Pierre-Etienne Martin (https://www.eva.mpg.de/comparative-cultural-psychology/staff/pierre-etienne-martin/), responsible for contributions to datasets in the area of sports and culture.  

The two interviewees were asked about their contribution to the dataset research, their interests, challenges, and the future.  We would like to thank both Nacho and Pierre-Etienne for their agreement to contribute to our column. 

Nacho Reimat, Scientific Programmer at the Distributed and Interactive Systems group at the CWI, Amsterdam, The Netherlands

Short bio: Ignacio Reimat is currently an R&D Engineer at Centrum Wiskunde & Informatica (CWI) in Amsterdam. He received the B.S. degree in Audiovisual Systems Engineering of Telecommunications at Universitat Politecnica de Catalunya in 2016 and the M.S degree in Innovation and Research in Informatics – Computer Graphics and Virtual Reality at Universitat Politecnica de Catalunya in 2020. His current research interests are 3D graphics, volumetric capturing, 3d reconstruction, point clouds, social Virtual Reality and real-time communications.

Could you provide a small summary of your contribution to the dataset research?

We have released the CWI Point Cloud Social XR Dataset [1], a dynamic point cloud dataset that depicts humans interacting in social XR settings. In particular, using commodity hardware we captured audio-visual data (RGB + Depth + Infrared + synchronized Audio) for a total of 45 unique sequences of people performing scripted actions [2]. The screenplays for the human actors were devised so as to simulate a variety of common use cases in social XR, namely, (i) Education and training, (ii) Healthcare, (iii) communication and social interaction, and (iv) Performance and sports. Moreover, diversity in gender, age, ethnicities, materials, textures and colours were additionally considered. As part of our release, we provide annotated raw material, resulting point cloud sequences, and an auxiliary software toolbox to acquire, process, encode, and visualize data, suitable for real-time applications.

Sample frames from the point cloud sequences released with the CWIPC-SXR dataset.

Why did you get interested in datasets research?

Real-time, immersive telecommunication systems are quickly becoming a reality, thanks to the advances in the acquisition, transmission, and rendering technologies. Point clouds in particular serve as a promising representation in these types of systems, offering photorealistic rendering capabilities with low complexity. Further development of transmission, coding, and quality evaluation algorithms, though, is currently hindered by the lack of publicly available datasets that represent realistic scenarios of remote communication between people in real-time. So we are trying to fill this gap. 

What is the most challenging aspect of datasets research?

In our case, because point clouds are a relatively new format, the most challenging part has been developing the technology to generate them. Our dataset is generated from several cameras, which need to be calibrated and synchronized in order to merge the views successfully. Apart from that, if you are releasing a large dataset, you also need to deal with other challenges like data hosting and maintenance, but even more important, find the way to distribute the data in a way that is suitable for different target users. Because we are not releasing just point clouds but also the raw data, there may be people interested in the raw videos, or in particular point clouds, and they do not want to download the full 1.6TB of data. And going even further, because of the novelty of the point cloud format, there is also a lack of tools to re-capture, playback or modify this type of data. That’s why, together with the dataset, we also released our point cloud auxiliary toolbox of software utilities built on top of the Point Cloud Library, which allows for alignment and processing of point clouds, as well as real-time capturing, encoding, transmission, and rendering.

How do you see the future of datasets research?

Open datasets are an essential part of science since they allow for comparison and reproducibility. The major problem is that creating datasets is difficult and expensive, requiring a big investment from research groups. In order to ensure that relevant datasets keep on being created, we need a push including: scientific venues for the publication and discussion of datasets (like the dataset track at the Multimedia Systems conference, which started more than a decade ago), investment from funding agencies and organizations identifying the datasets that the community will need in the future, and collaboration between labs to share the effort.

What are your future plans for your research?

We are very happy with the first version of the dataset since it provides a good starting point and was a source of learning. Still, there is room for improvements, so now that we have a full capturing system (together with the auxiliary tools), we would like to extend the dataset and refine the tools. The community still needs more datasets of volumetric video to further advance the research on alignment, post-processing, compression, delivery, and rendering. Apart from the dataset, the Distributed and Interactive Systems (https://www.dis.cwi.nl) group from CWI is working on volumetric video conferencing, developing a Social VR pipeline for enabling users to more naturally communicate and interact. Recently, we deployed a solution for visiting museums remotely together with friends and family members (https://youtu.be/zzB7B6EAU9c), and next October we will start two EU-funded projects on this topic.   

Pierre-Etienne Martin, Postdoctoral Researcher & Tech Development Coordinator, Max Planck Institute for Evolutionary Anthropology, Department of Comparative Cultural Psychology, Leipzig, Germany

Short Bio: Pierre-Etienne Martin is currently a Postdoctoral researcher at the Max Planck Institute. He received his M.S. degree in 2017 from the University of Bordeaux, the Pázmány Péter Catholic University and the Autonomous University of Madrid via the Image Processing and Computer vision Erasmus Master program. He obtained his PhD, labelled European, from the University of Bordeaux in 2020, supervised by Jenny Benois-Pineau and Renaud Péteri, on the topic of video detection and classification by means of Convolutional Neural Networks. His current research interests include among others Artificial Intelligence, Machine Learning and Computer Vision.

Could you provide a small summary of your contribution to the dataset research?

In 2017, I started my PhD thesis which focuses on movement analysis in sports. The aim of this research project, so-called CRIPS (ComputeR vIsion for Sports Performance – see ), is to improve the training experience of the athletes. Our team decided to focus on Table Tennis, and it is with the collaboration of the Sports Faculty of the University of Bordeaux, STAPS, that our first contribution came to be: the TTStroke-21 dataset [3]. This dataset gathers recordings of table tennis games at high resolution and 120 frames per second. The players and annotators are both from the STAPS. The annotation platform was designed by students from the LaBRI – University of Bordeaux, and the MIA from the University of la Rochelle. Coordination for recording the videos and doing the annotation was performed by my supervisors and myself.

In 2019, and until now, the TTStroke-21 is used to propose the Sports Task at the Multimedia Evaluation benchmark – MediaEval [4]. The goal is to segment and classify table tennis strokes from videos.

TTStrokes-21 sample images

Since 2021, I have joined the MPI EVA institute and I now focus on elaborating datasets for the Comparative Cultural Psychology department (CCP). The data we are working on focuses on great apes and children. We aim at segmenting, identifying and tracking. 

Why did you get interested in datasets research?

Datasets research is the field where the application of computer vision tools is possible. In order to widen the range of applications, datasets with qualitative ground truth need to be offered by the scientific community. Only then, models can be developed to solve the problem raised by the dataset and finally be offered to the community. This has been the goal of the interdisciplinary CRISP project, through the collaboration of the sport and computer science community, for improving athlete performance.

It is also the aim of collaborative projects, such as MMLAB [5], which gathers many models and implementations trained on various datasets, in order to ease reproducibility, performance comparison and inference for applications.

What is the most challenging aspect of datasets research?

From my experience, when organizing the Sport task at the MediaEval workshop, the most challenging aspect of datasets research is to be able to provide qualitative data: from acquisition to annotation; and tools to process them: use, demonstration and evaluation. That is why, on the side of our task, we also provide a baseline which covers most of these aspects.

How do you see the future of datasets research?

I hope datasets research will transcend in order to have a general scheme for annotation and evaluation of datasets. I hope the different datasets could be used together for training multi-task models, and give the opportunity to share knowledge and features proper to each type of dataset. Finally, quantity has been a major criterion for dataset research, but quality should be more considered in order to improve state-of-the-art performance while keeping a sustainable way to conduct research.

What are your future plans for your research?

Within the CCP department at MPI, I hope to be able to build different types of datasets to put to best use what has been implemented in the computer vision field to psychology.

Relevant references:

  1. CWIPC-SXR dataset: https://www.dis.cwi.nl/cwipc-sxr-dataset/
  2. I. Reimat, et al., “CWIPC-SXR: Point Cloud dynamic human dataset for Social XR. In Proceedings of the 12th ACM Multimedia Systems Conference (MMSys ’21). Association for Computing Machinery, New York, NY, USA, 300–306. https://doi.org/10.1145/3458305.3478452
  3. TTStroke-21: https://link.springer.com/article/10.1007/s11042-020-08917-3
  4. Media-Eval: http://www.multimediaeval.org/
  5. Open-MMLab: https://openmmlab.com/

Overview of Open Dataset Sessions and Benchmarking Competitions in 2021.

This issue of the Dataset Column proposes a review of some of the most important events in 2021 related to special sessions on open datasets or benchmarking competitions associated with multimedia data. While this is not meant to represent an exhaustive list of events, we wish to underline the great diversity of subjects and dataset topics currently of interest to the multimedia community. We will present the following events:

  • 13th International Conference on Quality of Multimedia Experience (QoMEX 2021 – https://qomex2021.itec.aau.at/). We summarize six datasets included in this conference, that address QoE studies on haze conditions (RHVD), tele-education events (EVENT-CLASS), storytelling scenes (MTF), image compression (EPFL), virtual reality effects on gamers (5Gaming), and live stream shopping (LSS-survey).
  • Multimedia Datasets for Repeatable Experimentation at 27th International Conference on Multimedia Modeling (MDRE at MMM 2021 – https://mmm2021.cz/special-session-mdre/). We summarize the five datasets presented during the MDRE, addressing several topics like lifelogging and environmental data (MNR-HCM), cat vocalizations (CatMeows), home activities (HTAD), gastrointestinal procedure tools (Kvasir-Instrument), and keystroke and lifelogging (KeystrokeDynamics).
  • Open Dataset and Software Track at 12th ACM Multimedia Systems Conference (ODS at MMSys ’21) (https://2021.acmmmsys.org/calls.php#ods). We summarize seven datasets presented at the ODS track, targeting several topics like network statistics (Brightcove Streaming Datasets, and PePa Ping), emerging image and video modalities (Full UHD 360-Degree, 4DLFVD, and CWIPC-SXR) and human behavior data (HYPERAKTIV and Target Selection Datasets).
  • Selected datasets at 29th ACM Multimedia Conference (MM ’21) (https://2021.acmmm.org/). For a general report from ACM Multimedia 2021 please see (https://records.sigmm.org/2021/11/23/reports-from-acm-multimedia-2021/). We summarize six datasets presented during the conference, targeting several topics like food logo detection (FoodLogoDet-1500), emotional relationship recognition (ERATO), text-to-face synthesis (CelebAText-HQ), multimodal linking (M3EL), egocentric video analysis (EGO-Deliver), and quality assessment of user-generated videos (PUGCQ).
  • ImageCLEF 2021 (https://www.imageclef.org/2021). We summarize the six datasets launched for the benchmarking tasks, related to several topics like social media profile assessment (ImageCLEFaware), segmentation and labeling of underwater coral images (ImageCLEFcoral), automatic generation of web-pages (ImageCLEFdrawnUI) and medical imaging analysis (ImageCLEF-VQAMed, ImageCLEFmedCaption, and ImageCLEFmedTuberculosis).

Creating annotated datasets is even more difficult in ongoing pandemic times, and we are glad to see that many interesting datasets were published despite this unfortunate situation.

QoMEX 2021

A large number of dataset-related papers have been presented at the International Conference on Quality of Multimedia Experience (QoMEX 2021), organized as a fully online event in Montreal, Canada, June 14 -17, 2021 (https://qomex2021.itec.aau.at/). The complete QoMEX ’21 Proceedings is available in the IEEE Digital Library (https://ieeexplore.ieee.org/xpl/conhome/9465370/proceeding).

In the conference, there was not a specifically dedicated Dataset session. However, datasets were very important to the conference with a number of papers showing new datasets or making use of broadly available ones. As a small example, six selected papers focused primarily on new datasets are listed below. They are contributions focused on haze, teaching in Virtual Reality, multiview video, image quality, cybersickness for Virtual Reality gaming and shopping patterns. 

A Real Haze Video Database for Haze Evaluation
Paper available at: https://ieeexplore.ieee.org/document/9465461
Chu, Y., Luo, G., and Chen, F.
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, P.R.China.
Dataset available at: https://drive.google.com/file/d/1zY0LwJyNB8u1JTAJU2X7ZkiYXsBX7BF/view?usp=sharing

The RHVD video quality assessment dataset focuses on the study of perceptual degradation caused by heavy haze conditions in real-world outdoor scenes, addressing a large number of possible use case scenarios, including driving assistance and warning systems. The dataset is collected from Flickr video sharing platform and post-edited, while 40 annotators were used for creating the subjective quality assessment experiments.

EVENT-CLASS: Dataset of events in the classroom
Paper available at: https://ieeexplore.ieee.org/document/9465389
Orduna, M., Gutierrez, J., Manzano, C., Ruiz, D., Cabrera, J., Diaz, C., Perez, P., and Garcia, N.
Grupo de Tratamiento de Imágenes, Information Processing & Telecom. Center, Universidad Politécnica de Madrid, Spain; Nokia Bell Labs, Madrid, Spain.
Dataset available at: http://www.gti.ssr.upm.es/data/event-class

The EVENT-CLASS dataset consists of 360-degree videos that contain events and characteristics specific to the context of tele-education, composed of video and audio sequences taken in varying conditions. The dataset addresses several topics, including quality assessment tests with the aim of improving the immersive experience of remote users.

A Multi-View Stereoscopic Video Database With Green Screen (MTF) For Video Transition Quality-of-Experience Assessment
Paper available at: https://ieeexplore.ieee.org/document/9465458
Hobloss, N., Zhang, L., and Cagnazzo, M.
LTCI, Télécom-Paris, Institut Polytechnique de Paris, Paris, France; Univ Rennes, INSA Rennes, CNRS, Rennes, France.
Dataset available at: https://drive.google.com/drive/folders/1MYiD7WssSh6X2y-cf8MALNOMMish4N5j

MFT is a multi-view stereoscopic video dataset, containing full-HD videos of real storytelling scenes, targeting QoE assessment for the analysis of visual artefacts that appear during an automatically generated point of view transitions. The dataset features a large baseline of camera setups and can also be used in other computer vision applications, like video compression, 3D video content, VR environments and optical flow estimation.

Performance Evaluation of Objective Image Quality Metrics on Conventional and Learning-Based Compression Artifacts
Paper available at: https://ieeexplore.ieee.org/document/9465445
Testolina, M., Upenik, E., Ascenso, J., Pereira, F., and Ebrahimi, T.
Multimedia Signal Processing Group, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland; Instituto Superior Técnico, Universidade de Lisboa – Instituto de Telecomunicações, Lisbon, Portugal.
Dataset available on request to the authors.

This dataset consists of a collection of compressed images, labelled according to subjective quality scores, targeting the evaluation of 14 objective quality metrics against the perceived human quality baseline.

The Effect of VR Gaming on Discomfort, Cybersickness, and Reaction Time
Paper available at: https://ieeexplore.ieee.org/document/9465470
Vlahovic, S., Suznjevic, M., Pavlin-Bernardic, N., and Skorin-Kapov, L.
Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia; Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb, Croatia.
Dataset available on request to the authors.

The authors present the results of a study conducted on 20 human users, that measures the physiological and cognitive aftereffects of exposure to three different VR games with game mechanics centered around natural interactions. This work moves away from cybersickness as a primary measure of VR discomfort and wishes to analyze other concepts like device-related discomfort, muscle fatigue and pain and correlations with game complexity

Beyond Shopping: The Motivations and Experience of Live Stream Shopping Viewers
Paper available at: https://ieeexplore.ieee.org/document/9465387
Liu, X. and Kim, S. H.
Adelphi University.
Dataset available on request to the authors.

The authors propose a study of 286 live stream shopping users, where viewer motivations are examined according to the Uses and Gratifications Theory, seeking to identify motivations broken down into sixteen constructs organized under four larger constructs: entertainment, information, socialization, and experience.

MDRE at MMM 2021

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2021 International Conference on Multimedia Modeling (MMM 2021). The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Björn Þór Jónsson (IT University of Copenhagen, Denmark) and Klaus Schoeffmann (Klagenfurt University, Austria). More details regarding this session can be found at: https://mmm2021.cz/special-session-mdre/

The MDRE’21 special session at MMM’21 is the third MDRE edition, and it represents an opportunity for interested researchers to submit their datasets to this track. The work submitted to MDRE is permanently available at http://mmdatasets.org, where all the current and past editions of MDRE are hosted. Authors are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, as well as discussing the way it can be useful to the community, along with the dataset in itself.

MNR-Air: An Economic and Dynamic Crowdsourcing Mechanism to Collect Personal Lifelog and Surrounding Environment Dataset.
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_18
Nguyen DH., Nguyen-Tai TL., Nguyen MT., Nguyen TB., Dao MS.
University of Information Technology, Ho Chi Minh City, Vietnam; University of Science, Ho Chi Minh City, Vietnam; Vietnam National University in Ho Chi Minh City, Ho Chi Minh City, Vietnam; National Institute of Information and Communications Technology, Koganei, Japan.
Dataset available on request to the authors.

The paper introduces an economical and dynamic crowdsourcing mechanism that can be used to collect personal lifelog associated events. The resulting dataset, MNR-HCM, represents data collected in Ho Chi Minh City, Vietnam, containing weather data, air pollution data, GPS data, lifelog images, and citizens’ cognition on a personal scale.

CatMeows: A Publicly-Available Dataset of Cat Vocalizations
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_20
Ludovico L.A., Ntalampiras S., Presti G., Cannas S., Battini M., Mattiello S.
Department of Computer Science, University of Milan, Milan, Italy; Department of Veterinary Medicine, University of Milan, Milan, Italy; Department of Agricultural and Environmental Science, University of Milan, Milan, Italy.
Dataset available at: https://zenodo.org/record/4008297

The CatMewos dataset consists of vocalizations produced by 21 cats belonging to two breeds, namely Main Coon and European Shorthair, that are emitted in three different contexts: brushing, isolation in an unfamiliar environment, and waiting for food. Recordings are performed with low-cost and easily available devices, thus creating a representative dataset for real-world scenarios.

HTAD: A Home-Tasks Activities Dataset with Wrist-accelerometer and Audio Features
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_17
Garcia-Ceja, E., Thambawita, V., Hicks, S.A., Jha, D., Jakobsen, P., Hammer, H.L., Halvorsen, P., Riegler, M.A.
SINTEF Digital, Oslo, Norway; SimulaMet, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; UIT The Arctic University of Norway, Tromsø, Norway; Haukeland University Hospital, Bergen, Norway.
Dataset available at: https://datasets.simula.no/htad/

The HTAD dataset contains wrist-accelerometer and audio data collected during several normal day-to-day tasks, such as sweeping, brushing teeth, or watching TV. Being able to detect these types of activities is important for the creation of assistive applications and technologies that target elderly care and mental health monitoring.

Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_19
Jha, D., Ali, S., Emanuelsen, K., Hicks, S.A., Thambawita, V., Garcia-Ceja, E., Riegler, M.A., de Lange, T., Schmidt, P.T., Johansen, H.D., Johansen, D., Halvorsen, P.
SimulaMet, Oslo, Norway; UIT The Arctic University of Norway, Tromsø, Norway; Simula Research Laboratory, Oslo, Norway; Augere Medical AS, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; Medical Department, Sahlgrenska University Hospital-Mölndal, Gothenburg, Sweden; Department of Medical Research, Bærum Hospital, Gjettum, Norway; Karolinska University Hospital, Solna, Sweden; Department of Engineering Science, University of Oxford, Oxford, UK; Sintef Digital, Oslo, Norway.
Dataset available at: https://datasets.simula.no/kvasir-instrument/

The Kvasir-Instrument dataset consists of 590 annotated frames that contain gastrointestinal (GI) procedure tools such as snares, balloons, and biopsy forceps, and seeks to improve follow-up and the set of available information regarding the disease and the procedure itself, by providing baseline data for the tracking and analysis of the medical tools.

Keystroke Dynamics as Part of Lifelogging
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_16
Smeaton, A.F., Krishnamurthy, N.G., Suryanarayana, A.H.
Insight Centre for Data Analytics, Dublin City University, Dublin, Ireland; School of Computing, Dublin City University, Dublin, Ireland.
Dataset available at: http://doras.dcu.ie/25133/

The authors created a dataset of longitudinal keystroke timing data that spans a period of up to seven months for four human participants. A detailed analysis of the data is performed, by examining the timing information associated with bigrams, or pairs of adjacently-typed alphabetic characters.

ODS at MMSys ’21

The traditional Open Dataset and Software Track (ODS) was a part of the 12th ACM Multimedia Systems Conference (MMSys ’21) organized as a hybrid event in Istanbul, Turkey, September 28 – October 1, 2021 (https://2021.acmmmsys.org/). The complete MMSys ’21: Proceedings of the 12th ACM Multimedia Systems Conference are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3458305).

The Session on Software, Tools and Datasets was chaired by Saba Ahsan (Nokia Technologies, Finland) and Luca De Cicco (Politecnico di Bari, Italy) on September 29, 2021, at 16:00 (UTC+3, Istanbul local time). The session has been initiated with 1-slide/minute intros given by the authors and then divided into individual virtual booths. There have been seven dataset papers presented out of thirteen contributions. Listing of the paper titles and their abstracts and associated DOIs is included below for your convenience.

Adaptive Streaming Playback Statistics Dataset
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478444
Teixeira, T, Zhang, B., Reznik, Y.
Brightcove Inc, USA
Dataset available at: https://github.com/brightcove/streaming-dataset

The authors propose a dataset that captures statistics from a number of real-world streaming events, utilizing different devices (TVs, desktops, mobiles, tablets, etc.) and networks (from 2.5G, 3G, and other early generation mobile networks to 5G and broadband). The captured data includes network and playback statistics, events and characteristics of the encoded stream.

PePa Ping Dataset: Comprehensive Contextualization of Periodic Passive Ping in Wireless Networks
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478456
Madariaga, D., Torrealba, L., Madariaga, J., Bustos-Jimenez, J., Bustos, B.
NIC Chile Research Labs, University of Chile
Dataset available at: https://github.com/niclabs/pepa-ping-mmsys21

The PePa Ping dataset consists of real-world data with a comprehensive contextualization of Internet QoS indicators, like Round-trip time, jitter and packet loss. A methodology is developed for Android devices, that obtains the necessary information, while the indicators are directly provided to the Linux kernel, therefore being an accurate representation of real-world data.

Full UHD 360-Degree Video Dataset and Modeling of Rate-Distortion Characteristics and Head Movement Navigation
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478447
Chakareski, J., Aksu, R., Swaminathan, V., Zink, M.
New Jersey Institute of Technology; University of Alabama; Adobe Research; University of Massachusetts Amherst, USA
Dataset available at: https://zenodo.org/record/5156999#.YQ1XMlNKjUI

The authors create a dataset of 360-degree videos that are used in analyzing the rate-distortion (R-D) characteristics of videos. These videos correspond to head movement navigation data in Virtual Reality (VR) and they may be used for analyzing how users explore panoramas around them in VR.

4DLFVD: A 4D Light Field Video Dataset
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478450
Hu, X., Wang, C.,Pan, Y., Liu, Y., Wang, Y., Liu, Y., Zhang, L., Shirmohammadi, S.
University of Ottawa, Canada / Beijing University of Posts and Telecommunication, China
Dataset available at: https://dx.doi.org/10.21227/hz0t-8482

The authors propose a 4D Light Field (LF) video dataset that is collected via a custom-made camera matrix. The dataset is to be used for designing and testing methods for LF video coding, processing and streaming, providing more viewpoints and/or higher framerate compared with similar datasets from the current literature.

CWIPC-SXR: Point Cloud dynamic human dataset for Social XR
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478452
Reimat, I., Alexiou, E., Jansen, J., Viola, I., Subramanyam, S., Cesar, P.
Centrum Wiskunde & Informatica, Netherlands
Dataset available at: https://www.dis.cwi.nl/cwipc-sxr-dataset/

The CWIPC-SXR dataset is composed of 45 unique sequences that correspond to several use cases for humans interacting in social extended reality. The dataset is composed of dynamic point clouds, that serve as a low complexity representation in these types of systems.

HYPERAKTIV: An Activity Dataset from Patients with Attention-Deficit/Hyperactivity Disorder (ADHD)
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478454
Hicks, S. A., Stautland, A., Fasmer, O. B., Forland, W., Hammer, H. L., Halvorsen, P., Mjeldheim, K., Oedegaard, K. J., Osnes, B., Syrstad, V. E.G., Riegler, M. A.
SimulaMet; University of Bergen; Haukeland University Hospital; OsloMet, Norway
Dataset available at: http://datasets.simula.no/hyperaktiv/

The HYPERAKTIV dataset contains general patient information, health, activity, information about the mental state, and heart rate data from patients with Attention-Deficit/Hyperactivity Disorder (ADHD). Included here are 51 patients with ADHD and 52 clinical control cases.

Datasets – Moving Target Selection with Delay
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478455
Liu, S. M., Claypool, M., Cockburn, A., Eg, R., Gutwin, C., Raaen, K.
Worcester Polytechnic Institute, USA; University of Canterbury, New Zealand; Kristiania University College, Norway; University of Saskatchewan, Canada
Dataset available at: https://web.cs.wpi.edu/~claypool/papers/selection-datasets/

The Selection datasets are composed of datasets created during four user studies on the effects of delay on video game actions and selections of a moving target with a various number of pointing devices. The datasets include performance data, like time to the selection, and demographic data for the users like age and gaming experience.

ACM MM 2021

A large number of dataset-related papers have been presented at the 29th ACM International Conference on Multimedia (MM’ 21), organized as a hybrid event in Chengdu, China, October 20 – 24, 2021 (https://2021.acmmm.org/). The complete MM ’21: Proceedings of the 29th ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3474085).

There was not a specifically dedicated Dataset session among more than 35 sessions at the MM ’21 symposium. However, the importance of datasets can be illustrated in the following statistics, quantifying how many times the term “dataset” appears among 542 accepted papers. The term appears in the title of 7 papers, the keywords of 66 papers, and the abstracts of 339 papers. As a small example, six selected papers focused primarily on new datasets are listed below. There are contributions focused on social multimedia, emotional recognition, text-to-face synthesis, egocentric video analysis, emerging multimedia applications, such as multimodal entity linking, and multimedia art, entertainment, and culture related to perceived quality of video content.

FoodLogoDet-1500: A Dataset for Large-Scale Food Logo Detection via Multi-Scale Feature Decoupling Network
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475289
Hou, Q., Min, W., Wang, J., Hou, S., Zheng, Y., Jiang, S.
Shandong Normal University, Jinan, China; Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Dataset available at: https://github.com/hq03/FoodLogoDet-1500-Dataset

The FoodLogoDet-1500 is a large-scale food logo dataset that has 1,500 categories, around 100,000 images and 150,000 manually annotated food logo objects. This type of dataset is important in self-service applications in shops and supermarkets, and copyright infringement detection for e-commerce websites.

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475493
Gao, X., Zhao, Y., Zhang, J., Cai, L.
Alibaba Group, Beijing, China
Dataset available on request to the authors.

The Emotional RelAtionship of inTeractiOn (ERATO) dataset is a large-scale multimodal dataset composed of over 30,000 interaction-centric video clips lasting around 203 hours. The videos are representative for studying the emotional relationships between the two interactive characters in the video clip.

Multi-caption Text-to-Face Synthesis: Dataset and Algorithm
Paper available at: https://dl.acm.org/doi/abs/10.1145/3474085.3475391
Sun, J., Li, Q., Wang, W., Zhao, J., Sun, Z.
Center for Research on Intelligent Perception and Computing, NLPR, CASIA, Beijing, China;
School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China; Institute of North Electronic Equipment, Beijing, China
Dataset available on request to the authors.

The authors propose the CelebAText-HQ dataset, which addresses the text-to-face generation problem. Each image in the dataset is manually annotated with 10 captions, allowing proposed methods and algorithms to take multiple captions as input in order to generate highly semantically related face images.

Multimodal Entity Linking: A New Dataset and A Baseline
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475400
Gan, J., Luo, J., Wang, H., Wang, S., He, W., Huang, Q.
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, China; Baidu Inc.
Dataset available at: https://jingrug.github.io/research/M3EL

The authors propose the M3EL large-scale multimodal entity linking dataset, containing data associated with 1,100 movies. Reviews and images are collected, and textual and visual mentions are extracted and labelled with entities registered from Wikipedia.

Ego-Deliver: A Large-Scale Dataset for Egocentric Video Analysis
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475336
Qiu, H., He, P., Liu, S., Shao, W., Zhang, F., Wang, J., He, L., Wang, F.
East China Normal University, Shanghai, China; University of Florida, Florida, FL, United States;
Alibaba Group, Shanghai, China
Dataset available at: https://egodeliver.github.io/EgoDeliver_Dataset/

The authors propose an egocentric video benchmarking dataset, consisting of videos recorded by takeaway riders doing their daily work. The dataset provides over 5,000 videos with more than 139,000 multi-track annotations and 45 different attributes, representing the first attempt in understanding the delivery takeaway process from an egocentric perspective.

PUGCQ: A Large Scale Dataset for Quality Assessment of Professional User-Generated Content
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475183
Li, G., Chen, B., Zhu, L., He, Q., Fan, H., Wang, S.
Kingsoft Cloud, Beijing, China; City University of Hong Kong, Hong Kong, Hong Kong
Dataset available at: https://github.com/wlkdb/pugcq_create

The PUGCQ dataset consists of 10,000 professional user-generated videos, annotated with a set of perceptual subjective ratings. In particular, during the subjective annotation and testing, human opinions are collected based upon not only MOS, but also attributes that may influence visual quality such as faces, noise, blur, brightness, and colour.

ImageCLEF 2021

ImageCLEF is a multimedia evaluation campaign, part of the clef initiative (http://www.clef-initiative.eu/). The 2021 edition (https://www.imageclef.org/2021) is the 19th edition of this initiative and addresses four main research tasks in several domains like: medicine, nature, social media content and user interface processing. ImageCLEF 2021 is organized by Bogdan Ionescu (University Politehnica of Bucharest, Romania), Henning Müller (University of Applied Sciences Western Switzerland, Sierre, Switzerland), Renaud Péteri (University of La Rochelle, France), Ivan Eggel (University of Applied Sciences Western Switzerland, Sierre, Switzerland) and Mihai Dogariu (University Politehnica of Bucharest, Romania).

Paper available at: https://arxiv.org/abs/2012.13180
Popescu, A., Deshayes-Chossar, J., Ionescu, B.
CEA LIST, France; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2021/aware

This represents the first edition of the aware task at ImageCLEF, and it seeks to understand in what way do public social media profiles affect users in certain important scenarios, representing a search or application for a bank loan, an accommodation, a job as waitress/waiter, and a job in IT.

Paper available at: http://ceur-ws.org/Vol-2936/paper-88.pdf
Chamberlain, J., de Herrera, A. G. S., Campello, A., Clark, A., Oliver, T. A., Moustahfid, H.
University of Essex, UK; NOAA – Pacific Islands Fisheries Science Center, USA; NOAA/ US IOOS, USA; Wellcome Trust, UK.
Dataset available at: https://www.imageclef.org/2021/coral

The ImageCLEFcoral task, currently at its third edition, proposes a dataset and benchmarking task for the automatic segmentation and labelling of underwater images that can be combined for generating 3D models for monitoring coral reefs. The task itself is composed of two subtasks, namely the coral reef image annotation and localisation and the coral reef image pixel-wise parsing.

Paper available at: http://ceur-ws.org/Vol-2936/paper-89.pdf
Fichou, D., Berari, R., Tăuteanu, A., Brie, P., Dogariu, M., Ștefan, L.D., Constantin, M.G., Ionescu, B.
teleportHQ, Cluj Napoca, Romania; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2021/drawnui

The second edition ImageCLEFdrawnUI addresses the issue of creating appealing web page interfaces by fostering systems that are capable of automatically generating a web page from a hand-drawn sketch. The task is separated into two subtasks, the wireframe subtask and the screenshots task.

Paper available at: http://ceur-ws.org/Vol-2936/paper-87.pdf
Abacha, A.B., Sarrouti, M., Demner-Fushman, D., Hasan, S.A., Müller, H.
National Library of Medicine, USA; CVS Health, USA; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/vqa

This represents the fourth edition of the ImageCLEF Medical Visual Question Answering (VQAMed) task. This benchmark includes a task on Visual Question Answering (VQA), where participants are tasked with answering questions from the visual content of radiology images, and a second task on Visual Question Generation (VQG), consisting of generating relevant questions about radiology images.

ImageCLEFmed Caption
Paper available at: http://ceur-ws.org/Vol-2936/paper-111.pdf
Pelka, O., Abacha, A.B., de Herrera, A.G.S., Jacutprakart, J., Friedrich, C.M., Müller, H.
University of Applied Sciences and Arts Dortmund, Germany; National Library of Medicine, USA; University of Essex, UK; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/caption

This is the fifth edition of the ImageCLEF Medical Concepts and Captioning task. The objective is to extract UMLS-concept annotations and/or captions from the image data that are then compared against the original text captions of the images.

ImageCLEFmed Tuberculosis
Paper available at: http://ceur-ws.org/Vol-2936/paper-90.pdf
Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Müller, H.
Institute for Informatics, Minsk, Belarus; University of Warwick, Coventry, England, UK; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/tuberculosis

Dataset Column: Overview, Scope and Call for Contributions

Overview and Scope

The Dataset Column (https://records.sigmm.org/open-science/datasets/) of ACM SIGMM Records provides timely updates on the developments in the domain of publicly available multimedia datasets as enabling tools for reproducible research in numerous related areas. It is intended as a platform for further dissemination of useful information on multimedia datasets and studies of datasets covering various domains, published in peer-reviewed journals, conference proceedings, dissertations, or as results of applied research in industry.

The aim of the Dataset Column is therefore not to substitute already established platforms for disseminating multimedia datasets, e.g., Qualinet Databases (https://qualinet.github.io/databases/) [2], Multimedia Evaluation Benchmark (https://multimediaeval.github.io/), but promote such platforms and particularly interesting datasets and benchmarking challenges associated with them. Multimedia Evaluation Benchmark, MediaEval 2021, registration is now open (https://multimediaeval.github.io). This year’s MediaEval features a wide variety of tasks and datasets tackling a large number of domains, including video privacy, social media data analysis and understanding, news items analysis, medicine and wellbeing, affective and subjective content analysis, and game and sports associated media.

The Column will also continue reporting of contributions presented within Dataset Tracks at relevant conferences, e.g., ACM Multimedia (MM), ACM Multimedia Systems (MMSys), International Conference on Quality of Multimedia Experience (QoMEX), International Conference on Multimedia Modeling (MMM).

Dataset Column in the SIGMM Records

Previously published Dataset Columns are listed below in chronological order.

Call for Contributions

Those who have created and even previously published elsewhere a dataset, benchmarking initiative or studies of datasets relevant to the multimedia community are very welcome to submit their contribution to the ACM SIGMM Records Dataset Column. Examples of these are the accepted datasets to the open dataset and software track of the ACM MMSys 2021 conference or the datasets presented at QoMEX 2021 conference. Please contact one of the editors responsible for the respective area, Mihai Gabriel Constantin (mihai.constantin84@upb.ro), Karel Fliegel (fliegek@fel.cvut.cz), and Maria Torres Vega (maria.torresvega@ugent.be) to report your contribution.

Column Editors

Since September 2021, the Dataset Column is edited by Mihai Gabriel Constantin, Karel Fliegel, and Maria Torres Vega. Current editors appreciate the work of the previous team, Martha Larson, Bart Thomee and all other contributors, and will continue and further develop this dissemination platform.

The general scope of the Dataset Column is reviewed above, with the more specific areas of the editors listed below:

  • Mihai Gabriel Constantin will be responsible for the datasets related to multimedia analysis, understanding, retrieval and exploration,
  • Karel Fliegel for the datasets with subjective annotations related to Quality of Experience (QoE) [1] research,
  • Maria Torres Vega for the datasets related to immersive multimedia systems, networked QoE and cognitive network management.

Mihai Gabriel Constantin is a researcher at the AI Multimedia Lab, University Politehnica of Bucharest, Romania, and got his PhD at the Faculty of Electronics, Telecommunications, and Information Technology at the same university, with the topic “Automatic Analysis of the Visual Impact of Multimedia Data”. He has authored over 25 scientific papers in international conferences and high impact journals, with an emphasis on the prediction of the subjective impact of multimedia items on human viewers and deep ensembles. He participated as researcher in more than 10 research projects, and is a member of program committees and reviewer for several workshops, conferences and journals. He is also an active member of the multimedia processing community, being part of the MediaEval benchmarking initiative organization team, and leading or co-organizing several tasks during MediaEval that include Predicting Media Memorability [3] and Recommending Movies Using Content [4], as well as publishing several papers that analyze the data, annotations, participant features, methods, and observed best practices for MediaEval tasks and datasets [5]. More details can be found on his webpage: https://gconstantin.aimultimedialab.ro/.

Karel Fliegel received M.Sc. (Ing.) in 2004 (electrical engineering and audiovisual technology) and his Ph.D. in 2011 (research on modeling of visual perception of image impairment features) both from the Czech Technical University in Prague, Faculty of Electrical Engineering (CTU FEE), Czech Republic. He is an assistant professor at Multimedia Technology Group of CTU FEE. His research interests include multimedia technology, image processing, image and video compression, subjective and objective image quality assessment, Quality of Experience, HVS modeling, and imaging photonics. He has been a member of research teams within various projects especially in the area of visual information processing. He has participated in COST ICT Actions IC1003 Qualinet and IC1105 3D-ConTourNet, responsible for development of Qualinet Databases [2] (https://qualinet.github.io/databases/) relevant especially to QoE research.

Maria Torres Vega is an FWO (Research Foundation Flanders) Senior Postdoctoral fellow working at the multimedia delivery cluster of the IDLab group of the Ghent University (UGent) currently working on the perception of immersive multimedia applications. She received her M.Sc. degree in Telecommunication Engineering from the Polytechnic University of Madrid, Spain, in 2009. Between 2009 and 2013 she worked as a software and test engineer in Germany with focus on Embedded Systems and Signal Processing. In October 2013, she decided to go back to academia and started her PhD at the Eindhoven University of Technology (Eindhoven, The Netherlands), where she researched on the impact of beam-steered optical wireless networks on the users’ perception of services. This work awarded her PhD in Electrical Engineering in September 2017. In her years in academia (since October 2013), she has authored more than 40 publications, including three best paper awards. Furthermore, she serves as reviewer to a plethora of journals and conferences. In 2020 she served as general chair of the 4th Quality of Experience Management workshop, as tutorial chair of the 2020 Network Softwarization conference (NetSoft), and as demo chair of the Quality of Multimedia Experience conference (QoMex 2020). In 2021, she served as Technical Program Committee (TPC) chair of the 2021 Quality of Multimedia Experience conference (QoMex 2021).


Report from the MMM 2020 Special Session on Multimedia Datasets for Repeatable Experimentation (MDRE 2020)


Information retrieval and multimedia content access have a long history of comparative evaluation, and many of the advances in the area over the past decade can be attributed to the availability of open datasets that support comparative and repeatable experimentation. Hence, sharing data and code to allow other researchers to replicate research results is needed in the multimedia modeling field, as it helps to improve the performance of systems and the reproducibility of published papers.

This report summarizes the special session on Multimedia Datasets for Repeatable Experimentation (MDRE 2020), which was organized at the 26th International Conference on MultiMedia Modeling (MMM 2020), held in January 2020 in Daejeon, South Korea.

The intent of these special sessions is to be a venue for releasing datasets to the multimedia community and discussing dataset related issues. The presentation mode in 2020 was to have short presentations (approximately 8 minutes), followed by a panel discussion moderated by Aaron Duane. In the following we summarize the special session, including its talks, questions, and discussions.


GLENDA: Gynecologic Laparoscopy Endometriosis Dataset

The session began with a presentation on ‘GLENDA: Gynecologic Laparoscopy Endometriosis Dataset’ [1], given by Andreas Leibetseder from the University of Klagenfurt. The researchers worked with experts on gynecologic laparoscopy, a type of minimally invasive surgery (MIS), that is performed via a live feed of a patient’s abdomen to survey the insertion and handling of various instruments for conducting medical treatments. Adopting this kind of surgical intervention not only facilitates a great variety of treatments but also the possibility of recording such video streams is essential for numerous post-surgical activities, such as treatment planning, case documentation and education. The process of manually analyzing these surgical recordings, as it is carried out in current practice, usually proves tediously time-consuming. In order to improve upon this situation, more sophisticated computer vision as well as machine learning approaches are actively being developed. Since most of these approaches rely heavily on sample data that, especially in the medical field, is only sparsely available, the researchers published the Gynecologic Laparoscopy ENdometriosis DAtaset (GLENDA) – an image dataset containing region-based annotations of a common medical condition called endometriosis. 

Endometriosis is a disorder involving the dislocation of uterine-like tissue. Andreas explained that this dataset is the first of its kind and was created in collaboration with leading medical experts in the field. GLENDA contains over 25K images, about half of which are pathological, i.e., showing endometriosis, and the other half non-pathological, i.e., containing no visible endometriosis. The accompanying paper thoroughly described the data collection process, the dataset’s properties and structure, while also discussing its limitations. The authors plan on continuously extending GLENDA, including the addition of other relevant categories and ultimately lesion severities. Furthermore, they are in the process of collecting specific ”endometriosis suspicion” class annotations in all categories for capturing a common situation where at times it proves difficult, even for endometriosis specialists, to classify the anomaly without further inspection. The difficulty in classification may be due to several reasons, such as visible video artifacts. Including such challenging examples in the dataset may greatly improve the quality of endometriosis classifiers.

Kvasir-SEG: A Segmented Polyp Dataset

The second presentation was given by Debesh Jha from the Simula Research Laboratory, who introduced the work entitled ‘Kvasir-SEG: A Segmented Polyp Dataset’ [2]. Debesh explained that pixel-wise image segmentation is a highly demanding task in medical image analysis. Similar to the aforementioned GLENDA dataset, it is difficult to find annotated medical images with corresponding segmentation masks in practice. The Kvasir-SEG dataset is an open-access corpus of gastrointestinal polyp images and corresponding segmentation masks, which has been further manually annotated and verified by an experienced gastroenterologist. The researchers demonstrated the use of their dataset with both a traditional segmentation approach and a modern deep learning-based CNN approach. In addition to presenting the Kvasir-SEG dataset, Debesh also discussed the FCM clustering algorithm and the ResUNet-based approach for automatic polyp segmentation they presented in their paper. The results show that the ResUNet model was superior to FCM clustering.

The researchers released the Kvasir-SEG dataset as an open-source dataset to the multimedia and medical research communities, in the hope that it can help evaluate and compare existing and future computer vision methods. By adding segmentation masks to the Kvasir dataset, which until today only consisted of framewise annotations, the authors have enabled multimedia and computer vision researchers to contribute in the field of polyp segmentation and automatic analysis of colonoscopy videos. This could boost the performance of other computer vision methods and may be an important step towards building clinically acceptable CAI methods for improved patient care.

Rethinking the Test Collection Methodology for Personal Self-Tracking Data

The third presentation was given by Cathal Gurrin from Dublin City University and was titled ‘Rethinking the Test Collection Methodology for Personal Self-Tracking Data’ [3]. Cathal argued that, although vast volumes of personal data are being gathered daily by individuals, the MMM community has not really been tackling the challenge of developing novel retrieval algorithms for this data, due to the challenges of getting access to the data in the first place. While initial efforts have taken place on a small scale, it is their conjecture that a new evaluation paradigm is required in order to make progress in analysing, modeling and retrieving from personal data archives. In their position paper, the researchers proposed a new model of Evaluation-as-a-Service that re-imagines the test collection methodology for personal multimedia data in order to address the many challenges of releasing test collections of personal multimedia data. 

After providing a detailed overview of prior research on the creation and use of self-tracking data for research, the authors identified issues that emerge when creating test collections of self-tracking data as commonly used by shared evaluation campaigns. This includes in particular the challenge of finding self-trackers willing to share their data, legal constraints that require expensive data preparation and cleaning before a potential release to the public, as well as ethical considerations. The Evaluation-as-a-Service model is a novel evaluation paradigm meant to address these challenges by enabling collaborative research on personal self-tracking data. The model relies on the idea of a central data infrastructure that guarantees full protection of the data, while at the same time allowing algorithms to operate on this protected data. Cathal highlighted the importance of data banks in this scenario. Finally, he briefly outlined technical aspects that would allow setting up a shared evaluation campaign on self-tracking data.

Experiences and Insights from the Collection of a Novel Multimedia EEG Dataset

The final presentation of the session was also provided by Cathal Gurrin from Dublin City University in which he introduced the topic ‘Experiences and Insights from the Collection of a Novel Multimedia EEG Dataset’ [4]. This work described how there is a growing interest in utilising novel signal sources such as EEG (Electroencephalography) in multimedia research. When using such signals, subtle limitations are often not readily apparent without significant domain expertise. Multimedia research outputs incorporating EEG signals can fail to be replicated when only minor modifications have been made to an experiment or seemingly unimportant (or unstated) details are changed. Cathal claimed that this can lead to over-optimistic or over-pessimistic viewpoints on the potential real-world utility of these signals in multimedia research activities.

In their paper, the researchers described the EEG/MM dataset and presented a summary of distilled experiences and knowledge gained during the preparation (and utilisation) of the dataset that supported a collaborative neural-image labelling benchmarking task. They stated that the goal of this task was to collaboratively identify machine learning approaches that would support the use of EEG signals in areas such as image labelling and multimedia modeling or retrieval. The researchers stressed that this research is relevant for the multimedia community as it suggests a template experimental paradigm (along with datasets and a baseline system) upon which researchers can explore multimedia image labelling using a brain-computer interface. In addition, the paper provided insights and experience of commonly encountered issues (and useful signals) when conducting research that utilises EEG in multimedia contexts. Finally, this work provided insight on how an EEG dataset can be used to support a collaborative neural-image labelling benchmarking task.


After the presentations, Aaron Duane moderated a panel discussion in which all presenters participated, as well as Björn Þór Jónsson who joined the panel as one of the special session chairs.

The panel began with a question about how the research community should address data anonymity in large multimedia datasets and how, even if the dataset is isolated and anonymised, data analysis techniques can be utilised to reverse this process either partially or completely. The panel agreed this was an important question and acknowledged that there is no simple answer. Cathal Gurrin stated that there is less of a restrictive onus on the datasets used for such research because the owners of the dataset often provide it with full knowledge of how it will be used.

As a follow up, the questioner asked the panel about GDPR compliancy in this context and the fact that uploaders could potentially change their minds about allowing their datasets to be used in research several years after it was released. The panel acknowledged this remains an open concern and even expanded on such concerns by presenting an additional concern, namely the malicious uploading of data without the consent of the owner. One solution to this which was provided by the panel was the introduction of an additional layer of security in the form of a human curator who could review the security and privacy concerns of a dataset during its generation, as is the case with some datasets of personal data currently under release to the community. 

The discussion continued with much interest continuing to be directed toward effective privacy in datasets, especially when dealing with personal data, such as those generated by lifeloggers. One audience member recalled a story where a personal dataset was publicly released and individuals were able to garner personal information about individuals who were not the original uploader of the dataset and who did not consent to their face or personal information being publicly released. Cathal and Björn acknowledged that this remains an issue but drew attention to advanced censoring techniques such as automatic face blurring which is rapidly maturing in the domain. Furthermore, they claimed that the proposed model of Evaluation-as-a-Service discussed in Cathal’s earlier presentation could help to further alleviate some of these concerns.

Steering the conversation away from exclusively dealing with data privacy concerns, Aaron directed a question at Debesh and Andreas regarding the challenges and limitations associated with working directly with medical professionals to generate their datasets related to medical disorders. Debesh stated that there were numerous challenges such as the medical professionals being unfamiliar with the tools used in the generation of this work and that in many cases circumstances required multiple medical professionals and their opinion as they would often disagree. This generated significant technical and administrative overhead for the researchers and their work which resulted in a tedious speed of progress. Andreas stated that such issues were identical for him and his colleagues and highlighted the importance of effective communication between the medical experts and the technical researchers.

Towards the end of the discussion, the panel discussed the concept of encouraging the release of more large-scale multimedia datasets for experimentation and what challenges are currently associated with that. The panel responded that the process remains difficult but having special sessions such as this are very helpful. The recognition of papers associated with multimedia datasets is becoming increasingly apparent with many exceptional papers earning hundreds of citations within the community. The panel also stated that we should be mindful of the nature of each dataset as releasing the same type of dataset, again and again, is not beneficial and has the potential to do more harm than good.


The MDRE special session, in its second incarnation at MMM 2020, was organised to facilitate the publication of high-quality datasets, and for community discussions on the methodology of dataset creation. The creation of reliable and shareable research artifacts, such as datasets with reliable ground truths, usually represents tremendous effort; effort that is rarely valued by publication venues, funding agencies or research institutions. In turn, this leads many researchers to focus on short-term research goals, with an emphasis on improving results on existing and often outdated datasets by small margins, rather than boldly venturing where no researchers have gone before. Overall, we believe that more emphasis on reliable and reproducible results would serve our community well, and the MDRE special session is a small effort towards that goal.


The session was organized by the authors of the report, in collaboration with Duc-Tien Dang-Nguyen (Dublin City University), who could not attend MMM. The panel format of the special session made the discussions much more engaging than that of a traditional special session. We would like to thank the presenters, and their co-authors for their excellent contributions, as well as the members of the audience who contributed greatly to the session.


  • [1] Leibetseder A., Kletz S., Schoeffmann K., Keckstein S., and Keckstein J. “GLENDA: Gynecologic Laparoscopy Endometriosis Dataset.” In: Cheng WH. et al. (eds) MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science, vol. 11962, 2020. Springer, Cham. https://doi.org/10.1007/978-3-030-37734-2_36.
  • [2] Jha D., Smedsrud P.H., Riegler M.A., Halvorsen P., De Lange T., Johansen D., and Johansen H.D. “Kvasir-SEG: A Segmented Polyp Dataset.” In: Cheng WH. et al. (eds) MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science, vol. 11962, 2020. Springer, Cham. https://doi.org/10.1007/978-3-030-37734-2_37.
  • [3] Hopfgartner F., Gurrin C., and Joho H. “Rethinking the Test Collection Methodology for Personal Self-tracking Data.” In: Cheng WH. et al. (eds) MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science, vol. 11962, 2020. Springer, Cham. https://doi.org/10.1007/978-3-030-37734-2_38.
  • [4] Healy G., Wang Z., Ward T., Smeaton A., and Gurrin C. “Experiences and Insights from the Collection of a Novel Multimedia EEG Dataset.” In: Cheng WH. et al. (eds) MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science, vol. 11962, 2020. Springer, Cham. https://doi.org/10.1007/978-3-030-37734-2_39.

MediaEval Multimedia Evaluation Benchmark: Tenth Anniversary and Counting

MediaEval Multimedia Challenges

MediaEval is a benchmarking initiative that offers challenges in multimedia retrieval, analysis and exploration. The tasks offered by MediaEval concentrate specifically on the human and social aspects of multimedia. They encourage researchers to bring together multiple modalities (visual, text, audio) and to think in terms of systems that serve users. Our larger aim is to promote reproducible research that makes multimedia a positive force for society. In order to provide an impression of the topical scope of MediaEval, we describe a few examples of typical tasks.

Historically, MediaEval tasks have often involved social media analysis. One of the first tasks offered by MediaEval, called the “Placing” Task, focused on the geo-location of social multimedia. This task ran from 2010-2016 and studied the challenge of automatically predicting the location at which an image has been taken. Over the years, the task investigated the benefits of combining text and image features, and also explored the challenges involved with geo-location prediction of video.

MediaEval “Placing” Task (2010-2016)

The “Placing” Task gave rise to two daughter tasks, which are focused on the societal impact of technology that can automatically predict the geo-location of multimedia shared online. One is Flood-related Multimedia, which challenges researchers to extract information related to flooding disasters from social media posts (combining text and images). The other is Pixel Privacy, which allows researchers to explore ways in which adversarial images can be used to protect sensitive information from being automatically extracted from images shared online.

The MediaEval Pixel Privacy Task (currently ongoing) had its own “trailer” in 2019

MediaEval has also offered a number of tasks that focus on how media content is received by users. The interest of MediaEval in the emotional impact of music is currently continued by the Emotion and Theme Recognition in Music Task. Also, the Predicting Media Memorability Task explores the aspects of video that are memorable to users.

The MediaEval Predicting Media Memorability Task (currently ongoing)

Recently, MediaEval has widened its focus to include multimedia analysis in systems. The Sports Video Annotation Task works towards improving sports training systems and the Medico Task focuses on multimedia analysis for more effective and efficient medical diagnosis.

Recent years have seen the rise of the use of sensor data in MediaEval. The No-audio Multimodal Speech Detection Task uses a unique data set captured by people wearing sensors and having conversations in a social setting. In addition to the sensor data, the movement of the speakers is captured by an overhead camera. The challenge is to detect the moments at which the people are speaking without making use of audio recordings.

Frames from overhead camera video of the
MediaEval No-audio Multimodal Speech Detection Task (currently ongoing)

The Insight for Wellbeing Task uses a data set of lifelog images, sensor data and tags captured by people walking through a city wearing sensors and using smartphones. The challenge is to relate the data that is captured to the local pollution conditions.

MediaEval 10th Anniversary Workshop

Each year, MediaEval holds a workshop that brings researchers together to share their findings, discuss, and plan next year’s tasks. The 2019 workshop marked the 10th anniversary of MediaEval, which became an independent benchmark in 2010. The MediaEval 2019 Workshop was hosted by EURECOM in Sophia Antipolis, France. The workshop took place 27-29 October 2020, right after ACM Multimedia 2019, in Nice, France.

group photo on stairs
MediaEval 2019 Workshop at EURECOM, Sophia, Antipolis, France (Photo credit: Mathias Lux)

The MediaEval 2019 Workshop is grateful to SIGMM for their support. This support contributed to helping ten students to attend the workshop, across a variety of tasks and also made it possible to record all of the workshop talks. We also gratefully acknowledge the Multimedia Computing Group at Delft University of Technology and EURECOM

Links to MediaEval 2019 tasks, videos and slides are available on the MediaEval 2019 homepage http://multimediaeval.org/mediaeval2019/. The link to the 2019 proceedings can be found there as well. 

presenter behind podium
Presenting results of a MediaEval task
(Photo credit: Vajira Thambawita)

MediaEval has compiled a bibliography of papers that have been published using MediaEval data sets. This list includes not only MediaEval workshop papers, but also papers published at other workshops, conferences, and in journals. In total, around 750 papers have been written that use MediaEval data, and this number continues to grow. Check out the bibliography at https://multimediaeval.github.io/bib.

The Medieval in MediaEval

A long-standing tradition in MediaEval is to incorporate some aspect of medieval history into the social event of the workshop. This tradition is a wordplay on our name (“mediaeval” is an older spelling of “medieval”). Through the years the medieval connection has served to provide a local context for the workshop and has strengthened the bond among participants. At the MediaEval 2019 Workshop, we offered the chance to take a nature walk to the medieval town of Biot.

people on path across river
A journey of discovery at the MediaEval 2019 workshop (Photo credit: Vajira Thambawita)

The walking participants and the participants taking the bus convened on the “Place des Arcades” in the medieval town of Biot, where we enjoyed a dinner together under historic arches.

The MediaEval 2019 workshop gathers in
Place des Arcades in Biot, near EURECOM
(Photo credit: Vajira Thambawita)

MediaEval 2020

MediaEval has just announced the task line-up for 2020. Registration will open in July 2020 and the runs will be due at the end of October 2020. The workshop will be held in December, with dates to be announced.

This year, the MediaEval workshop will be fully online. Since the MediaEval 2017 in Dublin, MediaEval has offered the possibility for remote workshop participation. Holding the workshop online this year is a natural extension of this trend, and we hope that researchers around the globe will take advantage of the opportunity to participate.

We are happy to introduce the new website: https://multimediaeval.github.io/. More information will be posted there as the season moves forward.

The day-to-day operations of MediaEval are handled by the MediaEval logistics committee, which grows stronger with each passing year. The authors of this article are logistics committee members from 2019. 

Dataset Column: ToCaDa Dataset with Multi-Viewpoint Synchronized Videos

This column describes the release of the Toulouse Campus Surveillance Dataset (ToCaDa). It consists of 25 synchronized videos (with audio) of two scenes recorded from different viewpoints of the campus. An extensive manual annotation comprises all moving objects and their corresponding bounding boxes, as well as audio events. The annotation was performed in order to i) enhance audiovisual objects that can be visible, audible or both, according to each recording location, and ii) uniquely identify all objects in each of the two scenes. All videos have been «anonymized». The dataset is available for download here.


The increasing number of recording devices, such as smartphones, has led to an exponential production of audiovisual documents. These documents may correspond to the same scene, for instance an outdoor event filmed from different points of view. Such multi-view scenes contain a lot of information and provide new opportunities for answering high-level automatic queries.

In essence, these documents are multimodal, and their audio and video streams contain different levels of information. For example, the source of a sound may either be visible or not according to the different points of view. This information can be used separately or jointly to achieve different tasks, such as synchronising documents or following the displacement of a person. The analysis of these multi-view field recordings further allows understanding of complex scenarios. The automation of these tasks faces a need for data, as well as a need for the formalisation of multi-source retrieval and multimodal queries. As also stated by Lefter et al., “problems with automatically processing multimodal data start already from the annotation level” [1]. The complexity of the interactions between modalities forced the authors to produce three different types of annotations: audio, video, and multimodal.

In surveillance applications, humans and vehicles are the most important common elements studied. In consequence, detecting and matching a person or a car that appears in several videos is a key problem. Although many algorithms have been introduced, a major relative problem still is how to precisely evaluate and to compare these algorithms in reference to a common ground truth. Datasets are required for evaluating multi-view based methods.

During the last decade, public datasets have become more and more available, helping with the evaluation and comparison of algorithms, and in doing so, contributing to improvements in human and vehicle detection and tracking. However, most of the datasets focus on a specific task and do not support the evaluation of approaches that mix multiple sources of information. Only few datasets provide synchronized videos with overlapping fields of view. Yet, these rarely provide more than 4 different views even though more and more approaches could benefit from having additional views available. Moreover, soundtracks are almost never provided despite being a rich source of information, as voices and motor noises can help to recognize, respectively, a person or a car.

Notable multi-view datasets are the following.

  • The 3D People Surveillance Dataset (3DPeS) [2] comprises 8 cameras with disjoint views and 200 different people. Each person appears, on average, in 2 views. More than 600 video sequences are available. Thus, it is well-suited for people re-identification. Cameras parameters are provided, as well as a coarse 3D reconstruction of the surveilled environment.
  • The Video Image Retrieval and Analysis Tool (VIRAT) [3] dataset provides a large amount of surveillance videos with a high pixel resolution. In this dataset, 16 scenes were recorded for hours although in the end only 25 hours with significant activities were kept. Moreover, only two pairs of videos present overlapping fields of view. Moving objects were annotated by workers with bounding boxes, as well as some buildings or areas. Three types of events were also annotated, namely (i) single person events, (ii) person and vehicle events, and (iii) person and facility events, leading to 23 classes of events. Most actions were performed by people with minimal scripted actions, resulting in realistic scenarios with frequent incidental movers and occlusions.
  • Purely action-oriented datasets can be found in the Multicamera Human Action Video (MuHAVi) [4] dataset, in which 14 actors perform 17 different action classes (such as “kick”, “punch”, “gunshot collapse”) while 8 cameras capture the indoor scene. Likewise, Human3.6M [5] contains videos where 11 actors perform 15 different classes of actions while being filmed by 4 digital cameras; its specificity lies in the fact that 1 time-of-flight sensor and 10 motion cameras were also used to estimate and to provide the 3DT pose of the actors on each frame. Both background subtraction and bounding boxes are provided at each frame. In total, more than 3.6M frames are available. In these two datasets, actions are performed in unrealistic conditions as the actors follow a script consisting of actions that are performed one after the other.

In the table below a comparison is shown between the aforementioned datasets, which are contrasted with the new ToCaDa dataset we recently introduced and describe in more detail below.

Properties 3DPeS [2] VIRAT [3] MuHAVi [4] Human3.6M [5] ToCaDa [6]
# Cameras 8 static 16 static 8 static 4 static 25 static
# Microphones 0 0 0 0 25+2
Overlapping FOV Very partially 2+2 8 4 17
Disjoint FOV 8 12 0 0 4
Synchronized No No Partially Yes Yes
Pixel resolution 704 x 576 1920 x 1080 720 x 576 1000 x 1000 Mostly 1920 x 1080
# Visual objects 200 Hundreds 14 11 30
# Action types 0 23 17 15 0
# Bounding boxes 0 ≈ 1 object/second 0 ≈ 1 object/frame ≈ 1 object/second
In/outdoor Outdoor Outdoor Indoor Indoor Outdoor
With scenario No No Yes Yes Yes
Realistic Yes Yes No No Yes

ToCaDa Dataset

As a large multi-view, multimodal, and realistic video collection does not yet exist, we therefore took the initiative to produce such a dataset. The ToCaDa dataset [6] comprises 25 synchronized videos (including soundtrack) of the same scene recorded from multiple viewpoints. The dataset follows two detailed scenarios consisting of comings and goings of people, cars and motorbikes, with both overlapping and non-overlapping fields of view (see Figures 1-2). This dataset aims at paving the way for multidisciplinary approaches and applications such as 4D-scene reconstruction, object re-identification/tracking and multi-source metadata modeling and querying.

Figure 1: The campus contains 25 cameras, of which 8 are spread out across the area and 17 are located within the red rectangle (see Figure 2).
Figure 2: The main building where 17 cameras with overlapping fields of view are concentrated.

About 20 actors were asked to follow two realistic scenarios by performing scripted actions, like driving a car, walking, entering or leaving a building, or holding an item in hand while being filmed. In addition to ordinary actions, some suspicious behaviors are present. More precisely:

  • In the first scenario, a suspect car (C) with two men inside (D the driver and P the passenger) arrives and parks in front of the main building (within the sights of the cameras with overlapping views). P gets out of the car C and enters the building. Two minutes later, P leaves the building holding a package and gets in C. C leaves the parking (see Figure 3) and gets away from the university campus (passing in front of some of the disjoint fields of view cameras). Other vehicles and persons regularly move in different cameras with no suspicious behavior.
  • In the second scenario, a suspect car (C) with two men inside (D the driver and P the passenger) arrives and parks badly along the road. P gets out of the car and enters the building. Meanwhile, a women W knocks on the car window to ask the driver D to park correctly, but he drives off immediately. A few minutes later, P leaves the building with a package and seems confused as the car is missing. He then runs away. In the end, in one of the disjoint-view cameras, we can see him waiting until C picks him up.
Figure 3: A subset of all the synchronized videos for a particular frame of the first scenario. First row: cameras located in front of the building. Second and third rows: cameras that face the car park. A car is circled in red to highlight the largely overlapping fields of view.

The 25 camera holders we enlisted used their own mobile devices to record the scene, leading to a large variety of resolutions, image quality, frame rates and video duration. Three foghorns were blown in order to coordinate this heterogeneous disposal:

  • The first one stands for a warning 20 seconds before the start, to give enough time to start shooting.
  • The second one is the actual starting time, used to temporally synchronize the videos.
  • The third one indicates the ending time.

All the videos were collected and were manually synchronized using the second and the third foghorn blows as starting and ending times. Indeed, the second one can be heard at the beginning of every video.


A special annotation procedure was set to handle the audiovisual content of this multi-view data [7]. Audio and video parts of each document were first separately annotated, after which a fusion of these modalities was realized.

The ground truth annotations are stored in json files. Each file corresponds to a video and shares the same title but not the same extension, namely <video_name>.mp4 annotations are stored in <video_name>.json. Both visual and audio annotations are stored together in the same file.

By annotating, our goal is to detect the visual objects and the salient sound events and, when possible, to associate them. Thus, we have grouped them into the generic term audio-visual object. This way, the appearance of a vehicle and its motor sound will constitute a single coherent audio-visual object and is associated with the same ID. An object that can be seen but cannot be heard is also an audio-visual object but with only a visual component, and similarly for an object that can only be heard. An example is given in Listing 1.

Listing 1: Json file structure of the visual component of an object in a video, visible from 13.8s to 18.2s and from 29.72s to 32.28s and associated with id 11.

To help with the annotation process, we developed a program for navigating through the frames of the synchronized videos and for identifying audio-visual objects by drawing bounding boxes in particular frames and/or specifying starting and ending times of salient sound. Bounding boxes were drawn around every moving object with a flag indicating whether the object was fully visible or occluded, specifying its category (human or vehicle), providing visual details (for example clothes types or colors), and timestamps of its apparitions and disappearances. Audio events were also annotated by a category and two timestamps.

Regarding bounding boxes, the coordinates of top-left and bottom-right corners of the bounding boxes are given. Bounding boxes were drawn such that the object is fully contained within the box and as tight as possible. For this purpose, our annotation tool allows the user to draw an initial approximate bounding box and then to adjust its boundaries at a pixel-level.

As drawing one bounding box for each object on every frame requires a huge amount of time, we have drawn bounding boxes on a subset of frames, so that the intermediate bounding boxes of an object can be linearly interpolated using its previous and next drawn bounding boxes. On average, we have drawn one bounding box per second for humans and two for vehicles due to their speed variation. For objects with irregular speed or trajectory, we have drawn more bounding boxes.

Regarding the audio component of an audio-visual object, namely the salient sound events, an audio category (voice, motor sound) is given in addition to its ID, as well as a list of details and time bounds (see Listing 2).

Listing 2: Json file structure of an audio event in a given video. As it is associated with id 11, it corresponds to the same audio-visual object as the one in Listing 1.

Finally, we linked the audio to the video objects, by giving the same ID to the audio object in case of causal identification, which means that the acoustic source of the audio event is the object (a car or a person for instance) that was annotated. This step was particularly crucial, and could not be automatized, as a complex expertise is required to identify the sound sources. For example, in the video sequence illustrated in Figure 4, a motor sound is audible and seems to come from the car whereas it actually comes from a motorbike behind the camera.

Figure 4: At this time of the video sequence of camera 10, a motor sound is heard and seems to come from the car while it actually comes from a motorbike behind the camera.

In case of an object presenting different sound categories (a car with door slams, music and motor sound for example), one object is created for each category and the same ID is given.

Ethical and Legal

According to the European legislation, it is forbidden to make images publicly available of people who might be recognized or of license plates. As people and license plates are visible in our videos, to conform to the General Data Protection Regulation (GDPR) we decided to:

  • Ask actors to sign an authorization for publishing their image, and
  • Apply post treatment on videos to blur faces of other people and any license plates.


We have introduced a new dataset composed of two sets of 25 synchronized videos of the same scene with 17 overlapping views and 8 disjoint views. Videos are provided with their associated soundtracks. We have annotated the videos by manually drawing bounding boxes on moving objects. We have also manually annotated audio events. Our dataset offers simultaneously a large number of both overlapping and disjoint synchronized views and a realistic environment. It also provides audio tracks with sound events, high pixel resolution and ground truth annotations.

The originality and the richness of this dataset come from the wide diversity of topics it covers and the presence of scripted and non-scripted actions and events. Therefore, our dataset is well suited for numerous pattern recognition applications related to, but not restricted to, the domain of surveillance. We describe below, some multidisciplinary applications that could be evaluated using this dataset:

3D and 4D reconstruction: The multiple cameras sharing overlapping fields of view along with some provided photographs of the scene allow performing a 3D reconstruction of the static parts of the scene and to retrieve intrinsic parameters and poses of the cameras using a Structure-from-Motion algorithm. Beyond a 3D reconstruction, the temporal synchronization of the videos could enable to render dynamic parts of the scene as well and to obtain a 4D reconstruction.

Object recognition and consistent labeling: Evaluation of algorithms for human and vehicle detection and consistent labeling across multiple views can be performed using the annotated bounding boxes and IDs. To this end, overlapping views provide a 3D environment that could help to infer the label of an object in a video knowing its position and label in another video.

Sound event recognition: The audio events recorded from different locations and manually annotated provide opportunities to evaluate the relevance of consistent acoustic models by, for example, launching the identification and indexing of a specific sound event. Looking for a particular sound by similarity is also feasible.

Metadata modeling and querying: The multiple layers of information of this dataset, both low-level (audio/video signal) and high-level (semantic data available in the ground truth files) enable handling of information at different resolutions of space and time, allowing to perform queries on heterogeneous information.


[1] I. Lefter, L.J.M. Rothkrantz, G. Burghouts, Z. Yang, P. Wiggers. “Addressing multimodality in overt aggression detection”, in Proceedings of the International Conference on Text, Speech and Dialogue, 2011, pp. 25-32.
[2] D. Baltieri, R. Vezzani, R. Cucchiara. “3DPeS: 3D people dataset for surveillance and forensics”, in Proceedings of the 2011 joint ACM workshop on Human Gesture and Behavior Understanding, 2011, pp. 59-64.
[3] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. Chen, J.T. Lee, S. Mukherjee, J.K. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Fong, A. Roy-Chowdhury, M. Desai. “A large-scale benchmark dataset for event recognition in surveillance video”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3153-3160.
[4] S. Singh, S.A. Velastin, H. Ragheb. “MuHAVi: A multicamera human action video dataset for the evaluation of action recognition methods”, in Proceedings of the 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2010, pp. 48-55.
[5] C. Ionescu, D. Papava, V. Olaru, C. Sminchisescu. “Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments”, IEEE transactions on Pattern Analysis and Machine Intelligence, 36(7), 2013, pp. 1325-1339.
[6] T. Malon, G. Roman-Jimenez, P. Guyot, S. Chambon, V. Charvillat, A. Crouzil, A. Péninou, J. Pinquier, F. Sèdes, C. Sénac. “Toulouse campus surveillance dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views”, in Proceedings of the 9th ACM Multimedia Systems Conference. 2018, pp. 393-398.
[7] P. Guyot, T. Malon, G. Roman-Jimenez, S. Chambon, V. Charvillat, A. Crouzil, A. Péninou, J. Pinquier, F. Sèdes, C. Sénac. “Audiovisual annotation procedure for multi-view field recordings”, in Proceedings of the International Conference on Multimedia Modeling, 2019, pp. 399-410.