Overview of Benchmarking Platforms and Software for Multimedia Applications

In a time where Artificial Intelligence (AI) continues to push the boundaries of what was previously thought possible, the demand for benchmarking platforms that allow to fairly assess and evaluate AI models has become paramount. These platforms serve as connecting hubs between data scientists, machine learning specialists, industry partners, and other interested parties. They mostly function under the Evaluation-as-a-Service (EaaS) paradigm [1], the idea that participants that do a certain benchmarking task should be able to test the output of their systems in similar conditions, by being provided with a common definition of the targeted concepts, datasets and data splits, metrics, and evaluation tools. These common elements are provided through online platforms that can even offer Application Programming Interfaces (APIs) or container-level integration of the participants’ AI models. This column provides an insight into these platforms, looking at their main characteristics, use cases, and particularities. In the second part of the column we will also look into some of the main benchmarking platforms that are geared towards handling multimedia-centric benchmarks and datasets, relevant to SIGMM.

Defining Characteristics of EaaS platforms

Benchmarking competitions and initiatives, and EaaS platforms attempt to tackle a number of keypoints in the development of AI algorithms and models, namely:

  • Creating a fair and impartial evaluation environment, by standardizing the datasets and evaluation metrics used by all participants to an evaluation competition. In doing so, EaaS platforms play a pivotal role in promoting transparency and comparability in AI models and approaches.
  • Enhancing reproducibility by giving the option to run the AI models on dedicated servers provided and managed by competition organizers. This increases the trust and bolsters the integrity of the results produced by competition participants, as the organizers are able to closely monitor the testing process for each individual AI model. 
  • Fostering, as a natural consequence, a higher degree of data privacy, as participants could be given access only to training data, while testing data is kept private and is only accessed via APIs on the dedicated servers, reducing the risk of data exposure.
  • Creating a common repository for the sharing the data and details of a benchmarking task, building a history not only of the results of the benchmarking tasks throughout the years, but also of the evolution of the types of approaches and models used by participants. Other interesting features, like the existence of forums and discussion threads on competitions, allow new participants to quickly search for problems they encounter and hopefully have a quicker resolution of their issues.

Given these common goals, benchmarking platforms usually integrate a set of common features and user-level functionalities that are summed up in this section and grouped into three categories: task organization and scheduling, scoring and reproducibility, and communication and dissemination.

Task organization and scheduling. The platforms allow the creation, modification and maintenance of benchmarking tasks, either through a graphical user interface (GUI) or by using task bundles (most commonly using JSON, XML, Python or custom scripting languages). Competition organizers can define their task, and define sub-tasks that may explore different facets of the targeted data. Scheduling is another important feature in benchmarking competition creation, as some parts of the data may be kept private until a certain moment in time, and allow the competition organizers to hide the results of other teams until a certain point in time. We consider the last point an important one, as participants may feel discouraged from continuing their participation if their initial results are not high enough compared with other participants. Another noteworthy feature is the run quantity management that allows organizers to specify a maximum number of allowed runs per participant during the benchmarking task. This limitation discourages participants from attempting to solve the given tasks with brute force approaches, where they implement a large number of models and model variations. As a result, participants are incentivized to delve deeper into the data, critically analyzing why certain methods succeed and others fall short.

Scoring and reproducibility. EaaS platforms generally deploy two paradigms, sometimes side-by-side, with regards to AI model testing and results generation [1, 2]: the Data-to-Algorithm (D2A) approach, and the Algorithm-to-Data (A2D) approach. The former refers to competitions where participants must download the testing set, run the prediction systems on their own machines, and provide the predictions to the organizers, usually in CSV format for the multimedia domain. In this setup, the ground truth data for the testing set is kept private, and after the organizers receive the prediction result files, they communicate the performance to the participants, or the results are automatically computed by the platform by organizer-provided scripts, once the files are uploaded to it. The A2D approach on the other hand is more complex, may incur additional financial costs, and may be more time consuming for both organizers and task participants, but increases the trustworthiness and reproducibility of the task and AI models themselves. In this setup, organizers provide cloud-based computing resources via Virtual Machines (VMs) and containers, and a common processing pipeline or API that competitors must integrate in their source code. The participants develop the wrappers that integrate their AI models accordingly, and upload the model to the EaaS platforms directly. The AI models are then executed according to the common pipeline and results are automatically provided to the participants, while also allowing for the testing data to be kept completely private. Traditionally, in order to achieve this, EaaS platforms offer the possibility of integration with cloud computing platforms like Amazon AWS, Microsoft Azure, or Google Cloud, and offer Docker integration for the creation of containers where the code can be hosted.

Communication and dissemination. EaaS platforms allow the interaction between competition organizers and participants, either through emails, automatic notifications, or forums where interested parties can exchange ideas, ask questions, offer help, signal potential problems in the data or scripts associated with the tasks.

Popular multimedia EaaS platforms

This section presents some of the most popular benchmarking platforms aimed at the multimedia domain. We will present some key features and associated popular multimedia datasets for the following platforms: Kaggle, AIcrowd, Codabench, Drivendata, and EvalAI.

Kaggle represents perhaps the top-most popular benchmarking platform at this moment, and goes beyond the scope of providing datasets and benchmarking competitions, also hosting AI models, courses, and source code repositories. Competition organizers can design the tasks under either of the D2A or A2D paradigms, giving participants the possibility of integrating their AI models in Jupyter Notebooks for reproducibility. The platform also gives the option of alloting CPU and GPU cloud-based resources for A2D competitions. The Kaggle repository offers code for a large number of additional competition management tools and communication APIs. Among an impressive number of datasets and competitions, Kaggle currently hosts competitions that use the MNIST original data [3], as well as other MNIST-like datasets like Fashion-MNIST [4], as well as datasets on varied subjects ranging from sentiment analysis in social media [5] to medical image processing [6].

AIcrowd is an open source EaaS platform for open benchmarking challenges that puts an accent on connections and collaborative work between data science and machine learning experts. This platform offers the source code for command line interface (CLI) and API clients that can interact with AIcrowd servers. ImageCLEF, between 2018 and 2022 [7 – 11], is one of the most popular multimedia benchmarking initiatives hosted on AICrowd, featuring diverse multimedia topics such as lifelogging, medical image processing, image processing for environment health prediction, the analysis of social media dangers with regards to image sharing, and ensemble learning for multimedia data.

Codabench, launched in August 2023, and its precursor CodaLab, are two open source benchmarking platforms that provide a large number of options, including A2D and D2A approaches, as well as “inverted benchmarks”, where organizers provide the reference algorithms and participants contribute with the datasets. Among the current running challenges on this platform standouts are the two Quality-of-Service-oriented challenges on audio-video synchronization error detection and error measurement challenges that are part of the 3rd Workshop on Image/Video/Audio Quality in Computer Vision and Generative AI at the Winter Conference on Applications of Computer Vision – WACV2024.

Drivendata targets the intersection of data science and social impact. This platform hosts competitions that integrate the social aspect of their domain of interest directly in their mission and definition, while also hosting a number of open-source projects and competition-winning AI models. Given its accent on social impact, this platform hosts a number of benchmarking challenges that target social issues like the detection of hateful memes [12] and image-based nature conservation efforts.

EvalAI is another open source platform that is able to create A2D and D2A competition environments, while also integrating optimization steps that allow for evaluation code to run faster on multi-core cloud infrastructure. The EvalAI platform holds many diverse multimedia-centric competitions, including image segmentation tasks based on LVIS [13] and a wide range of sport tasks [14].

Future directions, developments and other tools

While the tools and platforms described in the previous section represent just a portion of the number of EaaS platform currently online in the research community, we would also like to mention some projects that are currently in the development stage or that can be considered additional tools for benchmarking initiatives:

  • The AI4Media benchmarking platform, is a benchmarking platform that is currently in the prototype and development stage. Among its most interesting features and ideas promoted by the platform developers is the creation of complexity metrics that would help competition organizers understand the computational efficiency and resource requirements for the submitted systems.
  • The BenchmarkSTT started as a specialized benchmarking platform for speech-to-text, but is now evolving in different directions, including facial recognition in videos.
  • The PapersWithCode platform, while not a benchmarking platform per se, is useful as a repository that collects the results AI model on datasets throughout the years, and groups different datasets studying the same concepts under the same umbrella (i.e., Image Classification, Object Detection, Medical Image Segmentation, etc.), while also providing links to scientific papers, github implementations of the models, and links to the datasets. This may represent a good starting point for young researchers that are trying to understand the history and state-of-the-art for certain domains and applications.

Conclusions

Benchmarking platforms represent a key component of benchmarking, pushing for fairness and trustworthiness in AI model comparison, while also providing tools that may foster reproducibility in AI. We are happy to see that many of the platforms discussed in this article are open source, or have open source components, thus allowing interested scientists to create their own custom implementations of these platforms, and to adapt them when necessary to their particular fields.

Acknowledgements

The work presented in this column is supported under the H2020 AI4Media “A European Excellence Centre for Media, Society and Democracy” project, contract #951911.

References

[1] Hanbury, A., Müller, H., Balog, K., Brodt, T., Cormack, G. V., Eggel, I., Gollub, T., Hopfgartner, F., Kalpathy-Cramer, J., Kando, N., Krithara, A., Lin, J., Mercer, S. & Potthast, M. (2015). Evaluation-as-a-service: Overview and outlook. arXiv preprint arXiv:1512.07454.
[2] Hanbury, A., Müller, H., Langs, G., Weber, M. A., Menze, B. H., & Fernandez, T. S. (2012). Bringing the algorithms to the data: cloud–based benchmarking for medical image analysis. In Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics: Third International Conference of the CLEF Initiative, CLEF 2012, Rome, Italy, September 17-20, 2012. Proceedings 3 (pp. 24-29). Springer Berlin Heidelberg.
[3] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
[4] Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
[5] Niu, T., Zhu, S., Pang, L., & El Saddik, A. (2016). Sentiment analysis on multi-view social data. In MultiMedia Modeling: 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part II 22 (pp. 15-27). Springer International Publishing.
[6] Thambawita, V., Hicks, S. A., Storås, A. M., Nguyen, T., Andersen, J. M., Witczak, O., … & Riegler, M. A. (2023). VISEM-Tracking, a human spermatozoa tracking dataset. Scientific Data, 10(1), 1-8.
[7] Ionescu, B., Müller, H., Villegas, M., García Seco de Herrera, A., Eickhoff, C., Andrearczyk, V., … & Gurrin, C. (2018). Overview of ImageCLEF 2018: Challenges, datasets and evaluation. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings 9 (pp. 309-334). Springer International Publishing.
[8] Ionescu, B., Müller, H., Péteri, R., Dang-Nguyen, D. T., Piras, L., Riegler, M., … & Karampidis, K. (2019). ImageCLEF 2019: Multimedia retrieval in lifelogging, medical, nature, and security applications. In Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part II 41 (pp. 301-308). Springer International Publishing.
[9] Ionescu, B., Müller, H., Péteri, R., Dang-Nguyen, D. T., Zhou, L., Piras, L., … & Constantin, M. G. (2020). ImageCLEF 2020: Multimedia retrieval in lifelogging, medical, nature, and internet applications. In Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42 (pp. 533-541). Springer International Publishing.
[10] Ionescu, B., Müller, H., Péteri, R., Abacha, A. B., Demner-Fushman, D., Hasan, S. A., … & Popescu, A. (2021). The 2021 ImageCLEF Benchmark: Multimedia retrieval in medical, nature, internet and social media applications. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43 (pp. 616-623). Springer International Publishing.
[11] de Herrera, A. G. S., Ionescu, B., Müller, H., Péteri, R., Abacha, A. B., Friedrich, C. M., … & Dogariu, M. (2022, April). Imageclef 2022: multimedia retrieval in medical, nature, fusion, and internet applications. In European Conference on Information Retrieval (pp. 382-389). Cham: Springer International Publishing.
[12] Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Fitzpatrick, C. A., … & Parikh, D. (2021, August). The hateful memes challenge: Competition report. In NeurIPS 2020 Competition and Demonstration Track (pp. 344-360). PMLR.
[13] Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5356-5364).
[14] Giancola, S., Cioppa, A., Deliège, A., Magera, F., Somers, V., Kang, L., … & Li, Z. (2022, October). SoccerNet 2022 challenges results. In Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports (pp. 75-86).

Overview of Open Dataset Sessions and Benchmarking Competitions in 2021.


This issue of the Dataset Column proposes a review of some of the most important events in 2021 related to special sessions on open datasets or benchmarking competitions associated with multimedia data. While this is not meant to represent an exhaustive list of events, we wish to underline the great diversity of subjects and dataset topics currently of interest to the multimedia community. We will present the following events:

  • 13th International Conference on Quality of Multimedia Experience (QoMEX 2021 – https://qomex2021.itec.aau.at/). We summarize six datasets included in this conference, that address QoE studies on haze conditions (RHVD), tele-education events (EVENT-CLASS), storytelling scenes (MTF), image compression (EPFL), virtual reality effects on gamers (5Gaming), and live stream shopping (LSS-survey).
  • Multimedia Datasets for Repeatable Experimentation at 27th International Conference on Multimedia Modeling (MDRE at MMM 2021 – https://mmm2021.cz/special-session-mdre/). We summarize the five datasets presented during the MDRE, addressing several topics like lifelogging and environmental data (MNR-HCM), cat vocalizations (CatMeows), home activities (HTAD), gastrointestinal procedure tools (Kvasir-Instrument), and keystroke and lifelogging (KeystrokeDynamics).
  • Open Dataset and Software Track at 12th ACM Multimedia Systems Conference (ODS at MMSys ’21) (https://2021.acmmmsys.org/calls.php#ods). We summarize seven datasets presented at the ODS track, targeting several topics like network statistics (Brightcove Streaming Datasets, and PePa Ping), emerging image and video modalities (Full UHD 360-Degree, 4DLFVD, and CWIPC-SXR) and human behavior data (HYPERAKTIV and Target Selection Datasets).
  • Selected datasets at 29th ACM Multimedia Conference (MM ’21) (https://2021.acmmm.org/). For a general report from ACM Multimedia 2021 please see (https://records.sigmm.org/2021/11/23/reports-from-acm-multimedia-2021/). We summarize six datasets presented during the conference, targeting several topics like food logo detection (FoodLogoDet-1500), emotional relationship recognition (ERATO), text-to-face synthesis (CelebAText-HQ), multimodal linking (M3EL), egocentric video analysis (EGO-Deliver), and quality assessment of user-generated videos (PUGCQ).
  • ImageCLEF 2021 (https://www.imageclef.org/2021). We summarize the six datasets launched for the benchmarking tasks, related to several topics like social media profile assessment (ImageCLEFaware), segmentation and labeling of underwater coral images (ImageCLEFcoral), automatic generation of web-pages (ImageCLEFdrawnUI) and medical imaging analysis (ImageCLEF-VQAMed, ImageCLEFmedCaption, and ImageCLEFmedTuberculosis).

Creating annotated datasets is even more difficult in ongoing pandemic times, and we are glad to see that many interesting datasets were published despite this unfortunate situation.

QoMEX 2021

A large number of dataset-related papers have been presented at the International Conference on Quality of Multimedia Experience (QoMEX 2021), organized as a fully online event in Montreal, Canada, June 14 -17, 2021 (https://qomex2021.itec.aau.at/). The complete QoMEX ’21 Proceedings is available in the IEEE Digital Library (https://ieeexplore.ieee.org/xpl/conhome/9465370/proceeding).

In the conference, there was not a specifically dedicated Dataset session. However, datasets were very important to the conference with a number of papers showing new datasets or making use of broadly available ones. As a small example, six selected papers focused primarily on new datasets are listed below. They are contributions focused on haze, teaching in Virtual Reality, multiview video, image quality, cybersickness for Virtual Reality gaming and shopping patterns. 

A Real Haze Video Database for Haze Evaluation
Paper available at: https://ieeexplore.ieee.org/document/9465461
Chu, Y., Luo, G., and Chen, F.
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, P.R.China.
Dataset available at: https://drive.google.com/file/d/1zY0LwJyNB8u1JTAJU2X7ZkiYXsBX7BF/view?usp=sharing

The RHVD video quality assessment dataset focuses on the study of perceptual degradation caused by heavy haze conditions in real-world outdoor scenes, addressing a large number of possible use case scenarios, including driving assistance and warning systems. The dataset is collected from Flickr video sharing platform and post-edited, while 40 annotators were used for creating the subjective quality assessment experiments.

EVENT-CLASS: Dataset of events in the classroom
Paper available at: https://ieeexplore.ieee.org/document/9465389
Orduna, M., Gutierrez, J., Manzano, C., Ruiz, D., Cabrera, J., Diaz, C., Perez, P., and Garcia, N.
Grupo de Tratamiento de Imágenes, Information Processing & Telecom. Center, Universidad Politécnica de Madrid, Spain; Nokia Bell Labs, Madrid, Spain.
Dataset available at: http://www.gti.ssr.upm.es/data/event-class

The EVENT-CLASS dataset consists of 360-degree videos that contain events and characteristics specific to the context of tele-education, composed of video and audio sequences taken in varying conditions. The dataset addresses several topics, including quality assessment tests with the aim of improving the immersive experience of remote users.

A Multi-View Stereoscopic Video Database With Green Screen (MTF) For Video Transition Quality-of-Experience Assessment
Paper available at: https://ieeexplore.ieee.org/document/9465458
Hobloss, N., Zhang, L., and Cagnazzo, M.
LTCI, Télécom-Paris, Institut Polytechnique de Paris, Paris, France; Univ Rennes, INSA Rennes, CNRS, Rennes, France.
Dataset available at: https://drive.google.com/drive/folders/1MYiD7WssSh6X2y-cf8MALNOMMish4N5j

MFT is a multi-view stereoscopic video dataset, containing full-HD videos of real storytelling scenes, targeting QoE assessment for the analysis of visual artefacts that appear during an automatically generated point of view transitions. The dataset features a large baseline of camera setups and can also be used in other computer vision applications, like video compression, 3D video content, VR environments and optical flow estimation.

Performance Evaluation of Objective Image Quality Metrics on Conventional and Learning-Based Compression Artifacts
Paper available at: https://ieeexplore.ieee.org/document/9465445
Testolina, M., Upenik, E., Ascenso, J., Pereira, F., and Ebrahimi, T.
Multimedia Signal Processing Group, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland; Instituto Superior Técnico, Universidade de Lisboa – Instituto de Telecomunicações, Lisbon, Portugal.
Dataset available on request to the authors.

This dataset consists of a collection of compressed images, labelled according to subjective quality scores, targeting the evaluation of 14 objective quality metrics against the perceived human quality baseline.

The Effect of VR Gaming on Discomfort, Cybersickness, and Reaction Time
Paper available at: https://ieeexplore.ieee.org/document/9465470
Vlahovic, S., Suznjevic, M., Pavlin-Bernardic, N., and Skorin-Kapov, L.
Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia; Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb, Croatia.
Dataset available on request to the authors.

The authors present the results of a study conducted on 20 human users, that measures the physiological and cognitive aftereffects of exposure to three different VR games with game mechanics centered around natural interactions. This work moves away from cybersickness as a primary measure of VR discomfort and wishes to analyze other concepts like device-related discomfort, muscle fatigue and pain and correlations with game complexity

Beyond Shopping: The Motivations and Experience of Live Stream Shopping Viewers
Paper available at: https://ieeexplore.ieee.org/document/9465387
Liu, X. and Kim, S. H.
Adelphi University.
Dataset available on request to the authors.

The authors propose a study of 286 live stream shopping users, where viewer motivations are examined according to the Uses and Gratifications Theory, seeking to identify motivations broken down into sixteen constructs organized under four larger constructs: entertainment, information, socialization, and experience.

MDRE at MMM 2021

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2021 International Conference on Multimedia Modeling (MMM 2021). The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Björn Þór Jónsson (IT University of Copenhagen, Denmark) and Klaus Schoeffmann (Klagenfurt University, Austria). More details regarding this session can be found at: https://mmm2021.cz/special-session-mdre/

The MDRE’21 special session at MMM’21 is the third MDRE edition, and it represents an opportunity for interested researchers to submit their datasets to this track. The work submitted to MDRE is permanently available at http://mmdatasets.org, where all the current and past editions of MDRE are hosted. Authors are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, as well as discussing the way it can be useful to the community, along with the dataset in itself.

MNR-Air: An Economic and Dynamic Crowdsourcing Mechanism to Collect Personal Lifelog and Surrounding Environment Dataset.
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_18
Nguyen DH., Nguyen-Tai TL., Nguyen MT., Nguyen TB., Dao MS.
University of Information Technology, Ho Chi Minh City, Vietnam; University of Science, Ho Chi Minh City, Vietnam; Vietnam National University in Ho Chi Minh City, Ho Chi Minh City, Vietnam; National Institute of Information and Communications Technology, Koganei, Japan.
Dataset available on request to the authors.

The paper introduces an economical and dynamic crowdsourcing mechanism that can be used to collect personal lifelog associated events. The resulting dataset, MNR-HCM, represents data collected in Ho Chi Minh City, Vietnam, containing weather data, air pollution data, GPS data, lifelog images, and citizens’ cognition on a personal scale.

CatMeows: A Publicly-Available Dataset of Cat Vocalizations
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_20
Ludovico L.A., Ntalampiras S., Presti G., Cannas S., Battini M., Mattiello S.
Department of Computer Science, University of Milan, Milan, Italy; Department of Veterinary Medicine, University of Milan, Milan, Italy; Department of Agricultural and Environmental Science, University of Milan, Milan, Italy.
Dataset available at: https://zenodo.org/record/4008297

The CatMewos dataset consists of vocalizations produced by 21 cats belonging to two breeds, namely Main Coon and European Shorthair, that are emitted in three different contexts: brushing, isolation in an unfamiliar environment, and waiting for food. Recordings are performed with low-cost and easily available devices, thus creating a representative dataset for real-world scenarios.

HTAD: A Home-Tasks Activities Dataset with Wrist-accelerometer and Audio Features
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_17
Garcia-Ceja, E., Thambawita, V., Hicks, S.A., Jha, D., Jakobsen, P., Hammer, H.L., Halvorsen, P., Riegler, M.A.
SINTEF Digital, Oslo, Norway; SimulaMet, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; UIT The Arctic University of Norway, Tromsø, Norway; Haukeland University Hospital, Bergen, Norway.
Dataset available at: https://datasets.simula.no/htad/

The HTAD dataset contains wrist-accelerometer and audio data collected during several normal day-to-day tasks, such as sweeping, brushing teeth, or watching TV. Being able to detect these types of activities is important for the creation of assistive applications and technologies that target elderly care and mental health monitoring.

Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_19
Jha, D., Ali, S., Emanuelsen, K., Hicks, S.A., Thambawita, V., Garcia-Ceja, E., Riegler, M.A., de Lange, T., Schmidt, P.T., Johansen, H.D., Johansen, D., Halvorsen, P.
SimulaMet, Oslo, Norway; UIT The Arctic University of Norway, Tromsø, Norway; Simula Research Laboratory, Oslo, Norway; Augere Medical AS, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; Medical Department, Sahlgrenska University Hospital-Mölndal, Gothenburg, Sweden; Department of Medical Research, Bærum Hospital, Gjettum, Norway; Karolinska University Hospital, Solna, Sweden; Department of Engineering Science, University of Oxford, Oxford, UK; Sintef Digital, Oslo, Norway.
Dataset available at: https://datasets.simula.no/kvasir-instrument/

The Kvasir-Instrument dataset consists of 590 annotated frames that contain gastrointestinal (GI) procedure tools such as snares, balloons, and biopsy forceps, and seeks to improve follow-up and the set of available information regarding the disease and the procedure itself, by providing baseline data for the tracking and analysis of the medical tools.

Keystroke Dynamics as Part of Lifelogging
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_16
Smeaton, A.F., Krishnamurthy, N.G., Suryanarayana, A.H.
Insight Centre for Data Analytics, Dublin City University, Dublin, Ireland; School of Computing, Dublin City University, Dublin, Ireland.
Dataset available at: http://doras.dcu.ie/25133/

The authors created a dataset of longitudinal keystroke timing data that spans a period of up to seven months for four human participants. A detailed analysis of the data is performed, by examining the timing information associated with bigrams, or pairs of adjacently-typed alphabetic characters.

ODS at MMSys ’21

The traditional Open Dataset and Software Track (ODS) was a part of the 12th ACM Multimedia Systems Conference (MMSys ’21) organized as a hybrid event in Istanbul, Turkey, September 28 – October 1, 2021 (https://2021.acmmmsys.org/). The complete MMSys ’21: Proceedings of the 12th ACM Multimedia Systems Conference are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3458305).

The Session on Software, Tools and Datasets was chaired by Saba Ahsan (Nokia Technologies, Finland) and Luca De Cicco (Politecnico di Bari, Italy) on September 29, 2021, at 16:00 (UTC+3, Istanbul local time). The session has been initiated with 1-slide/minute intros given by the authors and then divided into individual virtual booths. There have been seven dataset papers presented out of thirteen contributions. Listing of the paper titles and their abstracts and associated DOIs is included below for your convenience.

Adaptive Streaming Playback Statistics Dataset
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478444
Teixeira, T, Zhang, B., Reznik, Y.
Brightcove Inc, USA
Dataset available at: https://github.com/brightcove/streaming-dataset

The authors propose a dataset that captures statistics from a number of real-world streaming events, utilizing different devices (TVs, desktops, mobiles, tablets, etc.) and networks (from 2.5G, 3G, and other early generation mobile networks to 5G and broadband). The captured data includes network and playback statistics, events and characteristics of the encoded stream.

PePa Ping Dataset: Comprehensive Contextualization of Periodic Passive Ping in Wireless Networks
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478456
Madariaga, D., Torrealba, L., Madariaga, J., Bustos-Jimenez, J., Bustos, B.
NIC Chile Research Labs, University of Chile
Dataset available at: https://github.com/niclabs/pepa-ping-mmsys21

The PePa Ping dataset consists of real-world data with a comprehensive contextualization of Internet QoS indicators, like Round-trip time, jitter and packet loss. A methodology is developed for Android devices, that obtains the necessary information, while the indicators are directly provided to the Linux kernel, therefore being an accurate representation of real-world data.

Full UHD 360-Degree Video Dataset and Modeling of Rate-Distortion Characteristics and Head Movement Navigation
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478447
Chakareski, J., Aksu, R., Swaminathan, V., Zink, M.
New Jersey Institute of Technology; University of Alabama; Adobe Research; University of Massachusetts Amherst, USA
Dataset available at: https://zenodo.org/record/5156999#.YQ1XMlNKjUI

The authors create a dataset of 360-degree videos that are used in analyzing the rate-distortion (R-D) characteristics of videos. These videos correspond to head movement navigation data in Virtual Reality (VR) and they may be used for analyzing how users explore panoramas around them in VR.

4DLFVD: A 4D Light Field Video Dataset
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478450
Hu, X., Wang, C.,Pan, Y., Liu, Y., Wang, Y., Liu, Y., Zhang, L., Shirmohammadi, S.
University of Ottawa, Canada / Beijing University of Posts and Telecommunication, China
Dataset available at: https://dx.doi.org/10.21227/hz0t-8482

The authors propose a 4D Light Field (LF) video dataset that is collected via a custom-made camera matrix. The dataset is to be used for designing and testing methods for LF video coding, processing and streaming, providing more viewpoints and/or higher framerate compared with similar datasets from the current literature.

CWIPC-SXR: Point Cloud dynamic human dataset for Social XR
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478452
Reimat, I., Alexiou, E., Jansen, J., Viola, I., Subramanyam, S., Cesar, P.
Centrum Wiskunde & Informatica, Netherlands
Dataset available at: https://www.dis.cwi.nl/cwipc-sxr-dataset/

The CWIPC-SXR dataset is composed of 45 unique sequences that correspond to several use cases for humans interacting in social extended reality. The dataset is composed of dynamic point clouds, that serve as a low complexity representation in these types of systems.

HYPERAKTIV: An Activity Dataset from Patients with Attention-Deficit/Hyperactivity Disorder (ADHD)
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478454
Hicks, S. A., Stautland, A., Fasmer, O. B., Forland, W., Hammer, H. L., Halvorsen, P., Mjeldheim, K., Oedegaard, K. J., Osnes, B., Syrstad, V. E.G., Riegler, M. A.
SimulaMet; University of Bergen; Haukeland University Hospital; OsloMet, Norway
Dataset available at: http://datasets.simula.no/hyperaktiv/

The HYPERAKTIV dataset contains general patient information, health, activity, information about the mental state, and heart rate data from patients with Attention-Deficit/Hyperactivity Disorder (ADHD). Included here are 51 patients with ADHD and 52 clinical control cases.

Datasets – Moving Target Selection with Delay
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478455
Liu, S. M., Claypool, M., Cockburn, A., Eg, R., Gutwin, C., Raaen, K.
Worcester Polytechnic Institute, USA; University of Canterbury, New Zealand; Kristiania University College, Norway; University of Saskatchewan, Canada
Dataset available at: https://web.cs.wpi.edu/~claypool/papers/selection-datasets/

The Selection datasets are composed of datasets created during four user studies on the effects of delay on video game actions and selections of a moving target with a various number of pointing devices. The datasets include performance data, like time to the selection, and demographic data for the users like age and gaming experience.

ACM MM 2021

A large number of dataset-related papers have been presented at the 29th ACM International Conference on Multimedia (MM’ 21), organized as a hybrid event in Chengdu, China, October 20 – 24, 2021 (https://2021.acmmm.org/). The complete MM ’21: Proceedings of the 29th ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3474085).

There was not a specifically dedicated Dataset session among more than 35 sessions at the MM ’21 symposium. However, the importance of datasets can be illustrated in the following statistics, quantifying how many times the term “dataset” appears among 542 accepted papers. The term appears in the title of 7 papers, the keywords of 66 papers, and the abstracts of 339 papers. As a small example, six selected papers focused primarily on new datasets are listed below. There are contributions focused on social multimedia, emotional recognition, text-to-face synthesis, egocentric video analysis, emerging multimedia applications, such as multimodal entity linking, and multimedia art, entertainment, and culture related to perceived quality of video content.

FoodLogoDet-1500: A Dataset for Large-Scale Food Logo Detection via Multi-Scale Feature Decoupling Network
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475289
Hou, Q., Min, W., Wang, J., Hou, S., Zheng, Y., Jiang, S.
Shandong Normal University, Jinan, China; Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Dataset available at: https://github.com/hq03/FoodLogoDet-1500-Dataset

The FoodLogoDet-1500 is a large-scale food logo dataset that has 1,500 categories, around 100,000 images and 150,000 manually annotated food logo objects. This type of dataset is important in self-service applications in shops and supermarkets, and copyright infringement detection for e-commerce websites.

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475493
Gao, X., Zhao, Y., Zhang, J., Cai, L.
Alibaba Group, Beijing, China
Dataset available on request to the authors.

The Emotional RelAtionship of inTeractiOn (ERATO) dataset is a large-scale multimodal dataset composed of over 30,000 interaction-centric video clips lasting around 203 hours. The videos are representative for studying the emotional relationships between the two interactive characters in the video clip.

Multi-caption Text-to-Face Synthesis: Dataset and Algorithm
Paper available at: https://dl.acm.org/doi/abs/10.1145/3474085.3475391
Sun, J., Li, Q., Wang, W., Zhao, J., Sun, Z.
Center for Research on Intelligent Perception and Computing, NLPR, CASIA, Beijing, China;
School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China; Institute of North Electronic Equipment, Beijing, China
Dataset available on request to the authors.

The authors propose the CelebAText-HQ dataset, which addresses the text-to-face generation problem. Each image in the dataset is manually annotated with 10 captions, allowing proposed methods and algorithms to take multiple captions as input in order to generate highly semantically related face images.

Multimodal Entity Linking: A New Dataset and A Baseline
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475400
Gan, J., Luo, J., Wang, H., Wang, S., He, W., Huang, Q.
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, China; Baidu Inc.
Dataset available at: https://jingrug.github.io/research/M3EL

The authors propose the M3EL large-scale multimodal entity linking dataset, containing data associated with 1,100 movies. Reviews and images are collected, and textual and visual mentions are extracted and labelled with entities registered from Wikipedia.

Ego-Deliver: A Large-Scale Dataset for Egocentric Video Analysis
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475336
Qiu, H., He, P., Liu, S., Shao, W., Zhang, F., Wang, J., He, L., Wang, F.
East China Normal University, Shanghai, China; University of Florida, Florida, FL, United States;
Alibaba Group, Shanghai, China
Dataset available at: https://egodeliver.github.io/EgoDeliver_Dataset/

The authors propose an egocentric video benchmarking dataset, consisting of videos recorded by takeaway riders doing their daily work. The dataset provides over 5,000 videos with more than 139,000 multi-track annotations and 45 different attributes, representing the first attempt in understanding the delivery takeaway process from an egocentric perspective.

PUGCQ: A Large Scale Dataset for Quality Assessment of Professional User-Generated Content
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475183
Li, G., Chen, B., Zhu, L., He, Q., Fan, H., Wang, S.
Kingsoft Cloud, Beijing, China; City University of Hong Kong, Hong Kong, Hong Kong
Dataset available at: https://github.com/wlkdb/pugcq_create

The PUGCQ dataset consists of 10,000 professional user-generated videos, annotated with a set of perceptual subjective ratings. In particular, during the subjective annotation and testing, human opinions are collected based upon not only MOS, but also attributes that may influence visual quality such as faces, noise, blur, brightness, and colour.

ImageCLEF 2021

ImageCLEF is a multimedia evaluation campaign, part of the clef initiative (http://www.clef-initiative.eu/). The 2021 edition (https://www.imageclef.org/2021) is the 19th edition of this initiative and addresses four main research tasks in several domains like: medicine, nature, social media content and user interface processing. ImageCLEF 2021 is organized by Bogdan Ionescu (University Politehnica of Bucharest, Romania), Henning Müller (University of Applied Sciences Western Switzerland, Sierre, Switzerland), Renaud Péteri (University of La Rochelle, France), Ivan Eggel (University of Applied Sciences Western Switzerland, Sierre, Switzerland) and Mihai Dogariu (University Politehnica of Bucharest, Romania).

ImageCLEFaware
Paper available at: https://arxiv.org/abs/2012.13180
Popescu, A., Deshayes-Chossar, J., Ionescu, B.
CEA LIST, France; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2021/aware

This represents the first edition of the aware task at ImageCLEF, and it seeks to understand in what way do public social media profiles affect users in certain important scenarios, representing a search or application for a bank loan, an accommodation, a job as waitress/waiter, and a job in IT.

ImageCLEFcoral
Paper available at: http://ceur-ws.org/Vol-2936/paper-88.pdf
Chamberlain, J., de Herrera, A. G. S., Campello, A., Clark, A., Oliver, T. A., Moustahfid, H.
University of Essex, UK; NOAA – Pacific Islands Fisheries Science Center, USA; NOAA/ US IOOS, USA; Wellcome Trust, UK.
Dataset available at: https://www.imageclef.org/2021/coral

The ImageCLEFcoral task, currently at its third edition, proposes a dataset and benchmarking task for the automatic segmentation and labelling of underwater images that can be combined for generating 3D models for monitoring coral reefs. The task itself is composed of two subtasks, namely the coral reef image annotation and localisation and the coral reef image pixel-wise parsing.

ImageCLEFdrawnUI
Paper available at: http://ceur-ws.org/Vol-2936/paper-89.pdf
Fichou, D., Berari, R., Tăuteanu, A., Brie, P., Dogariu, M., Ștefan, L.D., Constantin, M.G., Ionescu, B.
teleportHQ, Cluj Napoca, Romania; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2021/drawnui

The second edition ImageCLEFdrawnUI addresses the issue of creating appealing web page interfaces by fostering systems that are capable of automatically generating a web page from a hand-drawn sketch. The task is separated into two subtasks, the wireframe subtask and the screenshots task.

ImageCLEF-VQAMed
Paper available at: http://ceur-ws.org/Vol-2936/paper-87.pdf
Abacha, A.B., Sarrouti, M., Demner-Fushman, D., Hasan, S.A., Müller, H.
National Library of Medicine, USA; CVS Health, USA; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/vqa

This represents the fourth edition of the ImageCLEF Medical Visual Question Answering (VQAMed) task. This benchmark includes a task on Visual Question Answering (VQA), where participants are tasked with answering questions from the visual content of radiology images, and a second task on Visual Question Generation (VQG), consisting of generating relevant questions about radiology images.

ImageCLEFmed Caption
Paper available at: http://ceur-ws.org/Vol-2936/paper-111.pdf
Pelka, O., Abacha, A.B., de Herrera, A.G.S., Jacutprakart, J., Friedrich, C.M., Müller, H.
University of Applied Sciences and Arts Dortmund, Germany; National Library of Medicine, USA; University of Essex, UK; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/caption

This is the fifth edition of the ImageCLEF Medical Concepts and Captioning task. The objective is to extract UMLS-concept annotations and/or captions from the image data that are then compared against the original text captions of the images.

ImageCLEFmed Tuberculosis
Paper available at: http://ceur-ws.org/Vol-2936/paper-90.pdf
Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Müller, H.
Institute for Informatics, Minsk, Belarus; University of Warwick, Coventry, England, UK; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/tuberculosis