MediaEval Multimedia Challenges
MediaEval is a benchmarking initiative that offers challenges in multimedia retrieval, analysis and exploration. The tasks offered by MediaEval concentrate specifically on the human and social aspects of multimedia. They encourage researchers to bring together multiple modalities (visual, text, audio) and to think in terms of systems that serve users. Our larger aim is to promote reproducible research that makes multimedia a positive force for society. In order to provide an impression of the topical scope of MediaEval, we describe a few examples of typical tasks.
Historically, MediaEval tasks have often involved social media analysis. One of the first tasks offered by MediaEval, called the “Placing” Task, focused on the geo-location of social multimedia. This task ran from 2010-2016 and studied the challenge of automatically predicting the location at which an image has been taken. Over the years, the task investigated the benefits of combining text and image features, and also explored the challenges involved with geo-location prediction of video.
The “Placing” Task gave rise to two daughter tasks, which are focused on the societal impact of technology that can automatically predict the geo-location of multimedia shared online. One is Flood-related Multimedia, which challenges researchers to extract information related to flooding disasters from social media posts (combining text and images). The other is Pixel Privacy, which allows researchers to explore ways in which adversarial images can be used to protect sensitive information from being automatically extracted from images shared online.
MediaEval has also offered a number of tasks that focus on how media content is received by users. The interest of MediaEval in the emotional impact of music is currently continued by the Emotion and Theme Recognition in Music Task. Also, the Predicting Media Memorability Task explores the aspects of video that are memorable to users.
Recently, MediaEval has widened its focus to include multimedia analysis in systems. The Sports Video Annotation Task works towards improving sports training systems and the Medico Task focuses on multimedia analysis for more effective and efficient medical diagnosis.
Recent years have seen the rise of the use of sensor data in MediaEval. The No-audio Multimodal Speech Detection Task uses a unique data set captured by people wearing sensors and having conversations in a social setting. In addition to the sensor data, the movement of the speakers is captured by an overhead camera. The challenge is to detect the moments at which the people are speaking without making use of audio recordings.
The Insight for Wellbeing Task uses a data set of lifelog images, sensor data and tags captured by people walking through a city wearing sensors and using smartphones. The challenge is to relate the data that is captured to the local pollution conditions.
MediaEval 10th Anniversary Workshop
Each year, MediaEval holds a workshop that brings researchers together to share their findings, discuss, and plan next year’s tasks. The 2019 workshop marked the 10th anniversary of MediaEval, which became an independent benchmark in 2010. The MediaEval 2019 Workshop was hosted by EURECOM in Sophia Antipolis, France. The workshop took place 27-29 October 2020, right after ACM Multimedia 2019, in Nice, France.
The MediaEval 2019 Workshop is grateful to SIGMM for their support. This support contributed to helping ten students to attend the workshop, across a variety of tasks and also made it possible to record all of the workshop talks. We also gratefully acknowledge the Multimedia Computing Group at Delft University of Technology and EURECOM.
Links to MediaEval 2019 tasks, videos and slides are available on the MediaEval 2019 homepage http://multimediaeval.org/mediaeval2019/. The link to the 2019 proceedings can be found there as well.
MediaEval has compiled a bibliography of papers that have been published using MediaEval data sets. This list includes not only MediaEval workshop papers, but also papers published at other workshops, conferences, and in journals. In total, around 750 papers have been written that use MediaEval data, and this number continues to grow. Check out the bibliography at https://multimediaeval.github.io/bib.
The Medieval in MediaEval
A long-standing tradition in MediaEval is to incorporate some aspect of medieval history into the social event of the workshop. This tradition is a wordplay on our name (“mediaeval” is an older spelling of “medieval”). Through the years the medieval connection has served to provide a local context for the workshop and has strengthened the bond among participants. At the MediaEval 2019 Workshop, we offered the chance to take a nature walk to the medieval town of Biot.
The walking participants and the participants taking the bus convened on the “Place des Arcades” in the medieval town of Biot, where we enjoyed a dinner together under historic arches.
MediaEval has just announced the task line-up for 2020. Registration will open in July 2020 and the runs will be due at the end of October 2020. The workshop will be held in December, with dates to be announced.
This year, the MediaEval workshop will be fully online. Since the MediaEval 2017 in Dublin, MediaEval has offered the possibility for remote workshop participation. Holding the workshop online this year is a natural extension of this trend, and we hope that researchers around the globe will take advantage of the opportunity to participate.
We are happy to introduce the new website: https://multimediaeval.github.io/. More information will be posted there as the season moves forward.
The day-to-day operations of MediaEval are handled by the MediaEval logistics committee, which grows stronger with each passing year. The authors of this article are logistics committee members from 2019.
This column describes the release of the Toulouse Campus Surveillance Dataset (ToCaDa). It consists of 25 synchronized videos (with audio) of two scenes recorded from different viewpoints of the campus. An extensive manual annotation comprises all moving objects and their corresponding bounding boxes, as well as audio events. The annotation was performed in order to i) enhance audiovisual objects that can be visible, audible or both, according to each recording location, and ii) uniquely identify all objects in each of the two scenes. All videos have been «anonymized». The dataset is available for download here.
The increasing number of recording devices, such as smartphones, has led to an exponential production of audiovisual documents. These documents may correspond to the same scene, for instance an outdoor event filmed from different points of view. Such multi-view scenes contain a lot of information and provide new opportunities for answering high-level automatic queries.
In essence, these documents are multimodal, and their audio and video streams contain different levels of information. For example, the source of a sound may either be visible or not according to the different points of view. This information can be used separately or jointly to achieve different tasks, such as synchronising documents or following the displacement of a person. The analysis of these multi-view field recordings further allows understanding of complex scenarios. The automation of these tasks faces a need for data, as well as a need for the formalisation of multi-source retrieval and multimodal queries. As also stated by Lefter et al., “problems with automatically processing multimodal data start already from the annotation level” . The complexity of the interactions between modalities forced the authors to produce three different types of annotations: audio, video, and multimodal.
In surveillance applications, humans and vehicles are the most important common elements studied. In consequence, detecting and matching a person or a car that appears in several videos is a key problem. Although many algorithms have been introduced, a major relative problem still is how to precisely evaluate and to compare these algorithms in reference to a common ground truth. Datasets are required for evaluating multi-view based methods.
During the last decade, public datasets have become more and more available, helping with the evaluation and comparison of algorithms, and in doing so, contributing to improvements in human and vehicle detection and tracking. However, most of the datasets focus on a specific task and do not support the evaluation of approaches that mix multiple sources of information. Only few datasets provide synchronized videos with overlapping fields of view. Yet, these rarely provide more than 4 different views even though more and more approaches could benefit from having additional views available. Moreover, soundtracks are almost never provided despite being a rich source of information, as voices and motor noises can help to recognize, respectively, a person or a car.
Notable multi-view datasets are the following.
- The 3D People Surveillance Dataset (3DPeS)  comprises 8 cameras with disjoint views and 200 different people. Each person appears, on average, in 2 views. More than 600 video sequences are available. Thus, it is well-suited for people re-identification. Cameras parameters are provided, as well as a coarse 3D reconstruction of the surveilled environment.
- The Video Image Retrieval and Analysis Tool (VIRAT)  dataset provides a large amount of surveillance videos with a high pixel resolution. In this dataset, 16 scenes were recorded for hours although in the end only 25 hours with significant activities were kept. Moreover, only two pairs of videos present overlapping fields of view. Moving objects were annotated by workers with bounding boxes, as well as some buildings or areas. Three types of events were also annotated, namely (i) single person events, (ii) person and vehicle events, and (iii) person and facility events, leading to 23 classes of events. Most actions were performed by people with minimal scripted actions, resulting in realistic scenarios with frequent incidental movers and occlusions.
- Purely action-oriented datasets can be found in the Multicamera Human Action Video (MuHAVi)  dataset, in which 14 actors perform 17 different action classes (such as “kick”, “punch”, “gunshot collapse”) while 8 cameras capture the indoor scene. Likewise, Human3.6M  contains videos where 11 actors perform 15 different classes of actions while being filmed by 4 digital cameras; its specificity lies in the fact that 1 time-of-flight sensor and 10 motion cameras were also used to estimate and to provide the 3DT pose of the actors on each frame. Both background subtraction and bounding boxes are provided at each frame. In total, more than 3.6M frames are available. In these two datasets, actions are performed in unrealistic conditions as the actors follow a script consisting of actions that are performed one after the other.
In the table below a comparison is shown between the aforementioned datasets, which are contrasted with the new ToCaDa dataset we recently introduced and describe in more detail below.
|Properties||3DPeS ||VIRAT ||MuHAVi ||Human3.6M ||ToCaDa |
|# Cameras||8 static||16 static||8 static||4 static||25 static|
|Overlapping FOV||Very partially||2+2||8||4||17|
|Pixel resolution||704 x 576||1920 x 1080||720 x 576||1000 x 1000||Mostly 1920 x 1080|
|# Visual objects||200||Hundreds||14||11||30|
|# Action types||0||23||17||15||0|
|# Bounding boxes||0||≈ 1 object/second||0||≈ 1 object/frame||≈ 1 object/second|
As a large multi-view, multimodal, and realistic video collection does not yet exist, we therefore took the initiative to produce such a dataset. The ToCaDa dataset  comprises 25 synchronized videos (including soundtrack) of the same scene recorded from multiple viewpoints. The dataset follows two detailed scenarios consisting of comings and goings of people, cars and motorbikes, with both overlapping and non-overlapping fields of view (see Figures 1-2). This dataset aims at paving the way for multidisciplinary approaches and applications such as 4D-scene reconstruction, object re-identification/tracking and multi-source metadata modeling and querying.
About 20 actors were asked to follow two realistic scenarios by performing scripted actions, like driving a car, walking, entering or leaving a building, or holding an item in hand while being filmed. In addition to ordinary actions, some suspicious behaviors are present. More precisely:
- In the first scenario, a suspect car (C) with two men inside (D the driver and P the passenger) arrives and parks in front of the main building (within the sights of the cameras with overlapping views). P gets out of the car C and enters the building. Two minutes later, P leaves the building holding a package and gets in C. C leaves the parking (see Figure 3) and gets away from the university campus (passing in front of some of the disjoint fields of view cameras). Other vehicles and persons regularly move in different cameras with no suspicious behavior.
- In the second scenario, a suspect car (C) with two men inside (D the driver and P the passenger) arrives and parks badly along the road. P gets out of the car and enters the building. Meanwhile, a women W knocks on the car window to ask the driver D to park correctly, but he drives off immediately. A few minutes later, P leaves the building with a package and seems confused as the car is missing. He then runs away. In the end, in one of the disjoint-view cameras, we can see him waiting until C picks him up.
The 25 camera holders we enlisted used their own mobile devices to record the scene, leading to a large variety of resolutions, image quality, frame rates and video duration. Three foghorns were blown in order to coordinate this heterogeneous disposal:
- The first one stands for a warning 20 seconds before the start, to give enough time to start shooting.
- The second one is the actual starting time, used to temporally synchronize the videos.
- The third one indicates the ending time.
All the videos were collected and were manually synchronized using the second and the third foghorn blows as starting and ending times. Indeed, the second one can be heard at the beginning of every video.
A special annotation procedure was set to handle the audiovisual content of this multi-view data . Audio and video parts of each document were first separately annotated, after which a fusion of these modalities was realized.
The ground truth annotations are stored in json files. Each file corresponds to a video and shares the same title but not the same extension, namely <video_name>.mp4 annotations are stored in <video_name>.json. Both visual and audio annotations are stored together in the same file.
By annotating, our goal is to detect the visual objects and the salient sound events and, when possible, to associate them. Thus, we have grouped them into the generic term audio-visual object. This way, the appearance of a vehicle and its motor sound will constitute a single coherent audio-visual object and is associated with the same ID. An object that can be seen but cannot be heard is also an audio-visual object but with only a visual component, and similarly for an object that can only be heard. An example is given in Listing 1.
To help with the annotation process, we developed a program for navigating through the frames of the synchronized videos and for identifying audio-visual objects by drawing bounding boxes in particular frames and/or specifying starting and ending times of salient sound. Bounding boxes were drawn around every moving object with a flag indicating whether the object was fully visible or occluded, specifying its category (human or vehicle), providing visual details (for example clothes types or colors), and timestamps of its apparitions and disappearances. Audio events were also annotated by a category and two timestamps.
Regarding bounding boxes, the coordinates of top-left and bottom-right corners of the bounding boxes are given. Bounding boxes were drawn such that the object is fully contained within the box and as tight as possible. For this purpose, our annotation tool allows the user to draw an initial approximate bounding box and then to adjust its boundaries at a pixel-level.
As drawing one bounding box for each object on every frame requires a huge amount of time, we have drawn bounding boxes on a subset of frames, so that the intermediate bounding boxes of an object can be linearly interpolated using its previous and next drawn bounding boxes. On average, we have drawn one bounding box per second for humans and two for vehicles due to their speed variation. For objects with irregular speed or trajectory, we have drawn more bounding boxes.
Regarding the audio component of an audio-visual object, namely the salient sound events, an audio category (voice, motor sound) is given in addition to its ID, as well as a list of details and time bounds (see Listing 2).
Finally, we linked the audio to the video objects, by giving the same ID to the audio object in case of causal identification, which means that the acoustic source of the audio event is the object (a car or a person for instance) that was annotated. This step was particularly crucial, and could not be automatized, as a complex expertise is required to identify the sound sources. For example, in the video sequence illustrated in Figure 4, a motor sound is audible and seems to come from the car whereas it actually comes from a motorbike behind the camera.
In case of an object presenting different sound categories (a car with door slams, music and motor sound for example), one object is created for each category and the same ID is given.
Ethical and Legal
According to the European legislation, it is forbidden to make images publicly available of people who might be recognized or of license plates. As people and license plates are visible in our videos, to conform to the General Data Protection Regulation (GDPR) we decided to:
- Ask actors to sign an authorization for publishing their image, and
- Apply post treatment on videos to blur faces of other people and any license plates.
We have introduced a new dataset composed of two sets of 25 synchronized videos of the same scene with 17 overlapping views and 8 disjoint views. Videos are provided with their associated soundtracks. We have annotated the videos by manually drawing bounding boxes on moving objects. We have also manually annotated audio events. Our dataset offers simultaneously a large number of both overlapping and disjoint synchronized views and a realistic environment. It also provides audio tracks with sound events, high pixel resolution and ground truth annotations.
The originality and the richness of this dataset come from the wide diversity of topics it covers and the presence of scripted and non-scripted actions and events. Therefore, our dataset is well suited for numerous pattern recognition applications related to, but not restricted to, the domain of surveillance. We describe below, some multidisciplinary applications that could be evaluated using this dataset:
3D and 4D reconstruction: The multiple cameras sharing overlapping fields of view along with some provided photographs of the scene allow performing a 3D reconstruction of the static parts of the scene and to retrieve intrinsic parameters and poses of the cameras using a Structure-from-Motion algorithm. Beyond a 3D reconstruction, the temporal synchronization of the videos could enable to render dynamic parts of the scene as well and to obtain a 4D reconstruction.
Object recognition and consistent labeling: Evaluation of algorithms for human and vehicle detection and consistent labeling across multiple views can be performed using the annotated bounding boxes and IDs. To this end, overlapping views provide a 3D environment that could help to infer the label of an object in a video knowing its position and label in another video.
Sound event recognition: The audio events recorded from different locations and manually annotated provide opportunities to evaluate the relevance of consistent acoustic models by, for example, launching the identification and indexing of a specific sound event. Looking for a particular sound by similarity is also feasible.
Metadata modeling and querying: The multiple layers of information of this dataset, both low-level (audio/video signal) and high-level (semantic data available in the ground truth files) enable handling of information at different resolutions of space and time, allowing to perform queries on heterogeneous information.
 I. Lefter, L.J.M. Rothkrantz, G. Burghouts, Z. Yang, P. Wiggers. “Addressing multimodality in overt aggression detection”, in Proceedings of the International Conference on Text, Speech and Dialogue, 2011, pp. 25-32.
 D. Baltieri, R. Vezzani, R. Cucchiara. “3DPeS: 3D people dataset for surveillance and forensics”, in Proceedings of the 2011 joint ACM workshop on Human Gesture and Behavior Understanding, 2011, pp. 59-64.
 S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. Chen, J.T. Lee, S. Mukherjee, J.K. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Fong, A. Roy-Chowdhury, M. Desai. “A large-scale benchmark dataset for event recognition in surveillance video”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3153-3160.
 S. Singh, S.A. Velastin, H. Ragheb. “MuHAVi: A multicamera human action video dataset for the evaluation of action recognition methods”, in Proceedings of the 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2010, pp. 48-55.
 C. Ionescu, D. Papava, V. Olaru, C. Sminchisescu. “Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments”, IEEE transactions on Pattern Analysis and Machine Intelligence, 36(7), 2013, pp. 1325-1339.
 T. Malon, G. Roman-Jimenez, P. Guyot, S. Chambon, V. Charvillat, A. Crouzil, A. Péninou, J. Pinquier, F. Sèdes, C. Sénac. “Toulouse campus surveillance dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views”, in Proceedings of the 9th ACM Multimedia Systems Conference. 2018, pp. 393-398.
 P. Guyot, T. Malon, G. Roman-Jimenez, S. Chambon, V. Charvillat, A. Crouzil, A. Péninou, J. Pinquier, F. Sèdes, C. Sénac. “Audiovisual annotation procedure for multi-view field recordings”, in Proceedings of the International Conference on Multimedia Modeling, 2019, pp. 399-410.
Information retrieval and multimedia content access have a long history of comparative evaluation, and many of the advances in the area over the past decade can be attributed to the availability of open datasets that support comparative and repeatable experimentation. Sharing data and code to allow other researchers to replicate research results is needed in the multimedia modeling field, as it helps to improve the performance of systems and the reproducibility of published papers.
This report summarizes the special session on Multimedia Datasets for Repeatable Experimentation (MDRE 2019), which was organized at the 25th International Conference on MultiMedia Modeling (MMM 2019), which was held in January 2019 in Thessaloniki, Greece.
The intent of these special sessions is to be a venue for releasing datasets to the multimedia community and discussing dataset related issues. The presentation mode in 2019 was to have short presentations (8 minutes) with some questions, and an additional panel discussion after all the presentations, which was moderated by Björn Þór Jónsson. In the following we summarize the special session, including its talks, questions, and discussions.
The special session presenters: Luca Rossetto, Cathal Gurrin and Minh-Son Dao.
A Test Collection for Interactive Lifelog Retrieval
The session started with a presentation about A Test Collection for Interactive Lifelog Retrieval , given by Cathal Gurrin from Dublin City University (Ireland). In their work, the authors introduced a new test collection for interactive lifelog retrieval, which consists of multi-modal data from 27 days, comprising nearly 42 thousand images and other personal data (health and activity data; more specifically, heart rate, galvanic skin response, calorie burn, steps, blood pressure, blood glucose levels, human activity, and diet log). The authors argued that, although other lifelog datasets already exist, their dataset is unique in terms of the multi-modal character, and has a reasonable and easily manageable size of 27 consecutive days. Hence, it can also be used for interactive search and provides newcomers with an easy entry into the field. The published dataset has already been used for the Lifelog Search Challenge (LSC)  in 2018, which is an annual competition run at the ACM International Conference on Multimedia Retrieval (ICMR).
The discussion about this work started with a question about the plans for the dataset and whether it should be extended over the years, e.g. to increase the challenge of participating in the LSC. However, the problem with public lifelog datasets is the fact that there is a conflict between releasing more content and safeguarding privacy. There is a strong need to anonymize the contained images (e.g. blurring faces and license plates), where the rules and requirements of the EU GDPR regulations make this especially important. However, anonymizing content unfortunately is a very slow process. An alternative to removing and/or masking actual content from the dataset for privacy reasons would be to create artificial datasets (e.g. containing public images or only faces from people who consent to publish), but this would likely also be a non-trivial task. One interesting aspect could be the use of Generative Adversarial Networks (GANs) for the anonymization of faces, for instance by replacing all faces appearing in the content with generated faces learned from a small group of people who gave their consent. Another way to preemptively mitigate the privacy issues could be to wear conspicuous ‘lifelogging stickers’ during recording to make people aware of the presence of the camera, which would give them the possibility to object to being filmed or to avoid being captured altogether.
SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives
The second presentation was given by Minh-Son Dao from the National Institute of Information and Communications Technology (NICT) in Japan about SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives . This is a dataset that aims at combining the conditions of the environment with health-related aspects (e.g., pollution or weather data with cardio-respiratory or psychophysiological data). The creation of the dataset was motivated by the fact that people in larger cities in Japan very often do not want to go out (e.g., for some sports activities), because they are very concerned about pollution, i.e., health conditions. So it would be beneficial to have a map of the city with assigned pollution ratings, or a system that allows to perform related queries. Their dataset contains sensor data collected on routes by a few dozen volunteer people over seven days in Fukuoka, Japan. More particularly, they collected data about the location, O3, NO2, PM2.5 (particulates), temperature, and humidity in combination with heart rate, motion behavior (from 3-axis accelerometer), relaxation level, and other personal perception data from questionnaires.
This dataset has also been used for multimedia benchmark challenges, such as the Lifelogging for Wellbeing task at MediaEval. In order to define the ground truth, volunteers were presented with specific use cases and annotation rules, and were asked to collaboratively annotate the dataset. The collected data (the feelings of participants at different locations) was also visualized using an interactive map. Although the dataset may have some inconsistent annotations, it is easy to filter them out since labels of corresponding annotators and annotator groups are contained in the dataset as well.
V3C – a Research Video Collection
The third presentation was given by Luca Rossetto from the University of Basel (Switzerland) about V3C – a Research Video Collection . This is a large-scale dataset for multimedia retrieval, consisting of nearly 30,000 videos with an overall duration of about 3,800 hours. Although many other video datasets are available already (e.g., IACC.3 , or YFCC100M ), the V3C dataset is unique in the aspects of timeliness (more recent content than many other datasets and therefore more representative content for current ‘videos in the wild’) and diversity (represents many different genres or use cases), while also having no copyright restrictions (all contained videos were labelled with a Creative Commons license by their uploaders). The videos have been collected from the video sharing platform Vimeo (hence the name ‘Vimeo Creative Commons Collection’ or V3C in short) and represent video data currently used on video sharing platforms. The dataset comes together with a master shot-boundary detection ground truth, as well as keyframes and additional metadata. It is partitioned into three major parts (V3C1, V3C2, and V3C3) to make it more manageable, and it will be used by the TRECVID and the Video Browser Showdown (VBS) evaluation campaigns for several years. Although the dataset was not specifically built for retrieval, it is suitable for any use case that requires a larger video dataset.
The shot-boundary detection used to provide the master-shot reference for the V3C dataset was implemented using Cineast, which is an open source software available for download. It divides every frame into a 3×3 grid and computes color histograms for all 9 areas, which are then concatenated into a ‘regional color histogram’ feature vector that is compared between all adjacent frames. This seems to work very well for hard cuts and gradual transitions, although for grayscale content (and flashlights etc.) it is not very stable. The additional metadata provided with the dataset includes information about resolution, frame rate, uploading user and the upload date, as well as any semantic information provided by the uploader (title, description, tags, etc.).
Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition
Originally a fourth presentation was scheduled about Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition , but unfortunately no author was on site to give the presentation. This dataset contains audio samples with a duration of 30 seconds (as well as extracted features and ground truth) from a metropolitan city (Athens, Greece), that have been recorded during a period of about four years by 10 different persons with the aim to provide a collection about city sounds. The metadata includes geospatial coordinates, timestamp, rating, and tagging of the sound by the recording person. The authors demonstrated in a baseline evaluation that their dataset allows to predict the soundscape quality in the city with about 42% accuracy.
After the presentations, Björn Þór Jónsson moderated a panel discussion in which all presenters participated.
The panel started with a discussion on the size of datasets, whether the only way to make challenges more difficult is to keep increasing the dataset, or whether there are alternatives to this. Although this heavily depends on the research question one would like to solve, it was generally agreed that there is a definite need for evaluation with large datasets, because for small datasets some problems are trivial. Moreover, too small datasets often introduce some kind of content bias, so that they do not fully reflect the practical situation.
For now, it seems there is no real alternative to using larger datasets although it is clear that this will introduce additional challenges/hurdles for data management and data processing. All presenters (and the audience too) agreed that introducing larger datasets will also necessitate the need for closer collaboration with other research communities―with fields like data science, data management/engineering, and distributed and high-performance computing―in order to manage the higher data load.
However, even though we need larger datasets, we might not be ready yet to really go for true large-scale. For example, the V3C dataset is still far away from a true web-scale video search dataset; it originally was intended to be even bigger, but there were concerns from the TRECVID and VBS communities about the manageability. Datasets that are too large would set the entrance barrier for newcomers so high that an evaluation benchmark may not attract enough participants―a problem that could possibly disappear in a few years (as hardware becomes cheaper and faster/larger), but still needs to be addressed from an organizational viewpoint.
There were notes from the audience that instead of focusing on size alone, we should also consider the problem we want to solve. It appears many researchers use datasets for use cases for which they were not designed and are not suited to. Instead of blindly going for larger size, datasets could be kept small and simple for solving essential research questions, for example by truly optimizing them to the problem to solve; different evaluations would then use different datasets. However, this would lead to a considerable dataset fragmentation and necessitate the need for combining several datasets for broader/larger evaluation tasks, which has been shown to be quite challenging in the past. For example, there are already a lot of health datasets available, and it would be interesting to take benefit from them, but the workload for the integration into competitions is often too high in practice.
Another issue that should be addressed more intensively by the research community is to figure out the situation for personal datasets that are compliant with GDPR regulations, since currently nobody really knows how to deal with this.
The session was organized by the authors of the report, in collaboration with Duc-Tien Dang-Nguyen (Dublin City University), Michael Riegler (Center for Digitalisation and Engineering & University of Oslo), and Luca Piras (University of Cagliari). The panel format of the special session made the discussions much more lively and interactive than that of a traditional technical session. We would like to thank the presenters and their co-authors for their excellent contributions, as well as the members of the audience who contributed greatly to the session.
 Gurrin, C., Schoeffmann, K., Joho, H., Munzer, B., Albatal, R., Hopfgartner, F., … & Dang-Nguyen, D. T. (2019, January). A test collection for interactive lifelog retrieval. In International Conference on Multimedia Modeling (pp. 312-324). Springer, Cham.
 Sato, T., Dao, M. S., Kuribayashi, K., & Zettsu, K. (2019, January). SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives. In International Conference on Multimedia Modeling (pp. 325-337). Springer, Cham.
 Rossetto, L., Schuldt, H., Awad, G., & Butt, A. A. (2019, January). V3C–A Research Video Collection. In International Conference on Multimedia Modeling (pp. 349-360). Springer, Cham.
 Giannakopoulos, T., Orfanidi, M., & Perantonis, S. (2019, January). Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition. In International Conference on Multimedia Modeling (pp. 338-348). Springer, Cham.
 Dang-Nguyen, D. T., Schoeffmann, K., & Hurst, W. (2018, June). LSE2018 Panel-Challenges of Lifelog Search and Access. In Proceedings of the 2018 ACM Workshop on The Lifelog Search Challenge (pp. 1-2). ACM.
 Awad, G., Butt, A., Curtis, K., Lee, Y., Fiscus, J., Godil, A., … & Kraaij, W. (2018, November). Trecvid 2018: Benchmarking video activity detection, video captioning and matching, video storytelling linking and video search.
 Lokoč, J., Kovalčík, G., Münzer, B., Schöffmann, K., Bailer, W., Gasser, R., … & Barthel, K. U. (2019). Interactive search or sequential browsing? a detailed analysis of the video browser showdown 2018. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(1), 29.
 Kalkowski, S., Schulze, C., Dengel, A., & Borth, D. (2015, October). Real-time analysis and visualization of the YFCC100M dataset. In Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions(pp. 25-30). ACM.
Online disinformation is a problem that has been attracting increased interest by researchers worldwide as the breadth and magnitude of its impact is progressively manifested and documented in a number of studies (Boididou et al., 2014; Zhou & Zafarani, 2018; Zubiaga et al., 2018). This emerging area of research is inherently multidisciplinary and there have been numerous treatments of the subject, each having a distinct perspective or theme, ranging from the predominant perspectives of media, journalism and communications (Wardle & Derakhshan, 2017) and political science (Allcott & Gentzkow, 2017) to those of network science (Lazer et al., 2018), natural language processing (Rubin et al., 2015) and signal processing, including media forensics (Zampoglou et al., 2017). Given the multimodal nature of the problem, it is no surprise that the multimedia community has taken a strong interest in the field.
From a multimedia perspective, two research problems have attracted the bulk of researchers’ attention: a) detection of content tampering and content fabrication, and b) detection of content misuse for disinformation. The first was traditionally studied within the field of media forensics (Rocha et al, 2011), but has recently been under the spotlight as a result of the rise of deepfake videos (Güera & Delp, 2018), i.e. a special class of generative models that are capable of synthesizing highly convincing media content from scratch or based on some authentic seed content. The second problem has focused on the problem of multimedia misuse or misappropriation, i.e. the use of media content out of its original context with the goal of spreading misinformation or false narratives (Tandoc et al., 2018).
Developing automated approaches to detect media-based disinformation is relying to a great extent on the availability of relevant datasets, both for training supervised learning models and for evaluating their effectiveness. Yet, developing and releasing such datasets is a challenge in itself for a number of reasons:
- Identifying, curating, understanding, and annotating cases of media-based misinformation is a very effort-intensive task. More often than not, the annotation process requires careful and extensive reading of pertinent news coverage from a variety of sources similar to the journalistic practice of verification (Brandtzaeg et al., 2016).
- Media-based disinformation is largely manifested in social media platforms and relevant datasets are therefore hard to collect and distribute due to the temporary nature of social media content and the numerous technical restrictions and challenges involved in collecting content (mostly due to limitations or complete lack of appropriate support by the respective APIs), as well as the legal and ethical issues in releasing social media-based datasets (due to the need to comply with the respective Terms of Service and any applicable data protection law).
In this column, we present two multimedia datasets that could be of value to researchers who study media-based disinformation and develop automated approaches to tackle the problem. The first, called Fake Video Corpus (Papadopoulou et al., 2019) is a manually curated collection of 200 debunked and 180 verified videos, along with relevant annotations, accompanied by a set of 5,193 near-duplicate instances of them that were posted on popular social media platforms. The second, called FIVR-200K (Kordopatis-Zilos et al., 2019), is an automatically collected dataset of 225,960 videos, a list of 100 video queries and manually verified annotations regarding the relation (if any) of the dataset videos to each of the queries (i.e. near-duplicate, complementary scene, same incident).
For each of the two datasets, we present the design and creation process, focusing on issues and questions regarding the relevance of the collected content, the technical means of collection, and the process of annotation, which had the dual goal of ensuring high accuracy and keeping the manual annotation cost manageable. Given that each dataset is accompanied by a detailed journal article, in this column we only limit our description to high-level information, emphasizing the utility and creation process in each case, rather than on detailed statistics, which are disclosed in the respective papers.
Following the presentation of the two datasets, we then proceed to a critical discussion, highlighting their limitations and some caveats, and delineating future steps towards high quality dataset creation for the field of multimedia-based misinformation.
The complexity and challenge of the multimedia verification problem has led to the creation of numerous datasets and benchmarking efforts, each designed specifically for a particular task within this area. We can broadly classify these efforts in three areas: a) multimedia forensics, b) multimedia retrieval, and c) multimedia post classification. Datasets that are focused on the text modality, e.g. Fake News Challenge, Clickbait Challenge, Hyperpartisan News Detection, RumourEval (Derczynski et al 2017), etc. are beyond the scope of this post and are hence not included in this discussion.
Multimedia forensics: Generating high-quality multimedia forensics datasets has always been a challenge, since creating convincing forgeries is normally a manual task requiring a fair amount of skill, and as a result such datasets have generally been few and limited in scale. With respect to image splicing, our own survey (Zampoglou et al, 2017) listed a number of datasets that had been made available by this point, including our own Wild Web tampered image dataset, which consists of real-world forgeries that have been collected from the Web, including multiple near-duplicates, making it a large and particularly challenging collection. Recently, the Realistic Tampering Dataset (Korus et al,2017) was proposed, offering a large number of convincing forgeries for evaluation. On the other hand, copy-move image forgeries pose a different problem that requires specially designed datasets. Three such commonly used datasets are those produced by MICC (Amerini et al, 2011), the Image Manipulation Dataset by (Christlein et al, 2012), and CoMoFoD (Tralic et al, 2013). These datasets are still actively used in research.
With respect to video tampering, there has been relative scarcity in high-quality large-scale datasets, which is understandable given the difficulty of creating convincing forgeries. The recently proposed Multimedia Forensics Challenge datasets include some large-scale sets of tampered images and videos for the evaluation of forensics algorithms. Finally, there has recently been increased interest towards the automatic detection of forgeries made with the assistance of particular software, and specifically face-swapping software. As the quality of produced face-swaps is constantly improving, detecting face-swaps is an important emerging verification task. The FaceForensics++ dataset (Rössler et al, 2019) is a very-large scale dataset containing face-swapped videos (and untampered face videos) from a number of different algorithms, aimed for the evaluation of face-swap detection algorithms.
Multimedia retrieval: Several cases of multimedia verification can be considered to be an instance of a near-duplicate retrieval task, in which the query video (video to be verified) is run against a database of past cases/videos to check whether it has already appeared before. The most popular and publicly-available dataset for near-duplicate video retrieval is arguably the CC_WEB_VIDEO dataset (Wu et al., 2007). This consists of 12,790 user-generated videos collected from popular video sharing websites (YouTube, Google Video, and Yahoo! Video). It is organized in 24 query sets, for each of which the most popular video was selected to serve as query, and the rest of the videos were manually annotated based on their duplicity to the query. Another relevant dataset is VCDB (Jiang et al., 2014), which was compiled and annotated as a benchmark for the partial video copy detection problem and is composed of videos from popular video platforms (YouTube and Metacafe). VCDB contains two subsets of videos: a) the core, which consists of 28 discrete sets of videos with a total of 528 videos with over 9,000 pairs of manually annotated partial copies, and b) the distractors, which consists of 100,000 videos with the purpose to make the video copy detection problem more challenging.
Multimedia post classification: A benchmark task under the name “Verifying Multimedia Use” (Boididou et al., 2015; Boididou et al., 2016) was organized and took place in the context of MediaEval 2015 and 2016 respectively. The task made a dataset available of 15,629 tweets containing images and videos, each of which made a false or factual claim with respect to the shared image/video. The released tweets were posted in the context of breaking news events (e.g. Hurricane Sandy, Boston Marathon bombings) or hoaxes.
Video Verification Datasets
The Fake Video Corpus (FVC)
The Fake Video Corpus (Papadopoulou et al., 2018) is a collection of 380 user-generated videos and 5,193 near-duplicate versions of them, all collected from three online video platforms: YouTube, Facebook, and Twitter. The videos are annotated either as “verified” (“real”) or as “debunked” (“fake”) depending on whether the information they convey is accurate or misleading. Verified videos are typically user-generated takes of newsworthy events, while debunked videos include various types of misinformation, including staged content posing as UGC, real content taken out of context, or modified/tampered content (see Figure 1 for examples). The near-duplicates of each video are arranged in temporally ordered “cascades”, and each near-duplicate video is annotated with respect to its relation to the first video of the cascade (e.g. whether it is reinforcing or debunking the original claim). The FVC is the first, to our knowledge, large-scale dataset of debunked and verified user-generated videos (UGVs). The dataset contains different kinds of metadata for its videos, including channel (user) information, video information, and community reactions (number of likes, shares and comments) at the time of their inclusion.
The initial set of 380 videos were collected and annotated using various sources including the Context Aggregation and Analysis (CAA) service developed within the InVID project and fact-checking sites such as Snopes. To build the dataset, all videos submitted to the CAA service between November 2017 and January 2018 were collected in an initial pool of approximately 1600 videos, which were then manually inspected and filtered. The remaining videos were annotated as “verified” or “debunked” using established third party sources (news articles or blog posts), leading to the final pool of 180 verified and 200 fake unique videos. Then, keyword-based search was run on the three platforms, and near-duplicate video detection was used to identify the video duplicates within the returned results. More specifically, for each of the 380 videos, its title was reformulated in a more general form, and translated into four major languages: Russian, Arabic, French, and German. The original title, the general form and the translations were submitted as queries to YouTube, Facebook, and Twitter. Then, the near-duplicate retrieval algorithm of Kordopatis-Zilos etal (2017) was used on the resulting pool, and the results were manually inspected to remove erroneous matches.
The purpose of the dataset is twofold: i) to be used for the analysis of the dissemination patterns of real and fake user-generated videos (by analyzing the traits of the near-duplicate video cascades), and ii) to serve as a benchmark for the evaluation of automated video verification methods. The relatively large size of the dataset is important for both of these tasks. With respect to the study of dissemination patterns, the dataset provides the opportunity to study the dissemination of the same or similar content by analyzing associations between videos not provided by the original platform APIs, combined with the wealth of associated metadata. In parallel, having a collection of 5,573 annotated “verified” or “debunked” videos- even if many are near-duplicate versions of the 380 cases – can be used for the evaluation (or even training) of verification systems, either based on visual content or the associated video metadata.
The Fine-grained Incident Video Retrieval Dataset (FIVR-200K)
The FIVR-200K dataset (Kordopatis-Zilos et al., 2019) consists of 225,960 videos associated with 4,687 Wikipedia events and 100 selected video queries (see Figure 2 for examples). It has been designed to simulate the problem of Fine-grained Incident Video Retrieval (FIVR). The objective of this problem is: given a query video, retrieve all associated videos considering several types of associations with respect to an incident of interest. FIVR contains several retrieval tasks as special cases under a single framework. In particular, we consider three types of association between videos: a) Duplicate Scene Videos (DSV), which share at least one scene (originating from the same camera) regardless of any applied transformation, b) Complementary Scene Videos (CSV), which contain part of the same spatiotemporal segment, but captured from different viewpoints, and c) Incident Scene Videos (ISV), which capture the same incident, i.e. they are spatially and temporally close, but have no overlap.
For the collection of the dataset, we first crawled Wikipedia’s Current Event page to collect a large number of major news events that occurred between 2013 and 2017 (five years). Each news event is accompanied with a topic, headline, text, date, and hyperlinks. To collect videos of the same category, we retained only news events with topic “Armed conflicts and attacks” or “Disasters and accidents”. This ultimately led to a total of 4,687 events after filtering. To gather videos around these events and build a large collection with numerous video pairs that are associated through the relations of interest (DSV, CSV and ISV), we queried the public YouTube API with the event headlines. To ensure that the collected videos capture the corresponding event, we retained only the videos published within a timespan of one week from the event date. This process resulted in the collection of 225,960 videos.
Next, we proceeded with the selection of query videos. We set up an automated filtering and ranking process that implemented the following criteria: a) query videos should be relatively short and ideally focus on a single scene, b) queries should have many near-duplicates or same-incident videos within the dataset that are published by many different uploaders, c) among a set of near-duplicate/same-instance videos, the one that was uploaded first should be selected as query. This selection process was implemented based on a graph-based clustering approach and resulted in the selection of 635 query videos, of which we used the top 100 (ranked by corresponding cluster size) as the final query set.
For the annotation of similarity relations among videos, we followed a multi-step process, in which we presented annotators with the results of a similarity-based video retrieval system and asked them to indicate the type of relation through a drop-down list of the following labels: a) Near-Duplicate (ND), a special case where the whole video is near-duplicate to the query video, b) Duplicate Scene (DS), where only some scenes in the candidate video are near-duplicates of scenes in the query video, c) Complementary Scenes (CS), d) Incident Scene (IS), and e) Distractors (DI), i.e. irrelevant videos.
To make sure that annotators were presented with as many potentially relevant videos as possible, we used visual-only, text-only and hybrid similarity in turn. As a result, each annotator reviewed video candidates that had very high similarity with the query video in terms either of their visual content, or text metadata (title and description) or the combination of similarities. Once an initial set of annotations were produced by two independent annotators, the annotators went twice again through the annotations two ensure consistency and accuracy.
FIVR-200K was designed to serve as a benchmark that poses real-world challenges for the problem of reverse video search. Given a query video to be verified, the analyst would want to know whether the same or a very similar version of it has already been published. In that way, the user would be able to easily debunk cases of out-of-context video use (i.e. misappropriation) and on the other hand, if several videos are found that depict the same scene from different viewpoints at approximately the same time, then they could be considered to corroborate the video of interest.
Discussion: Limitations and Caveats
We are confident that the two video verification datasets presented in this column can be valuable resources for researchers interested in the problem of media-based disinformation and could serve both as training sets and as benchmarks for automated video verification methods. Yet, both of them suffer from certain limitations and care should be taken when using them to draw conclusions.
A first potential issue has to do with the video selection bias arising from the particular way that each of the two datasets was created. The videos of the Fake Video Corpus were selected in a mixed manner trying to include a number of cases that were known to the dataset creators and their collaborators, and was also enriched by a pool of test videos that were submitted for analysis to a publicly available video verification service. As a result, it is likely to be more focused on viral and popular videos. Also, videos were included, for which debunking or corroborating information was found online, which introduces yet another source of bias, potentially towards cases that were more newsworthy or clear cut. In the case of the FIVR-200K dataset, videos were intentionally collected to be between two categories of newsworthy events with the goal of ending up with a relatively homogeneous collection, which would be challenging in terms of content-based retrieval. This means that certain types of content, such as political events, sports and entertainment, are very limited or not present at all in the dataset.
A question that is related to the selection bias of the above datasets pertains to their relevance for multimedia verification and for real-world applications. In particular, it is not clear whether the video cases offered by the Fake Video Corpus are representative of actual verification tasks that journalists and news editors face in their daily work. Another important question is whether these datasets offer a realistic challenge to automatic multimedia analysis approaches. In the case of FIVR-200K, it was clearly demonstrated (Kordopatis-Zilos et al., 2019) that the dataset is a much harder benchmark for near-duplicate detection methods compared to previous datasets such as CC_WEB_VIDEO and VCDB. Even so, we cannot safely conclude that a method, which performs very well in FIVR-200K, would perform equally well in a dataset of much larger scale (e.g. millions or even billions of videos).
Another issue that affects the access to these datasets and the reproducibility of experimental results relates to the ephemeral nature of online video content. A considerable (and increasing) part of these video collections is taken down (either by their own creators or from the video platform), which makes it impossible for researchers to gain access to the exact video set that was originally collected. To give a better sense of the problem, 21% of the Fake Video Corpus and 11% of the FIVR-200K videos were not available online on September 2019. This issue, which affects all datasets that are based on online multimedia content, raises the more general question of whether there are steps that can be taken by online platforms such as YouTube, Facebook and Twitter that could facilitate the reproducibility of social media research without violating copyright legislation or the platforms’ terms of service.
The ephemeral nature of online content is not the only factor that renders the value of multimedia datasets very sensitive to the passing of time. Especially in the case of online disinformation, there seems to be an arms’ race, where new machine learning methods constantly get better in detecting misleading or tampered content, but at the same time new types of misinformation emerge, which are increasingly AI-assisted. This is particularly profound in the case of deepfakes, where the main research paradigm is based on the concept of competition between a generator (adversary) and a detector (Goodfellow et al., 2014).
Last but not least, one may always be concerned about the potential ethical issues arising when publicly releasing such datasets. In our case, reasonable concerns for privacy risks, which are always relevant when dealing with social media content, are addressed by complying with the relevant Terms of Service of the source platforms and by making sure that any annotation (label) assigned to the dataset videos is accurate. Additional ethical issues pertain to the potential “dual use” of the dataset, i.e. their use by adversaries to craft better tools and techniques to make misinformation campaigns more effective. A recent pertinent case was OpenAI’s delayed release of their very powerful GPT-2 model, which sparked numerous discussions and criticism, and making clear that there is no commonly accepted practice for ensuring reproducibility of research results (and empowering future research) and at the same time making sure that risks of misuse are eliminated.
Given the challenges of creating and releasing a large-scale dataset for multimedia verification, the main conclusions from our efforts towards this direction so far are the following:
- The field of multimedia verification is in constant motion and therefore the concept of a static dataset may not be sufficient to capture the real-world nuances and latest challenges of the problem. Instead new benchmarking models, e.g. in the form of open data challenges, and resources, e.g. constantly updated repository of “fake” multimedia, appear to be more effective for empowering future research in the area.
- The role of social media and multimedia sharing platforms (incl. YouTube, Facebook, Twitter, etc.) seems to be crucial in enabling effective collaboration between academia and industry towards addressing the real-world consequences of online misinformation. While there have been recent developments towards this direction, including the announcements by both Facebook and Alphabet’s Jigsaw of new deepfake datasets, there is also doubt and scepticism about the degree of openness and transparency that such platforms are ready to offer, given the conflicts of interest that are inherent in the underlying business model.
- Building a dataset that is fit for a highly diverse and representative set of verification cases appears to be a task that would require a community effort instead of effort from a single organisation or group. This would not only help towards distributing the massive dataset creation cost and effort to multiple stakeholders, but also towards ensuring less selection bias, richer and more accurate annotation and more solid governance.
Allcott, H., Gentzkow, M., “Social media and fake news in the 2016 election”, Journal of economic perspectives, 31(2), pp. 211–36, 2017.
Amerini, I, Ballan, L., Caldelli, R., Del Bimbo, A., Serra, G., “A SIFT-based forensic method for copy-move attack detection and transformation recovery”, IEEE Transactions on Information Forensics and Security, 6(3), pp. 1099–1110,2011.
Boididou, C., Papadopoulos, S., Kompatsiaris, Y., Schifferes, S., Newman, N., “Challenges of computational verification in social multimedia”, In Proceedings of the 23rd ACM International Conference on World Wide Web, pp. 743–748,2014.
Boididou, C., Andreadou, K., Papadopoulos, S., Dang-Nguyen, D.T., Boato, G., Riegler, M., Kompatsiaris, Y., “Verifying multimedia use at MediaEval 2015”. In Proceedings of MediaEval 2015, 2015.
Boididou C., Papadopoulos S., Dang-Nguyen D., Boato G., Riegler M., Middleton S.E., Petlund A., Kompatsiaris Y., “Verifying multimedia use at MediaEval 2016”. In Proceedings of MediaEval 2016, 2016.
Brandtzaeg, P.B., Lüders, M., Spangenberg, J., Rath-Wiggins, L., Følstad, A., “Emerging journalistic verification practices concerning social media”. Journalism Practice, 10(3), pp. 323–342, 2016.
Christlein V., Riess C., Jordan J., Riess C., Angelopoulou, E., “An evaluation of popular copy-move forgery detection approaches”. IEEE Transactions on Information Forensics & Security, 7(6), pp. 1841–1854, 2012.
Derczynski, L., Bontcheva, K., Liakata, M., Procter, R., Hoi, G.W.S., Zubiaga, A., “Semeval-2017 Task 8: Rumoureval: determining rumour veracity and support for rumours”, Proceedings of the 11th International Workshop on Semantic Evaluation,pp. 69-76, 2017.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Bengio, Y., “Generative adversarial nets”. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
Guan, H., Kozak, M., Robertson, E., Lee, Y., Yates, A.N., Delgado, A., Zhou, D., Kheyrkhah, T., Smith, J., Fiscus, J., “MFC datasets: Large-scale benchmark datasets for media forensic challenge evaluation”, In Proceedings of the 2019 IEEEWinter Applications of Computer Vision Workshops, pp. 63–72, 2019.
Güera, D., Delp, E.J., “Deepfake video detection using recurrent neural networks”, In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 1–6, 2018.
Jiang, Y. G., Jiang, Y., Wang, J., “VCDB: A large-scale database for partial copy detection in videos”. In Proceedings of the European Conference on Computer Vision, pp. 357–371, 2014.
Kiesel, J., Mestre, M., Shukla, R., Vincent, E., Adineh, P., Corney, D., Stein, B. Potthast, M., “Semeval-2019 Task 4: Hyperpartisan news detection”. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829–839,2019.
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, I., “FIVR: Fine-grained incident video retrieval”. IEEE Transactions on Multimedia, 21(10), pp. 2638–2652, 2019.
Korus, P., Huang, J., “Multi-scale analysis strategies in PRNU-based tampering localization”, IEEE Transactions on Information Forensics & Security, 21(4), pp. 809–824, 2017.
Lazer, D.M., Baum, M.A., Benkler, Y., Berinsky, A.J., Greenhill, K.M., Menczer, F., Schudson, M., “The science of fake news”, Science, 359(6380), pp. 1094–1096, 2018.
Papadopoulou, O., Zampoglou, M., Papadopoulos, S., Kompatsiaris, I., “A corpus of debunked and verified user-generated videos”. Online Information Review, 43(1), pp. 72–88, 2019.
Rocha, A., Scheirer, W., Boult, T., Goldenstein, S., “Vision of the unseen: Current trends and challenges in digital image and video forensics”, ACM Computing Surveys, 43(4), art. 26, 2011.
Rössler, A. Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M. “Faceforensics++: Learning to detect manipulated facial images”, In Proceedings of the IEEE International Conference on Computer Vision, 2019.
Rubin, V.L., Chen, Y., Conroy, N.J., “Deception detection for news: Three types of fakes”, In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community, art. 83, 2015.
Tandoc Jr, E.C., Lim, Z.W., Ling, R. “Defining “fake news”: A typology of scholarly definitions”, Digital journalism, 6(2), pp. 137–153, 2018.
Tralic, D., Zupancic I., Grgic S., Grgic M., “CoMoFoD – New database for copy-move forgery detection”. In Proceedings of the 55th International Symposium on Electronics in Marine, pp. 49–54, 2013.
Wardle, C., Derakhshan, H., “Information disorder: Toward an interdisciplinary framework for research and policy making”, Council of Europe Report, 27, 2017.
Wu, X., Hauptmann, A.G., Ngo, C.-W., “Practical elimination of near-duplicates from web video search”, In Proceedings of the 15th ACM International Conference on Multimedia, pp. 218–227, 2007.
Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y., “Detecting image splicing in the wild (web)”, In Proceedings of the 2015 IEEE International Conference on Multimedia & Expo Workshops, 2015.
Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y., “Large-scale evaluation of splicing localization algorithms for web images”, Multimedia Tools and Applications, 76(4), pp. 4801–4834, 2017.
Zhou, X., Zafarani, R., “Fake news: A survey of research, detection methods, and opportunities”. arXiv preprint arXiv:1812.00315, 2018.
Zubiaga, A., Aker, A., Bontcheva, K., Liakata, M., Procter, R., “Detection and resolution of rumours in social media: A survey”, ACM Computing Surveys, 51(2), art. 32, 2018.
Appendix A: Examples of videos in the Fake Video Corpus.
Appendix B: Examples of videos in the Fine-grained Incident Video Retrieval dataset.
Query video from the Boston Marathon bombing in April 15, 2013.
Query video from the the Las Vegas shooting in October 1, 2017.
In order to download the video dataset as well as its provided analysis data, please follow the instructions described here:
Standardized datasets are of vital importance in multimedia research, as they form the basis for reproducible experiments and evaluations. In the area of video retrieval, widely used datasets such as the IACC , which has formed the basis for the TRECVID Ad-Hoc Video Search Task and other retrieval-related challenges, have started to show their age. For example, IACC is no longer representative of video content as it is found in the wild . This is illustrated by the figures below, showing the distribution of video age and duration across various datasets in comparison with a sample drawn from Vimeo and Youtube.
Its recently released spiritual successor, the Vimeo Creative Commons Collection (V3C) , aims to remedy this discrepancy by offering a collection of freely reusable content sourced from the video hosting platform Vimeo (https://vimeo.com). The figures below show the age and duration distributions of the Vimeo sample from  in comparison with the properties of the V3C.
The V3C is comprised of three shards, consisting of 1000h, 1200h and 1500h of video content respectively. It consists not only of the original videos themselves, but also comes with video shot-boundary annotations, as well as representative key-frames and thumbnail images for every such video shot. In addition, all the technical and semantic video metadata that was available on Vimeo is provided as well. The V3C has already been used in the 2019 edition of the Video Browser Showdown  and will also be used for the TRECVID AVS Tasks (https://www-nlpir.nist.gov/projects/tv2019/) starting 2019 with a plan for future usage in the coming several years. This video provides an overview of the type of content found within the dataset
Dataset & Collections
The three shards of V3C (V3C1, V3C2, and V3C3) contain Creative Commons videos sourced from video hosting platform Vimeo. For this reason, the elements of the dataset may be freely used and publicly shared. The following table presents the composition of the dataset and the characteristics of its shards, as well as the information on the dataset as a whole.
|File Size (videos)||1.3TB||1.6TB||1.8TB||4.8TB|
|File Size (total)||2.4TB||3.0TB||3.3TB||8.7TB|
|Number of Videos||7’475||9’760||11’215||28’450|
|Mean Video Duration||8 minutes,
|Number of Segments||1’082’659||1’425’454||1’635’580||4’143’693|
Similarly to IACC, V3C contains a master shot reference, which segments every video into non-overlapping shots based on the visual content of the videos. For every single shot, a representative keyframe is included, as well as the thumbnail version of that keyframe. Furthermore, for each video, identified by a unique ID, a metadata file is available that contains both technical as well as semantic information, such as the categories. Vimeo categorizes every video into categories and subcategories. Some of the categories were determined to be non-relevant for visual based multimedia retrieval and analytical tasks, and were dropped during the sourcing process of V3C. For simplicity reasons, subcategories were generalized into their parent categories and are, for this reason, not included. The remaining Vimeo categories are:
- Arts & Design
- Cameras & Techniques
- Reporting & Journals
Ground Truth and Analysis Data
As described above, the ground truth of the dataset consists of (deliberately over-segmented) shot boundaries as well as keyframes. Additionally, for the first shard of the V3C, the V3C1, we have already performed several analyses of the video content and metadata in order to provide an overview of the dataset .
In particular, we have analyzed specific content characteristics of the dataset, such as:
- Bitrate distribution of the videos
- Resolution distribution of the videos
- Duration of shots
- Dominant color of the keyframes
- Similarity of the keyframes in terms of color layout, edge histogram, and deep features (weights extracted from the last fully-connected layer of GoogLeNet).
- Confidence range distribution of the best class for shots detected by NasNet (using the best result out of the 1000 ImageNet classes)
- Number of different classes for a video detected by NasNet (using the best result out of the 1000 ImageNet classes)
- Number of shots/keyframes for a specific content class
- Number of shots/keyframes for a specific number of detected faces
This additional analysis data is available via GitHub, so that other researchers can take advantage of it. For example, one could use a specific subset of the dataset (only shots with blue keyframes, only videos with a specific bitrate or resolution, etc.) for performing further evaluations (e.g., for multimedia streaming, video coding, but also for image and video retrieval, of course). Additionally, due the public dataset and the analysis data, one could easily create an image and video retrieval system and use it either for participation in competitions like the Video Browser Showdown , or for submitting other evaluation runs (TRECVID Ad-hoc Video Search Task).
In the broad field of multimedia retrieval and analytics, one of the key components of research is having useful and appropriate datasets in place to evaluate multimedia systems’ performance and benchmark their quality. The usage of standard and open datasets enables researchers to reproduce analytical experiments based on these datasets and thus validate their results. In this context, the V3C dataset proves to be very diverse in several useful aspects (upload time, visual concepts, resolutions, colors, etc.). Also it has no dominating characteristics and provides a low self-similarity (i.e., few near duplicates) .
Further, the richness of V3C in terms of content diversity and content attributes enables benchmarking multimedia systems in close-to-reality test environments. In contrast to other video datasets (cf. YouTube-8M  and IACC ), V3C also provides a vast number of different video encodings and bitrates per second, so that it enables research focusing on video retrieval and analytical tasks regarding those attributes. The large number of different video resolutions (and to a lesser extent frame-rates) makes this dataset interesting for video transport and storage applications such as the development of novel encoding schemes, streaming mechanisms or error-correction techniques. Finally, in contrast to many current datasets, V3C also provides support for creating queries for evaluation competitions, such as VBS and TRECVID .
 Fabian Berns, Luca Rossetto, Klaus Schoeffmann, Christian Beecks, and George Awad. 2019. V3C1 Dataset: An Evaluation of Content Characteristics. In Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR ’19). ACM, New York, NY, USA, 334-338.
 Jakub Lokoč, Gregor Kovalčík, Bernd Münzer, Klaus Schöffmann, Werner Bailer, Ralph Gasser, Stefanos Vrochidis, Phuong Anh Nguyen, Sitapa Rujikietgumjorn, and Kai Uwe Barthel. 2019. Interactive Search or Sequential Browsing? A Detailed Analysis of the Video Browser Showdown 2018. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 29 (February 2019), 18 pages.
 Rossetto, L., Schuldt, H., Awad, G., & Butt, A. A. (2019). V3C–A Research Video Collection. In International Conference on Multimedia Modeling (pp. 349-360). Springer, Cham.
 Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.
 Paul Over, George Awad, Alan F. Smeaton, Colum Foley, and James Lanagan. 2009. Creating a web-scale video collection for research. In Proceedings of the 1st workshop on Web-scale multimedia corpus (WSMC ’09). ACM, New York, NY, USA, 25-32.
 Smeaton, A. F., Over, P., and Kraaij, W. 2006. Evaluation campaigns and TRECVid. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (Santa Barbara, California, USA, October 26 – 27, 2006). MIR ’06. ACM Press, New York, NY, 321-330.
 Luca Rossetto & Heiko Schuldt (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.
The first months of the new calendar year, multimedia researchers traditionally are hard at work on their ACM Multimedia submissions. (This year the submission deadline is 1 April.) Questions of reproducibility, including those of data set availability and release, are at the forefront of everyone’s mind. In this edition of SIGMM Records, the editors of the “Data Sets and Benchmarks” column have teamed up with two intersecting groups, the Reproducibility Chairs and the General Chairs of ACM Multimedia 2019, to bring you a column about reproducibility in multimedia research and the connection between reproducible research and publicly available data sets. The column highlights the activities of SIGMM towards implementing ACM paper badging. ACM MMSys has pushed our community forward on reproducibility and pioneered the use of ACM badging . We are proud that in 2019 the newly established Reproducibility track will introduce badging at ACM Multimedia.
Complete information on Reproducibility at ACM Multimedia is available at: https://project.inria.fr/acmmmreproducibility/
The importance of reproducibility
Researchers intuitively understand the importance of reproducibility. Too often, however, it is explained superficially, with statements such as, “If you don’t pay attention to reproducibility, your paper will be rejected”. The essence of the matter lies deeper: reproducibility is important because of its role in making scientific progress possible.
What is this role exactly? The reason that we do research is to contribute to the totality of knowledge at the disposal of humankind. If we think of this knowledge as a building, i.e. a sort of edifice, the role of reproducibility is to provide the strength and stability that makes it possible to build continually upwards. Without reproducibility, there would simply be no way of creating new knowledge.
ACM provides a helpful characterization of reproducibility: “An experimental result is not fully established unless it can be independently reproduced” . In short, a result that is obtainable only once is not actually a result.
Reproducibility and scientific rigor are often mentioned in the same breath. Rigorous research provides systematic and sufficient evidence for its contributions. For example, in an experimental paper, the experiments must be properly designed and the conclusions of the paper must be directly supported by the experimental findings. Rigor involves careful analysis, interpretation, and reporting of the research results. Attention to reproducibility can be considered a part of rigor.
When we commit ourselves to reproducible research, we also commit ourselves to making sure that the research community has what it needs to reproduce our work. This means releasing the data that we use, and also releasing implementations of our algorithms. Devoting time and effort to reproducible research is an important way in which we support Open Science, the movement to make research resources and research results openly accessible to society.
Repeatability vs. Replicability vs. Reproducibility
We frequently use the word “reproducibility” in an informal way that includes three individual concepts, which actually have distinct formal uses: “repeatability”, “replicability” and “reproducibility”. Again, we can turn to ACM for definitions . All three concepts express the idea that research results must be invariant with respect to changes in the conditions under which they were obtained.
Specifically, “repeatability” means that the same research team can achieve the same result using the same setup and resources. “Replicability” means that that team can pass the setup and resources to a different research team, and that that team can also achieve the same result. “Reproducibility” (here, used in the formal sense) means that a different team can achieve the same result using a different setup and different resources. Note the connection to scientific rigor: obtaining the same result multiple times via a process that lacks rigor is meaningless.
When we write a research paper paying attention to reproducibility, it means that we are confident we would obtain the same results again within our own research team, that the paper includes a detailed description of how we achieved the result (and is accompanied by code or other resources), and that we are convinced that other researchers would reach the same conclusions using a comparable, but not identical, set up and resources.
Reproducibility at ACM Multimedia 2019
ACM Multimedia 2019 promotes reproducibility in two ways: First, as usual, reproducibility is one of the review criteria considered by the reviewers (https://www.acmmm.org/2019/reviewer-guidelines/). It is critical that authors describe their approach clearly and completely, and do not omit any details of their implementation or evaluation. Authors should release their data and also provide experimental results on publicly available data. Finally, increasingly, we are seeing authors who include a link to their code or other resources associated with the paper. Releasing resources should be considered a best practice.
The second way that ACM Multimedia 2019 promotes reproducibility is the new Reproducibility Track. Full information is available on the ACM Multimedia Reproducibility website . The purpose of the track is to ensure that authors receive recognition for the effort they have dedicated to making their research reproducible, and also to assign ACM badges to their papers. Next, we summarize the concept of ACM badges, then we will return to discuss the Reproducibility Track in more detail.
ACM Paper badging
Here, we provide a short summary of the information on badging available on the ACM website at . ACM introduced a system of badges in order to help push forward the processes by which papers are reviewed. The goal is to move the attention given to reproducibility to a new level, beyond the level achieved during traditional reviews. Badges seek to motivate authors to use practices leading to better replicability, with the idea that replicability will in turn lead to reproducibility.
In order to understand the badge system, it is helpful to know that ACM badges are divided into two categories. “Artifacts Evaluated” and “Results Evaluated”. ACM defines artifacts as digital objects that are created for the purpose of, or as a result of, carrying out research. Artifacts include implementation code as well as scripts used to run experiments, analyze results, or generate plots. Critically, they also include the data sets that were used in the experiment. The different “Artifacts Evaluated” badges reflect the level of care that authors put into making the artifacts available including how far do they go beyond the minimal functionality necessary and how well are the artifacts are documented.
There are two “Results Evaluated” badges. The “Results Replicated” badge, which results from a replicability review, and a “Results Reproduced” badge, which results from a full reproducibility review, in which the referees have succeeded in reproducing the results of the paper with only the descriptions of the authors, and without any of the authors’ artifacts. ACM Multimedia adopts the ACM idea that replicability leads to full reproducibility, and for this reason choses to focus in its first year on the “Results replicated” badge. Next we turn to a discussion of the ACM Multimedia 2019 Reproducibility Track and how it implements the “Results Replicated” badge.
Badging ACM MM 2019
Authors of main-conference papers appearing at ACM Multimedia 2018 or 2017 are eligible to make a submission to the Reproducibility Track of ACM Multimedia 2019. The submission has two components: An archive containing the resources needed to replicate the paper, and a short companion paper that contains a description of the experiments that were carried out in the original paper and implemented in the archive. The submissions undergo a formal reproducibility review, and submissions that pass receive a “Results Replicated” badge, which is added to the original paper in the ACM Digital Library. The companion paper appears in the proceedings of ACM Multimedia 2019 (also with a badge) and is presented at the conference as a poster.
ACM defines the badges, but the choice of which badges to award, and how to implement the review process that leads to the badge, is left to the individual conferences. The consequence is that the design and implementation of the ACM Multimedia Reproducibility Track requires a number of important decisions as well as careful implementation.
A key consideration when designing the ACM Multimedia Reproducibility Track was the work of the reproducibility reviewers. These reviewers carry out tasks that go beyond those of main-conference reviewers, since they must use the authors’ artifacts to replicate their results. The track is designed such that the reproducibility reviewers are deeply involved in the process. Because the companion paper is submitted a year after the original paper, reproducibility reviewers have plenty of time to dive into the code and work together with the authors. During this intensive process, the reviewers extend the originally submitted companion paper with a description of the review process and become authors on the final version of the companion paper.
The ACM Multimedia Reproducibility Track is expected to run similarly in years beyond 2019. The experience gained in 2019 will allow future years to tweak the process in small ways if it proves necessary, and also to expand to other ACM badges.
The visibility of badged papers is important for ACM Multimedia. Visibility incentivizes the authors who submit work to the conference to apply best practices in reproducibility. Practically, the visibility of badges also allows researchers to quickly identify work that they can build on. If a paper presenting new research results has a badge, researchers can immediately understand that this paper would be straightforward to use as a baseline, or that they can build confidently on the paper results without encountering ambiguities, technical issues, or other time-consuming frustrations.
The link between reproducibility and multimedia data sets
The link between Reproducibility and Multimedia Data Sets has been pointed out before, for example, in the theme chosen by the ACM Multimedia 2016 MMCommons workshop, “Datasets, Evaluation, and Reproducibility” . One of the goals of this workshop was to discuss how data challenges and benchmarking tasks can catalyze the reproducibility of algorithms and methods.
Researchers who dedicate time and effort to creating and publishing data sets are making a valuable contribution to research. In order to compare the effectiveness of two algorithms, all other aspects of the evaluation must be controlled, including the data set that is used. Making data sets publicly available supports the systematic comparison of algorithms that is necessary to demonstrate that new algorithms are capable of outperforming the state of the art.
Considering the definitions of “replicability” and “reproducibility” introduced above, additional observations can be made about the importance of multimedia data sets. Creating and publishing data sets supports replicability. In order to replicate a research result, the same resources as used in the original experiments, including the data set, must be available to research teams beyond the one who originally carried out the research.
Creating and publishing data sets also supports reproducibility (in the formal sense of the word defined above). In order to reproduce research results, however, it is necessary that there is more than one data set available that is suitable for carrying out evaluation of a particular approach or algorithm. Critically, the definition of reproducibility involves using different resources than were used in the original work. As the multimedia community continues to move from replication to reproduction, it is essential that a large number of data sets are created and published, in order to ensure that multiple data sets are available to assess the reproducibility of research results.
Thank you to people whose hard work is making reproducibility at ACM Multimedia happen: This includes the 2019 TPC Chairs, main-conference ACs and reviewers, as well as the Reproducibility reviewers. If you would like to volunteer to be a reproducibility committee member in this or future years, please contact the Reproducibility Chairs at MM19-Repro@sigmm.org
 Simon, Gwendal. Reproducibility in ACM MMSys Conference. Blogpost, 9 May 2017 http://peerdal.blogspot.com/2017/05/reproducibility-in-acm-mmsys-conference.html Accessed 9 March 2019.
 ACM, Artifact Review and Badging, Reviewed April 2018, https://www.acm.org/publications/policies/artifact-review-badging Accessed 9 March 2019.
 ACM MM Reproducibility: Information on Reproducibility at ACM Multimedia https://project.inria.fr/acmmmreproducibility/ Accessed 9 March 2019.
 Bart Thomee, Damian Borth, and Julia Bernd. 2016. Multimedia COMMONS Workshop 2016 (MMCommons’16): Datasets, Evaluation, and Reproducibility. In Proceedings of the 24th ACM international conference on Multimedia (MM ’16). ACM, New York, NY, USA, 1485-1486.
Affective video content analysis aims at the automatic recognition of emotions elicited by videos. It has a large number of applications, including mood based personalized content recommendation , video indexing , and efficient movie visualization and browsing . Beyond the analysis of existing video material, affective computing techniques can also be used to generate new content, e.g., movie summarization , personalized soundtrack recommendation to make user-generated videos more attractive . Affective techniques can furthermore be used to enhance the user engagement with advertising content by optimizing the way ads are inserted inside videos .
While major progress has been achieved in computer vision for visual object detection, high-level concept recognition, and scene understanding, a natural further step is the modeling and recognition of affective concepts. This has recently received increasing interest from research communities, e.g., computer vision and machine learning, with an overall goal of endowing computers with human-like perception capabilities.
Efficient training and benchmarking of computational models, however, require a large and diverse collection of data annotated with ground truth, which is often difficult to collect, and particularly in the field of affective computing. To address this issue we created the LIRIS-ACCEDE dataset. In contrast to most existing datasets that contain few video resources and have limited accessibility due to copyright constraints, LIRIS-ACCEDE consists of videos with a large content diversity annotated along emotional dimensions. The annotations are made according to the expected emotion of a video, which is the emotion that the majority of the audience feels in response to the same content. All videos are shared under Creative Commons licenses and can thus be freely distributed without copyright issues. The dataset (the videos, annotations, features and protocols) are publicly available, and it is currently composed of a total of six collections.
Dataset & Collections
The LIRIS-ACCEDE dataset is composed of movies and excerpts from movies under Creative Commons licenses that enable the dataset to be publicly shared. The set contains 160 professionally made and amateur movies, with different movie genres such as horror, comedy, drama, action and so on. Languages are mainly English, with a small set of Italian, Spanish, French and others subtitled in English. The set has been used to create the six collections that are part of the dataset. The two collections that were originally proposed are the Discrete LIRIS-ACCEDE collection, which contains short excerpts of movies, and the Continuous LIRIS-ACCEDE collection, which comprises long movies. Moreover, since 2015, the set has been used for tasks related to affect/emotion at the MediaEval Benchmarking Initiative for Multimedia Evaluation , where each year it was enriched with new data, features and annotations. Thus, the dataset also includes the four additional collections dedicated to these tasks.
The movies are available together with emotional annotations. When dealing with emotional video content analysis, the goal is to automatically recognize emotions elicited by videos. In this context, three types of emotions can be considered: intended, induced and expected emotions. The intended emotion is the emotion that the film maker wants to induce in the viewers. The induced emotion is the emotion that a viewer feels in response to the movie. The expected emotion is the emotion that the majority of the audience feels in response to the same content. While the induced emotion is subjective and context dependent, the expected emotion can be considered objective, as it reflects the more-or-less unanimous response of a general audience to a given stimulus. Thus, the LIRIS-ACCEDE dataset focuses on the expected emotion. The representation of emotions we are considering is the dimensional one, based on valence and arousal. Valence is defined on a continuous scale from most negative to most positive emotions, while arousal is defined continuously from calmest to most active emotions . Moreover, violence annotations were provided in the MediaEval 2015 Affective Impact of Movies collection, while fear annotations were provided in the MediaEval 2016 and 2017 Emotional Impact of Movies collections.
|Discrete LIRIS-ACCEDE collection||A total of 160 films from various genres split into 9,800 short clips with valence and arousal annotations. More details below.|
|Continuous LIRIS-ACCEDE collection||A total of 30 films with valence and arousal annotations per second. More details below.|
|MediaEval 2015 Affective Impact of Movies collection||A subset of the films with labels for the presence of violence, as well as for the felt valence and arousal. More details below.|
|MediaEval 2016 Emotional Impact of Movies collection||A subset of the films with score annotations for the expected valence and arousal. More details below.|
|MediaEval 2017 Emotional Impact of Movies collection||A subset of the films with valence and arousal values and a label for the presence of fear for each 10 second segment, as well as precomputed features. More details below.|
|MediaEval 2018 Emotional Impact of Movies collection||A subset of the films with valence and arousal values for each second, begin-end times of scenes containing fear, as well as precomputed features. More details below.|
The ground truth for the Discrete LIRIS-ACCEDE collection consists of the ranking of all video clips along both valence and arousal dimensions. These rankings were obtained thanks to a pairwise video clips comparison protocol that has been designed to be used through crowdsourcing (with CrowdFlower service). Thus, for each pair of video clips presented to raters, they had to select the one which conveyed most strongly the given emotion in terms of valence or arousal. The high inter-annotator agreement that was achieved reflects that annotations were fully consistent, despite the large diversity of our raters’ cultural backgrounds. Affective ratings (scores) were also collected for a subset of the 9,800 movies in order to cross-validate the crowdsourced annotations. The affective ratings also made learning of Gaussian Processes for Regression possible, to model the noisiness from measurements and map the whole ranked LIRIS-ACCEDE dataset into the 2D valence-arousal affective space. More details can be found in .
To collect the ground truth for the continuous and MediaEval 2016, 2017 and 2018 collections, which consisted of valence and arousal scores for every movie second, French annotators had to continuously indicate their level of valence and arousal while watching the movies using a modified version of the GTrace annotation tool  and a joystick. Each annotator continuously annotated one subset of the movies considering the induced valence, and another subset considering the induced arousal. Thus, each movie was continuously annotated by three to five different annotators. Then, the continuous valence and arousal annotations from the annotators were down-sampled by averaging the annotations over windows of 10 seconds with a shift of 1 second overlap (i.e., yielding 1 value per second) in order to remove any noise due to unintended movements of the joystick. Finally, the post-processed continuous annotations were averaged in order to create a continuous mean signal of the valence and arousal self-assessments, ranging from -1 (most negative for valence, most passive for arousal) to +1 (most positive for valence, most active for arousal). The details of this process are given in .
The ground truth for violence annotation, used in the MediaEval 2015 Affective Impact of Movies collection, was collected as follows. First, all the videos were annotated separately by two groups of annotators from two different countries. For each group, regular annotators labeled all the videos, which were then reviewed by master annotators. Regular annotators were graduate students (typically single with no children) and master annotators were senior researchers. Within each group, each video received 2 different annotations, which were then merged by the master annotators into the final annotation for the group. Finally, the achieved annotations from the two groups were merged and reviewed once more by the task organizers. The details can be found in .
The ground truth for fear annotations, used in the MediaEval 2017 and 2018 Emotional Impact of Movies collections, was generated using a tool specifically designed for the classification of audio-visual media allowing to perform annotation while watching the movie (at the same time). The annotations have been realized by two well-experienced team members of NICAM , both of them trained in classification of media. Each movie was annotated by one annotator reporting the start and stop times of each sequence in the movie expected to induce fear.
Through its six collections, the LIRIS-ACCEDE dataset constitutes a dataset of choice for affective video content analysis. It is one of the largest dataset for this purpose, and is regularly enriched with new data, features and annotations. In particular, it is used for the Emotional Impact of Movies tasks at MediaEval Benchmarking Initiative for Multimedia Evaluation. As all the movies are under Creative Commons licenses, the whole dataset can be freely shared and used by the research community, and is available at http://liris-accede.ec-lyon.fr.
Discrete LIRIS-ACCEDE collection 
In total 160 films and short films with different genres were used and were segmented into 9,800 video clips. The total time of all 160 films is 73 hours 41 minutes and 7 seconds, and a video clip was extracted on average every 27s. The 9,800 segmented video clips last between 8 and 12 seconds and are representative enough to conduct experiments. Indeed, the length of extracted segments is large enough to get consistent excerpts allowing the viewer to feel emotions, while being small enough to make the viewer feel only one emotion per excerpt.
The content of the movie was also considered to create homogeneous, consistent and meaningful excerpts that were not meant to disturb the viewers. A robust shot and fade in/out detection was implemented to make sure that each extracted video clip started and ended with a shot or a fade. Furthermore, the order of excerpts within a film was kept, allowing the study of temporal transitions of emotions.
Several movie genres are represented in this collection of movies, such as horror, comedy, drama, action, and so on. Languages are mainly English with a small set of Italian, Spanish, French and others subtitled in English. For this collection the 9,800 video clips are ranked according to valence, from the clip inducing the most negative emotion to the most positive, and to arousal, from the clip inducing the calmest emotion to the most active emotion. Besides the ranks, the emotional scores (valence and arousal) are also provided for each clip.
Continuous LIRIS-ACCEDE collection 
The movie clips for the Discrete collection were annotated globally, for which a single value of arousal and valence was used to represent a whole 8 to 12-second video clip. In order to allow deeper investigations into the temporal dependencies of emotions (since a felt emotion may influence the emotions felt in the future), longer movies were considered in this collection. To this end, a selection of 30 movies from the set of 160 was made such that their genre, content, language and duration were diverse enough to be representative of the original Discrete LIRIS-ACCEDE dataset. The selected videos are between 117 and 4,566 seconds long (mean = 884.2s ± 766.7s SD). The total length of the 30 selected movies is 7 hours, 22 minutes and 5 seconds. The emotional annotations consist of a score of expected valence and arousal for each second of each movie.
MediaEval 2015 Affective Impact of Movies collection 
This collection has been used as the development and test sets for the MediaEval 2015 Affective Impact of Movies Task. The overall use case scenario of the task was to design a video search system that used automatic tools to help users find videos that fitted their particular mood, age or preferences. To address this, two subtasks were proposed:
• Induced affect detection: the emotional impact of a video or movie can be a strong indicator for search or recommendation;
• Violence detection: detecting violent content is an important aspect of filtering video content based on age.
The 9,800 video clips from the Discrete LIRIS-ACCEDE section were used as development set, and an additional 1100 movie clips were proposed for the test set. For each of the 10,900 video clips, the annotations consist of: a binary value to indicate the presence of violence, the class of the excerpt for felt arousal (calm-neutral-active), and the class for felt valence (negative-neutral-positive).
MediaEval 2016 Emotional Impact of Movies collection 
The MediaEval 2016 Emotional Impact of Movies task required participants to deploy multimedia features to automatically predict the emotional impact of movies, in terms of valence and arousal. Two subtasks were proposed:
• Global emotion prediction: given a short video clip (around 10 seconds), participants’ systems were expected to predict a score of induced valence (negative-positive) and induced arousal (calm-excited) for the whole clip;
• Continuous emotion prediction: as an emotion felt during a scene may be influenced by the emotions felt during the previous scene(s), the purpose here was to consider longer videos, and to predict the valence and arousal continuously along the video. Thus, a score of induced valence and arousal were to be provided for each 1s-segment of each video.
The development set was composed of the Discrete LIRIS-ACCEDE part for the first subtask, and the Continuous LIRIS-ACCEDE part for the second subtask. In addition to the development set, a test set was also provided to assess participants’ methods performance. A total of 49 new movies under Creative Commons licenses were added. With the same protocol as the one used for the development set, 1,200 additional short video clips were extracted for the first subtask (between 8 and 12 seconds), while 10 long movies (from 25 minutes to 1 hour and 35 minutes) were selected for the second subtask (for a total duration of 11.48 hours). Thus, the annotations consist of a score of expected valence and arousal for each movie clip used for the first subtask, and a score of expected valence and arousal for each second of the movies for the second subtask.
MediaEval 2017 Emotional Impact of Movies collection 
This collection was used for the MediaEval 2017 Emotional Impact of Movies task. Here, only long movies were considered, and the emotion was considered in terms of valence, arousal and fear. The following two subtasks were proposed for which the emotional impact had to be predicted for consecutive 10-second segments, which slid over the whole movie with a shift of 5 seconds:
• Valence/Arousal prediction: participants’ systems were supposed to predict a score of expected valence and arousal for each consecutive 10-second segment;
• Fear prediction: the purpose here was to predict whether each consecutive 10-second segments was likely to induce fear or not. The targeted use case was the prediction of frightening scenes to help systems protecting children from potentially harmful video content. This subtask is complementary to the valence/arousal prediction task in the sense that the mapping of discrete emotions into the 2D valence/arousal space is often overlapped (for instance, fear, disgust and anger are overlapped since they are characterized with very negative valence and high arousal).
The Continuous LIRIS-ACCEDE collection was used as the development test for both subtasks. The test set consisted of a selection of new 14 new movies under Creative Commons licenses other than the selection of the 160 original movies. They are between 210 and 6,260 seconds long. The total length of the 14 selected movies is 7 hours, 57 minutes and 13 seconds. In addition to the video data, general purpose audio and visual content features were also provided, including Deep features, Fuzzy Color and Texture Histogram, Gabor features. The annotations consist of a valence value, an arousal value and a binary value for each 10-second segment to indicate if the segment was supposed to induce fear or not.
MediaEval 2018 Emotional Impact of Movies collection 
The MediaEval 2018 Emotional Impact of Movies task is similar to the one of 2017. However, in this case, more data was provided and a prediction of the emotional impact needed to be made for every second in movies rather than for 10-second segments as before. The two subtasks were:
• Valence and Arousal prediction: participants’ systems had to predict a score of expected valence and arousal continuously (every second) for each movie;
• Fear detection: the purpose here was to predict beginning and ending times of sequences inducing fear in movies. The targeted use case was the detection of frightening scenes to help systems protecting children from potentially harmful video content.
The development set for both subtasks consisted of the movies from the Continuous LIRIS-ACCEDE collection, as well as from the test set of the MediaEval 2017 Emotional Impact of Movies collection, i.e. 44 movies for a total duration of 15 hours and 20 minutes. The test set consisted of 12 other movies selected from the set of 160 movies, for a total duration of 8 hours and 56 minutes. Like the 2017 collection, in addition to the video data, general purpose audio and visual content features were also provided. The annotations consist of valence and arousal values for each second of the movies (for the first subtasks) as well as the beginning and ending times of each sequence in movies inducing fear (for the second subtask).
This work was supported in part by the French research agency ANR through the VideoSense Project under the Grant 2009 CORD 026 02 and through the Visen project within the ERA-NET CHIST-ERA framework under the grant ANR-12-CHRI-0002-04.
Should you have any inquiries or questions about the dataset, do not hesitate to contact us by email at: emmanuel dot dellandrea at ec-lyon dot fr.
 L. Canini, S. Benini, and R. Leonardi, “Affective recommendation of movies based on selected connotative features”, in IEEE Transactions on Circuits and Systems for Video Technology, 23(4), 636–647, 2013.
 S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. 2010, “Affective visualization and retrieval for music video”, in IEEE Transactions on Multimedia 12(6), 510–522, 2010.
 S.Zhao, H.Yao, X.Sun, X.Jiang, and P. Xu., “Flexible presentation of videos based on affective content analysis”, in Advances in Multimedia Modeling, 2013.
 H. Katti, K. Yadati, M. Kankanhalli, and C. Tat-Seng, “Affective video summarization and story board generation using pupillary dilation and eye gaze”, in IEEE International Symposium on Multimedia (ISM), 2011.
 R.R. Shah,Y. Yu, and R. Zimmermann, “Advisor: Personalized video soundtrack recommendation by late fusion with heuristic rankings”, in ACM International Conference on Multimedia, 2014.
 K. Yadati, H. Katti, and M. Kankanhalli, “Cavva: Computational affective video-in-video advertising”, in IEEE Transactions on Multimedia 16(1), 15–23, 2014.
 A. Hanjalic, “Extracting moods from pictures and sounds: Towards truly personalized TV”, in IEEE Signal Processing Magazine, 2006.
 J.A. Russell, “Core affect and the psychological construction of emotion”, in Psychological Review, 2003.
 Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “LIRIS-ACCEDE: A Video Database for Affective Content Analysis,” in IEEE Transactions on Affective Computing, 2015.
 Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Deep Learning vs. Kernel Methods: Performance for Emotion Prediction in Videos,” in 2015 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), 2015.
 M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang, B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty, and L. Chen, “The mediaeval 2015 affective impact of movies task,” in MediaEval 2015 Workshop, 2015.
 E. Dellandrea, L. Chen, Y. Baveye, M. Sjoberg and C. Chamaret, “The MediaEval 2016 Emotional Impact of Movies Task”, in Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, The Netherlands, October 20-21, 2016.
 E. Dellandrea, M. Huigsloot, L. Chen, Y. Baveye and M. Sjoberg, “The MediaEval 2017 Emotional Impact of Movies Task”, in Working Notes Proceedings of the MediaEval 2017 Workshop, Dublin, Ireland, September 13-15, 2017.
 E. Dellandréa, M. Huigsloot, L. Chen, Y. Baveye, Z. Xiao and M. Sjöberg, “The MediaEval 2018 Emotional Impact of Movies Task”, Working Notes Proceedings of the MediaEval 2018 Workshop, Sophia Antipolis, France, October 29-31, 2018.
 R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Stapleton, “Gtrace: General trace program compatible with emotionML”, in Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), 2013.
This column discusses the efforts of ACM SIGMM towards sharing and reproducibility. Apart from the specific sessions dedicated to open source and datasets, ACM Multimedia Systems started to provide official ACM badges for articles that make artifacts available since last year. This year, it has marked a record with 45% of the articles acquiring such a badge.
Without data it is impossible to put theories to the test. Moreover, without running code it is tedious at best to (re)produce and evaluate any results. Yet collecting data and writing code can be a road full of pitfalls, ranging from datasets containing copyrighted materials to algorithms containing bugs. The ideal datasets and software packages are those that are open and transparent for the world to look at, inspect, and use without or with limited restrictions. Such “artifacts” make it possible to establish public consensus on their correctness or otherwise to start a dialogue on how to fix any identified problems.
In our interconnected world, storing and sharing information has never been easier. Despite the temptation for researchers to keep datasets and software to themselves, a growing number are willing to share their resources with others. To further promote this sharing behavior, conferences, workshops, publishers, non-profit and even for-profit companies are increasingly recognizing and supporting these efforts. For example, the ACM Multimedia conference has hosted an open source software competition since 2004, and the ACM Multimedia Systems conference has included an open datasets and software track since 2011 . The ACM Digital Library now also hands out badges to public artifacts that have been made available and optionally reviewed and verified by members of the community. At the same time, organizations such as Zenodo and Amazon host open datasets for free. Sharing ultimately pays off: the citation statistics for ACM Multimedia Systems conferences over the past five years, for example, show that half of the 20 most cited papers shared data and code although they have represented a small fraction of the published papers so far.
Good practices are increasingly adopted. In this year’s edition of the ACM Multimedia Systems conference, 69 works (papers, demos, datasets, software) were accepted, out of which 31 (45%) were awarded an ACM badge. This is a large increase compared to last year, when out of 42 works only a total of 13 (31%) received one. This greatly expands one of the core objectives of both the conference and SIGMM towards open science. At this moment, the ACM Digital Library does not separately index which papers received a badge, making it challenging to find all papers who have one. It further appears not many other ACM conferences are aware of the badges yet; for example, while ACM Multimedia accepted 16 open source papers in 2016 and 6 papers in 2017, none applied for a badge. This year at ACM Multimedia Systems only “artifacts available” badges have been awarded. For next year our intention is to ensure all dataset and software submissions receive the “artifacts evaluated” badge. This would require several committed community members to spend time working with the authors to get the artifacts running on all major platforms with corresponding detailed documentation.
The accepted artifacts this year are diverse in nature: several submissions focus on releasing artifacts related to quality of experience of (mobile/wireless) streaming video, while others center on making datasets and tools related to images, videos, speech, sensors, and events available; in addition, there are a number of contributions in the medical domain. It is great to see such a range of interests in our community!
Social media sharing platforms (e.g., YouTube, Flickr, Instagram, and SoundCloud) have revolutionized how users access multimedia content online. Most of these platforms provide a variety of ways for the user to interact with the different types of media: images, video, music. In addition to watching or listening to the media content, users can also engage with content in different ways, e.g., like, share, tag, or comment. Social media sharing platforms have become an important resource for scientific researchers, who aim to develop new indexing and retrieval algorithms that can improve users’ access to multimedia content. As a result, enhancing the experience provided by social media sharing platforms.
Historically, the multimedia research community has focused on developing multimedia analysis algorithms that combine visual and text modalities. Less highly visible is research devoted to algorithms that exploit an audio signal as the main modality. Recently, awareness for the importance of audio has experienced a resurgence. Particularly notable is Google’s release of the AudioSet, “A large-scale dataset of manually annotated audio events” . In a similar spirit, we have developed the “Socially Significant Music Event“ dataset that supports research on music events . The dataset contains Electronic Dance Music (EDM) tracks with a Creative Commons license that have been collected from SoundCloud. Using this dataset, one can build machine learning algorithms to detect specific events in a given music track.
What are socially significant music events? Within a music track, listeners are able to identify certain acoustic patterns as nameable music events. We call a music event “socially significant” if it is popular in social media circles, implying that it is readily identifiable and an important part of how listeners experience a certain music track or music genre. For example, listeners might talk about these events in their comments, suggesting that these events are important for the listeners (Figure 1).
Traditional music event detection has only tackled low-level events like music onsets  or music auto-tagging [8, 10]. In our dataset, we consider events that are at a higher abstraction level than the low-level musical onsets. In auto-tagging, descriptive tags are associated with 10-second music segments. These tags generally fall into three categories: musical instruments (guitar, drums, etc.), musical genres (pop, electronic, etc.) and mood based tags (serene, intense, etc.). The types of tags are different than what we are detecting as part of this dataset. The events in our dataset have a particular temporal structure unlike the categories that are the target of auto-tagging. Additionally, we analyze the entire music track and detect start points of music events rather than short segments like auto-tagging.
There are three music events in our Socially Significant Music Event dataset: Drop, Build, and Break. These events can be considered to form the basic set of events used by the EDM producers [1, 2]. They have a certain temporal structure internal to themselves, which can be of varying complexity. Their social significance is visible from the presence of large number of timed comments related to these events on SoundCloud (Figure 1,2). The three events are popular in the social media circles with listeners often mentioning them in comments. Here, we define these events :
- Drop: A point in the EDM track, where the full bassline is re-introduced and generally follows a recognizable build section
- Build: A section in the EDM track, where the intensity continuously increases and generally climaxes towards a drop
- Break: A section in an EDM track with a significantly thinner texture, usually marked by the removal of the bass drum
SoundCloud is an online music sharing platform that allows users to record, upload, promote and share their self-created music. SoundCloud started out as a platform for amateur musicians, but currently many leading music labels are also represented. One of the interesting features of SoundCloud is that it allows “timed comments” on the music tracks. “Timed comments” are comments, left by listeners, associated with a particular time point in the music track. Our “Socially Significant Music Events” dataset is inspired by the potential usefulness of these timed comments as ground truth for training music event detectors. Figure 2 contains an example of a timed comment: “That intense buildup tho” (timestamp 00:46). We could potentially use this as a training label to detect a build, for example. In a similar way, listeners also mention the other events in their timed comments. So, these timed comments can serve as training labels to build machine learning algorithms to detect events.
The dataset contains 402 music tracks with an average duration of 4.9 minutes. Each track is accompanied by timed comments relating to Drop, Build, and Break. It is also accompanied by ground truth labels that mark the true locations of the three events within the tracks. The labels were created by a team of experts. Unlike many other publicly available music datasets that provide only metadata or short previews of music tracks , we provide the entire track for research purposes. The download instructions for the dataset can be found here: . All the music tracks in the dataset are distributed under the Creative Commons license. Some statistics of the dataset are provided in Table 1.
Table 1. Statistics of the dataset: Number of events, Number of timed comments
|Event Name||Total number of events||Number of events per track||Total number of timed comments||Number of timed comments per track|
The main purpose of the dataset is to support training of detectors for the three events of interest (Drop, Build, and Break) in a given music track. These three events can be considered a case study to prove that it is possible to detect socially significant musical events, opening the way for future work on an extended inventory of events. Additionally, the dataset can be used to understand the properties of timed comments related to music events. Specifically, timed comments can be used to reduce the need for manually acquired ground truth, which is expensive and difficult to obtain.
Timed comments present an interesting research challenge: temporal noise. The timed comments and the actual events do not always coincide. The comments could be at the same position, before, or after the actual event. For example, in the below music track (Figure 3), there is a timed comment about a drop at 00:40, while the actual drop occurs only at 01:00. Because of this noisy nature, we cannot use the timed comments alone as ground truth. We need strategies to handle temporal noise in order to use timed comments for training .
In addition to music event detection, our “Socially Significant Music Event” dataset opens up other possibilities for research. Timed comments have the potential to improve users’ access to music and to support them in discovering new music. Specifically, timed comments mention aspects of music that are difficult to derive from the signal, and may be useful to calculate song-to-song similarity needed to improve music recommendation. The fact that the comments are related to a certain time point is important because it allows us to derive continuous information over time from a music track. Timed comments are potentially very helpful for supporting listeners in finding specific points of interest within a track, or deciding whether they want to listen to a track, since they allow users to jump-in and listen to specific moments, without listening to the track end-to-end.
State of the art
The detection of music events requires training classifiers that are able to generalize over the variability in the audio signal patterns corresponding to events. In Figure 4, we see that the build-drop combination has a characteristic pattern in the spectral representation of the music signal. The build is a sweep-like structure and is followed by the drop, which we indicate by a red vertical line. More details about the state-of-the-art features useful for music event detection and the strategies to filter the noisy timed comments can be found in our publication .
The evaluation metric used to measure the performance of a music event detector should be chosen according to the user scenario for that detector. For example, if the music event detector is used for non-linear access (i.e., creating jump-in points along the playbar) it is important that the detected time point of the event falls before, rather than after, the actual event. In this case, we recommend using the “event anticipation distance” (ea_dist) as a metric. The ea_dist is amount of time that the predicted event time point precedes an actual event time point and represents the time the user would have to wait to listen to the actual event. More details about ea_dist can be found in our paper .
In , we report the implementation of a baseline music event detector that uses only timed comments as training labels. This detector attains an ea_dist of 18 seconds for a drop. We point out that from the user point of view, this level of performance could already lead to quite useful jump-in points. Note that the typical length of a build-drop combination is between 15-20 seconds. If the user is positioned 18 seconds before the drop, the build would have already started and the user knows that a drop is coming. Using an optimized combination of timed comments and manually acquired ground truth labels we are able to achieve an ea_dist of 6 seconds.
Timed comments, on their own, can be used as training labels to train detectors for socially significant events. A detector trained on timed comments performs reasonably well in applications like non-linear access, where the listener wants to jump through different events in the music track without listening to it in its entirety. We hope that the dataset will encourage researchers to explore the usefulness of timed comments for all media. Additionally, we would like to point out that our work has demonstrated that the impact of temporal noise can be overcome and that the contribution of timed comments to video event detection is worth investigating further.
Should you have any inquiries or questions about the dataset, do not hesitate to contact us via email at: firstname.lastname@example.org
 K. Yadati, M. Larson, C. Liem and A. Hanjalic, “Detecting Socially Significant Music Events using Temporally Noisy Labels,” in IEEE Transactions on Multimedia. 2018. http://ieeexplore.ieee.org/document/8279544/
 H. Y. Lo, J. C. Wang, H. M. Wang and S. D. Lin, “Cost-Sensitive Multi-Label Learning for Audio Tag Annotation and Retrieval,” in IEEE Transactions on Multimedia, vol. 13, no. 3, pp. 518-529, June 2011. http://ieeexplore.ieee.org/document/5733421/