Dataset Column: Report from the MMM 2019 Special Session on Multimedia Datasets for Repeatable Experimentation (MDRE 2019)

Special Session

Information retrieval and multimedia content access have a long history of comparative evaluation, and many of the advances in the area over the past decade can be attributed to the availability of open datasets that support comparative and repeatable experimentation. Sharing data and code to allow other researchers to replicate research results is needed in the multimedia modeling field, as it helps to improve the performance of systems and the reproducibility of published papers.

This report summarizes the special session on Multimedia Datasets for Repeatable Experimentation (MDRE 2019), which was organized at the 25th International Conference on MultiMedia Modeling (MMM 2019), which was held in January 2019 in Thessaloniki, Greece.

The intent of these special sessions is to be a venue for releasing datasets to the multimedia community and discussing dataset related issues. The presentation mode in 2019 was to have short presentations (8 minutes) with some questions, and an additional panel discussion after all the presentations, which was moderated by Björn Þór Jónsson. In the following we summarize the special session, including its talks, questions, and discussions.

The special session presenters: Luca Rossetto, Cathal Gurrin and Minh-Son Dao.

Presentations

A Test Collection for Interactive Lifelog Retrieval

The session started with a presentation about A Test Collection for Interactive Lifelog Retrieval [1], given by Cathal Gurrin from Dublin City University (Ireland). In their work, the authors introduced a new test collection for interactive lifelog retrieval, which consists of multi-modal data from 27 days, comprising nearly 42 thousand images and other personal data (health and activity data; more specifically, heart rate, galvanic skin response, calorie burn, steps, blood pressure, blood glucose levels, human activity, and diet log). The authors argued that, although other lifelog datasets already exist, their dataset is unique in terms of the multi-modal character, and has a reasonable and easily manageable size of 27 consecutive days. Hence, it can also be used for interactive search and provides newcomers with an easy entry into the field. The published dataset has already been used for the Lifelog Search Challenge (LSC) [5] in 2018, which is an annual competition run at the ACM International Conference on Multimedia Retrieval (ICMR).

The discussion about this work started with a question about the plans for the dataset and whether it should be extended over the years, e.g. to increase the challenge of participating in the LSC. However, the problem with public lifelog datasets is the fact that there is a conflict between releasing more content and safeguarding privacy. There is a strong need to anonymize the contained images (e.g. blurring faces and license plates), where the rules and requirements of the EU GDPR regulations make this especially important. However, anonymizing content unfortunately is a very slow process. An alternative to removing and/or masking actual content from the dataset for privacy reasons would be to create artificial datasets (e.g. containing public images or only faces from people who consent to publish), but this would likely also be a non-trivial task. One interesting aspect could be the use of Generative Adversarial Networks (GANs) for the anonymization of faces, for instance by replacing all faces appearing in the content with generated faces learned from a small group of people who gave their consent. Another way to preemptively mitigate the privacy issues could be to wear conspicuous ‘lifelogging stickers’ during recording to make people aware of the presence of the camera, which would give them the possibility to object to being filmed or to avoid being captured altogether.

SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives

The second presentation was given by Minh-Son Dao from the National Institute of Information and Communications Technology (NICT) in Japan about SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives [2]. This is a dataset that aims at combining the conditions of the environment with health-related aspects (e.g., pollution or weather data with cardio-respiratory or psychophysiological data). The creation of the dataset was motivated by the fact that people in larger cities in Japan very often do not want to go out (e.g., for some sports activities), because they are very concerned about pollution, i.e., health conditions. So it would be beneficial to have a map of the city with assigned pollution ratings, or a system that allows to perform related queries. Their dataset contains sensor data collected on routes by a few dozen volunteer  people over seven days in Fukuoka, Japan. More particularly, they collected data about the location, O3, NO2, PM2.5 (particulates), temperature, and humidity in combination with heart rate, motion behavior (from 3-axis accelerometer), relaxation level, and other personal perception data from questionnaires.

This dataset has also been used for multimedia benchmark challenges, such as the Lifelogging for Wellbeing task at MediaEval. In order to define the ground truth, volunteers were presented with specific use cases and annotation rules, and were asked to collaboratively annotate the dataset. The collected data (the feelings of participants at different locations) was also visualized using an interactive map. Although the dataset may have some inconsistent annotations, it is easy to filter them out since labels of corresponding annotators and annotator groups are contained in the dataset as well.

V3C – a Research Video Collection

The third presentation was given by Luca Rossetto from the University of Basel (Switzerland) about V3C – a Research Video Collection [3]. This is a large-scale dataset for multimedia retrieval, consisting of nearly 30,000 videos with an overall duration of about 3,800 hours. Although many other video datasets are available already (e.g., IACC.3 [6], or YFCC100M [8]), the V3C dataset is unique in the aspects of timeliness (more recent content than many other datasets and therefore more representative content for current ‘videos in the wild’) and diversity (represents many different genres or use cases), while also having no copyright restrictions (all contained videos were labelled with a Creative Commons license by their uploaders). The videos have been collected from the video sharing platform Vimeo (hence the name ‘Vimeo Creative Commons Collection’ or V3C in short) and represent video data currently used on video sharing platforms. The dataset comes together with a master shot-boundary detection ground truth, as well as keyframes and additional metadata. It is partitioned into three major parts (V3C1, V3C2, and V3C3) to make it more manageable, and it will be used by the TRECVID and the Video Browser Showdown (VBS) evaluation campaigns for several years. Although the dataset was not specifically built for retrieval, it is suitable for any use case that requires a larger video dataset.

The shot-boundary detection used to provide the master-shot reference for the V3C dataset was implemented using Cineast, which is an open source software available for download. It divides every frame into a 3×3 grid and computes color histograms for all 9 areas, which are then concatenated into a ‘regional color histogram’ feature vector that is compared between all adjacent frames. This seems to work very well for hard cuts and gradual transitions, although for grayscale content (and flashlights etc.) it is not very stable. The additional metadata provided with the dataset includes information about resolution, frame rate, uploading user and the upload date, as well as any semantic information provided by the uploader (title, description, tags, etc.). 

Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition

Originally a fourth presentation was scheduled about Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition [4], but unfortunately no author was on site to give the presentation. This dataset contains audio samples with a duration of 30 seconds (as well as extracted features and ground truth) from a metropolitan city (Athens, Greece), that have been recorded during a period of about four years by 10 different persons with the aim to provide a collection about city sounds. The metadata includes geospatial coordinates, timestamp, rating, and tagging of the sound by the recording person. The authors demonstrated in a baseline evaluation that their dataset allows to predict the soundscape quality in the city with about 42% accuracy.

Discussion

After the presentations, Björn Þór Jónsson moderated a panel discussion in which all presenters participated.

The panel started with a discussion on the size of datasets, whether the only way to make challenges more difficult is to keep increasing the dataset, or whether there are alternatives to this. Although this heavily depends on the research question one would like to solve, it was generally agreed that there is a definite need for evaluation with large datasets, because for small datasets some problems are trivial. Moreover, too small datasets often introduce some kind of content bias, so that they do not fully reflect the practical situation.

For now, it seems there is no real alternative to using larger datasets although it is clear that this will introduce additional challenges/hurdles for data management and data processing. All presenters (and the audience too) agreed that introducing larger datasets will also necessitate the need for closer collaboration with other research communities―with fields like data science, data management/engineering, and distributed and high-performance computing―in order to manage the higher data load.

However, even though we need larger datasets, we might not be ready yet to really go for true large-scale. For example, the V3C dataset is still far away from a true web-scale video search dataset; it originally was intended to be even bigger, but there were concerns from the TRECVID and VBS communities about the manageability. Datasets that are too large would set the entrance barrier for newcomers so high that an evaluation benchmark may not attract enough participants―a problem that could possibly disappear in a few years (as hardware becomes cheaper and faster/larger), but still needs to be addressed from an organizational viewpoint. 

There were notes from the audience that instead of focusing on size alone, we should also consider the problem we want to solve. It appears many researchers use datasets for use cases for which they were not designed and are not suited to. Instead of blindly going for larger size, datasets could be kept small and simple for solving essential research questions, for example by truly optimizing them to the problem to solve; different evaluations would then use different datasets. However, this would lead to a considerable dataset fragmentation and necessitate the need for combining several datasets for broader/larger evaluation tasks, which has been shown to be quite challenging in the past. For example, there are already a lot of health datasets available, and it would be interesting to take benefit from them, but the workload for the integration into competitions is often too high in practice.

Another issue that should be addressed more intensively by the research community is to figure out the situation for personal datasets that are compliant with GDPR regulations, since currently nobody really knows how to deal with this.

Acknowledgments

The session was organized by the authors of the report, in collaboration with Duc-Tien Dang-Nguyen (Dublin City University), Michael Riegler (Center for Digitalisation and Engineering & University of Oslo), and Luca Piras (University of Cagliari). The panel format of the special session made the discussions much more lively and interactive than that of a traditional technical session. We would like to thank the presenters and their co-authors for their excellent contributions, as well as the members of the audience who contributed greatly to the session.

References

[1] Gurrin, C., Schoeffmann, K., Joho, H., Munzer, B., Albatal, R., Hopfgartner, F., … & Dang-Nguyen, D. T. (2019, January). A test collection for interactive lifelog retrieval. In International Conference on Multimedia Modeling (pp. 312-324). Springer, Cham.
[2] Sato, T., Dao, M. S., Kuribayashi, K., & Zettsu, K. (2019, January). SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives. In International Conference on Multimedia Modeling (pp. 325-337). Springer, Cham.
[3] Rossetto, L., Schuldt, H., Awad, G., & Butt, A. A. (2019, January). V3C–A Research Video Collection. In International Conference on Multimedia Modeling (pp. 349-360). Springer, Cham.
[4] Giannakopoulos, T., Orfanidi, M., & Perantonis, S. (2019, January). Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition. In International Conference on Multimedia Modeling (pp. 338-348). Springer, Cham.
[5] Dang-Nguyen, D. T., Schoeffmann, K., & Hurst, W. (2018, June). LSE2018 Panel-Challenges of Lifelog Search and Access. In Proceedings of the 2018 ACM Workshop on The Lifelog Search Challenge (pp. 1-2). ACM.
[6] Awad, G., Butt, A., Curtis, K., Lee, Y., Fiscus, J., Godil, A., … & Kraaij, W. (2018, November). Trecvid 2018: Benchmarking video activity detection, video captioning and matching, video storytelling linking and video search.
[7] Lokoč, J., Kovalčík, G., Münzer, B., Schöffmann, K., Bailer, W., Gasser, R., … & Barthel, K. U. (2019). Interactive search or sequential browsing? a detailed analysis of the video browser showdown 2018. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(1), 29.
[8] Kalkowski, S., Schulze, C., Dengel, A., & Borth, D. (2015, October). Real-time analysis and visualization of the YFCC100M dataset. In Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions(pp. 25-30). ACM.

Dataset Column: Datasets for Online Multimedia Verification

Introduction

Online disinformation is a problem that has been attracting increased interest by researchers worldwide as the breadth and magnitude of its impact is progressively manifested and documented in a number of studies (Boididou et al., 2014; Zhou & Zafarani, 2018; Zubiaga et al., 2018). This emerging area of research is inherently multidisciplinary and there have been numerous treatments of the subject, each having a distinct perspective or theme, ranging from the predominant perspectives of media, journalism and communications (Wardle & Derakhshan, 2017) and political science (Allcott & Gentzkow, 2017) to those of network science (Lazer et al., 2018), natural language processing (Rubin et al., 2015) and signal processing, including media forensics (Zampoglou et al., 2017). Given the multimodal nature of the problem, it is no surprise that the multimedia community has taken a strong interest in the field.

From a multimedia perspective, two research problems have attracted the bulk of researchers’ attention: a) detection of content tampering and content fabrication, and b) detection of content misuse for disinformation. The first was traditionally studied within the field of media forensics (Rocha et al, 2011), but has recently been under the spotlight as a result of the rise of deepfake videos (Güera & Delp, 2018), i.e. a special class of generative models that are capable of synthesizing highly convincing media content from scratch or based on some authentic seed content. The second problem has focused on the problem of multimedia misuse or misappropriation, i.e. the use of media content out of its original context with the goal of spreading misinformation or false narratives (Tandoc et al., 2018).

Developing automated approaches to detect media-based disinformation is relying to a great extent on the availability of relevant datasets, both for training supervised learning models and for evaluating their effectiveness. Yet, developing and releasing such datasets is a challenge in itself for a number of reasons:

  1. Identifying, curating, understanding, and annotating cases of media-based misinformation is a very effort-intensive task. More often than not, the annotation process requires careful and extensive reading of pertinent news coverage from a variety of sources similar to the journalistic practice of verification (Brandtzaeg et al., 2016).
  2. Media-based disinformation is largely manifested in social media platforms and relevant datasets are therefore hard to collect and distribute due to the temporary nature of social media content and the numerous technical restrictions and challenges involved in collecting content (mostly due to limitations or complete lack of appropriate support by the respective APIs), as well as the legal and ethical issues in releasing social media-based datasets (due to the need to comply with the respective Terms of Service and any applicable data protection law).

In this column, we present two multimedia datasets that could be of value to researchers who study media-based disinformation and develop automated approaches to tackle the problem. The first, called Fake Video Corpus (Papadopoulou et al., 2019) is a manually curated collection of 200 debunked and 180 verified videos, along with relevant annotations, accompanied by a set of 5,193 near-duplicate instances of them that were posted on popular social media platforms. The second, called FIVR-200K (Kordopatis-Zilos et al., 2019), is an automatically collected dataset of 225,960 videos, a list of 100 video queries and manually verified annotations regarding the relation (if any) of the dataset videos to each of the queries (i.e. near-duplicate, complementary scene, same incident).

For each of the two datasets, we present the design and creation process, focusing on issues and questions regarding the relevance of the collected content, the technical means of collection, and the process of annotation, which had the dual goal of ensuring high accuracy and keeping the manual annotation cost manageable. Given that each dataset is accompanied by a detailed journal article, in this column we only limit our description to high-level information, emphasizing the utility and creation process in each case, rather than on detailed statistics, which are disclosed in the respective papers.

Following the presentation of the two datasets, we then proceed to a critical discussion, highlighting their limitations and some caveats, and delineating future steps towards high quality dataset creation for the field of multimedia-based misinformation.

Related Datasets

The complexity and challenge of the multimedia verification problem has led to the creation of numerous datasets and benchmarking efforts, each designed specifically for a particular task within this area. We can broadly classify these efforts in three areas: a) multimedia forensics, b) multimedia retrieval, and c) multimedia post classification. Datasets that are focused on the text modality, e.g. Fake News Challenge, Clickbait Challenge, Hyperpartisan News Detection, RumourEval (Derczynski et al 2017), etc. are beyond the scope of this post and are hence not included in this discussion.

Multimedia forensics: Generating high-quality multimedia forensics datasets has always been a challenge, since creating convincing forgeries is normally a manual task requiring a fair amount of skill, and as a result such datasets have generally been few and limited in scale. With respect to image splicing, our own survey (Zampoglou et al, 2017) listed a number of datasets that had been made available by this point, including our own Wild Web tampered image dataset, which consists of real-world forgeries that have been collected from the Web, including multiple near-duplicates, making it a large and particularly challenging collection. Recently, the Realistic Tampering Dataset (Korus et al,2017) was proposed, offering a large number of convincing forgeries for evaluation. On the other hand, copy-move image forgeries pose a different problem that requires specially designed datasets. Three such commonly used datasets are those produced by MICC (Amerini et al, 2011), the Image Manipulation Dataset by (Christlein et al, 2012), and CoMoFoD (Tralic et al, 2013). These datasets are still actively used in research.

With respect to video tampering, there has been relative scarcity in high-quality large-scale datasets, which is understandable given the difficulty of creating convincing forgeries. The recently proposed Multimedia Forensics Challenge datasets include some large-scale sets of tampered images and videos for the evaluation of forensics algorithms. Finally, there has recently been increased interest towards the automatic detection of forgeries made with the assistance of particular software, and specifically face-swapping software. As the quality of produced face-swaps is constantly improving, detecting face-swaps is an important emerging verification task. The FaceForensics++ dataset (Rössler et al, 2019) is a very-large scale dataset containing face-swapped videos (and untampered face videos) from a number of different algorithms, aimed for the evaluation of face-swap detection algorithms.

Multimedia retrieval: Several cases of multimedia verification can be considered to be an instance of a near-duplicate retrieval task, in which the query video (video to be verified) is run against a database of past cases/videos to check whether it has already appeared before. The most popular and publicly-available dataset for near-duplicate video retrieval is arguably the CC_WEB_VIDEO dataset (Wu et al., 2007). This consists of 12,790 user-generated videos collected from popular video sharing websites (YouTube, Google Video, and Yahoo! Video). It is organized in 24 query sets, for each of which the most popular video was selected to serve as query, and the rest of the videos were manually annotated based on their duplicity to the query. Another relevant dataset is VCDB (Jiang et al., 2014), which was compiled and annotated as a benchmark for the partial video copy detection problem and is composed of videos from popular video platforms (YouTube and Metacafe). VCDB contains two subsets of videos: a) the core, which consists of 28 discrete sets of videos with a total of 528 videos with over 9,000 pairs of manually annotated partial copies, and b) the distractors, which consists of 100,000 videos with the purpose to make the video copy detection problem more challenging.

Multimedia post classification: A benchmark task under the name “Verifying Multimedia Use” (Boididou et al., 2015; Boididou et al., 2016) was organized and took place in the context of MediaEval 2015 and 2016 respectively. The task made a dataset available of 15,629 tweets containing images and videos, each of which made a false or factual claim with respect to the shared image/video. The released tweets were posted in the context of breaking news events (e.g. Hurricane Sandy, Boston Marathon bombings) or hoaxes. 

Video Verification Datasets

The Fake Video Corpus (FVC)

The Fake Video Corpus (Papadopoulou et al., 2018) is a collection of 380 user-generated videos and 5,193 near-duplicate versions of them, all collected from three online video platforms: YouTube, Facebook, and Twitter. The videos are annotated either as “verified” (“real”) or as “debunked” (“fake”) depending on whether the information they convey is accurate or misleading. Verified videos are typically user-generated takes of newsworthy events, while debunked videos include various types of misinformation, including staged content posing as UGC, real content taken out of context, or modified/tampered content (see Figure 1 for examples). The near-duplicates of each video are arranged in temporally ordered “cascades”, and each near-duplicate video is annotated with respect to its relation to the first video of the cascade (e.g. whether it is reinforcing or debunking the original claim). The FVC is the first, to our knowledge, large-scale dataset of debunked and verified user-generated videos (UGVs). The dataset contains different kinds of metadata for its videos, including channel (user) information, video information, and community reactions (number of likes, shares and comments) at the time of their inclusion.

  
  
Figure 1. A selection of real (top row) and fake (bottom row) videos from the Fake Video Corpus. Click image to jump to larger version, description, and link to YouTube video.

The initial set of 380 videos were collected and annotated using various sources including the Context Aggregation and Analysis (CAA) service developed within the InVID project and fact-checking sites such as Snopes. To build the dataset, all videos submitted to the CAA service between November 2017 and January 2018 were collected in an initial pool of approximately 1600 videos, which were then manually inspected and filtered. The remaining videos were annotated as “verified” or “debunked” using established third party sources (news articles or blog posts), leading to the final pool of 180 verified and 200 fake unique videos. Then, keyword-based search was run on the three platforms, and near-duplicate video detection was used to identify the video duplicates within the returned results. More specifically, for each of the 380 videos, its title was reformulated in a more general form, and translated into four major languages: Russian, Arabic, French, and German. The original title, the general form and the translations were submitted as queries to YouTube, Facebook, and Twitter. Then, the  near-duplicate retrieval algorithm of Kordopatis-Zilos etal (2017) was used on the resulting pool, and the results were manually inspected to remove erroneous matches.

The purpose of the dataset is twofold: i) to be used for the analysis of the dissemination patterns of real and fake user-generated videos (by analyzing the traits of the near-duplicate video cascades), and ii) to serve as a benchmark for the evaluation of automated video verification methods. The relatively large size of the dataset is important for both of these tasks. With respect to the study of dissemination patterns, the dataset provides the opportunity to study the dissemination of the same or similar content by analyzing associations between videos not provided by the original platform APIs, combined with the wealth of associated metadata. In parallel, having a collection of 5,573 annotated “verified” or “debunked” videos- even if many are near-duplicate versions of the 380 cases – can be used for the evaluation (or even training) of verification systems, either based on visual content or the associated video metadata.

The Fine-grained Incident Video Retrieval Dataset (FIVR-200K)

The FIVR-200K dataset (Kordopatis-Zilos et al., 2019) consists of 225,960 videos associated with 4,687 Wikipedia events and 100 selected video queries (see Figure 2 for examples). It has been designed to simulate the problem of Fine-grained Incident Video Retrieval (FIVR). The objective of this problem is: given a query video, retrieve all associated videos considering several types of associations with respect to an incident of interest. FIVR contains several retrieval tasks as special cases under a single framework. In particular, we consider three types of association between videos: a) Duplicate Scene Videos (DSV), which share at least one scene (originating from the same camera) regardless of any applied transformation, b) Complementary Scene Videos (CSV), which contain part of the same spatiotemporal segment, but captured from different viewpoints, and c) Incident Scene Videos (ISV), which capture the same incident, i.e. they are spatially and temporally close, but have no overlap.

For the collection of the dataset, we first crawled Wikipedia’s Current Event page to collect a large number of major news events that occurred between 2013 and 2017 (five years). Each news event is accompanied with a topic, headline, text, date, and hyperlinks. To collect videos of the same category, we retained only news events with topic “Armed conflicts and attacks” or “Disasters and accidents”. This ultimately led to a total of 4,687 events after filtering. To gather videos around these events and build a large collection with numerous video pairs that are associated through the relations of interest (DSV, CSV and ISV), we queried the public YouTube API with the event headlines. To ensure that the collected videos capture the corresponding event, we retained only the videos published within a timespan of one week from the event date. This process resulted in the collection of 225,960 videos.

  
Figure 2. A selection of query videos from the Fine-grained Incident Video Retrieval dataset. Click image to jump to larger version, link to YouTube video, and several associated videos.

Next, we proceeded with the selection of query videos. We set up an automated filtering and ranking process that implemented the following criteria: a) query videos should be relatively short and ideally focus on a single scene, b) queries should have many near-duplicates or same-incident videos within the dataset that are published by many different uploaders, c) among a set of near-duplicate/same-instance videos, the one that was uploaded first should be selected as query. This selection process was implemented based on a graph-based clustering approach and resulted in the selection of 635 query videos, of which we used the top 100 (ranked by corresponding cluster size) as the final query set.

For the annotation of similarity relations among videos, we followed a multi-step process, in which we presented annotators with the results of a similarity-based video retrieval system and asked them to indicate the type of relation through a drop-down list of the following labels: a) Near-Duplicate (ND), a special case where the whole video is near-duplicate to the query video, b) Duplicate Scene (DS), where only some scenes in the candidate video are near-duplicates of scenes in the query video, c) Complementary Scenes (CS), d) Incident Scene (IS), and e) Distractors (DI), i.e. irrelevant videos.

To make sure that annotators were presented with as many potentially relevant videos as possible, we used visual-only, text-only and hybrid similarity in turn. As a result, each annotator reviewed video candidates that had very high similarity with the query video in terms either of their visual content, or text metadata (title and description) or the combination of similarities. Once an initial set of annotations were produced by two independent annotators, the annotators went twice again through the annotations two ensure consistency and accuracy.

FIVR-200K was designed to serve as a benchmark that poses real-world challenges for the problem of reverse video search. Given a query video to be verified, the analyst would want to know whether the same or a very similar version of it has already been published. In that way, the user would be able to easily debunk cases of out-of-context video use (i.e. misappropriation) and on the other hand, if several videos are found that depict the same scene from different viewpoints at approximately the same time, then they could be considered to corroborate the video of interest.

Discussion: Limitations and Caveats

We are confident that the two video verification datasets presented in this column can be valuable resources for researchers interested in the problem of media-based disinformation and could serve both as training sets and as benchmarks for automated video verification methods. Yet, both of them suffer from certain limitations and care should be taken when using them to draw conclusions. 

A first potential issue has to do with the video selection bias arising from the particular way that each of the two datasets was created. The videos of the Fake Video Corpus were selected in a mixed manner trying to include a number of cases that were known to the dataset creators and their collaborators, and was also enriched by a pool of test videos that were submitted for analysis to a publicly available video verification service. As a result, it is likely to be more focused on viral and popular videos. Also, videos were included, for which debunking or corroborating information was found online, which introduces yet another source of bias, potentially towards cases that were more newsworthy or clear cut. In the case of the FIVR-200K dataset, videos were intentionally collected to be between two categories of newsworthy events with the goal of ending up with a relatively homogeneous collection, which would be challenging in terms of content-based retrieval. This means that certain types of content, such as political events, sports and entertainment, are very limited or not present at all in the dataset. 

A question that is related to the selection bias of the above datasets pertains to their relevance for multimedia verification and for real-world applications. In particular, it is not clear whether the video cases offered by the Fake Video Corpus are representative of actual verification tasks that journalists and news editors face in their daily work. Another important question is whether these datasets offer a realistic challenge to automatic multimedia analysis approaches. In the case of FIVR-200K, it was clearly demonstrated (Kordopatis-Zilos et al., 2019) that the dataset is a much harder benchmark for near-duplicate detection methods compared to previous datasets such as CC_WEB_VIDEO and VCDB. Even so, we cannot safely conclude that a method, which performs very well in FIVR-200K, would perform equally well in a dataset of much larger scale (e.g. millions or even billions of videos).

Another issue that affects the access to these datasets and the reproducibility of experimental results relates to the ephemeral nature of online video content. A considerable (and increasing) part of these video collections is taken down (either by their own creators or from the video platform), which makes it impossible for researchers to gain access to the exact video set that was originally collected. To give a better sense of the problem, 21% of the Fake Video Corpus and 11% of the FIVR-200K videos were not available online on September 2019. This issue, which affects all datasets that are based on online multimedia content, raises the more general question of whether there are steps that can be taken by online platforms such as YouTube, Facebook and Twitter that could facilitate the reproducibility of social media research without violating copyright legislation or the platforms’ terms of service.

The ephemeral nature of online content is not the only factor that renders the value of multimedia datasets very sensitive to the passing of time. Especially in the case of online disinformation, there seems to be an arms’ race, where new machine learning methods constantly get better in detecting misleading or tampered content, but at the same time new types of misinformation emerge, which are increasingly AI-assisted. This is particularly profound in the case of deepfakes, where the main research paradigm is based on the concept of competition between a generator (adversary) and a detector (Goodfellow et al., 2014). 

Last but not least, one may always be concerned about the potential ethical issues arising when publicly releasing such datasets. In our case, reasonable concerns for privacy risks, which are always relevant when dealing with social media content, are addressed by complying with the relevant Terms of Service of the source platforms and by making sure that any annotation (label) assigned to the dataset videos is accurate. Additional ethical issues pertain to the potential “dual use” of the dataset, i.e. their use by adversaries to craft better tools and techniques to make misinformation campaigns more effective. A recent pertinent case was OpenAI’s delayed release of their very powerful GPT-2 model, which sparked numerous discussions and criticism, and making clear that there is no commonly accepted practice for ensuring reproducibility of research results (and empowering future research) and at the same time making sure that risks of misuse are eliminated.

Future work

Given the challenges of creating and releasing a large-scale dataset for multimedia verification, the main conclusions from our efforts towards this direction so far are the following:

  • The field of multimedia verification is in constant motion and therefore the concept of a static dataset may not be sufficient to capture the real-world nuances and latest challenges of the problem. Instead new benchmarking models, e.g. in the form of open data challenges, and resources, e.g. constantly updated repository of “fake” multimedia, appear to be more effective for empowering future research in the area.
  • The role of social media and multimedia sharing platforms (incl. YouTube, Facebook, Twitter, etc.) seems to be crucial in enabling effective collaboration between academia and industry towards addressing the real-world consequences of online misinformation. While there have been recent developments towards this direction, including the announcements by both Facebook and Alphabet’s Jigsaw of new deepfake datasets, there is also doubt and scepticism about the degree of openness and transparency that such platforms are ready to offer, given the conflicts of interest that are inherent in the underlying business model. 
  • Building a dataset that is fit for a highly diverse and representative set of verification cases appears to be a task that would require a community effort instead of effort from a single organisation or group. This would not only help towards distributing the massive dataset creation cost and effort to multiple stakeholders, but also towards ensuring less selection bias, richer and more accurate annotation and more solid governance.

References

Allcott, H., Gentzkow, M., “Social media and fake news in the 2016 election”, Journal of economic perspectives, 31(2), pp. 211–36, 2017.
Amerini, I, Ballan, L., Caldelli, R., Del Bimbo, A., Serra, G., “A SIFT-based forensic method for copy-move attack detection and transformation recovery”, IEEE Transactions on Information Forensics and Security, 6(3), pp. 1099–1110,2011.
Boididou, C., Papadopoulos, S., Kompatsiaris, Y., Schifferes, S., Newman, N., “Challenges of computational verification in social multimedia”, In Proceedings of the 23rd ACM International Conference on World Wide Web, pp. 743–748,2014.
Boididou, C., Andreadou, K., Papadopoulos, S., Dang-Nguyen, D.T., Boato, G., Riegler, M., Kompatsiaris, Y., “Verifying multimedia use at MediaEval 2015”. In Proceedings of MediaEval 2015, 2015.
Boididou C., Papadopoulos S., Dang-Nguyen D., Boato G., Riegler M., Middleton S.E., Petlund A., Kompatsiaris Y., “Verifying multimedia use at MediaEval 2016”. In Proceedings of MediaEval 2016, 2016.
Brandtzaeg, P.B., Lüders, M., Spangenberg, J., Rath-Wiggins, L., Følstad, A., “Emerging journalistic verification practices concerning social media”. Journalism Practice, 10(3), pp. 323–342, 2016.
Christlein V., Riess C., Jordan J., Riess C., Angelopoulou, E., “An evaluation of popular copy-move forgery detection approaches”. IEEE Transactions on Information Forensics & Security, 7(6), pp. 1841–1854, 2012.
Derczynski, L., Bontcheva, K., Liakata, M., Procter, R., Hoi, G.W.S., Zubiaga, A., “Semeval-2017 Task 8: Rumoureval: determining rumour veracity and support for rumours”, Proceedings of the 11th International Workshop on Semantic Evaluation,pp. 69-76, 2017.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Bengio, Y., “Generative adversarial nets”. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
Guan, H., Kozak, M., Robertson, E., Lee, Y., Yates, A.N., Delgado, A., Zhou, D., Kheyrkhah, T., Smith, J., Fiscus, J., “MFC datasets: Large-scale benchmark datasets for media forensic challenge evaluation”, In Proceedings of the 2019 IEEEWinter Applications of Computer Vision Workshops, pp. 63–72, 2019.
Güera, D., Delp, E.J., “Deepfake video detection using recurrent neural networks”, In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 1–6, 2018.
Jiang, Y. G., Jiang, Y., Wang, J., “VCDB: A large-scale database for partial copy detection in videos”. In Proceedings of the European Conference on Computer Vision, pp. 357–371, 2014.
Kiesel, J., Mestre, M., Shukla, R., Vincent, E., Adineh, P., Corney, D., Stein, B. Potthast, M., “Semeval-2019 Task 4: Hyperpartisan news detection”. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829–839,2019.
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, I., “FIVR: Fine-grained incident video retrieval”. IEEE Transactions on Multimedia, 21(10), pp. 2638–2652, 2019.
Korus, P., Huang, J., “Multi-scale analysis strategies in PRNU-based tampering localization”, IEEE Transactions on Information Forensics & Security, 21(4), pp. 809–824, 2017.
Lazer, D.M., Baum, M.A., Benkler, Y., Berinsky, A.J., Greenhill, K.M., Menczer, F., Schudson, M., “The science of fake news”, Science, 359(6380), pp. 1094–1096, 2018.
Papadopoulou, O., Zampoglou, M., Papadopoulos, S., Kompatsiaris, I., “A corpus of debunked and verified user-generated videos”. Online Information Review, 43(1), pp. 72–88, 2019.
Rocha, A., Scheirer, W., Boult, T., Goldenstein, S., “Vision of the unseen: Current trends and challenges in digital image and video forensics”, ACM Computing Surveys, 43(4), art. 26, 2011.
Rössler, A. Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M. “Faceforensics++: Learning to detect manipulated facial images”, In Proceedings of the IEEE International Conference on Computer Vision, 2019.
Rubin, V.L., Chen, Y., Conroy, N.J., “Deception detection for news: Three types of fakes”, In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community, art. 83, 2015.
Tandoc Jr, E.C., Lim, Z.W., Ling, R. “Defining “fake news”: A typology of scholarly definitions”, Digital journalism, 6(2), pp. 137–153, 2018.
Tralic, D., Zupancic I., Grgic S., Grgic M., “CoMoFoD – New database for copy-move forgery detection”. In Proceedings of the 55th International Symposium on Electronics in Marine, pp. 49–54, 2013.
Wardle, C., Derakhshan, H., “Information disorder: Toward an interdisciplinary framework for research and policy making”, Council of Europe Report, 27, 2017.
Wu, X., Hauptmann, A.G., Ngo, C.-W., “Practical elimination of near-duplicates from web video search”, In Proceedings of the 15th ACM International Conference on Multimedia, pp. 218–227, 2007.
Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y., “Detecting image splicing in the wild (web)”, In Proceedings of the 2015 IEEE International Conference on Multimedia & Expo Workshops, 2015.
Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y., “Large-scale evaluation of splicing localization algorithms for web images”, Multimedia Tools and Applications, 76(4), pp. 4801–4834, 2017.
Zhou, X., Zafarani, R., “Fake news: A survey of research, detection methods, and opportunities”. arXiv preprint arXiv:1812.00315, 2018.
Zubiaga, A., Aker, A., Bontcheva, K., Liakata, M., Procter, R., “Detection and resolution of rumours in social media: A survey”, ACM Computing Surveys, 51(2), art. 32, 2018.

Appendix A: Examples of videos in the Fake Video Corpus.

Real videos


US Airways Flight 1549 ditched in the Hudson River.


A group of musicians playing in an Istanbul park while bombs explode outside the stadium behind them.


A giant alligator crossing a Florida golf course.

Fake videos


“Syrian boy rescuing a girl amid gunfire” – Staged (fabricated content): The video was filmed by Norwegian Lars Klevberg in Malta.


“Golden Eagle Snatches Kid” – Tampered: The video was created by a team of students in Montreal as part of their course on visual effects.


“Pope Francis slaps Donald Trump’s hand for touching him” – Satire/parody: The video was digitally manipulated, and was made for the late-night television show Jimmy Kimmel Live.

Appendix B: Examples of videos in the Fine-grained Incident Video Retrieval dataset.

Example 1


Query video from the American Airlines Flight 383 fire at Chicago O’Hare International Airport in October 28, 2016.


Duplicate scene video.


Complimentary scene video.


Incident scene video.

Example 2


Query video from the Boston Marathon bombing in April 15, 2013.


Duplicate scene video.


Complimentary scene video.


Incident scene video.

Example 3


Query video from the the Las Vegas shooting in October 1, 2017.


Duplicate scene video.


Complimentary scene video.


Incident scene video.

The V3C1 Dataset: Advancing the State of the Art in Video Retrieval

Download

In order to download the video dataset as well as its provided analysis data, please follow the instructions described here:

https://github.com/klschoef/V3C1Analysis/blob/master/README.md

Introduction

Standardized datasets are of vital importance in multimedia research, as they form the basis for reproducible experiments and evaluations. In the area of video retrieval, widely used datasets such as the IACC [5], which has formed the basis for the TRECVID Ad-Hoc Video Search Task and other retrieval-related challenges, have started to show their age. For example, IACC is no longer representative of video content as it is found in the wild [7]. This is illustrated by the figures below, showing the distribution of video age and duration across various datasets in comparison with a sample drawn from Vimeo and Youtube.

datasets1

 

datasets2

Its recently released spiritual successor, the Vimeo Creative Commons Collection (V3C) [3], aims to remedy this discrepancy by offering a collection of freely reusable content sourced from the video hosting platform Vimeo (https://vimeo.com). The figures below show the age and duration distributions of the Vimeo sample from [7] in comparison with the properties of the V3C.datasets3

datasets4

The V3C is comprised of three shards, consisting of 1000h, 1200h and 1500h of video content respectively. It consists not only of the original videos themselves, but also comes with video shot-boundary annotations, as well as representative key-frames and thumbnail images for every such video shot. In addition, all the technical and semantic video metadata that was available on Vimeo is provided as well. The V3C has already been used in the 2019 edition of the Video Browser Showdown [2] and will also be used for the TRECVID AVS Tasks (https://www-nlpir.nist.gov/projects/tv2019/) starting 2019 with a plan for future usage in the coming several years. This video provides an overview of the type of content found within the dataset

Dataset & Collections

The three shards of V3C (V3C1, V3C2, and V3C3) contain Creative Commons videos sourced from video hosting platform Vimeo. For this reason, the elements of the dataset may be freely used and publicly shared. The following table presents the composition of the dataset and the characteristics of its shards, as well as the information on the dataset as a whole.

Partition V3C1 V3C2 V3C3 Total
File Size (videos) 1.3TB 1.6TB 1.8TB 4.8TB
File Size (total) 2.4TB 3.0TB 3.3TB 8.7TB
Number of Videos 7’475 9’760 11’215 28’450
Combined

Video Duration

1’000 hours,

23 minutes,

50 seconds

1’300 hours,

52 minutes,

48 seconds

1’500 hours,

8 minutes,

57 seconds

3801 hours,

25 minutes,

35 seconds

Mean Video Duration 8 minutes,

2 seconds

7 minutes,

59 seconds

8 minutes,

1 seconds

8 minutes,

1 seconds

Number of Segments 1’082’659 1’425’454 1’635’580 4’143’693

Similarly to IACC, V3C contains a master shot reference, which segments every video into non-overlapping shots based on the visual content of the videos. For every single shot, a representative keyframe is included, as well as the thumbnail version of that keyframe. Furthermore, for each video, identified by a unique ID, a metadata file is available that contains both technical as well as semantic information, such as the categories. Vimeo categorizes every video into categories and subcategories. Some of the categories were determined to be non-relevant for visual based multimedia retrieval and analytical tasks, and were dropped during the sourcing process of V3C. For simplicity reasons, subcategories were generalized into their parent categories and are, for this reason, not included. The remaining Vimeo categories are:

  • Arts & Design
  • Cameras & Techniques
  • Comedy
  • Fashion
  • Food
  • Instructionals
  • Music
  • Narrative
  • Reporting & Journals

Ground Truth and Analysis Data

As described above, the ground truth of the dataset consists of (deliberately over-segmented) shot boundaries as well as keyframes. Additionally, for the first shard of the V3C, the V3C1, we have already performed several analyses of the video content and metadata in order to provide an overview of the dataset [1]

In particular, we have analyzed specific content characteristics of the dataset, such as:

  • Bitrate distribution of the videos
  • Resolution distribution of the videos
  • Duration of shots
  • Dominant color of the keyframes
  • Similarity of the keyframes in terms of color layout, edge histogram, and deep features (weights extracted from the last fully-connected layer of GoogLeNet).
  • Confidence range distribution of the best class for shots detected by NasNet (using the best result out of the 1000 ImageNet classes) 
  • Number of different classes for a video detected by NasNet (using the best result out of the 1000 ImageNet classes)
  • Number of shots/keyframes for a specific content class
  • Number of shots/keyframes for a specific number of detected faces

This additional analysis data is available via GitHub, so that other researchers can take advantage of it. For example, one could use a specific subset of the dataset (only shots with blue keyframes, only videos with a specific bitrate or resolution, etc.) for performing further evaluations (e.g., for multimedia streaming, video coding, but also for image and video retrieval, of course). Additionally, due the public dataset and the analysis data, one could easily create an image and video retrieval system and use it either for participation in competitions like the Video Browser Showdown [2], or for submitting other evaluation runs (TRECVID Ad-hoc Video Search Task).

Conclusion

In the broad field of multimedia retrieval and analytics, one of the key components of research is having useful and appropriate datasets in place to evaluate multimedia systems’ performance and benchmark their quality. The usage of standard and open datasets enables researchers to reproduce analytical experiments based on these datasets and thus validate their results. In this context, the V3C dataset proves to be very diverse in several useful aspects (upload time, visual concepts, resolutions, colors, etc.). Also it has no dominating characteristics and provides a low self-similarity (i.e., few near duplicates) [3].

Further, the richness of V3C in terms of content diversity and content attributes enables benchmarking multimedia systems in close-to-reality test environments. In contrast to other video datasets (cf. YouTube-8M [4] and IACC [5]), V3C also provides a vast number of different video encodings and bitrates per second, so that it enables research focusing on video retrieval and analytical tasks regarding those attributes. The large number of different video resolutions (and to a lesser extent frame-rates) makes this dataset interesting for video transport and storage applications such as the development of novel encoding schemes, streaming mechanisms or error-correction techniques. Finally, in contrast to many current datasets, V3C also provides support for creating queries for evaluation competitions, such as VBS and TRECVID [6].

References

[1] Fabian Berns, Luca Rossetto, Klaus Schoeffmann, Christian Beecks, and George Awad. 2019. V3C1 Dataset: An Evaluation of Content Characteristics. In Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR ’19). ACM, New York, NY, USA, 334-338.

[2] Jakub Lokoč, Gregor Kovalčík, Bernd Münzer, Klaus Schöffmann, Werner Bailer, Ralph Gasser, Stefanos Vrochidis, Phuong Anh Nguyen, Sitapa Rujikietgumjorn, and Kai Uwe Barthel. 2019. Interactive Search or Sequential Browsing? A Detailed Analysis of the Video Browser Showdown 2018. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 29 (February 2019), 18 pages.

[3] Rossetto, L., Schuldt, H., Awad, G., & Butt, A. A. (2019). V3C–A Research Video Collection. In International Conference on Multimedia Modeling (pp. 349-360). Springer, Cham.

[4] Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.

[5] Paul Over, George Awad, Alan F. Smeaton, Colum Foley, and James Lanagan. 2009. Creating a web-scale video collection for research. In Proceedings of the 1st workshop on Web-scale multimedia corpus (WSMC ’09). ACM, New York, NY, USA, 25-32. 

[6] Smeaton, A. F., Over, P., and Kraaij, W. 2006. Evaluation campaigns and TRECVid. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (Santa Barbara, California, USA, October 26 – 27, 2006). MIR ’06. ACM Press, New York, NY, 321-330.

[7] Luca Rossetto & Heiko Schuldt (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.

ACM Multimedia 2019 and Reproducibility in Multimedia Research

The first months of the new calendar year, multimedia researchers traditionally are hard at work on their ACM Multimedia submissions. (This year the submission deadline is 1 April.) Questions of reproducibility, including those of data set availability and release, are at the forefront of everyone’s mind. In this edition of SIGMM Records, the editors of the “Data Sets and Benchmarks” column have teamed up with two intersecting groups, the Reproducibility Chairs and the General Chairs of ACM Multimedia 2019, to bring you a column about reproducibility in multimedia research and the connection between reproducible research and publicly available data sets. The column highlights the activities of SIGMM towards implementing ACM paper badging. ACM MMSys has pushed our community forward on reproducibility and pioneered the use of ACM badging [1]. We are proud that in 2019 the newly established Reproducibility track will introduce badging at ACM Multimedia.

Complete information on Reproducibility at ACM Multimedia is available at:  https://project.inria.fr/acmmmreproducibility/

The importance of reproducibility

Researchers intuitively understand the importance of reproducibility. Too often, however, it is explained superficially, with statements such as, “If you don’t pay attention to reproducibility, your paper will be rejected”. The essence of the matter lies deeper: reproducibility is important because of its role in making scientific progress possible.

What is this role exactly? The reason that we do research is to contribute to the totality of knowledge at the disposal of humankind. If we think of this knowledge as a building, i.e. a sort of edifice, the role of reproducibility is to provide the strength and stability that makes it possible to build continually upwards. Without reproducibility, there would simply be no way of creating new knowledge.

ACM provides a helpful characterization of reproducibility: “An experimental result is not fully established unless it can be independently reproduced” [2]. In short, a result that is obtainable only once is not actually a result.

Reproducibility and scientific rigor are often mentioned in the same breath. Rigorous research provides systematic and sufficient evidence for its contributions. For example, in an experimental paper, the experiments must be properly designed and the conclusions of the paper must be directly supported by the experimental findings. Rigor involves careful analysis, interpretation, and reporting of the research results. Attention to reproducibility can be considered a part of rigor.

When we commit ourselves to reproducible research, we also commit ourselves to making sure that the research community has what it needs to reproduce our work. This means releasing the data that we use, and also releasing implementations of our algorithms. Devoting time and effort to reproducible research is an important way in which we support Open Science, the movement to make research resources and research results openly accessible to society.

Repeatability vs. Replicability vs. Reproducibility

We frequently use the word “reproducibility” in an informal way that includes three individual concepts, which actually have distinct formal uses: “repeatability”, “replicability” and “reproducibility”. Again, we can turn to ACM for definitions [2]. All three concepts express the idea that research results must be invariant with respect to changes in the conditions under which they were obtained.

Specifically, “repeatability” means that the same research team can achieve the same result using the same setup and resources. “Replicability” means that that team can pass the setup and resources to a different research team, and that that team can also achieve the same result. “Reproducibility” (here, used in the formal sense) means that a different team can achieve the same result using a different setup and different resources. Note the connection to scientific rigor: obtaining the same result multiple times via a process that lacks rigor is meaningless.

When we write a research paper paying attention to reproducibility, it means that we are confident we would obtain the same results again within our own research team, that the paper includes a detailed description of how we achieved the result (and is accompanied by code or other resources), and that we are convinced that other researchers would reach the same conclusions using a comparable, but not identical, set up and resources.

Reproducibility at ACM Multimedia 2019

ACM Multimedia 2019 promotes reproducibility in two ways: First, as usual, reproducibility is one of the review criteria considered by the reviewers (https://www.acmmm.org/2019/reviewer-guidelines/). It is critical that authors describe their approach clearly and completely, and do not omit any details of their implementation or evaluation. Authors should release their data and also provide experimental results on publicly available data. Finally, increasingly, we are seeing authors who include a link to their code or other resources associated with the paper. Releasing resources should be considered a best practice.

The second way that ACM Multimedia 2019 promotes reproducibility is the new Reproducibility Track. Full information is available on the ACM Multimedia Reproducibility website [3]. The purpose of the track is to ensure that authors receive recognition for the effort they have dedicated to making their research reproducible, and also to assign ACM badges to their papers. Next, we summarize the concept of ACM badges, then we will return to discuss the Reproducibility Track in more detail.

ACM Paper badging

Here, we provide a short summary of the information on badging available on the ACM website at [2]. ACM introduced a system of badges in order to help push forward the processes by which papers are reviewed. The goal is to move the attention given to reproducibility to a new level, beyond the level achieved during traditional reviews. Badges seek to motivate authors to use practices leading to better replicability, with the idea that replicability will in turn lead to reproducibility.

In order to understand the badge system, it is helpful to know that ACM badges are divided into two categories. “Artifacts Evaluated” and “Results Evaluated”. ACM defines artifacts as digital objects that are created for the purpose of, or as a result of, carrying out research. Artifacts include implementation code as well as scripts used to run experiments, analyze results, or generate plots. Critically, they also include the data sets that were used in the experiment. The different “Artifacts Evaluated” badges reflect the level of care that authors put into making the artifacts available including how far do they go beyond the minimal functionality necessary and how well are the artifacts are documented.  

There are two “Results Evaluated” badges. The “Results Replicated” badge, which results from a replicability review, and a “Results Reproduced” badge, which results from a full reproducibility review, in which the referees have succeeded in reproducing the results of the paper with only the descriptions of the authors, and without any of the authors’ artifacts. ACM Multimedia adopts the ACM idea that replicability leads to full reproducibility, and for this reason choses to focus in its first year on the “Results replicated” badge. Next we turn to a discussion of the ACM Multimedia 2019 Reproducibility Track and how it implements the “Results Replicated” badge.

Badging ACM MM 2019

Authors of main-conference papers appearing at ACM Multimedia 2018 or 2017 are eligible to make a submission to the Reproducibility Track of ACM Multimedia 2019. The submission has two components: An archive containing the resources needed to replicate the paper, and a short companion paper that contains a description of the experiments that were carried out in the original paper and implemented in the archive. The submissions undergo a formal reproducibility review, and submissions that pass receive a “Results Replicated” badge, which  is added to the original paper in the ACM Digital Library. The companion paper appears in the proceedings of ACM Multimedia 2019 (also with a badge) and is presented at the conference as a poster.

ACM defines the badges, but the choice of which badges to award, and how to implement the review process that leads to the badge, is left to the individual conferences. The consequence is that the design and implementation of the ACM Multimedia Reproducibility Track requires a number of important decisions as well as careful implementation.

A key consideration when designing the ACM Multimedia Reproducibility Track was the work of the reproducibility reviewers. These reviewers carry out tasks that go beyond those of main-conference reviewers, since they must use the authors’ artifacts to replicate their results. The track is designed such that the reproducibility reviewers are deeply involved in the process. Because the companion paper is submitted a year after the original paper, reproducibility reviewers have plenty of time to dive into the code and work together with the authors. During this intensive process, the reviewers extend the originally submitted companion paper with a description of the review process and become authors on the final version of the companion paper.

The ACM Multimedia Reproducibility Track is expected to run similarly in years beyond 2019. The experience gained in 2019 will allow future years to tweak the process in small ways if it proves necessary, and also to expand to other ACM badges.

The visibility of badged papers is important for ACM Multimedia. Visibility incentivizes the authors who submit work to the conference to apply best practices in reproducibility. Practically, the visibility of badges also allows researchers to quickly identify work that they can build on. If a paper presenting new research results has a badge, researchers can immediately understand that this paper would be straightforward to use as a baseline, or that they can build confidently on the paper results without encountering ambiguities, technical issues, or other time-consuming frustrations.

The link between reproducibility and multimedia data sets

The link between Reproducibility and Multimedia Data Sets has been pointed out before, for example, in the theme chosen by the ACM Multimedia 2016 MMCommons workshop, “Datasets, Evaluation, and Reproducibility” [4]. One of the goals of this workshop was to discuss how data challenges and benchmarking tasks can catalyze the reproducibility of algorithms and methods.

Researchers who dedicate time and effort to creating and publishing data sets are making a valuable contribution to research. In order to compare the effectiveness of two algorithms, all other aspects of the evaluation must be controlled, including the data set that is used. Making data sets publicly available supports the systematic comparison of algorithms that is necessary to demonstrate that new algorithms are capable of outperforming the state of the art.

Considering the definitions of “replicability” and “reproducibility” introduced above, additional observations can be made about the importance of multimedia data sets. Creating and publishing data sets supports replicability. In order to replicate a research result, the same resources as used in the original experiments, including the data set, must be available to research teams beyond the one who originally carried out the research.

Creating and publishing data sets also supports reproducibility (in the formal sense of the word defined above). In order to reproduce research results, however, it is necessary that there is more than one data set available that is suitable for carrying out evaluation of a particular approach or algorithm. Critically, the definition of reproducibility involves using different resources than were used in the original work. As the multimedia community continues to move from replication to reproduction, it is essential that a large number of data sets are created and published, in order to ensure that multiple data sets are available to assess the reproducibility of research results.

Acknowledgements

Thank you to people whose hard work is making reproducibility at ACM Multimedia happen: This includes the 2019 TPC Chairs, main-conference ACs and reviewers, as well as the Reproducibility reviewers. If you would like to volunteer to be a reproducibility committee member in this or future years, please contact the Reproducibility Chairs at MM19-Repro@sigmm.org

[1] Simon, Gwendal. Reproducibility in ACM MMSys Conference. Blogpost, 9 May 2017 http://peerdal.blogspot.com/2017/05/reproducibility-in-acm-mmsys-conference.html Accessed 9 March 2019.

[2] ACM, Artifact Review and Badging, Reviewed April 2018,  https://www.acm.org/publications/policies/artifact-review-badging Accessed 9 March 2019.

[3] ACM MM Reproducibility: Information on Reproducibility at ACM Multimedia https://project.inria.fr/acmmmreproducibility/ Accessed 9 March 2019.

[4] Bart Thomee, Damian Borth, and Julia Bernd. 2016. Multimedia COMMONS Workshop 2016 (MMCommons’16): Datasets, Evaluation, and Reproducibility. In Proceedings of the 24th ACM international conference on Multimedia (MM ’16). ACM, New York, NY, USA, 1485-1486.

Predicting the Emotional Impact of Movies

Affective video content analysis aims at the automatic recognition of emotions elicited by videos. It has a large number of applications, including mood based personalized content recommendation [1], video indexing [2], and efficient movie visualization and browsing [3]. Beyond the analysis of existing video material, affective computing techniques can also be used to generate new content, e.g., movie summarization [4], personalized soundtrack recommendation to make user-generated videos more attractive [5]. Affective techniques can furthermore be used to enhance the user engagement with advertising content by optimizing the way ads are inserted inside videos [6].

While major progress has been achieved in computer vision for visual object detection, high-level concept recognition, and scene understanding, a natural further step is the modeling and recognition of affective concepts. This has recently received increasing interest from research communities, e.g., computer vision and machine learning, with an overall goal of endowing computers with human-like perception capabilities.

Efficient training and benchmarking of computational models, however, require a large and diverse collection of data annotated with ground truth, which is often difficult to collect, and particularly in the field of affective computing. To address this issue we created the LIRIS-ACCEDE dataset. In contrast to most existing datasets that contain few video resources and have limited accessibility due to copyright constraints, LIRIS-ACCEDE consists of videos with a large content diversity annotated along emotional dimensions. The annotations are made according to the expected emotion of a video, which is the emotion that the majority of the audience feels in response to the same content. All videos are shared under Creative Commons licenses and can thus be freely distributed without copyright issues. The dataset (the videos, annotations, features and protocols) are publicly available, and it is currently composed of a total of six collections.

Predicting the Emotional Impact of Movies

Credits and license information: (a) Cloudland, LateNite Films, shared under CC BY 3.0 Unported license at http://vimeo.com/17105083, (b) Origami, ESMA MOVIES, shared under CC BY 3.0 Unported license at http://vimeo.com/52560308, (c) Payload, Stu Willis, shared under CC BY 3.0 Unported license at http://vimeo.com/50509389, (d) The room of Franz Kafka, Fred. L’Epee, shared under CC BY-NC-SA 3.0 Unported license at http://vimeo.com/14482569, (e) Spaceman, Jono Schaferkotter & Before North, shared under CC BY-NC 3.0 Unported License license at http://vodo.net/spaceman.

Dataset & Collections

The LIRIS-ACCEDE dataset is composed of movies and excerpts from movies under Creative Commons licenses that enable the dataset to be publicly shared. The set contains 160  professionally made and amateur movies, with different movie genres such as horror, comedy, drama, action and so on. Languages are mainly English, with a small set of Italian, Spanish, French and others subtitled in English. The set has been used to create the six collections that are part of the dataset. The two collections that were originally proposed are the Discrete LIRIS-ACCEDE collection, which contains short excerpts of movies, and the Continuous LIRIS-ACCEDE collection, which comprises long movies. Moreover, since 2015, the set has been used for tasks related to affect/emotion at the MediaEval Benchmarking Initiative for Multimedia Evaluation [7], where each year it was enriched with new data, features and annotations. Thus, the dataset also includes the four additional collections dedicated to these tasks.

The movies are available together with emotional annotations. When dealing with emotional video content analysis, the goal is to automatically recognize emotions elicited by videos. In this context, three types of emotions can be considered: intended, induced and expected emotions[8]. The intended emotion is the emotion that the film maker wants to induce in the viewers. The induced emotion is the emotion that a viewer feels in response to the movie. The expected emotion is the emotion that the majority of the audience feels in response to the same content. While the induced emotion is subjective and context dependent, the expected emotion can be considered objective, as it reflects the more-or-less unanimous response of a general audience to a given stimulus[8]. Thus, the LIRIS-ACCEDE dataset focuses on the expected emotion. The representation of emotions we are considering is the dimensional one, based on valence and arousal. Valence is defined on a continuous scale from most negative to most positive emotions, while arousal is defined continuously from calmest to most active emotions [9]. Moreover, violence annotations were provided in the MediaEval 2015 Affective Impact of Movies collection, while fear annotations were provided in the MediaEval 2016 and 2017 Emotional Impact of Movies collections.

Discrete LIRIS-ACCEDE collection A total of 160 films from various genres split into 9,800 short clips with valence and arousal annotations. More details below.
Continuous LIRIS-ACCEDE collection A total of 30 films with valence and arousal annotations per second. More details below.
MediaEval 2015 Affective Impact of Movies collection A subset of the films with labels for the presence of violence, as well as for the felt valence and arousal. More details below.
MediaEval 2016 Emotional Impact of Movies collection A subset of the films with score annotations for the expected valence and arousal. More details below.
MediaEval 2017 Emotional Impact of Movies collection A subset of the films with valence and arousal values and a label for the presence of fear for each 10 second segment, as well as precomputed features. More details below.
MediaEval 2018 Emotional Impact of Movies collection A subset of the films with valence and arousal values for each second, begin-end times of scenes containing fear, as well as precomputed features. More details below.

Ground Truth

The ground truth for the Discrete LIRIS-ACCEDE collection consists of the ranking of all video clips along both valence and arousal dimensions. These rankings were obtained thanks to a pairwise video clips comparison protocol that has been designed to be used through crowdsourcing (with CrowdFlower service). Thus, for each pair of video clips presented to raters, they had to select the one which conveyed most strongly the given emotion in terms of valence or arousal. The high inter-annotator agreement that was achieved reflects that annotations were fully consistent, despite the large diversity of our raters’ cultural backgrounds. Affective ratings (scores) were also collected for a subset of the 9,800 movies in order to cross-validate the crowdsourced annotations. The affective ratings also made learning of Gaussian Processes for Regression possible, to model the noisiness from measurements and map the whole ranked LIRIS-ACCEDE dataset into the 2D valence-arousal affective space. More details can be found in [10].

To collect the ground truth for the continuous and MediaEval 2016, 2017 and 2018 collections, which consisted of valence and arousal scores for every movie second, French annotators had to continuously indicate their level of valence and arousal while watching the movies using a modified version of the GTrace annotation tool [16] and a joystick. Each annotator continuously annotated one subset of the movies considering the induced valence, and another subset considering the induced arousal. Thus, each movie was continuously annotated by three to five different annotators. Then, the continuous valence and arousal annotations from the annotators were down-sampled by averaging the annotations over windows of 10 seconds with a shift of 1 second overlap (i.e., yielding 1 value per second) in order to remove any noise due to unintended movements of the joystick. Finally, the post-processed continuous annotations were averaged in order to create a continuous mean signal of the valence and arousal self-assessments, ranging from -1 (most negative for valence, most passive for arousal) to +1 (most positive for valence, most active for arousal). The details of this process are given in [11].

The ground truth for violence annotation, used in the MediaEval 2015 Affective Impact of Movies collection, was collected as follows. First, all the videos were annotated separately by two groups of annotators from two different countries. For each group, regular annotators labeled all the videos, which were then reviewed by master annotators. Regular annotators were graduate students (typically single with no children) and master annotators were senior researchers.  Within each group, each video received 2 different annotations, which were then merged by the master annotators into the final annotation for the group. Finally, the achieved annotations from the two groups were merged and reviewed once more by the task organizers. The details can be found in [12].

The ground truth for fear annotations, used in the MediaEval 2017 and 2018 Emotional Impact of Movies collections, was generated using a tool specifically designed for the classification of audio-visual media allowing to perform annotation while watching the movie (at the same time). The annotations have been realized by two well-experienced team members of NICAM [17], both of them trained in classification of media. Each movie was annotated by one annotator reporting the start and stop times of each sequence in the movie expected to induce fear.

Conclusion

Through its six collections, the LIRIS-ACCEDE dataset constitutes a dataset of choice for affective video content analysis. It is one of the largest dataset for this purpose, and is regularly enriched with new data, features and annotations. In particular, it is used for the Emotional Impact of Movies tasks at MediaEval Benchmarking Initiative for Multimedia Evaluation. As all the movies are under Creative Commons licenses, the whole dataset can be freely shared and used by the research community, and is available at http://liris-accede.ec-lyon.fr.

Discrete LIRIS-ACCEDE collection [10]
In total 160 films and short films with different genres were used and were segmented into 9,800 video clips. The total time of all 160 films is 73 hours 41 minutes and 7 seconds, and a video clip was extracted on average every 27s. The 9,800 segmented video clips last between 8 and 12 seconds and are representative enough to conduct experiments. Indeed, the length of extracted segments is large enough to get consistent excerpts allowing the viewer to feel emotions, while being small enough to make the viewer feel only one emotion per excerpt.

The content of the movie was also considered to create homogeneous, consistent and meaningful excerpts that were not meant to disturb the viewers. A robust shot and fade in/out detection was implemented to make sure that each extracted video clip started and ended with a shot or a fade. Furthermore, the order of excerpts within a film was kept, allowing the study of temporal transitions of emotions.

Several movie genres are represented in this collection of movies, such as horror, comedy, drama, action, and so on. Languages are mainly English with a small set of Italian, Spanish, French and others subtitled in English. For this collection the 9,800 video clips are ranked according to valence, from the clip inducing the most negative emotion to the most positive, and to arousal, from the clip inducing the calmest emotion to the most active emotion. Besides the ranks, the emotional scores (valence and arousal) are also provided for each clip.

Continuous LIRIS-ACCEDE collection [11]
The movie clips for the Discrete collection were annotated globally, for which a single value of arousal and valence was used to represent a whole 8 to 12-second video clip. In order to allow deeper investigations into the temporal dependencies of emotions (since a felt emotion may influence the emotions felt in the future), longer movies were considered in this collection. To this end, a selection of 30 movies from the set of 160 was made such that their genre, content, language and duration were diverse enough to be representative of the original Discrete LIRIS-ACCEDE dataset. The selected videos are between 117 and 4,566 seconds long (mean = 884.2s ± 766.7s SD). The total length of the 30 selected movies is 7 hours, 22 minutes and 5 seconds. The emotional annotations consist of a score of expected valence and arousal for each second of each movie.

MediaEval 2015 Affective Impact of Movies collection [12]
This collection has been used as the development and test sets for the MediaEval 2015 Affective Impact of Movies Task. The overall use case scenario of the task was to design a video search system that used automatic tools to help users find videos that fitted their particular mood, age or preferences. To address this, two subtasks were proposed:

  • Induced affect detection: the emotional impact of a video or movie can be a strong indicator for search or recommendation;
  • Violence detection: detecting violent content is an important aspect of filtering video content based on age.

The 9,800 video clips from the Discrete LIRIS-ACCEDE section were used as development set, and an additional 1100 movie clips were proposed for the test set. For each of the 10,900 video clips, the annotations consist of: a binary value to indicate the presence of violence, the class of the excerpt for felt arousal (calm-neutral-active), and the class for felt valence (negative-neutral-positive).

MediaEval 2016 Emotional Impact of Movies collection [13]
The MediaEval 2016 Emotional Impact of Movies task required participants to deploy multimedia features to automatically predict the emotional impact of movies, in terms of valence and arousal. Two subtasks were proposed:

  • Global emotion prediction: given a short video clip (around 10 seconds), participants’ systems were expected to predict a score of induced valence (negative-positive) and induced arousal (calm-excited) for the whole clip;
  • Continuous emotion prediction: as an emotion felt during a scene may be influenced by the emotions felt during the previous scene(s), the purpose here was to consider longer videos, and to predict the valence and arousal continuously along the video. Thus, a score of induced valence and arousal were to be provided for each 1s-segment of each video.

The development set was composed of the Discrete LIRIS-ACCEDE part for the first subtask, and the Continuous LIRIS-ACCEDE part for the second subtask. In addition to the development set, a test set was also provided to assess participants’ methods performance. A total of 49 new movies under Creative Commons licenses were added. With the same protocol as the one used for the development set, 1,200 additional short video clips were extracted for the first subtask (between 8 and 12 seconds), while 10 long movies (from 25 minutes to 1 hour and 35 minutes) were selected for the second subtask (for a total duration of 11.48 hours). Thus, the annotations consist of a score of expected valence and arousal for each movie clip used for the first subtask, and a score of expected valence and arousal for each second of the movies for the second subtask.

MediaEval 2017 Emotional Impact of Movies collection [14]
This collection was used for the MediaEval 2017 Emotional Impact of Movies task. Here, only long movies were considered, and the emotion was considered in terms of valence, arousal and fear. The following two subtasks were proposed for which the emotional impact had to be predicted for consecutive 10-second segments, which slid over the whole movie with a shift of 5 seconds:

  • Valence/Arousal prediction: participants’ systems were supposed to predict a score of expected valence and arousal for each consecutive 10-second segment;
  • Fear prediction: the purpose here was to predict whether each consecutive 10-second segments was likely to induce fear or not. The targeted use case was the prediction of frightening scenes to help systems protecting children from potentially harmful video content. This subtask is complementary to the valence/arousal prediction task in the sense that the mapping of discrete emotions into the 2D valence/arousal space is often overlapped (for instance, fear, disgust and anger are overlapped since they are characterized with very negative valence and high arousal).

The Continuous LIRIS-ACCEDE collection was used as the development test for both subtasks. The test set consisted of a selection of new 14 new movies under Creative Commons licenses other than the selection of the 160 original movies. They are between 210 and 6,260 seconds long. The total length of the 14 selected movies is 7 hours, 57 minutes and 13 seconds. In addition to the video data, general purpose audio and visual content features were also provided, including Deep features, Fuzzy Color and Texture Histogram, Gabor features. The annotations consist of a valence value, an arousal value and a binary value for each 10-second segment to indicate if the segment was supposed to induce fear or not.

MediaEval 2018 Emotional Impact of Movies collection [15]
The MediaEval 2018 Emotional Impact of Movies task is similar to the one of 2017. However, in this case, more data was provided and a prediction of the emotional impact needed to be made for every second in movies rather than for 10-second segments as before. The two subtasks were:

  • Valence and Arousal prediction: participants’ systems had to predict a score of expected valence and arousal continuously (every second) for each movie;
  • Fear detection: the purpose here was to predict beginning and ending times of sequences inducing fear in movies. The targeted use case was the detection of frightening scenes to help systems protecting children from potentially harmful video content.

The development set for both subtasks consisted of the movies from the Continuous LIRIS-ACCEDE collection, as well as from the test set of the MediaEval 2017 Emotional Impact of Movies collection, i.e. 44 movies for a total duration of 15 hours and 20 minutes. The test set consisted of 12 other movies selected from the set of 160 movies, for a total duration of 8 hours and 56 minutes. Like the 2017 collection, in addition to the video data, general purpose audio and visual content features were also provided. The annotations consist of valence and arousal values for each second of the movies (for the first subtasks) as well as the beginning and ending times of each sequence in movies inducing fear (for the second subtask).

Acknowledgments

This work was supported in part by the French research agency ANR through the VideoSense Project under the Grant 2009 CORD 026 02 and through the Visen project within the ERA-NET CHIST-ERA framework under the grant ANR-12-CHRI-0002-04.

Contact

Should you have any inquiries or questions about the dataset, do not hesitate to contact us by email at: emmanuel dot dellandrea at ec-lyon dot fr.

References

[1] L. Canini, S. Benini, and R. Leonardi, “Affective recommendation of movies based on selected connotative features”, in IEEE Transactions on Circuits and Systems for Video Technology, 23(4), 636–647, 2013.
[2] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. 2010, “Affective visualization and retrieval for music video”, in IEEE Transactions on Multimedia 12(6), 510–522, 2010.
[3] S.Zhao, H.Yao, X.Sun, X.Jiang, and P. Xu., “Flexible presentation of videos based on affective content analysis”, in Advances in Multimedia Modeling, 2013.
[4] H. Katti, K. Yadati, M. Kankanhalli, and C. Tat-Seng, “Affective video summarization and story board generation using pupillary dilation and eye gaze”, in IEEE International Symposium on Multimedia (ISM), 2011.
[5] R.R. Shah,Y. Yu, and R. Zimmermann, “Advisor: Personalized video soundtrack recommendation by late fusion with heuristic rankings”, in ACM International Conference on Multimedia, 2014.
[6] K. Yadati, H. Katti, and M. Kankanhalli, “Cavva: Computational affective video-in-video advertising”, in IEEE Transactions on Multimedia 16(1), 15–23, 2014.
[7] http://www.multimediaeval.org/
[8] A. Hanjalic, “Extracting moods from pictures and sounds: Towards truly personalized TV”, in IEEE Signal Processing Magazine, 2006.
[9] J.A. Russell, “Core affect and the psychological construction of emotion”, in Psychological Review, 2003.
[10] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “LIRIS-ACCEDE: A Video Database for Affective Content Analysis,” in IEEE Transactions on Affective Computing, 2015.
[11] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Deep Learning vs. Kernel Methods: Performance for Emotion Prediction in Videos,” in 2015 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), 2015.
[12] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang, B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty, and L. Chen, “The mediaeval 2015 affective impact of movies task,” in MediaEval 2015 Workshop, 2015.
[13] E. Dellandrea, L. Chen, Y. Baveye, M. Sjoberg and C. Chamaret, “The MediaEval 2016 Emotional Impact of Movies Task”, in Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, The Netherlands, October 20-21, 2016.
[14] E. Dellandrea, M. Huigsloot, L. Chen, Y. Baveye and M. Sjoberg, “The MediaEval 2017 Emotional Impact of Movies Task”, in Working Notes Proceedings of the MediaEval 2017 Workshop, Dublin, Ireland, September 13-15, 2017.
[15] E. Dellandréa, M. Huigsloot, L. Chen, Y. Baveye, Z. Xiao and M. Sjöberg, “The MediaEval 2018 Emotional Impact of Movies Task”, Working Notes Proceedings of the MediaEval 2018 Workshop, Sophia Antipolis, France, October 29-31, 2018.
[16] R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Stapleton, “Gtrace: General trace program compatible with emotionML”, in Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), 2013.
[17] http://www.kijkwijzer.nl/nicam.

Sharing and Reproducibility in ACM SIGMM

 

This column discusses the efforts of ACM SIGMM towards sharing and reproducibility. Apart from the specific sessions dedicated to open source and datasets, ACM Multimedia Systems started to provide official ACM badges for articles that make artifacts available since last year. This year, it has marked a record with 45% of the articles acquiring such a badge.


Without data it is impossible to put theories to the test. Moreover, without running code it is tedious at best to (re)produce and evaluate any results. Yet collecting data and writing code can be a road full of pitfalls, ranging from datasets containing copyrighted materials to algorithms containing bugs. The ideal datasets and software packages are those that are open and transparent for the world to look at, inspect, and use without or with limited restrictions. Such “artifacts” make it possible to establish public consensus on their correctness or otherwise to start a dialogue on how to fix any identified problems.

In our interconnected world, storing and sharing information has never been easier. Despite the temptation for researchers to keep datasets and software to themselves, a growing number are willing to share their resources with others. To further promote this sharing behavior, conferences, workshops, publishers, non-profit and even for-profit companies are increasingly recognizing and supporting these efforts. For example, the ACM Multimedia conference has hosted an open source software competition since 2004, and the ACM Multimedia Systems conference has included an open datasets and software track since 2011 . The ACM Digital Library now also hands out badges to public artifacts that have been made available and optionally reviewed and verified by members of the community. At the same time, organizations such as Zenodo and Amazon host open datasets for free. Sharing ultimately pays off: the citation statistics for ACM Multimedia Systems conferences over the past five years, for example, show that half of the 20 most cited papers shared data and code although they have represented a small fraction of the published papers so far.

graphic datasets

Good practices are increasingly adopted. In this year’s edition of the ACM Multimedia Systems conference, 69 works (papers, demos, datasets, software) were accepted, out of which 31 (45%) were awarded an ACM badge. This is a large increase compared to last year, when out of 42 works only a total of 13 (31%) received one. This greatly expands one of the core objectives of both the conference and SIGMM towards open science. At this moment, the ACM Digital Library does not separately index which papers received a badge, making it challenging to find all papers who have one. It further appears not many other ACM conferences are aware of the badges yet; for example, while ACM Multimedia accepted 16 open source papers in 2016 and 6 papers in 2017, none applied for a badge. This year at ACM Multimedia Systems only “artifacts available” badges have been awarded. For next year our intention is to ensure all dataset and software submissions receive the “artifacts evaluated” badge. This would require several committed community members to spend time working with the authors to get the artifacts running on all major platforms with corresponding detailed documentation.

The accepted artifacts this year are diverse in nature: several submissions focus on releasing artifacts related to quality of experience of (mobile/wireless) streaming video, while others center on making datasets and tools related to images, videos, speech, sensors, and events available; in addition, there are a number of contributions in the medical domain. It is great to see such a range of interests in our community!

Socially significant music events

Social media sharing platforms (e.g., YouTube, Flickr, Instagram, and SoundCloud) have revolutionized how users access multimedia content online. Most of these platforms provide a variety of ways for the user to interact with the different types of media: images, video, music. In addition to watching or listening to the media content, users can also engage with content in different ways, e.g., like, share, tag, or comment. Social media sharing platforms have become an important resource for scientific researchers, who aim to develop new indexing and retrieval algorithms that can improve users’ access to multimedia content. As a result, enhancing the experience provided by social media sharing platforms.

Historically, the multimedia research community has focused on developing multimedia analysis algorithms that combine visual and text modalities. Less highly visible is research devoted to algorithms that exploit an audio signal as the main modality. Recently, awareness for the importance of audio has experienced a resurgence. Particularly notable is Google’s release of the AudioSet, “A large-scale dataset of manually annotated audio events” [7]. In a similar spirit, we have developed the “Socially Significant Music Event“ dataset that supports research on music events [3]. The dataset contains Electronic Dance Music (EDM) tracks with a Creative Commons license that have been collected from SoundCloud. Using this dataset, one can build machine learning algorithms to detect specific events in a given music track.

What are socially significant music events? Within a music track, listeners are able to identify certain acoustic patterns as nameable music events.  We call a music event “socially significant” if it is popular in social media circles, implying that it is readily identifiable and an important part of how listeners experience a certain music track or music genre. For example, listeners might talk about these events in their comments, suggesting that these events are important for the listeners (Figure 1).

Traditional music event detection has only tackled low-level events like music onsets [4] or music auto-tagging [810]. In our dataset, we consider events that are at a higher abstraction level than the low-level musical onsets. In auto-tagging, descriptive tags are associated with 10-second music segments. These tags generally fall into three categories: musical instruments (guitar, drums, etc.), musical genres (pop, electronic, etc.) and mood based tags (serene, intense, etc.). The types of tags are different than what we are detecting as part of this dataset. The events in our dataset have a particular temporal structure unlike the categories that are the target of auto-tagging. Additionally, we analyze the entire music track and detect start points of music events rather than short segments like auto-tagging.

There are three music events in our Socially Significant Music Event dataset: Drop, Build, and Break. These events can be considered to form the basic set of events used by the EDM producers [1, 2]. They have a certain temporal structure internal to themselves, which can be of varying complexity. Their social significance is visible from the presence of large number of timed comments related to these events on SoundCloud (Figure 1,2). The three events are popular in the social media circles with listeners often mentioning them in comments. Here, we define these events [2]:

  1. Drop: A point in the EDM track, where the full bassline is re-introduced and generally follows a recognizable build section
  2. Build: A section in the EDM track, where the intensity continuously increases and generally climaxes towards a drop
  3. Break: A section in an EDM track with a significantly thinner texture, usually marked by the removal of the bass drum
Figure 1. Screenshot from SoundCloud showing a list of timed comments left by listeners on a music track [11].

Figure 1. Screenshot from SoundCloud showing a list of timed comments left by listeners on a music track [11].

SoundCloud

SoundCloud is an online music sharing platform that allows users to record, upload, promote and share their self-created music. SoundCloud started out as a platform for amateur musicians, but currently many leading music labels are also represented. One of the interesting features of SoundCloud is that it allows “timed comments” on the music tracks. “Timed comments” are comments, left by listeners, associated with a particular time point in the music track. Our “Socially Significant Music Events” dataset is inspired by the potential usefulness of these timed comments as ground truth for training music event detectors. Figure 2 contains an example of a timed comment: “That intense buildup tho” (timestamp 00:46). We could potentially use this as a training label to detect a build, for example. In a similar way, listeners also mention the other events in their timed comments. So, these timed comments can serve as training labels to build machine learning algorithms to detect events.

Figure 2. Screenshot from SoundCloud indicating the useful information present in the timed comments. [11]

Figure 2. Screenshot from SoundCloud indicating the useful information present in the timed comments. [11]

SoundCloud also provides a well-documented API [6] with interfaces to many programming languages: Python, Ruby, JavaScript etc. Through this API, one can download the music tracks (if allowed by the uploader), timed comments and also other metadata related to the track. We used this API to collect our dataset. Via the search functionality we searched for tracks uploaded during the year 2014 with a Creative Commons license, which results in a list of tracks with unique identification numbers. We looked at the timed comments of these tracks for the keywords: drop, break and build. We kept the tracks whose timed comments contained a reference to these keywords and discarded the other tracks.

Dataset

The dataset contains 402 music tracks with an average duration of 4.9 minutes. Each track is accompanied by timed comments relating to Drop, Build, and Break. It is also accompanied by ground truth labels that mark the true locations of the three events within the tracks. The labels were created by a team of experts. Unlike many other publicly available music datasets that provide only metadata or short previews of music tracks  [9], we provide the entire track for research purposes. The download instructions for the dataset can be found here: [3]. All the music tracks in the dataset are distributed under the Creative Commons license. Some statistics of the dataset are provided in Table 1.  

Table 1. Statistics of the dataset: Number of events, Number of timed comments

Event Name Total number of events Number of events per track Total number of timed comments Number of timed comments per track
Drop  435  1.08  604  1.50
Build  596  1.48  609  1.51
Break  372  0.92  619  1.54

The main purpose of the dataset is to support training of detectors for the three events of interest (Drop, Build, and Break) in a given music track. These three events can be considered a case study to prove that it is possible to detect socially significant musical events, opening the way for future work on an extended inventory of events. Additionally, the dataset can be used to understand the properties of timed comments related to music events. Specifically, timed comments can be used to reduce the need for manually acquired ground truth, which is expensive and difficult to obtain.

Timed comments present an interesting research challenge: temporal noise. The timed comments and the actual events do not always coincide. The comments could be at the same position, before, or after the actual event. For example, in the below music track (Figure 3), there is a timed comment about a drop at 00:40, while the actual drop occurs only at 01:00. Because of this noisy nature, we cannot use the timed comments alone as ground truth. We need strategies to handle temporal noise in order to use timed comments for training [1].

Figure 3. Screenshot from SoundCloud indicating the noisy nature of timed comments [11].

Figure 3. Screenshot from SoundCloud indicating the noisy nature of timed comments [11].

In addition to music event detection, our “Socially Significant Music Event” dataset opens up other possibilities for research. Timed comments have the potential to improve users’ access to music and to support them in discovering new music. Specifically, timed comments mention aspects of music that are difficult to derive from the signal, and may be useful to calculate song-to-song similarity needed to improve music recommendation. The fact that the comments are related to a certain time point is important because it allows us to derive continuous information over time from a music track. Timed comments are potentially very helpful for supporting listeners in finding specific points of interest within a track, or deciding whether they want to listen to a track, since they allow users to jump-in and listen to specific moments, without listening to the track end-to-end.

State of the art

The detection of music events requires training classifiers that are able to generalize over the variability in the audio signal patterns corresponding to events. In Figure 4, we see that the build-drop combination has a characteristic pattern in the spectral representation of the music signal. The build is a sweep-like structure and is followed by the drop, which we indicate by a red vertical line. More details about the state-of-the-art features useful for music event detection and the strategies to filter the noisy timed comments can be found in our publication [1].

Figure 4. The spectral representation of the musical segment containing a drop. You can observe the sweeping structure indicating the buildup. The red vertical line is the drop.

Figure 4. The spectral representation of the musical segment containing a drop. You can observe the sweeping structure indicating the buildup. The red vertical line is the drop.

The evaluation metric used to measure the performance of a music event detector should be chosen according to the user scenario for that detector. For example, if the music event detector is used for non-linear access (i.e., creating jump-in points along the playbar) it is important that the detected time point of the event falls before, rather than after, the actual event.  In this case, we recommend using the “event anticipation distance” (ea_dist) as a metric. The ea_dist is amount of time that the predicted event time point precedes an actual event time point and represents the time the user would have to wait to listen to the actual event. More details about ea_dist can be found in our paper [1].

In [1], we report the implementation of a baseline music event detector that uses only timed comments as training labels. This detector attains an ea_dist of 18 seconds for a drop. We point out that from the user point of view, this level of performance could already lead to quite useful jump-in points. Note that the typical length of a build-drop combination is between 15-20 seconds. If the user is positioned 18 seconds before the drop, the build would have already started and the user knows that a drop is coming. Using an optimized combination of timed comments and manually acquired ground truth labels we are able to achieve an ea_dist of 6 seconds.

Conclusion

Timed comments, on their own, can be used as training labels to train detectors for socially significant events. A detector trained on timed comments performs reasonably well in applications like non-linear access, where the listener wants to jump through different events in the music track without listening to it in its entirety. We hope that the dataset will encourage researchers to explore the usefulness of timed comments for all media. Additionally, we would like to point out that our work has demonstrated that the impact of temporal noise can be overcome and that the contribution of timed comments to video event detection is worth investigating further.

Contact

Should you have any inquiries or questions about the dataset, do not hesitate to contact us via email at: n.k.yadati@tudelft.nl

References

[1] K. Yadati, M. Larson, C. Liem and A. Hanjalic, “Detecting Socially Significant Music Events using Temporally Noisy Labels,” in IEEE Transactions on Multimedia. 2018. http://ieeexplore.ieee.org/document/8279544/

[2] M. Butler, Unlocking the Groove: Rhythm, Meter, and Musical Design in Electronic Dance Music, ser. Profiles in Popular Music. Indiana University Press, 2006 

[3] http://osf.io/eydxk

[4] http://www.music-ir.org/mirex/wiki/2017:Audio_Onset_Detection

[5] https://developers.soundcloud.com/docs/api/guide

[6] https://developers.soundcloud.com/docs/api/guide

[7] https://research.google.com/audioset/

[8] H. Y. Lo, J. C. Wang, H. M. Wang and S. D. Lin, “Cost-Sensitive Multi-Label Learning for Audio Tag Annotation and Retrieval,” in IEEE Transactions on Multimedia, vol. 13, no. 3, pp. 518-529, June 2011. http://ieeexplore.ieee.org/document/5733421/

[9] http://majorminer.org/info/intro

[10] http://www.music-ir.org/mirex/wiki/2016:Audio_Tag_Classification

[11] https://soundcloud.com/spinninrecords/ummet-ozcan-lose-control-original-mix

Diversity and Credibility for Social Images and Image Retrieval

Social media has established itself as an inextricable component of today’s society. Images make up a large proportion of items shared on social media [1]. The popularity of social image sharing has contributed to the popularity of the Retrieving Diverse Social Images task at the MediaEval Benchmarking Initiative for Multimedia Evaluationa [2]. Since its introduction in 2013, the task has attracted a large participation and has published a set of datasets of outstanding value to the multimedia research community.

The task, and the datasets it has released, target a novel facet of multimedia retrieval, namely the search result diversification of social images. The task is defined as follows: Given a large number of images, retrieved by a social media image search engine, find those that are not only relevant to the query, but also provide a diverse view of the topic/topics behind the query (see an example in Figure 1). The features and methods needed to address the task successfully are complex and span different research areas (image processing, text processing, machine learning). For this reason, when creating the collections used in the Retrieving Diverse Social Images Tasks, we also created a set of baseline features. The features are released with the datasets. In this way, task participants who have expertise in one particular research area may focus on that area and still participate in the full evaluation.

Figure 1: Example of retrieval and diversification results for query “Pingxi Sky Lantern Festival” (results are truncated to the first 14 images for better visualization): (top images) Flickr initial retrieval results; (bottom images) diversification achieved with the approach from the TUW team (best approach at MediaEval 2015).

Figure 1: Example of retrieval and diversification results for query “Pingxi Sky Lantern Festival” (results are truncated to the first 14 images for better visualization): (top images) Flickr initial retrieval results; (bottom images) diversification achieved with the approach from the TUW team (best approach at MediaEval 2015).

The collections

Before describing the individual collections, it needs to be noted that all data consist of redistributable Creative Commons Flickr and Wikipedia content and are freely available for download (follow the instructions here [3]). Although the task ran also in 2017, we focus in the following on the datasets already released, namely: Div400, Div150Cred, Div150Multi and Div150Adhoc (corresponding to the 2013-2016 evaluation campaigns). Each of the four datasets available so far covers different aspects of the diversification challenge, either from the perspective of the task/use-case addressed, or from the data that can be used to address the task. Table 1 gives an overview of the four datasets that we describe in more detail over the next four subsections. Each of the datasets is divided into a development set and a test set. Although the division of development and test data is arbitrary, for comparability of results and full reproducibility, users of the collections are advised to maintain the separation when performing their experiments.

Table 1: Dataset statistics (devset – development data, testset – testing data, credibilityset – data for estimating user tagging credibility, single (s) – single topic queries, multi (m) – multi-topic queries, ++ – enhanced/updated content, POI – location point of interest, events – events and states associated with locations, general – general purpose ad-hoc topics).table1

Div400

In 2013, the task started with a narrowly defined use-case scenario, where a tourist, upon deciding to visit a particular location, reads the corresponding Wikipedia page and desires to see a diverse set of images from that location. Queries here might be “Big Ben in London” or “Palazzo delle Albere in Italy”. For each such query, we know the GPS coordinates, the name, and the Wikipedia page, including an example image of the destination. As a search pool, we consider the top 150 photos obtained from Flickr using the name as a search query. These photos come with some metadata (photo ID, title, description, tags, geotagging information, date when the photo was taken, owner’s name, number of times the photo has been displayed, URL in Flickr, license type, number of comments on the photo) [4].

In addition to providing the raw data, the collection also contains visual and text features of the data, such that researchers who are only interested in one of the two, can use the other without investing additional time in generating a baseline set of features.

As visual descriptors, for each of the images in the collection, we provide:

  • Global color naming histogram
  • Global histogram of oriented gradients
  • Global color moments on HSV
  • Global Locally Binary Patterns on gray scale
  • Global Color Structure Descriptor
  • Global statistics on gray level Run Length Matrix (Short Run Emphasis, Long Run Emphasis, Gray-Level Non-uniformity, Run Length Non-uniformity, Run Percentage, Low Gray-Level Run Emphasis, High Gray-Level Run Emphasis, Short Run Low Gray-Level Emphasis, Short Run High Gray-Level Emphasis, Long Run Low Gray-Level Emphasis, Long Run High Gray-Level Emphasis)
  • Local spatial pyramid representations (3×3) of each of the previous descriptors

As textual descriptors we provide the classic Term Frequency (TFt,d – the number of occurrences of term t in document d) and Document Frequency (DFt – the number of documents containing term t). Note that the datasets are not limited to a single notion of document. The most direct definition of a “document” is an image that can be either retrieved or not retrieved. However, it is easily conceivable that the relative frequency of a term in the set of images corresponding to one topic, or the set of images corresponding to one user might also be of interest in ranking the importance of a result to a query. Therefore, the collection also contains statistics that take a document to be a topic, as well as a user. All these are provided both as CSV files, as well as Lucene Index files. The former can be used as part of a custom weighting scheme, while the latter can be deployed directly in a Lucene/Solr search engine to obtain results based on the text without further effort.

Div150Cred

The tourism use case also underlies Div150Cred, but a component addressing the concept of user tagging credibility is added. The idea here is that not all users tag their photos in a manner that is useful for retrieval and, for this reason, it makes sense to consider, in addition to the visual and text descriptors also used in Div400, another feature set – a user credibility feature. Each of the 153 topics (30 in the development set and 123 in the test set) comes therefore, in addition to the visual and text features of each image, with a value indicating the credibility of the user. This value is estimated automatically based on a set of features, so in addition to the retrieval development and test sets, DIV150Cred also contains a credibility set, used by us to generate the credibility of each user, and which can be used by any interested researcher to generate better credibility estimators.

The credibility set contains images for approximately 300 locations from 685 users (a total of 3.6 million images). For each user there is a manually assigned credibility score as well as an automatically estimated one, based on the following features:

  • Visual score – learned predictor of a user’s consistent and relevant tagging behavior
  • Face proportion
  • Tag specificity
  • Location similarity
  • Photo count
  • Unique tags
  • Upload frequency
  • Bulk proportion

For each of these, the intuition behind it and the actual calculation is detailed in the collection report [5].

Div150Multi

Div150Multi adds another twist to the task of the search engine and its tourism use-case. Now, the topics are not simply points of interest, but rather a combination of a main concept and a qualifier, namely multi-topic queries about location specific events, location aspects or general activities (e.g., “Oktoberfest in Munich”, “Bucharest in winter”). In terms of features however, the collection builds on the existing ones used in Div400 and Div150Cred, but adds to the pool of resources the researchers have at their disposal. In terms of credibility, in addition to the 8 features listed above, we now also have:

  • Mean Photo Views
  • Mean Title Word Counts
  • Mean Tags per Photo
  • Mean Image Tag Clarity

Again, for details on the intuition and formulas behind these, the collection report [6] is the reference material.

A new set of descriptors has been now made available, based on convolutional neural networks.

  • CNN generic: a descriptor based on the reference convolutional (CNN) neural network model provided along with the Caffe framework [7]. This model is trained with the 1,000 ImageNet classes used during the ImageNet challenge. The descriptors are extracted from the last fully connected layer of the network (named fc7).
  • CNN adapted: These features were also computed using the Caffe framework, with the reference model architecture but using images of 1,000 landmarks instead of ImageNet classes. We collected approximately 1,200 Web images for each landmark and fed them directly to Caffe for training [8]. Similar to CNN generic, the descriptors were extracted from the last fully connected layer of the network (i.e., fc7).

Div150AdHoc

For this dataset, the definition of relevance was expanded from previous years, with the introduction of even more challenging multi-topic queries unrelated to POIs. These queries address the diversification problem for a general ad-hoc image retrieval system, where general-purpose multi-topic queries are used for retrieving the images (e.g., “animals at Zoo”, “flying planes on blue sky”, “hotel corridor”). The Div150Adhoc collection includes most of the previously described credibility descriptors, but drops faceProportion and location-Similarity, as they were no longer relevant for the new retrieval scenario. Also, the visualScore descriptor was updated in order to keep up with the latest advancements on CNN descriptors. Consequently, when training individual visual models, the Overfeat visual descriptor is replaced by the representation produced by the last fully connected layer of the network [9]. Full details are available in the collection report [10].

Ground-truth and state-of-the-art

Each of the above collections comes with an associated ground-truth, created by human assessors. As the focus is on both relevance and diversity, the ground truth and the metrics used reflect it: Precision at cutoff (primarily P@20) is used for relevance, and Cluster Recall at cutoff (primarily CR@20) is used for diversity.

Figure 2 shows an overview of the results obtained by participants in the evaluation campaigns over the period 2013-2016, and serves as a baseline for future experiments on these collections. Results presented here are on the test set alone. The reader may find more information about the methods in the MediaEval proceedings, which are listed on the Retrieving Diverse Social Images yearly task pages on the MediaEval website (http://multimediaeval.org/).

Figure 2

Figure 2. Evolution of the diversification performance (boxplots — the interquartile range (IQR), i.e. where the 50% of the values are; the line within the box = median; the tails = 1.5*IQR; the points outside (+) = outliers) for the different datasets in terms of precision (P) and cluster recall (CR) at different cut-off values. Flickr baseline represents the initial Flickr retrieval result for the corresponding dataset.

Conclusions

The Retrieving Diverse Social Image task datasets, as their name indicates, address the problem of retrieving images taking into account both the need to diversify the results presented to the user, as well as the potential lack of credibility of the users in their tagging behavior. They are based on already state-of-the-art retrieval technology (i.e., the Flickr retrieval system), which makes it possible to focus on the challenge of image diversification. Moreover, the data sets are not limited to images, but rather also include rich social information. The credibility component, represented by the credibility subsets of the last three collections, is unique to this set of benchmark datasets.

Acknowledgments

The Retrieving Diverse Social Image task datasets were made possible by the effort of a large team of people over an extended period of time. The contributions of the authors were essential. Further, we would like to acknowledge the multiple team members who have contributed to annotating the images and making the MediaEval Task possible. Please see the yearly Retrieving Diverse Social Images task pages on the MediaEval website (http://multimediaeval.org/).

Contact

Should you have any inquires or questions about the datasets, don’t hesitate to contact us via email at: bionescu at imag dot pub dot ro.

References

[1] http://contentmarketinginstitute.com/2015/11/visual-content-strategy/ (last visited 2017-11-29).

[2] http://www.multimediaeval.org/

[3] http://www.campus.pub.ro/lab7/bionescu/publications.html#datasets

[4] http://campus.pub.ro/lab7/bionescu/Div400.html

[5] http://campus.pub.ro/lab7/bionescu/Div150Cred.html

[6] http://campus.pub.ro/lab7/bionescu/Div150Multi.html

[7] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding” in ACM International Conference on Multimedia, 2014, pp. 675–678.

[8] E. Spyromitros-Xioufis, S. Papadopoulos, A. L. Ginsca, A. Popescu, Y. Kompatsiaris, and I. Vlahavas, “Improving diversity in image search via supervised relevance scoring” in ACM International Conference on Multimedia Retrieval, 2015, pp. 323–330.

[9] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets” arXiv preprint arXiv:1405.3531, 2014.

[10] http://campus.pub.ro/lab7/bionescu/Div150Adhoc.html