Can the Multimedia Research Community via Quality of Experience contribute to a better Quality of Life?

Can the multimedia community contribute to a better Quality of Life? Delivering a higher resolution and distortion-free media stream so you can enjoy the latest movie on Netflix or YouTube may provide instantaneous satisfaction, but does it make your long term life better? Whilst the QoMEX conference series has traditionally considered the former, in more recent years and with a view to QoMEX 2020, research works that consider the later are also welcome. In this context, rather than looking at what we do, reflecting on how we do it could offer opportunities for sustained rather than instantaneous impact in fields such as health, inclusive of assistive technologies (AT) and digital heritage among many others.

In this article, we ask if the concepts from the Quality of Experience (QoE) [1] framework model can be applied, adapted and reimagined to inform and develop tools and systems that enhance our Quality of Life. The World Health Organisation (WHO) definition of health states that “[h]ealth is a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity” [2]. This is a definition that is well-aligned with the familiar yet ill-defined term, Quality of Life (QoL). Whilst QoL requires further work towards a concrete definition, the definition of QoE has been developed through work by the QUALINET EU COST Network [3]. Using multimedia quality as a use case, a white paper [1] resulted from this effort that describes the human, context, service and system factors that influence the quality of experience for multimedia systems.

Fig. 1: (a) Quality of Experience and (b) Quality of Life. (reproduced from [2]).

The QoE formation process has been mapped to a conceptual model allowing systems and services to be evaluated and improved. Such a model has been developed and used in predicting QoE. Adapting and applying the methods to health-related QoL will allow predictive models for QoL to be developed.

In this context, the best paper award winner at QoMEX in 2017 [4] proposed such a mapping for QoL in stroke prevention, care and rehabilitation (Fig. 1) along with examining practical challenges for modeling and applications. The process of identifying and categorizing factors and features was illustrated using stroke patient treatment as an example use case and this work has continued through the European Union Horizon 2020 research project PRECISE4Q [5]. For medical practitioners, a QoL framework can assist in the development of decision support systems solutions, patient monitoring, and imaging systems.

At more of a “systems” level in e-health applications, the WHO defines assistive devices and technologies as “those whose primary purpose is to maintain or improve an individual’s functioning and independence to facilitate participation and to enhance overall well-being” [6]. A proposed application of immersive technologies as an assistive technology (AT) training solution applied QoE as a mechanism to evaluate the usability and utility of the system [7]. The assessment of immersive AT used a number of physiological data: EEG signal, GSR/EDA, body surface temperature, accelerometer, HR and BVP. These allow objective analysis while the individual is operating the wheelchair simulator. Performing such evaluations in an ecologically valid manner is a challenging task. However, the QoE framework provides a concrete mechanism to consider the human, context and system factors that influence the usability and utility of such a training simulator. In particular, the use of implicit and objective metrics can complement qualitative approaches to evaluations.

In the same vein, another work presented at QoMEX 2017 [8], employed the use of Augmented Reality (AR) and Virtual Reality (VR) as a clinical aid for diagnosis of speech and language difficulties, specifically aphasia (see Fig. 2). It is estimated, that speech or language difficulties affect more than 12% of people internationally [9]. Individuals who suffer from a stroke or traumatic brain injury (TBI) often experience symptoms of aphasia as a result of damage to the left frontal lobe. Anomic aphasia [10] is a mild form of aphasia in which patients experience word retrieval problems and semantic memory difficulties. Opportunities exist to digitalize well-accepted clinical approaches that can be augmented through QoE based objective and implicit metrics. Understanding the user via advanced processing techniques is an area in dire need of further research with significant opportunities to understand the user at a cognitive, interaction and performance levels moving far beyond the binary pass/fail of traditional approaches.

Fig. 2: Prototype System Framework (Reproduced from [8]). I. Physiological wearable sensors used to capture data. (a) Neurosky mindwave® device. (b) Empatica E4® wristband. II. Representation of user interaction with the wheelchair simulator. III. The compatibles displays. (a) Common screen. (b) Oculus Rift® HMD device. (c) HTC Vive® HMD device.

Moving beyond health, the QoE concept can also be extended to other areas such as digital heritage. Organizations such as broadcasters and national archives that collect media recordings are digitizing their material because the analog storage media degrade over time. Archivists, restoration experts, content creators, and consumers are all stakeholders but they have different perspectives when it comes to their expectations and needs. Hence their QoE for archive material can be very different, as discussed at QoMEX 2019 [11]. For people interested in media archives viewing quality through a QoE lens, QoE aids in understanding the issues and priorities of the stakeholders. Applying the QoE framework to explore the different stakeholders and the influencing factors that affect their QoE perceptions over time allows different kinds of models for QoE to be developed and used across the stages of the archived material lifecycle from digitization through restoration and consumption.

The QoE framework’s simple yet comprehensive conceptual model for the quality formation process has had a major impact on multimedia quality. The examples presented here highlight how it can be used as a blueprint in other domains and to reconcile different perspectives and attitudes to quality. With an eye on the next and future editions of QoMEX, will we see other use cases and applications of QoE to domains and concepts beyond multimedia quality evaluations? The QoMEX conference series has evolved and adapted based on emerging application domains, industry engagement, and approaches to quality evaluations.  It is clear that the scope of QoE research broadened significantly over the last 11 years. Please take a look at [12] for details on the conference topics and special sessions that the organizing team for QoMEX2020 in Athlone Ireland hope will broaden the range of use cases that apply QoE towards QoL and other application domains in a spirit of inclusivity and diversity.

References:

[1] P. Le Callet, S. Möller, and A. Perkis, eds., “Qualinet White Paper on Definitions of Quality of Experience (2012). European Network on Quality of Experience in Multimedia Systems and Services (COST Action IC 1003), Lausanne, Switzerland, Version 1.2, March 2013.”

[2] World Health Organization, “World health organisation. preamble to the constitution of the world health organisation,” 1946. [Online]. Available: http://apps.who.int/gb/bd/PDF/bd47/EN/constitution-en.pdf. [Accessed: 21-Jan-2020].

[3] QUALINET [Online], Available: https://www.qualinet.eu. [Accessed: 21-Jan-2020].

[4] A. Hines and J. D. Kelleher, “A framework for post-stroke quality of life prediction using structured prediction,” 9th International Conference on Quality of Multimedia Experience, QoMEX 2017, Erfurt, Germany, June 2017.

[5] European Union Horizon 2020 research project PRECISE4Q, https://precise4q.eu/. [Accessed: 21-Jan-2020].

[6] “WHO | Assistive devices and technologies,” WHO, 2017. [Online]. Available: http://www.who.int/disabilities/technology/en/. [Accessed: 21-Jan-2020].

[7] D. Pereira Salgado, F. Roque Martins, T. Braga Rodrigues, C. Keighrey, R. Flynn, E. L. Martins Naves, and N. Murray, “A QoE assessment method based on EDA, heart rate and EEG of a virtual reality assistive technology system”, In Proceedings of the 9th ACM Multimedia Systems Conference (Demo Paper), pp. 517-520, 2018.

[8] C. Keighrey, R. Flynn, S. Murray, and N. Murray, “A QoE Evaluation of Immersive Augmented and Virtual Reality Speech & Language Assessment Applications”, 9th International Conference on Quality of Multimedia Experience, QoMEX 2017, Erfurt, Germany, June 2017.

[9] “Scope of Practice in Speech-Language Pathology,” 2016. [Online]. Available: http://www.asha.org/uploadedFiles/SP2016-00343.pdf. [Accessed: 21-Jan-2020].

[10] J. Reilly, “Semantic Memory and Language Processing in Aphasia and Dementia,” Seminars in Speech and Language, vol. 29, no. 1, pp. 3-4, 2008.

[11] A. Ragano, E. Benetos, and A. Hines, “Adapting the Quality of Experience Framework for Audio Archive Evaluation,” Eleventh International Conference on Quality of Multimedia Experience (QoMEX), Berlin, Germany, 2019.

[12] QoMEX 2020, Athlone, Ireland. [Online]. Available: https://www.qomex2020.ie. [Accessed: 21-Jan-2020].

MPEG Column: 128th MPEG Meeting in Geneva, Switzerland

The original blog post can be found at the Bitmovin Techblog and has been modified/updated here to focus on and highlight research aspects.

The 128th MPEG meeting concluded on October 11, 2019 in Geneva, Switzerland with the following topics:

  • Low Complexity Enhancement Video Coding (LCEVC) Promoted to Committee Draft
  • 2nd Edition of Omnidirectional Media Format (OMAF) has reached the first milestone
  • Genomic Information Representation – Part 4 Reference Software and Part 5 Conformance Promoted to Draft International Standard

The corresponding press release of the 128th MPEG meeting can be found here: https://mpeg.chiariglione.org/meetings/128. In this report we will focus on video coding aspects (i.e., LCEVC) and immersive media applications (i.e., OMAF). At the end, we will provide an update related to adaptive streaming (i.e., DASH and CMAF).

Low Complexity Enhancement Video Coding

Low Complexity Enhancement Video Coding (LCEVC) has been promoted to committee draft (CD) which is the first milestone in the ISO/IEC standardization process. LCEVC is part two of MPEG-5 or ISO/IEC 23094-2 if you prefer the always easy-to-remember ISO codes. We introduced MPEG-5 already in previous posts and LCEVC is about a standardized video coding solution that leverages other video codecs in a manner that improves video compression efficiency while maintaining or lowering the overall encoding and decoding complexity.

The LCEVC standard uses a lightweight video codec to add up to two layers of encoded residuals. The aim of these layers is correcting artefacts produced by the base video codec and adding detail and sharpness for the final output video.

The target of this standard comprises software or hardware codecs with extra processing capabilities, e.g., mobile devices, set top boxes (STBs), and personal computer based decoders. Additional benefits are the reduction in implementation complexity or a corresponding expansion in spatial resolution.

LCEVC is based on existing codecs which allows for backwards-compatibility with existing deployments. Supporting LCEVC enables “softwareized” video coding allowing for release and deployment options known from software-based solutions which are well understood by software companies and, thus, opens new opportunities in improving and optimizing video-based services and applications.

Research aspects: in video coding, research efforts are mainly related to coding efficiency and complexity (as usual). However, as MPEG-5 basically adds a software layer on top of what is typically implemented in hardware, all kind of aspects related to software engineering could become an active area of research.

Omnidirectional Media Format

The scope of the Omnidirectional Media Format (OMAF) is about 360° video, images, audio and associated timed text and specifies (i) a coordinate system, (ii) projection and rectangular region-wise packing methods, (iii) storage of omnidirectional media and the associated metadata using ISOBMFF, (iv) encapsulation, signaling and streaming of omnidirectional media in DASH and MMT, and (v) media profiles and presentation profiles.

At this meeting, the second edition of OMAF (ISO/IEC 23090-2) has been promoted to committee draft (CD) which includes

  • support of improved overlay of graphics or textual data on top of video,
  • efficient signaling of videos structured in multiple sub parts,
  • enabling more than one viewpoint, and
  • new profiles supporting dynamic bitstream generation according to the viewport.

As for the first edition, OMAF includes encapsulation and signaling in ISOBMFF as well as streaming of omnidirectional media (DASH and MMT). It will reach its final milestone by the end of 2020.

360° video is certainly a vital use case towards a fully immersive media experience. Devices to capture and consume such content are becoming increasingly available and will probably contribute to the dissemination of this type of content. However, it is also understood that the complexity increases significantly, specifically with respect to large-scale, scalable deployments due to increased content volume/complexity, timing constraints (latency), and quality of experience issues.

Research aspects: understanding the increased complexity of 360° video or immersive media in general is certainly an important aspect to be addressed towards enabling applications and services in this domain. We may even start thinking that 360° video actually works (e.g., it’s possible to capture, upload to YouTube and consume it on many devices) but the devil is in the detail in order to handle this complexity in an efficient way to enable seamless and high quality of experience.

DASH and CMAF

The 4th edition of DASH (ISO/IEC 23009-1) will be published soon and MPEG is currently working towards a first amendment which will be about (i) CMAF support and (ii) event processing model. An overview of all DASH standards is depicted in the figure below, notably part one of MPEG-DASH referred to as media presentation description and segment formats.

MPEG-DASH-standard-status

The 2nd edition of the CMAF standard (ISO/IEC 23000-19) will become available very soon and MPEG is currently reviewing additional tools in the so-called technologies under considerations document as well as conducting various explorations. A working draft for additional media profiles is also under preparation.

Research aspects: with CMAF, low-latency supported is added to DASH-like applications and services. However, the implementation specifics are actually not defined in the standard and subject to competition (e.g., here). Interestingly, the Bitmovin video developer reports from both 2018 and 2019 highlight the need for low-latency solutions in this domain.

At the ACM Multimedia Conference 2019 in Nice, France I gave a tutorial entitled “A Journey towards Fully Immersive Media Access” which includes updates related to DASH and CMAF. The slides are available here.

Outlook 2020

Finally, let me try giving an outlook for 2020, not so much content-wise but events planned for 2020 that are highly relevant for this column:

  • MPEG129, Jan 13-17, 2020, Brussels, Belgium
  • DCC 2020, Mar 24-27, 2020, Snowbird, UT, USA
  • MPEG130, Apr 20-24, 2020, Alpbach, Austria
  • NAB 2020, Apr 08-22, Las Vegas, NV, USA
  • ICASSP 2020, May 4-8, 2020, Barcelona, Spain
  • QoMEX 2020, May 26-28, 2020, Athlone, Ireland
  • MMSys 2020, Jun 8-11, 2020, Istanbul, Turkey
  • IMX 2020, June 17-19, 2020, Barcelona, Spain
  • MPEG131, Jun 29 – Jul 3, 2020, Geneva, Switzerland
  • NetSoft,QoE Mgmt Workshop, Jun 29 – Jul 3, 2020, Ghent, Belgium
  • ICME 2020, Jul 6-10, London, UK
  • ATHENA summer school, Jul 13-17, Klagenfurt, Austria
  • … and many more!

JPEG Column: 85th JPEG Meeting in San Jose, California, U.S.A.

The 85th JPEG meeting was held in San Jose, CA, USA.

The meeting was distinguished by the Prime Time Engineering Emmy Award from the Academy of Television Arts & Sciences (ATAS) for the longevity of the first JPEG standard. Furthermore, a very successful workshop on JPEG emerging technologies was held at Microsoft premises in Silicon Valley with a broad participation from several companies working in imaging technologies. This workshop ended with the celebration of two JPEG committee experts, Thomas Richter and Ogawa Shigetaka, recognized by ISO outstanding contribution awards for the key roles they played in the development of JPEG XT standard.

The 85th JPEG meeting continued laying the groundwork for the continuous development of JPEG standards and exploration studies. In particular, the developments on new image coding standard JPEG XL,  the low latency and complexity standard JPEG XS, and the release of the JPEG Systems interoperable 360 image standard, together with the exploration studies on image compression using machine learning and on the use of blockchain and distributed ledger technologies for media applications.

The 85th JPEG meeting had the following highlights:

  • Prime Time Engineering Emmy award,
  • JPEG Emerging Technologies Workshop,
  • JPEG XL progresses towards a final specification,
  • JPEG AI evaluates machine learning based coding solutions,
  • JPEG exploration on Media Blockchain,
  • JPEG Systems interoperable 360 image standards released,
  • JPEG XS announces significant improvements of Bayer image sensor data compression.
JPEG Emerging Technologies Workshop.

Prime Time Engineering Emmy

The JPEG committee is honored to be the recipient of a prestigious Prime Time Engineering Award in 2019 by the US Academy of Television Arts & Sciences at the 71st Engineering Emmy Awards ceremony on the 23rd of October 2019 in Los Angeles, CA, USA. The first JPEG standard is known as a popular format in digital photography, used by hundreds of millions of users everywhere, in a wide range of applications including the world wide web, social media, photographic apparatus and smart cameras. The first part of the standard was published in 1992 and has grown to seven parts, with the latest, defining the reference software, published in 2019. This is a unique example of longevity in the fast moving information technologies and the Emmy award acknowledges this longevity and continuing influence over nearly three decades.

This is a well-deserved recognition not only for the Joint Photographic Experts Group committee members who started this standard under the auspices of ITU, ISO, IEC but also to all experts in the JPEG committee who continued to extend and maintain it, hence guaranteeing such a longevity.

JPEG convenor Touradj Ebrahimi during the Emmy acceptance speech.

According to Prof. Touradj Ebrahimi, Convenor of JPEG standardization committee, the longevity of JPEG is based on three very important factors: “The credibility by being developed under the auspices of three important standardization bodies, namely ITU, ISO and IEC, development by explicitly taking into account end users, and the choice of being royalty free”. Furthermore,  “JPEG defined not only a great technology but also it was a committee that first defined how standardization should take place in order to become successful”.

JPEG Emerging Technologies Workshop

At the 85th JPEG meeting in San Jose, CA, USA, JPEG organized the “JPEG Emerging Technologies Workshop” on the 5th of November 2019 to inform industry and academia active in the wider field of multimedia and in particular in imaging, about current JPEG Committee standardization activities and exploration studies. Leading JPEG experts shared highlights about some of the emerging JPEG technologies that could shape the future of imaging and multimedia, with the following program:

  • Welcome and Introduction (Touradj Ebrahimi);
  • JPEG XS – Lightweight compression; Transparent quality. (Antonin Descampe);
  • JPEG Pleno (Peter Schelkens);
  • JPEG XL – Next-generation Image Compression (Jan Wassenberg and Jon Sneyers);
  • High-Throughput JPEG 2000 – Big improvement to JPEG 2000 (Pierre-Anthony Lemieux);
  • JPEG Systems – The framework for future and legacy standards (Andy Kuzma);
  • JPEG Privacy and Security and Exploration on Media Blockchain Standardization Needs (Frederik Temmermans);
  • JPEG AI: Learning to Compress (João Ascenso)

This very successful workshop ended with a panel moderated by Fernando Pereira where different relevant media technology issues were discussed with a vibrant participation of the attendees.

Proceedings of the JPEG Emerging Technologies Workshop are available for download via the following link: https://jpeg.org/items/20191108_jpeg_emerging_technologies_workshop_proceedings.html

JPEG XL

The JPEG XL Image Coding System (ISO/IEC 18181) continues its progression towards a final specification. The Committee Draft of JPEG XL is being refined based on feedback received from experts from ISO/IEC national bodies. Experiments indicate the main two JPEG XL modes compare favorably with specialized responsive and lossless modes, enabling a simpler specification.

The JPEG committee has approved open-sourcing the JPEG XL software. JPEG XL will advance to the Draft International Standard stage in 2020-01.

JPEG AI

JPEG AI carried out rigorous subjective and objective evaluations of a number of promising learning-based image coding solutions from state of the art, which show the potential of these codecs for different rate-quality tradeoffs, in comparison to widely used anchors. Moreover, a wide set of objective metrics were evaluated for several types of image coding solutions.

JPEG exploration on Media Blockchain

Fake news, copyright violations, media forensics, privacy and security are emerging challenges in digital media. JPEG has determined that blockchain and distributed ledger technologies (DLT) have great potential as a technology component to address these challenges in transparent and trustable media transactions. However, blockchain and DLT need to be integrated closely with a widely adopted standard to ensure broad interoperability of protected images. Therefore, the JPEG committee has organized several workshops to engage with the industry and help to identify use cases and requirements that will drive the standardization process. During the San Jose meeting, the committee drafted a first version of the use cases and requirements document. On the 21st of January 2020, during its 86th JPEG Meeting to be held in Sydney, Australia, JPEG plans to organize an interactive discussion session with stakeholders. Practical and registration information is available on the JPEG website. To keep informed and to get involved in this activity, interested parties are invited to register to the ad hoc group’s mailing list. (http://jpeg-blockchain-list.jpeg.org).

JPEG Systems interoperable 360 image standards released.

The ISO/IEC 19566-5 JUMBF and ISO/IEC 19566-6 JPEG 360 were published in July 2019.  These two standards work together to define basics for interoperability and lay the groundwork for future capabilities for richer interactions with still images as we add functionality to JUMBF (Part 5), Privacy & Security (Part 4), JPEG 360 (Part 6), and JLINK (Part 7). 

JPEG XS announces significant improvements of Bayer image sensor data compression.

JPEG XS aims at standardization of a visually lossless low-latency and lightweight compression that can be used as a mezzanine codec in various markets. Work has been done in the last meeting to enable JPEG XS for use in Bayer image sensor compression. Among the targeted use cases for Bayer image sensor compression, one can cite video transport over professional video links, real-time video storage in and outside of cameras, and data compression onboard of autonomous cars. The JPEG Committee also announces the final publication of JPEG XS Part-3 “Transport and Container Formats” as International Standard. This part enables storage of JPEG XS images in various formats. In addition, an effort is currently on its final way to specify RTP payload for JPEG XS, which will enable transport of JPEG XS in the SMPTE ST2110 framework.

“The 2019 Prime Time Engineering Award by the Academy is a well-deserved recognition for the Joint Photographic Experts Group members who initiated standardization of the first JPEG standard and to all experts of the JPEG committee who since then have extended and maintained it, guaranteeing its longevity. JPEG defined not only a great technology but also it was the first committee that defined how standardization should take place in order to become successful” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

About JPEG

The Joint Photographic Experts Group (JPEG) is a Working Group of ISO/IEC, the International Organisation for Standardization / International Electrotechnical Commission, (ISO/IEC JTC 1/SC 29/WG 1) and of the International Telecommunication Union (ITU-T SG16), responsible for the popular JPEG, JPEG 2000, JPEG XR, JPSearch, JPEG XT and more recently, the JPEG XS, JPEG Systems, JPEG Pleno and JPEG XL families of imaging standards.

The JPEG Committee nominally meets four times a year, in different world locations. The 84th JPEG Meeting was held on 13-19 July 2019, in Brussels, Belgium. The next 86th JPEG Meeting will be held on 18-24 January 2020, in Sydney, Australia.

More information about JPEG and its work is available at www.jpeg.org or by contacting Antonio Pinheiro or Frederik Temmermans (pr@jpeg.org) of the JPEG Communication Subgroup.

If you would like to stay posted on JPEG activities, please subscribe to the jpeg-news mailing list on http://jpeg-news-list.jpeg.org.  

Future JPEG meetings are planned as follows:

  • No 86, Sydney, Australia, January 18 to 24, 2020
  • No 87, Erlangen, Germany, April 25 to 30, 2020

Report from ACM SIG Heritage Workshop

What does history mean to computer scientists?” – that was the first question that popped up in my mind when I was to attend the ACM Heritage Workshop at Minneapolis few months back. And needless to say, the follow up question was “what does history mean for a multimedia systems researcher?” As a young graduate student, I had the joy of my life when my first research paper on multimedia authoring (a hot topic those days) was accepted for presentation in the first ACM Multimedia in 1993, and that conference was held along side SIGGRAPH. Thinking about that, it gives multimedia systems researchers about 25 to 30 years of history. But what a flow of topics this area has seen: from authoring to streaming to content-based retrieval to social media and human-centered multimedia, the research area has been hot as ever. So, is it the history of research topics or the researchers or both? Then, how about the venues hosting these conferences, the networking events, or the grueling TPC meetings that prepped the conference actions?

Figure 1. Picture from the venue

With only questions and no clear answers, I decided to attend the workshop with an open mind. Most SIGs (Special Interest Groups) in ACM had representation at this workshop. The workshop itself was organized by the ACM History Committee. I understood this committee, apart from the workshop, organizes several efforts to track, record, and preserve computing efforts across disciplines. This includes identifying distinguished persons (who are retired but made significant contributions to computing), coming up with a customized questionnaire for the persons, training the interviewer, recording the conversations, curating them, archiving, and providing them for public consumption. Efforts at most SIGs were mostly based on the website. They were talking about how they try to preserve conference materials such as paper proceedings (when only paper proceedings were published), meeting notes, pictures, and videos. For instance, some SIGs were talking about how they tracked and preserved ACM’s approval letter for the SIG! 

It was very interesting – and touching – to see some attendees (senior Professors) coming to the workshop with boxes of materials – papers, reports, books, etc. They were either downsizing their offices or clearing out, and did not feel like throwing the material in recycling bins! These materials were given to ACM and Babbage Institute (at University of Minnesota, Minneapolis) for possible curation and storage.

Figure 2. Galleries with collected material

ACM History committee members talked about how they can fund (at a small level) projects that target specific activities for preserving and archiving computing events and materials. ACM History Committee agreed that ACM should take more responsibility in providing technical support to web hosting – obviously, not sure whether anything tangible would result.

Over the two days at the workshop, I was getting answers to my questions: History can mean pictures and videos taken at earlier MM conferences, TPC meetings, SIGMM sponsored events and retreats. Perhaps, the earlier paper proceedings that have some additional information than what is found in the corresponding ACM Digital Library version. Interviews with different research leaders that built and promoted SIGMM.

It was clear that history meant different things to different SIGs, and as SIGMM community, we would have to arrive at our own interpretation, collect and preserve that. And that made me understand the most obvious and perhaps, the most important thing: today’s events become tomorrow’s history! No brainer, right? Preserving today’s SIGMM events will give us a richer, colorful, and more complete SIGMM history for the future generations!

For the curious ones:

ACM Heritage Workshop website is at: https://acmsigheritage.dash.umn.ed

Some of the workshop presentation materials are available at: https://acmsigheritage.dash.umn.edu/uncategorized/class-material-posted/

Reports from ACM Multimedia 2019

Introduction

The annual ACM Multimedia Conference was held in Nice, France during October 21st to 25th, 2019. Being the 27th of its series, it attracted approximately 800 participants from all over the World. Among them were the student volunteers who supported the smooth organization of the Conference. In this article, I would like to introduce the reports and comments provided by each of them.

Figure. Student volunteers at ACM Multimedia 2019

Reports from student volunteers

Hui Chen (Tsinghua University, China)

It was such an honor for me to be granted for the student travel funding. During my stay in Nice, as a Ph.D. researcher, I read a lot of nice academical works which inspired me a lot. And I had wonderful conversations with authors from all over the world. Meanwhile, as a session volunteer, I was glad to help speakers and the audience during sessions. Their nice works and warm smiles impressed me a lot. What I most valued about is the friendship with other volunteers. We often discussed the attractive places and the delicious food in Nice, and cared for each other along the journey. I am deeply thankful for this wonderful experience in Nice. Some advice: (1) I think the beret was not necessary for the volunteers. Majority of us seemed to dislike it, because I did not see many volunteers wearing them. (2) Notifications about the room changing for sessions should be made clear early. (3) The manner of being punctual can be emphasized in the ice-break meeting. (4) Reminding of volunteered sessions could be shown in the Whova app.

Shizhe Chen (Renmin University of China, China)

It was a great pleasure to attend the ACM Multimedia this year. I have attended MM twice and the organizations are getting better and better. One big change was the deployment of the Whova APP, which really improved our experience at MM. On the one hand, it made connections among different attendants and organizations more convenient and efficient. On the other hand, it was nice to share photos in the APP about the conference. The volunteers are very devoted to serve the conference and uploaded many good pictures. The conference banquet at Nice also improved a lot. I really enjoyed local foods and magic shows. Even though there were so many people at that night, the organization was very ordered and made everyone satisfied. I also liked some multimedia modern art pieces exhibited at the conference which were wonderful. The conference session I enjoyed most was the Multimedia Grand Challenge, which provided a great opportunity for us academics to get involved in real-life problems in industries. It would have been better if there were more opportunities off-line to communicate with industry people in the conference. In summary, thanks for all the efforts the organizers have put on the conference. I am also proud to be able to contribute a little as a volunteer this time.

Yang Chen (University of Science and Technology of China, China)

This was my first time attending an international conference and needed to be a session volunteer during the conference. It was also my first time abroad. So I felt a litter nervous before going abroad for the conference. Fortunately, everything went smoothly in the end. The MM conference has been held for many years, so the experience of organizing the conference is rich, and the scale is also large. The MM conference provided a lot of convenience for the participants. All conference schedules can be found at the venue, so attendees can easily find the sessions that they needed to participate or were interested in. In addition, this year, the MM conference had many local characteristics of Nice, France. All attendees were given the famous local soap of Nice. The French food provided at the venue was also very delicious. All in all, it was a very impressive MM conference experience.

Amanda Duarte (Universitat Politècnica de Catalunya, Spain)

ACM Multimedia 2019 for me was a different and great experience. This was the first time that I attended this conference and it was very different of what I am used to find in a big conference. For the past four years I have been going to conferences more focused on Computer Vision and Machine Learning which nowadays have a large number of attendees, accepted papers, parallel sessions, and all the stress of being in a large venue and need to find the sessions that interest you across large rooms full of people.
ACM Multimedia on the other way around was held in a smaller venue with less attendees but yet with a very large amount of high quality researchers. Thus, I had the chance of talking more to great researchers in the areas that I have interest and also were interested in my work. In addition to my great experience during the conference in general, I had a great experience participating in the Doctoral Symposium during the conference. This event gave me the opportunity to present my work to great researchers that work on topics related to my doctoral thesis and were able of giving me great feedback and suggestions on how to improve my research.

Gelli Francesco (National University of Singapore, Singapore)

Although I am still a student, this edition of ACM Multimedia has been my third. Similar to the previous times, I met with the now more familiar community and allocated my time between attending sessions, walking around the posters, and rehearsing my presentation. My observation is that this year, there has been a major focus on applications rather than on the technical aspects. For example, the Best Paper session included works on zooming audio together with video, multi-modal dialogue system and privacy. The Brave New Ideas session, in which I presented, saw some more unusual and daring applications, such as the automatic creation of a sequence of images to match a short story. I had a great time presenting my paper on ranking images by subjective attributes, as I did my best to engage the audience with multiple questions. I learned from the senior organizers that their goal is to push the Multimedia community on applications such as Wellness and Human-Machine interaction, which naturally involves multimedia data. It was also inspiring to see so many engaged volunteers all dressed in blue running around with that very traditional beret. Definitely looking forward to attend the next edition.

Trung-Hiếu Hoàng (University of Science, Vietnam National University Ho Chi Minh City, Vietnam)

I am excited to share my experience in ACMMM 2019, as a person who received the student travel grant. Living in Vietnam, I cannot believe that I had such a great opportunity to travel thousands of kilometers and attend one of the top conferences in the world. On the first day, I met a lot of friends who received the same travel grant like me. We hung out together sharing different stories and experiences, all of us were enthusiastic and couldn’t wait to become a part of the volunteer team and contribute to the success of this year’s conference. During the last two years, I have had a strong interest in medical image processing. In detail, my research focuses on abnormality detection in the endoscopic image. Attending ACMMM 2019 gave me a wonderful chance to present my work, and discuss with experts in this field. I enjoyed the Healthcare Multimedia workshop, where I met the organizers of the BioMedia Grand Challenge track. I loved talking with them and discussing the future and their interests. In conclusion, I am so glad that the student grant brought me to Europe for the first time, opened up my mind and showed me wonderful things that I had never seen before.

Chia-Wei Hsieh (National Chiao Tung University, Taiwan)

I attended the ACM Multimedia 2019 in Nice, France, and listened to new AI approaches by experts and scholars from various countries. In this conference, I got the chance to learn about the latest studies’ results from world-renowned universities and research institutions, and learn about the latest developments in the industry. These most advanced tools broadened my view and realized the disabilities that can be improved in our future research. Furthermore, I appreciated serving as a volunteer at the conference. This forced me to interact with people and have made many good friends from all over the world. Everything is really well to attend MM’19, but a fly in the ointment is that the attendance of the last two days was pretty low. With some special benefits for people to stay, there could be more academic exchanges at the conference.

Michael Kerr (RMIT University, Australia)

I came to the conference this year hoping to learn about some very specific research that was being presented in my own field of employment of video surveillance. My expectations around these presentations was well met, but additionally I also took away new insights into other areas that were previously not of great interest to me, mainly as I had not explored their application to my own field.
I particularly enjoyed the Tutorials on Multimedia Forensics and was interested to see the work done in areas that had been developed in recent years. I was very engaged by the application of CNN to solve forensic challenges and quickly found that the application of these systems was a major theme in the entire conference. So, whilst I enjoyed many of the practical applications such as the Tutorials, the System Demonstrations, and the Open Source Software Competition, I also learnt a great deal about the growth of CNN technologies within the multimedia discipline as a whole. This has had a positive effect by helping to develop my own research plans and in particular enabling the identification of new applications that may be of interest to those working in multimedia as well as my specific field of interest.

Saurabh Kumar (Indian Institute of Technology Bombay, India)

I had an enjoyable experience at ACM Multimedia and learned a lot as this was my first big international conference. The papers were from diverse applications, and it was great talking to the speakers after the talks and at the posters. This allowed me to meet many amazing people from various backgrounds and talk about the exciting research they are doing. It was easy to approach anyone at the conference for casual or technical discussions. These days conferences are recorded with recording and proceedings are put up online, but that is just the tip of the iceberg. Attending a conference is a much broader experience, and I got an opportunity to experience this thanks to this travel grant. I made friends from many countries, thanks to the friendly atmosphere, and learned how my research fits in. I would like to highlight that being a volunteer was the primary reason all of this was possible. As a volunteer, it was so much easier to talk to people, and it was great helping them around. I would love to come and help out again anytime. The conference was just perfect, and I will remember my experience as a volunteer, which made it way more fun and especially the people I interacted with. I am certainly submitting to the next MM and coming back again with more exciting research and to meet this fantastic community. Also, visiting Nice was a delight, and it is a magnificent city, and the food was delicious.

Yadan Luo (University of Queensland, Australia)

It has been a great experience attending ACM Multimedia 2019 in Nice this October, where I met many brilliant people working in the same field. The Invited Talks offered impressive ideas, inspiring visions of the future and excellent coverage of many areas, like preserving audiovisual archives and data protection law. The most impressive part of the conference was the Art Exhibition, which showed a great power of installation art and interactive multimedia. Moreover, this great meeting brought me a lot of precious opportunities of meeting other researchers working in other subfields like video streaming, domain adaptation, and image generation. All chatting with them helped me quickly pick up plenty of new knowledge and opened a door to other research directions. In conclusion, I would like to sincerely express my thanks to people who have prepared the conference, in which I have benefited a lot from this fantastic event.

Kwanyong Park (Korea Advanced Institute of Science and Technology, Korea)

ACM Multimedia 2019 was especially special to me in terms of my improvement. Honestly speaking, my paper, presented in ACM Multimedia 2019, is my first international research accomplishment. So I really lacked experiences and skills about presenting my work and communicating with other researchers. But after ACM Multimedia 2019, I have confidence that at least I can do better and better. Combination of Oral and Poster sessions was really impressive and effective to obtain a lot of information in a short time. Every paper had at least 2 minutes oral presentation, and I could catch the core concept. Based on that, I easily decided whether the paper is closely related to my interest or not. I agree that this kind of configuration is a really efficient way. Through the conference, I saw which topics the students, who have mostly academic perspective, are focusing on. Although it is a great stimulus to me, I think practical perspective from various companies is also important to broaden the horizon. However, research from companies was relatively hard to find in ACM Multimedia 2019. I think that having some interactive booths from companies would be helpful.

K. R. Prajawal (International Institute of Information Technology, India)

ACM Multimedia was not only my first top-tier conference, but my first conference as well. I was pleased to see a lot of interesting and impactful papers from people from various backgrounds and universities. I particularly liked the conference venue as well, as it was spacious and comfortable to encourage a healthy discussion. I personally feel the food and meals could have been better curated. For example, I’m a vegetarian. I understand I have few items to eat, but the vegetarian items were not clearly labeled. This can be rectified in the future editions of the conference. I also believe that most of the presentation rooms were well prepared and organized for the presentation. During my oral presentation, however, I had an issue in playing a demo video. This issue had occurred because the conference organizers were not fully prepared to play a video during the presentation. That is rather odd, I felt, given this is a top-tier multimedia conference, which means it will have lots of audio and visual content. But, other than that, I had a very pleasant and fruitful time at the conference. I was able to connect and socialize with eminent researchers at ACM Multimedia and I hope to attend the next edition as well.

Estêvão Bissoli Saleme (Federal University of Espírito Santo, Brazil)

ACM Multimedia 2019 in Nice was such a unique experience. I volunteered for six sessions and attended a couple more, including the Best Paper session which I particularly liked the most. Not only because it brought original ideas, but also because I had the opportunity to witness an innovative presentation of the paper “Multimodal Dialog System: Generating Responses via Adaptive Decoders,” in which the speakers kept a dialog between them to give their talk. Besides that, I enjoyed the poster presentation hall, which we could mingle with other participants, get to know other people’s work better, and interact with them. One presentation that impressed me was entitled “Editing Text in the Wild.” In this work, the researchers proposed a method to replace any text in a picture keeping the background intact. The outcome looked like a real figure. Just impressive! Technically, I was more interested in Quality of Experience and Interaction, but I thought the subject of the papers in this session was spread out, which hindered the interaction with other presenters. It lacked a bit of work related to QoE itself. Finally, another aspect that deserves praise was the organization. Whova helped hugely, and we could post photos and interact with other people there. Moreover, Martha, Laurent, and Benoit were omnipresent and tireless. They were just on fire and worked very well to deliver such a great conference!

David Semedo (Universidade NOVA de Lisboa, Portugal)

My experience at ACM MM 2019 was very positive. I presented two full papers: one as a full oral and one as a short presentation. As such, the whole event was quite intense for me but also very personally enriching. I could do a lot of networking, with both students and senior researchers (the ConfLab contributed in this regard). As I am in my last Ph.D. year, I could talk with several researchers, from which I got valuable advices on how to take the next steps towards pursuing a career in research. At the poster sessions, I had the opportunity to discuss in detail my work with several people, from which I received constructive feedback. While I liked the fact that posters stayed posted during the whole conference, some were hard to find or were a bit hidden (e.g. the ones facing the wall). The conference program covered a wide range of topics on Multimedia. This allowed me to understand which techniques are being used on different tasks, and identify common technical aspects across these different tasks. It not only helped me in being updated, in terms of state-of-the-art approaches, but also in defining potential future research directions.

Junbo Wang (Institute of Automation, Chinese Academy of Sciences, China)

From 21-25 October 2019, I attended the ACM Multimedia 2019 Conference in Nice, France. This conference is a premier international conference in the area of multimedia within the field of computer science and I am very proud of attending this professional conference thanks to the ACM student travel grant. In this conference, I met many famous researchers in the area of multimedia, such as Tao Mei, Tat-Seng Chua, and Changsheng Xu. During the Poster or Oral sessions, I discussed many academic problems with these researchers, which really gave me new vision and insight. In addition to many academic talks, I also enjoyed a lot of French food, such as Macaroon and Foie Gras. As a session volunteer, I was also very happy to help the attendees in some session talks. The interesting and professional talks inspired me and guided my interest to many different research areas. Moreover, the conference was held at the NICE ACROPOLIS Convention Center in Nice, which is a beautiful and peaceful city. The fresh air and pleasant sea breeze gave us a good mood every day and made us have an unforgettable experience in this city. Overall, I think this conference was very successful to reach its fundamental objective: free communication. However, I also found that the sponsors this year was far less than that for last year, which can be expected to be better in the next year.

Xin Wang (Donghua University, China)

In my experience, I think MM’19 was very impressive and easy to follow. The arrangement of the conference was very reasonable especially the Whova APP helped me a lot whenever I wanted to figure on what is going on during the conference. Except one thing that I found in the first two days, there were still some workshops that had different room numbers between the session volunteer schedule (a Google sheet). That made me confused for a while, but luckily Martha told us use the APP as the standard. I really loved the Demo session and I think there must be people who had the same feeling like me. I met and talked with many researchers from all of the world, such as NUS, DCU, Nagoya University, Shandong University, National Chiao Tung University, etc. I still keep contact with some of them and exchange our research ideas. Besides, the weather in Nice was very comfortable. The food during the conference was rich and delicious. All of these reasons make me look forward to the next year’s MM conference.

Yitian Yuan (Tsinghua University, China)

It was very enjoyable to attend the ACM MM 2019 conference. As a volunteer, I could meet peers from other countries and schools and communicate with them, which is of great benefit to my scientific research knowledge. I think the agenda of this ACM MM conference was compact and reasonably arranged, but there are still the following problems that I think need to be improved: (1) The entrance of the main conference hall was dimly lit and the signs were not obvious, so volunteers needed to guide, otherwise it was difficult for participants to find the place. (2) I wish the stage at the Banquet had a bigger screen, so that everyone can see the name of the winners and the prize information. Finally, I wish the ACM MM better and better and more international influence.

Zhengyu Zhao (Radboud University, The Netherlands)

This was my second time to attend ACM Multimedia, after the first time in Korea in 2018. Overall, I felt the conference this year was a very successful edition, reflected by the perfect location, delicious food, well-designed program and especially the efforts from the volunteers. But still, I have some suggestions for further improvement. Specifically, from the experience of the poster presentation of my reproducibility paper, I realized that most people actually know nothing about this new reproducibility track. This made most of my time spent on explaining the general background of the track and so less time for my own research. I was happy to explain and get more people involved in this track but it would be better if the organization team could give more exposure of this track beforehand. From this experience serving as one of the poster session chairs, I figured out that many people do not use the official communication APP Whova, so the instructions and important announcements could not reach all the participants timely. In my opinion, more offline solutions (e.g., a big screen on the spot) would help.

Summary

In general, the student volunteers seemed to have enjoyed the event to the full extent, but some of them have proposed constructive suggestions that organizers and participants to future versions of the conference could take in account to provide better experiences!

All in all, I think we can see from the submitted reports that providing the chance to experience top-level research and to mix with all-range of researchers at a top-level Conference to young researchers who may one day become leaders in our community, would surely benefit us in the future.

An interview with Associate Professor Hugo L. Hammer

Hugo as a Ph.D. student, at the beginning of his research career.

Describe your journey into research from your youth up to the present. What foundational lessons did you learn from this journey? Why were you initially attracted to multimedia?

From an early age, I had the ability to focus and work individually and loved to develop new systems for all sorts of things, which probably was quite annoying for those around me. It turns out that it is these abilities to focus, being curious, and developing new systems is what drives my research today. When I started as a student in mathematics and statistics at the Norwegian University of Science and Technology (NTNU), I didn’t think of research as an alternative and was determined to find a job in the industry. Throughout the studies, I learned how little mathematics and statistics I had actually learned, which is why I decided to become a Ph.D. student. I expected to find a job in the industry after the Ph.D. period but ended up loving research, and that is why I am where I am today.

As a statistician, I have worked a lot with spatial and spatio-temporal data, such as geophysical observations. Such observations have striking similarities to multimedia content, such as images and videos. I have become very interested in machine learning methods used to process and make decisions from multimedia content and the potential for applying such methods towards other applications, such as geophysical applications. I also love working as a statistician within this field. A crucial part of my research is to try to combine methods from machine learning and statistics into new and exciting ways.

Tell us more about your vision and objectives behind your current roles? What do you hope to accomplish, and how will you bring this about? 

In my current position as an associate professor, I do both teaching and research. Teaching and research challenge me in different ways. I continuously try to develop and improve my teaching. I especially focus on how to do high quality, yet resource-efficient, teaching. I have, for example, worked a lot on how to activate students and improve learning when being a single teacher for hundreds of students.

Can you profile your current research, its challenges, opportunities, and implications?

My current research can roughly be divided into three directions. The first direction is about methods for real-time information processing and decision making, for example, from sensory information or video streams. The second direction is based on developing new machine learning models and methods, and as mentioned above, by taking advantage of my background in statistics. The third direction is doing more applied use of machine learning methods toward real-life multimedia data, in particular, medical data. Direction two and three go hand in hand. Having a background in statistics and working more and more with multimedia data is more of an opportunity than a challenge.

How would you describe your top innovative achievements in terms of the problems you were trying to solve, your solutions, and the impact it has today and into the future?

 I am proud of the research we have done on real-time information processing and decision making. Our developed methods are simple but still document state-of-the-art performance. In 2020, we plan to develop software packages to make the methods readily available and hopefully useful for many. We saw the potential of using machine learning, and in particular deep learning, towards geophysical data and problems quite early, and we are now able to operate at the forefront of this research. I’m also proud of our externally funded research projects and, for sure, our rejected research proposals.

Over your distinguished career, what are your top lessons you want to share with the audience?

Here is a lesson from my personal experience. I think it is easy to depend on or have too much respect for other researchers early in the career. Research is of course all about collaboration, but still, for me, it was useful early in the career to create a small research project where I did every step of the process myself (shaping ideas, collecting data, running simulation, writing, finding suitable publishing channels, revisions, etc.). It was hard work, but for sure, it made me a better and more independent researcher.

What is the best joke you know?

Daddy, what are clouds made of?

Linux servers, mostly.

If you were conducting this interview, what questions would you ask, and then what would be your answers?

One suggestion: What do you like to do in your spare time?

Research, right? 🙂 Working every day at an office, I try to find time for physical activity in my spare time. I love to run, bike, or go skiing in Nordmarka (a forest near Oslo, Norway) or in the mountains on the weekends.

A recent photo of Hugo.

Bio: Hugo L. Hammer is an associate professor in statistics at Oslo Metropolitan University. His main research interests are computational statistics, probabilistic forecasting, real-time analytics, and machine learning.

An interview with Professor Roger Zimmermann

Roger at the start of his career.

Please describe your journey into research from your youth up to the present. What foundational lessons did you learn from this journey? Why were you initially attracted to multimedia?

I have had an interest in technology early on, though my path to becoming an academic has not been very direct. In high school, I really enjoyed to tinker with electronics, taking radios apart, and learning about digital circuits. My goal was to work in this field, and after high school, I did an apprenticeship with Brown, Boveri & Cie. (BBC), which sometime later became Asea Brown Boveri (ABB). The apprentices were assigned to different company locations, and I was lucky enough to be sent to BBC’s Forschungszentrum (Research Center). The labs, the researchers, and the cutting-edge equipment and projects there left a deep impression on me. Beyond electronics, I really liked microprocessors, computers and how they could be flexibly programmed with software. I decided that I wanted to pursue further studies and I subsequently enrolled in the Höhere Technische Lehranstalt (HTL) Brugg-Windisch in their Informatik program (the HTL program has since changed and the building where I studied is now part of the campus Windisch of the Fachhochschule Nordwestschweiz). Fresh with my HTL degree in hand, I started to work for an engineering company and over the next years, I got the chance to work on some fascinating projects. After five years, I got an itch to study for a Master’s degree and I ended up in California. One of the professors (who became my advisor) encouraged me to go for a Ph.D., and I took him up on his offer to support me. His group worked at the intersection of databases and multimedia. It really fascinated me and we ended up building one of the early streaming media servers. What I still find fascinating about multimedia today is how it brings together many fundamental computer science areas such as networking, graphics, operating system support, signal processing, etc. I also like that multimedia is used by people to express their creativity, humanity and artistic aspirations – it is not only about technology.

My personal lessons looking back are that sometimes you may not know where your journey will take you, but make sure you enjoy and learn from the path to get there.

Tell us more about your vision and objectives behind your current roles? What do you hope to accomplish and how will you bring this about?

I currently work broadly in two areas, namely streaming media systems and data analytics. At this point, one of the main enjoyment I get is from working with my research group and international colleagues from around the world. On the technical side, it is fun if somebody is actually using what we develop. On the human side of things, it is great to see when my students and former students are doing well in various parts of the globe.

Can you profile your current research, its challenges, opportunities, and implications?

In my research group, I have two main themes and those are media systems and multimedia data analytics. In the first cluster, we look at media streaming on the Internet. The main technology in use today is Dynamic Adaptive Streaming over HTTP, also called DASH. Some interesting challenges are in the area of enabling very low latency in live streaming, which is of interest to many large Internet companies. Going forward, I see 5G networks as an interesting challenge. Most people are excited about the very high bandwidth that 5G can offer (in the best case), but I believe one of the major challenges will be the very high variability of 5G networks when a device is moving. On the multimedia, and especially spatial, data analytics side, I am part of a new lab between NUS and the ridesharing company Grab. There is a tremendous amount of data generated (e.g., GPS trajectories) that allow novel data-driven applications such as generating accurate road maps in regions where this information is not readily available or the inference of semantic attributes of roads (e.g., no right turn allowed). The fusion of multiple data types such as trajectories, images, maps, etc., will allow for some exciting new applications.

How would you describe your top innovative achievements in terms of the problems you were trying to solve, your solutions, and the impact it has today and into the future?

One of the areas where my group made innovative contributions was georeferenced mobile video — combining videos with their geo-spatial properties led to a lot of interesting developments. We started with this just about at the same time when the first iPhone came out, and the idea of utilizing all the sensors in a phone in combination with its video was really novel. Nowadays, sensor fusion is common and is used in many machine-learning applications and I am sure there will be even greater break-throughs in the future. Another area where I have been working for decades is media streaming and this whole industry has changed from proprietary networks to the Internet. There have been many people working in this area, but I believe that our own contributions have helped to transform this field.

Over your distinguished career, what are the top lessons you want to share with the audience?

My path to becoming an academic has not been as direct as for some other people. But one of the key things that I have enjoyed along the way was to work with many outstandingly talented and bright people from all around the world. I hope that humanity will keep working together based on facts and science to solve some of the big challenges that are coming our way.

If you were conducting this interview, what questions would you ask, and then what would be your answers?

One issue that concerns me is the apparent trend to not trust facts anymore. So a possible question could be: Do you see a danger when people easily distribute and believe in “alternate facts”?

My answer would be, I definitely see this as a considerable concern in the future. While there may be some technical solutions to combat fake news, etc., it is also increasingly important that people are well educated and think critically, especially in a world where fake information may look very persuasive.

 

What is the best joke you know?

I like many of the weird, but strangely funny comments on life and baseball from Yogi Berra. He was born Lawrence Peter Berra and was a US baseball legend. Two examples:

“When you come to a fork in the road, take it.”

“You should always go to other people’s funerals. Otherwise, they won’t come to yours.”


A current image of Roger.

Short bio:

Roger Zimmermann is an Associate Professor at the School of Computing at the National University of Singapore (NUS). He is also Deputy Director with the Smart Systems Institute (SSI) at NUS. From 2010 to 2016 he co-directed the Centre of Social Media Innovations for Communities (COSMIC), a research institute funded by the National Research Foundation (NRF) of Singapore. Prior to joining NUS he held the positions of Research Area Director with the Integrated Media Systems Center (IMSC) and Research Assistant Professor at the University of Southern California (USC). He earned his M.S. and Ph.D. degrees from the Viterbi School of Engineering at the University of Southern California.

Multidisciplinary Column: Conferences as Career and Community Catalysts

A little over 10 years ago, I chose to pursue a PhD. This meant I chose a professional life in which research publications and their uptake would be seen as major evidence of achievement. For those working in computer science, the major dissemination platforms for such publications are conferences.

Given my dual background in music and computer science, it was logical that my main interests were in topics that connected these both worlds. As a consequence, I hoped to become part of the Music Information Retrieval community. The International Society for Music Information Retrieval (ISMIR) therefore seemed the professional community to target, and the annual ISMIR conference the most logical place to present my work at.

In terms of its education and research, my department at TU Delft had track records and agendas in visual and social multimedia content analysis, but not particularly in music. Considering methodology and philosophy, I did think a lot of the work at the department was compatible with what I tried to do in music. Furthermore, as I still was in training in a selective major at the conservatoire, I was not in a good position to geographically move to any other institute that would have a more established Music Information Retrieval track record. So I inquired whether I could stay in Delft for pursuing my PhD.

The answer was somewhat complicated. There was no funding for a PhD position in Music Information Retrieval, and there were no strategic plans to change that. At the same time, the people who had supervised me as a student (in particular, my thesis supervisor Alan Hanjalic) saw promise in me, and would like to keep working with me. Ultimately, I got a one-year contract in which my main task was to try acquiring funding and international community backing to pursue a Music Information Retrieval PhD in a multimedia group.

At the start of that year, I got to attend my first ISMIR conference, where I presented a paper based on my master’s thesis. In a previous column for the SIGMM records, I already discussed my experiences at that moment: how debuting alone at a conference was intimidating, but how I was lucky that senior members of the community pro-actively took care I got introduced to other attendees. Frans Wiering, the senior member who looked after me in particular at that moment, was general chair of the upcoming ISMIR, which would take place in Utrecht, so in my home country. Frans was quick to invite me to serve as a student volunteer, which was very good news for me. As my year would be filled with grant-writing, I did not yet have a sufficiently stable infrastructure around me to be able to truly do research, so submitting to the next ISMIR was out of reach. But this way, I could still attend the conference, and even would have an excuse to keep mingling with all the attendees, as we as volunteers would be the first people to answer any participant questions regarding logistics.

Getting funding turned out a true challenge. In 2009, digital music consumption was not as large yet as it is today, and many potential data-providing partners were reluctant to collaborate. Of course, it also did not help my cause that I still was a complete nobody. Finally, when working on music, one faces an interesting paradox. On the one hand, many people, regardless of their backgrounds, identify with music, up to the point that they personally deeply care about it. As such, working on music makes for a good conversation starter, in which people are always happy to share their personal experiences. On the other hand, this makes music a commonplace topic, which risks it being shoved aside as ‘less serious’. Even though technically, the problems we are working on are framed in very similar ways as they may be in neighboring domains such as vision (and the research challenges are at least as hard, if not harder, due to subjective human factors being an integral part of the problem), common criticisms we receive are that music is fun but does not save lives, and does not deal with areas of major economical impact, nor easily measurable societal impact. So while we never have any problems legitimizing our work in public outreach, in grant-writing, we always need to justify extra why our work is more than a fun hobby, and sufficiently relevant to justify serious funding.

After several collaboration rejections, and the one proposal I did manage setting up getting rejected despite good review scores, I was very lucky that at the very end of my grant-writing year, I managed securing PhD funding through a Google Doctoral Fellowship (now PhD Fellowship). For this, I needed to get a research mentor, although my Google contacts weren’t so sure who would be appropriate for this role, as they were not aware of anyone working in music in the company at that stage.

Several weeks later, I was volunteering at ISMIR in Utrecht. That was where I found out that Douglas Eck had just moved from academia to industry, to work on music research at Google. And that was how I got my research mentor, with several extremely useful interning experiences at the company as a consequence.

When Emilia Gómez, the 2018-2019 president of the ISMIR society invited me to become general co-chair to ISMIR’s 20th anniversary edition with her, and host the event in Delft, this was my chance to give back. Now I had general chair powers, and as the society was quite open to discussing any innovations, I could try realizing the conference of my dreams.

As described in my previous column, the inclusive spirit of ISMIR has always been quite elaborate, including mentoring programs spearheaded by our Women in MIR movement, an explicit focus on multidisciplinarity over exclusivity, and on being medium-sized but single-track. Since two years, all our accepted papers are presented in a 4-minute presentation and a poster, such that all the works get equal visibility. This year, we chose to not do themed sessions but to randomize the paper order, such that authors on related topics would not be presenting their posters at the same time. As a side-effect, this also would nudge attendees towards learning about everything that got accepted, beyond the topics of their specializations. This is something I have seen the ISMIR community always being enthusiastic about, while I had very different experiences at (more prestigious) larger-sized conferences. In many cases, their larger size led to many parallel tracks with fragmented audiences, while any plenary program elements were so massive that it was hard to engage with anyone you did not happen to know already, or incidentally happened to stand or sit next to.

We made sure we offered more than paper presentations. For the keynotes, we invited speakers from neighboring fields and disciplines, and encouraged them to give some critical perspectives on our field. We engaged with a local school in an outreach program. Before the conference, we held workshops, including the Women in MIR prototyping workshop, so people would already get to know one another; we had a dedicated Newcomer Initiatives chair to make sure no one felt lost, and the socials were set up such that people could really mingle. With many people in music also happening to be active music players, we offered both formal and informal options to jam together, so that week, several cafes in Delft faced more live music than we would normally see.

But while I was preparing for this conference, one of my strongest experiences was that I kept being haunted by these memories of the past: that being able to join this community (and an academic career at all) had been a really close call, that really was catalyzed by me having been able to join the conferences, and having met supportive seniors, while I was still an early-stage student without a full research embedding.

So one of the ISMIR 2019 achievements I am most proud of, was that we extended our financial support programs, enabled by the ISMIR board and sponsorship funds. Beyond the existing grants for student authors and female participants, we added a third ‘community grant’ category, meant for individuals who would like to attend ISMIR, but who had not been in the capacity to actively participate to the conference at this stage. Reading through the motivation letters for this grant made me realize that my experiences not as much of a freak case, and that colleagues have been facing similar challenges.

I am deeply grateful that these grants enabled for us to get more people over to ISMIR. Young professionals in between positions, students in other disciplines seeking to collaborate more closely on music topics; students that have found themselves as sole people in their labs working on music, as the labs faced other strategic priorities; but also, seniors who used to be members of our field, but who had gradually been drifting out, when entering a vicious circle of not getting music projects funded, then having to do more teaching in other topics, and then taking hits on their research output and profile. It was a wonderful experience seeing all of them actively mingling with the community, and hearing how being at ISMIR indeed had been personally impactful for them.

For my student volunteers, I especially targeted local and national students who were not yet at the PhD level, such that they could experience our academic atmosphere. Here as well, I saw the positive impact of the ISMIR spirit; several of these students (of whom I am not even the thesis supervisor…) made friends with international colleagues, and are even trying to collaborate on music information research with them in their free time today.

Hopefully, this story can help inspiring colleagues who are seeking to make their conference cultures more inclusive and impactful. With this, I do want to add a warning that endeavors like this will not come for free, but demand considerable extra work and advocacy. Much of our proposed innovations initially faced pushback in some form, as these were not how things normally were done, and they required financial and human resources that would not be normally accounted for. But I am very grateful that we followed through, and extremely proud of what we achieved in the end. My great thanks go to the ISMIR society, my fellow ISMIR 2019 organizers and our sponsors for their trust and support.

All ISMIR 2019 presentations have been recorded, and are available through this link. The accepted (open access) papers with supplementary material are available via this page. Photos of the socials are available here.


About the Column

The Multidisciplinary Column is edited by Cynthia C. S. Liem and Jochen Huber. Every other edition, we will feature an interview with a researcher performing multidisciplinary work, or a column of our own hand. For this edition, we feature a column by Cynthia C. S. Liem.

Dr. Cynthia C. S. Liem is an Assistant Professor in the Multimedia Computing Group of Delft University of Technology, The Netherlands, and pianist of the Magma Duo. Her research interests consider search and recommendation for music and multimedia, with special interest in making people discover new interests, as well as questions of interpretability and validity. She initiated, co-coordinated and participated in various (inter)national collaborative research projects on the accessibility of content which would not trivially be retrieved, both in the music/cultural heritage world, as well as in social sciences applications, e.g. collaborating with organizational psychologists. Beyond her academic activities, Cynthia gained industrial experience at Bell Labs Netherlands, Philips Research and Google. She was a recipient of the Lucent Global Science and Google Anita Borg Europe Memorial scholarships, the Google European Doctoral Fellowship 2010 in Multimedia, and a finalist of the New Scientist Science Talent Award 2016 for young scientists committed to public outreach. In 2018, she was Researcher-in-Residence at the National Library of The Netherlands, and in 2019, she served as general co-chair of the ISMIR conference.

Dr. Jochen Huber is a Senior User Experience Researcher at Synaptics. Previously, he was an SUTD-MIT postdoctoral fellow in the Fluid Interfaces Group at MIT Media Lab and the Augmented Human Lab at Singapore University of Technology and Design. He holds a Ph.D. in Computer Science and degrees in both Mathematics (Dipl.-Math.) and Computer Science (Dipl.-Inform.), all from Technische Universität Darmstadt, Germany. Jochen’s work is situated at the intersection of Human-Computer Interaction and Human Augmentation. He designs, implements and studies novel input technology in the areas of mobile, tangible & non-visual interaction, automotive UX and assistive augmentation. He has co-authored over 60 academic publications and regularly serves as program committee member in premier HCI and multimedia conferences. He was program co-chair of ACM TVX 2016 and Augmented Human 2015 and chaired tracks of ACM Multimedia, ACM Creativity and Cognition and ACM International Conference on Interface Surfaces and Spaces, as well as numerous workshops at ACM CHI and IUI. Further information can be found on his personal homepage: http://jochenhuber.com

Dataset Column: Report from the MMM 2019 Special Session on Multimedia Datasets for Repeatable Experimentation (MDRE 2019)

Special Session

Information retrieval and multimedia content access have a long history of comparative evaluation, and many of the advances in the area over the past decade can be attributed to the availability of open datasets that support comparative and repeatable experimentation. Sharing data and code to allow other researchers to replicate research results is needed in the multimedia modeling field, as it helps to improve the performance of systems and the reproducibility of published papers.

This report summarizes the special session on Multimedia Datasets for Repeatable Experimentation (MDRE 2019), which was organized at the 25th International Conference on MultiMedia Modeling (MMM 2019), which was held in January 2019 in Thessaloniki, Greece.

The intent of these special sessions is to be a venue for releasing datasets to the multimedia community and discussing dataset related issues. The presentation mode in 2019 was to have short presentations (8 minutes) with some questions, and an additional panel discussion after all the presentations, which was moderated by Björn Þór Jónsson. In the following we summarize the special session, including its talks, questions, and discussions.

The special session presenters: Luca Rossetto, Cathal Gurrin and Minh-Son Dao.

Presentations

A Test Collection for Interactive Lifelog Retrieval

The session started with a presentation about A Test Collection for Interactive Lifelog Retrieval [1], given by Cathal Gurrin from Dublin City University (Ireland). In their work, the authors introduced a new test collection for interactive lifelog retrieval, which consists of multi-modal data from 27 days, comprising nearly 42 thousand images and other personal data (health and activity data; more specifically, heart rate, galvanic skin response, calorie burn, steps, blood pressure, blood glucose levels, human activity, and diet log). The authors argued that, although other lifelog datasets already exist, their dataset is unique in terms of the multi-modal character, and has a reasonable and easily manageable size of 27 consecutive days. Hence, it can also be used for interactive search and provides newcomers with an easy entry into the field. The published dataset has already been used for the Lifelog Search Challenge (LSC) [5] in 2018, which is an annual competition run at the ACM International Conference on Multimedia Retrieval (ICMR).

The discussion about this work started with a question about the plans for the dataset and whether it should be extended over the years, e.g. to increase the challenge of participating in the LSC. However, the problem with public lifelog datasets is the fact that there is a conflict between releasing more content and safeguarding privacy. There is a strong need to anonymize the contained images (e.g. blurring faces and license plates), where the rules and requirements of the EU GDPR regulations make this especially important. However, anonymizing content unfortunately is a very slow process. An alternative to removing and/or masking actual content from the dataset for privacy reasons would be to create artificial datasets (e.g. containing public images or only faces from people who consent to publish), but this would likely also be a non-trivial task. One interesting aspect could be the use of Generative Adversarial Networks (GANs) for the anonymization of faces, for instance by replacing all faces appearing in the content with generated faces learned from a small group of people who gave their consent. Another way to preemptively mitigate the privacy issues could be to wear conspicuous ‘lifelogging stickers’ during recording to make people aware of the presence of the camera, which would give them the possibility to object to being filmed or to avoid being captured altogether.

SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives

The second presentation was given by Minh-Son Dao from the National Institute of Information and Communications Technology (NICT) in Japan about SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives [2]. This is a dataset that aims at combining the conditions of the environment with health-related aspects (e.g., pollution or weather data with cardio-respiratory or psychophysiological data). The creation of the dataset was motivated by the fact that people in larger cities in Japan very often do not want to go out (e.g., for some sports activities), because they are very concerned about pollution, i.e., health conditions. So it would be beneficial to have a map of the city with assigned pollution ratings, or a system that allows to perform related queries. Their dataset contains sensor data collected on routes by a few dozen volunteer  people over seven days in Fukuoka, Japan. More particularly, they collected data about the location, O3, NO2, PM2.5 (particulates), temperature, and humidity in combination with heart rate, motion behavior (from 3-axis accelerometer), relaxation level, and other personal perception data from questionnaires.

This dataset has also been used for multimedia benchmark challenges, such as the Lifelogging for Wellbeing task at MediaEval. In order to define the ground truth, volunteers were presented with specific use cases and annotation rules, and were asked to collaboratively annotate the dataset. The collected data (the feelings of participants at different locations) was also visualized using an interactive map. Although the dataset may have some inconsistent annotations, it is easy to filter them out since labels of corresponding annotators and annotator groups are contained in the dataset as well.

V3C – a Research Video Collection

The third presentation was given by Luca Rossetto from the University of Basel (Switzerland) about V3C – a Research Video Collection [3]. This is a large-scale dataset for multimedia retrieval, consisting of nearly 30,000 videos with an overall duration of about 3,800 hours. Although many other video datasets are available already (e.g., IACC.3 [6], or YFCC100M [8]), the V3C dataset is unique in the aspects of timeliness (more recent content than many other datasets and therefore more representative content for current ‘videos in the wild’) and diversity (represents many different genres or use cases), while also having no copyright restrictions (all contained videos were labelled with a Creative Commons license by their uploaders). The videos have been collected from the video sharing platform Vimeo (hence the name ‘Vimeo Creative Commons Collection’ or V3C in short) and represent video data currently used on video sharing platforms. The dataset comes together with a master shot-boundary detection ground truth, as well as keyframes and additional metadata. It is partitioned into three major parts (V3C1, V3C2, and V3C3) to make it more manageable, and it will be used by the TRECVID and the Video Browser Showdown (VBS) evaluation campaigns for several years. Although the dataset was not specifically built for retrieval, it is suitable for any use case that requires a larger video dataset.

The shot-boundary detection used to provide the master-shot reference for the V3C dataset was implemented using Cineast, which is an open source software available for download. It divides every frame into a 3×3 grid and computes color histograms for all 9 areas, which are then concatenated into a ‘regional color histogram’ feature vector that is compared between all adjacent frames. This seems to work very well for hard cuts and gradual transitions, although for grayscale content (and flashlights etc.) it is not very stable. The additional metadata provided with the dataset includes information about resolution, frame rate, uploading user and the upload date, as well as any semantic information provided by the uploader (title, description, tags, etc.). 

Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition

Originally a fourth presentation was scheduled about Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition [4], but unfortunately no author was on site to give the presentation. This dataset contains audio samples with a duration of 30 seconds (as well as extracted features and ground truth) from a metropolitan city (Athens, Greece), that have been recorded during a period of about four years by 10 different persons with the aim to provide a collection about city sounds. The metadata includes geospatial coordinates, timestamp, rating, and tagging of the sound by the recording person. The authors demonstrated in a baseline evaluation that their dataset allows to predict the soundscape quality in the city with about 42% accuracy.

Discussion

After the presentations, Björn Þór Jónsson moderated a panel discussion in which all presenters participated.

The panel started with a discussion on the size of datasets, whether the only way to make challenges more difficult is to keep increasing the dataset, or whether there are alternatives to this. Although this heavily depends on the research question one would like to solve, it was generally agreed that there is a definite need for evaluation with large datasets, because for small datasets some problems are trivial. Moreover, too small datasets often introduce some kind of content bias, so that they do not fully reflect the practical situation.

For now, it seems there is no real alternative to using larger datasets although it is clear that this will introduce additional challenges/hurdles for data management and data processing. All presenters (and the audience too) agreed that introducing larger datasets will also necessitate the need for closer collaboration with other research communities―with fields like data science, data management/engineering, and distributed and high-performance computing―in order to manage the higher data load.

However, even though we need larger datasets, we might not be ready yet to really go for true large-scale. For example, the V3C dataset is still far away from a true web-scale video search dataset; it originally was intended to be even bigger, but there were concerns from the TRECVID and VBS communities about the manageability. Datasets that are too large would set the entrance barrier for newcomers so high that an evaluation benchmark may not attract enough participants―a problem that could possibly disappear in a few years (as hardware becomes cheaper and faster/larger), but still needs to be addressed from an organizational viewpoint. 

There were notes from the audience that instead of focusing on size alone, we should also consider the problem we want to solve. It appears many researchers use datasets for use cases for which they were not designed and are not suited to. Instead of blindly going for larger size, datasets could be kept small and simple for solving essential research questions, for example by truly optimizing them to the problem to solve; different evaluations would then use different datasets. However, this would lead to a considerable dataset fragmentation and necessitate the need for combining several datasets for broader/larger evaluation tasks, which has been shown to be quite challenging in the past. For example, there are already a lot of health datasets available, and it would be interesting to take benefit from them, but the workload for the integration into competitions is often too high in practice.

Another issue that should be addressed more intensively by the research community is to figure out the situation for personal datasets that are compliant with GDPR regulations, since currently nobody really knows how to deal with this.

Acknowledgments

The session was organized by the authors of the report, in collaboration with Duc-Tien Dang-Nguyen (Dublin City University), Michael Riegler (Center for Digitalisation and Engineering & University of Oslo), and Luca Piras (University of Cagliari). The panel format of the special session made the discussions much more lively and interactive than that of a traditional technical session. We would like to thank the presenters and their co-authors for their excellent contributions, as well as the members of the audience who contributed greatly to the session.

References

[1] Gurrin, C., Schoeffmann, K., Joho, H., Munzer, B., Albatal, R., Hopfgartner, F., … & Dang-Nguyen, D. T. (2019, January). A test collection for interactive lifelog retrieval. In International Conference on Multimedia Modeling (pp. 312-324). Springer, Cham.
[2] Sato, T., Dao, M. S., Kuribayashi, K., & Zettsu, K. (2019, January). SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives. In International Conference on Multimedia Modeling (pp. 325-337). Springer, Cham.
[3] Rossetto, L., Schuldt, H., Awad, G., & Butt, A. A. (2019, January). V3C–A Research Video Collection. In International Conference on Multimedia Modeling (pp. 349-360). Springer, Cham.
[4] Giannakopoulos, T., Orfanidi, M., & Perantonis, S. (2019, January). Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition. In International Conference on Multimedia Modeling (pp. 338-348). Springer, Cham.
[5] Dang-Nguyen, D. T., Schoeffmann, K., & Hurst, W. (2018, June). LSE2018 Panel-Challenges of Lifelog Search and Access. In Proceedings of the 2018 ACM Workshop on The Lifelog Search Challenge (pp. 1-2). ACM.
[6] Awad, G., Butt, A., Curtis, K., Lee, Y., Fiscus, J., Godil, A., … & Kraaij, W. (2018, November). Trecvid 2018: Benchmarking video activity detection, video captioning and matching, video storytelling linking and video search.
[7] Lokoč, J., Kovalčík, G., Münzer, B., Schöffmann, K., Bailer, W., Gasser, R., … & Barthel, K. U. (2019). Interactive search or sequential browsing? a detailed analysis of the video browser showdown 2018. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(1), 29.
[8] Kalkowski, S., Schulze, C., Dengel, A., & Borth, D. (2015, October). Real-time analysis and visualization of the YFCC100M dataset. In Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions(pp. 25-30). ACM.

Dataset Column: Datasets for Online Multimedia Verification

Introduction

Online disinformation is a problem that has been attracting increased interest by researchers worldwide as the breadth and magnitude of its impact is progressively manifested and documented in a number of studies (Boididou et al., 2014; Zhou & Zafarani, 2018; Zubiaga et al., 2018). This emerging area of research is inherently multidisciplinary and there have been numerous treatments of the subject, each having a distinct perspective or theme, ranging from the predominant perspectives of media, journalism and communications (Wardle & Derakhshan, 2017) and political science (Allcott & Gentzkow, 2017) to those of network science (Lazer et al., 2018), natural language processing (Rubin et al., 2015) and signal processing, including media forensics (Zampoglou et al., 2017). Given the multimodal nature of the problem, it is no surprise that the multimedia community has taken a strong interest in the field.

From a multimedia perspective, two research problems have attracted the bulk of researchers’ attention: a) detection of content tampering and content fabrication, and b) detection of content misuse for disinformation. The first was traditionally studied within the field of media forensics (Rocha et al, 2011), but has recently been under the spotlight as a result of the rise of deepfake videos (Güera & Delp, 2018), i.e. a special class of generative models that are capable of synthesizing highly convincing media content from scratch or based on some authentic seed content. The second problem has focused on the problem of multimedia misuse or misappropriation, i.e. the use of media content out of its original context with the goal of spreading misinformation or false narratives (Tandoc et al., 2018).

Developing automated approaches to detect media-based disinformation is relying to a great extent on the availability of relevant datasets, both for training supervised learning models and for evaluating their effectiveness. Yet, developing and releasing such datasets is a challenge in itself for a number of reasons:

  1. Identifying, curating, understanding, and annotating cases of media-based misinformation is a very effort-intensive task. More often than not, the annotation process requires careful and extensive reading of pertinent news coverage from a variety of sources similar to the journalistic practice of verification (Brandtzaeg et al., 2016).
  2. Media-based disinformation is largely manifested in social media platforms and relevant datasets are therefore hard to collect and distribute due to the temporary nature of social media content and the numerous technical restrictions and challenges involved in collecting content (mostly due to limitations or complete lack of appropriate support by the respective APIs), as well as the legal and ethical issues in releasing social media-based datasets (due to the need to comply with the respective Terms of Service and any applicable data protection law).

In this column, we present two multimedia datasets that could be of value to researchers who study media-based disinformation and develop automated approaches to tackle the problem. The first, called Fake Video Corpus (Papadopoulou et al., 2019) is a manually curated collection of 200 debunked and 180 verified videos, along with relevant annotations, accompanied by a set of 5,193 near-duplicate instances of them that were posted on popular social media platforms. The second, called FIVR-200K (Kordopatis-Zilos et al., 2019), is an automatically collected dataset of 225,960 videos, a list of 100 video queries and manually verified annotations regarding the relation (if any) of the dataset videos to each of the queries (i.e. near-duplicate, complementary scene, same incident).

For each of the two datasets, we present the design and creation process, focusing on issues and questions regarding the relevance of the collected content, the technical means of collection, and the process of annotation, which had the dual goal of ensuring high accuracy and keeping the manual annotation cost manageable. Given that each dataset is accompanied by a detailed journal article, in this column we only limit our description to high-level information, emphasizing the utility and creation process in each case, rather than on detailed statistics, which are disclosed in the respective papers.

Following the presentation of the two datasets, we then proceed to a critical discussion, highlighting their limitations and some caveats, and delineating future steps towards high quality dataset creation for the field of multimedia-based misinformation.

Related Datasets

The complexity and challenge of the multimedia verification problem has led to the creation of numerous datasets and benchmarking efforts, each designed specifically for a particular task within this area. We can broadly classify these efforts in three areas: a) multimedia forensics, b) multimedia retrieval, and c) multimedia post classification. Datasets that are focused on the text modality, e.g. Fake News Challenge, Clickbait Challenge, Hyperpartisan News Detection, RumourEval (Derczynski et al 2017), etc. are beyond the scope of this post and are hence not included in this discussion.

Multimedia forensics: Generating high-quality multimedia forensics datasets has always been a challenge, since creating convincing forgeries is normally a manual task requiring a fair amount of skill, and as a result such datasets have generally been few and limited in scale. With respect to image splicing, our own survey (Zampoglou et al, 2017) listed a number of datasets that had been made available by this point, including our own Wild Web tampered image dataset, which consists of real-world forgeries that have been collected from the Web, including multiple near-duplicates, making it a large and particularly challenging collection. Recently, the Realistic Tampering Dataset (Korus et al,2017) was proposed, offering a large number of convincing forgeries for evaluation. On the other hand, copy-move image forgeries pose a different problem that requires specially designed datasets. Three such commonly used datasets are those produced by MICC (Amerini et al, 2011), the Image Manipulation Dataset by (Christlein et al, 2012), and CoMoFoD (Tralic et al, 2013). These datasets are still actively used in research.

With respect to video tampering, there has been relative scarcity in high-quality large-scale datasets, which is understandable given the difficulty of creating convincing forgeries. The recently proposed Multimedia Forensics Challenge datasets include some large-scale sets of tampered images and videos for the evaluation of forensics algorithms. Finally, there has recently been increased interest towards the automatic detection of forgeries made with the assistance of particular software, and specifically face-swapping software. As the quality of produced face-swaps is constantly improving, detecting face-swaps is an important emerging verification task. The FaceForensics++ dataset (Rössler et al, 2019) is a very-large scale dataset containing face-swapped videos (and untampered face videos) from a number of different algorithms, aimed for the evaluation of face-swap detection algorithms.

Multimedia retrieval: Several cases of multimedia verification can be considered to be an instance of a near-duplicate retrieval task, in which the query video (video to be verified) is run against a database of past cases/videos to check whether it has already appeared before. The most popular and publicly-available dataset for near-duplicate video retrieval is arguably the CC_WEB_VIDEO dataset (Wu et al., 2007). This consists of 12,790 user-generated videos collected from popular video sharing websites (YouTube, Google Video, and Yahoo! Video). It is organized in 24 query sets, for each of which the most popular video was selected to serve as query, and the rest of the videos were manually annotated based on their duplicity to the query. Another relevant dataset is VCDB (Jiang et al., 2014), which was compiled and annotated as a benchmark for the partial video copy detection problem and is composed of videos from popular video platforms (YouTube and Metacafe). VCDB contains two subsets of videos: a) the core, which consists of 28 discrete sets of videos with a total of 528 videos with over 9,000 pairs of manually annotated partial copies, and b) the distractors, which consists of 100,000 videos with the purpose to make the video copy detection problem more challenging.

Multimedia post classification: A benchmark task under the name “Verifying Multimedia Use” (Boididou et al., 2015; Boididou et al., 2016) was organized and took place in the context of MediaEval 2015 and 2016 respectively. The task made a dataset available of 15,629 tweets containing images and videos, each of which made a false or factual claim with respect to the shared image/video. The released tweets were posted in the context of breaking news events (e.g. Hurricane Sandy, Boston Marathon bombings) or hoaxes. 

Video Verification Datasets

The Fake Video Corpus (FVC)

The Fake Video Corpus (Papadopoulou et al., 2018) is a collection of 380 user-generated videos and 5,193 near-duplicate versions of them, all collected from three online video platforms: YouTube, Facebook, and Twitter. The videos are annotated either as “verified” (“real”) or as “debunked” (“fake”) depending on whether the information they convey is accurate or misleading. Verified videos are typically user-generated takes of newsworthy events, while debunked videos include various types of misinformation, including staged content posing as UGC, real content taken out of context, or modified/tampered content (see Figure 1 for examples). The near-duplicates of each video are arranged in temporally ordered “cascades”, and each near-duplicate video is annotated with respect to its relation to the first video of the cascade (e.g. whether it is reinforcing or debunking the original claim). The FVC is the first, to our knowledge, large-scale dataset of debunked and verified user-generated videos (UGVs). The dataset contains different kinds of metadata for its videos, including channel (user) information, video information, and community reactions (number of likes, shares and comments) at the time of their inclusion.

  
  
Figure 1. A selection of real (top row) and fake (bottom row) videos from the Fake Video Corpus. Click image to jump to larger version, description, and link to YouTube video.

The initial set of 380 videos were collected and annotated using various sources including the Context Aggregation and Analysis (CAA) service developed within the InVID project and fact-checking sites such as Snopes. To build the dataset, all videos submitted to the CAA service between November 2017 and January 2018 were collected in an initial pool of approximately 1600 videos, which were then manually inspected and filtered. The remaining videos were annotated as “verified” or “debunked” using established third party sources (news articles or blog posts), leading to the final pool of 180 verified and 200 fake unique videos. Then, keyword-based search was run on the three platforms, and near-duplicate video detection was used to identify the video duplicates within the returned results. More specifically, for each of the 380 videos, its title was reformulated in a more general form, and translated into four major languages: Russian, Arabic, French, and German. The original title, the general form and the translations were submitted as queries to YouTube, Facebook, and Twitter. Then, the  near-duplicate retrieval algorithm of Kordopatis-Zilos etal (2017) was used on the resulting pool, and the results were manually inspected to remove erroneous matches.

The purpose of the dataset is twofold: i) to be used for the analysis of the dissemination patterns of real and fake user-generated videos (by analyzing the traits of the near-duplicate video cascades), and ii) to serve as a benchmark for the evaluation of automated video verification methods. The relatively large size of the dataset is important for both of these tasks. With respect to the study of dissemination patterns, the dataset provides the opportunity to study the dissemination of the same or similar content by analyzing associations between videos not provided by the original platform APIs, combined with the wealth of associated metadata. In parallel, having a collection of 5,573 annotated “verified” or “debunked” videos- even if many are near-duplicate versions of the 380 cases – can be used for the evaluation (or even training) of verification systems, either based on visual content or the associated video metadata.

The Fine-grained Incident Video Retrieval Dataset (FIVR-200K)

The FIVR-200K dataset (Kordopatis-Zilos et al., 2019) consists of 225,960 videos associated with 4,687 Wikipedia events and 100 selected video queries (see Figure 2 for examples). It has been designed to simulate the problem of Fine-grained Incident Video Retrieval (FIVR). The objective of this problem is: given a query video, retrieve all associated videos considering several types of associations with respect to an incident of interest. FIVR contains several retrieval tasks as special cases under a single framework. In particular, we consider three types of association between videos: a) Duplicate Scene Videos (DSV), which share at least one scene (originating from the same camera) regardless of any applied transformation, b) Complementary Scene Videos (CSV), which contain part of the same spatiotemporal segment, but captured from different viewpoints, and c) Incident Scene Videos (ISV), which capture the same incident, i.e. they are spatially and temporally close, but have no overlap.

For the collection of the dataset, we first crawled Wikipedia’s Current Event page to collect a large number of major news events that occurred between 2013 and 2017 (five years). Each news event is accompanied with a topic, headline, text, date, and hyperlinks. To collect videos of the same category, we retained only news events with topic “Armed conflicts and attacks” or “Disasters and accidents”. This ultimately led to a total of 4,687 events after filtering. To gather videos around these events and build a large collection with numerous video pairs that are associated through the relations of interest (DSV, CSV and ISV), we queried the public YouTube API with the event headlines. To ensure that the collected videos capture the corresponding event, we retained only the videos published within a timespan of one week from the event date. This process resulted in the collection of 225,960 videos.

  
Figure 2. A selection of query videos from the Fine-grained Incident Video Retrieval dataset. Click image to jump to larger version, link to YouTube video, and several associated videos.

Next, we proceeded with the selection of query videos. We set up an automated filtering and ranking process that implemented the following criteria: a) query videos should be relatively short and ideally focus on a single scene, b) queries should have many near-duplicates or same-incident videos within the dataset that are published by many different uploaders, c) among a set of near-duplicate/same-instance videos, the one that was uploaded first should be selected as query. This selection process was implemented based on a graph-based clustering approach and resulted in the selection of 635 query videos, of which we used the top 100 (ranked by corresponding cluster size) as the final query set.

For the annotation of similarity relations among videos, we followed a multi-step process, in which we presented annotators with the results of a similarity-based video retrieval system and asked them to indicate the type of relation through a drop-down list of the following labels: a) Near-Duplicate (ND), a special case where the whole video is near-duplicate to the query video, b) Duplicate Scene (DS), where only some scenes in the candidate video are near-duplicates of scenes in the query video, c) Complementary Scenes (CS), d) Incident Scene (IS), and e) Distractors (DI), i.e. irrelevant videos.

To make sure that annotators were presented with as many potentially relevant videos as possible, we used visual-only, text-only and hybrid similarity in turn. As a result, each annotator reviewed video candidates that had very high similarity with the query video in terms either of their visual content, or text metadata (title and description) or the combination of similarities. Once an initial set of annotations were produced by two independent annotators, the annotators went twice again through the annotations two ensure consistency and accuracy.

FIVR-200K was designed to serve as a benchmark that poses real-world challenges for the problem of reverse video search. Given a query video to be verified, the analyst would want to know whether the same or a very similar version of it has already been published. In that way, the user would be able to easily debunk cases of out-of-context video use (i.e. misappropriation) and on the other hand, if several videos are found that depict the same scene from different viewpoints at approximately the same time, then they could be considered to corroborate the video of interest.

Discussion: Limitations and Caveats

We are confident that the two video verification datasets presented in this column can be valuable resources for researchers interested in the problem of media-based disinformation and could serve both as training sets and as benchmarks for automated video verification methods. Yet, both of them suffer from certain limitations and care should be taken when using them to draw conclusions. 

A first potential issue has to do with the video selection bias arising from the particular way that each of the two datasets was created. The videos of the Fake Video Corpus were selected in a mixed manner trying to include a number of cases that were known to the dataset creators and their collaborators, and was also enriched by a pool of test videos that were submitted for analysis to a publicly available video verification service. As a result, it is likely to be more focused on viral and popular videos. Also, videos were included, for which debunking or corroborating information was found online, which introduces yet another source of bias, potentially towards cases that were more newsworthy or clear cut. In the case of the FIVR-200K dataset, videos were intentionally collected to be between two categories of newsworthy events with the goal of ending up with a relatively homogeneous collection, which would be challenging in terms of content-based retrieval. This means that certain types of content, such as political events, sports and entertainment, are very limited or not present at all in the dataset. 

A question that is related to the selection bias of the above datasets pertains to their relevance for multimedia verification and for real-world applications. In particular, it is not clear whether the video cases offered by the Fake Video Corpus are representative of actual verification tasks that journalists and news editors face in their daily work. Another important question is whether these datasets offer a realistic challenge to automatic multimedia analysis approaches. In the case of FIVR-200K, it was clearly demonstrated (Kordopatis-Zilos et al., 2019) that the dataset is a much harder benchmark for near-duplicate detection methods compared to previous datasets such as CC_WEB_VIDEO and VCDB. Even so, we cannot safely conclude that a method, which performs very well in FIVR-200K, would perform equally well in a dataset of much larger scale (e.g. millions or even billions of videos).

Another issue that affects the access to these datasets and the reproducibility of experimental results relates to the ephemeral nature of online video content. A considerable (and increasing) part of these video collections is taken down (either by their own creators or from the video platform), which makes it impossible for researchers to gain access to the exact video set that was originally collected. To give a better sense of the problem, 21% of the Fake Video Corpus and 11% of the FIVR-200K videos were not available online on September 2019. This issue, which affects all datasets that are based on online multimedia content, raises the more general question of whether there are steps that can be taken by online platforms such as YouTube, Facebook and Twitter that could facilitate the reproducibility of social media research without violating copyright legislation or the platforms’ terms of service.

The ephemeral nature of online content is not the only factor that renders the value of multimedia datasets very sensitive to the passing of time. Especially in the case of online disinformation, there seems to be an arms’ race, where new machine learning methods constantly get better in detecting misleading or tampered content, but at the same time new types of misinformation emerge, which are increasingly AI-assisted. This is particularly profound in the case of deepfakes, where the main research paradigm is based on the concept of competition between a generator (adversary) and a detector (Goodfellow et al., 2014). 

Last but not least, one may always be concerned about the potential ethical issues arising when publicly releasing such datasets. In our case, reasonable concerns for privacy risks, which are always relevant when dealing with social media content, are addressed by complying with the relevant Terms of Service of the source platforms and by making sure that any annotation (label) assigned to the dataset videos is accurate. Additional ethical issues pertain to the potential “dual use” of the dataset, i.e. their use by adversaries to craft better tools and techniques to make misinformation campaigns more effective. A recent pertinent case was OpenAI’s delayed release of their very powerful GPT-2 model, which sparked numerous discussions and criticism, and making clear that there is no commonly accepted practice for ensuring reproducibility of research results (and empowering future research) and at the same time making sure that risks of misuse are eliminated.

Future work

Given the challenges of creating and releasing a large-scale dataset for multimedia verification, the main conclusions from our efforts towards this direction so far are the following:

  • The field of multimedia verification is in constant motion and therefore the concept of a static dataset may not be sufficient to capture the real-world nuances and latest challenges of the problem. Instead new benchmarking models, e.g. in the form of open data challenges, and resources, e.g. constantly updated repository of “fake” multimedia, appear to be more effective for empowering future research in the area.
  • The role of social media and multimedia sharing platforms (incl. YouTube, Facebook, Twitter, etc.) seems to be crucial in enabling effective collaboration between academia and industry towards addressing the real-world consequences of online misinformation. While there have been recent developments towards this direction, including the announcements by both Facebook and Alphabet’s Jigsaw of new deepfake datasets, there is also doubt and scepticism about the degree of openness and transparency that such platforms are ready to offer, given the conflicts of interest that are inherent in the underlying business model. 
  • Building a dataset that is fit for a highly diverse and representative set of verification cases appears to be a task that would require a community effort instead of effort from a single organisation or group. This would not only help towards distributing the massive dataset creation cost and effort to multiple stakeholders, but also towards ensuring less selection bias, richer and more accurate annotation and more solid governance.

References

Allcott, H., Gentzkow, M., “Social media and fake news in the 2016 election”, Journal of economic perspectives, 31(2), pp. 211–36, 2017.
Amerini, I, Ballan, L., Caldelli, R., Del Bimbo, A., Serra, G., “A SIFT-based forensic method for copy-move attack detection and transformation recovery”, IEEE Transactions on Information Forensics and Security, 6(3), pp. 1099–1110,2011.
Boididou, C., Papadopoulos, S., Kompatsiaris, Y., Schifferes, S., Newman, N., “Challenges of computational verification in social multimedia”, In Proceedings of the 23rd ACM International Conference on World Wide Web, pp. 743–748,2014.
Boididou, C., Andreadou, K., Papadopoulos, S., Dang-Nguyen, D.T., Boato, G., Riegler, M., Kompatsiaris, Y., “Verifying multimedia use at MediaEval 2015”. In Proceedings of MediaEval 2015, 2015.
Boididou C., Papadopoulos S., Dang-Nguyen D., Boato G., Riegler M., Middleton S.E., Petlund A., Kompatsiaris Y., “Verifying multimedia use at MediaEval 2016”. In Proceedings of MediaEval 2016, 2016.
Brandtzaeg, P.B., Lüders, M., Spangenberg, J., Rath-Wiggins, L., Følstad, A., “Emerging journalistic verification practices concerning social media”. Journalism Practice, 10(3), pp. 323–342, 2016.
Christlein V., Riess C., Jordan J., Riess C., Angelopoulou, E., “An evaluation of popular copy-move forgery detection approaches”. IEEE Transactions on Information Forensics & Security, 7(6), pp. 1841–1854, 2012.
Derczynski, L., Bontcheva, K., Liakata, M., Procter, R., Hoi, G.W.S., Zubiaga, A., “Semeval-2017 Task 8: Rumoureval: determining rumour veracity and support for rumours”, Proceedings of the 11th International Workshop on Semantic Evaluation,pp. 69-76, 2017.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Bengio, Y., “Generative adversarial nets”. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
Guan, H., Kozak, M., Robertson, E., Lee, Y., Yates, A.N., Delgado, A., Zhou, D., Kheyrkhah, T., Smith, J., Fiscus, J., “MFC datasets: Large-scale benchmark datasets for media forensic challenge evaluation”, In Proceedings of the 2019 IEEEWinter Applications of Computer Vision Workshops, pp. 63–72, 2019.
Güera, D., Delp, E.J., “Deepfake video detection using recurrent neural networks”, In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 1–6, 2018.
Jiang, Y. G., Jiang, Y., Wang, J., “VCDB: A large-scale database for partial copy detection in videos”. In Proceedings of the European Conference on Computer Vision, pp. 357–371, 2014.
Kiesel, J., Mestre, M., Shukla, R., Vincent, E., Adineh, P., Corney, D., Stein, B. Potthast, M., “Semeval-2019 Task 4: Hyperpartisan news detection”. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829–839,2019.
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, I., “FIVR: Fine-grained incident video retrieval”. IEEE Transactions on Multimedia, 21(10), pp. 2638–2652, 2019.
Korus, P., Huang, J., “Multi-scale analysis strategies in PRNU-based tampering localization”, IEEE Transactions on Information Forensics & Security, 21(4), pp. 809–824, 2017.
Lazer, D.M., Baum, M.A., Benkler, Y., Berinsky, A.J., Greenhill, K.M., Menczer, F., Schudson, M., “The science of fake news”, Science, 359(6380), pp. 1094–1096, 2018.
Papadopoulou, O., Zampoglou, M., Papadopoulos, S., Kompatsiaris, I., “A corpus of debunked and verified user-generated videos”. Online Information Review, 43(1), pp. 72–88, 2019.
Rocha, A., Scheirer, W., Boult, T., Goldenstein, S., “Vision of the unseen: Current trends and challenges in digital image and video forensics”, ACM Computing Surveys, 43(4), art. 26, 2011.
Rössler, A. Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M. “Faceforensics++: Learning to detect manipulated facial images”, In Proceedings of the IEEE International Conference on Computer Vision, 2019.
Rubin, V.L., Chen, Y., Conroy, N.J., “Deception detection for news: Three types of fakes”, In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community, art. 83, 2015.
Tandoc Jr, E.C., Lim, Z.W., Ling, R. “Defining “fake news”: A typology of scholarly definitions”, Digital journalism, 6(2), pp. 137–153, 2018.
Tralic, D., Zupancic I., Grgic S., Grgic M., “CoMoFoD – New database for copy-move forgery detection”. In Proceedings of the 55th International Symposium on Electronics in Marine, pp. 49–54, 2013.
Wardle, C., Derakhshan, H., “Information disorder: Toward an interdisciplinary framework for research and policy making”, Council of Europe Report, 27, 2017.
Wu, X., Hauptmann, A.G., Ngo, C.-W., “Practical elimination of near-duplicates from web video search”, In Proceedings of the 15th ACM International Conference on Multimedia, pp. 218–227, 2007.
Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y., “Detecting image splicing in the wild (web)”, In Proceedings of the 2015 IEEE International Conference on Multimedia & Expo Workshops, 2015.
Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y., “Large-scale evaluation of splicing localization algorithms for web images”, Multimedia Tools and Applications, 76(4), pp. 4801–4834, 2017.
Zhou, X., Zafarani, R., “Fake news: A survey of research, detection methods, and opportunities”. arXiv preprint arXiv:1812.00315, 2018.
Zubiaga, A., Aker, A., Bontcheva, K., Liakata, M., Procter, R., “Detection and resolution of rumours in social media: A survey”, ACM Computing Surveys, 51(2), art. 32, 2018.

Appendix A: Examples of videos in the Fake Video Corpus.

Real videos


US Airways Flight 1549 ditched in the Hudson River.


A group of musicians playing in an Istanbul park while bombs explode outside the stadium behind them.


A giant alligator crossing a Florida golf course.

Fake videos


“Syrian boy rescuing a girl amid gunfire” – Staged (fabricated content): The video was filmed by Norwegian Lars Klevberg in Malta.


“Golden Eagle Snatches Kid” – Tampered: The video was created by a team of students in Montreal as part of their course on visual effects.


“Pope Francis slaps Donald Trump’s hand for touching him” – Satire/parody: The video was digitally manipulated, and was made for the late-night television show Jimmy Kimmel Live.

Appendix B: Examples of videos in the Fine-grained Incident Video Retrieval dataset.

Example 1


Query video from the American Airlines Flight 383 fire at Chicago O’Hare International Airport in October 28, 2016.


Duplicate scene video.


Complimentary scene video.


Incident scene video.

Example 2


Query video from the Boston Marathon bombing in April 15, 2013.


Duplicate scene video.


Complimentary scene video.


Incident scene video.

Example 3


Query video from the the Las Vegas shooting in October 1, 2017.


Duplicate scene video.


Complimentary scene video.


Incident scene video.