The V3C1 Dataset: Advancing the State of the Art in Video Retrieval


In order to download the video dataset as well as its provided analysis data, please follow the instructions described here:


Standardized datasets are of vital importance in multimedia research, as they form the basis for reproducible experiments and evaluations. In the area of video retrieval, widely used datasets such as the IACC [5], which has formed the basis for the TRECVID Ad-Hoc Video Search Task and other retrieval-related challenges, have started to show their age. For example, IACC is no longer representative of video content as it is found in the wild [7]. This is illustrated by the figures below, showing the distribution of video age and duration across various datasets in comparison with a sample drawn from Vimeo and Youtube.




Its recently released spiritual successor, the Vimeo Creative Commons Collection (V3C) [3], aims to remedy this discrepancy by offering a collection of freely reusable content sourced from the video hosting platform Vimeo ( The figures below show the age and duration distributions of the Vimeo sample from [7] in comparison with the properties of the V3C.datasets3


The V3C is comprised of three shards, consisting of 1000h, 1200h and 1500h of video content respectively. It consists not only of the original videos themselves, but also comes with video shot-boundary annotations, as well as representative key-frames and thumbnail images for every such video shot. In addition, all the technical and semantic video metadata that was available on Vimeo is provided as well. The V3C has already been used in the 2019 edition of the Video Browser Showdown [2] and will also be used for the TRECVID AVS Tasks ( starting 2019 with a plan for future usage in the coming several years. This video provides an overview of the type of content found within the dataset

Dataset & Collections

The three shards of V3C (V3C1, V3C2, and V3C3) contain Creative Commons videos sourced from video hosting platform Vimeo. For this reason, the elements of the dataset may be freely used and publicly shared. The following table presents the composition of the dataset and the characteristics of its shards, as well as the information on the dataset as a whole.

Partition V3C1 V3C2 V3C3 Total
File Size (videos) 1.3TB 1.6TB 1.8TB 4.8TB
File Size (total) 2.4TB 3.0TB 3.3TB 8.7TB
Number of Videos 7’475 9’760 11’215 28’450

Video Duration

1’000 hours,

23 minutes,

50 seconds

1’300 hours,

52 minutes,

48 seconds

1’500 hours,

8 minutes,

57 seconds

3801 hours,

25 minutes,

35 seconds

Mean Video Duration 8 minutes,

2 seconds

7 minutes,

59 seconds

8 minutes,

1 seconds

8 minutes,

1 seconds

Number of Segments 1’082’659 1’425’454 1’635’580 4’143’693

Similarly to IACC, V3C contains a master shot reference, which segments every video into non-overlapping shots based on the visual content of the videos. For every single shot, a representative keyframe is included, as well as the thumbnail version of that keyframe. Furthermore, for each video, identified by a unique ID, a metadata file is available that contains both technical as well as semantic information, such as the categories. Vimeo categorizes every video into categories and subcategories. Some of the categories were determined to be non-relevant for visual based multimedia retrieval and analytical tasks, and were dropped during the sourcing process of V3C. For simplicity reasons, subcategories were generalized into their parent categories and are, for this reason, not included. The remaining Vimeo categories are:

  • Arts & Design
  • Cameras & Techniques
  • Comedy
  • Fashion
  • Food
  • Instructionals
  • Music
  • Narrative
  • Reporting & Journals

Ground Truth and Analysis Data

As described above, the ground truth of the dataset consists of (deliberately over-segmented) shot boundaries as well as keyframes. Additionally, for the first shard of the V3C, the V3C1, we have already performed several analyses of the video content and metadata in order to provide an overview of the dataset [1]

In particular, we have analyzed specific content characteristics of the dataset, such as:

  • Bitrate distribution of the videos
  • Resolution distribution of the videos
  • Duration of shots
  • Dominant color of the keyframes
  • Similarity of the keyframes in terms of color layout, edge histogram, and deep features (weights extracted from the last fully-connected layer of GoogLeNet).
  • Confidence range distribution of the best class for shots detected by NasNet (using the best result out of the 1000 ImageNet classes) 
  • Number of different classes for a video detected by NasNet (using the best result out of the 1000 ImageNet classes)
  • Number of shots/keyframes for a specific content class
  • Number of shots/keyframes for a specific number of detected faces

This additional analysis data is available via GitHub, so that other researchers can take advantage of it. For example, one could use a specific subset of the dataset (only shots with blue keyframes, only videos with a specific bitrate or resolution, etc.) for performing further evaluations (e.g., for multimedia streaming, video coding, but also for image and video retrieval, of course). Additionally, due the public dataset and the analysis data, one could easily create an image and video retrieval system and use it either for participation in competitions like the Video Browser Showdown [2], or for submitting other evaluation runs (TRECVID Ad-hoc Video Search Task).


In the broad field of multimedia retrieval and analytics, one of the key components of research is having useful and appropriate datasets in place to evaluate multimedia systems’ performance and benchmark their quality. The usage of standard and open datasets enables researchers to reproduce analytical experiments based on these datasets and thus validate their results. In this context, the V3C dataset proves to be very diverse in several useful aspects (upload time, visual concepts, resolutions, colors, etc.). Also it has no dominating characteristics and provides a low self-similarity (i.e., few near duplicates) [3].

Further, the richness of V3C in terms of content diversity and content attributes enables benchmarking multimedia systems in close-to-reality test environments. In contrast to other video datasets (cf. YouTube-8M [4] and IACC [5]), V3C also provides a vast number of different video encodings and bitrates per second, so that it enables research focusing on video retrieval and analytical tasks regarding those attributes. The large number of different video resolutions (and to a lesser extent frame-rates) makes this dataset interesting for video transport and storage applications such as the development of novel encoding schemes, streaming mechanisms or error-correction techniques. Finally, in contrast to many current datasets, V3C also provides support for creating queries for evaluation competitions, such as VBS and TRECVID [6].


[1] Fabian Berns, Luca Rossetto, Klaus Schoeffmann, Christian Beecks, and George Awad. 2019. V3C1 Dataset: An Evaluation of Content Characteristics. In Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR ’19). ACM, New York, NY, USA, 334-338.

[2] Jakub Lokoč, Gregor Kovalčík, Bernd Münzer, Klaus Schöffmann, Werner Bailer, Ralph Gasser, Stefanos Vrochidis, Phuong Anh Nguyen, Sitapa Rujikietgumjorn, and Kai Uwe Barthel. 2019. Interactive Search or Sequential Browsing? A Detailed Analysis of the Video Browser Showdown 2018. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 29 (February 2019), 18 pages.

[3] Rossetto, L., Schuldt, H., Awad, G., & Butt, A. A. (2019). V3C–A Research Video Collection. In International Conference on Multimedia Modeling (pp. 349-360). Springer, Cham.

[4] Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.

[5] Paul Over, George Awad, Alan F. Smeaton, Colum Foley, and James Lanagan. 2009. Creating a web-scale video collection for research. In Proceedings of the 1st workshop on Web-scale multimedia corpus (WSMC ’09). ACM, New York, NY, USA, 25-32. 

[6] Smeaton, A. F., Over, P., and Kraaij, W. 2006. Evaluation campaigns and TRECVid. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (Santa Barbara, California, USA, October 26 – 27, 2006). MIR ’06. ACM Press, New York, NY, 321-330.

[7] Luca Rossetto & Heiko Schuldt (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.

JPEG Column: 83rd JPEG Meeting in Geneva, Switzerland

The 83rd JPEG meeting was held in Geneva, Switzerland.

The meeting was very dense due to the multiple activities taking place. Beyond the multiple standardization activities, like the new JPEG XL, JPEG Pleno, JPEG XS, HTJ2K or JPEG Systems, the 83rd JPEG meeting had the report and discussion of a new exploration study on the use of learning based methods applied to image coding, and two successful workshops, namely on digital holography applications and systems and the 3rd on media blockchain technology.

The new exploration study on the use of learning based methods applied to image coding was initiated at the previous 82nd JPEG meeting in Lisbon, Portugal. The initial approach provided very promising results and might establish a new alternative for future image representations.

The workshop on digital holography applications and systems, revealed the state of the art on industry applications and current technical solutions. It covered applications such as holographic microscopy, tomography, printing and display. Moreover, insights were provided on state-of-the-art holographic coding technologies and quality assessment procedures. The workshop allowed a very fruitful exchange of ideas between the different invited parties and JPEG experts.

The 3rd workshop of a series organized around media blockchain technology, had several talks were academia and industry shared their views on this emerging solution. The workshop ended with a panel where multiple questions were further elaborated by different panelists, providing the ground to a better understanding of the possible role of blockchain in media technology for the near future.

Two new logos for JPEG Pleno and JPEG XL, were approved and released during the Geneva meeting.

jpegpleno-logo  jpegxl-logo

The two new logos, for JPEG Pleno and JPEG XL

The 83rd JPEG meeting had the following highlights: 55540677_10156332786204370_7011318091044880384_n_h

  • New explorations studies of JPEG AI
  • The new Image Coding System JPEG XL
  • JPEG Pleno
  • HTJ2K
  • JPEG Media Blockchain Technology
  • JPEG Systems – Privacy, Security & IPR, JPSearch and JPEG in HEIF

In the following a short summary of the most relevant achievements of the 83rd meeting in Geneva, Switzerland, are presented.



The JPEG Committee is pleased to announce that it has started exploration studies on the use of learning-based solutions for its standards.

In the last few years, several efficient learning-based image coding solutions have been proposed, mainly with improved neural network models. These advances exploit the availability of large image datasets and special hardware, such as the highly parallelizable graphic processing units (GPUs). Recognizing that this area has received many contributions recently and it is considered critical for the future of a rich multimedia ecosystem, JPEG has created the JPEG AI AhG group to study promising learning-based image codecs with a precise and well-defined quality evaluation methodology.

In this meeting, a taxonomy was proposed and available solutions from the literature were organized into different dimensions. Besides, a list of promising learning-based image compression implementations and potential datasets to be used in the future were gathered.


The JPEG Committee continues to develop the JPEG XL Image Coding System, a standard for image coding that offers substantially better compression efficiency than relevant alternative image formats, along with features desirable for web distribution and efficient compression of high quality images.

Software for the JPEG XL verification model has been implemented. A series of experiments showed promising results for lossy, lossless and progressive coding. In particular, photos can be stored with significant savings in size compared to equivalent-quality JPEG files. Additionally, existing JPEG files can also be considerably reduced in size (for faster download) while retaining the ability to later reproduce the exact JPEG file. Moreover, lossless storage of images is possible with major savings in size compared to PNG. Further refinements to the software and experiments (including enhancement of existing JPEG files, and animations) will follow.

JPEG Pleno

The JPEG Committee has three activities in JPEG Pleno: Light Field, Point Cloud, and Holographic image coding. A generic box-based syntax has been defined that allows for signaling of these modalities, independently or composing a plenoptic scene represented by different modalities. The JPEG Pleno system also includes a reference grid system that supports the positioning of the respective modalities. The generic file format and reference grid system are defined in Part 1 of the standard, which is currently under development. Part 2 of the standard covers light field coding and supports two encoding mechanisms. The launch of specifications for point cloud and holographic content is under study by the JPEG committee.


The JPEG committee is pleased to announce the creation of an Amendment to JPEG XS Core Coding System defining the use of the codec for raw image sensor data. The JPEG XS project aims at the standardization of a visually lossless low-latency and lightweight compression scheme that can be used as a mezzanine codec in various markets. Among the targeted use cases for raw image sensor compression, one can cite video transport over professional video links (SDI, IP, Ethernet), real-time video storage in and outside of cameras, memory buffers, machine vision systems, and data compression onboard of autonomous cars. One of the most important benefit of the JPEG XS codec is an end-to-end latency ranging from less than one line to a few lines of the image.


The JPEG committee is pleased to announce a significant milestone, with ISO/IEC 15444-15 High-Throughput JPEG 2000 (HTJ2K) submitted to ISO for immediate publication as International Standard. HTJ2K opens the door to higher encoding and decoding throughput for applications where JPEG 2000 is used today.

The HTJ2K algorithm has demonstrated an average tenfold increase in encoding and decoding throughput compared to the algorithm currently defined by JPEG 2000 Part 1. This increase in throughput results in an average coding efficiency loss of 10% or less in comparison to the most efficient modes of the block coding algorithm in JPEG 2000 Part 1 and enables mathematically lossless transcoding to and from JPEG 2000 Part 1 codestreams.

JPEG Media Blockchain Technology

In order to clearly identify the impact of blockchain and distributed ledger technologies on JPEG standards, the committee has organized several workshops to interact with stakeholders in the domain. The programs and proceedings of these workshop are accessible on the JPEG website:

  1. 1st JPEG Workshop on Media Blockchain Proceedings, ISO/IEC JTC1/SC29/WG1, Vancouver, Canada, October 16th, 2018
  2. 2nd JPEG Workshop on Media Blockchain Proceedings, ISO/IEC JTC1/SC29/WG1, Lisbon, Portugal, January 22nd, 2019
  3. 3rd JPEG Workshop on Media Blockchain Proceedings, ISO/IEC JTC1/SC29/WG1, Geneva, Switzerland, March 20th, 2019

A 4th workshop is planned during the 84th JPEG meeting to be held in Brussels, Belgium on July 16th, 2019. The JPEG Committee invites experts to participate to this upcoming workshop.

JPEG Systems – Privacy, Security & IPR, JPSearch, and JPEG-in-HEIF.

At the 83rd meeting, JPEG Systems realized significant progress towards improving users’ privacy with the DIS text completion of ISO/IEC 19566-4 “Privacy, Security, and IPR Features” which will be released for ballot. JPEG Systems continued to progress on image search and retrieval with the FDIS text release of JPSearch ISO/IEC 24800 Part 2- 2nd edition. Finally, support for JPEG 2000, JPEG XR, and JPEG XS images encapsulated in ISO/IEC 15444-12 are progressing towards IS stage; this enables these JPEG images to be encapsulated in ISO base media file formats, such as ISO/IEC 23008-12 High efficiency file format (HEIF).

Final Quote

“Intelligent codecs might redesign the future of media compression. JPEG can accelerate this trend by producing the first AI based image coding standard.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

About JPEG

The Joint Photographic Experts Group (JPEG) is a Working Group of ISO/IEC, the International Organisation for Standardization / International Electrotechnical Commission, (ISO/IEC JTC 1/SC 29/WG 1) and of the International Telecommunication Union (ITU-T SG16), responsible for the popular JPEG, JPEG 2000, JPEG XR, JPSearch, JPEG XT and more recently, the JPEG XS, JPEG Systems, JPEG Pleno and JPEG XL families of imaging standards.

The JPEG Committee nominally meets four times a year, in different world locations. The 82nd JPEG Meeting was held on 19-25 January 2018, in Lisbon, Portugal. The next 84th JPEG Meeting will be held on 13-19 July 2019, in Brussels, Belgium.

More information about JPEG and its work is available at or by contacting Antonio Pinheiro or Frederik Temmermans of the JPEG Communication Subgroup.

If you would like to stay posted on JPEG activities, please subscribe to the jpeg-news mailing list.

Future JPEG meetings are planned as follows:

  • No 84, Brussels, Belgium, July 13 to 19, 2019
  • No 85, San Jose, California, U.S.A., November 2 to 8, 2019
  • No 86, Sydney, Australia, January 18 to 24, 2020

MPEG Column: 126th MPEG Meeting in Geneva, Switzerland

The original blog post can be found at the Bitmovin Techblog and has been modified/updated here to focus on and highlight research aspects.

The 126th MPEG meeting concluded on March 29, 2019 in Geneva, Switzerland with the following topics:

  • Three Degrees of Freedom Plus (3DoF+) – MPEG evaluates responses to the Call for Proposal and starts a new project on Metadata for Immersive Video
  • Neural Network Compression for Multimedia Applications – MPEG evaluates responses to the Call for Proposal and kicks off its technical work
  • Low Complexity Enhancement Video Coding – MPEG evaluates responses to the Call for Proposal and selects a Test Model for further development
  • Point Cloud Compression – MPEG promotes its Geometry-based Point Cloud Compression (G-PCC) technology to the Committee Draft (CD) stage
  • MPEG Media Transport (MMT) – MPEG approves 3rd Edition of Final Draft International Standard
  • MPEG-G – MPEG-G standards reach Draft International Standard for Application Program Interfaces (APIs) and Metadata technologies

The corresponding press release of the 126th MPEG meeting can be found here:

Three Degrees of Freedom Plus (3DoF+)

MPEG evaluates responses to the Call for Proposal and starts a new project on Metadata for Immersive Video

MPEG’s support for 360-degree video — also referred to as omnidirectional video — is achieved using the Omnidirectional Media Format (OMAF) and Supplemental Enhancement Information (SEI) messages for High Efficiency Video Coding (HEVC). It basically enables the utilization of the tiling feature of HEVC to implement 3DoF applications and services, e.g., users consuming 360-degree content using a head mounted display (HMD). However, rendering flat 360-degree video may generate visual discomfort when objects close to the viewer are rendered. The interactive parallax feature of Three Degrees of Freedom Plus (3DoF+) will provide viewers with visual content that more closely mimics natural vision, but within a limited range of viewer motion.

At its 126th meeting, MPEG received five responses to the Call for Proposals (CfP) on 3DoF+ Visual. Subjective evaluations showed that adding the interactive motion parallax to 360-degree video will be possible. Based on the subjective and objective evaluation, a new project was launched, which will be named Metadata for Immersive Video. A first version of a Working Draft (WD) and corresponding Test Model (TM) were designed to combine technical aspects from multiple responses to the call. The current schedule for the project anticipates Final Draft International Standard (FDIS) in July 2020.

Research aspects: Subjective evaluations in the context of 3DoF+ but also immersive media services in general are actively researched within the multimedia research community (e.g., ACM SIGMM/SIGCHI, QoMEX) resulting in a plethora of research papers. One apparent open issue is the gap between scientific/fundamental research and standards developing organizations (SDOs) and industry fora which often address the same problem space but sometimes adopt different methodologies, approaches, tools, etc. However, MPEG (and also other SDOs) often organize public workshops and there will be one during the next meeting, specifically on July 10, 2019 in Gothenburg, Sweden which will be about “Coding Technologies for Immersive Audio/Visual Experiences”. Further details are available here.

Neural Network Compression for Multimedia Applications

MPEG evaluates responses to the Call for Proposal and kicks off its technical work

Artificial neural networks have been adopted for a broad range of tasks in multimedia analysis and processing, such as visual and acoustic classification, extraction of multimedia descriptors or image and video coding. The trained neural networks for these applications contain a large number of parameters (i.e., weights), resulting in a considerable size. Thus, transferring them to a number of clients using them in applications (e.g., mobile phones, smart cameras) requires compressed representation of neural networks.

At its 126th meeting, MPEG analyzed nine technologies submitted by industry leaders as responses to the Call for Proposals (CfP) for Neural Network Compression. These technologies address compressing neural network parameters in order to reduce their size for transmission and the efficiency of using them, while not or only moderately reducing their performance in specific multimedia applications.

After a formal evaluation of submissions, MPEG identified three main technology components in the compression pipeline, which will be further studied in the development of the standard. A key conclusion is that with the proposed technologies, a compression to 10% or less of the original size can be achieved with no or negligible performance loss, where this performance is measured as classification accuracy in image and audio classification, matching rate in visual descriptor matching, and PSNR reduction in image coding. Some of these technologies also result in the reduction of the computational complexity of using the neural network or can benefit from specific capabilities of the target hardware (e.g., support for fixed point operations).

Research aspects: This topic has been addressed already in previous articles here and here. An interesting observation after this meeting is that apparently the compression efficiency is remarkable, specifically as the performance loss is negligible for specific application domains. However, results are based on certain applications and, thus, general conclusions regarding the compression of neural networks as well as how to evaluate its performance are still subject to future work. Nevertheless, MPEG is certainly leading this activity which could become more and more important as more applications and services rely on AI-based techniques.

Low Complexity Enhancement Video Coding

MPEG evaluates responses to the Call for Proposal and selects a Test Model for further development

MPEG started a new work item referred to as Low Complexity Enhancement Video Coding (LCEVC), which will be added as part 2 of the MPEG-5 suite of codecs. The new standard is aimed at bridging the gap between two successive generations of codecs by providing a codec-agile extension to existing video codecs that improves coding efficiency and can be readily deployed via software upgrade and with sustainable power consumption.

The target is to achieve:

  • coding efficiency close to High Efficiency Video Coding (HEVC) Main 10 by leveraging Advanced Video Coding (AVC) Main Profile and
  • coding efficiency close to upcoming next generation video codecs by leveraging HEVC Main 10.

This coding efficiency should be achieved while maintaining overall encoding and decoding complexity lower than that of the leveraged codecs (i.e., AVC and HEVC, respectively) when used in isolation at full resolution. This target has been met, and one of the responses to the CfP will serve as starting point and test model for the standard. The new standard is expected to become part of the MPEG-5 suite of codecs and its development is expected to be completed in 2020.

Research aspects: In addition to VVC and EVC, LCEVC is now the third video coding project within MPEG basically addressing requirements and needs going beyond HEVC. As usual, research mainly focuses on compression efficiency but a general trend in video coding is probably observable that favors software-based solutions rather than pure hardware coding tools. As such, complexity — both at encoder and decoder — is becoming important as well as power efficiency which are additional factors to be taken into account. Other issues are related to business aspects which are typically discussed elsewhere, e.g., here.

Point Cloud Compression

MPEG promotes its Geometry-based Point Cloud Compression (G-PCC) technology to the Committee Draft (CD) stage

MPEG’s Geometry-based Point Cloud Compression (G-PCC) standard addresses lossless and lossy coding of time-varying 3D point clouds with associated attributes such as color and material properties. This technology is appropriate especially for sparse point clouds.

MPEG’s Video-based Point Cloud Compression (V-PCC) addresses the same problem but for dense point clouds, by projecting the (typically dense) 3D point clouds onto planes, and then processing the resulting sequences of 2D images with video compression techniques.

G-PCC provides a generalized approach, which directly codes the 3D geometry to exploit any redundancy found in the point cloud itself and is complementary to V-PCC and particularly useful for sparse point clouds representing large environments.

Point clouds are typically represented by extremely large amounts of data, which is a significant barrier for mass market applications. However, the relative ease to capture and render spatial information compared to other volumetric video representations makes point clouds increasingly popular to present immersive volumetric data. The current implementation of a lossless, intra-frame G-PCC encoder provides a compression ratio up to 10:1 and acceptable quality lossy coding of ratio up to 35:1.

Research aspects: After V-PCC MPEG has now promoted G-PCC to CD but, in principle, the same research aspects are relevant as discussed here. Thus, coding efficiency is the number one performance metric but also coding complexity and power consumption needs to be considered to enable industry adoption. Systems technologies and adaptive streaming are actively researched within the multimedia research community, specifically ACM MM and ACM MMSys.

MPEG Media Transport (MMT)

MPEG approves 3rd Edition of Final Draft International Standard

MMT 3rd edition will introduce two aspects:

  • enhancements for mobile environments and
  • support of Contents Delivery Networks (CDNs).

The support for multipath delivery will enable delivery of services over more than one network connection concurrently, which is specifically useful for mobile devices that can support more than one connection at a time.

Additionally, support for intelligent network entities involved in media services (i.e., Media Aware Network Entity (MANE)) will make MMT-based services adapt to changes of the mobile network faster and better. Understanding the support for load balancing is an important feature of CDN-based content delivery, messages for DNS management, media resource update, and media request is being added in this edition.

On going developments within MMT will add support for the usage of MMT over QUIC (Quick UDP Internet Connections) and support of FCAST in the context of MMT.

Research aspects: Multimedia delivery/transport is still an important issue, specifically as multimedia data on the internet is increasing much faster than network bandwidth. In particular, the multimedia research community (i.e., ACM MM and ACM MMSys) is looking into novel approaches and tools utilizing exiting/emerging protocols/techniques like HTTP/2, HTTP/3 (QUIC), WebRTC, and Information-Centric Networking (ICN). One question, however, remains, namely what is the next big thing in multimedia delivery/transport as currently we are certainly in a phase where tools like adaptive HTTP streaming (HAS) reached maturity and the multimedia research community is eager to work on new topics in this domain.

Report from ACM MM 2018 – by Ana García del Molino

Seoul, what a beautiful place to host the premier conference on multimedia! Living in never-ending summer Singapore, I fell in love with the autumn colours of this city. The 26th edition of the ACM International Conference on Multimedia was held on October 22-26 of 2018 at the Lotte Hotel in Seoul, South Korea. It packed a full program including a very diverse range of workshops and tutorials, oral and poster presentations, art exhibits, interactive demos, competitions, industrial booths, and plenty of networking opportunities.

For me, this edition was a special one. About to graduate, with my thesis half written, I was presenting two papers. So of course, I was both nervous and excited. I had to fly to Seoul a few days ahead just to prepare myself! I was so motivated, I somehow managed to get myself a Best Social Media Reporter Award (who would have said… Me! A reporter!).

So, enough with the intro. Let’s get to the juice. What happened in Seoul between the 22nd and 26th of October 2018?

The first and last day of the conference were dedicated to Workshops and Tutorials. Those were a mix between Deep Learning themed and social applications of multimedia. The sessions included tutorials like “Interactive Video Search: Where is the User in the Age of Deep Learning?” that discussed the importance of the user in the collection of datasets, evaluation, and also interactive search, as opposed to using deep learning to solve challenges with big labelled datasets. In “Deep Learning Interpretation” Jitao Sang presented the main multimedia problems that can’t be addressed using deep learning. On the other hand, new and important trends related to social media (analysis of information diffusion and contagion, user activities and networking, prediction of real-world events, etc) were discussed in the tutorial “Social and Political Event Analysis using Rich Media”. The workshops were mainly user-centred, with special interest in affective computing and emotion analysis and use for multimedia (EE-USAD, ASMMC – MMAC 2018, AVEC 2018).

The conference kick-started with a wonderful keynote by Marianna Obrist. With “Don’t just Look – Smell, Taste, and Feel the Interaction” she showed us how to bring art into 4D by using technology, driving us through a full sensory experience that let us see, hear, and almost touch and smell. Ernest Edmonds also delved into how to mix art and multimedia in “What has art got to do with it?” but this time the other way around: what can multimedia research learn from the artists? Three industry speakers completed the keynote program. Xian-Sheng Hua from Alibaba Group shared their efforts towards visual Intelligence in “Challenges and Practices of Large-Scale Visual Intelligence in the Real-World”. Gary Geunbae Lee shared Samsung’s AI user experience strategy in “Living with Artificial Intelligence Technology in Connected Devices around Us.” And Bowen Zhou presented’s brand-new concept of Retail as a Service in “Transforming Retailing Experiences with Artificial Intelligence”.

This year’s program included 209 full papers, from a total of 757 submissions. 64 papers were allocated 15-minute oral presentations, while the others got a 90-second spotlight slot in the fast-forward sessions.  The poster sessions and the oral sessions run at the same time. While this was an inconvenience for poster presenters having to leave the poster to attend the oral sessions or miss them, the coffee breaks took place at the same location as the posters, so that was a win-win: chit-chat while having cookies and fruits? I’m in! In terms of content, half of the submissions were to only two areas: Multimedia and Vision and Deep Learning for Multimedia. But who am I to judge, when I had two of those myself! Many members of the community noted that the conference is becoming more and more deep learning, and less multimodal. To compensate, the workshops, tutorials and demos were mostly pure multimedia.

The challenges, competitions, art exhibits and demos happened in the afternoons, so at times it was hard to choose where to head to. So many interesting things happening all around the place! The art exhibit had some really cool interactive art installations, such as “Cellular Music”, that created music from visual motion. Among the demos, I found particularly interesting AniDance, an LSTM-based algorithm that made 3D models dance to the given music; SoniControl, an ultrasonic firewall for NFC protection; MusicMapp, a platform to augment how we experience music; and The Influence Map project, to explore who has influenced each scientist, and who did they most influence through their career.

Regarding diversity, I feel there is still a long way to go. Being in Asia, it makes sense that almost half of the attendees came from China. However, the submission numbers speak by themselves: less than 20% of submissions came from out of Asia, with just one submission from Africa (that’s a 0.13%!) Diversity is not only about gender, folks! I feel like more efforts are needed to facilitate the integration of more collectives in the multimedia community. One step at a time.

The next edition will take place at the NICE ACROPOLIS Convention Center in Nice, France from 21-25 October 2019. The ACM reproducibility badge system will be implemented for the first time at this 27th edition, so we may be seeing many more open-sourced projects. I am so looking forward to this!

On System QoE: Merging the system and the QoE perspectives

With Quality of Experience (QoE) research having made significant advances over the years, increased attention is being put on exploiting this knowledge from a service/network provider perspective in the context of the user-centric evaluation of systems. Current research investigates the impact of system/service mechanisms, their implementation or configurations on the service performance and how it affects the corresponding QoE of its users. Prominent examples address adaptive video streaming services, as well as enabling technologies for QoE-aware service management and monitoring, such as SDN/NFV and machine learning. This is also reflected in the latest edition of conferences such as the ACM Multimedia Systems Conference (MMSys ‘19), see some selected exemplary papers.

  • “ERUDITE: a Deep Neural Network for Optimal Tuning of Adaptive Video Streaming Controllers” by De Cicco, L., Cilli, G., & Mascolo, S.
  • “An SDN-Based Device-Aware Live Video Service For Inter-Domain Adaptive Bitrate Streaming” by Khalid, A., Zahran, H. & Sreenan C.J.
  • “Quality-aware Strategies for Optimizing ABR Video Streaming QoE and Reducing Data Usage” by Qin, Y., Hao, S., Pattipati, K., Qian, F., Sen, S., Wang, B., & Yue, C.
  • “Evaluation of Shared Resource Allocation using SAND for Adaptive Bitrate Streaming” by Pham, S., Heeren, P., Silhavy, D., Arbanowski, S.
  • “Requet: Real-Time QoE Detection for Encrypted YouTube Traffic” by Gutterman, C., Guo, K., Arora, S., Wang, X., Wu, L., Katz-Bassett, E., & Zussman, G.

For the evaluation of systems, proper QoE models are of utmost importance, as they  provide a mapping of various parameters to QoE. One of the main research challenges faced by the QoE community is deriving QoE models for various applications and services, whereby ratings collected from subjective user studies are used to model the relationship between tested influence factors and QoE. Below is a selection of papers dealing with this topic from QoMEX 2019; the main scientific venue for the  QoE community.

  • “Subjective Assessment of Adaptive Media Playout for Video Streaming” by Pérez, P., García, N., & Villegas, A.
  • “Assessing Texture Dimensions and Video Quality in Motion Pictures using Sensory Evaluation Techniques” by Keller, D., Seybold, T., Skowronek, J., & Raake, A.
  • “Tile-based Streaming of 8K Omnidirectional Video: Subjective and Objective QoE Evaluation” by Schatz, R., Zabrovskiy, A., & Timmerer, C.
  • “SUR-Net: Predicting the Satisfied User Ratio Curve for Image Compression with Deep Learning” by Fan, C., Lin, H., Hosu, V., Zhang, Y., Jiang, Q., Hamzaoui, R., & Saupe, D.
  • “Analysis and Prediction of Video QoE in Wireless Cellular Networks using Machine Learning” by Minovski, D., Åhlund, C., Mitra, K., & Johansson, P.

System-centric QoE

When considering the whole service, the question arises of how to properly evaluate QoE in a systems context, i.e., how to quantify system-centric QoE. The paper [1] provides fundamental relationships for deriving system-centric QoE,which are the basis for this article.

In the QoE community, subjective user studies are conducted to derive relationships between influence factors and QoE. Typically, the results of these studies are presented in terms of Mean Opinion Scores (MOS). However, these MOS results mask user diversity, which leads to specific distributions of user scores for particular test conditions. In a systems context, QoE can be better represented as a random variable Q|t for a fixed test condition. Such models are commonly exploited by service/network providers to derive various QoE metrics [2] in their system, such as expected QoE, or the percentage of users rating above a certain threshold (Good-or-Better ratio GoB).

Across the whole service, users will experience different performance, measured by e.g.,  response times, throughput, etc. which depend on the system’s (and services’) configuration and implementation. In turn, this leads to users experiencing different quality levels. As an example, we consider the response time of a system, which offers a certain web service, such as access to a static web site. In such a case, the system’s performance can be represented by a random variable R for the response time. In the system community, research aims at deriving such distributions of the performance, R.

The user centric evaluation of the system combines the system’s perspective and the QoE perspective, as illustrated in the figure below. We consider service/network providers interested in deriving various QoE metrics in their system, given (a) the system’s performance, and (b) QoE models available from user studies. The main questions we need to answer are how to combine a) user rating distributions obtained from subjective studies, and b) system performance condition distributions, so as to obtain the actual observed QoE distribution in the system? Moreover, how can various QoE metrics of interest in the system be derived?

System centric QoE - Merging the system and the QoE perspectives

System centric QoE – Merging the system and the QoE perspectives

Model of System-centric QoE

A service provider is interested in the QoE distribution Q in the system, which includes the following stochastic components: 1) system performance condition, t (i.e., response time in our example), and 2) user diversity, Q|t. This system-centric QoE distribution allows us to derive various QoE metrics, such as expected QoE or expected GoB in the system.

Some basic mathematical transformations allow us to derive the expected system-centric QoE E[Q], as shown below. As a result, we show that the expected system QoE is equal to the expected Mean Opinion Score (MOS) in the system! Hence, for deriving system QoE, it is necessary to measure the response time distribution R and to have a proper QoS-to-MOS mapping function f(t) obtained from subjective studies. From the subjective studies, we obtain the MOS mapping function for a response time t, f(t)=E[Q|t]. The system QoE then follows as E[Q] = E[f(R)]=E[M]. Note: The MOS M distribution in the system allows only to derive the expected MOS, i.e., expected system-centric QoE.

Expected system QoE E[Q] in the system is equal to the expected MOS

Expected system QoE E[Q] in the system is equal to the expected MOS

Let us consider another system-centric QoE metric, such as the GoB ratio. On a typical 5-point Absolute Category Rating (ACR) scale (1:bad quality, 5: excellent quality), the system-centric GoB is defined as GoB[Q]=P(Q>=4). We find that it is not possible to use a MOS mapping function f and the MOS distribution M=f(R) to derive GoB[Q] in the system! Instead, it is necessary to use the corresponding QoS-to-GoB mapping function g. This mapping function g can also be derived from the same subjective studies as the MOS mapping function, and maps the response time (tested in the subjective experiment) to the ratio of users rating “good or better” QoE, i.e., g(t)=P(Q|t > 4). We may thus derive in a similar way: GoB[Q]=E[g(R)]. In the system, the GoB ratio is the expected value of the response times R mapped to g(R). Similar observations lead to analogous results for other QoE metrics, such as quantiles or variances (see [1]).


The reported fundamental relationships provide an important link between the QoE community and the systems community. If researchers conducting subjective user studies provide different QoS-to-QoE mapping functions for QoE metrics of interest (e.g.,  MOS or GoB), this is enough to derive corresponding QoE metrics from a system’s perspective. This holds for any QoS (e.g., response time) distribution in the system, as long as the corresponding QoS values are captured in the reported QoE models. As a result, we encourage QoE researchers to report not only MOS mappings, but the entire rating distributions from conducted subjective studies. As an alternative, researchers may report QoE metrics and corresponding mapping functions beyond just those relying on MOS!

We draw the attention of the systems community to the fact that the actual QoE distribution in a system is not (necessarily) equal to the MOS distribution in the system (see [1] for numerical examples). Just applying MOS mapping functions and then using observed MOS distribution to derive other QoE metrics like GoB is not adequate. The current systems literature however, indicates that there is clearly a lack of a common understanding as to what are the implications of using MOS distributions rather than actual QoE distributions.


[1] Hoßfeld, T., Heegaard, P.E., Skorin-Kapov, L., & Varela, M. (2019). Fundamental Relationships for Deriving QoE in Systems. 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX). IEEE 

[2] Hoßfeld, T., Heegaard, P. E., Varela, M., & Möller, S. (2016). QoE beyond the MOS: an in-depth look at QoE via better metrics and their relation to MOS. Quality and User Experience, 1(1), 2.


  • Tobias Hoßfeld (University of Würzburg, Germany) is heading the chair of communication networks.
  • Poul E. Heegaard (NTNU – Norwegian University of Science and Technology) is heading the Networking Research Group.
  • Lea Skorin-Kapov (University of Zagreb, Faculty of Electrical Engineering and Computing, Croatia) is heading the Multimedia Quality of Experience Research Lab
  • Martin Varela is working in the analytics team at focusing on understanding and monitoring QoE for WebRTC services.

Multidisciplinary Column: An Interview with Max Mühlhäuser



Could you tell us a bit about your background, and what the road to your current position was?

Well, this road is marked by wonderful people who inspired me and sparked my interest in the research fields I pursued. In addition, it is marked by two of my major deficiencies: I cannot stop to investigate the role of my research in the larger context of systems and disciplines, and I have the strong desire to see “inventions” by researchers make their way into practice i.e. turn into “innovations”. The first of these deficiencies led to the unusually broad research interests of my lab and myself, and the second one made me spend a substantial part of my career conceptualizing and leading technology transfer organizations, for the most part industry-funded ones.

More precisely, I started to cooperate with Digital Equipment Corp. (DEC) during the time of my Diploma thesis already. DEC was then the second largest computer manufacturer and spearhead of the efforts to build affordable “computers for every engineering group”. My boss, the late Professor Krüger, gave me a lot of freedom, so I was able to turn the research cooperation into the first funded European research project of DEC and later into their first research center in Europe, conceived as a campus-based organization that worked very closely with academia. I am proud to say that I was allowed to conceptualize this academia-industry cooperation and that it was later on copied – often with my help and consultancy – many times across the globe, by several companies and governments. I acted as the founding director of the first such center, but at that time I was already determined to follow the academic career path. At the age of 32, I was appointed professor at the university of Kaiserslautern. Over the years, I was offered positions at prestigious universities in Canada, France, and the Netherlands, and I accepted positions in Austria and Germany (Karlsruhe, Darmstadt). My sabbaticals led me to Australia, France and Canada, and for the most part to California (San Diego and four times Palo Alto). In retrospective it was exciting to start at a new academic position every couple of years in the beginning, but it was also exciting to “finally settle” in Darmstadt and to build the strengths and connections there that were necessary to drive even larger cooperative projects than before.

The Telecooperation Lab embraces many different disciplines. Celebrating its 20th birthday next year, how did these disciplines evolve over the years?

It started with my excitement for distributed systems, based on solid knowledge about computer networks. At the time (the early 1980s), little more than point-to-point communication between file transfer or e-mail agents existed, and neither client-server nor multi-party systems were common. My early interest in this field concerned software engineering for distributed systems, ranging from design and specification support via programming and simulation to debugging and testing. Soon, multimedia became feasible due to advancements in computer hardware– and in peripherals: think of the late laser disk, a clumsy predecessor of today’s DVDs and BDs. Multimedia grabbed my immediate attention since numerous problems arose from the interest to enable it in a distributed manner. Almost at the same time, e-learning became my favorite application field since I saw the great potential of distributed multimedia for this domain, given the challenges of global education and of the knowledge society. I believe that technology has come a long way with respect to e-learning, but we are still far from mastering the challenges of technology supported education and knowledge work.

Soon came the time when computers left the desk and became ubiquitous. From my experience in multimedia and e-learning, it was obvious to me that human computer interaction would be a key to the success of ubiquitous computing. Simply extrapolating the keyboard-mouse-monitor based interaction paradigm to a future where tens, hundreds, or thousands of computers would surround an individual –  what a nightmare! This threat of a dystopia made us work on implicit and tangible interaction, hybrid cyber-physical knowledge work, novel mobile and workspace interaction, augmented and virtual reality, and custom 3D printed interaction – HCI became our “new multimedia”.

Regarding applications domains, our research in supporting the knowledge society evolved towards supporting ‘smart environments and spaces’, a natural consequence of the evolution of our core research towards networked ubiquitous computers. My continued interest in turning inventions into innovations made us work on urgent problems of industry – mainly revolving around business processes – and on computers that expect the unexpected: emergencies and disasters. Both these domains were a nice fit since they could benefit from appropriate smart spaces. Looking at smart spaces of ever larger scale, we naturally hit the challenge of supporting smart cities and critical infrastructures.

Finally, a bit more than ten years ago, our ubiquitous computing research made us encounter and realize the “ubiquity” of related cybersecurity threats to at large, in particular threats to privacy and appropriate trustworthiness estimation and of detecting networked attacks. These cybersecurity research activities were, like those in HCI, natural consequences of my afore-mentioned deficiency: my desire to take a holistic look at systems – in my case, ubiquitous computing systems.

Finally, the fact that we adapt, apply and sometimes further machine learning concepts in our research is nothing but a natural consequence of the utility of those concepts for our purposes.

How would you describe the interrelationship between those disciplines? Do these benefit from cross-fertilization effects and if so, how?

In my answer to your last question, I unwillingly used the word “natural” several times. This shows already that research on ubiquitous computing and smart spaces with a holistic slant almost inevitably leads you to looking at the different aspects we investigate. These aspects just happen to concern different research disciplines in computer science. The starting point is the fact that ubiquitous computing devices are much less general-purpose computers than dedicated components. Networking and distributed systems support are therefore a prerequisite for orchestrating these dedicated skills, forming what can be called a truly smart space. Such spaces are usually meant to assist humans, so that multimedia – conveying “humane” information representations – and HCI – for interacting with many cooperating dedicated components – are indispensable. Next, how can a smart space assist a human if it is subject to cyber-vulnerabilities? Instead, it has to enforce its users’ concerns with respect to privacy, trust, and intended behavior. Finally, true smartness is by nature bound to adopting and adapting best-of-breed AI techniques.

You also asked for cross-fertilizing effects. Let me share just three of the many examples in this respect. (i) Our AI related work cross-feritlized our cyberattack defense. (ii) On the other hand, the AI work introduced new challenges in distributed and networked systems, driving our research on edge computing forward. (iii) New requirements are added to this edge computing research by HCI since we want to support collaborative AR applications at large i.e. city-wide scale.

Moreover, cross-fertilizing goes beyond the research fields of computer science that we integrate in my own lab. As you know, I was and am heading highly interdisciplinary doctoral schools, formerly on e-learning, and now on privacy and trust for mobile users. When you work with researchers from sociology, law, economics, and psychology on topics like privacy protecting Smartphones, you first consider these topics as pertaining to computer science. Soon, you realize that the other disciplines dealt with issues like privacy and trust long before computers existed. Not only can you learn a lot from the deep and concise findings brought forth by these disciplines for decades or centuries, you can quickly establish a very fruitful cooperation with researchers from these disciplines who address the new challenges of mobile and ubiquitous computing from their perspective. I am convinced that the unique role of Xerox PARC in the history of computer science, with so many of the most fundamental innovations originating there, is mainly a consequence of their highly interdisciplinary approaches, combining the “science of computers” with the “sciences concerned with humans”.

Please tell us about the main challenges you faced when uniting such diverse topics under the Telecooperation Lab’s multi-disciplinary umbrella?

The major challenge lies in a balancing act for each PhD thesis and researcher. On one hand, the work must be strictly anchored in a narrow academic field; as a young researcher, you are lucky if you can make yourself a bit of a name in a single narrow community–which is a prerequisite for any further academic career steps for many reasons. Trying to get rooted in more than one community during a PhD would be what I call academic suicide. The second side of the balancing act, for us, is the challenge to keep that narrow and focused PhD well connected to the multi-area context of my lab – and for the members of the doctoral schools, even connected to the respective multi-disciplinary context. While this second side is not a prerequisite for a PhD, it is an inexhaustible source of both new challenges for, and new approaches to, the respective narrow PhD fields. In fact, reaching out to other fields while mastering your own field costs some additional time; in my experience, however, this additional time can easily be spared in the search for original scientific contributions that will earn you a PhD. The reason is that the cross-fertilizing from a multi-area or even multi-disciplinary setting will lead you to original contributions much faster, due to a fresh look at both, challenges and approaches.

When it comes to Postdoctoral researchers, things are a bit different since they are already rooted in a field, which means that they can reach out a bit further to other areas and disciplines, thereby creating a unique little research domain in which they can make themselves a name for their further career. My aim for my postdocs is to help them attain a status where, when I mention their name in a pertinent academic circle, my colleagues would say “oh, I know, that’s the guy who is working on XYZ”, with XYZ being a concise subdomain of research which that postdoc was instrumental in shaping.

The Telecooperation Lab is part of CRISP, the National Research Center for Applied Cybersecurity in Germany, which embraces many disciplines as well. Can you give us some insights into multidisciplinarity in such an environment?

Let me start by explaining that we started the first large cybersecurity research center in Darmstadt more than ten years ago, CRISP in its current form as a national center has only started to exist. By the way, CRISP will have to be renamed again for legal reasons (sigh!). Therefore, let me address our cybersecurity research in general. This research involved a very broad spectrum of disciplines, from physicists that address quantum related aspects to psychologists that investigate usable security and mental models. The most fruitful cooperations always concern areas that establish a “mutual benefits and challenges” relationship with the computer science side of cybersecurity. Two examples that come to my mind are The Laws and Economics. Computer science solutions to security and privacy always have limits. For instance, cryptographic solutions are always linked to trust at their boundaries (cf. trusted certificate authorities, trusted implementations of theoretically “proven-secure” protocols, trust in the absence of insider threats etc.). At such boundaries, law must punish what technology cannot guarantee, otherwise the systems remain insecure. In the reverse direction, new technical possibilities and solutions must be reflected in law. A prominent example is the power of AI: privacy law, such as the European Union’s GDPR, holds data processing organizations liable if they process personally identifiable information, PII for short. If data is not considered to be PII, it can be released. Now what if, three years later, a novel AI algorithm can link that data to some background data and infer PII from it? Privacy law needs a considerable update due to these new technical possibilities. I could talk about these mutual benefits and challenges on and on, but let me just quickly mention one more example from economics: if technology comes up with new privacy preserving schemes then these schemes may open up new opportunities for privacy-respecting services. In order for such services to succeed in the market, we need to learn about possible corresponding business models. This kind of economics research may lead to new challenges for technical approaches, and so on. Such “cycles of innovation” across different disciplines are among the most exciting facets of interdisciplinary research.

Could you name a grand challenge of multidisciplinary research in the Multimedia community?

Oh, I think I have a quite dedicated opinion on this one! We clearly live in the era of the fusion of bits and atoms – and this metaphor is of course just one way to characterize what is going on. Firstly, in the cyber-physical society that we are currently creating, the digital components are becoming the “brains” of complex real-world systems such as the transport system, energy grids, industrial production etc. This development creates already significant challenges concerning our future society, but beyond this trend and directly related to multimedia, there is an even more striking development: we increasingly feed the human senses by means of digitally created or processed signals – and hence, basically by means of multimedia. TV and telephone, social media and Web based information, Skype conversations and meetings, you-name-it: our perception of objects, spaces, and of our conversation partners – in other words: of the physical world – is conveyed, augmented, altered, and filtered by means of computers and computer networks. Now, you will ask what I consider the challenge in this development that goes on since decades. Consider that this field “jumps forward” in our days due to AI and other advancements: it is the challenge for interdisciplinary multimedia research to properly conserve the distinction between “real” and “imaginary” in all cases where we would or should conserve it. To cite a field that is only marginally concerned here, let me mention games: in games, it is – mostly – desired to blur the distinction between the real and the virtual. However, if you think of fake news or of highly persuasive social media governmental election campaigns, you get an idea of what I mean. The challenge here is highly multidisciplinary: for instance, many computer science areas have to come together already in order to check where in the media processing chain we can intervene in order to keep a handle on the real-versus-virtual distinction. Way beyond that, we need many disciplines to work hand-in-hand in order to figure out what we want and how we can achieve it. We have to recognize that many long-existing trends are at the fringe of jumping forward to an unprecedented level of perfection. We must figure out what society needs and wants. It is reckless to leave this development to economic or even malicious forces or to tech nerds who invent their own ethics. The examples are endless, let me cite a few in addition to those mentioned above, highlighting fake news and manipulative election campaigns.

Machine learning experts may call me paranoid, hinting at the fact that the detection of manipulated photos or deep fake videos is still a much simpler machine learning task than creating them. While this is true, I fear that it may change in the future. Moreover, alluding to the multidisciplinary challenges mentioned, let me remind you that we currently don’t have processes in place that would sufficiently check content for authenticity in a systematic way.

As another example, humans are told they are “valued customers”, but they are since long considered as consumers at best. More recently, they are downgraded to mass objects in which purchase desires are first created then directed–by sophisticated algorithms and with ever more convincing multimedia content. Meanwhile in the background, pricing discrimination is rising to new levels of sophistication. On a different field, questionable political powers are more and more capable of destabilizing democracies from a save seat across the Internet, using curated and increasingly machine-created influential media.

As a next big wave, we are witnessing a giants’ race among global IT players for the crown in the augmented and virtual reality markets. What is still a niche area may become wide spread technology tomorrow – reckon that the first successful smartphone was introduced only little more than a decade ago and that meanwhile the majority of the world’s population use Smartphones to access the Internet. A similar success story may lie ahead for AR/VR: at the latest when a generation grows up wearing AR contact lenses, noise-cancelling earplugs and haptics-augmented cloths, reality will not be threatened by fake information any more but digitally created, imaginary content will be reality, rendering the question “what is real?” obsolete. Of course, the list of technologies and application domains mentioned here is by far non-exhaustive.

The problem is that all these trends appear to be evolutionary, not disruptive as they are. Marketing has influenced customers already centuries ago, fake news existed even longer, and the movie industry has always had a leading role in imaginary technology, from chroma keying to the most advanced animation techniques. Therefore, the new and upcoming AI-powered multimedia technology is not (yet) recognized as disruptive and hence as a considerable threat to the fundamental rules of our society. This is a key reason why I consider this field a grand interdisciplinary research challenge. We need definitely far more than technology solutions. As an outset, we need to come to grips with appropriate ethical and socio-political norms. To what extend do we want to keep and protect the governing rules of society and humankind? Which changes do we want, which ones not? What does all that mean in terms of governing rules for AI-powered multimedia, for the merging of the real and the virtual? Apart from basic research, we need a participatory approach that involves society in general and the rising generations in particular. Since we cannot expect these fundamental societal process to lead to a final decision, we have to advance the other research challenges in parallel. For instance, we need a better understanding of social implications and of psychological factors related to the merge of the real and the virtual. Technology-related research must be intertwined with these efforts; as to technology fields concerned, multimedia research must go hand-in-hand with others like AI, cybersecurity, privacy, etc. –the selection depends on the particular questions addressed. This research must be further intertwined with human-related fields such as Law: laws must again regulate what technology can’t solve, and reflect what technology can achieve for the good or the evil. In all this, I did not yet mention further related issues like for instance biometric access control: as we try to make access control more user friendly, we rely on biometric data, most of which are variants of multimedia, namely speech, face or iris photos, gait and others. The difference between real and virtual remain important here and we can expect enormous malicious efforts to blur it.  You see, there is really a multidisciplinary grand challenge for multimedia.

How and in what form do you feel we as academics can be most impactful?

During the first half of my career, computer science was still in that wonderful gold diggers’ era: if you had a good idea and just decent skills to convey it to your academic peers, you could count on that idea to be heart, valued, and – if it was socially and economically viable – realized. Since then, we have moved to a state in which good research results are not even half the story. Many seemingly marginal factors drive innovation today. No wonder have we reached a point at which many industry players think that innovation should be driven by the company’s product groups in a close loop with customers, or by startups that can be acquired if successful, or – for the small part that requires long-term research – by a few top research institutions. I am confident that this opinion will be replaced by a new craze among CEOs in a few years. Meanwhile, academics should do there homework in three ways. (a) They should look for the true kernel in the current anti-academic trend and improve academic research accordingly. (b) They should orient their research towards the unique strength of academia, like the possibility to carry out true interdisciplinary research at universities. (c) They should tune their role, their words and deeds to those much-increased societal responsibilities highlighted above.

Academics from computer science trigger confusion and reshaping of our society to a bigger and bigger extend; it is time for them to live up to their responsibility.


Prof. Dr. Max Mühlhäuser is head of the Telecooperation Lab at Technische Universität Darmstadt, Informatics Dept. His Lab conducts research on smart ubiquitous computing environments for the ‘pervasive Future Internet’ in three research fields: middleware and large network infrastructures, novel multimodal interaction concepts, and human protection in ubiquitous computing (privacy, trust, & civil security). He heads or co-supervises various multilateral projects, e.g., on the Internet-of-Services, smart products, ad-hoc and sensor networks, and civil security; these projects are funded by the National Funding Agency DFG, the EU, German ministries, and industry. Max is heading the doctoral school Privacy and Trust for Mobile Users and serves as deputy speaker of the collaborative research center MAKI on the Future Internet. Max has also led several university wide programs that fostered E-Learning research and application. In his career, Max put a particular emphasis on technology transfer, e.g., as the founder and mentor of several campus-based industrial research centers.

Max has over 30 years of experience in research and teaching in areas related to Ubiquitous Computing (UC), Networks, Distributed Multimedia Systems, E-Learning, and Privacy&Trust. He held permanent or visiting professorships at the Universities of Kaiserslautern, Karlsruhe, Linz, Darmstadt, Montréal, Sophia Antipolis (Eurecom), and San Diego (UCSD). In 1993, he founded the TeCO institute ( in Karlsruhe, Germany, which became one of the pace-makers for Ubiquitous Computing research in Europe. Max regularly publishes in Ubiquitous and Distributed Computing, HCI, Multimedia, E-Learning, and Privacy&Trust conferences and journals and authored or co-authored more than 400 publications. He was and is active in numerous conference program committees, as organizer of several annual conferences, and as member of editorial boards or guest editor for journals like Pervasive Computing, ACM Multimedia, Pervasive and Mobile Computing, Web Engineering, and Distance Learning Technology.

Editor Biographies

Cynthia_Liem_2017Dr. Cynthia C. S. Liem is an Assistant Professor in the Multimedia Computing Group of Delft University of Technology, The Netherlands, and pianist of the Magma Duo. She initiated and co-coordinated the European research project PHENICX (2013-2016), focusing on technological enrichment of symphonic concert recordings with partners such as the Royal Concertgebouw Orchestra. Her research interests consider music and multimedia search and recommendation, and increasingly shift towards making people discover new interests and content which would not trivially be retrieved. Beyond her academic activities, Cynthia gained industrial experience at Bell Labs Netherlands, Philips Research and Google. She was a recipient of the Lucent Global Science and Google Anita Borg Europe Memorial scholarships, the Google European Doctoral Fellowship 2010 in Multimedia, and a finalist of the New Scientist Science Talent Award 2016 for young scientists committed to public outreach.




jochen_huberDr. Jochen Huber is a Senior User Experience Researcher at Synaptics. Previously, he was an SUTD-MIT postdoctoral fellow in the Fluid Interfaces Group at MIT Media Lab and the Augmented Human Lab at Singapore University of Technology and Design. He holds a Ph.D. in Computer Science and degrees in both Mathematics (Dipl.-Math.) and Computer Science (Dipl.-Inform.), all from Technische Universität Darmstadt, Germany. Jochen’s work is situated at the intersection of Human-Computer Interaction and Human Augmentation. He designs, implements and studies novel input technology in the areas of mobile, tangible & non-visual interaction, automotive UX and assistive augmentation. He has co-authored over 60 academic publications and regularly serves as program committee member in premier HCI and multimedia conferences. He was program co-chair of ACM TVX 2016 and Augmented Human 2015 and chaired tracks of ACM Multimedia, ACM Creativity and Cognition and ACM International Conference on Interface Surfaces and Spaces, as well as numerous workshops at ACM CHI and IUI. Further information can be found on his personal homepage:

An interview with Professor Pål Halvorsen

Describe your journey into research from your youth up to the present. What foundational lessons did you learn from this journey? Why were you initially attracted to multimedia?

I remember when I was about 14 years old and had an 8th grade project where we were to identify what we wanted to do in the future and the road to get there. I had just recently discovered the world of computers and so reported several ways to become a computer scientist. After following the identified path to the University of Oslo, graduating with a Bachelor in computer science, my way into research was more by chance, or maybe even by accident. At that time, I spent a lot of time on sports and was not sure what to do for my master thesis. However, I was lucky. I found an interesting topic in the area of system support for multimedia, mainly video. I guess my supervisors liked the work because they later offered me a PhD position (thanks!) where they brought me deeper into the world of multimedia systems research.

My supervisors then helped me to get an associate professor position at the university (thanks again!). I got to know more colleagues, all inspiring me to continue research in the area of multimedia. After a couple of years performing research as a continuation of my PhD and teaching system related courses, I got an opportunity to join Simula Research Laboratory together with Carsten Griwodz. A bit later, we started our own small research group at Simula, and it is still a great place to be.

I think it is safe to say my path has been to a large degree influenced by some of the great people that I have met. You cannot do everything yourself, and I have been blessed with a lot of very good colleagues and friends. As a PhD student, I was told that after a year I should know more about my topic than my supervisors. It sounded not possible, but after having supervised a number of students myself, I believe it is true! Another friend and colleague also said that he had learned everything he knew from his students. Again, very correct – my students (and colleagues) have taught me a lot (thanks!). Thus, my main take home message is to find an area that interests you and nice people to work with! You can accomplish a lot as a good team!  

Regarding my research interests, I initially found an interest in how efficient a computer system could be. I became fascinated by delivery of continuous media early on, and the “system support for multimedia” quickly became my area. After years of reporting an X% improvement of component Y, an interest of the complete end-to-end system rose. I have had a wish to build complete systems. So today, our research group does not only aim to improve individual components but also the entire pipeline in a holistic system – especially in the area of sports and medicine – where we can see the effects of the systems we deploy.

Pål Halvorsen at the beginning of his career

Pål Halvorsen at the beginning of his career as a computer scientist

Tell us more about your vision and objectives behind your current roles? What do you hope to accomplish and how will you bring this about?

Currently, I have several roles. My main position is with SimulaMet, a research center established by Simula Research Laboratory and Oslo Metropolitan University (OsloMet). I also recently moved my main university affiliation to OsloMet while still having a small adjunct professor position at University of Oslo. Both my research and teaching activities are related to my previously stated interests, and the combination of universities and research center is a perfect match for me, enabling a good mix of students and seniors.

I hope to be able to deliver results back into real systems, so that our results are not only published and then forgotten in a dark drawer somewhere. In this respect, we have contact with several real life “problem owners”, mainly in sports and medicine. To bring our results beyond research prototypes, we have also spun off both a sport and a medical company, achieving the vision of having real impact. The fact that we now run our systems for the two top soccer leagues in both Norway and Sweden is an example of our aims being fulfilled. Hopefully, we can soon say similar things in the medical scenario – that medical experts are assisted using our research-based systems!  

Can you profile your current research, its challenges, opportunities, and implications?

Having the end-to-end view, it is hard to make a short answer. We are trying to optimize both single components and the entire pipeline of components. Thus, we are doing a lot of different things. Our challenges are not only related to a specific requirement or a component, but also its integration into a system as a whole. We also address a number of real world applications. As you can see, the variety in our research is large.

However, there are also large opportunities in that the systems are researched and developed with real requirements and wishes in mind. Thus, if we succeed, there is a chance that we might actually have some impact. For example, in sports, we have three deployed systems in use.

How would you describe your top innovative achievements in terms of the problems you were trying to solve, your solutions, and the impact it has today and into the future?

Together with colleagues at Simula, University of Oslo and University of Tromsø, we have been lucky to find some interesting and usable solutions. For example, at the system level, we have solutions (code) included in the Linux kernel, and at the application level, or as efficient complete system providing functionality beyond existing systems, we have running (prototype) systems in both the areas of sport and medicine.

Pål Halvorsen today

Pål Halvorsen in his office in 2019

Over your distinguished career, what are your top lessons you want to share with the audience?

Well, first, I do not think you can call it “distinguished”. This is your description.

The most important thing for me is to have some fun. You must like what you do, and you must find people you enjoy working with. There are a lot of interesting challenges out there. You must just find yours.

What is the best joke you know?

Hehe, I am so bad at jokes. Every ten years, I might have a catchy comment, but I hardly ever tell jokes.

If you were conducting this interview, what questions would you ask, and what would be your answers?

Haha, I am not a man of many words, so I would probably just stick to the set of questions I was given and hoping it would soon be finished 😉

So, maybe this one last question:

Q: Anything to add?

A: No. (Both short since I have to both Q and A)


Professor Pål Halvorsen: 

Pål Halvorsen is a chief research scientist at SimulaMet, a professor at OsloMet University, an adjunct professor at University of Oslo, Norway, and the CEO of ForzaSys AS. He received his doctoral degree (Dr.Scient.) in 2001.  His research focuses mainly on complete  distributed multimedia systems including operating systems, processing, storage and retrieval, communication and distribution from a performance and efficiency point of view. He is a member of the IEEE and ACM. More information
can be found at

Pia Helén Smedsrud: 

Pia Helén Smedsrud is a PhD student at Simula Research Laboratory in Oslo, Norway. She has a medical degree from UiO (University of Oslo), and worked as a medical doctor before starting as a research trainee in the field of computer science at Simula. She also has a background from journalism. Her research interests include medical multimedia, clinical implementation and machine learning. Currently, she is doing her PhD in the intersection between informatics and medicine, on machine learning in endoscopy.

Opinion Column: Evolution of Topics in the Multimedia Community

For this edition of the SIGMM Opinion Column, we asked members of the Multimedia community to share their impressions about the shift of scientific topics in the community over the years, namely the evolution of “traditional” and “emerging” Multimedia topics. 

This subject has emerged in several conversations over the 2 years of history of the SIGMM Opinion Column, and we report here a summary of recent and old discussions, happened over different channels – our Facebook group, the SIGMM Linkedin group, and in-person conversations between the column editors and MM researchers – with hopes, fears and opinions around this problem. We want to thank all participants for their precious contribution to these discussions.

Historical Perspective of Topics in ACM MM

opinion11_2_1This year, ACM Multimedia turns 27. Today, MM is a large premium conference with hundreds of paper submissions every year, spanning 12 different thematic areas spanning across the wide spectrum of multimedia topics. But back at the beginning of MM’s history, the scale of the topic range was very different.

In the first editions of the conference, a general call for papers encouraged submissions about “technology, tools and techniques for the construction and delivery of high quality, innovative multimedia systems and interfaces”. Already in its 3rd edition, MM featured an Arts and Multimedia program. Starting from 2004, the conference offered three tracks for paper submissions: content (Multimedia analysis, processing, and retrieval), Systems (Multimedia networking and system support), and Applications (Multimedia tools, end-systems, and applications), plus a “Brave New Topics” track for work-in-progress submissions. Later on, the Human-Centered Multimedia track was included in the projects. In 2011, after a conference review, the ACM MM program went beyond the notion of “tracks”, and the concept of areas was introduced to allow the community to “solicit papers from a wide range of timely multimedia-related topics” (see the ACMM11 website). In 2014, the areas became 14, including, among others, Music, Speech and Audio Processing in Multimedia, and Social Media and Collective Online Presence. After a retreat in 2014, starting from 2015, areas are grouped in larger “Themes”, the core thematic areas of ACM Multimedia. After the last retreat in 2014, no major changes were introduced in the thematic structure of the conference.

Dynamics of Evolution Emerging Topics

Emerging topics and less mature works are generally welcome at conferences’ workshops. In our discussions, most members of the community agree that “you’ll see great work there, and very fruitful discussions due to the common focus on the workshop theme”. When emerging topics become more popular, they can be promoted to conference areas, as it happened for the “music, speech and audio” theme. 

It was observed in our community conversations that, while this upgrade to the main conference is great for visibility, being a separate, relatively novel area could lead to isolation: the workload for reviewers specialized on emerging topics could become too high, given that they are assigned to works in other areas; and the flat acceptance rate across all conference themes could mean that even accepting 2 submissions from an emerging topic area would give ‘unreasonably’ high acceptance rate, thus leading to many good papers (even with 3 accepts) having to be rejected. Participants to our forums noticed that these dynamics are somehow “counteracted the ‘Multimedia’ and multidisciplinary nature of the field”, they prevent conferences from growing and eventually hurt emerging topics. One solution proposed to balance this effect is to “maintain a solid specialized reviewer pool (where needed managed by someone from the field), which however would be distributed over relevant MM areas”, rather than forming a new area.

It was also noted that some emerging topics in their early stage would most likely not have an appropriate workshop. Therefore, it is important for the main conference to have places to accept such early works, thus making tracks such as the short paper tracks or the brave new idea track absolutely crucial for the development of  novel topics.

The Near-Future of Multimedia

In multiple occasions, MM community members shared their thoughts about how they would like to see the Multimedia community evolve around new topics.

There are a few topics that emerged in the past and that the community wishes they continued growing, and these include interactive Multimedia applications, as well as music-related Multimedia technology, Multimedia in cooking spaces, and arts and Multimedia. It was also pointed out that, although very important for Multimedia applications, topics around compression technology are also often given low weight in Multimedia spaces, and that MM should encourage submissions in the domain of machine learning concepts applied to compression.

There are also a few areas that are emerging across different sub-communities in computer science, and that, according to our community members, we should be encouraging to grow within the Multimedia field as well. These include works in digital health exploring the power of Multimedia for health care and monitoring, research around applications of Multimedia for good, understanding how the technologies we develop can help having a real impact on society, and discussions around the ethics and responsibility of Multimedia technologies, encouraging fair, transparent, inclusive and accountable Multimedia tools.

The Future of Multimedia

The future of MM according to the participants of the discussion goes beyond the forms we know today, as new technologies could significantly broaden and shake the current applicative paradigm of Multimedia. 

The upcoming 5G technology will enable a plethora of applications that are now extremely limited by the lack of bandwidth. This could go from mobile virtual reality, to interconnection with objects and, of course, smart cities. To extract meaningful information to be presented to the user, various and highly diverse data streams will need to be treated consistently. And Multimedia researchers will develop methods, applications, systems and models to understand how to properly develop and impact this field. Likewise, this technology will push the limits of what is currently possible in terms of content demand and interaction with connected objects. We will see technologies for hyper-personalization, dynamic user interaction and real-time video personalization. These technologies will be enabled by the study of how film grammar and storytelling works for novel content types like AR, VR, panoramic and 360° video, by research around novel immersive media experiences, and by the design of new media formats, with novel consumption paradigms.

Multimedia has a bright future, with new, exciting emerging topics to be discussed and encouraged. Perhaps time for a new retreat or for a conference review?

First Combined ACM SIGMM Strategic Workshop and Summer School in Stellenbosch, South Africa

The first combined ACM SIGMM Strategic Workshop and Summer School will be held in Stellenbosch, South Africa, in the beginning of July 2020.


First ACM Multimedia Strategic Workshop

The first Multimedia Strategic Workshop follows the successful series of workshops in areas such as information retrieval. The field of multimedia has continued to evolve and develop: collections of images, sounds and videos have become larger, computers have become more powerful, broadband and mobile Internet are widely supported, complex interactive searches can be done on personal computers or mobile devices, and soon. In addition, as large business enterprises find new ways to leverage the data they collect from users, the gap between the types of research conducted in industry and academics has widened, creating tensions over “repeatability” and “public data” in publications. These changes in environment and attitude mean that the time has come for the field to reassess its assumptions, goals, objectives and methodologies. The goal is to bring together researchers in the field to discuss long-term challenges and opportunities within the field. 

The participants of Multimedia Strategic Workshop will be active researchers in the field of Multimedia. The strategic workshop will give these researchers the opportunity to explore long-term issues in the multimedia field, to recognise the challenges on the horizon, to reach consensus on key issues and to describe them in the resulting report that will be made available to the multimedia research community. The report will stimulate debate, provide research directions to both researchers and graduate students, and also provide funding agencies with data that can be used coordinate the support for research.

The workshop will be held at the Wallenberg Research Centre at the Stellenbosch Institute for Advanced Study (STIAS). STIAS provides  provides venues and state-of-the art equipment for up to 300 conference guests at a time as well as breakaway rooms. 

The First ACM Multimedia Summer School on Multimedia

The motivation of the proposed summer school is to build on the success of the Deep Learning Indaba, but to focus on the application of machine learning to the field of Multimedia. We want delegates to be exposed to current research challenges in Multimedia. A secondary goal is to establish and grow the community of African researchers in the field of Multimedia; and to stimulate scientific research and collaboration between African researchers and the international community. The exact topics covered during the summer school will decided later together with the instructors but will reflect the current research trends in Multimedia.

The Strategic Workshop will be followed by the Summer School on Multimedia. Having the first summer school co-located with the Strategic Workshop will help to recruit the best possible instructors for the summer school. 

The Multimedia Summer School on Multimedia will be held at the Faculty of Engineering at Stellenbosch University, which is one of South Africa’s major producers of top quality engineers. The faculty was established in 1944 and is housed in a large complex of buildings with modern facilities, including lectures halls and electronic classrooms.

Stellenbosch is a university town in South Africa’s Western Cape province. It’s surrounded by the vineyards of the Cape Winelands and the mountainous nature reserves of Jonkershoek and Simonsberg. The town’s oak-shaded streets are lined with cafes, boutiques and art galleries. Cape Dutch architecture gives a sense of South Africa’s Dutch colonial history, as do the Village Museum’s period houses and gardens.

For more information about both events, please refer to the events’ web site ( or contact the organizers: