Towards Interactive QoE Assessment of Robotic Telepresence

Telepresence robots (TPRs) are remote-controlled, wheeled devices with an internet connection. A TPR can “teleport” you to a remote location, let you drive around and interact with people.  A TPR user can feel present in the remote location by being able to control the robot position, movements, actions, voice and video. A TPR facilitates human-to-human interaction, wherever you want and whenever you want. The human user sends commands to the TPR by pressing buttons or keys from a keyboard, mouse, or joystick.

A Robotic Telepresence Environment

In recent years, people from different environments and backgrounds have started to adopt TPRs for private and business purposes such as attending a class, roaming around the office and visiting patients. Due to the COVID-19 pandemic, adoption in healthcare has increased in order to facilitate social distancing and staff safety [Ackerman 2020, Tavakoli et al. 2020].

Robotic Telepresence Sample Use Cases

Despite such increase in adoption, a research gap remains from a QoE perspective, as TPRs offer interaction beyond the well understood QoE issues in traditional static audio-visual conferencing. TPRs, as remote-controlled vehicles, enable users with some form of physical presence at the remote location. Furthermore, for those people interacting with the TPR at the remote location, the robot is a physical representation or proxy agent of its remote operator. The operator can physically interact with the remote location by driving over an object or pushing an object forward. These aspects of teleoperation and navigation represent an additional dimension in terms of functionality, complexity and experience.

Navigating a TPR may pose challenges to end-users and influence their perceived quality of the system. For instance, when a TPR operator is driving the robot, he/she expects an instantaneous reaction from the robot. An increased delay in sending commands to the robot may thus negatively impact robot mobility and the user’s satisfaction, even if the audio-visual communication functionality itself is not affected.

In a recent paper published at QoMEX 2020 [Jahromi et al. 2020], we addressed this gap in research by means of a subjective QoE experiment that focused on the QoE aspects of live TPR teleoperation over the internet. We were interested in understanding how network QoS-related factors influence the operator’s QoE when using a TPR in an office context.

TPR QoE User Study and Experimental Findings

In our study, we investigated the QoE of TPR navigation along three research questions: 1) impact of network factors including bandwidth, delay and packet loss on the TPR navigation QoE, 2) discrimination between navigation QoE and video QoE, 3) impact of task on TPR QoE sensitivity.

The QoE study participants were situated in a laboratory setting in Dublin, Ireland, where they navigated a Beam Plus TPR via keyboard input on a desktop computer. The TPR was placed in a real office setting of California Telecom in California, USA. Bandwidth, delay and packet loss rate were manipulated on the operator’s PC.

A User Participating in the Robotic Telepresence QoE Study

A total of 23 subjects participated in our QoE lab study: 8 subjects were female and 15 male and the average test duration was 30 minutes per participant. We followed  ITU-T Recommendation BT.500 and detected three participants as outliers which were excluded from subsequent analysis. A post-test survey shows that none of the participants reported task boredom as a factor. In fact, many reported that they enjoyed the experience! 

The influence of network factors on Navigation QoE

All three network influence factors exhibited a significant impact on navigation QoE but in different ways. Above a threshold of 0.9 Mbps, bandwidth showed no influence on navigation QoE, while 1% packet loss already showed a noticeable impact on the navigation QoE.  A mixed-model ANOVA confirms that the impact of the different network factors on navigation quality ratings is statistically significant (see [Jahromi et al. 2020] for details).  From the figure below, one can see that the levels of navigation QoE MOS, as well as their sensitivity to network impairment level, depend on the actual impairment type.

The bar plots illustrate the influence of network QoS factors on the navigation quality (left) and the video quality (right).

Discrimination between navigation QoE and video QoE

Our study results show that the subjects were capable of discriminating between video quality and navigation quality, as they treated them as separate concepts when it comes to experience assessment. Based on ANOVA analysis [Jahromi et al. 2020], we see that the impact of bandwidth and packet loss on TPR video quality ratings were statistically significant. However, for the delay, this was not the case (in contrast to navigation quality).  A comparison of navigation quality and video quality subplots shows that changes in MOS across different impairment levels diverge between the two in terms of amplitude.  To quantify this divergence, we performed a Spearman Rank Ordered Correlation Coefficient (SROCC) analysis, revealing only a weak correlation between video and navigation quality (SROCC =0.47).

Impact of task on TPR QoE sensitivity

Our study showed that the type of TPR task had more impact on navigation QoE than streaming video QoE. Statistical analysis reveals that the actual task at hand significantly affects QoE impairment sensitivity, depending on the network impairment type. For example, the interaction between bandwidth and task is statistically significant for navigation QoE, which means that changes in bandwidth were rated differently depending on the task type. On the other hand, this was not the case for delay and packet loss. Regarding video quality, we do not see a significant impact of task on QoE sensitivity to network impairments, except for the borderline case for packet loss rate.

Conclusion: Towards a TPR QoE Research Agenda

There were three key findings from this study. First, we understand that users can differentiate between visual and navigation aspects of TPR operation. Secondly, all three network factors have a significant impact on TPR navigation QoE. Thirdly,  visual and navigation QoE sensitivity to specific impairments strongly depends on the actual task at hand. We also found the initial training phase to be essential in order to ensure familiarity of participants with the system and to avoid bias caused by novelty effects. We observed that participants were highly engaged when navigating the TPR, as was also reflected in the positive feedback received during the debriefing interviews. We believe that our study methodology and design, including task types, worked very well and can serve as a solid basis for future TPR QoE studies. 

We also see the necessity of developing a more generic, empirically validated, TPR experience framework that allows for systematic assessment and modelling of QoE and UX in the context of TPR usage. Beyond integrating concepts and constructs that have been already developed in other related domains such as (multi-party) telepresence, XR, gaming, embodiment and human-robot interaction, the development of such a framework must take into account the unique properties that distinguish the TPR experience from other technologies:

  • Asymmetric conditions
    The factors influencing  QoE for TPR users are not only bidirectional, they are also different on both sides of TPR, i.e., the experience is asymmetric. Considering the differences between the local and the remote location, a TPR setup features a noticeable number of asymmetric conditions as regards the number of users, content, context, and even stimuli: while the robot is typically controlled by a single operator, the remote location may host a number of users (asymmetry in the number of users). An asymmetry also exists in the number of stimuli. For instance, the remote users perceive the physical movement and presence of the operator by the actual movement of the TPR. The experience of encountering a TPR rolling into an office is a hybrid kind of intrusion, somewhere between a robot and a physical person. However, from the operator’s perspective, the experience is a rather virtual one, as he/she only becomes conscious of physical impact at the remote location only by means of technically mediated feedback.
  • Social Dimensions
    According to [Haans et al. 2012], the experience of telepresence is defined as “a consequence of the way in which we are embodied, and that the capability to feel as if one is actually there in a technologically mediated or simulated environment is a natural consequence of the same ability that allows us to adjust to, for example, a slippery surface or the weight of a hammer”.
    The experience of being present in a TPR-mediated context goes beyond AR and VR. It is a blended physical reality. The sense of ownership of a wheeled TPR by means of mobility and remote navigation of using a “physical” object, allows the users to feel as if they are physically present in the remote environment (e.g. a physical avatar). This allows the TPR users to get involved in social activities, such as accompanying people and participating in discussions while navigating, sharing the same visual scenes, visiting a place and getting involved in social discussions, parties and celebrations. In healthcare, a doctor can use TPR for visiting patients as well as dispensing and administering medication remotely.
  • TPR Mobility and Physical Environment
    Mobility is a key dimension of telepresence frameworks [Rae et al. 2015]. TPR mobility and navigation features introduce new interactions between the operators and the physical environment.  The environmental aspect becomes an integral part of the interaction experience [Hammer et al. 2018].
    During a TPR usage, the navigation path and the number of obstacles that a remote user may face can influence the user’s experience. The ease or complexity of navigation can change the operator’s focus and attention from one influence factor to another (e.g., video quality to navigation quality). In Paloski et al’s, 2008 study, it was found that cognitive impairment as a result of fatigue can influence user performance concerning robot operation [Paloski et al. 2008]. This raises the question of how driving and interaction through TPR impacts the user’s cognitive load and results in fatigue compared to physical presence.
    The mobility aspects of TPRs can also influence the perception of spatial configurations of the physical environment. This allows the TPR user to manipulate and interact with the environment from a spatial configuration aspect [Narbutt et al. 2017]. For example,  the ambient noise of the environment can be perceived at different levels. The TPR operator can move the robot closer to the source of the noise or keep a distance from it. This can enhance his/her feelings of being present [Rae et al. 2015].

Above distinctive characteristics of a TPR-mediated context illustrate the complexity and the broad range of aspects that potentially have a significant influence on the TPR quality of user experience. Consideration of these features and factors provides a useful basis for the development of a comprehensive TPR experience framework.

References

  • [Tavakoli et al. 2020] Tavakoli, Mahdi, Carriere, Jay and Torabi, Ali. (2020). Robotics For COVID-19: How Can Robots Help Health Care in the Fight Against Coronavirus.
  • [Ackerman 2020] E. Ackerman (2020). Telepresence Robots Are Helping Take Pressure Off Hospital Staff, IEEE Spectrum, Apr 2020
  • [Jahromi et al. 2020] H. Z. Jahromi, I. Bartolec, E. Gamboa, A. Hines, and R. Schatz, “You Drive Me Crazy! Interactive QoE Assessment for Telepresence Robot Control,” in 12th International Conference on Quality of Multimedia Experience (QoMEX 2020), Athlone, Ireland, 2020.
  • [Hammer et al. 2018] F. Hammer, S. Egger-Lampl, and S. Möller, “Quality-of-user-experience: a position paper,” Quality and User Experience, vol. 3, no. 1, Dec. 2018, doi: 10.1007/s41233-018-0022-0.
  • [Haans et al. 2012] A. Haans & W. A. Ijsselsteijn (2012). Embodiment and telepresence: Toward a comprehensive theoretical framework✩. Interacting with Computers, 24(4), 211-218.
  • [Rae et al. 2015] I. Rae, G. Venolia, JC. Tang, D. Molnar  (2015, February). A framework for understanding and designing telepresence. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 1552-1566).
  • [Narbutt et al. 2017] M. Narbutt, S. O’Leary, A. Allen, J. Skoglund, & A. Hines,  (2017, October). Streaming VR for immersion: Quality aspects of compressed spatial audio. In 2017 23rd International Conference on Virtual System & Multimedia (VSMM) (pp. 1-6). IEEE.
  • [Paloski et al. 2008] W. H. Paloski, C. M. Oman, J. J. Bloomberg, M. F. Reschke, S. J. Wood, D. L. Harm, … & L. S. Stone (2008). Risk of sensory-motor performance failures affecting vehicle control during space missions: a review of the evidence. Journal of Gravitational Physiology, 15(2), 1-29.

MPEG Column: 131st MPEG Meeting (virtual/online)

The original blog post can be found at the Bitmovin Techblog and has been modified/updated here to focus on and highlight research aspects.

The 131st MPEG meeting concluded on July 3, 2020, online, again but with a press release comprising an impressive list of news items which is led by “MPEG Announces VVC – the Versatile Video Coding Standard”. Just in the middle of the SC 29 (i.e., MPEG’s parent body within ISO) restructuring process, MPEG successfully ratified — jointly with ITU-T’s VCEG within JVET — its next-generation video codec among other interesting results from the 131st MPEG meeting:

Standards progressing to final approval ballot (FDIS)

  • MPEG Announces VVC – the Versatile Video Coding Standard
  • Point Cloud Compression – MPEG promotes a Video-based Point Cloud Compression Technology to the FDIS stage
  • MPEG-H 3D Audio – MPEG promotes Baseline Profile for 3D Audio to the final stage

Call for Proposals

  • Call for Proposals on Technologies for MPEG-21 Contracts to Smart Contracts Conversion
  • MPEG issues a Call for Proposals on extension and improvements to ISO/IEC 23092 standard series

Standards progressing to the first milestone of the ISO standard development process

  • Widening support for storage and delivery of MPEG-5 EVC
  • Multi-Image Application Format adds support of HDR
  • Carriage of Geometry-based Point Cloud Data progresses to Committee Draft
  • MPEG Immersive Video (MIV) progresses to Committee Draft
  • Neural Network Compression for Multimedia Applications – MPEG progresses to Committee Draft
  • MPEG issues Committee Draft of Conformance and Reference Software for Essential Video Coding (EVC)

The corresponding press release of the 131st MPEG meeting can be found here: https://mpeg-standards.com/meetings/mpeg-131/. This report focused on video coding featuring VVC as well as PCC and systems aspects (i.e., file format, DASH).

MPEG Announces VVC – the Versatile Video Coding Standard

MPEG is pleased to announce the completion of the new Versatile Video Coding (VVC) standard at its 131st meeting. The document has been progressed to its final approval ballot as ISO/IEC 23090-3 and will also be known as H.266 in the ITU-T.

VVC Architecture (from IEEE ICME 2020 tutorial of Mathias Wien and Benjamin Bross)

VVC is the latest in a series of very successful standards for video coding that have been jointly developed with ITU-T, and it is the direct successor to the well-known and widely used High Efficiency Video Coding (HEVC) and Advanced Video Coding (AVC) standards (see architecture in the figure above). VVC provides a major benefit in compression over HEVC. Plans are underway to conduct a verification test with formal subjective testing to confirm that VVC achieves an estimated 50% bit rate reduction versus HEVC for equal subjective video quality. Test results have already demonstrated that VVC typically provides about a 40%-bit rate reduction for 4K/UHD video sequences in tests using objective metrics (i.e., PSNR, VMAF, MS-SSIM). Application areas especially targeted for the use of VVC include:

  • ultra-high definition 4K and 8K video,
  • video with a high dynamic range and wide colour gamut, and
  • video for immersive media applications such as 360° omnidirectional video.

Furthermore, VVC is designed for a wide variety of types of video such as camera capturedcomputer-generated, and mixed content for screen sharing, adaptive streaming, game streaming, video with scrolling text, etc. Conventional standard-definition and high-definition video content are also supported with similar gains in compression. In addition to improving coding efficiency, VVC also provides highly flexible syntax supporting such use cases as (i) subpicture bitstream extraction, (ii) bitstream merging, (iii) temporal sub-layering, and (iv) layered coding scalability.

The current performance of VVC compared to HEVC-HM is shown in the figure below which confirms the statement above but also highlights the increased complexity. Please note that VTM9 is not optimized for speed but functionality (i.e., compression efficiency).

Performance of VVC, VTM9 vs. HM (taken from https://bit.ly/mpeg131).

MPEG also announces completion of ISO/IEC 23002-7 “Versatile supplemental enhancement information for coded video bitstreams” (VSEI), developed jointly with ITU-T as Rec. ITU-T H.274. The new VSEI standard specifies the syntax and semantics of video usability information (VUI) parameters and supplemental enhancement information (SEI) messages for use with coded video bitstreams. VSEI is especially intended for use with VVC, although it is drafted to be generic and flexible so that it may also be used with other types of coded video bitstreams. Once specified in VSEI, different video coding standards and systems-environment specifications can re-use the same SEI messages without the need for defining special-purpose data customized to the specific usage context.

At the same time, the Media Coding Industry Forum (MC-IF) announces a VVC patent pool fostering with an initial meeting on September 1, 2020. The aim of this meeting is to identify tasks and to propose a schedule for VVC pool fostering with the goal to select a pool facilitator/administrator by the end of 2020. MC-IF is not facilitating or administering a patent pool.

At the time of writing this blog post, it is probably too early to make an assessment of whether VVC will share the fate of HEVC or AVC (w.r.t. patent pooling). AVC is still the most widely used video codec but with AVC, HEVC, EVC, VVC, LCEVC, AV1, (AV2), and probably also AVS3 — did I miss anything? — the competition and pressure are certainly increasing.

Research aspects: from a research perspective, reduction of time-complexity (for a variety of use cases) while maintaining quality and bitrate at acceptable levels is probably the most relevant aspect. Improvements in individual building blocks of VVC by using artificial neural networks (ANNs) are another area of interest but also end-to-end aspects of video coding using ANNs will probably pave the roads towards the/a next generation of video codec(s). Utilizing VVC and its features for HTTP adaptive streaming (HAS) is probably most interesting for me but maybe also for others…

MPEG promotes a Video-based Point Cloud Compression Technology to the FDIS stage

At its 131st meeting, MPEG promoted its Video-based Point Cloud Compression (V-PCC) standard to the Final Draft International Standard (FDIS) stage. V-PCC addresses lossless and lossy coding of 3D point clouds with associated attributes such as colors and reflectance. Point clouds are typically represented by extremely large amounts of data, which is a significant barrier for mass-market applications. However, the relative ease to capture and render spatial information as point clouds compared to other volumetric video representations makes point clouds increasingly popular to present immersive volumetric data. With the current V-PCC encoder implementation providing compression in the range of 100:1 to 300:1, a dynamic point cloud of one million points could be encoded at 8 Mbit/s with good perceptual quality. Real-time decoding and rendering of V-PCC bitstreams have also been demonstrated on current mobile hardware. The V-PCC standard leverages video compression technologies and the video ecosystem in general (hardware acceleration, transmission services, and infrastructure) while enabling new kinds of applications. The V-PCC standard contains several profiles that leverage existing AVC and HEVC implementations, which may make them suitable to run on existing and emerging platforms. The standard is also extensible to upcoming video specifications such as Versatile Video Coding (VVC) and Essential Video Coding (EVC).

The V-PCC standard is based on Visual Volumetric Video-based Coding (V3C), which is expected to be re-used by other MPEG-I volumetric codecs under development. MPEG is also developing a standard for the carriage of V-PCC and V3C data (ISO/IEC 23090-10) which has been promoted to DIS status at the 130th MPEG meeting.

By providing high-level immersiveness at currently available bandwidths, the V-PCC standard is expected to enable several types of applications and services such as six Degrees of Freedom (6 DoF) immersive media, virtual reality (VR) / augmented reality (AR), immersive real-time communication and cultural heritage.

Research aspects: as V-PCC is video-based, we can probably state similar research aspects as for video codecs such as improving efficiency both for encoding and rendering as well as reduction of time complexity. During the development of V-PCC mainly HEVC (and AVC) has/have been used but it is definitely interesting to use also VVC for PCC. Finally, the dynamic adaptive streaming of V-PCC data is still in its infancy despite some articles published here and there.

MPEG Systems related News

Finally, I’d like to share news related to MPEG systems and the carriage of video data as depicted in the figure below. In particular, the carriage of VVC (and also EVC) has been now enabled in MPEG-2 Systems (specifically within the transport stream) and in the various file formats (specifically within the NAL file format). The latter is used also in CMAF and DASH which makes VVC (and also EVC) ready for HTTP adaptive streaming (HAS).

Carriage of Video in MPEG Systems Standards (taken from https://bit.ly/mpeg131).

What about DASH and CMAF?

CMAF maintains a so-called “technologies under consideration” document which contains — among other things — a proposed VVC CMAF profile. Additionally, there are two exploration activities related to CMAF, i.e., (i) multi-stream support and (ii) storage, archiving, and content management for CMAF files.

DASH works on potential improvement for the first amendment to ISO/IEC 23009-1 4th edition related to CMAF support, events processing model, and other extensions. Additionally, there’s a working draft for a second amendment to ISO/IEC 23009-1 4th edition enabling bandwidth change signalling track and other enhancements. Furthermore, ISO/IEC 23009-8 (Session-based DASH operations) has been advanced to Draft International Standard (see also my last report).

An overview of the current status of MPEG-DASH can be found in the figure below.

The next meeting will be again an online meeting in October 2020.

Finally, MPEG organized a Webinar presenting results from the 131st MPEG meeting. The slides and video recordings are available here: https://bit.ly/mpeg131.

Click here for more information about MPEG meetings and their developments.

MediaEval Multimedia Evaluation Benchmark: Tenth Anniversary and Counting

MediaEval Multimedia Challenges

MediaEval is a benchmarking initiative that offers challenges in multimedia retrieval, analysis and exploration. The tasks offered by MediaEval concentrate specifically on the human and social aspects of multimedia. They encourage researchers to bring together multiple modalities (visual, text, audio) and to think in terms of systems that serve users. Our larger aim is to promote reproducible research that makes multimedia a positive force for society. In order to provide an impression of the topical scope of MediaEval, we describe a few examples of typical tasks.

Historically, MediaEval tasks have often involved social media analysis. One of the first tasks offered by MediaEval, called the “Placing” Task, focused on the geo-location of social multimedia. This task ran from 2010-2016 and studied the challenge of automatically predicting the location at which an image has been taken. Over the years, the task investigated the benefits of combining text and image features, and also explored the challenges involved with geo-location prediction of video.

MediaEval “Placing” Task (2010-2016)

The “Placing” Task gave rise to two daughter tasks, which are focused on the societal impact of technology that can automatically predict the geo-location of multimedia shared online. One is Flood-related Multimedia, which challenges researchers to extract information related to flooding disasters from social media posts (combining text and images). The other is Pixel Privacy, which allows researchers to explore ways in which adversarial images can be used to protect sensitive information from being automatically extracted from images shared online.

The MediaEval Pixel Privacy Task (currently ongoing) had its own “trailer” in 2019

MediaEval has also offered a number of tasks that focus on how media content is received by users. The interest of MediaEval in the emotional impact of music is currently continued by the Emotion and Theme Recognition in Music Task. Also, the Predicting Media Memorability Task explores the aspects of video that are memorable to users.

The MediaEval Predicting Media Memorability Task (currently ongoing)

Recently, MediaEval has widened its focus to include multimedia analysis in systems. The Sports Video Annotation Task works towards improving sports training systems and the Medico Task focuses on multimedia analysis for more effective and efficient medical diagnosis.

Recent years have seen the rise of the use of sensor data in MediaEval. The No-audio Multimodal Speech Detection Task uses a unique data set captured by people wearing sensors and having conversations in a social setting. In addition to the sensor data, the movement of the speakers is captured by an overhead camera. The challenge is to detect the moments at which the people are speaking without making use of audio recordings.

Frames from overhead camera video of the
MediaEval No-audio Multimodal Speech Detection Task (currently ongoing)

The Insight for Wellbeing Task uses a data set of lifelog images, sensor data and tags captured by people walking through a city wearing sensors and using smartphones. The challenge is to relate the data that is captured to the local pollution conditions.

MediaEval 10th Anniversary Workshop

Each year, MediaEval holds a workshop that brings researchers together to share their findings, discuss, and plan next year’s tasks. The 2019 workshop marked the 10th anniversary of MediaEval, which became an independent benchmark in 2010. The MediaEval 2019 Workshop was hosted by EURECOM in Sophia Antipolis, France. The workshop took place 27-29 October 2020, right after ACM Multimedia 2019, in Nice, France.

group photo on stairs
MediaEval 2019 Workshop at EURECOM, Sophia, Antipolis, France (Photo credit: Mathias Lux)

The MediaEval 2019 Workshop is grateful to SIGMM for their support. This support contributed to helping ten students to attend the workshop, across a variety of tasks and also made it possible to record all of the workshop talks. We also gratefully acknowledge the Multimedia Computing Group at Delft University of Technology and EURECOM

Links to MediaEval 2019 tasks, videos and slides are available on the MediaEval 2019 homepage http://multimediaeval.org/mediaeval2019/. The link to the 2019 proceedings can be found there as well. 

presenter behind podium
Presenting results of a MediaEval task
(Photo credit: Vajira Thambawita)

MediaEval has compiled a bibliography of papers that have been published using MediaEval data sets. This list includes not only MediaEval workshop papers, but also papers published at other workshops, conferences, and in journals. In total, around 750 papers have been written that use MediaEval data, and this number continues to grow. Check out the bibliography at https://multimediaeval.github.io/bib.

The Medieval in MediaEval

A long-standing tradition in MediaEval is to incorporate some aspect of medieval history into the social event of the workshop. This tradition is a wordplay on our name (“mediaeval” is an older spelling of “medieval”). Through the years the medieval connection has served to provide a local context for the workshop and has strengthened the bond among participants. At the MediaEval 2019 Workshop, we offered the chance to take a nature walk to the medieval town of Biot.

people on path across river
A journey of discovery at the MediaEval 2019 workshop (Photo credit: Vajira Thambawita)

The walking participants and the participants taking the bus convened on the “Place des Arcades” in the medieval town of Biot, where we enjoyed a dinner together under historic arches.

The MediaEval 2019 workshop gathers in
Place des Arcades in Biot, near EURECOM
(Photo credit: Vajira Thambawita)

MediaEval 2020

MediaEval has just announced the task line-up for 2020. Registration will open in July 2020 and the runs will be due at the end of October 2020. The workshop will be held in December, with dates to be announced.

This year, the MediaEval workshop will be fully online. Since the MediaEval 2017 in Dublin, MediaEval has offered the possibility for remote workshop participation. Holding the workshop online this year is a natural extension of this trend, and we hope that researchers around the globe will take advantage of the opportunity to participate.

We are happy to introduce the new website: https://multimediaeval.github.io/. More information will be posted there as the season moves forward.

The day-to-day operations of MediaEval are handled by the MediaEval logistics committee, which grows stronger with each passing year. The authors of this article are logistics committee members from 2019. 

Standards Column: VQEG

Welcome to the first column on the ACM SIGMM Records from the Video Quality Experts Group (VQEG).
VQEG is an international and independent organisation of technical experts in perceptual video quality assessment from industry, academia, and government organisations.
This column briefly introduces the mission and main activities of VQEG, establishing a starting point of a series of columns that will provide regular updates of the advances within the current ongoing projects, as well as reports of the VQEG meetings. 
The editors of these columns are Jesús Gutiérrez (upper photo, jesus.gutierrez@upm.es), co-chair of the Immersive Media Group of VQEG and Kjell Brunnström (lower photo, kjell.brunnstrom@ri.se), general co-chair of VQEG.  Feel free to contact them for any further questions, comments or information, and also to check the VQEG website: www.vqeg.org.

Introduction

The Video Quality Experts Group (VQEG) was born from a need to bring together experts in subjective video quality assessment and objective quality measurement. The first VQEG meeting, held in Turin in 1997, was attended by a small group of experts drawn from ITU-T and ITU-R Study Groups. VQEG was first grounded in basic subjective methodology and objective tool development/verification for video quality assessment such that the industry could be moved forward with standardization and implementation. At the beginning it was focused around measuring the perceived video quality since the distribution path for video and audio were limited and known.

Over the last 20 years from the formation of VQEG the ecosystem has changed dramatically and thus so must the work. Multimedia is now pervasive on all devices and methods of distribution from broadcast to cellular data networks. This shift has the expertise within VQEG to move from the visual (no-audio) quality of video to Quality of Experience (QoE).

The march forward of technologies means that VQEG needs to react and be on the leading edge of developing, defining and deploying methods and tools that help address these new technologies and move the industry forward. This also means that we need to embrace both qualitative and quantitative ways of defining these new spaces and terms. Taking a holistic approach to QoE will enable VQEG to drive forward and faster with unprecedented collaboration and execution

VQEG is open to all interested from industry, academia, government organizations and Standard-Developing Organizations (SDOs). There are no fees involved, no membership applications and no invitations are needed to participate in VQEG activities. Subscription to the main VQEG email list (ituvidq@its.bldrdoc.gov) constitutes membership in VQEG.

VQEG conducts work via discussions over email reflectors, regularly scheduled conference calls and, in general, two face-to-face meetings per year. There are currently more than 500 people registered across 11 email reflectors, including a main reflector for general announcements relevant to the entire group, and different project reflectors dedicated to technical discussions of specific projects. A LinkedIn group exists as well.

Objectives

The main objectives of VQEG are: 

  • To provide a forum, via email lists and face-to-face meetings for video quality assessment experts to exchange information and work together on common goals. 
  • To formulate test plans that clearly and specifically define the procedures for performing subjective assessment tests and objective models validations.
  • To produce open source databases of multimedia material and test results, as well as software tools. 
  • To conduct subjective studies of multimedia and immersive technologies and provide a place for collaborative model development to take place.

Projects

Currently, several working groups are active within VQEG, classified under four main topics:

  1. Subjective Methods: Based on collaborative efforts to improve subjective video quality test methods.
    • Audiovisual HD (AVHD), project “Advanced Subjective Methods” (AVHD-SUB): This group investigates improved audiovisual subjective quality testing methods. This effort may lead to a revision of ITU-T Rec. P.911. As examples of its activities, the group has investigated alternative experiment designs for subjective tests, to validate subjective testing of long video sequences that are only viewed once by each subject. In addition, it conducted a joint investigation into the impact of the environment on mean opinion scores (MOS).
    • Psycho-Physiological Quality Assessment (PsyPhyQA): The aim of this project is to establish novel psychophysiology based techniques and methodologies for video quality assessment and real-time interaction of humans with advanced video communication environments. Specifically, some of the aspects that the project is looking at include: video quality assessment based on human psychophysiology (including, eye gaze, EEG, EKG, EMG, GSR, etc.), computational video quality models based on psychophysiological measurements, signal processing and machine learning techniques for psychophysiology based video quality assessment, experimental design and methodologies for psychophysiological assessment, correlates of psychophysics and psychophysiology. PsyPhyQA has published a dataset and testplan for a common framework for the evaluation of psychophysiological visual quality assessment.
    • Statistical Analysis Methods (SAM): This group addresses problems related to how to better analyze and improve data quality coming from subjective experiments and how to consider uncertainty in objective media quality predictors/models development. Its main goals are: to improve methods used to draw conclusions from subjective experiments, to understand the process of expressing opinion in a subjective experiment, to improve subjective experiment design to facilitate analysis and applications, to improve the analysis of objective model performances, and to revisit standardised methods for the assessment of the performance of objective model performances. 
  2. Objective Metrics: Working towards developing and validating objective video quality metrics.
    • Audiovisual HD (AVHD), project “AVHD-AS / P.NATS phase 2”: It is a joint project of VQEG and ITU Study Group 12 Question 14. The main goal is to develop a multitude of objective models, varying in terms of complexity/type of input/use-cases for the assessment of video quality in HTTP/TCIP based adaptive bitrate streaming services (e.g., YouTube, Vimeo, Amazon Video, Netflix, etc). For these services quality experienced by the end user is affected by video coding degradations, and delivery degradations due to initial buffering, re-buffering and media adaptations caused by the changes in bitrate, resolution, and frame rate
    • Computer Generated Imagery (CGI): focuses on the computer generated content for both images and videos material. The main goals are as follows: creating a large database of computer generated content, analyzing the content (feature extraction before and after rendering), analyzing the performance of objective quality metrics, evaluating/developing existing/new quality metrics/models for CGI material, studying rendering adaptation techniques (depending on the network constraints). This activity is in-line with the ITU-T work item P.BBQCG (Parametric Bitstream-based Quality Assessment of Cloud Gaming Services). 
    • No Reference Metrics (NORM): This group is an open collaborative for developing No-Reference metrics and methods for monitoring use case specific visual service quality. The NORM group is a complementary, industry-driven alternative of QoE to measure automatically the visual quality by using perceived indicators. Its main activities are to maintain a list of real-world use cases for visual quality monitoring, a list of potential algorithms and methods for no reference MOS and/or key indicators (visual artifact detection) for each use case, a list of methods (including datasets) to train and validate the algorithms for each use case, and a list of methods to provide root cause indication for each use case. In addition, the group encourages open discussions and knowledge sharing on all aspects related to no-reference metric research and development. 
    • Joint Effort Group (JEG) – Hybrid: This group is an open collaboration working together to develop a robust Hybrid Perceptual/Bit-Stream model. It has developed and made available routines to create and capture bit-stream data and parse bit-streams into HMIX files. Efforts are underway into developing subjectively rated video quality datasets with bit-stream data that can be used by all JEG researchers. The goal is to produce one model that combines metrics developed separately by a variety of researchers. 
    • Quality Assessment for Computer Vision Applications (QACoViA): the goal of this group is to study the visual quality requirements for computer vision methods, especially focusing on: testing methodologies and frameworks to identify the limit of computer vision methods with respect to the visual quality of the ingest; the minimum quality requirements and objective visual quality measure to estimate if a visual content is the operating region of computer vision; and delivering implementable algorithms being a proof/demonstrate of the new proposal concept of an objective video quality assessment methods for recognition tasks.
  3. Industry and Applications: Focused on seeking improved understanding of new video technologies and applications.
    • 5G Key Performance Indicators (5GKPI): Studies the relationship between the Key Performance Indicators (KPI) of new communication networks (namely 5G, but extensible to others) and the QoE of the video services on top of them. With this aim, this group addresses: the definition of relevant use cases (e.g., video for industrial applications, or mobility scenarios), the study of global QoE aspects for video in mobility and industrial scenarios, the identification of the relevant network KPIs(e.g., bitrate, latency, etc.) and application-level video KPIs (e.g., picture quality, A/V sync, etc.) and the generation of open datasets for algorithm testing and training.
    • IMG (Immersive Media Group): This group researches on quality assessment of immersive media, with the main goals of generating datasets of immersive media content, validating subjective test methods, and baseline quality assessment of immersive systems providing guidelines for QoE evaluation. The technologies covered by this group include: 360-degree content, virtual/augmented mixed reality, stereoscopic 3D content, Free Viewpoint Video, multiview technologies, light field content, etc.
  4. Support and Outreach: Responsible for the support for VQEG’s activities.
    • eLetter: The goal of VQEG eLetter is to provide up-to-date technical advances on video quality related topics. Each issue of VQEG eletter features a collection of papers authored by well-known researchers. These papers are contributed by invited authors or authors responding to a call-for-paper, and they can be: technical papers, summary/review of other publications, best practice anthologies, reprints of difficult to obtain articles, and responses to other articles. VQEG wants the eLetter to be interactive in nature.
    • Human Factors for Visual Experiences (HFVE): The objectives of this group is  to uphold the liaison relation between VQEG and the IEEE standardization group P3333.1. Some examples of the activities going on within this group are the standard for the (deep learning-based) assessment based on human factors of visual experiences with virtual/augmented/mixed reality and the standards on human factors for the  quality assessment of light field imaging (IEEE P3333.1.4) and on quality assessment of high dynamic range technologies. 
    • Independent Lab Group (ILG): The ILG act as independent arbitrators, whose generous contributions make possible the VQEG validation tests. Their goal is to ensure that all VQEG validation testing is unbiased and done to high quality standards. 
    • Joint Effort Group (JEG): is an activity within VQEG that promotes collaborative efforts addressed to: validate metrics through both subjective dataset completion and metric design, extend subjective datasets in order to better identify the limitations of quality metrics, improve subjective methodologies to address new scenarios and use cases that involve QoE issues, and increase the knowledge about both subjective and objective video quality assessment.
    • Joint Qualinet-VQEG team on Immersive Media: The objectives of this joint team from Qualinet and VQEG are: to uphold the liaison relation between both bodies, to inform both QUALINET and VQEG on the activities in respective organizations (especially on the topic of immersive media), to promote collaborations on other topics (i.e., form new joint teams), and to uphold the liaison relation with ITU-T SG12, in particular on topics around interactive, augmented and virtual reality QoE.
    • Tools and Subjective Labs Setup: The objective of this project is to provide the video quality research community with a wide variety of software tools and guidance in order to facilitate research. Tools are available in the following categories: quality analysis (software to run quality analyses), encoding (video encoding tools), streaming (streaming and extracting information from video streams), subjective test software (tools for running and analyzing subjective tests), and helper tools (miscellaneous helper tools).

In addition, the Intersector Rapporteur Group on Audiovisual Quality Assessment (IRG-AVQA) studies topics related to video and audiovisual quality assessment (both subjective and objective) among ITU-R Study Group 6 and ITU-T Study Group 12. VQEG colocates meetings with the IRG-AVQA to encourage a wider range of experts to contribute to Recommendations. 

For more details and previous closed projects please check: https://www.its.bldrdoc.gov/vqeg/projects-home.aspx

Major achievements

VQEG activities are documented in reports and submitted to relevant ITU Study Groups (e.g., ITU-T SG9, ITU-T SG12, ITU-R WP6C), and other SDOs as appropriate. Several VQEG studies have resulted in ITU Recommendations.

VQEG ProjectDescriptionITU Recommendations
Full Reference Television (FRTV) Phase I Examined the performance of FR and NR models on standard definition video. The test materials used in this test plan and the subjective tests data are freely available to researchers. ITU-T J.143 (2000), ITU-T J.144 (2001), ITU-T J.149 (2004)
Full Reference Television (FRTV) Phase II Examined the performance of FR and NR models on standard definition video, using the DSCQS methodology. ITU-T J.144 (2004)
ITU-R BT.1683 (2004)
Multimedia (MM) Phase I Examined the performance of FR, RR and NR models for VGA, CIF and QCIF video (no audio).ITU-T J.148 (2003)
ITU-T P.910 (2008)
ITU-T J.246 (2008)
ITU-T J.247 (2008)
ITU-T J.340 (2010)
ITU-R BT.1683 (2004)
Reduced Reference / No Reference Television (RRNR-TV) Examined the performance of RR and NR models on standard definition video ITU-T J.244 (2008)
ITU-T J.249 (2010)
ITU-R BT.1885 (2011)
High Definition Television (HDTV) Examined the performance of FR, RR and NR models for HDTV. Some of the video sequences used in this test are publicly available in the Consumer Digital Video Library.ITU-T J.341 (2011)
ITU-T J.342 (2011)
QARTStudied the subjective quality evaluation of video used for recognition tasks and task-based multimedia applications. ITU-T P.912 (2008)
Hybrid Perceptual BitstreamExamined the performance of Hybri models for VGA/WVGA and HDTV ITU-T J.343 (2014)
ITU-T J.343.1-6 (2014)
3DTVInvestigated how to assess 3DTV subjective video quality, covering methodologies, display requirements and evaluation of visual discomfort and fatigue. ITU-T P.914 (2016)
ITU-T P.915 (2016)
ITU-T P.916 (2016)
Audiovisual HD (AVHD)On one side, addressed the subjective evaluation of audio-video quality metrics.
On the other side, developed model standards for video quality assessment of streaming services over reliable transport for resolutions up to 4K/UHD, in collaboration with ITU-T SG12.
ITU-T P.913 (2014)
ITU-T P.1204 (2020)
ITU-T P.1204.3 (2020)
ITU-T P.1204.4 (2020)
ITU-T P.1204.5 (2020)

The contribution to current ITU standardization efforts is still ongoing. For example, updated texts have been contributed by VQEG on statistical analysis in ITU-T Rec. P.1401, and on subjective quality assessment of 360-degree video in ITU-T P.360-VR. 

Apart from this, VQEG is supporting the research on QoE by providing for the research community tools and datasets. For instance, it is worth noting the wide variety of software tools and guidance in order to facilitate research provided by VQEG Tools and Subjective Labs Setup via GitHub. Another example, is the VQEG Image Quality Evaluation Tool (VIQET), which is an objective no-reference photo quality evaluation tool. Finally, several datasets have been published which can be found in the websites of the corresponding projects, in the Consumer Digital Video Library or in other repositories.

General articles for the interested reader about the work of VQEG, especially covering the previous works are [1, 2].

References

[1] Q. Huynh-Thu, A. Webster, K. Brunnström, and M. Pinson, “VQEG: Shaping Standards on Video Quality”, in 1st International Conference on Advanced Imaging, Tokyo, Japan, 2015.
[2] K. Brunnström, D. Hands, F. Speranza, and A. Webster, “VQEG Validation and ITU Standardisation of Objective Perceptual Video Quality Metrics”, IEEE Signal Processing Magazine, vol. 26, no. 3, pp. 96-101, May 2009.

JPEG Column: 87th JPEG Meeting

The 87th JPEG meeting initially planned to be held in Erlangen, Germany, was held online from 25-30, April 2020 because of the Covid-19 outbreak. JPEG experts participated in a number of online meetings attempting to make them as effective as possible while considering participation from different time zones, ranging from Australia to California, U.S.A.

JPEG decided to proceed with a Second Call for Evidence on JPEG Pleno Point Cloud Coding and continued work to prepare for contributions to the previous Call for Evidence on Learning-based Image Coding Technologies (JPEG AI).

The 87th JPEG meeting had the following highlights:

  • JPEG Pleno Point Cloud Coding issues a Call for Evidence on coding solutions supporting scalability and random access of decoded point clouds.
  • JPEG AI defines evaluation methodologies of the Call for Evidence on machine learning based image coding solutions.
  • JPEG XL defines the file format compatible with existing formats. 
  • JPEG exploration on Media Blockchain releases use cases and requirements.
  • JPEG Systems releases a first version of JPEG Snack use cases and requirements.
  • JPEG XS announces significant improvement of the quality of raw-Bayer image sensor data compression.

JPEG Pleno Point Cloud

JPEG Pleno is working towards the integration of various modalities of plenoptic content under a single and seamless framework. Efficient and powerful point cloud representation is a key feature within this vision. Point cloud data supports a wide range of applications including computer-aided manufacturing, entertainment, cultural heritage preservation, scientific research and advanced sensing and analysis. During the 87th JPEG meeting, the JPEG Committee released a Second Call for Evidence on JPEG Pleno Point Cloud Coding that focuses specifically on point cloud coding solutions supporting scalability and random access of decoded point clouds. The Second Call for Evidence on JPEG Pleno Point Cloud Coding has a revised timeline reflecting changes in the activity due to the 2020 COVID-19 Pandemic. A Final Call for Evidence on JPEG Pleno Point Cloud Coding is planned to be released in July 2020.

JPEG AI

The main focus of JPEG AI was on the promotion and definition of the submission and evaluation methodologies of the Call for Evidence (in coordination with the IEEE MMSP 2020 Challenge) that was issued as outcome of the 86th JPEG meeting, Sydney, Australia.

JPEG XL

The File Format has been defined for JPEG XL (ISO/IEC 18181-1) codestream, metadata and extensions. The file format enables compatibility with ISOBMFF, JUMBF, XMP, Exif and other existing standards. Standardization has now reached the Committee Draft stage and the DIS ballot is ongoing. A white paper about JPEG XL’s features and tools was approved at this meeting and is available on the jpeg.org website.

JPEG exploration on Media Blockchain – Call for feedback on use cases and requirements

JPEG has determined that blockchain and distributed ledger technologies (DLT) have great potential as a technology component to address many privacy and security related challenges in digital media applications. This includes digital rights management, privacy and security, integrity verification, and authenticity, that impacts society in several ways including the loss of income in the creative sector due to piracy, the spread of fake news, or evidence tampering for fraud purposes.

JPEG is exploring standardization needs related to media blockchain to ensure seamless interoperability and integration of blockchain technology with widely accepted media standards. In this context, the JPEG Committee announces a call for feedback from interested stakeholders on the first public release of the use cases and requirements document.

JPEG Systems initiates standardisation of JPEG Snack

Media “snacking”, the consumption of multimedia in short bursts (less than 15 minutes) has become globally popular. JPEG recognizes the need for standardizing how snack images are constructed to ensure interoperability. A first version of JPEG Snack use cases and requirements is now complete and publicly available on JPEG website inviting feedback from stakeholders.

JPEG made progress on a fundamental capability of the JPEG file structure with enhancements to JPEG Universal Metadata Box Format (JUMBF) to support embedding common file types; the DIS text for JUMBF Amendment 1 is ready for ballot. Likewise JPEG 360 Amendment 1 DIS text is ready for ballot; this amendment supports stereoscopic 360 degree images, accelerated rendering for regions-of-interest, and removes the XMP signature block from the metadata description.

JPEG XS – The JPEG committee is pleased to announce significant improvement of the quality of its upcoming Bayer compression.

Over the past year, an improvement of around 2dB has been observed for the new coding tools currently being developed for image sensor compression within JPEG XS. This visually lossless low-latency and lightweight compression scheme can be used as a mezzanine codec in various markets like real-time video storage inside and outside of cameras, and data compression onboard autonomous cars. Mathematically lossless capability is also investigated and encapsulation within MXF or SMPTE ST2110-22 is currently being finalized.

Final Quote

“JPEG is committed to the development of new standards that provide state of the art imaging solutions to the largest spectrum of stakeholders. During the 87th meeting, held online because of the Covid-19 pandemic, JPEG progressed well with its current and even launched new activities. Although some timelines had to be revisited, overall, no disruptions of the workplan have occurred.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

About JPEG

The Joint Photographic Experts Group (JPEG) is a Working Group of ISO/IEC, the International Organisation for Standardization / International Electrotechnical Commission, (ISO/IEC JTC 1/SC 29/WG 1) and of the International Telecommunication Union (ITU-T SG16), responsible for the popular JPEG, JPEG 2000, JPEG XR, JPSearch, JPEG XT and more recently, the JPEG XS, JPEG Systems, JPEG Pleno and JPEG XL families of imaging standards.

More information about JPEG and its work is available at jpeg.org or by contacting Antonio Pinheiro or Frederik Temmermans (pr@jpeg.org) of the JPEG Communication Subgroup.

If you would like to stay posted on JPEG activities, please subscribe to the jpeg-news mailing list on http://jpeg-news-list.jpeg.org.  

Future JPEG meetings are planned as follows:

  • No 88, initially planned in Geneva, Switzerland, July 4 to 10, 2020, will be held online from July 7 to 10, 2020.

MPEG Column: 130th MPEG Meeting (virtual/online)

The original blog post can be found at the Bitmovin Techblog and has been modified/updated here to focus on and highlight research aspects.

The 130th MPEG meeting concluded on April 24, 2020, in Alpbach, Austria … well, not exactly, unfortunately. The 130th MPEG meeting concluded on April 24, 2020, but not in Alpbach, Austria.

I attended the 130th MPEG meeting remotely.

Because of the Covid-19 pandemic, the 130th MPEG meeting has been converted from a physical meeting to a fully online meeting, the first in MPEG’s 30+ years of history. Approximately 600 experts attending from 19 time zones worked in tens of Zoom meeting sessions supported by an online calendar and by collaborative tools that involved MPEG experts in both online and offline sessions. For example, input contributions had to be registered and uploaded ahead of the meeting to allow for efficient scheduling of two-hour meeting slots, which have been distributed from early morning to late night in order to accommodate experts working in different time zones as mentioned earlier. These input contributions have been then mapped to GitLab issues for offline discussions and the actual meeting slots have been primarily used for organizing the meeting, resolving conflicts, and making decisions including approving output documents. Although the productivity of the online meeting could not reach the level of regular face-to-face meetings, the results posted in the press release show that MPEG experts managed the challenge quite well, specifically

  • MPEG ratifies MPEG-5 Essential Video Coding (EVC) standard;
  • MPEG issues the Final Draft International Standards for parts 1, 2, 4, and 5 of MPEG-G 2nd edition;
  • MPEG expands the coverage of ISO Base Media File Format (ISOBMFF) family of standards;
  • A new standard for large scale client-specific streaming with MPEG-DASH;

Other Important Activities at the 130th MPEG meeting(i) the carriage of visual volumetric video-based coding data, (ii) Network-Based Media Processing (NBMP) function templates, (iii) the conversion from MPEG-21 contracts to smart contracts, (iv) deep neural network-based video coding, (v) Low Complexity Enhancement Video Coding (LCEVC) reaching DIS stage, and (vi) a new level of the MPEG-4 Audio ALS Simple Profile for high-resolution audio among others

The corresponding press release of the 130th MPEG meeting can be found here: https://mpeg.chiariglione.org/meetings/130. This report focused on video coding (EVC) and systems aspects (file format, DASH).

MPEG ratifies MPEG-5 Essential Video Coding Standard

At its 130th meeting, MPEG announced the completion of the new ISO/IEC 23094-1 standard which is referred to as MPEG-5 Essential Video Coding (EVC) and has been promoted to Final Draft International Standard (FDIS) status. There is a constant demand for more efficient video coding technologies (e.g., due to the increased usage of video on the internet), but coding efficiency is not the only factor determining the industry’s choice of video coding technology for products and services. The EVC standard offers improved compression efficiency compared to existing video coding standards and is based on the statements of all contributors to the standard who have committed announcing their license terms for the MPEG-5 EVC standard no later than two years after the FDIS publication date.

The MPEG-5 EVC defines two important profiles, including “Baseline profile” and “Main profile”. The “Baseline Profile” contains only technologies that are older than 20 years or otherwise freely available for use in the standard. In addition, the “Main Profile” adds a small number of additional tools, each of which can be either cleanly disabled or switched to the corresponding baseline tool on an individual basis.

It will be interesting to see how EVC profiles (baseline and main) will find its path into products and services given the existing number of codecs already in use (e.g., AVC, HEVC, VP9, AV1) and those still under development but being close to ratification (e.g., VVC, LCEVC). That is, in total, we may end up with about seven video coding formats that probably need to be considered for future video products and services. In other words, the multi-codec scenario I have envisioned some time ago is becoming reality raising some interesting challenges to be addressed in the future.

Research aspects: as for all video coding standards, the most important research aspect is certainly coding efficiency. For EVC it might be also interesting to research its usability of the built-in tool switching mechanism within a practical setup. Furthermore, the multi-codec issue, the ratification of EVC adds another facet to the already existing video coding standards in use or/and under development.

MPEG expands the Coverage of ISO Base Media File Format (ISOBMFF) Family of Standards

At the 130th WG11 (MPEG) meeting, the ISOBMFF family of standards has been significantly amended with new tools and functionalities. The standards in question are as follows:

  • ISO/IEC 14496-12: ISO Base Media File Format;
  • ISO/IEC 14496-15: Carriage of network abstraction layer (NAL) unit structured video in the ISO base media file format;
  • ISO/IEC 23008-12: Image File Format; and
  • ISO /IEC 23001-16: Derived visual tracks in the ISO base media file format.

In particular, three new amendments to the ISOBMFF family have reached their final milestone, i.e., Final Draft Amendment (FDAM):

  1. Amendment 4 to ISO/IEC 14496-12 (ISO Base Media File Format) allows the use of a more compact version of metadata for movie fragments;
  2. Amendment 1 to ISO/IEC 14496-15 (Carriage of network abstraction layer (NAL) unit structured video in the ISO base media file format) adds support of HEVC slice segment data track and additional extractor types for HEVC such as track reference and track groups; and
  3. Amendment 2 to ISO/IEC 23008-12 (Image File Format) adds support for more advanced features related to the storage of short image sequences such as burst and bracketing shots.

At the same time, new amendments have reached their first milestone, i.e., Committee Draft Amendment (CDAM):

  1. Amendment 2 to ISO/IEC 14496-15 (Carriage of network abstraction layer (NAL) unit structured video in the ISO base media file format) extends its scope to newly developed video coding standards such as Essential Video Coding (EVC) and Versatile Video Coding (VVC); and
  2. the first edition of ISO/IEC 23001-16 (Derived visual tracks in the ISO base media file format) allows a new type of visual track whose content can be dynamically generated at the time of presentation by applying some operations to the content in other tracks, such as crossfading over two tracks.

Both are expected to reach their final milestone in mid-2021.

Finally, the final text for the ISO/IEC 14496-12 6th edition Final Draft International Standard (FDIS) is now ready for the ballot after converting MP4RA to the Maintenance Agency. WG11 (MPEG) notes that Apple Inc. has been appointed as the Maintenance Agency and MPEG appreciates its valuable efforts for the many years while already acting as the official registration authority for the ISOBMFF family of standards, i.e., MP4RA (https://mp4ra.org/). The 6th edition of ISO/IEC 14496-12 is expected to be published by ISO by the end of this year.

Research aspects: the ISOBMFF family of standards basically offers certain tools and functionalities to satisfy the given use case requirements. The task of the multimedia systems research community could be to scientifically validate these tools and functionalities with respect to the use cases and maybe even beyond, e.g., try to adopt these tools and functionalities for novel applications and services.

A New Standard for Large Scale Client-specific Streaming with DASH

Historically, in ISO/IEC 23009 (Dynamic Adaptive Streaming over HTTP; DASH), every client has used the same Media Presentation Description (MPD) as it best serves the scalability of the service (e.g., for efficient cache efficiency in content delivery networks). However, there have been increasing requests from the industry to enable customized manifests for more personalized services. Consequently, MPEG has studied a solution to this problem without sacrificing scalability, and it has reached the first milestone of its standardization at the 130th MPEG meeting.

ISO/IEC 23009-8 adds a mechanism to the Media Presentation Description (MPD) to refer to another document, called Session-based Description (SBD), which allows per-session information. The DASH client can use this information (i.e., variables and their values) provided in the SBD to derive the URLs for HTTP GET requests. This standard is expected to reach its final milestone in mid-2021.

Research aspects: SBD’s goal is to enable personalization while maintaining scalability which calls for a tradeoff, i.e., which kind of information to put into the MPD and what should be conveyed within the SBD. This tradeoff per se could be considered already a research question that will be hopefully addressed in the near future.

An overview of the current status of MPEG-DASH can be found in the figure below.

The next MPEG meeting will be from June 29th to July 3rd and will be again an online meeting. I am looking forward to a productive AhG period and an online meeting later this year. I am sure that MPEG will further improve its online meeting capabilities and can certainly become a role model for other groups within ISO/IEC and probably also beyond.

Definitions of Crowdsourced Network and QoE Measurements

1 Introduction and Definitions

Crowdsourcing is a well-established concept in the scientific community, used for instance by Jeff Howe and Mark Robinson in 2005 to describe how businesses were using the Internet to outsource work to the crowd [2], but can be dated back up to 1849 (weather prediction in the US). Crowdsourcing has enabled a huge number of new engineering rules and commercial applications. To better define crowdsourcing in the context of network measurements, a seminar was held in Würzburg, Germany 25-26 September 2019 on the topic “Crowdsourced Network and QoE Measurements”. It notably showed the need for releasing a white paper, with the goal of providing a scientific discussion of the terms “crowdsourced network measurements” and “crowdsourced QoE measurements”. It describes relevant use cases for such crowdsourced data and its underlying challenges.

The outcome of the seminar is the white paper [1], which is – to our knowledge – the first document covering the topic of crowdsourced network and QoE measurements. This document serves as a basis for differentiation and a consistent view from different perspectives on crowdsourced network measurements, with the goal of providing a commonly accepted definition in the community. The scope is focused on the context of mobile and fixed network operators, but also on measurements of different layers (network, application, user layer). In addition, the white paper shows the value of crowdsourcing for selected use cases, e.g., to improve QoE, or address regulatory issues. Finally, the major challenges and issues for researchers and practitioners are highlighted.

This article now summarizes the current state of the art in crowdsourcing research and lays down the foundation for the definition of crowdsourcing in the context of network and QoE measurements as provided in [1]. One important effort is first to properly define the various elements of crowdsourcing.

1.1 Crowdsourcing

The word crowdsourcing itself is a mix of the crowd and the traditional outsourcing work-commissioning model. Since the publication of [2], the research community has been struggling to find a definition of the term crowdsourcing [3,4,5] that fits the wide variety of its applications and new developments. For example, in ITU-T P.912, crowdsourcing has been defined as:

Crowdsourcing consists of obtaining the needed service by a large group of people, most probably an on-line community.

The above definition has been written with the main purpose of collecting subjective feedback from users. For the purpose of this white paper focused on network measurements, it is required to clarify this definition. In the following, the term crowdsourcing will be defined as follows:

Crowdsourcing is an action by an initiator who outsources tasks to a crowd of participants to achieve a certain goal.

The following terms are further defined to clarify the above definition:

A crowdsourcing action is part of a campaign that includes processes such as campaign design and methodology definition, data capturing and storage, and data analysis.

The initiator of a crowdsourcing action can be a company, an agency (e.g., a regulator), a research institute or an individual.

Crowdsourcing participants (also “workers” or “users”) work on the tasks set up by the initiator. They are third parties with respect to the initiator, and they must be human.

The goal of a crowdsourcing action is its main purpose from the initiator’s perspective.

The goals of a crowdsourcing action can be manifold and may include, for example:

  • Gathering subjective feedback from users about an application (e.g., ranks expressing the experience of users when using an application)
  • Leveraging existing capacities (e.g., storage, computing, etc.)  offered by companies or individual users to perform some tasks
  • Leveraging cognitive efforts of humans for problem-solving in a scientific context.

In general, an initiator adopts a crowdsourcing approach to remedy a lack of resources (e.g., running a large-scale computation by using the resources of a large number of users to overcome its own limitations) or to broaden a test basis much further than classical opinion polls. Crowdsourcing thus covers a wide range of actions with various degrees of involvement by the participants.

In crowdsourcing, there are various methods of identifying, selecting, receiving, and retributing users contributing to a crowdsourcing initiative and related services. Individuals or organizations obtain goods and/or services in many different ways from a large, relatively open and often rapidly-evolving group of crowdsourcing participants (also called users). The use of goods or information obtained by crowdsourcing to achieve a cumulative result can also depend on the type of task, the collected goods or information and final goal of the crowdsourcing task.

1.2 Roles and Actors

Given the above definitions, the actors involved in a crowdsourcing action are the initiator and the participants. The role of the initiator is to design and initiate the crowdsourcing action, distribute the required resources to the participants (e.g., a piece of software or the task instructions, assign tasks to the participants or start an open call to a larger group), and finally to collect, process and evaluate the results of the crowdsourcing action.

The role of participants depends on their degree of contribution or involvement. In general, their role is described as follows. At least, they offer their resources to the initiator, e.g., time, ideas, or computation resources. In higher levels of contributions, participants might run or perform the tasks assigned by the initiator, and (optionally) report the results to the initiator.

Finally, the relationships between the initiator and the participants are governed by policies specifying the contextual aspects of the crowdsourcing action such as security and confidentiality, and any interest or business aspects specifying how the participants are remunerated, rewarded or incentivized for their participation in the crowdsourcing action.

2 Crowdsourcing in the Context of Network Measurements

The above model considers crowdsourcing at large. In this section, we analyse crowdsourcing for network measurements, which creates crowd data. This exemplifies the broader definitions introduced above, even if the scope is more restricted but with strong contextual aspects like security and confidentiality rules.

2.1 Definition: Crowdsourced Network Measurements

Crowdsourcing enables a distributed and scalable approach to perform network measurements. It can reach a large number of end-users all over the world. This clearly surpasses the traditional measurement campaigns launched by network operators or regulatory agencies able to reach only a limited sample of users. Primarily, crowd data may be used for the purpose of evaluating QoS, that is, network performance measurements. Crowdsourcing may however also be relevant for evaluating QoE, as it may involve asking users for their experience – depending on the type of campaign.

With regard to the previous section and the special aspects of network measurements, crowdsourced network measurements/crowd data are defined as follows, based on the previous, general definition of crowdsourcing introduced above:

Crowdsourced network measurements are actions by an initiator who outsources tasks to a crowd of participants to achieve the goal of gathering network measurement-related data.

Crowd data is the data that is generated in the context of crowdsourced network measurement actions.

The format of the crowd data is specified by the initiator and depends on the type of crowdsourcing action. For instance, crowd data can be the results of large scale computation experiments, analytics, measurement data, etc. In addition, the semantic interpretation of crowd data is under the responsibility of the initiator. The participants cannot interpret the crowd data, which must be thoroughly processed by the initiator to reach the objective of the crowdsourcing action.

We consider in this paper the contribution of human participants only. Distributed measurement actions solely made by robots, IoT devices or automated probes are excluded. Additionally, we require that participants consent to contribute to the crowdsourcing action. This consent might, however, vary from actively fulfilling dedicated task instructions provided by the initiator to merely accepting terms of services that include the option of analysing usage artefacts generated while interacting with a service.

It follows that in the present document, it is assumed that measurements via crowdsourcing (namely, crowd data) are performed by human participants aware of the fact that they are participating in a crowdsourcing campaign. Once clearly stated, more details need to be provided about the slightly adapted roles of the actors and their relationships in a crowdsourcing initiative in the context of network measurements.

2.2 Active and Passive Measurements

For a better classification of crowdsourced network measurements, it is important to differentiate between active and passive measurements. Similar to the current working definition within the ITU-T Study Group 12 work item “E.CrowdESFB” (Crowdsourcing Approach for the assessment of end-to-end QoS in Fixed Broadband and Mobile Networks), the following definitions are made:

Active measurements create artificial traffic to generate crowd data.

Passive measurements do not create artificial traffic, but measure crowd data that is generated by the participant.

For example, a typical case of an active measurement is a speed test that generates artificial traffic against a test server in order to estimate bandwidth or QoS. A passive measurement instead may be realized by fetching cellular information from a mobile device, which has been collected without additional data generation.

2.3 Roles of the Actors

Participants have to commit to participation in the crowdsourcing measurements. The level of contribution can vary depending on the corresponding effort or level of engagement. The simplest action is to subscribe to or install a specific application, which collects data through measurements as part of its functioning – often in the background and not as part of the core functionality provided to the user. A more complex task-driven engagement requires a more important cognitive effort, such as providing subjective feedback on the performance or quality of certain Internet services. Hence, one must differentiate between participant-initiated measurements and automated measurements:

Participant-initiated measurements require the participant to initiate the measurement. The measurement data are typically provided to the participant.

Automated measurements can be performed without the need for the participant to initiate them. They are typically performed in the background.

A participant can thus be a user or a worker. The distinction depends on the main focus of the person doing the contribution and his/her engagement:

A crowdsourcing user is providing crowd data as the side effect of another activity, in the context of passive, automated measurements.

A crowdsourcing worker is providing crowd data as a consequence of his/her engagement when performing specific tasks, in the context of active, participant-initiated measurements.

The term “users” should, therefore, be used when the crowdsourced activity is not the main focus of engagement, but comes as a side effect of another activity – for example, when using a web browsing application which collects measurements in the background, which is a passive, automated measurement.

“Workers” are involved when the crowdsourced activity is the main driver of engagement, for example, when the worker is paid to perform specific tasks and is performing an active, participant-initiated measurement. Note that in some cases, workers can also be incentivized to provide passive measurement data (e.g. with applications collecting data in the background if not actively used).

In general, workers are paid on the basis of clear guidelines for their specific crowdsourcing activity, whereas users provide their contribution on the basis of a more ambiguous, indirect engagement, such as via the utilization of a particular service provided by the beneficiary of the crowdsourcing results, or a third-party crowd provider. Regardless of the participants’ level of engagement, the data resulting from the crowdsourcing measurement action is reported back to the initiator.

The initiator of the crowdsourcing measurement action often has to design a crowdsourcing measurement campaign, recruit the participants (selectively or openly), provide them with the necessary means (e.g. infrastructure and/or software) to run their action, provide the required (backend) infrastructure and software tools to the participants to run the action, collect, process and analyse the information, and possibly publish the results.

2.4 Dimensions of Crowdsourced Network Measurements

In light of the previous section, there are multiple dimensions to consider for crowdsourcing in the context of network measurements. A preliminary list of dimensions includes:

  • Level of subjectivity (subjective vs. objective measurements) in the crowd data
  • Level of engagement of the participant (participant-initiated or background) or their cognitive effort, and awareness (consciousness) of the measurement level of traffic generation (active vs. passive)
  • Type and level of incentives (attractiveness/appeal, paid or unpaid)

Besides these key dimensions, there are other features which are relevant in characterizing a crowdsourced network measurement activity. These include scale, cost, and value; the type of data collected; the goal or the intention, i.e. the intention of the user (based on incentives) versus the intention of the crowdsourcing initiator of the resulting output.

Figure 1: Dimensions for network measurements crowdsourcing definition, and relevant characterization features (examples with two types of measurement actions)

In Figure 1, we have illustrated some dimensions of network measurements based on crowdsourcing. Only the subjectivity, engagement and incentives dimension are displayed, on an arbitrary scale. The objective of this figure is to show that an initiator has a wide range of combinations for crowdsourcing action. The success of a measurement action with regard to an objective (number of participants, relevance of the results, etc.) is multifactorial. As an example, action 1 may indicate QoE measurements from a limited number of participants and action 2 visualizes the dimensions for network measurements by involving a large number of participants.

3 Summary

The attendees of the Würzburg seminar on “Crowdsourced Network and QoE Measurements” have produced a white paper, which defines terms in the context of crowdsourcing for network and QoE measurements, lists of relevant use cases from the perspective of different stakeholders, and discusses the challenges associated with designing crowdsourcing campaigns, analyzing, and interpreting the data. The goal of the white paper is to provide definitions to be commonly accepted by the community and to summarize the most important use-cases and challenges from industrial and academic perspectives.

References

[1] White Paper on Crowdsourced Network and QoE Measurements – Definitions, Use Cases and Challenges (2020). Tobias Hoßfeld and Stefan Wunderer, eds., Würzburg, Germany, March 2020. doi: 10.25972/OPUS-20232.

[2] Howe, J. (2006). The rise of crowdsourcing. Wired magazine, 14(6), 1-4.

[3] Estellés-Arolas, E., & González-Ladrón-De-Guevara, F. (2012). Towards an integrated crowdsourcing definition. Journal of Information science, 38(2), 189-200.

[4] Kietzmann, J. H. (2017). Crowdsourcing: A revised definition and introduction to new research. Business Horizons, 60(2), 151-153.

[5] ITU-T P.912, “Subjective video quality assessment methods for recognition tasks “, 08/2016

[6] ITU-T P.808 (ex P.CROWD), “Subjective evaluation of speech quality with a crowdsourcing approach”, 06/2018

An interview with Associate Professor Duc-Tien Dang-Nguyen

Tien at the beginning of his research career.

Describe your journey into research from your youth up to the present. What foundational lessons did you learn from this journey? Why were you initially attracted to multimedia?

Looking back at the early days of my life, my love for science started quite young. I loved solving puzzles and small recreational mathematical problems. Actually, I still do this. It may also be because my mother “seeded” many stories with great scientific people like Thomas Edison or Marie Curie every night. I admired them a lot and often dreamed of being like them. I also love to play video games. I played them a lot, and I think that I am quite good, especially in games like The Legend of Zelda and the Castlevania series. I also love to travel, and perhaps that is why I have a nomad’s journey over the last ten years, starting from Vietnam to Japan, Italy, Ireland, and now Norway. While living in Vietnam, I would often travel to the countryside on my motorbike. Solving puzzles, playing video games, collecting things, and travelling; these tiny things play an essential role in making me who I am today.

Now back to the story. I come from Vietnam, where it is very normal for my generation to grow through endless competitions. My first challenge was a math competition when I was eight. I then became a math student and followed many competitions like the current MIT Mystery Hunt. When I was 12, a friend of my father gave me his old PC as a present. It was a 486 (we called it that since it has an Intel 486 core), and it changed my life. I played with it endlessly. I learned Pascal by myself, and in the last year of my secondary schools (K-9), I proudly won the first rank at both Math and Informatics in the regional contests. Thanks to that, I entered one of the best high schools in Vietnam. I joined the Informatics class, and as you might already guess, we were dealing with programming challenges every day. We learned mainly algorithms and data structures, discrete mathematics, and computational complexity through solving challenging problems from the International Olympiad of Informatics. It is quite similar to Topcoder now. It was tough and very competitive, but it was exciting to me since it was like solving hard puzzles.

Moving to my bachelor’s, I took an honor program in Computer Science, which was one of the best Computer Science programs in Vietnam. In the third year of my bachelor’s in an Image Processing course, I did a project about image annotation. It was a pure K-means for image segmentation based on pixel color values, followed by a k-NN on a pre-trained set of images. It sounds pretty basic now, but this was in 2001, and “I did it my way” so it was a fantastic achievement! It was from this project I became a multimedia researcher.

After my bachelor’s, I continued researching computer vision and image retrieval in my master’s. In my first year as a Ph.D. student, I was working on a multimedia retrieval project, but just three months before the qualifying exam (you need to present your research proposal to continue your Ph.D.), I changed my research topic to Image Forensics, thanks to the course of the same name. I found everything I love in this new research field. It is like solving a puzzle, collecting evidence, and playing a game simultaneously. So, I became an image forensics researcher.

Some people say, “Choose a job you love, and you will never have to work a day in your life”, perhaps they are missing the last part “because no one will hire you”. Yes, it’s just a joke, but it can also be quite true in may circumstances. It was hard to find a job that needs image forensics when I finished my Ph.D. However, since I know image processing, computer vision, and machine learning, it was not that hard for me to find a postdoc in those fields. I was then doing both multimedia forensics and multimedia retrieval. This “evolvement” introduced me to a new field, lifelogging, a research direction that tries to discover insights from personal data archives. At first, it was an “okay” field to me, but later, after digging more into it, I found many interesting challenges that need to be solved. And that was a very long story about how I reach the starting point of my research.

Can you profile your current research, its challenges, opportunities, and implications? Tell us more about your vision and objectives behind your current roles.

Bergen, where I mainly focus on image forensics and lifelogging. Multimedia forensics is about discovering the history of modifications to multimedia content such as videos, images, audio, etc. Mainly, I work with images and have dabbled a bit in video forensics. Audio is nice too, but I mostly enjoy working with the visual side of multimedia. People tend to think about multimedia forensics as a tool to check if an image or a video is real or fake. However, we also try to look at the specifics for the media in question. Some potential questions for an image could, for example, be where was it first posted? What type of camera was it taken with? These are questions that help identify the reliability of the image in questions and give more information than fake or real. Also, I believe that we should also take a further step by considering the context of use (how, where, and when) of the multimedia content. The expectation of truthfulness is radically different if the image is hanging in an art gallery than if it is being used as evidence in a court case.

As previously mentioned, I also work with lifelogging. This work is still in its early stages. We have not proposed any novel approaches yet. Instead, we are building a community by organizing research activities as workshops and bench-marking initiatives. We believe that by holding such events, we are preparing a solid user-base for the next phase when people are more familiar with such technologies, the phase of personal data analytics. We have witnessed great applications of AI during the last decade. Since AI needs data, and people need more personalized solutions, I believe that very soon we will be doing lifelogging in our everyday life. Let’s wait and see if my prediction is becoming true or not.

How would you describe your top innovative achievements in terms of the problems you were trying to solve, your solutions, and the impact it has today and into the future?

In multimedia forensics, I am quite happy that I was among the first to propose an approach for discriminating between computer graphics and natural human faces. People are well aware of “Deepfake”, and many great people are working on this problem. However, when I presented my first study in 2011, many people, including computer graphics researchers, were laughing when I told them that they would soon not be able to distinguish computer-generated faces from the real ones. In image forensics, we try to reveal all traces of the image acquisition history, and since digital images are based on pixels, they are susceptible to changes. For example, many traces of modifications become incredibly hard to find if the image is resized. Most of my approaches are thus physical or geometrical based, which makes them more robust against changes as well as more reliable in terms of decision explanation.

Over your distinguished career, what are your top lessons you want to share with the audience?

I believe that I am still at the start of my career, and perhaps the first and the most important lesson I have is about “causes and effects” or what Steve Jobs described as “Connecting the dots”. There are dots in our life that are very hard to understand or predict how everything is connected, but eventually, when looking back, the connections will reveal themselves over time. Just follow whatever you think is good for you and try very hard to make it a good “dot”. Everyone wants to work with something we love, but finding what we love in our current work is even more important.

What is the best joke you know?

Most of the jokes I love are in Vietnamese, and unless you are Vietnamese, you can’t get them. I am trying to think about some “Western” jokes that share some commonalities with Vietnamese humor and culture. That should be a politics joke. I believe that you can find a similar version with KGB or Stasi. This one was very famous, and surprisingly, it is very well suited to my current research on lifelogging 🙂

“Why do Stasi officers make such good taxi drivers? — You get in the car and they already know your name and where you live.”

A recent photo of Tien.

Bio: Duc-Tien Dang-Nguyen is an associate professor at the University of Bergen. His main research interests are multimedia forensics, lifelogging, and machine learning.

Dataset Column: ToCaDa Dataset with Multi-Viewpoint Synchronized Videos

This column describes the release of the Toulouse Campus Surveillance Dataset (ToCaDa). It consists of 25 synchronized videos (with audio) of two scenes recorded from different viewpoints of the campus. An extensive manual annotation comprises all moving objects and their corresponding bounding boxes, as well as audio events. The annotation was performed in order to i) enhance audiovisual objects that can be visible, audible or both, according to each recording location, and ii) uniquely identify all objects in each of the two scenes. All videos have been «anonymized». The dataset is available for download here.

Introduction

The increasing number of recording devices, such as smartphones, has led to an exponential production of audiovisual documents. These documents may correspond to the same scene, for instance an outdoor event filmed from different points of view. Such multi-view scenes contain a lot of information and provide new opportunities for answering high-level automatic queries.

In essence, these documents are multimodal, and their audio and video streams contain different levels of information. For example, the source of a sound may either be visible or not according to the different points of view. This information can be used separately or jointly to achieve different tasks, such as synchronising documents or following the displacement of a person. The analysis of these multi-view field recordings further allows understanding of complex scenarios. The automation of these tasks faces a need for data, as well as a need for the formalisation of multi-source retrieval and multimodal queries. As also stated by Lefter et al., “problems with automatically processing multimodal data start already from the annotation level” [1]. The complexity of the interactions between modalities forced the authors to produce three different types of annotations: audio, video, and multimodal.

In surveillance applications, humans and vehicles are the most important common elements studied. In consequence, detecting and matching a person or a car that appears in several videos is a key problem. Although many algorithms have been introduced, a major relative problem still is how to precisely evaluate and to compare these algorithms in reference to a common ground truth. Datasets are required for evaluating multi-view based methods.

During the last decade, public datasets have become more and more available, helping with the evaluation and comparison of algorithms, and in doing so, contributing to improvements in human and vehicle detection and tracking. However, most of the datasets focus on a specific task and do not support the evaluation of approaches that mix multiple sources of information. Only few datasets provide synchronized videos with overlapping fields of view. Yet, these rarely provide more than 4 different views even though more and more approaches could benefit from having additional views available. Moreover, soundtracks are almost never provided despite being a rich source of information, as voices and motor noises can help to recognize, respectively, a person or a car.

Notable multi-view datasets are the following.

  • The 3D People Surveillance Dataset (3DPeS) [2] comprises 8 cameras with disjoint views and 200 different people. Each person appears, on average, in 2 views. More than 600 video sequences are available. Thus, it is well-suited for people re-identification. Cameras parameters are provided, as well as a coarse 3D reconstruction of the surveilled environment.
  • The Video Image Retrieval and Analysis Tool (VIRAT) [3] dataset provides a large amount of surveillance videos with a high pixel resolution. In this dataset, 16 scenes were recorded for hours although in the end only 25 hours with significant activities were kept. Moreover, only two pairs of videos present overlapping fields of view. Moving objects were annotated by workers with bounding boxes, as well as some buildings or areas. Three types of events were also annotated, namely (i) single person events, (ii) person and vehicle events, and (iii) person and facility events, leading to 23 classes of events. Most actions were performed by people with minimal scripted actions, resulting in realistic scenarios with frequent incidental movers and occlusions.
  • Purely action-oriented datasets can be found in the Multicamera Human Action Video (MuHAVi) [4] dataset, in which 14 actors perform 17 different action classes (such as “kick”, “punch”, “gunshot collapse”) while 8 cameras capture the indoor scene. Likewise, Human3.6M [5] contains videos where 11 actors perform 15 different classes of actions while being filmed by 4 digital cameras; its specificity lies in the fact that 1 time-of-flight sensor and 10 motion cameras were also used to estimate and to provide the 3DT pose of the actors on each frame. Both background subtraction and bounding boxes are provided at each frame. In total, more than 3.6M frames are available. In these two datasets, actions are performed in unrealistic conditions as the actors follow a script consisting of actions that are performed one after the other.

In the table below a comparison is shown between the aforementioned datasets, which are contrasted with the new ToCaDa dataset we recently introduced and describe in more detail below.

Properties 3DPeS [2] VIRAT [3] MuHAVi [4] Human3.6M [5] ToCaDa [6]
# Cameras 8 static 16 static 8 static 4 static 25 static
# Microphones 0 0 0 0 25+2
Overlapping FOV Very partially 2+2 8 4 17
Disjoint FOV 8 12 0 0 4
Synchronized No No Partially Yes Yes
Pixel resolution 704 x 576 1920 x 1080 720 x 576 1000 x 1000 Mostly 1920 x 1080
# Visual objects 200 Hundreds 14 11 30
# Action types 0 23 17 15 0
# Bounding boxes 0 ≈ 1 object/second 0 ≈ 1 object/frame ≈ 1 object/second
In/outdoor Outdoor Outdoor Indoor Indoor Outdoor
With scenario No No Yes Yes Yes
Realistic Yes Yes No No Yes

ToCaDa Dataset

As a large multi-view, multimodal, and realistic video collection does not yet exist, we therefore took the initiative to produce such a dataset. The ToCaDa dataset [6] comprises 25 synchronized videos (including soundtrack) of the same scene recorded from multiple viewpoints. The dataset follows two detailed scenarios consisting of comings and goings of people, cars and motorbikes, with both overlapping and non-overlapping fields of view (see Figures 1-2). This dataset aims at paving the way for multidisciplinary approaches and applications such as 4D-scene reconstruction, object re-identification/tracking and multi-source metadata modeling and querying.

Figure 1: The campus contains 25 cameras, of which 8 are spread out across the area and 17 are located within the red rectangle (see Figure 2).
Figure 2: The main building where 17 cameras with overlapping fields of view are concentrated.

About 20 actors were asked to follow two realistic scenarios by performing scripted actions, like driving a car, walking, entering or leaving a building, or holding an item in hand while being filmed. In addition to ordinary actions, some suspicious behaviors are present. More precisely:

  • In the first scenario, a suspect car (C) with two men inside (D the driver and P the passenger) arrives and parks in front of the main building (within the sights of the cameras with overlapping views). P gets out of the car C and enters the building. Two minutes later, P leaves the building holding a package and gets in C. C leaves the parking (see Figure 3) and gets away from the university campus (passing in front of some of the disjoint fields of view cameras). Other vehicles and persons regularly move in different cameras with no suspicious behavior.
  • In the second scenario, a suspect car (C) with two men inside (D the driver and P the passenger) arrives and parks badly along the road. P gets out of the car and enters the building. Meanwhile, a women W knocks on the car window to ask the driver D to park correctly, but he drives off immediately. A few minutes later, P leaves the building with a package and seems confused as the car is missing. He then runs away. In the end, in one of the disjoint-view cameras, we can see him waiting until C picks him up.
Figure 3: A subset of all the synchronized videos for a particular frame of the first scenario. First row: cameras located in front of the building. Second and third rows: cameras that face the car park. A car is circled in red to highlight the largely overlapping fields of view.

The 25 camera holders we enlisted used their own mobile devices to record the scene, leading to a large variety of resolutions, image quality, frame rates and video duration. Three foghorns were blown in order to coordinate this heterogeneous disposal:

  • The first one stands for a warning 20 seconds before the start, to give enough time to start shooting.
  • The second one is the actual starting time, used to temporally synchronize the videos.
  • The third one indicates the ending time.

All the videos were collected and were manually synchronized using the second and the third foghorn blows as starting and ending times. Indeed, the second one can be heard at the beginning of every video.

Annotations

A special annotation procedure was set to handle the audiovisual content of this multi-view data [7]. Audio and video parts of each document were first separately annotated, after which a fusion of these modalities was realized.

The ground truth annotations are stored in json files. Each file corresponds to a video and shares the same title but not the same extension, namely <video_name>.mp4 annotations are stored in <video_name>.json. Both visual and audio annotations are stored together in the same file.

By annotating, our goal is to detect the visual objects and the salient sound events and, when possible, to associate them. Thus, we have grouped them into the generic term audio-visual object. This way, the appearance of a vehicle and its motor sound will constitute a single coherent audio-visual object and is associated with the same ID. An object that can be seen but cannot be heard is also an audio-visual object but with only a visual component, and similarly for an object that can only be heard. An example is given in Listing 1.

Listing 1: Json file structure of the visual component of an object in a video, visible from 13.8s to 18.2s and from 29.72s to 32.28s and associated with id 11.

To help with the annotation process, we developed a program for navigating through the frames of the synchronized videos and for identifying audio-visual objects by drawing bounding boxes in particular frames and/or specifying starting and ending times of salient sound. Bounding boxes were drawn around every moving object with a flag indicating whether the object was fully visible or occluded, specifying its category (human or vehicle), providing visual details (for example clothes types or colors), and timestamps of its apparitions and disappearances. Audio events were also annotated by a category and two timestamps.

Regarding bounding boxes, the coordinates of top-left and bottom-right corners of the bounding boxes are given. Bounding boxes were drawn such that the object is fully contained within the box and as tight as possible. For this purpose, our annotation tool allows the user to draw an initial approximate bounding box and then to adjust its boundaries at a pixel-level.

As drawing one bounding box for each object on every frame requires a huge amount of time, we have drawn bounding boxes on a subset of frames, so that the intermediate bounding boxes of an object can be linearly interpolated using its previous and next drawn bounding boxes. On average, we have drawn one bounding box per second for humans and two for vehicles due to their speed variation. For objects with irregular speed or trajectory, we have drawn more bounding boxes.

Regarding the audio component of an audio-visual object, namely the salient sound events, an audio category (voice, motor sound) is given in addition to its ID, as well as a list of details and time bounds (see Listing 2).

Listing 2: Json file structure of an audio event in a given video. As it is associated with id 11, it corresponds to the same audio-visual object as the one in Listing 1.

Finally, we linked the audio to the video objects, by giving the same ID to the audio object in case of causal identification, which means that the acoustic source of the audio event is the object (a car or a person for instance) that was annotated. This step was particularly crucial, and could not be automatized, as a complex expertise is required to identify the sound sources. For example, in the video sequence illustrated in Figure 4, a motor sound is audible and seems to come from the car whereas it actually comes from a motorbike behind the camera.

Figure 4: At this time of the video sequence of camera 10, a motor sound is heard and seems to come from the car while it actually comes from a motorbike behind the camera.

In case of an object presenting different sound categories (a car with door slams, music and motor sound for example), one object is created for each category and the same ID is given.

Ethical and Legal

According to the European legislation, it is forbidden to make images publicly available of people who might be recognized or of license plates. As people and license plates are visible in our videos, to conform to the General Data Protection Regulation (GDPR) we decided to:

  • Ask actors to sign an authorization for publishing their image, and
  • Apply post treatment on videos to blur faces of other people and any license plates.

Conclusion

We have introduced a new dataset composed of two sets of 25 synchronized videos of the same scene with 17 overlapping views and 8 disjoint views. Videos are provided with their associated soundtracks. We have annotated the videos by manually drawing bounding boxes on moving objects. We have also manually annotated audio events. Our dataset offers simultaneously a large number of both overlapping and disjoint synchronized views and a realistic environment. It also provides audio tracks with sound events, high pixel resolution and ground truth annotations.

The originality and the richness of this dataset come from the wide diversity of topics it covers and the presence of scripted and non-scripted actions and events. Therefore, our dataset is well suited for numerous pattern recognition applications related to, but not restricted to, the domain of surveillance. We describe below, some multidisciplinary applications that could be evaluated using this dataset:

3D and 4D reconstruction: The multiple cameras sharing overlapping fields of view along with some provided photographs of the scene allow performing a 3D reconstruction of the static parts of the scene and to retrieve intrinsic parameters and poses of the cameras using a Structure-from-Motion algorithm. Beyond a 3D reconstruction, the temporal synchronization of the videos could enable to render dynamic parts of the scene as well and to obtain a 4D reconstruction.

Object recognition and consistent labeling: Evaluation of algorithms for human and vehicle detection and consistent labeling across multiple views can be performed using the annotated bounding boxes and IDs. To this end, overlapping views provide a 3D environment that could help to infer the label of an object in a video knowing its position and label in another video.

Sound event recognition: The audio events recorded from different locations and manually annotated provide opportunities to evaluate the relevance of consistent acoustic models by, for example, launching the identification and indexing of a specific sound event. Looking for a particular sound by similarity is also feasible.

Metadata modeling and querying: The multiple layers of information of this dataset, both low-level (audio/video signal) and high-level (semantic data available in the ground truth files) enable handling of information at different resolutions of space and time, allowing to perform queries on heterogeneous information.

References

[1] I. Lefter, L.J.M. Rothkrantz, G. Burghouts, Z. Yang, P. Wiggers. “Addressing multimodality in overt aggression detection”, in Proceedings of the International Conference on Text, Speech and Dialogue, 2011, pp. 25-32.
[2] D. Baltieri, R. Vezzani, R. Cucchiara. “3DPeS: 3D people dataset for surveillance and forensics”, in Proceedings of the 2011 joint ACM workshop on Human Gesture and Behavior Understanding, 2011, pp. 59-64.
[3] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. Chen, J.T. Lee, S. Mukherjee, J.K. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Fong, A. Roy-Chowdhury, M. Desai. “A large-scale benchmark dataset for event recognition in surveillance video”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3153-3160.
[4] S. Singh, S.A. Velastin, H. Ragheb. “MuHAVi: A multicamera human action video dataset for the evaluation of action recognition methods”, in Proceedings of the 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2010, pp. 48-55.
[5] C. Ionescu, D. Papava, V. Olaru, C. Sminchisescu. “Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments”, IEEE transactions on Pattern Analysis and Machine Intelligence, 36(7), 2013, pp. 1325-1339.
[6] T. Malon, G. Roman-Jimenez, P. Guyot, S. Chambon, V. Charvillat, A. Crouzil, A. Péninou, J. Pinquier, F. Sèdes, C. Sénac. “Toulouse campus surveillance dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views”, in Proceedings of the 9th ACM Multimedia Systems Conference. 2018, pp. 393-398.
[7] P. Guyot, T. Malon, G. Roman-Jimenez, S. Chambon, V. Charvillat, A. Crouzil, A. Péninou, J. Pinquier, F. Sèdes, C. Sénac. “Audiovisual annotation procedure for multi-view field recordings”, in Proceedings of the International Conference on Multimedia Modeling, 2019, pp. 399-410.