Dataset Column: ToCaDa Dataset with Multi-Viewpoint Synchronized Videos

This column describes the release of the Toulouse Campus Surveillance Dataset (ToCaDa). It consists of 25 synchronized videos (with audio) of two scenes recorded from different viewpoints of the campus. An extensive manual annotation comprises all moving objects and their corresponding bounding boxes, as well as audio events. The annotation was performed in order to i) enhance audiovisual objects that can be visible, audible or both, according to each recording location, and ii) uniquely identify all objects in each of the two scenes. All videos have been «anonymized». The dataset is available for download here.


The increasing number of recording devices, such as smartphones, has led to an exponential production of audiovisual documents. These documents may correspond to the same scene, for instance an outdoor event filmed from different points of view. Such multi-view scenes contain a lot of information and provide new opportunities for answering high-level automatic queries.

In essence, these documents are multimodal, and their audio and video streams contain different levels of information. For example, the source of a sound may either be visible or not according to the different points of view. This information can be used separately or jointly to achieve different tasks, such as synchronising documents or following the displacement of a person. The analysis of these multi-view field recordings further allows understanding of complex scenarios. The automation of these tasks faces a need for data, as well as a need for the formalisation of multi-source retrieval and multimodal queries. As also stated by Lefter et al., “problems with automatically processing multimodal data start already from the annotation level” [1]. The complexity of the interactions between modalities forced the authors to produce three different types of annotations: audio, video, and multimodal.

In surveillance applications, humans and vehicles are the most important common elements studied. In consequence, detecting and matching a person or a car that appears in several videos is a key problem. Although many algorithms have been introduced, a major relative problem still is how to precisely evaluate and to compare these algorithms in reference to a common ground truth. Datasets are required for evaluating multi-view based methods.

During the last decade, public datasets have become more and more available, helping with the evaluation and comparison of algorithms, and in doing so, contributing to improvements in human and vehicle detection and tracking. However, most of the datasets focus on a specific task and do not support the evaluation of approaches that mix multiple sources of information. Only few datasets provide synchronized videos with overlapping fields of view. Yet, these rarely provide more than 4 different views even though more and more approaches could benefit from having additional views available. Moreover, soundtracks are almost never provided despite being a rich source of information, as voices and motor noises can help to recognize, respectively, a person or a car.

Notable multi-view datasets are the following.

  • The 3D People Surveillance Dataset (3DPeS) [2] comprises 8 cameras with disjoint views and 200 different people. Each person appears, on average, in 2 views. More than 600 video sequences are available. Thus, it is well-suited for people re-identification. Cameras parameters are provided, as well as a coarse 3D reconstruction of the surveilled environment.
  • The Video Image Retrieval and Analysis Tool (VIRAT) [3] dataset provides a large amount of surveillance videos with a high pixel resolution. In this dataset, 16 scenes were recorded for hours although in the end only 25 hours with significant activities were kept. Moreover, only two pairs of videos present overlapping fields of view. Moving objects were annotated by workers with bounding boxes, as well as some buildings or areas. Three types of events were also annotated, namely (i) single person events, (ii) person and vehicle events, and (iii) person and facility events, leading to 23 classes of events. Most actions were performed by people with minimal scripted actions, resulting in realistic scenarios with frequent incidental movers and occlusions.
  • Purely action-oriented datasets can be found in the Multicamera Human Action Video (MuHAVi) [4] dataset, in which 14 actors perform 17 different action classes (such as “kick”, “punch”, “gunshot collapse”) while 8 cameras capture the indoor scene. Likewise, Human3.6M [5] contains videos where 11 actors perform 15 different classes of actions while being filmed by 4 digital cameras; its specificity lies in the fact that 1 time-of-flight sensor and 10 motion cameras were also used to estimate and to provide the 3DT pose of the actors on each frame. Both background subtraction and bounding boxes are provided at each frame. In total, more than 3.6M frames are available. In these two datasets, actions are performed in unrealistic conditions as the actors follow a script consisting of actions that are performed one after the other.

In the table below a comparison is shown between the aforementioned datasets, which are contrasted with the new ToCaDa dataset we recently introduced and describe in more detail below.

Properties 3DPeS [2] VIRAT [3] MuHAVi [4] Human3.6M [5] ToCaDa [6]
# Cameras 8 static 16 static 8 static 4 static 25 static
# Microphones 0 0 0 0 25+2
Overlapping FOV Very partially 2+2 8 4 17
Disjoint FOV 8 12 0 0 4
Synchronized No No Partially Yes Yes
Pixel resolution 704 x 576 1920 x 1080 720 x 576 1000 x 1000 Mostly 1920 x 1080
# Visual objects 200 Hundreds 14 11 30
# Action types 0 23 17 15 0
# Bounding boxes 0 ≈ 1 object/second 0 ≈ 1 object/frame ≈ 1 object/second
In/outdoor Outdoor Outdoor Indoor Indoor Outdoor
With scenario No No Yes Yes Yes
Realistic Yes Yes No No Yes

ToCaDa Dataset

As a large multi-view, multimodal, and realistic video collection does not yet exist, we therefore took the initiative to produce such a dataset. The ToCaDa dataset [6] comprises 25 synchronized videos (including soundtrack) of the same scene recorded from multiple viewpoints. The dataset follows two detailed scenarios consisting of comings and goings of people, cars and motorbikes, with both overlapping and non-overlapping fields of view (see Figures 1-2). This dataset aims at paving the way for multidisciplinary approaches and applications such as 4D-scene reconstruction, object re-identification/tracking and multi-source metadata modeling and querying.

Figure 1: The campus contains 25 cameras, of which 8 are spread out across the area and 17 are located within the red rectangle (see Figure 2).
Figure 2: The main building where 17 cameras with overlapping fields of view are concentrated.

About 20 actors were asked to follow two realistic scenarios by performing scripted actions, like driving a car, walking, entering or leaving a building, or holding an item in hand while being filmed. In addition to ordinary actions, some suspicious behaviors are present. More precisely:

  • In the first scenario, a suspect car (C) with two men inside (D the driver and P the passenger) arrives and parks in front of the main building (within the sights of the cameras with overlapping views). P gets out of the car C and enters the building. Two minutes later, P leaves the building holding a package and gets in C. C leaves the parking (see Figure 3) and gets away from the university campus (passing in front of some of the disjoint fields of view cameras). Other vehicles and persons regularly move in different cameras with no suspicious behavior.
  • In the second scenario, a suspect car (C) with two men inside (D the driver and P the passenger) arrives and parks badly along the road. P gets out of the car and enters the building. Meanwhile, a women W knocks on the car window to ask the driver D to park correctly, but he drives off immediately. A few minutes later, P leaves the building with a package and seems confused as the car is missing. He then runs away. In the end, in one of the disjoint-view cameras, we can see him waiting until C picks him up.
Figure 3: A subset of all the synchronized videos for a particular frame of the first scenario. First row: cameras located in front of the building. Second and third rows: cameras that face the car park. A car is circled in red to highlight the largely overlapping fields of view.

The 25 camera holders we enlisted used their own mobile devices to record the scene, leading to a large variety of resolutions, image quality, frame rates and video duration. Three foghorns were blown in order to coordinate this heterogeneous disposal:

  • The first one stands for a warning 20 seconds before the start, to give enough time to start shooting.
  • The second one is the actual starting time, used to temporally synchronize the videos.
  • The third one indicates the ending time.

All the videos were collected and were manually synchronized using the second and the third foghorn blows as starting and ending times. Indeed, the second one can be heard at the beginning of every video.


A special annotation procedure was set to handle the audiovisual content of this multi-view data [7]. Audio and video parts of each document were first separately annotated, after which a fusion of these modalities was realized.

The ground truth annotations are stored in json files. Each file corresponds to a video and shares the same title but not the same extension, namely <video_name>.mp4 annotations are stored in <video_name>.json. Both visual and audio annotations are stored together in the same file.

By annotating, our goal is to detect the visual objects and the salient sound events and, when possible, to associate them. Thus, we have grouped them into the generic term audio-visual object. This way, the appearance of a vehicle and its motor sound will constitute a single coherent audio-visual object and is associated with the same ID. An object that can be seen but cannot be heard is also an audio-visual object but with only a visual component, and similarly for an object that can only be heard. An example is given in Listing 1.

Listing 1: Json file structure of the visual component of an object in a video, visible from 13.8s to 18.2s and from 29.72s to 32.28s and associated with id 11.

To help with the annotation process, we developed a program for navigating through the frames of the synchronized videos and for identifying audio-visual objects by drawing bounding boxes in particular frames and/or specifying starting and ending times of salient sound. Bounding boxes were drawn around every moving object with a flag indicating whether the object was fully visible or occluded, specifying its category (human or vehicle), providing visual details (for example clothes types or colors), and timestamps of its apparitions and disappearances. Audio events were also annotated by a category and two timestamps.

Regarding bounding boxes, the coordinates of top-left and bottom-right corners of the bounding boxes are given. Bounding boxes were drawn such that the object is fully contained within the box and as tight as possible. For this purpose, our annotation tool allows the user to draw an initial approximate bounding box and then to adjust its boundaries at a pixel-level.

As drawing one bounding box for each object on every frame requires a huge amount of time, we have drawn bounding boxes on a subset of frames, so that the intermediate bounding boxes of an object can be linearly interpolated using its previous and next drawn bounding boxes. On average, we have drawn one bounding box per second for humans and two for vehicles due to their speed variation. For objects with irregular speed or trajectory, we have drawn more bounding boxes.

Regarding the audio component of an audio-visual object, namely the salient sound events, an audio category (voice, motor sound) is given in addition to its ID, as well as a list of details and time bounds (see Listing 2).

Listing 2: Json file structure of an audio event in a given video. As it is associated with id 11, it corresponds to the same audio-visual object as the one in Listing 1.

Finally, we linked the audio to the video objects, by giving the same ID to the audio object in case of causal identification, which means that the acoustic source of the audio event is the object (a car or a person for instance) that was annotated. This step was particularly crucial, and could not be automatized, as a complex expertise is required to identify the sound sources. For example, in the video sequence illustrated in Figure 4, a motor sound is audible and seems to come from the car whereas it actually comes from a motorbike behind the camera.

Figure 4: At this time of the video sequence of camera 10, a motor sound is heard and seems to come from the car while it actually comes from a motorbike behind the camera.

In case of an object presenting different sound categories (a car with door slams, music and motor sound for example), one object is created for each category and the same ID is given.

Ethical and Legal

According to the European legislation, it is forbidden to make images publicly available of people who might be recognized or of license plates. As people and license plates are visible in our videos, to conform to the General Data Protection Regulation (GDPR) we decided to:

  • Ask actors to sign an authorization for publishing their image, and
  • Apply post treatment on videos to blur faces of other people and any license plates.


We have introduced a new dataset composed of two sets of 25 synchronized videos of the same scene with 17 overlapping views and 8 disjoint views. Videos are provided with their associated soundtracks. We have annotated the videos by manually drawing bounding boxes on moving objects. We have also manually annotated audio events. Our dataset offers simultaneously a large number of both overlapping and disjoint synchronized views and a realistic environment. It also provides audio tracks with sound events, high pixel resolution and ground truth annotations.

The originality and the richness of this dataset come from the wide diversity of topics it covers and the presence of scripted and non-scripted actions and events. Therefore, our dataset is well suited for numerous pattern recognition applications related to, but not restricted to, the domain of surveillance. We describe below, some multidisciplinary applications that could be evaluated using this dataset:

3D and 4D reconstruction: The multiple cameras sharing overlapping fields of view along with some provided photographs of the scene allow performing a 3D reconstruction of the static parts of the scene and to retrieve intrinsic parameters and poses of the cameras using a Structure-from-Motion algorithm. Beyond a 3D reconstruction, the temporal synchronization of the videos could enable to render dynamic parts of the scene as well and to obtain a 4D reconstruction.

Object recognition and consistent labeling: Evaluation of algorithms for human and vehicle detection and consistent labeling across multiple views can be performed using the annotated bounding boxes and IDs. To this end, overlapping views provide a 3D environment that could help to infer the label of an object in a video knowing its position and label in another video.

Sound event recognition: The audio events recorded from different locations and manually annotated provide opportunities to evaluate the relevance of consistent acoustic models by, for example, launching the identification and indexing of a specific sound event. Looking for a particular sound by similarity is also feasible.

Metadata modeling and querying: The multiple layers of information of this dataset, both low-level (audio/video signal) and high-level (semantic data available in the ground truth files) enable handling of information at different resolutions of space and time, allowing to perform queries on heterogeneous information.


[1] I. Lefter, L.J.M. Rothkrantz, G. Burghouts, Z. Yang, P. Wiggers. “Addressing multimodality in overt aggression detection”, in Proceedings of the International Conference on Text, Speech and Dialogue, 2011, pp. 25-32.
[2] D. Baltieri, R. Vezzani, R. Cucchiara. “3DPeS: 3D people dataset for surveillance and forensics”, in Proceedings of the 2011 joint ACM workshop on Human Gesture and Behavior Understanding, 2011, pp. 59-64.
[3] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. Chen, J.T. Lee, S. Mukherjee, J.K. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Fong, A. Roy-Chowdhury, M. Desai. “A large-scale benchmark dataset for event recognition in surveillance video”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3153-3160.
[4] S. Singh, S.A. Velastin, H. Ragheb. “MuHAVi: A multicamera human action video dataset for the evaluation of action recognition methods”, in Proceedings of the 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2010, pp. 48-55.
[5] C. Ionescu, D. Papava, V. Olaru, C. Sminchisescu. “Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments”, IEEE transactions on Pattern Analysis and Machine Intelligence, 36(7), 2013, pp. 1325-1339.
[6] T. Malon, G. Roman-Jimenez, P. Guyot, S. Chambon, V. Charvillat, A. Crouzil, A. Péninou, J. Pinquier, F. Sèdes, C. Sénac. “Toulouse campus surveillance dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views”, in Proceedings of the 9th ACM Multimedia Systems Conference. 2018, pp. 393-398.
[7] P. Guyot, T. Malon, G. Roman-Jimenez, S. Chambon, V. Charvillat, A. Crouzil, A. Péninou, J. Pinquier, F. Sèdes, C. Sénac. “Audiovisual annotation procedure for multi-view field recordings”, in Proceedings of the International Conference on Multimedia Modeling, 2019, pp. 399-410.

MPEG Column: 129th MPEG Meeting in Brussels, Belgium

The original blog post can be found at the Bitmovin Techblog and has been modified/updated here to focus on and highlight research aspects.

The 129th MPEG meeting concluded on January 17, 2020 in Brussels, Belgium with the following topics:

  • Coded representation of immersive media – WG11 promotes Network-Based Media Processing (NBMP) to the final stage
  • Coded representation of immersive media – Publication of the Technical Report on Architectures for Immersive Media
  • Genomic information representation – WG11 receives answers to the joint call for proposals on genomic annotations in conjunction with ISO TC 276/WG 5
  • Open font format – WG11 promotes Amendment of Open Font Format to the final stage
  • High efficiency coding and media delivery in heterogeneous environments – WG11 progresses Baseline Profile for MPEG-H 3D Audio
  • Multimedia content description interface – Conformance and Reference Software for Compact Descriptors for Video Analysis promoted to the final stage

Additional Important Activities at the 129th WG 11 (MPEG) meeting

The 129th WG 11 (MPEG) meeting was attended by more than 500 experts from 25 countries working on important activities including (i) a scene description for MPEG media, (ii) the integration of Video-based Point Cloud Compression (V-PCC) and Immersive Video (MIV), (iii) Video Coding for Machines (VCM), and (iv) a draft call for proposals for MPEG-I Audio among others.

The corresponding press release of the 129th MPEG meeting can be found here: This report focused on network-based media processing (NBMP), architectures of immersive media, compact descriptors for video analysis (CDVA), and an update about adaptive streaming formats (i.e., DASH and CMAF).

MPEG picture at Friday plenary; © Rob Koenen (Tiledmedia).

Coded representation of immersive media – WG11 promotes Network-Based Media Processing (NBMP) to the final stage

At its 129th meeting, MPEG promoted ISO/IEC 23090-8, Network-Based Media Processing (NBMP), to Final Draft International Standard (FDIS). The FDIS stage is the final vote before a document is officially adopted as an International Standard (IS). During the FDIS vote, publications and national bodies are only allowed to place a Yes/No vote and are no longer able to make any technical changes. However, project editors are able to fix typos and make other necessary editorial improvements.

What is NBMP? The NBMP standard defines a framework that allows content and service providers to describe, deploy, and control media processing for their content in the cloud by using libraries of pre-built 3rd party functions. The framework includes an abstraction layer to be deployed on top of existing commercial cloud platforms and is designed to be able to be integrated with 5G core and edge computing. The NBMP workflow manager is another essential part of the framework enabling the composition of multiple media processing tasks to process incoming media and metadata from a media source and to produce processed media streams and metadata that are ready for distribution to media sinks.

Why NBMP? With the increasing complexity and sophistication of media services and the incurred media processing, offloading complex media processing operations to the cloud/network is becoming critically important in order to keep receiver hardware simple and power consumption low.

Research aspects: NBMP reminds me a bit about what has been done in the past in MPEG-21, specifically Digital Item Adaptation (DIA) and Digital Item Processing (DIP). The main difference is that MPEG now targets APIs rather than pure metadata formats, which is a step forward in the right direction as APIs can be implemented and used right away. NBMP will be particularly interesting in the context of new networking approaches including, but not limited to, software-defined networking (SDN), information-centric networking (ICN), mobile edge computing (MEC), fog computing, and related aspects in the context of 5G.

Coded representation of immersive media – Publication of the Technical Report on Architectures for Immersive Media

At its 129th meeting, WG11 (MPEG) published an updated version of its technical report on architectures for immersive media. This technical report, which is the first part of the ISO/IEC 23090 (MPEG-I) suite of standards, introduces the different phases of MPEG-I standardization and gives an overview of the parts of the MPEG-I suite. It also documents use cases and defines architectural views on the compression and coded representation of elements of immersive experiences. Furthermore, it describes the coded representation of immersive media and the delivery of a full, individualized immersive media experience. MPEG-I enables scalable and efficient individual delivery as well as mass distribution while adjusting to the rendering capabilities of consumption devices. Finally, this technical report breaks down the elements that contribute to a fully immersive media experience and assigns quality requirements as well as quality and design objectives for those elements.

Research aspects: This technical report provides a kind of reference architecture for immersive media, which may help identify research areas and research questions to be addressed in this context.

Multimedia content description interface – Conformance and Reference Software for Compact Descriptors for Video Analysis promoted to the final stage

Managing and organizing the quickly increasing volume of video content is a challenge for many industry sectors, such as media and entertainment or surveillance. One example task is scalable instance search, i.e., finding content containing a specific object instance or location in a very large video database. This requires video descriptors that can be efficiently extracted, stored, and matched. Standardization enables extracting interoperable descriptors on different devices and using software from different providers so that only the compact descriptors instead of the much larger source videos can be exchanged for matching or querying. ISO/IEC 15938-15:2019 – the MPEG Compact Descriptors for Video Analysis (CDVA) standard – defines such descriptors. CDVA includes highly efficient descriptor components using features resulting from a Deep Neural Network (DNN) and uses predictive coding over video segments. The standard is being adopted by the industry. At its 129th meeting, WG11 (MPEG) has finalized the conformance guidelines and reference software. The software provides the functionality to extract, match, and index CDVA descriptors. For easy deployment, the reference software is also provided as Docker containers.

Research aspects: The availability of reference software helps to conduct reproducible research (i.e., reference software is typically publicly available for free) and the Docker container even further contributes to this aspect.


The 4th edition of DASH has already been published and is available as ISO/IEC 23009-1:2019. Similar to previous iterations, MPEG’s goal was to make the newest edition of DASH publicly available for free, with the goal of industry-wide adoption and adaptation. During the most recent MPEG meeting, we worked towards implementing the first amendment which will include additional (i) CMAF support and (ii) event processing models with minor updates; these amendments are currently in draft and will be finalized at the 130th MPEG meeting in Alpbach, Austria. An overview of all DASH standards and updates are depicted in the figure below:

ISO/IEC 23009-8 or “session-based DASH operations” is the newest variation of MPEG-DASH. The goal of this part of DASH is to allow customization during certain times of a DASH session while maintaining the underlying media presentation description (MPD) for all other sessions. Thus, MPDs should be cacheable within content distribution networks (CDNs) while additional information should be customizable on a per session basis within a newly added session-based description (SBD). It is understood that the SBD should have an efficient representation to avoid file size issues and it should not duplicate information typically found in the MPD.

The 2nd edition of the CMAF standard (ISO/IEC 23000-19) will be available soon (currently under FDIS ballot) and MPEG is currently reviewing additional tools in the so-called ‘technologies under considerations’ document. Therefore, amendments were drafted for additional HEVC media profiles and exploration activities on the storage and archiving of CMAF contents.

The next meeting will bring MPEG back to Austria (for the 4th time) and will be hosted in Alpbach, Tyrol. For more information about the upcoming 130th MPEG meeting click here.

Click here for more information about MPEG meetings and their developments

JPEG Column: 86th JPEG Meeting in Sydney, Australia

The 86th JPEG meeting was held in Sydney, Australia.

Among the different activities that took place, the JPEG Committee issued a Call for Evidence on learning-based image coding solutions. This call results from the success of the  explorations studies recently carried out by the JPEG Committee, and honours the pioneering work of JPEG issuing the first image coding standard more than 25 years ago.

In addition, a First Call for Evidence on Point Cloud Coding was issued in the framework of JPEG Pleno. Furthermore, an updated version of the JPEG Pleno reference software and a JPEG XL open source implementation have been released, while JPEG XS continues the development of raw-Bayer image sensor compression.

JPEG Plenary at the 86th meeting.

The 86th JPEG meeting had the following highlights:

  • JPEG AI issues a call for evidence on machine learning based image coding solutions
  • JPEG Pleno issues call for evidence on Point Cloud coding
  • JPEG XL verification test reveal competitive performance with commonly used image coding solutions 
  • JPEG Systems submitted final texts for Privacy & Security
  • JPEG XS announces new coding tools optimised for compression of raw-Bayer image sensor data


The JPEG Committee launched a learning-based image coding activity more than a year ago, also referred as JPEG AI. This activity aims to find evidence for image coding technologies that offer substantially better compression efficiency when compared to conventional approaches but relying on models exploiting a large image database.

A Call for Evidence (CfE) has been issued as outcome of the 86th JPEG meeting, Sydney, Australia as a first formal step to consider standardisation of such approaches in image compression. The CfE is organised in coordination with the IEEE MMSP 2020 Grand Challenge on Learning-based Image Coding Challenge and will use the same content, evaluation methodologies and deadlines.

JPEG Pleno

JPEG Pleno is working toward the integration of various modalities of plenoptic content under a single framework and in a seamless manner. Efficient and powerful point cloud representation is a key feature within this vision.  Point cloud data supports a wide range of applications including computer-aided manufacturing, entertainment, cultural heritage preservation, scientific research and advanced sensing and analysis. During the 86th JPEG Meeting, the JPEG Committee released a First Call for Evidence on JPEG Pleno Point Cloud Coding to be integrated in the JPEG Pleno framework.  This Call for Evidence focuses specifically on point cloud coding solutions that support scalability and random access of decoded point clouds.

Furthermore, a Reference Software implementation of the JPEG Pleno file format (Part 1) and light field coding technology (Part 2) is made publicly available as open source on the JPEG Gitlab repository ( The JPEG Pleno Reference Software is planned to become an International Standard as Part 4 of JPEG Pleno by the end of 2020.


The JPEG XL Image Coding System (ISO/IEC 18181) has produced an open source reference implementation available on the JPEG Gitlab repository ( The software is available under Apache 2, which includes a royalty-free patent grant. Speed tests indicate the multithreaded encoder and decoder outperforms libjpeg-turbo. 

Independent subjective and objective evaluation experiments have indicated competitive performance with commonly used image coding solutions while offering new functionalities such as lossless transcoding from legacy JPEG format to JPEG XL. The standardisation process has reached the Draft International Standard stage.

JPEG exploration into Media Blockchain

Fake news, copyright violations, media forensics, privacy and security are emerging challenges in digital media. JPEG has determined that blockchain and distributed ledger technologies (DLT) have great potential as a technology component to address these challenges in transparent and trustable media transactions. However, blockchain and DLT need to be integrated efficiently with a widely adopted standard to ensure broad interoperability of protected images. Therefore, the JPEG committee has organised several workshops to engage with the industry and help to identify use cases and requirements that will drive the standardisation process.

During its Sydney meeting, the committee organised an Open Discussion Session on Media Blockchain and invited local stakeholders to take part in an interactive discussion. The discussion focused on media blockchain and related application areas including, media and document provenance, smart contracts, governance, legal understanding and privacy. The presentations of this session are available on the JPEG website. To keep informed and to get involved in this activity, interested parties are invited to register to the ad hoc group’s mailing list.

JPEG Systems

JPEG Systems & Integration submitted final texts for ISO/IEC 19566-4 (Privacy & Security), ISO/IEC 24800-2 (JPSearch), and ISO/IEC 15444-16 2nd edition (JPEG 2000-in-HEIF) for publication.  Amendments to add new capabilities for JUMBF and JPEG 360 reached Committee Draft stage and will be reviewed and balloted by national bodies.

The JPEG Privacy & Security release is timely as consumers are increasingly aware and concerned about the need to protect privacy in imaging applications.  The JPEG 2000-in-HEIF enables embedding JPEG 2000 images in the HEIF file format.  The updated JUMBF provides a more generic means to embed images and other media within JPEG files to enable richer image experiences.  The updated JPEG 360 adds stereoscopic 360 images, and a method to accelerate the rendering of a region-of-interest within an image in order to reduce the latency experienced by users.  JPEG Systems & Integrations JLINK, which elaborates the relationships of the embedded media within the file, created updated use cases to refine the requirements, and continued technical discussions on implementation.


The JPEG committee is pleased to announce the specification of new coding tools optimised for compression of raw-Bayer image sensor data. The JPEG XS project aims at the standardisation of a visually lossless, low-latency and lightweight compression scheme that can be used as a mezzanine codec in various markets. Video transport over professional video links, real-time video storage in and outside of cameras, and data compression onboard of autonomous cars are among the targeted use cases for raw-Bayer image sensor compression. Amendment of the Core Coding System, together with new profiles targeting raw-Bayer image applications are ongoing and expected to be published by the end of 2020.

Final Quote

“The efforts to find new and improved solutions in image compression have led JPEG to explore new opportunities relying on machine learning for coding. After rigorous analysis in form of explorations during the last 12 months, JPEG believes that it is time to formally initiate a standardisation process, and consequently, has issued a call for evidence for image compression based on machine learning.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

86th JPEG meeting social event in Sydney, Australia.

About JPEG

The Joint Photographic Experts Group (JPEG) is a Working Group of ISO/IEC, the International Organisation for Standardization / International Electrotechnical Commission, (ISO/IEC JTC 1/SC 29/WG 1) and of the International Telecommunication Union (ITU-T SG16), responsible for the popular JPEG, JPEG 2000, JPEG XR, JPSearch, JPEG XT and more recently, the JPEG XS, JPEG Systems, JPEG Pleno and JPEG XL families of imaging standards.

More information about JPEG and its work is available at or by contacting Antonio Pinheiro or Frederik Temmermans ( of the JPEG Communication Subgroup. If you would like to stay posted on JPEG activities, please subscribe to the jpeg-news mailing list on  

Future JPEG meetings are planned as follows:

  • No 87, Erlangen, Germany, April 25 to 30, 2020 (Cancelled because of Covid-19 outbreak; Replaced by online meetings.)
  • No 88, Geneva, Switzerland, July 4 to 10, 2020

Collaborative QoE Management using SDN

The Software-Defined Networking (SDN) paradigm offers the flexibility and programmability in the deployment and management of network services by separating the Control plane from the Data plane. Being based on network abstractions and virtualization techniques, SDN allows for simplifying the implementation of traffic engineering techniques as well as the communication among different services providers, included Internet Service Providers (ISPs) and Over The Top (OTT) providers. For these reasons, the SDN architectures have been widely used in the last years for the QoE-aware management of multimedia services.

The paper [1] presents Timber, an open source SDN-based emulation platform to provide the research community with a tool for experimenting new QoE management approaches and algorithms, which may also rely on information exchange between ISP and OTT [2].  We believe that the exchange of information between the OTT and the ISP is extremely important because:

  1. QoE models depend on different influence factors, i.e., network, application, system and context factors [3];
  2. OTT and ISP have different information in their hands, i.e., network state and application Key Quality Indicators (KQIs), respectively;
  3. End-to-end encryption of the OTT services makes it difficult for ISP to have access to application KQIs to perform QoE-aware network management.

In the following we briefly describe Timber and the impact of collaborative QoE management.

Timber architecture

Figure 1 represents the reference architecture, which is composed of four planes. The Service Management Plane is a cloud space owned by the OTT provider, which includes: a QoE Monitoring module to estimate the user’s QoE on the basis of service parameters acquired at the client side; a DB where QoE measurements are stored and can be shared with third parties; a Content Distribution service to deliver multimedia contents. Through the RESTful APIs, the OTTs give access to part of the information stored in the DB to the ISP, on the basis of appropriate agreements.

The Network Data Plane, Network Control Plane, and the Network Management Plane are the those in the hands of the ISP. The Network Data Plane includes all the SDN enabled data forwarding network devices; the Network Control Plane consists of the SDN controller which manages the network devices through Southbound APIs; and the Network Management Plane is the application layer of the SDN architecture controlled by the ISP to perform network-wide control operations which communicates with the OTT via RESTful APIs. The SDN application includes a QoS Monitoring module to monitor the performance of the network, a Management Policy module to take into account Service Level Agreements (SLA), and a Control Actions module that decides on the network control actions to be implemented by the SDN controller to optimize the network resources and improve the service’s quality.

Timber implements this architecture on top of the Mininet SDN emulator and the Ryu SDN controller, which provides the major functionalities of the traffic engineering abstractions. According to the depicted scenario, the OTT has the potential to monitor the level of QoE for the provided services as it has access to the needed application and network level KQIs (Key Quality Indicators). On the other hand, the ISP has the potential to control the network level quality by changing the allocated resources. This scenario is implemented in Timber and allows for setting the needed emulation network and application configuration to text QoE-aware service management algorithms.

Specifically, the OTT performs QoE monitoring of the delivered service by acquiring service information from the client side based on passive measurements of service-related KQIs obtained through probes installed in the user’s devices. Based on these measurements, specific QoE models can be used to predict the user experience. The QoE measurements of active clients’ sessions are also stored in the OTT DB, which can also be accessed by the ISP through mentioned RESTful APIs. The ISP’s SDN application periodically controls the OTT-reported QoE and, in case of observed QoE degradations, implements network-wide policies by communicating with the SDN controller through the Northbound APIs. Accordingly, the SDN controller performs network management operations such as link-aggregation, addition of new flows, network slicing, by controlling the network devices through Southbound APIs.

QoE management based on information exchange: video service use-case

The previously described scenario, which is implemented by Timber, portraits a collaborative scenario between the ISP and the OTT, where the first provides QoE-related data and the later takes care of controlling the resources allocated to the deployed services. Ahmad et al. [4] makes use of Timber to conduct experiments aimed at investigating the impact of the frequency of information exchange between an OTT providing a video streaming service and the ISP on the end-user QoE.

Figure 2 shows the experiments topology. Mininet in Timber is used to create the network topology, which in this case regards the streaming of video sequences from the media server to the User1 (U1) when web traffic is also transmitted on the same network towards User2 (U2). U1 and U2 are two virtual hosts sharing the same access network and act as the clients. U1 runs the client-side video player and the Apache server provides both web and HAS (HTTP Adaptive Streaming) video services.

In the considered collaboration scenario, QoE-related KQIs are extracted from the client-side and sent to the to the MongoDB database (managed by the OTT), as depicted by the red dashed arrows. This information is then retrieved by the SDN controller of the ISP at frequency f (see green dashed arrow). The aim is to provide different network level resources to video streaming and normal web traffic when QoE degradation is observed for the video service. These control actions on the network are needed because TCP-based web traffic sessions of 4 Mbps start randomly towards U2 during the HD video streaming sessions, causing network time varying bottlenecks in the S1−S2 link. In these cases, the SDN controller implements virtual network slicing at S1 and S2 OVS switches, which provides the minimum guaranteed throughput of 2.5 Mbps and 1 Mbps to video streaming and web traffic, respectively. The SDN controller application utilizes flow matching criteria to assign flows to the virtual slice. The objective of this emulations is to show the impact of f on the resulting QoE.

The Big Buck Bunny 60-second long video sequence in 1280 × 720 was streamed between the server and the U1 by considering 5 different sampling intervals T for information exchange between OTT and ISP, i.e., 2s, 4s, 8s, 16s, and 32s. The information exchanged in this case were the average length stalling duration and the number of stalling events measured by the probe at the client video player. Accordingly, the QoE for the video streaming service was measured in terms of predicted MOS using the QoE model defined in [5] for HTTP video streaming, as follows:
MOSp = α exp( -β(L)N ) + γ
where L and N are the average length stalling duration and the number of stalling events, respectively, whereas α=3.5, γ=1.5, and β(L)=0.15L+0.19.

Figure 3.a shows the average predicted MOS when information is exchanged at different sampling intervals (the inverse of f). The greatest MOSp is 4.34 obtained for T=2s, and T=4s. Exponential decay in MOSp is observed as the frequency of information exchange decreases. The lowest MOSp is 3.07 obtained for T=32s. This result shows that greater frequency of information exchange leads to low latency in the controller response to QoE degradation. The reason is that the buffer at the client player side keeps on starving for longer durations in case of higher T resulting into longer stalling durations until the SDN controller gets triggered to provide the guaranteed network resources to support the video streaming service.

Figure 3.b Initial loading time, average stalling duration and latency in controller response to quality degradation for different sampling intervals.

Figure 3.b shows the video initial loading time, average stalling duration and latency in controller response to quality degradation w.r.t different sampling intervals. The latency in controller response to QoE degradation increases linearly as the frequency of information exchange decreases while the stalling duration grows exponentially as the frequency decrease. The initial loading time seems to be not relevantly affected by different sampling intervals.


Experiments are conducted on an SDN emulation environment to investigate the impact of the frequency of information exchange between OTT and ISP when a collaborative network management approach is considered. The QoE for a video streaming service is measured by considering 5 different sampling intervals for information exchange between OTT and ISP, i.e., 2s, 4s, 8s, 16s, and 32s. The information exchanged are the video average length stalling duration and the number of stalling events.

The experiment results showed that higher frequency of information exchange results in greater delivered QoE, but a sampling interval lower than 4s (frequency > ¼ Hz) may not further improve the delivered QoE. Clearly, this threshold depends on the variability of the network conditions. Further studies are needed to understand how frequently the ISP and OTT should collaboratively share data to have observable benefits in terms of QoE varying the network status and the deployed services.


[1] A. Ahmad, A. Floris and L. Atzori, “Timber: An SDN based emulation platform for QoE Management Experimental Research,” 2018 Tenth International Conference on Quality of Multimedia Experience (QoMEX), Cagliari, 2018, pp. 1-6.


[3] P. Le Callet, S. Möller, A. Perkis et al., “Qualinet White Paper on Definitions of Quality of Experience (2012),” in European Network on Quality of Experience in Multimedia Systems and Services (COST Action IC 1003), Lausanne, Switzerland, Version 1.2, March 2013.

[4] A. Ahmad, A. Floris and L. Atzori, “Towards Information-centric Collaborative QoE Management using SDN,” 2019 IEEE Wireless Communications and Networking Conference (WCNC), Marrakesh, Morocco, 2019, pp. 1-6.

[5] T. Hoßfeld, C. Moldovan, and C. Schwartz, “To each according to his needs: Dimensioning video buffer for specific user profiles and behavior,” in IFIP/IEEE Int. Symposium on Integrated Network Management (IM), 2015. IEEE, 2015, pp. 1249–1254.

SIGMM Test of Time Paper Award, SIGMM Funding for Special Initiatives 2020 and SIGMM-sponsored Conference Fee Reduction

In this note I provide an update on some recent SIGMM funding initiatives we are putting in place in 2020. These come about based on feedback from you, our members, on what you believe to be important and what you would like your SIGMM Executive Committee to work on. The specific topics covered here are the new SIGMM test of Time Paper Award, the various projects to be funded as a result of our call for funding applications for special initiatives some of which are further support for student travel, and the SIGMM sponsorship of conference fee reduction.

The SIGMM Test of Time Paper Award

A new award has just been formally approved by the ACM Awards Committee called the SIGMM Test of Time Paper Award with details available here.  To have an award formally approved by ACM the proposal has to be approved by a SIG Executive Committee, then approved by ACM headquarters, then approved by the ACM SIG Governing Board and then approved by the ACM Awards Committee. This ensures that ACM-approved awards are highly prestigious and rigorous in the way they select their winners.

SIGMM has been operational for 26 years and in that time has sponsored or co-sponsored more than 100 conferences and workshops, which have collectively published more than 15,574 individual papers.  5,742 of those papers have been published 10 or more years ago and the SIGMM Executive believes it is time to recognise the most significant and impactful from among those 5,742 papers.

The new award will be presented every year, starting this year, to the authors of a paper published either 10, 11 or 12 years previously at an SIGMM sponsored or co-sponsored conference. Thus the 2020 award will be for papers presented at a 2008, 2009 or 2010 SIGMM conference or workshop and will recognise the paper that has had the most impact and influence on the field of Multimedia in terms of research, development, product or ideas.  The paper may include theoretical advances, techniques and/or software tools that have been widely used, and/or innovative applications that have had impact on multimedia computing.

The award-winning paper will be selected by a 5-person selection committee consisting of 2 members of the organising committee for the MULTIMEDIA Conference in that year plus 3 established and respected members of our community who have no conflict of interest with the nominated papers.  The nominated papers are those top-ranked based on citation count from the ACM Digital Library, though the selection committee can add others if they wish.

Faced with the issue of recognising papers published prior to the 10, 11 or 12 year window of consideration, in this inaugural year when we announce the inaugural winner from 2008/2009/2010 we will also announce a set of up to 14 papers published at SIGMM conferences prior to 2008 as “honourable mentions” which could have been considered as strong candidates in their year of publication, if there had been an award for that year.  The first SIGMM MULTIMEDIA Conference was held in 1993 but was not sponsored by SIGMM as SIGMM was formed only in 1994, and so these up to 14 honourable mentions will cover the years 1994 to 2007 inclusive.

Selecting these papers from among all these candidates will be a challenging task for the selection committee and we wish them well in their deliberations and look forward to the award announcements at the MULTIMEDIA Conference in Seattle later this year.

SIGMM Funding for Special Initiatives 2020

For the last three years in a row, the SIGMM Executive committee has issued an invitation for applications for funding for new initiatives, which are submitted by SIGMM members. The assessment criteria for these initiatives were that they focus on one, or more, of the following:

– building on SIGMM’s excellence and strengths;

– nurturing new talent in the SIGMM community;

– addressing weakness(es) in the SIGMM community and in SIGMM activities

In late 2019 we issued our third call for funding and we received our strongest yet response from the SIGMM community. Submissions were evaluated and assessed by the SIGMM Executive and discussed at an Executive Committee meeting and in this short note I outline the funding awards which were made.

Before looking at the awards, it is worth reminding the reader that starting this year, SIGMM is centralising our support for student travel to our SIGMM-supported events, namely ICMR (in Dublin), MMSys (in Istanbul), IMX (in Barcelona), IH&MMSec (in Denver), MULTIMEDIA (in Seattle) and MM Asia (in Singapore).  As part of this scheme, any student member of SIGMM is eligible to apply, however, the students who are the first author of an accepted paper) are particularly encouraged. The value of the award will depend on the travel distance with up to US$2000 for long-haul travel and up to US$1000 for short-haul travel which are defined based on the location of the conference.  Details of this scheme and the link for submitting applications have already started to appear on the websites of some of these conferences.

With the SIGMM scheme supporting travel for student authors as a priority, some of these conferences applied for and have been approved for further funding to support other conference attendees and the IMX Conference in Barcelona, in June 2020 was awarded travel support for under-represented minorities while the MMSys conference in Istanbul in June 2020 was awarded travel support for non-student minorities. In both these cases the conferences themselves will administer selection and awarding of the funding. Student travel support was also awarded to the African Winter School in Multimedia, in Stellenbosch, South Africa in July 2020, an event which SIGMM also sponsors.

A number of other events which are not sponsored by SIGMM but which are closely related to our area also applied for funding to support student travel and the following have also been awarded funding for supporting student travel:

– the Adaptive Streaming Summer School, in Klagenfurt, Austria, July;

– the Content Based Multimedia Information (CBMI) Conference, in Lille, France, September;

– the International Conference on Quality of Multimedia Experience (QoMEX), in Athlone, Ireland, May, for female and under-represented minority students;

– the MediaEval Benchmarking Initiative for Multimedia Evaluation, workshop, late 2020.

All this funding, both the centralised and the special awards above, will help many students to travel to events in multimedia during 2020 and in addition to travel support, SIGMM will fund a number of events at some of our conferences.  These include a women and diversity lunch at CBMI in Lille, a diversity lunch and childcare support at the Information Hiding and Multimedia Security Workshop (IH&MMSec) in Denver, childcare support and a diversity and an inclusion panel discussion at IMX, a multimedia evaluation methodology workshop at the MediaEval workshop,
and childcare support and an N2Women meeting at MMSys.

We are also delighted to announce that SIGMM will also support some other activities besides travel and events and one of these is the costs of software development and presentation for Conflow at ACM Multimedia in Seattle. Conflow, and its predecessor ConfLab is a unique initiative from Hayley Hung and colleagues at TU Delft which encourages people with similar or complementary research interests to find each other at conference and ultimately to help them to connect with potential collaborators. It does this by instrumenting a physical space at an event with environmental sensors and distributing wearable sensors for participants who sign up and agree to have data about their interactions with others, captured, anonymised and used as a dataset for analysis. A pilot version at ACM Multimedia in Nice in 2019 called ConfLab ran with several dozen participants and was built around the notion of meeting the conference Chairs and this will be extended in 2020.

The final element of the SIGMM funding awarded recently was to the ICMR conference in Dublin in June which will be the testbed for calculation of a conference’s carbon footprint. ACM already has some initiatives in this area based on estimating the CO2e cost of air travel of conference attendees to/from the venue and there are software tools to help with this. The SIGMM funding will include this plus estimating the CO2e costs of local transport, food, accommodation, and more, plus it will also raise awareness of individual carbon footprints for delegates. This will be done for ICMR in a way that allows the process of calculating made available for other events.

SIGMM Sponsorship of Conference Fee Reduction

The third initiative which SIGMM is starting sponsorship of in 2020 is a reduction in the registration fees for SIGMM-sponsored conferences and this means for ICMR, MMSys, IMX, IH&MMSec, MULTIMEDIA and MM Asia. This has been a particular bug-bear for many of us so it is good to be able to do something about it.

Starting in 2020, SIGMM will sponsor US$100 toward conference registration fees for SIGMM members only, for the early-bird conference registrations. This will apply to students and non-students, and to ACM members and non-members. It means the conference registration choice may look a bit complicated but basically if you are an ACM member you get a certain reduction, if you are a SIGMM member you also get a reduction (from SIGMM), and if you are a student then you also get a reduction.  The amount of the reduction in the conference fee for being a SIGMM member ($100) is far more than the cost of joining SIGMM (which is either $20 or $15 for a student) thus it makes sense to join SIGMM and get the conference fee reduction and your SIGMM membership is an important thing for us.

The SIGMM Executive Committee believe this fee sponsorship is an appropriate way of giving back to the SIGMM community. Beyond 2020 we have not made a decision on sponsoring conference fee reductions, we will see how it works out in 2020 before deciding.

I’d also like to add one final note about attending our conferences and workshops.  We have a commitment to addressing diversity in our 25 in 25 strategy and we also have “access all areas” policy for our conferences. This means that a single registration fee allows access to all events and activities at our conferences … lunches, refreshments, dinners, etc., all bundled into one fee.  We also support those with special needs such as accessibility or dietary requirements and when these are brought to our attention, typically when an attendee registers, then we can put in place whatever support mechanisms are needed to maximise that attendee’s conference experience.  Our events strive to be harassment-free and pleasant conference experiences for all participants. We do not tolerate harassment of conference attendees and that means all our attendees, speakers and organizers are bound by ACM’s Policy Against Harassment. Participants are asked to confirm their commitment to upholding the policy when registering.

Finally, thank you for your support of SIGMM and our events. If there is one thing you can do to help us to help you, it is joining SIGMM, not just for the reduced conference registration fee but to show your support for what we do.  With a fixed rate of $20 or $15 for a student you’ll find details on the SIGMM Membership tab at