VQEG Column: VQEG Meeting December 2022


This column provides an overview of the last Video Quality Experts Group (VQEG) plenary meeting, which took place from 12 to 16 December 2022. Around 100 participants from 21 different countries around the world registered for the meeting that was organized online by Brightcove (United Kingdom). During the five days, there were more than 40 presentations and discussions among researchers working on topics related to the projects ongoing within VQEG. All the related information, minutes, and files from the meeting are available online on the VQEG meeting website, and video recordings of the meeting are available on Youtube.

Many of the works presented in this meeting can be relevant for the SIGMM community working on quality assessment. Particularly interesting can be the proposals to update and merge ITU-T recommendations P.913, P.911, and P.910, the kick-off of the test plan to evaluate the QoE of immersive interactive communication systems, and the creation of a new group on emerging technologies that will start working on AI-based technologies and greening of streaming and related trends.

We encourage readers interested in any of the activities going on in the working groups to check their websites and subscribe to the corresponding reflectors, to follow them and get involved.

Group picture of the VQEG Meeting 12-16 December 2022 (online).

Overview of VQEG Projects

Audiovisual HD (AVHD)

The AVHD group investigates improved subjective and objective methods for analysing commonly available video systems. Currently, there are two projects ongoing under this group: Quality of Experience (QoE) Metrics for Live Video Streaming Applications (Live QoE) and Advanced Subjective Methods (AVHD-SUB).

In this meeting, there were three presentations related to topics covered by this group. In the first one, Maria Martini (Kingston University, UK), presented her work on converting video quality assessment metrics. In particular, the work addressed the relationship between SSIM and PSNR for DCT-based compressed images and video, exploiting the content-related factor [1]. The second presentation was given by Urvashi Pal (Akamai, Australia) and dealt with video codec profiling with video quality assessment complexities and resolutions. Finally, Jingwen Zhu (Nantes Université, France) presented her work on the benefit of parameter-driven approaches for the modelling and the prediction of a Satisfied User Ratio for compressed videos [2].

Quality Assessment for Health applications (QAH)

The QAH group works on the quality assessment of health applications, considering both subjective evaluation and the development of datasets, objective metrics, and task-based approaches. Currently there is an open discussion on new topics to address within the group, such as the application of visual attention models and studies to health applications. Also, an opportunity to conduct medical perception research was announced, which was proposed by Elizabeth Krupinski and will take place in the European Congress of Radiology (Vienna, Austria, Mar. 2023).

In addition, four research works were presented at the meeting. Firstly, Julie Fournier (INSA Rennes, France) presented new insights on affinity therapy for people with ASD, based on an eye-tracking study on images. The second presentation was delivered by Lumi Xia (INSA Rennes, France) and dealt with the evaluation of the usability of deep learning-based denoising models for low-dose CT simulation. Also, Mohamed Amine Kerkouri (University of Orleans, France), presented his work on deep-based quality assessment of medical images through domain adaptation. Finally, Jorge Caviedes (ASU, USA) delivered a talk on cognition inspired diagnostic image quality models, emphasising the need of distinguishing among interpretability (e.g., medical professional is confident in making a diagnosis), adequacy (e.g., capture technique shows the right area for assessment), and visual quality (e.g., MOS) in quality assessment of medical contents.

Statistical Analysis Methods (SAM)

The SAM group works on improving analysis methods both for the results of subjective experiments and for objective quality models and metrics. The group is currently working on updating and merging the ITU-T recommendations P.913, P.911, and P.910. The suggestion is to make P.910 and P.911 obsolete and make P.913 the only recommendation from ITU-T on subjective video quality assessments. The group worked on the liaison and document to be sent to ITU-T SG12 and will be available in the meeting files.

In addition, Mohsen Jenadeleh (Univerity of Konstanz, Germany) presented his work on collective just noticeable difference assessment for compressed video with Flicker Test and QUEST+.

Computer Generated Imagery (CGI)

CGI group is devoted to analysing and evaluating computer-generated content, with a focus on gaming in particular. The group is currently working in collaboration with ITU-T SG12 on the work item P.BBQCG on Parametric bitstream-based Quality Assessment of Cloud Gaming Services. In this sense, Saman Zadtootaghaj (Sony Interactive Entertainment, Germany) provided an update on the ongoing activities. In addition, they are working on two new work items: G.OMMOG on Opinion Model for Mobile Online Gaming applications and P.CROWDG on Subjective Evaluation of Gaming Quality with a Crowdsourcing Approach. Also, the group is working on identifying other topics and interests in CGI rather than gaming content.

No Reference Metrics (NORM)

The NORM group is an open collaborative project for developing no-reference metrics for monitoring visual service quality. Currently, the group is working on three topics: the development of no-reference metrics, the clarification of the computation of the Spatial and Temporal Indexes (SI and TI, defined in the ITU-T Recommendation P.910), and the development of a standard for video quality metadata. 

In relation to the first topic, Margaret Pinson (NTIA/ITS, US), talked about why no-reference metrics for image and video quality lack accuracy and reproducibility [3] and presented new datasets containing camera noise and compression artifacts for the development of no-reference metrics by the group. In addition, Oliver Wiedeman (University of Konstanz, Germany) presented his work on cross-resolution image quality assessment.

Regarding the computation of complexity indices, Maria Martini (Kingston University, UK) presented a study comparing 12 metrics (and possible combinations) for assessing video content complexity. Vignesh V. Menon (University of Klagenfurt, Austria) presented a summary of live per-title encoding approaches using video complexity features. Ioannis Katsavounidis and Cosmin Stejerean (Meta, US) presented their work on using motion search to order videos by coding complexity, also making available the software in open source. In addition, they led a discussion on supplementing classic SI and TI with improved complexity metrics (VCA, motion search, etc.).

Finally, related to the third topic, Ioannis Katsavounidis (Meta, US) provided an update on the status of the project. Given that the idea is already mature enough, a contribution will be made to MPEG to consider the insertion of metadata of video metrics into the encoded video streams. In addition, a liaison with AOMedia will be established that may go beyond this particular topic. And include best practices on subjective testing, IMG topics, etc.

Joint Effort Group (JEG) – Hybrid

The JEG group was focused on a joint work to develop hybrid perceptual/bitstream metrics and gradually evolved over time to include several areas of Video Quality Assessment (VQA), such as the creation of a large dataset for training such models using full-reference metrics instead of subjective metrics. Currently, the group is working on research problems rather than algorithms and models with immediate applicability. In addition, the group has launched a new website, which includes a list of activities of interest, freely available publications, and other resources. 

Two examples of research problems addressed by the group were shown by the two presentations given by Lohic Fotio Tiotsop (Politecnico di Torino, Italy). The topic of the first presentation was related to the training of artificial intelligence observers for a wide range of applications, while the second presentation provided guidelines to train, validate, and publish DNN-based objective measures.

5G Key Performance Indicators (5GKPI)

The 5GKPI group studies the relationship between key performance indicators of new 5G networks and QoE of video services on top of them. In this meeting, Pablo Pérez (Nokia XR Lab, Spain) presented an overview of activities related to QoE and XR within 3GPP.

Immersive Media Group (IMG)

The IMG group is focused on the research on quality assessment of immersive media. The main joint activity going on within the group is the development of a test plan to evaluate the QoE of immersive interactive communication systems. After the discussions that took place in previous meetings and audio calls, a tentative schedule has been proposed to start the execution of the test plan in the following months. In this sense, a new work item will be proposed in the next ITU-T SG12 meeting to establish a collaboration between VQEG-IMG and ITU on this topic.

In addition to this, a variety of different topics related to immersive media technologies were covered in the works presented during the meeting. For example, Yaosi Hu (Wuhan University, China) presented her work on video quality assessment based on quality aggregation networks. In relation to light field imaging, Maria Martini (Kingston University, UK) exposed the main problems related to what light field quality assessment datasets are currently meeting and presented a new dataset. Also, there were three talks by researchers from CWI (Netherlands) dealing with point cloud QoE assessment: Silvia Rossi presented a behavioral analysis in a 6-DoF VR system, taking into account the influence of content, quality and user disposition [4]; Shishir Subramanyam presented his work related to the subjective QoE evaluation of user-centered adaptive streaming of dynamic point clouds [5]; and Irene Viola presented a point cloud objective quality assessment using PCA-based descriptors (PointPCA). Another presentation related to point cloud quality assessment was delivered by Marouane Tliba (Université d’Orleans, France), who presented an efficient deep-based graph objective metric

In addition, Shirin Rafiei (RISE, Sweden) gave a talk on UX and QoE aspects of remote control operations using a laboratory platform, Marta Orduna (Universidad Politécnica de Madrid, Spain) presented her work on comparing ACR, SSDQE, and SSCQE in long duration 360-degree videos, whose results will be used to submit a proposal to extend ITU-T Rec. P.919 for long sequences, and Ali Ak (Nantes Université, France) his work on just noticeable differences to HDR/SDR image/video quality.    

Quality Assessment for Computer Vision Applications (QACoViA)

The goal of the QACoViA group is to study the visual quality requirements for computer vision methods, where the “final observer” is an algorithm. Four presentations were delivered in this meeting addressing diverse related topics. In the first one, Mikołaj Leszczuk (AGH University, Poland) presented a method for assessing objective video quality for automatic license plate recognition tasks [6]. Also, Femi Adeyemi-Ejeye (University of Surrey, UK) presented his work related to the assessment of rail 8K-UHD CCTV facing video for the investigation of collisions. The third presentation dealt with the application of facial expression recognition and was delivered by Lucie Lévêque (Nantes Université, France), who compared the robustness of humans and deep neural networks on this task [7]. Finally, Alban Marie (INSA Rennes, France) presented a study video coding for machines through a large-scale evaluation of DNNs robustness to compression artefacts for semantic segmentation [8].

Other updates

In relation to the Human Factors for Visual Experiences (HFVE) group, Maria Martini (Kingston University, UK) provided a summary of the status of IEEE recommended practice for the quality assessment of light field imaging. Also, Kjell Brunnström (RISE, Sweden) presented a study related to the perceptual quality of video on simulated low temperatures in LCD vehicle displays.

In addition, a new group was created in this meeting called Emerging Technologies Group (ETG), whose main objective is to address various aspects of multimedia that do not fall under the scope of any of the existing VQEG groups. The topics addressed are not necessarily directly related to “video quality” but can indirectly impact the work addressed as part of VQEG. In particular, two major topics of interest were currently identified: AI-based technologies and greening of streaming and related trends. Nevertheless, the group aims to provide a common platform for people to gather together and discuss new emerging topics, discuss possible collaborations in the form of joint survey papers/whitepapers, funding proposals, etc.

Moreover, it was agreed during the meeting to make the Psycho-Physiological Quality Assessment (PsyPhyQA) group dormant until interest resumes in this effort. Also, it was proposed to move the Implementer’s Guide for Video Quality Metrics (IGVQM) project into the JEG-Hybrid, since their activities are currently closely related. This will be discussed in future group meetings and the final decisions will be announced. Finally, as a reminder, the VQEG GitHub with tools and subjective labs setup is still online and kept updated.

The next VQEG plenary meeting will take place in May 2023 and the location will be announced soon on the VQEG website.


[1] Maria G. Martini, “On the relationship between SSIM and PSNR for DCT-based compressed images and video: SSIM as content-aware PSNR”, TechRxiv. Preprint. https://doi.org/10.36227/techrxiv.21725390.v1, 2022.
[2] J. Zhu, P. Le Callet; A. Perrin, S. Sethuraman, K. Rahul, “On The Benefit of Parameter-Driven Approaches for the Modeling and the Prediction of Satisfied User Ratio for Compressed Video”, IEEE International Conference on Image Processing (ICIP), Oct. 2022.
[3] Margaret H. Pinson, “Why No Reference Metrics for Image and Video Quality Lack Accuracy and Reproducibility”, Frontiers in Signal Processing, Jul. 2022.
[4] S. Rossi, I. viola, P. Cesar, “Behavioural Analysis in a 6-DoF VR System: Influence of Content, Quality and User Disposition”, Proceedings of the 1st Workshop on Interactive eXtended Reality, Oct. 2022.
[5] S. Subramanyam, I. Viola, J. Jansen, E. Alexiou, A. Hanjalic, P. Cesar, “Subjective QoE Evaluation of User-Centered Adaptive Streaming of Dynamic Point Clouds”, International Conference on Quality of Multimedia Experience (QoMEX), Sep. 2022.
[6] M. Leszczuk, L. Janowski, J. Nawała, and A. Boev, “Method for Assessing Objective Video Quality for Automatic License Plate Recognition Tasks”, Communications in Computer and Information Science, Oct. 2022.
[7] L. Lévêque, F. Villoteau, E. V. B. Sampaio, M. Perreira Da Silva, and P. Le Callet, “Comparing the Robustness of Humans and Deep Neural Networks on Facial Expression Recognition”, Electronics, 11(23), Dec. 2022.
[8] A. Marie, K. Desnos, L. Morin, and Lu Zhang, “Video Coding for Machines: Large-Scale Evaluation of Deep Neural Networks Robustness to Compression Artifacts for Semantic Segmentation”, IEEE International Workshop on Multimedia Signal Processing (MMSP), Sep. 2022.

MPEG Column: 140th MPEG Meeting in Mainz, Germany

After several years of online meetings, the 140th MPEG meeting was held as a face-to-face meeting in Mainz, Germany, and the official press release can be found here and comprises the following items:

  • MPEG evaluates the Call for Proposals on Video Coding for Machines
  • MPEG evaluates Call for Evidence on Video Coding for Machines Feature Coding
  • MPEG reaches the First Milestone for Haptics Coding
  • MPEG completes a New Standard for Video Decoding Interface for Immersive Media
  • MPEG completes Development of Conformance and Reference Software for Compression of Neural Networks
  • MPEG White Papers: (i) MPEG-H 3D Audio, (ii) MPEG-I Scene Description

Video Coding for Machines

Video coding is the process of compression and decompression of digital video content with the primary purpose of consumption by humans (e.g., watching a movie or video telephony). Recently, however, massive video data is more and more analyzed without human intervention leading to a new paradigm referred to as Video Coding for Machines (VCM) which targets both (i) conventional video coding and (ii) feature coding (see here for further details).

At the 140th MPEG meeting, MPEG Technical Requirements (WG 2) evaluated the responses to the Call for Proposals (CfP) for technologies and solutions enabling efficient video coding for machine vision tasks. A total of 17 responses to this CfP were received, with responses providing various technologies such as (i) learning-based video codecs, (ii) block-based video codecs, (iii) hybrid solutions combining (i) and (ii), and (iv) novel video coding architectures. Several proposals use a region of interest-based approach, where different areas of the frames are coded in varying qualities.

The responses to the CfP reported an improvement in compression efficiency of up to 57% on object tracking, up to 45% on instance segmentation, and up to 39% on object detection, respectively, in terms of bit rate reduction for equivalent task performance. Notably, all requirements defined by WG 2 were addressed by various proposals.

Furthermore, MPEG Technical Requirements (WG 2) evaluated the responses to the Call for Evidence (CfE) for technologies and solutions enabling efficient feature coding for machine vision tasks. A total of eight responses to this CfE were received, of which six responses were considered valid based on the conditions described in the call:

  • For the tested video dataset increases in compression efficiency of up to 87% compared to the video anchor and over 90% compared to the feature anchor were reported.
  • For the tested image dataset, the compression efficiency can be increased by over 90% compared to both image and feature anchors.

Research aspects: the main research area is still the same as described in my last column, i.e., compression efficiency (incl. probably runtime, sometimes called complexity) and Quality of Experience (QoE). Additional research aspects are related to the actual task for which video coding for machines is used (e.g., segmentation, object detection, as mentioned above).

Video Decoding Interface for Immersive Media

One of the most distinctive features of immersive media compared to 2D media is that only a tiny portion of the content is presented to the user. Such a portion is interactively selected at the time of consumption. For example, a user may not see the same point cloud object’s front and back sides simultaneously. Thus, for efficiency reasons and depending on the users’ viewpoint, only the front or back sides need to be delivered, decoded, and presented. Similarly, parts of the scene behind the observer may not need to be accessed.

At the 140th MPEG meeting, MPEG Systems (WG 3) reached the final milestone of the Video Decoding Interface for Immersive Media (VDI) standard (ISO/IEC 23090-13) by promoting the text to Final Draft International Standard (FDIS). The standard defines the basic framework and specific implementation of this framework for various video coding standards, including support for application programming interface (API) standards that are widely used in practice, e.g., Vulkan by Khronos.

The VDI standard allows for dynamic adaptation of video bitstreams to provide the decoded output pictures so that the number of actual video decoders can be smaller than the number of elementary video streams to be decoded. In other cases, virtual instances of video decoders can be associated with the portions of elementary streams required to be decoded. With this standard, the resource requirements of a platform running multiple virtual video decoder instances can be further optimized by considering the specific decoded video regions to be presented to the users rather than considering only the number of video elementary streams in use. The first edition of the VDI standard includes support for the following video coding standards: High Efficiency Video Coding (HEVC), Versatile Video Coding (VVC), and Essential Video Coding (EVC).

Research aspect: VDI is also a promising standard to enable the implementation of viewport adaptive tile-based 360-degree video streaming, but its performance still needs to be assessed in various scenarios. However, requesting and decoding individual tiles within a 360-degree video streaming application is a prerequisite for enabling efficiency in such cases, and VDI provides the basis for its implementation.


Finally, I’d like to provide a quick update regarding MPEG-DASH, which seems to be in maintenance mode. As mentioned in my last blog post, amendments, Defects under Investigation (DuI), and Technologies under Consideration (TuC) are output documents, as well as a new working draft called Redundant encoding and packaging for segmented live media (REAP), which eventually will become ISO/IEC 23009-9. The scope of REAP is to define media formats for redundant encoding and packaging of live segmented media, media ingest, and asset storage. The current working draft can be downloaded here.

Research aspects: REAP defines a distributed system and, thus, all research aspects related to such systems apply here, e.g., performance and scalability, just to name a few.

The 141st MPEG meeting will be online from January 16-20, 2023. Click here for more information about MPEG meetings and their developments.

JPEG Column: 97th JPEG Meeting

JPEG initiates specification on fake media based on responses to its call for proposals

The 97th JPEG meeting was held online from 24 to 28 October 2022. JPEG received responses to the Call for Proposals (CfP) on JPEG Fake Media, the first multimedia international standard designed to facilitate the secure and reliable annotation of media assets creation and modifications. In total six responses were received addressing different requirements in the scope of this standardization initiative. Moreover, relevant advances were made on the standardization of learning-based coding, notably the learning-based coding of images, JPEG AI, and JPEG Pleno point cloud coding. Furthermore, the explorations on quality assessment of images, JPEG AIC, and of JPEG Pleno light field had relevant advances with the definition of their Calls for Contributions and Common Test Conditions.

Also relevant, the 98th JPEG meeting will be held in Sydney, Australia, representing a return to physical meetings after the long COVID pandemics. This is a return, as the last physical meeting was also held in January 2020 in the same location, in Sydney, Australia.

The 97th JPEG meeting had the following highlights:

  • JPEG Fake Media responses to the Call for Proposals analysed,
  • JPEG AI Verification Model,
  • JPEG Pleno Learning-based Point Cloud coding Verification Model,
  • JPEG Pleno Light Field issues a Call for Contributions on Subjective Light Field Quality Assessment,
  • JPEG AIC issues a Call for Contributions on Subjective Image Quality Assessment,
  • JPEG DNA releases a draft of Common Test Conditions,
  • JPEG XS prepares third edition of core coding system, and profiles and buffer models,
  • JPEG 2000 conformance is under development.
Fig. 1: Fake Media application scenarios: Good faith vs Malicious intent.

The following summarises the major achievements of the 97th JPEG meeting.

JPEG Fake Media

In April 2022, the JPEG Committee released a Final Call for Proposals on JPEG Fake Media. The scope of JPEG Fake Media is the creation of a standard that can facilitate the secure and reliable annotation of media assets creation and modifications. The standard shall address use cases that are in good faith as well as those with malicious intent. During the 97th meeting in October 2022, the following six responses to the call were presented:

  1. Adobe/C2PA: C2PA Specification
  2. Huawei: Provenance and Right Management for Digital Contents in JPEG Fake Media
  3. Sony Group Corporation: Methods to keep track provenance of media asset and signing data
  4. Vrije Universiteit Brussel/imec: Media revision history tracking via asset decomposition and serialization
  5. UPC: MIPAMS Provenance module
  6. Newcastle University: Response to JPEG Fake Media standardization call

In the coming months, these proposals will be thoroughly evaluated following a process that is open, transparent, fair and unbiased and allows deep technical discussions to assess which proposals best address identified requirements. Based on the conclusions of these discussions, a new standard will be produced to address fake media and provide solutions for transparency related to media authenticity. The standard will combine the best elements of the six proposals.

To stay informed about the activities please join the JPEG Fake Media & NFT AHG mailing list and regularly check the JPEG website for the latest information.


JPEG AI (ISO/IEC 6048) aims at the development of a learning-based image coding standard offering a single-stream, compact compressed domain representation, targeting both human visualization with significant compression efficiency improvement over state-of-the-art image coding standards at similar subjective quality, and improved performance for image processing and computer vision tasks. The evaluation of the Call for Proposals responses had already confirmed the industry interest, and the subjective tests presented at the 96th JPEG meeting showed results that significantly outperform conventional image compression solutions. 

The JPEG AI verification model has been issued as the outcome of this meeting and follows the integration effort of several neural networks and tools. There are several characteristics that make the JPEG AI Verification Model (VM) unique, such as the decoupling of the entropy decoding from the sample reconstruction and the exploitation of the spatial correlation between latents using a prediction and a fusion network as well as a massively parallelized auto-regressive network. The performance evaluation has shown significant RD performance improvements (as much as 32.2% of BD-rate over H.266/VVC) with competitive decoding complexity. Other functionalities such as rate adaptation and device interoperability have also been addressed with the use of gain units and the quantization of the weights in the entropy decoding module. Moreover, the adoption process for architectural changes and for new or improved coding tools in JPEG AI VM was approved. A set of core experiments have been defined for improving the JPEG AI VM and target the improvement of the coding efficiency and the reduction of the encoding and decoding complexity. The core experiments represent a set of promising technologies, such as learning-based GAN training, simplification of the analysis/synthesis transform, adaptive entropy coding alphabet, and even encoder-only tools and procedures for training speed-up.

JPEG Pleno Learning-based Point Cloud coding

The JPEG Pleno Point Cloud activity progressed at this meeting with the successful validation of the Verification Model under Consideration (VMuC). The VMuC was confirmed as the Verification Model (VM) to form the core of the future standard; ISO/IEC 21794 Part 6 JPEG Pleno: Learning-based Point Cloud Coding. The JPEG Committee has commenced work on the Working Draft of the standard, with initial text reviewed at this meeting. Prior to the next 98th JPEG Meeting, JPEG experts will investigate possible advancements to the VM in the area of auto-regressive entropy encoding and sparse tensor convolution as well as sourcing additional point clouds for the JPEG Pleno Point Cloud test set.

JPEG Pleno Light Field

During the 97th meeting, the JPEG Committee released the “JPEG Pleno Final Call for Contributions on Subjective Light Field Quality Assessment”, to collect new procedures and best practices regarding light field subjective quality evaluation methodologies to assess artifacts induced by coding algorithms. All contributions, including test procedures, datasets, and any additional information, will be considered to develop the standard by consensus among the JPEG experts following a collaborative process approach. The deadline for submission of contributions is April 1, 2023.

The JPEG Committee organized its 1st workshop on light field quality assessment to discuss challenges and current solutions for subjective light field quality assessment, explore relevant use cases and requirements, and provide a forum for researchers to discuss the latest findings in this area. The JPEG Committee also promoted its 2nd workshop on learning-based light field coding to exchange experiences and to present technological advances in learning-based coding solutions for light field data. The proceedings and video footage of both workshops are now accessible on the JPEG website.


At the 97th JPEG Meeting, a new JPEG AIC Final Call for Contributions on Subjective Image Quality Assessment was issued. The JPEG Committee is working on the continuation of the previous standardization efforts (AIC-1 and AIC-2) and aims at developing a new standard, known as AIC-3. The new standard will be focusing on the methodologies for quality assessment of images in a range that goes from high quality to near-visually lossless quality, which are not covered by the previous AIC standards.

The Call for Contributions on Subjective Image Quality Assessment is asking for contributions to the standardization process that will be collaborative from the very beginning. In this context, all received contributions will be considered for the development of the standard by consensus among the JPEG experts.

The JPEG Committee will be releasing a new JPEG AIC-3 Dataset on the 15th of December 2022. And the deadline for submitting contributions to the call is set to the 1st of April 2023 23:59 UTC. The contributors will be presenting their contributions at the 99th JPEG Meeting in April 2023.

The Call for Contributions on Subjective Image Quality Assessment addresses the development of a suitable subjective evaluation methodology standard. A second stage will address the objective perceptual visual quality evaluation models that perform well and have a good discriminative power in the high quality to near-visually lossless quality range.


The JPEG Committee has continued its exploration of the coding of images in quaternary representations, as it is particularly suitable for DNA storage applications. The scope of JPEG DNA is the creation of a standard for efficient coding of images that considers biochemical constraints and offers robustness to noise introduced by the different stages of the storage process that is based on DNA synthetic polymers. During the 97th JPEG meeting, the JPEG DNA Benchmark Codec and the JPEG DNA Common Test Conditions were updated to allow for additional concrete experiments to take place prior to issuing a draft call for proposals at the next meeting. This will also allow further validation and extension of the JPEG DNA benchmark codec to simulate an end-to-end image storage pipeline using DNA and in particular include biochemical noise simulation which is an essential element in practical implementations.


The 2nd edition of JPEG XS is now fully completed and published. The JPEG Committee continues its work on the 3rd edition of JPEG XS, starting with Part 1 (Core coding system) and Part 2 (Profiles and buffer models). These editions will address new use cases and requirements for JPEG XS by defining additional coding tools to further improve the coding efficiency, while keeping the low-latency and low-complexity core aspects of JPEG XS. The primary goal of the 3rd edition is to deliver the same image quality as the 2nd edition, but with half of the required bandwidth. During the 97th JPEG meeting, a new Working Draft of Part 1 and a first Working Draft of Part 2 were created. To support the work a new Core Experiment was also issued to further test the proposed technology. Finally, an update to the JPEG XS White Paper has been published.

JPEG 2000

A new edition of Rec. ITU-T T.803 | ISO/IEC 15444-4 (JPEG 2000 conformance) is under development.

This new edition proposes to relax the maximum allowable errors so that well-designed 16-bit fixed-point implementations pass all compliance tests; adds two test codestreams to facilitate testing of inverse wavelet and component decorrelating transform accuracy, and adds several codestreams and files conforming to Rec. ITU-T 801 |ISO/IEC 15444-2 to facilitate the implementation of decoders and file format readers

Codestreams and test files can be found on the JPEG GitLab repository at: https://gitlab.com/wg1/htj2k-codestreams/-/merge_requests/14

Final Quote

“Motivated by the consumers’ concerns of manipulated contents, the JPEG Committee has taken concrete steps to define a new standard that provides interoperable solutions for a secure and reliable annotation of media assets creation and modifications” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

Upcoming JPEG meetings are planned as follows:

  • No 98, will be in Sydney, Australia from 14-20 January 2022

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 2 (MDRE at MMM 2022, ACM MM 2022)

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on MDRE at MMM 2022 and ACM MM 2022:

  • Multimedia Datasets for Repeatable Experimentation at 28th International Conference on Multimedia Modeling (MDRE at MMM 2022 – https://mmm2022.org/ssp.html#mdre). We summarize the three datasets presented during the MDRE, addressing several topics like user-centric video search competition, dataset (GPR1200) to evaluate the performance of deep neural networks for general image retrieval, and dataset for evaluating the performance of Question Answering (QA) systems on lifelog data (LLQA).
  • Selected datasets at the 30th ACM Multimedia Conference (MM ’22 – https://2022.acmmm.org/). For a general report from ACM Multimedia 2022 please see (https://records.sigmm.org/2022/12/07/report-from-acm-multimedia-2022-by-nitish-nagesh/). We summarize nine datasets presented during the conference, targeting several topics like dataset for multimodal intent recognition (MintRec), audio-visual question answering dataset (AVQA), large-scale radar dataset (mmWave), multimodal sticker emotion recognition dataset (SER30K), video-sentence dataset for vision-language pre-training (ACTION), dataset of head and gaze behavior for 360-degree videos, saliency in augmented reality dataset (SARD), multi-modal dataset spotting the differences between pairs of similar images (DialDiff), and large-scale remote sensing images dataset (RSVG).

For the overview of datasets related to QoMEX 2022 and ODS at MMSys ’22 please check the first part (https://records.sigmm.org/?p=12292), while ImageCLEF 2022 and MediaEval 2022 are addressed in the third part (http://records.sigmm.org/?p=12362).

MDRE at MMM 2022

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2022 International Conference on Multimedia Modeling (MMM 2022), supporting both online and onsite presentation, Phu Quoc, Vietnam, June 6-10, 2022. The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Björn Þór Jónsson (IT University of Copenhagen, Denmark), Adam Jatowt (University of Innsbruck, Austria), Liting Zhou (Dublin City University, Ireland) and Graham Healy (Dublin City University, Ireland). Details regarding this session can be found at: https://mmm2022.org/ssp.html#mdre

The MDRE’22 special session at MMM’22, is the fourth MDRE session, and it represents an opportunity for interested researchers to submit their datasets to this track. The work submitted to MDRE is permanently available at http://mmdatasets.org, where all the current and past editions of MDRE are hosted. Authors are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, and a discussion of how it can be useful to the community, along with the dataset in itself.

A Task Category Space for User-Centric Comparative Multimedia Search Evaluations
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_16
Lokoč, J., Bailer, W., Barthel, K.U., Gurrin, C., Heller, S., Jónsson, B., Peška, L., Rossetto, L., Schoeffmann, K., Vadicamo, L., Vrochidis, S., Wu, J.
Charles University, Prague, Czech Republic; JOANNEUM RESEARCH, Graz, Austria; HTW Berlin, Berlin, Germany; Dublin City University, Dublin, Ireland; University of Basel, Basel, Switzerland; IT University of Copenhagen, Copenhagen, Denmark; University of Zurich, Zurich, Switzerland; Klagenfurt University, Klagenfurt, Austria; ISTI CNR, Pisa, Italy; Centre for Research and Technology Hellas, Thessaloniki, Greece; City University of Hong Kong, Hong Kong.
Dataset available at: On request

The authors have analyzed the spectrum of possible task categories and propose a list of individual axes that define a large space of possible task categories. Using this concept of category space, new user-centric video search competitions can be designed to benchmark video search systems from different perspectives. They further analyze the three task categories considered at the Video Browser Showdown and discuss possible (but sometimes challenging) shifts within the task category space.

GPR1200: A Benchmark for General-Purpose Content-Based Image Retrieval
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_17
Schall, K., Barthel, K.U., Hezel, N., Jung, K.
Visual Computing Group, HTW Berlin, University of Applied Sciences, Germany.
Dataset available at: http://visual-computing.com/project/GPR1200

In this study, the authors have developed a new dataset called GPR1200 to evaluate the performance of deep neural networks for general image retrieval (CBIR). They found that large-scale pretraining significantly improves retrieval performance and that further improvement can be achieved through fine-tuning. GPR1200 is presented as an easy-to-use and accessible but challenging benchmark dataset with a broad range of image categories.

LLQA – Lifelog Question Answering Dataset
Paper available at: https://doi.org/10.1007/978-3-030-98358-1_18
Tran, L.-D., Ho, T.C., Pham, L.A., Nguyen, B., Gurrin, C., Zhou, L.
Dublin City University, Dublin, Ireland; Vietnam National University, Ho Chi Minh University of Science, Ho Chi Minh City, Viet Nam; AISIA Research Lab, Ho Chi Minh City, Viet Nam.
Dataset available at: https://github.com/allie-tran/LLQA

This study presents Lifelog Question Answering Dataset (LLQA), a new dataset for evaluating the performance of Question Answering (QA) systems on lifelog data. The dataset includes over 15,000 multiple-choice questions as an augmented 85-day lifelog collection, and is intended to serve as a benchmark for future research in this area. The results of the study showed that QA on lifelog data is a challenging task that requires further exploration.

ACM MM 2022

Numerous dataset-related papers have been presented at the 30th ACM International Conference on Multimedia (MM’ 22), organized in Lisbon, Portugal, October 10 – 14, 2022 (https://2022.acmmm.org/). The complete MM ’22: Proceedings of the 30th ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3503161).

There was not a specifically dedicated Dataset session among roughly 35 sessions at the MM ’22 symposium. However, the importance of datasets can be illustrated in the following statistics, quantifying how often the term “dataset” appears in MM ’22 Proceedings. The term appears in the title of 9 papers (7 last year), the keywords of 35 papers (66 last year), and the abstracts of 438 papers (339 last year). As a small example, nine selected papers focused primarily on new datasets with publicly available data are listed below. There are contributions focused on various multimedia applications, e.g., understanding multimedia content, multimodal fusion and embeddings, media interpretation, vision and language, engaging users with multimedia, emotional and social signals, interactions and Quality of Experience, and multimedia search and recommendation.

MIntRec: A New Dataset for Multimodal Intent Recognition
Paper available at: https://doi.org/10.1145/3503161.3547906
Zhang, H., Xu, H., Wang, X., Zhou, Q., Zhao, S., Teng, J.
Tsinghua University, Beijing, China.
Dataset available at: https://github.com/thuiar/MIntRec

MIntRec is a dataset for multimodal intent recognition with 2,224 samples based on the data collected from the TV series Superstore, in text, video, and audio modalities, annotated with twenty intent categories and speaker bounding boxes. Baseline models are built by adapting multimodal fusion methods and show significant improvement over text-only modality. MIntRec is useful for studying relationships between modalities and improving intent recognition.

AVQA: A Dataset for Audio-Visual Question Answering on Videos
Paper available at: https://doi.org/10.1145/3503161.3548291
Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., Jin, C., Zhu, W.
Tsinghua University, Shenzhen, China; Communication University of China, Beijing, China.
Dataset available at: https://mn.cs.tsinghua.edu.cn/avqa

Audio-visual question-answering dataset (AVQA) is introduced for videos in real-life scenarios. It includes 57,015 videos and 57,335 question-answer pairs that rely on clues from both audio and visual modalities. A Hierarchical Audio-Visual Fusing module is proposed to model correlations among audio, visual, and text modalities. AVQA can be used to test models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios.

mmBody Benchmark: 3D Body Reconstruction Dataset and Analysis for Millimeter Wave Radar
Paper available at: https://doi.org/10.1145/3503161.3548262
Chen, A., Wang, X., Zhu, S., Li, Y., Chen, J., Ye, Q.
Zhejiang University, Hangzhou, China.
Dataset available at: On request

A large-scale mmWave radar dataset with synchronized and calibrated point clouds and RGB(D) images is presented, along with an automatic 3D body annotation system. State-of-the-art methods are trained and tested on the dataset, showing the mmWave radar can achieve better 3D body reconstruction accuracy than RGB camera but worse than depth camera. The dataset and results provide insights into improving mmWave radar reconstruction and combining signals from different sensors.

SER30K: A Large-Scale Dataset for Sticker Emotion Recognition
Paper available at: https://doi.org/10.1145/3503161.3548407
Liu, S., Zhang, X., Yan, J.
Nankai University, Tianjin, China.
Dataset available at: https://github.com/nku-shengzheliu/SER30K

A new multimodal sticker emotion recognition dataset called SER30K with 1,887 sticker themes and 30,739 images is introduced for understanding emotions in stickers. A proposed method called LORA, using a vision transformer and local re-attention module, effectively extracts visual and language features for emotion recognition on SER30K and other datasets.

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
Paper available at: https://doi.org/10.1145/3503161.3551581
Pan, Y., Li, Y., Luo, J., Xu, J., Yao, T., Mei, T.
JD Explore Academy, Beijing, China.
Dataset available at: http://www.auto-video-captions.top/2022/dataset

A new large-scale pre-training dataset, Auto-captions on GIF (ACTION), is presented for generic video understanding. It contains video-sentence pairs extracted and filtered from web pages and can be used for pre-training and downstream tasks such as video captioning and sentence localization. Comparisons with existing video-sentence datasets are made.

Where Are You Looking?: A Large-Scale Dataset of Head and Gaze Behavior for 360-Degree Videos and a Pilot Study
Paper available at: https://doi.org/10.1145/3503161.3548200
Jin, Y., Liu, J., Wang, F., Cui, S.
The Chinese University of Hong Kong, Shenzhen, Shenzhen, China.
Dataset available at: https://cuhksz-inml.github.io/head_gaze_dataset/

A dataset of users’ head and gaze behaviors in 360° videos is presented, containing rich dimensions, large scale, strong diversity, and high frequency. A quantitative taxonomy for 360° videos is also proposed, containing three objective technical metrics. Results of a pilot study on users’ behaviors and a case of application in tile-based 360° video streaming show the usefulness of the dataset for improving the performance of existing works.

Saliency in Augmented Reality
Paper available at: https://doi.org/10.1145/3503161.3547955
Duan, H., Shen, W., Min, X., Tu, D., Li, J., Zhai, G.
Shanghai Jiao Tong University, Shanghai, China; Alibaba Group, Hangzhou, China.
Dataset available at: https://github.com/DuanHuiyu/ARSaliency

A dataset, Saliency in AR Dataset (SARD), containing 450 background, 450 AR, and 1350 superimposed images with three mixing levels, is constructed to study the interaction between background scenes and AR contents, and the saliency prediction problem in AR. An eye-tracking experiment is conducted among 60 subjects to collect data.

Visual Dialog for Spotting the Differences between Pairs of Similar Images
Paper available at: https://doi.org/10.1145/3503161.3548170
Zheng, D., Meng, F., Si, Q., Fan, H., Xu, Z., Zhou, J., Feng, F., Wang, X.
Beijing University of Posts and Telecommunications, Beijing, China; WeChat AI, Tencent Inc, Beijing, China; Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; University of Trento, Trento, Italy.
Dataset available at: https://github.com/zd11024/Spot_Difference

A new visual dialog task called Dial-the-Diff is proposed, in which two interlocutors access two similar images and try to spot the difference between them through conversation in natural language. A large-scale multi-modal dataset called DialDiff, containing 87k Virtual Reality images and 78k dialogs, is built for the task. Benchmark models are also proposed and evaluated to bring new challenges to dialog strategy and object categorization.

Visual Grounding in Remote Sensing Images
Paper available at: https://doi.org/10.1145/3503161.3548316
Sun, Y., Feng, S., Li, X., Ye, Y., Kang, J., Huang, X.
Harbin Institute of Technology, Shenzhen, Shenzhen, China; Soochow University, Suzhou, China.
Dataset available at: https://sunyuxi.github.io/publication/GeoVG

A new problem of visual grounding in large-scale remote sensing images has been presented, in which the task is to locate particular objects in an image by a natural language expression. A new dataset, called RSVG, has been collected and a new method, GeoVG, has been designed to address the challenges of existing methods in dealing with remote sensing images.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 3 (ImageCLEF 2022, MediaEval 2022)

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on ImageCLEF 2022 and MediaEval 2022:

  • ImageCLEF 2022 (https://www.imageclef.org/2022). We summarize the 5 datasets launched for the benchmarking tasks, related to several topics like social media profile assessment (ImageCLEFaware), segmentation and labeling of underwater coral images (ImageCLEFcoral), late fusion ensembling systems for multimedia data (ImageCLEFfusion) and medical imaging analysis (ImageCLEFmedical Caption, and ImageCLEFmedical Tuberculosis).
  • MediaEval 2022 (https://multimediaeval.github.io/editions/2022/). We summarize the 11 datasets launched for the benchmarking tasks, that target a wide range of multimedia topics like the analysis of flood related media (DisasterMM), game analytics (Emotional Mario), news item processing (FakeNews, NewsImages), multimodal understanding of smells (MUSTI), medical imaging (Medico), fishing vessel analysis (NjordVid), media memorability (Memorability), sports data analysis (Sport Task, SwimTrack), and urban pollution analysis (Urban Air).

For the overview of datasets related to QoMEX 2022 and ODS at MMSys ’22 please check the first part (https://records.sigmm.org/?p=12292), while MDRE at MMM 2022 and ACM MM 2022 are addressed in the second part (http://records.sigmm.org/?p=12360).

ImageCLEF 2022

ImageCLEF is a multimedia evaluation campaign, part of the clef initiative (http://www.clef-initiative.eu/). The 2022 edition (https://www.imageclef.org/2022) is the 19th edition of this initiative and addresses four main research tasks in several domains like: medicine, nature, social media content and user interface processing. ImageCLEF 2021 is organized by Bogdan Ionescu (University Politehnica of Bucharest, Romania), Henning Müller (University of Applied Sciences Western Switzerland, Sierre, Switzerland), Renaud Péteri (University of La Rochelle, France), Ivan Eggel (University of Applied Sciences Western Switzerland, Sierre, Switzerland) and Mihai Dogariu (University Politehnica of Bucharest, Romania).

Paper available at: https://ceur-ws.org/Vol-3180/paper-98.pdf
Popescu, A., Deshayes-Chossar, J., Schindler, H., Ionescu, B.
CEA LIST, France; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2022/aware

This represents the second edition of the aware task at ImageCLEF, and it seeks to understand in what way do public social media profiles affect users in certain important scenarios, representing a search or application for: a bank loan, an accommodation, a job as waitress/waiter, and a job in IT.

Paper available at: https://ceur-ws.org/Vol-3180/paper-97.pdf
Chamberlain, J., de Herrera, A.G.S., Campello, A., Clark, A..
University of Essex, United Kingdom; Wellcome Trust, United Kingdom.
Dataset available at: https://www.imageclef.org/2022/coral

This fourth edition of the coral task addresses the problem of segmenting and labeling a set of underwater images used in the monitoring of coral reefs. The task proposes two subtasks, namely an annotation and localization subtask and a pixel-wise parsing subtask.

Paper available at: https://ceur-ws.org/Vol-3180/paper-99.pdf
Ştefan, L-D., Constantin, M.G., Dogariu, M., Ionescu, B.
University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2022/fusion

This represents the first edition of the fusion task, and it proposes several scenarios adapted for the use of late fusion or ensembling systems. The two scenarios correspond to a regression approach, using data associated with the prediction of media interestingness, and a retrieval scenario, using data associated with search result diversification.

ImageCLEFmedical Tuberculosis
Paper available at: https://ceur-ws.org/Vol-3180/paper-96.pdf
Kozlovski, S., Dicente Cid, Y., Kovalev, V., Müller, H.
United Institute of Informatics Problems, Belarus; Roche Diagnostics, Spain; University of Applied Sciences Western Switzerland, Switzerland; University of Geneva, Switzerland.
Dataset available at: https://www.imageclef.org/2022/medical/tuberculosis

This task is now at its sixth edition, and is being upgraded to a detection problem. Furthermore, two tasks are now included: the detection of lung cavern regions in lung CT images associated with lung tuberculosis and the prediction of 4 binary features of caverns suggested by experienced radiologists.

ImageCLEFmedical Caption
Paper available at: https://ceur-ws.org/Vol-3180/paper-95.pdf
Rückert, J., Ben Abacha, A., de Herrera, A.G.S., Bloch, L., Brüngel, R., Idrissi-Yaghir, A., Schäfer, H., Müller, H., Friedrich, C.M.
University of Applied Sciences and Arts Dortmund, Germany; Microsoft, USA; University of Essex, UK; University Hospital Essen, Germany; University of Applied Sciences Western Switzerland, Switzerland; University of Geneva, Switzerland.
Dataset available at: https://www.imageclef.org/2022/medical/caption

The sixth edition of this task consists of two tasks. In the first task participants must detect relevant medical concepts in a large corpus of medical images, while in the second task coherent captions must be generated for the entirety of the context of medical images, targeting the interplay of many visible concepts.

MediaEval 2022

The MediaEval Multimedia Evaluation benchmark (https://multimediaeval.github.io/) offers challenges in artificial intelligence for multimedia data. This is the 13th edition of MediaEval (https://multimediaeval.github.io/editions/2022/) and 11 tasks were proposed for this edition, targeting a large number of challenges by creating algorithms for retrieval, analysis, and exploration. For this edition, a “Quest for Insight” is pursued, where organizers are encouraged to propose interesting and insightful questions about the concepts that will be explored, and participants are encouraged to push beyond only striving to improve evaluation scores and to also working to achieve deeper understanding about the challenges.

DisasterMM: Multimedia Analysis of Disaster-Related Social Media Data
Preprint available at: https://2022.multimediaeval.com/paper5337.pdf
Andreadis, S., Bozas, A., Gialampoukidis, I., Mavropoulos, T., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I., Fiorin, R., Lombardo, F., Norbiato, D., Ferri, M.
Information Technologies Institute – Centre of Research and Technology Hellas, Greece; Eastern Alps River Basin District, Italy.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/disastermm/

The DisasterMM task proposes the analysis of social media data extracted from Twitter, targeting the analysis of natural or man-made disaster posts. For this year, the organizers focused on the analysis of flooding events and proposed two subtasks: relevance classification of posts and location extraction from texts.

Emotional Mario: A Game Analytics Challenge
Preprint or paper not published yet.
Lux, M., Alshaer, M., Riegler, M., Halvorsen, P., Thambawita, V., Hicks, S., Dang-Nguyen, D.-T.,
Alpen-Adria-Universität Klagenfurt, Austria; SimulaMet, Norway; University of Bergen, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/emotionalmario/

Emotional Mario focuses on the Super Mario Bros videogame, analyzing the data associated with gamers that consists of game input, demographics, biomedical data, and video associated with players’ faces. Two subtasks are proposed: event detection, seeking to identify gaming events of a significant importance based on facial videos and biometric data, and gameplay summarization, seeking to select the best moments of gameplay.

FakeNews Detection
Preprint available at: https://2022.multimediaeval.com/paper116.pdf
Pogorelov, K., Schroeder, D.T., Brenner, S., Maulana, A., Langguth, J.
Simula Research Laboratory, Norway; University of Bergen, Norway; Stuttgart Media University, Germany.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/fakenews/

The FakeNews Detection task proposes several types of methods of analyzing fake news and the way they spread, using COVID-19 related conspiracy theories. The competition proposes three tasks: the first subtask targets conspiracy detection in text-based data, the second asks participants to analyze graphs of conspiracy posters, while the last one combines the first two, aiming at detection on both text and graph data.

MUSTI – Multimodal Understanding of Smells in Texts and Images
Preprint available at: https://2022.multimediaeval.com/paper9634.pdf
Hürriyetoğlu, A., Paccosi, T., Menini, S., Zinnen, M., Lisena, P., Akdemir, K., Troncy, R., van Erp, M.
KNAW Humanities Cluster DHLab, Netherlands; Fondazione Bruno Kessler, Italy; Friedrich-Alexander-Universität, Germany; EURECOM, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/musti/

MUSTI is one of the few benchmarks that seek to analyze the underrepresented modality of smell. The organizers seek to further the understanding of descriptions of smell in texts and images, and propose two subtasks: the first one aims at classification of smells based on language and image models, predicting whether texts or images evoke the same smell source or not; while the second subtask targets the participants with identifying what are the common smell sources.

Medical Multimedia Task: Transparent Tracking of Spermatozoa
Preprint available at: https://2022.multimediaeval.com/paper5501.pdf
Thambawita, V., Hicks, S., Storås, A.M, Andersen, J.M., Witczak, O., Haugen, T.B., Hammer, H., Nguyen, T., Halvorsen, P., Riegler, M.A.
SimulaMet, Norway; OsloMet, Norway; The Arctic University of Norway, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/medico/

The Medico Medical Multimedia Task tackles the challenge of tracking sperm cells in video recordings, while analyzing the specific characteristics of these cells. Four subtasks are proposed: a sperm-cell real-time tracking task in videos, a prediction of cell motility task, a catch and highlight task seeking to identify sperm cell speed, and an explainability task.

Preprint available at: https://2022.multimediaeval.com/paper8446.pdf
Kille, B., Lommatzsch, A., Özgöbek, Ö., Elahi, M., Dang-Nguyen, D.-T.
Norwegian University of Science and Technology, Norway; Berlin Institute of Technology, Germany; University of Bergen, Norway; Kristiania University College, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/newsimages/

The goal of the NewsImages task is to further the understanding of the relationship between textual and image content in news articles. Participants are tasked with re-linking and re-matching textual news articles with the corresponding images, based on data gathered from social media, news portals and RSS feeds.

NjordVid: Fishing Trawler Video Analytics Task
Preprint available at: https://2022.multimediaeval.com/paper5854.pdf
Nordmo, T.A.S., Ovesen, A.B., Johansen, H.D., Johansen, D., Riegler, M.A.
The Arctic University of Norway, Norway; SimulaMet, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/njord/

The NjordVid task proposes data associated with fishing vessel recordings, representing a solution to maintaining sustainable fishing practices. Two different tasks are proposed: detection of events on the boat, like movement of people, catching fish, etc, and privacy of on-board personnel.

Predicting Video Memorability
Preprint available at: https://2022.multimediaeval.com/paper2265.pdf
Sweeney, L., Constantin, M.G., Demarty, C.-H., Fosco, C., de Herrera, A.G.S., Halder, S., Healy, G., Ionescu, B., Matran-Fernandez, A., Smeaton, A.F., Sultana, M.
Dublin City University, Ireland; University Politehnica of Bucharest, Romania; InterDigital, France; Massachusetts Institute of Technology Cambridge, USA; University of Essex, UK.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/memorability/

The Video Memorability task asks participants to predict how memorable a video sequence is, targeting short-term memorability. Three subtasks are proposed for this edition: a general video-based prediction task where participants are asked to predict the memorability score of a video, a generalization task where training and testing are performed on different sources of data, and an EEG-based task where annotator EEG scans are provided.

Sport Task: Fine Grained Action Detection and Classification of Table Tennis Strokes from Videos
Preprint available at: https://2022.multimediaeval.com/paper4766.pdf
Martin, P.-E., Calandre, J., Mansencal, B., Benois-Pineau, J., Péteri, R., Mascarilla, L., Morlier, J.
Max Planck Institute for Evolutionary Anthropology, Germany; La Rochelle University, France; Univ. Bordeaux, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/sportsvideo/

The Sport Task aims at action detection and classification in videos recorded at table tennis events. Low inter-class variability makes this task harder than other traditional action classification benchmarks. Two subtasks are proposed: a classification task where participants are asked to label table tennis videos according to the strokes the players make, and a detection task where participants must detect whether a stroke was made.

SwimTrack: Swimmers and Stroke Rate Detection in Elite Race Videos
Preprint available at: https://2022.multimediaeval.com/paper6876.pdf
Jacquelin, N., Jaunet, T., Vuillemot, R., Duffner, S.
École Centrale de Lyon, France; INSA-Lyon, France.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/swimtrack/

The SwimTrack comprises 5 different multimedia tracks related to the analysis of competition-level swimming videos, and provides multimodal video, image and audio data. The five subtasks are as follows: a position detection task associating swimmers with the numbers of swimming lanes, a stroke rate detection task, a camera registration task where participants must apply homography projection methods to create a top-view of the pool, a character recognition on scoreboards task, and a sound detection task associated with buzzer sounds.

Urban Air: Urban Life and Air Pollution
Preprint available at: https://2022.multimediaeval.com/paper586.pdf
Dao, M.-S., Dang, T.-H., Nguyen-Tai, T.-L., Nguyen, T.-B., Dang-Nguyen, D.-T.
National Institute of Information and Communications Technology, Japan; Dalat University, Vietnam; LOC GOLD Technology MTV Ltd. Co, Vietnam; University of Science, Vietnam National University in HCM City, Vietnam; Bergen University, Norway.
Dataset available at: https://multimediaeval.github.io/editions/2022/tasks/urbanair/

The Urban Air task provides multimodal data that allows the analysis of air pollution and pollution patterns in urban environments. The organizers created two subtasks for this competition: a multimodal/crossmodal air quality index prediction task using station and/or CCTV data, and a periodic traffic pollution pattern discovery task.

Overview of Open Dataset Sessions and Benchmarking Competitions in 2022 – Part 1 (QoMEX 2022, ODS at MMSys ’22)

In this Dataset Column, we present a review of some of the notable events related to open datasets and benchmarking competitions in the field of multimedia. This year’s selection highlights the wide range of topics and datasets currently of interest to the community. Some of the events covered in this review include special sessions on open datasets and competitions featuring multimedia data. While this list is not exhaustive and contains an overview of about 40 datasets, it is meant to showcase the diversity of subjects and datasets explored in the field. This year’s review follows similar efforts from the previous year (https://records.sigmm.org/2022/01/12/overview-of-open-dataset-sessions-and-benchmarking-competitions-in-2021/), highlighting the ongoing importance of open datasets and benchmarking competitions in advancing research and development in multimedia. The column is divided into three parts, in this one we focus on QoMEX 2022 and ODS at MMSys ’22:

  • 14th International Conference on Quality of Multimedia Experience (QoMEX 2022 – https://qomex2022.itec.aau.at/). We summarize three datasets included in this conference, that address QoE studies on audiovisual 360° video, storytelling for quality perception and energy consumption while streaming video QoE.
  • Open Dataset and Software Track at 13th ACM Multimedia Systems Conference (ODS at MMSys ’22 – https://mmsys2022.ie/). We summarize nine datasets presented at the ODS track, targeting several topics, including surveillance videos from a fishing vessel (Njord), multi-codec 8K UHD videos (8K MPEG-DASH dataset), light-field (LF) synthetic immersive large-volume plenoptic dataset (SILVR), a dataset of online news items and the related task of rematching (NewsImages), video sequences, characterized by various complexity categories (VCD), QoE dataset of realistic video clips for real networks, dataset of 360° videos with subjective emotional ratings (PEM360), free-viewpoint video dataset, and cloud gaming dataset (CGD).

For the overview of datasets related to MDRE at MMM 2022 and ACM MM 2022 please check the second part (http://records.sigmm.org/?p=12360), while ImageCLEF 2022 and MediaEval 2022 are addressed in the third part (http://records.sigmm.org/?p=12362).

QoMEX 2022

Three dataset papers were presented at the International Conference on Quality of Multimedia Experience (QoMEX 2022), organized in Lippstadt, Germany, September 5 – 7, 2022 (https://qomex2022.itec.aau.at/). The complete QoMEX ’22 Proceeding is available in the IEEE Digital Library (https://ieeexplore.ieee.org/xpl/conhome/9900491/proceeding).

These datasets were presented within the Databases session, chaired by Professor Oliver Hohlfeld. These three papers present contributions focused on audiovisual 360-degree videos, storytelling for quality perception and modelling of energy consumption and streaming of video QoE.

Audiovisual Database with 360° Video and Higher-Order Ambisonics Audio for Perception, Cognition, Behavior and QoE Evaluation Research
Paper available at: https://ieeexplore.ieee.org/document/9900893
Robotham, T., Singla, A., Rummukainen, O., Raake, A. and Habets, E.
International Audio Laboratories Erlangen, A joint institution of the Friedrich-Alexander-Universitat Erlangen-Nurnberg (FAU) and Fraunhofer Institute for Integrated Circuits (IIS), Germany; TU Ilmenau, Germany.
Dataset available at: https://qoevave.github.io/database/

This publicly available database provides audiovisual 360° content with high-order Ambisonics audio. It consists of twelve scenes capturing real-life nature and urban environments with a video resolution of 7680×3840 at 60 frames-per-second and with 4th-order Ambisonics audio. These 360° video sequences, with an average duration of 60 seconds, represent real-life settings for systematically evaluating various dimensions of uni-/multi-modal perception, cognition, behavior, and QoE. It provides high-quality reference material with a balanced focus on auditory and visual sensory information.

The Storytime Dataset: Simulated Videotelephony Clips for Quality Perception Research
Paper available at: https://ieeexplore.ieee.org/document/9900888
Spang, R. P., Voigt-Antons, J. N. and Möller, S.
Technische Universität Berlin, Berlin, Germany; Hamm-Lippstadt University of Applied Sciences, Lippstadt, Germany.
Dataset available at: https://osf.io/cyb8w/

This is a dataset of simulated videotelephony clips to act as stimuli in quality perception research. It consists of four different stories in the German language that are told through ten consecutive parts, each about 10 seconds long. Each of these parts is available in four different quality levels, ranging from perfect to stalling. All clips (FullHD, H.264 / AAC) are actual recordings from end-user video-conference software to ensure ecological validity and realism of quality degradation. Apart from a detailed description of the methodological approach, we contribute the entire stimuli dataset containing 160 videos and all rating scores for each file.

Modelling of Energy Consumption and Streaming Video QoE using a Crowdsourcing Dataset
Paper available at: https://ieeexplore.ieee.org/document/9900886
Herglotz, C, Robitza, W., Kränzler, M., Kaup, A. and Raake, A.
Friedrich-Alexander-Universität, Erlangen, Germany; Audiovisual Technology Group, TU Ilmenau, Germany; AVEQ GmbH, Vienna, Austria.
Dataset available at: On request

This paper performs a first analysis of end-user power efficiency and Quality of Experience of a video streaming service. A crowdsourced dataset comprising 447,000 streaming events from YouTube is used to estimate both the power consumption and perceived quality. The power consumption is modelled based on previous work, which extends toward predicting the power usage of different devices and codecs. The user-perceived QoE is estimated using a standardized model.

ODS at MMSys ’22

The traditional Open Dataset and Software Track (ODS) was a part of the 13th ACM Multimedia Systems Conference (MMSys ’22) organized in Athlone, Ireland, June 14 – 17, 2022 (https://mmsys2022.ie/). The complete MMSys ’22: Proceedings of the 13th ACM Multimedia Systems Conference are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3524273).

The Open Dataset and Software Chairs for MMSys ’22 were Roberto Azevedo (Disney Research, Switzerland), Saba Ahsan (Nokia Technologies, Finland), and Yao Liu (Rutgers University, USA). The ODS session with 14 papers has been initiated with pitches on Wednesday, June 15, followed by a poster session. There have been nine dataset papers presented out of fourteen contributions. A listing of the paper titles, dataset summaries, and associated DOIs is included below for your convenience.

Njord: a fishing trawler dataset
Paper available at: https://doi.org/10.1145/3524273.3532886
Nordmo, T.-A.S., Ovesen, A.B., Juliussen, B.A., Hicks, S.A., Thambawita, V., Johansen, H.D., Halvorsen, P., Riegler, M.A., Johansen, D.
UiT the Arctic University of Norway, Norway; SimulaMet, Norway; Oslo Metropolitan University, Norway.
Dataset available at: https://doi.org/10.5281/zenodo.6284673

This paper presents Njord, a dataset of surveillance videos from a commercial fishing vessel. The dataset aims to demonstrate the potential for using data from fishing vessels to detect accidents and report fish catches automatically. The authors also provide a baseline analysis of the dataset and discuss possible research questions that it could help answer.

Multi-codec ultra high definition 8K MPEG-DASH dataset
Paper available at: https://doi.org/10.1145/3524273.3532889
Taraghi, B., Amirpour, H., Timmerer, C.
Christian Doppler Laboratory Athena, Institute of Information Technology (ITEC), Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria.
Dataset available at: http://ftp.itec.aau.at/datasets/mmsys22/

This paper presents a dataset of multimedia assets encoded with various video codecs, including AVC, HEVC, AV1, and VVC, and packaged using the MPEG-DASH format. The dataset includes resolutions up to 8K and has a maximum media duration of 322 seconds, with segment lengths of 4 and 8 seconds. It is intended to facilitate research and development of video encoding technology for streaming services.

SILVR: a synthetic immersive large-volume plenoptic dataset
Paper available at: https://doi.org/10.1145/3524273.3532890
Courteaux, M., Artois, J., De Pauw, S., Lambert, P., Van Wallendael, G.
Ghent University – Imec, Oost-Vlaanderen, Zwijnaarde, Belgium.
Dataset available at: https://idlabmedia.github.io/large-lightfields-dataset/

SILVR (synthetic immersive large-volume plenoptic dataset) is a light-field (LF) image dataset allowing for six-degrees-of-freedom navigation in larger volumes while maintaining full panoramic field of view. It includes three virtual scenes with 642-2226 views, rendered with 180° fish-eye lenses and featuring color images and depth maps. The dataset also includes multiview rendering software and a lens-reprojection tool. SILVR can be used to evaluate LF coding and rendering techniques.

NewsImages: addressing the depiction gap with an online news dataset for text-image rematching
Paper available at: https://doi.org/10.1145/3524273.3532891
Lommatzsch, A., Kille, B., Özgöbek, O., Zhou, Y., Tešić, J., Bartolomeu, C., Semedo, D., Pivovarova, L., Liang, M., Larson, M.
DAI-Labor, TU-Berlin, Berlin, Germany; NTNU, Trondheim, Norway; Texas State University, San Marcos, TX, United States; Universidade Nova de Lisboa, Lisbon, Portugal.
Dataset available at: https://multimediaeval.github.io/editions/2021/tasks/newsimages/

NewsImages is a dataset of online news items and the related task of news images rematching, which aims to study the “depiction gap” between the content of an image and the text that accompanies it. The dataset is useful for studying connections between image and text and addressing the depiction gap, including sparse data, diversity of content, and the importance of background knowledge.

VCD: Video Complexity Dataset
Paper available at: https://doi.org/10.1145/3524273.3532892
Amirpour, H., Menon, V.V., Afzal, S., Ghanbari, M., Timmerer, C.
Christian Doppler Laboratory Athena, Institute of Information Technology (ITEC), Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria; School of Computer Science and Electronic Engineering, University of Essex, Colchester, United Kingdom.
Dataset available at: https://ftp.itec.aau.at/datasets/video-complexity/

The Video Complexity Dataset (VCD) is a collection of 500 Ultra High Definition (UHD) resolution video sequences, characterized by spatial and temporal complexities, rate-distortion complexity, and encoding complexity with the x264 AVC/H.264 and x265 HEVC/H.265 video encoders. It is suitable for video coding applications such as video streaming, two-pass encoding, per-title encoding, and scene-cut detection. These sequences are provided at 24 frames per second (fps) and stored online in losslessly encoded 8-bit 4:2:0 format.

Realistic video sequences for subjective QoE analysis
Paper available at: https://doi.org/10.1145/3524273.3532894
Hodzic, K., Cosovic, M., Mrdovic, S., Quinlan, J.J., Raca, D.
Faculty of Electrical Engineering, University of Sarajevo, Bosnia and Herzegovina; School of Computer Science & Information Technology, University College Cork, Ireland.
Dataset available at: https://shorturl.at/dtISV

The DashReStreamer framework is designed to recreate adaptively streamed video in real networks to evaluate user Quality of Experience (QoE). The authors have also created a dataset of 234 realistic video clips, based on video logs collected from real mobile and wireless networks, including video logs and network bandwidth profiles. This dataset and framework will help researchers understand the impact of video QoE dynamics on multimedia streaming.

PEM360: a dataset of 360° videos with continuous physiological measurements, subjective emotional ratings and motion traces
Paper available at: https://doi.org/10.1145/3524273.3532895
Guimard, Q., Robert, F., Bauce, C., Ducreux, A., Sassatelli, L., Wu, H.-Y., Winckler, M., Gros, A.
Université Côte d’Azur, Inria, CNRS, I3S, Sophia-Antipolis, France.
Dataset available at: https://gitlab.com/PEM360/PEM360/

PEM360 is a dataset of user head movements and gaze recordings in 360° videos, along with self-reported emotional ratings and continuous physiological measurement data. It aims to understand the connection between user attention, emotions, and immersive content, and includes software tools and joint instantaneous visualization of user attention and emotion, called “emotional maps.” The entire data and code are available in a reproducible framework.

A New Free Viewpoint Video Dataset and DIBR Benchmark
Paper available at: https://doi.org/10.1145/3524273.3532897
Guo, S., Zhou, K., Hu, J., Wang, J., Xu, J., Song, L.
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China.
Dataset available at: https://github.com/sjtu-medialab/Free-Viewpoint-RGB-D-Video-Dataset

A new dynamic RGB-D video dataset for FVV research is presented, including 13 groups of dynamic scenes and one group of static scenes, each with 12 HD video sequences and 12 corresponding depth video sequences. Also, the FVV synthesis benchmark is introduced based on depth image-based rendering to aid data-driven method validation. The dataset and benchmark aim to advance FVV synthesis with improved robustness and performance.

CGD: a cloud gaming dataset with gameplay video and network recordings
Paper available at: https://doi.org/10.1145/3524273.3532898
Slivar, I., Bacic, K., Orsolic, I., Skorin-Kapov, L., Suznjevic, M.
University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia.
Dataset available at: https://muexlab.fer.hr/muexlab/research/datasets

The cloud gaming (CGD) dataset contains 600 game streaming sessions from 10 games of different genres, with various encoding parameters (bitrate, resolution, and frame rate) to evaluate the impact of these parameters on Quality of Experience (QoE). The dataset includes gameplay video recordings, network traffic traces, user input logs, and streaming performance logs, and can be used to understand relationships between network and application layer data for cloud gaming QoE and QoE-aware network management mechanisms.

Green Video Streaming: Challenges and Opportunities


Regarding the Intergovernmental Panel on Climate Change (IPCC) report in 2021 and Sustainable Development Goal (SDG) 13 “climate action”, urgent action is needed against climate change and global greenhouse gas (GHG) emissions in the next few years [1]. This urgency also applies to the energy consumption of digital technologies. Internet data traffic is responsible for more than half of digital technology’s global impact, which is 55% of energy consumption annually. The Shift Project forecast [2] shows an increase of 25% in data traffic associated with 9% more energy consumption per year, reaching 8% of all GHG emissions in 2025. 

Video flows represented 80% of global data flows in 2018, and this video data volume is increasing by 80% annually [2].  This exponential increase in the use of streaming video is due to (i) improvements in Internet connections and service offerings [3], (ii) the rapid development of video entertainment (e.g., video games and cloud gaming services), (iii) the deployment of Ultra High-Definition (UHD, 4K, 8K), Virtual Reality (VR), and Augmented Reality (AR), and (iv) an increasing number of video surveillance and IoT applications [4]. Interestingly, video processing and streaming generate 306 million tons of CO2, which is 20% of digital technology’s total GHG emissions and nearly 1% of worldwide GHG emissions [2].

While research has shown that the carbon footprint of video streaming has been decreasing in recent years [5], there is still a high need to invest in research and development of efficient next-generation computing and communication technologies for video processing technologies. This carbon footprint reduction is due to technology efficiency trends in cloud computing (e.g., renewable power), emerging modern mobile networks (e.g., growth in Internet speed), and end-user devices (e.g., users prefer less energy-intensive mobile and tablet devices over larger PCs and laptops). However, since the demand for video streaming is growing dramatically, it raises the risk of increased energy consumption. 

Investigating energy efficiency during video streaming is essential to developing sustainable video technologies. The processes from video encoding to decoding and displaying the video on the end user’s screen require electricity, which results in CO2 emissions. Consequently, the key question becomes: “How can we improve energy efficiency for video streaming systems while maintaining an acceptable Quality of Experience (QoE)?”.

Challenges and Opportunities 

In this section, we will outline challenges and opportunities to tackle the associated emissions for video streaming of (i) data centers, (ii) networks, and (iii) end-user devices [5] – presented in Figure 1.

Figure 1. Challenges and opportunities to tackle emissions for video streaming.

Data centers are responsible for the video encoding process and storage of the video content. The video data traffic volume grows through data centers, driving their workloads with the estimated total power consumption of more than 1,000 TWh by 2025 [6]. Data centers are the most prioritized target of regulatory initiatives. National and regional policies are established related to the growing number of data centers and the concern over their energy consumption [7]. 

  • Suitable cloud services: Select energy-optimized and sustainable cloud services to help reduce CO2 emissions. Recently, IT service providers have started innovating in energy-efficient hardware by designing highly efficient Tensor Processing Units, high-performance servers, and machine-learning approaches to optimize cooling automatically to reduce the energy consumption in their data centers [8]. In addition to advances in hardware designs, it is also essential to consider the software’s potential for improvements in energy efficiency [9].
  • Low-carbon cloud regions: IT service providers offer cloud computing platforms in multiple regions delivered through a global network of data centers. Various power plants (e.g., fuel, natural gas, coal, wind, sun, and water) supply electricity to run these data centers generating different amounts of greenhouse gases. Therefore, it is essential to consider how much carbon is emitted by the power plants that generate electricity to run cloud services in the selected region for cloud computing. Thus, a cloud region needs to be considered by its entire carbon footprint, including its source of energy production.
  • Efficient and fast transcoders (and encoders): Another essential factor to be considered is using efficient transcoders/encoders that can transcode/encode the video content faster and with less energy consumption but still at an acceptable quality for the end-user [10][11][12].
  • Optimizing the video encoding parameters: There is a huge potential in optimizing the overall energy consumption of video streaming by optimizing the video encoding parameters to reduce the bitrates of encoded videos without affecting quality, including choosing a more power-efficient codec, resolution, frame rate, and bitrate among other parameters.

The next component within the video streaming process is video delivery within heterogeneous networks. Two essential energy consumption factors for video delivery are the network technology used and the amount of data to be transferred.

  • Energy-efficient network technology for video streaming: the network technology used to transmit data from the data center to the end-users determine energy performance since the networks’ GHG emissions vary widely [5]. A fiber-optic network is the most climate-friendly transmission technology, with only 2 grams of CO2 per hour of HD video streaming, while a copper cable (VDSL) generates twice as much (i.e., 4 grams of CO2 per hour). UMTS data transmission (3G) produces 90 grams of CO2 per hour, reduced to 5 grams of CO2 per hour when using 5G [13]. Therefore, research shows that expanding fiber-optic networks and 5G transmission technology are promising for climate change mitigation [5].
  • Lower data transmission: Lower data transmission drops energy consumption. Therefore, the amount of video data needs to be reduced without compromising video quality [2]. The video data per hour for various resolutions and qualities range from 30 MB/hr for very low resolutions to 7 GB/hr for UHD resolutions. A higher data volume causes more transmission energy. Another possibility is the reduction of unnecessary video usage, for example, by avoiding autoplay and embedded videos. Such video content aims to maximize the quantity of content consumed. Broadcasting platforms also play a central role in how viewers consume content and, thus, the impact on the environment [2].

The last component of the video streaming process is video usage at the end-user device, including decoding and displaying the video content on the end-user devices like personal computers, laptops, tablets, phones, or television sets.

  • End-user devices: Research works [3][14] show that the end-user devices and decoding hardware account for the greatest portion of energy consumption and CO2 emission in video streaming. Thus, most reduction strategies lay within the energy efficiency of the end-user devices, for instance, by improving screen display technologies or shifting from desktops to using more energy-efficient laptops, tablets, and smartphones.
  • Streaming parameters: Energy consumption of the video decoding process depends on video streaming parameters similar to the end-user QoE. Thus, it is important to intelligently select video streaming parameters to optimize the QoE and power efficiency of the end-user device. Moreover, different underlying video encoding parameters also impact the video decodings’ energy usage.
  • End-user device environment: A wide variety of browsers (including legacy versions), codecs, and operating systems besides the hardware (e.g., CPU, display) determine the final power consumption.

In this column, we argue that these challenges and opportunities for green video streaming can help to gain insights that further drive the adoption of novel, more sustainable usage patterns to reduce the overall energy consumption of video streaming without sacrificing end-user’s QoE.  

End-to-end video streaming: While we have highlighted the main factors of each video streaming component that impact energy consumption to create a generic power consumption model, we need to study and holistically analyze video streaming and its impact on all components. Implementing a dedicated system for optimizing energy consumption may introduce additional processing on top of regular service operations if not done efficiently. For instance, overall traffic will be reduced when using the most recent video codec (e.g., VVC) compared to AVC (the most deployed video codec up to date), but its encoding and decoding complexity will be increased and, thus, require more energy.

Optimizing the video streaming parameters: There is a huge potential in optimizing the overall energy consumption for video service providers by optimizing the video streaming parameters, including choosing a more power-efficient codec implementation, resolution, frame rate, and bitrate, among other parameters.

GAIA: Intelligent Climate-Friendly Video Platform 

Recently, we started the “GAIA” project to research the aspects mentioned before. In particular, the GAIA project researches and develops a climate-friendly adaptive video streaming platform that provides (i) complete energy awareness and accountability, including energy consumption and GHG emissions along the entire delivery chain, from content creation and server-side encoding to video transmission and client-side rendering; and (ii) reduced energy consumption and GHG emissions through advanced analytics and optimizations on all phases of the video delivery chain.

Figure 2. GAIA high-level approach for the intelligent climate-friendly video platform.

As shown in Figure 2, the research considered in GAIA comprises benchmarking, energy-aware and machine learning-based modeling, optimization algorithms, monitoring, and auto-tuning.

  • Energy-aware benchmarking involves a functional requirement analysis of the leading project objectives, measurement of the energy for transcoding video tasks on various heterogeneous cloud and edge resources, video delivery, and video decoding on end-user devices. 
  • Energy-aware modelling and prediction use the benchmarking results and the data collected from real deployments to build regression and machine learning. The models predict the energy consumed by heterogeneous cloud and edge resources, possibly distributed across various clouds and delivery networks. We further provide energy models for video distribution on different channels and consider the relation between bitrate, codec, and video quality.
  • Energy-aware optimization and scheduling researches and develops appropriate generic algorithms according to the requirements for real-time delivery (including encoding and transmission) of video processing tasks (i.e., transcoding) deployed on heterogeneous cloud and edge infrastructures. 
  • Energy-aware monitoring and auto-tuning perform dynamic real-time energy monitoring of the different video delivery chains for improved data collection, benchmarking, modelling and optimization. 

GMSys 2023: First International ACM Green Multimedia Systems Workshop

Finally, we would like to use this opportunity to highlight and promote the first International ACM Green Multimedia Systems Workshop (GMSys’23). The GMSys’23 takes place in Vancouver, Canada, in June 2023 co-located with ACM Multimedia Systems 2023. We expect a series of at least three consecutive workshops since this topic may critically impact the innovation and development of climate-effective approaches. This workshop strongly focuses on recent developments and challenges for energy reduction in multimedia systems and the innovations, concepts, and energy-efficient solutions from video generation to processing, delivery, and consumption. Please see the Call for Papers for further details.

Final Remarks 

In both the GAIA project and ACM GMSys workshop, there are various actions and initiatives to put energy efficiency-related topics for video streaming on the center stage of research and development. In this column, we highlighted major video streaming components concerning their possible challenges and opportunities enabling energy-efficient, sustainable video streaming, sometimes also referred to as green video streaming. Having a thorough understanding of the key issues and gaining meaningful insights are essential for successful research.


[1] IPCC, 2021: Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change[Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, In press, doi:10.1017/9781009157896.
[2] M. Efoui-Hess, Climate Crisis: the unsustainable use of online video – The practical case for digital sobriety, Technical Report, The Shift Project, July, 2019.
[3] IEA (2020), The carbon footprint of streaming video: fact-checking the headlines, IEA, Paris https://www.iea.org/commentaries/the-carbon-footprint-of-streaming-video-fact-checking-the-headlines.
[4] Cisco Annual Internet Report (2018–2023) White Paper, 2018 (updated 2020), https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html.
[5] C. Fletcher, et al., Carbon impact of video streaming, Technical Report, 2021, https://s22.q4cdn.com/959853165/files/doc_events/2021/Carbon-impact-of-video-streaming.pdf.
[6] Huawei Releases Top 10 Trends of Data Center Facility in 2025, 2020, https://www.huawei.com/en/news/2020/2/huawei-top10-trends-datacenter-facility-2025.
[7] COMMISSION REGULATION (EC) No 642/2009, Official Journal of the European Union, 2009, https://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2009:191:0042:0052:EN:PDF#:~:text=COMMISSION%20REGULATION%20(EC)%20No%20642/2009%20of%2022%20July,regard%20to%20the%20Treaty%20establishing%20the%20European%20Community.
[8] U. Hölzle, Data centers are more energy efficient than ever, Technical Report, 2020, https://blog.google/outreach-initiatives/sustainability/data-centers-energy-efficient/.
[9] Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C. Kuszmaul, Butler W. Lampson, Daniel Sanchez, and Tao B. Schardl. 2020. There’s plenty of room at the Top: What will drive computer performance after Moore’s law? Science 368, 6495 (2020), eaam9744. DOI:https://doi.org/10.1126/science.aam9744
[10] M. G. Koziri, P. K. Papadopoulos, N. Tziritas, T. Loukopoulos, S. U. Khan and A. Y. Zomaya, “Efficient Cloud Provisioning for Video Transcoding: Review, Open Challenges and Future Opportunities,” in IEEE Internet Computing, vol. 22, no. 5, pp. 46-55, Sep./Oct. 2018, doi: 10.1109/MIC.2017.3301630.
[11] J. -F. Franche and S. Coulombe, “Fast H.264 to HEVC transcoder based on post-order traversal of quadtree structure,” 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 2015, pp. 477-481, doi: 10.1109/ICIP.2015.7350844.
[12] E. de la Torre, R. Rodriguez-Sanchez and J. L. Martínez, “Fast video transcoding from HEVC to VP9,” in IEEE Transactions on Consumer Electronics, vol. 61, no. 3, pp. 336-343, Aug. 2015, doi: 10.1109/TCE.2015.7298293.
[13] Federal Ministry for the Environment, Nature Conservation and Nuclear Safety, Video streaming: data transmission technology crucial for climate footprint, No. 144/20, 2020, https://www.bmuv.de/en/pressrelease/video-streaming-data-transmission-technology-crucial-for-climate-footprint/
[14] Malmodin, Jens, and Dag Lundén. 2018. “The Energy and Carbon Footprint of the Global ICT and E&M Sectors 2010–2015” Sustainability 10, no. 9: 3027. https://doi.org/10.3390/su10093027

Students Report from ACM Multimedia 2022

ACM Multimedia 2022 was held in a hybrid format in Lisbon, Portugal from October 10-14, 2022.

This was the first local participation in three years for many participants, as the strict travel restrictions associated with Covid-19 in 2020 and 2021 made it difficult to participate locally by travelling out of the host and neighbouring countries.

In Portugal, the Covid-19 restrictions were almost lifted, and the city was bustling with tourists. Participants were careful to avoid infectious diseases and enjoyed Lisbon’s local wine “Vinho Verde” and cod dishes with their colleagues and engaged in lively discussions about multimedia research.

For many students, this was their first time presenting at an international conference, and it was a wonderful experience.

To encourage student authors to participate on-site, SIGMM has sponsored a group of students with Student Travel Grant Awards. Students who wanted to apply for this travel grant needed to submit an online form before the submission deadline. The selected students received either 1,000 or 2,000 USD to cover their airline tickets as well as accommodation costs for this event. Of the recipients, 25 were able to attend the conference. We asked them to share their unique experience attending ACM Multimedia 2022. In this article, we share their reports of the event.

Xiangming Gu, PhD student, National University of Singapore, Singapore

It is a great honour to receive a SIGMM Student Grant. ACM Multimedia 2022 is my first time attending an academic conference physically. During the conference, I presented my oral paper “MM-ALT: Multimodal Automatic Lyric Transcription”, which was also selected as “Top Rated Papers”. Besides the presentation, I also met a lot of people who shared similar research interests. It was very inspiring to learn from others’ papers and discuss them with the authors directly. Moreover, I was also a volunteer for ACM Multimedia 2022 and attended the session of the 5th International ACM Workshop on Multimedia Content Analysis in Sports. During the session, I learnt how to organize a workshop, which was a great exercise for me. Now, after I come back to Singapore, I still miss the conference. I wish I can get my paper accepted next year and attend the conference again.

Avinash Madasu, Computer Science Master’s student at the University of North Carolina Chapel Hill, USA.

It is my absolute honour to receive the student travel grant for attending the ACM Multimedia 2022 conference. This is the first time I have attended a top AI conference in-person. I enjoyed it a lot during the conference and I was sad that the conference ended quickly. Within the conference days, I was able to attend a lot of oral sessions, keynote talks and poster sessions. I was able to interact with fellow researchers from both academia and industry. I learnt a lot about exciting research going on in my area of interest as well as other areas. It provided a new refreshing experience and I hope to bring this to my research. I presented a poster and felt happy when fellow researchers appreciated my work. Apart from technical details, I was able to forge a lot of new friendships which I truly cherish for my whole life.

Moreno La Quatra, PhD student, Politecnico di Torino

The ACM Multimedia 2022 conference was an amazing experience. After a few years of remote conferences, it was a pleasure to be able to attend the conference in person. I got the opportunity to meet many researchers of different seniorities and backgrounds, and I learned a lot from them. The poster sessions were one of the highlights of the conference. They were a very valuable opportunity to present interesting ideas and explore the details of other researchers’ work. I found the keynotes, presentations, and workshops to be very inspiring and engaging as well. Throughout them, I learned about specific topics and interacted with friendly, passionate researchers from around the world. I would like to thank the ACM Multimedia 2022 organization for the opportunity to attend the conference in Lisbon, all the other volunteers for their friendly and helpful attitude, and the SIGMM Student Travel Grant committee for the financial support.

Sheng-Ming Tang, Master student, National Tsing Hua University, Hsinchu, Taiwan

My name is Sheng-Ming Tang from National Tsing Hua University, Hsinchu, Taiwan. It is a great honour for me to receive the student travel grant. First, I want to thank the committee for organizing this fantastic event. As ACM MM 2022 is my first in-person experience presenting at a conference, I felt a little bit nervous in the first place. However, I started to get comfortable in the conference through the interaction of those astonishing researchers and the volunteers. It was great to not only present in front of the public but also participate in the events. I met a lot of people who solved problems with different and creative approaches, learned brand-new mindsets from the keynote sessions, and gained abundant feedback from the audience, which would boost my research. Thank the committee again for giving me this greatest opportunity to present and share my work in person. I enjoyed a lot during the event.

Tai-Chen Tsai, Graduate student, National Tsing Hua University Taiwan

First, I would like to thank ACM for providing a student travel grant that allowed me to attend the conference. This is my first time presenting my work at a conference. The conference I attended was the interactive art session. I was worried that the setup would be complicated abroad. However, as soon as I arrived at the site, volunteers assisted me with the installation. The conference provided complete hardware resources, allowing me to have a smooth and excellent exhibition experience. Also, I took the opportunity to see many interesting researchers from different countries. The work “Emotional Machines” in the interactive art exhibition surprised me. His system collects and combines what participants are saying and their current emotions. The data is transformed into 360-degree image content in the VR environment through the model so that everyone’s information forms a small universe in the VR environment. The idea is creative.
Additionally, I can chat and discuss projects with published researchers while volunteering at workshops. They shared their lifestyle and work experiences as researchers in European countries, and we discussed what interesting study is and what is not. This is the best reward for me.

Bhalaji Nagarajan, PhD Student, Universitat de Barcelona, Spain

ACM-Multimedia was the first big conference I was able to participate in person after two years of complete virtual participation. I presented my work both as oral and poster presentations at the Workshop on Multimedia-Assisted Dietary Management (MADiMa). It gave me an excellent opportunity to present my work and to get valuable input from reputed pioneers regarding the future scope. It gave me a new dimension and helped in expanding my technical skill set. This was also my first volunteering experience on such a massive scale. It gave me a great learning experience to see and learn how to manage conferences of such a large scale.
I am very happy that I attended the conference in person. I was able to meet new people, and reputed pioneers in the field, learn new things and of course, made some new friends. A big thank you for the SIGMM Travel Grant that allowed me to attend the conference in-person in Lisbon.

Kiruthika Kannan, MS by Research, International Institute of Information Technology, Hyderabad, India. 

My paper on “DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games” was accepted at the 30th ACM International Conference on Multimedia. It was my first international conference, and I felt honoured to be able to present my research in front of experienced researchers. The conference also exhibited diverse research projects addressing fascinating scientific and technological problems. The poster sessions and talks at the conference improved my knowledge of the research trends in multimedia. In addition to this, I was able to interact with fellow researchers from diverse cultures. It was interesting to hear about their experiences and learn about their work at their institution. As a volunteer at the conference, I witnessed the hard work of the behind the scene organizers and volunteering team to smoothly run the events. I am grateful to the SIGMM Student Travel Grant for supporting my attendance at the ACMMM 22 conference.

Garima Sharma, PhD Student, Department of Human-Centred Computing, Monash University

It was a pleasure to receive a SIGMM travel grant and to attend the ACM Multimedia 2022 conference in person. ACM Multimedia is one of the top conferences in my research area and it was my first in-person conference during my PhD. I had a great experience interacting with numerous researchers and fellow PhD students. Along with all the interesting keynotes, I attended as many oral sessions as possible. Some of these sessions were aligned with my research work and some were outside of my work. This gave me a new research perspective at different levels. Also, working with organisers in a few sessions gave me a whole new experience in managing these events. Overall, I got many insightful comments, suggestions and feedback which motivated me with some interesting directions in my research work. I would like to thank the organisers for making this year’s ACM Multimedia a wonderful experience for every attendee.

Alon Harell, PhD student at the Multimedia Lab at Simon Fraser University

I had the pleasure to receive the SIGMM Student Travel Grant and to attend and volunteer at ACM Multimedia 22 in Lisbon, Portugal. The work I submitted to the conference was done outside of my regular PhD research, and thus without this grant, I would have not been able to participate. The workshop at which I presented, ACM MM Sport 22, was incredibly eye-opening with many fantastic papers, great presentations, and above all great people with which I was able to exchange ideas, form bonds, and perhaps even create future collaborations. The main conference, which coincides more closely with my main research on image and video coding for machines, was just as good. With fascinating talks, some in person, and some virtual, I was exposed to many new ideas (or perhaps just new to me) and learned a great deal. I was also able to benefit from the generosity and experience of Prof.  Chong Wah Ngo from Singapore Management University, during my PhD. Mentor lunch, who shared with me his thoughts on pursuing a career in academia. Overall, ACM Multimedia 22 was an especially unique experience because it was the first in-person conference I was able to attend since the beginning of the COVID-19 pandemic, and being back face-to-face with fellow researchers was a great pleasure.

Lorenzo Vaiani, Ph.D. student (1st year), Politecnico di Torino, Italy

ACM MM 2022 was my first in-person conference. Being able to present my works and discuss them with other participants in person was an incredible experience. I enjoyed every activity, from presentations and posters to workshops and demos. I received excellent feedback and new inspiration to continue my research. The best part was definitely strengthening the bonds with friends I already knew and making more with the amazing people I met there. I learned a lot from all of them. Volunteer activities helped a lot in making these kinds of connections. Thanks to the organizers for this fantastic opportunity and the SIGMM Student Travel Grant committee for the financial support. This edition of ACM MM was just the first for me, but I hope for many more in the future.

Xiaoyu Lin, third-year PhD student at Inria Grenoble, France

It is a great honour to attend ACM MM 2022 in Lisbon. It was a great experience. I have met lots of nice professors and researchers. Discussing with them gave me lots of inspiration on both research directions and career development. I presented my work during the doctoral symposium. I’ve got plenty of useful feedback which can help me to improve our work. During the “Ask Me Anything” lunch, I have the chance to discuss with several senior researchers. They provide me with some kind and very useful advice on how to do research. Besides, I have also served as a volunteer for a workshop. It also helped me to meet other volunteers and made some new friends. Thanks to all the chairs and organizers who have worked hard to make ACM MM 2022 such a wonderful conference.  It’s really an impressive experience!

Zhixin Ma, PhD student, Singapore Management University, Singapore

I would like to thank the ACM Multimedia Committee provided me with the student travel grant so that I can attend the conference in person. ACM Multimedia is the worldwide top conference in the Multimedia field. It provides me with an opportunity to present my work and communicate with the researchers working on this topic of multimedia search.
Besides, the excellent keynotes and passionate panel talk also picture a good vision of future research in the multimedia field. Overall, I must express that ACM MM22 is amazing and well-organized. I again appreciate the ACM MM committee for the student travel grant, which made my attendance possible.

Report from ACM Multimedia 2022 by Nitish Nagesh

Nitish Nagesh (@nitish_nagesh) is a Ph.D. student in the Computer Science department, the University of California, Irvine, USA. He has been awarded as Best Social Media Reporter of ACM Multimedia 2022 conference. To celebrate this award, Nitish Nagesh reported on his wonderful experience at ACM Multimedia 2022 as follows.

I was excited when our paper “World Food Atlas for Food Navigation” was accepted to the Multimedia Assisted Dietary Management Workshop (MADiMA). Being held in conjunction with ACM Multimedia 2022, the premier multimedia conference was the icing on the cake. It being in Lisbon, Portugal was the cherry on top of the cake. It is said that a picture is worth a thousand words. It is fitting to describe a multimedia conference experience report through pictures.

Prof. Ramesh Jain organized an informal meetup at the Choupana Caffe based on the advice of Joao Magalhaes, general chair of ACMMM 2022. It was great to meet researchers working on food computing including Prof. Yoko Yamakata, Prof. Agnieszka, Maija Kale. It was great to also have the company of students and professors from Singapore Management University including Prof. Chong Wah along with Prof. Phoebe Chen. Since this was the first in-person conference for many folks, we had great conversations over waffles, pear salad and watermelon mint juice!

The MADiMA workshop and the Cooking and Eating Activities (CEA) workshop had stellar keynote speakers and presentations about topics ranging from adherence to a mediterranean diet to mental health estimation through food images. 

The workshop was at the Lisbon Congress Center. It was a treat to watch the sun shine brightly on the congress center in the morning and the mellow sunset only a few minutes away near the Tagus Estuary rendering an orangish hue to the red bridge overlooking the train tracks below.

After a great set of presentations, the MADIMA and CEA workshop was drawn to a close with a group picture, of one large family of people who love food and want to help people enjoy food while maintaining their health goals. A huge shout out to the workshop chairs Prof. Stavroula Mougiakakou, Prof. Keiji Yanai, Prof. Dario Allegra and Prof. Yoko Yamakata. (I tried my best to include a photo where everyone looks good!)

All work and no play makes us dull people! And all research with no food makes us hungry people! We had a post-workshop dinner at an authentic Portuguese restaurant. The food was great and it was a delightful evening because of the surprise treat from the professors! 

Prof. Jain’s Ph.D. talk was inspiring as he shared his personal journey that led him to focus on healthcare. He urged students in the multimedia community to pursue multimodal healthcare research as he shared his insights on building a personal health navigator.

I had signed up to be a mentee for the Ph.D. school Ask Me Anything (AMA) session. We asked Prof. Ming Dong questions about his time at graduate school, balancing teaching and research responsibilities, tips on maximizing research output and strategies to cope with rejections. He was candid in his responses and emphasized the need to focus on incremental progress while striving to do impactful research. I must thank Prof. Wei Tsang and other organizers for their leadership in organizing a first-of-its-kind session.

In between running around oral sessions, poster presentations, keynote talks, networking, grabbing lunch, and enjoying Portuguese Tart, we managed to have fun while volunteering. Huge credit to the students and staff (the Rafael’s, the Diogo’s, the David’s, the Gustavo’s, the Pedro’s) from Nova university for doing the heavy lifting to ensure a smooth online, hybrid and in-person experience!

It was a pleasure to watch Prof. Alan Smeaton deliver an inspiring speech about the journey of information retrieval and multimedia. The community congratulates you once again on the Technical Achievement Award – more power to you, Alan!

The highlight of the conference was the grand banquet at Centro Cultural de Belém. There could not have been a better climax to the gala event than the Fado music. One aspect of Fado music symbolizes longing where the spouse sings a melancholy when her partner sets sail on long voyages. It is accompanied by the unique 12 string guitar and is sung very close to the audience to heighten the intimacy. I could fully relate to the artists’ melody and rhythms since I had been longing to see my family and friends back home, whom I have not visited for the past three years due to the pandemic. Another tune described the beauty of Lisbon in superlatives including the sun shining the brightest compared to any other part of the world. There was a happy ending to the tune when the artists recreated the moment of joy after the war was over and everyone was merry again. It reinvigorated a fresh hope and breathed a new lease of life into our cluttered worlds. For once, I was truly present in the moment!