Encouraging Scientific Collaborations with ConfFlow 2021


We often find other collaborators by chance at a conference or by looking for them specifically through their papers. However, sometimes hidden potential social connections might exist between different researchers that cannot be immediately observed because the keywords we use might not always represent the entire space of similar research interests. As a community, Multimedia (MM) is so diverse that it is easy for community members to miss out on very useful expertise and potentially fruitful collaborations. There is a lot of latent knowledge and potential synergies that could exist if we were to offer conference attendees an alternative perspective on their similarities to other attendees. ConfFlow is an online application that offers an alternative perspective on finding new research connections. It is designed to help researchers find others at conferences with complementary research interests for collaboration. With ConfFlow we take a data-driven approach by using something similar to the Toronto Paper Matching System (TPMS), used to identify suitable reviewers for papers, to construct a similarity embedding space for researchers to find other researchers. 

In this report, we discuss the follow up to the 2020 ConfFlow edition which was run at MMSys, MM, ICMR in 2021. We created separate editions of ConfFlow for each conference, processing 2642 (MM), 272 (MMSys), and 494 (ICMR) accepted authors from each conference.

Both the 2020 and 2021 editions of ConfFlow were funded by the SIGMM special initiatives fund.

New Functionality

In the 2020 edition of ConfFlow, we created an interface allowing authors at the MM 2020 conference to browse the research similarity space with others. Each user needs to claim their Google scholar account in the application before using it. We implemented a strict privacy-sensitive policy allowing data of individuals only to be shown if they consented to use the database; even public data was not shown as the processed public data might be considered a privacy invasion. Unfortunately, because of this strict policy, and very little uptake of the application, the full experience of the application was not possible for any user. In the 2021 edition, we updated the privacy policy to be more permissive, whilst still secure (see discussion in the Privacy and Ethical Considerations section below).

From our experiences from the 2020 edition, we identified some bottlenecks that could be improved upon. To that end, we made the following augmentations:

  • Improved frontend design: We did an overhaul of the interface to make it more modern, visually appealing, and user-friendly. The design was also slightly changed to accommodate new functionalities
  • New embedding options: We added two more options to choose how the similarity space is formed; word2vec (tf-idf weighted mean word2vec embeddings: w(eighted)-m(ean)o(f)w(ord)e(mbeddings)) and doc2vec (see Figure 1)
Figure 1. Screenshot showing the new embedding functionality (m-mowe and doc2vec)
  • Interactive tutorial for onboarding: We included an interactive tutorial that showcases the full range of functionalities to the users when they first log in (see Figure 2)
  • Direct messaging functionality: We added direct messaging to ConfFlow, allowing direct communication between attendees (see Figure 3)
  • Scaling ConfFlow and making it cheaper to run in the future: There is an economy of scale to only needing to update the ConfFlow database with conference newcomers. We made the following steps to make the process more efficient:
    • Generating a database of verified authors from the lists of SIGMM conference attendees listed on the ACM website in the last 6 years.
    • A helper tool for finding google scholar profiles of newcomers quicker as they needed to be manually verified for security reasons.



ConfFlow was rolled out to 3 conferences starting with MMSys 2021 (Istanbul, Turkey)  in September, Multimedia 2021 (Chengdu, China)  in October, and ICMR (Taipei, Taiwan) 2021 in November rather than just ACM MM in 2020. All MMSys and MM conferences were organized as hybrid events whilst ICMR was finally organized virtually after having to be rescheduled twice.

We asked all general and program chairs of each respective conference to provide the author lists of the accepted papers in the conference at least 1 month before the conference started. This was in the end a compromise between obtaining just the actual conference attendees (which would have made social connection easier if the conferences had been in-person only) and being able to get conference relevant participants sufficiently ahead of time in order to disambiguate identities and start the time-consuming computations of the embedding spaces. Given the added complication that MMSys and Multimedia were hybrid, the problem of waiting for the final conference registration list was that we would need to wait until very close to the conference itself to get the latest attendee list. In any case, even if we knew, the hybrid nature of the conference made virtual social connection still the more viable option. Use of the attendee list would also make it harder to pre-announce the application just before the conference started. Given also that the conference organizers were very occupied with handling the many uncertainties of conference organization during the pandemic, we decided that obtaining the author lists was the least risky approach.

Aside from getting the author lists, we also asked the conference organizers for support in disseminating the application to the conference attendees. A separate edition of ConfFlow needed to be generated for each conference. The following strategies were used for disseminating the application via the conference directly and from a personal account:

  • MMSys: slack channel, Twitter (conference, personal, and sigmm), weixin, weibo, facebook, presentation slides during conference general announcements
  • ACM MM: Twitter (conference and sigmm), whova, presentation slide during the conference banquet
  • ICMR:  Twitter (conference, personal, and sigmm).

We tried a different strategy compared to last year to catch people’s attention to the application by a more comprehensive dissemination strategy and also short catchy explanatory videos to communicate the functionalities of the application. These were embedded in our social media dissemination campaigns.

Following on from that, we issued an online survey to gauge how people in the community at large felt about social interaction and, if they had used ConfFlow, how was their experience of the app. This was sent shortly after the conference by email to all those that used the application and then also 1 week later as a reminder. Posts were also sent out on Twitter and Facebook to encourage people in the community to fill in the survey even if they had not used ConfFlow. The survey was divided into questions related to collaboration in general, their experience using ConfFlow, and questions about how the application experience could be changed. Further details about the questions are shown in the Appendix. 

Privacy and Ethical Considerations

The first edition of ConfFlow (2020) had a very restrictive opt-in only policy. This made the visualization hard to use for interested users, thus severely hindering the user experience. Users unanimously asked for visualization of the other researchers in the community. Therefore, any already publicly available information from a user’s google scholar account or ACM website and derived visualizations were displayed to everyone. Information that is not available publicly online such as their individual usage behavior, their visualization options, whether their ConfFlow account is activated or not etc is not shown publicly. 

Application Realization

For security reasons, each user cannot use ConfLab until they have claimed their account. This is needed because each account has preferences related to the ConfFlow interface – settings such as hiding particular researchers, having researchers marked as ‘favourites’ as well as the direct messaging functionality. We used very strict security procedures for the building of ConfFlow and this also meant that to retrieve a user’s preferences in the application, a user’s identity needed to be verified when a user claims their account. We do this by associating the author’s name and affiliation with a Google scholar profile and then a user needs to verify their identity with respect to their Google scholar account. In some cases, it is necessary to manually assign an author to a Google scholar profile because there are too many profiles with the same name; sometimes many author names can be associated with the same Google scholar account. To this end, one of the main new functionalities was the creation of a database of all SIGMM community members who had published at the MM conference recently. That way, their name and google scholar profile only needs to be associated once and can easily be re-used in future editions of ConfLab. This manual effort aspect of the process varied across the three different conferences in which ConfLab was created. We elaborate on this below. An additional helper function was created to allow faster manual verification in cases of ambiguity.

ConfFlow at ACM SIGMM

We describe some statistics for each edition of ConfFlow at the three conferences of SIGMM in 2021: MMSys, MM, and ICMR. We list them in chronological order of when the conference occurred in the calendar year.


The author list provided by General Chairs of MMSys had 272 unique authors. As shown in Figure 4.,  we were able to identify Google Scholar accounts of 158 authors. 145 of these accounts were identified automatically using the provided author information: name, affiliation, and e-mail domain. 13 accounts identified by the automatic process were tagged as ambiguous and required manual validation.

Figure 4. Author statistics for ACM MMSys‘21

We created ConfFlow accounts for 145 identified authors. As shown in Figure 5, 18 users claimed their accounts and used ConfFlow during the conference. Further analysis showed that 7 out of 18 users were newcomers to MMSys i.e., it was their first publication at this conference. 

After sending out the survey request to the 18 users after the conference, we obtained 1 survey response from a PhD student. Due to the low response rate, we do not report the responses.

Figure 5. User statistics for ConfFlow-MMSys‘21

The similarity space visualized in ConfFlow is based on the publications of authors in the last two years. Figure 6 shows the distribution of the number of papers MMSys authors published in the last two years. We show this because for each identified author, we take all the papers they published in the last 2 years to generate the latent representation of their research interests. What was particularly interesting to see is how many researchers were publishing 30 or more papers in the last 2 years. They account for a significant proportion of the authors of the conference who may be too busy to find new research connections. However, there is also a significant proportion of researchers publishing less than 30 papers a year who could find Conf Lab useful.

Figure 6. Histogram of the number of publications in the last 2 years for MMSys’21 authors.

ACM Multimedia

We realized that users without a Google scholar profile could not use ConfLab at all so for the Multimedia edition, we added a view-only (guest account) option of ConfLab and advertised it on social media accordingly. This view-only account also allowed researchers who did not want to claim their account to browse the embedding space. The disadvantage of this approach is that the application does not immediately centre on the user in the embedding space. Given the large number of authors at Multimedia, this made it extremely hard for view-only users to find themselves, which may have made it harder for them to appreciate the utility of the application. 

As shown in  Figure 7,  the author list provided by General Chairs of Multimedia had 2642 unique authors. We were able to identify Google Scholar accounts of 1608 authors. 1213 of these accounts were identified automatically using the provided author information: name, affiliation, and e-mail domain. 225 authors were already identified in the previous iterations of ConfFlow for ACM MMSys ‘21 and MM ‘20. We then manually analyzed the remaining 1204 authors that were either tagged as ambiguous matches by the automatic process or returned no matches at all. We were able to identify an additional 170 accounts with the manual search. This highlights how challenging it is to establish an online identity for all authors in order for them to use ConfFlow, despite manual intervention.

Figure 7. Author statistics for MM’21

We created ConfFlow accounts for the identified authors. As shown in Figure 8, 16 users claimed their accounts and used ConfFlow during the conference. Further analysis showed that 9 out of 16 users were newcomers to MMSys i.e., it was their first publication at this conference. 5 attendees requested access to the guest account.

Figure 8. User statistics for ConfFlow-MM’21.

Figure 9 shows the distribution of the number of papers.  Multimedia 2021 authors published in the last two years. It is interesting to see a more skewed distribution towards people with fewer publications compared to the MMSys edition. This would suggest that there are potentially more researchers who would find ConfFlow interesting as a social connection tool. However, both MMSys and Multimedia had very similar numbers of users despite Multimedia being almost 10 times bigger. This may be related to the fact that we were able to be in closer communication with the general chairs of MMSys who gave us access to more channels of communication (including a slide announcement during the conference opening). Meanwhile, at MM, the initial dissemination via Whova (which was the first line of attack) did not yield any new users at all and the Multimedia social media feed (Twitter)  had very few followers – this could be explained by the fact that Twitter is not used by many of our colleagues in Asia and Multimedia was being run in Chengdu. We do not have statistics on the proportion of hybrid vs. in-person attendees which may also have affected usage. 

Figure 9. Histogram of the number of publications in the last 2 years for all identified authors of MM’21.


The author list provided by the General Chairs of Multimedia had 494 unique authors. As shown in Figure 10, we were able to identify Google Scholar accounts of 286 authors. 162 of these accounts were identified automatically using the provided author information: name, affiliation, and e-mail domain. 67 authors were already identified in the previous iterations of ConfFlow. We then manually analyzed the remaining 265 authors that were either tagged as ambiguous matches by the automatic process or returned no matches at all. We were able to identify an additional 57 accounts with the manual search. 

Figure 10. Author statistics for ICMR ‘21

None of the users claimed their ConfFlow account during ICMR’21. Figure 11 shows the distribution of the number of papers MMSys authors published in the last two years. It is interesting that despite being almost double the size of MMSys and 5 times smaller than Multimedia, 

Figure 11. Histogram of number of publications in the last 2 years for all identified authors of ICMR ‘21.

Discussion and Recommendations

This section describes some key points of reflection on the running of ConfFlow this year. 

One of the main issues relates to the low number of users despite conference participants being aware of the application. The survey on collaboration and experience with ConfFlow did not yield sufficient responses. 

It is interesting to see in all conferences that a significant proportion of the users of ConfFlow were newcomers. Unfortunately, without the statistics from the survey we put out, it is not clear if this reflects the distribution of the conference attendees in general or whether more newcomers are interested in using ConfFlow due to its promise of helping people to connect socially. 

The reasons for this could be multiple: The hybrid format and virtual formats of the conferences made it difficult to provide time to think about collaborations whilst being in the middle of preparing to go to a conference or during the conference itself. For virtual participants, in particular, the benefit of not going physically means that one can continue with day to day duties in the person’s normal job. However, this does take away opportunities for social networking that one might have in the in-person setting. In addition, the challenges of running the conference in the hybrid format may also have led to fatigue for in-person as well as virtual participants. Another possible explanation is that in the general Multimedia community there is no obvious intrinsic value in changing the way collaboration is already carried out. The additional barrier of needing to claim their account due to privacy and ethical reasons may have been confusing (it could appear that an account needs to be created, which can be a barrier to usage). 

We reflect that the fact that more users were obtained for MMSys could have been related to the closer access we had to social media channels e.g. the conference slack channel, which helped to keep a centralized reminder for participants of what was going on in the conference. It could also be a reflection of the openness of the community to finding social connections. On the other hand, the Whova app used for MM is a more complex interface with multiple purposes beyond just communication, which may have made it harder for attendees to see the ConfLab announcement, embedded in other announcements.

Finally, we also considered that the ConfFlow interface takes time to browse and reflect on. Given that the intrinsic value of the application is not immediately obvious to many (this is our interpretation of the low interest in application use). It could make more sense to have a SIGMM  community-wide edition of ConfFlow that is available all year round, allowing for the dissemination of the application and its purpose to be made clear outside of the pre-conference rush. Then conference-specific editions could be generated. This, however, comes with its own logistic issues as every new identity added to the database would either require the entire embedding to be recomputed, or their latent research interest representation would need to be projected directly onto the existing embedding, which does not necessarily accurately represent their closeness to others in the existing database. The rate at which updates (new authors) are added would also require significant manual attention (and may not be easy to resolve as shown in the statistics in Table 1). Given also the popularity of the Influence Flowers (http://influencemap.ml/), a previously funded SIGMM initiative, we suspect that a more ego-based strategy may be more effective in encouraging researchers in the community to start engaging with the ConfFlow application.

ConfLab Factors\ Conference: MMSys MultimediaICMR
#previously identified authors0225286
#authors with automatically identifiable Google scholar158121367
#authors without Google Scholar Match131204265
#authors with manually identified Google Scholar.1317057
#survey respondents100

Table 1.  Summary statistics for each of the three conferences.


The ConfFlow 2021 edition generated new functionalities to allow researchers to browse their research interests with respect to others in a fun and novel way. More effort was given this year to improve the advertising of the application and to try and understand the community’s struggles with collaboration. Steps were also taken to make the running of ConfFlow less labour-intensive. 

Our conclusions from the many efforts made in ConfFlow 2021, the surrounding social media presence, and the survey is that for the SIGMM population at large, encouraging more social connections outside of the normal routes is unfortunately not perceived to have significant value. It seems that for now, more immediate forms of social interaction encouragement e.g. initiatives during the conference to help newcomers to integrate may be a more effective route to enable social integration. Another option is to consider a hybrid approach where ConfFlow can be used to e.g. identify groups for going to dinner together during the conference or sitting at the same table during the conference banquet. However, this would still require a sufficient uptake of the application. Given the myriad of different motivations community members have to attend conferences, it remains an intriguing and open challenge to encourage more diverse research output from this highly interdisciplinary community. 


ConfFlow 2021 was supported in part by the SIGMM Special Initiatives Fund and the Dutch NWO-funded MINGLE project number 639.022.606. We thank users who gave feedback on the application during prototyping and implementation and the General Chairs of ACM MMSys, Multimedia, and ICMR 2021 for their support.


Ekin Gedik and Hayley Hung. 2020. ConfFlow: A Tool to Encourage New Diverse Collaborations. Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 4562–4564. DOI:https://doi.org/10.1145/3394171.3414459


List of Survey Questions used for our google form:


  • Context Questions
    • I am attending these conferences in 2021
    • I am publishing in these conferences in 2021
    • Please indicate the job description that best describes you.
  • General Questions about Scientific Collaboration
    • I tend to initiate collaborations with people I already know well.
    • I tend to initiate collaborations with people at the same experience level as me.
    • I am very interested in finding collaborators from a different discipline.
    • I find it very hard to identify relevant collaborators from a different discipline.
    • I find it very hard to initiate interdisciplinary collaborations even when I know who I want to work with.
    • What are the common problems you face when trying to initiate a collaboration?
    • Do these problems influence how or whether you initiate collaborations?
  • Initial contact with ConfFlow:
    • I saw announcements encouraging me to try ConfFlow
    • Did you have problems in getting in to ConfFlow? e.g. the system could not find your Google Scholar account?
    • On how many separate occasions have you used ConfFlow?
  • Motivation for using ConfFlow
    • I did not use ConfFlow because I did not have time.
    • I did not use ConfFlow because I did not find it interesting.
    • I would be interested in trying ConfFlow in the weeks leading up to or following a conference.
    • Despite not using ConfFlow, I could see how it might help advance my research work.
    • We would be very grateful for any comments or feedback on your experience of ConfFlow so we can make it more useful. Please feel free to share any remarks you might have on this topic.
  • Experience using ConfFlow
    • The visualization matched who I would expect to be close to me.
    • The visualization matched who I would expect to be far away from me.
    • ConfFlow helped me to find interesting people that I did not know before.
    • ConfFlow helped me to connect with interesting people that I did not know before.
    • ConfFlow encouraged me to think more deliberately about making connections with researchers in a different discipline.
    • I think that ConfFlow could help to advance my research work.

VQEG Column: VQEG Meeting Dec. 2021 (virtual/online)


Welcome to a new column on the ACM SIGMM Records from the Video Quality Experts Group (VQEG).
The last VQEG plenary meeting took place from 13 to 17 December 2021, and it was organized online by University of Surrey, UK. During five days, more than 100 participants (from more than 20 different countries of America, Asia, Africa, and Europe) could remotely attend the multiple sessions related to the active VQEG projects, which included more than 35 presentations and interesting discussions. This column provides an overview of this VQEG plenary meeting, while all the information, minutes and files (including the presented slides) from the meeting are available online in the VQEG meeting website.

Group picture of the VQEG Meeting 13-17 December 2021

Many of the works presented in this meeting can be relevant for the SIGMM community working on quality assessment. Particularly interesting can be the new analyses and methodologies discussed within the Statistical Analyses Methods group, the new metrics and datasets presented within the No-Reference Metrics group, and the progress on the plans of the 5G Key Performance Indicators group and the Immersive Media group. We encourage those readers interested in any of the activities going on in the working groups to check their websites and subscribe to the corresponding reflectors, to follow them and get involved.

Overview of VQEG Projects

Audiovisual HD (AVHD)

The AVHD group investigates improved subjective and objective methods for analyzing commonly available video systems. In this sense, it has recently completed a joint project between VQEG and ITU SG12 in which 35 candidate objective quality models were submitted and evaluated through extensive validation tests. The result was the ITU-T Recommendation P.1204, which includes three standardized models: a bit-stream model, a reduced reference model, and a hybrid no-reference model. The group is currently considering extensions of this standard, which originally covered H.264, HEVC, and VP9, to include other encoders, such as AV1. Apart from this, two other projects are active under the scope of AVHD: QoE Metrics for Live Video Streaming Applications (Live QoE) and Advanced Subjective Methods (AVHD-SUB).

During the meeting, three presentations related to AVHD activities were provided. In the first one, Mikolaj Leszczuk (AGH University) presented their work on secure and reliable delivery of professional live transmissions with low latency, which brought to the floor the constant need for video datasets, such as the VideoSet. In addition, Andy Quested (ITU-R Working Party 6C) led a discussion on how to assess video quality for very high resolution (e.g., 8K, 16K, 32K, etc.) monitors with interactive applications, which raised the discussion on the key possibility of zooming in to absorb the details of the images without pixelation. Finally, Abhinau Kumar (UT Austin) and Cosmin Stejerean (Meta) presented their work on exploring the reduction of the complexity of VMAF by using features in the wavelet domain [1]. 

Quality Assessment for Health applications (QAH)

The QAH group works on the quality assessment of health applications, considering both subjective evaluation and the development of datasets, objective metrics, and task-based approaches. This group was recently launched and, for the moment, they have been working on a topical review paper on objective quality assessment of medical images and videos, which was submitted in December to Medical Image Analysis [2]. Rafael Rodrigues (Universidade da Beira Interior) and Lucie Lévêque (Nantes Université) presented the main details of this work in a presentation scheduled during the QAH session. The presentation also included information about the review paper published by some members of the group on methodologies for subjective quality assessment of medical images [3] and the efforts in gathering datasets to be listed on the VQEG datasets website. In addition, Lu Zhang (IETR – INSA Rennes) presented her work on model observers for the objective quality assessment of medical images from task-based approaches, considering three tasks: detection, localization, and characterization [4]. In addition, it is worth noting that members of this group are organizing a special session on “Quality Assessment for Medical Imaging” at the IEEE International Conference on Image Processing (ICIP) that will take place in Bordeaux (France) from the 16 to the 19 October 2022.

Statistical Analysis Methods (SAM)

The SAM group works on improving analysis methods both for the results of subjective experiments and for objective quality models and metrics. Currently, they are working on statistical analysis methods for subjective tests, which are discussed in their monthly meetings.

In this meeting, there were four presentations related to SAM activities. In the first one, Zhi Li and Lukáš Krasula (Netflix), exposed the lessons they learned from the subjective assessment test carried out during the development of their metric Contrast Aware Multiscale Banding Index (CAMBI) [5]. In particular, they found that some subjective can have perceptually unbalanced stimuli, which can cause systematic and random errors in the results. In this sense, they explained their statistical data analyses to mitigate these errors, such as the techniques in ITU-T Recommendation P.913 (section 12.6) which can reduce the effects of the random error. The second presentation described the work by Pablo Pérez (Nokia Bell Labs), Lucjan Janowsk (AGH University), Narciso Garcia (Universidad Politécnica de Madrid), and Margaret H. Pinson (NTIA/ITS) on a novel subjective assessment methodology with few observers with repetitions (FOWR) [6]. Apart from the description of the methodology, the dataset generated from the experiments is available on the Consumer Digital Video Library (CDVL). Also, they launched a call for other labs to repeat their experiments, which will help on discovering the viability, scope and limitations of the FOWR method and, if appropriate, include this method in the ITU-T Recommendation P.913 for quasi-experimental assessments when it is not possible to have 16 to 24 subjects (e.g., pre-tests, expert assessment, and resource limitations), for example, performing the experiment with 4 subjects 4 times each on different days, which would be similar to a test with 15 subjects. In the third presentation, Irene Viola (CWI) and Lucjan Janowski (AGH University) presented their analyses on the standardized methods for subject removal in subjective tests. In particular, the methods proposed in the recommendations ITU-R BT.500 and ITU-T P.913 were considered, resulting in that the first one (described in Annex 1 of Part 1) is not recommended for Absolute Category Rating (ACR) tests, while the one described in the second recommendations provides good performance, although further investigation in the correlation threshold used to discard subjects s required. Finally, the last presentation led the discussion on the future activities of SAM group, where different possibilities were proposed, such as the analysis of confidence intervals for subjective tests, new methods for comparing subjective tests from more than two labs, how to extend these results to better understand the precision of objective metrics, and research on crowdsourcing experiment in order to make them more reliable and improve cost-effectiveness. These new activities are discussed in the monthly meetings of the group.

Computer Generated Imagery (CGI)

CGI group focuses on quality analysis of computer-generated imagery, with a focus on gaming in particular. Currently, the group is working on topics related to ITU work items, such as ITU-T Recommendation P.809 with the development of a questionnaire for interactive cloud gaming quality assessment, ITU-T Recommendation P.CROWDG related to quality assessment of gaming through crowdsourcing, ITU-T Recommendation P.BBQCG with a bit-stream based quality assessment of cloud gaming services, and a codec comparison for computer-generated content. In addition, a presentation was delivered during the meeting by Nabajeet Barman (Kingston University/Brightcove), who presented the subjective results related to the work presented at the last VQEG meeting on the use of LCEVC for Gaming Video Streaming Applications [7]. For more information on the related activities, do not hesitate to contact the chairs of the group. 

No Reference Metrics (NORM)

The NORM group is an open collaborative project for developing no-reference metrics for monitoring visual service quality. Currently, two main topics are being addressed by the group, which are discussed in regular online meetings. The first one is related to the improvement of SI/TI metrics to solve ambiguities that have appeared over time, with the objective of providing reference software and updating the ITU-T Recommendation P.910. The second item is related to the addition of standard metadata of video quality assessment-related information in the encoded video streams. 

In this meeting, this group was one of the most active in terms of presentations on related topics, with 11 presentations. Firstly, Lukáš Krasula (Netflix) presented their Contrast Aware Multiscale Banding Index (CAMBI) [5], an objective quality metric that addresses banding degradations that are not detected by other metrics, such as VMAF and PSNR (code is available on GitHub). Mikolaj Leszczuk (AGH University) presented their work on the detection of User-Generated Content (UGC) automatic detection in the wild. Also, Vignesh Menon & Hadi Amirpour (AAU Klagenfurt) presented their open-source project related to the analysis and online prediction of video complexity for streaming applications. Jing Li (Alibaba) presented their work related to the perceptual quality assessment of internet videos [8], proposing a new objective metric (STDAM, for the moment, used internally) validated in the Youku-V1K dataset. The next presentation was delivered by Margaret Pinson (NTIA/ITS) dealing with a comprehensive analysis on why no-reference metrics fail, which emphasized the need of training these metrics on several datasets and test them on larger ones. The discussion also pointed out the recommendation for researchers to publish their metrics in open source in order to make it easier to validate and improve them. Moreover, Balu Adsumilli and Yilin Wang (Youtube) presented a new no-reference metric for UGC, called YouVQ, based on a transfer-learning approach with a pre-train on non-UGC data and a re-train on UGC. This metric will be released in open-source shortly, and a dataset with videos and subjective scores has been also published. Also, Margaret Pinson (NTIA/ITS), Mikołaj Leszczuk (AGH University), Lukáš Krasula (Netflix), Nabajeet Barman (Kingston University/Brightcove), Maria Martini (Kingston University), and Jing Li (Alibaba) presented a collection of datasets for no-reference metric research, while Shahid Satti (Opticom GmbH) exposed their work on encoding complexity for short video sequences. On his side, Franz Götz-Hahn (Universität Konstanz/Universität Kassel) presented their work on the creation of the KonVid-150k video quality assessment dataset [9], which can be very valuable for training no-reference metrics, and the development of objective video quality metrics. Finally, regarding the aforementioned two active topics within NORM group, Ioannis Katsavounidis (Meta) provided a presentation on the advances in relation to the activity related to the inclusion of standard video quality metadata, while Lukáš Krasula (Netflix), Cosmin Stejerean (Meta), and Werner Robitza (AVEQ/TU Ilmenau) presented the updates on the improvement of SI/TI metrics for modern video systems.

Joint Effort Group (JEG) – Hybrid

The JEG group was focused on joint work to develop hybrid perceptual/bitstream metrics and on the creation of a large dataset for training such models using full-reference metrics instead of subjective metrics. In this sense, a project in collaboration with Sky was finished and presented in the last VQEG meeting.

Related activities were presented in this meeting. In particular, Enrico Masala and Lohic Fotio Tiotsop (Politecnico di Torino) presented the updates on the recent activities carried out by the group, and their work on artificial-intelligence observers for video quality evaluation [10].

Implementer’s Guide for Video Quality Metrics (IGVQM)

The IGVQM group, whose activity started in the VQEG meeting in December 2020, works on creating an implementer’s guide for video quality metrics. In this sense, the current goal is to create a report on the accuracy of video quality metrics following a test plan based on collecting datasets, collecting metrics and methods for assessment, and carrying out statistical analyses. An update on the advances was provided by Ioannis Katsavounidis (Meta) and a call for the community is open to contribute to this activity with datasets and metrics.

5G Key Performance Indicators (5GKPI)

The 5GKPI group studies relationship between key performance indicators of new communications networks (especially 5G) and QoE of video services on top of them. Currently, the group is working on the definition of relevant use cases, which are discussed on monthly audiocalls. 

In relation to these activities, there were four presentations during this meeting. Werner Robitza (AVQ/TU Ilmenau) presented a proposal for KPI message format for gaming QoE over 5G networks. Also, Pablo Pérez (Nokia Bell Labs) presented their work on a parametric quality model for teleoperated driving [11] and an update of the ITU-T GSTR-5GQoE topic, related to the QoE requirements for real-time multimedia services over 5G networks. Finally, Margaret Pinson (NTIA/ITS) presented an overall description of 5G technology, including differences in spectrum allocation per country impact on the propagation and responsiveness and throughput of 5G devices.

Immersive Media Group (IMG)

The IMG group researches on quality assessment of immersive media. The group recently finished the test plan for quality assessment of short 360-degree video sequences, which resulted in the support for the development of the ITU-T Recommendation P.919. Currently, the group is working on further analyses of the data gathered from the subjective tests carried out for that test plan and on the analysis of data for the quality assessment of long 360-degree videos. In addition, members of the group are contributing to the IUT-T SG12 on the topic G.CMVTQS on computational models for QoE/QoS monitoring to assess video telephony services. Finally, the group is also working on the preparation of a test plan for evaluating the QoE with immersive and interactive communication systems, which was presented by Pablo Pérez (Nokia Bell Labs) and Jesús Gutiérrez (Universidad Politécnica de Madrid). If the reader is interested in this topic, do not hesitate to contact them to join the effort. 

During the meeting, there were also four presentations covering topics related to the IMG topics. Firstly, Alexander Raake (TU Ilmenau) provided an overview of the projects within the AVT group dealing with the QoE assessment of immersive media. Also, Ashutosh Singla (TU Ilmenau) presented a 360-degree video database with higher-order ambisonics spatial audio. Maria Martini (Kingston University) presented an update on the IEEE standardization activities on Human Factors or Visual Experiences (HFVE), such as the recently submitted draft standard on deep-learning-based quality assessment and the draft standard to be submitted shortly on quality assessment of light field content. Finally, Kjell Brunnstöm (RISE) presented their work on legibility in virtual reality, also addressing the perception of speech-to-text by Deaf and hard of hearing.  

Intersector Rapporteur Group on Audiovisual Quality Assessment (IRG-AVQA) and Q19 Interim Meeting

Although in this case there was no official meeting IRG-AVQA meeting, there were various presentations related to ITU activities addressing QoE evaluation topics. In this sense, Chulhee Lee (Yonsei University) presented an overview of ITU-R activities, with a special focus on quality assessment of HDR content, and together with Alexander Raake (TU Ilmenau) presented an update on ongoing ITU-T activities.

Other updates

All the sessions of this meeting and, thus, the presentations, were recorded and have been uploaded to Youtube. Also, it is worth informing that the anonymous FTP will be closed soon, so files and presentations can be accessed from old browsers or via an FTP app. All the files, including those corresponding to the VQEG meetings, will be embedded into the VQEG website over the next months. In addition, the GitHub with tools and subjective labs setup is still online and kept updated. Moreover, during this meeting, it was decided to close the Joint Effort Group (JEG) and the Independent Lab Group (ILG), which can be re-established when needed. Finally, although there were not many activities in this meeting within the Quality Assessment for Computer Vision Applications (QACoViA) and the Psycho-Physiological Quality Assessment (PsyPhyQA) they are still active.

The next VQEG plenary meeting will take place in Rennes (France) from 9 to 13 May 2022, which will be again face-to-face after four online meetings.


[1] A. K. Venkataramanan, C. Stejerean, A. C. Bovik, “FUNQUE: Fusion of Unified Quality Evaluators”, arXiv:2202.11241, submitted to the IEEE International Conference on Image Processing (ICIP), 2022. (opens in a new tab).
[2] R. Rodrigues, L. Lévêque, J. Gutiérrez, H. Jebbari, M. Outtas, L. Zhang, A. Chetouani, S. Al-Juboori, M. G. Martini, A. M. G. Pinheiro, “Objective Quality Assessment of Medical Images and Videos: Review and Challenges”, submitted to the Medical Image Analysis, 2022.
[3] L. Lévêque, M. Outtas, L. Zhang, H. Liu, “Comparative study of the methodologies used for subjective medical image quality assessment”, Physics in Medicine & Biology, vol. 66, no. 15, Jul. 2021. (opens in a new tab).
[4] L.Zhang, C.Cavaro-Ménard, P.Le Callet, “An overview of model observers”, Innovation and Research in Biomedical Engineering, vol. 35, no. 4, pp. 214-224, Sep. 2014. (opens in a new tab).
[5] P. Tandon, M. Afonso, J. Sole, L. Krasula, “Comparative study of the methodologies used for subjective medical image quality assessment”, Picture Coding Symposium (PCS), Jul. 2021. (opens in a new tab).
[6] P. Pérez, L. Janowski, N. García, M. Pinson, “Subjective Assessment Experiments That Recruit Few Observers With Repetitions (FOWR)”, IEEE Transactions on Multimedia (Early Access), Jul. 2021. (opens in a new tab).
[7] N. Barman, S. Schmidt, S. Zadtootaghaj, M.G. Martini, “Evaluation of MPEG-5 part 2 (LCEVC) for live gaming video streaming applications”, Proceedings of the Mile-High Video Conference, Mar. 2022. (opens in a new tab).
[8] J. Xu, J. Li, X. Zhou, W. Zhou, B. Wang, Z. Chen, “Perceptual Quality Assessment of Internet Videos”, Proceedings of the ACM International Conference on Multimedia, Oct. 2021. (opens in a new tab).
[9] F. Götz-Hahn, V. Hosu, H. Lin, D. Saupe, “KonVid-150k: A Dataset for No-Reference Video Quality Assessment of Videos in-the-Wild”, IEEE Access, vol. 9, pp. 72139 – 72160, May. 2021. (opens in a new tab).
[10] L. F. Tiotsop, T. Mizdos, M. Barkowsky, P. Pocta, A. Servetti, E. Masala, “Mimicking Individual Media Quality Perception with Neural Network based Artificial Observers”, ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 18, no. 1, Jan. 2022. (opens in a new tab).
[11] P. Pérez, J. Ruiz, I. Benito, R. López, “A parametric quality model to evaluate the performance of tele-operated driving services over 5G networks”, Multimedia Tools and Applications, Jul. 2021. (opens in a new tab).

What is the trade-off between CO2 emission and video-conferencing QoE?

It is a natural thing that users of multimedia services want to have the highest possible Quality of Experience (QoE), when using said services. This is especially so in contexts such as video-conferencing and video streaming services, which are nowadays a large part of many users’ daily life, be it work-related Zoom calls, or relaxing while watching Netflix. This has implications in terms of the energy consumed for the provision of those services (think of the cloud services involved, the networks, and the users’ own devices), and therefore it also has an impact on the resulting CO₂ emissions. In this column, we look at the potential trade-offs involved between varying levels of QoE (which for video services is strongly correlated with the bit rates used), and the resulting CO₂ emissions. We also look at other factors that should be taken into account when making decisions based on these calculations, in order to provide a more holistic view of the environmental impact of these types of services, and whether they do have a significant impact.

Energy Consumption and CO2 Emissions for Internet Service Delivery

Understanding the footprint of Internet service delivery is a challenging task. On one hand, the infrastructure and software components involved in the service delivery need to be known. For a very fine-grained model, this requires knowledge of all components along the entire service delivery chain: end-user devices, fixed or mobile access network, core network, data center and Internet service infrastructure. Furthermore, the footprint may need to consider the CO₂ emissions for producing and manufacturing the hardware components as well as the CO₂ emissions during runtime. Life cycle assessment is then necessary to obtain CO₂ emission per year for hardware production. However, one may argue that the infrastructure is already there and therefore the focus will be on the energy consumption and CO₂ emission during runtime and delivery of the services. This is also the approach we follow here to provide quantitative numbers of energy consumption and CO₂ emission for Internet-based video services. On the other hand, quantitative numbers are needed beyond the complexity of understanding and modelling the contributors to energy consumption and C02 emission.

To overcome this complexity, the literature typically considers key figures on the overall data traffic and service consumption times aggregated over users and services over a longer period of time, e.g., one year. In addition, the total energy consumption of mobile operators and data centres is considered. Together with the information on e.g., the number of base station sites, this gives some estimates, e.g., on the average power consumption per site or the average data traffic per base station site [Feh11]. As a result, we obtain measures such as energy per bit (Joule/bit) determining the energy efficiency of a network segment. In [Yan19], the annual energy consumption of Akamai is converted to power consumption and then divided by the maximum network traffic, which results again in the energy consumption per bit of Akamai’s data centers. Knowing the share of energy sources (nonrenewable energy, including coal, natural gas, oil, diesel, petroleum; renewable energy including solar, geothermal, wind energy, biomass, hydropower from flowing water), allows relating the energy consumption to the total CO₂ emissions. For example, the total contribution from renewables exceeded 40% in 2021 in Germany and Finland, Norway has about 60%, Croatia about 36% (statistics from 2020).

A detailed model of the total energy consumption of mobile network services and applications is provided in [Yan19]. Their model structure considers important factors from each network segment from cloud to core network, mobile network, and end-user devices. Furthermore, service-specific energy consumption are provided. They found that there are strong differences between the service type and the emerging data traffic pattern. However, key factors are the amount of data traffic and the duration of the services. They also consider different end-to-end network topologies (user-to-data center, user-to-user via data center, user-to-user and P2P communication). Their model of the total energy consumption is expressed as the sum of the energy consumption of the different segments:

  • Smartphone: service-specific energy depends among others on the CPU usage and the network usage e.g. 4G over the duration of use,
  • Base station and access network: data traffic and signalling traffic over the duration of use,
  • Wireline core network: service specific energy consumption of a mobile service taking into account the data traffic volume and the energy per bit,
  • Data center: energy per bit of the data center is multiplied by data traffic volume of the mobile service.

The Shift Project [TSP19] provides a similar model which is called the “1 Byte Model”. The computation of energy consumption is transparently provided in calculation sheets and discussed by the scientific community. As a result of the discussions [Kam20a,Kam20b], an updated model was released [TSP20] clarifying a simple bit/byte conversion issue. The suggested models in [TSP20, Kam20b] finally lead to comparable numbers in terms of energy consumption and CO₂ emission. As a side remark: Transparency and reproducibility are key for developing such complex models!

The basic idea of the 1 Byte Model for computing energy consumption is to take into account the time t of Internet service usage and the overall data volume v. The time of use is directly related to the energy consumption of the display of an end-user device, but also for allocating network resources. The data volume to transmit through the network, but also to generate or process data for cloud services, drives the energy consumption additionally. The model does not differentiate between Internet services, but they will result in different traffic volumes over the time of use. Then, for each segment i (device, network, cloud) a linear model E_i(t,v)=a_i * t + b_i * v + c_i is provided to quantify the energy consumption. To be more precise, the different coefficients are provided for each segment by [TSP20]. The overall energy consumption is then E_total = E_device + E_network + E_cloud.

CO₂ emission is then again a linear model of the total energy consumption (over the time of use of a service), which depends on the share of nonrenewable and renewable energies. Again, The Shift Project derives such coefficients for different countries and we finally obtain CO2 = k_country * E_total.

The Trade-off between QoE and CO2 Emissions

As a use case, we consider hosting a scientific conference online through video-conferencing services. Assume there are 200 conference participants attending the video-conferencing session. The conference lasts for one week, with 6 hours of online program per day.  The video conference software requires the following data rates for streaming the sessions (video including audio and screen sharing):

  • high-quality video: 1.0 Mbps
  • 720p HD video: 1.5 Mbps
  • 1080p HD video: 3 Mbps

However, group video calls require even higher bandwidth consumption. To make such experiences more immersive, even higher bit rates may be necessary, for instance, if using VR systems for attendance.

A simple QoE model may map the video bit rate of the current video session to a mean opinion score (MOS). [Lop18] provides a logistic regression MOS(x) depending on the video bit rate x in Mbps: f(x) = m_1 log x + m_2

Then, we can connect the QoE model with the energy consumption and CO₂ emissions model from above in the following way. We assume a user attending the conference for time t. With a video bit rate x, the emerging data traffic is v = x*t. Those input parameters are now used in the 1 Byte Model for a particular device (laptop, smartphone), type of network (wired, wifi, mobile), and country (EU, US, China).

Figure 1 shows the trade-off between the MOS and energy consumption (left y-axis). The energy consumption is mapped to CO₂ emission by assuming the corresponding parameter for the EU, and that the conference participants are all connected with a laptop. It can be seen that there is a strong increase in energy consumption and CO₂ emission in order to reach the best possible QoE. The MOS score of 4.75 is reached if a video bit rate of roughly 11 Mbps is used. However, with 4.5 Mbps, a MOS score of 4 is already reached according to that logarithmic model. This logarithmic behaviour is a typical observation in QoE and is connected to the Weber-Fechner law, see [Rei10]. As a consequence, we may significantly save energy and CO₂ when not providing the maximum QoE, but “only” good quality (i.e., MOS score of 4). The meaning of the MOS ratings is 5=Excellent, 4=Good, 3=Fair, 2=Poor, 1=Bad quality.

Figure 1: Trade-off between MOS and energy consumption or CO2 emission.

Figure 2, therefore, visualized the gain when delivering the video in lower quality and lower video bit rates. In fact, the gain compared to the efforts for MOS 5 are visualized. To get a better understanding of the meaning of those CO₂ numbers, we express the CO₂ gain now in terms of thousands of kilometers driving by car. Since the CO₂ emission depends on the share of renewable energies, we may consider different countries and the parameters as provided in [TSP20]. We see that ensuring each conference participant a MOS score of 4 instead of MOS 5 results in savings corresponding to driving approximately 40000 kilometers by car assuming the renewable energy share in the EU – this is the distance around the Earth! Assuming the energy share in China, this would save more than 90000 kilometers. Of course, you could also save 90 000 kilometers by walking – which requires however about 2 years non-stop with a speed of 5 km/h. Note that this large amount of CO₂ emission is calculated assuming a data rate of 15 Mbps over 5 days (and 6 hours per day), resulting in about 40.5 TB of data that needs to be transferred to the 200 conference participants.

Figure 2: Relating the CO2 emission in different countries for achieving this MOS to the distance by travelling in a car (in thousands of kilometers).


Raising awareness of CO₂ emissions due to Internet service consumption is crucial. The abstract CO₂ emission numbers may be difficult to understand, but relating this to more common quantities helps to understand the impact individuals have. Of course, the provided numbers only give an impression, since the models are very simple and do not take into account various facets. However, the numbers nicely demonstrate the potential trade-off between QoE of end-users and sustainability in terms of energy consumption and CO₂ emission. In fact, [Gna21] conducted qualitative interviews and found that there is a lack of awareness of the environmental impact of digital applications and services, even for digital natives. In particular, an underlying issue is that there is a lack of understanding among end-users as to how Internet service delivery works, which infrastructure components play a role and are included along the end-to-end service delivery path, etc. Hence, the environmental impact is unclear for many users. Our aim is thus to contribute to overcoming this issue by raising awareness on this matter, starting with simplified models and visualizations.

[Gna21] also found that users indicate a certain willingness to make compromises between their digital habits and the environmental footprint. Given global climate changes and increased environmental awareness among the general population, such a trend in willingness to make compromises may be expected to further increase in the near future. Hence, it may be interesting for service providers to empower users to decide their environmental footprint at the cost of lower (yet still satisfactory) quality. This will also reduce the costs for operators and seems to be a win-win situation if properly implemented in Internet services and user interfaces.

Nevertheless, tremendous efforts are also currently being undertaken by Internet companies to become CO₂ neutral in the future. For example, Netflix claims in [Netflix2021] that they plan to achieve net-zero greenhouse gas emissions by the close of 2022. Similarly, also economic, societal, and environmental sustainability is seen as a key driver for 6G research and development [Mat21]. However, the time horizon is on a longer scope, e.g., a German provider claims they will reach climate neutrality for in-house emissions by 2025 at the latest and net-zero from production to the customer by 2040 at the latest [DT21]. Hence, given the urgency of the matter, end-users and all stakeholders along the service delivery chain can significantly contribute to speeding up the process of ultimately achieving net-zero greenhouse gas emissions.


  • [TSP19] The Shift Project, “Lean ict: Towards digital sobriety,” directed by Hugues Ferreboeuf, Tech. Rep., 2019, last accessed: March 2022. Available online (last accessed: March 2022)
  • [Yan19] M. Yan, C. A. Chan, A. F. Gygax, J. Yan, L. Campbell,A. Nirmalathas, and C. Leckie, “Modeling the total energy consumption of mobile network services and applications,” Energies, vol. 12, no. 1, p. 184, 2019.
  • [TSP20] Maxime Efoui Hess and Jean-Noël Geist, “Did The Shift Project really overestimate the carbon footprint of online video? Our analysis of the IEA and Carbonbrief articles”, The Shift Project website, June 2020, available online (last accessed: March 2022) PDF
  • [Kam20a] George Kamiya, “Factcheck: What is the carbon footprint of streaming video on Netflix?”, CarbonBrief website, February 2020. Available online (last accessed: March 2022)
  • [Kam20b] George Kamiya, “The carbon footprint of streaming video: fact-checking the headlines”, IEA website, December 2020. Available online (last accessed: March 2022)
  • [Feh11] Fehske, A., Fettweis, G., Malmodin, J., & Biczok, G. (2011). The global footprint of mobile communications: The ecological and economic perspective. IEEE communications magazine, 49(8), 55-62.
  • [Lop18]  J. P. López, D. Martín, D. Jiménez, and J. M. Menéndez, “Prediction and modeling for no-reference video quality assessment based on machine learning,” in 2018 14th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), IEEE, 2018, pp. 56–63.
  • [Gna21] Gnanasekaran, V., Fridtun, H. T., Hatlen, H., Langøy, M. M., Syrstad, A., Subramanian, S., & De Moor, K. (2021, November). Digital carbon footprint awareness among digital natives: an exploratory study. In Norsk IKT-konferanse for forskning og utdanning (No. 1, pp. 99-112).
  • [Rei10] Reichl, P., Egger, S., Schatz, R., & D’Alconzo, A. (2010, May). The logarithmic nature of QoE and the role of the Weber-Fechner law in QoE assessment. In 2010 IEEE International Conference on Communications (pp. 1-5). IEEE.
  • [Netflix21] Netflix: “Environmental Social Governance 2020”,  Sustainability Accounting Standards Board (SASB) Report, (2021, March). Available online (last accessed: March 2022)
  • [Mat21] Matinmikko-Blue, M., Yrjölä, S., Ahokangas, P., Ojutkangas, K., & Rossi, E. (2021). 6G and the UN SDGs: Where is the Connection?. Wireless Personal Communications, 121(2), 1339-1360.
  • [DT21] Hannah Schauff. Deutsche Telekom tightens its climate targets (2021, January). Available online (last accessed: March 2022)

JPEG Column: 94th JPEG Meeting

IEC, ISO and ITU issue a call for proposals for joint standardization of image coding based on machine learning

The 94th JPEG meeting was held online from 17 to 21 January 2022. A major milestone has been reached at this meeting with the release of the final call for proposals under the JPEG AI project. This standard aims at the joint standardization of the first image coding standard based on machine learning by the IEC, ISO and ITU, offering a single stream, compact compressed domain representation, targeting both human visualization with significant compression efficiency improvement over image coding standards in common use at equivalent subjective quality and effective performance for image processing and computer vision tasks.

The JPEG AI call for proposals was issued in parallel with a call for proposals for point cloud coding based on machine learning. The latter will be conducted in parallel with JPEG AI standardization.

The 94th JPEG meeting had the following highlights:

  • JPEG AI Call for Proposals;
  • JPEG JPEG Pleno Point Cloud Call for Proposals;
  • JPEG Pleno Light Fields quality assessment;
  • JPEG AIC near perceptual lossless quality assessment;
  • JPEG Systems;
  • JPEG Fake Media draft Call for Proposals;
  • JPEG NFT exploration;
  • JPEG XS;
  • JPEG DNA explorations.

The following provides an overview of the major achievements carried out during the 94th JPEG meeting.


JPEG AI targets a wide range of applications such as cloud storage, visual surveillance, autonomous vehicles and devices, image collection storage and management, live monitoring of visual data and media distribution. The main objective is to design a coding solution that offers significant compression efficiency improvement over coding standards in common use at equivalent subjective quality and an effective compressed domain processing for machine learning-based image processing and computer vision tasks. Other key requirements include hardware/software implementation-friendly encoding and decoding, support for 8- and 10-bit depth, efficient coding of images with text and graphics and progressive decoding.

During the 94th JPEG meeting, several activities toward a JPEG AI learning-based coding standard have occurred, notably the release of the Final Call for Proposals for JPEG AI, consolidated with the definition of the Use Cases and Requirements and the Common Training and Test Conditions to assure a fair and complete evaluation of the future proposals.

The final JPEG AI Call for Proposals marks an important milestone being the first time that contributions are solicited towards a learning-based image coding solution. The JPEG AI proposals’ registration deadline is 25 February 2022. There are three main phases for proponents to submit materials, namely, on 10th March for the proposed decoder implementation with some fixed coding model, on 2nd May for the submission of proposals’ bitstreams and decoded images and/or labels for the test datasets, and on 18th July, for the submission of source code for the encoder, decoder, training procedure and the proposal description. The presentation and discussion of the JPEG AI proposals will occur during the 96th JPEG meeting. JPEG AI is a joint standardization project between IEC, ISO and ITU.

JPEG AI framework

JPEG Pleno Point Cloud Coding

JPEG Pleno is working towards the integration of various modalities of plenoptic content under a single and seamless framework. Efficient and powerful point cloud representation is a key feature of this vision. Point cloud data supports a wide range of applications for human and machine consumption including metaverse, autonomous driving, computer-aided manufacturing, entertainment, cultural heritage preservation, scientific research and advanced sensing and analysis. During the 94th JPEG meeting, the JPEG Committee released a final Call for Proposals on JPEG Pleno Point Cloud Coding. This call addresses learning-based coding technologies for point cloud content and associated attributes with emphasis on both human visualization and decompressed/reconstructed domain 3D processing and computer vision with competitive compression efficiency compared to point cloud coding standards in common use, with the goal of supporting a royalty-free baseline. This Call was released in conjunction with new releases of the JPEG Pleno Point Cloud Use Cases and Requirements and the JPEG Pleno Point Cloud Common Training and Test Conditions. Interested parties are invited to register for this Call by the deadline of the 31st of March 2022.

JPEG Pleno Light Field

Besides defining coding standards, JPEG Pleno is planning for the creation of quality assessment standards, i.e. defining a framework including subjective quality assessment protocols and objective quality assessment measures for lossy decoded data of plenoptic modalities in the context of multiple use cases. The first phase of this effort will address the light field modality and should build on the light field quality assessment tools developed by JPEG in recent years. Future activities will focus on holographic and point cloud modalities, for both of which also coding related standardization efforts have been initiated.


During the 94th JPEG Meeting, the first version of the use cases and requirements document was released under the Image Quality Assessment activity. The standardization process was also defined, and the process will be carried out in two phases: during Stage I, a subjective methodology for the assessment of images with visual quality in the range from high quality to near-visually lossless will be standardized, following a collaborative process; successively, in Stage II, an objective image quality metric will be standardized, by means of a competitive process. A tentative timeline has also been planned with a call for contributions for subjective quality assessment methodologies to be released in July 2022, and a call for proposals for an objective quality metric planned in July 2023.

JPEG Systems

JPEG Systems produced the FDIS text for JLINK (ISO/IEC 19566-7), which allows the storage of multiple images inside JPEG files and the interactive navigation between them. This enables features like virtual museum tours, real estate visits, hotspot zoom into other images and many others. For JPEG Snack, the Committee produced the DIS text of ISO/IEC 19566-8, which allows storing multiple images for self-running multimedia experiences like animated image sequences and moving image overlays. Both texts are submitted for respective balloting. For JUMBF (ISO/IEC 19566-5, JPEG Universal Metadata Box Format), a second edition was initiated which combines the first edition and two amendments. Actual extensions are the support of CBOR (Concise Binary Object Representation) and private content types. In addition, JPEG Systems started an activity on a technical report for JPEG extensions mechanisms to facilitate forwards and backwards compatibility under ISO/IEC 19566-9. This technical report gives guidelines for the design of future JPEG standards and summarizes existing design mechanisms.

JPEG Fake Media

At its 94th meeting, the JPEG Committee released a Draft Call for Proposals for JPEG Fake Media and associated Use Cases and Requirements on JPEG Fake Media. These documents are the result of the work performed by the JPEG Fake Media exploration. The scope of JPEG Fake Media is the creation of a standard that can facilitate secure and reliable annotation of media asset creation and modifications. The standard shall address use cases that are both in good faith and those with malicious intent. The Committee targets the following timeline for the next steps in the standardization process:

  • April 2022: issue Final Call for Proposals
  • October 2022: evaluation of proposals
  • January 2023: first Working Draft (WD)
  • January 2024: Draft International Standard (DIS)
  • October 2024: International Standard (IS)

The JPEG Committee welcomes feedback on the JPEG Fake Media documents and invites interested experts to join the JPEG Fake Media AhG mailing list to get involved in this standardization activity.


The Ad hoc Group (AhG) on NFT resumed its exploratory work on the role of JPEG in the NFT ecosystem during the 94th JPEG meeting. Three use cases and four essential requirements were selected. The use cases include the usage of NFT for JPEG-based digital art, NFT for collectable JPEGs, and NFT for JPEG micro-licensing. The following categories of critical requirements are under consideration: metadata descriptions, metadata embedding and referencing; authentication and integrity; and the format for registering media assets. As a result, the JPEG Committee published an output document titled JPEG NFT Use Cases and Requirements. Additionally, the third JPEG NFT and Fake Media Workshop proceedings were published, and arrangements were made to hold another combined workshop between the JPEG NFT and JPEG Fake Media groups.


At the 94th JPEG meeting a new revision of the Use Cases and Requirements for JPEG XS document was produced, as version 3.1, to clarify and improve the requirements of a frame buffer. In addition, the JPEG Committee reports that the second editions of Part 1 (Core coding system), Part 2 (Profiles and buffer models), and Part 3 (Transport and container formats) have been approved and are now scheduled for publication as International Standards. Lastly, the DAM text for Amendment 1 to JPEG XS Part 2, which contains the additional High420.12 profile and a new sublevel at 4 bpp, is ready and will be sent to final balloting for approval.


JPEG XL Part 4 (Reference software) has proceeded to the FDIS stage. Work continued on the second edition of Part 1 (Core coding system). Core experiments were defined to investigate the numerical stability of the edge-preserving filter and fixed-point implementations. Both Part 1 (core coding system) and Part 2 (file format) are now published as IS, and preliminary support has been implemented in major web browsers, image viewing and editing software. Consequently, JPEG XL is now ready for wide-scale adoption.


The JPEG Committee has continued its exploration of the coding of images in quaternary representations, as is particularly suitable for DNA storage. The scope of JPEG DNA is the creation of a standard for efficient coding of images that considers biochemical constraints and offers robustness to noise introduced by the different stages of the storage process that is based on DNA synthetic polymers. A new version of the JPEG DNA overview document was issued and is now publicly available. It was decided to continue this exploration by validating and extending the JPEG DNA experimentation software to simulate an end-to-end image storage pipeline using DNA for future exploration experiments including biochemical noise simulation. During the 94th JPEG meeting, the JPEG DNA committee initiate a new document describing the Common Test Conditions that should be used to evaluate different aspects of image coding for storage on DNA support. It was also decided to prepare an outreach video to explain DNA coding as well as organize the 6th workshop on JPEG DNA with emphasis on the biochemical process noise simulators. Interested parties are invited to consider joining the effort by registering on the mailing list of JPEG DNA AhG.

Final Quote

“JPEG marks a historical milestone with the parallel release of two calls for proposals for learning based coding of images and point clouds,” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

Upcoming JPEG meetings are planned as follows:

  • No 95, will be held online during 25-29 April 2022

MPEG Column: 137th MPEG Meeting (virtual/online)

The original blog post can be found at the Bitmovin Techblog and has been modified/updated here to focus on and highlight research aspects.

The 137th MPEG meeting was once again held as an online meeting, and the official press release can be found here and comprises the following items:

  • MPEG Systems Wins Two More Technology & Engineering Emmy® Awards
  • MPEG Audio Coding selects 6DoF Technology for MPEG-I Immersive Audio
  • MPEG Requirements issues Call for Proposals for Encoder and Packager Synchronization
  • MPEG Systems promotes MPEG-I Scene Description to the Final Stage
  • MPEG Systems promotes Smart Contracts for Media to the Final Stage
  • MPEG Systems further enhanced the ISOBMFF Standard
  • MPEG Video Coding completes Conformance and Reference Software for LCEVC
  • MPEG Video Coding issues Committee Draft of Conformance and Reference Software for MPEG Immersive Video
  • JVET produces Second Editions of VVC & VSEI and finalizes VVC Reference Software
  • JVET promotes Tenth Edition of AVC to Final Draft International Standard
  • JVET extends HEVC for High-Capability Applications up to 16K and Beyond
  • MPEG Genomic Coding evaluated Responses on New Advanced Genomics Features and Technologies
  • MPEG White Papers
    • Neural Network Coding (NNC)
    • Low Complexity Enhancement Video Coding (LCEVC)
    • MPEG Immersive video

In this column, I’d like to focus on the Emmy® Awards, video coding updates (AVC, HEVC, VVC, and beyond), and a brief update about DASH (as usual).

MPEG Systems Wins Two More Technology & Engineering Emmy® Awards

MPEG Systems is pleased to report that MPEG is being recognized this year by the National Academy for Television Arts and Sciences (NATAS) with two Technology & Engineering Emmy® Awards, for (i) “standardization of font technology for custom downloadable fonts and typography for Web and TV devices and for (ii) “standardization of HTTP encapsulated protocols”, respectively.

The first of these Emmys is related to MPEG’s Open Font Format (ISO/IEC 14496-22) and the second of these Emmys is related to MPEG Dynamic Adaptive Streaming over HTTP (i.e., MPEG DASH, ISO/IEC 23009). The MPEG DASH standard is the only commercially deployed international standard technology for media streaming over HTTP and it is widely used in many products. MPEG developed the first edition of the DASH standard in 2012 in collaboration with 3GPP and since then has produced four more editions amending the core specification by adding new features and extended functionality. Furthermore, MPEG has developed six other standards as additional “parts” of ISO/IEC 23009 enabling the effective use of the MPEG DASH standards with reference software and conformance testing tools, guidelines, and enhancements for additional deployment scenarios. MPEG DASH has dramatically changed the streaming industry by providing a standard that is widely adopted by various consortia such as 3GPP, ATSC, DVB, and HbbTV, and across different sectors. The success of this standard is due to its technical excellence, large participation of the industry in its development, addressing the market needs, and working with all sectors of industry all under ISO/IEC JTC 1/SC 29 MPEG Systems’ standard development practices and leadership.

These are MPEG’s fifth and sixth Technology & Engineering Emmy® Awards (after MPEG-1 and MPEG-2 together with JPEG in 1996, Advanced Video Coding (AVC) in 2008, MPEG-2 Transport Stream in 2013, and ISO Base Media File Format in 2021) and MPEG’s seventh and eighth overall Emmy® Awards (including the Primetime Engineering Emmy® Awards for Advanced Video Coding (AVC) High Profile in 2008 and High-Efficiency Video Coding (HEVC) in 2017).

I have been actively contributing to the MPEG DASH standard since its inception. My initial blog post dates back to 2010 and the first edition of MPEG DASH was published in 2012. A more detailed MPEG DASH timeline provides many pointers to the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität Klagenfurt and its DASH activities that is now continued within the Christian Doppler Laboratory ATHENA. In the end, the MPEG DASH community of contributors to and users of the standards can be very proud of this achievement only after 10 years of the first edition being published. Thus, also happy 10th birthday MPEG DASH and what a nice birthday gift.

Video Coding Updates

In terms of video coding, there have been many updates across various standards’ projects at the 137th MPEG Meeting.

Advanced Video Coding

Starting with Advanced Video Coding (AVC), the 10th edition of Advanced Video Coding (AVC, ISO/IEC 14496-10 | ITU-T H.264) has been promoted to Final Draft International Standard (FDIS) which is the final stage of the standardization process. Beyond various text improvements, this specifies a new SEI message for describing the shutter interval applied during video capture. This can be variable in video cameras, and conveying this information can be valuable for analysis and post-processing of the decoded video.

High-Efficiency Video Coding

The High-Efficiency Video Coding (HEVC, ISO/IEC 23008-2 | ITU-T H.265) standard has been extended to support high-capability applications. It defines new levels and tiers providing support for very high bit rates and video resolutions up to 16K, as well as defining an unconstrained level. This will enable the usage of HEVC in new application domains, including professional, scientific, and medical video sectors.

Versatile Video Coding

The second editions of Versatile Video Coding (VVC, ISO/IEC 23090-3 | ITU-T H.266) and Versatile supplemental enhancement information messages for coded video bitstreams (VSEI, ISO/IEC 23002-7 | ITU-T H.274) have reached FDIS status. The new VVC version defines profiles and levels supporting larger bit depths (up to 16 bits), including some low-level coding tool modifications to obtain improved compression efficiency with high bit-depth video at high bit rates. VSEI version 2 adds SEI messages giving additional support for scalability, multi-view, display adaptation, improved stream access, and other use cases. Furthermore, a Committee Draft Amendment (CDAM) for the next amendment of VVC was issued to begin the formal approval process to enable linking VVC with the Green Metadata (ISO/IEC 23001-11) and Video Decoding Interface (ISO/IEC 23090-13) standards and add a new unconstrained level for exceptionally high capability applications such as certain uses in professional, scientific, and medical application scenarios. Finally, the reference software package for VVC (ISO/IEC 23090-16) was also completed with its achievement of FDIS status. Reference software is extremely helpful for developers of VVC devices, helping them in testing their implementations for conformance to the video coding specification.

Beyond VVC

The activities in terms of video coding beyond VVC capabilities, the Enhanced Compression Model (ECM 3.1) performance over VTM-11.0 + JVET-V0056 (i.e., VVC reference software) shows an improvement of close to 15% for Random Access Main 10. This is indeed encouraging and, in general, these activities are currently managed within two exploration experiments (EEs). The first is on neural network-based (NN) video coding technology (EE1) and the second is on enhanced compression beyond VVC capability (EE2). EE1 currently plans to further investigate (i) enhancement filters (loop and post) and (ii) super-resolution (JVET-Y2023). It will further investigate selected NN technologies on top of ECM 4 and the implementation of selected NN technologies in the software library, for platform-independent cross-checking and integerization. Enhanced Compression Model 4 (ECM 4) comprises new elements on MRL for intra, various GPM/affine/MV-coding improvements including TM, adaptive intra MTS, coefficient sign prediction, CCSAO improvements, bug fixes, and encoder improvements (JVET-Y2025). EE2 will investigate intra prediction improvements, inter prediction improvements, improved screen content tools, and improved entropy coding (JVET-Y2024).

Research aspects: video coding performance is usually assessed in terms of compression efficiency or/and encoding runtime (time complexity). Another aspect is related to visual quality, its assessment, and metrics, specifically for neural network-based video coding technologies.

The latest MPEG-DASH Update

Finally, I’d like to provide a brief update on MPEG-DASH! At the 137th MPEG meeting, MPEG Systems issued a draft amendment to the core MPEG-DASH specification (i.e., ISO/IEC 23009-1) about Extended Dependent Random Access Point (EDRAP) streaming and other extensions which it will be further discussed during the Ad-hoc Group (AhG) period (please join the dash email list for further details/announcements). Furthermore, Defects under Investigation (DuI) and Technologies under Consideration (TuC) are available here.

An updated overview of DASH standards/features can be found in the Figure below.

MPEG-DASH status of January 2021.

Research aspects: in the Christian Doppler Laboratory ATHENA we aim to research and develop novel paradigms, approaches, (prototype) tools and evaluation results for the phases (i) multimedia content provisioning (i.e., video coding), (ii) content delivery (i.e., video networking), and (iii) content consumption (i.e., video player incl. ABR and QoE) in the media delivery chain as well as for (iv) end-to-end aspects, with a focus on, but not being limited to, HTTP Adaptive Streaming (HAS).

The 138th MPEG meeting will be again an online meeting in July 2022. Click here for more information about MPEG meetings and their developments.

Reports from ACM Multimedia System 2021


The 12th ACM Multimedia Systems Conference (MMSys’21) happened from September 28th through October 1st, 2021.  The  MMSys conference is an important forum for researchers in multimedia systems. But, due to the ongoing pandemic, the event was held in a hybrid mode – onsite in Istanbul, Turkey, and online. Organizers and chairs (Özgü Alay, Cheng-Hsin Hsu, and  Ali C. Begen) worked very hard to make sure the conference was successful, both for the on-site participants (around 50) and the online participants (with a peak of 330 concurrent viewers).  For a small description of the event, take a look at the text written by Ali Begen, one of the general chairs.
To encourage student authors to participate on-site, SIGMM has sponsored a group of students with Student Travel Grant Awards. Students who wanted to apply for this travel grant needed to submit an online form before the submission deadline. Then, the selection committee chose 7 travel grant winners. The selected students received either 1,000 or 2,000 USD to cover their airline tickets as well accommodation costs for this event. We asked the travel grant winners to share their unique experiences attending MMSys’21. The following are their comments.

Minh Nguyen

It is my honour to receive the SIGMM Student travel award that gives me a golden opportunity to attend the MMSys’2021 conference on-site. This conference is the first one I have attended during the Covid pandemic. I attended the whole conference, and I really appreciate the organizing committee who tried their best to organize this conference in a hybrid mode. It was a very interesting and well-organized conference where many innovative papers were introduced. The venue of the conference is a great place with professional staff and comfortable accommodation and meeting rooms. The local Turkish food attracted me. They were delicious. At this conference, I was happy to meet, connect, and discuss with experts working in multimedia systems, which is close to my PhD thesis. I was interested in informative and passionate keynotes about cutting-edge technologies and their open discussion. Especially, many novel papers motivated me and gave me some ideas for my future work in my PhD thesis. Also, their enjoyable social events brought me a chance to visit Istanbul and experience new things. I look forward to attending future editions of the conference.

Lucas Torrealba A.

I found the conference very interesting. It was my first experience of an in-person conference and it was amazing. The research articles presented seem very relevant to me and the organization did a wonderful job as well. In addition, it seems to be quite a good idea for the future to always leave hybrid ways to participate in the conferences.

Paniz Parastar

The MMsys2021 was my first in-person conference, and since it was highly organized, it raised my expectation of future conferences. Overall, many interesting topics were covered, and I only mentioned a couple of instances here. 
AI/ML are the hot topics as of today. I believe it’s enjoyable to see them applied in the various aspects of multimedia streaming and other areas as well as in computer vision. Notably, I liked the papers in NOSSDAV sessions on the last day of the conference adapting learning methods to improve the QoE of users. Since I’m working on distinguishing IoT devices and their traffics on the network these days, video clustering papers and mainly the paper that classifies the 360 videos from regular ones based on the traffic features (.i.e., flow and packet level features) were educational to me. Also, comparing subjective and objective quality assessment metrics alongside the various network conditions as they do in the paper may not be a new topic, but it is always interesting to explore. 
Plus, one of the most exciting talks for me was ‘Games as a Game Changer’, which was part of the Equality, Diversity, and Inclusion (EDI) Workshop. It changed my perception of games as an entertaining tool that also can help us better understand situations that don’t usually happen in our daily lives.

Ekrem Cetinkaya

MMSys’21 was my first in-person conference experience, and I can gladly say that it was above my expectations. We were welcomed by a fantastic organization, given how difficult the situation was. Everything went so smoothly, from the keynotes to paper presentations to demo sessions, and of course, social events.
Personally, two things were the most impressive for me. First, the keynote by Caitlin Kalinowski (Facebook) was given in person, and she had to fly from the U.S. to Istanbul just for this keynote. Second, the hybrid organization was thought through. There was a team of five whose duty was to make sure the conference was insightful for those who could not make it to Istanbul as well.
Moreover, the social events and the venues were really lovely. I learned that the MMSys community has a long history, and you could feel that, especially in those social events where it was an amicable environment, meaning that it was also easy for me to do some networking. Overall, I can say the MMSys conference was amazing in all aspects without any doubt. I want to thank the SIGMM committee once again for their travel grant, which made this experience possible.

Ivan Bartolec

The ACM MMSys’21 conference held in Istanbul, Turkey, was an excellent opportunity to meet, interact, and discuss ideas with researchers who are working to develop new and engaging multimedia experiences. This was my first MMSys conference, and it was an excellent environment for both learning and networking, with a thoughtfully selected collection of presentations, engaging keynotes (especially the one from a representative of Facebook), and fun social events. I found the sessions based on video or video streaming to be the most interesting and informative for my field of study. The demo sessions concept was also pretty unique, and by being on-site and seeing the demos and asking questions, I learnt a few things about practical implementations that I find incredibly useful. I’m very thankful for the opportunity to present my PhD research as part of the Doctoral symposium and to receive feedback from conference attendees as well as offline comments and ideas via email, which I gladly responded to. It was an absolute pleasure to attend MMSys’21 on-site, courtesy of the Student Travel Grant, and I look forward to visiting future editions of the conference and continuing to interact with the MMSys community.

Jesus Aguilar Armijo

It has been a pleasure to attend MMSys’2021 in person. This would not have been possible without the SIGMM Student travel award.
At the conference, I had the opportunity to attend four keynotes, where I would like to highlight the keynote from Caitlin Kalinowski (Facebook). She presented in person and showed the Virtual Reality devices of her company and future projects with emerging technologies.
I found truly engaging the different sessions of MMSys as they were related to my work in network-assisted video streaming. For example, the NOSSDAV session named “Session #1: Yet Another Streaming Session” contained the paper “Common Media Client Data (CMCD): Initial Findings” which I found especially interesting as I use some features of this standard in my work. Moreover, the paper entitled: “Beyond throughput, the next generation: a 5G dataset with channel and context metrics” (from MMSys’20 but presented in MMSys’21) in the open dataset session was particularly interesting for me as I use their previous dataset with 4G as a radio traces for my last paper.
During the conference, I had the opportunity to discuss and exchange ideas with different researchers, which I found valuable and insightful. I would also like to highlight the good organization of the conference and the social events.
Finally, I presented my work in the Doctoral Symposium session, and I received some interesting questions from the audience. It was a great opportunity, and I am grateful to SIGMM, which allowed me to participate in this extraordinary experience.

Towards an updated understanding of immersive multimedia experiences

Bringing theories and measurement techniques up to date

Development of technology for immersive multimedia experiences

Immersive multimedia experiences, as its name is suggesting are those experiences focusing on media that is able to immerse users with different interactions into an experience of an environment. Through different technologies and approaches, immersive media is emulating a physical world through the means of a digital or simulated world, with the goal of creating a sense of immersion. Users are involved in a technologically driven environment where they may actively join and participate in the experiences offered by the generated world [White Paper, 2020]. Currently, as hardware and technologies are developing further, those immersive experiences are getting better with the more advanced feeling of immersion. This means that immersive multimedia experiences are exceeding just the viewing of the screen and are enabling bigger potential. This column aims to present and discuss the need for an up to date understanding of immersive media quality. Firstly, the development of the constructs of immersion and presence over time will be outlined. Second, influencing factors of immersive media quality will be introduced, and related standardisation activities will be discussed. Finally, this column will be concluded by summarising why an updated understanding of immersive media quality is urgent.

Development of theories covering immersion and presence

One of the first definitions of presence was established by Slater and Usoh already in 1993 and they defined presence as a “sense of presence” in a virtual environment [Slater, 1993]. This is in line with other early definitions of presence and immersion. For example, Biocca defined immersion as a system property. Those definitions focused more on the ability of the system to technically accurately provide stimuli to users [Biocca, 1995]. As technology was only slowly capable to provide systems that are able to generate stimulation to users that can mimic the real world, this was of course the main content of definitions. Quite early on questionnaires to capture the experienced immersion were introduced, such as the Igroup Presence Questionnaire (IPQ) [Schubert, 2001]. Also, the early methods for measuring experiences are mainly focused on aspects of how good the representation of the real world was done and perceived. With maturing technology, the focus was shifted more towards emotions and more cognitive phenomena besides the basics stimulus generation. For example, Baños and colleagues showed that experienced emotion and immersion are in relation to each other and also influence the sense of presence [Baños, 2004]. Newer definitions focus more on these mentioned cognitive aspects, e.g., Nilsson defines three factors that can lead to immersion: (i) technology, (ii) narratives, and (iii) challenges, where only the factor technology is a non-cognitive one [Nilsson, 2016]. In 2018, Slater defines the place illusion as the illusion of being in a place while knowing one is not really there. This is a focus on a cognitive construct, removal of disbelieve, but still leaves the focus of how the illusion is created mainly on system factors instead of cognitive ones [Slater, 2018]. In recent years, more and more activities were started to define how to measure immersive experiences as an overall construct.

Constructs of interest in relation to immersion and presence

This section discusses constructs and activities that are related to immersion and presence. In the beginning, subtypes of extended reality (XR) and the relation to user experience (UX) as well as quality of experience (QoE) are outlined. Afterwards, recent standardization activities related to immersive multimedia experiences are introduced and discussed.
Moreover, immersive multimedia experiences can be divided by many different factors, but recently the most common distinctions are regarding the interactivity where content can be made for multi-directional viewing as 360-degree videos, or where content is presented through interactive extended reality. Those XR technologies can be divided into mixed reality (MR), augmented reality (AR), augmented virtuality (AV), virtual reality (VR), and everything in between [Milgram, 1995]. Through all those areas immersive multimedia experiences have found a place on the market, and are providing new solutions to challenges in research as well as in industries, with a growing potential of adopting into different areas [Chuah, 2018].

While discussing immersive multimedia experiences, it is important to address user experience and quality of immersive multimedia experiences, which can be defined following the definition of quality of experience itself [White Paper, 2012] as a measure of the delight or annoyance of a customer’s experiences with a service, wherein this case service is an immersive multimedia experience. Furthermore, while defining QoE terms experience and application are also defined and can be utilized for immersive multimedia experience, where an experience is an individual’s stream of perception and interpretation of one or multiple events; and application is a software and/or hardware that enables usage and interaction by a user for a given purpose [White Paper 2012].

As already mentioned, immersive media experiences have an impact in many different fields, but one, where the impact of immersion and presence is particularly investigated, is gaming applications along with QoE models and optimizations that go with it. Specifically interesting is the framework and standardization for subjective evaluation methods for gaming quality [ITU-T Rec. P.809, 2018]. This standardization is providing instructions on how to assess QoE for gaming services from two possible test paradigms, i.e., passive viewing tests and interactive tests. However, even though detailed information about the environments, test set-ups, questionnaires, and game selection materials are available those are still focused on the gaming field and concepts of flow and immersion in games themselves.

Together with gaming, another step in defining and standardizing infrastructure of audiovisual services in telepresence, immersive environments, and virtual and extended reality, has been done in regards to defining different service scenarios of immersive live experience [ITU-T Rec. H.430.3, 2018] where live sports, entertainment, and telepresence scenarios have been described. With this standardization, some different immersive live experience scenarios have been described together with architectural frameworks for delivering such services, but not covering all possible use case examples. When mentioning immersive multimedia experience, spatial audio sometimes referred to as “immersive audio” must be mentioned as is one of the key features of especially of AR or VR experiences [Agrawal, 2019], because in AR experiences it can provide immersive experiences on its own, but also enhance VR visual information.
In order to be able to correctly assess QoE or UX, one must be aware of all characteristics such as user, system, content, and context because their actual state may have an influence on the immersive multimedia experience of the user. That is why all those characteristics are defined as influencing factors (IF) and can be divided into Human IF, System IF, and Context IF and are as well standardized for virtual reality services [ITU-T Rec. G.1035, 2021]. Particularly addressed Human IF is simulator sickness as it specifically occurs as a result of exposure to immersive XR environments. Simulator sickness is also known as cybersickness or VR/AR sickness, as it is visually induced motion sickness triggered by visual stimuli and caused by the sensory conflict arising between the vestibular and visual systems. Therefore, to achieve the full potential of immersive multimedia experience, the unwanted sensation of simulation sickness must be reduced. However, with the frequent change of immersive technology, some hardware improvement is leading to better experiences, but a constant updating of requirement specification, design, and development is needed together with it to keep up with the best practices.

Conclusion – Towards an updated understanding

Considering the development of theories, definitions, and influencing factors around the constructs immersion and presence, one can see two different streams. First, there is a quite strong focus on the technical ability of systems in most early theories. Second, the cognitive aspects and non-technical influencing factors gain importance in the new works. Of course, it is clear that in the 1990ies, technology was not yet ready to provide a good simulation of the real world. Therefore, most activities to improve systems were focused on that activity including measurements techniques. In the last few years, technology was fast developing and the basic simulation of a virtual environment is now possible also on mobile devices such as the Oculus Quest 2. Although concepts such as immersion or presence are applicable from the past, definitions dealing with those concepts need to capture as well nowadays technology. Meanwhile, systems have proven to provide good real-world simulators and provide users with a feeling of presence and immersion. While there is already activity in standardization which is quite strong and also industry-driven, research in many research disciplines such as telecommunication are still mainly using old questionnaires. These questionnaires are mostly focused on technological/real-world simulation constructs and, thus, not able to differentiate products and services anymore to an extent that is optimal. There are some newer attempts to create new measurement tools for e.g. social aspects of immersive systems [Li, 2019; Toet, 2021]. Measurement scales aiming at capturing differences due to the ability of systems to create realistic simulations are not able to reliably differentiate different systems due to the fact that most systems are providing realistic real-world simulations. To enhance research and industrial development in the field of immersive media, we need definitions of constructs and measurement methods that are appropriate for the current technology even if the newer measurement and definitions are not as often cited/used yet. That will lead to improved development and in the future better immersive media experiences.

One step towards understanding immersive multimedia experiences is reflected by QoMEX 2022. The 14th International Conference on Quality of Multimedia Experience will be held from September 5th to 7th, 2022 in Lippstadt, Germany. It will bring together leading experts from academia and industry to present and discuss current and future research on multimedia quality, Quality of Experience (QoE), and User Experience (UX). It will contribute to excellence in developing multimedia technology towards user well-being and foster the exchange between multidisciplinary communities. One core topic is immersive experiences and technologies as well as new assessment and evaluation methods, and both topics contribute to bringing theories and measurement techniques up to date. For more details, please visit https://qomex2022.itec.aau.at.


[Agrawal, 2019] Agrawal, S., Simon, A., Bech, S., Bærentsen, K., Forchhammer, S. (2019). “Defining Immersion: Literature Review and Implications for Research on Immersive Audiovisual Experiences.” In Audio Engineering Society Convention 147. Audio Engineering Society.
[Biocca, 1995] Biocca, F., & Delaney, B. (1995). Immersive virtual reality technology. Communication in the age of virtual reality, 15(32), 10-5555.
[Baños, 2004] Baños, R. M., Botella, C., Alcañiz, M., Liaño, V., Guerrero, B., & Rey, B. (2004). Immersion and emotion: their impact on the sense of presence. Cyberpsychology & behavior, 7(6), 734-741.
[Chuah, 2018] Chuah, S. H. W. (2018). Why and who will adopt extended reality technology? Literature review, synthesis, and future research agenda. Literature Review, Synthesis, and Future Research Agenda (December 13, 2018).
[ITU-T Rec. G.1035, 2021] ITU-T Recommendation G:1035 (2021). Influencing factors on quality of experience for virtual reality services, Int. Telecomm. Union, CH-Geneva.
[ITU-T Rec. H.430.3, 2018] ITU-T Recommendation H:430.3 (2018). Service scenario of immersive live experience (ILE), Int. Telecomm. Union, CH-Geneva.
[ITU-T Rec. P.809, 2018] ITU-T Recommendation P:809 (2018). Subjective evaluation methods for gaming quality, Int. Telecomm. Union, CH-Geneva.
[Li, 2019] Li, J., Kong, Y., Röggla, T., De Simone, F., Ananthanarayan, S., De Ridder, H., … & Cesar, P. (2019, May). Measuring and understanding photo sharing experiences in social Virtual Reality. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1-14).
[Milgram, 1995] Milgram, P., Takemura, H., Utsumi, A., & Kishino, F. (1995, December). Augmented reality: A class of displays on the reality-virtuality continuum. In Telemanipulator and telepresence technologies (Vol. 2351, pp. 282-292). International Society for Optics and Photonics.
[Nilsson, 2016] Nilsson, N. C., Nordahl, R., & Serafin, S. (2016). Immersion revisited: a review of existing definitions of immersion and their relation to different theories of presence. Human Technology, 12(2).
[Schubert, 2001] Schubert, T., Friedmann, F., & Regenbrecht, H. (2001). The experience of presence: Factor analytic insights. Presence: Teleoperators & Virtual Environments, 10(3), 266-281.
[Slater, 1993] Slater, M., & Usoh, M. (1993). Representations systems, perceptual position, and presence in immersive virtual environments. Presence: Teleoperators & Virtual Environments, 2(3), 221-233.
[Toet, 2021] Toet, A., Mioch, T., Gunkel, S. N., Niamut, O., & van Erp, J. B. (2021). Holistic Framework for Quality Assessment of Mediated Social Communication.
[Slater, 2018] Slater, M. (2018). Immersion and the illusion of presence in virtual reality. British Journal of Psychology, 109(3), 431-433.
[White Paper, 2012] Qualinet White Paper on Definitions of Quality of Experience (2012). European Network on Quality of Experience in Multimedia Systems and Services (COST Action IC 1003), Patrick Le Callet, Sebastian Möller and Andrew Perkis, eds., Lausanne, Switzerland, Version 1.2, March 2013.
[White Paper, 2020] Perkis, A., Timmerer, C., Baraković, S., Husić, J. B., Bech, S., Bosse, S., … & Zadtootaghaj, S. (2020). QUALINET white paper on definitions of immersive media experience (IMEx). arXiv preprint arXiv:2007.07032.

Multidisciplinary Column: An Interview with Odette Scharenborg

Odette, could you tell us a bit about your background, and what the road to your current position was?

Dr Odette Scharenborg, Associate professor and Delft Technology Fellow, SpeechLab/Multimedia Computing Group, Delft University of Technology

In high school, I enjoyed both languages and science topics such as physics, chemistry and biology. When researching what I wanted to study I came across “Language, Speech, and Computer Science” at Radboud University, Nijmegen, the Netherlands, which sounded and indeed was an interesting combination of both languages and science topics. Probably inspired by one of my favourite TV series when I was younger, the Knight Rider, which included a car with which you could communicate through speech, I from early on focused on speech technology.

After obtaining my university degree in 2000, I was offered a PhD position at the same department as I pursued my studies, on another interdisciplinary topic: computational modelling of human speech processing. My PhD project (2001-2005) combined theories about human speech processing (psycholinguistics) and tools and approaches from automatic speech recognition (which itself is more or less at the cross-roads of electrical engineering and computer science) in order to learn more about how humans process speech and improve automatic speech recognition (i.e., the conversion of speech into text).

After obtaining my PhD (in 2005), I went to the Speech and Hearing group in the Department of Computer Science at the University of Sheffield, UK, for a visiting post-doc position (funded by a Dutch Science Foundation (NWO) Talent Scholarship). I then returned to Radboud University for a 3-year post-doc position (funded by an NWO Veni personal fellowship) on new computational modelling of human speech processing project. After this project, I felt that after having read so much about the theories about humans process speech, I really wanted to know how researchers actually came to these theories. So, in the next few years, my research focused on human speech processing. First at the Max Planck Institute for Psycholinguistics, where I was trained as a psycholinguist, and subsequently, funded by an NWO Vidi personal grant, again at the Radboud University, where I became Associate Professor.

Towards the end of my Vidi-project (in 2016), I started to miss the computer science component of my earlier research and decided to try to move back into automatic speech recognition. I had an idea, met two amazing speech researchers who loved my idea, and we decided to collaborate. This collaboration (still ongoing) has allowed me to move back into the field of automatic speech recognition that at that time was rapidly changing due to the rise of deep learning.

In 2018, my Vidi project and contract at Radboud University ended, and I became unemployed. I was then headhunted by a company on automatic speech recognition for health applications. However, I felt that I wanted to stay in academia. Luckily for me, shortly after joining the company, Delft University of Technology offered me a Delft Technology Fellowship, and I joined TU Delft in June 2018, where I’ve since then worked as an Associate Professor of Speech Technology.

How important is interdisciplinarity in your research on speech?

As probably is clear from my road so far, I am an interdisciplinary researcher. The field of automatic speech recognition is already interdisciplinary in that it combines electrical engineering and computer science. However, in my research, I use my knowledge about sounds and sound structures (i.e., phonetics, a subfield of linguistics) and am inspired by and use knowledge about how humans process speech (i.e., psycholinguistics). The speech signal is a signal that can be researched and viewed from different angles: from the perspective of frequencies (physics), the perspective of the individual sounds (phonetics), meaning (semantics), as a means to convey a message or intent, etc. . It also contains different types of information: the words of the message, information about the speaker’s identity, age, gender, height, health status, emotional status, native language, to name only a few.

The focus of my research is on automatic speech recognition. Automatic speech recognisers typically work well for “standard” speakers of a small number of languages. In fact, for only about 2% of all the languages in the world, there is enough annotated speech data to build automatic speech recognisers. Moreover, a large portion of society does not speak in a “standard” way: “standard” speakers are native speakers of a language, without a speech impediment, without a strong regional accent, typically highly educated, and between the ages of 18 and 60 years. As you can tell, this excludes a large portion of our society: children, elderly, people with speaking or voice disorders, deaf people, immigrants, etc. In my work, I focus on making speech technology, and particularly automatic speech recognition, available for everyone, irrespective of how one speaks and the language one speaks. In order to do so, I look at how humans process speech as they are the best speech recognisers that exist; moreover, they can quickly adapt to idiosyncrasies in a speaker’s speech or voice. Moreover, I use knowledge about how sounds and the voice sound differently depending on, for instance, the speaker’s age or health status. So, in my research towards inclusive speech technology, I combine computer science with linguistics and psycholinguistics. Interdisciplinarity is thus at the core of my research.

What disciplines do you combine yourself in your own work?

As explained above, in my research I combine multiple research fields, most notably: computer science, different subfields of psycholinguistics (first and second language learning, native and non-native speech processing; the processing of emotions) and linguistics (primarily phonetics and a bit of conversational analysis).

Could you name a grand research challenge in your current field of work?

There are several grand research challenges in my field:

  • I already named one: making speech technology available for everyone, irrespective of how one speaks and what language one speaks. One of the grand challenges for this is to build speech technology for speech that is not only highly variable but for which also only a little amount of data is available (i.e., low resource scenarios).
  • A second grand challenge: when people speak they often use words or phrases from another language, this is called code-switching. Automatic speech recognisers are typically built for one language; it is very hard for them to deal with code-switched speech.
  • A third grand challenge: speech is often produced with background noise or background speech present. This deteriorates recognition performance tremendously. Dealing with all the different types of background noise and speech is another grand challenge.

You have been an active champion for diversity and inclusion. Could you tell us a bit more about your activities on these topics?

When I was growing academically, I did not really have a female role model, and especially not female role models who had children. When I was in my late twenties/early thirties, I found this hard because I was afraid that having children would negatively impact my chances for the next academic job and my academic career in general. Also, being not only a first-generation PhD but also a first-generation academic, it took me a really long time to realise there were unwritten rules and, knowing what these were and how to deal with them (not sure I now know all 😉 ). Then, when I became Associate Professor at Radboud University, I found that several students, male and female, regularly came to talk to me about personal and academic issues and, that they thought my advice useful and I found it interesting and motivating to talk to them. I wanted to do more regarding gender equality but didn’t know how.

Then in 2016, a group of senior female speech researchers together organised the Young Female Researchers in Speech Science and Technology Workshop, in conjunction with the flagship conference of the International Speech Communication Association (ISCA) Interspeech, in order to attract more female students into a speech PhD program. I was invited as a mentor. This workshop was highly successful and now is a yearly workshop in conjunction with Interspeech. I joined the organisation of this workshop for 3 years. Then in 2019, having advocated gender equality in the ISCA board of which I’ve been a member since 2017, I was asked to form a new committee: the committee for gender equality. Very quickly this committee started to focus on more than gender and look at other types of diversity, sexual orientation, research areas (ISCA encompasses several speech sub-areas, including phonetics, psycholinguistics, health, automatic speech recognition, speech generation, etc.), and geographical regions. Naturally, we not only wanted to attract people from diverse backgrounds but also wanted to retain them, so we also started to look into inclusion. The first thing our committee did, was to create a website where female speech researchers who hold a PhD can list themselves. This website is used to help workshop/conference organisers to find female researchers for the organising committee, as panellists and keynote/invited speakers, etc. We then went on to organise diversity and inclusion meetings at Interspeech and for 2 years we organise a separate ISCA-queer meeting. We have held a workshop in Africa (remotely due to the pandemic) in order to reach local speech researchers there and see where we can collaborate and where we can help them with our resources and expertise. We wrote a code of conduct for session chairs at workshops/conferences in order for them to know how to balance questions from people from minority groups and non-minority groups. To name but a few of our activities.

In 2020 I came up with the idea for a mentoring programme within the IEEE Signal Processing Society (SPS), for students from minority groups, which was well received and was funded with $50K annually. This programme, loosely based on the YFRSW-format, provides students with a mentor from our society who will supervise them for a period of 9 months, and who will mentor them and help them build a network. Each student receives $4K to visit one of the IEEE (SPS) conferences/workshops. In the first round, we awarded 9 students from all over the world.

In addition to these activities, I’ve also been on the board of the Delft Women in Science (DEWIS) at my university and the chair of the Diversity and Inclusion Committee (EDIT) at my faculty at TU Delft. Additionally, I am regularly asked to appear as a female role model in STEM for young girls and in Dutch media.

In getting to your current position, you experienced some personal hardships. In serving as a public role model, you have been open about these. How can we learn from these experiences to make academia a better place?

My CV shows the (many) consecutive positions I’ve had and how almost all are financed by personal grants that I obtained. These personal grants especially tend to attract a lot of praise. What my CV doesn’t show is the story behind it. It doesn’t show the many job applications I sent out, which never led to a position. It doesn’t show that for a period of more than 2 years I did not have a contract, meaning that I did not have any social security, while I was working on a post-doc position. It does not show how I was bullied at my previous university and the damage that did to my self-esteem, something I still struggle with. It does not show that I had to leave behind my 10-month-old daughter for a month and again for 2 weeks because I was expected to be in Germany for a post-doc position, nor does it show the two bouts of (mild) depression I suffered (one directly related to the bullying). I never talked about all of this because as a temporary (and young and female) researcher, you feel extremely vulnerable because you are so dependent on (the goodwill of) other, more senior researchers. If you don’t want to or cannot do a task, if you complain, they will simply find someone else and you are without a job again. On top of that, you often simply are not believed.

When I became a mentor for students and young researchers, I decided to share some of my struggles so that they knew that they were not the only ones who struggled and that I knew what they were going through. I began to receive feedback from these students that they appreciated my honesty and openness, which gave me the courage to be more open about my own issues. However, only after I received my permanent position at TU Delft (in 2019), and after becoming active in diversity and inclusion, did I very slowly dare to speak more openly to my colleagues and senior people about what had happened.

In late 2019, I was asked to talk about what it is like to be a female researcher in speech technology at the IEEE Workshop on Automatic Speech Recognition and Understanding. I thought about the story I wanted to tell, and eventually, I decided to tell my colleagues, including many of my close friends, my story: I started by showing them my CV, which received a lot of appreciative nods. I then told the story of my life, the story that is not shown by my CV, including the hardships. This resulted in many of my male colleagues and friends crying. Of course, this was never my intention. I don’t think that my story is that much different from the average person from a minority group and probably there are quite a few men whose stories are worse than mine.What I wanted to say was: CVs might look great, or they might not. It is important to not take CVs or facts or numbers at face value, you don’t know what people go through or have done to get where they are. Everyone has a story to tell; but it is, unfortunately, the case that the bad stories far more often happen to women and other people from minority groups.

A third message of piece of advice is that if you go through a hard time, know that you are not alone. In life in general, and in academia particularly, we celebrate successes, but failures and hardships are ignored and are often considered a weakness. I strongly believe that by being open about one’s hardships, you will feel better yourself, and will help others with dealing with their hardships.

Finally, we need to see fellow academics as people and treat them as people. We need to be supportive of one another, especially of our younger colleagues and of those from minority groups. We should be mentors and role models. We should listen to what they are saying and believe what they are saying. Not question what they say, but believe them when they describe something bad that has happened to them and help, because daring to speak up takes an enormous amount of strength and courage. If one dares to speak up, believe that it is true and tell them that you know how courageous they have to be to speak up.

How and in what form do you feel we as academics can be most impactful?

As academics we have many responsibilities: we teach the younger generation, we investigate and develop new technology and theories. Some of our research has a direct impact on society, some research does not yet, some research will maybe never have a direct impact on society. I don’t believe that all research needs to have an impact. I do believe that we as academics can be impactful, and that is by explaining science to the general audience. What is science? Why don’t scientists have answers to all questions? Why is what you do important? By explaining one’s research in layman’s terms, science and scientific output will become easier to understand for non-scientists. It will help shape public debate. It will lead to scientific results not being as easily dismissed as nowadays often happens. At the same time, and at least as important: by talking to people from the general public, you as an academic will see the world through their eyes, look at the impact of your work in a different way, and I am convinced it will also often lead to the explanation of why a certain development or technology is not adopted by society at large or by a particular group in society. In short, academics can be most impactful by communicating with the general public, and communication is and thus should be a two-directional process.


Dr Odette Scharenborg is an Associate Professor and Delft Technology Fellow at the Multimedia Computing Group at the Delft University of Technology, the Netherlands, and the Vice-President of the International Speech Communication Association (ISCA). Her research focuses on human speech-processing inspired automatic speech processing with the aim to develop inclusive speech technology, i.e., speech technology that works for everyone irrespective of how they speak or the language they speak.

Since 2017, Odette is on the Board of ISCA, where she is also the chair of the Diversity committee (since 2019) and was co-chair of the Interspeech Conferences committee and of the Technical Committee (2017-2019). From 2018-2021, Odette was a member of the IEEE Speech and Language Processing Technical Committee (subarea Speech Production and Perception). From 2019-2021, she was an Associate Editor of IEEE Signal Processing Letters, where she now is a Senior Associate Editor.

Editor Biographies


Dr Cynthia C. S. Liem is an Associate Professor in the Multimedia Computing Group of Delft University of Technology, The Netherlands, and pianist of the Magma Duo. Her research interests focus on making people discover new interests and content which would not trivially be retrieved in music and multimedia collections, assessing questions of validation and validity in data science, and fostering trustworthy and responsible AI applications when human-interpreted data is involved. She initiated and co-coordinated the European research projects PHENICX (2013-2016) and TROMPA (2018-2021), focusing on technological enrichment of digital musical heritage, and participated as technical partner in an ERASMUS+ education innovation project on Big Data for Psychological Assessment. She gained industrial experience at Bell Labs Netherlands, Philips Research and Google. She was a recipient of the Lucent Global Science and Google Anita Borg Europe Memorial scholarships, the Google European Doctoral Fellowship 2010 in Multimedia, a finalist of the New Scientist Science Talent Award 2016 for young scientists committed to public outreach, Researcher-in-Residence 2018 at the National Library of The Netherlands, general chair of the ISMIR 2019 conference, and keynote speaker at the RecSys 2021 conference. Presently, she co-leads the Future Libraries Lab with the National Library of The Netherlands, is track leader of the Trustworthy AI track in the AI for Fintech lab with the ING bank, holds a TU Delft Education Fellowship on Responsible AI teaching, and is a member of the Dutch Young Academy.


Dr Jochen Huber is Professor of Computer Science at Furtwangen University, Germany. Previously, he was a Senior User Experience Researcher with Synaptics and an SUTD-MIT postdoctoral fellow in the Fluid Interfaces Group at MIT Media Lab and the Augmented Human Lab at Singapore University of Technology and Design. He holds a Ph.D. in Computer Science and degrees in both Mathematics (Dipl.-Math.) and Computer Science (Dipl.-Inform.), all from Technische Universität Darmstadt, Germany. Jochen’s work is situated at the intersection of Human-Computer Interaction and Human Augmentation. He designs, implements and studies novel input technology in the areas of mobile, tangible & non-visual interaction, automotive UX and assistive augmentation. He has co-authored over 60 academic publications and regularly serves as program committee member in premier HCI and multimedia conferences. He was program co-chair of ACM TVX 2016 and Augmented Human 2015 and chaired tracks of ACM Multimedia, ACM Creativity and Cognition and ACM International Conference on Interface Surfaces and Spaces, as well as numerous workshops at ACM CHI and IUI. Further information can be found on his personal homepage: http://jochenhuber.com

Overview of Open Dataset Sessions and Benchmarking Competitions in 2021.

This issue of the Dataset Column proposes a review of some of the most important events in 2021 related to special sessions on open datasets or benchmarking competitions associated with multimedia data. While this is not meant to represent an exhaustive list of events, we wish to underline the great diversity of subjects and dataset topics currently of interest to the multimedia community. We will present the following events:

  • 13th International Conference on Quality of Multimedia Experience (QoMEX 2021 – https://qomex2021.itec.aau.at/). We summarize six datasets included in this conference, that address QoE studies on haze conditions (RHVD), tele-education events (EVENT-CLASS), storytelling scenes (MTF), image compression (EPFL), virtual reality effects on gamers (5Gaming), and live stream shopping (LSS-survey).
  • Multimedia Datasets for Repeatable Experimentation at 27th International Conference on Multimedia Modeling (MDRE at MMM 2021 – https://mmm2021.cz/special-session-mdre/). We summarize the five datasets presented during the MDRE, addressing several topics like lifelogging and environmental data (MNR-HCM), cat vocalizations (CatMeows), home activities (HTAD), gastrointestinal procedure tools (Kvasir-Instrument), and keystroke and lifelogging (KeystrokeDynamics).
  • Open Dataset and Software Track at 12th ACM Multimedia Systems Conference (ODS at MMSys ’21) (https://2021.acmmmsys.org/calls.php#ods). We summarize seven datasets presented at the ODS track, targeting several topics like network statistics (Brightcove Streaming Datasets, and PePa Ping), emerging image and video modalities (Full UHD 360-Degree, 4DLFVD, and CWIPC-SXR) and human behavior data (HYPERAKTIV and Target Selection Datasets).
  • Selected datasets at 29th ACM Multimedia Conference (MM ’21) (https://2021.acmmm.org/). For a general report from ACM Multimedia 2021 please see (https://records.sigmm.org/2021/11/23/reports-from-acm-multimedia-2021/). We summarize six datasets presented during the conference, targeting several topics like food logo detection (FoodLogoDet-1500), emotional relationship recognition (ERATO), text-to-face synthesis (CelebAText-HQ), multimodal linking (M3EL), egocentric video analysis (EGO-Deliver), and quality assessment of user-generated videos (PUGCQ).
  • ImageCLEF 2021 (https://www.imageclef.org/2021). We summarize the six datasets launched for the benchmarking tasks, related to several topics like social media profile assessment (ImageCLEFaware), segmentation and labeling of underwater coral images (ImageCLEFcoral), automatic generation of web-pages (ImageCLEFdrawnUI) and medical imaging analysis (ImageCLEF-VQAMed, ImageCLEFmedCaption, and ImageCLEFmedTuberculosis).

Creating annotated datasets is even more difficult in ongoing pandemic times, and we are glad to see that many interesting datasets were published despite this unfortunate situation.

QoMEX 2021

A large number of dataset-related papers have been presented at the International Conference on Quality of Multimedia Experience (QoMEX 2021), organized as a fully online event in Montreal, Canada, June 14 -17, 2021 (https://qomex2021.itec.aau.at/). The complete QoMEX ’21 Proceedings is available in the IEEE Digital Library (https://ieeexplore.ieee.org/xpl/conhome/9465370/proceeding).

In the conference, there was not a specifically dedicated Dataset session. However, datasets were very important to the conference with a number of papers showing new datasets or making use of broadly available ones. As a small example, six selected papers focused primarily on new datasets are listed below. They are contributions focused on haze, teaching in Virtual Reality, multiview video, image quality, cybersickness for Virtual Reality gaming and shopping patterns. 

A Real Haze Video Database for Haze Evaluation
Paper available at: https://ieeexplore.ieee.org/document/9465461
Chu, Y., Luo, G., and Chen, F.
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, P.R.China.
Dataset available at: https://drive.google.com/file/d/1zY0LwJyNB8u1JTAJU2X7ZkiYXsBX7BF/view?usp=sharing

The RHVD video quality assessment dataset focuses on the study of perceptual degradation caused by heavy haze conditions in real-world outdoor scenes, addressing a large number of possible use case scenarios, including driving assistance and warning systems. The dataset is collected from Flickr video sharing platform and post-edited, while 40 annotators were used for creating the subjective quality assessment experiments.

EVENT-CLASS: Dataset of events in the classroom
Paper available at: https://ieeexplore.ieee.org/document/9465389
Orduna, M., Gutierrez, J., Manzano, C., Ruiz, D., Cabrera, J., Diaz, C., Perez, P., and Garcia, N.
Grupo de Tratamiento de Imágenes, Information Processing & Telecom. Center, Universidad Politécnica de Madrid, Spain; Nokia Bell Labs, Madrid, Spain.
Dataset available at: http://www.gti.ssr.upm.es/data/event-class

The EVENT-CLASS dataset consists of 360-degree videos that contain events and characteristics specific to the context of tele-education, composed of video and audio sequences taken in varying conditions. The dataset addresses several topics, including quality assessment tests with the aim of improving the immersive experience of remote users.

A Multi-View Stereoscopic Video Database With Green Screen (MTF) For Video Transition Quality-of-Experience Assessment
Paper available at: https://ieeexplore.ieee.org/document/9465458
Hobloss, N., Zhang, L., and Cagnazzo, M.
LTCI, Télécom-Paris, Institut Polytechnique de Paris, Paris, France; Univ Rennes, INSA Rennes, CNRS, Rennes, France.
Dataset available at: https://drive.google.com/drive/folders/1MYiD7WssSh6X2y-cf8MALNOMMish4N5j

MFT is a multi-view stereoscopic video dataset, containing full-HD videos of real storytelling scenes, targeting QoE assessment for the analysis of visual artefacts that appear during an automatically generated point of view transitions. The dataset features a large baseline of camera setups and can also be used in other computer vision applications, like video compression, 3D video content, VR environments and optical flow estimation.

Performance Evaluation of Objective Image Quality Metrics on Conventional and Learning-Based Compression Artifacts
Paper available at: https://ieeexplore.ieee.org/document/9465445
Testolina, M., Upenik, E., Ascenso, J., Pereira, F., and Ebrahimi, T.
Multimedia Signal Processing Group, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland; Instituto Superior Técnico, Universidade de Lisboa – Instituto de Telecomunicações, Lisbon, Portugal.
Dataset available on request to the authors.

This dataset consists of a collection of compressed images, labelled according to subjective quality scores, targeting the evaluation of 14 objective quality metrics against the perceived human quality baseline.

The Effect of VR Gaming on Discomfort, Cybersickness, and Reaction Time
Paper available at: https://ieeexplore.ieee.org/document/9465470
Vlahovic, S., Suznjevic, M., Pavlin-Bernardic, N., and Skorin-Kapov, L.
Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia; Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb, Croatia.
Dataset available on request to the authors.

The authors present the results of a study conducted on 20 human users, that measures the physiological and cognitive aftereffects of exposure to three different VR games with game mechanics centered around natural interactions. This work moves away from cybersickness as a primary measure of VR discomfort and wishes to analyze other concepts like device-related discomfort, muscle fatigue and pain and correlations with game complexity

Beyond Shopping: The Motivations and Experience of Live Stream Shopping Viewers
Paper available at: https://ieeexplore.ieee.org/document/9465387
Liu, X. and Kim, S. H.
Adelphi University.
Dataset available on request to the authors.

The authors propose a study of 286 live stream shopping users, where viewer motivations are examined according to the Uses and Gratifications Theory, seeking to identify motivations broken down into sixteen constructs organized under four larger constructs: entertainment, information, socialization, and experience.

MDRE at MMM 2021

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2021 International Conference on Multimedia Modeling (MMM 2021). The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Björn Þór Jónsson (IT University of Copenhagen, Denmark) and Klaus Schoeffmann (Klagenfurt University, Austria). More details regarding this session can be found at: https://mmm2021.cz/special-session-mdre/

The MDRE’21 special session at MMM’21 is the third MDRE edition, and it represents an opportunity for interested researchers to submit their datasets to this track. The work submitted to MDRE is permanently available at http://mmdatasets.org, where all the current and past editions of MDRE are hosted. Authors are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, as well as discussing the way it can be useful to the community, along with the dataset in itself.

MNR-Air: An Economic and Dynamic Crowdsourcing Mechanism to Collect Personal Lifelog and Surrounding Environment Dataset.
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_18
Nguyen DH., Nguyen-Tai TL., Nguyen MT., Nguyen TB., Dao MS.
University of Information Technology, Ho Chi Minh City, Vietnam; University of Science, Ho Chi Minh City, Vietnam; Vietnam National University in Ho Chi Minh City, Ho Chi Minh City, Vietnam; National Institute of Information and Communications Technology, Koganei, Japan.
Dataset available on request to the authors.

The paper introduces an economical and dynamic crowdsourcing mechanism that can be used to collect personal lifelog associated events. The resulting dataset, MNR-HCM, represents data collected in Ho Chi Minh City, Vietnam, containing weather data, air pollution data, GPS data, lifelog images, and citizens’ cognition on a personal scale.

CatMeows: A Publicly-Available Dataset of Cat Vocalizations
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_20
Ludovico L.A., Ntalampiras S., Presti G., Cannas S., Battini M., Mattiello S.
Department of Computer Science, University of Milan, Milan, Italy; Department of Veterinary Medicine, University of Milan, Milan, Italy; Department of Agricultural and Environmental Science, University of Milan, Milan, Italy.
Dataset available at: https://zenodo.org/record/4008297

The CatMewos dataset consists of vocalizations produced by 21 cats belonging to two breeds, namely Main Coon and European Shorthair, that are emitted in three different contexts: brushing, isolation in an unfamiliar environment, and waiting for food. Recordings are performed with low-cost and easily available devices, thus creating a representative dataset for real-world scenarios.

HTAD: A Home-Tasks Activities Dataset with Wrist-accelerometer and Audio Features
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_17
Garcia-Ceja, E., Thambawita, V., Hicks, S.A., Jha, D., Jakobsen, P., Hammer, H.L., Halvorsen, P., Riegler, M.A.
SINTEF Digital, Oslo, Norway; SimulaMet, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; UIT The Arctic University of Norway, Tromsø, Norway; Haukeland University Hospital, Bergen, Norway.
Dataset available at: https://datasets.simula.no/htad/

The HTAD dataset contains wrist-accelerometer and audio data collected during several normal day-to-day tasks, such as sweeping, brushing teeth, or watching TV. Being able to detect these types of activities is important for the creation of assistive applications and technologies that target elderly care and mental health monitoring.

Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_19
Jha, D., Ali, S., Emanuelsen, K., Hicks, S.A., Thambawita, V., Garcia-Ceja, E., Riegler, M.A., de Lange, T., Schmidt, P.T., Johansen, H.D., Johansen, D., Halvorsen, P.
SimulaMet, Oslo, Norway; UIT The Arctic University of Norway, Tromsø, Norway; Simula Research Laboratory, Oslo, Norway; Augere Medical AS, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; Medical Department, Sahlgrenska University Hospital-Mölndal, Gothenburg, Sweden; Department of Medical Research, Bærum Hospital, Gjettum, Norway; Karolinska University Hospital, Solna, Sweden; Department of Engineering Science, University of Oxford, Oxford, UK; Sintef Digital, Oslo, Norway.
Dataset available at: https://datasets.simula.no/kvasir-instrument/

The Kvasir-Instrument dataset consists of 590 annotated frames that contain gastrointestinal (GI) procedure tools such as snares, balloons, and biopsy forceps, and seeks to improve follow-up and the set of available information regarding the disease and the procedure itself, by providing baseline data for the tracking and analysis of the medical tools.

Keystroke Dynamics as Part of Lifelogging
Paper available at: https://link.springer.com/chapter/10.1007%2F978-3-030-67835-7_16
Smeaton, A.F., Krishnamurthy, N.G., Suryanarayana, A.H.
Insight Centre for Data Analytics, Dublin City University, Dublin, Ireland; School of Computing, Dublin City University, Dublin, Ireland.
Dataset available at: http://doras.dcu.ie/25133/

The authors created a dataset of longitudinal keystroke timing data that spans a period of up to seven months for four human participants. A detailed analysis of the data is performed, by examining the timing information associated with bigrams, or pairs of adjacently-typed alphabetic characters.

ODS at MMSys ’21

The traditional Open Dataset and Software Track (ODS) was a part of the 12th ACM Multimedia Systems Conference (MMSys ’21) organized as a hybrid event in Istanbul, Turkey, September 28 – October 1, 2021 (https://2021.acmmmsys.org/). The complete MMSys ’21: Proceedings of the 12th ACM Multimedia Systems Conference are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3458305).

The Session on Software, Tools and Datasets was chaired by Saba Ahsan (Nokia Technologies, Finland) and Luca De Cicco (Politecnico di Bari, Italy) on September 29, 2021, at 16:00 (UTC+3, Istanbul local time). The session has been initiated with 1-slide/minute intros given by the authors and then divided into individual virtual booths. There have been seven dataset papers presented out of thirteen contributions. Listing of the paper titles and their abstracts and associated DOIs is included below for your convenience.

Adaptive Streaming Playback Statistics Dataset
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478444
Teixeira, T, Zhang, B., Reznik, Y.
Brightcove Inc, USA
Dataset available at: https://github.com/brightcove/streaming-dataset

The authors propose a dataset that captures statistics from a number of real-world streaming events, utilizing different devices (TVs, desktops, mobiles, tablets, etc.) and networks (from 2.5G, 3G, and other early generation mobile networks to 5G and broadband). The captured data includes network and playback statistics, events and characteristics of the encoded stream.

PePa Ping Dataset: Comprehensive Contextualization of Periodic Passive Ping in Wireless Networks
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478456
Madariaga, D., Torrealba, L., Madariaga, J., Bustos-Jimenez, J., Bustos, B.
NIC Chile Research Labs, University of Chile
Dataset available at: https://github.com/niclabs/pepa-ping-mmsys21

The PePa Ping dataset consists of real-world data with a comprehensive contextualization of Internet QoS indicators, like Round-trip time, jitter and packet loss. A methodology is developed for Android devices, that obtains the necessary information, while the indicators are directly provided to the Linux kernel, therefore being an accurate representation of real-world data.

Full UHD 360-Degree Video Dataset and Modeling of Rate-Distortion Characteristics and Head Movement Navigation
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478447
Chakareski, J., Aksu, R., Swaminathan, V., Zink, M.
New Jersey Institute of Technology; University of Alabama; Adobe Research; University of Massachusetts Amherst, USA
Dataset available at: https://zenodo.org/record/5156999#.YQ1XMlNKjUI

The authors create a dataset of 360-degree videos that are used in analyzing the rate-distortion (R-D) characteristics of videos. These videos correspond to head movement navigation data in Virtual Reality (VR) and they may be used for analyzing how users explore panoramas around them in VR.

4DLFVD: A 4D Light Field Video Dataset
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478450
Hu, X., Wang, C.,Pan, Y., Liu, Y., Wang, Y., Liu, Y., Zhang, L., Shirmohammadi, S.
University of Ottawa, Canada / Beijing University of Posts and Telecommunication, China
Dataset available at: https://dx.doi.org/10.21227/hz0t-8482

The authors propose a 4D Light Field (LF) video dataset that is collected via a custom-made camera matrix. The dataset is to be used for designing and testing methods for LF video coding, processing and streaming, providing more viewpoints and/or higher framerate compared with similar datasets from the current literature.

CWIPC-SXR: Point Cloud dynamic human dataset for Social XR
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478452
Reimat, I., Alexiou, E., Jansen, J., Viola, I., Subramanyam, S., Cesar, P.
Centrum Wiskunde & Informatica, Netherlands
Dataset available at: https://www.dis.cwi.nl/cwipc-sxr-dataset/

The CWIPC-SXR dataset is composed of 45 unique sequences that correspond to several use cases for humans interacting in social extended reality. The dataset is composed of dynamic point clouds, that serve as a low complexity representation in these types of systems.

HYPERAKTIV: An Activity Dataset from Patients with Attention-Deficit/Hyperactivity Disorder (ADHD)
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478454
Hicks, S. A., Stautland, A., Fasmer, O. B., Forland, W., Hammer, H. L., Halvorsen, P., Mjeldheim, K., Oedegaard, K. J., Osnes, B., Syrstad, V. E.G., Riegler, M. A.
SimulaMet; University of Bergen; Haukeland University Hospital; OsloMet, Norway
Dataset available at: http://datasets.simula.no/hyperaktiv/

The HYPERAKTIV dataset contains general patient information, health, activity, information about the mental state, and heart rate data from patients with Attention-Deficit/Hyperactivity Disorder (ADHD). Included here are 51 patients with ADHD and 52 clinical control cases.

Datasets – Moving Target Selection with Delay
Paper available at: https://dl.acm.org/doi/10.1145/3458305.3478455
Liu, S. M., Claypool, M., Cockburn, A., Eg, R., Gutwin, C., Raaen, K.
Worcester Polytechnic Institute, USA; University of Canterbury, New Zealand; Kristiania University College, Norway; University of Saskatchewan, Canada
Dataset available at: https://web.cs.wpi.edu/~claypool/papers/selection-datasets/

The Selection datasets are composed of datasets created during four user studies on the effects of delay on video game actions and selections of a moving target with a various number of pointing devices. The datasets include performance data, like time to the selection, and demographic data for the users like age and gaming experience.

ACM MM 2021

A large number of dataset-related papers have been presented at the 29th ACM International Conference on Multimedia (MM’ 21), organized as a hybrid event in Chengdu, China, October 20 – 24, 2021 (https://2021.acmmm.org/). The complete MM ’21: Proceedings of the 29th ACM International Conference on Multimedia are available in the ACM Digital Library (https://dl.acm.org/doi/proceedings/10.1145/3474085).

There was not a specifically dedicated Dataset session among more than 35 sessions at the MM ’21 symposium. However, the importance of datasets can be illustrated in the following statistics, quantifying how many times the term “dataset” appears among 542 accepted papers. The term appears in the title of 7 papers, the keywords of 66 papers, and the abstracts of 339 papers. As a small example, six selected papers focused primarily on new datasets are listed below. There are contributions focused on social multimedia, emotional recognition, text-to-face synthesis, egocentric video analysis, emerging multimedia applications, such as multimodal entity linking, and multimedia art, entertainment, and culture related to perceived quality of video content.

FoodLogoDet-1500: A Dataset for Large-Scale Food Logo Detection via Multi-Scale Feature Decoupling Network
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475289
Hou, Q., Min, W., Wang, J., Hou, S., Zheng, Y., Jiang, S.
Shandong Normal University, Jinan, China; Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Dataset available at: https://github.com/hq03/FoodLogoDet-1500-Dataset

The FoodLogoDet-1500 is a large-scale food logo dataset that has 1,500 categories, around 100,000 images and 150,000 manually annotated food logo objects. This type of dataset is important in self-service applications in shops and supermarkets, and copyright infringement detection for e-commerce websites.

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475493
Gao, X., Zhao, Y., Zhang, J., Cai, L.
Alibaba Group, Beijing, China
Dataset available on request to the authors.

The Emotional RelAtionship of inTeractiOn (ERATO) dataset is a large-scale multimodal dataset composed of over 30,000 interaction-centric video clips lasting around 203 hours. The videos are representative for studying the emotional relationships between the two interactive characters in the video clip.

Multi-caption Text-to-Face Synthesis: Dataset and Algorithm
Paper available at: https://dl.acm.org/doi/abs/10.1145/3474085.3475391
Sun, J., Li, Q., Wang, W., Zhao, J., Sun, Z.
Center for Research on Intelligent Perception and Computing, NLPR, CASIA, Beijing, China;
School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China; Institute of North Electronic Equipment, Beijing, China
Dataset available on request to the authors.

The authors propose the CelebAText-HQ dataset, which addresses the text-to-face generation problem. Each image in the dataset is manually annotated with 10 captions, allowing proposed methods and algorithms to take multiple captions as input in order to generate highly semantically related face images.

Multimodal Entity Linking: A New Dataset and A Baseline
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475400
Gan, J., Luo, J., Wang, H., Wang, S., He, W., Huang, Q.
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, China; Baidu Inc.
Dataset available at: https://jingrug.github.io/research/M3EL

The authors propose the M3EL large-scale multimodal entity linking dataset, containing data associated with 1,100 movies. Reviews and images are collected, and textual and visual mentions are extracted and labelled with entities registered from Wikipedia.

Ego-Deliver: A Large-Scale Dataset for Egocentric Video Analysis
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475336
Qiu, H., He, P., Liu, S., Shao, W., Zhang, F., Wang, J., He, L., Wang, F.
East China Normal University, Shanghai, China; University of Florida, Florida, FL, United States;
Alibaba Group, Shanghai, China
Dataset available at: https://egodeliver.github.io/EgoDeliver_Dataset/

The authors propose an egocentric video benchmarking dataset, consisting of videos recorded by takeaway riders doing their daily work. The dataset provides over 5,000 videos with more than 139,000 multi-track annotations and 45 different attributes, representing the first attempt in understanding the delivery takeaway process from an egocentric perspective.

PUGCQ: A Large Scale Dataset for Quality Assessment of Professional User-Generated Content
Paper available at: https://dl.acm.org/doi/10.1145/3474085.3475183
Li, G., Chen, B., Zhu, L., He, Q., Fan, H., Wang, S.
Kingsoft Cloud, Beijing, China; City University of Hong Kong, Hong Kong, Hong Kong
Dataset available at: https://github.com/wlkdb/pugcq_create

The PUGCQ dataset consists of 10,000 professional user-generated videos, annotated with a set of perceptual subjective ratings. In particular, during the subjective annotation and testing, human opinions are collected based upon not only MOS, but also attributes that may influence visual quality such as faces, noise, blur, brightness, and colour.

ImageCLEF 2021

ImageCLEF is a multimedia evaluation campaign, part of the clef initiative (http://www.clef-initiative.eu/). The 2021 edition (https://www.imageclef.org/2021) is the 19th edition of this initiative and addresses four main research tasks in several domains like: medicine, nature, social media content and user interface processing. ImageCLEF 2021 is organized by Bogdan Ionescu (University Politehnica of Bucharest, Romania), Henning Müller (University of Applied Sciences Western Switzerland, Sierre, Switzerland), Renaud Péteri (University of La Rochelle, France), Ivan Eggel (University of Applied Sciences Western Switzerland, Sierre, Switzerland) and Mihai Dogariu (University Politehnica of Bucharest, Romania).

Paper available at: https://arxiv.org/abs/2012.13180
Popescu, A., Deshayes-Chossar, J., Ionescu, B.
CEA LIST, France; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2021/aware

This represents the first edition of the aware task at ImageCLEF, and it seeks to understand in what way do public social media profiles affect users in certain important scenarios, representing a search or application for a bank loan, an accommodation, a job as waitress/waiter, and a job in IT.

Paper available at: http://ceur-ws.org/Vol-2936/paper-88.pdf
Chamberlain, J., de Herrera, A. G. S., Campello, A., Clark, A., Oliver, T. A., Moustahfid, H.
University of Essex, UK; NOAA – Pacific Islands Fisheries Science Center, USA; NOAA/ US IOOS, USA; Wellcome Trust, UK.
Dataset available at: https://www.imageclef.org/2021/coral

The ImageCLEFcoral task, currently at its third edition, proposes a dataset and benchmarking task for the automatic segmentation and labelling of underwater images that can be combined for generating 3D models for monitoring coral reefs. The task itself is composed of two subtasks, namely the coral reef image annotation and localisation and the coral reef image pixel-wise parsing.

Paper available at: http://ceur-ws.org/Vol-2936/paper-89.pdf
Fichou, D., Berari, R., Tăuteanu, A., Brie, P., Dogariu, M., Ștefan, L.D., Constantin, M.G., Ionescu, B.
teleportHQ, Cluj Napoca, Romania; University Politehnica of Bucharest, Romania.
Dataset available at: https://www.imageclef.org/2021/drawnui

The second edition ImageCLEFdrawnUI addresses the issue of creating appealing web page interfaces by fostering systems that are capable of automatically generating a web page from a hand-drawn sketch. The task is separated into two subtasks, the wireframe subtask and the screenshots task.

Paper available at: http://ceur-ws.org/Vol-2936/paper-87.pdf
Abacha, A.B., Sarrouti, M., Demner-Fushman, D., Hasan, S.A., Müller, H.
National Library of Medicine, USA; CVS Health, USA; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/vqa

This represents the fourth edition of the ImageCLEF Medical Visual Question Answering (VQAMed) task. This benchmark includes a task on Visual Question Answering (VQA), where participants are tasked with answering questions from the visual content of radiology images, and a second task on Visual Question Generation (VQG), consisting of generating relevant questions about radiology images.

ImageCLEFmed Caption
Paper available at: http://ceur-ws.org/Vol-2936/paper-111.pdf
Pelka, O., Abacha, A.B., de Herrera, A.G.S., Jacutprakart, J., Friedrich, C.M., Müller, H.
University of Applied Sciences and Arts Dortmund, Germany; National Library of Medicine, USA; University of Essex, UK; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/caption

This is the fifth edition of the ImageCLEF Medical Concepts and Captioning task. The objective is to extract UMLS-concept annotations and/or captions from the image data that are then compared against the original text captions of the images.

ImageCLEFmed Tuberculosis
Paper available at: http://ceur-ws.org/Vol-2936/paper-90.pdf
Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Müller, H.
Institute for Informatics, Minsk, Belarus; University of Warwick, Coventry, England, UK; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at: https://www.imageclef.org/2021/medical/tuberculosis

Report from ACM Multimedia Systems 2021 by Neha Sharma

Neha Sharma (@NehaSharma) is a PhD student working with Dr Mohamed Hefeeda in Network and Multimedia Systems Lab at Simon Fraser University. Her research interests are in computer vision and machine learning with a focus on next-generation multimedia systems and applications. Her current work focuses on designing an inexpensive hyperspectral camera using a hybrid approach by leveraging both hardware and software solutions. She has been awarded as Best Social Media Reporter of the conference to promote the sharing among researchers on social networks. To celebrate this award, here is a more complete report on the conference.

Being a junior researcher in multimedia systems, I must say I feel proud to be part of this amazing community. I became part of ACM Multimedia Systems Conference (MMSys) last year in 2020, where I published my first research work. I was excited to attend MMSys ’20 in Istanbul, which unfortunately shifted online due to COVID-19. I presented my first work online and got to learn about other researchers in the community. This year I was able to publish another work with my team and got selected to present my ideas and research plans in Doctoral Symposium (thanks to reviewers). MMSys’21 gave me hope to have a full conference experience, as we all were hoping to start our lives back to normal. But, as the conference date was approaching, things were still not clear and travel restrictions were still in place. But on the good note, MMSys ’21 became hybrid to provide an opportunity to the people who can travel. It was at the very end I decided to travel and attend MMSys’21 in person. And I am glad I made that decision. My experience was overwhelmingly rich in terms of learning interesting research findings and making inspiring connections in the community. As the recipient of the “Best Social Media Reporter” award, enjoy the highlights of MMSys’ 21 through my lens. 

In the light of the ongoing global pandemic, ACM MMSys ’21 was held in hybrid mode – onsite in Istanbul, Turkey and online jointly on September 28 – October 1, 2021. Ali C. Begen (Ozyegin University and Networked Media, Turkey) opened the conference onsite with a warm welcome. MMSys’21 became the first-ever hybrid conference where participants presented onsite as well as remotely in real-time. There were participants joining from 38 different countries. The organizing team did an amazing job in pulling off this complex event. This year the research track implemented a two-round submission system, and accepted papers included public reviews in the proceedings. This, however, was not the only first, MMSys ’21 had its first Doctoral Symposium targeting the PhD students and aiming to find their mentors. In addition, there were postponed celebrations for the 30th anniversary of NOSSDAV and the 25th anniversary of Packet Video.

The conference program was very well scheduled. Each day of the conference started with a keynote. There were four insightful and inspiring keynotes from researchers working in cutting edge multimedia technologies. The first day started with a talk titled “AI-Driven Solutions throughout Games’ Lifecycles Leveraging Big Data” by Qiaolin Chen from Tencent IEG Global. Chen discussed how AI and big data are evolving the gaming industry, from intelligent market decisions to data-driven game development. On the second day, Caitlin Kalinowski presented an interesting keynote “Making Impossible Products: How to Get 0-to-1 Products Right”. Caitlin heads the VR Hardware team at Facebook Reality Labs. She shared insights about Oculus and zero-to-one products. The next day, Chris Bregler (Google) talked about “Synthetic Media: New Opportunities and New Challenges”. He discussed recent trends in generative media creation techniques that have opened new possibilities for societally beneficial uses but have also raised concerns about misuse. Last day, Sriram Sethuraman and Deepthi Nandakumar (Amazon) provided insights about “Role of ML in the Prediction of Perceptual Video Quality”. Keynotes are available on youtube to watch on-demand.

This year the conference attracted paper submissions from a range of multimedia topics including immersive media, live video, content preparation, cloud-based and mobile media processing and computer vision systems. Apart from the main research track, MMSys ’21 hosted three workshops:

  • NOSSDAV – Network and Operating System Support for Digital Audio and Video
  • MMVE – Immersive Mixed and Virtual Environment Systems
  • GameSys – Game Systems

These workshops provided an opportunity to meet those who are working in focused areas of multimedia research. This year MMSys conducted the inaugural ACM workshop on Game Systems (GameSys ’21). This workshop attracted research on all aspects of computer/digital games, emphasizing networks, systems, interaction, and applications. Highlights include the work presented by Mark Claypool et. Al (Worcester Polytechnic Institute) which conducts a user study measuring attribute scaling for cloud-based games. 

In addition to area focussed workshops, MMSys’21 also conducted two grand challenges:

Another main highlight of the conference is the EDI (Equality, Diversity and Inclusion) workshop. The workshop was tailored towards PhD students, assistant professors and starting researchers in various research organizations. The event openly discussed core topics about parenthood, work-family policies, career paths and EDI aspects at large. Laura Toni, Mea Wang and Ozgu Alay opened the workshop on the third day of the conference. Miriam Redi shared goals to achieve an equitable and inclusive multimedia community. Susanne Boll talked about the target strategy “25 in 25” to increase the participation of women in SIGMM to at least 25% by 2025. Other guest speakers also highlighted some strategies to achieve target diversity and inclusion in MMSys.

Last but not the least, amazing social events. Each day of the conference ended with a well-planned social event providing a great opportunity to the in-person attendees to meet, discuss, and develop professional and social links throughout the community in a more relaxed setting. We had visited some historical venues like Galata Tower and Adile Sultan Palace and enjoyed a Bosphorus boat tour with a live music band. This year MMSys planned the first inter-continental socials. We travelled from the European side to the Asian side of Istanbul (by bus and by boat). As a token of appreciation, in-person participants received Turkish delights and coffee, a set of traditional towels (peştemal), Istanbul-themed puzzles and a hand-made Kütahya Porcelain vase/coffee set as souvenirs. For me, the best part was sitting together and dining with peers, discussing prospects of your own research or multimedia systems research, in general.

Closing the conference, Ali C. Begen opened with the announcement of the awards. The Best Paper Award was presented to Xiao Zhu et. Al for the paper “Livelyzer: Analyzing the First-Mile Ingest Performance of Live Video Streaming”. See the full list of awards here. The conference closed with the announcement of ACM Multimedia Systems 2022, which will be happening in Athlone, Ireland. Looking forward to seeing everyone again next year.