Reports from ACM Multimedia System 2021


The 12th ACM Multimedia Systems Conference (MMSys’21) happened from September 28th through October 1st, 2021.  The  MMSys conference is an important forum for researchers in multimedia systems. But, due to the ongoing pandemic, the event was held in a hybrid mode – onsite in Istanbul, Turkey, and online. Organizers and chairs (Özgü Alay, Cheng-Hsin Hsu, and  Ali C. Begen) worked very hard to make sure the conference was successful, both for the on-site participants (around 50) and the online participants (with a peak of 330 concurrent viewers).  For a small description of the event, take a look at the text written by Ali Begen, one of the general chairs.
To encourage student authors to participate on-site, SIGMM has sponsored a group of students with Student Travel Grant Awards. Students who wanted to apply for this travel grant needed to submit an online form before the submission deadline. Then, the selection committee chose 7 travel grant winners. The selected students received either 1,000 or 2,000 USD to cover their airline tickets as well accommodation costs for this event. We asked the travel grant winners to share their unique experiences attending MMSys’21. The following are their comments.

Minh Nguyen

It is my honour to receive the SIGMM Student travel award that gives me a golden opportunity to attend the MMSys’2021 conference on-site. This conference is the first one I have attended during the Covid pandemic. I attended the whole conference, and I really appreciate the organizing committee who tried their best to organize this conference in a hybrid mode. It was a very interesting and well-organized conference where many innovative papers were introduced. The venue of the conference is a great place with professional staff and comfortable accommodation and meeting rooms. The local Turkish food attracted me. They were delicious. At this conference, I was happy to meet, connect, and discuss with experts working in multimedia systems, which is close to my PhD thesis. I was interested in informative and passionate keynotes about cutting-edge technologies and their open discussion. Especially, many novel papers motivated me and gave me some ideas for my future work in my PhD thesis. Also, their enjoyable social events brought me a chance to visit Istanbul and experience new things. I look forward to attending future editions of the conference.

Lucas Torrealba A.

I found the conference very interesting. It was my first experience of an in-person conference and it was amazing. The research articles presented seem very relevant to me and the organization did a wonderful job as well. In addition, it seems to be quite a good idea for the future to always leave hybrid ways to participate in the conferences.

Paniz Parastar

The MMsys2021 was my first in-person conference, and since it was highly organized, it raised my expectation of future conferences. Overall, many interesting topics were covered, and I only mentioned a couple of instances here. 
AI/ML are the hot topics as of today. I believe it’s enjoyable to see them applied in the various aspects of multimedia streaming and other areas as well as in computer vision. Notably, I liked the papers in NOSSDAV sessions on the last day of the conference adapting learning methods to improve the QoE of users. Since I’m working on distinguishing IoT devices and their traffics on the network these days, video clustering papers and mainly the paper that classifies the 360 videos from regular ones based on the traffic features (.i.e., flow and packet level features) were educational to me. Also, comparing subjective and objective quality assessment metrics alongside the various network conditions as they do in the paper may not be a new topic, but it is always interesting to explore. 
Plus, one of the most exciting talks for me was ‘Games as a Game Changer’, which was part of the Equality, Diversity, and Inclusion (EDI) Workshop. It changed my perception of games as an entertaining tool that also can help us better understand situations that don’t usually happen in our daily lives.

Ekrem Cetinkaya

MMSys’21 was my first in-person conference experience, and I can gladly say that it was above my expectations. We were welcomed by a fantastic organization, given how difficult the situation was. Everything went so smoothly, from the keynotes to paper presentations to demo sessions, and of course, social events.
Personally, two things were the most impressive for me. First, the keynote by Caitlin Kalinowski (Facebook) was given in person, and she had to fly from the U.S. to Istanbul just for this keynote. Second, the hybrid organization was thought through. There was a team of five whose duty was to make sure the conference was insightful for those who could not make it to Istanbul as well.
Moreover, the social events and the venues were really lovely. I learned that the MMSys community has a long history, and you could feel that, especially in those social events where it was an amicable environment, meaning that it was also easy for me to do some networking. Overall, I can say the MMSys conference was amazing in all aspects without any doubt. I want to thank the SIGMM committee once again for their travel grant, which made this experience possible.

Ivan Bartolec

The ACM MMSys’21 conference held in Istanbul, Turkey, was an excellent opportunity to meet, interact, and discuss ideas with researchers who are working to develop new and engaging multimedia experiences. This was my first MMSys conference, and it was an excellent environment for both learning and networking, with a thoughtfully selected collection of presentations, engaging keynotes (especially the one from a representative of Facebook), and fun social events. I found the sessions based on video or video streaming to be the most interesting and informative for my field of study. The demo sessions concept was also pretty unique, and by being on-site and seeing the demos and asking questions, I learnt a few things about practical implementations that I find incredibly useful. I’m very thankful for the opportunity to present my PhD research as part of the Doctoral symposium and to receive feedback from conference attendees as well as offline comments and ideas via email, which I gladly responded to. It was an absolute pleasure to attend MMSys’21 on-site, courtesy of the Student Travel Grant, and I look forward to visiting future editions of the conference and continuing to interact with the MMSys community.

Jesus Aguilar Armijo

It has been a pleasure to attend MMSys’2021 in person. This would not have been possible without the SIGMM Student travel award.
At the conference, I had the opportunity to attend four keynotes, where I would like to highlight the keynote from Caitlin Kalinowski (Facebook). She presented in person and showed the Virtual Reality devices of her company and future projects with emerging technologies.
I found truly engaging the different sessions of MMSys as they were related to my work in network-assisted video streaming. For example, the NOSSDAV session named “Session #1: Yet Another Streaming Session” contained the paper “Common Media Client Data (CMCD): Initial Findings” which I found especially interesting as I use some features of this standard in my work. Moreover, the paper entitled: “Beyond throughput, the next generation: a 5G dataset with channel and context metrics” (from MMSys’20 but presented in MMSys’21) in the open dataset session was particularly interesting for me as I use their previous dataset with 4G as a radio traces for my last paper.
During the conference, I had the opportunity to discuss and exchange ideas with different researchers, which I found valuable and insightful. I would also like to highlight the good organization of the conference and the social events.
Finally, I presented my work in the Doctoral Symposium session, and I received some interesting questions from the audience. It was a great opportunity, and I am grateful to SIGMM, which allowed me to participate in this extraordinary experience.

Towards an updated understanding of immersive multimedia experiences

Bringing theories and measurement techniques up to date

Development of technology for immersive multimedia experiences

Immersive multimedia experiences, as its name is suggesting are those experiences focusing on media that is able to immerse users with different interactions into an experience of an environment. Through different technologies and approaches, immersive media is emulating a physical world through the means of a digital or simulated world, with the goal of creating a sense of immersion. Users are involved in a technologically driven environment where they may actively join and participate in the experiences offered by the generated world [White Paper, 2020]. Currently, as hardware and technologies are developing further, those immersive experiences are getting better with the more advanced feeling of immersion. This means that immersive multimedia experiences are exceeding just the viewing of the screen and are enabling bigger potential. This column aims to present and discuss the need for an up to date understanding of immersive media quality. Firstly, the development of the constructs of immersion and presence over time will be outlined. Second, influencing factors of immersive media quality will be introduced, and related standardisation activities will be discussed. Finally, this column will be concluded by summarising why an updated understanding of immersive media quality is urgent.

Development of theories covering immersion and presence

One of the first definitions of presence was established by Slater and Usoh already in 1993 and they defined presence as a “sense of presence” in a virtual environment [Slater, 1993]. This is in line with other early definitions of presence and immersion. For example, Biocca defined immersion as a system property. Those definitions focused more on the ability of the system to technically accurately provide stimuli to users [Biocca, 1995]. As technology was only slowly capable to provide systems that are able to generate stimulation to users that can mimic the real world, this was of course the main content of definitions. Quite early on questionnaires to capture the experienced immersion were introduced, such as the Igroup Presence Questionnaire (IPQ) [Schubert, 2001]. Also, the early methods for measuring experiences are mainly focused on aspects of how good the representation of the real world was done and perceived. With maturing technology, the focus was shifted more towards emotions and more cognitive phenomena besides the basics stimulus generation. For example, Baños and colleagues showed that experienced emotion and immersion are in relation to each other and also influence the sense of presence [Baños, 2004]. Newer definitions focus more on these mentioned cognitive aspects, e.g., Nilsson defines three factors that can lead to immersion: (i) technology, (ii) narratives, and (iii) challenges, where only the factor technology is a non-cognitive one [Nilsson, 2016]. In 2018, Slater defines the place illusion as the illusion of being in a place while knowing one is not really there. This is a focus on a cognitive construct, removal of disbelieve, but still leaves the focus of how the illusion is created mainly on system factors instead of cognitive ones [Slater, 2018]. In recent years, more and more activities were started to define how to measure immersive experiences as an overall construct.

Constructs of interest in relation to immersion and presence

This section discusses constructs and activities that are related to immersion and presence. In the beginning, subtypes of extended reality (XR) and the relation to user experience (UX) as well as quality of experience (QoE) are outlined. Afterwards, recent standardization activities related to immersive multimedia experiences are introduced and discussed.
Moreover, immersive multimedia experiences can be divided by many different factors, but recently the most common distinctions are regarding the interactivity where content can be made for multi-directional viewing as 360-degree videos, or where content is presented through interactive extended reality. Those XR technologies can be divided into mixed reality (MR), augmented reality (AR), augmented virtuality (AV), virtual reality (VR), and everything in between [Milgram, 1995]. Through all those areas immersive multimedia experiences have found a place on the market, and are providing new solutions to challenges in research as well as in industries, with a growing potential of adopting into different areas [Chuah, 2018].

While discussing immersive multimedia experiences, it is important to address user experience and quality of immersive multimedia experiences, which can be defined following the definition of quality of experience itself [White Paper, 2012] as a measure of the delight or annoyance of a customer’s experiences with a service, wherein this case service is an immersive multimedia experience. Furthermore, while defining QoE terms experience and application are also defined and can be utilized for immersive multimedia experience, where an experience is an individual’s stream of perception and interpretation of one or multiple events; and application is a software and/or hardware that enables usage and interaction by a user for a given purpose [White Paper 2012].

As already mentioned, immersive media experiences have an impact in many different fields, but one, where the impact of immersion and presence is particularly investigated, is gaming applications along with QoE models and optimizations that go with it. Specifically interesting is the framework and standardization for subjective evaluation methods for gaming quality [ITU-T Rec. P.809, 2018]. This standardization is providing instructions on how to assess QoE for gaming services from two possible test paradigms, i.e., passive viewing tests and interactive tests. However, even though detailed information about the environments, test set-ups, questionnaires, and game selection materials are available those are still focused on the gaming field and concepts of flow and immersion in games themselves.

Together with gaming, another step in defining and standardizing infrastructure of audiovisual services in telepresence, immersive environments, and virtual and extended reality, has been done in regards to defining different service scenarios of immersive live experience [ITU-T Rec. H.430.3, 2018] where live sports, entertainment, and telepresence scenarios have been described. With this standardization, some different immersive live experience scenarios have been described together with architectural frameworks for delivering such services, but not covering all possible use case examples. When mentioning immersive multimedia experience, spatial audio sometimes referred to as “immersive audio” must be mentioned as is one of the key features of especially of AR or VR experiences [Agrawal, 2019], because in AR experiences it can provide immersive experiences on its own, but also enhance VR visual information.
In order to be able to correctly assess QoE or UX, one must be aware of all characteristics such as user, system, content, and context because their actual state may have an influence on the immersive multimedia experience of the user. That is why all those characteristics are defined as influencing factors (IF) and can be divided into Human IF, System IF, and Context IF and are as well standardized for virtual reality services [ITU-T Rec. G.1035, 2021]. Particularly addressed Human IF is simulator sickness as it specifically occurs as a result of exposure to immersive XR environments. Simulator sickness is also known as cybersickness or VR/AR sickness, as it is visually induced motion sickness triggered by visual stimuli and caused by the sensory conflict arising between the vestibular and visual systems. Therefore, to achieve the full potential of immersive multimedia experience, the unwanted sensation of simulation sickness must be reduced. However, with the frequent change of immersive technology, some hardware improvement is leading to better experiences, but a constant updating of requirement specification, design, and development is needed together with it to keep up with the best practices.

Conclusion – Towards an updated understanding

Considering the development of theories, definitions, and influencing factors around the constructs immersion and presence, one can see two different streams. First, there is a quite strong focus on the technical ability of systems in most early theories. Second, the cognitive aspects and non-technical influencing factors gain importance in the new works. Of course, it is clear that in the 1990ies, technology was not yet ready to provide a good simulation of the real world. Therefore, most activities to improve systems were focused on that activity including measurements techniques. In the last few years, technology was fast developing and the basic simulation of a virtual environment is now possible also on mobile devices such as the Oculus Quest 2. Although concepts such as immersion or presence are applicable from the past, definitions dealing with those concepts need to capture as well nowadays technology. Meanwhile, systems have proven to provide good real-world simulators and provide users with a feeling of presence and immersion. While there is already activity in standardization which is quite strong and also industry-driven, research in many research disciplines such as telecommunication are still mainly using old questionnaires. These questionnaires are mostly focused on technological/real-world simulation constructs and, thus, not able to differentiate products and services anymore to an extent that is optimal. There are some newer attempts to create new measurement tools for e.g. social aspects of immersive systems [Li, 2019; Toet, 2021]. Measurement scales aiming at capturing differences due to the ability of systems to create realistic simulations are not able to reliably differentiate different systems due to the fact that most systems are providing realistic real-world simulations. To enhance research and industrial development in the field of immersive media, we need definitions of constructs and measurement methods that are appropriate for the current technology even if the newer measurement and definitions are not as often cited/used yet. That will lead to improved development and in the future better immersive media experiences.

One step towards understanding immersive multimedia experiences is reflected by QoMEX 2022. The 14th International Conference on Quality of Multimedia Experience will be held from September 5th to 7th, 2022 in Lippstadt, Germany. It will bring together leading experts from academia and industry to present and discuss current and future research on multimedia quality, Quality of Experience (QoE), and User Experience (UX). It will contribute to excellence in developing multimedia technology towards user well-being and foster the exchange between multidisciplinary communities. One core topic is immersive experiences and technologies as well as new assessment and evaluation methods, and both topics contribute to bringing theories and measurement techniques up to date. For more details, please visit


[Agrawal, 2019] Agrawal, S., Simon, A., Bech, S., Bærentsen, K., Forchhammer, S. (2019). “Defining Immersion: Literature Review and Implications for Research on Immersive Audiovisual Experiences.” In Audio Engineering Society Convention 147. Audio Engineering Society.
[Biocca, 1995] Biocca, F., & Delaney, B. (1995). Immersive virtual reality technology. Communication in the age of virtual reality, 15(32), 10-5555.
[Baños, 2004] Baños, R. M., Botella, C., Alcañiz, M., Liaño, V., Guerrero, B., & Rey, B. (2004). Immersion and emotion: their impact on the sense of presence. Cyberpsychology & behavior, 7(6), 734-741.
[Chuah, 2018] Chuah, S. H. W. (2018). Why and who will adopt extended reality technology? Literature review, synthesis, and future research agenda. Literature Review, Synthesis, and Future Research Agenda (December 13, 2018).
[ITU-T Rec. G.1035, 2021] ITU-T Recommendation G:1035 (2021). Influencing factors on quality of experience for virtual reality services, Int. Telecomm. Union, CH-Geneva.
[ITU-T Rec. H.430.3, 2018] ITU-T Recommendation H:430.3 (2018). Service scenario of immersive live experience (ILE), Int. Telecomm. Union, CH-Geneva.
[ITU-T Rec. P.809, 2018] ITU-T Recommendation P:809 (2018). Subjective evaluation methods for gaming quality, Int. Telecomm. Union, CH-Geneva.
[Li, 2019] Li, J., Kong, Y., Röggla, T., De Simone, F., Ananthanarayan, S., De Ridder, H., … & Cesar, P. (2019, May). Measuring and understanding photo sharing experiences in social Virtual Reality. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1-14).
[Milgram, 1995] Milgram, P., Takemura, H., Utsumi, A., & Kishino, F. (1995, December). Augmented reality: A class of displays on the reality-virtuality continuum. In Telemanipulator and telepresence technologies (Vol. 2351, pp. 282-292). International Society for Optics and Photonics.
[Nilsson, 2016] Nilsson, N. C., Nordahl, R., & Serafin, S. (2016). Immersion revisited: a review of existing definitions of immersion and their relation to different theories of presence. Human Technology, 12(2).
[Schubert, 2001] Schubert, T., Friedmann, F., & Regenbrecht, H. (2001). The experience of presence: Factor analytic insights. Presence: Teleoperators & Virtual Environments, 10(3), 266-281.
[Slater, 1993] Slater, M., & Usoh, M. (1993). Representations systems, perceptual position, and presence in immersive virtual environments. Presence: Teleoperators & Virtual Environments, 2(3), 221-233.
[Toet, 2021] Toet, A., Mioch, T., Gunkel, S. N., Niamut, O., & van Erp, J. B. (2021). Holistic Framework for Quality Assessment of Mediated Social Communication.
[Slater, 2018] Slater, M. (2018). Immersion and the illusion of presence in virtual reality. British Journal of Psychology, 109(3), 431-433.
[White Paper, 2012] Qualinet White Paper on Definitions of Quality of Experience (2012). European Network on Quality of Experience in Multimedia Systems and Services (COST Action IC 1003), Patrick Le Callet, Sebastian Möller and Andrew Perkis, eds., Lausanne, Switzerland, Version 1.2, March 2013.
[White Paper, 2020] Perkis, A., Timmerer, C., Baraković, S., Husić, J. B., Bech, S., Bosse, S., … & Zadtootaghaj, S. (2020). QUALINET white paper on definitions of immersive media experience (IMEx). arXiv preprint arXiv:2007.07032.

Multidisciplinary Column: An Interview with Odette Scharenborg

Odette, could you tell us a bit about your background, and what the road to your current position was?

Dr Odette Scharenborg, Associate professor and Delft Technology Fellow, SpeechLab/Multimedia Computing Group, Delft University of Technology

In high school, I enjoyed both languages and science topics such as physics, chemistry and biology. When researching what I wanted to study I came across “Language, Speech, and Computer Science” at Radboud University, Nijmegen, the Netherlands, which sounded and indeed was an interesting combination of both languages and science topics. Probably inspired by one of my favourite TV series when I was younger, the Knight Rider, which included a car with which you could communicate through speech, I from early on focused on speech technology.

After obtaining my university degree in 2000, I was offered a PhD position at the same department as I pursued my studies, on another interdisciplinary topic: computational modelling of human speech processing. My PhD project (2001-2005) combined theories about human speech processing (psycholinguistics) and tools and approaches from automatic speech recognition (which itself is more or less at the cross-roads of electrical engineering and computer science) in order to learn more about how humans process speech and improve automatic speech recognition (i.e., the conversion of speech into text).

After obtaining my PhD (in 2005), I went to the Speech and Hearing group in the Department of Computer Science at the University of Sheffield, UK, for a visiting post-doc position (funded by a Dutch Science Foundation (NWO) Talent Scholarship). I then returned to Radboud University for a 3-year post-doc position (funded by an NWO Veni personal fellowship) on new computational modelling of human speech processing project. After this project, I felt that after having read so much about the theories about humans process speech, I really wanted to know how researchers actually came to these theories. So, in the next few years, my research focused on human speech processing. First at the Max Planck Institute for Psycholinguistics, where I was trained as a psycholinguist, and subsequently, funded by an NWO Vidi personal grant, again at the Radboud University, where I became Associate Professor.

Towards the end of my Vidi-project (in 2016), I started to miss the computer science component of my earlier research and decided to try to move back into automatic speech recognition. I had an idea, met two amazing speech researchers who loved my idea, and we decided to collaborate. This collaboration (still ongoing) has allowed me to move back into the field of automatic speech recognition that at that time was rapidly changing due to the rise of deep learning.

In 2018, my Vidi project and contract at Radboud University ended, and I became unemployed. I was then headhunted by a company on automatic speech recognition for health applications. However, I felt that I wanted to stay in academia. Luckily for me, shortly after joining the company, Delft University of Technology offered me a Delft Technology Fellowship, and I joined TU Delft in June 2018, where I’ve since then worked as an Associate Professor of Speech Technology.

How important is interdisciplinarity in your research on speech?

As probably is clear from my road so far, I am an interdisciplinary researcher. The field of automatic speech recognition is already interdisciplinary in that it combines electrical engineering and computer science. However, in my research, I use my knowledge about sounds and sound structures (i.e., phonetics, a subfield of linguistics) and am inspired by and use knowledge about how humans process speech (i.e., psycholinguistics). The speech signal is a signal that can be researched and viewed from different angles: from the perspective of frequencies (physics), the perspective of the individual sounds (phonetics), meaning (semantics), as a means to convey a message or intent, etc. . It also contains different types of information: the words of the message, information about the speaker’s identity, age, gender, height, health status, emotional status, native language, to name only a few.

The focus of my research is on automatic speech recognition. Automatic speech recognisers typically work well for “standard” speakers of a small number of languages. In fact, for only about 2% of all the languages in the world, there is enough annotated speech data to build automatic speech recognisers. Moreover, a large portion of society does not speak in a “standard” way: “standard” speakers are native speakers of a language, without a speech impediment, without a strong regional accent, typically highly educated, and between the ages of 18 and 60 years. As you can tell, this excludes a large portion of our society: children, elderly, people with speaking or voice disorders, deaf people, immigrants, etc. In my work, I focus on making speech technology, and particularly automatic speech recognition, available for everyone, irrespective of how one speaks and the language one speaks. In order to do so, I look at how humans process speech as they are the best speech recognisers that exist; moreover, they can quickly adapt to idiosyncrasies in a speaker’s speech or voice. Moreover, I use knowledge about how sounds and the voice sound differently depending on, for instance, the speaker’s age or health status. So, in my research towards inclusive speech technology, I combine computer science with linguistics and psycholinguistics. Interdisciplinarity is thus at the core of my research.

What disciplines do you combine yourself in your own work?

As explained above, in my research I combine multiple research fields, most notably: computer science, different subfields of psycholinguistics (first and second language learning, native and non-native speech processing; the processing of emotions) and linguistics (primarily phonetics and a bit of conversational analysis).

Could you name a grand research challenge in your current field of work?

There are several grand research challenges in my field:

  • I already named one: making speech technology available for everyone, irrespective of how one speaks and what language one speaks. One of the grand challenges for this is to build speech technology for speech that is not only highly variable but for which also only a little amount of data is available (i.e., low resource scenarios).
  • A second grand challenge: when people speak they often use words or phrases from another language, this is called code-switching. Automatic speech recognisers are typically built for one language; it is very hard for them to deal with code-switched speech.
  • A third grand challenge: speech is often produced with background noise or background speech present. This deteriorates recognition performance tremendously. Dealing with all the different types of background noise and speech is another grand challenge.

You have been an active champion for diversity and inclusion. Could you tell us a bit more about your activities on these topics?

When I was growing academically, I did not really have a female role model, and especially not female role models who had children. When I was in my late twenties/early thirties, I found this hard because I was afraid that having children would negatively impact my chances for the next academic job and my academic career in general. Also, being not only a first-generation PhD but also a first-generation academic, it took me a really long time to realise there were unwritten rules and, knowing what these were and how to deal with them (not sure I now know all 😉 ). Then, when I became Associate Professor at Radboud University, I found that several students, male and female, regularly came to talk to me about personal and academic issues and, that they thought my advice useful and I found it interesting and motivating to talk to them. I wanted to do more regarding gender equality but didn’t know how.

Then in 2016, a group of senior female speech researchers together organised the Young Female Researchers in Speech Science and Technology Workshop, in conjunction with the flagship conference of the International Speech Communication Association (ISCA) Interspeech, in order to attract more female students into a speech PhD program. I was invited as a mentor. This workshop was highly successful and now is a yearly workshop in conjunction with Interspeech. I joined the organisation of this workshop for 3 years. Then in 2019, having advocated gender equality in the ISCA board of which I’ve been a member since 2017, I was asked to form a new committee: the committee for gender equality. Very quickly this committee started to focus on more than gender and look at other types of diversity, sexual orientation, research areas (ISCA encompasses several speech sub-areas, including phonetics, psycholinguistics, health, automatic speech recognition, speech generation, etc.), and geographical regions. Naturally, we not only wanted to attract people from diverse backgrounds but also wanted to retain them, so we also started to look into inclusion. The first thing our committee did, was to create a website where female speech researchers who hold a PhD can list themselves. This website is used to help workshop/conference organisers to find female researchers for the organising committee, as panellists and keynote/invited speakers, etc. We then went on to organise diversity and inclusion meetings at Interspeech and for 2 years we organise a separate ISCA-queer meeting. We have held a workshop in Africa (remotely due to the pandemic) in order to reach local speech researchers there and see where we can collaborate and where we can help them with our resources and expertise. We wrote a code of conduct for session chairs at workshops/conferences in order for them to know how to balance questions from people from minority groups and non-minority groups. To name but a few of our activities.

In 2020 I came up with the idea for a mentoring programme within the IEEE Signal Processing Society (SPS), for students from minority groups, which was well received and was funded with $50K annually. This programme, loosely based on the YFRSW-format, provides students with a mentor from our society who will supervise them for a period of 9 months, and who will mentor them and help them build a network. Each student receives $4K to visit one of the IEEE (SPS) conferences/workshops. In the first round, we awarded 9 students from all over the world.

In addition to these activities, I’ve also been on the board of the Delft Women in Science (DEWIS) at my university and the chair of the Diversity and Inclusion Committee (EDIT) at my faculty at TU Delft. Additionally, I am regularly asked to appear as a female role model in STEM for young girls and in Dutch media.

In getting to your current position, you experienced some personal hardships. In serving as a public role model, you have been open about these. How can we learn from these experiences to make academia a better place?

My CV shows the (many) consecutive positions I’ve had and how almost all are financed by personal grants that I obtained. These personal grants especially tend to attract a lot of praise. What my CV doesn’t show is the story behind it. It doesn’t show the many job applications I sent out, which never led to a position. It doesn’t show that for a period of more than 2 years I did not have a contract, meaning that I did not have any social security, while I was working on a post-doc position. It does not show how I was bullied at my previous university and the damage that did to my self-esteem, something I still struggle with. It does not show that I had to leave behind my 10-month-old daughter for a month and again for 2 weeks because I was expected to be in Germany for a post-doc position, nor does it show the two bouts of (mild) depression I suffered (one directly related to the bullying). I never talked about all of this because as a temporary (and young and female) researcher, you feel extremely vulnerable because you are so dependent on (the goodwill of) other, more senior researchers. If you don’t want to or cannot do a task, if you complain, they will simply find someone else and you are without a job again. On top of that, you often simply are not believed.

When I became a mentor for students and young researchers, I decided to share some of my struggles so that they knew that they were not the only ones who struggled and that I knew what they were going through. I began to receive feedback from these students that they appreciated my honesty and openness, which gave me the courage to be more open about my own issues. However, only after I received my permanent position at TU Delft (in 2019), and after becoming active in diversity and inclusion, did I very slowly dare to speak more openly to my colleagues and senior people about what had happened.

In late 2019, I was asked to talk about what it is like to be a female researcher in speech technology at the IEEE Workshop on Automatic Speech Recognition and Understanding. I thought about the story I wanted to tell, and eventually, I decided to tell my colleagues, including many of my close friends, my story: I started by showing them my CV, which received a lot of appreciative nods. I then told the story of my life, the story that is not shown by my CV, including the hardships. This resulted in many of my male colleagues and friends crying. Of course, this was never my intention. I don’t think that my story is that much different from the average person from a minority group and probably there are quite a few men whose stories are worse than mine.What I wanted to say was: CVs might look great, or they might not. It is important to not take CVs or facts or numbers at face value, you don’t know what people go through or have done to get where they are. Everyone has a story to tell; but it is, unfortunately, the case that the bad stories far more often happen to women and other people from minority groups.

A third message of piece of advice is that if you go through a hard time, know that you are not alone. In life in general, and in academia particularly, we celebrate successes, but failures and hardships are ignored and are often considered a weakness. I strongly believe that by being open about one’s hardships, you will feel better yourself, and will help others with dealing with their hardships.

Finally, we need to see fellow academics as people and treat them as people. We need to be supportive of one another, especially of our younger colleagues and of those from minority groups. We should be mentors and role models. We should listen to what they are saying and believe what they are saying. Not question what they say, but believe them when they describe something bad that has happened to them and help, because daring to speak up takes an enormous amount of strength and courage. If one dares to speak up, believe that it is true and tell them that you know how courageous they have to be to speak up.

How and in what form do you feel we as academics can be most impactful?

As academics we have many responsibilities: we teach the younger generation, we investigate and develop new technology and theories. Some of our research has a direct impact on society, some research does not yet, some research will maybe never have a direct impact on society. I don’t believe that all research needs to have an impact. I do believe that we as academics can be impactful, and that is by explaining science to the general audience. What is science? Why don’t scientists have answers to all questions? Why is what you do important? By explaining one’s research in layman’s terms, science and scientific output will become easier to understand for non-scientists. It will help shape public debate. It will lead to scientific results not being as easily dismissed as nowadays often happens. At the same time, and at least as important: by talking to people from the general public, you as an academic will see the world through their eyes, look at the impact of your work in a different way, and I am convinced it will also often lead to the explanation of why a certain development or technology is not adopted by society at large or by a particular group in society. In short, academics can be most impactful by communicating with the general public, and communication is and thus should be a two-directional process.


Dr Odette Scharenborg is an Associate Professor and Delft Technology Fellow at the Multimedia Computing Group at the Delft University of Technology, the Netherlands, and the Vice-President of the International Speech Communication Association (ISCA). Her research focuses on human speech-processing inspired automatic speech processing with the aim to develop inclusive speech technology, i.e., speech technology that works for everyone irrespective of how they speak or the language they speak.

Since 2017, Odette is on the Board of ISCA, where she is also the chair of the Diversity committee (since 2019) and was co-chair of the Interspeech Conferences committee and of the Technical Committee (2017-2019). From 2018-2021, Odette was a member of the IEEE Speech and Language Processing Technical Committee (subarea Speech Production and Perception). From 2019-2021, she was an Associate Editor of IEEE Signal Processing Letters, where she now is a Senior Associate Editor.

Editor Biographies


Dr Cynthia C. S. Liem is an Associate Professor in the Multimedia Computing Group of Delft University of Technology, The Netherlands, and pianist of the Magma Duo. Her research interests focus on making people discover new interests and content which would not trivially be retrieved in music and multimedia collections, assessing questions of validation and validity in data science, and fostering trustworthy and responsible AI applications when human-interpreted data is involved. She initiated and co-coordinated the European research projects PHENICX (2013-2016) and TROMPA (2018-2021), focusing on technological enrichment of digital musical heritage, and participated as technical partner in an ERASMUS+ education innovation project on Big Data for Psychological Assessment. She gained industrial experience at Bell Labs Netherlands, Philips Research and Google. She was a recipient of the Lucent Global Science and Google Anita Borg Europe Memorial scholarships, the Google European Doctoral Fellowship 2010 in Multimedia, a finalist of the New Scientist Science Talent Award 2016 for young scientists committed to public outreach, Researcher-in-Residence 2018 at the National Library of The Netherlands, general chair of the ISMIR 2019 conference, and keynote speaker at the RecSys 2021 conference. Presently, she co-leads the Future Libraries Lab with the National Library of The Netherlands, is track leader of the Trustworthy AI track in the AI for Fintech lab with the ING bank, holds a TU Delft Education Fellowship on Responsible AI teaching, and is a member of the Dutch Young Academy.


Dr Jochen Huber is Professor of Computer Science at Furtwangen University, Germany. Previously, he was a Senior User Experience Researcher with Synaptics and an SUTD-MIT postdoctoral fellow in the Fluid Interfaces Group at MIT Media Lab and the Augmented Human Lab at Singapore University of Technology and Design. He holds a Ph.D. in Computer Science and degrees in both Mathematics (Dipl.-Math.) and Computer Science (Dipl.-Inform.), all from Technische Universität Darmstadt, Germany. Jochen’s work is situated at the intersection of Human-Computer Interaction and Human Augmentation. He designs, implements and studies novel input technology in the areas of mobile, tangible & non-visual interaction, automotive UX and assistive augmentation. He has co-authored over 60 academic publications and regularly serves as program committee member in premier HCI and multimedia conferences. He was program co-chair of ACM TVX 2016 and Augmented Human 2015 and chaired tracks of ACM Multimedia, ACM Creativity and Cognition and ACM International Conference on Interface Surfaces and Spaces, as well as numerous workshops at ACM CHI and IUI. Further information can be found on his personal homepage:

Overview of Open Dataset Sessions and Benchmarking Competitions in 2021.

This issue of the Dataset Column proposes a review of some of the most important events in 2021 related to special sessions on open datasets or benchmarking competitions associated with multimedia data. While this is not meant to represent an exhaustive list of events, we wish to underline the great diversity of subjects and dataset topics currently of interest to the multimedia community. We will present the following events:

  • 13th International Conference on Quality of Multimedia Experience (QoMEX 2021 – We summarize six datasets included in this conference, that address QoE studies on haze conditions (RHVD), tele-education events (EVENT-CLASS), storytelling scenes (MTF), image compression (EPFL), virtual reality effects on gamers (5Gaming), and live stream shopping (LSS-survey).
  • Multimedia Datasets for Repeatable Experimentation at 27th International Conference on Multimedia Modeling (MDRE at MMM 2021 – We summarize the five datasets presented during the MDRE, addressing several topics like lifelogging and environmental data (MNR-HCM), cat vocalizations (CatMeows), home activities (HTAD), gastrointestinal procedure tools (Kvasir-Instrument), and keystroke and lifelogging (KeystrokeDynamics).
  • Open Dataset and Software Track at 12th ACM Multimedia Systems Conference (ODS at MMSys ’21) ( We summarize seven datasets presented at the ODS track, targeting several topics like network statistics (Brightcove Streaming Datasets, and PePa Ping), emerging image and video modalities (Full UHD 360-Degree, 4DLFVD, and CWIPC-SXR) and human behavior data (HYPERAKTIV and Target Selection Datasets).
  • Selected datasets at 29th ACM Multimedia Conference (MM ’21) ( For a general report from ACM Multimedia 2021 please see ( We summarize six datasets presented during the conference, targeting several topics like food logo detection (FoodLogoDet-1500), emotional relationship recognition (ERATO), text-to-face synthesis (CelebAText-HQ), multimodal linking (M3EL), egocentric video analysis (EGO-Deliver), and quality assessment of user-generated videos (PUGCQ).
  • ImageCLEF 2021 ( We summarize the six datasets launched for the benchmarking tasks, related to several topics like social media profile assessment (ImageCLEFaware), segmentation and labeling of underwater coral images (ImageCLEFcoral), automatic generation of web-pages (ImageCLEFdrawnUI) and medical imaging analysis (ImageCLEF-VQAMed, ImageCLEFmedCaption, and ImageCLEFmedTuberculosis).

Creating annotated datasets is even more difficult in ongoing pandemic times, and we are glad to see that many interesting datasets were published despite this unfortunate situation.

QoMEX 2021

A large number of dataset-related papers have been presented at the International Conference on Quality of Multimedia Experience (QoMEX 2021), organized as a fully online event in Montreal, Canada, June 14 -17, 2021 ( The complete QoMEX ’21 Proceedings is available in the IEEE Digital Library (

In the conference, there was not a specifically dedicated Dataset session. However, datasets were very important to the conference with a number of papers showing new datasets or making use of broadly available ones. As a small example, six selected papers focused primarily on new datasets are listed below. They are contributions focused on haze, teaching in Virtual Reality, multiview video, image quality, cybersickness for Virtual Reality gaming and shopping patterns. 

A Real Haze Video Database for Haze Evaluation
Paper available at:
Chu, Y., Luo, G., and Chen, F.
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, P.R.China.
Dataset available at:

The RHVD video quality assessment dataset focuses on the study of perceptual degradation caused by heavy haze conditions in real-world outdoor scenes, addressing a large number of possible use case scenarios, including driving assistance and warning systems. The dataset is collected from Flickr video sharing platform and post-edited, while 40 annotators were used for creating the subjective quality assessment experiments.

EVENT-CLASS: Dataset of events in the classroom
Paper available at:
Orduna, M., Gutierrez, J., Manzano, C., Ruiz, D., Cabrera, J., Diaz, C., Perez, P., and Garcia, N.
Grupo de Tratamiento de Imágenes, Information Processing & Telecom. Center, Universidad Politécnica de Madrid, Spain; Nokia Bell Labs, Madrid, Spain.
Dataset available at:

The EVENT-CLASS dataset consists of 360-degree videos that contain events and characteristics specific to the context of tele-education, composed of video and audio sequences taken in varying conditions. The dataset addresses several topics, including quality assessment tests with the aim of improving the immersive experience of remote users.

A Multi-View Stereoscopic Video Database With Green Screen (MTF) For Video Transition Quality-of-Experience Assessment
Paper available at:
Hobloss, N., Zhang, L., and Cagnazzo, M.
LTCI, Télécom-Paris, Institut Polytechnique de Paris, Paris, France; Univ Rennes, INSA Rennes, CNRS, Rennes, France.
Dataset available at:

MFT is a multi-view stereoscopic video dataset, containing full-HD videos of real storytelling scenes, targeting QoE assessment for the analysis of visual artefacts that appear during an automatically generated point of view transitions. The dataset features a large baseline of camera setups and can also be used in other computer vision applications, like video compression, 3D video content, VR environments and optical flow estimation.

Performance Evaluation of Objective Image Quality Metrics on Conventional and Learning-Based Compression Artifacts
Paper available at:
Testolina, M., Upenik, E., Ascenso, J., Pereira, F., and Ebrahimi, T.
Multimedia Signal Processing Group, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland; Instituto Superior Técnico, Universidade de Lisboa – Instituto de Telecomunicações, Lisbon, Portugal.
Dataset available on request to the authors.

This dataset consists of a collection of compressed images, labelled according to subjective quality scores, targeting the evaluation of 14 objective quality metrics against the perceived human quality baseline.

The Effect of VR Gaming on Discomfort, Cybersickness, and Reaction Time
Paper available at:
Vlahovic, S., Suznjevic, M., Pavlin-Bernardic, N., and Skorin-Kapov, L.
Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia; Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb, Croatia.
Dataset available on request to the authors.

The authors present the results of a study conducted on 20 human users, that measures the physiological and cognitive aftereffects of exposure to three different VR games with game mechanics centered around natural interactions. This work moves away from cybersickness as a primary measure of VR discomfort and wishes to analyze other concepts like device-related discomfort, muscle fatigue and pain and correlations with game complexity

Beyond Shopping: The Motivations and Experience of Live Stream Shopping Viewers
Paper available at:
Liu, X. and Kim, S. H.
Adelphi University.
Dataset available on request to the authors.

The authors propose a study of 286 live stream shopping users, where viewer motivations are examined according to the Uses and Gratifications Theory, seeking to identify motivations broken down into sixteen constructs organized under four larger constructs: entertainment, information, socialization, and experience.

MDRE at MMM 2021

The Multimedia Datasets for Repeatable Experimentation (MDRE) special session is part of the 2021 International Conference on Multimedia Modeling (MMM 2021). The session was organized by Cathal Gurrin (Dublin City University, Ireland), Duc-Tien Dang-Nguyen (University of Bergen, Norway), Björn Þór Jónsson (IT University of Copenhagen, Denmark) and Klaus Schoeffmann (Klagenfurt University, Austria). More details regarding this session can be found at:

The MDRE’21 special session at MMM’21 is the third MDRE edition, and it represents an opportunity for interested researchers to submit their datasets to this track. The work submitted to MDRE is permanently available at, where all the current and past editions of MDRE are hosted. Authors are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, as well as discussing the way it can be useful to the community, along with the dataset in itself.

MNR-Air: An Economic and Dynamic Crowdsourcing Mechanism to Collect Personal Lifelog and Surrounding Environment Dataset.
Paper available at:
Nguyen DH., Nguyen-Tai TL., Nguyen MT., Nguyen TB., Dao MS.
University of Information Technology, Ho Chi Minh City, Vietnam; University of Science, Ho Chi Minh City, Vietnam; Vietnam National University in Ho Chi Minh City, Ho Chi Minh City, Vietnam; National Institute of Information and Communications Technology, Koganei, Japan.
Dataset available on request to the authors.

The paper introduces an economical and dynamic crowdsourcing mechanism that can be used to collect personal lifelog associated events. The resulting dataset, MNR-HCM, represents data collected in Ho Chi Minh City, Vietnam, containing weather data, air pollution data, GPS data, lifelog images, and citizens’ cognition on a personal scale.

CatMeows: A Publicly-Available Dataset of Cat Vocalizations
Paper available at:
Ludovico L.A., Ntalampiras S., Presti G., Cannas S., Battini M., Mattiello S.
Department of Computer Science, University of Milan, Milan, Italy; Department of Veterinary Medicine, University of Milan, Milan, Italy; Department of Agricultural and Environmental Science, University of Milan, Milan, Italy.
Dataset available at:

The CatMewos dataset consists of vocalizations produced by 21 cats belonging to two breeds, namely Main Coon and European Shorthair, that are emitted in three different contexts: brushing, isolation in an unfamiliar environment, and waiting for food. Recordings are performed with low-cost and easily available devices, thus creating a representative dataset for real-world scenarios.

HTAD: A Home-Tasks Activities Dataset with Wrist-accelerometer and Audio Features
Paper available at:
Garcia-Ceja, E., Thambawita, V., Hicks, S.A., Jha, D., Jakobsen, P., Hammer, H.L., Halvorsen, P., Riegler, M.A.
SINTEF Digital, Oslo, Norway; SimulaMet, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; UIT The Arctic University of Norway, Tromsø, Norway; Haukeland University Hospital, Bergen, Norway.
Dataset available at:

The HTAD dataset contains wrist-accelerometer and audio data collected during several normal day-to-day tasks, such as sweeping, brushing teeth, or watching TV. Being able to detect these types of activities is important for the creation of assistive applications and technologies that target elderly care and mental health monitoring.

Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy
Paper available at:
Jha, D., Ali, S., Emanuelsen, K., Hicks, S.A., Thambawita, V., Garcia-Ceja, E., Riegler, M.A., de Lange, T., Schmidt, P.T., Johansen, H.D., Johansen, D., Halvorsen, P.
SimulaMet, Oslo, Norway; UIT The Arctic University of Norway, Tromsø, Norway; Simula Research Laboratory, Oslo, Norway; Augere Medical AS, Oslo, Norway; Oslo Metropolitan University, Oslo, Norway; Medical Department, Sahlgrenska University Hospital-Mölndal, Gothenburg, Sweden; Department of Medical Research, Bærum Hospital, Gjettum, Norway; Karolinska University Hospital, Solna, Sweden; Department of Engineering Science, University of Oxford, Oxford, UK; Sintef Digital, Oslo, Norway.
Dataset available at:

The Kvasir-Instrument dataset consists of 590 annotated frames that contain gastrointestinal (GI) procedure tools such as snares, balloons, and biopsy forceps, and seeks to improve follow-up and the set of available information regarding the disease and the procedure itself, by providing baseline data for the tracking and analysis of the medical tools.

Keystroke Dynamics as Part of Lifelogging
Paper available at:
Smeaton, A.F., Krishnamurthy, N.G., Suryanarayana, A.H.
Insight Centre for Data Analytics, Dublin City University, Dublin, Ireland; School of Computing, Dublin City University, Dublin, Ireland.
Dataset available at:

The authors created a dataset of longitudinal keystroke timing data that spans a period of up to seven months for four human participants. A detailed analysis of the data is performed, by examining the timing information associated with bigrams, or pairs of adjacently-typed alphabetic characters.

ODS at MMSys ’21

The traditional Open Dataset and Software Track (ODS) was a part of the 12th ACM Multimedia Systems Conference (MMSys ’21) organized as a hybrid event in Istanbul, Turkey, September 28 – October 1, 2021 ( The complete MMSys ’21: Proceedings of the 12th ACM Multimedia Systems Conference are available in the ACM Digital Library (

The Session on Software, Tools and Datasets was chaired by Saba Ahsan (Nokia Technologies, Finland) and Luca De Cicco (Politecnico di Bari, Italy) on September 29, 2021, at 16:00 (UTC+3, Istanbul local time). The session has been initiated with 1-slide/minute intros given by the authors and then divided into individual virtual booths. There have been seven dataset papers presented out of thirteen contributions. Listing of the paper titles and their abstracts and associated DOIs is included below for your convenience.

Adaptive Streaming Playback Statistics Dataset
Paper available at:
Teixeira, T, Zhang, B., Reznik, Y.
Brightcove Inc, USA
Dataset available at:

The authors propose a dataset that captures statistics from a number of real-world streaming events, utilizing different devices (TVs, desktops, mobiles, tablets, etc.) and networks (from 2.5G, 3G, and other early generation mobile networks to 5G and broadband). The captured data includes network and playback statistics, events and characteristics of the encoded stream.

PePa Ping Dataset: Comprehensive Contextualization of Periodic Passive Ping in Wireless Networks
Paper available at:
Madariaga, D., Torrealba, L., Madariaga, J., Bustos-Jimenez, J., Bustos, B.
NIC Chile Research Labs, University of Chile
Dataset available at:

The PePa Ping dataset consists of real-world data with a comprehensive contextualization of Internet QoS indicators, like Round-trip time, jitter and packet loss. A methodology is developed for Android devices, that obtains the necessary information, while the indicators are directly provided to the Linux kernel, therefore being an accurate representation of real-world data.

Full UHD 360-Degree Video Dataset and Modeling of Rate-Distortion Characteristics and Head Movement Navigation
Paper available at:
Chakareski, J., Aksu, R., Swaminathan, V., Zink, M.
New Jersey Institute of Technology; University of Alabama; Adobe Research; University of Massachusetts Amherst, USA
Dataset available at:

The authors create a dataset of 360-degree videos that are used in analyzing the rate-distortion (R-D) characteristics of videos. These videos correspond to head movement navigation data in Virtual Reality (VR) and they may be used for analyzing how users explore panoramas around them in VR.

4DLFVD: A 4D Light Field Video Dataset
Paper available at:
Hu, X., Wang, C.,Pan, Y., Liu, Y., Wang, Y., Liu, Y., Zhang, L., Shirmohammadi, S.
University of Ottawa, Canada / Beijing University of Posts and Telecommunication, China
Dataset available at:

The authors propose a 4D Light Field (LF) video dataset that is collected via a custom-made camera matrix. The dataset is to be used for designing and testing methods for LF video coding, processing and streaming, providing more viewpoints and/or higher framerate compared with similar datasets from the current literature.

CWIPC-SXR: Point Cloud dynamic human dataset for Social XR
Paper available at:
Reimat, I., Alexiou, E., Jansen, J., Viola, I., Subramanyam, S., Cesar, P.
Centrum Wiskunde & Informatica, Netherlands
Dataset available at:

The CWIPC-SXR dataset is composed of 45 unique sequences that correspond to several use cases for humans interacting in social extended reality. The dataset is composed of dynamic point clouds, that serve as a low complexity representation in these types of systems.

HYPERAKTIV: An Activity Dataset from Patients with Attention-Deficit/Hyperactivity Disorder (ADHD)
Paper available at:
Hicks, S. A., Stautland, A., Fasmer, O. B., Forland, W., Hammer, H. L., Halvorsen, P., Mjeldheim, K., Oedegaard, K. J., Osnes, B., Syrstad, V. E.G., Riegler, M. A.
SimulaMet; University of Bergen; Haukeland University Hospital; OsloMet, Norway
Dataset available at:

The HYPERAKTIV dataset contains general patient information, health, activity, information about the mental state, and heart rate data from patients with Attention-Deficit/Hyperactivity Disorder (ADHD). Included here are 51 patients with ADHD and 52 clinical control cases.

Datasets – Moving Target Selection with Delay
Paper available at:
Liu, S. M., Claypool, M., Cockburn, A., Eg, R., Gutwin, C., Raaen, K.
Worcester Polytechnic Institute, USA; University of Canterbury, New Zealand; Kristiania University College, Norway; University of Saskatchewan, Canada
Dataset available at:

The Selection datasets are composed of datasets created during four user studies on the effects of delay on video game actions and selections of a moving target with a various number of pointing devices. The datasets include performance data, like time to the selection, and demographic data for the users like age and gaming experience.

ACM MM 2021

A large number of dataset-related papers have been presented at the 29th ACM International Conference on Multimedia (MM’ 21), organized as a hybrid event in Chengdu, China, October 20 – 24, 2021 ( The complete MM ’21: Proceedings of the 29th ACM International Conference on Multimedia are available in the ACM Digital Library (

There was not a specifically dedicated Dataset session among more than 35 sessions at the MM ’21 symposium. However, the importance of datasets can be illustrated in the following statistics, quantifying how many times the term “dataset” appears among 542 accepted papers. The term appears in the title of 7 papers, the keywords of 66 papers, and the abstracts of 339 papers. As a small example, six selected papers focused primarily on new datasets are listed below. There are contributions focused on social multimedia, emotional recognition, text-to-face synthesis, egocentric video analysis, emerging multimedia applications, such as multimodal entity linking, and multimedia art, entertainment, and culture related to perceived quality of video content.

FoodLogoDet-1500: A Dataset for Large-Scale Food Logo Detection via Multi-Scale Feature Decoupling Network
Paper available at:
Hou, Q., Min, W., Wang, J., Hou, S., Zheng, Y., Jiang, S.
Shandong Normal University, Jinan, China; Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Dataset available at:

The FoodLogoDet-1500 is a large-scale food logo dataset that has 1,500 categories, around 100,000 images and 150,000 manually annotated food logo objects. This type of dataset is important in self-service applications in shops and supermarkets, and copyright infringement detection for e-commerce websites.

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark
Paper available at:
Gao, X., Zhao, Y., Zhang, J., Cai, L.
Alibaba Group, Beijing, China
Dataset available on request to the authors.

The Emotional RelAtionship of inTeractiOn (ERATO) dataset is a large-scale multimodal dataset composed of over 30,000 interaction-centric video clips lasting around 203 hours. The videos are representative for studying the emotional relationships between the two interactive characters in the video clip.

Multi-caption Text-to-Face Synthesis: Dataset and Algorithm
Paper available at:
Sun, J., Li, Q., Wang, W., Zhao, J., Sun, Z.
Center for Research on Intelligent Perception and Computing, NLPR, CASIA, Beijing, China;
School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China; Institute of North Electronic Equipment, Beijing, China
Dataset available on request to the authors.

The authors propose the CelebAText-HQ dataset, which addresses the text-to-face generation problem. Each image in the dataset is manually annotated with 10 captions, allowing proposed methods and algorithms to take multiple captions as input in order to generate highly semantically related face images.

Multimodal Entity Linking: A New Dataset and A Baseline
Paper available at:
Gan, J., Luo, J., Wang, H., Wang, S., He, W., Huang, Q.
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, China; Baidu Inc.
Dataset available at:

The authors propose the M3EL large-scale multimodal entity linking dataset, containing data associated with 1,100 movies. Reviews and images are collected, and textual and visual mentions are extracted and labelled with entities registered from Wikipedia.

Ego-Deliver: A Large-Scale Dataset for Egocentric Video Analysis
Paper available at:
Qiu, H., He, P., Liu, S., Shao, W., Zhang, F., Wang, J., He, L., Wang, F.
East China Normal University, Shanghai, China; University of Florida, Florida, FL, United States;
Alibaba Group, Shanghai, China
Dataset available at:

The authors propose an egocentric video benchmarking dataset, consisting of videos recorded by takeaway riders doing their daily work. The dataset provides over 5,000 videos with more than 139,000 multi-track annotations and 45 different attributes, representing the first attempt in understanding the delivery takeaway process from an egocentric perspective.

PUGCQ: A Large Scale Dataset for Quality Assessment of Professional User-Generated Content
Paper available at:
Li, G., Chen, B., Zhu, L., He, Q., Fan, H., Wang, S.
Kingsoft Cloud, Beijing, China; City University of Hong Kong, Hong Kong, Hong Kong
Dataset available at:

The PUGCQ dataset consists of 10,000 professional user-generated videos, annotated with a set of perceptual subjective ratings. In particular, during the subjective annotation and testing, human opinions are collected based upon not only MOS, but also attributes that may influence visual quality such as faces, noise, blur, brightness, and colour.

ImageCLEF 2021

ImageCLEF is a multimedia evaluation campaign, part of the clef initiative ( The 2021 edition ( is the 19th edition of this initiative and addresses four main research tasks in several domains like: medicine, nature, social media content and user interface processing. ImageCLEF 2021 is organized by Bogdan Ionescu (University Politehnica of Bucharest, Romania), Henning Müller (University of Applied Sciences Western Switzerland, Sierre, Switzerland), Renaud Péteri (University of La Rochelle, France), Ivan Eggel (University of Applied Sciences Western Switzerland, Sierre, Switzerland) and Mihai Dogariu (University Politehnica of Bucharest, Romania).

Paper available at:
Popescu, A., Deshayes-Chossar, J., Ionescu, B.
CEA LIST, France; University Politehnica of Bucharest, Romania.
Dataset available at:

This represents the first edition of the aware task at ImageCLEF, and it seeks to understand in what way do public social media profiles affect users in certain important scenarios, representing a search or application for a bank loan, an accommodation, a job as waitress/waiter, and a job in IT.

Paper available at:
Chamberlain, J., de Herrera, A. G. S., Campello, A., Clark, A., Oliver, T. A., Moustahfid, H.
University of Essex, UK; NOAA – Pacific Islands Fisheries Science Center, USA; NOAA/ US IOOS, USA; Wellcome Trust, UK.
Dataset available at:

The ImageCLEFcoral task, currently at its third edition, proposes a dataset and benchmarking task for the automatic segmentation and labelling of underwater images that can be combined for generating 3D models for monitoring coral reefs. The task itself is composed of two subtasks, namely the coral reef image annotation and localisation and the coral reef image pixel-wise parsing.

Paper available at:
Fichou, D., Berari, R., Tăuteanu, A., Brie, P., Dogariu, M., Ștefan, L.D., Constantin, M.G., Ionescu, B.
teleportHQ, Cluj Napoca, Romania; University Politehnica of Bucharest, Romania.
Dataset available at:

The second edition ImageCLEFdrawnUI addresses the issue of creating appealing web page interfaces by fostering systems that are capable of automatically generating a web page from a hand-drawn sketch. The task is separated into two subtasks, the wireframe subtask and the screenshots task.

Paper available at:
Abacha, A.B., Sarrouti, M., Demner-Fushman, D., Hasan, S.A., Müller, H.
National Library of Medicine, USA; CVS Health, USA; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at:

This represents the fourth edition of the ImageCLEF Medical Visual Question Answering (VQAMed) task. This benchmark includes a task on Visual Question Answering (VQA), where participants are tasked with answering questions from the visual content of radiology images, and a second task on Visual Question Generation (VQG), consisting of generating relevant questions about radiology images.

ImageCLEFmed Caption
Paper available at:
Pelka, O., Abacha, A.B., de Herrera, A.G.S., Jacutprakart, J., Friedrich, C.M., Müller, H.
University of Applied Sciences and Arts Dortmund, Germany; National Library of Medicine, USA; University of Essex, UK; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at:

This is the fifth edition of the ImageCLEF Medical Concepts and Captioning task. The objective is to extract UMLS-concept annotations and/or captions from the image data that are then compared against the original text captions of the images.

ImageCLEFmed Tuberculosis
Paper available at:
Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Müller, H.
Institute for Informatics, Minsk, Belarus; University of Warwick, Coventry, England, UK; University of Applied Sciences Western Switzerland, Sierre, Switzerland.
Dataset available at:

Report from ACM Multimedia Systems 2021 by Neha Sharma

Neha Sharma (@NehaSharma) is a PhD student working with Dr Mohamed Hefeeda in Network and Multimedia Systems Lab at Simon Fraser University. Her research interests are in computer vision and machine learning with a focus on next-generation multimedia systems and applications. Her current work focuses on designing an inexpensive hyperspectral camera using a hybrid approach by leveraging both hardware and software solutions. She has been awarded as Best Social Media Reporter of the conference to promote the sharing among researchers on social networks. To celebrate this award, here is a more complete report on the conference.

Being a junior researcher in multimedia systems, I must say I feel proud to be part of this amazing community. I became part of ACM Multimedia Systems Conference (MMSys) last year in 2020, where I published my first research work. I was excited to attend MMSys ’20 in Istanbul, which unfortunately shifted online due to COVID-19. I presented my first work online and got to learn about other researchers in the community. This year I was able to publish another work with my team and got selected to present my ideas and research plans in Doctoral Symposium (thanks to reviewers). MMSys’21 gave me hope to have a full conference experience, as we all were hoping to start our lives back to normal. But, as the conference date was approaching, things were still not clear and travel restrictions were still in place. But on the good note, MMSys ’21 became hybrid to provide an opportunity to the people who can travel. It was at the very end I decided to travel and attend MMSys’21 in person. And I am glad I made that decision. My experience was overwhelmingly rich in terms of learning interesting research findings and making inspiring connections in the community. As the recipient of the “Best Social Media Reporter” award, enjoy the highlights of MMSys’ 21 through my lens. 

In the light of the ongoing global pandemic, ACM MMSys ’21 was held in hybrid mode – onsite in Istanbul, Turkey and online jointly on September 28 – October 1, 2021. Ali C. Begen (Ozyegin University and Networked Media, Turkey) opened the conference onsite with a warm welcome. MMSys’21 became the first-ever hybrid conference where participants presented onsite as well as remotely in real-time. There were participants joining from 38 different countries. The organizing team did an amazing job in pulling off this complex event. This year the research track implemented a two-round submission system, and accepted papers included public reviews in the proceedings. This, however, was not the only first, MMSys ’21 had its first Doctoral Symposium targeting the PhD students and aiming to find their mentors. In addition, there were postponed celebrations for the 30th anniversary of NOSSDAV and the 25th anniversary of Packet Video.

The conference program was very well scheduled. Each day of the conference started with a keynote. There were four insightful and inspiring keynotes from researchers working in cutting edge multimedia technologies. The first day started with a talk titled “AI-Driven Solutions throughout Games’ Lifecycles Leveraging Big Data” by Qiaolin Chen from Tencent IEG Global. Chen discussed how AI and big data are evolving the gaming industry, from intelligent market decisions to data-driven game development. On the second day, Caitlin Kalinowski presented an interesting keynote “Making Impossible Products: How to Get 0-to-1 Products Right”. Caitlin heads the VR Hardware team at Facebook Reality Labs. She shared insights about Oculus and zero-to-one products. The next day, Chris Bregler (Google) talked about “Synthetic Media: New Opportunities and New Challenges”. He discussed recent trends in generative media creation techniques that have opened new possibilities for societally beneficial uses but have also raised concerns about misuse. Last day, Sriram Sethuraman and Deepthi Nandakumar (Amazon) provided insights about “Role of ML in the Prediction of Perceptual Video Quality”. Keynotes are available on youtube to watch on-demand.

This year the conference attracted paper submissions from a range of multimedia topics including immersive media, live video, content preparation, cloud-based and mobile media processing and computer vision systems. Apart from the main research track, MMSys ’21 hosted three workshops:

  • NOSSDAV – Network and Operating System Support for Digital Audio and Video
  • MMVE – Immersive Mixed and Virtual Environment Systems
  • GameSys – Game Systems

These workshops provided an opportunity to meet those who are working in focused areas of multimedia research. This year MMSys conducted the inaugural ACM workshop on Game Systems (GameSys ’21). This workshop attracted research on all aspects of computer/digital games, emphasizing networks, systems, interaction, and applications. Highlights include the work presented by Mark Claypool et. Al (Worcester Polytechnic Institute) which conducts a user study measuring attribute scaling for cloud-based games. 

In addition to area focussed workshops, MMSys’21 also conducted two grand challenges:

Another main highlight of the conference is the EDI (Equality, Diversity and Inclusion) workshop. The workshop was tailored towards PhD students, assistant professors and starting researchers in various research organizations. The event openly discussed core topics about parenthood, work-family policies, career paths and EDI aspects at large. Laura Toni, Mea Wang and Ozgu Alay opened the workshop on the third day of the conference. Miriam Redi shared goals to achieve an equitable and inclusive multimedia community. Susanne Boll talked about the target strategy “25 in 25” to increase the participation of women in SIGMM to at least 25% by 2025. Other guest speakers also highlighted some strategies to achieve target diversity and inclusion in MMSys.

Last but not the least, amazing social events. Each day of the conference ended with a well-planned social event providing a great opportunity to the in-person attendees to meet, discuss, and develop professional and social links throughout the community in a more relaxed setting. We had visited some historical venues like Galata Tower and Adile Sultan Palace and enjoyed a Bosphorus boat tour with a live music band. This year MMSys planned the first inter-continental socials. We travelled from the European side to the Asian side of Istanbul (by bus and by boat). As a token of appreciation, in-person participants received Turkish delights and coffee, a set of traditional towels (peştemal), Istanbul-themed puzzles and a hand-made Kütahya Porcelain vase/coffee set as souvenirs. For me, the best part was sitting together and dining with peers, discussing prospects of your own research or multimedia systems research, in general.

Closing the conference, Ali C. Begen opened with the announcement of the awards. The Best Paper Award was presented to Xiao Zhu et. Al for the paper “Livelyzer: Analyzing the First-Mile Ingest Performance of Live Video Streaming”. See the full list of awards here. The conference closed with the announcement of ACM Multimedia Systems 2022, which will be happening in Athlone, Ireland. Looking forward to seeing everyone again next year.

JPEG Column: 93rd JPEG Meeting

JPEG Committee launches a Call for Proposals on Learning based Point Cloud Coding

The 93rd JPEG meeting was held online from 18 to 22 October 2021. The JPEG Committee continued its work on the development of new standardised solutions for the representation of visual information. Notably, the JPEG Committee has decided to release a new call for proposals on point cloud coding based on machine learning technologies that targets both compression efficiency and effective performance for 3D processing as well as machine and computer vision tasks. This activity will be conducted in parallel with JPEG AI standardization. Furthermore, it was also decided to pursue the development of a new standard in the context of the exploration on JPEG Fake News activity.

JPEG coding framework based in machine learning. The latent representation generated by the AI based coding mechanism can be used for human visualisation, data processing and computer vision tasks.

Considering the response to the Call for Proposals on JPEG Pleno Holography, a first standard for compression of digital holograms has entered its collaborative phase. The response to the call for proposals identified a reliable coding solution for this type of visual information that overcomes the limitations of the state of the art coding solutions for holographic data compression.

The 93rd JPEG meeting had the following highlights:

  • JPEG Pleno Point Cloud Coding draft of the Call for Proposals;
  • JPEG JPEG Pleno Holography;
  • JPEG AI drafts of the Call for Proposals and Common Training and Test Conditions;
  • JPEG Fake Media defines the standardisation timeline;
  • JPEG NFT collects use cases;
  • JPEG AIC explores standardisation of near-visually lossless quality models;
  • JPEG XS new profiles and sub-levels;
  • JPEG XL explores fixed point implementations;
  • JPEG DNA considers image quaternary representations suitable for DNA storage.

The following provides an overview of the major achievements of the 93rd JPEG meeting.

JPEG Pleno Point Cloud Coding

JPEG Pleno is working towards the integration of various modalities of plenoptic content under a single and seamless framework. Efficient and powerful point cloud representation is a key feature within this vision. Point cloud data supports a wide range of applications for human and machine consumption including autonomous driving, computer-aided manufacturing, entertainment, cultural heritage preservation, scientific research and advanced sensing and analysis. During the 93rd JPEG meeting, the JPEG Committee released a Draft Call for Proposals on JPEG Pleno Point Cloud Coding. This call addresses learning-based coding technologies for point cloud content and associated attributes with emphasis on both human visualization and decompressed/reconstructed domain 3D processing and computer vision with competitive compression efficiency compared to point cloud coding standards in common use, with the goal of supporting a royalty-free baseline. A Final Call for Proposals on JPEG Pleno Point Cloud Coding is planned to be released in January 2022.

JPEG Pleno Holography

At its 93rd JPEG meeting, the committee reviewed the response to the Call for Proposals on JPEG Pleno Holography, which is the first standardization effort aspiring to a versatile solution for efficient compression of holograms for a wide range of applications such as holographic microscopy, tomography, interferometry, printing and display and their associated hologram types. The coding technology selected provides excellent rate-distortion performance for lossy coding, in addition, to supporting lossless coding and random access via a space-frequency segmentation approach. The selected technology will serve as a baseline for the standard specification to be developed. This final specification is planned to be published as an international standard in early 2024.


JPEG AI scope is the creation of a learning-based image coding standard offering a single-stream, compact compressed domain representation, targeting both human visualization with significant compression efficiency improvement over image coding standards in common use at equivalent subjective quality, and effective performance for image processing and computer vision tasks.

During the 93rd JPEG meeting, the JPEG AI project activities were focused on the analysis of the results of the exploration studies as well as refinements and improvements on common training and test conditions, especially the performance assessment of the image classification and super-resolution tasks. A related topic that received much attention was device interoperability which was thoroughly analyzed and discussed. Also, the JPEG AI Third Draft Call for Proposals is now available with improvements on evaluation conditions and proposal composition and requirements. A final call for proposals is expected to be issued at the 94th meeting (17-21 January 2022) and to produce a first Working Draft by October 2022.

JPEG Fake Media

The scope of the JPEG Fake Media exploration is to assess standardization needs to facilitate secure and reliable annotation of media asset creation and modifications in good-faith usage scenarios as well as in those with malicious intent. At the 93rd meeting, the JPEG Committee released an updated version of the “JPEG Fake Media Context, Use Cases and Requirements” document. The new version includes an extended set of definitions and a new section related to threat vectors. In addition, the requirements have been substantially enhanced, in particular those related to media asset authenticity and integrity. Given the progress of the exploration, an initial timeline for the standardization process was proposed:

  • April 2022: Issue call for proposals
  • October 2022: Submission of proposals
  • January 2023: Start standardization process
  • January 2024: Draft International Standard (DIS)
  • October 2024: International Standard (IS)

The JPEG Committee welcomes feedback on the working document and invites interested experts to join the JPEG Fake Media AhG mailing list to get involved in this standardization activity.


Non-Fungible Tokens (NFTs) have recently attracted substantial interest. Numerous digital assets associated with NFTs are encoded in existing JPEG formats or can be represented in JPEG-developed current and future representations. Additionally, several trusts and security concerns have been raised about NFTs and the underlying digital assets. The JPEG Committee has established the JPEG NFT exploration initiative to better understand user requirements for media formats. JPEG NFT’s mission is to provide effective specifications that enable various applications that rely on NFTs applied to media assets. The standard shall be secure, trustworthy, and environmentally friendly, enabling an interoperable ecosystem based on NFT within or across applications. The group seeks to engage stakeholders from various backgrounds, including technical, legal, creative, and end-user communities, to develop use cases and requirements. On October 12th, 2021, a second JPEG NFT Workshop was organized in this context. The presentations and video footage from the workshop are now available on the JPEG website. In January 2022, a third workshop will focus on commonalities with the JPEG Fake Media exploration. JPEG encourages interested parties to visit its website frequently for the most up-to-date information and to subscribe to the JPEG NFT Ad Hoc Group’s (AhG) mailing list to participate in this effort.


During the 93rd JPEG Meeting, work was initiated on the first draft of a document on use cases and requirements regarding Assessment of Image Coding. The scope of AIC activities was defined to target standards or best practices with respect to subjective and objective image quality assessment methodologies that target a range from high quality to near-visually lossless quality. This is a range of visual qualities where artefacts are not noticeable by an average non-expert viewer without presenting an original reference image but are detectable by a flicker test.


The JPEG Committee created an updated document “Use Cases and Requirements for JPEG XS V3.0”. It describes new use cases and refines the requirements to allow improving the coding efficiency and to provide additional functionality w.r.t. HDR content, random access and more. In addition, the JPEG XS second editions of Part 1 (Core coding system), Part 2 (Profiles and buffer models), and Part 3 (Transport and container formats) went to the final ballot before ISO publication stage. In the meantime, the Committee continued working on the second editions of Part 4 (Conformance Testing) and Part 5 (Reference Software), which are now ready as Draft International Standards. In addition, the decision was made to create an amendment to Part 2 that will add a High420.12 profile and a new sublevel at 4 bpp, to swiftly address market demands.


Part 3 (Conformance testing) has proceeded to DIS stage. Core experiments were discussed to investigate hardware coding, in particular fixed-point implementations, and will be continued. Work on a second edition of Part 1 (Core coding system) was initiated. With preliminary support in major web browsers, image viewing and editing software, JPEG XL is ready for wide-scale adoption.


The JPEG Committee has continued its exploration of the coding of images in quaternary representations, as is particularly suitable for DNA storage. An important progress in this activity is the implementation of experimentation software to simulate the coding/decoding of images in quaternary code. A thorough explanation of the package has been created, and a wiki for documentation and a link to the code can be found here. A successful fifth workshop on JPEG DNA was held prior to the 93rd JPEG meeting and a new version of the JPEG DNA overview document was issued and is now publicly available. It was decided to continue this exploration by validating and extending the JPEG DNA experimentation software to simulate an end-to-end image storage pipeline using DNA for future exploration experiments, as well as improving the JPEG DNA overview document. Interested parties are invited to consider joining the effort by registering to the mailing list of JPEG DNA.

Final Quote

“Aware of the importance of timely standards in AI-powered imaging applications, the JPEG Committee is moving forward with two concurrent calls for proposals addressing both image and point cloud coding based on machine learning”, said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

Upcoming JPEG meetings are planned as follows:

No 94, to be held online during 17-21 January 2022.

Reports from ACM Multimedia 2021


Due to the COVID-19, the annual ACM Multimedia Conference ( was held in a hybrid mode – onsite in Chengdu, China, and online jointly this year. The organizers have made meticulous preparations for this conference and totally more than 1000 researchers from all over the world participated. 

Besides, there are also AI companies, e.g., Huawei and ByteDance on site trying to attract researchers. It is worth mentioning that in order to prevent the COVID-19, staff and volunteers make a lot of efforts, such as testing the body temperature and providing free masks for attendees.

To encourage student authors to fully engage with the event, SIGMM has sponsored 39 students with Student Travel Grant Awards this year. Students who wanted to apply for this travel grant needed to submit an online form ( before the submission deadline and then the selection committee has chosen the travel grant winners according to selection criteria. The selected students received up to 1000 USD to cover their airline tickets as well accommodation costs for this event. We interviewed some travel grant winners to share their wonderful experience of attending the conference. The following are comments from them.

Students interviewed at ACM Multimedia 2021

Shaoxiang Chen (Fudan University)

It was such a great pleasure to receive the student travel grant and attend the ACM MM 2021 conference in Chengdu. The organizers have devoted a significant amount of effort to ensure the attendees have a nice experience, and in fact, we did. The prepared check-in gifts including masks, an umbrella, and small notebooks were considerate. The onsite covid-19 test was convenient for us to travel back. The keynote talks were closely related to the popular topics in the multimedia community, and I have learned a lot about deep learning and multimodal pre-training. As for the doctoral symposium, I have met excellent PhD students from all over the world and received helpful suggestions from the mentors during my own presentation. Finally, the wonderful performances at the dinner banquet made the entire conference experience even more perfect.

Yuqian Fu (Fudan University)

It is the second time that I attend ACM Multimedia onsite. The first time was in Nice, France in October 2019. That is also a very nice trip. Another thing that I want to share is that I have one long paper accepted by ACM Multimedia in 2020. The conference was supposed to be held in Seattle, USA. However, due to the COVID-19, we had to attend the conference online, which is a big pity. Therefore, it is really a happy thing to participate in this year’s conference in Chengdu. During the conference, I have the opportunity to talk with other researchers face-to-face, and I also presented my work actively to them. I learned a lot in the past few days and had a good experience. Finally, I would like to thank SIGMM for the travel grant, thank the organizers for all the efforts they made to ensure the progress of the conference, and the volunteers for their kind help.

Zheng Wang (Fudan University)

It has been a wonderful experience for me at the ACM Multimedia 2021 in Chengdu this October. Owing to the COVID-19 outbreaks in the past two years, we were so lucky to be together again. Many thanks to the local organizers for their tremendous efforts to hold the conference onsite. At the poster sessions, I was able to present my paper for video moment retrieval to attendances and discuss my idea with them. I could also stop by others’ work, and understanding their work gives me a direct observation about what is going on in the multimedia community. I enjoy the poster session since it helped me know the research trades better. One issue is that the hall for the poster session is relatively crowded, and some walls have two posters arranged one above the other, making the communication a bit inconvenient. In the keynote sessions, I was able to see diverse research areas gathered under the same topic, which let me see a problem from different aspects. As I am in my last PhD year, I could talk with several researchers from university institutions and companies, and I got valuable advice on what should I get prepared for pursuing a career in research or business. Thanks to the local organizers for arranging trips to see cute pandas, which makes visiting Chengdu a delight and unforgettable memory.

Yang Jiao (Fudan University)

It was a great honour to attend the ACM Multimedia in Chengdu this year. This year’s ACM Multimedia is a special conference, for it is the first top conference held onsite since COVID-19. It was the first time that I attended this conference and I enjoyed the academic atmosphere there. I have met a lot of friends with similar research interests as well as famous teachers to share research experiences. What excites me most is the best paper session, where a great number of outstanding works investigate interesting frontier tasks in multimedia society, such as generating music according to visual motion, estimating postures based on one’s speech tune, etc. Moreover, the dinner banquet surprises me a lot. Besides the regular host introduction and dining time, organizers also elaborately prepare wonderful shows as well as a lucky draw. I, fortunately, won the third prize. In summary, thanks for all the efforts of the organizers and excellent talks given by outstanding researchers in this year’s Multimedia. It was a really impressive experience for me!

Yechao Zhang (Huazhong University of Science and Technology (HUST) )

It was such an honour for me to receive the student travel grant. Frankly, I am merely a grad student in my second year in HUST, and it was the first time for me to attend any academic conference ever. The acceptance from ACM Multimedia 2021 is a major inspiration for me, which had inspired me to apply for a PhD program just so I could keep contributing to the academic research in the area of Multimedia in the future. During the conference, I had very much enjoyed my time visiting Chengdu. Apart from the amazing food adventure, I had the most beneficial conversations with researchers from all over the world. All these wonderful experiences would not be possible if there wasn’t for the travel grant from SIGMM. Many thanks for the recognition and support from SIGMM. I sincerely hope ACM Multimedia will gain more international influence.

Jingru Gan (University of Chinese Academy of Sciences)

The ACM Multimedia held this year is an extraordinary conference in terms of the organization and attending experience. I am most impressed by the refined arrangement of hybrid oral sessions which accommodates onsite and online presenters from everywhere on earth. The great importance of this meeting is that it intensifies the bond of researchers from pages of papers to face-to-face meetings. To get a chance of knowing how others go through months of trial and error before achieving a satisfactory result is inspiring, which encourages me to completely dedicate myself to my future work.

Yanqiao Zhu (University of Chinese Academy of Sciences)

Although this was not my first time attending international conferences, my experience at ACM Multimedia 2021 was still very exciting and unforgettable, especially after a long-time travel block due to COVID-19. This year, the diverse program not only makes me feel more connected with the multimedia research community but really broadens my vision. During the conference, I presented my paper on multimedia recommendation, met with many prestigious scholars from both academia and industry, and exchanged many interesting ideas. I believe most of the discussions will spur sparks for future research directions. I also participated in social networking programs, during which I made a lot of friends in related research areas. Overall, it was a great honour for me to receive the SIGMM travel grant that supports me attending ACM Multimedia 2021 physically. I would like to sincerely thank all organizers for their effort in making this year’s ACM Multimedia a great success.

Yudong Wang (University of Electronic Science and Technology of China)

As an undergraduate who received the student travel grant, this is my first time attending an international conference. According to the 2019-nCoV, the attendees onsite are almost Chinese and the room for the poster is a little crowded, but fortunately, people are orderly. At the conference, I stand on my poster and share my work with some researchers in the same field. Apart from that, I talk with some people who work on recommendation algorithms. They help me get to know the other AI application and brand new methods to realize intelligence. I listen to some oral work from a different area of the world and learned a lot about the other field of multimedia. The most impressive thing is the banquet. Although from different schools, the atmosphere among strangers on the table is harmonious. We talk about our daily life in our school and enjoy the performances on the stage. By the way, the gifts prepared for the attendees are surprises. If there are any regrets, it must be that I was not a volunteer to help others and failed to draw a lottery. In summary, thanks to the committee, I had a great experience on ACM Multimedia 2021.

Peidong Liu (Tsinghua University)

I am pleased to attend ACM MM 2021 conference onsite in Chengdu, China. Due to the coronavirus pandemic, the conference adopts a hybrid form, i.e. both onsite and online, to make most of the people participating in the academic exchange. It is noted that this is my first time to attend the onsite international conference in the last few years and I find it more convenient to exchange ideas onsite than online. There are several points worth talking about. First off, this conference utilizes an app called Whova in the procedure of the conference and we can complete personal research interests and affiliated institutions to communicate more conveniently with other researchers. Besides that, volunteers are patient to help us with the check-in process and give us a nice experience at the conference. Finally, thanks to the support from the conference community, I gain the opportunity to communicate with the researchers onsite all around the globe.

Haoyu Zhang (Shandong University)

This was my first time attending an international conference, and I was very happy to participate offline in Chengdu, Sichuan, China. The feeling of participating in the offline conference was something that cannot be experienced online. The volunteers at the conference were very enthusiastic and answered some questions about attending the conference for me. The ACM Multimedia was very caring, prepared many exquisite gifts for each participant, and provided dinner with very local characteristics. The delicious food made me linger. In the daily meeting, I watched and browsed the reports and posters that I was interested in, and had detailed exchanges with the authors, which not only broadened my horizons but also inspired my thinking. In short, I was very honoured to be able to attend this ACM Multimedia conference, and it was a very impressive experience. Finally, I wish the ACM Multimedia better and better.


Overall, almost everyone has a high evaluation of the experience of participating in this conference. Besides, we can tell that the travel grant does help a lot to the students. To summarize, this conference was held successfully and left a very good impression on the participants.