Gender Diversity in SIGMM: We’ll Just Leave This Here As Well


1. Introduction and Background

SIGMM is the Association for Computing Machinery’s (ACM) Special Interest Group (SIG) in Multimedia, one of 36 SIGs in the ACM family.  ACM itself was founded in 1947 and is the world’s largest educational and scientific society for computing, uniting computing educators, researchers and professionals. With almost 100,000 members worldwide, ACM is a strong force in the computing world and is dedicated to advancing the art, science, engineering, and application of information technology.

SIGMM has been operating for nearly 30 years and sponsors 5, soon to be 6, major international conferences each year as well as dozens of workshops and an ACM Transactions Journal.  SIGMM sponsors several Excellence and Achievement Awards each year, including awards for Technical Achievement, Rising Star, Outstanding PhD Thesis, TOMM best paper, and Best TOMM Associate Editor award. SIGMM funds student travel scholarships to almost all our conferences with nearly 50 such student travel grants at the flagship MULTIMEDIA conference in Seoul, Korea, in 2018.  SIGMM has two active chapters, one in the Bay Area of San Francisco and one in China. It has a very active online activity with social media reporters at our conferences, a regular SIGMM Records newsletter, and a weekly news digest.  At our flagship conference, SIGMM sponsors Women and diversity lunches, Doctoral Symposiums, and a newcomers’ welcome breakfast.  SIGMM also funds special initiatives based on suggestions/proposals from the community as well as a newly-launched conference ambassador program to reach out to other ACM SIGs for collaborations across our conferences.

It is generally accepted that SIGMM has a diversity and inclusion problem which exists at all levels, but we have now realized this and have started to take action.  In September 2017 ACM SIGARCH produced the first of a series of articles on gender diversity in the field of Computer Architecture. SIGARCH members looked at their numbers of representation of women in SIGARCH conferences over the previous 2 years and produced the first of a set of reports entitled “Gender Diversity in Computer Architecture: We’re Just Going to Leave This Here”.


This report generated much online debate and commentary, including at the ACM SIG Governing Board (SGB) meetings in 2017 and in 2018.

At a SIGMM Executive Committee meeting in Mountain View, California in October 2017, SIGMM agreed to replicate the SIGARCH study to examine and measure, the (lack of) gender diversity at SIGMM-sponsored Conferences.  We issued a call offering funding support to do this, but there were no takers, so I did this myself, from within my own research lab.

2. Baselines for Performance Comparison

Before jumping into the numbers it is worth establishing a baseline to measure against. As an industry-wide figure, 17-24% of Computer Science undergrads at US R1 institutions are female as are 17% of those with technical roles at large high-tech companies that report diversity. I also looked at the female representation within some of the other ACM SIGs. While we must accept that inclusiveness and diversity is not just about gender but also about race, ethnicity, nationality, even about institution, we don’t have data on these other aspects so I focus just on gender diversity.

So how does SIGMM compare to other SIGs? Let’s look at SIG memberships using data provided by ACM.

The best (most balanced or least imbalanced) SIGs are CSE (Computer Science Education) with 25% female, Computer Human Interaction (CHI) also with 25% female from among those declaring a gender, though CHI is probably better because it has a greater percentage of undeclared gender, thus a lower proportion of males. The worst SIGs (most imbalanced or least balanced) are PLAN (Programming Languages) with 4% female, and OPS (operating systems) with 5% female.


The figures for SIGMM show 9% female membership with 17% unknown or not declaring which means that among the declared members it is just below 11%. Among the other SIGs this makes us closest to AI (Artificial Intelligence) and to IR (Information Retrieval), though SIGIR has a larger number of members with gender undeclared.


Measuring this against overall ACM memberships we find that ACM members are 68% male, 12% female and 20% undeclared. This makes SIGMM quite mid-table compared to other SIGs, but we’re all doing badly and we all have an imbalance. Interestingly, the MULTMEDIA Conference in 2018 in Seoul, Korea had 81% male, 18% female and 1% other/undeclared attendees, slightly better than our memberships ratio but still not good.

3. Gender Balance at SIGMM Conferences

We [1] carried out a desk study for the 3 major SIGMM conferences, namely MULTIMEDIA with an average attendance of almost 800, the International Conference on Multimedia Retrieval (ICMR) with 230 attendees at the last conference and Multimedia Systems (MMSys) with about 130 attendees. For each of the last 5 years we trawled through the conference websites, extracting the names/affiliations of the organizing committees, the technical program committees and the invited keynote speakers.  We did likewise for the SIGMM award winners. This required us determining gender for over 2,700 people and although there were duplicates as the same people can recur on the program committees for multiple years and over multiple conferences. Some of these were easy like “John” and “Susanne”, but these were few so for the others we searched for them on the web. If we were still searching after 5 minutes, we gave up. [2]

[1] This work was carried out by Agata Wolski, a Summer intern student, and I, during Summer 2018.

[2] The data gathered from this activity is available on request from

The figures for each of these annual conferences for a 5-year period for MULTIMEDIA, for a 4-year period for ICMR and for a 3-year period for MMSys, are shown in the following sequence of charts, first showing the percentages and then the raw numbers, for each conference.







So what do the figures mean in comparison to each other and to our baseline?

The results tell us the following:

  • Almost all the percentages for female participation in the organisation of all SIGMM conferences are above the SIGMM membership figure of 9% which is really closer to 11% when discounting those SIGMM members with gender unassigned yet we know the number of female SIGMM members is much already smaller compared to the 17% female in technology companies and the almost 18% female ACM members when discounting unassigned genders.
  • Even if we were to use 17% to 18% figures as our baseline, our female participation in SIGMM conference organisation is less than that baseline, meaning our female SIGMM members are not appearing in organisational and committee roles as per our membership pro rates would indicate they should.
  • While each of our conferences fall below these pro rata figures, none of the three conferences are particularly worse than the others.

4. Initiatives Elsewhere to Redress Gender Imbalance

I then examined some of the actions that are carried out elsewhere and that SIGMM could implement, and started by looking at other ACM SIGs.  There I found that some of the other SIGs do some of the following:

  • women and diversity events at conferences (breakfasts or lunches, like SIGMM does)
  • Women-only networking pre-conference meals at conferences
  • Women-only technical programme events like N2Women
  • Formation of mentoring group (using Slack) for informal mentoring
  • Highlighting the roles and achievements of women on social media and in newsletters
  • Childcare and companion travel grants for conference attendance

I then looked more broadly at other initiatives and found the following:

  • gender quotas
  • accelerator programs like Athena Swan
  • female-only events like workshops
  • reports like this which act as spotlights

When we put these all together there are three recurring themes which appear across various initiatives:

  1. Networking .. encouraging us to be part of a smaller group within a larger group. This is a natural human trait of us being tribal, we like to belong to groups starting with our family but also the people we have lunch with, go to yoga classes with, go on holidays with, we each have multiple sometimes non-overlapping groups or tribes that we like to be part of. One such group is the network of minority/women that gets formed as a result of some of the activities.
  2. Peer-to-peer buddying .. again there is a natural human trait whereby older siblings (sisters) tend to help younger ones throughout life, from when we are very young and right throughout life.  The buddying activity reflects this and gives a form of satisfaction to the older or senior buddy, as well as practical benefit to the younger or more junior buddy.
  3. Role models .. there are several initiatives which try to promote role models as those kinds of people that we ourselves can try to aspire to be.  More often that not, it is the very successful people and the high flyers who are put into these positions of role models whereas in practice not everyone actually wants to aspire to be a high flyer.  For many people success in their lives means something different, something less lofty and aspirational and when we see high flying successful people promoted as role models our reaction can be the opposite. We can reject them because we don’t want to be in their league and as a result we can feel depressed and regard ourselves as under-achievers, thus defeating the purpose of having role models in the first place.

5. SIGMM Women’s / Diversity Lunch at MULTIMEDIA 2018

At the ACM MULTIMEDIA Conference in Seoul, Korea in October 2018 SIGMM once again organised a women’s / diversity lunch and about 60 people attended, mostly women.


At the event I gave a high level overview of the statistics presented earlier in this report, and then in order to gather feedback from the audience we held a moderated discussion with PadLet used to gather feedback. PadLet is an online bulletin board used to display information (text, images or links) which can be contributed anonymously from an audience. Attendees at the lunch scanned a QR code on their smartphones which opened a browser and allowed them to post comments on the big screen in response to a topic being discussed during the meeting.

The first topic discussed was “What brings you to the MULTMEDIA Conference?

  • The answers (anonymous comments) posted included that many are here because they are presenting papers or posters, many want to do networking and to share ideas, to help build the community of like-minded researchers, some are attending in order to meet old friends .. and these are the usual reasons for attending a conference.

For the second topic we asked “What excites you about multimedia as a topic, how did you get into the area?

  • The answers included the interaction between computer vision and language, the novel applications around multimodality, the multidisciplinary nature and the practical nature of the subject, and the diversity of topics and the people attending.

The third topic was “What is more/less important for you … networking, role models or peer buddies?

  • From the answers to this, networking was almost universally identified as the most important, and as a follow-on from that, interacting with peers

Finally we asked “Do you know of an initiative that works, or that you would like to see at SIGMM event(s)?

  • A variety of suggestions were put forward including holding hackathons, funding undergraduate students from local schools to attend the conference, an ACM award for women only, ring-fenced funding for supporting women only, training for reviewing, and a lot of people wanted mentoring and mentor matching.

6. SIGMM Initiatives

So what will we do in SIGMM?

  • We will continue to encourage networking at SIGMM sponsored conferences. We will fund lunches like the ones at the MULTIMEDIA Conference. We also started a newcomers breakfast at the MULTIMEDIA Conference in 2018 and we will continue with this.
  • We will ensure that all our conference delegates can attend all conference events at all SIGMM conferences without extra fees. This was a SIGMM policy identified in a review of SIGMM conference some years ago but it has slipped.
  • We will not force but we will facilitate peer-to-peer buddying through the networking events at our conferences and through this we will indirectly help you identify your own role models.
  • We will appoint a diversity coordinator to oversee the women / diversity activities across our SIGMM events and this appointee will be a full member of the SIGMM Executive Committee.
  • We will offer an opportunity for all members of our SIGMM community attending our sponsored conferences, as part of their conference registration, to indicate their availability and interest in taking on an organisational role in SIGMM activities, including conference organisation and/or reviewing. This will provide for us a reserve of people from whom we can draw on their expertise and their services and we can do so in a way which promotes diversity.

These may appear to be small-scale and relatively minor because we are not getting to the roots of what causes the bias and we are not inducing change to counter the causes of the bias. However these are positive steps, steps in the right direction, and we will now have the gender and other bias issues permanently on our radars.

Report from the SIGMM Emerging Leaders Symposium 2018

The idea of a symposium to bring together the bright new talent within the SIGMM community and to hear their views on some topics within the area and on the future of Multimedia, was first mooted in 2014 by Shih-Fu Chang, then SIGMM Chair. That lead to the “Rising Stars Symposium” at the MULTIMEDIA Conference in 2015 where 12 invited speakers made presentations on their work as a satellite event to the main conference. After each presentation a respondent, typically an experienced member of the SIGMM community, gave a response or personal interpretation of the presentation. The format worked well and was very thought-provoking, though some people felt that a shorter event which could be more integrated into the conference, might work better.

For the next year, 2016, the event was run a second time with 6 invited speakers and was indeed more integrated into the main conference. The event skipped a year in 2017, but was brought back for the MULTIMEDIA Conference in 2018 and this time, rather than invite speakers we decided to have an open call with nominations, to make selection for the symposium a competitive process. We also decided to rename the event from Rising Stars Symposium, and call it the “SIGMM Emerging Leaders Symposium”, to avoid confusion with the “SIGMM Rising Star Award”, which is completely different and is awarded annually.

In July 2018 we issued a call for applications to the “Third SIGMM Emerging Leaders Symposium, 2018” which was to be held at the annual MULTIMEDIA Conference in Seoul, Korea, in October 2018. Applications were received and were evaluated by a panel consisting of the following people, and we thank them for volunteering and for their support in doing this.

Werner Bailer, Joanneum Research
Guillaume Gravier, IRISA
Frank Hopfgartner, Sheffield University
Hayley Hung, Delft University, (a previous awardee)
Marta Mrak, BBC

Based on the assessment panel recommendations, 4 speakers were included in the Symposium, namely:

Hanwang Zhang, Nanyang Technological University, Singapore
Michael Riegler, Simula, Norway
Jia Jia, Tsinghua University, China
Liqiang Nie, Shandong University, China

The Symposium took place on the last day of the main conference and was chaired by Gerald Friedland, SIGMM Conference Director.


Towards X Visual Reasoning

By Hanwang Zhang (Nanyang Technological University, Singapore)

For decades, we are interested in detecting objects and classifying them into a fixed vocabulary of lexicon. With the maturity of these “low-level” vision solutions, we are hunger for a “higher-level” representation of the visual data, so as to extract visual knowledge rather than merely bags of visual entities, allowing machines to reason about human-level decision-making. In particular, we wish an “X” reasoning, where X means eXplainable and eXplicit. In this talk, I first reviewed a brief history of symbolism and connectionism, which alternatively promote the development of AI in the past decades. In particular, though the deep neural networks — the prevailing incarnation of connectionism — have shown impressive super-human performance in various tasks, they still lag behind us in high-level reasoning. Therefore, I propose the marriage between symbolism and connectionism to take the complementary advantages of them, that is, the proposed X visual reasoning. Second, I introduced the two building blocks of X visual reasoning: visual knowledge acquisition by scene graph detection and X neural modules applied on the knowledge for reasoning. For scene graph detection, I introduced our recent progress on reinforcement learning of the scene dynamics, which can help to generate coherent scene graphs that respect visual context. For X neural modules, I discussed our most recent work on module design, algorithms, and applications in various visual reasoning tasks such as visual Q&A, natural language grounding, and image captioning. At last, I visioned some future directions towards X visual reasoning, such as using meta-learning and deep reinforcement learning for more dynamic and efficient X neural module compositions.

Professor Ramesh Jain mentioned that a truly X reasoning should consider the potential human-computer interaction that may change or digress a current reasoning path. This is crucial because human intelligence can reasonably respond to interruptions and incoming evidences.

We can position X visual reasoning in the recent trend of neural-symbolic unification, which gradually becomes our consensus towards a general AI. The “neural”’ is good at representation learning and model training, and the “symbolic” is good at knowledge reasoning and model explanation. One should bear in mind that the future multimedia system should take the complementary advantages of the “neural-symbolic”.

BioMedia – The Important Role of Multimedia Research for Healthcare

by Michael Riegler (SimulaMet & University of Oslo, Norway)

With the recent rise of machine learning, analysis of medical data has become a hot topic. Nevertheless, the analysis is still often restricted to a special type of images coming from radiology or CT scans. However, there are continuously vast amounts of multimedia data collected both within the healthcare systems and by the users using devices such as cameras, sensors and mobile phones.

In this talk I focused on the potential of multimedia data and applications to improve healthcare systems. First, a focus on the various data was given. A person’s health is contained in many data sources such as images, videos, text and sensors. Medical data can also be divided into data with hard and soft ground truth. Hard ground truth means that there are procedures that verify certain labels of the given data (for example a biopsy report for a cancerous tissue sample). Soft ground truth is data that was labeled by medical experts without a verification of the outcome. Different data types also come with different levels of security. For example activity data from sensors have a low chance to help to identify the patient whereas speech, social media, GPS come with a higher chance of identification. Finally, it is important to take context into account and results should be explainable and reproducible. This was followed by a discussion about the importance of multimodal data fusion and context aware analysis supported by three example use cases: Mental health, artificial reproduction and colonoscopy.

I also discussed the importance of involving medical experts and patients as users. Medical experts and patients are two different user groups, with different needs and requirements. One common requirement for both groups is the need for explanation about how the decisions were taken. In addition, medical experts are mainly interested in support during their daily tasks, but are not very interested in, for example, huge amounts of sensor data from patients because the increase amount of work. They have a preference on interacting with the patients than with the data. Patients on the other hand usually prefer to collect a lot of data and get informed about their current status, but are more concerned about their privacy. They also usually want that medical experts take as much data into account as possible when making their assessments.

Professor Susanne Boll mentioned that it is important to find out what is needed to make automatic analysis accepted by hospitals and who is taking the responsibility for decisions made by automatic systems. Understandability and reproducibility of methods were mentioned as an important first step.

The most relevant messages of the talk are that the multimedia community has the diverse skills needed to address several challenges related to medicine. Furthermore, it is important to focus on explainable and reproducible methods.

Mental Health Computing via Harvesting Social Media Data

By Jia Jia, Tsinghua University, China

Nowadays, with the rapid pace of life, mental health is receiving widespread attention. Common symptoms like stress, or clinical disorders like depression, are quite harmful, and thus it is of vital significance to detect mental health problems before they lead to severe consequences. Professional mental criteria like the International Classification of Diseases (ICD-10 [1]) and the Diagnostic and Statistical Manual of Mental Disorders (DSM [2]) have defined distinguishing behaviors in daily lives that help diagnosing disorders. However, traditional interventions based on face-to-face interviews or self-report questionnaires are expensive and hysteretic. The potential antipathy towards consulting psychiatrists exacerbates these problems.

Social media platforms, like Twitter and Weibo, have become increasingly prevalent for users to express themselves and interact with friends. The user-generated content (UGC) shared in such platforms may help to better understand the real-life state and emotion of users in a timely manner, making the analysis of the users’ mental wellness feasible. Underlying these discoveries, research efforts have also been devoted for early detection of mental problems.

In this talk, I focused on the timely detection of mental wellness, focusing on typical mental problems: stress and depression. Starting with binary user-level detection, I expanded the research by considering the trigger and the severity of the mental problems, involving different social media platforms that are popular in different cultures. I presented my recent progress from three prespectives:

  1. Through self-reported sentence pattern matching, I constructed a series of large-scale well-labeled datasets in the field of online mental health analysis;
  2. Based on previous psychological research, I extracted multiple groups of discriminating features for detection and presented several multi-modal models targeting at different contexts. I conducted extensive experiments with my models, demonstrating significantly better performance as compared to the state-of-the-art methods; and
  3. I investigated in detail the contribution per feature, of online behaviors and even cultural differences in different contexts. I managed to reveal behaviors not covered in traditional psychological criteria, and provided new perspectives and insights for current and future research.

My developed mental health care applications were also demonstrated in the end.

Dr. B. Prabhakaran indicated that mental health understanding is a difficult problem, even for trained doctors, and we will need to work with psychiatrist sooner than later. Thanks to his valuable comments, regarding possible future directions, I envisage the use of augmented / mixed reality to create different immersive “controlled” scenarios where human behavior can be studied. I consider for example to create stressful situations (such as exams, missing a flight, etc.), for better understanding depression. Especially for depression, I plan to incorporate EEG sensor data in my studies.



Towards Micro-Video Understanding

By Liqiang Nie, Shandong University, China

We are living in the era of ever-dwindling attention span. To feed our hunger for quick content, bite-sized videos embracing the philosophy of “shorter-is-better”, are becoming popular with the rise of micro-video sharing services. Typical services include Vine, Snapchat, Viddy, and Kwai. Micro-videos like a wildfire are very popular and taking over the content and social media marketing space, in virtue of their value in brevity, authenticity, communicability, and low-cost. Micro-videos can benefit lots of commercial applications, such as brand building. Despite their value, the analysis and modeling of micro-videos is non-trivial due to the following reasons:

  1. micro-videos are short in length and of low quality;
  2. they can be described by multiple heterogeneous channels, spanning from social, visual, and acoustic to textual modalities;
  3. they are organized into a hierarchical ontology in terms of semantic venues; and
  4. there are no available benchmark dataset on micro-videos.

In my talk, I introduced some shallow and deep learning models for micro-video understanding that are worth studying and have proven effective:

  1. Popularity Prediction. Among the large volume of micro-videos, only a small portion of them will be widely viewed by users, while most will only gain little attention. Obviously, if we can identify in advance the hot and popular micro-videos, it will benefit many applications, like the online marketing and network reservation;
  2. Venue Category Estimation. In a random sample over 2 million Vine videos, I found that only 1.22% of the videos are associated with venue information. Including location information about the videos can benefit multifaceted aspects, such as footprints recording, personalized applications, and other location-based services, it is thus highly desired to infer the missing geographic cues;
  3. Low quality sound. As the quality of the acoustic signal is usually relatively low, simply integrating acoustic features with visual and textual features often leads to suboptimal results, or even adversely degrades the overall quality.

In the future, I may try some other meaningful tasks such as micro-video captioning or tagging and detection of unsuitable content. As many micro-videos are annotated with erroneous words, namely the topic tags or descriptions are not well correlated to the content, this negatively influences other applications, such as textual query search. It is common that users upload many violence and erotic videos. At present, the detection and alert tasks mainly rely on labor-intensive inspection. I plan to create systems that automatically detect erotic and violence content.

During the presentation, the audience asked about the datasets used in my work. In my previous work, all the videos come from Vine, but this service has been closed. The audience wondered how I will build the dataset in the future. As there are many other micro-video sites, such as Kwai and Instagram, I hence can obtain sufficient data from them to support my further research.

Opinion Column: Survey on ACM Multimedia

For this edition of the Opinion Column, happening in correspondence with ACM Multimedia 2018, we launched a short community survey regarding their perception of the conference. We prepared the survey together with senior members of the community, as well as the organizers of ACM Multimedia 2019. You can find the full survey here.


Overall, we collected 52 responses. The participant sample was slightly skewed towards more senior members of the community: around 70% described themselves are full, associate or assistant professors. Almost 20% were research scientists from industry. Half of the participants were long-term contributors of the conference, having attended more than 6 editions of ACM MM, however only around a quarter of the participants had attended the last edition of MM in Seoul, Korea.

First, we asked participants to describe what ACM Multimedia means for them, using 3 words. We aggregated the responses in the word cloud below. Bigger words correspond to words with higher frequency. Most participants associated MM with prestigious and high quality content, and with high diversity of topics and modalities. While recognizing its prestige, some respondents showed their interest in a modernization of the MM focus.


Next, we asked respondents “What brings you to ACM Multimedia?”, and provided a set of pre-defined options including “presenting my research”, “networking”, “community building”,  “ACM MM is at the core of my scientific interests” and “other” (free text). 1 on 5 participants selected all options as relevant to their motivation behind attending Multimedia. The large majority of participants (65%) declare to attend ACM Multimedia to present research and do networking. By inspecting the free-text answers in the “other” option, we found that some people were interested in specific tracks, and that others see MM as a good opportunity to showcase research to their graduate students.

The next question was about paper submission. We wanted to characterize what pushes researchers to submit to ACM multimedia. We prepared 3 different statements capturing different dimensions of analysis, and asked participants to rate them on a 5-point scale, from “Strongly disagree” (1), to “Strongly agree” (5).

The distribution of agreement for each question is shown in the plot below. Participants tend to neither disagree nor agree about Multimedia as the only possible venue for their papers (average agreement score 2.9); they generally disagreed with the statement “I consider ACM Multimedia mostly to resubmit papers rejected from other venues” (average score 2.0), and strongly agreed on the idea of MM as a premier conference (average score 4.2).


One of the goals of this survey was to help the future Program Chairs of MM 2019 understand the extent to which participants agree with the reviewers’ guidelines that will be introduced in the next edition of the conference. To this end, we invited respondents to express their agreement with a fundamental point of these guidelines: “Remember that the problem [..] is expected to involve more than a single modality, or [..] how people interpret and use multimedia. Papers that address a single modality only and also fail to contribute new knowledge on human use of multimedia must be rejected as out of scope for the conference”.  Around 60% agreed or strongly agreed with this statement, while slightly more than 25% disagreed or strongly disagreed. The remaining 15% had no opinion about the statement.

We also asked participants to share with us any further comment regarding this last question or ACM MM in general. People generally approved the introduction of these reviewing guidelines, and the idea of multiple modalities and human perception and applications of multimedia. Some suggested that, given the re-focusing implied by this new reviewing guidelines, the instructions should be made more specific i.e. chairs should clarify the definition of “involve”: how multimodal should the paper be?

Others encouraged to clarify even further the broader scope of ACM Multimedia, defining its position with respect to other multimedia conferences (MMsys, MMM), but also with computer vision conferences such as CVPR/ECCV (and avoid conference dates overlapping).

Some comments proposed to rate papers based on the impact on the community, and on the level of innovation even in  a single modality, as forcing multiple modalities could “alienate” community members.

Beyond reviewing guidelines, a major theme emerging from the free-text comments was about diversity in ACM Multimedia. Several participants called for more geographic diversity in participants and paper authors. Some also noted that more turn-over in the organizing committees should be encouraged. Finally, most participants brought up the need for more balance in MM topics: it was brought up that, while most accepted papers are under the general umbrella of “Multimedia Content Understanding”, MM should encourage in the future more paper about systems, arts, and other emerging topics.

With this bottom-up survey analysis, we aimed to give voice to the major themes that the multimedia community cares about, and hope to continue doing so in the future editions of this column. We would like to thank all researchers and community members who gave their contribution by shaping and filling this survey, and allowed us to get a broader picture of the community perception of ACM MM!

An interview with Géraldine Morin

Please describe your journey into research from your youth up to the present. What foundational lessons did you learn from this journey? Why were you initially attracted to multimedia?

My journey into research was not such a linear path (or ’straight path’ as some French institutions put it —a criteria for them to hire)… I started convinced that I wanted to be a high school math teacher. Since I was accepted in a Math and CS engineering school after a competitive exam, I did accept to study there, working in parallel towards a pure math degree.
The first year, I did manage to follow both curricula (taking two math exams in September), but it was quite a challenge and the second year I gave up on the math degree to keep following the engineering curricula.
I finished with a master degree in applied Math (back then fully included in the engineering curricula) and really enjoyed working on the Master thesis (I did my internship in Kaiserslautern, Germany) so I decided to apply for a Ph.D. grant.
I made it into the Ph.D. program in Grenoble and liked my Ph.D. topic in geometric modelling but had a hard time with my advisor there.
So I decided after two years to give up, (passed a motorcycle driving licence) and went on teaching Math in high school for a year (also passed the teacher examination). Encouraged by my former German Master thesis advisor, I then applied for a Ph.D. program at Rice University in the US to work with Ron Goldman, a researcher whose work and papers I really liked. I got the position and really enjoyed doing research there.
After a wedding, a kid, and finishing the Ph.D. (in that order) I had moved to Germany to live with my husband and found a Postdoc position in Berlin for one year. I applied then to Toulouse, where I have stayed since. In Toulouse, I was hired in a Computer Vision research group, where a subgroup of people were tackling problems in multimedia, and offered me the chance to be the 3D-person of their team 🙂

I learned that a career, or research path, is really shaped by the people you meet on your way, for good or bad. Perseverance for something you enjoy is certainly necessary, and not staying in a context that do not fit you is also important! I am glad I did start again after giving up at first, but also do not regret my choice to give up either.

Research topic, and research areas, are important and a good match with your close collaborators is also very relevant to me. I really enjoy the multimedia community for that matter. The people are open minded and curious, and very encouraging… At multimedia conferences I always feel that my research is valued and relevant to the field (in the other communities, CG or CV, I sometimes get a remark like, ‘oh well, I guess you are not really doing C{G|V}’ …). Multimedia also has a good balance between theory and practice, and that’s fun !

Visit in Chicago during my Ph.D. in the US.

Visit in Chicago during my Ph.D. in the US.


Tell us more about your vision and objectives behind your current roles? What do you hope to accomplish and how will you bring this about?

I just took the responsibility of a department, while we are changing the curricula. This is a lot of organisation and administrative work, but also forces me to have a larger vision of how the field of computer science is evolving and what is important to teach. Interestingly, we prepare our student for jobs that do not exist yet ! This new challenge for me, also makes me realise how important it is to keep time for research, and the open-mindedness I get from my research activity.

Can you profile your current research, its challenges, opportunities, and implications?

As I mentioned before, currently, my challenge is to be able to keep on being active in research. I follow up on two paths: first in geometric modeling, trying to bridge the gap between my current interest in skeleton based models and two hot topics that are 3D printing, and machine learning.
The second is to continue working in multimedia, in distributing 3D content in a scalable way.
Concerning my implication, I am also currently co-heading the French geometric modeling group, and I very much appreciate to promote our research community, and contribute to keep it active and recognised.

How would you describe the role of women especially in the field of multimedia?

I have participated in my first women in MM meeting in ACM, and very much appreciated it. I have to admit I was not really interested in women targeted activities before I did participate in my first women workshop (WiSH – Women in SHape) in 2013, that brought groups on women to collaborate during one week… that was a great experience, that made me realise that, despite the fact that I really enjoy working with my -almost all male- colleagues, it was also fun and very inspiring to work with women groups. Moreover, being questioned by younger colleagues about the ability for a woman to have a family and faculty job, I now think that my good experience as a faculty and mother of 3 should be shared when needed.

How would you describe your top innovative achievements in terms of the problems you were trying to solve, your solutions, and the impact it has today and into the future?

My first contributions were in a quite theoretical field : during my Ph.D. I proposed to use analytic functions in a geometric modeling context. That raised some convergence issues that I managed to prove.
Later, I really enjoyed working with collaborators and proposing a shared topic with my colleague Romulus who worked on streaming, we started in 2006 to work on 3D streaming; that led us to collaborating with Wei Tsang Ooi for the National University of Singapore and for more than 12 years, we have been now advancing some innovative solutions for the distribution of 3D content, working on adapted 3D models for me, and system solutions for them… implying along the way new colleagues. Along the way, we won the best paper award for my Ph.D. student paper in the ACM MM in 2008 (I am very proud of that —despite the fact that I could not attend the conference, I gave birth between submission and conference ;).

Over your distinguished career, what are your top lessons you want to share with the audience?

A very simple one: Enjoy what you do! and work will be fun.
For me, I am amazed thinking over new ideas always remain so exciting 🙂

What is the best joke you know? 🙂

hard one !

Jogging in the morning to N Seoul Tower for sunrise, ACM-MM 2018.

Jogging in the morning to N Seoul Tower for sunrise, ACM-MM 2018.


If you were conducting this interview, what questions would you ask, and then what would be your answers?

I have heard there are very detailed studies, especially in the US about difference between male and female behaviour.
It seems that being aware of these helps. For example, women tend to judge themselves harder that men do…
(that’s not really a question and answer, more a remark :p )

Another try:
Q: What would make you feel confident/helps you get over challenges ?
A: I think I lack self confidence, and I always ask for a lot of feedback from colleagues (for examples for dry runs).
If I get good feedback, it boosts my confidence, if I get worst feedback, it helps me improve… I win both ways 🙂



Assoc. Prof. Géraldine Morin: 

Je suis Maître de conférences à l’ENSEEIHT, l’une des écoles de l’Institut National Polytechnique de Toulouse de l’Université de Toulouse, et j’effectue ma recherche à l’IRIT (UMR CNRS 5505). Avant de m’installer à Toulouse, j’étais Grenobloise et j’ai été diplomée de l’ENSIMAG (diplôme d’ingénieur) et de l’ Université Joseph Fourier (D.E.A. de mathématiques appliquées) ainsi qu’une licence de maths purs que j’ai suivi en parallèle à ma première année d’école d’ingénieur. J’ai ensuite fait une thèse en Modélisation Géométrique aux Etats-Unis à (Rice University) (“Analytic Functions for Computer Aided Geometric Design”) sous la direction de Ron Goldman. Ensuite, j’ai fait un postdoc d’un an en géométrie algorithmique, à la Freie Universität de Berlin.

Predicting the Emotional Impact of Movies

Affective video content analysis aims at the automatic recognition of emotions elicited by videos. It has a large number of applications, including mood based personalized content recommendation [1], video indexing [2], and efficient movie visualization and browsing [3]. Beyond the analysis of existing video material, affective computing techniques can also be used to generate new content, e.g., movie summarization [4], personalized soundtrack recommendation to make user-generated videos more attractive [5]. Affective techniques can furthermore be used to enhance the user engagement with advertising content by optimizing the way ads are inserted inside videos [6].

While major progress has been achieved in computer vision for visual object detection, high-level concept recognition, and scene understanding, a natural further step is the modeling and recognition of affective concepts. This has recently received increasing interest from research communities, e.g., computer vision and machine learning, with an overall goal of endowing computers with human-like perception capabilities.

Efficient training and benchmarking of computational models, however, require a large and diverse collection of data annotated with ground truth, which is often difficult to collect, and particularly in the field of affective computing. To address this issue we created the LIRIS-ACCEDE dataset. In contrast to most existing datasets that contain few video resources and have limited accessibility due to copyright constraints, LIRIS-ACCEDE consists of videos with a large content diversity annotated along emotional dimensions. The annotations are made according to the expected emotion of a video, which is the emotion that the majority of the audience feels in response to the same content. All videos are shared under Creative Commons licenses and can thus be freely distributed without copyright issues. The dataset (the videos, annotations, features and protocols) are publicly available, and it is currently composed of a total of six collections.

Predicting the Emotional Impact of Movies

Credits and license information: (a) Cloudland, LateNite Films, shared under CC BY 3.0 Unported license at, (b) Origami, ESMA MOVIES, shared under CC BY 3.0 Unported license at, (c) Payload, Stu Willis, shared under CC BY 3.0 Unported license at, (d) The room of Franz Kafka, Fred. L’Epee, shared under CC BY-NC-SA 3.0 Unported license at, (e) Spaceman, Jono Schaferkotter & Before North, shared under CC BY-NC 3.0 Unported License license at

Dataset & Collections

The LIRIS-ACCEDE dataset is composed of movies and excerpts from movies under Creative Commons licenses that enable the dataset to be publicly shared. The set contains 160  professionally made and amateur movies, with different movie genres such as horror, comedy, drama, action and so on. Languages are mainly English, with a small set of Italian, Spanish, French and others subtitled in English. The set has been used to create the six collections that are part of the dataset. The two collections that were originally proposed are the Discrete LIRIS-ACCEDE collection, which contains short excerpts of movies, and the Continuous LIRIS-ACCEDE collection, which comprises long movies. Moreover, since 2015, the set has been used for tasks related to affect/emotion at the MediaEval Benchmarking Initiative for Multimedia Evaluation [7], where each year it was enriched with new data, features and annotations. Thus, the dataset also includes the four additional collections dedicated to these tasks.

The movies are available together with emotional annotations. When dealing with emotional video content analysis, the goal is to automatically recognize emotions elicited by videos. In this context, three types of emotions can be considered: intended, induced and expected emotions[8]. The intended emotion is the emotion that the film maker wants to induce in the viewers. The induced emotion is the emotion that a viewer feels in response to the movie. The expected emotion is the emotion that the majority of the audience feels in response to the same content. While the induced emotion is subjective and context dependent, the expected emotion can be considered objective, as it reflects the more-or-less unanimous response of a general audience to a given stimulus[8]. Thus, the LIRIS-ACCEDE dataset focuses on the expected emotion. The representation of emotions we are considering is the dimensional one, based on valence and arousal. Valence is defined on a continuous scale from most negative to most positive emotions, while arousal is defined continuously from calmest to most active emotions [9]. Moreover, violence annotations were provided in the MediaEval 2015 Affective Impact of Movies collection, while fear annotations were provided in the MediaEval 2016 and 2017 Emotional Impact of Movies collections.

Discrete LIRIS-ACCEDE collection A total of 160 films from various genres split into 9,800 short clips with valence and arousal annotations. More details below.
Continuous LIRIS-ACCEDE collection A total of 30 films with valence and arousal annotations per second. More details below.
MediaEval 2015 Affective Impact of Movies collection A subset of the films with labels for the presence of violence, as well as for the felt valence and arousal. More details below.
MediaEval 2016 Emotional Impact of Movies collection A subset of the films with score annotations for the expected valence and arousal. More details below.
MediaEval 2017 Emotional Impact of Movies collection A subset of the films with valence and arousal values and a label for the presence of fear for each 10 second segment, as well as precomputed features. More details below.
MediaEval 2018 Emotional Impact of Movies collection A subset of the films with valence and arousal values for each second, begin-end times of scenes containing fear, as well as precomputed features. More details below.

Ground Truth

The ground truth for the Discrete LIRIS-ACCEDE collection consists of the ranking of all video clips along both valence and arousal dimensions. These rankings were obtained thanks to a pairwise video clips comparison protocol that has been designed to be used through crowdsourcing (with CrowdFlower service). Thus, for each pair of video clips presented to raters, they had to select the one which conveyed most strongly the given emotion in terms of valence or arousal. The high inter-annotator agreement that was achieved reflects that annotations were fully consistent, despite the large diversity of our raters’ cultural backgrounds. Affective ratings (scores) were also collected for a subset of the 9,800 movies in order to cross-validate the crowdsourced annotations. The affective ratings also made learning of Gaussian Processes for Regression possible, to model the noisiness from measurements and map the whole ranked LIRIS-ACCEDE dataset into the 2D valence-arousal affective space. More details can be found in [10].

To collect the ground truth for the continuous and MediaEval 2016, 2017 and 2018 collections, which consisted of valence and arousal scores for every movie second, French annotators had to continuously indicate their level of valence and arousal while watching the movies using a modified version of the GTrace annotation tool [16] and a joystick. Each annotator continuously annotated one subset of the movies considering the induced valence, and another subset considering the induced arousal. Thus, each movie was continuously annotated by three to five different annotators. Then, the continuous valence and arousal annotations from the annotators were down-sampled by averaging the annotations over windows of 10 seconds with a shift of 1 second overlap (i.e., yielding 1 value per second) in order to remove any noise due to unintended movements of the joystick. Finally, the post-processed continuous annotations were averaged in order to create a continuous mean signal of the valence and arousal self-assessments, ranging from -1 (most negative for valence, most passive for arousal) to +1 (most positive for valence, most active for arousal). The details of this process are given in [11].

The ground truth for violence annotation, used in the MediaEval 2015 Affective Impact of Movies collection, was collected as follows. First, all the videos were annotated separately by two groups of annotators from two different countries. For each group, regular annotators labeled all the videos, which were then reviewed by master annotators. Regular annotators were graduate students (typically single with no children) and master annotators were senior researchers.  Within each group, each video received 2 different annotations, which were then merged by the master annotators into the final annotation for the group. Finally, the achieved annotations from the two groups were merged and reviewed once more by the task organizers. The details can be found in [12].

The ground truth for fear annotations, used in the MediaEval 2017 and 2018 Emotional Impact of Movies collections, was generated using a tool specifically designed for the classification of audio-visual media allowing to perform annotation while watching the movie (at the same time). The annotations have been realized by two well-experienced team members of NICAM [17], both of them trained in classification of media. Each movie was annotated by one annotator reporting the start and stop times of each sequence in the movie expected to induce fear.


Through its six collections, the LIRIS-ACCEDE dataset constitutes a dataset of choice for affective video content analysis. It is one of the largest dataset for this purpose, and is regularly enriched with new data, features and annotations. In particular, it is used for the Emotional Impact of Movies tasks at MediaEval Benchmarking Initiative for Multimedia Evaluation. As all the movies are under Creative Commons licenses, the whole dataset can be freely shared and used by the research community, and is available at

Discrete LIRIS-ACCEDE collection [10]
In total 160 films and short films with different genres were used and were segmented into 9,800 video clips. The total time of all 160 films is 73 hours 41 minutes and 7 seconds, and a video clip was extracted on average every 27s. The 9,800 segmented video clips last between 8 and 12 seconds and are representative enough to conduct experiments. Indeed, the length of extracted segments is large enough to get consistent excerpts allowing the viewer to feel emotions, while being small enough to make the viewer feel only one emotion per excerpt.

The content of the movie was also considered to create homogeneous, consistent and meaningful excerpts that were not meant to disturb the viewers. A robust shot and fade in/out detection was implemented to make sure that each extracted video clip started and ended with a shot or a fade. Furthermore, the order of excerpts within a film was kept, allowing the study of temporal transitions of emotions.

Several movie genres are represented in this collection of movies, such as horror, comedy, drama, action, and so on. Languages are mainly English with a small set of Italian, Spanish, French and others subtitled in English. For this collection the 9,800 video clips are ranked according to valence, from the clip inducing the most negative emotion to the most positive, and to arousal, from the clip inducing the calmest emotion to the most active emotion. Besides the ranks, the emotional scores (valence and arousal) are also provided for each clip.

Continuous LIRIS-ACCEDE collection [11]
The movie clips for the Discrete collection were annotated globally, for which a single value of arousal and valence was used to represent a whole 8 to 12-second video clip. In order to allow deeper investigations into the temporal dependencies of emotions (since a felt emotion may influence the emotions felt in the future), longer movies were considered in this collection. To this end, a selection of 30 movies from the set of 160 was made such that their genre, content, language and duration were diverse enough to be representative of the original Discrete LIRIS-ACCEDE dataset. The selected videos are between 117 and 4,566 seconds long (mean = 884.2s ± 766.7s SD). The total length of the 30 selected movies is 7 hours, 22 minutes and 5 seconds. The emotional annotations consist of a score of expected valence and arousal for each second of each movie.

MediaEval 2015 Affective Impact of Movies collection [12]
This collection has been used as the development and test sets for the MediaEval 2015 Affective Impact of Movies Task. The overall use case scenario of the task was to design a video search system that used automatic tools to help users find videos that fitted their particular mood, age or preferences. To address this, two subtasks were proposed:

  • Induced affect detection: the emotional impact of a video or movie can be a strong indicator for search or recommendation;
  • Violence detection: detecting violent content is an important aspect of filtering video content based on age.

The 9,800 video clips from the Discrete LIRIS-ACCEDE section were used as development set, and an additional 1100 movie clips were proposed for the test set. For each of the 10,900 video clips, the annotations consist of: a binary value to indicate the presence of violence, the class of the excerpt for felt arousal (calm-neutral-active), and the class for felt valence (negative-neutral-positive).

MediaEval 2016 Emotional Impact of Movies collection [13]
The MediaEval 2016 Emotional Impact of Movies task required participants to deploy multimedia features to automatically predict the emotional impact of movies, in terms of valence and arousal. Two subtasks were proposed:

  • Global emotion prediction: given a short video clip (around 10 seconds), participants’ systems were expected to predict a score of induced valence (negative-positive) and induced arousal (calm-excited) for the whole clip;
  • Continuous emotion prediction: as an emotion felt during a scene may be influenced by the emotions felt during the previous scene(s), the purpose here was to consider longer videos, and to predict the valence and arousal continuously along the video. Thus, a score of induced valence and arousal were to be provided for each 1s-segment of each video.

The development set was composed of the Discrete LIRIS-ACCEDE part for the first subtask, and the Continuous LIRIS-ACCEDE part for the second subtask. In addition to the development set, a test set was also provided to assess participants’ methods performance. A total of 49 new movies under Creative Commons licenses were added. With the same protocol as the one used for the development set, 1,200 additional short video clips were extracted for the first subtask (between 8 and 12 seconds), while 10 long movies (from 25 minutes to 1 hour and 35 minutes) were selected for the second subtask (for a total duration of 11.48 hours). Thus, the annotations consist of a score of expected valence and arousal for each movie clip used for the first subtask, and a score of expected valence and arousal for each second of the movies for the second subtask.

MediaEval 2017 Emotional Impact of Movies collection [14]
This collection was used for the MediaEval 2017 Emotional Impact of Movies task. Here, only long movies were considered, and the emotion was considered in terms of valence, arousal and fear. The following two subtasks were proposed for which the emotional impact had to be predicted for consecutive 10-second segments, which slid over the whole movie with a shift of 5 seconds:

  • Valence/Arousal prediction: participants’ systems were supposed to predict a score of expected valence and arousal for each consecutive 10-second segment;
  • Fear prediction: the purpose here was to predict whether each consecutive 10-second segments was likely to induce fear or not. The targeted use case was the prediction of frightening scenes to help systems protecting children from potentially harmful video content. This subtask is complementary to the valence/arousal prediction task in the sense that the mapping of discrete emotions into the 2D valence/arousal space is often overlapped (for instance, fear, disgust and anger are overlapped since they are characterized with very negative valence and high arousal).

The Continuous LIRIS-ACCEDE collection was used as the development test for both subtasks. The test set consisted of a selection of new 14 new movies under Creative Commons licenses other than the selection of the 160 original movies. They are between 210 and 6,260 seconds long. The total length of the 14 selected movies is 7 hours, 57 minutes and 13 seconds. In addition to the video data, general purpose audio and visual content features were also provided, including Deep features, Fuzzy Color and Texture Histogram, Gabor features. The annotations consist of a valence value, an arousal value and a binary value for each 10-second segment to indicate if the segment was supposed to induce fear or not.

MediaEval 2018 Emotional Impact of Movies collection [15]
The MediaEval 2018 Emotional Impact of Movies task is similar to the one of 2017. However, in this case, more data was provided and a prediction of the emotional impact needed to be made for every second in movies rather than for 10-second segments as before. The two subtasks were:

  • Valence and Arousal prediction: participants’ systems had to predict a score of expected valence and arousal continuously (every second) for each movie;
  • Fear detection: the purpose here was to predict beginning and ending times of sequences inducing fear in movies. The targeted use case was the detection of frightening scenes to help systems protecting children from potentially harmful video content.

The development set for both subtasks consisted of the movies from the Continuous LIRIS-ACCEDE collection, as well as from the test set of the MediaEval 2017 Emotional Impact of Movies collection, i.e. 44 movies for a total duration of 15 hours and 20 minutes. The test set consisted of 12 other movies selected from the set of 160 movies, for a total duration of 8 hours and 56 minutes. Like the 2017 collection, in addition to the video data, general purpose audio and visual content features were also provided. The annotations consist of valence and arousal values for each second of the movies (for the first subtasks) as well as the beginning and ending times of each sequence in movies inducing fear (for the second subtask).


This work was supported in part by the French research agency ANR through the VideoSense Project under the Grant 2009 CORD 026 02 and through the Visen project within the ERA-NET CHIST-ERA framework under the grant ANR-12-CHRI-0002-04.


Should you have any inquiries or questions about the dataset, do not hesitate to contact us by email at: emmanuel dot dellandrea at ec-lyon dot fr.


[1] L. Canini, S. Benini, and R. Leonardi, “Affective recommendation of movies based on selected connotative features”, in IEEE Transactions on Circuits and Systems for Video Technology, 23(4), 636–647, 2013.
[2] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. 2010, “Affective visualization and retrieval for music video”, in IEEE Transactions on Multimedia 12(6), 510–522, 2010.
[3] S.Zhao, H.Yao, X.Sun, X.Jiang, and P. Xu., “Flexible presentation of videos based on affective content analysis”, in Advances in Multimedia Modeling, 2013.
[4] H. Katti, K. Yadati, M. Kankanhalli, and C. Tat-Seng, “Affective video summarization and story board generation using pupillary dilation and eye gaze”, in IEEE International Symposium on Multimedia (ISM), 2011.
[5] R.R. Shah,Y. Yu, and R. Zimmermann, “Advisor: Personalized video soundtrack recommendation by late fusion with heuristic rankings”, in ACM International Conference on Multimedia, 2014.
[6] K. Yadati, H. Katti, and M. Kankanhalli, “Cavva: Computational affective video-in-video advertising”, in IEEE Transactions on Multimedia 16(1), 15–23, 2014.
[8] A. Hanjalic, “Extracting moods from pictures and sounds: Towards truly personalized TV”, in IEEE Signal Processing Magazine, 2006.
[9] J.A. Russell, “Core affect and the psychological construction of emotion”, in Psychological Review, 2003.
[10] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “LIRIS-ACCEDE: A Video Database for Affective Content Analysis,” in IEEE Transactions on Affective Computing, 2015.
[11] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Deep Learning vs. Kernel Methods: Performance for Emotion Prediction in Videos,” in 2015 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), 2015.
[12] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang, B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty, and L. Chen, “The mediaeval 2015 affective impact of movies task,” in MediaEval 2015 Workshop, 2015.
[13] E. Dellandrea, L. Chen, Y. Baveye, M. Sjoberg and C. Chamaret, “The MediaEval 2016 Emotional Impact of Movies Task”, in Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, The Netherlands, October 20-21, 2016.
[14] E. Dellandrea, M. Huigsloot, L. Chen, Y. Baveye and M. Sjoberg, “The MediaEval 2017 Emotional Impact of Movies Task”, in Working Notes Proceedings of the MediaEval 2017 Workshop, Dublin, Ireland, September 13-15, 2017.
[15] E. Dellandréa, M. Huigsloot, L. Chen, Y. Baveye, Z. Xiao and M. Sjöberg, “The MediaEval 2018 Emotional Impact of Movies Task”, Working Notes Proceedings of the MediaEval 2018 Workshop, Sophia Antipolis, France, October 29-31, 2018.
[16] R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Stapleton, “Gtrace: General trace program compatible with emotionML”, in Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), 2013.

MPEG Column: 124th MPEG Meeting in Macau, China

The original blog post can be found at the Bitmovin Techblog and has been modified/updated here to focus on and highlight research aspects.

The MPEG press release comprises the following aspects:

  • Point Cloud Compression – MPEG promotes a video-based point cloud compression technology to the Committee Draft stage
  • Compressed Representation of Neural Networks – MPEG issues Call for Proposals
  • Low Complexity Video Coding Enhancements – MPEG issues Call for Proposals
  • New Video Coding Standard expected to have licensing terms timely available – MPEG issues Call for Proposals
  • Multi-Image Application Format (MIAF) promoted to Final Draft International Standard
  • 3DoF+ Draft Call for Proposal goes Public

Point Cloud Compression – MPEG promotes a video-based point cloud compression technology to the Committee Draft stage

At its 124th meeting, MPEG promoted its Video-based Point Cloud Compression (V-PCC) standard to Committee Draft (CD) stage. V-PCC addresses lossless and lossy coding of 3D point clouds with associated attributes such as colour. By leveraging existing and video ecosystems in general (hardware acceleration, transmission services and infrastructure), and future video codecs as well, the V-PCC technology enables new applications. The current V-PCC encoder implementation provides a compression of 125:1, which means that a dynamic point cloud of 1 million points could be encoded at 8 Mbit/s with good perceptual quality.

A next step is the storage of V-PCC in ISOBMFF for which a working draft has been produced. It is expected that further details will be discussed in upcoming reports.

Research aspects: Video-based Point Cloud Compression (V-PCC) is at CD stage and a first working draft for the storage of V-PCC in ISOBMFF has been provided. Thus, a next consequence is the delivery of V-PCC encapsulated in ISOBMFF over networks utilizing various approaches, protocols, and tools. Additionally, one may think of using also different encapsulation formats if needed.

MPEG issues Call for Proposals on Compressed Representation of Neural Networks

Artificial neural networks have been adopted for a broad range of tasks in multimedia analysis and processing, media coding, data analytics, and many other fields. Their recent success is based on the feasibility of processing much larger and complex neural networks (deep neural networks, DNNs) than in the past, and the availability of large-scale training data sets. Some applications require the deployment of a particular trained network instance to a potentially large number of devices and, thus, could benefit from a standard for the compressed representation of neural networks. Therefore, MPEG has issued a Call for Proposals (CfP) for compression technology for neural networks, focusing on the compression of parameters and weights, focusing on four use cases: (i) visual object classification, (ii) audio classification, (iii) visual feature extraction (as used in MPEG CDVA), and (iv) video coding.

Research aspects: As point out last time, research here will mainly focus around compression efficiency for both lossy and lossless scenarios. Additionally, communication aspects such as transmission of compressed artificial neural networks within lossy, large-scale environments including update mechanisms may become relevant in the (near) future.


MPEG issues Call for Proposals on Low Complexity Video Coding Enhancements

Upon request from the industry, MPEG has identified an area of interest in which video technology deployed in the market (e.g., AVC, HEVC) can be enhanced in terms of video quality without the need to necessarily replace existing hardware. Therefore, MPEG has issued a Call for Proposals (CfP) on Low Complexity Video Coding Enhancements.

The objective is to develop video coding technology with a data stream structure defined by two component streams: a base stream decodable by a hardware decoder and an enhancement stream suitable for software processing implementation. The project is meant to be codec agnostic; in other words, the base encoder and base decoder can be AVC, HEVC, or any other codec in the market.

Research aspects: The interesting aspect here is that this use case assumes a legacy base decoder – most likely realized in hardware – which is enhanced with software-based implementations to improve coding efficiency or/and quality without sacrificing capabilities of the end user in terms of complexity and, thus, energy efficiency due to the software based solution. 


MPEG issues Call for Proposals for a New Video Coding Standard expected to have licensing terms timely available

At its 124th meeting, MPEG issued a Call for Proposals (CfP) for a new video coding standard to address combinations of both technical and application (i.e., business) requirements that may not be adequately met by existing standards. The aim is to provide a standardized video compression solution which combines coding efficiency similar to that of HEVC with a level of complexity suitable for real-time encoding/decoding and the timely availability of licensing terms.

Research aspects: This new work item is more related to business aspects (i.e., licensing terms) than technical aspects of video coding.


Multi-Image Application Format (MIAF) promoted to Final Draft International Standard

The Multi-Image Application Format (MIAF) defines interoperability points for creation, reading, parsing, and decoding of images embedded in High Efficiency Image File (HEIF) format by (i) only defining additional constraints on the HEIF format, (ii) limiting the supported encoding types to a set of specific profiles and levels, (iii) requiring specific metadata formats, and (iv) defining a set of brands for signaling such constraints including specific depth map and alpha plane formats. For instance, it addresses use case like a capturing device may use one of HEIF codecs with a specific HEVC profile and level in its created HEIF files, while a playback device is only capable of decoding the AVC bitstreams.

Research aspects: MIAF is an application format which is defined as a combination of tools (incl. profiles and levels) of other standards (e.g., audio codecs, video codecs, systems) to address the needs of a specific application. Thus, the research is related to use cases enabled by this application format. 


3DoF+ Draft Call for Proposal goes Public

Following investigations on the coding of “three Degrees of Freedom plus” (3DoF+) content in the context of MPEG-I, the MPEG video subgroup has provided evidence demonstrating the capability to encode a 3DoF+ content efficiently while maintaining compatibility with legacy HEVC hardware. As a result, MPEG decided to issue a draft Call for Proposal (CfP) to the public containing the information necessary to prepare for the final Call for Proposal expected to occur at the 125th MPEG meeting (January 2019) with responses due at the 126th MPEG meeting (March 2019).

Research aspects: This work item is about video (coding) and, thus, research is about compression efficiency.


What else happened at #MPEG124?

  • MPEG-DASH 3rd edition is still in the final editing phase and not yet available. Last time, I wrote that we expect final publication later this year or early next year and we hope this is still the case. At this meeting Amendment.5 is progressed to DAM and conformance/reference software for SRD, SAND and Server Push is also promoted to DAM. In other words, DASH is pretty much in maintenance mode.
  • MPEG-I (systems part) is working on immersive media access and delivery and I guess more updates will come on this after the next meeting. OMAF is working on a 2nd edition for which a working draft exists and phase 2 use cases (public document) and draft requirements are discussed.
  • Versatile Video Coding (VVC): working draft 3 (WD3) and test model 3 (VTM3) has been issued at this meeting including a large number of new tools. Both documents (and software) will be publicly available after editing periods (Nov. 23 for WD3 and Dec 14 for VTM3).


JPEG Column: 81st JPEG Meeting in Vancouver, Canada

The 81st JPEG meeting was held in Vancouver, British Columbia, Canada, at which significant efforts were put into the analysis of the responses to the call for proposals on the next generation image coding standard, nicknamed JPEG XL, that is expected to provide a solution for image format with improved quality and flexibility, allied with a better compression efficiency. The responses to the call confirms the interest of different parties on this activity. Moreover, the initial  subjective and objective evaluation of the different proposals confirm a significative evolution on both quality and compression efficiency that will be provided by the future standard.

Apart the multiple activities related with several standards development, a workshop on Blockchain technologies was held at Telus facilities in Vancouver, with several talks on Blockchain and Distributed Ledger Technologies, and a Panel where the influence of these technologies on multimedia was analysed and discussed. A new workshop is planned at the 82nd JPEG meeting to be held in Lisbon, Portugal, in January 2019.

The 81st JPEG meeting had the following highlights:JPEG81VancouverCut

  • JPEG Completes Initial Assessment on Responses for the Next Generation Image Coding Standard (JPEG XL);
  • Workshop on Blockchain technology;
  • JPEG XS Core Coding System submitted to ISO for immediate publication as International Standard;
  • HTJ2K achieves Draft International Status;
  • JPEG Pleno defines a generic file format syntax architecture.

The following summarizes various highlights during JPEG’s Vancouver meeting.

JPEG XL completes the initial assessment of responses to the call for proposals

 The JPEG Committee launched the Next Generation Image Coding activity, also referred to as JPEG XL, with the aim of developing a standard for image coding that offers substantially better compression efficiency than existing image formats, along with features desirable for web distribution and efficient compression of high quality images. A Call for Proposals on Next Generation Image Coding was issued at the 79th JPEG meeting.

Seven submissions were received in response to the Call for Proposals. The submissions, along with the anchors, were evaluated in subjective tests by three independent research labs. At the 81st JPEG meeting in Vancouver, Canada, the proposals were evaluated using subjective and objective evaluation metrics, and a verification model (XLM) was agreed upon. Following this selection process, a series of experiments have been designed in order to compare the performance of the current XLM with alternative choices as coding components including those technologies submitted by some of the top performing submissions; these experiments are commonly referred to as core experiments and will serve to further refine and improve the XLM towards the final standard. 

Workshop on Blockchain technology

On October 16th, 2018, JPEG organized its first workshop on Media Blockchain in Vancouver. Touradj Ebrahimi JPEG Convenor and Frederik Temmermans a leading JPEG expert, presented on the background of the JPEG standardization committee and ongoing JPEG activities such as JPEG Privacy and Security. Thereafter, Eric Paquet, Victoria Lemieux and Stephen Swift shared their experiences related to blockchain technology focusing on standardization challenges and formalization, real world adoption in media use cases and the state of the art related to consensus models. The workshop closed with an interactive discussion between the speakers and the audience, moderated by JPEG Requirements Chair Fernando Pereira.

The presentations from the workshop are available for download on the JPEG website. In January 2019, during the 82nd JPEG meeting in Lisbon, Portugal, a 2nd workshop will be organized to continue the discussion and interact with European stakeholders. More information about the program and registration will be made available on

In addition to the workshop, JPEG issued an updated version of its white paper “JPEG White paper: Towards a Standardized Framework for Media Blockchain and Distributed Ledger Technologies” that elaborates on the blockchain initiative, exploring relevant standardization activities, industrial needs and use cases. The white paper will be further extended in the future with more elaborated use cases and conclusions drawn from the workshops. To keep informed and get involved in the discussion, interested parties are invited to register to the ad hoc group’s mailing list via


Touradj Ebrahimi, convenor of JPEG, giving the introductory talk in the Workshop on Blockchain technology.


The JPEG committee is pleased to announce a significant milestone of the JPEG XS project, with the Core Coding System (aka JPEG XS Part-1) submitted to ISO for immediate publication as International Standard. This project aims at the standardization of a near-lossless low-latency and lightweight compression scheme that can be used as a mezzanine codec within any AV market. Among the targeted use cases are video transport over professional video links (SDI, IP, Ethernet), real-time video storage, memory buffers, omnidirectional video capture and rendering, and sensor compression (for example in cameras and in the automotive industry). The Core Coding System allows for visual transparent quality at moderate compression rates, scalable end-to-end latency ranging from less than a line to a few lines of the image, and low complexity real time implementations in ASIC, FPGA, CPU and GPU. Beside the Core Coding System, Profiles and levels (addressing specific application fields and use cases), together with the transport and container formats (defining different means to store and transport JPEG XS codestreams in files, over IP networks or SDI infrastructures) are also being finalized and their expected submission for publication as International Standard is Q1 2019.


The JPEG Committee has reached a major milestone in the development of an alternative block coding algorithm for the JPEG 2000 family of standards, with ISO/IEC 15444-15 High Throughput JPEG 2000 (HTJ2K) achieving Draft International Status (DIS).

The HTJ2K algorithm has demonstrated an average tenfold increase in encoding and decoding throughput compared to the algorithm currently defined by JPEG 2000 Part 1. This increase in throughput results in an average coding efficiency loss of 10% or less in comparison to the most efficient modes of the block coding algorithm in JPEG 2000 Part 1, and enables mathematically lossless transcoding to and from JPEG 2000 Part 1 codestreams.

The JPEG Committee has begun the development of HTJ2K conformance codestreams and reference software.

JPEG Pleno

The JPEG Committee is currently pursuing three activities in the framework of the JPEG Pleno Standardization: Light Field, Point Cloud and Holographic content coding.

At the Vancouver meeting, a generic file format syntax architecture was outlined that allows for efficient exchange of these modalities by utilizing a box-based file format. This format will enable the carriage of light field, point cloud and holography data, including associated metadata for colour space specification, camera calibration etc. In the particular case of light field data, this will encompass both texture and disparity information.

For coding of point clouds and holographic data, activities are still in exploratory phase addressing the elaboration of use cases and the refinement of requirements for coding such modalities. In addition, experimental procedures are being designed to facilitate the quality evaluation and testing of technologies that will be submitted in later calls for coding technologies. Interested parties active in point cloud and holography related markets and applications, both from industry and academia are welcome to participate in this standardization activity.

Final Quote

“JPEG XL standard will enable a higher quality content while improving on compression efficiency and offering new features useful for emerging multimedia applications. said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

About JPEG

The Joint Photographic Experts Group (JPEG) is a Working Group of ISO/IEC, the International Organisation for Standardization / International Electrotechnical Commission, (ISO/IEC JTC 1/SC 29/WG 1) and of the International Telecommunication Union (ITU-T SG16), responsible for the popular JPEG, JPEG 2000, JPEG XR, JPSearch and more recently, the JPEG XT, JPEG XS, JPEG Systems and JPEG Pleno families of imaging standards.  

The JPEG Committee nominally meets four times a year, in different world locations. The 81st JPEG Meeting was held on 12-19 October 2018, in Vancouver, Canada. The next 82nd JPEG Meeting will be held on 19-25 January 2019, in Lisbon, Portugal.

More information about JPEG and its work is available at or by contacting Antonio Pinheiro or Frederik Temmermans ( of the JPEG Communication Subgroup.

If you would like to stay posted on JPEG activities, please subscribe to the jpeg-news mailing list on  

Future JPEG meetings are planned as follows:

  • No 82, Lisbon, Portugal, January 19 to 25, 2019
  • No 83, Geneva, Switzerland, March 16 to 22, 2019
  • No 84, Brussels, Belgium, July 13 to 19, 2019


Towards an Integrated View on QoE and UX: Adding the Eudaimonic Dimension

In the past, research on Quality of Experience (QoE) has frequently been limited to networked multimedia applications, such as the transmission of speech, audio and video signals. In parallel, usability and User Experience (UX) research addressed human-machine interaction systems which either focus on a functional (pragmatic) or aesthetic (hedonic) aspect of the experience of the user. In both, the QoE and UX domains, the context (mental, social, physical, societal etc.) of use has mostly been considered as a control factor, in order to guarantee the functionality of the service or the ecological validity of the evaluation. This situation changes when systems are considered which explicitly integrate the usage environment and context they are used in, such as Cyber-Physical Systems (CPS), used e.g. in smart home or smart workplace scenarios. Such systems dispose of sensors and actuators which are able to sample and manipulate the environment they are integrated into, and thus the interaction with them is somehow moderated through the environment; e.g. the environment can react to a user entering a room. In addition, such systems are used for applications which differ from standard multimedia communication in the sense that they are frequently used over a long or repeating period(s) of time, and/or in a professional use scenario. In such application scenarios the motivation of system usage can be divided between the actual system user and a third party (e.g. the employer) resulting in differing factors affecting related experiences (in comparison to services which are used on the user’s own account). However, the impact of this duality of usage motivation on the resulting QoE or UX has rarely been addressed in existing research of both scientific communities. 

In the context of QoE research, the European Network on Quality of Experience in Multimedia Systems and Services, Qualinet (COST Action IC 1003) as well as a number of Dagstuhl seminars [see note from the editors], started a scientific discussion about the definition of the term QoE and related concepts around 2011. This discussion resulted in a White Paper which defines QoE as “the degree of delight or annoyance of the user of an application or service. It results from the fulfillment of his or her expectations with respect to the utility and/ or enjoyment of the application or service in the light of the users personality and current state.” [White Paper 2012]. Besides this definition, the white paper describes a number of factors that influence a user’s QoE perception, e.g. human-, system- and contextual factors. Although this discussion lists a large set of influencing factors quite thoroughly, it still focuses on rather short-term (or episodic) and media related hedonic experiences. A first step towards integrating an additional (quality) dimension (to the hedonic one) has been described in [Hammer et al., 2018], where the authors introduced the eudaimonic perspective as being the user’s overall well-being as a result of system usage. The term “eudaimonic” stems from Aristoteles and is commonly used to designate a deeper degree of well-being, as a result of a self-fulfillment by developing one’s own strengths.

On a different side, UX research has historically evolved from usability research (which was for a long time focusing on enhancing the efficiency and effectiveness of the system), and was initially concerned with the prevention of negative emotions related to technology use. As an important contributor for such preventions, pragmatic aspects of analyzed ICT systems have been identified in usability research. However, the twist towards a modern understanding of UX focuses on the understanding of human-machine interaction as a specific emotional experience (e.g., pleasure) and considers pragmatic aspects only as enablers of positive experiences but not as contributors to positive experiences. In line with this understanding, the concept of Positive or Hedonic Psychology, as introduced by [Kahnemann 1999], has been embedded and adopted in HCI and UX research. As a result, the related research community has mainly focused on the hedonic aspects of experiences as described in [Diefenbach 2014] and as critically outlined by [Mekler 2016] in which the authors argue that this concentration on hedonic aspects has overcasted the importance of eudaimonic aspects of well-being as described in positive psychology. With respect to the measurement of user experiences, the devotion towards hedonic psychology comes also with the need for measuring emotional responses (or experiential qualities). In contrast to the majority of QoE research, where the measurement of the (single) experienced (media) quality of a multimedia system is in the focus, the measurement of experiential qualities in UX calls for the measurement of a range of qualities (e.g. [Bargas-Avila 2011] lists affect, emotion, fun, aesthetics, hedonic and flow as qualities that are assessed in the context of UX). Hence, this measurement approach considers a considerable broader range of quantified qualities. However, the development of the UX domain towards a design-based UX research that steers away from quantitatively measurable qualities and focuses more towards a qualitative research approach (that does not generate measurable numbers) has marginalized this measurement or model-based UX research camp in recent UX developments as denoted by [Law 2014].

While existing work in QoE mainly focuses on hedonic aspects (and in UX, also on pragmatic ones), eudaimonic aspects such as the development of one’s own strengths have not been considered extensively so far in the context of both research areas. Especially in the usage context of professional applications, the meaningfulness of system usage (which is strongly related to eudaimonic aspects) and the growth of the user’s capabilities will certainly influence the resulting experiential quality(ies). In particular, professional applications must be designed such that the user continues to use the system in the long run without frustration, i.e. provide long-term acceptance for applications which the user is required to use by the employer. In order to consider these aspects, the so-called “HEP cube” has been introduced in [Hammer et al. 2018]. It opens a 3-dimensional space of hedonic (H), eudaimonic (E) and pragmatic (P) aspects of QoE and UX, which are integrated towards a Quality of User Experience (QUX) concept.

Whereas a simple definition of QUX has not yet been set up in this context, a number of QUX-related aspects, e.g. utility (P), joy-of-use (H), meaningfulness (E), have been integrated into a multidimensional HEP construct. This construct is displayed in Figure 1. In addition to the well-known hedonic and pragmatic aspects of UX, it incorporates the eudaimonic dimension. Thereby, it shows the assumed relationships between aforementioned aspects of User Experience and QoE, and in addition usefulness and motivation (which is strongly related to the eudaimonic dimension). These aspects are triggered by user needs (first layer) and moderated by the respective dimension aspects joy-of-use (for hedonic), ease-of-use (pragmatic), and purpose-of-use (eudaimonic). The authors expect that a consideration of the additional needs and QUX aspects, and an incorporation of these aspects into application design, will not only lead to higher acceptance rates, but also to deep-grounded well-being of users. Furthermore, incorporation of these aspects into QoE and / or QUX modelling will improve their respective prediction performance and ecological validity.


Figure 1: QUX as a multidimensional construct involving HEP attributes, existing QoE/UX, need fulfillment and motivation. Picture taken from Hammer, F., Egger-Lampl, S., Möller, S.: Quality-of-User-Experience: A Position Paper, Quality and User Experience, Springer (2018).


  • [White Paper 2012] Qualinet White Paper on Definitions of Quality of Experience (2012).  European Network on Quality of Experience in Multimedia Systems and  Services (COST Action IC 1003), Patrick Le Callet, Sebastian Möller and Andrew Perkis, eds., Lausanne, Switzerland, Version 1.2, March 2013.
  • [Kahnemann 1999] Kahneman, D.: Well-being: Foundations of Hedonic Psychology, chap. Objective Happiness, pp. 3{25. Russell Sage Foundation Press, New York (1999)
  • [Diefenbach 2014] Diefenbach, S., Kolb, N., Hassenzahl, M.: The `hedonic’ in human-computer interaction: History, contributions, and future research directions. In: Proceedings of the 2014 conference on Designing interactive systems, pp. 305{314. ACM (2014)
  • [Mekler 2016] Mekler, E.D., Hornbaek, K.: Momentary pleasure or lasting meaning?: Distinguishing eudaimonic and hedonic user experiences. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 4509{4520. ACM (2016)
  • [Bargas-Avila 2011] Bargas-Avila, J.A., Hornbaek, K.: Old wine in new bottles or novel challenges: A critical analysis of empirical studies of user experience. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2689{2698. ACM (2011)
  • [Law 2014] Law, E.L.C., van Schaik, P., Roto, V.: Attitudes towards user experience (UX) measurement. International Journal of Human-Computer Studies 72(6), 526{541 (2014)
  • [Hammer et al. 2018] Hammer, F., Egger-Lampl, S., Möller, S.: Quality-of-User-Experience: A Position Paper, Quality and User Experience, Springer (2018).

Note from the editors:

More details on the integrated view of QoE and UX can be found in Hammer, F., Egger-Lampl, S. & Möller, “Quality-of-user-experience: a position paper”. Springer Quality and User Experience (2018) 3: 9.

The Dagstuhl seminars mentioned by the authors started a scientific discussion about the definition of the term QoE in 2009. Three Dagstuhl Seminars were related to QoE: 09192 “From Quality of Service to Quality of Experience” (2009), 12181 “Quality of Experience: From User Perception to Instrumental Metrics” (2012), and 15022 “Quality of Experience: From Assessment to Application” (2015). A Dagstuhl Perspectives Workshop 16472 “QoE Vadis?” followed in 2016 which set out to jointly and critically reflect on future perspectives and directions of QoE research. During the Dagstuhl Perspectives Workshop, the QoE-UX wedding proposal came up to marry the area of QoE and UX. The reports from the Dagstuhl seminars  as well as the Manifesto from the Perspectives Workshop are available online and listed below.

One step towards an integrated view of QoE and UX is reflected by QoMEX 2019. The 11th International Conference on Quality of Multimedia Experience will be held in June 5th to 7th, 2019 in Berlin, Germany. It will bring together leading experts from academia and industry to present and discuss current and future research on multimedia quality, quality of experience (QoE) and user experience (UX). This way, it will contribute towards an integrated view on QoE and UX, and foster the exchange between the so-far distinct communities. More details: