Report by Björn Þór Jónsson, Cathal Gurrin and Klaus Schoeffmann.
Björn Þór Jónsson (http://www.ru.is/~bjorn/) is an Assistant Professor in the School of Computer Science at Reykjavik University. His research focuses on scalability of multimedia analytics and multimedia retrieval applications, especially on novel hardware architectures.
Cathal Gurrin (http://www.computing.dcu.ie/~cgurrin/) is a lecturer at the School of Computing, at Dublin City University, Ireland and a principal investigator at the Insight Centre for Data Analytics. His research interest is multimedia information retrieval and personal data analytics, with a special emphasis on lifelogging and its applications.
Klaus Schoeffmann (http://vidosearch.com/?page_id=2) is an Associate Professor in the distributed multimedia systems research group at the Institute of Information Technology (ITEC) at Klagenfurt University, Austria. His research focuses on human-computer-interaction with multimedia data (e.g., image and video browsing), multimedia content analysis, and multimedia tools and applications.
This report summarizes the presentations and discussions of the special session entitled “Perspectives on Multimedia Analytics” at MMM 2016, which was held in Miami, Florida on January 6, 2016. The special session consisted of four brief paper presentations, followed by a panel discussion with questions from the audience. The session was organized by Björn Þór Jónsson and Cathal Gurrin, and chaired and moderated by Klaus Schoeffmann. The goal of this report is to record the conclusions of the special session, in the hope that it may serve members of our community who are interested in Multimedia Analytics.
Presentations

Alan Smeaton opens the discussion. From the left: Klaus Shoefmann (moderator), Alan Smeaton, Björn Þór Jónsson, Guillaume Gravier and Graham Healy.
Firstly, Alan Smeaton presented an analysis of time-series-based recognition of semantic concepts [1]. He argued that while concept recognition in visual multimedia is typically based on simple concepts, there is a need to recognise semantic concepts which have a temporal as- pect corresponding to activities or complex events. Furthermore, he argued that while various results are reported in the literature, there are research questions which remain unanswered, such as: “What concept detection accuracies are satisfactory for higher-level recognition?” and “Can recognition methods perform equally well across various concept detection performances?” Results suggested that, although improving concept detection accuracies can en- hance the recognition of time series based concepts, concept detection does not need to be very accurate in order to characterize the dynamic evolution of time series if appropriate methods are used. In other words, even if semantic concept detectors still have low accuracy, it makes a lot of sense to apply them to temporally adjacent shots/frames in video in order to detect semantic events from them.
Secondly, Björn Þór Jónsson presented ten research questions for scalable multimedia analytics [2]. He argued that the scale and complexity of multimedia collections is ever increasing, as is the desire to harvest useful insight from the collections. To optimally support the complex quest for insight, multimedia analytics has emerged as a new research area that combines concepts and techniques from multimedia analysis and visual analytics into a single framework. Björn argued further, however, that state-of-the-art database management solutions are not yet designed for multimedia analytics workloads, and that research is therefore required into scalable multimedia analytics, built on the three underlying pillars of visual analytics, multimedia analysis and database management. Björn then proposed ten specific research questions to address in this area.
Third, Guillaume Gravier presented a study of the needs and expectations of media professionals for multimedia analytics solutions [3]. The goal of the work was to help clarifying what multimedia analytics encompasses by studying users expectations. They focused on a very specific family of applications for search and navigation of broadcast and social news content. Using extensive conversations with media professionals, using mock-up interfaces and human-centered design methodology, they analyze the perceived usefulness of a number of functionalities leveraging existing or upcoming technology. Based on the results, Guillaume proposed a defintion of research directions for (multi)media analytics.

Graham Healy gives the final presentation of the session. Sitting, from the left: Klaus Schoeffmann (moderator), Alan Smeaton, Björn Þór Jónsson, and Guillaume Gravier.
Finally, Graham Healy presented an analysis of human annotation quality using neural signals such as electroencephalography (EEG) [4]. They explored how neurophysiological signals correlate to attention and perception, in order to better understand the image-annotation task. Results indicated potential issues with “how well” a person manually annotates images and variability across annotators. They propose that such issues may arise in part as a result of subjectively interpretable instructions that may fail to elicit similar labelling behaviours and decision thresholds across participants. In particular, they found instances where an individual’s annotations differed from a group consensus, even though their EEG signals indicated that they were likely in consensus. Finally, Graham discussed the potential implications of the work for annotation tasks and crowd-sourcing in the future.
Discussions
Firstly, a question was asked about a definition for multi- media analytics, and its relationship to multimedia analysis. Björn proposed the following definition of the main goal of scalable multimedia analytics: “. . . to produce the processes, techniques and tools to allow many diverse users to efficiently and effectively analyze large and dynamic multimedia collections over a long period of time to gain insight and knowledge” [2]. Guillaume, on the other hand, proposed that multimedia analytics could be defined as: “. . . the process of organizing multimedia data collections and providing tools to extract knowledge, gain insight and help make decisions by interacting with the data” [3]. Finally, Alan also added that in contrast with multimedia analysis, the multimedia analytics user is involved in every stage of the whole process: from media production/capturing, over data inspection, filtering, and structuring, up to the final consumption, visualization, and usage of the media data.
Clearly, all three definitions are largely in agreement, as they focus on the insight and knowledge gained through interaction with data. In addition, the first definition includes scalability aspects, such as the number of analysts and the duration of analysis, The speakers agreed that multimedia analysis was mostly concerned with the automatic analysis of media content, while media interaction is definitely an important aspect to consider in multimedia analytics. Björn even proposed that users might be more satisfied with a system that takes a few iterations of user interaction to reach a conclusion, than with a system that takes a somewhat shorter time to reach the same conclusions without any interaction. Guillaume stressed that their work had demonstrated the importance of working with the professional users to get their requirements early. When asked, Graham agreed that using neural sen- sors could potentially become a weapon in the analyst’s arsenal, helping the analyst to understand what the brain finds interesting.
A question was asked about potential application areas for multimedia analytics. There was general agreement that many and diverse areas could benefit from multimedia analytics techniques. Alan listed a number of application areas, such as: on-line education, lifelogging, surveillance and forensics, medicine and biomedicine, and so on; in fact he struggled to find an area that could not be affected. There was also agreement that many multimedia analytics application areas would need to involve very large quantities of data. As an example, the recent YFCC 100M collection has nearly 100 million images and around 800 thousand videos; yet compared to web-scale collections it is still very small.
A further thread of discussion centered on where to focus research efforts. The works described by Björn and Guillaume already propose some long-term research questions and directions. Based on his experience, Alan proposed that work on improving the quality of a particular concept detector from 95% to 96%, for example, would not have any significant impact, while work on improving the higher-level detection to use more (and more varied) information would be much more productive. Alan was asked in continuation whether researchers working on concept detection should rather focus on more general concepts with higher recall but often low precision (e.g., beach, car, food, etc.) or more specific concepts with low recall but typically higher precision (e.g., Nascar racing tyre, Shushi, United Airlines plane etc.). He answered that none should be particularly preferred but we need to continue work for both types of concepts.
Finally, some questions were posed to the participants about details of their respective works; however these will not be reported here.
Summary
Overall, the conclusion of the discussion is that multimedia analytics should be a very fruitful research area in the future, with diverse application in many areas and for many users. While the finer-grained conclusions of the discussion that we have described above were perhaps not revolutionary, we nevertheless felt it would be a service to the community to write them down in this short report.
The panel format of the special session made the discussion much more lively and interactive than that of a traditional technical session. We would like to thank the presenters and their co-authors for their excellent contributions. The session chairs would also particularly like to thank the moderator, Klaus Schoeffmann, for his contribution to the session, as a good panel moderator is very important for the success of the session.
References
[1] Peng Wang, Lifeng Sun, Shiqiang Yang, and Alan Smeaton. What are the limits to time series based recognition of semantic concepts? In Proc. MMM, Miami, FL, USA, 2016.
[2] Björn Þór Jónsson, Marcel Worring, Jan Zahálka, Stevan Rudinac, and Laurent Amsaleg. Ten research questions for scalable multimedia analytics. In Proc. MMM, Miami, FL, USA, 2016.
[3] Guillaume Gravier, Martin Ragot, Laurent Amsaleg, Rémi Bois, Grégoire Jadi, Éric Jamet, Laura Monceaux, and Pascale Sébillot. Shaping-up multimedia analytics: Needs and expectations of media professionals. In Proc. MMM, Miami, FL, USA, 2016.
[4] Graham Healy, Cathal Gurrin, and Alan Smeaton. Informed perspectives on human annotation using neural signals. In Proc. MMM, Miami, FL, USA, 2016.