ACM SIGMM Award for Outstanding PhD Thesis in Multimedia Computing, Communications and Applications 2015

Awardee

ting-yaoACM Special Interest Group on Multimedia (SIGMM) is pleased to present the 2015 SIGMM Outstanding Ph.D. Thesis Award to Dr. Ting Yao and Honorable Mention recognition to Dr. Britta Meixner.

The award committee considers Dr. Yao’s dissertation entitled “Multimedia Search by Self, External, and Crowdsourcing Knowledge” worthy of the recognition as the thesis proposes an innovative knowledge transfer framework for multimedia search which is expected to have significant impact, especially in boosting the search performance for big multimedia data.

Dr. Yao’s thesis proposes the knowledge transfer methodology in three multimedia search scenarios:

  1. Seeking consensus among multiple modalities in the context of search re-ranking,
  2. Leveraging external knowledge as a prior to be transferred to a problem that belongs to a domain different from the external knowledge, and
  3. Exploring the large user click-through data as crowdsourced human intelligence for annotation and search.

The effectiveness of the proposed framework has been successfully justified by thorough experiments. The proposed framework has substantial contributions in principled integration of multimodal data which is indispensable in multimedia search. The publications related to the thesis clearly demonstrate the major impact of this work in many research disciplines including multimedia, web, and information retrieval. The fact that parts of the proposed techniques have been and are being transferred to the commercial search service Bing further attest to the practical contributions of this thesis. Overall, the committee recognizes the significant impact and contributions presented in the thesis to the multimedia community.

Bio of Awardee

Dr. Ting Yao is an associate researcher in the Multimedia Search and Mining group at Microsoft Research, Beijing, China. His research interests are in multimedia search and computing. He completed a Ph.D. in Computer Science at City University of Hong Kong in 2014. He received the B.Sc. degree in theoretical and applied mechanics (2004), B.Eng. double degree in electronic information engineering (2004), and M.Eng. degree in signal and information processing (2008) all from the University of Science and Technology of China, Hefei, China. The system designed by him achieved the second place in the THUMOS action recognition challenge at CVPR 2015. He was also the principal designer of the image retrieval systems that achieved the third and fifth performance in the MSR-Bing image retrieval challenge at ACM MM 2014 and 2013, respectively. He received the Best Paper Award of ACM ICIMCS (2013).

Honorable Mention

britta-meixnerThe award committee is pleased to present the Honorable Mention to Dr. Britta Meixner for the thesis entitled: “Annotated Interactive Non-linear Video – Software Suite, Download and Cache Management.”

The thesis presents a fully functional software suite for authoring non-linear interactive videos with downloading and cache management mechanisms for effective video playback. The committee is significantly impressed by the thorough study presented in the thesis with extensive analysis of the properties of the software suite. The implementation which has been made available as open source software along with the thesis undoubtedly has very high potential impact to the multimedia community.

Bio of Awardee

Dr. Britta Meixner received her Master’s degree (German Diplom) in Computer Science from the University of Passau, Germany, in 2008. Furthermore, she received the First State Examination for Lectureship at Secondary Schools for the subjects Computer Science and Mathematics from the Bavarian State Ministry for Education and Culture in 2008. She received her Ph.D. degree from the University of Passau, Germany, in 2014. The title of her thesis is “Annotated Interactive Non-linear Video – Software Suite, Download and Cache Management.” She is currently a postdoctoral research fellow with the University of Passau, Germany, and will be a postdoctoral research fellow at FXPAL, Palo Alto, CA, USA, starting October 2015. Her research interest is mainly in hypermedia. She is an award winner of the 2015 Award “Women + Media Technology,” granted by Germany’s public broadcasters ARD and ZDF (ARD/ZDF Förderpreis “Frauen + Medientechnologie” 2015). She was a Reviewer for Springer Multimedia Tools and Applications (MTAP) Journal, an Organizer of the “International Workshop on Interactive Content Consumption (WSICC)” at ACM TVX in 2014 and 2015, and Associate Chair at ACM TVX2015.

Announcement of ACM SIGMM Rising Star Award 2015

yu-gang-jiangACM Special Interest Group on Multimedia (SIGMM) is pleased to present this year’s Rising Star Award in multimedia computing, communications and applications to Dr. Yu-Gang Jiang. The ACM SIGMM Rising Star Award recognizes a young researcher who has made outstanding research contributions to the field of multimedia computing, communication and applications during the early part of his or her career. Dr. Yu-Gang Jiang has made fundamental contributions in the area of video analysis and retrieval, especially with innovative approaches to large-scale video concept detection. He has been an active leader in exploring the bag-of-visual-words (BoW) representation for concept detection, providing influential insights on the critical representation design. He proposed the important idea of “soft-weighting” in his CIVR 2007paper, which significantly advanced the performance of visual concept detection. Dr. Jiang has proposed several important techniques for video and image search. In 2009, he proposed a novel domain adaptive concept selection method for concept-based video search. His method selects the most relevant concepts for a given query considering not only the semantic concept-to-query relatedness but also the data distribution in the target domain. Recently he proposed a method that generates query-adaptive hash codes for improved visual search, with which a finer-grained ranking of search results can be achieved compared to the traditional hashing based methods. His most recent work is in the emerging field of video content recognition by deep learning, where he proposed a comprehensive deep learning framework to model static, short-term motion and long-term temporal information in videos. Very promising results were obtained on the widely used UCF101 dataset. As a postdoctoral researcher at Columbia University and later as a faculty member at Fudan University, Dr. Jiang has devoted significant efforts to video event recognition, a problem that is receiving increasing attention in the multimedia community. His extensive contributions in this area include not only innovative algorithm design, but also large benchmark construction, system development, and survey tutorials. He devised a comprehensive system in 2010 using multimodal features, contextual concepts and temporal clues, which won the multimedia event detection (MED) task in NIST TRECVID 2010. He constructed the Columbia Consumer Video (CCV) benchmark in 2011, which has been widely used. Recently, he continues to lead major efforts in creating and sharing large-scale video datasets in critical areas (including 200+ event categories and 100,000 partially copy videos) as community resources. The high impact of his works is reflected by the high number of citations of his work. His recent paper on video search result organization received the Best Poster Paper Award at ACMMM 2014. His shared benchmark datasets and source codes have been used worldwide. In addition, he has made extensive contributions to the professional communities by serving as conference program chairs, invited speakers, and tutorial experts. In summary, Dr. Yu-Gang Jiang receives the 2015 ACM SIGMM Rising Star Award for his significant contributions in the areas of video content recognition and search.

ACM SIGMM Award for Outstanding Technical Contributions to Multimedia Computing, Communications and Applications

tatsengchuaThe 2015 winner of the prestigious ACM Special Interest Group on Multimedia (SIGMM) award for Outstanding Technical Contributions to Multimedia Computing, Communications and Applications is Prof. Dr. Tat-Seng Chua. The award is given in recognition of his pioneering contributions to multimedia, text and social media processing. Tat-Seng Chua is a leading researcher in multimedia, text and social media analysis and retrieval. He is one of the few researchers who has made substantial contributions in the fields of multimedia, information retrieval and social media. Dr. Chua’s contributions in multimedia dates back to the early 1990s, where he was among the first to work on image retrieval with relevance feedback (1991), video retrieval and sequencing by exploring metadata and cinematic rules (1995), and fine grained image retrieval at segment level (1995). These works helped shape the development of the field for many years. Given the limitation of visual content analysis, his research advocates the integration of text, metadata and visual contents coupled with domain knowledge for large-scale media analysis. He developed a multi-source, multi-modal and multi-resolution framework together with the involvement of human in the loop for such analysis and retrieval tasks. This has helped his group not only publish papers in top conferences and journals, but also achieve top positions in large-scale video evaluations when his group participated in TRECVID in 2000-2006, VideOlympics in 2007-09, as well as winning the highly competitive Star (Multimedia) Challenge in 2008. Leveraging the experience, he developed a large-scale multi-label image test set named NUS-WIDE, which has been widely used with over 600 citations. He recently started a company named ViSenze Pte Ltd (www.visenze.com) to commercialize his research in mobile visual fashion search. In his more recent research work in multimedia question-answering (MMQA), he developed a joint text-visual model to exploit correlation between text queries, text-based answers, and visual concepts in images and videos to return both relevant text and video answers. The early work was carried out in the domain of news video (2003), which has motivated several follow-on works in image QA. His recent works tackled the more complicated “how-to” type QA in product domains (2010-13). His recent works (2013-14) exploited SemanticNet to perform attribute-based image retrieval and use of various types of domain knowledge. His current work aims to build a live, continuous-learning system to support the dynamic annotation and retrieval of images and micro videos in social media streams. In information retrieval and social media research, Dr. Chua focused on the key problems of organizing large-scale unstructured text contents to support question-answering (QA). His works point towards the use of linguistics and domain knowledge for effective large-scale information analysis, organization and retrieval. Given his strong interest in both multimedia and text processing, it is natural for him to venture into social media research that involves the analysis of text, multimedia, and social network contents. His group developed a live social observatory system to carry out research in building descriptive, predictive and prescriptive analytics of multiple live social media streams. The system has been well recognized by peers. His recent work on “multi-screen social TV” won the 2015 Best IEEEE Multimedia Best paper Award. Dr. Chua has been involved in most key conferences in these areas by serving as general chair, technical program chair, or invited keynote speaker as well as by leading innovative research and winning many best paper or best student paper awards in recent years. He is the Steering Committee Chair of two international multimedia conference series: ACM ICMR (International Conference on Multimedia Retrieval) and MMM (MultiMedia Modeling). In summary, he is an extraordinarily accomplished and outstanding researcher in multimedia, text and social media processing, truly exemplifying the characteristics of the ACM SIGMM Award for Outstanding Technical Contributions.

ACM SIGMM/TOMM 2015 Award Announcements

The ACM Special Interest Group in Multimedia (SIGMM) and ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) are pleased to announce the following awards for 2015 recognizing outstanding achievements and services made in the multimedia community.
SIGMM Technical Achievement Award:
Dr. Tat-Seng Chua, National University of Singapore

SIGMM Rising Star Award:
Dr. Yu-Gang Jiang, Fudan University
SIGMM Best Ph.D. Thesis Award:
Dr. Ting Yao, City University of Hong Kong (currently Microsoft Research)

TOMM Nicolas D. Georganas Best Paper Award:
“A Quality of Experience Model for Haptic Virtual Environments” by Abdelwahab Hamam, Abdulmotaleb El Saddik, and Jihad Alja’am, published in TOMM, vol. 10, Issue 3, 2014.
TOMM Best Associate Editor Award:
Dr. Pradeep K. Atrey, State University of New York, Albany
Additional information of each award and recipient is available on the SIGMM web site.
http://www.sigmm.org/
Awards will be presented in the annual SIGMM event, ACM Multimedia Conference, held in Brisbane, Australia during October 26-30, 2015.
ACM is the professional society of computer scientists, and SIGMM is the special interest group on multimedia. TOMCCAP is the flagship journal publication of SIGMM.

Call for Nominations: Editor-In-Chief of ACM TOMM

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) The term of the current Editor-in-Chief (EiC) of the ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (http://tomm.acm.org/) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC. Nominations, including self-nominations, are invited for a three-year term as TOMM EiC, beginning on 1 January 2016. The EiC appointment may be renewed at most one time. This is an entirely voluntary position, but ACM will provide appropriate administrative support. The EiC is responsible for maintaining the highest editorial quality, for setting technical direction of the papers published in TOMM, and for maintaining a reasonable pipeline of articles for publication. He/she has final say on acceptance of papers, size of the Editorial Board, and appointment of Associate Editors. The EiC is expected to adhere to the commitments expressed in the policy on Rights and Responsibilities in ACM Publishing (http://www.acm.org/publications/policies/RightsResponsibilities). For more information about the role of the EiC, see ACM’s Evaluation Criteria for Editors-in-Chief (http://www.acm.org/publications/policies/evaluation/). Nominations should include a vita along with a brief statement of why the nominee should be considered. Self-nominations are encouraged, and should include a statement of the candidate’s vision for the future development of TOMM. The deadline for submitting nominations is 24 July 2015, although nominations will continue to be accepted until the position is filled. Please send all nominations to the nominating committee chair, Nicu Sebe (sebe@disi.unitn.it). The search committee members are:

  • Nicu Sebe (University of Trento), Chair
  • Rainer Lienhart (University of Augsburg)
  • Alejandro Jaimes (Yahoo!)
  • John R. Smith (IBM)
  • Lynn Wilcox (FXPAL)
  • Wei Tsang Ooi (NUS)
  • Mary Lou Soffa (University of Virginia), ACM Publications Board Liaison

GamingAnywhere: An Open-Source Cloud Gaming Platform

Overview

GamingAnywhere is an open-source clouding gaming platform. In addition to its openness, we design GamingAnywhere for high extensibility, portability, and reconfigurability. GamingAnywhere currently supports Windows and Linux, and can be ported to other OS’s including OS X and Android. Our performance study demonstrates that GamingAnywhere achieves high responsiveness and video quality yet imposes low network traffic [1,2]. The value of GamingAnywhere, however, is from its openness: researchers, service providers, and gamers may customize GamingAnywhere to meet their needs. This is not possible in other closed and proprietary cloud gaming platforms. A demonstration of the GamingAnywhere system. There are four devices in the photo. One game server (left-hand side labtop) and three game clients (an MacBook, an Android phone, and an iPad 2).

Motivation

Computer games have become very popular, e.g., gamers spent 24.75 billion USD on computer games, hardware, and accessories in 2011. Traditionally, computer games are delivered either in boxes or via Internet downloads. Gamers have to install the computer games on physical machines to play them. The installation process becomes extremely tedious because the games are too complicated and the computer hardware and system software are very fragmented. Take Blizzard’s Starcraft II as example, it may take more than an hour to install it on an i5 PC, and another hour to apply the online patches. Furthermore, gamers may find that their computers are not powerful enough to enable all the visual effects yet achieve high frame rates. Hence, gamers have to repeatedly upgrade their computers so as to play the latest computer games. Cloud gaming is a better way to deliver high-quality gaming experience and opens new business opportunities. In a cloud gaming system, computer games run on powerful cloud servers, while gamers interact with the games via networked thin clients. The thin clients are light-weight and can be ported to resource-constrained platforms, such as mobile devices and TV set-top boxes. With cloud gaming, gamers can play the latest computer gamers anywhere and anytime, while the game developers can optimize their games for a specific PC configuration. The huge potential of cloud gaming has been recognized by the game industry: (i) a market report predicts that cloud gaming market will increase 9 times between 2011 and 2017 and (ii) several cloud gaming startups were recently acquired by leading game developers. Although cloud gaming is a promising direction for the game industry, achieving good user experience without excessive hardware investment is a tough problem. This is because gamers are hard to please, as they concurrently demand for high responsiveness and high video quality, but do not want to pay too much. Therefore, service providers have to not only design the systems to meet the gamers’ needs but also take error resiliency, scalability, and resource allocation into considerations. This renders the design and implementation of cloud gaming systems extremely challenging. Indeed, while real-time video streaming seems to be a mature technology at first glance, cloud gaming systems have to execute games, handle user inputs, and perform rendering, capturing, encoding, packetizing, transmitting, decoding, and displaying in real-time, and thus are much more difficult to optimize. We observe that many systems researchers have new ideas to improve cloud gaming experience for gamers and reduce capital expenditure (CAPEX) and operational expenditure (OPEX) for service providers. However, all existing cloud gaming platforms are closed and proprietary, which prevent the researchers from testing their ideas on real cloud gaming systems. Therefore, the new ideas were either only tested using simulators/emulators, or, worse, never evaluated and published. Hence, very few new ideas on cloud gaming (in specific) or highly-interactive distributed systems (more general) have been transferred to the industry. To better bridge the multimedia research community and the game/software industry, we present GamingAnywhere, the first open source cloud gaming testbed in April 2013. We hope GamingAnywhere cloud gather enough attentions, and quickly grow into a community with critical mass, just like Openflow, which shares the same motivation with GamingAnywhere in a different research area.

Design Philosophy

GamingAnywhere aims to provide an open platform for researchers to develop and study real-time multimedia streaming applications in the cloud. The design objectives of GamingAnywhere include:

  1. Extensibility: GamingAnywhere adopts a modularized design. Both platform-dependent components such as audio and video capturing and platform-independent components such as codecs and network protocols can be easily modified or replaced. Developers should be able to follow the programming interfaces of modules in GamingAnywhere to extend the capabilities of the system. It is not limited only to games, and any real-time multimedia streaming application such as live casting can be done using the same system architecture.
  2. Portability: In addition to desktops, mobile devices are now becoming one of the most potential clients of cloud services as wireless networks are getting increasingly more popular. For this reason, we maintain the principle of portability when designing and implementing GamingAnywhere. Currently the server supports Windows and Linux, while the client supports Windows, Linux, and OS X. New platforms can be easily included by replacing platform-dependent components in GamingAnywhere. Besides the easily replaceable modules, the external components leveraged by GamingAnywhere are highly portable as well. This also makes GamingAnywhere easier to be ported to mobile devices.
  3. Configurability: A system researcher may conduct experiments for real-time multimedia streaming applications with diverse system parameters. A large number of built-in audio and video codecs are supported by GamingAnywhere. In addition, GamingAnywhere exports all available configurations to users so that it is possible to try out the best combinations of parameters by simply editing a text-based configuration file and fitting the system into a customized usage scenario.
  4. Openness: GamingAnywhere is publicly available at http://gaminganywhere.org/. Use of GamingAnywhere in academic research is free of charge but researchers and developers should follow the license terms claimed in the binary and source packages.
 
Figure 2: A demonstration of GamingAnywhere running on a Android phone for playing Mario run in an N64 emulator on PC.

How to Start

We offer GamingAnywhere in two types of software packs: all-in-one and binary. The all-in-one pack allows the gamers to recompile GamingAnywhere from scratch, while the binary packs are for the gamers who just want to tryout GamingAnywhere. There are binary packs for Windows and Linux. All the packs are downloadable as zipped archives, and can be installed by simply uncompressing them. GamingAnywhere consists of three binaries: (i) ga-client, which is the thin client, (ii) ga-server-periodic, a server which periodically captures game screens and audio, and (iii) ga-server-event-driven, another server which utilizes code injection techniques to capture game screens and audio on-demand (i.e., whenever an updated game screen is available).   The readers are welcome to visit the website of GamingAnywhere at http://gaminganywhere.org/. Table 1 gives the latest supported OS’s and versions and all the source codes and pre-compiled binary packages can be downloaded from this page. The website provides a variety of document to help users to quickly setup GamingAnywhere server and client on their own computers, including the Quick Start Guide, the Configuration File Guide, and a FAQ document. If you got some questions that are not explained in the documents, we also provide an interactive forum for online discussion.

  Windows Linux MacOSX Android
Server Windows 7+ Supported Supported
Client Windows XP+ Supported Supported 4.1+

 

Future Perspectives

Cloud gaming is getting increasingly popular, but to turn cloud gaming into an even bigger success, there are still many challenges ahead of us. In [3], we share our views on the most promising research opportunities for providing high-quality and commercially-viable cloud gaming services. These opportunities span over fairly diverse research directions: from very system-oriented game integration to quite human-centric QoE modeling; from cloud related GPU virtualization to content-dependent video codecs. We believe these research opportunities are of great interests to both the research community and the industry for future, better cloud gaming platforms. GamingAnywhere enables several future research directions on cloud gaming and beyond. For example, techniques for cloud management, such as resource allocation and Virtual Machine (VM) migration, are critical to the success of commercial deployments. These cloud management techniques need to be optimized for cloud games, e.g., the VM placement decisions need to be aware of gaming experience [4]. Beyond cloud gaming, as dynamic and adaptive binding between computing devices and displays is increasingly more popular, screencast technologies which enable such binding over wireless networks, also employs real-time video streaming as the core technology. The ACM MMSys’15 paper [5] demonstrates that, GamingAnywhere, though designed for cloud gaming, also serve a good reference implementation and testbed for experimenting different innovations and alternatives for screencast performance improvements. Furthermore, we expect to see future applications, such as mobile smart lens and even telepresence, can make good use of GamingAnywhere as part of core technologies. We are happy to offer GamingAnywhere to the community and more than happy to welcome the community members to join us in the hacking of future, better, real-time streaming systems for the good of the humans.

ACM TOMM (TOMCCAP) Call for Special Issue Proposals

ACM Transactions on Multimedia
Computing, Communications and Applications
ACM TOMM (previously known as ACM TOMCCAP)

Deadline for Proposal Submission: May, 1st 2015
Notification: June, 1st 2015
http://tomm.acm.org/

ACM TOMM is one of the world’s leading journals on multimedia. As in previous years, we are planning to publish a special issue (SI) in 2016. Proposals are accepted until May, 1st 2015. Each special issue is in the responsibility of the guest editors. If you wish to guest edit a special issue, you should prepare a proposal as outlined below, then send this via e-mail to the Senior Associate Editor (SAE) for Special Issue Management of TOMM, Shervin Shirmohammadi shervin@ieee.org

Proposals must:

  • Cover a currently-hot or emerging topic in the area of multimedia computing, communications, and applications;
  • Set out the importance of the special issue’s topic in that area;
  • Give a strategy for the recruitment of high quality papers;
  • Indicate a draft timeline in which the special issue could be produced (paper writing, reviewing, and submission of final copies to TOMM), assuming the proposal is accepted;
  • Include a list of recent (submission deadline within the last year) or currently-open special issues in similar topics and clearly explain how the proposed SI is different from those SIs;
  • Include the list of the proposed guest editors, their short bios, and their editorial and journal/conference organization experience as related to the Special Issue’s topic.

As in the previous years, the special issue will be published as online-only issue in the ACM Digital Library. This gives the guest editors higher flexibility in the review process and the number of papers to be accepted, while yet ensuring a timely publication.

The proposals will be reviewed by the SAE together with the Editor in Chief (EiC). Evaluation criteria includes: relevance to multimedia, ability to attract many excellent submissions, topic not too specific or too broad, quality and details of the proposal, distinguished from recent or current SIs with similar topic, experience and reputation of the guest editors, geographic/ethnic diversity of the guest editors. The final decision will be made by the EiC. A notification of the decision will be given by June 1st 2015. Once a proposal is accepted we will contact you to discuss the further process.

For questions please contact:
Shervin Shirmohammadi – Senior Associate Editor for Special Issue Management shervin@ieee.org
Ralf Steinmetz – Editor in Chief (EiC) steinmetz.eic@kom.tu-darmstadt.de
Sebastian Schmidt – Information Director TOMM@kom.tu-darmstadt.de

Summary of the 5th BAMMF

Bay Area Multimedia Forum (BAMMF)

BAMMF is a Bay Area Multimedia Forum series. Experts from both academia and industry are invited to exchange ideas and information through talks, tutorials, posters, panel discussions and networking sessions. Topics of the forum will include emerging areas in vision, audio, touch, speech, text, various sensors, human computer interaction, natural language processing, machine learning, media-related signal processing, communication, and cross-media analysis etc. Talks in the event may cover advancement in algorithms and development, demonstration of new inventions, product innovation, business opportunities, etc. If you are interested in giving a presentation at the forum, please contact us.

The 5th BAMMF

The 5th BAMMF was held in the George E. Pake Auditorium in Palo Alto, CA, USA on November 20, 2014. The slides and videos of the speakers at the forum have been made available on the BAMMF web page, and we provide here an overview of their talks. For speakers’ bios, the slides and videos, please visit the web page.

Industrial Impact of Deep Learning – From Speech Recognition to Language and Multimodal Processing

Li Deng (Deep Learning Technology Center, Microsoft Research, Redmond, USA)

Since 2010, deep neural networks have started making real impact in speech recognition industry, building upon earlier work on (shallow) neural nets and (deep) graphical models developed by both speech and machine learning communities. This keynote will first reflect on the historical path to this transformative success. The role of well-timed academic-industrial collaboration will be highlighted, so will be the advances of big data, big compute, and seamless integration between application-domain knowledge of speech and general principles of deep learning. Then, an overview will be given on the sweeping achievements of deep learning in speech recognition since its initial success in 2010 (as well as in image recognition since 2012). Such achievements have resulted in across-the-board, industry-wide deployment of deep learning. The final part of the talk will focus on applications of deep learning to large-scale language/text and multimodal processing, a more challenging area where potentially much greater industrial impact than in speech and image recognition is emerging.

Brewing a Deeper Understanding of Images

Yangqing Jia (Google)

In this talk I will introduce the recent developments in the image recognition fields from two perspectives: as a researcher and as an engineer. For the first part I will describe our recent entry “GoogLeNet” that won the ImageNet 2014 challenge, including the motivation of the model and knowledge learned from the inception of the model. For the second part, I will dive into the practical details of Caffe, an open-source deep learning library I created at UC Berkeley, and show how one could utilize the toolkit for a quick start in deep learning as well as integration and deployment in real-world applications.

Applied Deep Learning

Ronan Collobert (Facebook)

I am interested in machine learning algorithms which can be applied in real-life applications and which can be trained on “raw data”. Specifically, I prefer to trade simple “shallow” algorithms with task-specific handcrafted features for more complex (“deeper”) algorithms trained on raw features. In that respect, I will present several general deep learning architectures, which excels in performance on various Natural Language, Speech and Image Processing tasks. I will look into specific issues related to each application domain, and will attempt to propose general solutions for each use case.

Compositional Language and Visual Understanding

Richard Socher (Stanford)

In this talk, I will describe deep learning algorithms that learn representations for language that are useful for solving a variety of complex language tasks. I will focus on 3 projects:

  • Contextual sentiment analysis (e.g. having an algorithm that actually learns what’s positive in this sentence: “The Android phone is better than the IPhone”)
  • Question answering to win trivia competitions (like IBM Watson’s Jeopardy system but with one neural network)
  • Multimodal sentence-image embeddings to find images that visualize sentences and vice versa (with a fun demo!) All three tasks are solved with a similar type of recursive neural network algorithm.

 

Call for Workshop Proposals @ ACM Multimedia 2015

We invite proposals for Workshops to be held at the ACM Multimedia 2015 Conference. Accepted workshops will take place in conjunction with the main conference, which is scheduled for October 26-30, 2015, in Brisbane, Australia.

We solicit proposals for two different kinds of workshops: regular workshops and data challenge.

Regular Workshops

The regular workshops should offer a forum for discussions of broad range of emerging and specialized topics of interest to the SIG Multimedia community. There are a number of important issues to be considered when generating a workshop proposal:

  1. The topic of the proposed workshop should offer a perspective distinct from and complementary to the research themes of the main conference. We therefore strongly advise to carefully review the themes of the main conference (which can be found here), when generating a proposal.
  2. The SIG Multimedia community expects the workshop program to be part of the program to nurture and to grow the workshop research theme towards becoming mainstream in the multimedia research field and one of the themes of the main conference in the future.
  3. Interdisciplinary theme workshops are strongly encouraged.
  4. Workshops should offer a discussion forum of a different type than that of the main conference. In particular, they should avoid becoming “mini-conferences” with accompanying keynote presentations and best paper awards. While formal presentation of ideas through regular oral sessions are allowed, we strongly encourage organizers to propose alternate ways to allow participants to discuss open issues, key methods and important research topics related to the workshop theme. Examples are panels, group brainstorming sessions, mini-tutorials around key ideas and proof of concept demonstration sessions.

Data Challenge Workshops

We are also seeking organizers to propose Challenge-Based Workshops. Both academic and corporate organizers are welcome.

Data Challenge workshops are solicited from both academic and corporate organizers. The organizers should provide a dataset that is exemplar of the complexities of current and future multimodal/multimedia problems, and one or more multimodal/ multimedia tasks whose performance can be objectively measured. Participants in the challenge will evaluate their methods against the challenge data in order to identify areas of strengths and weakness. Best performing participating methods will be presented in the form of papers and oral/poster presentations at the workshop.

More information

For details on submitting workshops proposals and the evaluation criteria, please check the following site:

http://www.acmmm.org/2015/call-for-workshop-proposals/

Important dates:

  • Proposal Submission: February 10, 2015
  • Notification of Acceptance February 27, 2015

Looking forward to receiving many excellent submissions!

Alan Hanjalic, Lexing Xie and Svetha Venkatesh
Workshops Chairs, ACM Multimedia 2015

openSMILE:) The Munich Open-Source Large-scale Multimedia Feature Extractor

A tutorial for version 2.1

Introduction

The openSMILE feature extraction and audio analysis tool enables you to extract large audio (and recently also video) feature spaces incrementally and fast, and apply machine learning methods to classify and analyze your data in real-time. It combines acoustic features from Music Information Retrieval and Speech Processing, as well as basic computer vision features. Large, standard acoustic feature sets are included and usable out-of-the-box to ensure comparable standards in feature extraction in related research. The purpose of this article is to briefly introduce openSMILE, it’s features, potentials, and intended use-cases as well as to give a hands-on tutorial packed with examples that should get you started quickly with using openSMILE. About openSMILE SMILE is originally an acronym for Speech & Music Interpretation by Large-space feature Extraction. Due to the recent addition of video-processing in version 2.0, the acronym openSMILE evolved to open-Source Media Interpretation by Large-space feature Extraction. The development of the toolkit has been started at Technische Universität München (TUM) for the EU-FP7 research project SEMAINE. The original primary focus was on state-of-the-art acoustic emotion recognition for emotionally aware, interactive virtual agents. After the project, openSMILE has been continuously extended to a universal audio analysis toolkit. It has been used and evaluated extensively in the series of INTERSPEECH challenges on emotion, paralinguistics, and speaker states and traits: From the first INTERSPEECH 2009 Emotion Challenge up to the upcoming Challenge at INTERSPEECH 2015 (see openaudio.eu for a summary of the challenges). Since 2013 the code-base has been transferred to audEERING and the development is continued by them under a dual-license model – keeping openSMILE free for the research community. openSMILE is written in C++ and is available as both a standalone command-line executable as well as a dynamic library. The main features of openSMILE are its capability of on-line incremental processing and its modularity. Feature extractor components can be freely interconnected to create new and custom features, all via a simple text-based configuration file. New components can be added to openSMILE via an easy binary plug-in interface and an extensive internal API. Scriptable batch feature extraction is supported just as well as live on-line extraction from live recorded audio streams. This enables you to build and design systems on off-line databases, and then use exactly the same code to run your developed system in an interactive on-line prototype or even product. openSMILE is intended as a toolkit for researchers and developers, but not for end-users. It thus cannot be configured through a Graphical User Interface (GUI). However, it is a fast, scalable, and highly flexible command-line backend application, on which several front-end applications could be based. Such examples are network interface components, and in the latest release of openSMILE (version 2.1) a batch feature extraction GUI for Windows platforms: As seen in the above figure, the GUI allows to easily choose a configuration file, the desired output files and formats, and to select files and folders on which to run the analysis. Made popular in the field of speech emotion recognition and paralinguistic speech analysis, openSMILE is now beeing widely used in this community. According to google scholar the two papers on openSMILE ([Eyben10] and [Eyben13a]) are currently cited over 380 times. Research teams across the globe are using it for several tasks, including paralinguistic speech analysis, such as alcohol intoxication detection, in VoiceXML telephony-based spoken dialogue systems — as implemented by the HALEF framework, natural, speech enabled virtual agent systems, and human behavioural signal processing, to name only a few examples. Key Features The key features of openSMILE are:

  • It is cross-platform (Windows, Linux, Mac, new in 2.1: Android)
  • It offers both incremental processing and batch processing.
  • It efficiently extracts a large number of features very fast by re-using already computed values.
  • It has multi-threading support for parallel feature extraction and classification.
  • It is extensible with new custom components and plug-ins.
  • It supports audio file in- and output as well as live sound recording and playback.
  • The computation of MFCC, PLP, (log-)energy, and delta regression coefficients is fully HTK compatible.
  • It has a wide range of general audio signal processingcomponents:
    • Windowing functions (Hamming, Hann, Gauss, Sine, …),
    • Fast-Fourier Transform,
    • Pre-emphasis filter,
    • Finit-Impulse-Response (FIR) filterbanks,
    • Autocorrelation,
    • Cepstrum,
    • Overlap-add re-synthesis,
  • … and speech-related acoustic descriptors:
    • Signal energy,
    • Loudness based on a simplified sub-band auditory model,
    • Mel-/Bark-/Octave-scale spectra,
    • MFCC and PLP-CC,
    • Pitch (ACF and SHS algorithms and Viterbi smoothing),
    • Voice quality (Jitter, Shimmer, HNR),
    • Linear Predictive Coding (LPC),
    • Line Spectral Pairs (LSP),
    • Formants,
    • Spectral shape descriptors (Roll-off, slope, etc.),
  • … and music-related descriptors:
    • Pitch classes (semitone spectrum),
    • CHROMA and CENS features.
  • It supports multi-modal fusion on the feature level through openCV integration.
  • Several post-processingmethods for low-level descriptors are included:
    • Moving average smoothing,
    • Moving average mean subtraction and variance normalization (e.g. for on-line Cepstral mean subtraction),
    • On-line histogram equalization (experimental),
    • Delta regression coefficients of arbitrary order,
    • Binary operations to re-combine descriptors.
  • A wide range of statistical functionalsfor feature summarization is supported, e.g.:
    • Means, Extremes,
    • Moments,
    • Segment statistics,
    • Sample-values,
    • Peak statistics,
    • Linear and quadratic regression,
    • Percentiles,
    • Durations,
    • Onsets,
    • DCT coefficients,
    • Zero-crossings.
  • Generic and popular data file formatsare supported:
    • Hidden Markov Toolkit (HTK) parameter files (read/write)
    • WEKA Arff files (currently only non-sparse) (read/write)
    • Comma separated value (CSV) text (read/write)
    • LibSVM feature file format (write)

In the latest release (2.1) the new features are:

  • Integration and improvement of the emotion recognition models from openEAR,
  • LSTM-RNN based voice-activity detector prototype models included,
  • Fast linear SVMsink component which supports linear kernel SVM models trained with the WEKA SMO classifier,
  • LSTM-RNN JSON network file support for networks trained with the CURRENNT toolkit,
  • Spectral harmonics descriptors,
  • Android support,
  • Improvements to configuration files and command-line options,
  • Improvements and fixes.

openSMILE’s architecture openSMILE has a very modular architecture, designed for incremental data-flow. A central dataMemory component hosts shared memory buffers (known as dataMemory levels) to which a single component can write data and one or more other components can read data from. There are data-source components, which read data from files or other external sources and introduce them to the dataMemory. Then there are data-processor components, which read data, modify them, and save it to a new buffer – these are the actual feature extractor components. In the end data-sink components read the final data and save them to files or digest it in other ways (classifiers etc.): As all components which process data and connect to the dataMemory share some common functionality, they are all derived from a single base class cSmileComponent. The following figure shows the class hierarchy, and the connections between the cDataWriter and cDataReader components to the dataMemory (dotted lines). Getting openSMILE and the documentation The latest openSMILE packages can be downloaded here. At the time of writing the most recent release is 2.1. Grab the complete package of the latest release. This includes the source code, the binaries for Linux and Windows. Some most up-to-date releases might not always include a full-blown set of binaries for all platforms, so sometimes you might have to compile from source, if you want the latest cutting-edge version. While the tutorial in the next section should give you a good quick-start, it does not and can not cover every detail of openSMILE. For learning more and getting further help, there are three main resources: The first is the openSMILE documentation, called the openSMILE book. It contains detailed instructions on how to install, compile, and use openSMILE and introduces you to the basics of openSMILE. However, it might not be the most up-to-date resource for the newest features. Thus, the second resource, is the on-line help built into the binaries. This provides the most up-to-date documentation of available components and their options and features. We will tell you how to use the on-line help in the next section. If you cannot find your answer in neither of these resources, you can ask for help in the discussion forums on the openSMILE website or read the source-code.

Quick-start tutorial

You can’t wait to get openSMILE and try it out on your own data? Then this is your section. In the following the basic concepts of openSMILE are described, pre-built use-cases of automatic, on-line voice activity detection and speech emotion recognition are presented, and the concept of configuration files and the data-flow architecture are explained.

a. Basic concepts

Please refer to the openSMILE book for detailed installation and compilation instructions. Here we assume that you have a compiled SMILExtract binary (optionally with PortAudio support, if you want to use the live audio recording examples below), with which you can run:

SMILExtract -h
SMILExtract -H cWaveSource

to see general usage instructions (first line) and the on-line help for the cWaveSource component (second line), for example. However, from this on-line help it is hard to get a general picture of the openSMILE concepts. We thus describe briefly how to use openSMILE for the most common tasks. Very loosely said, the SMILExtract binaries can be seen as a special kind of code interpreter which executes custom configuration scripts. What openSMILE actually does in the end when you invoke it is only controlled by this configuration script. So, in order to do something with openSMILE you need:

  • The binary SMILExtract,
  • a (set of) configuration file(s),
  • and optionally other files, such as classification models, etc.

The configuration file defines all the components that are to be used as well as their data-flow interconnections. All the components are iteratively run in the “tick-loop“, i.e. a run method (tick()) of each component is called in every loop iteration. Each component then checks if there are new data to process, and if yes, processes the data, and makes them available for other components to process them further. Every component returns a status value, which indicates whether the component has processed data or not. If no component has had any further data to process, the end of the data input (EOI) is assumed. All components are switched to an EOI state and the tick-loop is executed again to process data which require special attention at the end of the input, such as delta-regression coefficients. Since version 2.0-rc1, multi-pass processing is supported, i.e. providing a feature to enable re-running of the whole processing. It is not encouraged to use this, since it breaks incremental processing, but for some experiments it might be necessary. The minimal, generic use-case scenario for openSMILE is thus as follows:

SMILExtract -C config/my_configfile.conf

Each configuration file can define additional command-line options. Most prominent examples are the options for in- and output files (-I and -O). These options are not shown when the normal help is invoked with the -h option. To show the options defined by a configuration file, use this command-line:

SMILExtract -ccmdHelp -C config/my_configfile.conf

The default command-line for processing audio files for feature extraction is:

SMILExtract -C config/my_configfile.conf -I input_file.wav -O output_file

This runs SMILExtract with the configuration given in my_configfile.conf. The following two sections will show you how to quickly get some advanced applications running as pre-configured use-cases for voice activity detection and speech emotion recognition.

b. Use-case: The openSMILE voice-activity detector

The latest openSMILE release (2.1) contains a research prototype of an intelligent, data-drive voice-activity detector (VAD) based on Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN), similar to the system introduced in [Eyben13b]. The VAD examples are contained in the folder scripts/vad. A README in that folder describes further details. Here we give a brief tutorial on how to use the two included use-case examples:

  • vad_opensource.conf: Runs the LSTM-RNN VAD and dumps the activations (voice probability) for each frame to a CSV text file. To run the example on a wave file, type:
    cd scripts/vad;
    SMILExtracct -I ../../example-audio/media-interpretation.wav \
                 -C vad_opensoure.conf -csvoutput vad.csv

    This will write the VAD probabilities scaled to the range -1 to +1 (2nd column) and the corresponding timestamps (1st column) to vad.csv. A VAD probability greater 0 indicates voice presence.

  • vad_segmeter.conf: Runs the VAD on an input wave file, and automatically extract voice segments to new wave files. Optionally the raw voicing probabilities as in the above example can be saved to file. To run the example on a wave file, type:
    cd scripts/vad;
    mkdir -p voice_segments
    SMILExtract -I ../../example-audio/media-interpretation.wav -C vad_segmenter.conf \
                -waveoutput voice_segments/segment_

    This will create a new wave file (numbered consecutively, starting at 1). The vad_segmenter.conf optionally supports output to CSV with the -csvoutput filename option. The start and end times (in seconds) of the voice segments relative to the start of the input file can be optionally dumped with the -saveSegmentTimes filename option. The columns of the output file are: segment filename, start (sec.), end (sec.), length of segment as number of raw (10ms) frames.

To visualise the VAD output over the waveform, we recommend using Sonic-visualiser. If you have sonc-visualiser installed (on Linux) you can open both the wave-file and the VAD output with this command:

sonic-visualiser example-audio/media-interpretation.wav vad.csv

An annotation layer import dialog should appear. The first column should be detected as Time and the second column as value. If this is not the case, select these values manually, and specify that timing is specified explicitly (should be the default) and click OK. You should see something like this:

c. Use-case: Automatic speech emotion recognition

As of version 2.1, openSMILE supports running the emotion recognition models from the openEAR toolkit [Eyben09] in live emotion recognition demo. In order to start this live speech emotion recognition demo, download the speech emotion recognition models and unzip them in the top-level folder of the openSMILE package. A folder named models should be created there which contains a README.txt, and a sub-folder emo. If this is the case, you are ready to run the demo. Type:

SMILExtract -C config/emobase_live4.conf

to run it. The classification output will be shown on the console. NOTE: This example requires that you are running a binary with PortAudio support enabled. Refer to the openSMILE book for details on how to compile your binary with portaudio support for Linux. For Windows pre-compiled binaries (SMILExtractPA*.exe) are included, which should be used instead of the standard SMILExtract.exe for the above example. If you want to choose a different audio recording device, use

SMILExtract -C config/emobase_live4.conf -device ID

To see a list of available devices and their IDs, type:

SMILExtract -C config/emobase_live4.conf -listdevices

Note: If you have a different directory layout or have installed SMILExtract in a system path, you must make sure that the models are located in a directory named “models” located in the directory from where you call the binary, or you must adapt the path to the models in the configuration file (emobase_live4.conf). In openSMILE 2.1, the emotion recognition models can also be used for off-line/batch analysis. Two configuration files are provided for this purpose: config/emobase_live4_batch.conf and config/emobase_live4_batch_single.conf. The latter of the two will compute a single feature vector for the input file and return a single result. Use this, if your audio files are already chunked into short phrases or sentences. The first, emobase_live4_batch.conf will run an energy based segementation on the input and will return a result for every segment. Use this for longer, un-cut audio files. To run analyis in batch mode, type:

SMILExtract -C config/emobase_live4_batch(_single).conf -I example-audio/opensmile.wav > result.txt

This will redirect the result(s) from SMILExtract’s standard output (console) to the file result.txt. The file is by default in a machine parseable format, where key=value tokens are separated by :: and a single result is given on each line, for example:

SMILE-RESULT::ORIGIN=libsvm::TYPE=regression::COMPONENT=arousal::VIDX=0::NAME=(null)::
     VALUE=1.237816e-01
SMILE-RESULT::ORIGIN=libsvm::TYPE=regression::COMPONENT=valence::VIDX=0::NAME=(null)::
     VALUE=1.825088e-01
SMILE-RESULT::ORIGIN=libsvm::TYPE=classification::COMPONENT=emodbEmotion::VIDX=0::
     NAME(null)::CATEGORY_IDX=2::CATEGORY=disgust::PROB=0;anger:0.033040::
     PROB=1;boredom:0.210172::PROB=2;disgust:0.380724::PROB=3;fear:0.031658::
     PROB=4;happiness:0.016040::PROB=5;neutral:0.087751::PROB=6;sadness:0.240615
SMILE-RESULT::ORIGIN=libsvm::TYPE=classification::COMPONENT=abcAffect::VIDX=0::
    NAME=(null)::CATEGORY_IDX=0::CATEGORY=agressiv::PROB=0;agressiv:0.614545::
    PROB=1;cheerful:0.229169::PROB=2;intoxicated:0.037347::PROB=3;nervous:0.011133::
    PROB=4;neutral:0.091070::PROB=5;tired:0.016737
SMILE-RESULT::ORIGIN=libsvm::TYPE=classification::COMPONENT=avicInterest::VIDX=0::
    NAME=(null)::CATEGORY_IDX=1::CATEGORY=loi2::PROB=0;loi1:0.006460::
    PROB=1;loi2:0.944799::PROB=2;loi3:0.048741

The above example is the result of the analysis of the file example-audio/media-interpretation.wav.

d. Understanding configuration files

The above, pre-configured examples are a good quick-start to show the diverse potential of the tool. We will now take a deeper look at openSMILE configuration files. First, we will use simple, small configuration files, and modify these in order to understand the basic concepts of these files. Then, we will show you how to write your own configuration files from scratch. The demo files used in this section are provided in the 2.1 release package in the folder config/demo. We will first start with demo1_energy.conf. This file extracts basic frame-wise logarithmic energy. To run this file on one of the included audio examples in the folder example-audio, type the following command:

SMILExtract -C config/demo/demo1_energy.conf -I example-audio/

openSMILE

.wav -O energy.csv

This will create a file called energy.csv. Its content should look similar to this: The second example we will discuss here, is the audio recorder example (audiorecorder.conf). NOTE: This example requires that you are running a binary with PortAudio support enabled. Refer to the openSMILE book for details on how to compile your binary with portaudio support for Linux. For Windows pre-compiled binaries (SMILExtractPA*.exe) are included, which should be used instead of the standard SMILExtract.exe for the following example. This example implements a simple live audio recorder. Audio is recorded from the default audio device to an uncompressed PCM wave file. To run the example and record to rec.wav, type:

SMILExtract -C config/demo/audiorecorder.conf -O rec.wav

Modifiying existing configuration files is the fasted way to create custom extraction scripts. We will now change the demo1_energy.conf file to extract Root-Mean-Square (RMS) energy instead of logarithmic energy. This can be achieved by changing the respective options in the section of the cEnergy component (identified by the section heading [energy:cEnergy]) from

rms = 0
log = 1

to

rms = 1
log = 0

As a second example, we will merge audiorecorder.conf and demo1_energy.conf to create a configuration file which computes the frame-wise RMS energy from live audio input. First, we start with concatenating the two files. On Linux, type:

cat config/demo/audiorecorder.conf config/demo/demo1_energy.conf > config/demo/live_energy.conf

On Windows, use a text editor such as Notepad++ to combine the files via copy and paste. Now we must remove the cWaveSource component from the original demo1_energy.conf, as this should be replaced by the cPortaudioSource component of the audiorecorder.conf file. To do this, we search for the line

instance[waveSource].type = cWaveSource

and comment it out by prefixing it with a ; or the C-style // or the script- and INI-style #. We also remove the corresponding configuration file section for waveSource. We do the same for the waveSink component and the corresponding section, the leave only the output of the computed frame-wise energy to a CSV file. Theoretically, we could also leave the waveSink section and component, but we would need to change the command-line option defined for the output filename, as this is the same for the CSV output and the wave-file output without any changes. In this case we should replace the filename option in the waveSink section by:

filename = \cm[waveoutput{output.wav}:name of output wave file]

Now, run your new configuration file with:

SMILExtract -C config/demo/live_energy.conf -O live_energy.csv

and inspect the contents of the live_energy.csv file with a text editor. openSMILE configuration files are made up of sections, similar to INI files. Each section is identified by a header which takes the form:

[instancename:cComponentType]

The first part (instancename) is a custom-chosen name for the section. It must be unique throughout the whole configuration file and all included sub-files. The second part defines the type of this configuration section and thereby its allowed contents. The configuration section typename must be one of the available component names (from the list printed by the command SMILExtract -L), as configuration file sections are linked to component instances. The contents of each section are lines of key=value pairs, until the next section header is found. Besides simple key=value pairs as in INI files, a more advanced structure is supported by openSMILE. The key can be a hierarchical value build of key1.subkey, for example, or an array such as keyarray[0] and keyarray[1]. On the other side, the value field can also denote an array of values, if the values are separated by a semi-colon (;). Quotes for the values are not needed and not yet supported, and multi-line values are not allowed. Boolean flags are always expressed as numeric values with 1 for on or true and 0 for off or false. The keys are referred to as the configuration options of the components, i.e. those listed by the on-line help (SMILExtract -H cComponentType). Since version 2.1, configuration sections can be split into multiple parts across the configuration file. That is, the same header (same instancename and typename) may occur more than once. In that case all options from all occurrences will be joint. There is one configuration section that must always be present: that of the component manager:

[componentInstances:cComponentManager]
instance[dataMemory].type = cDataMemory
instance[instancename].type = cComponentType
instance[instancename2].type = cComponentType2
...

The component manager is the main instance which creates all component instances of the currently loaded configuration, makes them read their configuration settings from the parsed configuration file (through the configManager component), and runs the tick-loop, i.e. the loop where data are processed incrementally by calling each component once to process newly available data frames. Each component that shall be included in the configuration, must be listed in this section, and for each component listed there, a corresponding configuration file section with the same instancename and of the same component type must exist. The only exception is the first line, which instantiates the central dataMemory component. It must be always present in the instance list, but no configuration file section has to be supplied for it. Each component that processes data has a data-reader and/or a data-writer sub-component, which are configurable via the reader and writer objects. The only options of interest to us now in these objects are the dmLevel options. These options configure the data-flow connections in your configuration file, i.e. they define in which order data is processed by the components, or in other words, which component is connected with which other component: Each component that modifies data or creates data (i.e. reading it from external sources etc.), will write its data to a unique dataMemory location (called level). The name of this location is defined in the configuration file via the option writer.dmLevel=name_of_evel. The level names must be unique and only one single component can write to each level. Multiple components can, however, read from a single level, enabling re-use of already computed data by multiple components. E.g. we typically have a wave source component which reads audio data from an uncompressed audio file (see also the demo1_energy.conf file):

[wavesource:cWaveSource]
writer.dmLevel = wave
filename = input.wav

The above reads data from input.wav into the dataMemory level wave. If next we want to chunk the audio data into overlapping analysis windows of 20ms length at a rate of 10ms, we need a cFramer component:

[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames20ms
frameSize = 0.02
frameStep = 0.01

The crucial line in the above code is the line which sets the reader dataMemory level (reader.dmLevel = wave) to the output level of the wave source component – effectively connecting the framer to the wave source component. To create new configuration files from scratch, a configuration file template generator is available. We will use it to create a configuration for computing magnitude spectra via the Fast-Fourier Transform (FFT). The template file generator requires a list of components that we want to have in the configuration file, so we must build this list first. In openSMILE most processing steps are wrapped in individual components to increase flexibility and re-usability of intermediate data. For our example we thus need the following components:

  • An audio file reader (cWaveSource),
  • a component which generates short-time analysis frames (cFramer),
  • a component which applies a windowing function to these frames such as a Hamming window (cWindower),
  • a component which performs a FFT (cTranformFFT),
  • a component which computes spectral magnitudes from the complex FFT result (cFFTmagphase),
  • and finally a component which writes the magnitude spectra to a CSV file (cCsvSink).

The generate our configuration file template, we thus run (note, that the component names are case sensitive!):

SMILExtract -l 0 -logfile my_fft_magnitude.conf -cfgFileTemplate -configDflt cWaveSource,cFramer,
    cWindower,cTransformFFT,cFFTmagphase,cCsvSink

The switch -cfgFileTemplate enables the template file output, and makes -configDflt accept a comma separated list of component names. If -configDflt is used by itself, it will print only the default configuration section of a single component (of which the name is given as argument to that option). This invocation of SMILExtract prints the configuration file template to the log (i.e., standard error and to the (log-)file given by the -logfile option). The switch -l 0 suppresses all other log messages (by setting the log-level to 0), leaving only the configuration file template lines in the specified file. The file generated by the above command cannot be used as is, yet. We need to update the data-flow connections first. In our example this is trivial, as one component always reads from the previous one, except for the wave source, which has no reader. We have to change:

[waveSource:cWaveSource]
writer.dmLevel = < >

to

[waveSource:cWaveSource]
writer.dmLevel = wave

The same for the framer, resulting in:

[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames

and for the windower:

[windower:cWindower]
reader.dmLevel = frames
writer.dmLevel = windowed
...
winFunc = Hamming
...

where we also change the windowing function from the default (Hanning) to Hamming, and in the same fashion we go down all the way to the csvSink component:

[transformFFT:cTransformFFT]
reader.dmLevel = windowed
writer.dmLevel = fftcomplex

...

[fFTmagphase:cFFTmagphase]
reader.dmLevel = fftcomplex
writer.dmLevel = fftmag

...

[csvSink:cCsvSink]
reader.dmLevel = fftmag

The configuration file can now be used with the command:

SMILExtract -C my_fft_magnitude.conf

However, if you run the above, you will most likely get an error message that the file input.wav is not found. This is good news, as it first of all means you have configured the data-flow correctly. In case you did not, you will get error messages about missing data memory levels, etc. The missing file problem is due to the hard-coded input file name with the option filename = input.wav in the wave source section. If you change this line to filename = example-audio/opensmile.wav your configuration will run without errors. It writes the result to a file called smileoutput.csv. To avoid having to change the filenames in the configuration file for every input file you want to process, openSMILE provides a very convenient feature: it allows you to define command-line options in the configuration files. In order to use this feature you replace the value of the filename by the command \cm[], e.g. for the input file:

filename = \cm[inputfile(I){input.wav}:input filename]

and for the output file:

filename = \cm[outputfile(O){output.csv}:output filename]

The syntax of the \cm command is: [longoptionName(shortOption-1charOnly){default value}:description for on-line help].

e. Reference feature sets

A major advantage of openSMILE over related feature extraction toolkits is that is comes with several reference and baseline feature sets which were used for the INTERSPEECH Challenges (2009-2014) on Emotion, Paralinguistics and Speaker States and Traits, as well as the Audio-Visual Emotion Challenges (AVEC) from 2011-2013. All of the INTERSPEECH configuration files are found under config/ISxx_*.conf. All the INTERSPEECH Challenge configuration files follow a common standard regarding the data output options they define. The default output file option (-O) defines the name of the WEKA ARFF file to which functionals are written. To save the data in CSV format additionally, use the option -csvoutput filename. To disable the default ARFF output, use -O ?. To enable saving of intermediate parameters, frame-wise Low-Level Descriptors (LLD), in CSV format the option -lldoutput filename can be used. By default, lines are appended to the functions ARFF and CSV files is they exist, but the LLD files will be overwritten. To change this behaviour, the boolean (1/0) options -appendstaticarff 1/0, -appendstaticcsv 1/0, and -appendlld 0/1 are provided. Besides the Challenge feature sets, openSMILE 2.1 is capable of extracting parameters for the Geneva Minimalistic Acoustic Parameter Set (GeMAPS — submitted for publication as [Eyben14], configuration files will be available together with publication of the article), which is a small set of acoustic paramters relevant for affective voice research. It was standardized and agreed upon by several research teams, including linguists, psychologists, and engineers. Besides these large-scale brute-forced acoustic feature sets, several other configuration files are provided for extracting individual LLD. These include Mel-Frequency Cepstral Coefficients (MFCC*.conf) and Perceptual Linear Predictive Coding Cepstral Coefficients (PLP*.conf), as well as the fundamental frequency and loudness (prosodyShsViterbiLoudness.conf, or smileF0.conf for fundamental frequency only).

Conclusion and summary

We have introduced openSMILE version 2.1 in this article and have given a hands-on practical guide on how to use it to extract audio features of out-of-the-box baseline feature sets, as well as customized acoustic descriptors. It was also shown how to use the voice activity detector, and pre-trained emotion models from the openEAR toolkit for live, incremental emotion recognition. The openSMILE toolkit features a large collection of baseline acoustic feature sets for paralinguistic speech and music analysis and a flexible and complete framework for audio analysis. In future work, more efforts will be put in documentation, speed-up of the underlying framework, and the implementation of new, robust acoustic and visual descriptors.

Acknowledgements

This research was supported by an ERC Advanced Grant in the European Community’s 7th Framework Programme under grant agreement 230331-PROPEREMO (Production and perception of emotion: an affective sciences approach) to Klaus Scherer and by the National Center of Competence in Research (NCCR) Affective Sciences financed by the Swiss National Science Foundation (51NF40-104897) and hosted by the University of Geneva. The research leading to these results has received funding from the European Community’s Seventh Framework Programme under grant agreement No.\ 338164 (ERC Starting Grant iHEARu). The authors would like to thank audEERING UG (haftungsbeschränkt) for providing up-to-date pre-release documentation, computational resources, and great support in maintaining the free open-source releases.