ACM SIGMM/TOMM 2015 Award Announcements

The ACM Special Interest Group in Multimedia (SIGMM) and ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) are pleased to announce the following awards for 2015 recognizing outstanding achievements and services made in the multimedia community.
SIGMM Technical Achievement Award:
Dr. Tat-Seng Chua, National University of Singapore

SIGMM Rising Star Award:
Dr. Yu-Gang Jiang, Fudan University
SIGMM Best Ph.D. Thesis Award:
Dr. Ting Yao, City University of Hong Kong (currently Microsoft Research)

TOMM Nicolas D. Georganas Best Paper Award:
“A Quality of Experience Model for Haptic Virtual Environments” by Abdelwahab Hamam, Abdulmotaleb El Saddik, and Jihad Alja’am, published in TOMM, vol. 10, Issue 3, 2014.
TOMM Best Associate Editor Award:
Dr. Pradeep K. Atrey, State University of New York, Albany
Additional information of each award and recipient is available on the SIGMM web site.
http://www.sigmm.org/
Awards will be presented in the annual SIGMM event, ACM Multimedia Conference, held in Brisbane, Australia during October 26-30, 2015.
ACM is the professional society of computer scientists, and SIGMM is the special interest group on multimedia. TOMCCAP is the flagship journal publication of SIGMM.

MPEG Column: 112th MPEG Meeting

This blog post is also available at at bitmovin tech blog and blog.timmerer.com.

The 112th MPEG meeting in Warsaw, Poland was a special meeting for me. It was my 50th MPEG meeting which roughly accumulates to one year of MPEG meetings (i.e., one year of my life I’ve spend in MPEG meetings incl. traveling – scary, isn’t it? … more on this in another blog post). But what happened at this 112th MPEG meeting (my 50th meeting)…

  • Requirements: CDVA, Future of Video Coding Standardization (no acronym yet), Genome compression
  • Systems: M2TS (ISO/IEC 13818-1:2015), DASH 3rd edition, Media Orchestration (no acronym yet), TRUFFLE
  • Video/JCT-VC/JCT-3D: MPEG-4 AVC, Future Video Coding, HDR, SCC
  • Audio: 3D audio
  • 3DG: PCC, MIoT, Wearable

MPEG Friday Plenary. Photo (c) Christian Timmerer.

As usual, the official press release and other publicly available documents can be found here. Let’s dig into the different subgroups:

Requirements

In requirements experts were working on the Call for Proposals (CfP) for Compact Descriptors for Video Analysis (CDVA) including an evaluation framework. The evaluation framework includes 800-1000 objects (large objects like building facades, landmarks, etc.; small(er) objects like paintings, books, statues, etc.; scenes like interior scenes, natural scenes, multi-camera shots) and the evaluation of the responses should be conducted for the 114th meeting in San Diego.

The future of video coding standardization is currently happening in MPEG and shaping the way for the successor of of the HEVC standard. The current goal is providing (native) support for scalability (more than two spatial resolutions) and 30% compression gain for some applications (requiring a limited increase in decoder complexity) but actually preferred is 50% compression gain (at a significant increase of the encoder complexity). MPEG will hold a workshop at the next meeting in Geneva discussing specific compression techniques, objective (HDR) video quality metrics, and compression technologies for specific applications (e.g., multiple-stream representations, energy-saving encoders/decoders, games, drones). The current goal is having the International Standard for this new video coding standard around 2020.

MPEG has recently started a new project referred to as Genome Compression which is about of course about the compression of genome information. A big dataset has been collected and experts working on the Call for Evidence (CfE). The plan is holding a workshop at the next MPEG meeting in Geneva regarding prospect of Genome Compression and Storage Standardization targeting users, manufactures, service providers, technologists, etc.

Summer in Warsaw. Photo (c) Christian Timmerer.

Systems

The 5th edition of the MPEG-2 Systems standard has been published as ISO/IEC 13818-1:2015 on the 1st of July 2015 and is a consolidation of the 4th edition + Amendments 1-5.

In terms of MPEG-DASH, the draft text of ISO/IEC 23009-1 3rd edition comprising 2nd edition + COR 1 + AMD 1 + AMD 2 + AMD 3 + COR 2 is available for committee internal review. The expected publication date is scheduled for, most likely, 2016. Currently, MPEG-DASH includes a lot of activity in the following areas: spatial relationship description, generalized URL parameters, authentication, access control, multiple MPDs, full duplex protocols (aka HTTP/2 etc.), advanced and generalized HTTP feedback information, and various core experiments:

  • SAND (Sever and Network Assisted DASH)
  • FDH (Full Duplex DASH)
  • SAP-Independent Segment Signaling (SISSI)
  • URI Signing for DASH
  • Content Aggregation and Playback COntrol (CAPCO)

In particular, the core experiment process is very open as most work is conducted during the Ad hoc Group (AhG) period which is discussed on the publicly available MPEG-DASH reflector.

MPEG systems recently started an activity that is related to media orchestration which applies to capture as well as consumption and concerns scenarios with multiple sensors as well as multiple rendering devices, including one-to-many and many-to-one scenarios resulting in a worthwhile, customized experience.

Finally, the systems subgroup started an exploration activity regarding real-time streaming of file (a.k.a TRUFFLE) which should perform an gap analysis leading to extensions of the MPEG Media Transport (MMT) standard. However, some experts within MPEG concluded that most/all use cases identified within this activity could be actually solved with existing technology such as DASH. Thus, this activity may still need some discussions…

Video/JCT-VC/JCT-3D

The MPEG video subgroup is working towards a new amendment for the MPEG-4 AVC standard covering resolutions up to 8K and higher frame rates for lower resolution. Interestingly, although MPEG most of the time is ahead of industry, 8K and high frame rate is already supported in browser environments (e.g., using bitdash 8K, HFR) and modern encoding platforms like bitcodin. However, it’s good that we finally have means for an interoperable signaling of this profile.

In terms of future video coding standardization, the video subgroup released a call for test material. Two sets of test sequences are already available and will be investigated regarding compression until next meeting.

After a successful call for evidence for High Dynamic Range (HDR), the technical work starts in the video subgroup with the goal to develop an architecture (“H2M”) as well as three core experiments (optimization without HEVC specification change, alternative reconstruction approaches, objective metrics).

The main topic of the JCT-VC was screen content coding (SCC) which came up with new coding tools that are better compressing content that is (fully or partially) computer generated leading to a significant improvement of compression, approx. or larger than 50% rate reduction for specific screen content.

Audio

The audio subgroup is mainly concentrating on 3D audio where they identified the need for intermediate bitrates between 3D audio phase 1 and 2. Currently, phase 1 identified 256, 512, 1200 kb/s whereas phase 2 focuses on 128, 96, 64, 48 kb/s. The broadcasting industry needs intermediate bitrates and, thus, phase 2 is extended to bitrates between 128 and 256 kb/s.

3DG

MPEG 3DG is working on point cloud compression (PCC) for which open source software has been identified. Additionally, there’re new activity in the area of Media Internet of Things (MIoT) and wearable computing (like glasses and watches) that could lead to new standards developed within MPEG. Therefore, stay tuned on these topics as they may shape your future.

The week after the MPEG meeting I met the MPEG convenor and the JPEG convenor again during ICME2015 in Torino but that’s another story…

L. Chiariglione, H. Hellwagner, T. Ebrahimi, C. Timmerer (from left to right) during ICME2015. Photo (c) T. Ebrahimi.

Call for Nominations: Editor-In-Chief of ACM TOMM

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) The term of the current Editor-in-Chief (EiC) of the ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (http://tomm.acm.org/) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC. Nominations, including self-nominations, are invited for a three-year term as TOMM EiC, beginning on 1 January 2016. The EiC appointment may be renewed at most one time. This is an entirely voluntary position, but ACM will provide appropriate administrative support. The EiC is responsible for maintaining the highest editorial quality, for setting technical direction of the papers published in TOMM, and for maintaining a reasonable pipeline of articles for publication. He/she has final say on acceptance of papers, size of the Editorial Board, and appointment of Associate Editors. The EiC is expected to adhere to the commitments expressed in the policy on Rights and Responsibilities in ACM Publishing (http://www.acm.org/publications/policies/RightsResponsibilities). For more information about the role of the EiC, see ACM’s Evaluation Criteria for Editors-in-Chief (http://www.acm.org/publications/policies/evaluation/). Nominations should include a vita along with a brief statement of why the nominee should be considered. Self-nominations are encouraged, and should include a statement of the candidate’s vision for the future development of TOMM. The deadline for submitting nominations is 24 July 2015, although nominations will continue to be accepted until the position is filled. Please send all nominations to the nominating committee chair, Nicu Sebe (sebe@disi.unitn.it). The search committee members are:

  • Nicu Sebe (University of Trento), Chair
  • Rainer Lienhart (University of Augsburg)
  • Alejandro Jaimes (Yahoo!)
  • John R. Smith (IBM)
  • Lynn Wilcox (FXPAL)
  • Wei Tsang Ooi (NUS)
  • Mary Lou Soffa (University of Virginia), ACM Publications Board Liaison

GamingAnywhere: An Open-Source Cloud Gaming Platform

Overview

GamingAnywhere is an open-source clouding gaming platform. In addition to its openness, we design GamingAnywhere for high extensibility, portability, and reconfigurability. GamingAnywhere currently supports Windows and Linux, and can be ported to other OS’s including OS X and Android. Our performance study demonstrates that GamingAnywhere achieves high responsiveness and video quality yet imposes low network traffic [1,2]. The value of GamingAnywhere, however, is from its openness: researchers, service providers, and gamers may customize GamingAnywhere to meet their needs. This is not possible in other closed and proprietary cloud gaming platforms. A demonstration of the GamingAnywhere system. There are four devices in the photo. One game server (left-hand side labtop) and three game clients (an MacBook, an Android phone, and an iPad 2).

Motivation

Computer games have become very popular, e.g., gamers spent 24.75 billion USD on computer games, hardware, and accessories in 2011. Traditionally, computer games are delivered either in boxes or via Internet downloads. Gamers have to install the computer games on physical machines to play them. The installation process becomes extremely tedious because the games are too complicated and the computer hardware and system software are very fragmented. Take Blizzard’s Starcraft II as example, it may take more than an hour to install it on an i5 PC, and another hour to apply the online patches. Furthermore, gamers may find that their computers are not powerful enough to enable all the visual effects yet achieve high frame rates. Hence, gamers have to repeatedly upgrade their computers so as to play the latest computer games. Cloud gaming is a better way to deliver high-quality gaming experience and opens new business opportunities. In a cloud gaming system, computer games run on powerful cloud servers, while gamers interact with the games via networked thin clients. The thin clients are light-weight and can be ported to resource-constrained platforms, such as mobile devices and TV set-top boxes. With cloud gaming, gamers can play the latest computer gamers anywhere and anytime, while the game developers can optimize their games for a specific PC configuration. The huge potential of cloud gaming has been recognized by the game industry: (i) a market report predicts that cloud gaming market will increase 9 times between 2011 and 2017 and (ii) several cloud gaming startups were recently acquired by leading game developers. Although cloud gaming is a promising direction for the game industry, achieving good user experience without excessive hardware investment is a tough problem. This is because gamers are hard to please, as they concurrently demand for high responsiveness and high video quality, but do not want to pay too much. Therefore, service providers have to not only design the systems to meet the gamers’ needs but also take error resiliency, scalability, and resource allocation into considerations. This renders the design and implementation of cloud gaming systems extremely challenging. Indeed, while real-time video streaming seems to be a mature technology at first glance, cloud gaming systems have to execute games, handle user inputs, and perform rendering, capturing, encoding, packetizing, transmitting, decoding, and displaying in real-time, and thus are much more difficult to optimize. We observe that many systems researchers have new ideas to improve cloud gaming experience for gamers and reduce capital expenditure (CAPEX) and operational expenditure (OPEX) for service providers. However, all existing cloud gaming platforms are closed and proprietary, which prevent the researchers from testing their ideas on real cloud gaming systems. Therefore, the new ideas were either only tested using simulators/emulators, or, worse, never evaluated and published. Hence, very few new ideas on cloud gaming (in specific) or highly-interactive distributed systems (more general) have been transferred to the industry. To better bridge the multimedia research community and the game/software industry, we present GamingAnywhere, the first open source cloud gaming testbed in April 2013. We hope GamingAnywhere cloud gather enough attentions, and quickly grow into a community with critical mass, just like Openflow, which shares the same motivation with GamingAnywhere in a different research area.

Design Philosophy

GamingAnywhere aims to provide an open platform for researchers to develop and study real-time multimedia streaming applications in the cloud. The design objectives of GamingAnywhere include:

  1. Extensibility: GamingAnywhere adopts a modularized design. Both platform-dependent components such as audio and video capturing and platform-independent components such as codecs and network protocols can be easily modified or replaced. Developers should be able to follow the programming interfaces of modules in GamingAnywhere to extend the capabilities of the system. It is not limited only to games, and any real-time multimedia streaming application such as live casting can be done using the same system architecture.
  2. Portability: In addition to desktops, mobile devices are now becoming one of the most potential clients of cloud services as wireless networks are getting increasingly more popular. For this reason, we maintain the principle of portability when designing and implementing GamingAnywhere. Currently the server supports Windows and Linux, while the client supports Windows, Linux, and OS X. New platforms can be easily included by replacing platform-dependent components in GamingAnywhere. Besides the easily replaceable modules, the external components leveraged by GamingAnywhere are highly portable as well. This also makes GamingAnywhere easier to be ported to mobile devices.
  3. Configurability: A system researcher may conduct experiments for real-time multimedia streaming applications with diverse system parameters. A large number of built-in audio and video codecs are supported by GamingAnywhere. In addition, GamingAnywhere exports all available configurations to users so that it is possible to try out the best combinations of parameters by simply editing a text-based configuration file and fitting the system into a customized usage scenario.
  4. Openness: GamingAnywhere is publicly available at http://gaminganywhere.org/. Use of GamingAnywhere in academic research is free of charge but researchers and developers should follow the license terms claimed in the binary and source packages.
 
Figure 2: A demonstration of GamingAnywhere running on a Android phone for playing Mario run in an N64 emulator on PC.

How to Start

We offer GamingAnywhere in two types of software packs: all-in-one and binary. The all-in-one pack allows the gamers to recompile GamingAnywhere from scratch, while the binary packs are for the gamers who just want to tryout GamingAnywhere. There are binary packs for Windows and Linux. All the packs are downloadable as zipped archives, and can be installed by simply uncompressing them. GamingAnywhere consists of three binaries: (i) ga-client, which is the thin client, (ii) ga-server-periodic, a server which periodically captures game screens and audio, and (iii) ga-server-event-driven, another server which utilizes code injection techniques to capture game screens and audio on-demand (i.e., whenever an updated game screen is available).   The readers are welcome to visit the website of GamingAnywhere at http://gaminganywhere.org/. Table 1 gives the latest supported OS’s and versions and all the source codes and pre-compiled binary packages can be downloaded from this page. The website provides a variety of document to help users to quickly setup GamingAnywhere server and client on their own computers, including the Quick Start Guide, the Configuration File Guide, and a FAQ document. If you got some questions that are not explained in the documents, we also provide an interactive forum for online discussion.

  Windows Linux MacOSX Android
Server Windows 7+ Supported Supported
Client Windows XP+ Supported Supported 4.1+

 

Future Perspectives

Cloud gaming is getting increasingly popular, but to turn cloud gaming into an even bigger success, there are still many challenges ahead of us. In [3], we share our views on the most promising research opportunities for providing high-quality and commercially-viable cloud gaming services. These opportunities span over fairly diverse research directions: from very system-oriented game integration to quite human-centric QoE modeling; from cloud related GPU virtualization to content-dependent video codecs. We believe these research opportunities are of great interests to both the research community and the industry for future, better cloud gaming platforms. GamingAnywhere enables several future research directions on cloud gaming and beyond. For example, techniques for cloud management, such as resource allocation and Virtual Machine (VM) migration, are critical to the success of commercial deployments. These cloud management techniques need to be optimized for cloud games, e.g., the VM placement decisions need to be aware of gaming experience [4]. Beyond cloud gaming, as dynamic and adaptive binding between computing devices and displays is increasingly more popular, screencast technologies which enable such binding over wireless networks, also employs real-time video streaming as the core technology. The ACM MMSys’15 paper [5] demonstrates that, GamingAnywhere, though designed for cloud gaming, also serve a good reference implementation and testbed for experimenting different innovations and alternatives for screencast performance improvements. Furthermore, we expect to see future applications, such as mobile smart lens and even telepresence, can make good use of GamingAnywhere as part of core technologies. We are happy to offer GamingAnywhere to the community and more than happy to welcome the community members to join us in the hacking of future, better, real-time streaming systems for the good of the humans.

MPEG Column: 111th MPEG Meeting

— original posts here by Multimedia Communication blogChristian TimmererAAU/bitmovin

The 111th MPEG meeting (note: link includes press release and all publicly available output documents) was held in Geneva, Switzerland showing up some interesting aspects which I’d like to highlight here. Undoubtedly, it was the shortest meeting I’ve ever attended (and my first meeting was #61) as final plenary concluded at 2015/02/20T18:18!

MPEG111 opening plenary

In terms of the requirements (subgroup) it’s worth to mention the call for evidence (CfE) for high-dynamic range (HDR) and wide color gamut (WCG) video coding which comprises a first milestone towards a new video coding format. The purpose of this CfE is to explore whether or not  (a) the coding efficiency and/or (b) the functionality of the HEVC Main 10 and Scalable Main 10 profiles can be significantly improved for HDR and WCG content. In addition to that requirements issues a draft call for evidence on free viewpoint TV. Both documents are publicly available here.

The video subgroup continued discussions related to the future of video coding standardisation and issued a public document requesting contributions on “future video compression technology”. Interesting application requirements come from over-the-top streaming use cases which request HDR and WCG as well as video over cellular networks. Well, at least the former is something to be covered by the CfE mentioned above. Furthermore, features like scalability and perceptual quality is something that should be considered from ground-up and not (only) as an extension. Yes, scalability is something that really helps a lot in OTT streaming starting from easier content management, cache-efficient delivery, and it allows for a more aggressive buffer modelling and, thus, adaptation logic within the client enabling better Quality of Experience (QoE) for the end user. It seems like complexity (at the encoder) is not such much a concern as long as it scales with cloud deployments such as http://www.bitcodin.com/ (e.g., the bitdash demo area shows some neat 4K/8K/HFR DASH demos which have been encoded with bitcodin). Closely related to 8K, there’s a new AVC amendment coming up covering 8K although one can do it already today (see before) but it’s good to have standards support for this. For HEVC, the JCT-3D/VC issued the FDAM4 for 3D Video Extensions and started with PDAM5 for Screen Content Coding Extensions (both documents being publicly available after an editing period of about a month).

And what about audio, the audio subgroup has decided that ISO/IEC DIS 23008-3 3D Audio shall be promoted directly to IS which means that the DIS was already at such a good state that only editorial comments are applied which actually saves a balloting cycle. We have to congratulate the audio subgroup for this remarkable milestone.

Finally, I’d like to discuss a few topics related to DASH which is progressing towards its 3rd edition which will incorporate amendment 2 (Spatial Relationship Description, Generalized URL parameters and other extensions), amendment 3 (Authentication, Access Control and multiple MPDs), and everything else that will be incorporated within this year, like some aspects documented in the technologies under consideration or currently being discussed within the core experiments (CE). Currently, MPEG-DASH conducts 5 core experiments:

  • Server and Network Assisted DASH (SAND)
  • DASH over Full Duplex HTTP-based Protocols (FDH)
  • URI Signing for DASH (CE-USD)
  • SAP-Independent Segment SIgnaling (SISSI)
  • Content aggregation and playback control (CAPCO)

The description of core experiments is publicly available and, compared to the previous meeting, we have a new CE which is about content aggregation and playback control (CAPCO) which “explores solutions for aggregation of DASH content from multiple live and on-demand origin servers, addressing applications such as creating customized on-demand and live programs/channels from multiple origin servers per client, targeted preroll ad insertion in live programs and also limiting playback by client such as no-skip or no fast forward.” This process is quite open and anybody can join by subscribing to the email reflector.

The CE for DASH over Full Duplex HTTP-based Protocols (FDH) is becoming major and basically defines the usage of DASH for push-features of WebSockets and HTTP/2. At this meeting MPEG issues a working draft and also the CE on Server and Network Assisted DASH (SAND) got its own part 5 where it goes to CD but documents are not publicly available. However, I’m pretty sure I can report more on this next time, so stay tuned or feel free to comment here.

ACM TOMM (TOMCCAP) Call for Special Issue Proposals

ACM Transactions on Multimedia
Computing, Communications and Applications
ACM TOMM (previously known as ACM TOMCCAP)

Deadline for Proposal Submission: May, 1st 2015
Notification: June, 1st 2015
http://tomm.acm.org/

ACM TOMM is one of the world’s leading journals on multimedia. As in previous years, we are planning to publish a special issue (SI) in 2016. Proposals are accepted until May, 1st 2015. Each special issue is in the responsibility of the guest editors. If you wish to guest edit a special issue, you should prepare a proposal as outlined below, then send this via e-mail to the Senior Associate Editor (SAE) for Special Issue Management of TOMM, Shervin Shirmohammadi shervin@ieee.org

Proposals must:

  • Cover a currently-hot or emerging topic in the area of multimedia computing, communications, and applications;
  • Set out the importance of the special issue’s topic in that area;
  • Give a strategy for the recruitment of high quality papers;
  • Indicate a draft timeline in which the special issue could be produced (paper writing, reviewing, and submission of final copies to TOMM), assuming the proposal is accepted;
  • Include a list of recent (submission deadline within the last year) or currently-open special issues in similar topics and clearly explain how the proposed SI is different from those SIs;
  • Include the list of the proposed guest editors, their short bios, and their editorial and journal/conference organization experience as related to the Special Issue’s topic.

As in the previous years, the special issue will be published as online-only issue in the ACM Digital Library. This gives the guest editors higher flexibility in the review process and the number of papers to be accepted, while yet ensuring a timely publication.

The proposals will be reviewed by the SAE together with the Editor in Chief (EiC). Evaluation criteria includes: relevance to multimedia, ability to attract many excellent submissions, topic not too specific or too broad, quality and details of the proposal, distinguished from recent or current SIs with similar topic, experience and reputation of the guest editors, geographic/ethnic diversity of the guest editors. The final decision will be made by the EiC. A notification of the decision will be given by June 1st 2015. Once a proposal is accepted we will contact you to discuss the further process.

For questions please contact:
Shervin Shirmohammadi – Senior Associate Editor for Special Issue Management shervin@ieee.org
Ralf Steinmetz – Editor in Chief (EiC) steinmetz.eic@kom.tu-darmstadt.de
Sebastian Schmidt – Information Director TOMM@kom.tu-darmstadt.de

Summary of the 5th BAMMF

Bay Area Multimedia Forum (BAMMF)

BAMMF is a Bay Area Multimedia Forum series. Experts from both academia and industry are invited to exchange ideas and information through talks, tutorials, posters, panel discussions and networking sessions. Topics of the forum will include emerging areas in vision, audio, touch, speech, text, various sensors, human computer interaction, natural language processing, machine learning, media-related signal processing, communication, and cross-media analysis etc. Talks in the event may cover advancement in algorithms and development, demonstration of new inventions, product innovation, business opportunities, etc. If you are interested in giving a presentation at the forum, please contact us.

The 5th BAMMF

The 5th BAMMF was held in the George E. Pake Auditorium in Palo Alto, CA, USA on November 20, 2014. The slides and videos of the speakers at the forum have been made available on the BAMMF web page, and we provide here an overview of their talks. For speakers’ bios, the slides and videos, please visit the web page.

Industrial Impact of Deep Learning – From Speech Recognition to Language and Multimodal Processing

Li Deng (Deep Learning Technology Center, Microsoft Research, Redmond, USA)

Since 2010, deep neural networks have started making real impact in speech recognition industry, building upon earlier work on (shallow) neural nets and (deep) graphical models developed by both speech and machine learning communities. This keynote will first reflect on the historical path to this transformative success. The role of well-timed academic-industrial collaboration will be highlighted, so will be the advances of big data, big compute, and seamless integration between application-domain knowledge of speech and general principles of deep learning. Then, an overview will be given on the sweeping achievements of deep learning in speech recognition since its initial success in 2010 (as well as in image recognition since 2012). Such achievements have resulted in across-the-board, industry-wide deployment of deep learning. The final part of the talk will focus on applications of deep learning to large-scale language/text and multimodal processing, a more challenging area where potentially much greater industrial impact than in speech and image recognition is emerging.

Brewing a Deeper Understanding of Images

Yangqing Jia (Google)

In this talk I will introduce the recent developments in the image recognition fields from two perspectives: as a researcher and as an engineer. For the first part I will describe our recent entry “GoogLeNet” that won the ImageNet 2014 challenge, including the motivation of the model and knowledge learned from the inception of the model. For the second part, I will dive into the practical details of Caffe, an open-source deep learning library I created at UC Berkeley, and show how one could utilize the toolkit for a quick start in deep learning as well as integration and deployment in real-world applications.

Applied Deep Learning

Ronan Collobert (Facebook)

I am interested in machine learning algorithms which can be applied in real-life applications and which can be trained on “raw data”. Specifically, I prefer to trade simple “shallow” algorithms with task-specific handcrafted features for more complex (“deeper”) algorithms trained on raw features. In that respect, I will present several general deep learning architectures, which excels in performance on various Natural Language, Speech and Image Processing tasks. I will look into specific issues related to each application domain, and will attempt to propose general solutions for each use case.

Compositional Language and Visual Understanding

Richard Socher (Stanford)

In this talk, I will describe deep learning algorithms that learn representations for language that are useful for solving a variety of complex language tasks. I will focus on 3 projects:

  • Contextual sentiment analysis (e.g. having an algorithm that actually learns what’s positive in this sentence: “The Android phone is better than the IPhone”)
  • Question answering to win trivia competitions (like IBM Watson’s Jeopardy system but with one neural network)
  • Multimodal sentence-image embeddings to find images that visualize sentences and vice versa (with a fun demo!) All three tasks are solved with a similar type of recursive neural network algorithm.

 

MPEG Column: 110th MPEG Meeting

— original posts here by Multimedia Communication blogChristian TimmererAAU/bitmovin

The 110th MPEG meeting was held at the Strasbourg Convention and Conference Centre featuring the following highlights:

  • The future of video coding standardization
  • Workshop on media synchronization
  • Standards at FDIS: Green Metadata and CDVS
  • What’s happening in MPEG-DASH?

Additional details about MPEG’s 110th meeting can be also found here including the official press release and all publicly available documents.

The Future of Video Coding Standardization

MPEG110 hosted a panel discussion about the future of video coding standardization. The panel was organized jointly by MPEG and ITU-T SG 16’s VCEG featuring Roger Bolton (Ericsson), Harald Alvestrand (Google), Zhong Luo (Huawei), Anne Aaron (Netflix), Stéphane Pateux (Orange), Paul Torres (Qualcomm), and JeongHoon Park (Samsung).

As expected, “maximizing compression efficiency remains a fundamental need” and as usual, MPEG will study “future application requirements, and the availability of technology developments to fulfill these requirements”. Therefore, two Ad-hoc Groups (AhGs) have been established which are open to the public:

The presentations of the brainstorming session on the future of video coding standardization can be found here.

Workshop on Media Synchronization

MPEG101 also hosted a workshop on media synchronization for hybrid delivery (broadband-broadcast) featuring six presentations “to better understand the current state-of-the-art for media synchronization and identify further needs of the industry”.

  • An overview of MPEG systems technologies providing advanced media synchronization, Youngkwon Lim, Samsung
  • Hybrid Broadcast – Overview of DVB TM-Companion Screens and Streams specification, Oskar van Deventer, TNO
  • Hybrid Broadcast-Broadband distribution for new video services :  a use cases perspective, Raoul Monnier, Thomson Video Networks
  • HEVC and Layered HEVC for UHD deployments, Ye Kui Wang, Qualcomm
  • A fingerprinting-based audio synchronization technology, Masayuki Nishiguchi, Sony Corporation
  • Media Orchestration from Capture to Consumption, Rob Koenen, TNO

The presentation material is available here. Additionally, MPEG established an AhG on timeline alignment (that’s how the project is internally called) to study use cases and solicit contributions on gap analysis and also technical contributions [email][subscription].

Standards at FDIS: Green Metadata and CDVS

My first report on MPEG Compact Descriptors for Visual Search (CDVS) dates back to July 2011 which provides details about the call for proposals. Now, finally, the FDIS has been approved during the 110th MPEG meeting. CDVS defines a compact image description that facilitates the comparison and search of pictures that include similar content, e.g. when showing the same objects in different scenes from different viewpoints. The compression of key point descriptors not only increases compactness, but also significantly speeds up, when compared to a raw representation of the same underlying features, the search and classification of images within large image databases. Application of CDVS for real-time object identification, e.g. in computer vision and other applications, is envisaged as well.

Another standard reached FDIS status entitled Green Metadata (first reported in August 2012). This standard specifies the format of metadata that can be used to reduce energy consumption from the encoding, decoding, and presentation of media content, while simultaneously controlling or avoiding degradation in the Quality of Experience (QoE). Moreover, the metadata specified in this standard can facilitate a trade-off between energy consumption and QoE. MPEG is also working on amendments to the ubiquitous MPEG-2 TS ISO/IEC 13818-1 and ISOBMFF ISO/IEC 14496-12 so that green metadata can be delivered by these formats.

What’s happening in MPEG-DASH?

MPEG-DASH is in a kind of maintenance mode but still receiving new proposals in the area of SAND parameters and some core experiments are going on. Also, the DASH-IF is working towards new interoperability points and test vectors in preparation of actual deployments. When speaking about deployments, they are happening, e.g., a 40h live stream right before Christmas (by bitmovin, a top-100 company that matters most in online video). Additionally, VideoNext was co-located with CoNEXT’14 targeting scientific presentations about the design, quality and deployment of adaptive video streaming. Webex recordings of the talks are available here. In terms of standardization, MPEG-DASH is progressing towards the 2nd amendment including spatial relationship description (SRD), generalized URL parameters and other extensions. In particular, SRD will enable new use cases which can be only addressed using MPEG-DASH and the FDIS is scheduled for the next meeting which will be in Geneva, Feb 16-20, 2015. I’ll report on this within my next blog post, stay tuned..

Call for Workshop Proposals @ ACM Multimedia 2015

We invite proposals for Workshops to be held at the ACM Multimedia 2015 Conference. Accepted workshops will take place in conjunction with the main conference, which is scheduled for October 26-30, 2015, in Brisbane, Australia.

We solicit proposals for two different kinds of workshops: regular workshops and data challenge.

Regular Workshops

The regular workshops should offer a forum for discussions of broad range of emerging and specialized topics of interest to the SIG Multimedia community. There are a number of important issues to be considered when generating a workshop proposal:

  1. The topic of the proposed workshop should offer a perspective distinct from and complementary to the research themes of the main conference. We therefore strongly advise to carefully review the themes of the main conference (which can be found here), when generating a proposal.
  2. The SIG Multimedia community expects the workshop program to be part of the program to nurture and to grow the workshop research theme towards becoming mainstream in the multimedia research field and one of the themes of the main conference in the future.
  3. Interdisciplinary theme workshops are strongly encouraged.
  4. Workshops should offer a discussion forum of a different type than that of the main conference. In particular, they should avoid becoming “mini-conferences” with accompanying keynote presentations and best paper awards. While formal presentation of ideas through regular oral sessions are allowed, we strongly encourage organizers to propose alternate ways to allow participants to discuss open issues, key methods and important research topics related to the workshop theme. Examples are panels, group brainstorming sessions, mini-tutorials around key ideas and proof of concept demonstration sessions.

Data Challenge Workshops

We are also seeking organizers to propose Challenge-Based Workshops. Both academic and corporate organizers are welcome.

Data Challenge workshops are solicited from both academic and corporate organizers. The organizers should provide a dataset that is exemplar of the complexities of current and future multimodal/multimedia problems, and one or more multimodal/ multimedia tasks whose performance can be objectively measured. Participants in the challenge will evaluate their methods against the challenge data in order to identify areas of strengths and weakness. Best performing participating methods will be presented in the form of papers and oral/poster presentations at the workshop.

More information

For details on submitting workshops proposals and the evaluation criteria, please check the following site:

http://www.acmmm.org/2015/call-for-workshop-proposals/

Important dates:

  • Proposal Submission: February 10, 2015
  • Notification of Acceptance February 27, 2015

Looking forward to receiving many excellent submissions!

Alan Hanjalic, Lexing Xie and Svetha Venkatesh
Workshops Chairs, ACM Multimedia 2015

openSMILE:) The Munich Open-Source Large-scale Multimedia Feature Extractor

A tutorial for version 2.1

Introduction

The openSMILE feature extraction and audio analysis tool enables you to extract large audio (and recently also video) feature spaces incrementally and fast, and apply machine learning methods to classify and analyze your data in real-time. It combines acoustic features from Music Information Retrieval and Speech Processing, as well as basic computer vision features. Large, standard acoustic feature sets are included and usable out-of-the-box to ensure comparable standards in feature extraction in related research. The purpose of this article is to briefly introduce openSMILE, it’s features, potentials, and intended use-cases as well as to give a hands-on tutorial packed with examples that should get you started quickly with using openSMILE. About openSMILE SMILE is originally an acronym for Speech & Music Interpretation by Large-space feature Extraction. Due to the recent addition of video-processing in version 2.0, the acronym openSMILE evolved to open-Source Media Interpretation by Large-space feature Extraction. The development of the toolkit has been started at Technische Universität München (TUM) for the EU-FP7 research project SEMAINE. The original primary focus was on state-of-the-art acoustic emotion recognition for emotionally aware, interactive virtual agents. After the project, openSMILE has been continuously extended to a universal audio analysis toolkit. It has been used and evaluated extensively in the series of INTERSPEECH challenges on emotion, paralinguistics, and speaker states and traits: From the first INTERSPEECH 2009 Emotion Challenge up to the upcoming Challenge at INTERSPEECH 2015 (see openaudio.eu for a summary of the challenges). Since 2013 the code-base has been transferred to audEERING and the development is continued by them under a dual-license model – keeping openSMILE free for the research community. openSMILE is written in C++ and is available as both a standalone command-line executable as well as a dynamic library. The main features of openSMILE are its capability of on-line incremental processing and its modularity. Feature extractor components can be freely interconnected to create new and custom features, all via a simple text-based configuration file. New components can be added to openSMILE via an easy binary plug-in interface and an extensive internal API. Scriptable batch feature extraction is supported just as well as live on-line extraction from live recorded audio streams. This enables you to build and design systems on off-line databases, and then use exactly the same code to run your developed system in an interactive on-line prototype or even product. openSMILE is intended as a toolkit for researchers and developers, but not for end-users. It thus cannot be configured through a Graphical User Interface (GUI). However, it is a fast, scalable, and highly flexible command-line backend application, on which several front-end applications could be based. Such examples are network interface components, and in the latest release of openSMILE (version 2.1) a batch feature extraction GUI for Windows platforms: As seen in the above figure, the GUI allows to easily choose a configuration file, the desired output files and formats, and to select files and folders on which to run the analysis. Made popular in the field of speech emotion recognition and paralinguistic speech analysis, openSMILE is now beeing widely used in this community. According to google scholar the two papers on openSMILE ([Eyben10] and [Eyben13a]) are currently cited over 380 times. Research teams across the globe are using it for several tasks, including paralinguistic speech analysis, such as alcohol intoxication detection, in VoiceXML telephony-based spoken dialogue systems — as implemented by the HALEF framework, natural, speech enabled virtual agent systems, and human behavioural signal processing, to name only a few examples. Key Features The key features of openSMILE are:

  • It is cross-platform (Windows, Linux, Mac, new in 2.1: Android)
  • It offers both incremental processing and batch processing.
  • It efficiently extracts a large number of features very fast by re-using already computed values.
  • It has multi-threading support for parallel feature extraction and classification.
  • It is extensible with new custom components and plug-ins.
  • It supports audio file in- and output as well as live sound recording and playback.
  • The computation of MFCC, PLP, (log-)energy, and delta regression coefficients is fully HTK compatible.
  • It has a wide range of general audio signal processingcomponents:
    • Windowing functions (Hamming, Hann, Gauss, Sine, …),
    • Fast-Fourier Transform,
    • Pre-emphasis filter,
    • Finit-Impulse-Response (FIR) filterbanks,
    • Autocorrelation,
    • Cepstrum,
    • Overlap-add re-synthesis,
  • … and speech-related acoustic descriptors:
    • Signal energy,
    • Loudness based on a simplified sub-band auditory model,
    • Mel-/Bark-/Octave-scale spectra,
    • MFCC and PLP-CC,
    • Pitch (ACF and SHS algorithms and Viterbi smoothing),
    • Voice quality (Jitter, Shimmer, HNR),
    • Linear Predictive Coding (LPC),
    • Line Spectral Pairs (LSP),
    • Formants,
    • Spectral shape descriptors (Roll-off, slope, etc.),
  • … and music-related descriptors:
    • Pitch classes (semitone spectrum),
    • CHROMA and CENS features.
  • It supports multi-modal fusion on the feature level through openCV integration.
  • Several post-processingmethods for low-level descriptors are included:
    • Moving average smoothing,
    • Moving average mean subtraction and variance normalization (e.g. for on-line Cepstral mean subtraction),
    • On-line histogram equalization (experimental),
    • Delta regression coefficients of arbitrary order,
    • Binary operations to re-combine descriptors.
  • A wide range of statistical functionalsfor feature summarization is supported, e.g.:
    • Means, Extremes,
    • Moments,
    • Segment statistics,
    • Sample-values,
    • Peak statistics,
    • Linear and quadratic regression,
    • Percentiles,
    • Durations,
    • Onsets,
    • DCT coefficients,
    • Zero-crossings.
  • Generic and popular data file formatsare supported:
    • Hidden Markov Toolkit (HTK) parameter files (read/write)
    • WEKA Arff files (currently only non-sparse) (read/write)
    • Comma separated value (CSV) text (read/write)
    • LibSVM feature file format (write)

In the latest release (2.1) the new features are:

  • Integration and improvement of the emotion recognition models from openEAR,
  • LSTM-RNN based voice-activity detector prototype models included,
  • Fast linear SVMsink component which supports linear kernel SVM models trained with the WEKA SMO classifier,
  • LSTM-RNN JSON network file support for networks trained with the CURRENNT toolkit,
  • Spectral harmonics descriptors,
  • Android support,
  • Improvements to configuration files and command-line options,
  • Improvements and fixes.

openSMILE’s architecture openSMILE has a very modular architecture, designed for incremental data-flow. A central dataMemory component hosts shared memory buffers (known as dataMemory levels) to which a single component can write data and one or more other components can read data from. There are data-source components, which read data from files or other external sources and introduce them to the dataMemory. Then there are data-processor components, which read data, modify them, and save it to a new buffer – these are the actual feature extractor components. In the end data-sink components read the final data and save them to files or digest it in other ways (classifiers etc.): As all components which process data and connect to the dataMemory share some common functionality, they are all derived from a single base class cSmileComponent. The following figure shows the class hierarchy, and the connections between the cDataWriter and cDataReader components to the dataMemory (dotted lines). Getting openSMILE and the documentation The latest openSMILE packages can be downloaded here. At the time of writing the most recent release is 2.1. Grab the complete package of the latest release. This includes the source code, the binaries for Linux and Windows. Some most up-to-date releases might not always include a full-blown set of binaries for all platforms, so sometimes you might have to compile from source, if you want the latest cutting-edge version. While the tutorial in the next section should give you a good quick-start, it does not and can not cover every detail of openSMILE. For learning more and getting further help, there are three main resources: The first is the openSMILE documentation, called the openSMILE book. It contains detailed instructions on how to install, compile, and use openSMILE and introduces you to the basics of openSMILE. However, it might not be the most up-to-date resource for the newest features. Thus, the second resource, is the on-line help built into the binaries. This provides the most up-to-date documentation of available components and their options and features. We will tell you how to use the on-line help in the next section. If you cannot find your answer in neither of these resources, you can ask for help in the discussion forums on the openSMILE website or read the source-code.

Quick-start tutorial

You can’t wait to get openSMILE and try it out on your own data? Then this is your section. In the following the basic concepts of openSMILE are described, pre-built use-cases of automatic, on-line voice activity detection and speech emotion recognition are presented, and the concept of configuration files and the data-flow architecture are explained.

a. Basic concepts

Please refer to the openSMILE book for detailed installation and compilation instructions. Here we assume that you have a compiled SMILExtract binary (optionally with PortAudio support, if you want to use the live audio recording examples below), with which you can run:

SMILExtract -h
SMILExtract -H cWaveSource

to see general usage instructions (first line) and the on-line help for the cWaveSource component (second line), for example. However, from this on-line help it is hard to get a general picture of the openSMILE concepts. We thus describe briefly how to use openSMILE for the most common tasks. Very loosely said, the SMILExtract binaries can be seen as a special kind of code interpreter which executes custom configuration scripts. What openSMILE actually does in the end when you invoke it is only controlled by this configuration script. So, in order to do something with openSMILE you need:

  • The binary SMILExtract,
  • a (set of) configuration file(s),
  • and optionally other files, such as classification models, etc.

The configuration file defines all the components that are to be used as well as their data-flow interconnections. All the components are iteratively run in the “tick-loop“, i.e. a run method (tick()) of each component is called in every loop iteration. Each component then checks if there are new data to process, and if yes, processes the data, and makes them available for other components to process them further. Every component returns a status value, which indicates whether the component has processed data or not. If no component has had any further data to process, the end of the data input (EOI) is assumed. All components are switched to an EOI state and the tick-loop is executed again to process data which require special attention at the end of the input, such as delta-regression coefficients. Since version 2.0-rc1, multi-pass processing is supported, i.e. providing a feature to enable re-running of the whole processing. It is not encouraged to use this, since it breaks incremental processing, but for some experiments it might be necessary. The minimal, generic use-case scenario for openSMILE is thus as follows:

SMILExtract -C config/my_configfile.conf

Each configuration file can define additional command-line options. Most prominent examples are the options for in- and output files (-I and -O). These options are not shown when the normal help is invoked with the -h option. To show the options defined by a configuration file, use this command-line:

SMILExtract -ccmdHelp -C config/my_configfile.conf

The default command-line for processing audio files for feature extraction is:

SMILExtract -C config/my_configfile.conf -I input_file.wav -O output_file

This runs SMILExtract with the configuration given in my_configfile.conf. The following two sections will show you how to quickly get some advanced applications running as pre-configured use-cases for voice activity detection and speech emotion recognition.

b. Use-case: The openSMILE voice-activity detector

The latest openSMILE release (2.1) contains a research prototype of an intelligent, data-drive voice-activity detector (VAD) based on Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN), similar to the system introduced in [Eyben13b]. The VAD examples are contained in the folder scripts/vad. A README in that folder describes further details. Here we give a brief tutorial on how to use the two included use-case examples:

  • vad_opensource.conf: Runs the LSTM-RNN VAD and dumps the activations (voice probability) for each frame to a CSV text file. To run the example on a wave file, type:
    cd scripts/vad;
    SMILExtracct -I ../../example-audio/media-interpretation.wav \
                 -C vad_opensoure.conf -csvoutput vad.csv

    This will write the VAD probabilities scaled to the range -1 to +1 (2nd column) and the corresponding timestamps (1st column) to vad.csv. A VAD probability greater 0 indicates voice presence.

  • vad_segmeter.conf: Runs the VAD on an input wave file, and automatically extract voice segments to new wave files. Optionally the raw voicing probabilities as in the above example can be saved to file. To run the example on a wave file, type:
    cd scripts/vad;
    mkdir -p voice_segments
    SMILExtract -I ../../example-audio/media-interpretation.wav -C vad_segmenter.conf \
                -waveoutput voice_segments/segment_

    This will create a new wave file (numbered consecutively, starting at 1). The vad_segmenter.conf optionally supports output to CSV with the -csvoutput filename option. The start and end times (in seconds) of the voice segments relative to the start of the input file can be optionally dumped with the -saveSegmentTimes filename option. The columns of the output file are: segment filename, start (sec.), end (sec.), length of segment as number of raw (10ms) frames.

To visualise the VAD output over the waveform, we recommend using Sonic-visualiser. If you have sonc-visualiser installed (on Linux) you can open both the wave-file and the VAD output with this command:

sonic-visualiser example-audio/media-interpretation.wav vad.csv

An annotation layer import dialog should appear. The first column should be detected as Time and the second column as value. If this is not the case, select these values manually, and specify that timing is specified explicitly (should be the default) and click OK. You should see something like this:

c. Use-case: Automatic speech emotion recognition

As of version 2.1, openSMILE supports running the emotion recognition models from the openEAR toolkit [Eyben09] in live emotion recognition demo. In order to start this live speech emotion recognition demo, download the speech emotion recognition models and unzip them in the top-level folder of the openSMILE package. A folder named models should be created there which contains a README.txt, and a sub-folder emo. If this is the case, you are ready to run the demo. Type:

SMILExtract -C config/emobase_live4.conf

to run it. The classification output will be shown on the console. NOTE: This example requires that you are running a binary with PortAudio support enabled. Refer to the openSMILE book for details on how to compile your binary with portaudio support for Linux. For Windows pre-compiled binaries (SMILExtractPA*.exe) are included, which should be used instead of the standard SMILExtract.exe for the above example. If you want to choose a different audio recording device, use

SMILExtract -C config/emobase_live4.conf -device ID

To see a list of available devices and their IDs, type:

SMILExtract -C config/emobase_live4.conf -listdevices

Note: If you have a different directory layout or have installed SMILExtract in a system path, you must make sure that the models are located in a directory named “models” located in the directory from where you call the binary, or you must adapt the path to the models in the configuration file (emobase_live4.conf). In openSMILE 2.1, the emotion recognition models can also be used for off-line/batch analysis. Two configuration files are provided for this purpose: config/emobase_live4_batch.conf and config/emobase_live4_batch_single.conf. The latter of the two will compute a single feature vector for the input file and return a single result. Use this, if your audio files are already chunked into short phrases or sentences. The first, emobase_live4_batch.conf will run an energy based segementation on the input and will return a result for every segment. Use this for longer, un-cut audio files. To run analyis in batch mode, type:

SMILExtract -C config/emobase_live4_batch(_single).conf -I example-audio/opensmile.wav > result.txt

This will redirect the result(s) from SMILExtract’s standard output (console) to the file result.txt. The file is by default in a machine parseable format, where key=value tokens are separated by :: and a single result is given on each line, for example:

SMILE-RESULT::ORIGIN=libsvm::TYPE=regression::COMPONENT=arousal::VIDX=0::NAME=(null)::
     VALUE=1.237816e-01
SMILE-RESULT::ORIGIN=libsvm::TYPE=regression::COMPONENT=valence::VIDX=0::NAME=(null)::
     VALUE=1.825088e-01
SMILE-RESULT::ORIGIN=libsvm::TYPE=classification::COMPONENT=emodbEmotion::VIDX=0::
     NAME(null)::CATEGORY_IDX=2::CATEGORY=disgust::PROB=0;anger:0.033040::
     PROB=1;boredom:0.210172::PROB=2;disgust:0.380724::PROB=3;fear:0.031658::
     PROB=4;happiness:0.016040::PROB=5;neutral:0.087751::PROB=6;sadness:0.240615
SMILE-RESULT::ORIGIN=libsvm::TYPE=classification::COMPONENT=abcAffect::VIDX=0::
    NAME=(null)::CATEGORY_IDX=0::CATEGORY=agressiv::PROB=0;agressiv:0.614545::
    PROB=1;cheerful:0.229169::PROB=2;intoxicated:0.037347::PROB=3;nervous:0.011133::
    PROB=4;neutral:0.091070::PROB=5;tired:0.016737
SMILE-RESULT::ORIGIN=libsvm::TYPE=classification::COMPONENT=avicInterest::VIDX=0::
    NAME=(null)::CATEGORY_IDX=1::CATEGORY=loi2::PROB=0;loi1:0.006460::
    PROB=1;loi2:0.944799::PROB=2;loi3:0.048741

The above example is the result of the analysis of the file example-audio/media-interpretation.wav.

d. Understanding configuration files

The above, pre-configured examples are a good quick-start to show the diverse potential of the tool. We will now take a deeper look at openSMILE configuration files. First, we will use simple, small configuration files, and modify these in order to understand the basic concepts of these files. Then, we will show you how to write your own configuration files from scratch. The demo files used in this section are provided in the 2.1 release package in the folder config/demo. We will first start with demo1_energy.conf. This file extracts basic frame-wise logarithmic energy. To run this file on one of the included audio examples in the folder example-audio, type the following command:

SMILExtract -C config/demo/demo1_energy.conf -I example-audio/

openSMILE

.wav -O energy.csv

This will create a file called energy.csv. Its content should look similar to this: The second example we will discuss here, is the audio recorder example (audiorecorder.conf). NOTE: This example requires that you are running a binary with PortAudio support enabled. Refer to the openSMILE book for details on how to compile your binary with portaudio support for Linux. For Windows pre-compiled binaries (SMILExtractPA*.exe) are included, which should be used instead of the standard SMILExtract.exe for the following example. This example implements a simple live audio recorder. Audio is recorded from the default audio device to an uncompressed PCM wave file. To run the example and record to rec.wav, type:

SMILExtract -C config/demo/audiorecorder.conf -O rec.wav

Modifiying existing configuration files is the fasted way to create custom extraction scripts. We will now change the demo1_energy.conf file to extract Root-Mean-Square (RMS) energy instead of logarithmic energy. This can be achieved by changing the respective options in the section of the cEnergy component (identified by the section heading [energy:cEnergy]) from

rms = 0
log = 1

to

rms = 1
log = 0

As a second example, we will merge audiorecorder.conf and demo1_energy.conf to create a configuration file which computes the frame-wise RMS energy from live audio input. First, we start with concatenating the two files. On Linux, type:

cat config/demo/audiorecorder.conf config/demo/demo1_energy.conf > config/demo/live_energy.conf

On Windows, use a text editor such as Notepad++ to combine the files via copy and paste. Now we must remove the cWaveSource component from the original demo1_energy.conf, as this should be replaced by the cPortaudioSource component of the audiorecorder.conf file. To do this, we search for the line

instance[waveSource].type = cWaveSource

and comment it out by prefixing it with a ; or the C-style // or the script- and INI-style #. We also remove the corresponding configuration file section for waveSource. We do the same for the waveSink component and the corresponding section, the leave only the output of the computed frame-wise energy to a CSV file. Theoretically, we could also leave the waveSink section and component, but we would need to change the command-line option defined for the output filename, as this is the same for the CSV output and the wave-file output without any changes. In this case we should replace the filename option in the waveSink section by:

filename = \cm[waveoutput{output.wav}:name of output wave file]

Now, run your new configuration file with:

SMILExtract -C config/demo/live_energy.conf -O live_energy.csv

and inspect the contents of the live_energy.csv file with a text editor. openSMILE configuration files are made up of sections, similar to INI files. Each section is identified by a header which takes the form:

[instancename:cComponentType]

The first part (instancename) is a custom-chosen name for the section. It must be unique throughout the whole configuration file and all included sub-files. The second part defines the type of this configuration section and thereby its allowed contents. The configuration section typename must be one of the available component names (from the list printed by the command SMILExtract -L), as configuration file sections are linked to component instances. The contents of each section are lines of key=value pairs, until the next section header is found. Besides simple key=value pairs as in INI files, a more advanced structure is supported by openSMILE. The key can be a hierarchical value build of key1.subkey, for example, or an array such as keyarray[0] and keyarray[1]. On the other side, the value field can also denote an array of values, if the values are separated by a semi-colon (;). Quotes for the values are not needed and not yet supported, and multi-line values are not allowed. Boolean flags are always expressed as numeric values with 1 for on or true and 0 for off or false. The keys are referred to as the configuration options of the components, i.e. those listed by the on-line help (SMILExtract -H cComponentType). Since version 2.1, configuration sections can be split into multiple parts across the configuration file. That is, the same header (same instancename and typename) may occur more than once. In that case all options from all occurrences will be joint. There is one configuration section that must always be present: that of the component manager:

[componentInstances:cComponentManager]
instance[dataMemory].type = cDataMemory
instance[instancename].type = cComponentType
instance[instancename2].type = cComponentType2
...

The component manager is the main instance which creates all component instances of the currently loaded configuration, makes them read their configuration settings from the parsed configuration file (through the configManager component), and runs the tick-loop, i.e. the loop where data are processed incrementally by calling each component once to process newly available data frames. Each component that shall be included in the configuration, must be listed in this section, and for each component listed there, a corresponding configuration file section with the same instancename and of the same component type must exist. The only exception is the first line, which instantiates the central dataMemory component. It must be always present in the instance list, but no configuration file section has to be supplied for it. Each component that processes data has a data-reader and/or a data-writer sub-component, which are configurable via the reader and writer objects. The only options of interest to us now in these objects are the dmLevel options. These options configure the data-flow connections in your configuration file, i.e. they define in which order data is processed by the components, or in other words, which component is connected with which other component: Each component that modifies data or creates data (i.e. reading it from external sources etc.), will write its data to a unique dataMemory location (called level). The name of this location is defined in the configuration file via the option writer.dmLevel=name_of_evel. The level names must be unique and only one single component can write to each level. Multiple components can, however, read from a single level, enabling re-use of already computed data by multiple components. E.g. we typically have a wave source component which reads audio data from an uncompressed audio file (see also the demo1_energy.conf file):

[wavesource:cWaveSource]
writer.dmLevel = wave
filename = input.wav

The above reads data from input.wav into the dataMemory level wave. If next we want to chunk the audio data into overlapping analysis windows of 20ms length at a rate of 10ms, we need a cFramer component:

[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames20ms
frameSize = 0.02
frameStep = 0.01

The crucial line in the above code is the line which sets the reader dataMemory level (reader.dmLevel = wave) to the output level of the wave source component – effectively connecting the framer to the wave source component. To create new configuration files from scratch, a configuration file template generator is available. We will use it to create a configuration for computing magnitude spectra via the Fast-Fourier Transform (FFT). The template file generator requires a list of components that we want to have in the configuration file, so we must build this list first. In openSMILE most processing steps are wrapped in individual components to increase flexibility and re-usability of intermediate data. For our example we thus need the following components:

  • An audio file reader (cWaveSource),
  • a component which generates short-time analysis frames (cFramer),
  • a component which applies a windowing function to these frames such as a Hamming window (cWindower),
  • a component which performs a FFT (cTranformFFT),
  • a component which computes spectral magnitudes from the complex FFT result (cFFTmagphase),
  • and finally a component which writes the magnitude spectra to a CSV file (cCsvSink).

The generate our configuration file template, we thus run (note, that the component names are case sensitive!):

SMILExtract -l 0 -logfile my_fft_magnitude.conf -cfgFileTemplate -configDflt cWaveSource,cFramer,
    cWindower,cTransformFFT,cFFTmagphase,cCsvSink

The switch -cfgFileTemplate enables the template file output, and makes -configDflt accept a comma separated list of component names. If -configDflt is used by itself, it will print only the default configuration section of a single component (of which the name is given as argument to that option). This invocation of SMILExtract prints the configuration file template to the log (i.e., standard error and to the (log-)file given by the -logfile option). The switch -l 0 suppresses all other log messages (by setting the log-level to 0), leaving only the configuration file template lines in the specified file. The file generated by the above command cannot be used as is, yet. We need to update the data-flow connections first. In our example this is trivial, as one component always reads from the previous one, except for the wave source, which has no reader. We have to change:

[waveSource:cWaveSource]
writer.dmLevel = < >

to

[waveSource:cWaveSource]
writer.dmLevel = wave

The same for the framer, resulting in:

[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames

and for the windower:

[windower:cWindower]
reader.dmLevel = frames
writer.dmLevel = windowed
...
winFunc = Hamming
...

where we also change the windowing function from the default (Hanning) to Hamming, and in the same fashion we go down all the way to the csvSink component:

[transformFFT:cTransformFFT]
reader.dmLevel = windowed
writer.dmLevel = fftcomplex

...

[fFTmagphase:cFFTmagphase]
reader.dmLevel = fftcomplex
writer.dmLevel = fftmag

...

[csvSink:cCsvSink]
reader.dmLevel = fftmag

The configuration file can now be used with the command:

SMILExtract -C my_fft_magnitude.conf

However, if you run the above, you will most likely get an error message that the file input.wav is not found. This is good news, as it first of all means you have configured the data-flow correctly. In case you did not, you will get error messages about missing data memory levels, etc. The missing file problem is due to the hard-coded input file name with the option filename = input.wav in the wave source section. If you change this line to filename = example-audio/opensmile.wav your configuration will run without errors. It writes the result to a file called smileoutput.csv. To avoid having to change the filenames in the configuration file for every input file you want to process, openSMILE provides a very convenient feature: it allows you to define command-line options in the configuration files. In order to use this feature you replace the value of the filename by the command \cm[], e.g. for the input file:

filename = \cm[inputfile(I){input.wav}:input filename]

and for the output file:

filename = \cm[outputfile(O){output.csv}:output filename]

The syntax of the \cm command is: [longoptionName(shortOption-1charOnly){default value}:description for on-line help].

e. Reference feature sets

A major advantage of openSMILE over related feature extraction toolkits is that is comes with several reference and baseline feature sets which were used for the INTERSPEECH Challenges (2009-2014) on Emotion, Paralinguistics and Speaker States and Traits, as well as the Audio-Visual Emotion Challenges (AVEC) from 2011-2013. All of the INTERSPEECH configuration files are found under config/ISxx_*.conf. All the INTERSPEECH Challenge configuration files follow a common standard regarding the data output options they define. The default output file option (-O) defines the name of the WEKA ARFF file to which functionals are written. To save the data in CSV format additionally, use the option -csvoutput filename. To disable the default ARFF output, use -O ?. To enable saving of intermediate parameters, frame-wise Low-Level Descriptors (LLD), in CSV format the option -lldoutput filename can be used. By default, lines are appended to the functions ARFF and CSV files is they exist, but the LLD files will be overwritten. To change this behaviour, the boolean (1/0) options -appendstaticarff 1/0, -appendstaticcsv 1/0, and -appendlld 0/1 are provided. Besides the Challenge feature sets, openSMILE 2.1 is capable of extracting parameters for the Geneva Minimalistic Acoustic Parameter Set (GeMAPS — submitted for publication as [Eyben14], configuration files will be available together with publication of the article), which is a small set of acoustic paramters relevant for affective voice research. It was standardized and agreed upon by several research teams, including linguists, psychologists, and engineers. Besides these large-scale brute-forced acoustic feature sets, several other configuration files are provided for extracting individual LLD. These include Mel-Frequency Cepstral Coefficients (MFCC*.conf) and Perceptual Linear Predictive Coding Cepstral Coefficients (PLP*.conf), as well as the fundamental frequency and loudness (prosodyShsViterbiLoudness.conf, or smileF0.conf for fundamental frequency only).

Conclusion and summary

We have introduced openSMILE version 2.1 in this article and have given a hands-on practical guide on how to use it to extract audio features of out-of-the-box baseline feature sets, as well as customized acoustic descriptors. It was also shown how to use the voice activity detector, and pre-trained emotion models from the openEAR toolkit for live, incremental emotion recognition. The openSMILE toolkit features a large collection of baseline acoustic feature sets for paralinguistic speech and music analysis and a flexible and complete framework for audio analysis. In future work, more efforts will be put in documentation, speed-up of the underlying framework, and the implementation of new, robust acoustic and visual descriptors.

Acknowledgements

This research was supported by an ERC Advanced Grant in the European Community’s 7th Framework Programme under grant agreement 230331-PROPEREMO (Production and perception of emotion: an affective sciences approach) to Klaus Scherer and by the National Center of Competence in Research (NCCR) Affective Sciences financed by the Swiss National Science Foundation (51NF40-104897) and hosted by the University of Geneva. The research leading to these results has received funding from the European Community’s Seventh Framework Programme under grant agreement No.\ 338164 (ERC Starting Grant iHEARu). The authors would like to thank audEERING UG (haftungsbeschränkt) for providing up-to-date pre-release documentation, computational resources, and great support in maintaining the free open-source releases.