Summary of the 5th BAMMF

Bay Area Multimedia Forum (BAMMF)

BAMMF is a Bay Area Multimedia Forum series. Experts from both academia and industry are invited to exchange ideas and information through talks, tutorials, posters, panel discussions and networking sessions. Topics of the forum will include emerging areas in vision, audio, touch, speech, text, various sensors, human computer interaction, natural language processing, machine learning, media-related signal processing, communication, and cross-media analysis etc. Talks in the event may cover advancement in algorithms and development, demonstration of new inventions, product innovation, business opportunities, etc. If you are interested in giving a presentation at the forum, please contact us.

The 5th BAMMF

The 5th BAMMF was held in the George E. Pake Auditorium in Palo Alto, CA, USA on November 20, 2014. The slides and videos of the speakers at the forum have been made available on the BAMMF web page, and we provide here an overview of their talks. For speakers’ bios, the slides and videos, please visit the web page.

Industrial Impact of Deep Learning – From Speech Recognition to Language and Multimodal Processing

Li Deng (Deep Learning Technology Center, Microsoft Research, Redmond, USA)

Since 2010, deep neural networks have started making real impact in speech recognition industry, building upon earlier work on (shallow) neural nets and (deep) graphical models developed by both speech and machine learning communities. This keynote will first reflect on the historical path to this transformative success. The role of well-timed academic-industrial collaboration will be highlighted, so will be the advances of big data, big compute, and seamless integration between application-domain knowledge of speech and general principles of deep learning. Then, an overview will be given on the sweeping achievements of deep learning in speech recognition since its initial success in 2010 (as well as in image recognition since 2012). Such achievements have resulted in across-the-board, industry-wide deployment of deep learning. The final part of the talk will focus on applications of deep learning to large-scale language/text and multimodal processing, a more challenging area where potentially much greater industrial impact than in speech and image recognition is emerging.

Brewing a Deeper Understanding of Images

Yangqing Jia (Google)

In this talk I will introduce the recent developments in the image recognition fields from two perspectives: as a researcher and as an engineer. For the first part I will describe our recent entry “GoogLeNet” that won the ImageNet 2014 challenge, including the motivation of the model and knowledge learned from the inception of the model. For the second part, I will dive into the practical details of Caffe, an open-source deep learning library I created at UC Berkeley, and show how one could utilize the toolkit for a quick start in deep learning as well as integration and deployment in real-world applications.

Applied Deep Learning

Ronan Collobert (Facebook)

I am interested in machine learning algorithms which can be applied in real-life applications and which can be trained on “raw data”. Specifically, I prefer to trade simple “shallow” algorithms with task-specific handcrafted features for more complex (“deeper”) algorithms trained on raw features. In that respect, I will present several general deep learning architectures, which excels in performance on various Natural Language, Speech and Image Processing tasks. I will look into specific issues related to each application domain, and will attempt to propose general solutions for each use case.

Compositional Language and Visual Understanding

Richard Socher (Stanford)

In this talk, I will describe deep learning algorithms that learn representations for language that are useful for solving a variety of complex language tasks. I will focus on 3 projects:

  • Contextual sentiment analysis (e.g. having an algorithm that actually learns what’s positive in this sentence: “The Android phone is better than the IPhone”)
  • Question answering to win trivia competitions (like IBM Watson’s Jeopardy system but with one neural network)
  • Multimodal sentence-image embeddings to find images that visualize sentences and vice versa (with a fun demo!) All three tasks are solved with a similar type of recursive neural network algorithm.


MPEG Column: 110th MPEG Meeting

— original posts here by Multimedia Communication blogChristian TimmererAAU/bitmovin

The 110th MPEG meeting was held at the Strasbourg Convention and Conference Centre featuring the following highlights:

  • The future of video coding standardization
  • Workshop on media synchronization
  • Standards at FDIS: Green Metadata and CDVS
  • What’s happening in MPEG-DASH?

Additional details about MPEG’s 110th meeting can be also found here including the official press release and all publicly available documents.

The Future of Video Coding Standardization

MPEG110 hosted a panel discussion about the future of video coding standardization. The panel was organized jointly by MPEG and ITU-T SG 16’s VCEG featuring Roger Bolton (Ericsson), Harald Alvestrand (Google), Zhong Luo (Huawei), Anne Aaron (Netflix), Stéphane Pateux (Orange), Paul Torres (Qualcomm), and JeongHoon Park (Samsung).

As expected, “maximizing compression efficiency remains a fundamental need” and as usual, MPEG will study “future application requirements, and the availability of technology developments to fulfill these requirements”. Therefore, two Ad-hoc Groups (AhGs) have been established which are open to the public:

The presentations of the brainstorming session on the future of video coding standardization can be found here.

Workshop on Media Synchronization

MPEG101 also hosted a workshop on media synchronization for hybrid delivery (broadband-broadcast) featuring six presentations “to better understand the current state-of-the-art for media synchronization and identify further needs of the industry”.

  • An overview of MPEG systems technologies providing advanced media synchronization, Youngkwon Lim, Samsung
  • Hybrid Broadcast – Overview of DVB TM-Companion Screens and Streams specification, Oskar van Deventer, TNO
  • Hybrid Broadcast-Broadband distribution for new video services :  a use cases perspective, Raoul Monnier, Thomson Video Networks
  • HEVC and Layered HEVC for UHD deployments, Ye Kui Wang, Qualcomm
  • A fingerprinting-based audio synchronization technology, Masayuki Nishiguchi, Sony Corporation
  • Media Orchestration from Capture to Consumption, Rob Koenen, TNO

The presentation material is available here. Additionally, MPEG established an AhG on timeline alignment (that’s how the project is internally called) to study use cases and solicit contributions on gap analysis and also technical contributions [email][subscription].

Standards at FDIS: Green Metadata and CDVS

My first report on MPEG Compact Descriptors for Visual Search (CDVS) dates back to July 2011 which provides details about the call for proposals. Now, finally, the FDIS has been approved during the 110th MPEG meeting. CDVS defines a compact image description that facilitates the comparison and search of pictures that include similar content, e.g. when showing the same objects in different scenes from different viewpoints. The compression of key point descriptors not only increases compactness, but also significantly speeds up, when compared to a raw representation of the same underlying features, the search and classification of images within large image databases. Application of CDVS for real-time object identification, e.g. in computer vision and other applications, is envisaged as well.

Another standard reached FDIS status entitled Green Metadata (first reported in August 2012). This standard specifies the format of metadata that can be used to reduce energy consumption from the encoding, decoding, and presentation of media content, while simultaneously controlling or avoiding degradation in the Quality of Experience (QoE). Moreover, the metadata specified in this standard can facilitate a trade-off between energy consumption and QoE. MPEG is also working on amendments to the ubiquitous MPEG-2 TS ISO/IEC 13818-1 and ISOBMFF ISO/IEC 14496-12 so that green metadata can be delivered by these formats.

What’s happening in MPEG-DASH?

MPEG-DASH is in a kind of maintenance mode but still receiving new proposals in the area of SAND parameters and some core experiments are going on. Also, the DASH-IF is working towards new interoperability points and test vectors in preparation of actual deployments. When speaking about deployments, they are happening, e.g., a 40h live stream right before Christmas (by bitmovin, a top-100 company that matters most in online video). Additionally, VideoNext was co-located with CoNEXT’14 targeting scientific presentations about the design, quality and deployment of adaptive video streaming. Webex recordings of the talks are available here. In terms of standardization, MPEG-DASH is progressing towards the 2nd amendment including spatial relationship description (SRD), generalized URL parameters and other extensions. In particular, SRD will enable new use cases which can be only addressed using MPEG-DASH and the FDIS is scheduled for the next meeting which will be in Geneva, Feb 16-20, 2015. I’ll report on this within my next blog post, stay tuned..

Call for Workshop Proposals @ ACM Multimedia 2015

We invite proposals for Workshops to be held at the ACM Multimedia 2015 Conference. Accepted workshops will take place in conjunction with the main conference, which is scheduled for October 26-30, 2015, in Brisbane, Australia.

We solicit proposals for two different kinds of workshops: regular workshops and data challenge.

Regular Workshops

The regular workshops should offer a forum for discussions of broad range of emerging and specialized topics of interest to the SIG Multimedia community. There are a number of important issues to be considered when generating a workshop proposal:

  1. The topic of the proposed workshop should offer a perspective distinct from and complementary to the research themes of the main conference. We therefore strongly advise to carefully review the themes of the main conference (which can be found here), when generating a proposal.
  2. The SIG Multimedia community expects the workshop program to be part of the program to nurture and to grow the workshop research theme towards becoming mainstream in the multimedia research field and one of the themes of the main conference in the future.
  3. Interdisciplinary theme workshops are strongly encouraged.
  4. Workshops should offer a discussion forum of a different type than that of the main conference. In particular, they should avoid becoming “mini-conferences” with accompanying keynote presentations and best paper awards. While formal presentation of ideas through regular oral sessions are allowed, we strongly encourage organizers to propose alternate ways to allow participants to discuss open issues, key methods and important research topics related to the workshop theme. Examples are panels, group brainstorming sessions, mini-tutorials around key ideas and proof of concept demonstration sessions.

Data Challenge Workshops

We are also seeking organizers to propose Challenge-Based Workshops. Both academic and corporate organizers are welcome.

Data Challenge workshops are solicited from both academic and corporate organizers. The organizers should provide a dataset that is exemplar of the complexities of current and future multimodal/multimedia problems, and one or more multimodal/ multimedia tasks whose performance can be objectively measured. Participants in the challenge will evaluate their methods against the challenge data in order to identify areas of strengths and weakness. Best performing participating methods will be presented in the form of papers and oral/poster presentations at the workshop.

More information

For details on submitting workshops proposals and the evaluation criteria, please check the following site:

Important dates:

  • Proposal Submission: February 10, 2015
  • Notification of Acceptance February 27, 2015

Looking forward to receiving many excellent submissions!

Alan Hanjalic, Lexing Xie and Svetha Venkatesh
Workshops Chairs, ACM Multimedia 2015

openSMILE:) The Munich Open-Source Large-scale Multimedia Feature Extractor

A tutorial for version 2.1


The openSMILE feature extraction and audio analysis tool enables you to extract large audio (and recently also video) feature spaces incrementally and fast, and apply machine learning methods to classify and analyze your data in real-time. It combines acoustic features from Music Information Retrieval and Speech Processing, as well as basic computer vision features. Large, standard acoustic feature sets are included and usable out-of-the-box to ensure comparable standards in feature extraction in related research. The purpose of this article is to briefly introduce openSMILE, it’s features, potentials, and intended use-cases as well as to give a hands-on tutorial packed with examples that should get you started quickly with using openSMILE. About openSMILE SMILE is originally an acronym for Speech & Music Interpretation by Large-space feature Extraction. Due to the recent addition of video-processing in version 2.0, the acronym openSMILE evolved to open-Source Media Interpretation by Large-space feature Extraction. The development of the toolkit has been started at Technische Universität München (TUM) for the EU-FP7 research project SEMAINE. The original primary focus was on state-of-the-art acoustic emotion recognition for emotionally aware, interactive virtual agents. After the project, openSMILE has been continuously extended to a universal audio analysis toolkit. It has been used and evaluated extensively in the series of INTERSPEECH challenges on emotion, paralinguistics, and speaker states and traits: From the first INTERSPEECH 2009 Emotion Challenge up to the upcoming Challenge at INTERSPEECH 2015 (see for a summary of the challenges). Since 2013 the code-base has been transferred to audEERING and the development is continued by them under a dual-license model – keeping openSMILE free for the research community. openSMILE is written in C++ and is available as both a standalone command-line executable as well as a dynamic library. The main features of openSMILE are its capability of on-line incremental processing and its modularity. Feature extractor components can be freely interconnected to create new and custom features, all via a simple text-based configuration file. New components can be added to openSMILE via an easy binary plug-in interface and an extensive internal API. Scriptable batch feature extraction is supported just as well as live on-line extraction from live recorded audio streams. This enables you to build and design systems on off-line databases, and then use exactly the same code to run your developed system in an interactive on-line prototype or even product. openSMILE is intended as a toolkit for researchers and developers, but not for end-users. It thus cannot be configured through a Graphical User Interface (GUI). However, it is a fast, scalable, and highly flexible command-line backend application, on which several front-end applications could be based. Such examples are network interface components, and in the latest release of openSMILE (version 2.1) a batch feature extraction GUI for Windows platforms: As seen in the above figure, the GUI allows to easily choose a configuration file, the desired output files and formats, and to select files and folders on which to run the analysis. Made popular in the field of speech emotion recognition and paralinguistic speech analysis, openSMILE is now beeing widely used in this community. According to google scholar the two papers on openSMILE ([Eyben10] and [Eyben13a]) are currently cited over 380 times. Research teams across the globe are using it for several tasks, including paralinguistic speech analysis, such as alcohol intoxication detection, in VoiceXML telephony-based spoken dialogue systems — as implemented by the HALEF framework, natural, speech enabled virtual agent systems, and human behavioural signal processing, to name only a few examples. Key Features The key features of openSMILE are:

  • It is cross-platform (Windows, Linux, Mac, new in 2.1: Android)
  • It offers both incremental processing and batch processing.
  • It efficiently extracts a large number of features very fast by re-using already computed values.
  • It has multi-threading support for parallel feature extraction and classification.
  • It is extensible with new custom components and plug-ins.
  • It supports audio file in- and output as well as live sound recording and playback.
  • The computation of MFCC, PLP, (log-)energy, and delta regression coefficients is fully HTK compatible.
  • It has a wide range of general audio signal processingcomponents:
    • Windowing functions (Hamming, Hann, Gauss, Sine, …),
    • Fast-Fourier Transform,
    • Pre-emphasis filter,
    • Finit-Impulse-Response (FIR) filterbanks,
    • Autocorrelation,
    • Cepstrum,
    • Overlap-add re-synthesis,
  • … and speech-related acoustic descriptors:
    • Signal energy,
    • Loudness based on a simplified sub-band auditory model,
    • Mel-/Bark-/Octave-scale spectra,
    • MFCC and PLP-CC,
    • Pitch (ACF and SHS algorithms and Viterbi smoothing),
    • Voice quality (Jitter, Shimmer, HNR),
    • Linear Predictive Coding (LPC),
    • Line Spectral Pairs (LSP),
    • Formants,
    • Spectral shape descriptors (Roll-off, slope, etc.),
  • … and music-related descriptors:
    • Pitch classes (semitone spectrum),
    • CHROMA and CENS features.
  • It supports multi-modal fusion on the feature level through openCV integration.
  • Several post-processingmethods for low-level descriptors are included:
    • Moving average smoothing,
    • Moving average mean subtraction and variance normalization (e.g. for on-line Cepstral mean subtraction),
    • On-line histogram equalization (experimental),
    • Delta regression coefficients of arbitrary order,
    • Binary operations to re-combine descriptors.
  • A wide range of statistical functionalsfor feature summarization is supported, e.g.:
    • Means, Extremes,
    • Moments,
    • Segment statistics,
    • Sample-values,
    • Peak statistics,
    • Linear and quadratic regression,
    • Percentiles,
    • Durations,
    • Onsets,
    • DCT coefficients,
    • Zero-crossings.
  • Generic and popular data file formatsare supported:
    • Hidden Markov Toolkit (HTK) parameter files (read/write)
    • WEKA Arff files (currently only non-sparse) (read/write)
    • Comma separated value (CSV) text (read/write)
    • LibSVM feature file format (write)

In the latest release (2.1) the new features are:

  • Integration and improvement of the emotion recognition models from openEAR,
  • LSTM-RNN based voice-activity detector prototype models included,
  • Fast linear SVMsink component which supports linear kernel SVM models trained with the WEKA SMO classifier,
  • LSTM-RNN JSON network file support for networks trained with the CURRENNT toolkit,
  • Spectral harmonics descriptors,
  • Android support,
  • Improvements to configuration files and command-line options,
  • Improvements and fixes.

openSMILE’s architecture openSMILE has a very modular architecture, designed for incremental data-flow. A central dataMemory component hosts shared memory buffers (known as dataMemory levels) to which a single component can write data and one or more other components can read data from. There are data-source components, which read data from files or other external sources and introduce them to the dataMemory. Then there are data-processor components, which read data, modify them, and save it to a new buffer – these are the actual feature extractor components. In the end data-sink components read the final data and save them to files or digest it in other ways (classifiers etc.): As all components which process data and connect to the dataMemory share some common functionality, they are all derived from a single base class cSmileComponent. The following figure shows the class hierarchy, and the connections between the cDataWriter and cDataReader components to the dataMemory (dotted lines). Getting openSMILE and the documentation The latest openSMILE packages can be downloaded here. At the time of writing the most recent release is 2.1. Grab the complete package of the latest release. This includes the source code, the binaries for Linux and Windows. Some most up-to-date releases might not always include a full-blown set of binaries for all platforms, so sometimes you might have to compile from source, if you want the latest cutting-edge version. While the tutorial in the next section should give you a good quick-start, it does not and can not cover every detail of openSMILE. For learning more and getting further help, there are three main resources: The first is the openSMILE documentation, called the openSMILE book. It contains detailed instructions on how to install, compile, and use openSMILE and introduces you to the basics of openSMILE. However, it might not be the most up-to-date resource for the newest features. Thus, the second resource, is the on-line help built into the binaries. This provides the most up-to-date documentation of available components and their options and features. We will tell you how to use the on-line help in the next section. If you cannot find your answer in neither of these resources, you can ask for help in the discussion forums on the openSMILE website or read the source-code.

Quick-start tutorial

You can’t wait to get openSMILE and try it out on your own data? Then this is your section. In the following the basic concepts of openSMILE are described, pre-built use-cases of automatic, on-line voice activity detection and speech emotion recognition are presented, and the concept of configuration files and the data-flow architecture are explained.

a. Basic concepts

Please refer to the openSMILE book for detailed installation and compilation instructions. Here we assume that you have a compiled SMILExtract binary (optionally with PortAudio support, if you want to use the live audio recording examples below), with which you can run:

SMILExtract -h
SMILExtract -H cWaveSource

to see general usage instructions (first line) and the on-line help for the cWaveSource component (second line), for example. However, from this on-line help it is hard to get a general picture of the openSMILE concepts. We thus describe briefly how to use openSMILE for the most common tasks. Very loosely said, the SMILExtract binaries can be seen as a special kind of code interpreter which executes custom configuration scripts. What openSMILE actually does in the end when you invoke it is only controlled by this configuration script. So, in order to do something with openSMILE you need:

  • The binary SMILExtract,
  • a (set of) configuration file(s),
  • and optionally other files, such as classification models, etc.

The configuration file defines all the components that are to be used as well as their data-flow interconnections. All the components are iteratively run in the “tick-loop“, i.e. a run method (tick()) of each component is called in every loop iteration. Each component then checks if there are new data to process, and if yes, processes the data, and makes them available for other components to process them further. Every component returns a status value, which indicates whether the component has processed data or not. If no component has had any further data to process, the end of the data input (EOI) is assumed. All components are switched to an EOI state and the tick-loop is executed again to process data which require special attention at the end of the input, such as delta-regression coefficients. Since version 2.0-rc1, multi-pass processing is supported, i.e. providing a feature to enable re-running of the whole processing. It is not encouraged to use this, since it breaks incremental processing, but for some experiments it might be necessary. The minimal, generic use-case scenario for openSMILE is thus as follows:

SMILExtract -C config/my_configfile.conf

Each configuration file can define additional command-line options. Most prominent examples are the options for in- and output files (-I and -O). These options are not shown when the normal help is invoked with the -h option. To show the options defined by a configuration file, use this command-line:

SMILExtract -ccmdHelp -C config/my_configfile.conf

The default command-line for processing audio files for feature extraction is:

SMILExtract -C config/my_configfile.conf -I input_file.wav -O output_file

This runs SMILExtract with the configuration given in my_configfile.conf. The following two sections will show you how to quickly get some advanced applications running as pre-configured use-cases for voice activity detection and speech emotion recognition.

b. Use-case: The openSMILE voice-activity detector

The latest openSMILE release (2.1) contains a research prototype of an intelligent, data-drive voice-activity detector (VAD) based on Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN), similar to the system introduced in [Eyben13b]. The VAD examples are contained in the folder scripts/vad. A README in that folder describes further details. Here we give a brief tutorial on how to use the two included use-case examples:

  • vad_opensource.conf: Runs the LSTM-RNN VAD and dumps the activations (voice probability) for each frame to a CSV text file. To run the example on a wave file, type:
    cd scripts/vad;
    SMILExtracct -I ../../example-audio/media-interpretation.wav \
                 -C vad_opensoure.conf -csvoutput vad.csv

    This will write the VAD probabilities scaled to the range -1 to +1 (2nd column) and the corresponding timestamps (1st column) to vad.csv. A VAD probability greater 0 indicates voice presence.

  • vad_segmeter.conf: Runs the VAD on an input wave file, and automatically extract voice segments to new wave files. Optionally the raw voicing probabilities as in the above example can be saved to file. To run the example on a wave file, type:
    cd scripts/vad;
    mkdir -p voice_segments
    SMILExtract -I ../../example-audio/media-interpretation.wav -C vad_segmenter.conf \
                -waveoutput voice_segments/segment_

    This will create a new wave file (numbered consecutively, starting at 1). The vad_segmenter.conf optionally supports output to CSV with the -csvoutput filename option. The start and end times (in seconds) of the voice segments relative to the start of the input file can be optionally dumped with the -saveSegmentTimes filename option. The columns of the output file are: segment filename, start (sec.), end (sec.), length of segment as number of raw (10ms) frames.

To visualise the VAD output over the waveform, we recommend using Sonic-visualiser. If you have sonc-visualiser installed (on Linux) you can open both the wave-file and the VAD output with this command:

sonic-visualiser example-audio/media-interpretation.wav vad.csv

An annotation layer import dialog should appear. The first column should be detected as Time and the second column as value. If this is not the case, select these values manually, and specify that timing is specified explicitly (should be the default) and click OK. You should see something like this:

c. Use-case: Automatic speech emotion recognition

As of version 2.1, openSMILE supports running the emotion recognition models from the openEAR toolkit [Eyben09] in live emotion recognition demo. In order to start this live speech emotion recognition demo, download the speech emotion recognition models and unzip them in the top-level folder of the openSMILE package. A folder named models should be created there which contains a README.txt, and a sub-folder emo. If this is the case, you are ready to run the demo. Type:

SMILExtract -C config/emobase_live4.conf

to run it. The classification output will be shown on the console. NOTE: This example requires that you are running a binary with PortAudio support enabled. Refer to the openSMILE book for details on how to compile your binary with portaudio support for Linux. For Windows pre-compiled binaries (SMILExtractPA*.exe) are included, which should be used instead of the standard SMILExtract.exe for the above example. If you want to choose a different audio recording device, use

SMILExtract -C config/emobase_live4.conf -device ID

To see a list of available devices and their IDs, type:

SMILExtract -C config/emobase_live4.conf -listdevices

Note: If you have a different directory layout or have installed SMILExtract in a system path, you must make sure that the models are located in a directory named “models” located in the directory from where you call the binary, or you must adapt the path to the models in the configuration file (emobase_live4.conf). In openSMILE 2.1, the emotion recognition models can also be used for off-line/batch analysis. Two configuration files are provided for this purpose: config/emobase_live4_batch.conf and config/emobase_live4_batch_single.conf. The latter of the two will compute a single feature vector for the input file and return a single result. Use this, if your audio files are already chunked into short phrases or sentences. The first, emobase_live4_batch.conf will run an energy based segementation on the input and will return a result for every segment. Use this for longer, un-cut audio files. To run analyis in batch mode, type:

SMILExtract -C config/emobase_live4_batch(_single).conf -I example-audio/opensmile.wav > result.txt

This will redirect the result(s) from SMILExtract’s standard output (console) to the file result.txt. The file is by default in a machine parseable format, where key=value tokens are separated by :: and a single result is given on each line, for example:


The above example is the result of the analysis of the file example-audio/media-interpretation.wav.

d. Understanding configuration files

The above, pre-configured examples are a good quick-start to show the diverse potential of the tool. We will now take a deeper look at openSMILE configuration files. First, we will use simple, small configuration files, and modify these in order to understand the basic concepts of these files. Then, we will show you how to write your own configuration files from scratch. The demo files used in this section are provided in the 2.1 release package in the folder config/demo. We will first start with demo1_energy.conf. This file extracts basic frame-wise logarithmic energy. To run this file on one of the included audio examples in the folder example-audio, type the following command:

SMILExtract -C config/demo/demo1_energy.conf -I example-audio/


.wav -O energy.csv

This will create a file called energy.csv. Its content should look similar to this: The second example we will discuss here, is the audio recorder example (audiorecorder.conf). NOTE: This example requires that you are running a binary with PortAudio support enabled. Refer to the openSMILE book for details on how to compile your binary with portaudio support for Linux. For Windows pre-compiled binaries (SMILExtractPA*.exe) are included, which should be used instead of the standard SMILExtract.exe for the following example. This example implements a simple live audio recorder. Audio is recorded from the default audio device to an uncompressed PCM wave file. To run the example and record to rec.wav, type:

SMILExtract -C config/demo/audiorecorder.conf -O rec.wav

Modifiying existing configuration files is the fasted way to create custom extraction scripts. We will now change the demo1_energy.conf file to extract Root-Mean-Square (RMS) energy instead of logarithmic energy. This can be achieved by changing the respective options in the section of the cEnergy component (identified by the section heading [energy:cEnergy]) from

rms = 0
log = 1


rms = 1
log = 0

As a second example, we will merge audiorecorder.conf and demo1_energy.conf to create a configuration file which computes the frame-wise RMS energy from live audio input. First, we start with concatenating the two files. On Linux, type:

cat config/demo/audiorecorder.conf config/demo/demo1_energy.conf > config/demo/live_energy.conf

On Windows, use a text editor such as Notepad++ to combine the files via copy and paste. Now we must remove the cWaveSource component from the original demo1_energy.conf, as this should be replaced by the cPortaudioSource component of the audiorecorder.conf file. To do this, we search for the line

instance[waveSource].type = cWaveSource

and comment it out by prefixing it with a ; or the C-style // or the script- and INI-style #. We also remove the corresponding configuration file section for waveSource. We do the same for the waveSink component and the corresponding section, the leave only the output of the computed frame-wise energy to a CSV file. Theoretically, we could also leave the waveSink section and component, but we would need to change the command-line option defined for the output filename, as this is the same for the CSV output and the wave-file output without any changes. In this case we should replace the filename option in the waveSink section by:

filename = \cm[waveoutput{output.wav}:name of output wave file]

Now, run your new configuration file with:

SMILExtract -C config/demo/live_energy.conf -O live_energy.csv

and inspect the contents of the live_energy.csv file with a text editor. openSMILE configuration files are made up of sections, similar to INI files. Each section is identified by a header which takes the form:


The first part (instancename) is a custom-chosen name for the section. It must be unique throughout the whole configuration file and all included sub-files. The second part defines the type of this configuration section and thereby its allowed contents. The configuration section typename must be one of the available component names (from the list printed by the command SMILExtract -L), as configuration file sections are linked to component instances. The contents of each section are lines of key=value pairs, until the next section header is found. Besides simple key=value pairs as in INI files, a more advanced structure is supported by openSMILE. The key can be a hierarchical value build of key1.subkey, for example, or an array such as keyarray[0] and keyarray[1]. On the other side, the value field can also denote an array of values, if the values are separated by a semi-colon (;). Quotes for the values are not needed and not yet supported, and multi-line values are not allowed. Boolean flags are always expressed as numeric values with 1 for on or true and 0 for off or false. The keys are referred to as the configuration options of the components, i.e. those listed by the on-line help (SMILExtract -H cComponentType). Since version 2.1, configuration sections can be split into multiple parts across the configuration file. That is, the same header (same instancename and typename) may occur more than once. In that case all options from all occurrences will be joint. There is one configuration section that must always be present: that of the component manager:

instance[dataMemory].type = cDataMemory
instance[instancename].type = cComponentType
instance[instancename2].type = cComponentType2

The component manager is the main instance which creates all component instances of the currently loaded configuration, makes them read their configuration settings from the parsed configuration file (through the configManager component), and runs the tick-loop, i.e. the loop where data are processed incrementally by calling each component once to process newly available data frames. Each component that shall be included in the configuration, must be listed in this section, and for each component listed there, a corresponding configuration file section with the same instancename and of the same component type must exist. The only exception is the first line, which instantiates the central dataMemory component. It must be always present in the instance list, but no configuration file section has to be supplied for it. Each component that processes data has a data-reader and/or a data-writer sub-component, which are configurable via the reader and writer objects. The only options of interest to us now in these objects are the dmLevel options. These options configure the data-flow connections in your configuration file, i.e. they define in which order data is processed by the components, or in other words, which component is connected with which other component: Each component that modifies data or creates data (i.e. reading it from external sources etc.), will write its data to a unique dataMemory location (called level). The name of this location is defined in the configuration file via the option writer.dmLevel=name_of_evel. The level names must be unique and only one single component can write to each level. Multiple components can, however, read from a single level, enabling re-use of already computed data by multiple components. E.g. we typically have a wave source component which reads audio data from an uncompressed audio file (see also the demo1_energy.conf file):

writer.dmLevel = wave
filename = input.wav

The above reads data from input.wav into the dataMemory level wave. If next we want to chunk the audio data into overlapping analysis windows of 20ms length at a rate of 10ms, we need a cFramer component:

reader.dmLevel = wave
writer.dmLevel = frames20ms
frameSize = 0.02
frameStep = 0.01

The crucial line in the above code is the line which sets the reader dataMemory level (reader.dmLevel = wave) to the output level of the wave source component – effectively connecting the framer to the wave source component. To create new configuration files from scratch, a configuration file template generator is available. We will use it to create a configuration for computing magnitude spectra via the Fast-Fourier Transform (FFT). The template file generator requires a list of components that we want to have in the configuration file, so we must build this list first. In openSMILE most processing steps are wrapped in individual components to increase flexibility and re-usability of intermediate data. For our example we thus need the following components:

  • An audio file reader (cWaveSource),
  • a component which generates short-time analysis frames (cFramer),
  • a component which applies a windowing function to these frames such as a Hamming window (cWindower),
  • a component which performs a FFT (cTranformFFT),
  • a component which computes spectral magnitudes from the complex FFT result (cFFTmagphase),
  • and finally a component which writes the magnitude spectra to a CSV file (cCsvSink).

The generate our configuration file template, we thus run (note, that the component names are case sensitive!):

SMILExtract -l 0 -logfile my_fft_magnitude.conf -cfgFileTemplate -configDflt cWaveSource,cFramer,

The switch -cfgFileTemplate enables the template file output, and makes -configDflt accept a comma separated list of component names. If -configDflt is used by itself, it will print only the default configuration section of a single component (of which the name is given as argument to that option). This invocation of SMILExtract prints the configuration file template to the log (i.e., standard error and to the (log-)file given by the -logfile option). The switch -l 0 suppresses all other log messages (by setting the log-level to 0), leaving only the configuration file template lines in the specified file. The file generated by the above command cannot be used as is, yet. We need to update the data-flow connections first. In our example this is trivial, as one component always reads from the previous one, except for the wave source, which has no reader. We have to change:

writer.dmLevel = < >


writer.dmLevel = wave

The same for the framer, resulting in:

reader.dmLevel = wave
writer.dmLevel = frames

and for the windower:

reader.dmLevel = frames
writer.dmLevel = windowed
winFunc = Hamming

where we also change the windowing function from the default (Hanning) to Hamming, and in the same fashion we go down all the way to the csvSink component:

reader.dmLevel = windowed
writer.dmLevel = fftcomplex


reader.dmLevel = fftcomplex
writer.dmLevel = fftmag


reader.dmLevel = fftmag

The configuration file can now be used with the command:

SMILExtract -C my_fft_magnitude.conf

However, if you run the above, you will most likely get an error message that the file input.wav is not found. This is good news, as it first of all means you have configured the data-flow correctly. In case you did not, you will get error messages about missing data memory levels, etc. The missing file problem is due to the hard-coded input file name with the option filename = input.wav in the wave source section. If you change this line to filename = example-audio/opensmile.wav your configuration will run without errors. It writes the result to a file called smileoutput.csv. To avoid having to change the filenames in the configuration file for every input file you want to process, openSMILE provides a very convenient feature: it allows you to define command-line options in the configuration files. In order to use this feature you replace the value of the filename by the command \cm[], e.g. for the input file:

filename = \cm[inputfile(I){input.wav}:input filename]

and for the output file:

filename = \cm[outputfile(O){output.csv}:output filename]

The syntax of the \cm command is: [longoptionName(shortOption-1charOnly){default value}:description for on-line help].

e. Reference feature sets

A major advantage of openSMILE over related feature extraction toolkits is that is comes with several reference and baseline feature sets which were used for the INTERSPEECH Challenges (2009-2014) on Emotion, Paralinguistics and Speaker States and Traits, as well as the Audio-Visual Emotion Challenges (AVEC) from 2011-2013. All of the INTERSPEECH configuration files are found under config/ISxx_*.conf. All the INTERSPEECH Challenge configuration files follow a common standard regarding the data output options they define. The default output file option (-O) defines the name of the WEKA ARFF file to which functionals are written. To save the data in CSV format additionally, use the option -csvoutput filename. To disable the default ARFF output, use -O ?. To enable saving of intermediate parameters, frame-wise Low-Level Descriptors (LLD), in CSV format the option -lldoutput filename can be used. By default, lines are appended to the functions ARFF and CSV files is they exist, but the LLD files will be overwritten. To change this behaviour, the boolean (1/0) options -appendstaticarff 1/0, -appendstaticcsv 1/0, and -appendlld 0/1 are provided. Besides the Challenge feature sets, openSMILE 2.1 is capable of extracting parameters for the Geneva Minimalistic Acoustic Parameter Set (GeMAPS — submitted for publication as [Eyben14], configuration files will be available together with publication of the article), which is a small set of acoustic paramters relevant for affective voice research. It was standardized and agreed upon by several research teams, including linguists, psychologists, and engineers. Besides these large-scale brute-forced acoustic feature sets, several other configuration files are provided for extracting individual LLD. These include Mel-Frequency Cepstral Coefficients (MFCC*.conf) and Perceptual Linear Predictive Coding Cepstral Coefficients (PLP*.conf), as well as the fundamental frequency and loudness (prosodyShsViterbiLoudness.conf, or smileF0.conf for fundamental frequency only).

Conclusion and summary

We have introduced openSMILE version 2.1 in this article and have given a hands-on practical guide on how to use it to extract audio features of out-of-the-box baseline feature sets, as well as customized acoustic descriptors. It was also shown how to use the voice activity detector, and pre-trained emotion models from the openEAR toolkit for live, incremental emotion recognition. The openSMILE toolkit features a large collection of baseline acoustic feature sets for paralinguistic speech and music analysis and a flexible and complete framework for audio analysis. In future work, more efforts will be put in documentation, speed-up of the underlying framework, and the implementation of new, robust acoustic and visual descriptors.


This research was supported by an ERC Advanced Grant in the European Community’s 7th Framework Programme under grant agreement 230331-PROPEREMO (Production and perception of emotion: an affective sciences approach) to Klaus Scherer and by the National Center of Competence in Research (NCCR) Affective Sciences financed by the Swiss National Science Foundation (51NF40-104897) and hosted by the University of Geneva. The research leading to these results has received funding from the European Community’s Seventh Framework Programme under grant agreement No.\ 338164 (ERC Starting Grant iHEARu). The authors would like to thank audEERING UG (haftungsbeschränkt) for providing up-to-date pre-release documentation, computational resources, and great support in maintaining the free open-source releases.

MPEG Column: Press release for the 109th MPEG meeting

MPEG collaborates with SC24 experts to develop committee draft of MAR reference model

SC 29/WG 11 (MPEG) is pleased to announce that the Mixed and Augmented Reality Reference Model (MAR RM), developed jointly and in close collaboration with SC 24/WG 9, has reached Committee Draft status at the 109th WG 11 meeting. The MAR RM defines not only the main concepts and terms of MAR, but also its application domain and an overall system architecture that can be applied to all MAR systems, regardless of the particular algorithms, implementation methods, computational platforms, display systems, and sensors/devices used. The MAR RM can therefore be used as a consultation source to aid in the development of MAR applications or services, business models, or new (or extensions to existing) standards. It identifies representative system classes and use cases with respect to the defined architecture, but does not specify technologies for the encoding of MAR information, or interchange formats.

2nd edition of HEVC includes scalable and multi-view video coding

At the 109th MPEG meeting, the standard development work was completed for two important extensions to the High Efficiency Video Coding standard (ISO/IEC 23008-2, also standardized by ITU-T as Rec. H.265).
The first of these are the scalability extensions of HEVC, known as SHVC, adding support for embedded bitstream scalability in which different levels of encoding quality are efficiently supported by adding or removing layered subsets of encoded data. The other are the multiview extensions of HEVC, known as MV-HEVC providing efficient representation of video content with multiple camera views and optional depth map information, such as for 3D stereoscopic and autostereoscopic video applications. MV-HEVC is the 3D video extension of HEVC, and further work for more efficient coding of 3D video is ongoing.
SHVC and MV-HEVC will be combined with the original content of the HEVC standard and also the recently-completed format range extensions (known as RExt), so that a new edition of the standard will be published that contains all extensions approved up to this time.

In addition, the finalization of reference software and a conformance test set for HEVC was completed at the 109th meeting, as ISO/IEC 23008-5 and ISO/IEC 23008-8, respectively. These important standards will greatly help industry achieve effective interoperability between products using HEVC and provide valuable information to ease the development of such products.
In consideration of the recent dramatic developments in video coding technology, including the completion of the development of the HEVC standard and several major extensions, MPEG plans to host a brainstorming event during its 110th meeting which will be open to the public. The event will be co-hosted by MPEG’s frequent collaboration partner in video coding standardization work, the Video Coding Experts Group (VCEG) of ITU-T Study Group 16. More information on how to register for the event will be available at

MPEG-H 3D Audio extended to lower bit rates

At its 109th meeting, MPEG has selected technology for Version II of the MPEG-H 3D Audio standard (ISO/IEC 23008-3) based on responses submitted to the Call for Proposals issued in January 2013. This follows from selection of Version I technology, which was chosen at the 105th meeting, in August 2013. While Version I technology was evaluated for bitrates between 1.2 Mb/s to 256 kb/s, Version II technology is focused on bitrates between 128 kb/s to 48 kb/s.
The selected technology supports content in multiple formats: channel-based, channels and objects (C+O), and scene-based Higher Order Ambisonics (HOA). A total of six submissions were reviewed: three for coding C+O content and three for coding HOA content.
The selected technologies for Version II were shown to be within the framework of the unified Version I technology.
The submissions were evaluated using a comprehensive set of subjective listening tests in which the resulting statistical analysis guided the selection process. At the highest bitrate of 128 kb/s for the coding of a signal supporting a 22.2 loudspeaker configuration, both of the selected technologies had performance of “Good” on the MUSHRA subjective quality scale. It is expected that the C+O and HOA Version II technologies will be merged into a unified architecture.
This MPEG-H 3D Audio Version II is expected to reach Draft International Standard by June 2015.

The 109th meeting also saw the technical completion of Version I of the MPEG-H 3D Audio standard and is expected to be an International Standard by February, 2015.

Public seminar for media synchronization planned for 110th MPEG meeting in October

A public seminar on Media Synchronization for Hybrid Delivery will be held on the 22nd of October 2014 during the 110th MPEG meeting in Strasbourg. The purpose of this seminar is to introduce MPEG’s activity on media stream synchronization for heterogeneous delivery environments, including hybrid environments employing both broadcast and broadband networks, with existing MPEG systems technologies such as MPEG-2 TS, DASH, and MMT. The seminar will also strive to ensure alignment of its present and future projects with users and industry use-cases needs. Main topics covered by the seminar interventions include:

  • Hybrid Broadcast – Broadband distribution for UHD deployments and 2nd screen content
  • Inter Destination Media Synchronization
  • MPEG Standardization efforts on Time Line Alignment of media contents
  • Audio Fingerprint based Synchronization

You are invited to join the seminar to learn more about MPEG activities in this area and to work with us to further develop technologies and standards supporting new applications of rich and heterogeneous media delivery.
The seminar is open to the public and registration is free of charge.

First MMT Developers’ Day held at MPEG 109, second planned for MPEG 110

Following the recent finalization of the MPEG Media Transport standard (ISO/IEC 23008-1), MPEG has hosted an MMT Developers’ Day to better understand the rate of MMT adoption and to provide a channel for MPEG to receive comments from industries about the standard. During the event four oral presentations have been presented including “Multimedia transportation technology and status in China”, “MMT delivery considering bandwidth utilization”, “Fast channel change/ Targeted Advertisement insertion over hybrid media delivery”, and “MPU Generator.” In addition, seven demonstrations have been presented such as Reliable 4K HEVC Realtime Transmission by using MMT-FEC, MMT Analyzer, Applications of MMT content through Broadcast, Storage, and Network Delivery, Media Delivery Optimization with the MMT Cache Middle Box, MMT-based Transport Technology for Advanced Services in Super Hi-Vision, target ad insertion and multi-view content composition in broadcasting system with MMT, and QoS management for Media Delivery. MPEG is planning to host a 2nd MMT Developer’s Day during the 110th meeting on Wednesday, Oct 22nd.

Seminar at MPEG 109 introduces MPEG’s activity for Free Viewpoint Television

A seminar for FTV (Free Viewpoint Television) was held during the 109th MPEG meeting in Sapporo. FTV is an emerging visual media technology that will revolutionize the viewing of 3D scenes to facilitate a more immersive experience by allowing users to freely navigate the view of a 3D scene as if they were actually there. The purpose of the seminar was to introduce MPEG’s activity on FTV to interested parties and to align future MPEG standardization of FTV technologies with user and industry needs.

Digging Deeper – How to Contact MPEG

Communicating the large and sometimes complex array of technology that the MPEG Committee has developed is not a simple task. Experts, past and present, have contributed a series of tutorials and vision documents that explain each of these standards individually. The repository is growing with each meeting, so if something you are interested is not yet there, it may appear shortly – but you should also not hesitate to request it. You can start your MPEG adventure at

Further Information

Future MPEG meetings are planned as follows:

  • No. 110, Strasbourg, FR, 20 – 24 October 2014
  • No. 111, Geneva, CH, 16 – 20 February 2015
  • No. 112, Warsaw, PL, 22 – 26 June 2015

For further information about MPEG, please contact:

Dr. Leonardo Chiariglione (Convenor of MPEG, Italy)
Via Borgionera, 103
10040 Villar Dora (TO), Italy
Tel: +39 011 935 04 61


Dr. Arianne T. Hinds
Cable Television Laboratories
858 Coal Creek Circle
Louisville, Colorado 80027 USA
Tel: +1 303 661 3419

The MPEG homepage also has links to other MPEG pages that are maintained by the MPEG subgroups. It also contains links to public documents that are freely available for download by those who are not MPEG members. Journalists that wish to receive MPEG Press Releases by email s

Launching the first-ever National Data Science Bowl

What is the National Data Science Bowl ?

Take a deep dive and see how tiny plants and animals fuel the world

We are pioneering a new language to understand our incredibly beautiful and complex world. A language that is forward-looking rather than retrospective, different from the words of historians and famed novelists. It is data science; and through it, we have the power to use insights from our past to build an unprecedented future. We need your help building that future. The 2014/2015 National Data Science Bowl offers tremendous potential to modernize the way we understand and address a major environmental challenge— monitoring the health of our oceans.


ACM is a partner in the first-ever National Data Science Bowl, which launched on 12/15.

This 90-day competition offers data scientists the chance to solve a critical problem facing our world’s oceans using the power of data.

Participants are challenged to examine nearly 100,000 underwater images to develop an algorithm that will enable researchers to monitor certain sea life at a speed and scale never before possible.

$175,000 in prize money to top three individual contestants and the top academic team.

Report from SLAM 2014

ISCA/IEEE Workshop on Speech, Language and Audio in Multimedia

Following SLAM 2013 in Marseille, France, SLAM 2014 was the second edition of the workshop, held in Malaysia as a satellite of Interspeech 2014. The workshop was organized over two days, one for science and one for socializing and community building. With about 15 papers and 30 attendees, the highly-risky second edition of the workshop showed the will to build a strong scientific community at the frontier of speech and audio processing, natural language processing and multimedia content processing.

The first day featured talks covering various topics related to speech, language and audio processing applied to multimedia data. Two keynotes from Shri Narayanan (University of Southern California) and Min-Yen Kan (National University of Singapore) nicely completed the program.
The second day took us on a tour of Penang followed by a visit of the campus of Universiti Sains Malaysia from which local organizers are. The tour offered plenty of opportunities to strengthen the links between participants and build a stronger community, as expected. Most participants later went ot Singapore to attend Interspeech, the main conference in the domain of speech communication, where further discussions went on.

We hope to collocate the next SLAM edition with a multimedia conference such as ACM Multimedia in 2015. Keep posted!