ACM TOMM (TOMCCAP) Call for Special Issue Proposals

ACM Transactions on Multimedia
Computing, Communications and Applications
ACM TOMM (previously known as ACM TOMCCAP)

Deadline for Proposal Submission: May, 1st 2015
Notification: June, 1st 2015
http://tomm.acm.org/

ACM TOMM is one of the world’s leading journals on multimedia. As in previous years, we are planning to publish a special issue (SI) in 2016. Proposals are accepted until May, 1st 2015. Each special issue is in the responsibility of the guest editors. If you wish to guest edit a special issue, you should prepare a proposal as outlined below, then send this via e-mail to the Senior Associate Editor (SAE) for Special Issue Management of TOMM, Shervin Shirmohammadi shervin@ieee.org

Proposals must:

  • Cover a currently-hot or emerging topic in the area of multimedia computing, communications, and applications;
  • Set out the importance of the special issue’s topic in that area;
  • Give a strategy for the recruitment of high quality papers;
  • Indicate a draft timeline in which the special issue could be produced (paper writing, reviewing, and submission of final copies to TOMM), assuming the proposal is accepted;
  • Include a list of recent (submission deadline within the last year) or currently-open special issues in similar topics and clearly explain how the proposed SI is different from those SIs;
  • Include the list of the proposed guest editors, their short bios, and their editorial and journal/conference organization experience as related to the Special Issue’s topic.

As in the previous years, the special issue will be published as online-only issue in the ACM Digital Library. This gives the guest editors higher flexibility in the review process and the number of papers to be accepted, while yet ensuring a timely publication.

The proposals will be reviewed by the SAE together with the Editor in Chief (EiC). Evaluation criteria includes: relevance to multimedia, ability to attract many excellent submissions, topic not too specific or too broad, quality and details of the proposal, distinguished from recent or current SIs with similar topic, experience and reputation of the guest editors, geographic/ethnic diversity of the guest editors. The final decision will be made by the EiC. A notification of the decision will be given by June 1st 2015. Once a proposal is accepted we will contact you to discuss the further process.

For questions please contact:
Shervin Shirmohammadi – Senior Associate Editor for Special Issue Management shervin@ieee.org
Ralf Steinmetz – Editor in Chief (EiC) steinmetz.eic@kom.tu-darmstadt.de
Sebastian Schmidt – Information Director TOMM@kom.tu-darmstadt.de

Summary of the 5th BAMMF

Bay Area Multimedia Forum (BAMMF)

BAMMF is a Bay Area Multimedia Forum series. Experts from both academia and industry are invited to exchange ideas and information through talks, tutorials, posters, panel discussions and networking sessions. Topics of the forum will include emerging areas in vision, audio, touch, speech, text, various sensors, human computer interaction, natural language processing, machine learning, media-related signal processing, communication, and cross-media analysis etc. Talks in the event may cover advancement in algorithms and development, demonstration of new inventions, product innovation, business opportunities, etc. If you are interested in giving a presentation at the forum, please contact us.

The 5th BAMMF

The 5th BAMMF was held in the George E. Pake Auditorium in Palo Alto, CA, USA on November 20, 2014. The slides and videos of the speakers at the forum have been made available on the BAMMF web page, and we provide here an overview of their talks. For speakers’ bios, the slides and videos, please visit the web page.

Industrial Impact of Deep Learning – From Speech Recognition to Language and Multimodal Processing

Li Deng (Deep Learning Technology Center, Microsoft Research, Redmond, USA)

Since 2010, deep neural networks have started making real impact in speech recognition industry, building upon earlier work on (shallow) neural nets and (deep) graphical models developed by both speech and machine learning communities. This keynote will first reflect on the historical path to this transformative success. The role of well-timed academic-industrial collaboration will be highlighted, so will be the advances of big data, big compute, and seamless integration between application-domain knowledge of speech and general principles of deep learning. Then, an overview will be given on the sweeping achievements of deep learning in speech recognition since its initial success in 2010 (as well as in image recognition since 2012). Such achievements have resulted in across-the-board, industry-wide deployment of deep learning. The final part of the talk will focus on applications of deep learning to large-scale language/text and multimodal processing, a more challenging area where potentially much greater industrial impact than in speech and image recognition is emerging.

Brewing a Deeper Understanding of Images

Yangqing Jia (Google)

In this talk I will introduce the recent developments in the image recognition fields from two perspectives: as a researcher and as an engineer. For the first part I will describe our recent entry “GoogLeNet” that won the ImageNet 2014 challenge, including the motivation of the model and knowledge learned from the inception of the model. For the second part, I will dive into the practical details of Caffe, an open-source deep learning library I created at UC Berkeley, and show how one could utilize the toolkit for a quick start in deep learning as well as integration and deployment in real-world applications.

Applied Deep Learning

Ronan Collobert (Facebook)

I am interested in machine learning algorithms which can be applied in real-life applications and which can be trained on “raw data”. Specifically, I prefer to trade simple “shallow” algorithms with task-specific handcrafted features for more complex (“deeper”) algorithms trained on raw features. In that respect, I will present several general deep learning architectures, which excels in performance on various Natural Language, Speech and Image Processing tasks. I will look into specific issues related to each application domain, and will attempt to propose general solutions for each use case.

Compositional Language and Visual Understanding

Richard Socher (Stanford)

In this talk, I will describe deep learning algorithms that learn representations for language that are useful for solving a variety of complex language tasks. I will focus on 3 projects:

  • Contextual sentiment analysis (e.g. having an algorithm that actually learns what’s positive in this sentence: “The Android phone is better than the IPhone”)
  • Question answering to win trivia competitions (like IBM Watson’s Jeopardy system but with one neural network)
  • Multimodal sentence-image embeddings to find images that visualize sentences and vice versa (with a fun demo!) All three tasks are solved with a similar type of recursive neural network algorithm.

 

MPEG Column: 110th MPEG Meeting

— original posts here by Multimedia Communication blogChristian TimmererAAU/bitmovin

The 110th MPEG meeting was held at the Strasbourg Convention and Conference Centre featuring the following highlights:

  • The future of video coding standardization
  • Workshop on media synchronization
  • Standards at FDIS: Green Metadata and CDVS
  • What’s happening in MPEG-DASH?

Additional details about MPEG’s 110th meeting can be also found here including the official press release and all publicly available documents.

The Future of Video Coding Standardization

MPEG110 hosted a panel discussion about the future of video coding standardization. The panel was organized jointly by MPEG and ITU-T SG 16’s VCEG featuring Roger Bolton (Ericsson), Harald Alvestrand (Google), Zhong Luo (Huawei), Anne Aaron (Netflix), Stéphane Pateux (Orange), Paul Torres (Qualcomm), and JeongHoon Park (Samsung).

As expected, “maximizing compression efficiency remains a fundamental need” and as usual, MPEG will study “future application requirements, and the availability of technology developments to fulfill these requirements”. Therefore, two Ad-hoc Groups (AhGs) have been established which are open to the public:

The presentations of the brainstorming session on the future of video coding standardization can be found here.

Workshop on Media Synchronization

MPEG101 also hosted a workshop on media synchronization for hybrid delivery (broadband-broadcast) featuring six presentations “to better understand the current state-of-the-art for media synchronization and identify further needs of the industry”.

  • An overview of MPEG systems technologies providing advanced media synchronization, Youngkwon Lim, Samsung
  • Hybrid Broadcast – Overview of DVB TM-Companion Screens and Streams specification, Oskar van Deventer, TNO
  • Hybrid Broadcast-Broadband distribution for new video services :  a use cases perspective, Raoul Monnier, Thomson Video Networks
  • HEVC and Layered HEVC for UHD deployments, Ye Kui Wang, Qualcomm
  • A fingerprinting-based audio synchronization technology, Masayuki Nishiguchi, Sony Corporation
  • Media Orchestration from Capture to Consumption, Rob Koenen, TNO

The presentation material is available here. Additionally, MPEG established an AhG on timeline alignment (that’s how the project is internally called) to study use cases and solicit contributions on gap analysis and also technical contributions [email][subscription].

Standards at FDIS: Green Metadata and CDVS

My first report on MPEG Compact Descriptors for Visual Search (CDVS) dates back to July 2011 which provides details about the call for proposals. Now, finally, the FDIS has been approved during the 110th MPEG meeting. CDVS defines a compact image description that facilitates the comparison and search of pictures that include similar content, e.g. when showing the same objects in different scenes from different viewpoints. The compression of key point descriptors not only increases compactness, but also significantly speeds up, when compared to a raw representation of the same underlying features, the search and classification of images within large image databases. Application of CDVS for real-time object identification, e.g. in computer vision and other applications, is envisaged as well.

Another standard reached FDIS status entitled Green Metadata (first reported in August 2012). This standard specifies the format of metadata that can be used to reduce energy consumption from the encoding, decoding, and presentation of media content, while simultaneously controlling or avoiding degradation in the Quality of Experience (QoE). Moreover, the metadata specified in this standard can facilitate a trade-off between energy consumption and QoE. MPEG is also working on amendments to the ubiquitous MPEG-2 TS ISO/IEC 13818-1 and ISOBMFF ISO/IEC 14496-12 so that green metadata can be delivered by these formats.

What’s happening in MPEG-DASH?

MPEG-DASH is in a kind of maintenance mode but still receiving new proposals in the area of SAND parameters and some core experiments are going on. Also, the DASH-IF is working towards new interoperability points and test vectors in preparation of actual deployments. When speaking about deployments, they are happening, e.g., a 40h live stream right before Christmas (by bitmovin, a top-100 company that matters most in online video). Additionally, VideoNext was co-located with CoNEXT’14 targeting scientific presentations about the design, quality and deployment of adaptive video streaming. Webex recordings of the talks are available here. In terms of standardization, MPEG-DASH is progressing towards the 2nd amendment including spatial relationship description (SRD), generalized URL parameters and other extensions. In particular, SRD will enable new use cases which can be only addressed using MPEG-DASH and the FDIS is scheduled for the next meeting which will be in Geneva, Feb 16-20, 2015. I’ll report on this within my next blog post, stay tuned..

Call for Workshop Proposals @ ACM Multimedia 2015

We invite proposals for Workshops to be held at the ACM Multimedia 2015 Conference. Accepted workshops will take place in conjunction with the main conference, which is scheduled for October 26-30, 2015, in Brisbane, Australia.

We solicit proposals for two different kinds of workshops: regular workshops and data challenge.

Regular Workshops

The regular workshops should offer a forum for discussions of broad range of emerging and specialized topics of interest to the SIG Multimedia community. There are a number of important issues to be considered when generating a workshop proposal:

  1. The topic of the proposed workshop should offer a perspective distinct from and complementary to the research themes of the main conference. We therefore strongly advise to carefully review the themes of the main conference (which can be found here), when generating a proposal.
  2. The SIG Multimedia community expects the workshop program to be part of the program to nurture and to grow the workshop research theme towards becoming mainstream in the multimedia research field and one of the themes of the main conference in the future.
  3. Interdisciplinary theme workshops are strongly encouraged.
  4. Workshops should offer a discussion forum of a different type than that of the main conference. In particular, they should avoid becoming “mini-conferences” with accompanying keynote presentations and best paper awards. While formal presentation of ideas through regular oral sessions are allowed, we strongly encourage organizers to propose alternate ways to allow participants to discuss open issues, key methods and important research topics related to the workshop theme. Examples are panels, group brainstorming sessions, mini-tutorials around key ideas and proof of concept demonstration sessions.

Data Challenge Workshops

We are also seeking organizers to propose Challenge-Based Workshops. Both academic and corporate organizers are welcome.

Data Challenge workshops are solicited from both academic and corporate organizers. The organizers should provide a dataset that is exemplar of the complexities of current and future multimodal/multimedia problems, and one or more multimodal/ multimedia tasks whose performance can be objectively measured. Participants in the challenge will evaluate their methods against the challenge data in order to identify areas of strengths and weakness. Best performing participating methods will be presented in the form of papers and oral/poster presentations at the workshop.

More information

For details on submitting workshops proposals and the evaluation criteria, please check the following site:

http://www.acmmm.org/2015/call-for-workshop-proposals/

Important dates:

  • Proposal Submission: February 10, 2015
  • Notification of Acceptance February 27, 2015

Looking forward to receiving many excellent submissions!

Alan Hanjalic, Lexing Xie and Svetha Venkatesh
Workshops Chairs, ACM Multimedia 2015

openSMILE:) The Munich Open-Source Large-scale Multimedia Feature Extractor

A tutorial for version 2.1

Introduction

The openSMILE feature extraction and audio analysis tool enables you to extract large audio (and recently also video) feature spaces incrementally and fast, and apply machine learning methods to classify and analyze your data in real-time. It combines acoustic features from Music Information Retrieval and Speech Processing, as well as basic computer vision features. Large, standard acoustic feature sets are included and usable out-of-the-box to ensure comparable standards in feature extraction in related research. The purpose of this article is to briefly introduce openSMILE, it’s features, potentials, and intended use-cases as well as to give a hands-on tutorial packed with examples that should get you started quickly with using openSMILE. About openSMILE SMILE is originally an acronym for Speech & Music Interpretation by Large-space feature Extraction. Due to the recent addition of video-processing in version 2.0, the acronym openSMILE evolved to open-Source Media Interpretation by Large-space feature Extraction. The development of the toolkit has been started at Technische Universität München (TUM) for the EU-FP7 research project SEMAINE. The original primary focus was on state-of-the-art acoustic emotion recognition for emotionally aware, interactive virtual agents. After the project, openSMILE has been continuously extended to a universal audio analysis toolkit. It has been used and evaluated extensively in the series of INTERSPEECH challenges on emotion, paralinguistics, and speaker states and traits: From the first INTERSPEECH 2009 Emotion Challenge up to the upcoming Challenge at INTERSPEECH 2015 (see openaudio.eu for a summary of the challenges). Since 2013 the code-base has been transferred to audEERING and the development is continued by them under a dual-license model – keeping openSMILE free for the research community. openSMILE is written in C++ and is available as both a standalone command-line executable as well as a dynamic library. The main features of openSMILE are its capability of on-line incremental processing and its modularity. Feature extractor components can be freely interconnected to create new and custom features, all via a simple text-based configuration file. New components can be added to openSMILE via an easy binary plug-in interface and an extensive internal API. Scriptable batch feature extraction is supported just as well as live on-line extraction from live recorded audio streams. This enables you to build and design systems on off-line databases, and then use exactly the same code to run your developed system in an interactive on-line prototype or even product. openSMILE is intended as a toolkit for researchers and developers, but not for end-users. It thus cannot be configured through a Graphical User Interface (GUI). However, it is a fast, scalable, and highly flexible command-line backend application, on which several front-end applications could be based. Such examples are network interface components, and in the latest release of openSMILE (version 2.1) a batch feature extraction GUI for Windows platforms: As seen in the above figure, the GUI allows to easily choose a configuration file, the desired output files and formats, and to select files and folders on which to run the analysis. Made popular in the field of speech emotion recognition and paralinguistic speech analysis, openSMILE is now beeing widely used in this community. According to google scholar the two papers on openSMILE ([Eyben10] and [Eyben13a]) are currently cited over 380 times. Research teams across the globe are using it for several tasks, including paralinguistic speech analysis, such as alcohol intoxication detection, in VoiceXML telephony-based spoken dialogue systems — as implemented by the HALEF framework, natural, speech enabled virtual agent systems, and human behavioural signal processing, to name only a few examples. Key Features The key features of openSMILE are:

  • It is cross-platform (Windows, Linux, Mac, new in 2.1: Android)
  • It offers both incremental processing and batch processing.
  • It efficiently extracts a large number of features very fast by re-using already computed values.
  • It has multi-threading support for parallel feature extraction and classification.
  • It is extensible with new custom components and plug-ins.
  • It supports audio file in- and output as well as live sound recording and playback.
  • The computation of MFCC, PLP, (log-)energy, and delta regression coefficients is fully HTK compatible.
  • It has a wide range of general audio signal processingcomponents:
    • Windowing functions (Hamming, Hann, Gauss, Sine, …),
    • Fast-Fourier Transform,
    • Pre-emphasis filter,
    • Finit-Impulse-Response (FIR) filterbanks,
    • Autocorrelation,
    • Cepstrum,
    • Overlap-add re-synthesis,
  • … and speech-related acoustic descriptors:
    • Signal energy,
    • Loudness based on a simplified sub-band auditory model,
    • Mel-/Bark-/Octave-scale spectra,
    • MFCC and PLP-CC,
    • Pitch (ACF and SHS algorithms and Viterbi smoothing),
    • Voice quality (Jitter, Shimmer, HNR),
    • Linear Predictive Coding (LPC),
    • Line Spectral Pairs (LSP),
    • Formants,
    • Spectral shape descriptors (Roll-off, slope, etc.),
  • … and music-related descriptors:
    • Pitch classes (semitone spectrum),
    • CHROMA and CENS features.
  • It supports multi-modal fusion on the feature level through openCV integration.
  • Several post-processingmethods for low-level descriptors are included:
    • Moving average smoothing,
    • Moving average mean subtraction and variance normalization (e.g. for on-line Cepstral mean subtraction),
    • On-line histogram equalization (experimental),
    • Delta regression coefficients of arbitrary order,
    • Binary operations to re-combine descriptors.
  • A wide range of statistical functionalsfor feature summarization is supported, e.g.:
    • Means, Extremes,
    • Moments,
    • Segment statistics,
    • Sample-values,
    • Peak statistics,
    • Linear and quadratic regression,
    • Percentiles,
    • Durations,
    • Onsets,
    • DCT coefficients,
    • Zero-crossings.
  • Generic and popular data file formatsare supported:
    • Hidden Markov Toolkit (HTK) parameter files (read/write)
    • WEKA Arff files (currently only non-sparse) (read/write)
    • Comma separated value (CSV) text (read/write)
    • LibSVM feature file format (write)

In the latest release (2.1) the new features are:

  • Integration and improvement of the emotion recognition models from openEAR,
  • LSTM-RNN based voice-activity detector prototype models included,
  • Fast linear SVMsink component which supports linear kernel SVM models trained with the WEKA SMO classifier,
  • LSTM-RNN JSON network file support for networks trained with the CURRENNT toolkit,
  • Spectral harmonics descriptors,
  • Android support,
  • Improvements to configuration files and command-line options,
  • Improvements and fixes.

openSMILE’s architecture openSMILE has a very modular architecture, designed for incremental data-flow. A central dataMemory component hosts shared memory buffers (known as dataMemory levels) to which a single component can write data and one or more other components can read data from. There are data-source components, which read data from files or other external sources and introduce them to the dataMemory. Then there are data-processor components, which read data, modify them, and save it to a new buffer – these are the actual feature extractor components. In the end data-sink components read the final data and save them to files or digest it in other ways (classifiers etc.): As all components which process data and connect to the dataMemory share some common functionality, they are all derived from a single base class cSmileComponent. The following figure shows the class hierarchy, and the connections between the cDataWriter and cDataReader components to the dataMemory (dotted lines). Getting openSMILE and the documentation The latest openSMILE packages can be downloaded here. At the time of writing the most recent release is 2.1. Grab the complete package of the latest release. This includes the source code, the binaries for Linux and Windows. Some most up-to-date releases might not always include a full-blown set of binaries for all platforms, so sometimes you might have to compile from source, if you want the latest cutting-edge version. While the tutorial in the next section should give you a good quick-start, it does not and can not cover every detail of openSMILE. For learning more and getting further help, there are three main resources: The first is the openSMILE documentation, called the openSMILE book. It contains detailed instructions on how to install, compile, and use openSMILE and introduces you to the basics of openSMILE. However, it might not be the most up-to-date resource for the newest features. Thus, the second resource, is the on-line help built into the binaries. This provides the most up-to-date documentation of available components and their options and features. We will tell you how to use the on-line help in the next section. If you cannot find your answer in neither of these resources, you can ask for help in the discussion forums on the openSMILE website or read the source-code.

Quick-start tutorial

You can’t wait to get openSMILE and try it out on your own data? Then this is your section. In the following the basic concepts of openSMILE are described, pre-built use-cases of automatic, on-line voice activity detection and speech emotion recognition are presented, and the concept of configuration files and the data-flow architecture are explained.

a. Basic concepts

Please refer to the openSMILE book for detailed installation and compilation instructions. Here we assume that you have a compiled SMILExtract binary (optionally with PortAudio support, if you want to use the live audio recording examples below), with which you can run:

SMILExtract -h
SMILExtract -H cWaveSource

to see general usage instructions (first line) and the on-line help for the cWaveSource component (second line), for example. However, from this on-line help it is hard to get a general picture of the openSMILE concepts. We thus describe briefly how to use openSMILE for the most common tasks. Very loosely said, the SMILExtract binaries can be seen as a special kind of code interpreter which executes custom configuration scripts. What openSMILE actually does in the end when you invoke it is only controlled by this configuration script. So, in order to do something with openSMILE you need:

  • The binary SMILExtract,
  • a (set of) configuration file(s),
  • and optionally other files, such as classification models, etc.

The configuration file defines all the components that are to be used as well as their data-flow interconnections. All the components are iteratively run in the “tick-loop“, i.e. a run method (tick()) of each component is called in every loop iteration. Each component then checks if there are new data to process, and if yes, processes the data, and makes them available for other components to process them further. Every component returns a status value, which indicates whether the component has processed data or not. If no component has had any further data to process, the end of the data input (EOI) is assumed. All components are switched to an EOI state and the tick-loop is executed again to process data which require special attention at the end of the input, such as delta-regression coefficients. Since version 2.0-rc1, multi-pass processing is supported, i.e. providing a feature to enable re-running of the whole processing. It is not encouraged to use this, since it breaks incremental processing, but for some experiments it might be necessary. The minimal, generic use-case scenario for openSMILE is thus as follows:

SMILExtract -C config/my_configfile.conf

Each configuration file can define additional command-line options. Most prominent examples are the options for in- and output files (-I and -O). These options are not shown when the normal help is invoked with the -h option. To show the options defined by a configuration file, use this command-line:

SMILExtract -ccmdHelp -C config/my_configfile.conf

The default command-line for processing audio files for feature extraction is:

SMILExtract -C config/my_configfile.conf -I input_file.wav -O output_file

This runs SMILExtract with the configuration given in my_configfile.conf. The following two sections will show you how to quickly get some advanced applications running as pre-configured use-cases for voice activity detection and speech emotion recognition.

b. Use-case: The openSMILE voice-activity detector

The latest openSMILE release (2.1) contains a research prototype of an intelligent, data-drive voice-activity detector (VAD) based on Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN), similar to the system introduced in [Eyben13b]. The VAD examples are contained in the folder scripts/vad. A README in that folder describes further details. Here we give a brief tutorial on how to use the two included use-case examples:

  • vad_opensource.conf: Runs the LSTM-RNN VAD and dumps the activations (voice probability) for each frame to a CSV text file. To run the example on a wave file, type:
    cd scripts/vad;
    SMILExtracct -I ../../example-audio/media-interpretation.wav \
                 -C vad_opensoure.conf -csvoutput vad.csv

    This will write the VAD probabilities scaled to the range -1 to +1 (2nd column) and the corresponding timestamps (1st column) to vad.csv. A VAD probability greater 0 indicates voice presence.

  • vad_segmeter.conf: Runs the VAD on an input wave file, and automatically extract voice segments to new wave files. Optionally the raw voicing probabilities as in the above example can be saved to file. To run the example on a wave file, type:
    cd scripts/vad;
    mkdir -p voice_segments
    SMILExtract -I ../../example-audio/media-interpretation.wav -C vad_segmenter.conf \
                -waveoutput voice_segments/segment_

    This will create a new wave file (numbered consecutively, starting at 1). The vad_segmenter.conf optionally supports output to CSV with the -csvoutput filename option. The start and end times (in seconds) of the voice segments relative to the start of the input file can be optionally dumped with the -saveSegmentTimes filename option. The columns of the output file are: segment filename, start (sec.), end (sec.), length of segment as number of raw (10ms) frames.

To visualise the VAD output over the waveform, we recommend using Sonic-visualiser. If you have sonc-visualiser installed (on Linux) you can open both the wave-file and the VAD output with this command:

sonic-visualiser example-audio/media-interpretation.wav vad.csv

An annotation layer import dialog should appear. The first column should be detected as Time and the second column as value. If this is not the case, select these values manually, and specify that timing is specified explicitly (should be the default) and click OK. You should see something like this:

c. Use-case: Automatic speech emotion recognition

As of version 2.1, openSMILE supports running the emotion recognition models from the openEAR toolkit [Eyben09] in live emotion recognition demo. In order to start this live speech emotion recognition demo, download the speech emotion recognition models and unzip them in the top-level folder of the openSMILE package. A folder named models should be created there which contains a README.txt, and a sub-folder emo. If this is the case, you are ready to run the demo. Type:

SMILExtract -C config/emobase_live4.conf

to run it. The classification output will be shown on the console. NOTE: This example requires that you are running a binary with PortAudio support enabled. Refer to the openSMILE book for details on how to compile your binary with portaudio support for Linux. For Windows pre-compiled binaries (SMILExtractPA*.exe) are included, which should be used instead of the standard SMILExtract.exe for the above example. If you want to choose a different audio recording device, use

SMILExtract -C config/emobase_live4.conf -device ID

To see a list of available devices and their IDs, type:

SMILExtract -C config/emobase_live4.conf -listdevices

Note: If you have a different directory layout or have installed SMILExtract in a system path, you must make sure that the models are located in a directory named “models” located in the directory from where you call the binary, or you must adapt the path to the models in the configuration file (emobase_live4.conf). In openSMILE 2.1, the emotion recognition models can also be used for off-line/batch analysis. Two configuration files are provided for this purpose: config/emobase_live4_batch.conf and config/emobase_live4_batch_single.conf. The latter of the two will compute a single feature vector for the input file and return a single result. Use this, if your audio files are already chunked into short phrases or sentences. The first, emobase_live4_batch.conf will run an energy based segementation on the input and will return a result for every segment. Use this for longer, un-cut audio files. To run analyis in batch mode, type:

SMILExtract -C config/emobase_live4_batch(_single).conf -I example-audio/opensmile.wav > result.txt

This will redirect the result(s) from SMILExtract’s standard output (console) to the file result.txt. The file is by default in a machine parseable format, where key=value tokens are separated by :: and a single result is given on each line, for example:

SMILE-RESULT::ORIGIN=libsvm::TYPE=regression::COMPONENT=arousal::VIDX=0::NAME=(null)::
     VALUE=1.237816e-01
SMILE-RESULT::ORIGIN=libsvm::TYPE=regression::COMPONENT=valence::VIDX=0::NAME=(null)::
     VALUE=1.825088e-01
SMILE-RESULT::ORIGIN=libsvm::TYPE=classification::COMPONENT=emodbEmotion::VIDX=0::
     NAME(null)::CATEGORY_IDX=2::CATEGORY=disgust::PROB=0;anger:0.033040::
     PROB=1;boredom:0.210172::PROB=2;disgust:0.380724::PROB=3;fear:0.031658::
     PROB=4;happiness:0.016040::PROB=5;neutral:0.087751::PROB=6;sadness:0.240615
SMILE-RESULT::ORIGIN=libsvm::TYPE=classification::COMPONENT=abcAffect::VIDX=0::
    NAME=(null)::CATEGORY_IDX=0::CATEGORY=agressiv::PROB=0;agressiv:0.614545::
    PROB=1;cheerful:0.229169::PROB=2;intoxicated:0.037347::PROB=3;nervous:0.011133::
    PROB=4;neutral:0.091070::PROB=5;tired:0.016737
SMILE-RESULT::ORIGIN=libsvm::TYPE=classification::COMPONENT=avicInterest::VIDX=0::
    NAME=(null)::CATEGORY_IDX=1::CATEGORY=loi2::PROB=0;loi1:0.006460::
    PROB=1;loi2:0.944799::PROB=2;loi3:0.048741

The above example is the result of the analysis of the file example-audio/media-interpretation.wav.

d. Understanding configuration files

The above, pre-configured examples are a good quick-start to show the diverse potential of the tool. We will now take a deeper look at openSMILE configuration files. First, we will use simple, small configuration files, and modify these in order to understand the basic concepts of these files. Then, we will show you how to write your own configuration files from scratch. The demo files used in this section are provided in the 2.1 release package in the folder config/demo. We will first start with demo1_energy.conf. This file extracts basic frame-wise logarithmic energy. To run this file on one of the included audio examples in the folder example-audio, type the following command:

SMILExtract -C config/demo/demo1_energy.conf -I example-audio/

openSMILE

.wav -O energy.csv

This will create a file called energy.csv. Its content should look similar to this: The second example we will discuss here, is the audio recorder example (audiorecorder.conf). NOTE: This example requires that you are running a binary with PortAudio support enabled. Refer to the openSMILE book for details on how to compile your binary with portaudio support for Linux. For Windows pre-compiled binaries (SMILExtractPA*.exe) are included, which should be used instead of the standard SMILExtract.exe for the following example. This example implements a simple live audio recorder. Audio is recorded from the default audio device to an uncompressed PCM wave file. To run the example and record to rec.wav, type:

SMILExtract -C config/demo/audiorecorder.conf -O rec.wav

Modifiying existing configuration files is the fasted way to create custom extraction scripts. We will now change the demo1_energy.conf file to extract Root-Mean-Square (RMS) energy instead of logarithmic energy. This can be achieved by changing the respective options in the section of the cEnergy component (identified by the section heading [energy:cEnergy]) from

rms = 0
log = 1

to

rms = 1
log = 0

As a second example, we will merge audiorecorder.conf and demo1_energy.conf to create a configuration file which computes the frame-wise RMS energy from live audio input. First, we start with concatenating the two files. On Linux, type:

cat config/demo/audiorecorder.conf config/demo/demo1_energy.conf > config/demo/live_energy.conf

On Windows, use a text editor such as Notepad++ to combine the files via copy and paste. Now we must remove the cWaveSource component from the original demo1_energy.conf, as this should be replaced by the cPortaudioSource component of the audiorecorder.conf file. To do this, we search for the line

instance[waveSource].type = cWaveSource

and comment it out by prefixing it with a ; or the C-style // or the script- and INI-style #. We also remove the corresponding configuration file section for waveSource. We do the same for the waveSink component and the corresponding section, the leave only the output of the computed frame-wise energy to a CSV file. Theoretically, we could also leave the waveSink section and component, but we would need to change the command-line option defined for the output filename, as this is the same for the CSV output and the wave-file output without any changes. In this case we should replace the filename option in the waveSink section by:

filename = \cm[waveoutput{output.wav}:name of output wave file]

Now, run your new configuration file with:

SMILExtract -C config/demo/live_energy.conf -O live_energy.csv

and inspect the contents of the live_energy.csv file with a text editor. openSMILE configuration files are made up of sections, similar to INI files. Each section is identified by a header which takes the form:

[instancename:cComponentType]

The first part (instancename) is a custom-chosen name for the section. It must be unique throughout the whole configuration file and all included sub-files. The second part defines the type of this configuration section and thereby its allowed contents. The configuration section typename must be one of the available component names (from the list printed by the command SMILExtract -L), as configuration file sections are linked to component instances. The contents of each section are lines of key=value pairs, until the next section header is found. Besides simple key=value pairs as in INI files, a more advanced structure is supported by openSMILE. The key can be a hierarchical value build of key1.subkey, for example, or an array such as keyarray[0] and keyarray[1]. On the other side, the value field can also denote an array of values, if the values are separated by a semi-colon (;). Quotes for the values are not needed and not yet supported, and multi-line values are not allowed. Boolean flags are always expressed as numeric values with 1 for on or true and 0 for off or false. The keys are referred to as the configuration options of the components, i.e. those listed by the on-line help (SMILExtract -H cComponentType). Since version 2.1, configuration sections can be split into multiple parts across the configuration file. That is, the same header (same instancename and typename) may occur more than once. In that case all options from all occurrences will be joint. There is one configuration section that must always be present: that of the component manager:

[componentInstances:cComponentManager]
instance[dataMemory].type = cDataMemory
instance[instancename].type = cComponentType
instance[instancename2].type = cComponentType2
...

The component manager is the main instance which creates all component instances of the currently loaded configuration, makes them read their configuration settings from the parsed configuration file (through the configManager component), and runs the tick-loop, i.e. the loop where data are processed incrementally by calling each component once to process newly available data frames. Each component that shall be included in the configuration, must be listed in this section, and for each component listed there, a corresponding configuration file section with the same instancename and of the same component type must exist. The only exception is the first line, which instantiates the central dataMemory component. It must be always present in the instance list, but no configuration file section has to be supplied for it. Each component that processes data has a data-reader and/or a data-writer sub-component, which are configurable via the reader and writer objects. The only options of interest to us now in these objects are the dmLevel options. These options configure the data-flow connections in your configuration file, i.e. they define in which order data is processed by the components, or in other words, which component is connected with which other component: Each component that modifies data or creates data (i.e. reading it from external sources etc.), will write its data to a unique dataMemory location (called level). The name of this location is defined in the configuration file via the option writer.dmLevel=name_of_evel. The level names must be unique and only one single component can write to each level. Multiple components can, however, read from a single level, enabling re-use of already computed data by multiple components. E.g. we typically have a wave source component which reads audio data from an uncompressed audio file (see also the demo1_energy.conf file):

[wavesource:cWaveSource]
writer.dmLevel = wave
filename = input.wav

The above reads data from input.wav into the dataMemory level wave. If next we want to chunk the audio data into overlapping analysis windows of 20ms length at a rate of 10ms, we need a cFramer component:

[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames20ms
frameSize = 0.02
frameStep = 0.01

The crucial line in the above code is the line which sets the reader dataMemory level (reader.dmLevel = wave) to the output level of the wave source component – effectively connecting the framer to the wave source component. To create new configuration files from scratch, a configuration file template generator is available. We will use it to create a configuration for computing magnitude spectra via the Fast-Fourier Transform (FFT). The template file generator requires a list of components that we want to have in the configuration file, so we must build this list first. In openSMILE most processing steps are wrapped in individual components to increase flexibility and re-usability of intermediate data. For our example we thus need the following components:

  • An audio file reader (cWaveSource),
  • a component which generates short-time analysis frames (cFramer),
  • a component which applies a windowing function to these frames such as a Hamming window (cWindower),
  • a component which performs a FFT (cTranformFFT),
  • a component which computes spectral magnitudes from the complex FFT result (cFFTmagphase),
  • and finally a component which writes the magnitude spectra to a CSV file (cCsvSink).

The generate our configuration file template, we thus run (note, that the component names are case sensitive!):

SMILExtract -l 0 -logfile my_fft_magnitude.conf -cfgFileTemplate -configDflt cWaveSource,cFramer,
    cWindower,cTransformFFT,cFFTmagphase,cCsvSink

The switch -cfgFileTemplate enables the template file output, and makes -configDflt accept a comma separated list of component names. If -configDflt is used by itself, it will print only the default configuration section of a single component (of which the name is given as argument to that option). This invocation of SMILExtract prints the configuration file template to the log (i.e., standard error and to the (log-)file given by the -logfile option). The switch -l 0 suppresses all other log messages (by setting the log-level to 0), leaving only the configuration file template lines in the specified file. The file generated by the above command cannot be used as is, yet. We need to update the data-flow connections first. In our example this is trivial, as one component always reads from the previous one, except for the wave source, which has no reader. We have to change:

[waveSource:cWaveSource]
writer.dmLevel = < >

to

[waveSource:cWaveSource]
writer.dmLevel = wave

The same for the framer, resulting in:

[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames

and for the windower:

[windower:cWindower]
reader.dmLevel = frames
writer.dmLevel = windowed
...
winFunc = Hamming
...

where we also change the windowing function from the default (Hanning) to Hamming, and in the same fashion we go down all the way to the csvSink component:

[transformFFT:cTransformFFT]
reader.dmLevel = windowed
writer.dmLevel = fftcomplex

...

[fFTmagphase:cFFTmagphase]
reader.dmLevel = fftcomplex
writer.dmLevel = fftmag

...

[csvSink:cCsvSink]
reader.dmLevel = fftmag

The configuration file can now be used with the command:

SMILExtract -C my_fft_magnitude.conf

However, if you run the above, you will most likely get an error message that the file input.wav is not found. This is good news, as it first of all means you have configured the data-flow correctly. In case you did not, you will get error messages about missing data memory levels, etc. The missing file problem is due to the hard-coded input file name with the option filename = input.wav in the wave source section. If you change this line to filename = example-audio/opensmile.wav your configuration will run without errors. It writes the result to a file called smileoutput.csv. To avoid having to change the filenames in the configuration file for every input file you want to process, openSMILE provides a very convenient feature: it allows you to define command-line options in the configuration files. In order to use this feature you replace the value of the filename by the command \cm[], e.g. for the input file:

filename = \cm[inputfile(I){input.wav}:input filename]

and for the output file:

filename = \cm[outputfile(O){output.csv}:output filename]

The syntax of the \cm command is: [longoptionName(shortOption-1charOnly){default value}:description for on-line help].

e. Reference feature sets

A major advantage of openSMILE over related feature extraction toolkits is that is comes with several reference and baseline feature sets which were used for the INTERSPEECH Challenges (2009-2014) on Emotion, Paralinguistics and Speaker States and Traits, as well as the Audio-Visual Emotion Challenges (AVEC) from 2011-2013. All of the INTERSPEECH configuration files are found under config/ISxx_*.conf. All the INTERSPEECH Challenge configuration files follow a common standard regarding the data output options they define. The default output file option (-O) defines the name of the WEKA ARFF file to which functionals are written. To save the data in CSV format additionally, use the option -csvoutput filename. To disable the default ARFF output, use -O ?. To enable saving of intermediate parameters, frame-wise Low-Level Descriptors (LLD), in CSV format the option -lldoutput filename can be used. By default, lines are appended to the functions ARFF and CSV files is they exist, but the LLD files will be overwritten. To change this behaviour, the boolean (1/0) options -appendstaticarff 1/0, -appendstaticcsv 1/0, and -appendlld 0/1 are provided. Besides the Challenge feature sets, openSMILE 2.1 is capable of extracting parameters for the Geneva Minimalistic Acoustic Parameter Set (GeMAPS — submitted for publication as [Eyben14], configuration files will be available together with publication of the article), which is a small set of acoustic paramters relevant for affective voice research. It was standardized and agreed upon by several research teams, including linguists, psychologists, and engineers. Besides these large-scale brute-forced acoustic feature sets, several other configuration files are provided for extracting individual LLD. These include Mel-Frequency Cepstral Coefficients (MFCC*.conf) and Perceptual Linear Predictive Coding Cepstral Coefficients (PLP*.conf), as well as the fundamental frequency and loudness (prosodyShsViterbiLoudness.conf, or smileF0.conf for fundamental frequency only).

Conclusion and summary

We have introduced openSMILE version 2.1 in this article and have given a hands-on practical guide on how to use it to extract audio features of out-of-the-box baseline feature sets, as well as customized acoustic descriptors. It was also shown how to use the voice activity detector, and pre-trained emotion models from the openEAR toolkit for live, incremental emotion recognition. The openSMILE toolkit features a large collection of baseline acoustic feature sets for paralinguistic speech and music analysis and a flexible and complete framework for audio analysis. In future work, more efforts will be put in documentation, speed-up of the underlying framework, and the implementation of new, robust acoustic and visual descriptors.

Acknowledgements

This research was supported by an ERC Advanced Grant in the European Community’s 7th Framework Programme under grant agreement 230331-PROPEREMO (Production and perception of emotion: an affective sciences approach) to Klaus Scherer and by the National Center of Competence in Research (NCCR) Affective Sciences financed by the Swiss National Science Foundation (51NF40-104897) and hosted by the University of Geneva. The research leading to these results has received funding from the European Community’s Seventh Framework Programme under grant agreement No.\ 338164 (ERC Starting Grant iHEARu). The authors would like to thank audEERING UG (haftungsbeschränkt) for providing up-to-date pre-release documentation, computational resources, and great support in maintaining the free open-source releases.

MPEG Column: Press release for the 109th MPEG meeting

MPEG collaborates with SC24 experts to develop committee draft of MAR reference model

SC 29/WG 11 (MPEG) is pleased to announce that the Mixed and Augmented Reality Reference Model (MAR RM), developed jointly and in close collaboration with SC 24/WG 9, has reached Committee Draft status at the 109th WG 11 meeting. The MAR RM defines not only the main concepts and terms of MAR, but also its application domain and an overall system architecture that can be applied to all MAR systems, regardless of the particular algorithms, implementation methods, computational platforms, display systems, and sensors/devices used. The MAR RM can therefore be used as a consultation source to aid in the development of MAR applications or services, business models, or new (or extensions to existing) standards. It identifies representative system classes and use cases with respect to the defined architecture, but does not specify technologies for the encoding of MAR information, or interchange formats.

2nd edition of HEVC includes scalable and multi-view video coding

At the 109th MPEG meeting, the standard development work was completed for two important extensions to the High Efficiency Video Coding standard (ISO/IEC 23008-2, also standardized by ITU-T as Rec. H.265).
The first of these are the scalability extensions of HEVC, known as SHVC, adding support for embedded bitstream scalability in which different levels of encoding quality are efficiently supported by adding or removing layered subsets of encoded data. The other are the multiview extensions of HEVC, known as MV-HEVC providing efficient representation of video content with multiple camera views and optional depth map information, such as for 3D stereoscopic and autostereoscopic video applications. MV-HEVC is the 3D video extension of HEVC, and further work for more efficient coding of 3D video is ongoing.
SHVC and MV-HEVC will be combined with the original content of the HEVC standard and also the recently-completed format range extensions (known as RExt), so that a new edition of the standard will be published that contains all extensions approved up to this time.

In addition, the finalization of reference software and a conformance test set for HEVC was completed at the 109th meeting, as ISO/IEC 23008-5 and ISO/IEC 23008-8, respectively. These important standards will greatly help industry achieve effective interoperability between products using HEVC and provide valuable information to ease the development of such products.
In consideration of the recent dramatic developments in video coding technology, including the completion of the development of the HEVC standard and several major extensions, MPEG plans to host a brainstorming event during its 110th meeting which will be open to the public. The event will be co-hosted by MPEG’s frequent collaboration partner in video coding standardization work, the Video Coding Experts Group (VCEG) of ITU-T Study Group 16. More information on how to register for the event will be available at http://mpeg.chiariglione.org/meetings/110.

MPEG-H 3D Audio extended to lower bit rates

At its 109th meeting, MPEG has selected technology for Version II of the MPEG-H 3D Audio standard (ISO/IEC 23008-3) based on responses submitted to the Call for Proposals issued in January 2013. This follows from selection of Version I technology, which was chosen at the 105th meeting, in August 2013. While Version I technology was evaluated for bitrates between 1.2 Mb/s to 256 kb/s, Version II technology is focused on bitrates between 128 kb/s to 48 kb/s.
The selected technology supports content in multiple formats: channel-based, channels and objects (C+O), and scene-based Higher Order Ambisonics (HOA). A total of six submissions were reviewed: three for coding C+O content and three for coding HOA content.
The selected technologies for Version II were shown to be within the framework of the unified Version I technology.
The submissions were evaluated using a comprehensive set of subjective listening tests in which the resulting statistical analysis guided the selection process. At the highest bitrate of 128 kb/s for the coding of a signal supporting a 22.2 loudspeaker configuration, both of the selected technologies had performance of “Good” on the MUSHRA subjective quality scale. It is expected that the C+O and HOA Version II technologies will be merged into a unified architecture.
This MPEG-H 3D Audio Version II is expected to reach Draft International Standard by June 2015.

The 109th meeting also saw the technical completion of Version I of the MPEG-H 3D Audio standard and is expected to be an International Standard by February, 2015.

Public seminar for media synchronization planned for 110th MPEG meeting in October

A public seminar on Media Synchronization for Hybrid Delivery will be held on the 22nd of October 2014 during the 110th MPEG meeting in Strasbourg. The purpose of this seminar is to introduce MPEG’s activity on media stream synchronization for heterogeneous delivery environments, including hybrid environments employing both broadcast and broadband networks, with existing MPEG systems technologies such as MPEG-2 TS, DASH, and MMT. The seminar will also strive to ensure alignment of its present and future projects with users and industry use-cases needs. Main topics covered by the seminar interventions include:

  • Hybrid Broadcast – Broadband distribution for UHD deployments and 2nd screen content
  • Inter Destination Media Synchronization
  • MPEG Standardization efforts on Time Line Alignment of media contents
  • Audio Fingerprint based Synchronization

You are invited to join the seminar to learn more about MPEG activities in this area and to work with us to further develop technologies and standards supporting new applications of rich and heterogeneous media delivery.
The seminar is open to the public and registration is free of charge.

First MMT Developers’ Day held at MPEG 109, second planned for MPEG 110

Following the recent finalization of the MPEG Media Transport standard (ISO/IEC 23008-1), MPEG has hosted an MMT Developers’ Day to better understand the rate of MMT adoption and to provide a channel for MPEG to receive comments from industries about the standard. During the event four oral presentations have been presented including “Multimedia transportation technology and status in China”, “MMT delivery considering bandwidth utilization”, “Fast channel change/ Targeted Advertisement insertion over hybrid media delivery”, and “MPU Generator.” In addition, seven demonstrations have been presented such as Reliable 4K HEVC Realtime Transmission by using MMT-FEC, MMT Analyzer, Applications of MMT content through Broadcast, Storage, and Network Delivery, Media Delivery Optimization with the MMT Cache Middle Box, MMT-based Transport Technology for Advanced Services in Super Hi-Vision, target ad insertion and multi-view content composition in broadcasting system with MMT, and QoS management for Media Delivery. MPEG is planning to host a 2nd MMT Developer’s Day during the 110th meeting on Wednesday, Oct 22nd.

Seminar at MPEG 109 introduces MPEG’s activity for Free Viewpoint Television

A seminar for FTV (Free Viewpoint Television) was held during the 109th MPEG meeting in Sapporo. FTV is an emerging visual media technology that will revolutionize the viewing of 3D scenes to facilitate a more immersive experience by allowing users to freely navigate the view of a 3D scene as if they were actually there. The purpose of the seminar was to introduce MPEG’s activity on FTV to interested parties and to align future MPEG standardization of FTV technologies with user and industry needs.

Digging Deeper – How to Contact MPEG

Communicating the large and sometimes complex array of technology that the MPEG Committee has developed is not a simple task. Experts, past and present, have contributed a series of tutorials and vision documents that explain each of these standards individually. The repository is growing with each meeting, so if something you are interested is not yet there, it may appear shortly – but you should also not hesitate to request it. You can start your MPEG adventure at http://mpeg.chiariglione.org/

Further Information

Future MPEG meetings are planned as follows:

  • No. 110, Strasbourg, FR, 20 – 24 October 2014
  • No. 111, Geneva, CH, 16 – 20 February 2015
  • No. 112, Warsaw, PL, 22 – 26 June 2015

For further information about MPEG, please contact:

Dr. Leonardo Chiariglione (Convenor of MPEG, Italy)
Via Borgionera, 103
10040 Villar Dora (TO), Italy
Tel: +39 011 935 04 61
leonardo@chiariglione.org

or

Dr. Arianne T. Hinds
Cable Television Laboratories
858 Coal Creek Circle
Louisville, Colorado 80027 USA
Tel: +1 303 661 3419
a.hinds@cablelabs.com.

The MPEG homepage also has links to other MPEG pages that are maintained by the MPEG subgroups. It also contains links to public documents that are freely available for download by those who are not MPEG members. Journalists that wish to receive MPEG Press Releases by email s

Launching the first-ever National Data Science Bowl

What is the National Data Science Bowl ?

Take a deep dive and see how tiny plants and animals fuel the world

We are pioneering a new language to understand our incredibly beautiful and complex world. A language that is forward-looking rather than retrospective, different from the words of historians and famed novelists. It is data science; and through it, we have the power to use insights from our past to build an unprecedented future. We need your help building that future. The 2014/2015 National Data Science Bowl offers tremendous potential to modernize the way we understand and address a major environmental challenge— monitoring the health of our oceans.

Compete

ACM is a partner in the first-ever National Data Science Bowl, which launched on 12/15.

This 90-day competition offers data scientists the chance to solve a critical problem facing our world’s oceans using the power of data.

Participants are challenged to examine nearly 100,000 underwater images to develop an algorithm that will enable researchers to monitor certain sea life at a speed and scale never before possible.

$175,000 in prize money to top three individual contestants and the top academic team.

Report from SLAM 2014

ISCA/IEEE Workshop on Speech, Language and Audio in Multimedia

Following SLAM 2013 in Marseille, France, SLAM 2014 was the second edition of the workshop, held in Malaysia as a satellite of Interspeech 2014. The workshop was organized over two days, one for science and one for socializing and community building. With about 15 papers and 30 attendees, the highly-risky second edition of the workshop showed the will to build a strong scientific community at the frontier of speech and audio processing, natural language processing and multimedia content processing.

The first day featured talks covering various topics related to speech, language and audio processing applied to multimedia data. Two keynotes from Shri Narayanan (University of Southern California) and Min-Yen Kan (National University of Singapore) nicely completed the program.
The second day took us on a tour of Penang followed by a visit of the campus of Universiti Sains Malaysia from which local organizers are. The tour offered plenty of opportunities to strengthen the links between participants and build a stronger community, as expected. Most participants later went ot Singapore to attend Interspeech, the main conference in the domain of speech communication, where further discussions went on.

We hope to collocate the next SLAM edition with a multimedia conference such as ACM Multimedia in 2015. Keep posted!

Slow Internet? – More bandwidth is not the answer

The RITE (Reducing Internet Transport Latency) EU-project has, since its start nearly two years ago, worked to reduce the delay experienced when using the Internet. The approach is to make small, smart, changes to the mechanisms that makes Internet communication work. These mechanisms were developed to maximise throughput, but delay was overlooked until recently.

One of the tasks of the RITE EU-project is to raise awareness about how to achieve faster response in the Internet and to clear up misconceptions.

A common misconception is the idea that higher bandwidth means lower delay. It may be right if you only consider the case where a large file is downloaded (the download takes less time to complete), but there’s very many cases where other measures are needed to reduce the delay. It’s not enough to just buy the highest capacity Internet connection. To clear this up, RITE has produced a video and educational material showing some of the biggest sources of Internet delay. The video also explains the difference between delay and bandwidth.

The RITE project has made educational material to go with the video. A Kahoot! quiz that allows the students to compete, experiments that can be run from any computer and a fact sheet containing tips for how you can reduce your Internet delay is among the available resources.

You can learn more about Internet delay and read about the project on the RITE website: http://www.riteproject.eu

Resources:

Take the Kahoot! quiz.
Educational material.

SSI: An Open Source Platform for Social Signal Interpretation

Introduction

Automatic detection and interpretation of social signals carried by voice, gestures, mimics, etc. will play a key-role for next-generation interfaces as it paves the way towards more intuitive and natural human-computer interaction. In this article we introduce Social Signal Interpretation (SSI), a framework for real-time recognition of social signals. SSI supports a large range of sensor devices, filter and feature algorithms, as well as, machine learning and pattern recognition tools. It encourages developers to add new components using SSI’s C++ API, but also addresses front end users by offering an XML interface to build pipelines with a text editor. SSI is freely available under GPL at openssi.net.

Key Features

The Social Signal Interpretation (SSI) framework offers tools to record, analyse and recognize human behaviour in real-time, such as gestures, mimics, head nods, and emotional speech. Following a patch-based design pipelines are set up from autonomic components and allow the parallel and synchronized processing of sensor data from multiple input devices. In particularly, SSI supports the machine learning pipeline in its full length and offers a graphical interface that assists a user to collect own training corpora and obtain personalized models. In addition to a large set of built-in components SSI also encourages developers to extend available tools with new functions. For inexperienced users an easy-to-use XML editor is available to draft and run pipelines without special programming skills. SSI is written in C++ and optimized to run on computer systems with multiple CPUs.

The key features of SSI include:

  • Synchronized reading from multiple sensor devices
  • General filter and feature algorithms, such as image processing, signal filtering, frequency analysis and statistical measurements in real-time
  • Event-based signal processing to combine and interpret high level information, such as gestures, keywords, or emotional user states
  • Pattern recognition and machine learning tools for on-line and off-line processing, including various algorithms for feature selection, clustering and classification
  • Patch-based pipeline design (C++-API or easy-to-use XML editor) and a plug-in system to integrate new components

SSI also includes wrappers for many popular sensor devices and signal processing libraries, such as the e-Health Sensor Shield, IOM biofeedback system (Wild Divine), MicrosoftKinect, TheEyeTribe, Wii Remote Control, ARTKplus, FFMpeg, OpenCV, WEKA, Torch, DSPFilters, Fubi, Praat, OpenSmile, LibSox, and EmoVoice. To get SSI, please visit our download page.

 

Figure 1: Sketch summarizing the various tasks covered by SSI.

Framework Overview

Social Signal Interpretation (SSI) is an open source project meant to support the development of recognition systems using live input of multiple sensors [1]. Therefore it offers support for a large variety of filter and feature algorithms to process captured signals, as well as, tools to accomplish the full machine learning pipeline. Two type of users are addressed: developers are provided a C++-API that encourages them to write new components and front end users can define recognition pipelines in XML from available components.

Since social cues are expressed through a variety of channels, such as face, voice, postures, etc., multiple kind of sensors are required to obtain a complete picture of the interaction. In order to combine information generated by different devices raw signal streams need to be synchronized and handled in a coherent way. Therefore an architecture is established to handle diverse signals in a coherent way, no matter if it is a waveform, a heart beat signal, or a video image.

 

Figure 2: Examples of sensor devices SSI supports.

Sensor devices deliver raw signals, which need to undergo a number of processing steps in order to carve out relevant information and separate it from noisy or irrelevant parts. Therefore, SSI comes with a large repertoire of filter and feature algorithms to treat audiovisual and physiological signals. By putting processing blocks in series developers can quickly build complex processing pipelines, without having to care much about implementation details such as buffering and synchronization, which will be automatically handled by the framework. Since processing blocks are allocated to separate threads, individual window sizes can be chosen for each processing step.

 

Figure 3: Streams are processed in parallel using tailored window sizes.

Since human communication does not follow the precise mechanisms of a machine, but is tainted with a high amount of variability, uncertainty and ambiguity, robust recognizers have to be built that use probabilistic models to recognize and interpret the observed behaviour. To this end, SSI assembles all tasks of a machine learning pipeline including pre-processing, feature extraction, and online classification/fusion in real-time. Feature extraction converts a signal chunk into a set of compact features – keeping only the essential information necessary to classify the observed behaviour. Classification, finally accomplishes a mapping of observed feature vectors onto a set of discrete states or continuous values. Depending on whether the chunks are reduced to a single feature vector or remain a series of variable length, a statistical or dynamic classification scheme is applied. Examples of both types are included in the SSI framework.

 

Figure 4: Support for statistical and dynamic classification schemes.

To solve ambiguity in human interaction information extracted from diverse channels need to be combined. In SSI information can be fused at various levels. Already at data level, e. g. when depth information is enhanced with colour information. At feature level, when features of two ore more channels are put together to a single feature vector. Or at decision level, when probabilities of different recognizers are combined. In the latter cases, fused information should represent the same moment in time. If this is not possible due to temporal offsets (e. g. a gesture followed by a verbal instruction) fusion has to take place at event level. The preferred level depends on the type of information that is fused.

 

Figure 5: Fusion on feature, decision and event level.

XML Pipeline

In this article we will focus on SSI’s XML interface, which allows the definition of pipelines as plain text files. No particular programming skills or development environments are required. To assemble a pipeline any text editor can be used. However, there is also an XML editor in SSI, which offers special functions and simplifies the task of writing pipelines, e.g. by listing options and descriptions for a selected component.

 

Figure 6: SSI’s XML editor offers convenient access to available components (left panel). A component’s options can be directly accessed and edited in a special sheet (right panel).

To illustrate how XML pipelines are built in SSI, we will start off with a simple unimodal example. Let’s assume we wish to build an application that converts a sound into a spectrum of frequencies (a so called spectrogram). This is a typical task in audio processing as many properties of speech are best studied in the frequency domain. The following pipeline captures sound from a microphone and transforms it into a spectrogram. Both, the raw and the transformed signal, are finally visualized.

<?xml version="1.0" ?>
<pipeline ssi-v="1">

        <register>  
                <load name="ssiaudio.dll"/>
                <load name="ssisignal.dll"/>
                <load name="ssigraphic.dll" />
        </register>

        <!-- SENSOR -->
        <sensor create="ssi_sensor_Audio" option="audio" scale="true">
                <provider channel="audio" pin="audio"/>
        </sensor>

        <!-- PROCESSING -->
        <transformer create="ssi_feature_Spectrogram" minfreq="100" maxfreq="5100" nbanks="50">
                <input pin="audio" frame="0.01s" delta="0.015s"/>
                <output pin="spect"/>
        </transformer>

 <!-- VISUALIZATION -->
        <consumer create="ssi_consumer_SignalPainter" name="audio" size="10" type="2">
                <input pin="audio" frame="0.02s"/>
        </consumer>
        <consumer create="ssi_consumer_SignalPainter" name="spectrogram" size="10" type="1">
                <input pin="spect" frame="1"/>
        </consumer>

</pipeline>

 

Since SSI uses a plugin system, components are loaded dynamically at runtime. Therefore, we load the required components by including the according DLL files grouped within the <register> element. In our case, three DLLs will be loaded, namely ssiaudio.dll, which includes components to read from an audio source, ssisignal.dll, which includes generic signal processing algorithms, and ssigraphic.dll, which includes tools for visualizing the raw and processed signals. The remaining part of the pipeline defines which components will be created and in which way they connect with each other.

Sensors, introduced with the keyword <sensor>, define the source of a pipeline. Here, we place a component with the name ssi_sensor_Audio, which encapsulates an audio input. Generally, components offer options, which allow us to alter their behaviour. For instance, setting scale=true will tell the audio sensor to deliver the wave signal as floating point values in range [-1..1], otherwise we would receive a stream of integers. By setting option=audio we further instruct the component to save the final configuration to a file audio.option.

Next, we have to define, which sources of a sensor we want to tap. A sensor offers at least one such channel. To connect a channel we use the <provider> statement, which has two attributes: channel is the unique identifier of the channel and pin defines a freely selectable identifier, which is later on used to refer to the according signal stream. Internally, SSI will now create a buffer and constantly write incoming audio samples to it. By default, it will keep the last 10 seconds of the stream. Connected components that read from the buffer will receive a copy of the requested samples.

Components that connect to a stream, apply some processing, and output the result as a new stream, are tagged with the keyword <transformer>. The new stream may differ from the input stream in sample rate and type, e.g. a video image can be mapped to a single position value. Or – as in our case – a one dimensional audio stream of 16kHz is mapped on a multidimensional spectral representation of lower frequency, a so called spectrogram. A spectogram is created by calculating the energy in different frequency bins. The component in SSI that applies this kind of transformation is called ssi_feature_Spectrogram. Via the options minfreq, maxfreq, and nbanks, the frequency range in Hz, which is defined by [minfreq…maxfreq], and the number of bins are set. Options not set in the pipeline, like here the number of coefficients (nfft) and the type of window (wintype) used for the Fourier transformation, will be initialized with their default values (unless indirectly overwritten in an option file loaded via option as in case of the audio sensor).

By the tag <input> we specify the input stream for the transformer. Since we want to read raw audio samples we put audio, i.e. the pin we selected for the audio channel. Now, we need to decide on the size of the blocks in which the input stream is processed. We do this through the attribute frame, which defines the frame hop, i.e. the number of samples a window will be shifted after a read operation. Optionally, this window can be extended by a certain number of samples given by the attribute delta. In our case, we choose 0.01s and 0.015s, respectively. I.e. at each loop a block of length 0.025 seconds is retrieved and then shifted by 0.01 seconds. If we assume a sample rate of 16kHz (the default rate of an audio stream) this converts to a block length of 400 samples (16kHz*[0.01s+0.015s]) and a frame shift of 160 samples (16kHz*0.01s). In other words, at each calculation step, 400 samples are copied from the buffer and afterwards the read position is increased by 160 samples. Since the output is a single sample of dimension equal to the number of bins in the spectrogram, the sample rate of the output stream becomes 100 Hz (1/0.01s). For the new, transformed stream, SSI creates another buffer, to which we assign a new pin spect that we wrap in the <output> element.

Finally, we want to visualize the streams. Components that read from a buffer, but do not write back a new stream, are tagged with the keyword <consumer>. To draw the current content of a buffer in a graph we use an instance of ssi_consumer_SignalPainter that we connect to a stream pin within the <input> tag. To draw both, the raw and the transformed stream, we add two of them and connect one to audio and one to spect. The option type allows us to choose an appropriate kind of visualization. In case of the raw audio we set the frame length to 0.02s, i.e. the stream will be updated every 20 milliseconds. In case of the spectrogram, we set 1 (no trailing s), which sets the update rate to a single frame.

Now, we are ready to run the pipeline by typing xmlpipe <filename> on the console (or hitting F5 if you use SSI’s XML editor). When running for the first time, a pop-up shows up, so we can select an input source. The choice will be remembered and stored in the file audio.option. The output should be something like:

 

Figure 7: Left top: plot of a raw audio signal. Right bottom: spectrogram with low frequency bins being blue and high energy bins being red. The console window provides information on the current state of the pipeline.

Sometimes, when pipelines become long, it is clearer to outsource important options to a separate file. In the pipeline we mark those parts with $(<key>) and create a new file, which includes statements of the form <key> = <value>. For instance, we could alter the spectogram to:

<transformer create="ssi_feature_Spectrogram" minfreq="$(minfreq)" maxfreq="$(maxfreq)" nbanks="$(nbanks)">
        <input pin="audio" frame="0.01s" delta="0.015s"/>
        <output pin="spect"/>
</transformer>

and set the actual values in another file (while pipelines should end on .pipeline, config files should end on .pipeline-config):

minfreq = 100 # minimum frequency maxfreq = 5100 # maximum frequency nbanks = 100 # $(select{5,10,25,50,100}) number of bins

For convenience, SSI offers a small GUI named xmlpipeui.exe, which lists available options in a table and automatically parses new keys from a pipeline:

 

Figure 8: GUI to manage options and run pipelines with different configurations.

In SSI, the counterpart to streams are events. Unlike streams, which have a continuous nature, events may occur asynchronously and have a definite onset and offset. To demonstrate this feature we will extend our previous example and add an activity detector to drive the feature extraction, i.e. the spectogram will be displayed only during times when there is activity in the audio.

To do so, we add two more components. Another transformer (AudioActivity), which calculates loudness (method=0) and sets values below some threshold to zero (threshold=0.1). And another consumer (ZeroEventSender), which picks up the result and looks for parts in the signal that are non-zero. If such a part is detected and if it is longer than a second (mindur=1.0), an event is fired. To identify them, events will be equipped with an address composed of an event name and a sender name: <event>@<sender&gt. In our example, options ename and sname are applied to set the address to activity@audio.

<!-- ACTIVITY DETECTION -->
<transformer create="ssi_feature_AudioActivity" method="0" threshold="0.1">
        <input pin="audio" frame="0.01s" delta="0.015s"/>
        <output pin="activity"/>
</transformer>
<consumer create="ssi_consumer_ZeroEventSender" mindur="1.0" maxdur="5.0" sname="audio" ename="activity">
        <input pin="activity" frame="0.1s"/>
</consumer>

 

We can now change the visualization of the spectogram from continuous to event triggered. Therefore we replace the attribute frame by listen=activity@audio. We also set the length of the graph to zero (size=0), which will allow it to adapt its length dynamically to the duration of the event.

<consumer create="ssi_consumer_SignalPainter" name="spectrogram" size="0" type="1">
        <input pin="spect" listen="activity@audio" />
</consumer>

 

To display a list of current events in the framework, we also include an instance of the component called EventMonitor. Since it only reacts to events and is neither a consumer, nor a transformer, it is wrapped with the keyword <object>. With the <listen> tag we can determine, which events we want to receive. By setting address=@ and span=10000we configure the monitor to display any event in the last 10 seconds.

<object create="ssi_listener_EventMonitor" mpos="400,300,400,300">
        <listen address="@" span="10000"/>
</object>

The output of the new pipeline is shown below. Again, it contains continuous plots of the raw audio and the activity signal (left top). In the graph below the triggered spectrogram is displayed, showing the result for the latest activity event (18.2s – 19.4s). This corresponds to the top entry in the monitor (right bottom). It also lists three previous events. Since activity events do not contain additional meta data, they have a size of 0 bytes. In case of a classification event, e.g., class names with probabilities are attached.

 

Figure 9: In this example activity detection has been added to drive the spectrogram (see graph between raw audio and spectrogram). After a period of activity, an event is fired, which triggers the visualization of the spectrogram. Past events are listed in the window below the console.

Multi-modal Enjoyment Detection

We will now move to a more complex application – multi-modal enjoyment detection. The system we want to focus on has been developed as part of the European project (FP7) ILHAIRE (Incorporating Laughter into Human Avatar Interactions: Research and Experiments, see http://www.ilhaire.eu/). It combines the input of two sensors, a microphone and a camera, to predict in real-time the level of enjoyment of a user. In this context, we define enjoyment as an episode of positive emotion, indicated by visual and auditory cues of enjoyment, such as smiles and voiced laughters. On the basis of the frequency and intensity of these cues the level of enjoyment is determined.

 

Figure 10: The more cues of enjoyment a user displays, the higher will be the output of the system.

Training data for tuning the detection models was recorded in several sessions among three to four users having funny conversation, each session alsted for about 1.5h. During recordings each user was equipped with a headset and filmed with a Kinect and a HD camera. To allow simultaneous recording of four users, the setup included several pcs synchronized over the network. The possibility to keep pipelines, which are distributed over several machines in a network, in sync allows it to create large multi-modal corpora with multiple users. In this particular case, the raw data captured by SSI summed up to about ~4.78 GB per minute including audio, Kinect body and face tracking, as well as, HD video streams.

The following pipeline snippet connects to an audio and Kinect sensor and stores the captured signals on disk. Note that the audio stream and the Kinect rgb video are muxed in a single file. Therefore the audio is passed as an additional input source wrapped by a <xinput> element. An additional line is added at the top of the file to configure the <framework> to wait for a synchronization signal on port 1234.

<!-- SYNCHRONIZATION -->
<framework sync="true" sport="1234" slisten="true"/>

<!-- AUDIO SENSOR -->
<sensor create="ssi_sensor_Audio" option="audio" scale="true">
        <provider channel="audio" pin="audio"/>
</sensor>

<!-- KINECT SENSOR -->
<sensor create="ssi_sensor_MicrosoftKinect">
        <provider channel="rgb" pin="kinect_rgb"/>
        <provider channel="au" pin="kinect_au"/>
        <provider channel="face" pin="kinect_face"/>
</sensor>
<!-- STORAGE -->
<consumer create="ssi_consumer_FFMPEGWriter" url="rgb.mp4">
        <input pin="kinect_rgb" frame="1"/>
        <xinput size="1">
                <input pin="audio"/>
        </xinput>
</consumer>
<consumer create="ssi_consumer_FileWriter" path="au">
        <input pin="kinect_au" frame="5"/>
</consumer>
<consumer create="ssi_consumer_FileWriter" path="face">
        <input pin="kinect_face" frame="5"/>
</consumer>

 

Based on the audiovisual content, raters were asked to annotate audible and visual cues of laughter in the recordings. Afterwards, features are extracted from the raw signals and each feature vector is labelled by the according annotation tracks. The labelled feature vectors serve as input for a learning phase during which a separation of the feature space is seeked that allows a good segregation of the class labels. E.g. a feature measuring the extension of lip corners may correlate with smiling and hence be picked as an indicator for enjoyment. Since in complex recognition tasks no definite mapping exists, numerous approaches have been proposed to solve this task. SSI includes several well established learning algorithms, such as K-Nearest Neighbours, Gaussian Mixture Models, or Support Vector Machines. These algorithms are part of SSI’s machine learning library, which also allows provides tools to simulate (parts of) pipelines in a best-effort manner and to evaluate models in terms of expected recognition accuracy.

   

Figure 11: Manual annotations of the enjoyment cues are used to train detection models for each modality.

The following C++ code snippet gives an impression how learning is accomplished using SSI’s machine learning library. First, the raw audio file (“user1.wav”) and an annotation file (“user1.anno”) are loaded. Next, the audio stream is converted into a list of samples to which feature extraction is applied. Finally, a model is trained using those samples. To see how well the model performs on the training data an additional evaluation step is added.

// read audio
ssi_stream_t stream;
WavTools::ReadWavFile ("user1.wav", stream);

// read annotation
Annotation anno;
ModelTools::LoadAnnotation (anno, "user1.anno");

// create samples
SampleList samples;
ModelTools::LoadSampleList (samples, stream, anno, "user1");

// extract features
SampleList samples_t;
EmoVoiceFeat *ev = ssi_create (EmoVoiceFeat, "ev", true);
ModelTools::TransformSampleList (samples, samples_t, *ev);

// create model
IModel *svm = ssi_create (SVM, "svm", true);
Trainer trainer (svm);

// train and save
trainer.train (samples_t);
trainer.save (model);

// evaluation
Evaluation eval;
eval.evalKFold (trainer, samples_t, 10);
eval.print ();

 

After the learning phase the pre-trained classification models are ready to be plugged into a pipeline. To accomplish the pipeline at hand, two models were trained: one to detect audible laughter cues (e.g. laughter bursts) from the voice, and one to detect visual cues (e.g. smiles) in the face, which are now connected to the according feature extracting components. Activity detection is applied to decide if a frame contains sufficient information to be included in the classification process. For example, if no face is detected in the current image or if the energy of an audio chunk is too low, the frame will be discarded. Otherwise, classification is applied and the result is forwarded to the fusion component. The first steps in the pipeline are stream-based, i.e. signals are continuously processed over a fixed length window. Late components responsible for cue detection and fusion are event based processes, applied only where signals carry information relevant to the recognition process. To derive a final decision, both type of cues are finally combined.

The pipeline is basically an extension of the recording pipeline described earlier, which includes additional processing steps. To process the audio stream, the following snippet is added:

<!-- VOCAL ACTIVITY DETECTION -->
<transformer create="ssi_feature_AudioActivity" threshold="0.025">
        <input pin="audio" frame="19200" delta="28800"/>
        <output pin="voice_activity"/>
</transformer>      

<!-- VOCAL FEATURE EXTRACTION -->
<transformer create="ssi_feature_EmoVoiceFeat">
        <input pin="audio" frame="19200" delta="28800"/>
        <output pin="audio_feat"/>
</transformer>

<!-- VOCAL LAUGTHER CLASSIFICATION -->
<consumer create="ssi_consumer_Classifier" trainer="models\voice" sname="laughter" ename="voice">
        <input pin="audio_feat" frame="1" delta="0" trigger="voice_activity"></input>
</consumer>

  While video processing is accomplished by:

<!-- FACIAL ACTIVITY DETECTION -->
<transformer create="ssi_feature_MicrosoftKinectFAD" minfaceframes="10">
        <input pin="kinect_face" frame="10" delta="15"/>
        <output pin="face_activity"/>
</transformer>      

<!-- FACIAL FEATURE EXTRACTION -->
<transformer create="ssi_feature_MicrosoftKinectAUFeat">
        <input pin="kinect_au" frame="10" delta="15"/>
        <output pin="kinect_au_feat"/>
</transformer>

<!-- FACIAL LAUGHTER CLASSIFICATION -->
<consumer create="ssi_consumer_Classifier" trainer="models\face" sname="laughter" ename="face">
        <input pin="kinect_au_feat" frame="1" delta="0" trigger="face_activity"></input>         
</consumer>

 

Obviously, both snippets share a very similar structure, though different components are loaded and frame/delta sizes are adjusted to fit the samples rates. Note that this time the trigger stream (voice_activity/face_activity) is directly applied by the keyword trigger in the <input> section of the classifier. The pre-trained models for detecting cues in the vocal and facial feature streams are loaded from file via the trainer option. Cue probabilities are finally combined via vector fusion:

<object create="ssi_listener_VectorFusionModality" ename="enjoyment" sname="fusion"
        update_ms="400" fusionspeed="1.0f" gradient="0.5f" threshold="0.1f" >
        <listen address="laughter@voice,face"/>
</object>

 

The core idea of vector based fusion is to handle detect events (laughter cues in our case) as independent vectors in a single or multidimensional event space and derive a final decision by aggregating vectors while taking into account temporal relationships (the influence of an event is reduced over time) [2]. In contrast to standard segment-based fusion approaches, which force a decision in all modalities at each fusion step, it is individually decided if and when a modality contributes. The following animation illustrates this, with green dots representing detected cues, whereas red dots mean that no cue was detected. Note that the final fusion decision – represented by the green bar on the right – grows if cues are detected and afterwards shrinks again as no new input is added:

 

Figure 12: The pre-trained models are applied to detect enjoyment cues at run-time (green dots). The more cues are detected across modalities, the higher is the output of the fusion algorithm (green bar).

The following video clip demonstrates the detection pipeline in action. Input streams are visualized on the left (top: video stream with face tracking, bottom: raw audio stream and acdtivity plot). Laughter cues detected in the two modalities are shown in two separate bar plots on top of each other. Result of final multi-modal laughter detection in the bar plot on the very right.

Conclusion

In this article we have introduced Social Signal Interpretation (SSI), a multi-modal signal processing framework. We have introduced the basic concepts of SSI and demonstrated by means of two examples how SSI allows users to quickly set up processing pipelines in XML. While in this article we have focused on how to build applications from available components, a simple plugin system offers developers the possibility to extend the pool of available modules with new components. By sharing these within the multimedia community everyone is encouraged to enrich the functions of SSI in future. Readers who would like to learn more about SSI or get free access to the source code, please visit http://openssi.net.

Future Work

So far, applications developed with SSI target desktop machines, possibly distributed over several such machines within a network. Though, wireless sensor devices become more and more popular offering some amount of mobility, yet it is not possible to monitor subjects outside a certain radius, unless they take a desktop computer with them. Smartphones and similar pocket computers can help to overcome this limitation. In a pilot project, a plugin for SSI has been developed to stream in real-time audiovisual content and other sensor data over a wireless LAN connection from a mobile device running Android to an SSI server. The data is analysed on the fly on the server and the result is send back to the mobile device. Such scenarios give ample scope for new applications, which allow it to follow a user “in the wild”. In the CARE project (a sentient Context-Aware Recommender System for the Elderly) an recommender system is currently developed, which in real-time assists solely living elderly people in their home environment by recommending them depending on the situation physical, mental and social activities that are aimed to induce senior citizens to be self-confident, independent and active at everyday life again.

Acknowledgements

The work described in this article is funded by the European Union under research grant CEEDs (FP7-ICT-2009-5) and TARDIS (FP7-ICT-2011-7), and ILHAIRE, a Seventh Framework Programme (FP7/2007-2013) under grant agreement n°270780.