MPEG Column: Press release for the 109th MPEG meeting

MPEG collaborates with SC24 experts to develop committee draft of MAR reference model

SC 29/WG 11 (MPEG) is pleased to announce that the Mixed and Augmented Reality Reference Model (MAR RM), developed jointly and in close collaboration with SC 24/WG 9, has reached Committee Draft status at the 109th WG 11 meeting. The MAR RM defines not only the main concepts and terms of MAR, but also its application domain and an overall system architecture that can be applied to all MAR systems, regardless of the particular algorithms, implementation methods, computational platforms, display systems, and sensors/devices used. The MAR RM can therefore be used as a consultation source to aid in the development of MAR applications or services, business models, or new (or extensions to existing) standards. It identifies representative system classes and use cases with respect to the defined architecture, but does not specify technologies for the encoding of MAR information, or interchange formats.

2nd edition of HEVC includes scalable and multi-view video coding

At the 109th MPEG meeting, the standard development work was completed for two important extensions to the High Efficiency Video Coding standard (ISO/IEC 23008-2, also standardized by ITU-T as Rec. H.265).
The first of these are the scalability extensions of HEVC, known as SHVC, adding support for embedded bitstream scalability in which different levels of encoding quality are efficiently supported by adding or removing layered subsets of encoded data. The other are the multiview extensions of HEVC, known as MV-HEVC providing efficient representation of video content with multiple camera views and optional depth map information, such as for 3D stereoscopic and autostereoscopic video applications. MV-HEVC is the 3D video extension of HEVC, and further work for more efficient coding of 3D video is ongoing.
SHVC and MV-HEVC will be combined with the original content of the HEVC standard and also the recently-completed format range extensions (known as RExt), so that a new edition of the standard will be published that contains all extensions approved up to this time.

In addition, the finalization of reference software and a conformance test set for HEVC was completed at the 109th meeting, as ISO/IEC 23008-5 and ISO/IEC 23008-8, respectively. These important standards will greatly help industry achieve effective interoperability between products using HEVC and provide valuable information to ease the development of such products.
In consideration of the recent dramatic developments in video coding technology, including the completion of the development of the HEVC standard and several major extensions, MPEG plans to host a brainstorming event during its 110th meeting which will be open to the public. The event will be co-hosted by MPEG’s frequent collaboration partner in video coding standardization work, the Video Coding Experts Group (VCEG) of ITU-T Study Group 16. More information on how to register for the event will be available at http://mpeg.chiariglione.org/meetings/110.

MPEG-H 3D Audio extended to lower bit rates

At its 109th meeting, MPEG has selected technology for Version II of the MPEG-H 3D Audio standard (ISO/IEC 23008-3) based on responses submitted to the Call for Proposals issued in January 2013. This follows from selection of Version I technology, which was chosen at the 105th meeting, in August 2013. While Version I technology was evaluated for bitrates between 1.2 Mb/s to 256 kb/s, Version II technology is focused on bitrates between 128 kb/s to 48 kb/s.
The selected technology supports content in multiple formats: channel-based, channels and objects (C+O), and scene-based Higher Order Ambisonics (HOA). A total of six submissions were reviewed: three for coding C+O content and three for coding HOA content.
The selected technologies for Version II were shown to be within the framework of the unified Version I technology.
The submissions were evaluated using a comprehensive set of subjective listening tests in which the resulting statistical analysis guided the selection process. At the highest bitrate of 128 kb/s for the coding of a signal supporting a 22.2 loudspeaker configuration, both of the selected technologies had performance of “Good” on the MUSHRA subjective quality scale. It is expected that the C+O and HOA Version II technologies will be merged into a unified architecture.
This MPEG-H 3D Audio Version II is expected to reach Draft International Standard by June 2015.

The 109th meeting also saw the technical completion of Version I of the MPEG-H 3D Audio standard and is expected to be an International Standard by February, 2015.

Public seminar for media synchronization planned for 110th MPEG meeting in October

A public seminar on Media Synchronization for Hybrid Delivery will be held on the 22nd of October 2014 during the 110th MPEG meeting in Strasbourg. The purpose of this seminar is to introduce MPEG’s activity on media stream synchronization for heterogeneous delivery environments, including hybrid environments employing both broadcast and broadband networks, with existing MPEG systems technologies such as MPEG-2 TS, DASH, and MMT. The seminar will also strive to ensure alignment of its present and future projects with users and industry use-cases needs. Main topics covered by the seminar interventions include:

  • Hybrid Broadcast – Broadband distribution for UHD deployments and 2nd screen content
  • Inter Destination Media Synchronization
  • MPEG Standardization efforts on Time Line Alignment of media contents
  • Audio Fingerprint based Synchronization

You are invited to join the seminar to learn more about MPEG activities in this area and to work with us to further develop technologies and standards supporting new applications of rich and heterogeneous media delivery.
The seminar is open to the public and registration is free of charge.

First MMT Developers’ Day held at MPEG 109, second planned for MPEG 110

Following the recent finalization of the MPEG Media Transport standard (ISO/IEC 23008-1), MPEG has hosted an MMT Developers’ Day to better understand the rate of MMT adoption and to provide a channel for MPEG to receive comments from industries about the standard. During the event four oral presentations have been presented including “Multimedia transportation technology and status in China”, “MMT delivery considering bandwidth utilization”, “Fast channel change/ Targeted Advertisement insertion over hybrid media delivery”, and “MPU Generator.” In addition, seven demonstrations have been presented such as Reliable 4K HEVC Realtime Transmission by using MMT-FEC, MMT Analyzer, Applications of MMT content through Broadcast, Storage, and Network Delivery, Media Delivery Optimization with the MMT Cache Middle Box, MMT-based Transport Technology for Advanced Services in Super Hi-Vision, target ad insertion and multi-view content composition in broadcasting system with MMT, and QoS management for Media Delivery. MPEG is planning to host a 2nd MMT Developer’s Day during the 110th meeting on Wednesday, Oct 22nd.

Seminar at MPEG 109 introduces MPEG’s activity for Free Viewpoint Television

A seminar for FTV (Free Viewpoint Television) was held during the 109th MPEG meeting in Sapporo. FTV is an emerging visual media technology that will revolutionize the viewing of 3D scenes to facilitate a more immersive experience by allowing users to freely navigate the view of a 3D scene as if they were actually there. The purpose of the seminar was to introduce MPEG’s activity on FTV to interested parties and to align future MPEG standardization of FTV technologies with user and industry needs.

Digging Deeper – How to Contact MPEG

Communicating the large and sometimes complex array of technology that the MPEG Committee has developed is not a simple task. Experts, past and present, have contributed a series of tutorials and vision documents that explain each of these standards individually. The repository is growing with each meeting, so if something you are interested is not yet there, it may appear shortly – but you should also not hesitate to request it. You can start your MPEG adventure at http://mpeg.chiariglione.org/

Further Information

Future MPEG meetings are planned as follows:

  • No. 110, Strasbourg, FR, 20 – 24 October 2014
  • No. 111, Geneva, CH, 16 – 20 February 2015
  • No. 112, Warsaw, PL, 22 – 26 June 2015

For further information about MPEG, please contact:

Dr. Leonardo Chiariglione (Convenor of MPEG, Italy)
Via Borgionera, 103
10040 Villar Dora (TO), Italy
Tel: +39 011 935 04 61
leonardo@chiariglione.org

or

Dr. Arianne T. Hinds
Cable Television Laboratories
858 Coal Creek Circle
Louisville, Colorado 80027 USA
Tel: +1 303 661 3419
a.hinds@cablelabs.com.

The MPEG homepage also has links to other MPEG pages that are maintained by the MPEG subgroups. It also contains links to public documents that are freely available for download by those who are not MPEG members. Journalists that wish to receive MPEG Press Releases by email s

Launching the first-ever National Data Science Bowl

What is the National Data Science Bowl ?

Take a deep dive and see how tiny plants and animals fuel the world

We are pioneering a new language to understand our incredibly beautiful and complex world. A language that is forward-looking rather than retrospective, different from the words of historians and famed novelists. It is data science; and through it, we have the power to use insights from our past to build an unprecedented future. We need your help building that future. The 2014/2015 National Data Science Bowl offers tremendous potential to modernize the way we understand and address a major environmental challenge— monitoring the health of our oceans.

Compete

ACM is a partner in the first-ever National Data Science Bowl, which launched on 12/15.

This 90-day competition offers data scientists the chance to solve a critical problem facing our world’s oceans using the power of data.

Participants are challenged to examine nearly 100,000 underwater images to develop an algorithm that will enable researchers to monitor certain sea life at a speed and scale never before possible.

$175,000 in prize money to top three individual contestants and the top academic team.

Report from SLAM 2014

ISCA/IEEE Workshop on Speech, Language and Audio in Multimedia

Following SLAM 2013 in Marseille, France, SLAM 2014 was the second edition of the workshop, held in Malaysia as a satellite of Interspeech 2014. The workshop was organized over two days, one for science and one for socializing and community building. With about 15 papers and 30 attendees, the highly-risky second edition of the workshop showed the will to build a strong scientific community at the frontier of speech and audio processing, natural language processing and multimedia content processing.

The first day featured talks covering various topics related to speech, language and audio processing applied to multimedia data. Two keynotes from Shri Narayanan (University of Southern California) and Min-Yen Kan (National University of Singapore) nicely completed the program.
The second day took us on a tour of Penang followed by a visit of the campus of Universiti Sains Malaysia from which local organizers are. The tour offered plenty of opportunities to strengthen the links between participants and build a stronger community, as expected. Most participants later went ot Singapore to attend Interspeech, the main conference in the domain of speech communication, where further discussions went on.

We hope to collocate the next SLAM edition with a multimedia conference such as ACM Multimedia in 2015. Keep posted!

Slow Internet? – More bandwidth is not the answer

The RITE (Reducing Internet Transport Latency) EU-project has, since its start nearly two years ago, worked to reduce the delay experienced when using the Internet. The approach is to make small, smart, changes to the mechanisms that makes Internet communication work. These mechanisms were developed to maximise throughput, but delay was overlooked until recently.

One of the tasks of the RITE EU-project is to raise awareness about how to achieve faster response in the Internet and to clear up misconceptions.

A common misconception is the idea that higher bandwidth means lower delay. It may be right if you only consider the case where a large file is downloaded (the download takes less time to complete), but there’s very many cases where other measures are needed to reduce the delay. It’s not enough to just buy the highest capacity Internet connection. To clear this up, RITE has produced a video and educational material showing some of the biggest sources of Internet delay. The video also explains the difference between delay and bandwidth.

The RITE project has made educational material to go with the video. A Kahoot! quiz that allows the students to compete, experiments that can be run from any computer and a fact sheet containing tips for how you can reduce your Internet delay is among the available resources.

You can learn more about Internet delay and read about the project on the RITE website: http://www.riteproject.eu

Resources:

Take the Kahoot! quiz.
Educational material.

SSI: An Open Source Platform for Social Signal Interpretation

Introduction

Automatic detection and interpretation of social signals carried by voice, gestures, mimics, etc. will play a key-role for next-generation interfaces as it paves the way towards more intuitive and natural human-computer interaction. In this article we introduce Social Signal Interpretation (SSI), a framework for real-time recognition of social signals. SSI supports a large range of sensor devices, filter and feature algorithms, as well as, machine learning and pattern recognition tools. It encourages developers to add new components using SSI’s C++ API, but also addresses front end users by offering an XML interface to build pipelines with a text editor. SSI is freely available under GPL at openssi.net.

Key Features

The Social Signal Interpretation (SSI) framework offers tools to record, analyse and recognize human behaviour in real-time, such as gestures, mimics, head nods, and emotional speech. Following a patch-based design pipelines are set up from autonomic components and allow the parallel and synchronized processing of sensor data from multiple input devices. In particularly, SSI supports the machine learning pipeline in its full length and offers a graphical interface that assists a user to collect own training corpora and obtain personalized models. In addition to a large set of built-in components SSI also encourages developers to extend available tools with new functions. For inexperienced users an easy-to-use XML editor is available to draft and run pipelines without special programming skills. SSI is written in C++ and optimized to run on computer systems with multiple CPUs.

The key features of SSI include:

  • Synchronized reading from multiple sensor devices
  • General filter and feature algorithms, such as image processing, signal filtering, frequency analysis and statistical measurements in real-time
  • Event-based signal processing to combine and interpret high level information, such as gestures, keywords, or emotional user states
  • Pattern recognition and machine learning tools for on-line and off-line processing, including various algorithms for feature selection, clustering and classification
  • Patch-based pipeline design (C++-API or easy-to-use XML editor) and a plug-in system to integrate new components

SSI also includes wrappers for many popular sensor devices and signal processing libraries, such as the e-Health Sensor Shield, IOM biofeedback system (Wild Divine), MicrosoftKinect, TheEyeTribe, Wii Remote Control, ARTKplus, FFMpeg, OpenCV, WEKA, Torch, DSPFilters, Fubi, Praat, OpenSmile, LibSox, and EmoVoice. To get SSI, please visit our download page.

 

Figure 1: Sketch summarizing the various tasks covered by SSI.

Framework Overview

Social Signal Interpretation (SSI) is an open source project meant to support the development of recognition systems using live input of multiple sensors [1]. Therefore it offers support for a large variety of filter and feature algorithms to process captured signals, as well as, tools to accomplish the full machine learning pipeline. Two type of users are addressed: developers are provided a C++-API that encourages them to write new components and front end users can define recognition pipelines in XML from available components.

Since social cues are expressed through a variety of channels, such as face, voice, postures, etc., multiple kind of sensors are required to obtain a complete picture of the interaction. In order to combine information generated by different devices raw signal streams need to be synchronized and handled in a coherent way. Therefore an architecture is established to handle diverse signals in a coherent way, no matter if it is a waveform, a heart beat signal, or a video image.

 

Figure 2: Examples of sensor devices SSI supports.

Sensor devices deliver raw signals, which need to undergo a number of processing steps in order to carve out relevant information and separate it from noisy or irrelevant parts. Therefore, SSI comes with a large repertoire of filter and feature algorithms to treat audiovisual and physiological signals. By putting processing blocks in series developers can quickly build complex processing pipelines, without having to care much about implementation details such as buffering and synchronization, which will be automatically handled by the framework. Since processing blocks are allocated to separate threads, individual window sizes can be chosen for each processing step.

 

Figure 3: Streams are processed in parallel using tailored window sizes.

Since human communication does not follow the precise mechanisms of a machine, but is tainted with a high amount of variability, uncertainty and ambiguity, robust recognizers have to be built that use probabilistic models to recognize and interpret the observed behaviour. To this end, SSI assembles all tasks of a machine learning pipeline including pre-processing, feature extraction, and online classification/fusion in real-time. Feature extraction converts a signal chunk into a set of compact features – keeping only the essential information necessary to classify the observed behaviour. Classification, finally accomplishes a mapping of observed feature vectors onto a set of discrete states or continuous values. Depending on whether the chunks are reduced to a single feature vector or remain a series of variable length, a statistical or dynamic classification scheme is applied. Examples of both types are included in the SSI framework.

 

Figure 4: Support for statistical and dynamic classification schemes.

To solve ambiguity in human interaction information extracted from diverse channels need to be combined. In SSI information can be fused at various levels. Already at data level, e. g. when depth information is enhanced with colour information. At feature level, when features of two ore more channels are put together to a single feature vector. Or at decision level, when probabilities of different recognizers are combined. In the latter cases, fused information should represent the same moment in time. If this is not possible due to temporal offsets (e. g. a gesture followed by a verbal instruction) fusion has to take place at event level. The preferred level depends on the type of information that is fused.

 

Figure 5: Fusion on feature, decision and event level.

XML Pipeline

In this article we will focus on SSI’s XML interface, which allows the definition of pipelines as plain text files. No particular programming skills or development environments are required. To assemble a pipeline any text editor can be used. However, there is also an XML editor in SSI, which offers special functions and simplifies the task of writing pipelines, e.g. by listing options and descriptions for a selected component.

 

Figure 6: SSI’s XML editor offers convenient access to available components (left panel). A component’s options can be directly accessed and edited in a special sheet (right panel).

To illustrate how XML pipelines are built in SSI, we will start off with a simple unimodal example. Let’s assume we wish to build an application that converts a sound into a spectrum of frequencies (a so called spectrogram). This is a typical task in audio processing as many properties of speech are best studied in the frequency domain. The following pipeline captures sound from a microphone and transforms it into a spectrogram. Both, the raw and the transformed signal, are finally visualized.

<?xml version="1.0" ?>
<pipeline ssi-v="1">

        <register>  
                <load name="ssiaudio.dll"/>
                <load name="ssisignal.dll"/>
                <load name="ssigraphic.dll" />
        </register>

        <!-- SENSOR -->
        <sensor create="ssi_sensor_Audio" option="audio" scale="true">
                <provider channel="audio" pin="audio"/>
        </sensor>

        <!-- PROCESSING -->
        <transformer create="ssi_feature_Spectrogram" minfreq="100" maxfreq="5100" nbanks="50">
                <input pin="audio" frame="0.01s" delta="0.015s"/>
                <output pin="spect"/>
        </transformer>

 <!-- VISUALIZATION -->
        <consumer create="ssi_consumer_SignalPainter" name="audio" size="10" type="2">
                <input pin="audio" frame="0.02s"/>
        </consumer>
        <consumer create="ssi_consumer_SignalPainter" name="spectrogram" size="10" type="1">
                <input pin="spect" frame="1"/>
        </consumer>

</pipeline>

 

Since SSI uses a plugin system, components are loaded dynamically at runtime. Therefore, we load the required components by including the according DLL files grouped within the <register> element. In our case, three DLLs will be loaded, namely ssiaudio.dll, which includes components to read from an audio source, ssisignal.dll, which includes generic signal processing algorithms, and ssigraphic.dll, which includes tools for visualizing the raw and processed signals. The remaining part of the pipeline defines which components will be created and in which way they connect with each other.

Sensors, introduced with the keyword <sensor>, define the source of a pipeline. Here, we place a component with the name ssi_sensor_Audio, which encapsulates an audio input. Generally, components offer options, which allow us to alter their behaviour. For instance, setting scale=true will tell the audio sensor to deliver the wave signal as floating point values in range [-1..1], otherwise we would receive a stream of integers. By setting option=audio we further instruct the component to save the final configuration to a file audio.option.

Next, we have to define, which sources of a sensor we want to tap. A sensor offers at least one such channel. To connect a channel we use the <provider> statement, which has two attributes: channel is the unique identifier of the channel and pin defines a freely selectable identifier, which is later on used to refer to the according signal stream. Internally, SSI will now create a buffer and constantly write incoming audio samples to it. By default, it will keep the last 10 seconds of the stream. Connected components that read from the buffer will receive a copy of the requested samples.

Components that connect to a stream, apply some processing, and output the result as a new stream, are tagged with the keyword <transformer>. The new stream may differ from the input stream in sample rate and type, e.g. a video image can be mapped to a single position value. Or – as in our case – a one dimensional audio stream of 16kHz is mapped on a multidimensional spectral representation of lower frequency, a so called spectrogram. A spectogram is created by calculating the energy in different frequency bins. The component in SSI that applies this kind of transformation is called ssi_feature_Spectrogram. Via the options minfreq, maxfreq, and nbanks, the frequency range in Hz, which is defined by [minfreq…maxfreq], and the number of bins are set. Options not set in the pipeline, like here the number of coefficients (nfft) and the type of window (wintype) used for the Fourier transformation, will be initialized with their default values (unless indirectly overwritten in an option file loaded via option as in case of the audio sensor).

By the tag <input> we specify the input stream for the transformer. Since we want to read raw audio samples we put audio, i.e. the pin we selected for the audio channel. Now, we need to decide on the size of the blocks in which the input stream is processed. We do this through the attribute frame, which defines the frame hop, i.e. the number of samples a window will be shifted after a read operation. Optionally, this window can be extended by a certain number of samples given by the attribute delta. In our case, we choose 0.01s and 0.015s, respectively. I.e. at each loop a block of length 0.025 seconds is retrieved and then shifted by 0.01 seconds. If we assume a sample rate of 16kHz (the default rate of an audio stream) this converts to a block length of 400 samples (16kHz*[0.01s+0.015s]) and a frame shift of 160 samples (16kHz*0.01s). In other words, at each calculation step, 400 samples are copied from the buffer and afterwards the read position is increased by 160 samples. Since the output is a single sample of dimension equal to the number of bins in the spectrogram, the sample rate of the output stream becomes 100 Hz (1/0.01s). For the new, transformed stream, SSI creates another buffer, to which we assign a new pin spect that we wrap in the <output> element.

Finally, we want to visualize the streams. Components that read from a buffer, but do not write back a new stream, are tagged with the keyword <consumer>. To draw the current content of a buffer in a graph we use an instance of ssi_consumer_SignalPainter that we connect to a stream pin within the <input> tag. To draw both, the raw and the transformed stream, we add two of them and connect one to audio and one to spect. The option type allows us to choose an appropriate kind of visualization. In case of the raw audio we set the frame length to 0.02s, i.e. the stream will be updated every 20 milliseconds. In case of the spectrogram, we set 1 (no trailing s), which sets the update rate to a single frame.

Now, we are ready to run the pipeline by typing xmlpipe <filename> on the console (or hitting F5 if you use SSI’s XML editor). When running for the first time, a pop-up shows up, so we can select an input source. The choice will be remembered and stored in the file audio.option. The output should be something like:

 

Figure 7: Left top: plot of a raw audio signal. Right bottom: spectrogram with low frequency bins being blue and high energy bins being red. The console window provides information on the current state of the pipeline.

Sometimes, when pipelines become long, it is clearer to outsource important options to a separate file. In the pipeline we mark those parts with $(<key>) and create a new file, which includes statements of the form <key> = <value>. For instance, we could alter the spectogram to:

<transformer create="ssi_feature_Spectrogram" minfreq="$(minfreq)" maxfreq="$(maxfreq)" nbanks="$(nbanks)">
        <input pin="audio" frame="0.01s" delta="0.015s"/>
        <output pin="spect"/>
</transformer>

and set the actual values in another file (while pipelines should end on .pipeline, config files should end on .pipeline-config):

minfreq = 100 # minimum frequency maxfreq = 5100 # maximum frequency nbanks = 100 # $(select{5,10,25,50,100}) number of bins

For convenience, SSI offers a small GUI named xmlpipeui.exe, which lists available options in a table and automatically parses new keys from a pipeline:

 

Figure 8: GUI to manage options and run pipelines with different configurations.

In SSI, the counterpart to streams are events. Unlike streams, which have a continuous nature, events may occur asynchronously and have a definite onset and offset. To demonstrate this feature we will extend our previous example and add an activity detector to drive the feature extraction, i.e. the spectogram will be displayed only during times when there is activity in the audio.

To do so, we add two more components. Another transformer (AudioActivity), which calculates loudness (method=0) and sets values below some threshold to zero (threshold=0.1). And another consumer (ZeroEventSender), which picks up the result and looks for parts in the signal that are non-zero. If such a part is detected and if it is longer than a second (mindur=1.0), an event is fired. To identify them, events will be equipped with an address composed of an event name and a sender name: <event>@<sender&gt. In our example, options ename and sname are applied to set the address to activity@audio.

<!-- ACTIVITY DETECTION -->
<transformer create="ssi_feature_AudioActivity" method="0" threshold="0.1">
        <input pin="audio" frame="0.01s" delta="0.015s"/>
        <output pin="activity"/>
</transformer>
<consumer create="ssi_consumer_ZeroEventSender" mindur="1.0" maxdur="5.0" sname="audio" ename="activity">
        <input pin="activity" frame="0.1s"/>
</consumer>

 

We can now change the visualization of the spectogram from continuous to event triggered. Therefore we replace the attribute frame by listen=activity@audio. We also set the length of the graph to zero (size=0), which will allow it to adapt its length dynamically to the duration of the event.

<consumer create="ssi_consumer_SignalPainter" name="spectrogram" size="0" type="1">
        <input pin="spect" listen="activity@audio" />
</consumer>

 

To display a list of current events in the framework, we also include an instance of the component called EventMonitor. Since it only reacts to events and is neither a consumer, nor a transformer, it is wrapped with the keyword <object>. With the <listen> tag we can determine, which events we want to receive. By setting address=@ and span=10000we configure the monitor to display any event in the last 10 seconds.

<object create="ssi_listener_EventMonitor" mpos="400,300,400,300">
        <listen address="@" span="10000"/>
</object>

The output of the new pipeline is shown below. Again, it contains continuous plots of the raw audio and the activity signal (left top). In the graph below the triggered spectrogram is displayed, showing the result for the latest activity event (18.2s – 19.4s). This corresponds to the top entry in the monitor (right bottom). It also lists three previous events. Since activity events do not contain additional meta data, they have a size of 0 bytes. In case of a classification event, e.g., class names with probabilities are attached.

 

Figure 9: In this example activity detection has been added to drive the spectrogram (see graph between raw audio and spectrogram). After a period of activity, an event is fired, which triggers the visualization of the spectrogram. Past events are listed in the window below the console.

Multi-modal Enjoyment Detection

We will now move to a more complex application – multi-modal enjoyment detection. The system we want to focus on has been developed as part of the European project (FP7) ILHAIRE (Incorporating Laughter into Human Avatar Interactions: Research and Experiments, see http://www.ilhaire.eu/). It combines the input of two sensors, a microphone and a camera, to predict in real-time the level of enjoyment of a user. In this context, we define enjoyment as an episode of positive emotion, indicated by visual and auditory cues of enjoyment, such as smiles and voiced laughters. On the basis of the frequency and intensity of these cues the level of enjoyment is determined.

 

Figure 10: The more cues of enjoyment a user displays, the higher will be the output of the system.

Training data for tuning the detection models was recorded in several sessions among three to four users having funny conversation, each session alsted for about 1.5h. During recordings each user was equipped with a headset and filmed with a Kinect and a HD camera. To allow simultaneous recording of four users, the setup included several pcs synchronized over the network. The possibility to keep pipelines, which are distributed over several machines in a network, in sync allows it to create large multi-modal corpora with multiple users. In this particular case, the raw data captured by SSI summed up to about ~4.78 GB per minute including audio, Kinect body and face tracking, as well as, HD video streams.

The following pipeline snippet connects to an audio and Kinect sensor and stores the captured signals on disk. Note that the audio stream and the Kinect rgb video are muxed in a single file. Therefore the audio is passed as an additional input source wrapped by a <xinput> element. An additional line is added at the top of the file to configure the <framework> to wait for a synchronization signal on port 1234.

<!-- SYNCHRONIZATION -->
<framework sync="true" sport="1234" slisten="true"/>

<!-- AUDIO SENSOR -->
<sensor create="ssi_sensor_Audio" option="audio" scale="true">
        <provider channel="audio" pin="audio"/>
</sensor>

<!-- KINECT SENSOR -->
<sensor create="ssi_sensor_MicrosoftKinect">
        <provider channel="rgb" pin="kinect_rgb"/>
        <provider channel="au" pin="kinect_au"/>
        <provider channel="face" pin="kinect_face"/>
</sensor>
<!-- STORAGE -->
<consumer create="ssi_consumer_FFMPEGWriter" url="rgb.mp4">
        <input pin="kinect_rgb" frame="1"/>
        <xinput size="1">
                <input pin="audio"/>
        </xinput>
</consumer>
<consumer create="ssi_consumer_FileWriter" path="au">
        <input pin="kinect_au" frame="5"/>
</consumer>
<consumer create="ssi_consumer_FileWriter" path="face">
        <input pin="kinect_face" frame="5"/>
</consumer>

 

Based on the audiovisual content, raters were asked to annotate audible and visual cues of laughter in the recordings. Afterwards, features are extracted from the raw signals and each feature vector is labelled by the according annotation tracks. The labelled feature vectors serve as input for a learning phase during which a separation of the feature space is seeked that allows a good segregation of the class labels. E.g. a feature measuring the extension of lip corners may correlate with smiling and hence be picked as an indicator for enjoyment. Since in complex recognition tasks no definite mapping exists, numerous approaches have been proposed to solve this task. SSI includes several well established learning algorithms, such as K-Nearest Neighbours, Gaussian Mixture Models, or Support Vector Machines. These algorithms are part of SSI’s machine learning library, which also allows provides tools to simulate (parts of) pipelines in a best-effort manner and to evaluate models in terms of expected recognition accuracy.

   

Figure 11: Manual annotations of the enjoyment cues are used to train detection models for each modality.

The following C++ code snippet gives an impression how learning is accomplished using SSI’s machine learning library. First, the raw audio file (“user1.wav”) and an annotation file (“user1.anno”) are loaded. Next, the audio stream is converted into a list of samples to which feature extraction is applied. Finally, a model is trained using those samples. To see how well the model performs on the training data an additional evaluation step is added.

// read audio
ssi_stream_t stream;
WavTools::ReadWavFile ("user1.wav", stream);

// read annotation
Annotation anno;
ModelTools::LoadAnnotation (anno, "user1.anno");

// create samples
SampleList samples;
ModelTools::LoadSampleList (samples, stream, anno, "user1");

// extract features
SampleList samples_t;
EmoVoiceFeat *ev = ssi_create (EmoVoiceFeat, "ev", true);
ModelTools::TransformSampleList (samples, samples_t, *ev);

// create model
IModel *svm = ssi_create (SVM, "svm", true);
Trainer trainer (svm);

// train and save
trainer.train (samples_t);
trainer.save (model);

// evaluation
Evaluation eval;
eval.evalKFold (trainer, samples_t, 10);
eval.print ();

 

After the learning phase the pre-trained classification models are ready to be plugged into a pipeline. To accomplish the pipeline at hand, two models were trained: one to detect audible laughter cues (e.g. laughter bursts) from the voice, and one to detect visual cues (e.g. smiles) in the face, which are now connected to the according feature extracting components. Activity detection is applied to decide if a frame contains sufficient information to be included in the classification process. For example, if no face is detected in the current image or if the energy of an audio chunk is too low, the frame will be discarded. Otherwise, classification is applied and the result is forwarded to the fusion component. The first steps in the pipeline are stream-based, i.e. signals are continuously processed over a fixed length window. Late components responsible for cue detection and fusion are event based processes, applied only where signals carry information relevant to the recognition process. To derive a final decision, both type of cues are finally combined.

The pipeline is basically an extension of the recording pipeline described earlier, which includes additional processing steps. To process the audio stream, the following snippet is added:

<!-- VOCAL ACTIVITY DETECTION -->
<transformer create="ssi_feature_AudioActivity" threshold="0.025">
        <input pin="audio" frame="19200" delta="28800"/>
        <output pin="voice_activity"/>
</transformer>      

<!-- VOCAL FEATURE EXTRACTION -->
<transformer create="ssi_feature_EmoVoiceFeat">
        <input pin="audio" frame="19200" delta="28800"/>
        <output pin="audio_feat"/>
</transformer>

<!-- VOCAL LAUGTHER CLASSIFICATION -->
<consumer create="ssi_consumer_Classifier" trainer="models\voice" sname="laughter" ename="voice">
        <input pin="audio_feat" frame="1" delta="0" trigger="voice_activity"></input>
</consumer>

  While video processing is accomplished by:

<!-- FACIAL ACTIVITY DETECTION -->
<transformer create="ssi_feature_MicrosoftKinectFAD" minfaceframes="10">
        <input pin="kinect_face" frame="10" delta="15"/>
        <output pin="face_activity"/>
</transformer>      

<!-- FACIAL FEATURE EXTRACTION -->
<transformer create="ssi_feature_MicrosoftKinectAUFeat">
        <input pin="kinect_au" frame="10" delta="15"/>
        <output pin="kinect_au_feat"/>
</transformer>

<!-- FACIAL LAUGHTER CLASSIFICATION -->
<consumer create="ssi_consumer_Classifier" trainer="models\face" sname="laughter" ename="face">
        <input pin="kinect_au_feat" frame="1" delta="0" trigger="face_activity"></input>         
</consumer>

 

Obviously, both snippets share a very similar structure, though different components are loaded and frame/delta sizes are adjusted to fit the samples rates. Note that this time the trigger stream (voice_activity/face_activity) is directly applied by the keyword trigger in the <input> section of the classifier. The pre-trained models for detecting cues in the vocal and facial feature streams are loaded from file via the trainer option. Cue probabilities are finally combined via vector fusion:

<object create="ssi_listener_VectorFusionModality" ename="enjoyment" sname="fusion"
        update_ms="400" fusionspeed="1.0f" gradient="0.5f" threshold="0.1f" >
        <listen address="laughter@voice,face"/>
</object>

 

The core idea of vector based fusion is to handle detect events (laughter cues in our case) as independent vectors in a single or multidimensional event space and derive a final decision by aggregating vectors while taking into account temporal relationships (the influence of an event is reduced over time) [2]. In contrast to standard segment-based fusion approaches, which force a decision in all modalities at each fusion step, it is individually decided if and when a modality contributes. The following animation illustrates this, with green dots representing detected cues, whereas red dots mean that no cue was detected. Note that the final fusion decision – represented by the green bar on the right – grows if cues are detected and afterwards shrinks again as no new input is added:

 

Figure 12: The pre-trained models are applied to detect enjoyment cues at run-time (green dots). The more cues are detected across modalities, the higher is the output of the fusion algorithm (green bar).

The following video clip demonstrates the detection pipeline in action. Input streams are visualized on the left (top: video stream with face tracking, bottom: raw audio stream and acdtivity plot). Laughter cues detected in the two modalities are shown in two separate bar plots on top of each other. Result of final multi-modal laughter detection in the bar plot on the very right.

Conclusion

In this article we have introduced Social Signal Interpretation (SSI), a multi-modal signal processing framework. We have introduced the basic concepts of SSI and demonstrated by means of two examples how SSI allows users to quickly set up processing pipelines in XML. While in this article we have focused on how to build applications from available components, a simple plugin system offers developers the possibility to extend the pool of available modules with new components. By sharing these within the multimedia community everyone is encouraged to enrich the functions of SSI in future. Readers who would like to learn more about SSI or get free access to the source code, please visit http://openssi.net.

Future Work

So far, applications developed with SSI target desktop machines, possibly distributed over several such machines within a network. Though, wireless sensor devices become more and more popular offering some amount of mobility, yet it is not possible to monitor subjects outside a certain radius, unless they take a desktop computer with them. Smartphones and similar pocket computers can help to overcome this limitation. In a pilot project, a plugin for SSI has been developed to stream in real-time audiovisual content and other sensor data over a wireless LAN connection from a mobile device running Android to an SSI server. The data is analysed on the fly on the server and the result is send back to the mobile device. Such scenarios give ample scope for new applications, which allow it to follow a user “in the wild”. In the CARE project (a sentient Context-Aware Recommender System for the Elderly) an recommender system is currently developed, which in real-time assists solely living elderly people in their home environment by recommending them depending on the situation physical, mental and social activities that are aimed to induce senior citizens to be self-confident, independent and active at everyday life again.

Acknowledgements

The work described in this article is funded by the European Union under research grant CEEDs (FP7-ICT-2009-5) and TARDIS (FP7-ICT-2011-7), and ILHAIRE, a Seventh Framework Programme (FP7/2007-2013) under grant agreement n°270780.

Nicolas D. Georganas Best Paper Award

The ACM Transactions on Multimedia Computing, Communications and Applications (TOMM), 2014

The 2014 ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) Nicolas D. Georganas Best Paper Award is presented to the paper “A framework for network aware caching for video on demand systems” (TOMM vol. 9, Issue 4) by Bogdan Carbunar, Rahul Potharaju, Michael Pearce, Venugopal Vasudevan and Michael Needham.

The purpose of the named award is to recognize the most significant work in ACM TOMM (formerly TOMCCAP) in a given calendar year. The whole readership of ACM TOMM was invited to nominate articles which were published in Volume 9 (2013). Based on the nominations the winner has been chosen by the TOMM Editorial Board. The main assessment criteria have been quality, novelty, timeliness, clarity of presentation, in addition to relevance to multimedia computing, communications, and applications.

The winning paper examines caching strategies for video-on-demand solutions using log traces collected from Motorola equipment deployed within several Comcast sites. The authors propose several fundamental metrics for characterizing video-on-demand CDN architectures and contribute several caching strategies based on observations extracted from real-world data. The superior performance of the proposed solutions make the work highly relevant for both video-on-demand providers (e.g., to improve caching strategies) and academics (e.g., designing realistic simulators using the characterizations outlined in the paper).

The award honors the founding Editor-in-Chief of TOMM, Nicolas D. Georganas, for his outstanding contributions to the field of multimedia computing and his significant contributions to ACM.  He exceedingly influenced the research and the whole multimedia community.

The Editor-in-Chief Prof. Dr.-Ing. Ralf Steinmetz and the Editorial Board of ACM TOMM cordially congratulate the winner. The award will be presented on November 4th 2014 at the ACM Multimedia 2014 in Orlando, Florida.

Bios of the Award Recipients:

Bogdan Carbunar is an assistant professor in the School of Computing and Information Sciences at the Florida International University. Previously, he held various researcher positions within the Applied Research Center at Motorola.  His research interests include distributed systems, security and applied cryptography.  He holds a Ph.D. in Computer Science from Purdue University.

Rahul Potharaju is an Applied Scientist at Microsoft. Before that, he earned his obtained his PhD degree in the Computer Science from Purdue University and Master’s degree in Computer Science from Northwestern University. Rahul is passionate about transforming big data into actionable insights and building large-scale data-intensive systems, with a particular interest in analytics-as-a-service clouds and automated problem inference systems. He is a recipient of the Motorola Engineering Excellence award in 2009 and the Purdue Diamond Award in 2014. His research has been adopted by several business groups inside Microsoft and has won the Microsoft Trustworthy Reliability Computing Award in 2013.

Michael Pearce attended Iowa State University and the University of Illinois at Chicago. He is a Distinguished Member of Technical Staff at Motorola Solutions and a Motorola Solutions Science Advisory Board Associate. He is currently leading a team investigating hybrid cloud computing, data analytics, and dynamic fault tolerant distributed systems.

Venu Vasudevan is Senior Director of the Multi-Screen Media & Analytics team at Arris Advanced Technologies, and Adjunct Assistant Professor at Rice University’s Department of Electrical and Computer Engineering. He leads research efforts on media delivery architectures for advancing television and multi-screen platforms with specific emphasis on the applications of sensing, media analytics and data science to interactive TV, content discovery and advanced advertising. He received a Ph.D. in Computer Science from The Ohio State University and a B.S. from the Indian Institute of Technology in Electrical and Computer Engineering. Dr. Vasudevan has served as a speaker and panelist at industry events such as Yankee Group, Digital Hollywood and the Pelorus Group summits. He also serves on several NSF panels on Small Business Innovation. He has published more than 50 conference, journal and book publications and has over ten issued patents. He won the best paper award at IEEE Percom 2009 and received recognition from the Hawaii International Conference and the ACM Convergence on Small and Personal Computers.

Michael Needham is a Principal Engineer for Roberson and Associates. Before that, he was a researcher with Motorola for more than 20 years. He has extensive knowledge and experience in a broad range of technologies related to the telecommunications and video media industries, with very strong analytical and communication skills. He has received 25 U.S. patents and 2 best paper awards.

Announcement of ACM SIGMM Rising Star Award 2014

ACM Special Interest Group on Multimedia (SIGMM) is pleased to present the SIGMM Rising Star Award in its inaugural year to Dr. Meng Wang.

This new award of ACM SIGMM recognizes a young researcher who has made outstanding research contributions to the field of multimedia computing, communication and applications during the early part of his or her career. Dr. Meng Wang has made such significant contributions in the areas of media tagging and tag processing as well as multimedia accessibility research.

In media tagging and tag processing, Dr. Wang has made a large range of contributions, ranging from active learning, multi-graph learning to semi-supervised learning in media tagging. The work “Unified Video Annotation via Multigraph Learning”, published in IEEE TCSVT (2009), is one of the first approaches that are able to adaptively fuse multimodal information sources. The work has been extensively followed in multimedia and computer vision areas, and has received over 220 Google citations. His paper “Assistive Tagging: A Survey of Multimedia Tagging with Human-Computer Joint Exploration “, published in ACM Computing Surveys (2012), covers a range of assistive techniques to support media tagging. This work organizes promising approaches into several key topics, and presents a definitive taxonomy covering related research under each topic. It points to key areas of research conducted as well as new fruitful areas yet to be explored. It provides an excellent guide for any researcher interested in this important subject.

In multimedia accessibility research, Dr. Wang developed a variety of innovative techniques, including image and video recoloring, search results filtering, and color indication, to help color-blind users access and understand color images and videos. Moreover, he worked on dynamic captioning that appeared in ACM Multimedia 2010 and received the best paper award. The work aims to help hearing impaired audience better understand video stories. It also improves the conventional static caption in many ways, such as optimizing caption locations around speaking faces, progressively highlighting scripts, and adding visualization of the audio volume to overall user experience.

Bio of Awardee

Dr. Meng Wang is a Professor at the Hefei University of Technology, China. He received his B.E. and Ph.D. degree in the Special Class for the Gifted Young and the Department of Electronic Engineering and Information Science from the University of Science and Technology of China (USTC), Hefei, China, in 2003 and 2008, respectively. He previously worked as an associate researcher at Microsoft Research Asia, and then a core member in a startup in Silicon Valley. After that, he worked in the National University of Singapore as a senior research fellow. His current research interests include multimedia content analysis, search, mining, recommendation, and large -scale computing. He has authored more than 150 manuscripts including book chapters, journal and conference papers published in TMM, TOMCCAP, TCSVT, ACM MM, and SIGIR. He received the best paper awards successively from the 17th and 18th ACM International Conference on Multimedia, the best paper award from the 16th International Conference on Multimedia Modeling, the best paper award from the 4th International Conference on Internet Multimedia Computing and Service, and the best demo award from the 20th ACM International Conference on Multimedia. He is a member of IEEE and ACM.

SIGMM Award for Outstanding Technical Contributions to Multimedia Computing, Communications and Applications

The 2014 winner of the prestigious ACM Special Interest Group on Multimedia (SIGMM) award for Outstanding Technical Contributions to Multimedia Computing, Communications and Applications is Prof. Dr. Klara Nahrstedt.

Klara Nahrstedt is a leading researcher in multimedia systems. She has made seminal contributions in QoS management for distributed multimedia systems. Her pioneering work on QoS brokerage with QoS translation, QoS negotiation and QoS adaptation services set between application and transport layers to enable end-to-end QoS contract changed the way multimedia end-system architectures are designed and built. This result was published as the “QoS Broker” in 1995. Her novel QoS adaptation extended this work by modeling the end-to-end QoS problem based on a control-theoretical approach. This work gained wide recognition as the first usage of control theory in multimedia systems and received the Leonard C. Abraham Paper Award from the IEEE Communication Society.

Addressing the end-to-end QoS management problem, she made fundamental contributions to QoS routing. In her 1999 JSAC paper “Distributed Quality of Service Routing in Ad-hoc Networks” she derived a  distributed time and  bandwidth  sensitive  routing  scheme  for a dynamic multi-hop mobile environment. Her IEEE Network Magazine paper, “An Overview of Quality-of-Service Routing for the Next Generation High-Speed Networks: Problems and Solutions” received the “Best Tutorial Paper” Award from the IEEE Communication Society in 1999 and is still highly relevant today.

She has made seminal contributions to multimedia wireless networks.  Her novel  pricing  scheme  for  ad  hoc  networks using pricing for a clique of nodes that interfere with each other, as opposed to the per connection pricing as it was done in wired networks, as well as her results on cross-layer QoS approaches, including bandwidth and delay management, found a wide acceptance and acknowledgement in industry.

She has made seminal contributions in the area of multimedia scheduling for mobile devices. Her fundamental work on energy-efficient dynamic soft-real-time CPU scheduling for mobile multimedia devices, and development of first energy-efficient OS for mobile multimedia devices, GRACE-OS, has been widely recognized in academia and industry.

She leads the 3D tele-immersive systems and networking field. She was the first one to develop a multi-view 3D video adaptation framework for bandwidth management and view-casting protocols for multi-view 3D video. She has developed new metrics for 3D immersive video and the first comprehensive framework based on sound theoretical underpinnings for Quality of Experience in Distributed Interactive Multimedia Environments. This work was awarded Best Student Paper Award at the premier SIGMM conference, ACM Multimedia 2011, and her PhD student received the SIGMM 2012 Best PhD Thesis Award as a result.

Her two textbooks Multimedia Systems (with R. Steinmetz, Springer-Verlag, 2004) and Multimedia Computing, Communications and Applications (with R. Steinmetz, Prentice Hall, 1995) are among the world’s most widely used text-books on multimedia technology. They present the entire field of multimedia technology in a comprehensive manner.

Prof. Nahrstedt’s research leadership has translated into several awards including the 2009 Humboldt Fellow Research Award, the 2012 IEEE Computer Society Technical Achievement Award, the 2013 ACM Fellow recognition and the 2014 induction into the German National Academy of Sciences.

In summary, Prof. Nahrstedt’s accomplishments include her pioneering and extraordinary contributions in quality of service for multimedia systems and networking and her visionary leadership of the computing community.

The award will be presented on November 5th 2014 at the ACM Multimedia Conference in Orlando, Florida.

ACM SIGMM/TOMM 2014 Award Announcements

The ACM Special Interest Group in Multimedia (SIGMM) and ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) are pleased to announce the following awards for 2014 recognizing outstanding achievements and services made in the multimedia community.

SIGMM Technical Achievement Award:
Dr.
Klara Nahrstedt

SIGMM Rising Star Award:
Dr. Meng Wang

SIGMM Best Ph.D. Thesis Award:
Dr. Zhigang Ma

TOMM Nicolas D. Georganas Best Paper Award:
A Framework for Network Aware Caching for Video on Demand Systems” by Bogdan Carbunar, Rahul Potharaju, Michael Pearce, Venugopal Vasudevan and Michael Needham, published in TOMM, vol. 9, Issue 4, 2013.

TOMM Best Associate Editor Award:
Dr.
Mohamed Hefeeda

Additional information of each award and recipient is available on the SIGMM web site.

http://www.sigmm.org/

Awards will be presented in the annual SIGMM event, ACM Multimedia Conference, held in Orlando, Florida, USA during November 3-7, 2014.

ACM is the professional society of computer scientists, and SIGMM is the special interest group on multimedia. TOMCCAP is the flagship journal publication of SIGMM.

SIGMM Outstanding Ph.D. Thesis Award 2014

ACM Special Interest Group on Multimedia (SIGMM) is pleased to present the 2014 SIGMM Outstanding Ph.D. Thesis Award to Dr. Zhigang Ma.

The award committee considered Dr. Ma’s dissertation entitled “From Concepts to Events: A Progressive Process for Multimedia Content Analysis” worthy of the recognition as the proposed framework based on mathematical theories has great potential for developing real-world applications as well as addressing myriad technical challenges.

The fundamental innovations presented in Dr. Ma’s thesis consist of

  1. feature selection through subspace sparsity which leads to greatly improved accuracy with compact representation,
  2. semi-supervised learning with joint feature selection allowing exploitation of massive unlabeled data with only few labeled data,
  3. multimedia event detection by learning an intermediate representation,
  4. knowledge adaptation for multimedia event detection when only very few examples are available.

Despite the variety of problems addressed, these innovations are based on a unified machine learning framework, which is applicable to diverse application domains.  The proposed solutions have been proven to be effective and general through a large set of experiments over a variety of challenging data sets, including personal photos, web images, consumer videos, You Tube style internet video corpora, health care surveillance data, and 3D human motion data.

Bio of Awardee

Dr. Zhigang Ma received the B.Sc. and M.Sc. degrees from Zhejiang University, China, in 2004 and 2006, respectively, and the Ph.D. degree from University of Trento, Italy, in 2013. The title of his thesis is “From Concepts to Events: A Progressive Process for Multimedia Content Analysis”. He is currently a Postdoctoral Research Fellow with the School of Computer Science, Carnegie Mellon University, Pittsburgh, USA. His research interest is mainly in multimedia analysis using machine learning techniques. He received the best PhD thesis award from Gruppo Italiano Ricercatori in Pattern Recognition, Italy, in 2014. He was a PC member for ACM MM 2014 and a TPC member for ICME 2014.