Author: Pascal Mettes
Supervisor(s) and Committee member(s): Arnold W.M. Smeulders (promotor), Cees G.M. Snoek (co-promotor)
URL: https://dare.uva.nl/search?identifier=d27fd9d1-095b-418d-9c1d-263afd1da3e1
ISBN: 978-94-6182-851-4
This thesis investigates the role of objects for the spatio-temporal recognition of activities in videos. We investigate what, when, and where specific activities occur in visual content by examining object representations, centered around the main question: what do objects tell about the extent of activities in visual space and time? The thesis presents six works on this topic.
First, the spatial extent of activities is investigated using objects and their parts. In a part-based setting, we hypothesize that the parts coming from the context around activities – such as interacting objects – and parts from other activities improve recognition. We analyze this hypothesis by backtracking where selected parts come from and find that part selection should not be done separately for each activity, but instead be shared and optimized over all activities. This leads us to conclude that the spatial extent of activities goes beyond the activity itself to include its surrounding context and even parts of other activities.
Second, over two works, it is investigated whether activities exhibit different object preferences over time and which objects matter for representing activities. A video activity like birthday party has many temporal fragments, such as cutting a cake, singing, and unwrapping presents. The fragments are connected to the activity, but all have different object preferences. We propose a video representation dubbed bag-of-fragments, that incorporates the presence of multiple fragments in activity videos. Experimentally, we show that using multiple fragments with different object preferences aids activity recognition. To find a suitable object representation for frames and fragments, we investigate how to leverage the complete ImageNet hierarchy for pre-training deep networks. Reorganizing the ImageNet tree yields deep networks that have a clear positive effect on the recognition of activities, indicating the importance of employing the right set of objects and hierarchical level for activities.
Third, the full spatio-temporal extent of activities is investigated, where over three works the extensive annotation burden of action localization is increasingly reduced. An accepted standard in activity localization is to use action proposals at test time and select the best one with a classifier trained on carefully annotated box annotations. We first propose to annotate actions in video with points instead of box annotations. We introduce an overlap measure and extended Multiple Instance Learning algorithm to exploit point supervision. We show that training on proposals guided by a few point annotations performs as well as training on box annotations, while being much faster to annotate. We extend this work by replacing manual point annotations with pseudo-annotations, automatic annotations from visual cues such as objects. When using spatio-temporal proposals, pseudo-annotations work as well as boxes and points, resulting in effective activity localization using only video labels as activity annotations. In the last work, we take the link between activities and objects to its penultimate step by examining the spatial relations of objects for action localization without any training examples. To arrive at spatial awareness, we build our embedding on top of freely available actor and object detectors. Relevance of objects is determined in a word embedding space and further enforced with estimated spatial preferences. Experimental evaluation shows that activity localization without training examples is possible when jointly embedding actors, objects, and their spatial relations over time.
The works of this thesis lead to the conclusion that objects provide valuable information about the presence and spatio-temporal extent of activities in videos.
Intelligent Sensory Information Systems group
URL: https://ivi.fnwi.uva.nl/isis/
The world is full of digital images and videos. In this deluge of visual information, the grand challenge is to unlock its content. This quest is the central research aim of the Intelligent Sensory Information Systems group. We address the complete knowledge chain of image and video retrieval by machine and human. Topics of study are semantic understanding, image and video mining, interactive picture analytics, and scalability. Our research strives for automation that matches human visual cognition, interaction surpassing man and machine intelligence, visualization blending it all in interfaces giving instant insight, and database architectures for extreme sized visual collections. Our research culminates in state-of-the-art image and video search engines which we evaluate in leading benchmarks, often as the best performer, in user studies, and in challenging applications.