Section: New Results
Action Recognition in Videos
Participants : Piotr Bilinski, François Brémond.
The aim of this work is to learn and recognize short human actions in videos. We perform an extensive evaluation of local spatio-temporal descriptors, then we propose a new action recognition approach for RGB camera videos. We also propose a new approach for RGB-D cameras. For all our experiments, we develop an evaluation framework based on the bag-of-words model, SVM and cross-validation technique. We use the bag-of-words model to represent actions in videos and we use non-linear multi-class Support Vector Machines together with leave-one-person-out cross-validation technique to perform action classification.
Local spatio-temporal descriptors have shown to obtain very good performance for action recognition in videos. Over the last years, many different descriptors have been proposed. They are usually evaluated using too specific experimental methods and using different datasets. Moreover, existing evaluations make assumptions that do not allow to fully compare descriptors. In order to explore capabilities of descriptors, we perform an extensive evaluation of local spatio-temporal descriptors for action recognition in videos. Four widely used state-of-the-art descriptors (HOG, HOF, HOG-HOF and HOG3D) and four video datasets (Weizmann, KTH, ADL and KECK) have been selected. In contrast to other evaluations, we test all the computed descriptors, we perform experiments on several codebook sizes and use several datasets, differing in difficulty. Our results show how the recognition rate depends on the codebook size and the dataset. We observe that usually the HOG descriptor alone performs the worst but outperforms other descriptors when it is combined with the HOF descriptor. Also, we observe that smaller codebook sizes lead to consistently good performance across different datasets. This work has been published in [32] .
We also propose a new action recognition method for RGB camera videos based on feature point tracking and a new head estimation algorithm. We track feature points along a video and compute appearance features (HOG-HOF) for each trajectory. Additionally, we estimate a head position for each visible human in the video, using the following chain: segmentation, person, head and face detectors. Finally, we create an action descriptor based on the combination of all these sources of information. Our approach has been evaluated on several datasets, including two benchmarking datasets: KTH and ADL, and our new action recognition dataset. This new dataset has been created in cooperation with the CHU Nice Hospital. It refers to people performing daily living activities like: standing up, sitting down, walking, reading a magazine etc.
We also study the usefulness of low-cost RGB-D camera for action recognition task. We propose a new action recognition method using both RGB and depth information. We track feature points using RGB videos and represent trajectories in a four-dimensional space using additionally depth information. Experiments have been successfully performed on our new RGB-D action recognition dataset, recorded using Microsoft's Kinect device.