TY - GEN
T1 - Human interaction recognition using low-rank matrix approximation and super descriptor tensor decomposition
AU - Khokher, Muhammad Rizwan
AU - Bouzerdoum, Abdesselam
AU - Phung, Son Lam
PY - 2017/6/16
Y1 - 2017/6/16
N2 - Audio-visual recognition systems rely on efficient feature extraction. Many spatio-temporal interest point detectors for visual feature extraction are either too sparse, leading to loss of information, or too dense resulting in noisy and redundant information. Furthermore, interest point detectors designed for a controlled environment can be affected by camera motion. In this paper, a salient spatio-temporal interest point detector is proposed based on a low-rank and group-sparse matrix approximation. The detector handles the camera motion through a short-window video stabilization. The multimodal audio-visual features from multiple descriptors are represented by a super descriptor, from which a compact set of features is extracted through a tensor decomposition and feature selection. This tensor decomposition retains the spatiotemporal structure among features obtained from multiple descriptors. Experimental validation is conducted using two benchmark human interaction recognition datasets: TVHID and Parliament. Experimental results are presented which show that the proposed approach outperforms many state-ofthe- art methods, achieving classification rates of 74.7% and 88.5% on the TVHID and Parliament datasets, respectively.
AB - Audio-visual recognition systems rely on efficient feature extraction. Many spatio-temporal interest point detectors for visual feature extraction are either too sparse, leading to loss of information, or too dense resulting in noisy and redundant information. Furthermore, interest point detectors designed for a controlled environment can be affected by camera motion. In this paper, a salient spatio-temporal interest point detector is proposed based on a low-rank and group-sparse matrix approximation. The detector handles the camera motion through a short-window video stabilization. The multimodal audio-visual features from multiple descriptors are represented by a super descriptor, from which a compact set of features is extracted through a tensor decomposition and feature selection. This tensor decomposition retains the spatiotemporal structure among features obtained from multiple descriptors. Experimental validation is conducted using two benchmark human interaction recognition datasets: TVHID and Parliament. Experimental results are presented which show that the proposed approach outperforms many state-ofthe- art methods, achieving classification rates of 74.7% and 88.5% on the TVHID and Parliament datasets, respectively.
KW - Human interaction recognition
KW - low-rank and group-sparse matrix approximation
KW - spatiotemporal interest point detection
KW - tensor decomposition
UR - http://www.scopus.com/inward/record.url?scp=85023752750&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2017.7952476
DO - 10.1109/ICASSP.2017.7952476
M3 - Conference contribution
AN - SCOPUS:85023752750
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 1847
EP - 1851
BT - 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017
Y2 - 5 March 2017 through 9 March 2017
ER -