TY - JOUR
T1 - Actor-Aware Self-Supervised Learning for Semi-Supervised Video Representation Learning
AU - Assefa, Maregu
AU - Jiang, Wei
AU - Alemu, Kumie Gedamu
AU - Yilma, Getinet
AU - Adhikari, Deepak
AU - Ayalew, Melese
AU - Seid, Abegaz Mohammed
AU - Erbad, Aiman
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023/11/1
Y1 - 2023/11/1
N2 - Self-supervised contrastive learning has shown a significant improvement in performance for action recognition tasks by discovering useful signals from unlabeled videos. Nevertheless, the unique features of existing video benchmark datasets have led the learned video representations to be contextually biased toward dominant backgrounds and scene correlations. Thus, ultimately leading to poor generalizations on scene-invariant action recognition. Therefore, we propose Actor-aware Self-supervised Learning for Semi-supervised Video Representation Learning (ActorSL). We aligned localized actors and their corresponding scene information to encourage the model to learn discriminative regions and mitigate the model's dependency on the video background during contrastive training. Furthermore, we present an inter-video Background Mixing (iBM) augmentation strategy to introduce scene consistency into the model. We patch inter-video crops of four randomly selected frames for iBM to create a unique frame for each video. The patched frame is blended with the target video frames to generate a spatially augmented sample. Then, the actor-scene aligned features and features of iBM-augmented videos are utilized to optimize contrastive loss and consistency regularization jointly in a semi-supervised way. Moreover, iBM combines the one-hot-encoded labels of patches with the label of the target video as a label smoothing regularizer to soften the decision boundaries of the semi-supervised model. Our experimental results reveal that, ActorSL notably improved current state-of-the-art semi-supervised methods on the Kinetics-400, UCF101, and HMDB51 datasets under a low-label regime.
AB - Self-supervised contrastive learning has shown a significant improvement in performance for action recognition tasks by discovering useful signals from unlabeled videos. Nevertheless, the unique features of existing video benchmark datasets have led the learned video representations to be contextually biased toward dominant backgrounds and scene correlations. Thus, ultimately leading to poor generalizations on scene-invariant action recognition. Therefore, we propose Actor-aware Self-supervised Learning for Semi-supervised Video Representation Learning (ActorSL). We aligned localized actors and their corresponding scene information to encourage the model to learn discriminative regions and mitigate the model's dependency on the video background during contrastive training. Furthermore, we present an inter-video Background Mixing (iBM) augmentation strategy to introduce scene consistency into the model. We patch inter-video crops of four randomly selected frames for iBM to create a unique frame for each video. The patched frame is blended with the target video frames to generate a spatially augmented sample. Then, the actor-scene aligned features and features of iBM-augmented videos are utilized to optimize contrastive loss and consistency regularization jointly in a semi-supervised way. Moreover, iBM combines the one-hot-encoded labels of patches with the label of the target video as a label smoothing regularizer to soften the decision boundaries of the semi-supervised model. Our experimental results reveal that, ActorSL notably improved current state-of-the-art semi-supervised methods on the Kinetics-400, UCF101, and HMDB51 datasets under a low-label regime.
KW - Action recognition
KW - Actor-aware pseudo-labeling
KW - Contrastive learning
KW - Inter-video background mixing
KW - Semi-supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85153494459&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2023.3267178
DO - 10.1109/TCSVT.2023.3267178
M3 - Article
AN - SCOPUS:85153494459
SN - 1051-8215
VL - 33
SP - 6679
EP - 6692
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 11
ER -