Mutual modality learning for video action classification

Автор: Komkov S.A., Dzabraev M.D., Petiushko A.A.

Журнал: Компьютерная оптика @computer-optics

Рубрика: Обработка изображений, распознавание образов

Статья в выпуске: 4 т.47, 2023 года.

Бесплатный доступ

The construction of models for video action classification progresses rapidly. However, the performance of those models can still be easily improved by ensembling with the same models trained on different modalities (e.g. Optical flow). Unfortunately, it is computationally expensive to use several modalities during inference. Recent works examine the ways to integrate advantages of multi-modality into a single RGB-model. Yet, there is still room for improvement. In this paper, we explore various methods to embed the ensemble power into a single model. We show that proper initialization, as well as mutual modality learning, enhances single-modality models. As a result, we achieve state-of-the-art results in the Something-Something-v2 benchmark.

Еще

Video recognition, video action classification, video labeling, mutual learning, optical flow

Короткий адрес: https://sciup.org/140301851

IDR: 140301851 | DOI: 10.18287/2412-6179-CO-1277

Список литературы Mutual modality learning for video action classification

Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T. Hmdb: a large video database for human motion recognition. 2011 Int Conf on Computer Vision 2011: 2556-2563. DOI: 10.1109/ICCV.2011.6126543.
UCF101 - Action recognition data set. Source: https://www.crcv.ucf.edu/research/data-sets/ucf101/.
Kinetics. Source: https://www.deepmind.com/open-source/kinetics.
Goyal R, Kahou SE, Michalski V, et al. The "something something" video database for learning and evaluating visual common sense. 2017 IEEE Int Conf on Computer Vision (ICCV) 2017: 5842-5850. DOI: 10.1109/ICCV.2017.622.
Miech A, Zhukov D, Alayrac J-B, Tapaswi M, Laptev I, Sivic J. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019: 2630-2640. DOI: 10.1109/ICCV.2019.00272.
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. 2015 IEEE Int Conf on Computer Vision (ICCV) 2015: 4489-4497. DOI: 10.1109/ICCV.2015.510.
Feichtenhofer C, Fan H, Malik J, He K. Slowfast networks for video recognition. 2019 IEEE/CVF Int Conf on Computer Vision (ICCV) 2019: 6202-6211. DOI: 10.1109/ICCV.2019.00630.
Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. 2017 IEEE Conf on Computer Vision and Pattern Recognition (CVPR) 2017: 6299-6308. DOI: 10.1109/CVPR.2017.502.
Lin J, Gan C, Han S. Tsm: Temporal shift module for efficient video understanding. 2019 IEEE/CVF Int Conf on Computer Vision (ICCV) 2019: 7083-7093. DOI: 10.1109/ICCV.2019.00718.
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. NIPS'14: Proc 27th Int Conf on Neural Information Processing Systems 2014; 1: 568-576.
Fan L, Huang W, Gan C, Ermon S, Gong B, Huang J. End-to-end learning of motion representation for video understanding. 2018 IEEE/CVF Conf on Computer Vision and Pattern Recognition 2018: 6016-6025. DOI: 10.1109/CVPR.2018.00630.
Crasto N, Weinzaepfel P, Alahari K, Schmid C. Mars: Motion-augmented rgb stream for action recognition. 2019 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2019: 7882-7891. DOI: 10.1109/CVPR.2019.00807.
Piergiovanni AJ, Ryoo MS. Representation flow for action recognition. 2019 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2019: 9945-9953. DOI: 10.1109/CVPR.2019.01018.
Stroud JC, Ross DA, Sun C, Deng J, Sukthankar R. D3d: Distilled 3d networks for video action recognition. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) 2020: 625-634. DOI: 10.1109/WACV45572.2020.9093274.
Zhang Y, Xiang T, Hospedales TM, Lu H. Deep mutual learning. 2018 IEEE/CVF Conf on Computer Vision and Pattern Recognition 2018: 4320-4328. DOI: 10.1109/CVPR.2018.00454.
Wang X, Girshick R, Gupta A, He K. Non-local neural networks. 2018 IEEE/CVF Conf on Computer Vision and Pattern Recognition 2018: 7794-7803. DOI: 10.1109/CVPR.2018.00813.
Xie S, Sun C, Huang J, Tu Z, Murphy K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Book: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, eds. Computer Vision - ECCV 2018. 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV. Cham, Switzerland: Springer Nature Switzerland AG; 2018: 305321. DOI: 10.1007/978-3-030-01267-0_19.
Zolfaghari M, Singh K, Brox T. Eco: Efficient convolutional network for online video understanding. In Book: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, eds. Computer Vision - ECCV 2018. 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II. Cham, Switzerland: Springer Nature Switzerland AG; 2018: 695-712. DOI: 10.1007/978-3-030-01216-8_43.
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M. A closer look at spatiotemporal convolutions for action recognition. 2018 IEEE/CVF Conf on Computer Vision and Pattern Recognition 2018: 6450-6459. DOI: 10.1109/CVPR.2018.00675.
Yang C, Xu Y, Shi J, Dai B, Zhou B. Temporal pyramid network for action recognition. 2020 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2020: 591-600. IEEE, DOI: 10.1109/CVPR42600.2020.00067.
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: Towards good practices for deep action recognition. In Book: Leibe B, Matas J, Sebe N, Welling M, eds. Computer vision -ECCV 2016. 14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII. 20-36. Cham, Switzerland: Springer Nature Switzerland AG; 2016. DOI: 10.1007/978-3-319-46484-8_2.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. 2016 IEEE Conf on Computer Vision and Pattern Recognition (CVPR) 2016: 770-778. DOI: 10.1109/CVPR.2016.90.
Shao H, Qian S, Liu Y. Temporal interlacing network. Proc AAAI Conf on Artificial Intelligence 2020; 34(7): 11966-11973. DOI: 10.1609/aaai.v34i07.6872.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS 2017) 2017: 1-11.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR 2021) 2021: 1-21.
Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C. Multiscale vision transformers. 2021 IEEE/CVF Int Conf on Computer Vision (ICCV) 2021: 6804-6815. DOI: 10.1109/ICCV48922.2021.00675.
Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H. Video swin transformer. 2016 IEEE Conf on Computer Vision and Pattern Recognition (CVPR) 2022: 3202-3211. DOI: 10.1109/CVPR.2016.297.
Jiang B, Wang M, Gan W, Wu W, Yan J. Stm: Spatiotemporal and motion encoding for action recognition. 2019 IEEE/CVF Int Conf on Computer Vision (ICCV) 2019: 2000-2009. DOI: 10.1109/ICCV.2019.00209.
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv Preprint. 2015. Source: https://arxiv.org/abs/1503.02531.
Furlanello T, Lipton Z, Tschannen M, Itti L, Anandkumar A. Born again neural networks. Proc 35th Int Conf on Machine Learning 2018: 1607-1616.
Zhang B, Wang L, Wang Z, Qiao Y, Wang H. Real-time action recognition with enhanced motion vector cnns. 2016 IEEE Conf on Computer Vision and Pattern Recognition (CVPR) 2016: 2718-2726. DOI: 10.1109/CVPR.2016.297.
Wang W, Tran D, Feiszli M. What makes training multimodal classification networks hard? 2020 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2020: 12695-12705. DOI: 10.1109/CVPR42600.2020.01271.
Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conf on Computer Vision and Pattern Recognition 2009: 248-255. DOI: 10.1109/CVPR.2009.5206848.
Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Book: Leibe B, Matas J, Sebe N, Welling M, eds. Computer Vision -ECCV 2016. 14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I. Cham, Switzerland: Springer Nature Switzerland AG; 2016: 510-526. DOI: 10.1007/978-3-319-46448-0_31.
Zach C, Pock T, Bischof H. A duality based approach for realtime tv-L1 optical flow. In Book: Hamprecht FA, Schnorr C, Jahne B, eds. Pattern recognition. 29th DAGM Symposium, Heidelberg, Germany, September 12-14, 2007, Proceedings. Berlin, Heidelberg: Springer-Verlag; 2007: 214-223. DOI: 10.1007/978-3-540-74936-3_22.
Gehrig D, Gehrig M, Hidalgo-Carrio J, Scaramuzza D. Video to events: Recycling video datasets for event cameras. 2020 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR) 2020: 3586-3595. DOI: 10.1109/CVPR42600.2020.00364.
Fan Q, Chen C-FR, Kuehne H, Pistoia M, Cox D. More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. Advances in Neural Information Processing Systems 32 (NeurIPS 2019) 2019: 2264-2273.
Perez-Rua J-M, Martinez B, Zhu X, Toisoul A, Escorcia V, Xiang T. Knowing what, where and when to look: Efficient video action modeling with attention. arXiv Preprint. 2020. Source: https://arxiv.org/abs/2004.01278.

Еще

Статья научная