Attention Boosted Deep Networks for Video Classification

Junyong You; Jari Korhonen

Attention Boosted Deep Networks for Video Classification

Junyong You^*, Jari Korhonen

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

8 Citations (Scopus)

Abstract

Video classification can be performed by summarizing image contents of individual frames into one class by deep neural networks, e.g., CNN and LSTM. Human interpretation of video content is influenced by the attention mechanism. In other words, video class can be more attentively decided by certain information than others. In this paper, we propose to integrate the attention mechanism into deep networks for video classification. The proposed framework employs 2D CNN networks with ImageNet pretrained weights to extract features of video frames that are then fed to a bidirectional LSTM network for video classification. An attention block has been developed that can be added after the LSTM network in the proposed framework. Several different 2D CNN architectures have been tested in the experiments. The results with respect to two publicly available datasets have demonstrated that integrating attention can boost the performance of deep networks in video classification compared to not applying the attention block. We also found out that applying attention to the LSTM outputs on the VGG19 architecture provides the highest classification accuracy in the proposed framework.

Original language	English
Title of host publication	Proceedings - International Conference on Image Processing (ICIP)
Publisher	IEEE Explore
Pages	1761-1765
Number of pages	5
Publication status	Published - 2020
Event	IEEE International Conference on Image Processing (ICIP) - Duration: 25 Sept 2020 → 28 Sept 2020

Conference

Conference	IEEE International Conference on Image Processing (ICIP)
Period	25/09/20 → 28/09/20

Keywords

Attention
bidirectional LSTM
CNN
video classification

Cite this

@inproceedings{401a95f13ef34d079a2e161713a280ab,

title = "Attention Boosted Deep Networks for Video Classification",

abstract = "Video classification can be performed by summarizing image contents of individual frames into one class by deep neural networks, e.g., CNN and LSTM. Human interpretation of video content is influenced by the attention mechanism. In other words, video class can be more attentively decided by certain information than others. In this paper, we propose to integrate the attention mechanism into deep networks for video classification. The proposed framework employs 2D CNN networks with ImageNet pretrained weights to extract features of video frames that are then fed to a bidirectional LSTM network for video classification. An attention block has been developed that can be added after the LSTM network in the proposed framework. Several different 2D CNN architectures have been tested in the experiments. The results with respect to two publicly available datasets have demonstrated that integrating attention can boost the performance of deep networks in video classification compared to not applying the attention block. We also found out that applying attention to the LSTM outputs on the VGG19 architecture provides the highest classification accuracy in the proposed framework.",

keywords = "Attention, bidirectional LSTM, CNN, video classification",

author = "Junyong You and Jari Korhonen",

year = "2020",

language = "English",

pages = "1761--1765",

booktitle = "Proceedings - International Conference on Image Processing (ICIP)",

publisher = "IEEE Explore",

note = "IEEE International Conference on Image Processing (ICIP) ; Conference date: 25-09-2020 Through 28-09-2020",

}

TY - GEN

T1 - Attention Boosted Deep Networks for Video Classification

AU - You, Junyong

AU - Korhonen, Jari

PY - 2020

Y1 - 2020

N2 - Video classification can be performed by summarizing image contents of individual frames into one class by deep neural networks, e.g., CNN and LSTM. Human interpretation of video content is influenced by the attention mechanism. In other words, video class can be more attentively decided by certain information than others. In this paper, we propose to integrate the attention mechanism into deep networks for video classification. The proposed framework employs 2D CNN networks with ImageNet pretrained weights to extract features of video frames that are then fed to a bidirectional LSTM network for video classification. An attention block has been developed that can be added after the LSTM network in the proposed framework. Several different 2D CNN architectures have been tested in the experiments. The results with respect to two publicly available datasets have demonstrated that integrating attention can boost the performance of deep networks in video classification compared to not applying the attention block. We also found out that applying attention to the LSTM outputs on the VGG19 architecture provides the highest classification accuracy in the proposed framework.

AB - Video classification can be performed by summarizing image contents of individual frames into one class by deep neural networks, e.g., CNN and LSTM. Human interpretation of video content is influenced by the attention mechanism. In other words, video class can be more attentively decided by certain information than others. In this paper, we propose to integrate the attention mechanism into deep networks for video classification. The proposed framework employs 2D CNN networks with ImageNet pretrained weights to extract features of video frames that are then fed to a bidirectional LSTM network for video classification. An attention block has been developed that can be added after the LSTM network in the proposed framework. Several different 2D CNN architectures have been tested in the experiments. The results with respect to two publicly available datasets have demonstrated that integrating attention can boost the performance of deep networks in video classification compared to not applying the attention block. We also found out that applying attention to the LSTM outputs on the VGG19 architecture provides the highest classification accuracy in the proposed framework.

KW - Attention

KW - bidirectional LSTM

KW - CNN

KW - video classification

M3 - Published conference contribution

SP - 1761

EP - 1765

BT - Proceedings - International Conference on Image Processing (ICIP)

PB - IEEE Explore

T2 - IEEE International Conference on Image Processing (ICIP)

Y2 - 25 September 2020 through 28 September 2020

ER -

Attention Boosted Deep Networks for Video Classification

Abstract

Conference

Keywords

Fingerprint

Cite this