Video classification using convolutional – LSTM
In this section, we will start combining convolutional, max pooling, dense, and recurrent layers to classify each frame of a video clip. Specifically, each video contains several human activities, which persist for multiple frames (though they move between frames) and may leave the frame. First, let's get a more detailed description of the dataset we will be using for this project.
UCF101 – action recognition dataset
UCF101
is an action recognition dataset of realistic action videos, collected from YouTube and having 101 action categories covering 13,320 videos. The videos are collected with variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, and illumination condition.
The videos in 101 action categories are further clustered into 25 groups (clips in each group have common features, for example, background and viewpoint) having four to seven videos of an action in each group. There are five...