Preparing Video Data for Deep Learning: Introducing Vid Prepper

to preparing videos for machine learning/deep learning. Due to the size and computational cost of video data, it is vital that it is processed in as efficient a way possible for your use case. This includes things like metadata analysis, standardization, augmentation, shot and object detection, and tensor loading. This article explores some ways how these can be done and why we would do them. I have also built an open source Python package called vid-prepper. I built the package with the aim of providing a fast and efficient way to apply different preprocessing techniques to your video data. The package builds off some giants of the machine learning and deep learning World, so whilst this package is useful in bringing them together in a common and easy to use framework, the real work is most definitely on them!

Video has been an important part of my career. I started my data career in a company that built a SaaS platform for video analytics for major leading video companies (called NPAW) and currently work for the BBC. Video currently dominates the web landscape, but with AI is still pretty limited, although growing superfast. I wanted to create something that helps speed up people’s ability to try things out and contribute to this really interesting area. This article will discuss what the different package modules do and how to use them, starting with metadata analysis.

Metadata Analysis

from vid_prepper import metadata

At the BBC, I am pretty fortunate to work at a professional organisation with hugely talented people creating broadcast quality videos. However, I know that most video data is not this. Often files will be mixed formats, colours, sizes, or they may be corrupted or have parts missing, they may also have quirks from older videos, like interlacing. It is important to be aware of any of this before processing videos for machine learning.

We will be training our models on GPUs, and these are fantastic for tensor calculations at scale but expensive to run. When training large models on GPUs, we want to be as efficient as possible to avoid high costs. If we have corrupted videos or videos in unexpected or unsupported formats it will waste time and resources, could make your models less accurate or even cause the training pipeline to break. Therefore, checking and filtering your files beforehand is a necessity.

Metadata Analysis is almost always an important first step in preparing video data (image source – Pexels)

I have built the metadata analysis module on the ffprobe library, part of the FFmpeg library built in C and Assembler. This is a hugely powerful and efficient library used extensively in the profession and the module can be used to analyse a single video file or a batch of them as shown in the code below.

# Extract metadata
video_path = [“sample.mp4”]
video_info = metadata.Metadata.validate_videos(video_path)

# Extract metadata batch
video_paths = [“sample1.mp4”, “sample2.mp4”, “sample3.mp4”]
video_info = metadata.Metadata.validate_videos(video_paths)

This provides a dictionary output of the video metadata including codecs, sizes, frame rates, duration, pixel formats, audio metadata and more. This is really useful both for finding video data with issues or odd quirks, or also for selecting specific video data or choosing the formats and codec to standardize to based on the most commonly used ones.

Filtering Based on Metadata Issues

Given this seemed to be a pretty regular use case, I built in the ability to filter the list of videos based on a set of checks. For example, if there is video or audio missing, codecs or formats not as specified, or frame rates or durations different to those specified, then these videos can be identified by setting the filter and only_errors parameters, as shown below.

# Run tests on videos
videos = ["video1.mp4", "video2.mkv", "video3.mov"]

all_filters_with_params = {
    "filter_missing_video": {},
    "filter_missing_audio": {},
    "filter_variable_framerate": {},
    "filter_resolution": {"min_width": 1280, "min_height": 720},
    "filter_duration": {"min_seconds": 5.0},
    "filter_pixel_format": {"allowed": ["yuv420p", "yuv422p"]},
    "filter_codecs": {"allowed": ["h264", "hevc", "vp9", "prores"]}
}

errors = Metadata.validate_videos(
    videos,
    filters=all_filters_with_params,
    only_errors=True
)

By removing or identifying issues with the data before we get to the real intensive work of model training means we avoid wasting time and money, making it a vital first step.

Standardization

from vid_prepper import standardize

Standardization is usually pretty important in preprocessing for video machine learning. It can help make things much more efficient and consistent, and often deep learning models require specific sizes (eg. 224 x 224). If you have a lot of video data then any time spent in this stage is often repaid many times in the training stage later on.

Standardizing video data can make processing much, much more efficient and give better results (image source – Pexels)

Codecs

Videos are often structured for efficient storage and distribution over the internet so that they can be broadcast cheaply and quickly. This usually involves heavy compression to make videos as small as possible. Unfortunately, this is pretty much diametrically opposed to what is good for deep learning.

The bottleneck for deep learning is almost always decoding videos and loading them to tensors, so the more compressed a video file is, the longer that takes. This typically means avoiding ultra compressed codecs like H265 and VVC and going for lighter compressed alternatives with hardware acceleration like H264 or VP9, or as long as you can avoid I/O bottlenecks, using something like uncompressed MJPEG which tends to be used in production as it is the fastest way of loading frames into tensors.

Frame Rate

The standard frame rates (FPS) for video are 24 for cinema, 30 for TV and online content and 60 for fast motion content. These frame rates are determined by the number of images required to be shown per second so that our eyes see one smooth motion. However, deep learning models don’t necessarily need as high a frame rate in the training videos to create numeric representations of motion and generate smooth looking videos. As every frame is an additional tensor to compute, we want to minimize the frame rate to the smallest we can get away with.

Different types of videos and the use case of our models will determine how low we can go. The less motion in a video, the lower we can set the input frame rate without compromising the results. For example, an input dataset of studio news clips or talk shows is going to require a lower frame rate than a dataset made up of ice hockey matches. Also, if we’re working on a video understanding or video-to-text model, rather than generating video for human consumption, it might be possible to set the frame rate even lower.

Calculating Minimum Frame Rate

It is actually possible to mathematically determine a pretty good minimum frame rate for your video dataset based on motion statistics. Using a RAFT or Farneback algorithm on a sample of your dataset, you can calculate the optical flow per pixel for each frame change. This provides the horizontal and vertical displacement for each pixel to calculate the magnitude of the change (the square root of adding the squared values).

Averaging this value over the frame gives the frame momentum and taking the median and 95th percentile of all the frames gives values that you can plug into the equation below to get a range of likely optimal minimum frame rates for your training data.

Optimal FPS (Lower) = Current FPS x Max model interpolation rate / Median momentum

Optimal FPS (Higher) = Current FPS x Max model interpolation rate / 95th percentile momentum

Where max model interpolation is the maximum per frame momentum the model can handle, usually provided in the model card.

Working out momentum is nothing more than a bit of Pythagoras. No PHD maths here! Source – Pexels

You can then run small scale tests of your training pipeline to determine the lowest frame rate you can achieve for optimal performance.

Vid Prepper

The standardize module in vid-prepper can standardize the size, codec, colour format and frame rate of a single video or batch of videos.

Again, it is built on FFmpeg and has the ability to accelerate things on GPU if that is available to you. To standardize videos, you can simply run the code below.

# Standardize batch of videos
video_file_paths = [“sample1.mp4”, “sample2.mp4”, “sample3.mp4”]
standardizer = standardize.VideoStandardizer(
            size="224x224",
            fps=16,
            codec="h264",
            color="rgb",
            use_gpu=False  # Set to True if you have CUDA
        )

standardizer.batch_standardize(videos=video_file_paths, output_dir="videos/")

In order to make things more efficient, especially if you are using expensive GPUs and don’t want an IO bottleneck from loading videos, the module also accepts webdatasets. These can be loaded similarly to the following code:

# Standardize webdataset
standardizer = standardize.VideoStandardizer(
            size="224x224",
            fps=16,
            codec="h264",
            color="rgb",
            use_gpu=False  # Set to True if you have CUDA
        )

standardizer.standardize_wds("dataset.tar", key="mp4", label="cls")

Tensor Loader

from vid_prepper import loader

A video tensor is typically 4 or 5 dimensions, consisting of the pixel colour (usually RGB), height and width of the frame, time and batch (optional) components. As mentioned above, decoding videos into tensors is often the biggest bottleneck in the preprocessing pipeline, so the steps taken up to this point make a big difference in how efficiently we can load our tensors.

This module converts videos into PyTorch tensors using FFmpeg for frame sampling and NVDec to allow for GPU acceleration. You can alter the size of the tensors to fit your model along with selecting the number of frames to sample per clip and the frame stride (spacing between the frames). As with standardization, the option to use webdatasets is also available. The code below gives an example on how this is done.

# Load clips into tensors
loader = VideoLoader(num_frames=16, frame_stride=2, size=(224,224), device="cuda")
video_paths = ["video1.mp4", "video2.mp4", "video3.mp4"]
batch_tensor = loader.load_files(video_paths)

# Load webdataset into tensors
wds_path = "data/shards/{00000..00009}.tar"
dataset = loader.load_wds(wds_path, key="mp4", label="cls")

Detector

from vid_prepper import detector

It is often a necessary part of video preprocessing to detect things within the video content. These might be particular objects, shots or transitions. This module brings together powerful processes and models from PySceneDetector, HuggingFace, Idea Research and PyTorch to provide efficient detection.

Video detection is often a useful way of splitting videos into clips and getting only the clips you need for your model (image source – Pexels)

Shot Detection

In many video machine learning use cases (eg. semantic search, seq2seq trailer generation and many more), splitting videos into individual shots is an important step. There are a few ways of doing this, but PySceneDetect is one of the more accurate and reliable ways of doing this. This library provides a wrapper for PySceneDetect’s content detection method by calling the following method. It outputs the start and end frames for each shot.

# Detect shots in videos
video_path = "video.mp4"
detector = VideoDetector(device="cuda")
shot_frames = detector.detect_shots(video_path)

Transition Detection

Whilst PySceneDetect is a strong tool for splitting up videos into individual scenes, it is not always 100% accurate. There are times where you may be able to take advantage of repeated content (eg. transitions) breaking up shots. For example, BBC News has an upwards red and white wipe transition between segments that can easily be detected using something like PyTorch.

Transition detection works directly on tensors by detecting pixel changes in blocks of pixels exceeding a certain threshold change that you can set. The example code below shows how it works.

# Detect gradual transitions/wipes
video_path = "video.mp4"
video_loader = loader.VideoLoader(num_frames=16, 
                                  frame_stride=2, 
                                  size=(224, 224), 
                                  device="cpu",
                                  use_nvdec=False  # Use "cuda" if available)
video_tensor = loader.load_file(video_path)

detector = VideoDetector(device="cpu" # or cuda)
wipe_frames = detector.detect_wipes(video_tensor, 
                                    block_grid=(8,8), 
                                    threshold=0.3)

Object Detection

Object detection is often a requirement to finding the clips you need in your video data. For example, you may require clips with people in them or animals. This method uses an open source Dino model against a small set of objects from the standard COCO dataset labels for detecting objects. Both the model choice and list of objects are completely customisable and can be set by you. The model loader is the HuggingFace transformers package so the model you use will need to be available there. For custom labels, the default model takes a string with the following structure in the text_queries parameter – “dog. cat. ambulance.”

# Detect objects in videos
video_path = "video.mp4"
video_loader = loader.VideoLoader(num_frames=16, 
                                  frame_stride=2, 
                                  size=(224, 224), 
                                  device="cpu",
                                  use_nvdec=False  # Use "cuda" if available)
video_tensor = loader.load_file(video_path)

detector = VideoDetector(device="cpu" # or cuda)
results = detector.detect_objects(video, 
                                  text_queries=text_queries # if None will default to COCO list, 
                                  text_threshold=0.3, 
                                  model_id=”IDEA-Research/grounding-dino-tiny”)

Data Augmentation

Things like Video Transformers are incredibly powerful and can be used to create great new models. However, they often require a huge amount of data which isn’t necessarily easily available with things like video. In those cases, we need a way to generate varied data that stops our models overfitting. Data Augmentation is one such solution to help boost limited data availability.

For video, there are a number of standard methods for augmenting the data and most of those are supported by the major frameworks. Vid-prepper brings together two of the best – Kornia and Torchvision. With vid-prepper, you can perform individual augmentations like cropping, flipping, mirroring, padding, gaussian blurring, adjusting brightness, colour, saturation and contrast, and coarse dropout (where parts of the video frame are masked). You can also chain them together for higher efficiency.

Augmentations all work on the video tensors rather than directly on the videos and support GPU acceleration if you have it. The example code below shows how to call the methods individually and how to chain them.

# Individual Augmentation Example
video_path = "video.mp4"
video_loader = loader.VideoLoader(num_frames=16, 
                                  frame_stride=2, 
                                  size=(224, 224), 
                                  device="cpu",use_nvdec=False  # Use "cuda" if available)
video_tensor = loader.load_file(video_path)

video_augmentor = augmentor.VideoAugmentor(device="cpu", use_gpu=False)
cropped = augmentor.crop(video_tensor, type="center", size=(200, 200))
flipped = augmentor.flip(video_tensor, type="horizontal")
brightened = augmentor.brightness(video_tensor, amount=0.2)


# Chained Augmentations
augmentations = [
            ('crop', {'type': 'random', 'size': (180, 180)}),
            ('flip', {'type': 'horizontal'}),
            ('brightness', {'amount': 0.1}),
            ('contrast', {'amount': 0.1})
        ]
        
chained_result = augmentor.chain(video_tensor, augmentations)

Summing Up

Video preprocessing is hugely important in deep learning due to the relatively huge size of the data compared to text. Transformer model requirements for oceans of data compound this even further. Three key elements make up the deep learning process – time, money and performance. By optimizing our input video data, we can minimize the amount of the first two elements we need to get the best out of the final one.

There are some amazing open source tools available for Video Machine Learning, with more coming along every day currently. Vid-prepper stands on the shoulders of some of the best and most widely used in an attempt to try and bring them together in an easy to use package. Hopefully you find some value in it and it helps you to create the next generation of video models, which is extremely exciting!

Source link

The post Preparing Video Data for Deep Learning: Introducing Vid Prepper first appeared on TechToday.

This post originally appeared on TechToday.