A Proposed Entirely AI Based Codec

bhobba · Apr 9, 2024

OK, an actual codec has been detailed using AI. It performs well, although the modern Codec VVC is superior in some circumstances. Can we do better with an entirely AI codec? We will assume our source is in a standard format used for mastering: Apple Pros-res 10-bit 4:4:4 at 50 frames per second.
https://www.apple.com/final-cut-pro/docs/Apple_ProRes.pdf.

It is an intra-only codec, and all versions are generally considered visually lossless. The Discrete Cosine Transform is applied to each frame, and is Huffman entropy encoded. It is a variable bitrate codec that will not transmit unnecessary DCT coefficients if they do not make any visual difference.

An obvious tool to use is the ISIZE bit-save algorithm. This algorithm significantly reduces the size of the raw data file by 70% for Pro-Res with no visual loss.

As discussed in a previous post, scale-arbitrary Invertible Image Downscaling (AIID) is a highly efficient image processing technique. It allows an 8k video in a format like Apple Pro-res to be downscaled to 8-bit FHD. Figure 2 from the AIID paper shows that FHD colour can be recovered with 50 db PSNR, 4k with 40 db PSNR, and 8k with about 30 db PSNR, with Table 1 showing it is closer to 35 db PSNR. Generally, anything 40 dB and above is indistinguishable from the original. However, since it's been pre-processed with ISIZE, the VMAF of 8k will be higher, possibly reaching the visually lossless quality of VMAF 95.

A good codec would be to transmit the output of AIID using some standard codec (e.g., EVC Baseline) and upscale it to the TV's resolution, remembering that downscaled 4k is virtually indistinguishable from 8k with our current screen sizes and technology. 8k televisions also usually include very sophisticated technology, upscaling 4k to 8k. I own a recent Samsung 8k, and from my experiments, I can't tell the difference. However, if the AIID upscaling to 8K combined with ISIZE is good enough, it may not be necessary.

The next step is to specify an FHD AI codec to distribute the output of the 8k down-scaled by AIID. Already using the visually lossless Pro-Res codec, and not accounting for it being now 8-bit, which would lower the bitrate, the estimated bitrate is 20 MBS from published data on the codec. And that is for 8k at 50 frames per second.

Let's see if we can modify this using AI for better compression. Progress has been made at transmitting visually lossless video by removing unnecessary noise:
https://arxiv.org/pdf/2401.13616v1

Divide frames into groups of, say, 32x32x32.

Apply the 3D DCT transform to the group of frames like Pro-Res does for 2D images and transmit the coefficients using FLIIC.

See:
https://arxiv.org/abs/2304.00299

As the above paper shows, this alone can reduce bitrate considerably. Some of the lower bits can be truncated if further reduction is required.

Since we combine this with ISIZE and AIID, to train FLIIC, we would need two groups of training images, one preprocessed by ISIZE and AIID and a de-noised version of the same images preprocessed by ISIZE and AIID.

Indeed, the large redundancies between frames mean from a group of 32 frames, by using a CNN, it is possible to get a reasonable prediction of the next 32 frames:
https://arxiv.org/pdf/2206.05099

The encoder feeds the predicted and real frame blocks into a CRNN-type network, efficiently transmitting the difference to the decoder.

The encoder determines whether sending the 32 frames or the CRNN encoded difference is more efficient and sends a bit at the start of a frame sequence to indicate which to use.

If some lower bits were removed, the images could be enhanced by predicting inter-frame residuals using the previously detailed method for every five frames, as it is now not lossless.

This is just one possible completely AI codec. However, it faces a big problem: the computation power required. All AI codecs are expected to be common in 10 years, but they will use much more powerful processors than are currently in use.

I hope to write an insights article putting this all together.

Thanks
Bill

Greg Bernhardt · Sep 2, 2024

Looking forward to an Insights article, thanks!

Algr · Sep 23, 2024

I'd been expecting something like that to come along.

All compression works by giving the playback device some form of understanding of the data, so that the encoder can call on the decoder's understanding in order to describe the data more efficiently. AI that understands the nature of photographs is already in use, so using AI's visual understanding as a video codec seems a logical progression.

bhobba · Sep 23, 2024

Algr said:

I'd been expecting something like that to come along.

Indeed.

The insight article is taking longer than I expected because of the fast-changing nature of AI in both video and still images. In particular, GigaGan is a striking development I want to include:
https://videogigagan.github.io/

Plus, a whole new approach, compressing the image to be invertible back to the original image. However, the data lost via compression follows a known statistical distribution. When decompressing, a random sample of the distribution is taken to simulate the missing data, and the compression is inverted to its original resolution.
https://research.monash.edu/files/614698441/613267800-oa.pdf

It is beneficial in converting a colour image to a grayscale image and then back to the original colour.

Please don't worry - it's coming. I will get on it straight away.

Thanks
Bill

Algr · Sep 23, 2024

Does "invertible" mean decompressible? I haven't heard it called that before.

bhobba · Sep 24, 2024

Algr said:

Does "invertible" mean decompressible? I haven't heard it called that before.

My article and the links will explain it. But here is the gist.

Let's say you want to compress an image by 2. You divide it into 4 pieces (two vertical, and two horizontal) with one piece being a 2 times downscaled version of the image. If you want to recover the original you reassemble the 4 pieces and you get the original. In other words the downscaled image has an inverse to get the original back if you have the four pieces. The trick is to ensure the other three pieces have a known statistical distribution. So when inverting the downscaled image to the original you simply take a sample from the distribution of other three pieces. This leads to a reconstructed full scale image, but with a bit of 'snow' in it. Anything 40 db or greater is considered lossless, and at 2 times downscaling it better than 40 db.. The paper above explains an even more sophisticated version that downscales by arbitrary amounts. A less sophisticated, but easier to understand, version is explained here:
https://arxiv.org/abs/2005.05650

It works particularly well for converting colour to greyscale:
https://arxiv.org/abs/2105.02104

The reason is the almost universally used Bayer Filter:
https://en.wikipedia.org/wiki/Bayer_filter

Take 8k as an example. It is really 4 4K streams. To get a single 4k greyscale you simply downscale the concatenated 4 streams by 2 to get a greyscale from which the original 8K can be imperceptibly recovered.

The best video codec we have is VVC. The invertible method, by itself, beats that:
https://arxiv.org/abs/2108.03690

The final insights article will, hopefully explain all.

Thanks
Bill

A Proposed Entirely AI Based Codec

Similar threads

Hot Threads

Recent Insights