- #1
- 10,825
- 3,690
OK, an actual codec has been detailed using AI. It performs well, although the modern Codec VVC is superior in some circumstances. Can we do better with an entirely AI codec? We will assume our source is in a standard format used for mastering: Apple Pros-res 10-bit 4:4:4 at 50 frames per second.
https://www.apple.com/final-cut-pro/docs/Apple_ProRes.pdf.
It is an intra-only codec, and all versions are generally considered visually lossless. The Discrete Cosine Transform is applied to each frame, and is Huffman entropy encoded. It is a variable bitrate codec that will not transmit unnecessary DCT coefficients if they do not make any visual difference.
An obvious tool to use is the ISIZE bit-save algorithm. This algorithm significantly reduces the size of the raw data file by 70% for Pro-Res with no visual loss.
As discussed in a previous post, scale-arbitrary Invertible Image Downscaling (AIID) is a highly efficient image processing technique. It allows an 8k video in a format like Apple Pro-res to be downscaled to 8-bit FHD. Figure 2 from the AIID paper shows that FHD colour can be recovered with 50 db PSNR, 4k with 40 db PSNR, and 8k with about 30 db PSNR, with Table 1 showing it is closer to 35 db PSNR. Generally, anything 40 dB and above is indistinguishable from the original. However, since it's been pre-processed with ISIZE, the VMAF of 8k will be higher, possibly reaching the visually lossless quality of VMAF 95.
A good codec would be to transmit the output of AIID using some standard codec (e.g., EVC Baseline) and upscale it to the TV's resolution, remembering that downscaled 4k is virtually indistinguishable from 8k with our current screen sizes and technology. 8k televisions also usually include very sophisticated technology, upscaling 4k to 8k. I own a recent Samsung 8k, and from my experiments, I can't tell the difference. However, if the AIID upscaling to 8K combined with ISIZE is good enough, it may not be necessary.
The next step is to specify an FHD AI codec to distribute the output of the 8k down-scaled by AIID. Already using the visually lossless Pro-Res codec, and not accounting for it being now 8-bit, which would lower the bitrate, the estimated bitrate is 20 MBS from published data on the codec. And that is for 8k at 50 frames per second.
Let's see if we can modify this using AI for better compression. Progress has been made at transmitting visually lossless video by removing unnecessary noise:
https://arxiv.org/pdf/2401.13616v1
Divide frames into groups of, say, 32x32x32.
Apply the 3D DCT transform to the group of frames like Pro-Res does for 2D images and transmit the coefficients using FLIIC.
See:
https://arxiv.org/abs/2304.00299
As the above paper shows, this alone can reduce bitrate considerably. Some of the lower bits can be truncated if further reduction is required.
Since we combine this with ISIZE and AIID, to train FLIIC, we would need two groups of training images, one preprocessed by ISIZE and AIID and a de-noised version of the same images preprocessed by ISIZE and AIID.
Indeed, the large redundancies between frames mean from a group of 32 frames, by using a CNN, it is possible to get a reasonable prediction of the next 32 frames:
https://arxiv.org/pdf/2206.05099
The encoder feeds the predicted and real frame blocks into a CRNN-type network, efficiently transmitting the difference to the decoder.
The encoder determines whether sending the 32 frames or the CRNN encoded difference is more efficient and sends a bit at the start of a frame sequence to indicate which to use.
If some lower bits were removed, the images could be enhanced by predicting inter-frame residuals using the previously detailed method for every five frames, as it is now not lossless.
This is just one possible completely AI codec. However, it faces a big problem: the computation power required. All AI codecs are expected to be common in 10 years, but they will use much more powerful processors than are currently in use.
I hope to write an insights article putting this all together.
Thanks
Bill
https://www.apple.com/final-cut-pro/docs/Apple_ProRes.pdf.
It is an intra-only codec, and all versions are generally considered visually lossless. The Discrete Cosine Transform is applied to each frame, and is Huffman entropy encoded. It is a variable bitrate codec that will not transmit unnecessary DCT coefficients if they do not make any visual difference.
An obvious tool to use is the ISIZE bit-save algorithm. This algorithm significantly reduces the size of the raw data file by 70% for Pro-Res with no visual loss.
As discussed in a previous post, scale-arbitrary Invertible Image Downscaling (AIID) is a highly efficient image processing technique. It allows an 8k video in a format like Apple Pro-res to be downscaled to 8-bit FHD. Figure 2 from the AIID paper shows that FHD colour can be recovered with 50 db PSNR, 4k with 40 db PSNR, and 8k with about 30 db PSNR, with Table 1 showing it is closer to 35 db PSNR. Generally, anything 40 dB and above is indistinguishable from the original. However, since it's been pre-processed with ISIZE, the VMAF of 8k will be higher, possibly reaching the visually lossless quality of VMAF 95.
A good codec would be to transmit the output of AIID using some standard codec (e.g., EVC Baseline) and upscale it to the TV's resolution, remembering that downscaled 4k is virtually indistinguishable from 8k with our current screen sizes and technology. 8k televisions also usually include very sophisticated technology, upscaling 4k to 8k. I own a recent Samsung 8k, and from my experiments, I can't tell the difference. However, if the AIID upscaling to 8K combined with ISIZE is good enough, it may not be necessary.
The next step is to specify an FHD AI codec to distribute the output of the 8k down-scaled by AIID. Already using the visually lossless Pro-Res codec, and not accounting for it being now 8-bit, which would lower the bitrate, the estimated bitrate is 20 MBS from published data on the codec. And that is for 8k at 50 frames per second.
Let's see if we can modify this using AI for better compression. Progress has been made at transmitting visually lossless video by removing unnecessary noise:
https://arxiv.org/pdf/2401.13616v1
Divide frames into groups of, say, 32x32x32.
Apply the 3D DCT transform to the group of frames like Pro-Res does for 2D images and transmit the coefficients using FLIIC.
See:
https://arxiv.org/abs/2304.00299
As the above paper shows, this alone can reduce bitrate considerably. Some of the lower bits can be truncated if further reduction is required.
Since we combine this with ISIZE and AIID, to train FLIIC, we would need two groups of training images, one preprocessed by ISIZE and AIID and a de-noised version of the same images preprocessed by ISIZE and AIID.
Indeed, the large redundancies between frames mean from a group of 32 frames, by using a CNN, it is possible to get a reasonable prediction of the next 32 frames:
https://arxiv.org/pdf/2206.05099
The encoder feeds the predicted and real frame blocks into a CRNN-type network, efficiently transmitting the difference to the decoder.
The encoder determines whether sending the 32 frames or the CRNN encoded difference is more efficient and sends a bit at the start of a frame sequence to indicate which to use.
If some lower bits were removed, the images could be enhanced by predicting inter-frame residuals using the previously detailed method for every five frames, as it is now not lossless.
This is just one possible completely AI codec. However, it faces a big problem: the computation power required. All AI codecs are expected to be common in 10 years, but they will use much more powerful processors than are currently in use.
I hope to write an insights article putting this all together.
Thanks
Bill
Last edited: