Rate-Perception Optimized Preprocessing for Video Coding
Chengqian Ma, Zhiqiang Wu, Chunlei Cai, Pengwei Zhang,
Yi Wang, Long Zheng, Chao Chen, Quan Zhou
Bilibili Inc.
Shanghai, China
{machengqian01, wuzhiqiang01, caichunlei, zhangpengwei,
arXiv:2301.10455v1 [[Link]] 25 Jan 2023
wangyi, zhenglong, chenchao02, zhouquan}@[Link]
Input frame Codec settings Decoded frame
e.g. preset, crf
RPP Standard codec
Preprocess e.g. HEVC, AV1
Single-Pass
(a) H.265 at 0.409bpp/MS-SSIM:0.99 (b) RPP + H.265 at 0.332bpp/MS-SSIM:0.99
Figure 1. Left: The deployment workflow of RPP: single-pass of the input frame and suitable with all standard video codec. Right: Frame
segments of H.265 vs. RPP + H.265 at same MS-SSIM. Zoom in to see the details.
Abstract vices which serve millions of users every day. Our code and
model will be released after the paper is accepted.
In the past decades, lots of progress have been done
in the video compression field including traditional video
codec and learning-based video codec. However, few stud- 1. Introduction
ies focus on using preprocessing techniques to improve the
rate-distortion performance. In this paper, we propose a In recent years, the demand for online streaming high-
rate-perception optimized preprocessing (RPP) method. We definition video is growing rapidly, and is expected to con-
first introduce an adaptive Discrete Cosine Transform loss tinue to grow in the next following years. These stream-
function which can save the bitrate and keep essential high ing high-definition videos cost huge bandwidth. They
frequency components as well. Furthermore, we also com- spend more than 80% of all consumer Internet traffic [14].
bine several state-of-the-art techniques from low-level vi- Therefore, it is essential to build a highly efficient video
sion fields into our approach, such as the high-order degra- compression system to generate better video quality at a
dation model, efficient lightweight network design, and Im- given bandwidth budget. Thus, many video coding stan-
age Quality Assessment model. By jointly using these pow- dards have been developed during the past decades, such
erful techniques, our RPP approach can achieve on aver- as H.264 [46], H.265 [38], H.266 [8], and AOMedia Video
age, 16.27% bitrate saving with different video encoders 1(AV1) [11]. These traditional methods are built on many
like AVC, HEVC, and VVC under multiple quality metrics. handcrafted modules, such as block partition, Discrete Co-
In the deployment stage, our RPP method is very simple and sine Transform (DCT) [2], and intra/inter prediction, etc.
efficient which is not required any changes in the setting of While these handcrafted methods have achieved good rate-
video encoding, streaming, and decoding. Each input frame distortion performance, learned video compression meth-
only needs to make a single pass through RPP before send- ods [10, 28, 29] still attract more and more attention which
ing into video encoders. In addition, in our subjective vi- is inspired by the success of deep neural networks in other
sual quality test, 87% of users think videos with RPP are fields of image processing. These learned methods claim to
better or equal to videos by only using the codec to com- achieve comparable or even better performance than tradi-
press, while these videos with RPP save about 12% bitrate tional codecs. However, most existing learned video com-
on average. Our RPP framework has been integrated into pression methods increase the complexity on both the en-
the production environment of our video transcoding ser- coder and decoder sides. This computationally heavy de-
coder makes deployment not viable, especially on end-user 2. Related Work
devices such as mobile phones and laptops. Some studies
try to convert the essential components of standard hybrid 2.1. Image Compression
video encoder designs into a trainable framework in order In the past decades, a lot of traditional image compres-
to end-to-end optimize all the modules in the video en- sion methods like JPEG [40], JPEG2000 [13] and BPG [6]
coder [29, 49]. However, few studies have attempted to use have been proposed. These methods have achieved high
preprocessing methods to improve the rate-distortion per- performance on reducing the image size efficiently by ex-
formance of video compression systems. ploiting hand-crafted techniques. One of the most impor-
In this paper, we propose a rate-perception optimized tant parts for those hand-crafted designs is the transforma-
preprocessor (RPP) that can efficiently optimize the rate and tion like DCT. The DCT linearly maps the pixels into the
visual quality at the same time in an independent single for- frequency domain. One advantage of the DCT is that it can
ward pass. In particular, we introduce the adaptive Discrete compact energy which makes it easy to reduce the spatial
Cosine Transform (DCT) loss into the training stage of the redundancy of the image. After transformation, these meth-
RPP. In addition, we also engage the full-reference image ods quantize the corresponding coefficients and then do the
quality assessment model: MS-SSIM [45] into the train- entropy coding. Recently, thanks to the DNN, learning-
ing part to optimize the perceptual quality of the model. based image compression methods [3, 4, 31] have achieved
At the same time, a light-weight fully convolutional neu- competitive or better performance than the traditional image
ral network with attention mechanism is designed by us to compression codes.
improve efficiency.
2.2. Video Compression
The contributions of our work can be summarized as fol-
lows: There is a long history of progress for the video com-
pression methods. During past decades, several video cod-
ing standards have been proposed and widely used in the
• We first introduce the adaptive Discrete Cosine Trans- real world, such as H.264 [46], H.265 [38], H.266 [8],
form (DCT) loss which can reduce spatial redundancy and AOMedia Video 1(AV1) [11]. With the continuous
meanwhile still keeping the important high frequency development of video coding standards, these traditional
component for the content. From our experiments, in- video compression methods provided strong performance
volving the adaptive DCT loss in training can signifi- and made significant improvements. These methods are
cantly save the bit rate and maintain the visual quality also practical to use with the hardware support in the real-
of the video. world applications, such as online video streaming, digital
tv, etc. In recent years, a lot of DNN based methods have
been proposed for every part of the video coding, such as in-
• We propose a rate-perception optimized preprocessor tra prediction and residual coding [10], mode decision [28],
(RPP) which is a light-weight fully convolutional neu- entropy coding, etc. Those methods are employed to im-
ral network with attention mechanism. The RPP model prove the performance of one specific module of the tradi-
is balanced between perception and distortion by uti- tional video codec. Instead of replacing the particular com-
lizing both adaptive DCT loss and reference-based ponent of the traditional video compression codec, some ap-
IQA loss functions. We also introduce the higher-order proaches focus on the end-to-end optimized video compres-
degradation model into our training stage to enhance sion framework [29, 49]. In addition, A. Chadha et al. [9]
the visual quality of the preprocessed frame. tries to converts the essential components of standard video
encoder designs into a trainable framework and jointly opti-
mize a preprocessor with the differentiable framework from
• Our approach can be easily plugged into the preprocess the end-to-end manner.
pipeline of any standard video codec, such as AVC,
2.3. Metrics
HEVC, AV1 or VVC. Powered by our approach, these
standard video codec can achieve better performance In the past decades, Peak Signal-to-Noise Ratio (PSNR)
in BD-rate without any changes and sacrifices in video was the most widely used full-reference method for assess-
encoding and decoding. Compared with state-of-the- ing video fidelity and quality and it continues to play a fun-
art video codec method, our model can reduce the BD- damental role in evaluating video compression algorithms.
rate by about 16.27% in average under multiple qual- However, the PSNR has been proven that has a poor correla-
ity metrics. Furthermore, our RPP model are extreme tion with human perception [17, 43]. Thus, a variety of full-
efficient which can achieve 1080p@87FPS during the reference image quality assessments (IQA) or video quality
inference which is far beyond real-time efficiency. assessments (VQA) has been proposed [5, 25, 36, 44]. For
example, Structural Similarity (SSIM) index [44] estimates ing procedure. In addition, we combine the higher-order
perceptual distortions by considering structural informa- degradation modeling process to simulate real-world com-
tion, and its variant MultiScale-SSIM (MS-SSIM) [45] pro- plex degradation [42]. By using this higher-order degrada-
vides better performance and more flexibility by incorporat- tion method to generate the pair of training data, our pre-
ing multiscale resolution processing. Video Multi-method processing network can be trained to handle some compli-
Assessment Fusion (VMAF) [25,26] is another main stream cated degradations in the real world which can also improve
evaluation metric in the real-world industry where lots of the perceptual quality of the output from the network. Fur-
famous commercial companies like Netflix [26], Meta [34], thermore, for the sake of performance and efficiency, we
Tiktok [48], Intel [22] etc., and standardization such as AO- construct a light-weight fully convolutional neural network
Media [12] adopt it for video codec evaluation. VMAF with a channel-wise attention mechanism [18]. In the de-
combines three quality features: Visual Information Fi- ployment framework, for a given video frame fi , it sim-
delity (VIF) [36], Detail Loss Metric (DLM) [23], and Mo- ply goes a single forward pass through the RPP network.
tion, to train a Support Vector Machine (SVM) regressor Then the processed frame fo from the RPP network can be
[15] to predict subjective score of video quality. Lot of stud- encoded by a standard video codec, such as an AVC [46],
ies have demonstrated that VMAF is remarkably more cor- HEVC [38], VVC [8], or AV1 [11] encoder.
related to the Mean Opinion Score (MOS) than SSIM and
PSNR [5, 33, 47].
3.2. Adaptive Discrete Cosine Transform Loss
Although it has been many years since DCT was first in-
2.4. Image Enhancement troduced in image/video compression algorithms, because
Image enhancement has been a long-standing problem of its high effectiveness and ease of use, DCT-like trans-
for its vitally practical value in all kinds of vision appli- forms are still the mainstream transform today. Generally,
cations. Recently, with the development of deep learning the basis function of two-dimensional(2D)DCT can be writ-
techniques such as network design and gradient-based opti- ten as:
mization problems, the learning-based methods [1,42] have hπ 1 wπ 1
i,j
shown promising performance in various fields of image Bh,w = cos (i + )cos (j + ) (1)
H 2 H 2
enhancement including super-resolution, denoising, deblur-
Then the 2D DCT is formulated as:
ring, etc. Some methods [20, 27] aim at achieving real-
time image super-resolution with well-designed lightweight XW
H−1 −1
X i,j
CNN which can obtain better results with limited computa- Fh,w = fi,j Bh,w (2)
tional effort. Other approaches [16, 42] focus on designing i=0 j=0
the degradation models which aim to model the complex s.t. h ∈ {0 , 1 , · · · , H − 1 }, w ∈ {0 , 1 , · · · , W − 1 }
degradation process of the image. Wang et al. [42] uses
a high-order degradation process to simulate complex real- where F ∈ RH×W is the 2D DCT frequency spectrum, f ∈
world degradations. While lots of great works have been RH×W is the input frame, H is the height of f , and W is the
done in the image enhancement field, there are rare works width of f . Normally, height and width are the same. H
that utilize methods and techniques with video coding. and W are usually denoted as N in most common cases.
With the input of the frame f , it converts blocks of pix-
els into same-sized blocks of frequency coefficients. As
3. Proposed Method
we mentioned, the DCT has a crucial property which is
3.1. Overview that the blocks of frequency coefficients separate the high-
frequency components from the low frequency. In an im-
In this section, we give a brief overview of our rate- age, most of the energy will be concentrated in the lower
perception optimized preprocessing (RPP) method. The frequencies, so in the traditional compression algorithms,
goal of our preprocessing model is to provide a prepro- they simply throw away the higher frequency coefficients to
cessed input frame that is optimized with both rate and reduce the spatial redundancy. However, some of the high
perception via a learnable preprocessing neural network. frequency components also play a very important role in
Specifically, in order to optimize our model in the balance the visual quality of the whole frame. Therefore, we first
between rate and distortion, we design an adaptive DCT loss introduced the adaptive DCT loss for video preprocessing.
that can reduce the spatial redundancy and keep the essen- First, we use DCT to transform a frame f into the frequency
tial high frequency components for perception in the mean- domain. Second, we select the frequency coefficients I
time. On the other hand, for the perception optimization which belong to the high frequency components by using
part, we aim to perceptually enhance our preprocessed in- the ZigZag order traversal. The formula can be written as:
put frame by using the full-reference IQA model: SSIM.
0
We utilize the IQA model as the loss function in our train- Fh,w = Fh,w ∗ Ih,w (3)
Degradation
CRF
First Blur RPP DCT Loss
order Resize Predicted
Ground Truth
Noise
sub-pixel
sub-pixel
block
block
conv
conv
JPEG ...
Second Resize
order JPEG
(a)
L1
Sharpen
SSIM
Enhancement Enhanced Reconstruction Loss
+ Perceptual Loss (b)
Figure 2. Example framework of training RPP. (a) is the histogram of frequency coefficient of the predicted frame. (b) is the histogram of
frequency coefficient filtered by the adaptive DCT function
(
0 if (h + w) < S components and discard some trivial high frequency com-
where Ih,w = (4) ponents. With this optimization, the frame processed by the
1 if (h + w) ≥ S
model can make the video encoder allocate more bit rates to
S ∈ {0, 1, · · · , (H − 1)(W − 1)} these important high frequency components such as edges
In the DCT frequency domain, the value of the frequency and contrast areas. In the meanwhile, since the adaptive
coefficient means how much energy is in this frequency DCT loss function will filter some trivial high frequency
component in the whole frame. If a frequency component components to be zero, it can also benefit the entropy cod-
has less energy, it means that this frequency component is ing process [19, 35] which will consume much less bitrate
relatively less essential to reconstruct the frame. So we want with consecutive zeros.
to throw away some high frequency component with a rel- 3.3. Network and Image Degradation
atively small value of coefficients. In this case, we do the
mean average of the absolute value of these selected coeffi- Inspired by the light-weight network architectures from
0
cients Fh,w to get a Threshold T , which can be formulated the image enhancement field, we adopt a few ideas from
as: them [20, 27]. Specifically, based on the feature extraction
H−1
XW −1 block like RFDB [27], we add the channel attention mech-
1 X
0
T = (|Fh,w |) (5) anism [18] into the block in order to let the network pay
H ·W w=j
h=i more attention to different channel frequencies. Moreover,
where i+j ≥N we use an efficient sub-pixel convolution which is first in-
0
troduced by Shi et al. [37] to downscale and upscale the
If|Fh,w |is smaller than Threshold T , this means it has less resolutions of feature maps. The overall network architec-
effect on reconstructing the frame than the average. Then ture is shown in Fig. 4.
00
we select it into another set of the coefficients Fh,w . Finally, The way to model the degradation of the training data
we calculate the mean absolute error between the filtered is important to improve the visual quality during network
00
DCT frequency coefficients Fh,w and zero, which can be training. We include some general degradation [16] meth-
written as: ods into our degradation model, such as blur, noise, resize,
H−1
XW −1
X and JPEG compression. For the blur, we model our blur
00
Ldct = |Fh,w − 0|, degradation with isotropic and anisotropic Gaussian filters.
h=i w=j (6) We choose two commonly-used noise types which is Gaus-
00 0
Fh,w ∈ {|Fh,w | < T} and i+j ≥N sian noise and Poisson noise for noise degradation. For
resizing, we use both upsampling and downsampling with
By using this loss function in the model training, the model several resize algorithms including area, bilinear, and bicu-
will be optimized to preserve the essential high frequency bic operations. Since in the real-world applications, the in-
put frames of our framework mostly are decoded from a resolution images. To evaluate the performance of our pro-
compressed video, so we add the video compression degra- posed method, we test it on the UVG datasets [30], HEVC
dation which may introduce blocking and ringing artifacts Standard 1080p Test Sequences [7] and MCL-JCV datasets
from spatial and time domain. As we mentioned before, [41]. With the diverse content, these datasets are widely
High-order degradation modeling [42] has been proposed used to evaluate the performance of video compression al-
to better simulate the complex real-world degradations. We gorithms.
utilize this idea in our image degradation model as well. By Implementation Details. We train our RPP model with
generating training pairs with these degradation models, our two stages. The first warm-up stage is that we train the
objective is to make the model have the ability to remove model on reconstruction loss Lr by using the Adam op-
common noise and compression noise, which can also opti- timizer [21] with initial learning rate as 1 × 10−3 , β1 as
mize the rate because video codec can not encode the noise 0.9 and β2 as 0.999, respectively. The mini-batch size is
well. set as 32. The resolution of training images is 128 × 128
which is randomly cropped from the original images in the
3.4. Loss Functions
datasets. After training 600K iterations with the warm-up
Our target is to train our preprocessing network by op- training, we use the overall loss function Lall by setting the
timizing rate and perception at the same time. In order to λ1 as 10, λ2 as 0.1 in training, and adjust the learning rate
perform the optimization of both rate and perception on the to the 1 × 10−4 . To be specific in the adaptive DCT loss
reconstructed frame fˆ relative to the input frame f , we com- setting, we use both N =8 and N =16 to train the network
bine the adaptive DCT loss Ldct , reconstruction loss Lr and at the same time since the most common size of the mac-
perceptual loss Lp together to optimize the model. Ldct is roblock in traditional video codec is 8 and 16. With this
the method introduced by us to optimize the rate and dis- setting, we train our RPP model for another 700K iterations
tortion in Eq.6. For reconstruction loss Lr , we want to en- so that the model can be converged. The training data of
sure the basic reconstruction ability of the model so that we both two training stages are augmented by our two-order
adopt the L1 distance as our reconstruction loss, which can image degradation model. The whole training framework is
be formulated as: implemented based on Pytorch [32] and it takes about only
H−1 W −1
1 day to train the network by using two NVIDIA GeForce
1 X X GT RTX3090. In the deployment stage, the input frame will be
Lr = |f − fˆi,j | (7)
HW i=0 j=0 i,j first sent into our deployed RPP model to get preprocessed.
We set a hyper-parameter here as α to handle the prepro-
in which f GT is the processed ground truth of the f . It cessing intensity of our approach for some cases that do not
is common knowledge that the contrast or edge in the high require intensive preprocessing with our pretrained model
frequency areas has a higher correlation with human per- setting and are sensitive to all high frequencies information
ception. Multiscale structural similarity (MS-SSIM) [45] in the video. The value of α is deduced empirically from
is proven by being good at preserving the structural infor- experiments. The preprocessed frame can be written as:
mation and contrast in high frequency regions. Thus, we
adopt the MS-SSIM as our perceputal loss part, which can fp = αfo + (1 − α)fi (10)
be written as:
where the fo is the output frame from the RPP model and
Lp = 1 − LM S−SSIM (fˆ, f GT ) (8) the fi is the input frame. Then the preprocessed frame will
be encoded by a standard video codec. Importantly, benefit-
With the combination of Ldct ,Lr and Lp , our overall loss
ting from our network design, our RPP model can achieve
function can be formulated as:
87.7FPS inference performance for 1080p videos by de-
Lall = λ1 Ldct + λ2 Lp + Lr (9) ployed with TensorRT [39] on a single NVIDIA GeForce
RTX3090. The inference performance on 720p and 4K is
Where λ1 and λ2 are the rate and perceptual coefficients 185FPS and 22.6FPS, respectively.
respectively. Evaluation Method. To measure the performance of
our proposed method, we use two evaluation metrics: MS-
4. Experiments SSIM and VMAF, MS-SSIM is the most common metric
in the academic video codec area and VMAF is a main-
4.1. Experiments Setup
stream perceptually-oriented metric in the video-streaming
Datasets. We adopt DIV2K and Flickr2K datasets [1] industry. We test our proposed method with AVC/H.264,
for training our RPP model which DIV2K has 1000 high- HEVC/H.265 , VVC/H.266, and AV1 which cover all the
definition 2K resolution images and Flickr2K has 2650 2K popular standard video codecs.
UVG dataset MCL_JCV dataset HEVC Class B dataset H.264&H.265 with medium preset on UVG dataset
UVG dataset MCL_JCV dataset HEVC Class B dataset H.266 on UVG & HEVC Class B dataset
(a) (b)
Figure 3. (a) Rate distortion curves for UVG dataset, MCL JCV dataset, and HEVC Class B dataset on MS-SSIM and VMAF. Curves are
plotted for the standard codec and RPP + standard codec. The corrrsponding BD rates for our proposed method are reported in Tables 1,
2 and 3, repsectively, for each dataset. (b) Top: Rate distortion curves with medium preset for UVG dataset on MS-SSIM. Bottom: Rate
distortion curves for H.266 of UVG dataset and HEVC Class B dataset on MS-SSIM.
4.2. Experiments Results VMAF MS-SSIM
RPP+H.264(veryfast) -26.92 -4.86
RPP+H.265(veryfast) -39.77 -8.70
RPP+H.264(medium) -27.30 -5.60
RPP+H.265(medium) -39.24 -9.58
In this section, we show the experimental results of the Table 1. BD rates on UVG dataset for RPP+H.264 and
comparison between standard video codecs and our RPP + RPP+H.265 with ’very fast’ and ’medium’ preset
standard video codecs. We fix the α = 0.5 in Eq.10 for both
HEVC dataset and MCL JCV dataset, and α = 1 for UVG
VMAF MS-SSIM
dataset. The results of Figure 3(a) and Table 1,2,3 show
that our proposed method can obviously improve the BD- RPP+H.264 -15.88 -9.59
rate of both two metrics with standard codecs over all three RPP+H.265 -19.14 -11.93
datasets. The average saving of RPP + H.264 is 18.21% un- Table 2. BD rates on MCL JCV dataset for RPP+H.264 and
der VMAF and 8.73% under MS-SSIM over three datasets. RPP+H.265
The average saving of RPP + H.265 is 24.62% under VMAF
and 13.51% under MS-SSIM over three datasets. Some
VMAF MS-SSIM
learning-based video encoders [29] have shown to outper-
form traditional standard codec only under ’very fast’ pre- RPP+H.264 -11.84 -11.75
set. To demonstrate the generalizability of our approach, RPP+H.265 -14.94 -19.90
We also test our RPP approach with the ’medium’ pre-
Table 3. BD rates on HEVC ClassB dataset for RPP+H.264 and
set. As it shown in the top figure of Figure 3(b), our ap- RPP+H.265
proach still outperforms the standard codecs which are con-
sistent with the ’very fast’ preset results in Figure 3(a). Fur-
thermore, we test our RPP approach with H.266 on UVG 4.3. Ablation Study and Analysis
dataset and HEVC Class B dataset. As it shown in the
bottom figure of Figure 3(b), the average saving of RPP + Effectiveness of Adaptive DCT Loss. To investigate the
H.266 is 8.42% under MS-SSIM over both two datasets. As effects of the adaptive DCT loss function, we set the λ1 = 0
we expected, our approach can get significant gains when in the Lall so that the adaptive DCT loss function will not
jointly used with all the mainstream standard codecs. In affect the optimization of training. We do this ablation study
addition, our method has a lower bitrate than the standard on the UVG dataset. As shown in Figure 4(a) , we can see
codec under the same Quantization Parameter (QP), which that the adaptive DCT loss brings 3.05% BD-rate saving on
can demonstrate the bit-saving ability of our approach. H.264 and 6.04% on H.265 under MS-SSIM, which has a
Effectiveness of Adaptive DCT Loss Ablation Analysis of Hyper-parameter alpha
[4] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin
Hwang, and Nick Johnston. Variational image compression
with a scale hyperprior. arXiv preprint arXiv:1802.01436,
2018. 2
[5] Nabajeet Barman, Steven Schmidt, Saman Zadtootaghaj,
Maria G Martini, and Sebastian Möller. An evaluation of
(a) (b)
video quality assessment metrics for passive gaming video
streaming. In Proceedings of the 23rd packet video work-
Figure 4. (a) Ablation study of adaptive DCT loss on UVG dataset shop, pages 7–12, 2018. 2, 3
(b) Ablation study of hyper-parameter α on HEVC Class B and [6] Fabrice Bellard. Bpg image format (2014). Volume, 1:2,
MCL JCV dataset 2016. 2
[7] Frank Bossen et al. Common test conditions and software
reference configurations. JCTVC-L1100, 12(7), 2013. 5
very impressive effect. Compared to the BD-rate savings in [8] Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle
Table 1, it contributes over 60% bitrate savings in the whole Chen, Gary J Sullivan, and Jens-Rainer Ohm. Overview of
approach. the versatile video coding (vvc) standard and its applications.
Choice and Analysis of Hyper-parameter α We test IEEE Transactions on Circuits and Systems for Video Tech-
our approach on HEVC class B dataset and MCL JCV by nology, 31(10):3736–3764, 2021. 1, 2, 3
setting different α values (0.2, 0.5, 0.8, 1.0) in Eq.10. From [9] Aaron Chadha and Yiannis Andreopoulos. Deep percep-
Figure 4(b), we can see α = 0.5 has the best rate-distortion tual preprocessing for video coding. In Proceedings of
curve compared to other values of α. As we mentioned be- the IEEE/CVF Conference on Computer Vision and Pattern
fore, α is to control the preprocessing intensity of our ap- Recognition, pages 14852–14861, 2021. 2
proach. From our perspective, there are two reasons we [10] Tong Chen, Haojie Liu, Qiu Shen, Tao Yue, Xun Cao, and
Zhan Ma. Deepcoder: A deep neural network based video
need to have a hyper-parameter to control the intensity.
compression. In 2017 IEEE Visual Communications and Im-
First, our model is trained at a fixed setting with a small
age Processing (VCIP), pages 1–4. IEEE, 2017. 1, 2
public dataset which means the data is not diverse enough.
[11] Yue Chen, Debargha Mukherjee, Jingning Han, Adrian
Second, some videos are extremely sensitive to the high fre- Grange, Yaowu Xu, Sarah Parker, Cheng Chen, Hui Su, Ur-
quency components that our fixed setting pretrained model vang Joshi, Ching-Han Chiang, et al. An overview of coding
may over-preprocess. tools in av1: The first video codec from the alliance for open
media. APSIPA Transactions on Signal and Information Pro-
5. Conclusion cessing, 9, 2020. 1, 2, 3
[12] Yue Chen, Debargha Murherjee, Jingning Han, Adrian
In this paper, we propose a rate-perceptual optimized Grange, Yaowu Xu, Zoe Liu, Sarah Parker, Cheng Chen, Hui
preprocessing (RPP) method to generate a rate-optimized Su, Urvang Joshi, et al. An overview of core coding tools in
and perceptual-enhanced frame via a neural network for the av1 video codec. In 2018 Picture Coding Symposium
video coding. In the deployment stage, our RPP approach (PCS), pages 41–45. IEEE, 2018. 3
is plug-and-play on the standard video codecs without re- [13] Charilaos Christopoulos, Athanassios Skodras, and Touradj
quiring any changes in encoding and decoding settings. In Ebrahimi. The jpeg2000 still image coding system: an
addition, Our proposed method is also very efficient and overview. IEEE transactions on consumer electronics,
can achieve far beyond real-time performance. As shown 46(4):1103–1127, 2000. 2
in experimental results, our RPP approach can achieve con- [14] U Cisco. Cisco annual internet report (2018–2023) white
siderable and consistent gains with all mainstream standard paper. Cisco: San Jose, CA, USA, 2020. 1
video codecs on different metrics. [15] Corinna Cortes and Vladimir Vapnik. Support-vector net-
works. Machine learning, 20(3):273–297, 1995. 3
[16] Michael Elad and Arie Feuer. Restoration of a single super-
References
resolution image from several blurred, noisy, and undersam-
[1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge pled measured images. IEEE transactions on image process-
on single image super-resolution: Dataset and study. In The ing, 6(12):1646–1658, 1997. 3, 4
IEEE Conference on Computer Vision and Pattern Recogni- [17] Bernd Girod. What’s wrong with mean-squared error? Dig-
tion (CVPR) Workshops, July 2017. 3, 5 ital images and human vision, pages 207–220, 1993. 2
[2] N. Ahmed, T. Natarajan, and K.R. Rao. Discrete cosine [18] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
transform. IEEE Transactions on Computers, C-23(1):90– works. In Proceedings of the IEEE conference on computer
93, 1974. 1 vision and pattern recognition, pages 7132–7141, 2018. 3, 4
[3] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. [19] David A Huffman. A method for the construction of
End-to-end optimized image compression. arXiv preprint minimum-redundancy codes. Proceedings of the IRE,
arXiv:1611.01704, 2016. 2 40(9):1098–1101, 1952. 4
[20] Zheng Hui, Xinbo Gao, Yunchu Yang, and Xiumei Wang. [33] Reza Rassool. Vmaf reproducibility: Validating a perceptual
Lightweight image super-resolution with information multi- practical video quality metric. In 2017 IEEE international
distillation network. In Proceedings of the 27th acm interna- symposium on broadband multimedia systems and broad-
tional conference on multimedia, pages 2024–2032, 2019. 3, casting (BMSB), pages 1–2. IEEE, 2017. 3
4 [34] Shankar L Regunathan, Haixiong Wang, Yun Zhang,
[21] Diederik P Kingma and Jimmy Ba. Adam: A method for Yu Ryan Liu, David Wolstencroft, Srinath Reddy, Cosmin
stochastic optimization. arXiv preprint arXiv:1412.6980, Stejerean, Sonal Gandhi, Minchuan Chen, Pankaj Sethi,
2014. 5 et al. Efficient measurement of quality at scale in facebook
[22] Faouzi Kossentini, Hassen Guermazi, Nader Mahdi, Chekib video ecosystem. In Applications of Digital Image Process-
Nouira, Amir Naghdinezhad, Hassene Tmar, Omar Khlif, ing XLIII, volume 11510, pages 69–80. SPIE, 2020. 3
Phoenix Worth, and Foued Ben Amara. The svt-av1 encoder: [35] J. Rissanen and G. G. Langdon. Arithmetic coding.
overview, features and speed-quality tradeoffs. Applications IBM Journal of Research and Development, 23(2):149–162,
of Digital Image Processing XLIII, 11510:469–490, 2020. 3 1979. 4
[23] Songnan Li, Fan Zhang, Lin Ma, and King Ngi Ngan. Im- [36] Hamid R Sheikh and Alan C Bovik. Image information
age quality assessment by separately evaluating detail losses and visual quality. IEEE Transactions on image processing,
and additive impairments. IEEE Transactions on Multime- 15(2):430–444, 2006. 2, 3
dia, 13(5):935–949, 2011. 3 [37] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz,
[24] Yawei Li, Kai Zhang, Radu Timofte, Luc Van Gool, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan
Fangyuan Kong, Mingxi Li, Songwei Liu, Zongcai Du, Ding Wang. Real-time single image and video super-resolution
Liu, Chenhui Zhou, et al. Ntire 2022 challenge on efficient using an efficient sub-pixel convolutional neural network. In
super-resolution: Methods and results. In Proceedings of Proceedings of the IEEE conference on computer vision and
the IEEE/CVF Conference on Computer Vision and Pattern pattern recognition, pages 1874–1883, 2016. 4
Recognition, pages 1062–1102, 2022. [38] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and
Thomas Wiegand. Overview of the high efficiency video
[25] Zhi Li, Anne Aaron, Ioannis Katsavounidis, Anush Moorthy,
coding (hevc) standard. IEEE Transactions on circuits and
and Megha Manohara. Toward a practical perceptual video
systems for video technology, 22(12):1649–1668, 2012. 1, 2,
quality metric. The Netflix Tech Blog, 6(2), 2016. 2, 3
3
[26] Zhi Li, Christos Bampis, Julie Novak, Anne Aaron, Kyle
[39] Han Vanholder. Efficient inference with tensorrt. In GPU
Swanson, Anush Moorthy, and JD Cock. Vmaf: The journey
Technology Conference, volume 1, page 2, 2016. 5
continues. Netflix Technology Blog, 25, 2018. 3
[40] Gregory K Wallace. The jpeg still picture compression stan-
[27] Jie Liu, Jie Tang, and Gangshan Wu. Residual feature dis- dard. Communications of the ACM, 34(4):30–44, 1991. 2
tillation network for lightweight image super-resolution. In
[41] Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin,
European Conference on Computer Vision, pages 41–55.
Lina Jin, Longguang Song, Ping Wang, Ioannis Katsavouni-
Springer, 2020. 3, 4
dis, Anne Aaron, and C-C Jay Kuo. Mcl-jcv: a jnd-based
[28] Zhenyu Liu, Xianyu Yu, Yuan Gao, Shaolin Chen, Xi- h. 264/avc video quality assessment dataset. In 2016 IEEE
angyang Ji, and Dongsheng Wang. Cu partition mode de- international conference on image processing (ICIP), pages
cision for hevc hardwired intra encoder using convolution 1509–1513. IEEE, 2016. 5
neural network. IEEE Transactions on Image Processing, [42] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan.
25(11):5088–5103, 2016. 1, 2 Real-esrgan: Training real-world blind super-resolution with
[29] Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei pure synthetic data. In Proceedings of the IEEE/CVF Inter-
Cai, and Zhiyong Gao. Dvc: An end-to-end deep video com- national Conference on Computer Vision, pages 1905–1914,
pression framework. In Proceedings of the IEEE/CVF Con- 2021. 3, 5
ference on Computer Vision and Pattern Recognition, pages [43] Zhou Wang and Alan C Bovik. Mean squared error: Love
11006–11015, 2019. 1, 2, 6 it or leave it? a new look at signal fidelity measures. IEEE
[30] Alexandre Mercat, Marko Viitanen, and Jarno Vanne. Uvg signal processing magazine, 26(1):98–117, 2009. 2
dataset: 50/120fps 4k sequences for video codec analysis and [44] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si-
development. In Proceedings of the 11th ACM Multimedia moncelli. Image quality assessment: from error visibility to
Systems Conference, pages 297–302, 2020. 5 structural similarity. IEEE transactions on image processing,
[31] David Minnen, Johannes Ballé, and George D Toderici. 13(4):600–612, 2004. 2, 3
Joint autoregressive and hierarchical priors for learned im- [45] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multi-
age compression. Advances in neural information processing scale structural similarity for image quality assessment. In
systems, 31, 2018. 2 The Thrity-Seventh Asilomar Conference on Signals, Sys-
[32] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, tems & Computers, 2003, volume 2, pages 1398–1402. Ieee,
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming 2003. 2, 3, 5
Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- [46] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and
perative style, high-performance deep learning library. Ad- Ajay Luthra. Overview of the h. 264/avc video coding stan-
vances in neural information processing systems, 32, 2019. dard. IEEE Transactions on circuits and systems for video
5 technology, 13(7):560–576, 2003. 1, 2, 3
[47] Fan Zhang, Angeliki V Katsenou, Mariana Afonso, Goce
Dimitrov, and David R Bull. Comparing vvc, hevc and av1
using objective and subjective assessments. arXiv preprint
arXiv:2003.10282, 2020. 3
[48] Han Zhang, Jizheng Xu, and Li Song. Video multimethod
assessment fusion based rate-distortion optimization for ver-
satile video coding. In 2021 IEEE International Conference
on Image Processing (ICIP), pages 2064–2068. IEEE, 2021.
3
[49] Saiping Zhang, Marta Mrak, Luis Herranz, Marc Górriz
Blanch, Shuai Wan, and Fuzheng Yang. Dvc-p: Deep video
compression with perceptual optimizations. In 2021 Inter-
national Conference on Visual Communications and Image
Processing (VCIP), pages 1–5. IEEE, 2021. 2