Academia.eduAcademia.edu

A unified architecture for fast HEVC intra-prediction coding

2017, Journal of Real-Time Image Processing

The high efficiency video coding (HEVC) is the new video coding standard, which obtains over 50% bit rate savings compared with H.264/AVC for the same perceptual quality. Intra-prediction coding in HEVC achieves high coding performance in expense of high computational complexity, due to the exhaustive evaluation of all available coding units (CU) sizes, with up to 35 prediction modes for each CU, selecting the one with the lower rate distortion cost, among other new features. This paper presents a Unified Architecture to form a novel fast HEVC intra-prediction coding algorithm, denoted as fast partitioning and mode decision. This approach combines a fast partitioning decision algorithm, based on decision trees, which are trained using machine learning techniques, and a fast mode decision algorithm, based on a novel texture orientation detection algorithm, which computes the mean directional variance along a set of co-lines with rational slopes using a sliding window over the prediction unit. Both algorithms proposed apply a similar approach, exploiting the strong correlation between several image features and the optimal CTU partitioning and the optimal prediction mode. The key point of the combined approach is that both algorithms compute the image features with low complexity, and the partition decision and the mode decision can also be taken with low complexity, using decision trees (if-else statements) and by selecting the minimum directional variance between a reduced set of directions. This approach can be implemented using any combination of nodes, obtaining a wide range of time savings, from 44 to 67%, and light penalties from 1.1 to 4.6%. Comparisons with similar state-of-the-art works show the proposed approach achieves the best trade-off between complexity reduction and rate distortion.

J Real-Time Image Proc DOI 10.1007/s11554-017-0685-4 ORIGINAL RESEARCH PAPER A unified architecture for fast HEVC intra-prediction coding Damian Ruiz1 • Gerardo Fernández-Escribano1 • José Luis Martı́nez1 Pedro Cuenca1 • Received: 5 July 2016 / Accepted: 23 March 2017 Ó Springer-Verlag Berlin Heidelberg 2017 Abstract The high efficiency video coding (HEVC) is the new video coding standard, which obtains over 50% bit rate savings compared with H.264/AVC for the same perceptual quality. Intra-prediction coding in HEVC achieves high coding performance in expense of high computational complexity, due to the exhaustive evaluation of all available coding units (CU) sizes, with up to 35 prediction modes for each CU, selecting the one with the lower rate distortion cost, among other new features. This paper presents a Unified Architecture to form a novel fast HEVC intra-prediction coding algorithm, denoted as fast partitioning and mode decision. This approach combines a fast partitioning decision algorithm, based on decision trees, which are trained using machine learning techniques, and a fast mode decision algorithm, based on a novel texture orientation detection algorithm, which computes the mean directional variance along a set of co-lines with rational slopes using a sliding window over the prediction unit. Both algorithms proposed apply a similar approach, exploiting the strong correlation between several image features and the optimal CTU partitioning and the optimal prediction mode. The key point of the combined approach & Damian Ruiz [email protected] Gerardo Fernández-Escribano [email protected] José Luis Martı́nez [email protected] Pedro Cuenca [email protected] 1 Instituto Investigación en Informática de Albacete, Universidad de Castilla-La Mancha, Av España S/N, 02071 Albacete, Spain is that both algorithms compute the image features with low complexity, and the partition decision and the mode decision can also be taken with low complexity, using decision trees (if-else statements) and by selecting the minimum directional variance between a reduced set of directions. This approach can be implemented using any combination of nodes, obtaining a wide range of time savings, from 44 to 67%, and light penalties from 1.1 to 4.6%. Comparisons with similar state-of-the-art works show the proposed approach achieves the best trade-off between complexity reduction and rate distortion. Keywords HEVC  Intra-prediction  Machine learning  Texture orientation  Directional variance 1 Introduction The new video coding standard, known as high efficiency video coding (HEVC) [1], has been approved by the Joint Collaborative Team Video Coding (JCT-VC) working group, from ITU and ISO organizations. HEVC has already replaced the successful H.264/AVC standard [2], especially for high-resolution formats beyond HD, like the ultra-high definition formats termed as 4 and 8 K. The new HEVC video coding tools allow to achieve bit rate savings of over 50% compared to H.264/AVC, for the same objective video quality [3]. Furthermore, HEVC also outperforms other widespread video codecs used in the Internet such as VP8 and VP9 [4]. Special interest has aroused the high performance of HEVC using exclusively intra-picture tools for the still images coding, showing gains around 40% compared with the successful JPEG-2000 standard [5]. There are several particular scenarios in which intra-picture coding is the 123 J Real-Time Image Proc optimal choice, such as still-picture photography storage, live TV interviews where low latency is needed for natural communication between the interlocutors, and also for professional edition and post-production tasks commonly used in the TV and cinema industries, where high quality and fast access to the individual pictures are required. These production codecs, named mezzanine codecs, are highly relevant for the audio-visual industry, which demands two things: very high compression efficiency and a low computational burden. For this reason, the HEVC standard has approved a set of specific profiles that exclusively use the intra-prediction schema, known as the ‘‘Main Still Picture’’ and ‘‘Main Intra’’ profiles, with support for different bit depths and chroma sampling formats. The high HEVC intra-coding performance is mainly attributable to two novel tools, the new flexible quad-tree picture partitioning [6], named Coding Tree Unit (CTU), and the new high density of angular predictors for the mode decision [7]. The pictures are divided in CTUs, which can be recursively split to form a quad-tree structure with three new unit types: coding unit (CU), prediction unit (PU) and transform unit (TU). The CUs, PUs and TUs cover a size range from 64 9 64 to 4 9 4, adapting the size of the coding units to the local image complexity, taking use of largest sizes for homogeneous regions and smaller sizes for complex textured areas. With the aim of achieving the best intra-coding performance, an exhaustive evaluation of the whole number of possible CU, PU and TU sizes and the full number of available prediction modes is carried out by using a well-known Rate Distortion Optimization (RDO) technique [8]. While such flexibility leads to high compression efficiency, it comes at the expenses of a huge computational burden, primarily due to the large number of available directional predictors and PU sizes [9], which is hindering the rapid adoption of HEVC by the professional market [10, 11]. Multiple approaches have been proposed in the literature for the reduction in the HEVC complexity, many of which focus on the reductions in the number of block sizes to be evaluated by using a tree pruning scheme. Other approaches are centred in the detection of the most probable intradirection, in order to avoid the evaluation of the full range of prediction modes by the RDO. The speeding up of the intra-prediction coding can be achieved by applying advanced techniques that allow the taking of decisions that are needed in the different stages of intra-prediction with low complexity. This approach constitutes the basis of this paper, which addresses the complexity reduction in HEVC intra-coding by using nontraditional techniques used in the video coding standards, such as machine learning (ML) and image processing algorithms for texture orientation detection. This paper presents a Unified Architecture to form a novel fast HEVC 123 intra-prediction coding algorithm, denoted as Fast Partitioning and Mode Decision (FPMD). This approach combines, in a first stage, a Fast Partitioning Decision (FPD) algorithm, based on decision trees, which are trained using ML techniques, and in a second stage, a Fast Mode Decision (FMD) algorithm, based on a novel texture orientation detection algorithm, which computes the mean directional variance along a set of co-lines with rational slopes using a sliding window over the PU. Both algorithms apply a similar approach, exploiting the strong correlation between several image features and the optimal PU partitioning and the optimal prediction mode. The key point of the combined approach is that both algorithms compute the image features with low complexity, and the partition decision and the mode decision can also be taken with low complexity, using decision trees and by selecting the minimum directional variance between a reduced set of directions. The rest of the paper is organized as follows. An overview of the HEVC intra-prediction schema is introduced in Sect. 2. Section 3 presents a review of the fast intra-prediction approaches recently proposed in the literature. Section 4 presents the details of our Fast Partitioning Decision (FPD) algorithm and our Fast Mode Decision (FMD) algorithm to form the Unified Architecture proposed for fast HEVC intra-prediction coding, while the experimental results are shown in Sect. 5. We summarize the conclusions in Sect. 6. 2 Technical background HEVC can be considered an evolution of the H.264/AVC, since it maintains the same block-based ‘‘hybrid’’ architecture used in previous video compression standards, by applying inter-picture prediction for temporal decorrelation and intra-picture prediction for the spatial image decorrelation. In addition, new tools have been introduced in HEVC that increase its coding efficiency compared to H.264/AVC, such as a new coding unit partitioning scheme named Coding Tree Unit (CTU), a new angular intra-prediction algorithm, new transform sizes of 16 9 16 and 32 9 32, and a new filter in the decoding loop termed Sample Adaptive Offset (SAO). A detailed description of those tools and a general overview of the HEVC architecture can be found in [12]. The CTUs in intra-prediction can be iteratively partitioned into four square sub-blocks of half resolution; thus, it can be considered as a hierarchical tree where each branch ends in a node, which determines the CUs. Each CU is by itself a new root of two new trees that contain the PUs and TUs. The maximum CTU size is 64 9 64 pixels, allowing CU sizes in the range of 64 9 64–8 9 8 pixels. J Real-Time Image Proc The PU is the basic entity in intra-prediction that takes the same size of its CU, and only for the smallest CU size (8 9 8), it can be also split into 4 9 4 sub-PUs. Consequently, the PUs can cover the widest range of sizes from 64 9 64 to 4 9 4 pixels. Finally, the TUs can be partitioned and transformed using a tree structure, termed as Residual Quad Tree (RQT), with a maximum of three depth levels, allowing TU sizes from 32 9 32 to 4 9 4. Intra-prediction achieves optimal coding performance by using an exhaustive evaluation over the 35 prediction modes, and all possible partitions sizes from 64 9 64 to 4 9 4, which means the evaluation of 341 different blocks by CTU. HEVC exploits the high spatial correlation between PU pixels and the pixels from the top row and left column of the neighbouring PUs. Those samples, denoted as Pref ;, are used for the construction of the directional predictors. Detailed information on intra-prediction coding can be found in [13]. The intra-prediction modes include two non-directional modes, namely DC and Planar, which achieve a high efficiency performance in smooth gradient areas, and 33 angular modes for image areas with edge patterns. The HEVC angular modes are defined with a 1/32 of fractional precision between two-integer pixel positions of Pref , and these are clustered in 16 horizontal modes, named from H2 to H17, and 17 vertical modes, named from as V18 to V34. The angular predictors can be also classified in two categories: the first one is composed by the five modes, which orientations match with the integer position of the reference pixels. We called this first category as Integer Position Modes (IPM). These are the horizontal H10, the vertical V26 and the three diagonal modes: H2, V18 and V34. The second category includes the rest of the modes with orientation falling between two references samples, and therefore the predictors are computed by the interpolation of the two nearest Pref samples. We named this second category as Fractional Position Modes (FPM). Table 1 collects the details of the 33 angular modes, denoted as Mi, where ri is the angular mode orientation, which is defined as a rational slope r ¼ ry =rx , and hi is the orientation angle, such as hi ¼ arctanðri Þ. As can be noted, the IPMs have an integer slope ri, meanwhile the FPMs have a non-integer slope ri . Lines with rational slopes, are commonly used in digital images processing, favours its definition in the discrete space -Z--2, denoted as integer lattice, K [ -Z--2. Given a continuous line with rational slope r ¼ ry =rx , it can be represented by (1), y ¼ r  x þ d; 8r2Q ð1Þ Just if ‘‘x’’ takes integer values multiples of rx , that is x ¼ k  rx ; V k [ -Z--, ‘‘y’’ reaches an integer position in K. Table 1 Orientations and slopes of angular modes in HEVC Mode Mi Angle hi Mode Mi Angle hi Slope ri H2 5p/4 H3 39p/32 -1 V18 3p/4 1 -13/16 V19 23p/32 H4 38p/32 16/13 -21/32 V20 22p/32 32/21 H5 H6 37p/32 -17/32 V21 21p/32 32/17 36p/32 -13/32 V22 20p/32 32/13 H7 35p/32 -9/32 V23 19p/32 32/9 H8 34p/32 -5/32 V24 18p/32 32/5 H9 33p/32 -1/16 V25 17p/32 16/1 H10 p 0 V26 p/2 ? Slope ri H11 31p/32 1/16 V27 15p/32 -16/1 H12 H13 30p/32 29p/32 5/32 9/32 V28 V29 14p/32 13p/32 -32/5 -32/9 H14 28p/32 13/32 V30 12p/32 -32/13 H15 27p/32 17/32 V31 11p/32 -32/17 H16 26p/32 21/32 V32 10p/32 -32/21 H17 25p/32 13/16 V33 9p/32 -16/13 V34 p/4 -1 Accordingly, the distance between two points with integer positions belonging to a line with rational slope ri is determined by the rx ; ry parameters. As can be observed in Table 1, the FPM modes, H3, V19, H9, V25, H11, V27, H17 and V33 have a rational slopes of ±m/16 and ±16/m 8 m ¼ 1; 13, that means those lines are defined with at least two point in integer position for PU sizes largest of 16 9 16, that is 32 9 32 and 64 9 64 PU sizes, but not for 16 9 16, 8 9 8 and 4 9 4. The other 20 FPM modes in HEVC have a rational slope of ±m/32 and ±32/m 8m ¼ 5; 9; 13; 17; 21; thus, their lines are defined with two points in integer position for PU sizes largest of 32 9 32, which only applies for the 64 9 64 PU size. With the aim of reducing the computational complexity, the HEVC reference model [14], version HM16.6, implements a low complexity intra-prediction algorithm, which is based on Piao et al.’s scheme [15]. This algorithm workflow is performed by means of two processes that are repeated for each PU, the Rough Mode Decision (RMD) and the RDO. The RMD model evaluates the 35 prediction modes computing a low complexity Lagrange cost function (JHAD), which uses the Sum of Absolute Hadamard Transformed Differences (SATD). The N modes with lowest JHAD cost are selected as candidate modes to be evaluated by the RDO stage, being N equal to 3 for the PU sizes of 64 9 64, 32 9 32 and 16 9 16, and N is equal to 8 for others PU sizes. The RDO carry out the exhaustive PU encoding and decoding, in order to compute Lagrange Jmode cost, and the mode with the lowest cost is selected as the optimal prediction mode for each PU. 123 J Real-Time Image Proc 3 Related work Recently, many proposals have been presented by the research community in order to alleviate the high computational burden of the HEVC intra-coding, avoiding the exhaustive evaluation of the full number of size-mode combinations through the rate distortion optimization stage. According to the algorithm approach, proposals can be classified into three categories. The first one evaluates all the prediction modes but reduces the number of CU sizes required to be checked by both the RMD and the RDO stages, mostly by limiting the depth of the CTU tree, and thus they are commonly named as tree pruning or early termination algorithms. The second category reduces the number of angular prediction modes to be checked for the RMD stage as well as, eventually, the number of candidates in the rate distortion optimization stage, mainly based on content features. The last category combines both approaches, mainly reducing the number of directional modes and the candidate CU sizes to be checked. In the first group, the proposals can be classified into two different sub-categories. The first sub-category contains those in which the partitioning decision is taken exclusively on the basis of the RD cost value of the different CU sizes [16–18]. The approaches in the second subcategory [19, 20] mainly use the same schema, but the optimal CTU partitioning prediction is based on some content features extracted from the CUs. Most popular approaches in the second group are based on pixel gradient detection in the spatial domain using the Sobel filter, followed by the Histogram of Oriented Gradient (HOG) computation. In [21] few directional modes are selected as candidates for the RDO based on HOG. Chen et al. [22] uses a 2 9 2 filter to select the strong primary gradient direction of the PU, introducing the conception of nonparametric approach to estimate the distribution density of the gradient histogram. Yan et al. [23] apply a pixel-based edge detection algorithm based on the sum of absolutes differences in the angle of the prediction modes by using a two-tap interpolation method. In [24], a reduction in intra-modes was proposed using five different filters to detect the dominant edge of the 4 9 4 PUs, and a set of 11 prediction modes closer to the dominant edge are evaluated. The Yao et al. [25] proposal reduces the number of prediction modes to be evaluated to eleven modes or only two modes, DC and Planar, depending on the dominant edge assent standard deviation. Finally, the proposals presented by [26, 27] fall in the last group. Shen [26] suggested a fast partitioning decision algorithm that uses the correlation of the content and the optimal CTU tree depth, limiting the minimum and 123 maximum depth levels. The algorithm is based on the observed evidence that small CUs tend to be chosen for rich-textured regions, whereas large CUs are chosen for homogeneous regions. It computes a depth predictor by using the neighbouring tree blocks and two early terminations for prediction modes based on the statistics of neighbouring blocks and the RD cost of the candidates. The authors reported a 21% time reduction with a rate penalty of 1.7%. By using a similar approach, the fast intra-prediction algorithm presented in [27] is based on the RD cost difference between the first and second candidate modes computed in the RMD stage, and this reduces the number of RDO candidates from N to one mode (Best Candidate) or three modes (Best Candidate, DC and MPM). In addition, if the RD cost of the best mode of the RDO stage is under a threshold, the algorithm decides to apply an early termination pruning to the tree to avoid the evaluation of lower CU sizes. The algorithm achieves a computational complexity reduction of 30.5% with an average performance drop of 1.2%. 4 Unified architecture for fast HEVC intraprediction coding As mentioned in Sect. 2, the HEVC intra-prediction can achieve a high level of performance at the expenses of a huge computational burden due to the high number of combinations involved in the intra-prediction process, comprising the evaluation of all possible prediction modes and the full range of CU sizes. With the aim of reducing the HEVC intra-prediction complexity, this section presents our Unified Architecture to form a novel fast HEVC intraprediction coding algorithm, denoted as Fast Partitioning and Mode Decision (FPMD). This approach combines, in a first stage, a Fast Partitioning Decision (FPD) algorithm for the CTU coding decision based on decision trees, which are trained using ML techniques, and in a second stage, a Fast Mode Decision (FMD) algorithm, based on a novel texture orientation detection algorithm. 4.1 Stage 1: The fast partitioning decision (FPD) algorithm 4.1.1 Observations and motivation The first stage of the Unified Architecture proposed in this paper is derived from the analysis of the computational complexity of the intra-prediction algorithm implemented in the HM reference software [14]. In some preliminary tests where one sequence from each class of the JCT-VC test sequences [28] have been encoded using the HM 16.6 J Real-Time Image Proc reference model, the computing time of each intra-prediction stage has been collected. The results for two QPs, QP22 and QP37, showing that the complexity of the RMD and RDO stages makes up over 80% of the total intraprediction computation, and the remainder of the time is spent by the RQT stage and other auxiliary tasks. This fact has motivated the design of our FPD algorithm, which replaces the brute force scheme used in the HEVC through the RMD and RDO, with a low complexity algorithm based on a fast CU size classifier, which is previously trained using ML methodology. The classifier selects a sub-optimal CTU partitioning, thus avoiding the exhaustive evaluation of all available CU sizes. The FPD approach proposed is based on the fact that the CU partitioning decision can be taken using the local CU features, considering that there exists a strong correlation between the optimal partitioning size and the texture complexity of the CU. It is well known that homogeneous blocks achieve best performance when being encoded by large block sizes. Otherwise, highly textured blocks can be efficiently encoded by using small sizes that adjust their size to the details of the image. The proposed algorithm uses a binary classifier based on a decision tree with two classes, Split and Non-Split, in each top-three depths of the CTU tree. The algorithm starts with the largest CU size, 64x64, and using several attributes of the CU the decision tree takes the partitioning decision. 4.1.2 Training data set With the aim of obtaining a training set covering a wide range of content complexities, the Spatial Information (SI) and Temporal Information (TI) metrics were computed for all JCT-VC test sequences [28], according to the ITU-T P.910 recommendation [29]. Figure 1 shows the Spatio- Spao-Temporal Informaon 90 80 PeopleOnStreet Temporal index (TI) 70 60 Traffic 50 SlideEding 40 BQTerrace BasketballDrillText 30 Cactus BasketballDrive 20 4.1.3 Attribute selection The trained sequences were used for the extraction of attributes from the CUs, and they are also encoded with the HM encoder with the aim of achieving the optimal partitioning of each CTU. The CTU partitioning has been obtained for the four distortion levels QP22, QP27, QP32 and QP37 recommended in the Common Test Conditions and Software Reference Configurations (CTCs) by the JCTVC [28]. That CTU partitioning allows us to classify each CU in a binary class that takes the Split or Non-Split value. Many features can be extracted from a CU to describe its content using the first- and second-order metrics in the spatial domain, such as the Mean, Standard Deviation, Skewness, Kurtosis, Entropy, Autocorrelation, Inertia, or Covariance [30]. There are also useful attributes describing the CU features, which can be computed in the Fourier domain, such as the DC energy, number of nonzero AC coefficients, or the mean and variance of the AC coefficients. Initially, we extracted a large number of such statistics commonly used in image processing and image classification. The attribute selection was carried out using the opensource WEKA, an effective ML tool [31] that provides a set of tools that use different strategies to rank the usefulness of the attributes, denoted as Attributes Evaluator, in conjunction with different search algorithms. Considering both factors, attribute ranking and computational complexity, we finally selected the three attributes from the spatial domain with the best performance in terms of ranking evaluation, which are based on the variance and mean computation of the CU. They are the following: (1) the variance of the 2N 9 2N CUs, denoted as r22N , (2) the variance of variances of the four N 9 N   sub-CUs, denoted as r2 r2N , and (3) the variance of the means of the four N 9 N sub-CUs, denoted as r2 ½lN Š: In order to reduce the complexity of the variance computation, it is implemented by the variance algorithm named textbook one-pass proposed in the Chan et al. algorithm [32]. This variance approach does not require passing through the data twice, once to calculate mean and again to compute sum of squares of deviations from mean. ParkScene 10 0 Temporal (ST) information of the selected training set sequences. Only the first picture of each selected sequence is used to train the classifier, which is considered enough to achieve a high number of representative CTU samples of blocks with homogeneous areas and blocks with highly detailed textured areas. 0 5 10 15 20 Spaal Index (SI) Fig. 1 ST information of the training sequences 25 30 4.1.4 Decision tree specification Prior to the training of the decision tree, and considering that it is one of the key factors in ML, we proceeded to the 123 J Real-Time Image Proc classifier selection. From among several classifiers, C4.5 [33] is a well-known classifier for general-purpose classifying problems [34] as well as for video coding purposes [35]. The most common mechanism for measuring the trained decision tree accuracy is the tenfold cross-validation process available in WEKA, which provides the prediction error measured in terms of misclassification instances or percentage of Correctly Classified Instances (CCI). The first node, denoted as Node64, is the most critical node because a wrong Non-Split decision at the highest partitioning level can cause a 64 9 64 CTU that should be divided into several smaller CUs not to be split, and therefore the compression efficiency will be reduced. We used 4231 instances for the training of the Node64 by using the 4231 64 9 64 of CTUs of the 8 frames used from the 8 test sequences. These instances suffer the well-known imbalance problem because there are significantly more instances belonging to Split, by over 80%, than Non-Split with just 8% for the QP22. To address the imbalance issue, previously to the training of the decision tree for Node64, an unsupervised instance random sub-sample filter [36] available in WEKA has been applied. The training results show that the CCI after the tenfold cross-validation step for the decision trees of the Node64 are over 90%, which can be considered a high-accuracy classification. The decision tree for Node64 is shown in Fig. 2. Node64 is defined with 3 inner nodes, three rules, and one condition for each rule with a specific threshold Thi (8i ¼ 1; 2; 3) that determines the binary decision within the inner nodes. Node32 processes the split 32 9 32 CUs from Node64, and the decision of Non-Split the CUs is taken, forwarding the CUs classified as Split to Node16, and otherwise sending the Non-Split CUs to the intra-prediction stage for an exhaustive evaluation of the 32 9 32 and 16 9 16 PU sizes. The total training data set size for this node is 16,924, and, in this case, only the class distribution for the QP22 and QP27 were imbalanced, because there were significantly more instances belonging to Split, by over 60%, than Non-Split, with around 30%. The training results show a CCI after the tenfold crossvalidation step for the decision trees of Node32 are in the range of 83–89%, which can still be considered a highaccuracy classification. The decision tree for Node32 is shown in Fig. 3. Based on the variance of the 32 9 32 CUs and the variance of the variances of their four 16 9 16 sub-CUs, the decision tree classifies each of them in the Split or Non-Split class. Node32 is defined with two inner nodes and one condition for each rule with a specific threshold Thi (8i ¼ 4; 5) that determines the binary decision within the inner nodes. Node16 processes the 16 9 16 CUs split from Node32, and a new decision to Split or Non-Split is taken. The 4231 CTUs that comprise the data set, belonging to the 8 test sequences, are divided into 16 9 16 CUs for the Node16 training, and therefore the total training data set size for this node is 67,696 instances. The training results obtained after the tenfold crossvalidation step for the decision trees of Node16 are in the CCI range of 72–79%. The decision tree for Node16 is shown in Fig. 4. Node16 is defined with three inner nodes and one condition for each rule with a specific threshold Thi (8i ¼ 6; 7; 8), which determines the binary decision within the inner nodes. The proposed algorithm can be implemented in a scalable way, using only the top Node (Node64) or combining this with the other two nodes as following: 1. Node64 The fast classifier replaces the RDO for the 64 9 64 CU size, so only if the classifier decision is ‘‘Split’’, the four 32 9 32 CUs are exhaustively Fig. 2 Decision tree for Node64 CTU σ2[μ32] ≤ Th1 > Th1 σ2[σ232] ≤ Th2 σ264 ≤ Th3 RMD + RDO+ RQT PU64 & PU32 123 Node32 > Th2 Node32 > Th3 Node32 J Real-Time Image Proc Fig. 3 Decision tree for Node32 CU 32x32 σ2[σ216] ≤ Th4 > Th4 σ232 ≤ Th5 RMD + RDO + RQT PU32 & PU16 Node16 > Th5 Node16 4.2 Stage 2: The fast mode decision (FMD) algorithm them are based on pixel gradient detection in the spatial domain by computing the gradient of the image using the Sobel filter [37] or other similar filters. This technique has been proved robust when high-energy edges are present in the image, but natural images often have wide areas with weak edges, or even no edges, so this approach can be inefficient for the intra-prediction mode decision. This fact has motivated the algorithm presented for this second stage of our Unified Architecture, denoted as Fast Mode Decision (FMD) algorithm. In this paper, a novel texture orientation detection algorithm is proposed, which computes the Mean Directional Variance (MDV) using a Sliding Window (SW), along a set of co-lines with rational slopes, denoted as MDV-SW. The key point of the proposed algorithm is based on the hypothesis that pixel correlation is maximum in the texture orientation, and consequently the variance computed in that direction will obtain a low value compared with the variance computed in other directions. Another noteworthy feature of this proposal is the use of a set of rational slopes, which are exclusively defined in an integer position of the discrete lattice K; thus, no pixel interpolation is required. Moreover, it was observed that there exists a strong dependence between the optimal mode selected by the RDO and the distortion applied, set by the QP parameter. Therefore, the directional variance computation for each N  N PU is expanded to ðN þ 1Þ  ðN þ 1Þ window, thus the neighbouring pixels used as reference samples for the construction of the predictor are also included in the calculation of the directional variance. In order to reduce the computational complexity of the gradient detection, we need to define lines where their points are located in the integer position of lattice K [ -Z--2, so that we can describe the problem of the discretization of a direction in the discrete space -Z--2 as the sub-sampling of an integer lattice. 4.2.1 Observations and motivation 4.2.2 Sub-sampling of integer lattice Many proposals have been presented in the literature for the fast optimal mode decision in HEVC [21–25]. Most of The directional variance metric described in [38] uses the traditional definition of ‘‘digital line’’ for the discrete space CU 16x16 σ2[σ28] ≤ Th6 > Th6 RMD + RDO + RQT PU16 & PU8 σ216 ≤ Th7 > Th7 RMD + RDO + RQT PU8 & PU4 σ2[σ28] ≤ Th8 RMD + RDO + RQT PU16 & PU8 > Th8 RMD + RDO + RQT PU8 & PU4 Fig. 4 Decision tree for Node16 2. 3. evaluated by the RDO. This configuration achieves the minimum speeding up. Node64 ? Node32 The fast classifier substitutes the RDO by the 64 9 64 and 32 9 32 CUs. If the classifier decision in Node32 is ‘‘Split’’, the four 16 9 16 CUs are evaluated by the RDO in order to achieve the optimal partitioning in 8 9 8 and 4 9 4. Node64 ? Node32 ? Node16 The fast classifier substitutes the RDO by the 64 9 64, 32 9 32 and 16 9 16 CUs. If the classifier decision in Node16 is ‘‘Split’’, the four 8 9 8 CUs are evaluated by the RDO, achieving this configuration the maximum speeding up. 123 J Real-Time Image Proc -Z--2, such that L(r, n) is a digital line with rational slope r, V r [ Q and V n [ -Z--, meaning that every pixel with position (x, y) [ -Z--2 is associated exclusively to one digital line, defined as: ð2Þ ð3Þ Therefore, for each rational slope r ¼ ry =rx 8 rx ; ry [ -Z-a set of integers, denoted as n, can be found, which define the digital lines L(r, n). Digital lines have been widely used in different fields, including image processing, image coding and artificial vision, among others. However, they do not provide enough accuracy for gradient detection in images or pixel blocks with small size. Figure 5a depicts an example of an image composed by four equally spaced bars (grey bars) with a slope of 1=3; where we have plot the set of digital lines with same slope L(1=3; n) covering the image. As can be observed, although the digital lines and the plotted bars have the same slope, the digital lines are not representing the orientation of the bars with high accuracy. There are some digital lines which only two of each tree pixels belong to the bar, and the other digital lines two of each three pixels are out of the bars but they have a pixel into the bar, reducing the pixels correlation along the digital lines. This is due to the digital lines are not straight, instead they are covering an area whose width depends on the slope factors (rx ; ry ). Figure 5b shows an example of two digital lines, L(1=3; n) and L(3=4; n), where the digital line areas have been shaded and the digital line width is denoted as l. Using simple trigonometric functions can be qffiffiffiffiffiffiffiffiffiffiffiffiffiffi demonstrated that l ¼ ry ðrx 1Þ= rx2 þ ry2 , and conse- quently l increases with the rx ; ry increasing, as is shown in Fig. 5b for rational slopes of 1=3 and 3=4. This feature of the digital lines, their width, limits the orientation detection accuracy of the digital lines, mainly for directions represented by rational slopes that requires high rx ; ry factors. For this reason, the concept of a down-sampling lattice in two-dimensional space described in [39] has been used, together with the co-lines definition used in lattice theory [40]. According to [39], in a 2D system an integer lattice K can be obtained by down-sampling in two directions d1 and d2 of the cubic integer lattice -Z--2. Hence, the sub-lattice K , -Z--2 can be formally represented by a non-singular 2 9 2 integer matrix, denoted as sub-sampling matrix or generator matrix, MK, such that dx1 ; dy1 ; dx2 ; dy2 [ -Z--, that is:   d dx2 M K ¼ ½r1 r2 Š ¼ x1 ð4Þ dy1 dy2 The use of the rational slopes as sub-sampling directions  T in MK, in their vector form d1 ¼ dx1 ; dy1 and  T d2 ¼ dx2 ; dy2 , makes it possible to obtain a sub-lattice K which describes the points of the line with slopes d1 andd2 , and thus it facilitates the variance computation along those orientations. Lattice theory states that given a generator matrix MK, the lattice -Z--2 is partitioned into jdetðM K Þj L(1/3,n) x x n= -2 n= -1 n= 0 L(1/3,n) n= 1 n= 2 n= 3 n= 4 n= 5 n= 6 y (a) y L(3/4,n) (b) Fig. 5 a Example of four bars with orientation r ¼ 1=3, and the set of digital lines L(1=3; n). b Area covered by two digital lines with rational slopes of 1=3 and 3=4 123 J Real-Time Image Proc MΛ= 12 2 -1 |Det(MΛ)| = 5 MΛ= 12 2 -1 |Det(MΛ)| = 5 C1 = 1 C2 = 2 C2 = 0 C2 = 1 (a) (b) Fig. 6 a Example of co-lines and their respective cosets for the slope r1 , using as direction vectors r1 ¼ ½1; 2ŠT and r2 ¼ ½2; 1ŠT of the subsampling matrix MK . (b) Example of co-lines and their respective cosets for the slope r2 , using as direction vectors r1 ¼ ½1; 2ŠT and r2 ¼ ½2; 1ŠT of the sub-sampling matrix MK number of cosets of the lattice K that are shifted versions of the sub-lattice K. Each coset can be obtained by a shifting  T vector sk ¼ skx ; sky 8 2 k ¼ 0; 1; ::detðM K Þ 1. In [40] the concept of co-line, denoted as CLSk ðr; nÞ, is introduced, and it is defined as the intersection of a kth coset in lattice K and a digital line L ðr; nÞ. Therefore, the pixels ðx; yÞ belonging to the co-lines with slope d1 and d2 can be obtained by the linear combination of both vectors d1 ; d2 , 8c1 ; c2 [ -Z--, such that,         dx1 dx2 s x ¼ c1 þ c2 þ kx ð5Þ dy1 dy2 sky y y ¼ r1 x þ n 8 jr1 j [ 1; n ¼ c2 ry2 r1 rx2 r1 skx þ sky ; or Our proposal computes the variance along the pixels of the co-lines with a set of rational slopes ri , which can be obtained using the generator matrix MK where any couple of rational slopes r1 ¼ ry1 =rx1 and r2 ¼ ry2 =rx2 can be used as vector directions d1 and d2 in MK. Equation (5) can be rewritten as two independent equations using the r1 and r2 slopes, such as; x ¼ c1 rx1 þ c2 rx2 þ skx ð6Þ y ¼ c1 ry1 þ c2 ry2 þ sky ð7Þ By isolating for the variable c1 from (6) and substituting it into (7), the expression of the co-lines with orientation r1 is obtained as is shown in (8) and (9), V c2 [ -Z-k ¼ 0; . . . detðMK Þ 1: ð8Þ x ¼ y=r 1 þ n 8 jr1 j  1; n sky =r1 þ skx ¼ c2 rx2 ry2 =r1 ð9Þ y ¼ r2 x þ n 8 jr2 j [ 1; n r2 skx þ sky ; or ¼ c1 ry1 r2 rx1 ð10Þ With the aim of clarifying this process for co-lines with slope r1 , Fig. 6a shows an example using the directional vectors r1 ¼ ½1; 2ŠT and r2 ¼ ½2; 1ŠT . Given that jdetðMK Þj ¼ 5, there are five cosets determined by the shifting vectors s0 ¼ ½ 1; 0Š, s1 ¼ ½0; 1Š, s2 ¼ ½0; 0Š, s3 ¼ ½0; 1Š, s4 ¼ ½1; 0Š, and those are represented by the doted, black, grey, white and lined points, respectively. Setting c2 ¼ 0, we obtain the expression of the co-lines for the first five cosets, and setting c2 ¼ 1, the next five colines are likewise obtained, which are depicted as solid lines and dashed lines, respectively, in Fig. 6a. Following the same reasoning, the expression of the colines with orientation r2 is obtained by isolating for the variable c2 from (8) and substituting it into (9), as is shown in (10) and (11), V c1 [ -Z-- k ¼ 0; . . . detðMK Þ 1: x ¼ y=r 2 þ n 8 jr2 j  1; n sky =r2 þ skx ¼ c1 rx1 ry1 =r2 ð11Þ 123 J Real-Time Image Proc Table 2 Six generator matrixes defining the twelve rational slopes and the respective cosets 4.2.3 Selection of the co-lines orientation M Ki Vector directions M K0 r3 ; r9 M K1 r0 ; r6 M K2 r1 ; r7 M K3 r5 ; r11 M K4 r2 ; r8 M K5 r4 ; r10 The decision of the co-lines orientations that best match with the modes directionality in HEVC is one of the key points of our approach. In order to estimate the dominant texture orientation in each PU, twelve rational slopes ri , ð8 i ¼ 0; . . .; 11Þ have been selected. Four of them with the same slope as the IPM modes, that is, the horizontal, vertical and two diagonal orientations (the diagonal H2 and V34 are considered the same in terms of slope), and the other eight slopes are defined with slopes close to some of the FPM modes, but with rational slopes of 1=2; 1=4; 2; and  4. Table 2 summarizes the set of six generator matrices M Ki , the vector directions used for the integer lattice sub-sampling, and the respective cosets defined by detðM Ki Þ. The chosen orientations have a maximum slope factor of 4 instead of 32, as the HEVC modes have (Table 1). Consequently, there are always two points that determine a co-line with rational slope ri , except for the smaller PU size of 4 9 4 and the largest slopes (r ¼ 1=4 and r ¼ 4), which use a pixel of a neighbouring blocks. Figure 7 shows the set of twelve rational slopes. With the aim of comparing the orientations of the selected slopes with the HEVC modes’ directionality, Fig. 8 depicts both, the 33 angular intra-prediction modes defined in HEVC (grey solid lines) and the twelve proposed co-lines with rational slope (blue dashed lines), which have been chosen for the analysis of the dominant gradient. Generator matrix   1 0 0 1   1 1 1 1   2 1 1 2   2 1 1 2   4 1 1 4   4 1 1 4 Cosets 1 2 5 5 17 17 Figure 6b shows the same example of Fig. 6a but now for the expression of the co-lines with slope r2 . Now, setting c1 ¼ 1, the expression of the co-lines for the first five cosets is obtained, and the next set of five co-lines are equally obtained by setting c1 ¼ 2, which are depicted as solid lines and dashed lines, respectively. As can be observed in Fig. 6a, b, the distance between two consecutive pixels belonging to one co-line is always the same, and its position is free of any type of interpolation process. Consequently, the co-lines are available to represent the geometrical orientation with highest accuracy than the traditional digital lines, particularly for the small size blocks, as is demanded in HEVC. Fig. 7 Co-lines r0 to r11 selected to compute of directional variance r9 r8 r7 r3 r2 r1 r11 r6 r10 r5 r4 r0 123 J Real-Time Image Proc r7 r6 V19 V20 V22 V23 V24 r9 r10 V25 V27 V28 V29 r11 V30 V32 r0 V33 V34 r5 r4 H2 H3 r1 H4 H6 r2 H7 H8 H9 r3 H16 H17 V18 r8 H11 H12 H13 H14 Fig. 8 Set of angular intrapredictions in HEVC (grey), and the co-lines defined with rational slopes r0 to r11 (doted blue lines) for the dominant gradient analysis r0 As can be observed, the co-lines r0 ; r3 ; r6 , and r9 overlap the angular modes H2, V10, V18, V26 and V34; thus, such modes can be estimated with high accuracy from respective co-lines. The co-lines with slopes r1 ; r5 ; r7 and r11 , mostly overlap the angular modes H5, V15, V21 and V31, so these co-lines can be also considered as a good estimation of the co-located modes. Finally, the four co-lines with slopes r2 ; r4 ; r8 and r10 , are located near the middle of two modes H7 and H8 for the slope r2 , H12 and H13 for the slope r4 , V23 and V24 for the slope r8 , and V28 and V29 for the slope r10 . With the aim of considering the remaining angular modes that are not covered by each of the twelve proposed slopes ri , twelve classes denoted as Ci have been defined. Each class selects a set of candidate modes Mi on the left and right side of the respective rational slopes ri . Table 3 shows the features of the proposed co-lines ri . It should be noted that the co-line r0 is the only one slope, which selects four angular modes in its class. This is due to modes H2 and V34 of HEVC being the same in terms of orientation, so the two horizontal H2 and H3 modes, and the two vertical V33 and V34 modes, are selected as candidates. Based on empirical simulations, one candidate mode on the left and right side of each class has been added for the smaller PU sizes of 8 9 8 and 4 9 4, except for the horizontal r3 and vertical, r6 , slopes. This is motivated by the fact that the co-lines in those block sizes with no Cartesian orientations are defined by a low number of pixels, and the accuracy of the orientation detection is also lower. In addition, the two non-angular modes, DC and Planar are included as candidates for all the classes, because they match quite well with weak edge images. 4.2.4 Computation of mean directional variance (MDV) along co-lines Cumulative variance along digital lines is proposed in [38], which is proven as an efficient metric for texture orientation detection in images with large size. However, due to the constraints imposed by factors such as the PU sizes, the high density of the intra-prediction modes in HEVC, and the need to reduce the encoder’s computational burden, the following modifications and novel features are proposed: 1. 2. The variance is computed along the co-lines according to (8)–(11), instead of using the digital lines expression. In order to reduce the computational complexity of the variance, an approximation of the variance, denoted textbook one-pass algorithm, proposed by Chan et al. [32] is used. Equation (12) shows the directional variance expression using the textbook one-pass approach, pj ðri ; nÞ being the pixels belonging to CLðri ; nÞ and N the number of pixels of the nth co-line with slope ri : 2 !2 3 N N X X 1 1 p2 ð r i ; nÞ pj ð r i ; nÞ 5 r2 ½CLðri ; nފ ¼ 4 N j¼0 j N j¼0 ð12Þ 3. Finally, instead of calculating the cumulative variance of the digital lines as is proposed in [38], the MDV is 123 J Real-Time Image Proc Table 3 Slope, angle and candidate modes assigned to defined co-lines Co-line hi Slope (ry/rx) PU class (Ci ) Candidates modes Mi for PU64, PU32, PU16 Additional candidates Mi for PU8, PU4 r0 5p/4 -1/1 I H2, H3, V33, V34 H4, V32 r1 8p/7 -1/2 II H4, H5, H6 H3, H7 r2 27p/25 -1/4 III H7, H8 H6, H9 r3 p 0 IV H9, H10, H11 H8, H12 r4 23p/25 1/4 V H12, H13 H11, H14 r5 6p/7 1/2 VI H14, H15, H16 H13, H17 r6 r7 3p/4 9p/14 1/1 2/1 VII VIII H17, V18, V19 V20, V21, V22 H16, V20 V19, V23 r8 29p/50 4/1 IX V23, V24 V22, V25 r9 p/2 ? X V25, V26, V27 V24, V28 r10 21p/50 -4/1 XI V28, V29 V27, V30 r11 5p/14 -2/1 XII V30, V31, V32 V29, V33 computed as the average of the individual variances obtained along the L co-lines with the same ri orientation, as is shown in (13). MDV ðri Þ ¼ L 1X r2 ½CLðri ; nފ L n¼1 ð13Þ It is worth noting that for a rational slope (ry =ry ), the distance between two pixels belonging to that orientation qffiffiffiffiffiffiffiffiffiffiffiffiffiffi can be computed as 2 rx2 þ ry2 ; and therefore it has a strong dependence with the rational values of the slope. From statistical point of view, for most of the natural images the variance value computed along the slopes with large pixel distance could be penalized, compared to the variance computed along the slopes with smaller distance, as the Cartesian orientations are. However, for highly textured images, as the image depicted on Fig. 5a, the correlation between the pixels with the same orientation (r ¼ 1=3) is high, even if the pixel pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi distance along the slope is large ( 2 32 þ 12 ). The simulation results showed on Sect. 5 reveal that this algorithm achieves a high performance for a wide variety of sequences, containing a large set of patterns and orientations. 4.2.5 Computation of mean directional variance using sliding window (MDV-SW) As has been described previously, the MDV is computed for each PU using the pixels of co-segments CSm ðri ; nÞ belonging to that PU in a scalable manner. However, the angular modes in HEVC intra-prediction use as reference sample the neighbouring pixels of decoded PUs, denoted as PUd ðx; yÞ, which have been previously distorted by the 123 quantification factor QP. Based on empirical simulations, we observed the strong dependence between the optimal mode selected by the RDO and the QP parameter, especially for high QP values, which cause a strong smoothing of the reference pixels; thus, the correlation between the reference samples and the PU’s pixels that are not yet distorted can be modified. Consequently, an enhancement to the MDV algorithm is proposed in this subsection. The main idea of this approach is to expand the window of the MDV computation for aN 9 N PU to a window of ðN þ 1Þ  ðN þ 1Þ pixels, overlapping the left column and top row of the window with the left and top decoded pixels of the neighbouring PUs. Figure 9a depicts an example of the computation of the MDV for a 4 9 4 PU(i, j) along the slope r6 , where the reference samples (Pref) belonging to decoded PUs (PUd) are lined. As can be observed, five cosegments with lengths of 2, 3 and 4 pixels are defined for such slope. Figure 9b presents the new MDV approach, where the new ðN þ 1Þ  ðN þ 1Þ window allows the MDV computation using the reference pixels from the neighbouring PUs, and two new co-segments are now available. Accordingly, the MDV is computed in five overlapping window sizes of 65 9 65, 33 9 33, 17 9 17, 9 9 9 and 5 9 5 pixels, thus this novel approach is named Mean Directional Variance with Sliding Window (MDV-SW). Figure 9c illustrates an example of the MDV-SW computation of four consecutive 494 PUs f PUði þ k; j þ lÞj8 k; l ¼ 0; 1g, showing that the sliding windows are overlapping each other, in order to capture the neighbouring reference samples. The MDV-SW implementation needs a slight computation increase compared with the MDV due to the extended window size and therefore the border pixels are being computed twice for the adjacent PUs. J Real-Time Image Proc PUd(i-1,j-1) PUd(i,j-1) PUd(i+1,j-1) PUd(i-1,j-1) PUd(i,j-1) PUd(i+1,j-1) Pref PUd(i-1,j) PUd(i,j-1) PUd(i+1,j-1) Pref PUd(i-1,j) Pref PU(i,j) PU(i+1,j) PU(i,j+1) PU(i+1,j+1) PUd(i-1,j) PU(i,j) PUd(i-1,j+1) PUd(i-1,j-1) PU(i,j) PUd(i-1,j+1) Pref PUd(i-1,j+1) 1) Pref (a) Pref (b) (c) Fig. 9 a Example of MDV computation of 4 9 4 PU. b Example of MDV-SW computation over expanded window for a 4 9 4 PUs. c Example of MDV-SW using overlapped windows for the evaluation of 4 9 4 PUs 4.3 The fast partitioning and mode decision architecture (FPMD) Finally, this subsection presents the proposed unified architecture to form a novel fast HEVC intra-prediction coding algorithm, denoted as Fast Partitioning and Mode Decision (FPMD). This approach combines the FPD algorithm proposed in Sect. 4.1, and the FMD algorithm proposed in Sect. 4.2. The FPMD algorithm presented in this paper achieves a considerable complexity reduction of over 67%, at the expense of a slight penalty in terms of rate increase, due to sub-optimal partitioning and the mode decision given by the FPMD. The architecture of the FPMD is depicted in Fig. 10, which shows the new functional algorithms introduced by FPMD shaded in grey. The FPMD workflow works at the CTU level, as was described in Sect. 4.1, evaluating the CTU attributes that are used by the decision trees in the CTU classifier stage, selecting the different PU sizes. With the aim of organizing the set of PUs that partition the CTU, a Partition Map is arranged by depth levels   (8 d ¼ 0; . . .; 4). For each d level, the k PUs PUd;k belonging to that depth level is recorded in a list. Intra-prediction is computed by the evaluation of all PUs included in the Partition Map lists, which are processed in depth level order. Following the fast partitioning algorithm described in Sect. 4.1, for every PUd;k , the four sub-PUs (PUdþ1;4kþi 8 i ¼ 0; . . .3Þ are also evaluated by the RMD, RDO and RQT stages. Consequently, five evaluations are always performed for each element of the Partition Map. Then, the MDV-SW algorithm is run for each PU in the Partition Map, and it is classified by a class Ci, which includes a set of 3 or 4 candidate angular modes in addition to the two non-directional modes, DC and Planar, as was described in Sect. 4.2. Those modes are arranged in a Mode List, and they will be the only modes that will be evaluated by the RMD, instead of the 35 modes evaluated by the original RMD stage of the HM reference software. Because of the RMD process, a set of three candidate modes is selected to be checked by the RDO stage, and the mode with the lowest cost is selected to be further evaluated by the RQT, which selects the optimal TU size. Finally, by comparing the cost of the PUd;k with the sum of the costs of the four sub-PUs, the best option Non-Split PUd;k or the four Split PUdþ1;4kþi ð8 i ¼ 0; . . .3Þ is selected. 5 Performance evaluation With the aim of evaluating the proposed Fast Partitioning and Mode Decision (FPMD) algorithm, the FPD algorithm and the FMD algorithm for intra-prediction in HEVC were implemented in the HEVC HM 16.6 reference software [14]. The non-modified HM 16.6 reference software was used as anchor by using the same test sequences and encoding parameters. The simulations were independently run for the three-node configurations of the FPD algorithm, in order to report the results by applying the algorithm in a scalable way, starting with solely Node64, then Node64 ? Node32, and finally Node64 ? Node32 ? Node16. FMD was evaluated based on the MDV and MDV-SW proposals. Then, all combinations of FPD ? MDV-SW (Node64 ? MDV-SW, Node64 ? Node32 ? MDV-SW and Node64 ? Node32 ? Node16 ? MDV-SW), were also activated to show the simulation results for the FPMD algorithm, which combines both proposals. 123 J Real-Time Image Proc CTU CTU Classifier (Decision Trees Nodes) ParonMap(d) = {PUd,i ,..,PUd,j} ParonMap(d) recommended by the JCT-VC [28] for the ‘‘All-Intra’’ mode configuration, and the Main profile (AI-Main). That recommendation specifies the use of four QPs (QP22, QP27, QP32 and QP37) and a set of 22 test sequences classified in five classes, named from A to E, which cover wide range of resolutions and frame rates. All the sequences use 4:2:0 chroma sub-sampling and a bit-depth of 8 bits. The algorithm performance was evaluated in terms of Computational Complexity Reduction (CCR) and Rate Distortion (RD) performance, and both of them were compared to the HM results. For the CCR measure, the Time Saving metric was computed following (14): Time Saving ð%Þ ¼ PUd,k PUd+1,4k+i i=0,..,3 ð14Þ MDV-SW (PU) ModeList(PU) = Ci RMD {ModeList(PU)} 3 Best Candidate Modes RDO {PU, 3 candidate modes} Concerning the RD performance, the average Peak Signal to Noise Ratio (PSNR) metric was calculated for each luma (Y_PSNR) and chroma components (U_PSNR, V_PSNR). The YUV_PSNRs for the four QPs were used for the computation of the RD performance by using the Bjøntegaard Delta-Rate metric (BD-rate) defined by ITU [41] and recommended in the CTC [28]. In order to obtain the increase in the BD-rate and the increase in the Time Saving introduced by the fast mode decision algorithm when it is combined in the architecture of each node, the DBDrate and DT.Saving metrics are also used, following (15) and (16), and where N ¼ 64; 64 þ 32; 64 þ 32 þ 16: DBDrate ¼ BDrate NodeN BDrate NodeN þMDV DT:Saving ¼ T:SavingNodeN RQT {PU, mode} No Enc:Time ðHM16:6Þ Enc:TimeðPropÞ  100 Enc:Time ðHM16:6Þ Last PU ð15Þ SW T:SavingNodeN þMDV SW ð16Þ 5.2 Simulation results Yes Best {PU size, TU size, Mode} Last depth No Yes End CTU Fig. 10 Fast partitioning and mode decision (FPMD) algorithm proposed 5.1 Encoding parameters and metrics The experiments were conducted under the ‘‘Common Test Conditions and Software Reference Configurations’’ (CTC) 123 Table 4 shows the experimental results of the FPD algorithm compared with the HM 16.6 reference software. It can be observed that for the Node64 implementation, all the sequences achieve a negligible 0.1% penalty in terms of BD-rate, and an average time saving of around 12%. The results for the Node64 ? Node32 decision tree implementation report much better time savings over 29%, and a bit rate increase lower than 1%. Finally, the overall algorithm (Node64 ? Node32 ? Node16) shows a computational reduction of over 53%, increasing the bit rate penalty by around 2.2%. It should be noted that we used only eight frames for training the decision tree, which is a 0.081% of the total frames that comprises 22 simulated sequences (9780 frames). The overall experimental results confirm that the FPD algorithm proposed can reduce the J Real-Time Image Proc Table 4 Performance results of the fast partitioning decision (FPD) algorithm Classification Sequence Frames N64 T. saving (%) Class A (2560 9 1600) Class B (1920 9 1080) Class C (832 9 480) Class D (416 9 240) Class E (128 9 720) N64 ? N32 BD-rate (%) T. saving (%) N64 ? N32 ? N16 BD-rate (%) T. saving (%) BD-rate (%) Traffic 150 11.60 0.0 29.39 0.9 56.61 1.6 PeopleOnStreet 150 12.19 0.0 28.03 0.6 50.09 1.1 BasketballDrive BQTerrace 500 600 14.36 14.15 0.2 0.1 36.87 30.96 2.4 0.8 59.19 52.57 3.1 1.3 Cactus 500 13.18 0.1 29.77 1.0 56.95 2.1 Kimono 240 15.43 0.1 33.90 5.1 62.05 5.3 ParkScene 240 14.51 0.0 30.74 0.9 58.75 1.6 BasketballDrill 500 15.4 0.0 27.63 0.5 53.50 2.3 BQMall 600 10.27 0.0 26.88 0.8 50.76 2.5 PartyScene 500 10.89 0.0 22.08 0.1 45.48 2.1 RaceHorses 300 10.18 0.0 26.41 0.7 53.14 1.7 BasketballPass 500 10.25 0.0 23.54 0.6 45.66 2.1 BQSquare 600 5.98 0.0 21.11 0.3 35.61 1.1 BlowingBubbles 500 8.82 0.0 18.58 0.0 40.55 1.7 RaceHorses 300 6.80 0.0 19.75 0.4 43.20 1.8 FourPeople 600 8.13 0.0 33.88 0.8 56.64 2.0 Johnny 600 12.48 0.3 45.54 3.4 63.00 4.4 KristenAndSara 600 Class A 26.81 11.89 0.3 0.00 44.39 28.71 1.7 0.75 61.19 53.35 2.6 1.35 Class B 14.32 0.10 32.44 2.04 57.90 2.68 Class C 10.40 0.00 25.75 0.53 50.72 2.15 Class D 7.43 0.00 20.74 0.33 41.25 1.68 Class E 18.22 0.20 41.27 1.97 60.28 3.00 Average 12.42 0.1 29.82 1.2 53.08 2.2 computational complexity of HEVC intra-picture prediction by over 53% with a slight bit rate increase, favouring real-time software and hardware implementation. Table 5 shows the simulation results for the novel FMD algorithm based on the MDV and MDV-SW proposals compared with the HM 16.6 reference software. As can be observed, with MDV-SW scheme the average time saving has been reduced to a 29.7%, regarding the bit rate penalty, it should be noted the average BD-rate is 0.4%. Table 6 reports the simulation results individualized for each possible combination, for the final Unified Architecture proposed based on FPMD algorithm, in terms of BDrate and Time Saving. As can be noted, the average speedup is now improved from the range of 12–53% obtained for the FPD algorithm without the MDV-SW approach, as is shown in Table 4, to the range of 41–67%. This enhancement is at the expense of bit a rate increase, where the BDrate is practically doubled compared with the FPD algorithm, due to the error introduced by the FMD algorithm. An initial conclusion that can be drawn from this observation is that the penalty of the fast mode decision is not additive to the penalty due to the fast partitioning decision; instead, the error due to the wrong mode decision is amplified when the wrong PU size classification decision is given. The first node, N64 ? MDV-SW, achieves an average time saving of around 40%, which is quite similar for the classes A, B, C and D, and only class E reaches a speed-up of over 45%. In terms of rate penalty, the results are also quite uniform, around 1%. Regarding the second node, N64 ? N32 ? MDV-SW, the time saving increases by 15% with respect to the N64 ? MDV-SW implementation, achieving a notable average complexity reduction of 55.7%. The average BD-rate penalty is nearly doubled, 2.3%, compared with the previous node. Finally, the results for the overall node implementation including the MDVSW approach, N64 ? N32 ? N16 ? MDV-SW, show a considerable complexity reduction of 67%. In terms of rate penalty, the sequences of class A obtain the best performance with a 2.5% BD-rate, which is nearly half the average rate increase of 4.6%. 123 J Real-Time Image Proc Table 5 Performance results of the fast mode decision (FMD) algorithm Classification Class A (2560 9 1600) Class B (1920 9 1080) Class C (832 9 480) Class D (416 9 240) Class E (128 9 720) Sequence Frames MDV MDV-SW T. saving (%) BD-rate (%) T. saving (%) BD-rate (%) 0.3 Traffic 150 30.87 0.5 30.43 PeopleOnStreet 150 30.34 0.9 29.84 0.4 BasketballDrive 500 31.34 0.6 31.14 0.2 BQTerrace 600 30.47 0.6 30.30 0.3 Cactus 500 30.33 0.8 30.12 0.4 Kimono 240 31.69 0.2 31.16 0.1 ParkScene BasketballDrill 240 500 30.98 28.71 0.2 1.5 30.53 28.41 0.1 0.6 BQMall 600 29.69 1.0 29.21 0.5 PartyScene 500 29.16 1.0 28.64 0.7 RaceHorses 300 30.34 0.9 29.65 0.3 BasketballPass 500 30.39 1.3 29.11 0.6 BQSquare 600 28.27 1.4 28.27 1.0 BlowingBubbles 500 27.96 1.3 27.96 0.8 RaceHorses 300 29.04 1.4 28.10 0.6 FourPeople 600 30.53 0.8 30.33 0.3 Johnny 600 30.90 0.8 30.85 0.4 KristenAndSara 600 30.60 0.9 30.31 0.4 Class A 30.61 0.7 30.14 0.3 Class B 30.96 0.5 30.65 0.2 Class C 29.48 1.1 28.98 0.6 Class D Class E 28.92 30.68 1.3 0.8 28.36 30.50 0.7 0.4 Average 30.10 0.9 29.70 0.4 Table 7 summarizes the results in terms of the DBD-rate and DTime Saving for the three nodes. In terms of complexity reduction, the MDV-SW algorithm obviously provides the highest reduction for the first node, N64, of 30%, which is practically the speed-up achieved when MDV-SW is applied alone. However, for the full node implementation, namely N64 ? N32 ? N16, the speed-up of the MDV-SW is just around 15%, because the fast mode partitioning has already reduced the complexity by 50%, so the 30% speed-up due to the fast mode decision just affects the remaining 50% of the computational burden. Therefore, the benefits of the fast mode decision are masked when high complexity reduction is achieved by the fast partitioning decision. The behaviour of the rate penalty is quite different from the time saving. Unexpectedly, for the first two nodes, N64 ? MDV-SW and N64 ? N32 ? MDV-SW, the rate increase due to the fast mode decision is practically the same, 1%. The BD-rate obtained for the MDV-SW standalone implementation, 0.4%, is practically doubled when it is computed jointly with the fast partitioning mode, for both nodes. 123 Nevertheless, in the overall implementation, namely N64 ? N32 ? N16 ? MDV-SW, the BD-rate increase due to MDV-SW is multiplied by 6 compared with MDV-SW alone, an increase of 2.3% with respect to the rate penalty of the same node without the MDV-SW approach. 5.3 Comparison with other fast intra-prediction algorithms In Sect. 3, several fast intra-prediction algorithms were described. In this subsection, a performance comparison between those proposals and the FPMD proposed in this paper is made. The simulation results have been reported in Table 8 using the same JCT-VC test sequences, CTCs and performance metrics, in order to show a fair comparison. The Sun et al. algorithm [16] can be considered the best performing algorithm in the balance of time saving, 50%, and bit rate penalty, 2.3%. The Sun et al. proposal outperforms the FPMD (N64 ? MDV-SW) implementation in terms of encoder time reduction, with 50% instead of 41.67%, but the bit rate penalty also increases by over 1.2%. J Real-Time Image Proc Table 6 Performance results of fast partitioning and mode decision (FPMD) algorithm compared to HM16.6 Classification Class A (2560 9 1600) Class B (1920 9 1080) Class C (832 9 480) Class D (416 9 240) Class E (128 9 720) Sequence Frames N64 ? MDV-SW N64 ? N32 ? MDV-SW N64 ? N32 ? N16 ? MDVSW T. saving (%) BD-rate (%) T. saving (%) BD-rate (%) T. saving (%) BD-rate (%) Traffic 150 41.09 0.9 55.98 2.0 69.55 2.0 PeopleOnStreet 150 41.09 0.9 55.98 1.6 66.88 3.0 BasketballDrive 500 42.56 0.9 59.31 3.2 70.55 8.0 BQTerrace 600 42.86 0.7 56.70 1.4 67.41 6.5 Cactus 500 42.00 1.0 55.67 2.1 69.66 3.2 Kimono ParkScene 240 240 44.08 43.20 1.1 0.8 57.92 56.82 6.7 1.8 72.23 70.64 7.6 2.0 BasketballDrill 500 39.41 1.0 54.09 1.5 67.11 2.5 BQMall 600 40.46 1.2 54.65 2.1 66.29 4.7 PartyScene 500 40.43 1.3 52.47 1.4 63.68 3.7 RaceHorses 300 39.60 0.7 53.74 1.5 67.15 5.0 BasketballPass 500 37.97 1.2 50.61 1.9 62.35 3.7 BQSquare 600 39.01 1.6 50.30 1.8 57.62 5.5 BlowingBubbles 500 38.36 1.3 48.65 1.4 60.02 3.4 RaceHorses 300 39.20 1.1 49.34 1.5 61.69 3.5 FourPeople 600 42.14 1.1 58.34 1.9 69.74 4.3 Johnny 600 50.23 1.3 64.72 4.8 72.96 8.8 KristenAndSara 600 43.73 1.4 63.63 2.8 71.74 4.6 Class A 41.78 1.0 55.98 1.8 68.24 2.5 Class B 42.94 0.9 57.30 3.0 70.14 5.5 Class C Class D 39.97 38.64 1.1 1.3 53.75 49.74 1.6 1.7 66.09 60.46 4.0 4.0 Class E 45.48 1.3 62.33 3.2 71.51 5.9 Average 41.67 1.1 55.71 2.3 67.34 4.6 Finally, the reported results prove that the proposed FPMD algorithm for the overall implementation (N64 ? N32 ? N16 ? MDV-SW) achieves the highest time saving for intra-prediction coding of over 67% compared with the HEVC reference model, and it outperforms the best proposal in the balance of complexity reduction and bit rate penalty. 6 Conclusions In this paper, we have presented a Unified Architecture to form a novel fast HEVC intra-prediction coding algorithm, denoted as Fast Partitioning and Mode Decision (FPMD), which combines a Fast Partitioning Decision (FPD) algorithm and a Fast Mode Decision (FMD) algorithm. The FPD algorithm comprised a three-node decision tree using low complexity attributes, allowing an early CU classification in terms of optimal CTU partitioning, thereby reducing the number of PU sizes to be checked by the RMD and RDO stages. Moreover, the algorithm can be implemented in a scalable way by combining the first node, the first two nodes, or all three nodes, achieving different levels of coding performance. The FMD algorithm is based on the texture orientation detection, by analyzing the MDV along the digital co-lines. Instead of 33 angular directions as is defined in HEVC, we have used twelve co-lines with rational slopes. The orientation with lower variance is selected as the dominant texture orientation, and a reduced number of directional candidate modes are selected to be further processed by the RDO stage. These approaches can be combined to form the Unified Architecture using any combination of nodes, obtaining a wide range of time savings, from 44 to 67%, and light BD-rate penalties from 1.1 to 4.6%, with respect to HM 16.6. Comparisons with similar state-of-the-art works show the proposed architecture achieves the best trade-off between complexity reduction and rate distortion. 123 J Real-Time Image Proc Table 7 Performance differences between the non-combined proposals and the combined proposals for each node Classification Class A (2560 9 1600) Class B (1920 9 1080) Class C (832 9 480) Class D (416 9 240) Class E (128 9 720) Sequence N64 vs. N64 ? MDV-SW N64 ? N32 vs. N64 ? N32 ? MDV-SW N64 ? N32 ? N16 vs. N64 ? N32 ? N16 ? MDVSW DT.Saving (%) DT.Saving (%) DT.Saving (%) DBD-rate (%) DBD-rate (%) DBD-rate (%) Traffic 29.49 0.90 26.59 1.02 12.94 0.40 PeopleOnStreet 30.27 0.99 27.95 1.04 16.79 1.88 BasketballDrive 28.20 0.72 22.44 0.79 11.35 4.93 BQTerrace 28.71 0.65 25.75 0.68 14.84 5.22 Cactus 28.82 0.91 25.89 1.03 12.71 1.07 2.31 Kimono 28.65 0.96 24.02 1.52 10.18 ParkScene 28.68 0.78 26.08 0.91 11.89 0.37 BasketballDrill 29.14 1.04 26.46 1.06 13.62 0.15 BQMall 29.57 1.19 27.77 1.27 15.52 2.26 PartyScene RaceHorses 30.24 29.35 1.33 0.73 30.39 27.34 1.34 0.78 18.19 14.01 1.61 3.31 BasketballPass 31.98 1.19 27.08 1.24 16.69 1.59 BQSquare 30.19 1.56 29.21 1.57 22.01 4.42 BlowingBubbles 31.55 1.32 30.07 1.35 19.46 1.67 RaceHorses 31.07 1.0 29.59 1.11 18.49 1.67 FourPeople 29.66 1.08 24.45 1.13 13.10 2.36 4.39 Johnny 23.42 1.03 19.18 1.39 9.96 KristenAndSara 28.36 1.12 19.24 1.15 10.54 1.94 Class A 29.88 0.95 27.26 1.03 14.78 1.14 Class B 28.61 0.81 24.80 0.99 12.12 2.78 Class C 29.58 1.07 27.96 1.11 15.26 1.83 Class D 31.20 1.28 28.97 1.32 19.09 2.34 Class E 27.02 1.08 20.83 1.23 11.14 2.90 Average 29.25 1.03 25.89 1.13 14.26 2.31 Table 8 Performance comparison between FPMD and the related works 123 Time saving (%) BD-rate (%) Sun [16] 50 2.30 Huang [17] 20 0.50 Cen [18] 16 2.80 Tian [19] 29 0.50 Khan [20] 42 1.20 Jiang [21] 20 0.74 Chen [22] 37.6 1.65 Yan [23] 23.5 1.30 Silva [24] 20 0.90 Yao [25] 36.2 1.86 Shen [26] 21 1.70 Kim [27] 30.5 1.20 FPMD (N64 ? MDV-SW) 41.67 1.10 FPMD (N64 ? N32 ? MDV-SW) 55.71 2.30 FPMD (N64 ? N32 ? N16 ? MDV-SW) 67.30 4.60 J Real-Time Image Proc Acknowledgements This work was supported by the MINECO and European Commission (FEDER funds) under the Projects TIN201238341-C04-04 and TIN2015-66972-C5-2-R. References 1. High Efficiency Video Coding, Rec. ITU-T H.265 and ISO/IEC 23008-2 (2013) 2. Advanced Video Coding for Generic Audiovisual Services, Rec. ITU-T H.264 and ISO/IEC 14496-10 (MPEG-4 AVC) (2012) 3. Ohm, J., Sullivan, J.G.J., Schwarz, H., Thiow, T., Wiegand, T.: Comparison of the coding efficiency of video coding standards— including high efficiency video coding (HEVC). IEEE Trans. Circuits Syst. Video Technol. 22(12), 1669–1684 (2012) 4. Prabhakar, B., Reddy, D.K.: Analysis of video coding standards using PSNR and bit rate saving. In: International Conference on Signal Processing and Communication Engineering Systems (SPACES), pp. 306–308 (2015) 5. Nguyen, T., Marpe, D.: Performance analysis of HEVC-based intra coding for still image compression. In: Picture Coding Symposium (PCS), pp. 233–236 (2012) 6. Il-Koo, K., Min, J., Lee, T., Woo-Jin, H., JeongHoon, P.: Block partitioning structure in the HEVC standard. IEEE Trans. Circuits Syst. Video Technol. 22(12), 1697–1706 (2012) 7. Min, J., Lee, S., Kim, I., Han, W.-J., Lainema, J., Ugur, K.: Unification of the directional intra prediction methods in TMuC. In: JCTVC-B100, Geneva, Switzerland (2010) 8. Sullivan, G.J., Wiegand, T.: Rate-distortion optimization for video compression. IEEE Signal Process. Mag. 15(6), 74–90 (1998) 9. Bossen, F., Bross, B., Sühring, K., Flynn, D.: HEVC complexity and implementation analysis. IEEE Trans. Circuits Syst. Video Technol. 22, 1685–1696 (2012) 10. Correa, G., Assuncao, P., Agostini, L., da Silva Cruz, L.A.: Complexity control of high efficiency video encoders for powerconstrained devices. IEEE Trans. Consum. Electron. 57(4), 1866–1874 (2011) 11. Khan, M., Shafique, M., Grellert, M., Henkel, J.: Hardwaresoftware collaborative complexity reduction scheme for the emerging HEVC intra encoder. In: Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 125–128, 18–22 March 2013 12. Sullivan, G.J., Ohm, J., Woo-Jin, H., Wiegand, T.: Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 22(12), 1649–1668 (2012) 13. Lainema, J., Bossen, F., Han, W.-J., Min, J., Ugur, K.: Intra coding of the HEVC standard. IEEE Trans. Circuits Syst. Video Technol. 22(12), 1792–1801 (2012) 14. Joint Collaborative Team on Video Coding Reference Software, ver. HM 16.6. https://hevc.hhi.fraunhofer.de/ 15. Piao, Y., Min, J.H., Chen, J. (2010) Encoder improvement of unified intra prediction. In: JCTVC-C207, JCT-VC of ISO/IEC and ITU-T, Guangzhou, China (2010) 16. Sun, H., Zhou, D., Goto, S.: A low-complexity HEVC intra prediction algorithm based on level and mode filtering. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1085–1090 17. Huang, H., Zhao, Y., Lin, C., Bai, H.: Fast bottom-up pruning for HEVC intraframe coding. In: Visual Communications and Image Processing (VCIP), pp. 1–5 (2013) 18. Cen, Y., Wang, W., Yao, X.: A fast CU depth decision mechanism for HEVC. Inf. Process. Lett. 115(9), 719–724 (2015) 19. Tian, G., Goto, S.: Content adaptive prediction unit size decision algorithm for HEVC intra coding. In: Picture Coding Symposium (PCS), pp. 405–408 (2012) 20. Khan, M., Shafique, M., Henkel, J.: An adaptive complexity reduction scheme with fast prediction unit decision for HEVC intra encoding. In: IEEE International Conference on Image Processing (ICIP), pp. 1578–1582 (2013) 21. Jiang, W., Hanjie, M., Chen, Y.: Gradient based fast mode decision algorithm for intra prediction in HEVC. In: International Conference on Consumer Electronics, Communications and Networks (CECNet), pp. 1836–1840 (2012) 22. Chen, G., Liu, Z., Ikenaga, T., Dongsheng, W.: Fast HEVC intra mode decision using matching edge detector and kernel density estimation alike histogram generation. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 53–56 (2013) 23. Yan, S., Hong, L., He, W., Wang, Q.: Group-based fast mode decision algorithm for intra prediction in HEVC. In: International Conference on Signal Image Technology and Internet Based Systems (SITIS), pp. 225–229 (2012) 24. da Silva, T.L., Agostini, L.V., da Silva Cruz, L.A.: Fast HEVC intra prediction mode decision based on EDGE direction information. In: European Signal Processing Conference (EUSIPCO), pp. 1214–1218 (2012) 25. Yao, Y., Xiaojuan, L., Yu, L.: Fast intra mode decision algorithm for HEVC based on dominant edge assent distribution. Multimed. Tools Appl. J. 75, 1–19 (2014) 26. Shen, L., Zhang, Z., An, P.: Fast CU size decision and mode decision algorithm for HEVC intra coding. IEEE Trans. Consum. Electron. 59(1), 207–213 (2013) 27. Kim, Y., Jun, D., Jung, S., Soo Choi, J., Kim, J.: A fast intraprediction method in HEVC using rate-distortion estimation based on Hadamard transform. ETRI J. 35(2), 270–280 (2013) 28. Bossen, F.: Common test conditions and software reference configurations, document JCTVC-L1100, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC), 12th Meeting: Geneve, CH, 14–23 Jan 2013 29. ITU-T Recommendation P.910: Subjective Video Quality Assessment Methods for Multimedia Applications. International Telecommunication Union, Geneva (1999) 30. Pratt, W.K.: Digital Image Processing: PIKS Inside, 3rd edn. Wiley, New York (2001) 31. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2003) 32. Chan, T.F., Golub, G.H., LeVequ, R.J.: Updating formulae and a pairwise algorithm for computing sample variances. Technical Report. Stanford University, Stanford, CA, USA (1979) 33. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993) 34. Chen, L., Lin, J.: A study on review manipulation classification using decision tree. In: International Conference on Service Systems and Service Management (ICSSSM), pp. 680–685 (2013) 35. Fernández-Escribano, G., Kalva, H., Cuenca, P., Orozco-Barbosa, L., Garrido, A.: A fast MB mode decision algorithm for MPEG-2 to H.264 P-frame transcoding. IEEE Trans. Circuits Syst. Video Technol. 18(2), 172–185 (2008) 36. Hulse, J.V., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, pp. 935–942 (2007) 37. Gupta, S., Mazumdar, S.G.: Sobel edge detection algorithm. Int. J. Comput. Sci. Manag. Res. 2(2), 1578–1583 (2013) 38. Jayachandra, D., Makur, A.: Directional variance: a measure to find the directionality in a given image segment. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1551–1554 (2010) 39. Lei, Z., Makur, A.: Enumeration of downsampling lattices in twodimensional multirate systems. IEEE Trans. Signal Process. 56(1), 414–418 (2008) 123 J Real-Time Image Proc 40. Velisavljevic, V., Beferull-Lozano, B., Vetterli, M., Dragotti, P.L.: Directionlets: anisotropic multidirectional representation with separable filtering. IEEE Trans. Image Process. 15(7), 1916–1933 (2006) 41. Bjøntegaard, G.: Calculation of average PSNR differences between RD-curves. ITU-T SG16 Q.6 Document, VCEG-M33, Austin, US (2001) Damian Ruiz received his B.S and M.S degree in Electrical Engineering from the Universidad Politècnica de Madrid (UPM), Spain, and the Ph.D. degree from the University of Castilla-La Mancha (UCLM), Albacete, Spain, in 2000 and 2016, respectively. In 2012, he joined the Mobile Communication Group (MCG) at the Polytechnic University of Valencia (UPV), Valencia, Spain. In 2017, he joined the Department of Signal and Communications Theory at the King Juan Carlos University, Madrid, Spain, where he is currently Associate Ph.D. Professor. His research interests include image and video coding, machine learning and perceptual video quality. He has over 25 publications in these areas in international refereed journals and conference proceedings. He has also been a visiting researcher at the Florida Atlantic University, Boca Raton (USA). Gerardo Fernández-Escribano received the M.Sc. degree in Computer Engineering and the Ph.D. degree from the University of Castilla-La Mancha (UCLM), Albacete, Spain, in 2003 and 2007, respectively. In 2008, he joined the Department of Computer Systems at the UCLM, where he is currently an Associate Ph.D. Professor at the School of Industrial Engineering. His research interests include multimedia standards, video transcoding, video compression, video transmission and machine learning mechanisms. He has also been a visiting researcher at the Florida Atlantic University, Boca Raton (USA), and at the Friedrich Alexander Universität, Erlangen-Nuremberg (Germany). 123 José Luis Martı́nez (M’07) received his M.S and Ph.D. degrees in Computer Science and Engineering from the University of Castilla-La Mancha, Albacete, Spain in 2007 and 2009, respectively. In 2005, he joined the Department of Computer Engineering at the University of Castilla-La Mancha, where he was a researcher with the Computer Architecture and Technology group at the Albacete research institute of informatics (I3A). In 2010, he joined the department of Computer Architecture at the Complutense University in Madrid where he was assistant lecturer. In 2011, he rejoined the Department of Informatics Systems of the University of Castilla-La Mancha, where he is currently assistant lecturer. His research interests include video coding, video standards, video transcoding and parallel video processing. He has also been a visiting researcher at the Florida Atlantic University, Boca Raton (USA) and Centre for Communication System Research (CCSR) at the University of Surrey, Guildford (UK). He has over 70 publications in these areas in international refereed journals and conference proceedings. Pedro Cuenca received his M.Sc. degree in Physics (Electronics and Computer Science, award extraordinary) from the University of Valencia in 1994. He received his Ph.D. degree in Computer Engineering in 1999 from the Polytechnic University of Valencia. In 1995, he joined the Department of Computer Engineering at the University of Castilla-La Mancha (UCLM), where he is currently a Full Professor of Computer Architecture and Dean of the Faculty of Computer Engineering. His research topics are centred in the area of video compression, QoS video transmission and video applications for multicore and GPU architectures. He has published over 100 papers in international journals and conferences. He has also been a visiting researcher at Nottingham Trent University, University of Ottawa and University of Surrey. He has served in the organization of International Conferences as Chair and Technical Program Chair. He was the Chair of the IFIP 6.8 Working Group during the 2006–2012 period.