MA AjamMontassar 201704
MA AjamMontassar 201704
MA AjamMontassar 201704
Innstraße 33
94032 Passau
http://www.fim.uni-passau.de
Master’s Thesis
Montassar AJAM
Matriculation Number: 75997
26.01.2017
Supervised by
Prof. Dr. Tomas Sauer
Classification-based Road Region Detection Montassar AJAM
First of all I would like to express my deep gratitude to Prof. Dr. Tomas Sauer my mas-
ter thesis supervisor and the Director of the Institute of Software Systems in Technical
Applications of Computer Science (FORWISS Passau) for giving me the opportunity to
carry out this Thesis, for his valuable guidance and for many insightful conversations
during the development of the ideas.
Also, I like to thank Dr. Alexander zimmernann for introducing me to the topic as
well as for the support on the way and of course for sharing his precious time during the
process of learning.
Special thanks to Mr. Steven Kienast for providing me with the tools that I needed
to choose the right direction and successfully complete my thesis.
Finally, I owe more than thanks to my loved ones, who have supported me throughout
entire process, both by keeping me harmonious and helping me putting pieces together.
I will be grateful forever for your love.
Hereby I declare that I wrote this thesis myself with the help of no more than the
mentioned literature and auxiliary means.
Passau, 26.01.2017
........................................
Abstract
Road detection and tracking methods are the state of the art in present intelligent
transportation systems and intelligent vehicle applications. It is however very challeng-
ing since the road is in an outdoor scenario imaged from a moving platform.
This master thesis focus on active safety for the automotive sector, or more specifically,
on the problem of feature extraction and data classification for road detection applica-
tions developed to enhance driver assistance systems for semi-autonomous or self-guided
vehicles.
In this thesis, we provide a novel feature vector construction approach based on block-
wise Discrete Cosine Transform texture, color and position features which are then con-
catenated, Labeled and used to train some commonly used statistical binary classifiers,
e.g Support Vector Machine. Finally, we elaborate the results and evaluate these classi-
fiers.
In conclusion, the results of this study showed that the combination of these differ-
ent features have led to a successful detection of road regions and were well suited for
this kind of tasks.
Contents
List of Figures xi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Definition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Method 18
3.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 Raw Image Data Description . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Image enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.3 Region of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Block-based Feature extraction . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Texture feature extraction . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Color Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.4 Position Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 26
3.2.5 Feature vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.3 Artificial Neural network . . . . . . . . . . . . . . . . . . . . . . . 29
4 Implementation 33
4.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Feature vectors Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Matlab Classsification Learner App . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Data Importing and validation method . . . . . . . . . . . . . . . 36
4.3.2 Classification algorithms . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.3 Model Comparaison and Assessment . . . . . . . . . . . . . . . . . 37
4.4 Matlab Neural Net Pattern Recognition App . . . . . . . . . . . . . . . . 37
4.4.1 Input And Target Data Importing . . . . . . . . . . . . . . . . . . 38
4.4.2 Validation method and parameters . . . . . . . . . . . . . . . . . 38
4.4.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.4 Training The Network . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Evaluation 41
5.1 Chosen parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Validation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1 Hold-out Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.2 Accuracy of a model . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Performance Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.1 Support Vector Machine Classifier . . . . . . . . . . . . . . . . . . 42
5.3.2 Other Classification Techniques performance . . . . . . . . . . . . 44
5.3.3 Neural Network Performance . . . . . . . . . . . . . . . . . . . . . 44
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
List of Acronyms 48
Bibliography 49
List of Figures
1.1 Motivation
More than a million people die each year on the world’s roads with an estimated road
traffic death rate of 4.3 per 100.000 population in Germany and 24.4 per the same pop-
ulation in Tunisia ([TPI13] Global status on road report, 2015, Table A2). Besides,
the cost of dealing with the economic consequences of these road traffic crashes reaches
billions of dollars ([JTA00] Estimating Global Road Fatalities, 2000, Table 11).
Therefore, Strategies are formulated to reduce road traffic accidents, some of which fo-
cus on educating and training road users along with enforcing road traffic rules ([ETS14],
p.7-8), some others, consider enhancing road infrastructures to make them safer ([ETS14],
p.8). While emerging strategies are paying more attention to promoting vehicle safety
and the use of modern technologies such as intelligent transport system (ITS) tools.
Hence, The automotive industry was directly affected by this new trend and invested
into research focused on vehicle safety. The ultimate goal is to start manufacturing semi
autonomous to fully autonomous ”crash-less” cars with build in Advanced Driver Assis-
tant Systems (ADAS) which would provide both comfort and security to drivers.
One of these systems being developed, consists of detecting and tracking road regions
for autonomous driving. Its task is to analyze the images provided by a front camera,
and then to find the road regions with a sensitive precision. This model rely on image
and video processing techniques as well as on recent advances in computer vision field.
But none of the previous conceived methods has reached perfect results.
For instance, the car manufacturer Tesla Motors was a subject of a U.S federal in-
vestigation [MHC16] during the progress of this thesis due to a fatal crash of a driver
Classification-based Road Region Detection Montassar AJAM
operating a Tesla Model S sedan car with its Autopilot system engaged. The car’s Au-
topilot ”Model S Software Version 7.0” which rolled out worldwide early this year is
one of the most advanced autopilot systems in the world and includes autopilot features
that help steer the car on the highway using a front-facing camera next to the rear-view
mirror, the autopilot has faced questions as it has failed to distinguish between a white
truck and a bright sky[Lev16] (http://goo.gl/f55PJ7).
1.2 Objective
This thesis describes a novel approach to detecting and tracking road regions. The
method utilizes data classification techniques to solve the problem of distinguishing be-
tween road and non-road regions in a precise manner.
The training data for this Classification problem will be constructed from texture,
color, position features which are extracted from pixel blocks of the input images which
are collected by a camera installed in the front-view of a car.
1.3 Scope
The implementation and testing are based on MATLAB R2016a. For building our
algorithm , we used Image processing , Computer Vision System , Statistics and Machine
Learning toolboxes. We used as well Imagemagick Software for image displaying reasons.
and a labelling graphical user interface developed at the University of Passau by Mr.
Steven Kiena, to Label the images. After designing the model, we use the ”MATLAB
Classification Learner tool” to train our model to classify data using supervised machine
learning.
Image Analysis : refers to processing an image where the ultimate goal is not to
enhance or otherwise alter its appearance but instead to extract meaningful information
about its contents.
1.5 Outline
This thesis is divided into an introductory part, five main chapters listed bellow and a
conclusion.
This chapter studies the previous researches in this particular field, i.e Road detection
using computer vision techniques, and describes the environments we have used in this
thesis and the acquired background which helped us to achieve our goals.
Chapter 3 : Method
The general methods followed in the Thesis as well as the theoretical foundations be-
hind these methods are explained in this Chapter. We have first described the data we
have used such as the image sequences fed to the classifier.
Chapter 4 : Implementation
This chapter describes the way the implementation method is done, starting from the
Labelling process and the used tools , and ending with specifying the used toolboxes in
the implementation of our method.
Chapter 5 : Evaluation
In this chapter, we have showed the results we have got after the construction of the
feature vectors and the training of the built models, the results represent the performance
measurements of the constructed models.
In this chapter, we describe the problems that occurred and we give as well a small
description about possible solutions to overvome such problems in order to bild in the
future better systems.
camera can be either color or grayscale depending on the intended use. Many vision
systems utilize high-quality speciality cameras which have a bigger pixel density concen-
tration in the images they produce, thereby capturing more detailed information. Other
vision systems use cameras with low noise which produce a cleaner image, where all or
most information in the image accurately represent the information in the world that
they have captured. A high frame rate cameras are a type of cameras used also in vision
systems and which are able to capture images at a very fast rate, often greater than 30
times per second.
The optical system used in a typical computer vision system captures images and
transfers it to the computer as a series of pixels. Each pixel has a red, green, and blue
value each ranging between 0 and 255 or between 0 and 65536 for the deep color images.
The Figure 2.2 shows an image frame from a camera. r̂ and ĉ are the axes of the
image frame originating at the upper left corner of the image. Pixels in the image are
most commonly indexed along the r̂ and ĉ axes corresponding to the row and column of
the image.
paragraph gives a short overview of embedded systems real-time vision pipeline consider-
ation and inter-relation between different components of embedded vision systems used
in ADAS as our main focus in this thesis lies on this particular type of applications.
The strong and growing automotive market for ADAS using embedded computer vision
systems comes after an advancement of powerful hardware namely in the form of fast
and real-time processing, human machine interfaces and sufficiently large memory.
2.2 Matlab
MATLAB, short for MAtrix LAborotory, is a numerical computing environment as well
as a high-level programming language. It performs many computationally intensive tasks
with considerable higher speed than other programming languages.
MATLAB is used in areas like signal and image processing, communications, control
design, test and measurement, financial modeling and analysis and computational bi-
ology. Adds-on toolboxes (collection of spacial-purpose MATLAB functions, available
separately) extend the MATLAB environment to solve particular classes of problems in
these application areas. Figure(2.3) shows a screen-shot from the MATLAB environ-
ment.
The toolboxes of current interest for this thesis are Statistics and Machine Learning
Toolbox and Neural Network Toolbox
Carlo simulations, and perform hypothesis tests. Regression and classification algorithms
are also included so that users can draw inferences from data and build predictive models
using these algorithms [PT16].
For multidimensional data analysis, Statistics and Machine Learning Toolbox pro-
vides feature selection, stepwise regression, principal component analysis(PCA), regu-
larization, and other dimensionality reduction methods that let users able to identify
variables or features that impact a particular model.
The toolbox provides also supervised and unsupervised machine learning algorithms,
including support vector machines (SVMs), boosted and bagged decision trees, k-nearest
neighbor, k-means, k-medoids, hierarchical clustering, Gaussian mixture models, and
hidden Markov models. Many of the statistics and machine learning algorithms can be
used for computations on data sets that are too big to be stored in memory [PT16].
variables using visualization techniques, such as scatter plot matrices and classical mul-
tidimensional scaling.
Regression and Analysis of variance: Using regression techniques, you can model
a continuous response variable as a function of one or more predictors. Statistics and
Machine Learning Toolbox offers a variety of regression algorithms, including linear
regression, generalized linear models, nonlinear regression, and mixed-effects models.
Analysis of variance (ANOVA) enables you to assign sample variance to different sources
and determine whether the variation arises within or among different population groups.
Statistics and Machine Learning Toolbox includes these ANOVA algorithms and related
techniques
Big Data, Parallel Computing, and Code Generation: Statistics and Machine
Learning Toolbox allows its users to perform computationally demanding and data-
intensive statistical analysis and to generate portable and readable C code for select
functions for classification, regression, clustering, descriptive statistics, and probability
distributions
Overview
Neural Network Toolbox provides algorithms, functions, and apps to create, train, vi-
sualize, and simulate neural networks. You can perform classification, regression, clus-
tering, dimensionality reduction, time-series forecasting, and dynamic system modeling
and control.
The toolbox includes convolutional neural network and autoencoder deep learning
algorithms for image classification and feature learning tasks. To speed up training of
large data sets, you can distribute computations and data across multicore processors,
GPUs, and computer clusters using Parallel Computing Toolbox [DBH06].
Deep Learning: Deep learning algorithms can learn discriminative features directly
from data such as images, text, and signals. These algorithms can be used to build highly
accurate classifiers when trained on large labeled training datasets. Neural Network
Toolbox supports training convolutional neural networks and autoencoder deep learning
algorithms for image classification and feature learning tasks.
RGB space is a broadly used color space for image display. It is composed of three color
components red, green, and blue. These components are called ”additive primaries” since
a color in RGB space is produced by adding them together. In contrast, CMY space is a
color space primarily used for printing. The three color components are cyan, magenta
and yellow. These three components are called ”subtractive primaries” since a color in
CMY space is produced through light absorption.
The CIE L*a*b* and CIE L*u*v* spaces are spaces that consist of a luminance or
lightness component (L) and two chromatic components a and b or u and v [FR98].
CIE L*a*b* is designed to deal with subtractive colorant mixtures, While CIE L*u*v*
is designed to deal with additive colorant mixtures.
HSV (or HSL, or HSB) space is widely used in computer graphics and is a more
intuitive way of describing color. The three color components are hue, saturation and
value (or lightness, brightness). The hue is invariant to the changes in illumination and
camera direction and hence more suited to object detection or retrieval. RGB coordinates
can be easily translated to the HSV (or HLS, or HSB) coordinates using this following
steps given by Travis[Tra91]. To convert from RGB to HSV (assuming normalised RGB
values) one have to find first the maximum and minimum values from the RGB triplet.
Saturation, S, is then:
max − min
S= (2.1)
max
and Value V, is:
V = max (2.2)
max − R
R0 =
max − min
0 max − G
G = (2.3)
max − min
max − B
B0 =
max − min
Hue, H, is then converted to degrees by multiplying by 60 giving HSV with S and
V between 0 and 1 and H between 0 and 360. More details about RGB and HSV color
spaces will follow in the Chapter 4 as they are both chosen in the implementation of our
method.
The Figure 2.4 and 2.5 illustrate the RGB and HSV color points geometric represen-
tations .
Another color space, called the opponent color [FSZ03] and that uses color axes (R-G,
2B-R-G, R+G+B) is also interesting and very used in the image processing literature.
Its representation has the advantage of isolating the brightness information on the third
axis. With this solution, the first two chromaticity axes, which are invariant to the
changes in illumination intensity and shadows, can be down-sampled since humans are
more sensitive to brightness than they are to chromatic information.
Color Histogram
The color histogram serves as an effective representation of the color content of an image
if the color pattern is unique compared with the rest of the data set. The color histogram
is easy to compute and effective in characterizing both the global and local distributions
of colors in an image. In addition, it is robust to translation and rotation about the
viewing axis and changes only slowly with the scale, occlusion and viewing angle.
Since any pixel in the image can be described by three components in a certain color
space ( for instance, red, green, and blue components in RGB space or hue, saturation
and value in HSV space), a histogram, i.e., the distribution of the number of pixels for
each quantized bin, can be defined for each component. Clearly, the more bins a color
histogram contains, the more discrimination power it has. However, a histogram with
a large number of bins will not only increase the computational cost, but will also be
inappropriate for building efficient indexes for image databases.
Furthermore, a very fine bin quantization does nor necessarily improve the retrieval
performance in many applications. One way to reduce the number of bins is to use the
opponent color space which enables the brightness of the histogram to be down sampled.
Another way is to use clustering methods to determine the K best colors in a given space
for a given set of images. Each of these best colors will be taken as a histogram bin. Since
the clustering process takes the color distribution of images over the entire database into
consideration, the likelihood of histogram bins in which no or very few pixels fall will
be minimized. Another option is to use the bins that have the largest pixel numbers
since a small number of histogram bins capture the majority of pixels of an image. Such
a reduction does not degrade the performance of histogram matching, but may even
enhance it since small histogram bins are likely to be noisy.
Color Histogram does not take the spatial information of pixels into consideration,
thus very different images can have similar color distributions. To increase discrimination
power, several improvements have been proposed to incorporate spatial information. A
simple approach is to divide an image into sub-areas and calculate a histogram for each
of those sub-areas. As introduced above, the division can be as simple as a rectangular
partition, or as complex as a region or even object segmentation. Increasing the number
of sub-areas increases increases the information about location, but also increases the
memory and computational time.
CCV addresses the problem of considering the spatial information of pixels that color
Histogram doesn’t take into consideration. In CCV each histogram bin is partitioned into
two types: coherent and incoherent. If pixel value belongs to a large uniformly-colored
region falls into coherent type. Otherwise it falls into incoherent [FSZ03].
Let αi denote the number of coherent pixels in the ith color bin and βi denote the
number of incoherent pixels in an image. Then, the CCV of the image is defined as the
vector ((α1 , β1 ), (α1 , β1 ), · · · , (αn , βn )) with (α1 , β1 ), (α1 , β1 ), · · · , (αn , βn ) is the color
histogram of the image. The CCV outperforms generally color histogram in Image
Retrieval tasks [PZM96], especially for those images which have either mostly uniform
color or mostly texture regions due to its additional spatial information but not very
used in computer vision and pattern recognition tasks. In addition, for both the color
histogram and color coherence vector representation, the HSV color space provides better
results than CIE L*u*v* and CIE L*a*b* space.
Texture classification is an active research topic in computer vision and pattern recog-
nition. Early texture classification methods focus on the statistical analysis of texture
images
The LBP operator is originally designed for texture description. The operator assigns a
label to every pixel of an image by Thresholding the 3 × 3 - neighborhood of each pixel
with the center pixel value and considering the result as a binary number. Then, The
256-bin LBP histogram computed over a region is used for texture description. Fig 2.6
illustrates the basic LBP operator.
P −1
X
p 1, x ≥ 0
LBPP,R = s(gp − gc )2 , s(x) = (2.7)
0, x < 0
p=0
where gc is the gray value of the central pixel, gP is the value of its neighbors, P is
the total number of involved neighbors, and R is the radius of the neighborhood.
To be able to deal with textures at different scales, the LBP operator was later ex-
tended to use neighborhoods of different sizes [OPM02]. Defining the local neighborhood
as a set of sampling points evenly spaced on a circle centered at the pixel to be labeled
allows any radius and number of sampling points.
Figure 2.7: Examples of the extended LBP operator.the circular (8, 1), (16, 2) and (24,
3) neighborhoods
2.4.3 Gabor-Fisher
Gabor features [LW02] are very popular in the computer vision field, its effectiveness
have been proved through many researches exploiting Gabor features and precisely in
some particular tasks such as face recognition.
Before we explain how the representation of the Gabor feature is expressed, we should
first define what Gabor Wavelets (kernels, filters) are. Gabor wavelets were introduced
to image analysis due to their biological relevance and computational properties [Dau85],
[JP87]. The Gabor wavelets, whose kernels are similar to the 2-D receptive field profiles
of the mammalian cortical simple cells, exhibit desirable characteristics of spatial local-
ity and orientation selectivity, and are optimally localized in the space and frequency
domains. The Gabor wavelets (kernels, filters) can be defined as follows [Dau80]:
denotes the norm operator, and the wave vector kµ,ν is defined as follows:
The Gabor wavelet representation of an image is the convolution of the image with a
family of Gabor kernels as defined by (2.8). Let I(x, y) be the gray level distribution of
an image, the convolution of image and a Gabor kernel Ψu,v is defined as follows:
The Gabor filtering coefficient Ou,v (z) is a complex number, which can be rewritten
as Oµ,ν (z) = Mµ,ν (z) exp(iθµ,ν (z)) with Mν,µ (z) being the magnitude and θµ,ν (z) be-
ing the phase. It is known that magnitude information contains the variation of local
energy in the image. In [LW02] , the augmented Gabor feature vector FeatGabor is de-
fined via uniform down-sampling, normalization and concatenation of the Gabor filtering
coefficients
(ρ)
where O(ν,µ) is the concatenated column vector from down-sampled magnitude matrix
(ρ)
Mµ,ν by a factor of ρ , and t is the transpose operator [YZ10] .
The application of Gabor Features is still limited to some of the object description
applications [GA09], and precisely to face description applications , and that’s because
it generally outperforms other feature descriptors in this particular tasks .
Ix = I ∗ Dx = I ∗ −1 0 1
1
Iy = I ∗ Dy = I ∗ 0
−1
q
The magnitude of the gradient is : |G| = Ix 2 + Iy 2
The orientation of the gradient is given by : θ = arctan(Iy /Ix )
• Finally the HOG feature vector is created by concatenating the histograms of all
small regions.
In this master thesis, We have used the Discrete Cosine Transform for the building
of a feature vector used in the classification problem of the road and non-road region
which is the main task of this master thesis. Therefore, a more detailed explanation of
this Transform and the way the DCT feature is built, are given in the Chapter 4 of this
thesis.
3.1 Pre-processing
This Step is of an utmost importance as it will help us to reduce processing and com-
putation time. The idea is to locate the Region of Interest(ROI) of each frame and
eliminate all irrelevant data points (e.g. Sky region, Buildings... ) from later precessing.
But first, We will try to adjust raw image sequence intensity values for a better visual
displaying of it.
perform accurately on unseen new examples after having experienced our learning data
set.
Figure 4.3 illustrates sample frame captures of the used data set.
Color Depth
we use in this thesis sRGB Deep color images,i.e 16-bit color depth per channel, this color
depth allows for more than a billion color variations (248 possible color) and therefore,
this will allow for a higher level of description of image information.
To simplify the Input/Output operations, The image’s sequence are concatenated to-
gether in a single binary PPM formatted stream file (Portable Pixel Map) respecting
the format specification i,e no data, delimiters, or padding before, after, or between
concatenated images.
The first step to perform a contrast stretching of an image is to determine the limits
over which image intensity values will be extended. These lower and upper limits of
the new output will be fixed. Next, the histogram of the original image is examined to
determine the value limits in the input image. If the original range covers the full possible
set of values, straightforward contrast stretching will achieve nothing, but sometimes
most of the image data is contained within a restricted range, this restricted range can
be stretched linearly.This can be summarized using the function:
b−a
Pout = (Pin − c)( )+a (3.1)
d−c
with:
a (resp. b) : the lower (resp. upper) intensity value of the output image.
c (resp. d) : the lower (resp. upper) intensity value of the input image.
Pin (resp. Pout ) : the original (resp. output) pixel value.
The problem using the last only mapping is that a single outlying pixel with either
a very high or very low value can severely affect the value of c or d and this could
lead to very unrepresentative scaling. Therefore a more robust approach is to first take
a histogram of the image, and then select c and d at, say, the nth percentile in the
histogram (that is, n% of the pixel in the histogram will have values lower than c, and
n% of the pixels will have values higher than d. This prevents outliers affecting the
scaling so much.
We carry out a contrast stretching on the input images by adjusting its intensity values
for a better visual displaying of it. This is done by mapping the intensity values to new
values such that 1% of data is saturated at low and high intensities of the input images
(i.e values are rejected and left outside the chosen limits for low and high intensities),
this operation will increase the contrast of the output images.
This pre-processing step is optional and it has no effect on the final results of the
proposed method, it is done only for displaying reasons.
vanishing point of the road region which is by definition The point at which the road
region disappears or ceases to exist.
The vanishing point detection was a case study in many previous works. In both Kong
& al. paper [KAP09] and Bui & Nobuyama paper [BN12] , a soft-voting method was
proposed to detect the vanishing point. But in our work, the vanishing point need not
to be computed as we assume that the camera detecting the scenes is in a fixed position
and that that the vanishing point does not vary between input sequence frames. An
Empirical position is then chosen and the frames are cropped horizontally following the
line passing by the vanishing point and parallel to the frames horizontal axes.
Block processing operations are performed by splitting the input frame into blocks of
equal size and when the dimensions of this frame are not integral multiples of the block
size chosen, the algorithm pads zeroes along the rows and/or columns to get the integral
number of blocks. The block size is a parameter that depends on the application e.g
a block size of 8x8 pixels is chosen for the JPEG compression. It is also possible to
create an overlapping block border between neighbour blocks. The overlap size is also a
parameter to be chosen depending on the application specifications. when the overlap
size is 0x0 pixels. the block splitting is called a distinct block processing as there is no
overlap and the neighbour blocks are totally independent.
Figure 3.4: Block splitting with overlaps Figure 3.5: Distinct Block Splitting
Texture Analysis
The Discrete cosine transform operates on a signal to transform it from spatial domain
to frequency domain. This will help to separate the input signal into spectral sub-bands
of different importance. There exist four variants of the Discrete cosine Transform
namely DCT − I, DCT − II, DCT − III, DCT − IV . The most commonly used type
is the DCT − II, is often simply referred to as ” the DCT ” and was defined for the
first time in the paper of N.Ahmed & al [ANR74] as follows: The DCT − II of a one
dimensional data sequence X(m), m = 0, 1, · · · , (M − 1) is defined as :
√ M −1
2 X
Gx (0) = X(m)
M
m=1
M −1
2 X (2m + 1)uπ
Gx (u) = X(m) cos , u = 1, 2, · · · , (M − 1) (3.2)
M 2M
m=1
where Gx(u) is the uth DCT coefficient. Similarly, the inverse transformation is
defined as:
M −1
1 X (2m + 1)uπ
X(m) = √ Gx (0) + Gx (u) cos , m = 1, 2, · · · , (M − 1) (3.3)
2 2M
k=1
When the data sequence is two dimensional as in the case of images, A two-dimensional
DCT − II is applied to to the data sequence to transform it to frequency domain. The
2-D DCT − II for a data sequence X(m, n), m = 0, 1, · · · , (M − 1), n = 0, 1, · · · , (N − 1)
is just an extension of the 1-D DCT − II defined in the equation (4.1) and is given by:
M −1 N −1
2 X X
Gx (0, 0) = X(m, n)
MN
m=1 n=1
M −1 N −1
4 X X (2n + 1)vπ
Gx (0, v) = X(m, n)cos ,
MN 2N
m=1 n=1
M −1 N −1
4 X X (2m + 1)uπ
Gx (u, 0) = X(m, n) cos ,
MN 2M
m=1 n=1
M −1 N −1
4 X X (2m + 1)uπ (2n + 1)vπ
Gx (u, 0) = X(m, n) cos cos , (3.4)
MN 2M 2N
m=1 n=1
where u = 1, 2, · · · , (M − 1), v = 1, 2, · · · , (N − 1)
M −1 N −1
1 X X (2m + 1)uπ (2m + 1)vπ
X(m, n) = Gx (0, 0) + Gx (u, v) cos cos , (3.5)
2 2M 2N
u=1 v=1
where m = 1, 2, · · · , (M − 1), n = 1, 2, · · · , (N − 1)
The first DCT coefficient Gx (0, 0) is a special case because it’s proportional to the
average horizontal pixel intensity change of all the input samples. It is for that reason
sometimes called the ”DC component” of the signal (DC as in direct current).
In this section, a DCT feature extraction approach is proposed. This approach is com-
posed of two stages. In the first stage, we choose a block size and an overlap parameter
and then we apply the DCT to each block of the image. In the second stage, we construct
the feature vector of each block as follows :
if we have already chosen the block size to be [N, N ] and the overlap size to be [P, P ].
Then the 2D DCT type II will be applied to a block of a size [N + 2 *P, N + 2 *P ]
resulting in an image that have the same size in which the pixel of the position (i,j) will
represent the DCT coefficient Gx (i, j). As long as the first coefficient Gx (0, 0) will be
significantly affected by any illumination change as it represents the average intensity
change of this block. we proceed by not counting the first coefficient in the construction
of the feature vector to achieve robustness to illumination changes. Instead, we choose
to divide the rest of the coefficients by this component. Thus, The feature vector will
have a size of 2*(N + 2*P ) − 1 and will take this form :
Gx (0, 1)
Gx (0, 2)
Gx (0, 3)
..
.
1 Gx (0, N + 2*P − 1)
DCTF eature (Block) =
Gx (0, 0)
Gx (1, 0)
G x (1, 1)
..
.
..
.
Gx (N + 2*P − 1, N + 2*P − 1)
the approach we took to get the color features of the input frames is to use the color his-
togram associated to each block of these frames which are divided in the same manner as
for the last feature i.e same block-size and same overlap-size. For this reason, the choice
of the color space become of ultimate importance as the color histogram constructed will
depend directly of the color space chosen.
To examine as much possible situations as we can, we have chosen to use separately two
color spaces , the standard RGB color space and then the HSV color space. Thereafter,
we compare the results to conclude which color space will fit to our problem.
The step that comes after choosing the color space representation is the extraction of the
color features. Therefore, a 3D color histogram is built for each block i.e each channel
is represented by a 1D color histogram. And then a concatenation of the three color
histogram distributions is done to get a color Feature vector.
Because he number of elements per histogram i.e the number of bins will determine
the size of our feature Vector. choosing this parameter will be of a great importance in
the construction of our feature vector
The feature vector that we have constructed is associated to each block and describes
the latter by the position of its center pixel. If the block doesn’t contain centers e.g
a block of size 10x10, the pixel describing this block will be the pixel with the highest
pixel coordinates in the smallest central square of size 2x2.
Figure 3.7: Center pixel of a 9x9 block Figure 3.8: Center pixel of a 10x10 block
The position feature vector of a particular block is a two dimensional vector. The first
coefficient and the second coefficient of this vector mark respectively the x-axis and the
y-axis of the pixel chosen to describe this block.
x − Coordinate of the central pixel
P ositionF eature (Block) =
y − coordinate of the central pixel
margins and support vectors. A more detailed description of SVMs follows, based on
Burges (1998) and Vapnik (1998).
Mathematical formulation
For a given dataset of n samples of the form (x1 , y1 ), · · · , (xn , yn ) where yi are either
1 and 0, a constant denoting the class to which that point xi belongs. Each xi is p-
dimensional real vector.
SV classifiers are based on the idea of the dividing (or separating) hyperplane f (x)
with normal w ∈ Rp , and an offset b ∈ R
f (x) = hw, xi + b = 0 , x ∈ Rp (3.6)
When the data is linearly separable, the best separating hyperplane (i.e decision
boundary) is found by solving the optimization problem [JWHT14] [CST00]
1
w = argmin kwk2 subject to yi hwi , xi i ≥ 1 i = 1, · · · , n (3.7)
w∈Rp 2
In the non-separable case, an introduction of a parameter ξi that measures the amount
of violation in the constraints yi hwi , xi i ≥ 1 . The optimization problem takes in this
case the form:
m
1 X
(w, ξ) = argmin kwk2 + C ξi
w∈Rp ,ξ∈Rn 2 i=1
The Dual problem for this optimization problem can be formulated this way:
X X
α = max αi + αj αk yj yk hxj , xk i
αi ≥0
i jk
P
Subj. αi ≥ 0 ∀, i α i yi =0 (3.9)
φ : X −→ F
K : X × X −→ F
In this Thesis we have built six models based on the support vector machine classifier.
One model uses a linear Kernel and the five other models were built using these following
kernel functions.
A human brain contains an enormous amount of nerve cells called neurons. Each of
these neurons are connected to many other similar neurons, creating a very complex
network. Each cell collects inputs from all the neural cells it is connected to as a form
of an electric signal, if this signal reaches a certain threshold, the neuron signals to all
the cells it is connected to. The analogue of the neurons in an artificial neural network
is called “Perceptron” and it operates the same way as its real analogs.
1 b
x1 w1
x2 w2
P
.. .. Activation
. . function
xn wn
inputs weights
The perceptron can take several weighted inputs and summarize them,if the combined
input exceeds a threshold it fires (i.e sends an output which is determined by the activa-
tion function and is often chosen to be between 0 and 1 or â1 and 1).An important thing
to note here is that the derivative of the activation function should be easily calculated
and it would be mathematically convenient if it is expressed in terms of the original
activation function as this derivative is needed in the well-known training algorithm of
neural networks “Backpropagation algorithm”.
The functioning of a Perceptron can be summed up in the following equation:
Xn
y = Φ( wi xi + b) (3.10)
i=1
where y is the output signal, Φ is the activation function, n is the number of connec-
tions to the perceptron, wi is the weight associated with the ith connection and xi is the
value of the ith connection, b represents the threshold. A graphical representation can
be found in figure 3.10. Although the simplicity of the idea behind the conception of a
perceptron, the strength of this latter is proven when several perceptrons are combined
and work together. The perceptrons are often organized in layers, where each layer takes
inputs from the previous one, applies weights and then signals to the next layer.
Input 1
Input 2
Input 3 Ouput
Input 4
Input 5
Figure 3.11: A graphical representation of an artificial neural network with one hidden
layer. Picture by the author
A classifier must be able to learn from examples by adapting the weights on the
incoming connections of these hidden units .
In an ANN, this is achieved by updating the weights of the connections between the
layers.In fact, There are several ways of doing this where most of them involve initializing
these weights and then feed an example to the network. The error made by the network at
the output level is then calculated and feed backwards through “backpropagation”. This
process is then used to update the weights. Hence, the network can learn to distinguish
between several different classes by repeating the Backpropagetion algorithm.
Machine learning problems, particularly the ANN approach, face always the over-
fitting when the classifier becomes too good at recognizing the training examples, at
the expense of not being able to recognize a general input. This can be avoided by
cross-validation, where the network is trained on one set of data, and then evaluated
on a separate one. When the error starts rising in the validation set, the network might
be over-fitted. If previous networks’ configurations are saved, the network can then be
rolled back to the one which gave the smallest error.
4.1 Environment
The following software, respectively operating systems, were used in This Thesis:
For Labeling our Frames, we have used a graphical user interface tool named Evaluation
Gui and developed at the university of Passau. Using this tool, we can load the road
image sequence, select in an input frame the pixels that correspond to the road boarders
and finally get as an output a binary file which represents a stream of road edge’s pixel
positions. Some other options are also available , such as an interpolation option of the
selected pixels between different frames.
Each single frame treated by the Evaluation Gui is represented by its number in the
image sequence followed by the selected pixel positions of the left side of the road and
then the selected pixel positions of the right side.
Classification-based Road Region Detection Montassar AJAM
For associating a label to each block indicating whether this block belongs to a road
region or not in the input images sequence . We start by extracting all the road edge’s
pixel positions from the binary file and then specify a polygonal region of interest by
interpolating the pixel positions of each frame in this binary file, the region of interest
will be the road area region. the next step consists of constructing a pixel-wise mask
image in which the pixels that hold the same positions of the region of interest specified
will have a label 1 and the rest will have the label 0. This way we constructed a pixel-
wise mask image where each pixel of the treated image have an associated label in the
mask image.
As we are going to carry out a block-wise classification, our aim in this step is to
construct a block-wise mask in which each block is described by one label. For this
reason, we have built a simple function that has as an input the pixel-wise mask and
the block size chosen for extracting the features. a computation of the number of pixels
labeled as 1 and those labeled as 0 is done in each block. And each block will have the
label which is more present within its pixels.
The Figure 4.4 shows a Block-wise labeled sample frame
A table is then constructed, it contains ”a bag of words” feature vectors extracted from
the processed frames, for each is associated an additional coefficient which represents the
label of that vector. The table is used later by our Learner app as a labeled training
data for the supervised machine learning model construction.
The following table illustrates an example of how a training data is built in the form
of a table.
cluding decision trees, support vector machines (SVM), and k-nearest neighbors, logistic
regression and ensemble classification.
We can also use the Learner App in some other machine learning tasks such as selecting
features, specifying validation schemes, and assessing results. Using the Classification
learner App is simple and can be described in this steps
Decision trees Support vector machines Nearest neighbor classifiers Ensemble classifiers
Deep tree Linear SVM Fine KNN Boosted trees
Medium tree fine Gaussian SVM medium KNN bagged trees
Shallow tree medium Gaussian coarse KNN subspace KNN
coarse Gaussian SVM cubic KNN subspace discriminant
quadratic SVM cosine KNN
cubic SVM weighted KNN
Recognition Tool.
This last interface presents the sate of the network when the training is in process.
When the training of the model ends. we can get the performance of the model as well
as different plots presenting the error histogram and the confusion matrix.
This method is based on only a portion of the data. For that reason, it is recommended
for large data sets which matchs up to our situation.
Classification-based Road Region Detection Montassar AJAM
Rather than the accuracy results, the distribution of the Support vectors ,i.e the po-
sition of these vectors in the frames from which the training data is constructed, is of
an important interest for a precise evaluation of our SV models. As these support vec-
tors are the definers of the separating hyperplane between road and non-road regions,
studying their positions can give us more information about the behaviour of our pre-
dictive models. For that reason, we have represented some frames in a fusion with their
associated support vectors represented in their positions under different conditions.
As it is shown in the fig 5.3, the distribution of the support vectors follow the road
edges as they are the natural delimiters of the road and non-road regions. Robustness
against contrast changes is unfortunately poorly achieved. As we can see from the next
figure, when the frame presents shadowed and non-shadowed road regions, the model
finds a difficulty in detecting the road edges and the support vectors distribution become
distributed in a pseudo-irregular manner.
Figure 5.3: Support vectors distribution over an associated frame, Case of a shadowed
frame
It can be observed from these results that the Support vector machine gives better
results than most of the standard used classifiers. We can see clearly that The nearest
neighbours classifier(KNN) with all its variants appears to do not fit this kind of
tasks. But, we can also notice that the Logistic Regression classifier, the The De-
cision Tree with a medium number of leaves N , i.e N can reach 20, give good results
that can be compared to the SVM performance and a Complex Tree, i.e a decision
tree with a big flexibility with respect to the number of its leaves, can make astounding
results when the number of samples gets bigger.
5.4 Conclusion
The implementation of this system using Matlab environment and its image processing
and machine learning toolboxes, which followed, a step of conception consisting of con-
structing a suitable feature vector capable of describing as well as it can the information
contained in the input images, has led us to elaborate the results of the last section and
to affirm the capability of this system to detect the road regions. We can also note that
that the real-time detection of the road regions can be easily reached due to optimized
code and the rapidness of classification process of the built models which will allow us
to conclude that the main goals of this thesis are reached
This system presents also some constraints, and one can say the shadows that happen
to appear in the image sequences can make it harder to the classifier to find the boundary
between the road and non-road regions . Nevertheless, the detection of the main part of
the road is still present
If one want to build a larger system which is almost insensitive to such external inter-
ferences, It could be possible by the integration of feature vectors which are invariable
to contrast, but that will definitely make us do some compromises in terms of computa-
tional time as the feature vectors will be bigger in size and the models will take longer
to classify them. At this step, It could also be interesting to improve this by using
some dimensionality reduction techniques such as Principal component analysis (PCA)
to faster and lighten the processing and eventually to keep only the necessary compo-
nents of our feature vector. If we would build a system like that, it would probably be
better to present and more importantly, more robust to contrast variations as this code.
This improvement that can be done will be discussed further in more details in the next
chapter.
[BN12] Bui, TH and Eitaku Nobuyama: A local soft voting method for texture-
based vanishing point detection from unstructured road images. SICE Annual
Conference (SICE), 2012 . . . , pages 396–401, 2012.
[CB10] Corvee, E. and F. Bremond: Body Parts Detection for People Track-
ing Using Trees of Histogram of Oriented Gradient Descriptors. In 2010
7th IEEE International Conference on Advanced Video and Signal Based
Surveillance, pages 469–475, Aug 2010.
[Dau85] Daugman, John G.: Uncertainty relation for resolution in space, spatial
frequency, and orientation optimized by two-dimensional visual cortical fil-
ters. J. Opt. Soc. Am. A, 2(7):1160–1169, Jul 1985.
[DBH06] Demuth, H.B., M. Beale and M. Hagan: Neural Network Toolbox for
Use with MATLAB: User’s Guide. MathWorks, Incorporated, 2006.
[ETS14] ETSC: Road safety planning Good practice examples from national road
safety strategies in the EU. (October):1–15, 2014.
[FR98] Ford, Adrian and Alan Roberts: Colour space conversions. 1998.
[FSZ03] Feng, D.D., W.C. Siu and H.J. Zhang: Multimedia Information Retrieval
and Management: Technological Fundamentals and Applications. Engineer-
ing online library. Springer, 2003.
Classification-based Road Region Detection Montassar AJAM
[GA09] Gao, Feng and Haizhou Ai: Face Age Classification on Consumer Images
with Gabor Feature and Fuzzy LDA Method, pages 132–141. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2009.
[GZZ10] Guo, Z., L. Zhang and D. Zhang: A Completed Modeling of Local Binary
Pattern Operator for Texture Classification. IEEE Transactions on Image
Processing, 19(6):1657–1663, June 2010.
[Kal16] Kala, R.: On-road Intelligent Vehicles: Motion Planning for Intelligent
Transportation Systems. Elsevier Science, 2016.
[KAP09] Kong, Hui, Jean Yves Audibert and Jean Ponce: Vanishing point
detection for road detection. 2009 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Workshops, CVPR Workshops
2009, (3):96–103, 2009.
[Low99] Lowe, D. G.: Object recognition from local scale-invariant features. In Pro-
ceedings of the Seventh IEEE International Conference on Computer Vision,
volume 2, pages 1150–1157 vol.2, 1999.
[Low01] Lowe, D. G.: Local feature view clustering for 3D object recognition. In
Proceedings of the 2001 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. CVPR 2001, volume 1, pages I–682–I–688
vol.1, 2001.
[LW02] Liu, Chengjun and H. Wechsler: Gabor feature based classification using
the enhanced fisher linear discriminant model for face recognition. IEEE
Transactions on Image Processing, 11(4):467–476, Apr 2002.
[PZM96] Pass, Greg, Ramin Zabih and Justin Miller: Comparing Images Using
Color Coherence Vectors. In Proceedings of the Fourth ACM International
Conference on Multimedia, MULTIMEDIA ’96, pages 65–73, New York, NY,
USA, 1996. ACM.
[TPI13] Toroyan, Tami, Margie M Peden and Kacem Iaych: Global status on
road report 2015. World Health Organization, 19(2):150, 2013.
[Tra91] Travis, D.: Effective Color Displays: Theory and Practice. Computers and
people. Academic Press, 1991.
[YZ10] Yang, Meng and Lei Zhang: Gabor Feature Based Sparse Representa-
tion for Face Recognition with Gabor Occlusion Dictionary, pages 448–461.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.