Garzon'16 ConvNN TrafficSigns
Garzon'16 ConvNN TrafficSigns
Garzon'16 ConvNN TrafficSigns
NEURAL NETWORKS
APPLIED TO TRAFFIC SIGN DETECTION
IN GRAND THEFT AUTO V
ALEXANDER RAFAEL GARZÓN
ADVISOR: ALAIN L. KORNHAUSER
SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
BACHELOR OF SCIENCE IN ENGINEERING
DEPARTMENT OF OPERATIONS RESEARCH AND FINANCIAL ENGINEERING
PRINCETON UNIVERSITY
JUNE 2016
I hereby declare that I am the sole author of this thesis.
I authorize Princeton University to lend this thesis to other institutions or individuals
for the purpose of scholarly research.
______________________________________
Alex Garzón
I further authorize Princeton University to reproduce this thesis by photocopying or by
other means, in total or in part, at the request of other institutions or individuals for the
purpose of scholarly research.
______________________________________
Alex Garzón
ii
Abstract
In recent years, more and more companies have continued to join the quest of
developing fully autonomously driven vehicles. With a relatively recent research
report suggesting that by 2030, the technologies of autonomous driving will have
developed into a global industry worth $87 billion US dollars, it is no wonder so
many companies are investing so heavily now in creating such technologies. Many
issues and obstacles need to be addressed and resolved though, before fully
autonomously driven vehicles can be sold to consumers and used on public streets.
Perhaps the most fundamental issues are those of giving the autonomously driven
vehicles the ability to actually drive accurately, safely, and in accordance with all
street laws. One specific obstacle is for the vehicle to recognize road signs just as a
human would normally, such as detecting stop signs, traffic lights, speed limit signs,
and warning signs. The focus of this study is on improving upon current detection
methods by boosting accuracy (less false detections, less missed detections) and
reducing image analysis time. This is done by attempting a more difficult single-step
approach to traffic sign detection, as opposed to the traditional relatively easier
two-step approach described in Chapter 2. This study first attempts to develop a
reliable traffic sign detector by constructing, training, and tuning various
Convolutional Neural Networks. Images for training are obtained both from real
world public datasets and images from the game of Grand Theft Auto V. It then
attempts to explore the advantages of using a virtual environment (in this case a
video game) to train detectors for autonomous driving. It concludes there are
distinctive, measurable advantages to training such detectors in a virtual
environment. Investments in constructing virtual environments for training and
testing autonomously driven vehicles should be seriously considered.
iii
Acknowledgements
These past four years at Princeton have been a great and sometimes wild
journey. It was a transformative time for me with many good memories and friends
made. I have grown a lot more certain about my general path from here onwards,
yet the exact path has been made more obscure given all the fantastic opportunities
after college that I have learned about and discovered while here at Princeton. This
senior thesis was a fantastic project to end my time here, and the machine learning
and statistical techniques here in, directly relate to the machine learning team I will
be working on at Google post-graduation.
This thesis would not have been possible without the invaluable guidance
and vision of Professor Kornhauser. The research ideas he pitched to me were a
fantastic senior thesis project and also greatly appealing to my interests. The classes
of Computer Vision (COS 429) and Analysis of Big Data (ORF 350) and their
teachings by Professors Jianxiong Xiao and Han Liu, respectively, were also
extraordinarily useful in developing my understanding of Convolutional Neural
Networks and other Computer Vision techniques, features, and statistical methods.
I’d also like to thank Artur Filipowicz and Chenyi Chen. Artur was invaluable in
introducing me to Script Hook V, so that I could begin hacking GTA V. Chenyi was
invaluable in helping me use his NVIDIA Tesla K40, 12 GB RAM GPU in the PAVE lab.
When CNN training code run time can be cut from 10 hours to 30 minutes, it is truly
a much-appreciated blessing.
Lastly, I would like to thank my friends and family. Shirley, you’ve been an
awesome girlfriend, thank so much for putting up with and supporting me. Ruina,
you’re the best early morning breakfast buddy ever, thank you for all the smiles and
support. To the ORFE squad, Raina, Chris, and Matt, we survived ORFE together,
cheers to all the late night memories p-setting, I could not have done this alone. To
the PCT family, you all have been the greatest. Also Katrina, you have been a sister
who has always believed in me and that has pushed me to work harder, whether
you knew it or not, thank you so much, I will keep at it. Lastly, Papá, you should be
proud, you have done so much for your son, I am so thankful for all of it. I would not
be here, at Princeton, writing this thesis, had you not always been there for me
growing up and teaching me.
The end is now here. And equally so is the end the beginning. Thank you all.
iv
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction 1
1.1 Problem & Objective Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Why Grand Theft Auto V (GTA V) ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Why Convolutional Neural Networks (CNNs) ? . . . . . . . . . . . . . . . . . . . . . 5
2 Background & Literature Review 6
2.1 Previous Traffic Sign Detection Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Previous Virtual Environment Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Previous GTA V Hacking Development . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Data Creation and Development 12
3.1 CNN Training and Testing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 LISA-TS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Training, Validation, and Test Set Construction . . . . . . . . . . . . . . . . . . . . 15
4 CNN Methodology & Results 19
4.1 CNN Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 CNN Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Programming with Theano. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Stop Sign MLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Contrasting MLPs with MNIST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Stop Sign CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Contrasting CNNs with MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.8 Contrasting CNNs with GTSDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.9 Performance on GTA V Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
v
5 GTA V Methodology & Results 35
5.1 GTA V – System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 GTA V – Car Handling Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 GTA V – Position and Angle Structuring . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Live Action Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Conclusion 47
6.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 GTA V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.1 Additional CNN Component Creation. . . . . . . . . . . . . . . . . . . . . 51
6.3.2 Additional Tracking Layer Creation. . . . . . . . . . . . . . . . . . . . . . 51
6.3.3 Additional Virtual Environment and Dataset Exploration . . . . . . . . 52
Bibliography 56
Appendix 59
A.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.1.1 Image and True-values Grouping and Processing . . . . . . . . . . . . . 59
A.1.2 Reading, Formatting, and Pickling Data . . . . . . . . . . . . . . . . . . . 60
A.2 CNN Implementation (Theano) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.2.1 CNN Architecture Definition and Training . . . . . . . . . . . . . . . . . 61
A.2.2 CNN Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2.3 MLP Architecture Definition and Training. . . . . . . . . . . . . . . . . . 68
A.3 GTA V Hacking Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.3.1 Live Video Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
vi
List of Tables
3.1 Training, Validation, Testing Set Counts . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Stop Sign MLP Validation and Testing Errors . . . . . . . . . . . . . . . . . . . 27
4.2 MNIST MLP Validation and Testing Errors . . . . . . . . . . . . . . . . . . . . . 28
4.3 Stop Sign CNN Validation and Testing Errors . . . . . . . . . . . . . . . . . . . 29
4.4 MNIST CNN Validation and Testing Errors . . . . . . . . . . . . . . . . . . . . . 30
4.5 Traffic Sign CNN (GTSDB) Validation and Testing Errors . . . . . . . . . . . . 31
4.6 GTA V Image Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 Sample Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
vii
List of Figures
1.1 Sample Image of Stop Sign Detector in GTA V . . . . . . . . . . . . . . . . . . . . 2
1.2 Sample Images of Traffic Signs in GTA V . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Image Depicting TORCS System Setup [19] . . . . . . . . . . . . . . . . . . . . . 10
3.1 Sample Image Abbreviated Annotations . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Samples Images from LISA-TS Dataset . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Example of a Sobel Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Example of Max-Pooling technique used . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Three common activation functions . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Sample FC Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Sample CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Sample MNIST Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Sample GTSDB Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.8 Sample GTA V Image Classifications . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 System Setup Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Camera Placement Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 GTA V Video Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 GTA V Traffic Sign CNN Script Images . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5 GTA V Lane Detection CNN and Bird’s Eye View . . . . . . . . . . . . . . . . . 46
6.1 Speed Limit Sign Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Speed Limit Sign Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
viii
Chapter 1
Introduction
Figure 1.1: Sample Image of Stop Sign Detector in GTA V
Grand Theft Auto V is indeed very representative of the real world. It was the
second most expensive video game ever developed at a cost of $137 million USD
[18]. The game very closely emulates the real world. In the game there is changing
lighting based on time of day and weather. There are pedestrians crossing roads,
other vehicles driving on the roads, and even your occasional anomalies like animal
crossings. Importantly, all traffic signs in the game are also based off American
traffic signs. The CNN detector trained will be focused on American traffic sign
detection, but realistically, retraining it on another country’s traffic sign set is as
trivial as changing the image/object files in the game used to render such traffic
signs. The fact that training a CNN traffic sign detector of game images is an at least
equally good detector as one trained on real world images will be demonstrated in
this study.
An additional benefit of the game is that to some degree it is very modifiable by a
researcher to fit the needs of developing and testing a CNN traffic sign detector. Just
to mention it first, collecting in game images for training is the most tedious task,
although, it doesn’t take forever. A researcher can generate about 150 unique and
representative images of traffic signs, semi-manually, per hour. Then the researcher
can also combine these images with real world image datasets that already exist and
contain thousands more images. What is really great though is the wide range of
game hacking tools publically available and kept up-to-date by game enthusiasts
who maintain an API of functions and variables that can be used via various
programming languages such as C++ and C# to interact with and modify game
behavior. For instance, a researcher can change the time of day, lighting and
weather in the game. One can also change the camera angle to any position one
wants. For instance, one can collect images from a low-seated sport car view, and
also from a high-seat truck view, and then train the detector, and compare whether
a variable such as the height of camera mounting makes a distinct difference in
performance. One can also zoom to different parts of screen, and crop out irrelevant
parts such as the sky, where a traffic sign is unlikely to be located.
There are also unique advantages to testing the system in game. For instance,
hypothetically, assuming a reasonably working autonomous system, the car can just
run forever in the game. If the detector doesn’t work well at night, the lighting can
be fixed permanently to sunny, bright conditions. If the autonomous system can’t
detect pedestrians yet, or gets confused by them, they can be deleted automatically
from the game to prevent issues. If the car crashes or drives off road, it can be re-
spawned back to a legal starting position on the road. These are just examples of
many in game hacks, that can make testing much easier than in the real world.
Simultaneously, one could also automate image collection during testing to retrain
the detector afterwards to make it even better.
Specific details of how the car handling script works and what hacks were taken
advantage of will be described as they arise throughout the study.
Examples of Signs in Game
Figure 1.2: Sample Images of Traffic Signs in GTA V
The signs above in order left to right then top to bottom are: (1) Avoid Median, (2)
Yield, (3) No U-Turns, (4) Pedestrian Crossing, (5) Do Not Block Intersection, (6)
One Way Road, (7) Right Turn Only, (8) Stop Sign, and (9) Do Not Enter.
There are also many more traffic signs in the game such as traffic signals, animal
crossing, no parking, and many more. As you can see, they all very accurately
resemble their real world counter parts in America. It is essential that traffic signs
Chapter 2
Background & Literature Review
traffic sign is away from the vehicle. All an autonomously driven vehicle system
should care about is if there is a stop sign being approach and how far away the stop
sign is, the exact bounding box of where it is exactly does not matter, the vehicle just
needs to know when to begin slowing down and eventually stop. Detecting if there is
a stop sign and how far it is, can both be part of the prediction output from one CNN,
currently though, this study has been focused on the first part of the prediction
which is whether or not there is a stop sign. Adding the other part of the prediction
is trivial if the first part can work well, since once a dataset with labels of true values
of distances to stop signs in images is available, the CNN can be retrained to also
make predictions on those distances. The only constraint that prevents currently
training in that manner is the lack of a labeled dating set, which makes sense, since
you can’t measure the distance to a sign directly from an image so easily after it is
taken and you are no longer at the location, but if you are taking images in a virtual
environment, that distance could be known by grabbing game variables and
computing the distance from their values and recording it at time of capturing the
image. This is one such advantage of being able to use a virtual environment to train
a CNN (assuming that training a CNN on a virtual environment translates well to
real-world usage, and vice versa, a fact we discover through experimentation later
in this thesis). Despite current traffic sign detection schemes using such a different
technique from what is proposed in this study, such papers can still be quite
relevant as the detection tasks are to some degree related and it is important to be
familiar with other methods being developed.
The most relevant paper to this study’s work that could be found was entitled:
Multi-Column Deep Neural Network for Traffic Sign Classification [30]. The
researchers were very successful in the GTSDB competition and they focused on
training a set of CNNs to detect traffic signs and then taking an average of their
predictions to come up with their final detection prediction. Another important
technique they used was processing the images into multiple forms that could be
used for training. For instance, of course you use the original photo, then they also
modified it into four additional forms which they called: Imadjust, Histeq,
Adapthisteq, and Conorm. Briefly, Imadjust is the picture with increased contrast up
to the point where 1% of data is saturated at the highest and low intensities. Histeq
increases contrast such that the output image histogram of pixel intensities is close
to uniform. Adapthisteq is similar to histeq except the image is tiled and each tile is
processed to become close to uniform in intensity. Lastly Conorm is an edge
enhancer, they used a difference of gaussians to enhance the edges, but one could
have also attempted using a modified form of the Sobel operator which I introduce
later in Chapter 4. It is these two special modifications that allow the team to bump
up their performance to give them the extra edge to win, but still, 95% of that
performance could have come from just their single first CNN with no image
processing. This suggests that CNNs are very strong tools for this kind of of problem.
The team goes into details about their different layers in the CNN and the intuition
they give for their decisions is quite useful for this study in thinking about how to
architect the CNN in this study later on.
Another interesting research team has a paper entitled: Traffic Sign Recognition –
How far are we from the solution? [31]. They, too, also divide the problem into 2
pieces, first traffic sign localization, and then classification of traffic sign
localizations, however, they attempt to do it with other methods that do not include
a CNN. Although they are quite successful, they still fall short of the team above
which used multiple CNNs, further suggesting, that CNN is indeed quite a good
solution for this problem. Instead, the team attempted to first accomplish traffic sign
localization by way of using integral channel features (ChnFtrs) as created by other
researchers in [2] for pedestrian detection. ChnFtrs detector uses ideas derived from
HOGs (Histograms of Oriented Gradients). Specifically the team in [31] uses 10
channels derived from ChnFtrs to localize signs, and they do so quite successfully.
For the final step of classification they use a technique known as INNLP (Iterative
Nearest Neighbors-based Linear Projections) and it also works quite well. However,
for a simple classification problem where the object is dead center in the image,
even the CNN in this study can perform incredibly well when being trained for only
a few minutes, as will be shown later with the MNIST and GTSDB datasets. Thus,
while their work is very interesting, since it was decided not to use the two-step
method which already has much research done on it, and instead focus on the
single-step more difficult method of classification, thus it was decided their work
was not too relevant for the CNN research in this study. Nonetheless, it was
important and eye opening to see other techniques out there that exist for this two-
step traffic sign detection approach. It will be interesting to see if the single-step
CNN can ever be trained to a sufficient state that it can out due the winners of
GTSDB who use a relatively easier two-step method. As you will see in the results,
although the single-step detector turns out to be good, it is by no means perfect, yet
nonetheless impressive given the limited amount of data for a CNN single-step
solution.
In fact, Chenyi’s entire system setup of having the game running, a car handling
script running, and a CNN running, and all three components communicating with
each other, does indeed very much mirror my system setup. Below is how Chenyi
very clearly illustrates his system setup [19]:
Figure 2.1: Image Depicting TORCS System Setup [19]
The biggest reason it was decided to use GTA V instead of TORCS is quite
straightforward. The research in this paper centers on improving traffic sign
detection, however TORCS is a racetrack environment, there simply are no street
signs. It is a professional racing track with very clear lane markings that works
perfect for Chenyi’s research but is not useful for traffic sign detection research. GTA
V on the other hand, has traffic signs that very accurately resemble the real world in
terms of factors such as: placement, size, and design (they are all based off standard
American traffic signs). Not only that, but to assist in training and testing purposes,
as with most objects in the game, every sign is uniquely identified by an ID number
with its location. Another Princeton undergraduate student, Artur Filipowicz, is
currently working on grabbing traffic signs in close proximity of the vehicle in the
game and being able to calculate their exact coordinates in game images. Knowing
these true values would greatly speed up CNN training for those pursing the two-
step method with bounding boxes as the time required in data collection and
labeling dramatically drops. The label process would be instant and automated.
10
Currently that task is a work in progress. It also would be exactly what is needed
here to train the CNN to predict distances to signs, since Artur’s code would be able
to output the true values of distances to signs in images that are collected inside the
game for training the CNN.
What is being used in this research though is the foundational system setup
modeling Chenyi’s setup for TORCS that is created by Artur for GTA V in his
independent research work [21]. The setup here is based off his setup with
modifications for the particular CNN, for the system used, and with optimizations to
improve speed of communication of game images over to the CNN so the CNN can
quickly calculate its designed values and send them over to the car handling script
(driving controller).
11
Chapter 3
Data Creation and Development
12
autonomous driving system expected to keep riders safe. Training a CNN involves
creating a network architecture by tuning hyper parameters (number of layers,
types of layers, order of layers, dimensions of kernels, number of kernels, pooling
sizes, pooling padding size, dimensions of input and output to a layer, et cetera) and
then letting the optimizer run across thousands of images in the training set,
thousands of times, to estimate optimal values for the thousands of parameters in
the layers via backpropogation and stochastic gradient descent. The amount of
parameters being estimated is an enormous and the outcome is in large part based
on a well-curated selection of images for the training set. It is also important to note
that the word “thousands” can easily be replaced by “tens of thousands”, “hundreds
of thousands”, or “millions” and onwards. It all depends on many factors such as the
difficulty of the detection problem, the amount of training data available, the ease of
construction of new data, and the computational “firepower” (hardware) available
to the researcher.
13
/pedestrianCrossing_1330545944.avi_image3.png;pedestrianCrossing;355;179;384;207;0
/leftTurn_1330546134.avi_image2.png;turnLeft;594;179;617;203;0
/rightLaneMustTurn_1330546501.avi_image12.png;rightLaneMustTurn;928;52;1001;128;0
/signalAhead_1330546728.avi_image4.png;signalAhead;342;188;368;213;0
/laneEnds_1330547145.avi_image10.png;laneEnds;592;101;617;127;0
14
of the sign can be elongated. Not only does one want to recognize that there is a
speed limit sign, but one also want to be able to determine what is the value of the
speed limit (25 mph, 35 mph, etc.).
Lastly, for modifying car behavior based off a stop sign, the behavior is quite
standard and defined. The vehicle needs to come to a gradual stop at the correct
place and remain still for a couple of seconds before reaccelerating to the
appropriate speed and continuing on its way through the intersection (assuming
there is nothing blocking the intersection, as determining that, is another entirely
different detection problem that needs to be solved). Developing the car handling
script to do this was another equally challenging problem, as interacting with a
game that was not purposefully designed to be interacted with, can at times prove
challenging and require quite a few interesting hacks or different attempts at getting
the expected sought-after behavior. This is also discussed in the next section.
15
respectively. So of course the latter, is greatly preferred, however due to the vast
majority of stop signs falling into the first category, there was no choice but to go
with using the 704 x 480 category. The GTA V script simply disregards the bottom
10 rows of pixels, which seems reasonable, since a sign is unlikely to be in the very
bottom portion of the image anyways. Alternatively, one could have distorted the
image aspect ratios slightly, although it was decided not to do that as the shape of
the sign would be change slightly and that could affect detection.
So with that issue addressed, AWK and Bash scripts were developed, to
automatically go through the 7856 images, and sort them into 2 categories. Images
containing stop signs and images NOT containing stop signs (but quite likely other
signs). It does so, while dropping all the 640x480 images, and only keeping the
704x480 images which it down samples to a size of 140x95. Then it groups them
into training, validation, and test sets (where positive is defined as containing a stop
sign, and negative is no stop sign present):
# of Positive Images # of Negative Images
Training Set 500 500
Validation Set 200 200
Test Set 200 200
Table 3.1: Training, Validation, Testing Set Counts
Note: The images are all grouped as tracks, where one track is a burst of images in
the few seconds approaching a traffic sign. Because the training set, validation set,
and test set, should resemble the diverse environment, but not too close to identical,
so one had to manually make sure no tracks were present in more than one set.
Otherwise there is a big risk of over fitting.
Another concern about tracks is that ideally you would not like any tracks. However,
in the LISA-TS dataset almost all data is grouped into tracks of 10 – 40 frames.
Assuming an average of 25 frames a track, the training set only has 20 unique tracks
of positive images, which really isn’t a lot. However, all datasets explored in this
16
research such as LISA, GTSDB [24], and others are also track based. The reason for
this is simply that this is how data is collected in the real world using a car and a
camera. There is no other way but going around and taking a video (which is where
the bursts of frames/images come from). Of course, 500 images from 20 tracks is
still though much better than 20 images (one from each track), although the ideal
case would be 500 images from 500 tracks. It would be interesting research to see
how much better it would actually be, though based on reduction of a lot of
redundant information, and a dataset with much more diverse and accurate training
images, it already intuitively seems it should be a substantial boost to detector
performance.
Here is the first case of where data availability constraints come into the equation.
Limit of data available can be real constraints on training a CNN when there is not a
sufficient amount of data. Fortunately though, this amount of data is enough, though
using GTA V to automatically collect images could be a major break through in
collecting a much larger amount of training images. One could also do what Chenyi
[19] did, and have someone manually drive the car around for 12 hours, as he did, to
collect images. But he was collecting images with lane markings, so the full 12 hours
was useful, for this research though, only a small percentage of the 12 hours (say
5%) would be useful (the time when traffic signs are present in images). Thus it
would be substantially more costly to do a similar procedure manually, not to
mention all the data cleanup afterwards, so it was decided to not undergo this task
and instead focus on using the already available and annotated LISA-TS dataset of
real world traffic sign image, specifically for stop sign detection.
17
Here are examples of negative images from LISA-TS:
Figure 3.2: Samples Images from LISA-TS Dataset
18
Chapter 4
CNN Methodology & Results
19
• Convolutional Layers (CONV). The CONV layer is where CNN gets it name
from, it is essentially a process where the input into the layer is “convolved”
with a set of filters, and the results/outputs of those convolutions are passed
as input into the next layer. A filter can be thought of as a sliding window
with weights (a matrix of numbers) that is sliding across the entire image (a
bigger matrix of numbers), and at each position it takes a dot product with
the portion of the image it covers, which will be part of the output. These
weights on these filters are what is being learned during training. The filter
object in a CONV layer is 4 dimensional: (1) number of filters, (2) number of
channels (number of input matrices), (3) filter height, and (4) filter width.
Usually, it is very hard for the human eye to get an idea of what these filter
are doing and especially if the filters do not come from the first convolutional
layer. This is just their nature and there isn’t a better way to visualize them.
But here is a small trivial example to give some basic intuition:
One interesting convolution is that of a Sobel operator that produces an
image with edges strongly emphasized. This is especially helpful in edge
detection and thus consequently detecting the edges of the road and the
street signs and many more useful artifacts.
Figure 4.1: Example of a Sobel Kernel
As one can see, intuitively, the Sobel operator should produce a very large
dot product in absolute value when it is “convolved” on a 3x3 matrix that has
a vertical edge through the middle of it, since such an image would have the
left hand side and right hand side values differ in magnitude by a large
amount. If one wanted to detect horizontal edges, one could just rotate the
20
21
Figure 4.2: Example of Max-Pooling technique used
• Rectified Linear Unit Layers (ReLU). The RELU layer is the layer responsible
from adding further nonlinearity to the CNN, it is supposed to act as an
activation function to allow the CNN to train faster. There is debate about
which function to use in this layer, there are 3 major functions in use though:
o (1) f(x) = max(0,x) range: [0, ∞]
o (2) f(x) = tanh(x) range: [-1, 1]
o (3) f(x) = (1 + 𝑒 !! )!! range: [0, 1]
Figure 4.3: Three common activation functions
Function (1) is often preferred since it is computationally simple and has
been shown to accelerate convergence in stochastic gradient descent.
However, it can have disadvantages associated with not getting rid of large
gradient values. The function (2) is preferred over function (3) since it is
centered on zero, which give additional useful properties. Thus it was
decided on using function (2) in all the CNNs: f(x) = tanh(x)
22
• Fully Connected (FC) Layer. This is typically the last layer in the CNN after
some combination of any amount of layers before, where their types
correspond to the previous three layers mentioned. The FC layer is unique in
that it is the only section of the CNN where every node in the layer n has its
own connections to all the nodes in layer n+1, so long as layers n and n+ 1
are part of the FC section. As one can imagine, so many weights makes this
computationally expensive, thus having combinations of CONV, POOL and
ReLU layers before this step, and thus greatly reducing the size of the input
into the FC layer is very important. If the FC layer has Hidden Layers (HLs)
within it, which it usually does, then it can also be considered a Multilayer
Perceptron (MLP). Learning in the MLP/FC layer takes place via a process
known as Backpropogation which is essentially a method of using stochastic
gradient descent to minimize the error of the predections, which is the
output from the final layer of the FC. Each layer in the FC layer has a weight
matrix W and bias vector b, these are the values that are learned. The last
layer of the CNN in this research is actually a logistic regression (LR) layer,
this is because it is a common technique shown to improve performance
[15]. The hidden layer (HL) essentially transforms the input into a linearly
separable space, and then the Logistic Regression (LR) layer classifies
(makes predictions from) the output from HL. Below is a useful visualization
of an FC layer with three HLs [28]:
23
Figure 4.4: Sample FC Layer
The final construction for the CNN in this study was the following:
n INPUT IMAGE ->[CONV -> POOL -> RELU]*3 -> HL -> RELU -> LR ->
PREDICTION (OUTPUT)
Note the “*3” notation, the 3 layers in the brackets are repeated 3 times. So
essentially the CNN can be thought of as having 12 layers in between input and
output. Multiple architectures were tried and this one appeared to work best in
practice and intuitively seemed justifiable, more on that intuition below in 4.2.
To help you visualize, below is an illustration from [25] of a typical CNN
architecture:
Figure 4.5: Sample CNN Architecture
24
25
vision problems. The GPUs are more efficient and faster. These are two good
reasons that have lead to beginning with using Theano to develop, train, test, and
operate the CNN in this study.
There are many alternatives to Theano in the world of deep learning and CNN
construction. Such alternatives include: Caffe, Torch, Tensorflow, and DL4J. Caffe
seems to be a promising alternative. It is a deep learning framework developed by
the Berkley Vision and Learning Center (BVLC). Caffe also seems to be a very
popular framework out there, with plenty of documentation. It similarly has a flag
enabling switching usage of CPU to using the GPU. It has libraries available in
several languages such as python and C++. However, Theano is also great and all
CNNs within this research paper rely on Theano.
Used in this study were the following hardware to train CNNs:
• Intel CPU (on a personal Macbook Pro)
• NVIDIA Tesla K40 GPU – 12 GB SDRAM (made available in the Princeton
PAVE Lab [27])
The NVIDIA GPU greatly increased training speed by a factor of roughly 10 to 12.
26
27
Figure 4.6: Sample MNIST Images
28
29
30
Here are the results of the CNN, a total of 490 images were used, roughly equal
portions of the above 6 signs; 400 for training, 50 for validation, and 40 for testing:
CNN Setup: VE TE (%): Learning # Hidden
(%): Rate (λ) Units:
K = [20,20], D = [8,6], E = 200 4.0 5.0 0.003 500
K = [20,20], D = [8,6], E = 200 18.0 5.0 0.03 500
K = [20,20], D = [8,6], E = 200 12.0 5.0 0.0003 500
K = [25,50], D = [8,6], E = 200 4.0 0.0 0.003 500
K = [25,50], D = [8,6], E = 200 6.0 2.5 0.003 800
K = [35,70], D = [8,6], E = 200 4.0 2.5 0.003 500
K = [20,20], D = [10,6], E = 200 2.0 5.0 0.003 500
K = [25,50], D = [10,6], E = 200 2.0 0.0 0.003 500
K = [25,50], D = [12,6], E = 200 2.0 7.5 0.003 500
K = [25,50], D = [12,8], E = 200 0.0 5.0 0.003 500
Table 4.5: Traffic Sign CNN (GTSDB) Validation and Testing Errors
As one can see, the CNNs performs extraordinarily well for this kind of traffic sign
classification, which is a relatively much easier problem than the one being
attempted to be solved with GTA V in this thesis. The top performing CNN gets 2%
error on the validation set and 0% error on the testing set! If you take that to be 1%
error on average, assuming the traffic sign localization (bounding box detection) is
equally error free and correlated, then only a 1% error detection system on traffic
signs in GTSDB would have been a very competitive performer in the GTSDB
competition, especially considering this study has not used any special data
processing methods or other performance boosting methods and tunings.
31
Of course, not all the photos are that clear and well lit, for example:
Figure 4.7: Sample GTSDB Images
Clearly it is quite likely, that the small errors are likely to be on signs such as those
above that are either over lit or under lit, and thus look abnormal with obfuscated
edges and colors.
(1) D = YES (2) D = YES (3) D = YES
32
(4) D = YES (5) D = NO, NOT CORRECT (6) D = YES, CORRECT
(7) D = YES, CORRECT (8) D = YES, CORRECT (9) D = NO, NOT CORRECT
EXAMPLE OF NEGATIVE IMAGE SET:
(1) NO, CORRECT (2) NO, CORRECT (3) YES, NOT CORRECT
(4) YES, NOT CORRECT (5) NO, CORRECT (6) NO, CORRECT
(7) NO, CORRECT (8) YES, NOT CORRECT (9) NO, CORRECT
Figure 4.8: Sample GTA V Image Classifications
33
These are just 18 images selectively pulled from the 200 game images that were
tested. Overall, from all 200 images tested, roughly 73% of images were classified
correctly and 27% were classified incorrectly. This is almost 3 out of 4! It is good
news that the detector can work well on game images. Admittedly, some of the game
images seem to be easier in that the stop sign is bigger on average or slightly bigger
in the image, than compared with the LISA-TS images used to train, validate, and
test the CNN in section 4.6 Nonetheless, it still shows the detector did a great job.
Also, looking at the photos, one can speculate why the CNN detector may have
predicted wrongly. In the positive set, the detector surprisingly was able to do
image (4) correctly, but image (5) and (9) also dark images, it missed. It seems to be
that the detector struggles when the sign is very dark as in (5) and dark and
blending into the background as in (9). Of course, to test this hypothesis, we would
need to look at more images to be more confident. Perhaps this suggests though that
more training with darker images is needed to further improve the CNN. Also, it
could suggest that the tactics used in [30] for image preprocessing, such as
increasing the contrast, might be incredibly useful for the CNN in detecting signs in
relatively darker images. As for the negative set, discerning what was inaccurately
detected as a stop sign is a slightly harder task, but in general the detector seemed
to struggle with images with dark shadows or in general less light in the images,
further confirming the belief that intuitively makes sense that the detector would
struggle in less well-lit images.
Here is the detection rates table, showing the breakdown of error between positive
and negative testing images:
Image Set: Correctly Classified (%): Incorrectly Classified (%):
Positive Set 79 21
Negative Set 67 33
Total Set 73 27
Table 4.6: GTA V Image Classification Results
34
It seems that the detector has a somewhat bigger problem with false positives than
with false negatives. One reason false negatives could be low (a good thing) is that
the positive image set tended to have the stop sign relatively larger in the image
than compared with LISA-TS images, as stated before. As for why false negatives are
relatively larger, it is not immediately clear why, but one possible reason as alluded
to earlier could be the detector falsely detects stop signs in low light images.
An interesting note about the images, is as you can see, they are a very diverse set,
taken from all times of day, and under different weather conditions. If desired, one
could also have driven to many different environments, such as urban, rural,
mountainous, dessert, costal, industrial, et cetera, since they all exist in this game.
However this study concentrated on urban, suburban, and a few images
mountainous. The breadth of images in this game though is truly a promising
feature of using such a game, as it improves the odds of solely using this one game to
be able to develop a diverse enough dataset to reflect what an autonomously driven
vehicle would encounter in the real world. Hopefully after data collection in the
game is more automated, large new datasets can be collected from it, and the CNN
detector performance can be improved even further.
Most importantly though, from this we see that a CNN detector trained on real
world images works very well on in-game images, and thus one can conclude that a
CNN detector on training images from GTA V, could also work reasonably well on
images from the real world. This confirms my belief that a virtual environment can
be used to successfully train CNNs and is great news.
35
Chapter 5
GTA V Methodology & Results
36
37
38
The system comes from the assumption that one would expect multiple detections,
possibly 10, 20, 30 before the stop sign, assuming the detector is operating at 12Hz,
or possibly even more. However, that number cannot be averaged out and used as a
point to trigger stopping since the fact it varies so much is widely due to road
conditions. On a straight road the stop sign can be scene from a long distance, but on
a curvy road the stop sign appears right away, or rather it could be a straight road
but the stop sign is just initially occluded by some other object or bad weather
conditions. Thus it was decided to adjust the camera zoom so that only stop signs
greater than 14 meters away can be seen. The camera is just zoomed slightly ahead.
If a car is traveling at 20mph that is equivalent to roughly 8.9 meters per second. So
given that the system operates at 12 Hz, or 1 image processed per 80 milliseconds,
assuming a near perfect detector with <5% error, one could be quite certain that
after 5 missed detections (or 400 milliseconds of no detection), that the stop sign is
within the 14 meters and then the car at that point comes to a stop, having roughly
1061 milliseconds (1 second) to stop at the 1 meter before stop sign mark,. Here is a
table of the settings for different speeds (15 to 50 mph, in increments of 5). It was
assumed there are no stop signs on 55mph or greater roads, which seems somewhat
reasonable though not perfect. The rule could still be applied for high speeds, but
the stopping might be too abrupt unless the camera is zoomed incredibly far out, but
there could be other problems since photo quality tends to degrade with distance.
Here though is the table:
Camera Zoom Vehicle Speed Vehicle Speed Time left to stop
(meters) (mph) (meters/sec) (milliseconds)
10 15 6.7 940
14 20 8.9 1061
18 25 11.1 1130
23 30 13.4 1241
28 35 15.6 1331
34 40 17.9 1443
39
40 45 20.11 1539
46 50 22.3 1617
Table 5.1: Sample Stopping Times
In the game these values would work fantastically assuming a great detector and
that the stop sign is really missed because it has been dropped out of the camera
field of vision and not that some object occluded it. However, in real life, a car might
not be able to, or rather the driver would not enjoy, the car coming to a complete
stop from 50 mph in 1617 milliseconds (1.6 seconds). To adjust this one could
extend the camera zoom, and tripling the stopping time to 5 seconds at a speed of 50
mph, which seems much more reasonable and close to the comfort zone. However
that would require a zoom of 150 meters, and that is quite far, and there is a high
possibility of another car occluding a sign so far away. So thus, it was concluded that
this type of stopping system in the real world would not be that effective. A better
approach to this problem is to have the CNN not only detect there is a stop sign but
also predict its distance based on how big it is in the photo. Unfortunately, as
mentioned before there are no datasets that have the true value of distances to signs
from vehicle in the annotations of their photos, since getting such a true value would
require a person measuring in real life the distance for each photo. Given that, using
those photos to train a CNN (supervised training) to predict distance to the sign is
not possible. However, this is where the advantages of GTA V come into play. In GTA
V one can use the hacking code and libraries to grab objects that are nearby. From
this one can also get other variables that help in estimating distances to such
objects. Artur is currently working on developing code in an attempt to estimate the
true values of distances to signs in game images captured. If he is able to develop
successfully a script that can accurately determine this distance on all images
captured with signs, then the images and their true values could be trained on my
currently existing CNN, and the CNN would then be able to begin outputting the
predicted distance values. The only limiting constraint to the problem is first getting
a dataset to train the CNN on.
40
As for now, the CNN is currently focused on image classification. It outputs one to
the CNN output text file when the image being processed contains a stop sign and
outputs zero to that text file when the images do not contain a stop sign. Thus the
output vector is simply a scalar:
My Stop Sign CNN Output Vector
Stop Sign Detection:
SSP
SSP = stop sign present (1 if yes, 0 if no)
In contrast, here is Chenyi’s simplified lane detection CNN implemented for GTA V
by Artur that outputs two scalars:
Chenyi’s CNN Output Vector
Driving car portion:
DTC ATC
DTC = distance to center of road
ATC = angle of adjustment to return to center of road
5.3 GTA V – Position And Angle Structuring
Another idea that came about with the game hacks being available is training the
CNN on images taken from cameras of multiple positions and multiple angles. This is
because, in the game, you can set the camera literally anywhere in relation to the
car. The camera can be set using 5 parameters, <x,y,z,w,v> where <x,y,z> define
starting at where the player is in the game (presumably in the car) and point to
where the camera should go, for example, x is how far in front, y is how far to the
side, and z is how far up. Then <w,v> control the angle of the camera, using these
parameters, you can point the camera at exactly, whatever angle you wish. For
example, if x,y,z are in meters and w,z, in degrees, if you let <w,v,x,y,z> = <0,-
90,0,0,10>, you would have a camera in the game hovering 10 meters above the car
and looking down on it, to give you a bird’s eye view of it. This could be a very
interesting point of view from which to train the car, however the practicality in the
real world currently does not seem that great. A more practical approach could be to
assume the camera has to be attached to the car (as all major companies are doing
41
this), and thus experiment with where on the car is optimal. For instance, would a
camera under the front bumper, just barely above the ground, looking forward,
work better, or would a camera mounted much higher on the car do better? The
script created can very easily do this. And below are 6 settings with which the car is
currently able to run at (note: the ground is defined as 10 cm above the pavement).
The 6 photos below are ordered first such that the first row is 0 and 0.5 meters off
the ground, the second row is 1 and 1.5 meters off the ground, and the last row is 2
and 3 meters off the ground:
42
Collecting a sufficient amount of images to run the CNN, say 500 per setting if doing
all 6, or maybe 1000 per setting if doing only 2, could easily have taken me at least
12 hours of manually playing the game and driving around to collect such images. It
was decided it was best to focus first on other research in this thesis were the data
was more easily accessible. However, having the car ready to go and collect such
data at different positions and angles is none the less a great start. If one could get
someone to drive around the game and collect images, then one could easily just
feed them into the CNN and see how they perform as a test set, or rather, if the
person is able to collect a sufficiently large amount of images, one could train an
entire new CNN entirely on this dataset and see how it compares to training on real
world images! Additionally, the bird’s eye view images pictured below in 5.4 could
be an additional topic of research.
Figure 5.3: GTA V Video Image
43
44
45
Table 5.5: GTA V Lane Detection CNN and Bird’s Eye View
46
Chapter 6
Conclusion
6.1 CNN
The CNNs trained within this thesis all exceeded my expectations in how they would
perform. CNNs are truly a very powerful detection tool in for computer vision tasks.
The most important CNN in this thesis is undoubtedly that best performer in section
4.6 which had a Validation Error of 19% and Testing Error of 21.5%. It was a single
CNN with 12 layers trained and validated on a set of only 1400 images from the
LISA-TS dataset. 700 of those images contained stop signs, and 700 of them did not.
They are real images taken from behind the windshield of a car driving on real
world roads. The fact that this CNN could perform so well, despite being given such
a limited amount of data, is truly remarkable. It shows that this single-step approach
to traffic sign detection is truly viable, and provides an alternative to the traditional
two-step approach.
Additionally, if more data is thrown at the problem for training, results in this thesis
all strongly suggest that the CNN from 4.6 could continue to increase in
performance. In all likelihood, an increase in the data by 1 magnitude would get
validation and testing errors to below 10%, and the single-step detector would
come close to rivaling the winners of the GTSDB competition. Additionally, an even
more important result, is that the fact that the single-step method works for traffic
sign classification, given that the CNN can locate the sign’s presence, it is also likely
to be able to be trained to return a prediction of its distance from the vehicle based
on the traffic sign’s size. The only last obstacle to testing that is dataset creation,
which will hopefully be made possible by Artur’s independent work this semester
47
with working on getting the true values of distance by calculating them off of in-
game parameters. And as shown in 4.9, a CNN trained to classify or detect in GTA V,
is likely to perform equally well on real world images as well, which is what we
want. The outcomes from CNN research in this thesis are quite inspiring.
On another note, it is deemed important to pause and explain the intuition behind
why the single-step traffic sign detection method requires a larger training dataset.
The reasoning is that in a large image of the road ahead, there is a very small signal-
to-noise ratio. That is, there are so many objects in a vast image, and the CNN is
expected to train on detecting one small object that many images have in common.
In order to get a well performing CNN, it thus needs a very large training dataset on
which to optimize over. If the dataset is too small, there are risks of the CNN training
itself to detect mutually shared noise among the images such as the amount of road
pavement, trees, or other objects that the CNN architect did not notice might have
been coincidentally more common in the positive set than negative set, or vice
verse. In terms of performance, there is only upside to adding more and more high-
quality, curated images to the training and validation sets of images.
Additionally, even if the single-step CNN reaches a bottleneck at 5% error and
cannot decrease further to 1% error, like two-step method, that does not mean it is
useless. Rather, it can still be quite useful. For instance, if it is able to predict
distances to the signs with only 5% error, that is amazing performance, as the two-
step traffic sign detectors are only computing bounding boxes and not directly
giving an indication of distance to the sign. Distance to the sign is of huge benefit,
because it allows the vehicle to calibrate its behavior much more accurately by
knowing how far the stop sign is from it. Additionally, it prevents disasters if the
sign happens to be temporarily occluded by another vehicle or object. If a stop sign
is detected as 40, 39, …30, ..., 20 meters away, over the span of a few seconds, but
then all of a sudden goes missing due to occlusion, the self driving car will still know
when to stop, and this is an essential criteria for an autonomously driven vehicle.
48
6.2 GTA V
Working with GTA V was a great success. Whenever there was an idea about setting
up a certain camera view, setting the car to a certain speed, or changing car behavior
based on information from a text file (such as a CNN output/prediction), it was all
accomplished with great success by leveraging the tools available in the hacking
library, Script Hook V (C++)(and equivalently in ScriptHookDotNet which is in C#).
Not only were ideas able to accomplish to satisfaction by using Script Hook V, but
also along the way much learning about the abundant amount of tools available in
the game that could be leverage for future research took place. Below are some of
the ideas for ways to use GTA V for future research, some of which were previously
briefly mentioned:
One, the capability to locate nearby objects can be used to locate the number of
signs in the vicinity. Other information about that sign such as game coordinates can
be calculated. Artur is currently working on developing a script to compute the true
values of distances to signs on the upcoming stretch of road, and this would be
49
infinitely useful in training a CNN that can predict distance to stop signs detected,
for example. Currently no real world datasets have been found that have these true
values, and this makes sense, as going out to measure for every photo is extremely
costly in man-hours.
Two, the position and angle structuring code can be used to create large datasets of
traffic sign images from an infinite amount of camera placements and angles,
although, you’d choose strategic, practical positions and angles, of course. While it
would require some man-hours, a database of 10,000 images could easily be
achievable by playing the game for one week. The results would hopefully be
spectacular in being able to train well working CNNs and determine which
placement and angle have what advantages in detecting certain signs. From this
research, one could infer that it is quite likely to be ideal to have multiple cameras
mounted in different places on a self driving car, as Google already does, but more
specifically, one could train and test for the optimal combined locations of cameras
to produce the best possible combination and thus work towards a tuple of CNNs
together in a system that can guarantee near 100% accuracy of traffic sign
detection. Such research would be a strong advancement to the current progress in
the realm of traffic sign detection.
Three, someone could work on the task of locating all traffic signs within the entire
game world in GTA V. They could then spawn the player and car at all those
locations systematically, in front of and forward facing those signs. They could then
automate the car driving straight down the road and passing the traffic sign and
collecting images for training. A script that could automate this and allow for
automated data collection with minimal human oversight could greatly increase the
amount of data that could be collected easily by a magnitude and possibly two
magnitudes. Instead of collecting 5,000 images, one could go ahead and collection
500,000 images. Such a massive dataset is quite reasonable, and not unheard of, as
one of the most famous examples in the object detection and object localization
fields is the ImageNet competition [29]. The dataset for the 2015 competition had a
50
whopping 1.2 million images in it. Given current GPU technologies increasing in
computational capacity on a yearly basis, large datasets are becoming much more
feasible to train. And given the greater capacity to train, larger datasets should be
acquired, in order to train CNNs on more difficult detection problems that were
previously considered infeasible, or rather, too difficult at the time.
51
decision)? If we are 75% confident speed is decreasing from 50 to 30mph, but there
is a 25% chance it is increase from 50 to 60mph, what should the defined behavior
for the vehicle handler be in such a case? Safety tradeoffs need to be made. The
tracking layer will prove to be quite interesting to work on.
Several of the academic papers read for this research have indeed implemented
different tracking layers. It would be interesting to also test in further research how
changing the tracking layer to function in different ways affects the overall network
performance. An initial plan could be to create a data structure that will hold
anywhere from 0-10 upcoming objects. The objects will have priorities of
importance, confidence levels of their existence, their estimated location, and other
important information abstracted from the output of the CNN network and also
calculated upon that output.
6.3.3 Additional Virtual Environment and Dataset Exploration
In addition to GTA V, one could also begin exploring other virtual environments. Just
how TORCS worked very well for Chenyi Chen to train a lane detection CNN, and
GTA V seems similarly promising to train a good traffic sign detection CNN, it is
possible that other virtual environments might have their own set of quirks that are
ideal for training certain types of CNNs useful for detecting objects that would
enhance the ability of an autonomously driven vehicles. For instance, in GTA V one
can also pilot boats and pilot planes among other vehicles. Currently companies like
Amazon and Google are well invested in designing autonomously driven drones.
Drones in the air and middle of their journey tend to have little obstacles in their
path, however the taking off and landing portions could be ripe areas for CNN
detection to be well put to use.
In GTA V one can only pilot planes, blimps, and helicopters. There is no small
aircraft available such as drones. However, with the hacks it would be feasible to
possibly create one, given a good amount of work put in. So that being said, an
52
53
AMERICAN SPEET LIMIT SIGN from [23] EUROPEAN SPEED LIMIT SIGN from [24]
Figure 6.1: Speed Limit Sign Comparison
Which sign looks easier to detect? The European one in general is almost always
easier to detect. It is circular, with a bright red border, and it is uniformly the same
throughout Europe. The researchers in [1] confirm this. The tried using GTSDB
winning detectors to detect American traffic signs and by far the hardest sign to
detect was the speed limit sign. Not only is it a very generic sign, but it is often
elongated by extra information such as “Radar Enforced”, “Photo Enforced”, “State
Maximum”, and many more variations. It is also very similar to other objects.
Figure 6.2: Speed Limit Sign Image
The above picture illustrates how generic a speed limit sign is, and in fact, the
researchers in (1) had a big problem with the detector detecting other objects, such
54
as the back of a truck, (which is also white, square, and can have text on it) as a
speed limit sign, which it is mostly clearly not.
It is through training CNNs on various sets of traffic signs, that researchers might
begin to hit these kind of road blocks and possibly request a country or region to
standardize their traffic signs to a more detectable and uniform version. Changing
street signs is a relatively cheap task compared to other high cost options such as
putting special detectors or sensors on top of street signs to assist autonomously
driven cars. Not to mention impractical, imagine if the sensors broke and the car
thus did not detect a stop sign. That would easily be a very costly and unacceptable
error.
Final Thoughts
Working on traffic sign detection and with GTA V was a great pleasure. The utmost
encouragement is expressed to others to further research in this area. Based on all
the work that has been have done so far, training, testing, and developing CNN
detectors in virtual environments seems to be a very tractable and effective method.
There are so many new techniques you can try with relative ease in virtual
environment that are not feasible in the real world. Go get started now! Vehicular
accidents are one of the leading causes of deaths in American and in the World,
further work in getting autonomously driven tools on to cars on the roads to quickly
save the lives of many. Go make a difference.
55
Bibliography
[1]
Andreas Møgelmose, Dongran Liu , and Mohan M. Trivedi, Traffic Sign Detection for U.S. Roads:
Remaining Challenges and a case for Tracking, Oct 2014
http://cvrr.ucsd.edu/publications/2014%5CMoegelmoseLiuTrivedi_ITSC2014.pdf
[2]
Piotr Dollár, Integral Channel Features, 2009
http://authors.library.caltech.edu/60048/1/dollarBMVC09ChnFtrs.pdf
[3]
Mohammed Boumediene , Jean-Philippe Lauffenburger, Jeremie Daniel, and Christophe Cudel,
Coupled Detection, Association and Tracking for Traffic Sign Recognition*, June 2014
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6856492
[4]
Mohammed Boumediene , Christophe Cudel, Michel Basset, Abdelaziz Ouamri, Triangular traffic signs
detection based on RSLD algorithm, Aug 2013
http://link.springer.com/article/10.1007%2Fs00138-013-0540-y
[5]
Miguel Angel Garcia-Garrido, Fast Traffic Sign Detection and Recognition Under Changing Lighting
Conditions, Sept 2006
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1706843
[5]
Andrzej Ruta, Video-based Traffic Sign Detection, Tracking and Recognition, 2009
http://www.brunel.ac.uk/~csstyyl/papers/tmp/thesis.pdf
[6]
Karla Brkic, An overview of traffic sign detection methods, 2010
https://www.fer.unizg.hr/_download/repository/BrkicQualifyingExam.pdf
[7]
Shu Wang, A New Edge Feature For Head-Shoulder Detection, 2013
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6738581
[8]
M. Liang, Traffic sign detection by ROI extraction and histogram features-based recognition, Aug
2013
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6706810&tag=1
[9]
Auranuch Lorsakul, Road Lane and Traffic Sign Detection & Tracking for Autonomous Urban Driving,
2000
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.6612&rep=rep1&type=pdf
[10]
Traffic Sign Recognition for Intelligent Vehicle/Driver Assistance System Using Neural Network on
OpenCV, 2007
http://bartlab.org/Dr.%20Jackrit's%20Papers/ney/3.KRS036_Final_Submission.pdf
56
[11]
Arturo de la Escalera, Road Traffic Sign Detection and Classification, Dec 1997
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=649946
[12]
Script Hook V v1.0.505.2a
By Alexander Blade, Frequently Updated
http://www.dev-c.com/gtav/scripthookv/
[13]
Script Hook V .NET v2.5.1
By Crosire, First Published: April 27, 2015, Frequently Updated
https://www.gta5-mods.com/tools/scripthookv-net
[14]
Theano Python Library
Developed by Machine Learning Group at the Université de Montréal
https://github.com/Theano/Theano
[15]
Deep Learning Tutorials, LeNet5.
http://deeplearning.net/tutorial/code/convolutional_mlp.py
[16]
The MNIST Database of handwritten digits, Yann LeCun, Corinna Cortes, and Christopher J.C. Burges
http://yann.lecun.com/exdb/mnist/
[17]
Lux Research, Self-driving Cars an $87 Billion Opportunity in 2030, Though None Reach Full
Autonomy, May 2014.
http://www.luxresearchinc.com/news-and-events/press-releases/read/self-driving-cars-87-billion-
opportunity-2030-though-none-reach
[18]
Brendan Sinclair, GTA V dev costs over $137 million, says analyst, Feb 2013.
http://www.gamesindustry.biz/articles/2013-02-01-gta-v-dev-costs-over-USD137-million-says-
analyst
[19]
C. Chen, A. Seff, A. Kornhauser, J. Xiao. DeepDriving: Learning Affordance for Direct Perception in
Autonomous Driving
http://deepdriving.cs.princeton.edu/
[20]
B. Wymann, E. Espie, C. Guionneau, C. Dimitrakakis, ´ R. Coulom, and A. Sumner. TORCS, The Open
Racing Car Simulator, 2014.
http://www.torcs.org
[21]
A. Filipowicz, D. Stanley, B. Zhang. TorcsNet in GTA 5, 2015.
http://devpost.com/software/pave-gtav-lane-detection - updates
[22]
57
S. Macy. GTA 5 Has Now Sold Over 60 Million Copies, 3 Feb 2016.
http://www.ign.com/articles/2016/02/03/gta-5-has-now-sold-over-60-million-copies
[23]
LISA-TS Dataset,
Andreas Møgelmose, Vision based Traffic Sign Detection and Analysis for Intelligent Driver
Assistance Systems: Perspectives and Survey, 2012.
http://cvrr.ucsd.edu/LISA/datasets.html
[24]
German Traffic Sign Detection Benchmark (GTSDB)
Sebastian Houben and Johannes Stallkamp and Jan Salmen and Marc Schlipsing and Christian Igel,
Detection of Traffic Signs in Real-World Images: The German Traffic Sign Detection Benchmark,
2013.
http://benchmark.ini.rub.de/
[25]
Convolutional Neural Network
https://en.wikipedia.org/wiki/Convolutional_neural_network
[26]
Convolutional Neural Network
http://deeplearning.net/software/theano/theano.pdf
[27]
PAVE (Princeton Autonomous Vehicle Engineering) Laboratory GPU
https://www.princeton.edu/ris/projects/pave/
[28]
Michael Nielsen, Why are deep neural networks hard to train?, Jan 2016.
http://neuralnetworksanddeeplearning.com/chap5.html
[29]
ImageNet Competition, Stanford Vision Lab, Stanford University, Princeton University
http://image-net.org/
[30]
Dan Cire¸san, Ueli Meier, Jonathan Masci and J¨urgen Schmidhuber
Multi-Column Deep Neural Network for Traffic Sign Classification, 2012.
http://people.idsia.ch/~juergen/nn2012traffic.pdf
[31]
Markus Mathias, Radu Timofte, Rodrigo Benenson, and Luc Van Gool
Traffic Sign Recognition – How far are we from the solution?
http://rodrigob.github.io/documents/2013_ijcnn_traffic_signs.pdf
[32]
Radu Timofte , KUL Belgium Traffic Sign Dataset
http://btsd.ethz.ch/shareddata/
[33]
Swedish Traffic Sign (STS) Dataset
http://www.cvl.isy.liu.se/research/datasets/traffic-signs-dataset/
58
Appendix A
Code
59
Next one can divide the positive and negative images into whatever ratio they want
among the training, validation, and testing sets. I choose 5:1:1 as my ratio. And had
700 total positive images and 700 total negative images, initially. Then for other
experiments I tried different ratios, and with other data.
A.1.2 Reading, Formatting, and Pickling Data:
There is some work involved in packaging all the photos together nicely into a
datastructure of numpy that is necessary for Theano code to be then able to work
with it, here is how I did it with the LISA-TS dataset, using Python:
from scipy import misc
import numpy
import os
import cPickle
import gzip
import theano
import theano.tensor as T
#alternate the order
train_set_x = numpy.empty([1000,39900])
train_set_y = numpy.empty([1000,1])
#for fn in os.listdir('.')
for i in range(0,500): #will do 0 through 49
img = misc.imread('./binaryDetection140x95/pos500_training/'+str(i)+'.png')
imgArray = img.ravel()
#print i
#print imgArray.shape
train_set_x[i*2,] = imgArray
train_set_y[i*2,] = 1
for i in range(0,500): #will do 0 through 499
img = misc.imread('./binaryDetection140x95/neg500_training/'+str(i)+'.png')
imgArray = img.ravel()
train_set_x[i*2+1,] = imgArray
train_set_y[i*2+1,] = 0
valid_set_x = numpy.empty([200,39900])
valid_set_y = numpy.empty([200,1])
test_set_x = numpy.empty([200,39900])
test_set_y = numpy.empty([200,1])
for i in range(0,100): #will do 0 through 99
imgArrayV =
misc.imread('./binaryDetection140x95/pos100_validation/'+str(i+530)+'.png').ravel()
imgArrayT =
misc.imread('./binaryDetection140x95/pos100_testing/'+str(i+660)+'.png').ravel()
valid_set_x[i*2,] = imgArrayV
valid_set_y[i*2,] = 1
60
test_set_x[i*2,] = imgArrayT
test_set_y[i*2,] = 1
for i in range(0,100): #will do 0 through 99
imgArrayV =
misc.imread('./binaryDetection140x95/neg100_validation/'+str(i+1060)+'.png').ravel()
imgArrayT =
misc.imread('./binaryDetection140x95/neg100_testing/'+str(i+800)+'.png').ravel()
valid_set_x[i*2+1,] = imgArrayV
valid_set_y[i*2+1,] = 0
test_set_x[i*2+1,] = imgArrayT
test_set_y[i*2+1,] = 0
data =
[(train_set_x,train_set_y.flatten()),(valid_set_x,valid_set_y.flatten()),(test_set_x,test_set_y.flatten())]
with gzip.open('stopsignAlter.pkl.gz', 'wb') as f:
print "dumping"
cPickle.dump(data, f)
f.close()
print "done"
A.2 CNN Implementation (Theano)
There was a lot of work in getting to learn how Theano works in practice, and also
how a CNN works in theory, and then putting the two together to figure out how to
code up a working CNN architecture. Here is the code for different pieces of working
with CNNs and Theano, everything is done in python:
A.2.1 CNN architecture definition and training:
import os
import sys
import timeit
import numpy
import theano
import theano.tensor as T
from theano.tensor.signal import downsample
from theano.tensor.nnet import conv
from logistic_sgd_ss import LogisticRegression, load_data
from mlp import HiddenLayer
import cPickle
class LeNetConvPoolLayer(object):
"""Pool Layer of a convolutional network """
61
62
63
# allocate symbolic variables for the data
index = T.lscalar() # index to a [mini]batch
# start-snippet-1
x = T.matrix('x') # the data is presented as rasterized images
y = T.ivector('y') # the labels are presented as 1D vector of
# [int] labels
######################
# BUILD ACTUAL MODEL #
######################
print '... building the model'
# Reshape matrix of rasterized images of shape (batch_size, 3 * 95 * 140)
# to a 4D tensor, compatible with our LeNetConvPoolLayer
# (28, 28) is the size of MNIST images.
layer0_input = x.reshape((batch_size, 3, 95, 140))
# Construct the first convolutional pooling layer:
# filtering reduces the image size to (95-16+1 , 140-16+1) = (80, 125)
# maxpooling reduces this further to (80/2, 125/2) = (40, 62)
# 4D output tensor is thus of shape (batch_size, nkerns[0], 40, 62)
layer0 = LeNetConvPoolLayer(
rng,
input=layer0_input,
image_shape=(batch_size, 3, 95, 140),
filter_shape=(nkerns[0], 3, 16, 16),
poolsize=(2, 2)
)
# Construct the second convolutional pooling layer
# filtering reduces the image size to (40-8+1, 62-8+1) = (33, 55)
# maxpooling reduces this further to (33/2, 55/2) = (16, 27)
# 4D output tensor is thus of shape (batch_size, nkerns[1], 16, 27)
layer1 = LeNetConvPoolLayer(
rng,
input=layer0.output,
image_shape=(batch_size, nkerns[0], 40, 62),
filter_shape=(nkerns[1], nkerns[0], 8, 8),
poolsize=(2, 2)
)
# Construct the third convolutional pooling layer
# filtering reduces the image size to (16-5+1, 27-5+1) = (12, 23)
# maxpooling reduces this further to (12/2, 23/2) = (6, 11)
# 4D output tensor is thus of shape (batch_size, nkerns[1], 6, 11)
layer2 = LeNetConvPoolLayer(
rng,
input=layer1.output,
image_shape=(batch_size, nkerns[1], 16, 27),
filter_shape=(nkerns[2], nkerns[1], 5, 5),
poolsize=(2, 2)
)
64
65
train_model = theano.function(
[index],
cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)
getGrads = theano.function(
[index],
grads,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)
# end-snippet-1
###############
# TRAIN MODEL #
###############
print '... training'
# early-stopping parameters
patience = 10000 # look as this many examples regardless
patience_increase = 2 # wait this much longer when a new best is
# found
improvement_threshold = 0.995 # a relative improvement of this much is
# considered significant
validation_frequency = min(n_train_batches, patience / 2)
# go through this many
# minibatche before checking the network
# on the validation set; in this case we
# check every epoch
best_validation_loss = numpy.inf
best_iter = 0
test_score = 0.
start_time = timeit.default_timer()
epoch = 0
done_looping = False
while (epoch < n_epochs) and (not done_looping):
epoch = epoch + 1
for minibatch_index in xrange(n_train_batches):
iter = (epoch - 1) * n_train_batches + minibatch_index
theseGrads = getGrads(minibatch_index)
if iter % 50 == 0:
print 'training @ iter = ', iter
66
cost_ij = train_model(minibatch_index)
if (iter + 1) % validation_frequency == 0:
# compute zero-one loss on validation set
validation_losses = [validate_model(i) for i
in xrange(n_valid_batches)]
this_validation_loss = numpy.mean(validation_losses)
print('epoch %i, minibatch %i/%i, validation error %f %%' %
(epoch, minibatch_index + 1, n_train_batches,
this_validation_loss * 100.))
# if we got the best validation score until now
if this_validation_loss < best_validation_loss:
#improve patience if loss improvement is good enough
if this_validation_loss < best_validation_loss * \
improvement_threshold:
patience = max(patience, iter * patience_increase)
# save best validation score and iteration number
best_validation_loss = this_validation_loss
best_iter = iter
# test it on the test set
test_losses = [
test_model(i)
for i in xrange(n_test_batches)
]
test_score = numpy.mean(test_losses)
print((' epoch %i, minibatch %i/%i, test error of '
'best model %f %%') %
(epoch, minibatch_index + 1, n_train_batches,
test_score * 100.))
# save the best model
with open('best_model_c_ss_0n3.pkl', 'w') as f:
cPickle.dump(layer0, f)
f.close()
with open('best_model_c_ss_1n3.pkl', 'w') as f:
cPickle.dump(layer1, f)
f.close()
with open('best_model_c_ss_2n3.pkl', 'w') as f:
cPickle.dump(layer2, f)
f.close()
with open('best_model_c_ss_3n3.pkl', 'w') as f:
cPickle.dump(layer3, f)
f.close()
with open('best_model_c_ss_4n3.pkl', 'w') as f:
cPickle.dump(layer4, f)
f.close()
if patience <= iter:
done_looping = True
break
67
end_time = timeit.default_timer()
print('Optimization complete.')
print('Best validation score of %f %% obtained at iteration %i, '
'with test performance %f %%' %
(best_validation_loss * 100., best_iter + 1, test_score * 100.))
print >> sys.stderr, ('The code for file ' +
os.path.split(__file__)[1] +
' ran for %.2fm' % ((end_time - start_time) / 60.))
'''
import cPickle
save_file = open('bestTrained.p','wb')
#cPickle.dump(params.get_value(borrow=True),save_file,-1)
#cPickle.dump(params,save_file,-1)
cPickle.dump(layer3,save_file,-1)
save_file.close()
'''
if __name__ == '__main__':
evaluate_cnn()
def experiment(state, channel):
evaluate_cnn(state.learning_rate, dataset=state.dataset)
A.2.2 CNN Testing:
#can test CNN on any image sets assuming first proper formatting
import cPickle
import gzip
import os
import sys
import timeit
import numpy
import theano
import theano.tensor as T
from logistic_sgd import LogisticRegression, load_data
from convolutional_mlp import LeNetConvPoolLayer
def predict():
"""
An example of how to load a trained model and use it
to predict labels.
"""
# load the saved model
layer0 = cPickle.load(open('best_model_c_ss_0.pkl'))
layer1 = cPickle.load(open('best_model_c_ss_1.pkl'))
layer2 = cPickle.load(open('best_model_c_ss_2.pkl'))
layer3 = cPickle.load(open('best_model_c_ss_3.pkl'))
layer4 = cPickle.load(open('best_model_c_ss_4.pkl'))
# compile a predictor function
predict_model_4 = theano.function(
68
inputs=[layer4.input],
outputs=layer4.y_pred)
# We can test it on some examples from test test
#dataset='stopsign.pkl.gz'
dataset='stopsignAlter.pkl.gz'
datasets = load_data(dataset)
test_set_x, test_set_y = datasets[2]
test_set_x = test_set_x.get_value()
#print test_set_x
#print test_set_x[1]
# compile a layer0 or layer1 CNN function
convPool_0 = theano.function(
inputs=[layer0.input],
outputs=layer0.output)
convPool_1 = theano.function(
inputs=[layer1.input],
outputs=layer1.output)
convPool_2 = theano.function(
inputs=[layer2.input],
outputs=layer2.output)
#mlp function
mlp_3 = theano.function(
inputs=[layer3.input],
outputs=layer3.output)
batchsize = 20
layer0input = test_set_x[0:batchsize,]
layer0input = layer0input.reshape(batchsize,3,95,140)
layer0output = convPool_0(layer0input)
layer1input = layer0output
layer1output = convPool_1(layer1input)
layer2input = layer1output
layer2output = convPool_2(layer2input)
layer3input = layer2output
layer3input = layer3input.reshape(batchsize,800)
layer3output = mlp_3(layer3input)
layer4input = layer3output
predicted_values = predict_model_4(layer4input)
print ("Predicted values for the first 38 examples in test set:")
print predicted_values[0:37]
print test_set_y.eval()[0:37]
print (predicted_values[0:batchsize]-test_set_y.eval()[0:batchsize])
#
#predicted_values = predict_model_3(test_set_x[0:200,])
#print ("Predicted values for the first 200 examples in test set:")
69
#print predicted_values
#print test_set_y.eval()[0:200]
predict()
70
71
counter2 = 3; //number of frames to show debug (until we might take the next
picture)
}
}
}
catch (Exception exception){UI.Notify("error - python probably writing to file");}
if (counter2 > 0)
{
if (enabled){drawDebug(Game.Player.Character.CurrentVehicle);}
counter2--;
}
if (enabled){drive(angle, disp);}
if (Game.Player.Character.IsInVehicle())
{
camera.AttachTo(Game.Player.Character.CurrentVehicle, new Vector3(0f, 3f,
heightZ));
camera.Rotation = Game.Player.Character.CurrentVehicle.Rotation;
}
this.mContainer.Draw();counter++;
}
float average(Queue<float> list)
{
float sum = 0;
foreach (float f in list){sum += f;}
return sum / list.Count;
}
Vector3 vel;
float targetSpeed = 5f;
void drive(float angle, float disp)
{
Vehicle car = Game.Player.Character.CurrentVehicle;
float turn = -1 * disp * Math.Abs(disp) - angle;
float originalVelW = .7f;
float desiredVelW = 1f - originalVelW;
vel = car.ForwardVector + ((float)Math.Sin(turn / 180f * Math.PI)) * car.RightVector;
vel.Normalize(); car.Velocity = (originalVelW * car.Velocity + desiredVelW * targetSpeed
* vel);
}
const int IMAGE_HEIGHT = 210;const int IMAGE_WIDTH = 280;
void screenshot(String filename)
{
var foregroundWindowsHandle = GetForegroundWindow();
var rect = new Rect();
GetWindowRect(foregroundWindowsHandle, ref rect);
Rectangle bounds = new Rectangle(rect.Left, rect.Top, rect.Right - rect.Left, rect.Bottom
- rect.Top);
using (Bitmap bitmap = new Bitmap(bounds.Width, bounds.Height))
{
using (Graphics g = Graphics.FromImage(bitmap))
{
72
g.ScaleTransform(.2f, .2f);
g.CopyFromScreen(new Point(bounds.Left, bounds.Top), Point.Empty, bounds.Size);
}
Bitmap output = new Bitmap(IMAGE_WIDTH, IMAGE_HEIGHT);
using (Graphics g = Graphics.FromImage(output))
{
g.DrawImage(bitmap, 0, 0, IMAGE_WIDTH, IMAGE_HEIGHT);
}
output.Save(filename, ImageFormat.Bmp);
}
}
private void onKeyUp(object sender, KeyEventArgs e)
{
if (e.KeyCode == Keys.I)
{
if (Game.Player.Character.IsInVehicle())
{
enabled = true;
GTA.Native.Function.Call(Hash.RENDER_SCRIPT_CAMS, true, true, camera.Handle,
true, true);
}else{UI.Notify("Please enter a vehicle.");}
}
if (e.KeyCode == Keys.U){UI.Notify("red light
detected");UI.DrawTexture("t_red_light.png",1,1, displayTime, p,s);}
if (e.KeyCode == Keys.J){UI.Notify("yellow light
detected");UI.DrawTexture("t_yellow_light.png", 1, 1, displayTime, p, s);}
if (e.KeyCode == Keys.M){UI.Notify("green light
detected");UI.DrawTexture("t_green_light.png", 1, 1, displayTime, p, s);}
if (e.KeyCode == Keys.K){UI.Notify("stop sign
detected");UI.DrawTexture("t_stop_sign.png", 1, 1, displayTime, p, s);}
if (e.KeyCode == Keys.O)
{
enabled = false;
UI.Notify("Relinquishing control");
GTA.Native.Function.Call(Hash.RENDER_SCRIPT_CAMS, false, false, camera.Handle,
true, true);
}
if (e.KeyCode == Keys.N)
{
Vehicle vehicle = World.CreateVehicle(VehicleHash.Adder,
Game.Player.Character.Position + Game.Player.Character.ForwardVector * 3.0f,
Game.Player.Character.Heading + 90);
vehicle.CanTiresBurst = false;
vehicle.PrimaryColor = VehicleColor.ModshopBlack1;
vehicle.CustomSecondaryColor = Color.DarkOrange;
vehicle.PlaceOnGround();
vehicle.NumberPlate = " 888 ";
}
if (e.KeyCode == Keys.L)
73
{
enabled = !enabled;
if(enabled)
UI.Notify("activated self driving");
else
UI.Notify("deactivated self driving");
}
if (e.KeyCode == Keys.NumPad1){heightZ = 0f;}
if (e.KeyCode == Keys.NumPad2){heightZ = 0.5f;}
if (e.KeyCode == Keys.NumPad3){heightZ = 1.0f;}
if (e.KeyCode == Keys.NumPad4){heightZ = 1.5f;}
if (e.KeyCode == Keys.NumPad5){heightZ = 2.0f;}
if (e.KeyCode == Keys.NumPad9){heightZ = 3.0f;}
UI.Notify("endKeyUp");
}
void drawDebug(Vehicle car)
{
Vector3 forward = (float)Math.Cos(angle * D_TO_R) * car.ForwardVector -
(float)Math.Sin(angle * D_TO_R) * car.RightVector;
Vector3 right = -(float)Math.Sin(angle * D_TO_R) * car.ForwardVector -
(float)Math.Cos(angle * D_TO_R) * car.RightVector;
Vector3 center = car.Position + disp * right;
float r = 0.1f;
for (float i = 0; i < 15; i += .2f)
{
World.DrawMarker(MarkerType.DebugSphere, center + i * forward,
Vector3.WorldUp, new Vector3(1, 1, 1), new Vector3(r, r, r), Color.Blue);
}
if (enabled && false) // this is turned off
{
Vector3 debugV = car.Position + 15f * vel; r = .3f;
World.DrawMarker(MarkerType.DebugSphere, debugV, Vector3.WorldUp, new
Vector3(1, 1, 1), new Vector3(r, r, r), Color.Blue);
}}}
74