Lastmyreport
Lastmyreport
Lastmyreport
In our six-week summer training we learnt about basics of python in the first two
weeks, Then one-week NumPy, after that one-week pandas and last one week for machine
learning.
Python is a popular programming language. It was created by Guido van Rossum, and released
in 1991.
It is used for:
• Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc.).
• Python has a simple syntax similar to the English language.
1|Page
• Python has syntax that allows developers to write programs with fewer lines than some
other programming languages.
• Python runs on an interpreter system, meaning that code can be executed as soon as it is
written. This means that prototyping can be very quick.
• Python can be treated in a procedural way, an object-oriented way or a functional way.
• Interpreted
• There are no separate compilation and execution steps like C and C++.
• Directly run the program from the source code.
• Internally, Python converts the source code into an intermediate form
called bytecodes which is then translated into native language of
specific computer to run it.
• No need to worry about linking and loading with libraries, etc.
• Platform Independent
• Python programs can be developed and executed on multiple operating
system platforms.
• Python can be used on Linux, Windows, Macintosh, Solaris and many
more.
• Free and Open Source; Redistributable
• High-level Language
• In Python, no need to take care about low-level details such as
managing the memory used by the program.
• Simple
• Closer to English language; Easy to Learn
• More emphasis on the solution to the problem rather than the syntax
• Embeddable
• Python can be used within C/C++ program to give scripting capabilities
for the program’s users.
• Robust:
• Exceptional handling features
• Memory management techniques in built
• Rich Library Support
2|Page
• The Python Standard Library is very vast.
• Known as the “batteries included” philosophy of Python; It can help
do various things involving regular expressions, documentation
generation, unit testing, threading, databases, web browsers, CGI,
email, XML, HTML, WAV files, cryptography, GUI and many more.
• Besides the standard library, there are various other high-quality
libraries such as the Python Imaging Library which is an amazingly
simple image manipulation library.
1. GNU Debugger uses Python as a pretty printer to show complex structures such
as C++ containers.
2. Python has also been used in artificial intelligence
3. Python is often used for natural language processing tasks.
Applications:
1. GUI based desktop applications
2. Graphic design, image processing applications, Games, and Scientific/
computational Applications
3. Web frameworks and applications
4. Enterprise and Business applications
5. Operating Systems
6. Education
7. Database Access
8. Language Development
9. Prototyping
10. Software Development
4|Page
1.6 Organizations using Python:
1. Google (Components of Google spider and Search Engine)
2. Yahoo (Maps)
3. YouTube
4. Mozilla
5. Dropbox
6. Microsoft
7. Cisco
8. Spotify
9. Quora
Boolean:
are either True or False.
Numbers:
can be integers (1 and 2), floats (1.1 and 1.2), fractions (1/2 and 2/3).
Strings:
are sequences of Unicode characters, e.g. an HTML document.
Lists:
are ordered sequences of values.
Tuples:
are ordered, immutable sequences of values.
5|Page
Sets :
are unordered bags of values.
1.8 Variables:
Variables are nothing but reserved memory locations to store values. This means that
when you create a variable, you reserve some space in memory.
Based on the data type of a variable, the interpreter allocates memory and decides what
can be stored in the reserved memory. Therefore, by assigning different data types to
variables, you can store integers, decimals or characters in these variables.
Ex: counter=100 #An Integer assignment
miles = 1000.0 #A floating point
name = "John" #A string
• Arithmetic Operator
6|Page
Comparison Operator
7|Page
• Logical Operator
1.10 LOOPS :
Programming languages provide various control structures that allow for more complicated
execution paths.
A loop statement allows us to execute a statement or group of statements
multiple times.
Python programming language provides following types of loops to handle
looping requirements.
8|Page
Table. 1.4 Loop types
while loop
1. Repeats a statement or group of statements while a given condition is
TRUE. It tests the condition before executing the loop body.
for loop
Executes a sequence of statements multiple times and abbreviates the code
2. that manages the loop variable.
nested loops
You can use one or more loop inside any another while, for or do..while
3. loop.
9|Page
Table 1.5 types of conditional statements
Sr.No. Statement & Description
1 if statements
2 if...else statements
3 nested if statements
It also has functions for working in domain of linear algebra, Fourier transform, and matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open-source project and you can use it
freely.
In Python we have lists that serve the purpose of arrays, but they are slow to process.
NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
10 | P a g e
The array object in NumPy is called ndarray, it provides a lot of supporting functions that make
working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very important.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was
created by Wes McKinney in 2008.
Pandas allows us to analyse big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
1.16 Function:
Function blocks begin with the keyword “def” followed by the function name and
parentheses ( ( ) ).
Any input parameters or arguments should be placed within these parentheses.
You can also define parameters inside these parentheses. The first statement of a function can be
an optional statement -the documentation string of the function.
The code block within every function starts with a colon (:) and is indented. The statement return
[expression] exits a function, optionally passing back an expression to
the caller. A return statement with no arguments is the same as return None.
Syntax:
def function_name(parameters):
“ “ “docstring “ “ “
11 | P a g e
Statement(s)
Example:
def greet(name):
"""
This function greets to
the person passed in as
a parameter
"""
print ("Hello, " + name + ". Good morning!")
Machine Learning algorithms enable the computers to learn from data, and even improve
themselves, without being explicitly programmed.
12 | P a g e
premise of machine learning is to build algorithms that can receive input data and use statistical
analysis to predict an output while updating outputs as new data becomes available.
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
The goal is to approximate the mapping function so well that when you have new input data (x)
that you can predict the output variables (Y) for that data.
13 | P a g e
As shown in the above example, we have initially taken some data and marked them as ‘Spam’
or ‘Not Spam’. This labelled data is used by the training supervised model, this data is used to
train the model.
Once it is trained, we can test our model by testing it with some test new mails and checking of
the model is able to predict the right output.
• Regression: A regression problem is when the output variable is a real value, such
as “dollars” or “weight
Example of Supervised Learning Algorithms:
• Linear Regression
• Logistic Regression
• Nearest Neighbour
• Gaussian Naive Bayes
• Decision Trees
• Support Vector Machine (SVM)
• Random Forest
14 | P a g e
Figure 1.3 – Unsupervised Learning
In the above example, we have given some characters to our model which are ‘Ducks’ and ‘Not
Ducks’. In our training data, we don’t provide any label to the corresponding data. The
unsupervised model is able to separate both the characters by looking at the type of data and
models the underlying structure or distribution in the data in order to learn more about it.
What is Flask?
In order to build a Flask app, you’ll need the following minimal directory structure:
project
├── templates
└── app.py
We write our Flask app into app.py. In the templates/ directory, we store the HTML templates
that our Flask app will use to display to the end user.
16 | P a g e
CHAPTER 2 - TRAINING WORK UNDERTAKEN
2.1 Introduction
In our project we decided to predict the price of used cars using data science and machine
learning that we learned during our six-week summer training.
2.2.1 STEP 1:
17 | P a g e
Figure 2. 1 – Importing libraries
2.2.2 Step 2:
pd.read_csv is used to read the “car data.csv” dataset and stored it in the car_dataset.
This dataset contains information about used cars listed on different websites
This data can be used for a lot of purposes such as price prediction to exemplify the use of
linear regression in Machine Learning.
The columns in the given dataset are as follows:
1. Car_Name (This column should be filled with the name of the car.)
2. Year (This column should be filled with the year in which the car was bought.)
3. Selling_Price (This column should be filled with the price the owner wants to sell the
car at.)
4. Present_Price (This is the current ex-showroom price of the car.)
5. Kms_Driven (This is the distance completed by the car in km.)
6. Fuel_Type (Fuel type of the car.)
7. Seller_Type (Defines whether the seller is a dealer or an individual.)
8. Transmission (Defines whether the car is manual or automatic
9. Owner (Defines the number of owners the car has previously had.)
18 | P a g e
Car_dataset.head() – It shows the first five rows of the car dataset.
Figure 2. 2 – Dataset.head()
2.2.3 Step 3:
Car_dataset.shape is used for checking the number of rows and columns in the car dataset.
From output (301,9) we observe that our dataset contains 301 rows and 9 columns.
Similarly, car_dataset.info() function is used for getting information about the dataset.
Like it talks about the count of non-null values in each column of the dataset and also tells
about the datatype of each column i.e. whether it is float64, int64 and object type.
In our car dataset we have zero null value in any column and we have two columns with float64
types values i.e. (Selling_Price and Present_Price). Three columns with int64 datatype i.e.
(Year, Kms_Driven, Owner).
19 | P a g e
Figure 2. 3 – Checking row, columns and dataset info
2.2.4 Step 4:
The describe() method returns description of the data in the Data Frame.
If the Data Frame contains numerical data, the description contains this information for each
column:
The describe() method returns description of the data in the Data Frame.
In our dataset we have five columns with numerical data i.e(Year, selling_Price, Present_Price,
Kms_Driven and Owner)
Count of non-null values in our dataset is 301 for each above column which is equal to number
of rows in our dataset i.e., 301.it means above columns doesn’t have any null value.
-The minimum value in the Year column Zis 2003.000000, in Selling_Price is 0.100000,
In Present_Price column is 0.320000, in Kms_Driven is 500.000000and in Owner is 0.000000.
21 | P a g e
The maximum value in the Year column is 2018.000000, in Selling_Price is 35.000000,
In Present_Price column is 92.600000, in Kms_Driven is 500000.000000and in Owner is
3.000000.
2.2.5 Step 5:
car_dataset.isnull().sum() is used to count the null values in each column.
From above output we came to know that in our dataset we don’t have any column that contain
null value.
22 | P a g e
Figure 2.6 – Plotting fuel type categorical data
As we can infer from the graph most of the cars in the second-hand market uses petrol as their
fuel with few using diesel and only a handful using CNG.
23 | P a g e
2.3.2 Visualizing seller type-
This graph infers that number of second-hand cars sold by individual are less as compared to
dealers
24 | P a g e
2.3.3 Visualizing Transmission_type –
We can infer from this graph that most of the car in the market our of manual transmission.
25 | P a g e
2.4 Visualizing Numerical Data-
From this graph, we infer that many cars have zero owner previously.
Very few has one owner.
Very limited has 3 owners previously.
A Box Plot is also known as Whisker plot is created to display the summary of the set of
data values having properties like minimum, first quartile, median, third quartile and
maximum. In the box plot, a box is created from the first quartile to the third quartile, a
vertical line is also there which goes through the box at the median. Here x-axis Denotes the
data to be plotted while the y-axis shows the frequency distribution.
26 | P a g e
Figure 2.11 – Box plot of selling price (in lakhs)
From above graph we infer that most of the cars has a max selling price of 13 lakh and min
selling price of 1 lakh. And median selling price of 4 lakh.
Selling price has many outliers.
27 | P a g e
2.4.3 Visualizing present price-
From above graph we infer that most of the cars has a max present price of 21 lakh and min
selling price of 1 lakh. And median present price of 9 lakh.
28 | P a g e
2.4.4 Visualizing Kms_Driven by a car-
From above graph we infer that most of the car has max Kms Driven of 90000kms and
minimum Kms Driven of 500 kms. Median kms driven are 32000. Kms Driven boxplot also
contain outliers.
From above graph we infer that maximum cars are 45000 kms driven.
29 | P a g e
2.4.5 Visualizing Year in which the new car was bought
This graph shows the year in which the cars were bought from the showroom as we can see
most cars were bought in the year 2015 (i.e. recently) and less number of cars were bought in
the years like 2004,2007,2003,2005,2006,2009,2008 etc.
30 | P a g e
Figure 2 .16- Checking for the distribution of categorical Data
From above output we observe that in the Fuel_Type column ,maximum cars use Petrol with
the count of 239,diesel car are 60 and CNG are 2 in number.
In seller_Type column, Dealer are195 in number and Individual are 106 in number.
In Transmission column, we observe that cars with manual gear are 261 in number, Automatic
cars are 40 in number.
2.6 Encoding
To use the categorical data in our model for prediction of car price we had to transform it to
numerical data.
So, we convert petrol to 0 ,diesel to 1 and CNG to 2 in our Fuel_Type column.
In Seller_Type column, we convert Dealer to 0 and Individual to 1.
In Transmission column, we convert Manual to 0 and Automatic to 1.
31 | P a g e
Figure 2.17 – Encoding of categorical columns
After Encoding we visualize the first five rows of our dataset again.
After encoding and visualization our data was ready for testing and training.
Let us now store the data and target value into two separate variables.
X contain all columns except Car_Name and Selling Price.
We drop Car_Name column because it is not useful feature for predicting car price.
33 | P a g e
2.7.1 Train-Test Split Evaluation
The train-test split is a technique for evaluating the performance of a machine learning
algorithm.
It can be used for classification or regression problems and can be used for any supervised
learning algorithm.
The procedure involves taking a dataset and dividing it into two subsets. The first subset is used
to fit the model and is referred to as the training dataset. The second subset is not used to train
the model; instead, the input element of the dataset is provided to the model, then predictions
are made and compared to the expected values. This second dataset is referred to as the test
dataset.
This is how we expect to use the model in practice. Namely, to fit it on available data with
known inputs and outputs, then make predictions on new examples in the future where we do
not have the expected output or target values.
The train-test procedure is appropriate when there is a sufficiently large dataset available.
In above we use 10 % data for testing and rest 90% data for training.
After train_test_split we apply machine learning algorithm for fitting our model. We apply
34 | P a g e
three algorithm i.e.-
1.Linear Regression
2. Random Forest Regression
3. Decision Tree Regressor
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between
x (input) and y(output). Hence, the name is Linear Regression.
35 | P a g e
Then we fit the model with X_train and Y_train data.
For model evaluation ,we use predict method to do the prediction on the X_test data
For calculating the error we use R square error method.
R-squared is a statistical measure that represents the goodness of fit of a regression model.
The ideal value for r-square is 1. The closer the value of r-square to 1, the better is the model
fitted.
R-square is a comparison of the residual sum of squares (SSres) with the total sum of squares
(SStot).
Where,
The total sum of squares is calculated by summation of squares of perpendicular distance
between data points and the average line.
36 | P a g e
Figure 2.23 – Random Forest diagram
Figure 2.24-Importing Random Forest Regressor from ensemble class and creates object of it
37 | P a g e
Squared Error for checking the accuracy of our model.
The Mean Squared Error (MSE) or Mean Squared Deviation (MSD) of an estimator
measures the average of error squares i.e. the average squared difference between the
estimated values and true value.
where N is the number of data points, y(i) is the i-th measurement, and y ̂(i) is its corresponding
prediction.
In above fig the root mean square error is very less which is 0.54.so,it is fitted better i.e. The
lower the RMSE, the better a given model is able to “fit” a dataset.
Decision Tree is the most powerful and popular tool for classification and prediction. A
Decision tree is a flowchart-like tree structure, where each internal node denotes a test on an
attribute, each branch represents an outcome of the test, and each leaf node (terminal node)
holds a class label.
38 | P a g e
Figure 2.25 – Decision tree diagram
Figure 2.26 importing decision tree regressor from tree class and creates object of it
We fit regressor2 object with X_train and Y_train data . Do prediction on X_test data.
We calculate Mean Squared Error and Root Mean Squared Error between actual Y_test value
and predicted y value for checking the accuracy of our model.
We observe that the root mean square error in decision tree regressor is more than random
forest regressor model which is 0.86.
So, we decided to apply Random Forest in our model for better accuracy.
We use the built-in RandomForestRegressor() class from sklearn to build our regression model.
39 | P a g e
The following code helps us save our model using the Pickle module. Our ML model is saved
as “model.pkl”. We will later use this file to predict the output when new input data is provided
from our web-app.
We will initialize our app and then load the “model.pkl” file to the app.
40 | P a g e
2.3. Define the app route for the default page of the web-app:
Routes refer to URL patterns of an app (such as myapp.com/home or
myapp.com/about). @app.route("/") is a Python decorator that Flask provides to
assign URLs in
our app to functions easily.
The decorator is telling our @app that whenever a user visits our app domain (localhost:5000
for
local servers) at the given .route(), execute the home() function. Flask uses the Jinja template
library to render templates. In our application, we will use templates to render HTML which
will
display in the browser.
2.4. Redirecting the API to predict the Car price :
We create a new app route (‘/predict’) that reads the input from our ‘index.html’ form and on
clicking the predict button, outputs the result using render_template.
Figure 2.30 creating predict function to use the predict button in our web-app
41 | P a g e
Figure 2.31 Index.html file
42 | P a g e
CHAPTER 3: RESULT AND DISCUSSION
The project is saved in a folder called “myflask”. We first run the ‘mml.py’ file to get
our ML model and then we run the ‘app.py’. On running this file, our application is hosted on
the
local server at port 5000.
You can simply type “localhost:5000″ on your web browser to open
your web-application after running ‘app.py’
• car data.csv — This is the dataset we used
• mml.py — This is our machine learning code
• model.pkl — This is the file we obtain after we run the mlmodel.py file. It is present in
the same directory
43 | P a g e
• app.py — This is the Flask application we created above
• templates — This folder contains our ‘index.html’ file. This is mandatory in Flask while
rendering templates. All HTML files are placed under this folder.
• static — This folder contains the “css” folder. The static folder in Flask application is
meant to hold the CSS and JavaScript files.
It is always a good idea to run your application first in the local server and check the
functionality of your application before hosting it online in a cloud platform. Let’s see
what happens when we run ‘app.py’ :
44 | P a g e
Figure 3. 3 chrome web page showing the result
Now, let’s enter the required values and click on the “Predict” button and see what happens.
45 | P a g e
Figure 3. 4 Entering the required values
46 | P a g e
Observe the URL (127.0.0.1:5000/predict), this is the use of app routes. On clicking the
“Predict” button we are taken to the predict page where the predict function renders the
‘index.html’ page with the output of our application.
47 | P a g e
CHAPTER 4 : CONCLUSION AND FUTURE SCOPE
4.1 Conclusion
Here, I have come to the end of the project on car price prediction using machine learning.
The purpose of this project is to enhance our technical skills.
I would like to share my experience while doing this project. I learnt many new things about
the different libraries of python like pandas, numpy and sklearn, we also came to know how to
visualize the dataset, how to handle null values ,duplicates values and categorical data.
Implementation of various Machine learning algorithm and how to fit these models and after
that predicting the price of car. I also learnt how to use Flask, an API of Python that allows us
to build up web-applications. Thus it was a wonderful learning experience for me while
working on this project.
This project has developed my thinking skills and more interest in the field of data science
and machine learning with python. This project gave me real insight into the world of data
science.
A very special thanks to our HOD sir for giving us two months to work on our technical
skills.
Thank You
48 | P a g e
4.2 Future Scope
Scope of Python: Python programming language, to be the most promising career in
technologies, industry. Opportunities in the career of python are increasing tremendously in the
world. Since Python has simple codes, faster readability capacity, significant companies are in
demand in python language. Python to be an excellent tool in designing progressive ideas.
Candidates interested in python increases every day.
Today, companies both in India, our lookout for a skilled python developer for their companies.
Knowing python language gives a competitive advantage when compared to other words.
Indian IT companies established around 2 lakh jobs in 2018, still expecting more developers in
python for their company. Python language becomes more trending since it is implemented in
upcoming technologies such as artificial intelligence, machine learning.
Scope of python trending in the fields of data science, analyst. Job roles advanced in python
with high promising pay in large companies.
• Research Analyst
• DevOps Engineer
• Python developer
• Data Analyst
• Software developer
The scope of Machine Learning is not limited to the investment sector. Rather, it is expanding
across all fields such as banking and finance, information technology, media & entertainment,
gaming, and the automotive industry.
The scope of Machine Learning in India, as well as in other parts of the world, is high in
comparison to other career fields when it comes to job opportunities. According to Gartner,
there will be 2.3 million jobs in the field of Artificial Intelligence and Machine Learning by
2022. Also, the salary of a Machine Learning Engineer is much higher than the salaries
offered to other job profiles.
49 | P a g e
REFERENCES
Online Sources
Books
[2] Al Sweigart, Automate the Boring Stuff with Python, 2nd Edition.
50 | P a g e