Data Analysis
Data Analysis
Data Analysis
Standard deviation: number that describes how spread out the values are
o Ex: 0.9 the most of the numbers are withing a range of 0.9 from the mean value
o Higher standard deviation indicates that the values are spread out over a wider
range
X= numpy.std(speed)
Variance: another number that indicates how spread out the values are. Square root of
variance = standard deviation.
o x = numpy.var(speed)
Percentiles: gives a number that describe the value that a given percent of the values
are lower than.
o x = numpy.percentile(ages,75) 75% of the people are …. Younger
Data distribution
Creating random data set
250 random floats between 0 and 5
x = numpy.random.uniform (0.0,5.0,250)
Import numpy
Import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0,5.0,250)
plt.hist(x,5) //5 bars
plt.show()
Normal data distribution known as bell curve
x = numpy.random.normal(5.0,1.0,100000)
plt.hist(x,100)
plt.show()
Scatter plot: diagram where each value in the data set is represented by a dot.
o Using Matplotlib with 2 equals array x axis and y axis
x = [4,5,6]
y = [100,99,101]
plt.scatter(x,y)
plt.show()
Regression: the word regression is used when you try to find the relationship between
variables which is used to predict the outcome of future events.
def myfunc(x):
return slope*x+intercept
mymodel= list(map(myfunc,x))
Import numpy as np
Import matplotlib.pyplot as plt
x = [1,2,3]
y= [100,90,99]
plt.scatter(x,y)
plt.plot(myline, mymodel(myline))
plt.show()
R-SQUARED
o Used to know the relationship between x and y if there is no
relationship the polynomial regression cannot be used to predict
anything
o The r-squared value range from 0 to 1 where 0 means no
relationship and 1 means 100% related.
o The R-squared score is a good indicator of how well my
data set is fitting the model.
import numpy
from sklearn.metrics import r2_score
x = [1,2,3]
y = [100,90,80]
Ex: predict the CO2 emission of a car based on engine size and weight
Import pandas as pd: pandas allow us to read csv files and return data
frame object
df = pd.read_csv(‘data.csv’)
then make a list of the independent values and call this var X put the
dependent values in a variable called y.
X = df[[‘weight’,’volume’]
y = df[‘CO2’]
from sklearn import linear_model
this object has a method called fit() that takes the independent and dependent
values as param and fills the regression object with data that describes the
relationship
reg = linear_model.LinearRegression()
reg.fit(X,y)
PredictedCO2 = reg.predict([[2300,1300]])
print(predictedCO2)
Result :
[107.2087328]
We have predicted that a car with 1.3-liter engine, and a weight of 2300 kg,
will release approximately 107 grams of CO2 for every kilometer it drives.
Scale: When data has different values and even different measurement units it can be difficult
to compare them ex kg to meters or altitude to time
we can scale data into new values that are easier to compare
Sklearn module has a method called standardScaler() which return a scale object with method
for transforming data sets
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv("cars.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
scaledX = scale.fit_transform(X.values)
regr = linear_model.LinearRegression()
regr.fit(scaledX, y)
#predict the CO2 emission from a 1.3 liter car that weighs 2300 kilograms,
predictedCO2 = regr.predict([scaled[0]])
print(predictedCO2)
Train/test
Used to measure the accuracy of your model.
You split the data set into 2 sets training and testing sets
Ex: 80% training and 20% testing
You train the model using the training set, You test the model using the testing set
Train the model means create the model, Test the model means test the accuracy of the model
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
Decision Tree
Flow chart that can help you make decisions based on previous experience
Import pandas
df = pandas.read_csv(“data_comedy.csv”)
print(df)
d= {‘YES’:1, ‘NO’: 0}
df[‘Go’] = df[‘Go’].map(d)
print(df)
then we have to separate the feature columns from the target column
features are the columns that we try to predict from and the target column is the column with
the values we try to predict.