Data Analysis

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Data analysis

Defining the question - identify

I. Collecting data - collecting


II. Cleaning data clean - cleaning
III. Manipulating data - manipulating
IV. Analyzing data - analyze
V. Visualizing data - visualize

1- Collecting: Bring data that might be quantitative(numeric) or qualitative(descriptive)


2- Cleaning: Removing errors, duplicates, missing data, fixing structure and format, typos,
filling gaps, deleting wrong rows.
3- Manipulating: Manipulate data to create required features and variables applying logic
and extracting data to generate a completely different set of data (organizing data)
4- Analyzing: Using techniques and algorithms, logic methods, customizing segment,
comparing business growth. Type of data analysis
a. Descriptive: Info about what is happening.
b. Diagnostic: Why it is happening
c. Predictive: What is likely to happens. Eg: regression,
d. Prescriptive: What is the best action. Providing recommendation of how to
optimize business Ex: increasing salaries / give discount
5- Visualizing data: showing analyzed data in a visual or graphical form for easy
interpretation
Data type in Data analysis:

 Qualitative (categorical values)


 Nominal(labels) for naming variables ex: colors, tastes, languages,
gender, smoke, blood type
 Ordinal values are orders sorted by scale ex: sizes, letter grades, time
of day morning,

 Quantitative (Numerical vars)


 Discrete: Counted, finite, non-negative, eg: dice 1->6, number of …
 Continuous
o Interval set of numbers within a range height weight
age
o Ratio can be 0
import numpy as np # For Linear Algebra
import pandas as pd # For data manipulation
import seaborn as sns # For data visualization
import matplotlib.pyplot as plt # For data visualization
MACHINE LEARNING: ANALYSING DATA AND PREDICTING THE OUTCOME

 Mean: average x = numpy.mean(speed)


 Median: mid point value x = numpy.median(speed)
 Mode: most common value x = stats.mode(speed) // from scipy import stats

 Standard deviation: number that describes how spread out the values are
o Ex: 0.9 the most of the numbers are withing a range of 0.9 from the mean value
o Higher standard deviation indicates that the values are spread out over a wider
range
X= numpy.std(speed)
 Variance: another number that indicates how spread out the values are. Square root of
variance = standard deviation.
o x = numpy.var(speed)

 Percentiles: gives a number that describe the value that a given percent of the values
are lower than.
o x = numpy.percentile(ages,75) 75% of the people are …. Younger

 Data distribution
Creating random data set
250 random floats between 0 and 5
x = numpy.random.uniform (0.0,5.0,250)

 Histogram: used to visualize data set using matplotlib library

Import numpy
Import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0,5.0,250)
plt.hist(x,5) //5 bars
plt.show()
 Normal data distribution known as bell curve
x = numpy.random.normal(5.0,1.0,100000)
plt.hist(x,100)
plt.show()

here 5.0 is the mean and 1.0 is the standard deviation


values should be concentrated around 5 and rarely further away than 1.0 from
the mean

 Scatter plot: diagram where each value in the data set is represented by a dot.
o Using Matplotlib with 2 equals array x axis and y axis

Import matplotlib.pyplot as plt

x = [4,5,6]
y = [100,99,101]

plt.scatter(x,y)
plt.show()

 Regression: the word regression is used when you try to find the relationship between
variables which is used to predict the outcome of future events.

 Linear regression: uses the relationship between data points to draw a


straight line through all of them. This line can be used to predict future
values.
Import matplotlib.pyplot as plt
from scipy import stats
x = [5,7,8]
y= [99,86,87]
#method that returns some important key values of Linear Regression:

slope, intercept, r, p, std_err = stats.linregress(x,y)


function that uses the slope and intercept values to return a new value. This new value represents where on the y-axis the
corresponding x value will be placed

def myfunc(x):
return slope*x+intercept
mymodel= list(map(myfunc,x))

plt.scatter(x,y) #draw the original scatter plot


plt.plot(x, mymodel) #draw the line of linear regression
plt.show() #display the diagram
 Polynomial regression:
o If your data points clearly will not fit a linear regression (a straight
line through all data points) it might be ideal for a polynomial
regression
o Like linear regression uses the relationship between variables x
and y to find the best way to draw a line through the data points

Import numpy as np
Import matplotlib.pyplot as plt

x = [1,2,3]
y= [100,90,99]

#Method to make polynomial model


mymodel = np.poly1d(np.polyfit(x, y,3))
myline = np. linspace(1,3,100)

plt.scatter(x,y)
plt.plot(myline, mymodel(myline))
plt.show()
 R-SQUARED
o Used to know the relationship between x and y if there is no
relationship the polynomial regression cannot be used to predict
anything
o The r-squared value range from 0 to 1 where 0 means no
relationship and 1 means 100% related.
o The R-squared score is a good indicator of how well my
data set is fitting the model.

import numpy
from sklearn.metrics import r2_score

x = [1,2,3]
y = [100,90,80]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))


print(r2_score(y, mymodel(x)))
 Multiple regression: like linear regression but with more than 1
independent value meaning that we are trying to predict a value based
on 2 or more variables

Ex: predict the CO2 emission of a car based on engine size and weight

Import pandas as pd: pandas allow us to read csv files and return data
frame object

df = pd.read_csv(‘data.csv’)

then make a list of the independent values and call this var X put the
dependent values in a variable called y.
X = df[[‘weight’,’volume’]
y = df[‘CO2’]
from sklearn import linear_model

from sklearn module we will use the LinearRegression()


method to create a linear regression object

this object has a method called fit() that takes the independent and dependent
values as param and fills the regression object with data that describes the
relationship

reg = linear_model.LinearRegression()
reg.fit(X,y)

now we have a regression object that are ready to predict


CO2 values based on a car’s weight and volume

#predict the CO2 emission of a car where the weight is


2300kg and the volume is 1300cm

PredictedCO2 = reg.predict([[2300,1300]])

print(predictedCO2)

Result :
[107.2087328]

We have predicted that a car with 1.3-liter engine, and a weight of 2300 kg,
will release approximately 107 grams of CO2 for every kilometer it drives.
 Scale: When data has different values and even different measurement units it can be difficult
to compare them ex kg to meters or altitude to time

we can scale data into new values that are easier to compare
Sklearn module has a method called standardScaler() which return a scale object with method
for transforming data sets

import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler

scale = StandardScaler()

df = pandas.read_csv("cars.csv")

X = df[['Weight', 'Volume']]
y = df['CO2']

scaledX = scale.fit_transform(X.values)

regr = linear_model.LinearRegression()
regr.fit(scaledX, y)

scaled = scale.transform([[2300, 1.3]])

#predict the CO2 emission from a 1.3 liter car that weighs 2300 kilograms,
predictedCO2 = regr.predict([scaled[0]])

print(predictedCO2)

 Train/test
Used to measure the accuracy of your model.
You split the data set into 2 sets training and testing sets
Ex: 80% training and 20% testing
You train the model using the training set, You test the model using the testing set
Train the model means create the model, Test the model means test the accuracy of the model
train_x = x[:80]

train_y = y[:80]

test_x = x[80:]

test_y = y[80:]
 Decision Tree
Flow chart that can help you make decisions based on previous experience

Import pandas
df = pandas.read_csv(“data_comedy.csv”)
print(df)

to make a decision tree all data has to be numerical


we have to convert the non-numerical data to numerical
pandas has a map method that takes a dictionary with information on how to convert the values
{‘UK’:0 , ‘USA’:1, ‘N’: 2}

Change string values into numerical values


d = {‘UK’: 0, ‘USA’: 1, ‘N’: 2}
df[‘Nationality’] = df[‘Nationality’].map(d)

d= {‘YES’:1, ‘NO’: 0}
df[‘Go’] = df[‘Go’].map(d)
print(df)

then we have to separate the feature columns from the target column
features are the columns that we try to predict from and the target column is the column with
the values we try to predict.

X is the feature columns y is the target column


features = ['Age', 'Experience', 'Rank', 'Nationality']
X = df[features]
y = df['Go']
print(X)
print(y)

to draw the tree


dtree = DecisionTreeClassifiter()
dtree = dtree.fit(X, y)

tree.plot_tree(dtree, feature_names = features)

to predict new values


print(dtree.predict([40,10,7,1]]))

You might also like