Session22 To 24 PYTHON COLAB

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 128

Session 22 - 24

Google CoLAB & PYTHON


Google Colab
Why Google Colab ?

 Google Colab comes with collaboration backed in the product.


 It also runs on Google servers
 Don’t need to install anything
 Notebooks are saved to your Google Drive account
 Google Colab recently added support for Tensor Processing Unit
( TPU ) apart from its existing GPU and CPU instances. So, it’s
a big deal for all deep learning people
Hardware accelerator
Creating Folder on Google Drive
Creating New Colab
Notebook
Create a new notebook via Right click > More > Colaboratory
Notebook Settings

Edit > Notebook settings or Runtime>Change runtime type and select GPU as Hardware accelerator
Open python notebook in different ways
Bar Graph
• # BAR GRAPH
• import matplotlib.pyplot as plt

• x = ['A', 'B', 'C', 'D', 'E']


• y = [22, 9, 40, 27, 55]

• plt.bar(x, y)

• plt.show()
Prime Numbers from n1 to n2
n1=int(input("Enter limit for Prime numbers From :"))
n2=int(input("Enter limit for Prime numbers upto :"))
print("Prime Numbers between",n1, "and", n2, "are :")
for n in range(n1,n2,+1):
x=n
s=1
for i in range(1,x,+1):
if(x%i)==0:
s=s+1
if s<=2:
print(n)
Cricket Data Analysis # Question 1 - Highest Run and Players Name
# Reading an excel file using Python for i in range(1,sheet.nrows,+1):
import xlrd k2=int(sheet.cell_value(i,3))
# Give the location of the file a.append(k2)
loc = ("Cricket_data.xls") largest2=max(a)

# To open Workbook for j in range (0,len(a),+1):


if (largest2==a[j]):
wb = xlrd.open_workbook(loc)
t2=j
sheet = wb.sheet_by_index(0)
print("Highest run scorer is ", m1[t2], "and score
m1=[] is ", largest2 )
m2=[] #Question 2-Highest Wicket And Players Name
a=[] for i in range(1,sheet.nrows,+1):

b=[] k3=int(sheet.cell_value(i,4))

c=[] b.append(k3)
largest2=max(b)
#data reading from excel to temprory memory
for j in range (0,len(b),+1):
for i in range(1,sheet.nrows,+1):
if (largest2==b[j]):
k0=(sheet.cell_value(i,0))
t2=j
m1.append(k0) print("Maximum Wicket Taker is ", m1[t2], "and
for i in range(1,sheet.ncols,+1): wicket taken is", largest2 )

k1=(sheet.cell_value(0,i))
m2.append(k1)
ABA Review
PYTHON
Session - 23

For Self Reading


https://www.w3schools.com/python/
List of Chapters
•Chapter 1: Basics
•Chapter 2: Conditionals
•Chapter 3: Functions
•Chapter 4: Iteration
•Chapter 5: Strings
•Chapter 6: Collection Data Types
•Chapter 7: Advanced Functions
•Chapter 8: Exception Handling
•Chapter 9: Python Modules
•Chapter 10: Files
•Chapter 11: Documentation
Why Python?
 High-level language, can do a lot with relatively little
code
 Fairly popular among high-level languages

 Robust support for object-oriented programming

 Support for integration with other languages


Chapter 1: Basics
 Running python programs
 Variables
 Printing
 Operators
 Input
 Comments
 Scope
Hello, world!

Let's get started! Here's an example of a python program
run as a script:

>>>print “Hello, world”

>>> print 'hello:', x, x**2, x**3


Easy examples
 Example basics.py
print 1+3
pi = 3.1415926
print pi
message = "Hello, world"
print message

Output:
4
3.1415926
Hello, world
Variable types - Examples
 Example types.py:
pi = 3.1415926
message = "Hello, world"
i = 2+2

print type(pi)
print type(message)
print type(i)

Output:
<type 'float'>
<type 'str'>
<type 'int'>
Operators
 + addition
 - subtraction
 / division
 ** exponentiation
 % modulus (remainder after division)
 Comparison operators
Operators
 Example operators.py
print 2*2
print 2**3
print 10%3
print 1.0/2.0
print 1/2

Output:
4
8
1
0.5
0
 Note the difference between floating point division and
integer division in the last two lines
Type conversion

int(), float(), str(), and
bool() convert to
integer, floating point,
string, and boolean
(True or False) types,
respectively

Example typeconv.py: 
Output:
print 1.0/2.0 0.5
print 1/2 0
print float(1)/float(2) 0.5
print int(3.1415926) 3
print str(3.1415926) 3.1415926
print bool(1) True
print bool(0) False
Chapter 2: Conditionals

True and False booleans

Comparison and Logical Operators

if, elif, and else statements
Comparison operators

== : is equal to?

!= : not equal to

> : greater than

< : less than

>= : greater than or equal to

<= : less than or equal to

is : do two references refer to the same object?
(See Chapter 6)
Logical operators

and, or, not

>>> 2+2==5 or 1+1==2


True
>>> 2+2==5 and 1+1==2
False
>>> not(2+2==5) and 1+1==2
True

Note: We do NOT use &&, ||, !, as in C!
If statements

Example ifelse.py

if (1+1==2):
print "1+1==2"
print "I always thought so!"
else:
print "My understanding of math must be faulty!"


Simple one-line if:
if (1+1==2): print “I can add!”
elif statement

Equivalent of “else if” in C

Example elif.py:
x=3
if (x == 1):
print "one"
elif (x == 2):
print "two"
else:
print "many"
Chapter 3: Functions

Defining functions

Return values

Local variables

Built-in functions

Functions of functions

Passing lists, dictionaries, and keywords to functions
Functions

Define them in the file above the point they're used

Body of the function should be indented consistently
(4 spaces is typical in Python)

Example: square.py
def square(n):
return n*n

print ("The square of 3 is “)


print (square(3))

Output:
The square of 3 is 9
Function variables are local

Variables declared in a function do not exist outside that
function

Example square2.py
def square(n):
m = n*n
return m

print "The square of 3 is ",


print square(3)
print m

Output:
File "./square2.py", line 9, in <module>
print m
NameError: name 'm' is not defined
Scope

Variables assigned within a function are local to that
function call

Variables assigned at the top of a module are global to
that module; there's only “global” within a module
 Within a function, Python will try to match a variable
name to one assigned locally within the function; if that
fails, it will try within enclosing function-defining (def)
statements (if appropriate); if that fails, it will try to
resolve the name in the global scope (but the variable
must be declared global for the function to be able to
change it). If none of these match, Python will look
through the list of built-in names
Multiple return values
 Can return multiple values by packaging them into a
tuple

def onetwothree(x):
return x*1, x*2, x*3

print onetwothree(3)

3, 6, 9
Built-in Functions

Several useful built-in functions. Example math.py

print pow(2,3)
print abs(-14)
print max(1,-5,3,0)

Output:
8
14
3
Chapter 4: Iteration

while loops

for loops

range function

Flow control within loops: break, continue, pass, and
the “loop else”
while

Example while.py
i=1
while i < 4:
print i
i += 1
Output:
1
2
3
for

Example for.py
for i in range(3):
print i,
output:
0, 1, 2

range(n) returns a list of integers from 0 to n-1.
range(0,10,2) returns a list 0, 2, 4, 6, 8
Flow control within loops

General structure of a loop:
while <statement> (or for <item> in <object>):
<statements within loop>
if <test1>: break # exit loop now
if <test2>: continue # go to top of loop now
if <test3>: pass # does nothing!
else:
<other statements> # if exited loop without
# hitting a break
Parallel traversals
 If we want to go through 2 lists (more later) in
parallel, can use zip:
A = [1, 2, 3]
B = [4, 5, 6]
for (a,b) in zip(A,B):
print a, “*”, b, “=”, a*b

output:
1*4=4
2 * 5 = 10
3 * 6 = 18
Chapter 5: Strings

String basics

Escape sequences

Slices

Block quotes

Formatting

String methods
String basics

Strings can be delimited by single or double quotes

Python uses Unicode, so strings are not limited to ASCII
characters

An empty string is denoted by having nothing between string
delimiters (e.g., '')

Can access elements of strings with [], with indexing starting
from zero:
>>> “snakes”[3]
'k'

Note: can't go other way --- can't set “snakes”[3] = 'p' to
change a string; strings are immutable

a[-1] gets the last element of string a (negative indices work
through the string backwards from the end)

Strings like a = r'c:\home\temp.dat' (starting with an r
character before delimiters) are “raw” strings (interpret
literally)
More string basics

Type conversion:
>>> int(“42”)
42
>>> str(20.4)
'20.4'

Compare strings with the is-equal operator, == (like
in C and C++):
>>> a = “hello”
>>> b = “hello”
>>> a == b
True

>>>location = “Chattanooga “ + “Tennessee”
>>>location
Chattanooga Tennessee
String methods

Strings are classes with many built-in methods.
Those methods that create new strings need to be
assigned (since strings are immutable, they cannot
be changed in-place).

S.capitalize()

S.center(width)

S.count(substring [, start-idx [, end-idx]])

S.find(substring [, start [, end]]))

S.isalpha(), S.isdigit(), S.islower(), S.isspace(),
S.isupper()

S.join(sequence)

And many more!
replace method

Doesn't really replace (strings are immutable) but
makes a new string with the replacement performed:
>>> a = “abcdefg”
>>> b = a.replace('c', 'C')
>>> b
abCdefg
>>> a
abcdefg
Regular Expressions
• Regular expressions are a way to do pattern-
matching. The basic concept (and most of the
syntax of the actual regular expression) is the same
in Java or Perl
Regular Expression Syntax
• Common regular expression syntax:
. Matches any char but newline (by default)
^ Matches the start of a string
$ Matches the end of a string
* Any number of what comes before this
+ One or more of what comes before this
| Or
\w Any alphanumeric character
\d Any digit
\s Any whitespace character
(Note: \W matches NON-alphanumeric, \D NON digits, etc)
[aeiou] matches any of a, e, i, o, u
junk Matches the string 'junk'
Match Object Funtions
• Search() and match() return a MatchObject. This
object has some useful functions:
group(): return the matched string
start(): starting position of the match
end(): ending position of the match
span(): tuple containing the (start,end) positions of
the match
Chapter 6: Collection Data Types

Tuples

Lists

Dictionaries
Tuples

Tuples are a collection of data items. They may be of
different types. Tuples are immutable like strings.
Lists are like tuples but are mutable.
>>>“Tony”, “Pat”, “Stewart”
('Tony', 'Pat', 'Stewart')
Python uses () to denote tuples; we could also use (),
but if we have only one item, we need to use a
comma to indicate it's a tuple: (“Tony”,).

An empty tuple is denoted by ()

Need to enclose tuple in () if we want to pass it all
together as one argument to a function
Lists

Like tuples, but mutable, and designated by square
brackets instead of parentheses:
>>> [1, 3, 5, 7, 11]
[1, 3, 5, 7, 11]
>>> [0, 1, 'boom']
[0, 1, 'boom']

An empty list is []

Append an item:
>>> x = [1, 2, 3]
>>> x.append(“done”)
>>> print x
[1, 2, 3, 'done']
Lists and Tuples Contain Object
References

Lists and tuples contain object references. Since lists
and tuples are also objects, they can be nested
>>> a=[0,1,2]
>>> b=[a,3,4]
>>> print b
[[0, 1, 2], 3, 4]
>>> print b[0][1]
1
>>> print b[1][0]
... TypeError: 'int' object is unsubscriptable
Dictionaries

Unordered collections where items are accessed by a
key, not by the position in the list

Like a hash in Perl

Collection of arbitrary objects; use object references
like lists

Nestable

Can grow and shrink in place like lists

Concatenation, slicing, and other operations that
depend on the order of elements do not work on
dictionaries
The “is” operator

Python “variables” are really object references. The
“is” operator checks to see if these references refer
to the same object (note: could have two identical
objects which are not the same object...)

References to integer constants should be identical.
References to strings may or may not show up as
referring to the same object. Two identical, mutable
objects are not necessarily the same object
“in” operator

For collection data types, the “in” operator
determines whether something is a member of the
collection (and “not in” tests if not a member):
>>> team = (“David”, “Robert”, “Paul”)
>>> “Howell” in team
False
>>> “Stewart” not in team
True
Chapter 7: Advanced Functions
 Passing lists and keyword dictionaries to functions
 Lambda functions
 apply()
 map()
 filter()
 reduce()
 List comprehensions
Chapter 8: Exception Handling

Basics of exception handling
Chapter 9: Python Modules

Basics of modules

Import and from … import statements

Changing data in modules

Reloading modules

Module packages

__name__ and __main__

Import as statement
Module basics
 Each file in Python is considered a module. Everything within
the file is encapsulated within a namespace (which is the
name of the file)
 To access code in another module (file), import that file, and
then access the functions or data of that module by prefixing
with the name of the module, followed by a period
 To import a module:
import sys
(note: no file suffix)
 Can import user-defined modules or some “standard” modules
like sys and random
 Any python program needs one “top level” file which imports
any other needed modules
Python standard library
 There are over 200+ modules in the Standard Library
 Consult the Python Library Reference Manual,
included in the Python installation and/or available
at http://www.python.org
What import does

An import statement does three things:
- Finds the file for the given module
- Compiles it to bytecode
- Runs the module's code to build any objects (top-level
code, e.g., variable initialization)

The module name is only a simple name; Python uses a
module search path to find it. It will search: (a) the
directory of the top-level file, (b) directories in the
environmental variable PYTHONPATH, (c) standard
directories, and (d) directories listed in any .pth files (one
directory per line in a plain text file); the path can be listed
by printing sys.path
The sys module

Printing the command-line arguments,
print-argv.pl
import sys
cmd_options = sys.argv
i=0
for cmd in cmd_options:
print "Argument ", i, "=", cmd
i += 1
output:
localhost(Chapter8)% ./print-argv.pl test1 test2
Argument 0 = ./print-argv.pl
Argument 1 = test1
Argument 2 = test2
The random module

import random
guess = random.randint(1,100)
print guess
dinner = random.choice([“meatloaf”, “pizza”,
“chicken pot pie”])
print dinner
Chapter 10: Files

Basic file operations
Opening a file

open(filename, mode)
where filename is a Python string, and mode is a
Python string, 'r' for reading, 'w' for writing, or 'a' for
append
Basic file operations

Basic operations:

outfile = open('output.dat', 'w')


infile = open('input.dat', r')
Basic file operations

Basic operations
output = open('output.dat', 'w')
input = open('input.dat', 'r')
A = input.read() # read whole file into string
A = input.read(N) # read N bytes
A = input.readline() # read next line
A = input.readlines() # read file into list of
# strings
output.write(A) # Write string A into file
output.writelines(A) # Write list of strings
output.close() # Close a file
Redirecting stdout
 Print statements normally go to stdout (“standard
output,” i.e., the screen)
 stdout can be redirected to a file:

import sys
sys.stdout = open('output.txt', 'w')
print message # will show up in output.txt

 Alternatively, to print just some stuff to a file:


print >> logfile, message # if logfile open
Chapter 11: Documentation

Comments

dir

Documentation strings
PYTHON EXAMPLES
• 1. Bike Sharing
• 2. Twitter Sentiment Analysis

https://www.onlinegdb.com/online_python_compiler
Exercise 1 - Bike Sharing (1) - Startup
Example
• Bike sharing systems are a new generation of traditional bike rentals where the
whole process from membership, rental and return back has become automatic.
Through these systems, the user is able to easily rent a bike from a particular
position and return back at another position. Currently, there are about over 500
bike-sharing programs around the world which are composed of over 500
thousands bicycles. Today, there exists great interest in these systems due to their
important role in traffic, environmental and health issues.
• Apart from interesting real-world applications of bike sharing systems, the
characteristics of data being generated by these systems make them attractive for
the research. Opposed to other transport services such as bus or subway, the
duration of travel, departure, and arrival position is explicitly recorded in these
systems. This feature turns the bike sharing system into a virtual sensor network
that can be used for sensing mobility in the city. Hence, it is expected that most of
the important events in the city could be detected via monitoring these data.
Example 1 – Bike Sharing (2)
• Overview & Python Coding
• https://medium.com/@wilamelima/analysing-bike-sharing-trends-with-
python-a9f574c596b9

• Data Set Description


• https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

• PYTHON online interpreter (another & easy one)


• https://repl.it/languages/python3
Exercise 2 – Twitter Sentiment Analysis
For Beginners :
https://towardsdatascience.com/a-beginners-guide-to-sentiment-analysis-in-python
95e354ea84f6

For Advanced Users:


https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-pyt
hon
COLAB & PYTHON
EXAMPLES
# BAR GRAPH

import matplotlib.pyplot as plt

x = ['A', 'B', 'C', 'D', 'E']


y = [22, 9, 40, 27, 55]

plt.bar(x, y)

plt.show()
Printing prime numbers from N1 to N2
n1=int(input("Enter limit for Prime numbers From :"))
n2=int(input("Enter limit for Prime numbers upto :"))
print("Prime Numbers between",n1, "and", n2, "are :")
for n in range(n1,n2,+1):
x=n
s=1
for i in range(1,x,+1):
if(x%i)==0:
s=s+1
if s<=2:
print(n)
Cricket Data Analysis # Question 1 - Highest Run and Players Name

# Reading an excel file using Python for i in range(1,sheet.nrows,+1):

import xlrd k2=int(sheet.cell_value(i,3))

# Give the location of the file a.append(k2)

loc = ("Cricket_data.xls") largest2=max(a)

# To open Workbook for j in range (0,len(a),+1):

wb = xlrd.open_workbook(loc) if (largest2==a[j]):

sheet = wb.sheet_by_index(0) t2=j

m1=[] print("Highest run scorer is ", m1[t2], "and


score is ", largest2 )
m2=[]
#Question 2-Highest Wicket And Players Name
a=[]
for i in range(1,sheet.nrows,+1):
b=[]
k3=int(sheet.cell_value(i,4))
c=[]
b.append(k3)
#data reading from excel to temprory memory
largest2=max(b)
for i in range(1,sheet.nrows,+1):
for j in range (0,len(b),+1):
k0=(sheet.cell_value(i,0))
if (largest2==b[j]):
m1.append(k0)
t2=j
for i in range(1,sheet.ncols,+1):
print("Maximum Wicket Taker is ", m1[t2],
k1=(sheet.cell_value(0,i)) "and wicket taken is", largest2 )

m2.append(k1)
Applied Business Analytics
Review
Example 1 – Soft Drink Preferences
• A group of 25 people was surveyed to find their soft drink
preferences (1 – Pepsi, 2 – Bovonto, 3 – Coke, 4 – Limca)
• R Coding – Input from Key Board
• # Sample data - 3 4 1 1 3 4 3 3 1 3 2 1 2 1 2 3 2 3 1 1 1 1 4 3 1

soft<-scan()

barplot(table(soft),xlab="Soft Drink",ylab="Frequency of
Preferences", main="Soft Drink Preferences",col="white")

barplot(table(soft)/length(soft),xlab="Soft Drink",ylab="Freq. of
Preferences",main=" Soft Drink Preferences ",col="gray70")

09/05/2024 ABA 78
Example 2 – Students enrolled in a University 2015-2019

• R Coding

nos <-
c(2810,890,540,3542,1363,471,4301,1663,652,5362,2071,895,6593,2752,1113)

year.stud <- matrix(nos,byrow=T,ncol=3) # Vector to Matrix

rownames(year.stud) <- c("2015","2016","2017","2018","2019")

colnames(year.stud) <- c("Commerce","Science","Humanity")

barplot(t(year.stud),main="Students Enrolment in AU",beside=F)

09/05/2024 ABA 79
Example 3– Correlation of Asian Paints
closing price with other variables
• DATA SET : Asian paints.csv
cp<-read.csv(file.choose()) • Which variables are having highest positive
summary(cp) correlation? Why?
cor(cp$close, cp$OPEN)
• Any negative correlation between the variables?
cor(cp$close, cp$HIGH)
Why?
• Which variables are having lowest positive
cor(cp$close, cp$LOW)
correlation? Why?
cor(cp$close, cp$vwap)
par(mfrow=c(2,2))
plot(cp$close, cp$OPEN)
plot(cp$close, cp$HIGH)
plot(cp$close, cp$LOW)
plot(cp$close, cp$vwap)

plot(cp)

09/05/2024 ABA 80
Example 4 : Correlation of POC between ITES (INFY &
TCS)
• CSV files : INFY_2122_FY_POC.csv & TCS_2122_FY_POC

INFY=read.csv(file.choose())
TCS=read.csv(file.choose())
cor(INFY$closePOC, TCS$closePOC)
par(mfrow=c(2,2))
boxplot(INFY$closePOC, main="INFOSYS")
boxplot(TCS$closePOC, main="TCS")
hist(INFY$closePOC)
hist(TCS$closePOC)

• What is the correlation value of TCSPOC & INFYPOC?


• Who is performing better (based on one year data)?
• What is the outlier is communicating ?
• Does boxplot and histogram are same?
09/05/2024 ABA 81
Example 5 – Wilcox Test

# Data in two numeric vectors


women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
# Create a data frame
my_data <- data.frame(group = rep(c("Woman", "Man"), each = 9), weight =
c(women_weight, men_weight))
#Question : Is there any significant difference between women and men weights?
# Compute two-samples Wilcoxon test
res <- wilcox.test(weight ~ group, data = my_data, exact = FALSE)
print(res)

# INFERENCE The p-value of the test is 0.02712, which is less than the significance level (0.05).
# We can conclude that men’s median weight is significantly different from women’s median weight
# if you want to test whether the median men’s weight is less than the median women’s weight,
wilcox.test(weight ~ group, data = my_data, exact = FALSE, alternative = "less")
#Or, if you want to test whether the median men’s weight is greater than the median women’s weight,
wilcox.test(weight ~ group, data = my_data, exact = FALSE, alternative =
"greater")
boxplot(men_weight,women_weight, xlab = "Gender", ylab="Weight", names=c("Men","Women"))

09/05/2024 ABA 82
Example 6 - Linear Modeling
• df1<-read.csv(file.choose())
• #Visualization
• boxplot(df1[3:9])
• boxplot(df1[11:13])
• boxplot(df1$POC1)
• #Simple Linear Regression, DV: Closing Price, IV: Opening Price
• reg1=lm(df1$close~df1$OPEN)
• summary(reg1)
• reg2=lm(df1$close~df1$LOW)
• summary(reg2)
• reg3=lm(df1$close~df1$HIGH)
• summary(reg3)
• reg4=lm(df1$close~df1$ltp)
• summary(reg4)
• reg5=lm(df1$close~df1$vwap)
• summary(reg5)
09/05/2024 ABA 83
Example 7 Regression using R

CSV - tcsnifty50POC_1year

data=read.csv(file.choose())

reg1=lm(data1$tcsPOC~data1$NIFTYPOC)

summary(reg1)

09/05/2024 ABA 84
Example 8 - Sales and Advertisement
• We’ll use the marketing data set [datarium package], which contains the impact of the
amount of money spent on three advertising medias (youtube, facebook and
newspaper) on sales.

• sales = b0 + b1*youtube + b2*facebook + b3*newspaper

• install.packages("datarium")
• library(datarium)

• data("marketing", package = "datarium")


• head(marketing, 4)
09/05/2024 ABA 85
Some concepts in R for Module II – (1)
Calculating Probability for Continuous distributions
Continuous Distributions Rname Parameter
Beta beta shape 1, 2
Cauchy cauchy location,scale
Chi-Squared chisq df
Expoential exp rate or scale
F f df1,df2
Gamma gamma rate
Log-Normal enorm meanlog, sdlog
Logistic logis location, scale
Normal norm mean, sd
Student’s t test t df
Uniform unif min, max
Weibull Weibull Shape, Scale
Wilcox Wilcox m,n
09/05/2024 ABA 86
Some concepts in R for Module II - (2)
Calculating Probability for Discrete distributions
Discrete Distributions Rname Parameter
Binomial binorm n,p
Geometric geom p
Hypergeometric hype m,n,k
Negative Binomial nbinorm size,prob/µn
Poisson pois lamda=mean

09/05/2024 ABA 87
General Statistics (1)
• Null Hypothesis (Ho)
• Nothing Happened, The mean was unchanged, The treatment has no effect,
The model did not improve
• Alternate Hypothesis (Ha)
• Something Happened, The mean rose, The treatment improved the patient’s
health, The model fit better
• Assume Ho is TRUE
• T-statistics
• P-Value
• Small (P<α ) – Strong evidence against Ho, i.e. Reject Ho
• not small (P >= α ) – Retain H0 (failing to reject Ho )
• Example : P< 0.05 –Reject Ho
• P < 0.05, (100-95)/100 = 5/100=0.05 = 95%
• High Risk Applications
P < 0.01, (100-99)/100 = 1/100=0.01 = 99%
P < 0.001, (100-99.9)/100 = 0.1/100=0.001 = 99.9%
09/05/2024 ABA 88
General Statistics (2)
Testing mean of the sample – (t-test – small sample, n<30)
• You have sample from a population, given this sample you
want to know if the mean of the population could reasonably
be “m”
• t.test is making inferences about a population mean from the
sample.

• t test - ask if the population means could be 95


• x<-rnorm(50,mean=100,sd=15)
• t.test(x,mu=95) #p value is X.XXXXXX < 0.05
• plot(x)
• P<α (small) and so it is unlikely (based on the sample data)
that 95 could be the mean of the population i.e. Reject Ho
• Run the above 3 lines multiple times & Check the answer
09/05/2024 ABA 89
General Statistics (3)
Testing for Normality
• You want a statistical test to determine whether your data
sample is from a normally distributed population.
• shapiro.test(x)
• plot(x)
• Shapiro-Wilk test:
• Null hypothesis: the data are normally distributed
• Alternative hypothesis: the data are not normally distributed
• P<α Reject NULL, indicates that the population is likely not
normally distributed.
• P> α large p-value suggests the underlying population could be
normally distributed.
09/05/2024 ABA 90
General Statistics (4)
Confidence Interval for a Median - WILCOX
 The procedure for calculating the Confidence Interval for mean is well-defined
and widely known. The same is not true for the median. Wilcox on signed rank
test is pretty standard procedures for this.
Comparing the locations of two samples nonparametrically
• You want to know: Is one population shifted to the left/right compared with the
other?

• wilcox.test : tells us whether the central locations of the two populations are
significantly different or equivalently whether their relative frequency are
different

• Example : We randomly select a group of employees and ask each one to


complete the same task under 2 different circumstances (favourable and
unfavourable conditions). We measure their completion times and check they
are significantly different or not.

09/05/2024 ABA 91
Example 9 – Wilcox Test
# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
# Create a data frame
my_data <- data.frame(group = rep(c("Woman", "Man"), each = 9), weight =
c(women_weight, men_weight))
#Question : Is there any significant difference between women and men weights?
# Compute two-samples Wilcoxon test
res <- wilcox.test(weight ~ group, data = my_data, exact = FALSE)
print(res)

# INFERENCE The p-value of the test is 0.02712, which is less than the significance level (0.05).
# We can conclude that men’s median weight is significantly different from women’s median weight
# if you want to test whether the median men’s weight is less than the median women’s weight,
wilcox.test(weight ~ group, data = my_data, exact = FALSE, alternative = "less")
#Or, if you want to test whether the median men’s weight is greater than the median women’s weight,
wilcox.test(weight ~ group, data = my_data, exact = FALSE, alternative =
"greater")
boxplot(men_weight,women_weight, xlab = "Gender", ylab="Weight", names=c("Men","Women"))

09/05/2024 ABA 92
Regression -Base Model 1 & 2
SAMPLE SKETCHES
16.00 b<0
b=0

14.00 0<b<1
b=1

12.00 b>1

10.00
SAMPLE SKETCHES
8.00 45.00

6.00 40.00

35.00
4.00

30.00
b<0
2.00
b=0
25.00 0<b<1
- b=1
- 0.50 1.00 1.50 2.00 2.50 3.00
20.00 b>1

15.00

10.00

5.00

-
- 0.50 1.00 1.50 2.00 2.50 3.00
09/05/2024 ABA 93
Statistical Inferences
• Statistical Inferences
• Intercept
• Beta
• P-value
*** - Statistical significant at 1 percent level
** - Statistical significant at 5 percent level
* - Statistical significant at 10 percent level
• R2 Value
Higher the R2 better the model is fit (0.70 and above)
acceptable if R2 is (0.50 to 0.70). (Application Specific)
• Residual Error

09/05/2024 ABA 94
Try it Exercise – Assignment 1
• Download Your Company & Sector
• Regression using POC (Minimum 3 SLR required)
• Plot all POCs
• Assignment content
• 1. Data set (% of change, cleaned)
• 2. LM Summary output
• 3. Linear Regression Equation(s)
• 4. Plot & Boxplot output
• 5. Inferences

09/05/2024 ABA 95
Example 10 - MLR- HR

CSV File Name


Supervisor_Performance_data.csv

R Coding
Employee_Supervisor_Performance.
r

09/05/2024 ABA 96
Example 11 – All Tyre brands are equal
• Imagine that you are interested in understanding whether knowing the brand of car
tyre can help you predict whether you will get more or less mileage before you
need to replace them.
• We’ll draw what is hopefully a random sample of 60 tyres from four different
manufacturers and use the mean mileage by brand to help inform our thinking.
• While we expect variation across our sample we’re interested in whether the
differences between the tyre brands (the groups) is significantly different than what
we would expect in random variation within the groups.
H0: μ1 = μ2 = μ3 = μ4
• Our research or testable hypothesis is
μApollo=μBridgestone=μCEAT=μFalken
Our null hypothesis is basically “Tyre brand doesn’t matter in predicting the
mileage – i.e. all tyre brands are same”.
Alternate hypothesis is at least one of the tire brand populations is different
than the other three.

09/05/2024 Data Analytics using R 97


ANOVA – A Real Time R Example
(1)
• Step 1 : Read the file - tyre.csv # file.choose()
• tyre<-read.csv(file.choose())
• summary(tyre)

• Step 2 : Dependent Variable is Mileage & Independent


Variable is Brands
- Is Mileage depends on Brand ?
• Step 3 : Graph - Mileage distribution of all brand of tyres
• boxplot(tyre$Mileage, horizontal = TRUE, main="Mileage
distribution across all brands", col = "blue")
09/05/2024 Data Analytics using R 98
ANOVA – A Real Time R Example (2)
• Step 4 :#Groupwise Boxplotting
• boxplot(tyre$Mileage~tyre$Brands, main="Boxplot comparing Mileage of
Four Brands of Tyre", col= rainbow(4), horizontal = TRUE)
• Step 5 : ONEWAY ANOVA Coding
• tyres.aov<- aov(Mileage~Brands, tyre)
• summary(tyres.aov)

• Step 6 : Inference
http://rpubs.com/ibecav/308410

09/05/2024 Data Analytics using R 99


09/05/2024 Data Analytics using R 100
ANOVA – A Real Time R Example (3)
Pairwise – t test
• Apollo VS Bridgestone
• Apollo VS CEAT
• Apollo VS Falken
• Bridgestone VS CEAT
• Bridgestone VS Falken and
• CEAT VS Falken.
• pairwise.t.test(tyre$Mileage,tyre$Brands,p.adjust.method = "none")
• What is your understanding on the results ?
• TukeyHSD(tyres.aov, conf.level = 0.95)
• TukeyHSD(tyres.aov, conf.level = 0.99)
• TukeyHSD(tyres.aov, conf.level = 0.90)
• NOTE : We can see the 6 pairings, The diff column is the difference between the
means of the two brands listed. So the mean for Bridgestone is 3,019 miles less than
Apollo. And Falken is 5845 miles more than Bridgestone etc.,

09/05/2024 Data Analytics using R 101


Example 12 – Detergent Sales, Discount and Location

• Location- KK Nagar & Anna Nagar


• Discount - No Discount, 10 % discount & 20 % discount
• Sales - in Kilograms

Assumption
NULL : Sales is not depending on Discount and location.

09/05/2024 Data Analytics using R 102


Example 13 – POC is depending on Months

• Date
• Close
• No. of Trade
• ClosePOC
• NOTPOC
• Month

• Assumption – POC is not depending on Months

09/05/2024 Data Analytics using R 103


Example 14 – Learning & Employability

• CGPA
• GENDER
• COURSE
• CTC

• Assumption – CTC depending on CGPA, GENDER and Course

09/05/2024 Data Analytics using R 104


Example 15 – Case 1
data1=read.csv(file.choose())
#ONEWAY ANOVA Coding
a3<- aov(data1$HOEP~data1$Time)
summary(a3)
boxplot(data1$HOEP~data1$Time,xlab = “Hour",
ylab="HOEP", main="HOEP based on Time")

09/05/2024 Data Analytics using R 105


Example 16 – Alcohol consumption behaviour
E:\TSM_2022_2023\TSM_TEACHING\ABA\Session_PPTs\goggles.csv

Gender Alcohol Participant Was


1 1 Male who consumed no alcohol
1 2 Male who consumed 2 pints
1 3 Male who consumed 4 pints
2 1 Female who consumed no alcohol
2 2 Female who consumed 2 pints
2 3 Female who consumed 4 pints

09/05/2024 Data Analytics using R 106


Equal opportunity in Employment
• Preemployment Test
• 1. Measure abilities that are directly related to the job under
consideration
• 2. Must not discriminate on the basis of race or national origin
• y – Job performance, x – Preemployment Test
• Model 1 (Pooled) :
• Model 2 (Minority):
• Model 2 (White) :

09/05/2024 Data Analytics using R 107


Employment on pretest

09/05/2024 Data Analytics using R 108


Inferences
• Model1 – Race distinction is ignored
• – is minimum required level of performance
• > for eligibility
• > for minority eligibility
• > for white eligibility
• If is – relaxation of pretest requirements for white
• If is – Tightening of pretest requirements for minorities

09/05/2024 Data Analytics using R 109


Adjusted R-Square
• When we add a new variable in the regression model it brings some
information to explain the dependent variable. i.e. if the adjusted R-
Square decreases after the addition of new variable, it means the
variable is not at all useful for the model.

09/05/2024 Data Analytics using R 110


Why MANOVA?
• There may be circumstances in which we are interested in several
dependent variables, and in these cases the simple ANOVA model is
inadequate.

• Why don’t we conduct multiple ANOVA for each dependent variable?


• The reason is that: more tests we conduct on the same data, more we
inflate the family-wise error rate (greater chance of making a Type I
error).

• MANOVA, by including all dependent variables in the same analysis, can


capture the relationship between outcome variables. Related to this point,
ANOVA can tell us only whether groups differ along a single dimension,
whereas MANOVA has the power to detect whether groups differ along
a combination of dimensions.

09/05/2024 Data Analytics using R 111


Theory of MANOVA (1)
• In ANOVA, variances (systematic and unsystematic) are single
values. In MANOVA, these variances are contained in a matrix.

• Hypothesis SSCP (Sums of Squares and Cross Products): the matrix


that represents the systematic variance and is called hypothesis sum
of squares and cross-products matrix, denoted by H.

• Error SSCP: matrix that represents the unsystematic variance and is


called error sum of squares and cross-products matrix, denoted
by E.

• Total SSCP: total amount of variance, denoted by T.

09/05/2024 Data Analytics using R 112


Theory of MANOVA (2)
• Pillai-Bartlett Trace (also known as Pillai’s trace)
• Pillai’s trace is used as a test statistic in MANOVA. It is a positive
valued statistic ranging from 0 to 1. Increasing values means that
effects are contributing more to the model; you should reject the null
hypothesis for large values.

• Pillai’s trace is considered to be the most powerful and robust statistic


for general use, especially for departures from assumptions. Other
commonly used tests include: Hotelling’s T2, Roy’s largest root and
Wilks’s lambda.
09/05/2024 Data Analytics using R 113
Example 17 : OCD - Obsessive Compulsive Disorder

• The effects of cognitive behavior therapy (CBT) on obsessive


compulsive disorder (OCD).
• Two dependent variables (DV1 and DV2) are considered: the
occurrence of obsession-related behaviors (Actions) and the
occurrence of obsession-related cognitions (Thoughts).
• OCD sufferers are grouped into three conditions: with Cognitive
Behavior Therapy (CBT), with behavior therapy (BT), and with no-
treatment (NT).

09/05/2024 Data Analytics using R 114


MANOVA using R
• ocdData <- read.csv(file.choose())
• # rename level label
• ocdData$Group<-factor(ocdData$Group, levels = c("CBT", "BT", "No
Treatment Control"), labels = c("CBT", "BT", "NT"))
• str(ocdData)
• outcome <- cbind(ocdData$Actions, ocdData$Thoughts)
• boxplot(outcome~ocdData$Group, main="Boxplot of comparing OCD",
col= rainbow(4), horizontal = TRUE, data=ocdData)
• ocdmodel1<-aov(outcome~Group,data=ocdData)
• summary(ocdmodel1)
• ocdModel2 <- manova(outcome ~ Group, data=ocdData)
• summary(ocdModel2, intercept=TRUE)
• # summary(ocdModel2, intercept=TRUE, test="Wilks")
• # summary(ocdModel2, intercept=TRUE, test="Hotelling")
• # summary(ocdModel2, intercept=TRUE, test="Roy")
09/05/2024 Data Analytics using R 115
AOV & MANOVA Output

09/05/2024 Data Analytics using R 116


Steps in MANOVA
1.Univariate ANOVA for DV1 (Actions)
2.Univariate ANOVA for DV2 (Thoughts)
3.The relationship between DVs: cross-products
4.The total SSCP matrix (T)
5.The residual (error) SSCP matrix (E)
6.The model (hypothesis) SSCP matrix (H)

09/05/2024 Data Analytics using R 117


MANOVA Result analysis
• For the Group variable, Pillai’s trace has p value 0.049, which
indicates a significant difference.
• ANOVA - The p values indicate that there was no significant
difference between therapy groups in terms of Thoughts (p=.136) and
Actions (p=.08).
• MANOVA - we know that therapy had a significant impact on OCD
based on MANOVA.
• The reason for the anomaly is simple: the MANOVA takes account of
the correlation between dependent variables, and so for these data it
has more power to detect group differences.
09/05/2024 Data Analytics using R 118
Try it exercise - MANOVA
• TCS four financial year equity data
• Date
• TCSCLOSEPOC
• TCSLTPPOC
• Month

• Assumption – (TCSCLOSEPOC+TCSLTPPOC) is depending on Months

09/05/2024 Data Analytics using R 119


Assignment 2 - 10 marks
• Your own company (mid term question) using POC
• Assignment content
• 1. Equity Data set 2 years (cleaned)
• 2. COR, LM, GLM, ANOVA, MANOVA (you have the freedom to select
variables)
• 3. Output 1 - Plot, Boxplot, Histogram
• 4. Output 2 – Statistical summary
• 5. Inferences
• Last Date – 09/4/2023, 17.00

9 May 2024 V. Senthil 120


• Linear vs Logistic Regression – Read the Y-Axis
Example 18 - R Coding for Logistic
Regression
#Logit Regression – Demo #data set – admission_data - Admission (Binary type) is depending on GRE, GPA & Rank of the college
df <- read.csv(file.choose())
summary(df)
#Logit Regression GLM
logit1 <- glm(admitted ~ GRE+GPA+RANK,data=df,family="binomial")
summary(logit1)
#Logit Regression GLM with different Ranked college
df$RANK<- as.factor(df$RANK)
logit2 <- glm(admitted ~ GRE+GPA+RANK,data=df,family="binomial")
summary(logit2)
#Predicion
#Let’s say a student have a profile with 790 in GRE,3.8 GPA and
#he studied from a rank-1 college.
#Now you want to predict the chances of that boy getting admit in future.
x <- data.frame(GRE=790,GPA=2.8,RANK=as.factor(1))
p<- predict(logit2,x) # Linear Moding when type is not specified
print(p)
p<- predict(logit2,x,type="response") # Logistic Regression
print(p)
• logreg1=glm(admit~GRE+GPA+RANK, family=“binomial”)
Prediction
• Let’s say a student have a profile with 790 in GRE,3.8 GPA and he
studied from a rank-1 college. Now you want to predict the chances of
that boy getting admit in future.
x <- data.frame(GRE=790,GPA=3.8,RANK=as.factor(1))
p<- predict(logit2,x,type="response")
print(p)

• We see that there is 70% chance that this student will get the admit.
SEM Basics(5)
• SEM = Reliability + Regression
• Structural equation models combine measurement models (e.g.,
reliability) with structural models (e.g., regression). The sem package,
developed by John Fox, allows for some basic structural equation
models. To use it, add the sem package by using the package
manager.

• Structural Equation Modeling may be thought of as regression


corrected for attenuation.

09/05/2024 Applied Business Analytics using R 125


What test ? When to apply ?

09/05/2024 Applied Business Analytics using R 126


WHAT WE DID SO FAR ?

Evaluation parameter Descriptions Marks

Mini Project/Assignments & Mini Project Report / Assignment & Quiz 10


Quiz

Case Presentation HBR Case Presentation and Interactions 10

Coursera Courses One Coursera course with score card 10

Mid Term Test Mid-term examination weightage will be 50% of total 30


internal marks.

End term Marks A written exam will be conducted at the end of the 40
trimester. This will carry 40% of total marks

Total marks 100

09/05/2024 ABA 127


All the BEST & Thanks

You might also like