Mathematical Computations Using R
Mathematical Computations Using R
Mathematical Computations Using R
II B.Sc Statistics
Unit-I
1. History of R programming
o R was created by Ross Ihaka and Robert Gentleman at the University
➢ of Auckland, New Zealand, which is currently developed by the R Development Core Team.
➢ R made its first appearance in 1993.
➢ This programming language was named R, based on the first letter of first name of the two R authors
(Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs Language S.
➢ A large group of individuals has contributed to R by sending code and bug reports.
➢ Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source code
archive.
1.2 R Commands
help() Obtain documentation for a given R command
c(), scan() Enter data manually to a vector in R
seq() Make arithmetic progression vector
rep() Make vector of repeated values
data() Load (often into a data.frame) built-in dataset
View() View dataset in a spreadsheet-type format
str() Display internal structure of an R object read.csv(),
read.table() Load into a data.frame an existing data file
library(), require() Make available an R add-on package
dim() See dimensions (# of rows/cols) of data.frame
length() Give length of a vector
ls() Lists memory contents
rm() Removes an item from memory
names() Lists names of variables in a data.frame
hist() Command for producing a histogram
histogram() Lattice command for producing a histogram
stem() Make a stem plot
table() List all values of a variable with frequencies
xtabs() Cross-tabulation tables using formulas
mosaicplot() Make a mosaic plot
cut() Groups values of a variable into larger bins
mean(), median() Identify “center” of distribution
by() apply function to a column split by factors
summary() Display 5-number summary and mean
var(), sd() Find variance, sd of values in vector
sum() Add up all values in a vector
quantile() Find the position of a quantile in a dataset
plot() Produces a scatterplot
barplot() Produces a bar graph
barchart() Lattice command for producing bar graphs
boxplot() Produces a boxplot
bwplot() Lattice command for producing boxplots
xyplot() Lattice command for producing a scatterplot
lm() Determine the least-squares regression line
anova() Analysis of variance (can use on results of
predict() Obtain predicted values from linear model
nls() estimate parameters of a nonlinear model
residuals() gives (observed - predicted) for a model fit to data
sample() take a sample from a vector of data
replicate() repeat some process a set number of times
cumsum() produce running total of values for input vector
ecdf() builds empirical cumulative distribution function
dbinom(), etc. tools for binomial distributions
dpois(), etc. tools for Poisson distributions
pnorm(), etc. tools for normal distributions
qt(), etc. tools for student t distributions
pchisq(), etc. tools for chi-square distributions
binom.test() hypothesis test and confidence interval for 1 proportion
prop.test() inference for 1 proportion using normal approx.
chisq.test() carries out a chi-square test
fisher.test() Fisher test for contingency table
t.test() t test for inference on population mean
qqnorm(), qqline() tools for checking normality
addmargins() adds marginal sums to an existing table
prop.table() compute proportions from a contingency table
par() query and edit graphical settings
power.t.test() power calculations for 1- and 2-sample t
anova() compute analysis of variance table for fitted model
Normal
Normal numbers are the backbone of classical statistical theory due to the central limit theorem. The normal
distribution has two parameters a mean µ and a standard deviation s. These are the
location and spread parameters. For example, IQs may be normally distributed with mean 100 and standard
deviation 16, Human gestation may be normal with mean 280 and standard deviation
about 10 (approximately). The family of normals can be standardized to normal with mean 0 (centered) and
variance 1. This is achieved by "standardizing" the numbers, i.e. Z=(X-µ)/s.
Here are some examples
>rnorm(1,100,16) # an IQ score
94.1719
>rnorm(1,mean=280,sd=10)
270.4325 # how long for a baby (10 days early)
Here the function is called as rnorm(n,mean=0,sd=1) where one specifies the mean and the standard deviation.
> x=rnorm(100)
>hist(x,probability=TRUE,col=gray(.9),main="normal mu=0,sigma=1")
>curve(dnorm(x),add=T)
## also for IQs using rnorm(100,mean=100,sd=16)
Binomial
The binomial random numbers are discrete random numbers. They have the distribution of the number of
successes in n independent Bernoulli trials where a Bernoulli trial results in success
or failure, success with probability p.
A single Bernoulli trial is given with n=1 in the binomial
> n=1, p=.5 # set the probability
>rbinom(1,n,p) # different each time
1
>rbinom(10,n,p) # 10 different such numbers
0110101010
A binomially distributed number is the same as the number of1's in n such Bernoulli numbers. For the last
example, this would be There are then two parameters n (the number of Bernoulli trials)
and p (the success probability). To generate binomial numbers, we simply change the value of n
from 1 to the desired number of trials. For example, with 10 trials:
> n = 10; p=.5
>rbinom(1,n,p) # 6 successes in 10 trials
6
>rbinom(5,n,p) # 5 binomial number
66454
The number of successes is of course discrete, but as n gets large, the number starts to look quite normal. This
is a case of the central limit theorem which states in general that (X- µ)/s is
normal in the limit (note this is standardized as above) and in our specific case that the graphs show 100
binomially distributed random numbers for 3 values of n and for p=.25. Notice in the graph, as n increases the
shape becomes more and more bell-shaped. These graphs were made with the commands
> n=5;p=.25 # change as appropriate
> x=rbinom(100,n,p) # 100 random numbers
>hist(x,probability=TRUE,)
## use points, not curve as dbinom wants integers only for x
>xvals=0:n;points(xvals,dbinom(xvals,n,p),type="h",lwd=3)
>points(xvals,dbinom(xvals,n,p),type="p",lwd=3)
... repeat with n=15, n=50
Exponential
The exponential distribution is important for theoretical work. It is used to describe lifetimes of electrical
components (to first order). For example, if the mean life of a light bulb is 2500 hours one may think its
lifetime is random with exponential distribution having mean 2500. The one parameter is the rate = 1/mean.
We specify it as follows rexp(n,rate=1). Here is an example with the rate being 1/2500.
> x=rexp(100,1/2500)
>hist(x,probability=TRUE,col=gray(.9),main="exponential ,mean=2500")
>curve(dexp(x,1/2500),add=T)
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function
takes a dim attribute which creates the required number of dimension. In the below example we create an
array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the distinct values of
the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or
character or Boolean etc. in the input vector. They are useful in statistical modeling. Factors are created using
the factor() function. Then levels functions gives the count of levels.
# Create a vector.
apple_colors<- c('green','green','yellow','red','red','red','green')
# Create a factor object.
factor_apple<- factor(apple_colors)
# Print the factor.
print(factor_apple)
print(nlevels(factor_apple))
Data Frames:
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes
of data. The first column can be numeric while the second column can be character and third column can be
logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
Objects
Objects are assigned values using <- , an arrow formed out of < and -. (An equal sign, =, can also be used.)
For example, the following command assigns the value 5 to the object x.
x <- 5
After this assignment, the object x ‘contains’ the value 5. Another assignment to the same object will change
the content.
x <- 107
we can check the content of an object by simply entering the name of the object on an interactive command
line. Try that throughout these examples to see what the results are of the different operations and functions
illustrated.
Distributions
sample(x, size, replace = FALSE, prob = NULL) # take a simple random
sample of size n from the
# population x with or without replacement
rbinom(n,size,p)
pbinom()
qbinom()
dbinom()
rnorm(n,mean,sd) #randomly generate n numbers from a Normal
distribution with the specific mean and sd
pnorm() #find probability (area under curve) of a Normal(10,3^2)
distribution to the left
qnorm() #find quantity or value x such that area under
Normal(10,3^2)
1.7 Data Input
Unlike SAS, which has DATA and PROC steps, R has data structures (vectors, matrices, arrays, data frames)
that you can operate on through functions that perform statistical analyses and create graphs. This section
describes how to enter or import data into R, and how to prepare it for use
in statistical analyses. Topics include R data structures, importing data (from Excel, SPSS, SAS, Stata, and
ASCII Text Files), entering data from the keyboard, creating an interface with a database management system,
exporting data (to Excel, SPSS, SAS, Stata, and Tab Delimited Text Files), annotating data (with variable
labels and value labels), and listing data. In addition, methods for handling missing values and date values are
presented.
1.8 Data Frames
A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following
variable df is a data frame containing three vectors n, s, b.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
>df = data.frame(n, s, b) # df is a data frame
1.8 Graphics
This provides the most basic information to get started producing plots in R. This section provides an
introduction to R graphics by way of a series of charts, graphs and visualization. R has also been used to
produce figures that help to visualize important concepts or teaching points. The organization of R graphics
this section briefly describes how R’s graphics functions are organized so that the user knows where to start
looking for a particular function. The R graphics system can be broken into four distinct levels: graphics
packages; graphics systems; a graphics engine, including standard graphics devices; and graphics device
packages
To visualize data:
• ggplot2 - R's famous package for making beautiful graphics.ggplot2
lets you use the grammar of graphics to build layered, customizable plots.
• ggvis - Interactive, web based graphics built with the grammar of
graphics.
• rgl - Interactive 3D visualizations with R
• Colors : The package colorspace provides a set of functions for
transforming between color spaces and mixcolor() for mixing colors within a
color space.
• htmlwidgets - A fast way to build interactive (javascript based)
visualizations with R. Packages that implement htmlwidgets include:
• leaflet (maps)
• dygraphs (time series)
• DT (tables)
• diagrammeR (diagrams)
• network3D (network graphs)
• threeJS (3D scatterplots and globes).
Graphics formats that R supports and the functions that open an appropriate R Programming language has
numerous libraries to create charts and graphs.R provides the usual range of standard statistical plots, including
scatterplots, boxplots, and histograms, bar plots, pie charts, and basic3Dplots
Types of charts
• scatterplots,
• boxplots
• histograms
• bar plots
• pie charts
• basic3Dplots
1.9 Table
A table is an arrangement of information in rows and columns that make comparing and contrasting
information easier. As you can see in the following example, the data are much easier to read than they would
be in a list containing thread.table() #read spreadsheet data (i.e. more than one variable) from a text file table()
#frequency counts of entries, ideally the entries are factors(although#it works with integers or even reals)at
same data.
Example
smoke <-matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
colnames(o) <-c("High","Low","Middle")
rownames(o) <-c("current","former","never")
smoke<-as.table(smoke)
smoke
High Low Middle
current 51 43 22
former 92 28 21
never 68 22 9
Unit II
• Title – Every diagram must be given a suitable ‘Title’ which should be small and self-explanatory.
• Size – Size of the diagram should be appropriate neither too small nor too big.
• Paper used – Diagrams are generally prepared on blank paper.
• Scale – Under one-dimensional diagrams especially ‘Bar Diagrams’ generally Y-axis is more
important from the point of view of the decision of scale because we represent magnitude along this
axis.
• Index – When two or more variables are presented and different types of line/shading patterns are
used to distinguish, then an index must be given to show their details.
• Selection of Proper Type of Diagram – It’s very important to select the correct type of diagram to
represent data effectively.
• Data presented in the form of diagrams are able to attract the attention of even a common man.
(2) Easy to Remember
• Diagrams are used to represent a huge mass of complex data in a simplified and intelligible form,
which is easy to understand.
(5) Diagrams Are Useful in Making Comparisons
• It becomes easier to compare two sets of data visually by presenting them through diagrams.
(6) More Informative
• Diagrams not only depict the characteristics of data but also bring out other hidden facts and relations
which are not possible from the classified and tabulated data.
Diagrammatic presentation is a technique of presenting numeric data through Pictograms, Cartograms, Bar
Diagrams & Pie Diagrams etc. It is the most attractive and appealing way to represent statistical data. ... Under
Pictograms, we use pictures to present data.
Simple Bar Diagram
A simple bar chart is used to represent data involving only one variable classified on a spatial, quantitative
or temporal basis. In a simple bar chart, we make bars of equal width but variable length, i.e. the magnitude
of a quantity is represented by the height or length of the bars. Simple bar diagram is used for comparative
study of two or more items or value of a single variable. These can also be drawn either vertically or
horizontally. Distance between these bars should be equal.
600
500
400
300
200
100
0
Population in Lakhs
Andhra Karnataka Kerala Tamil Nadu
R-coding
> year<-c("2005","2006","2007")
> color<-c("red","blue")
> profit=matrix(c(1000,1500,2000,1800,1300,1200),nrow=2,ncol=3,byrow=T)
>barplot(profit,names.arg=year,xlab="year",ylab="profit",col=color,main="Annual Profit",beside=T)
Annual Profit
2000
1500
1000
profit
500
0
year
Construction of Sub divided Bar Diagram
A sub-divided or component bar chart is used to represent data in which the total magnitude is divided into
different or components. In this diagram, first we make simple bars for each class taking the total magnitude
in that class and then divide these simple bars into parts in the ratio of various components.
R-code
> funds<-c("Share","Surplus","loans","Foreign currency")
> colors<-c("green","blue")
> values<-matrix(c(339,998,5843,2552,352,1043,5614,3262),nrow=2,ncol=4,byrow=TRUE)
> barplot(values,names.arg=funds,xlab="year",ylab="funds",main="sources of funds",col=colors)
> barplot(values,names.arg=funds,xlab="year",ylab="funds",main="sources of funds",col=colors)
> barplot(values,names.arg=funds,xlab="year",ylab="funds",main="sources of
funds",col=colors,horiz=TRUE)
sources of funds
10000
8000
6000
funds
4000
2000
0
year
sources of funds
loans Foreign currency
funds
Surplus
Share
year
42.5
14.4
6.8
36.3
What Is a Histogram?
A histogram is a graphical representation that organizes a group of data points into user-specified ranges. It is
similar in appearance to a bar graph. The histogram condenses a data series into an easily interpreted visual
by taking many data points and grouping them into logical ranges or bins.
Construction of Histogram
R-code
> x<-c(5,15,25,35,45,55,65,75,85)
> f<-c(4,6,7,14,16,14,8,16,5)
> a<-rep(x,f)
> brk=seq(0,90,by=10)
> hist(a,brk,xlab="class interval",ylab="frequency",col="green",main="histogram")
15
10
histogram
frequency
5
0
0 20 40 60 80
class interval
Computation measures of Central Values
A measure of central tendency (also referred to as measures of centre or central location) is a summary
measure that attempts to describe a whole set of data with a single value that represents the middle or centre
of its distribution.
There are three main measures of central tendency:
The mode
The median
The mean.
Each of these measures describes a different indication of the typical or central
value in the distribution.
x i
x= i =1
R-code
> Family<-c("A","B","C","D","E","F","G","H","I","J")
> Expenditure<-c(30,70,10,75,500,8,42,250,40,36)
> mean(Expenditure)
output
mean= 106.1
R-code
> persons<-c(2,3,4,5,6)
> house<-c(10,25,30,25,10)
> fx=sum(persons*house)
> fx
[1] 400
> f=sum(house)
>f
[1] 100
> fxx=(fx/f)
> fxx
Output
Mean= 4
Harmonic mean
R-code
> har<-c(6,15,35,40,900,520,300,400,1800,2000)
> aa=(1/har)
> aa
[1] 0.1666666667 0.0666666667 0.0285714286 0.0250000000 0.0011111111
[6] 0.0019230769 0.0033333333 0.0025000000 0.0005555556 0.0005000000
> stt=data.frame(har,st)
> stt
har X_data
1 6 0.1666666667
2 15 0.0666666667
3 35 0.0285714286
4 40 0.0250000000
5 900 0.0011111111
6 520 0.0019230769
7 300 0.0033333333
8 400 0.0025000000
9 1800 0.0005555556
10 2000 0.0005000000
> n=length(har)
>n
[1] 10
> sttt=sum(st)
> sttt
[1] 0.2968278
> haa=(n/sttt)
> haa
output
[1] 33.68956
Geometric mean
In statistics, the geometric mean is calculated by raising the product of a series of numbers to the inverse of
the total length of the series. The geometric mean is most useful when numbers in the series are not
independent of each other or if numbers tend to make large fluctuations.
geoMean<-function(values){
prod(values)^(1/length(values))
}
values<-c(2,4,6,8)
geoMean(values)
Harmonic Mean
Harmonic mean is a type of average that is calculated by dividing the number of values in a data series by
the sum of the reciprocals (1/x_i) of each value in the data series. The harmonic mean is often used to
calculate the average of the ratios or rates.
a = c(10, 2, 19, 24, 6, 23, 47, 24, 54, 77)
Mode
The mode is the most commonly occurring value in a distribution. Consider this dataset showing the retirement
age of 11 people, in whole years.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2
The most commonly occurring value is 54, therefore the mode of this
distribution is 54 years.
Using R- code
Mode
Measures of Dispersion
▪ The measure of dispersion shows how the data is spread or scattered around the mean.
(X i − X) 2
S= i =1
n -1
Example
Sample
Data (Xi) : 10 12 14 15 17 18 18 24
130
= = 4.3095
7
Skewness
measures the skewness of a distribution;
positive or negative skewness
Kurtosis
Unit III
Discrete Distributions
In this chapter we introduce discrete random variables, those who take values
in a finite or countably infinite support set. We discuss probability mass
functions and some special ex- pectations, namely, the mean, variance and
standard deviation. Some of the more important discrete distributions are
explored in detail, and the more general concept of expectation is defined, which
paves the way for moment generating functions.
Every discrete random variable X has associated with it a probability mass function (PMF)
fX : S X → [0, 1] defined by
f X(x) = IP(X = x), x ∈ S X.
(3.1.2)
Since values of the PMF represent probabilities, we know from Chapter 4 that
PMFs enjoy certain properties. In particular, all PMFs satisfy
1. f X(x) > 0 for x ∈ S ,
2. f X(x) = 1, and
3. IP(X ∈ A) = Σ fX(x), for any event A ⊂ S .
Example 3.1. Toss a coin 3 times. The sample space would be
S = {HHH, HTH, THH, TTH, HHT, HTT, THT, TTT } .
Now let X be the number of Heads observed. Then X has support S X = {0, 1, 2,
3}. Assuming that the coin is fair and was tossed in exactly the same way each
time, it is not unreasonable to suppose that the outcomes in the sample space
are all equally likely. What is the PMF of
X? Notice that X is zero exactly when the outcome TTT occurs, and this event
has probability 1/8. Therefore, fX(0) = 1/8, and the same reasoning shows that
fX(3) = 1/8. Exactly three outcomes result in X = 1, thus, fX(1) = 3/8 and fX(3)
holds the remaining 3/8 probability (the total is 1). We can represent the PMF
with a table:
x SX 0 1 3 Total
fX (x) = IP(X = x) 1/8 3/8 1/8
µ = IE X = x fX(x),
x∈S
(3.1.3
To compute the variance σ2, we subtract the value of mu from each entry in
x, square the answers, multiply by f, and sum. The standard deviation σ is
simply the square root of σ2.
[1] 1.5
[1] 0.75
[1] 0.8660254
Density, cumulative distribution function, quantile function and random variate generation for many
standard probability distributions are available in the stats package.
Keywords
distribution
Details
The functions for the density/mass function, cumulative distribution function, quantile
function and random variate generation are named in the
form dxxx, pxxx, qxxx and rxxx respectively.
For the geometric distribution see dgeom. (This is also a special case of the negative
binomial.)
Density, distribution function, quantile function and random generation for the Bernoulli distribution with
parameter prob
Usage
Arguments
x, q vector of quantiles.
P vector of probabilities.
N number of observations. If length(n) > 1, the length is taken to be the number
required.
Prob probability of success on each trial.
log, log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are P[X<=x],="" otherwise,="" p[x="">x].
Details
P( x) = p x (1 − p )
1− x
, for x = 0 or 1
If an element of x is not 0 or 1, the result of dbern is zero, without a warning. p(x) is computed using
Loader's algorithm, see the reference below.
The quantile is defined as the smallest value x such that F(x)≥p, where F is the distribution function.
Value
dbern gives the density, pbern gives the distribution function, qbern gives the quantile function
and rbern generates random deviates.
The Binomial Distribution
Density, distribution function, quantile function and random generation for the binomial distribution with
parameters size and prob.
Usage
dbinom(x, size, prob, log = FALSE)
pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
qbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE)
Arguments
x, q vector of quantiles.
p vector of probabilities.
n number of observations. If length(n) > 1, the length is taken to be the number required.
Size number of trials (zero or more).
Prob probability of success on each trial.
log, log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are P[X≤x], otherwise, P[X>x].
Details
()
p ( x) = n p x q n − x
x
,x=0,1,2,…,n
The quantile is defined as the smallest value x such that F(x)≥p, where F is the distribution function.
Value
dbinom gives the density, pbinom gives the distribution function, qbinom gives the quantile function
and rbinom generates random deviates.
The length of the result is determined by n for rbinom, and is the maximum of the lengths of the numerical
arguments for the other functions.
The numerical arguments other than n are recycled to the length of the result. Only the first elements of the
logical arguments are used.
The Poisson Distribution
Density, distribution function, quantile function and random generation for the Poisson distribution with
parameter lambda.
Usage
dpois(x, lambda, log = FALSE)
ppois(q, lambda, lower.tail = TRUE, log.p = FALSE)
qpois(p, lambda, lower.tail = TRUE, log.p = FALSE)
rpois(n, lambda)
Arguments
Details
x e −
The Poisson distribution has density p ( x) = for x=0,1,2,… . The mean and variance
x!
are E(X)=Var(X)=λ.
Note that λ=0 is really a limit case (setting 00=1) resulting in a point mass at 0, see also the example.
If an element of x is not integer, the result of dpois is zero, with a warning. p(x) is computed using Loader's
algorithm, see the reference in dbinom.
The quantile is right continuous: qpois(p, lambda) is the smallest integer x such that P(X≤x)≥p.
Setting lower.tail = FALSE allows to get much more precise results when the default, lower.tail =
TRUE would return 1, see the example below.
Value
dpois gives the (log) density, ppois gives the (log) distribution function, qpois gives the quantile function,
and rpois generates random deviates.
Density, distribution function, quantile function and random generation for the geometric distribution
with parameter prob
Usage
dgeom(x, prob, log = FALSE)
pgeom(q, prob, lower.tail = TRUE, log.p = FALSE)
qgeom(p, prob, lower.tail = TRUE, log.p = FALSE)
rgeom(n, prob)
Arguments
x, q vector of quantiles representing the number of failures in a sequence of Bernoulli trials before
success occurs.
P vector of probabilities.
n number of observations. If length(n) > 1, the length is taken to be the number required.
Details
The geometric distribution with prob =p has density p(x)=p(1−p)x for x=0,1,2,…, 0<p≤1.
The quantile is defined as the smallest value x such that F(x)≥p, where F is the distribution function.
Value
dgeom gives the density, pgeom gives the distribution function, qgeom gives the quantile function,
and rgeom generates random deviates.
The length of the result is determined by n for rgeom, and is the maximum of the lengths of the numerical
arguments for the other functions.
The numerical arguments other than n are recycled to the length of the result. Only the first elements of the
logical arguments are used.
Unit –IV
Density, distribution function, quantile function and random generation for the normal distribution with
mean equal to mean and standard deviation equal to sd
Usage
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
Arguments
x, q vector of quantiles.
P vector of probabilities.
n number of observations. If length(n) > 1, the length is taken to be the number required.
Details
If mean or sd are not specified they assume the default values of 0 and 1, respectively.
The normal distribution has densityf(x)=12πσe−(x−μ)2/2σ2where μ is the mean of the distribution and σ the
standard deviation.
Value
dnorm gives the density, pnorm gives the distribution function, qnorm gives the quantile function,
and rnorm generates random deviates.
The length of the result is determined by n for rnorm, and is the maximum of the lengths of the numerical
arguments for the other functions.
The numerical arguments other than n are recycled to the length of the result. Only the first elements of the
logical arguments are used.
For sd = 0 this gives the limit as sd decreases to 0, a point mass at mu. sd < 0 is an error and returns NaN.
These functions provide information about the uniform distribution on the interval
from min to max. dunif gives the density, punif gives the distribution function qunif gives the quantile
function and runif generates random deviates.
Keywords
distribution
Usage
dunif(x, min = 0, max = 1, log = FALSE)
punif(q, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
qunif(p, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
runif(n, min = 0, max = 1)
Arguments
x, q vector of quantiles.
p vector of probabilities.
n number of observations. If length(n) > 1, the length is taken to be the number required.
min, max lower and upper limits of the distribution. Must be finite.
log, log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are P[X≤x], otherwise, P[X>x].
Details
If min or max are not specified they assume the default values of 0 and 1 respectively.
For the case of u:=min==max, the limit case of X≡u is assumed, although there is no density in that case
and dunif will return NaN (the error condition).
runif will not generate either of the extreme values unless max = min or max-min is small compared
to min, and in particular not for the default arguments.
Value
dunif gives the density, punif gives the distribution function, qunif gives the quantile function,
and runif generates random deviates.
The length of the result is determined by n for runif, and is the maximum of the lengths of the numerical
arguments for the other functions.
The numerical arguments other than n are recycled to the length of the result. Only the first elements of the
logical arguments are used.
Density, distribution function, quantile function and random generation for the exponential distribution with
rate rate (i.e., mean 1/rate).
Keywords
distribution
Usage
dexp(x, rate = 1, log = FALSE)
pexp(q, rate = 1, lower.tail = TRUE, log.p = FALSE)
qexp(p, rate = 1, lower.tail = TRUE, log.p = FALSE)
rexp(n, rate = 1)
Arguments
x, q vector of quantiles.
p vector of probabilities.
n number of observations. If length(n) > 1, the length is taken to be the number required.
Rate vector of rates.
log, log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are P[X≤x], otherwise, P[X>x].
Details
Value
dexp gives the density, pexp gives the distribution function, qexp gives the quantile function,
and rexp generates random deviates.
The length of the result is determined by n for rexp, and is the maximum of the lengths of the numerical
arguments for the other functions.
The numerical arguments other than n are recycled to the length of the result. Only the first elements of the
logical arguments are used.
Density, distribution function, quantile function and random generation for the Gamma distribution with
parameters shape and scale.
Keywords
distribution
Usage
dgamma(x, shape, rate = 1, scale = 1/rate, log = FALSE)
pgamma(q, shape, rate = 1, scale = 1/rate, lower.tail = TRUE,
log.p = FALSE)
qgamma(p, shape, rate = 1, scale = 1/rate, lower.tail = TRUE,
log.p = FALSE)
rgamma(n, shape, rate = 1, scale = 1/rate)
Arguments
x, q vector of quantiles.
P vector of probabilities.
n number of observations. If length(n) > 1, the length is taken to be the number required.
Rate an alternative way to specify the scale.
shape, scale shape and scale parameters. Must be positive, scale strictly.
log, log.p logical; if TRUE, probabilities/densities p are returned as log(p).
lower.tail logical; if TRUE (default), probabilities are P[X≤x], otherwise, P[X>x].
Details
Note that for smallish values of shape (and moderate scale) a large parts of the mass of the Gamma
distribution is on values of x so near zero that they will be represented as zero in computer arithmetic.
So rgamma may well return values which will be represented as zero. (This will also happen for very large
values of scale since the actual generation is done for scale = 1.)
Value
dgamma gives the density, pgamma gives the distribution function, qgamma gives the quantile function,
and rgamma generates random deviates.
The length of the result is determined by n for rgamma, and is the maximum of the lengths of the numerical
arguments for the other functions.
The numerical arguments other than n are recycled to the length of the result. Only the first elements of the
logical arguments are used.
Unit V
LINEAR CORRELATION
The term correlation is used by a common man without knowing that he is making use of the
term correlation. For example when parents advice their children to work hard so that they may get
good marks, they are correlating good marks with hard work. The study related to the characteristics
of only variable such as height, weight, ages, marks, wages, etc., is known as univariate analysis. The
statistical Analysis related to the study of the relationship between two variables is known as Bi-
Variate Analysis. Sometimes the variables may be inter-related. In health sciences we study the
relationship between blood pressure and age, consumption level of some nutrient and weight gain,
total income and medical expenditure, etc. The nature and strength of relationship may be examined
by correlation and Regression analysis. Thus Correlation refers to the relationship of two variables
or more. Correlation is statistical Analysis which measures and analyses the degree or extent to
which the two variables fluctuate with reference to each other. The word relationship is important.
It indicates that there is some connection between the variables. It measures the closeness of the
relationship. Correlation does not indicate cause and effect relationship. Price and supply, income and
expenditure are correlated.
Meaning of Correlation:
In a bivariate distribution we may interested to find out if there is any correlation or covariation
between the two variables under study. If the change in one variable affects a change in the other
variable, the variables are said to be correlated. If the two variables deviate in the same direction, i.e.,
if the increase in one results in a corresponding increase in the other, correlation is said to be direct
or positive.
Example
Correlation is said to be perfect if the deviation one variable is followed by a corresponding and
proportional deviation in the other.
Definitions:
Ya-Kun-Chou:
A.M. Tuttle:
Uses of correlation:
2. It is useful for economists to study the relationship between variables like price, quantity.
4. It is helpful in measuring the degree of relationship between the variables like income and
expenditure, price and supply, supply and demand etc.
SCATTER DIAGRAM
Scatter diagram pertaining independent variables, it is easily verifiable that if any line is drawn
through the plotted points, not more than two points will be lying on the line most of the other points
will be at a considerable distance from this line. Scatter diagram that the two variables
are linearly related, the problem arises on deciding which of the many possible lines the best fitted
line is. The lease square method is the most widely accepted method of fitting a straight line and is
discussed here adequately.
Y Y (r = -1)
Correlation
O X axis O X axis
1. If all the plotted dots lie on a straight line falling from upper left hand corner to lower right
hand corner, there is a perfect negative correlation between the two variables. In this case
the coefficient of correlation takes the value r = -1.
2. If all the plotted points form a straight line from lower left hand corner to the upper right hand
corner then there is Perfect positive correlation. We denote this as r = +1
3. If the plotted points in the plane form a band and they show a rising trend from the lower
left hand corner to the upper right hand corner the two variables are highly positively
correlated. Highly Positive Highly Negative
1. If the points fall in a narrow band from the upper left hand corner to the lower right hand
corner, there will be a high degree of negative correlation.
2. If the plotted points in the plane are spread all over the diagram there is no correlation
between the two variables.
Merits:
1. It is a simplest and attractive method of finding the nature of correlation between the two
variables.
4. It is the first step in finding out the relation between the two variables.
Demerits:
By this method we cannot get the exact degree or correlation between the two variables.
Types of Correlation:
Correlation is classified into various types. The most important ones are
It depends upon the direction of change of the variables. If the two variables tend to move
together in the same direction (i. e) an increase in the value of one variable is accompanied by an
increase in the value of the other, (or) a decrease in the value of one variable is accompanied by a
decrease in the value of other, then the correlation is called positive or direct correlation. Price and
supply, height and weight, yield and rainfall, are some examples of positive correlation.
If the two variables tend to move together in opposite directions so that increase (or) decrease
in the value of one variable is accompanied by a decrease or increase in the value of the other variable,
then the correlation is called negative (or) inverse correlation. Price and demand, yield of crop and
price, are examples of negative correlation.
If the ratio of change between the two variables is a constant then there will be linear
correlation between them.
Example
X 10 20 30 40 50
Y 20 40 60 80 100
Here the ratio of change between the two variables is the same. If we plot these points on a
graph we get a straight line.
If the amount of change in one variable does not bear a constant ratio of the amount of change
in the other. Then the relation is called Curve- linear (or) non-linear correlation. The graph will be a
curve.
Example
X 10 20 30 40 50
Y 10 30 70 90 120
Here there is a non linear relationship between the variables. The ratio between them is not fixed
for all points. Also if we plot them on the graph, the points will not be in a straight line. It will be a
curve.
When we study only two variables, the relationship is simple correlation. For example,
quantity of money and price level, demand and price. But in a multiple correlation we study more
than two variables simultaneously. The relationship of price, demand and supply of a commodity
are an example for multiple correlations.
Example:
y 9 8 10 12 11 13 14 16 15
Solution:
x Y x2 y2 xy
1 9 1 81 9
2 8 4 64 16
3 10 9 100 30
4 12 16 144 48
5 11 25 121 55
6 13 36 169 78
7 14 49 196 98
8 16 64 256 128
9 15 81 225 135
45 108 285 1356 597
n xy − ( x) ( y)
r=
[n x 2
− ( x) 2 ][n y 2
− ( y) 2 ]
9 597 − 45 108
r=
(9 285 − (45)2 ).(9 1356 − (108)2 )
5373 − 4860
r=
(2565 − 2025).(12204 − 11664)
= 0.95
r = 0.95
Regression
MEANING OF REGRESSION: The dictionary meaning of the word Regression is ‘Stepping back’ or
‘Going back’. Regression is the measures of the average relationship between two or more variables in
terms of the original units of the data. And it is also attempts to establish the nature of the relationship
between variables that is to study the functional relationship between the variables and thereby provide a
mechanism for prediction, or forecasting.
Example 9.9
Calculate the regression coefficient and obtain the lines of regression for the following data
Solution:
Regression coefficient of X on Y
Y = 0.929X–3.716+11
= 0.929X+7.284
Example 9.10
Calculate the two regression equations of X on Y and Y on X from the data given below, taking
deviations from a actual means of X and Y.
Solution:
= 39.25 (when the price is Rs. 20, the likely demand is 39.25)