How To Run R
How To Run R
How To Run R
COURSE: B.COM(HONS)
SEMESTER: 3RD
1
INDEX
OBJECTIVES 3
INTRODUCTION 4-7
EXPERIMENT 1 7-9
EXPERIMENT 2 10-15
EXPERIMENT 3 15-20
EXPERIMENT 4 21-27
EXPERIMENT 5 28-32
EXPERIMENT 6 32-47
2
OBJECTIVES
The objective of this lab is to understand various aspects of research, identifications, and use of
various statist ical tests using software tools available to a researcher. Labs are structured to give
students experience with conducting experiments, analyzing data, thinking critically about
theory and data,and communicating their results and analysis in writing and oral presentation.
3
1 HOW TO RUN R:
R operates in two modes: interactive and batch. The one typically used is interactive mode. In this mode, you
type in commands, R displays results, you type in more commands, and so on. On the other hand, batch mode
does not require interaction with the user. It’s useful for production jobs, such as when a program must be run
periodically, say once per day, because you can automate the process.
You can then execute R commands. The window in which all this appears is called the R console. As a quick
example, consider a standard normal distribution—that is, with mean 0 and variance 1. If a random variable X
has that distribution,
then its values are centered around 0, some negative, some positive, averaging in the end to 0. Now form a new
random variable Y = |X|. Since we’ve taken the absolute value, the values of Y will not be centered around 0,
and the mean of Y will be positive. Let’s find the mean of Y. Our approach is based on a simulated example of
N(0,1) variates.
> mean(abs(rnorm(100)))
> [1] 0.7194236
This code generates the 100 random variates, finds their absolute values, and then finds the mean of the absolute
values.
The [1] you see means that the first item in this line of output is item 1. In this case, our output consists of only
one line (and one item), so this is redundant. This notation becomes helpful when you need to read voluminous
output that consists of a lot of items spread over many lines. For example, if there were two rows of output with
six items per row, the second row would be labeled.
> rnorm(10)
[1] -0.6427784 -1.0416696 -1.4020476 -0.6718250 -0.9590894 -0.8684650
[7] -0.5974668 0.6877001 1.3577618 -2.2794378
Here, there are 10 values in the output, and the label [7] in the second row lets you quickly see that 0.6877001,
for instance, is the eighth output item.You can also store R commands in a file. By convention, R code files
have the suffix .R or .r. If you create a code file called z.R, you can execute the contents of that file by issuing
the following command:
> source("z.R")
4
1.2 BATCH MODE
Sometimes it’s convenient to automate R sessions. For example, you may wish to run an R
script that generates a graph without needing to bother with manually launching R and
executing the script yourself. Here you would run R in batch mode
.As an example, let’s put our graph-making code into a file named z.R with the following
contents:
The items marked with # are comments. They’re ignored by the R interpreter.Comments
serve as notes to remind us and others what the code is doing, in a human-readable format.
Here’s a step-by-step breakdown of what we’re doing in the preceding code:
• We call the pdf() function to inform R that we want the graph we create to be saved in the PDF
file xh.pdf.
• We call rnorm() (for random normal) to generate 100 N(0,1) random variates.
• We call hist() on those variates to draw a histogram of these values.
• We call dev.off() to close the graphical “device” we are using, which is
the file xh.pdf in this case. This is the mechanism that actually causes the file to be written to
disk.
We could run this code automatically, without entering R’s interactive mode, by invoking R
with an operating system shell command (such as at the $ prompt commonly used in Linux
systems):
You can confirm that this worked by using your PDF viewer to display the saved histogram.
(It will just be a plain-vanilla histogram, but R is capable of producing quite sophisticated
variations.)
5
INTRODUCTION TO FUNCTIONS-
AS IN MOST PROGRAMMING LANGUAGES , THE HEART OF R PROGRAMMIMG CONSISTS OF
WRITING FUNCTIONS. A FUNCTION IS A GROUP OF INSTRUCTIONS THAT TAKE INPUTS, USES
THEM TO COMPUTE OTHER VALUES, AND RETURN A RESULT.
As a simple introduction , lets define a function named oddcount(), whose purpose is to count the odd
numbers in a vector of integers. Normally,we would compose the function code using an text editor and
save it in a file,but in this quick and-dirty example- we will enter it line by line in R’s interactive mode.we
will then call the function on a couple of tests cases.
>oddcount <_function(x)
+for(n in x)
+return K
>oddcount(c(1,3,5))
[1]3
>oddcount(c(1,2,3,7,9))
[1]4
First, we told R that we wanted to define a function named oddcount with one argument,X. The left brace
demarcates the start of the body of the function. We wrote one R statement per line.
Until the body of function is finished, R reminds you that you are still in the definition by using + as its
prompt, instead of the usual>. (actually,+ is a line – continuation character, not a prompt for a new input.)
R resumes the> prompt after you finally enter a right brace to conclude the function body.
After defining the function, we evaluated two calls to oddcount(). Since there are three odd numbers in
the vector(1,3,5) returns the value 3. There are four odd numbers in (1,2,3,7,9), so the second call returns
4.
Notice that the modulo operator for reminder arithmetic is %%in R,as indicated by the comment.for
example, 38 divided by 7 leaves a reminder of 3:
>38%%7
[1]3
For instance, lets see what happens with the following code:
For(n in x)
{if (n%%2==1)k<_k+1}
6
First, it sets to X[1], and then it tests that value for being odd or even. If the value is odd, which is the case
here, the count variable K is incremented. Then n is set to x[2] , tested for being odd or even, and so on.
By the way , C|C ++ programmers might be tempted to write the preceeding loop like this:
For (I in 1:length(x))
{if x(i)%%2==1)k<_k+1
Here, length(x) is the number of elements in X. Suppose there are 25 elements. then 1:length(x) means
1:25, which in turn means 1,2,3….,25. This code would also work (unless x were to have length 0), but one
of the major themes of R programming is to avoid loops if possible; if not, keep loops simple. Look again
at our original formulation:
For(n in x)
{if (n%%2==1)k<_k+1}
It’s simpler and cleaner, as we do not need to resort to using the length()functions and array indexing.
Return(k)
This has the function return the computed value of k to the code that called it. However,simply writing
the following also works:
R functions will return the last value computed if there is no explicit return()call. However, this approach
must be used by care,
In programming language terminology, x is the formal argument (or formal parameter) of the function
oddcount(). In the first function call in the preceding example,c 1,3,5) is referred to as the actual
argument.
Let’s make a simple data set (in R parlance, a vector ) consisting of the numbers 1, 2, and 4, and
name it x:
> x <- c(1,2,4)
The standard assignment operator in R is <-. You can also use =, but this is discouraged, as it
does not work in some special situations. Note that there are no fixed types associated with
variables. Here, we’ve assigned a vector to x, but later we might assign something of a
different type to it. We’ll look at vectors and the other types in Section 1.4.
The c stands for concatenate. Here, we are concatenating the numbers 1, 2, and 4. More
precisely, we are concatenating three one-element vectors that consist of those numbers. This is
7
because any number is also considered to be a one-element vector.Now we can also do the
following:
which sets q to (1,2,4,1,2,4,8) (yes, including the duplicates).Now let’s confirm that the data is
really in x. To print the vector to the screen, simply type its name. If you type any variable name
(or, more generally,any expression) while in interactive mode, R will print out the value of that
variable (or expression). Programmers familiar with other languages such as Python will find
this feature familiar. For our example, enter this:
>x
[1] 1 2 4
Yep, sure enough, x consists of the numbers 1, 2, and 4. Individual elements of a vector are
accessed via [ ]. Here’s how we can print out the third element of x:
> x[3]
[1] 4
As in other languages, the selector (here, 3) is called the index or subscript. Those familiar with
ALGOL-family languages, such as C and C++, should note that elements of R vectors are
indexed starting from 1, not 0. Subsetting is a very important operation on vectors. Here’s an
example:
> x <- c(1,2,4)
> x[2:3]
[1The expression x[2:3] refers to the subvector of x consisting of elements 2 through 3, which
are 2 and 4 here. We can easily find the mean and standard deviation of our data set, as
follows:
>mean(x) [1] 2.333333
> sd(x)
[1] 1.527525
This again demonstrates typing an expression at the prompt in order to print it. In the first line,
our expression is the function call mean(x). The return value from that call is printed
automatically, without requiring a call to R’s print() function.
If we want to save the computed mean in a variable instead of just printing it to the screen, we
could execute this code:
>y
[1] 2.333333
Comments are especially valuable for documenting program code, but they are useful in
interactive sessions, too, since R records the command history (as discussed in Section 1.6). If
you save your session and resume it later, the comments can help you remember what you were
doing.Finally, let’s do something with one of R’s internal data sets (these are
used for demos). You can get a list of these data sets by typing the following:
> data()
One of the data sets is called Nile and contains data on the flow of the Nile River. Let’s find
the mean and standard deviation of this data set:
>mean(Nile)[1]919.35>sd(Nile)
> sd(Nile)[1] 169.2275] 2 4
> hist(Nile)
A window pops up with the histogram in it, as shown in Figure 1-1. This graph is bare-bones simple, but R
has all kinds of optional bells and whistles for plotting. For instance, you can change the number of bins by
specifying the breaks variable. The call hist(z,breaks=12) would draw a histogram of the data set z with 12
bins. You can also create nicer labels, make use of color, and make many other changes to create a more
informative and eye appealing graph. When you become more familiar with R, you’ll be able to construct
complex, rich color graphics of striking beauty.
Well, that’s the end of our first, five-minute introduction to R. Quit R by calling the q() function
(or alternatively by pressing CTRL-D in Linux or CMD-D on a Mac):
9
> q()
Save workspace image? [y/n/c]: n
That last prompt asks whether you want to save your variables so that you can resume work
later. If you answer y, then all those objects will be loaded automatically the next time you run
R. This is a very important feature, especially when working with large or numerous data sets.
Answering y here also saves the session’s command history. We’ll talk more about saving your
workspace.
EXPERIMENT -2
THEORY:
With R, it’s Important that one understand that there is a difference between the actual
R object and the manner in which that R object is printed to the console. Often, the printed
output may have additional bells and whistles to make the output more friendly to the users.
However, these bells and whistles are not inherently part of the object
R has five basic or “atomic” classes of objects:
• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)
The most basic type of R object is a vector. Empty vectors can be created with the
vector() function. There is really only one rule about vectors in R, which is that A vector can
only contain objects of the same class. But of course, like any good rule, there is an
exception, which is a list, which we will get to a bit later. A list is represented as a vector but
can contain objects of different classes. Indeed, that’s usually why we use them.
There is also a class for “raw” objects, but they are not commonly used directly in data
analysis
10
1.3 CREATING VECTORS
The c() function can be used to create vectors of objects by concatenating things together.
Note that in the above example, T and F are short-hand ways to specify TRUE and FALSE.
However, in general one should try to use the explicit TRUE and FALSE values when
indicating logical values. The T and F values are primarily there for when you’re feeling
lazy.
>x
[1] 0 0 0 0 0 0 0 0 0 0
To calculate frequency for State vector, you can use table function.
11
To calculate mean for a vector, you can use mean function.
Since the above vector contains a NA (not available) value, the mean function returns NA.
To calculate mean for a vector excluding NA values, you can include na.rm = TRUE
parameter in mean function.
data$x = as.numeric(data$x)
Some useful vectors can be created quickly with R. The colon operator is
[1] 1 2 3 4 5 6 7 8 9 10
> -3:4
[1] -3 -2 -1 0 1 2 3 4
> 9:5
[1] 9 8 7 6 5
More generally, the function seq() can generate any arithmetic progression.
[1] 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0
> rep(5,3
) [1] 5 5 5
> rep(2:5,each=3)
[1] 2 2 2 3 3 3 4 4 4 5 5 5
> rep(-1:3,
length.out=10) [1] -1 0 1
2 3 -1 0 1 2 3
> 2^(0:10)
2LISTS:
A list allows you to store a variety of objects.
13
You can use subscripts to select the specific component of the list.
14
> x <- list(1:3, TRUE, "Hello", list(1:2, 5))
> x[[3]]
[1] "Hello"
> x[c(1,3)]
[[1]]
[1] 1 2 3
[[2]]
[1] "Hello"
We can also name some or all of the entries in our list, by supplying argument names to list():
>x
$y
[1] 1 2 3
[[2]]
[1] TRUE
$z
[1] "Hello"
Notice that the [[1]] has been replaced by $y, which gives us a clue as to
how we can recover the entries by their name. We can still use the numeric
position if we prefer:
> x$y
[1] 1 2 3
> x[[1]]
15
[1] 1 2 3
The function names() can be used to obtain a character vector of all the
> names(x)
EXPERIMEMT-3
2.1 THEORY:
Matrices are much used in statistics, and so play an important role in R. To create a matrix
use the function matrix(), specifying elements by column first:
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
This is called column-major order. Of course, we need only give one of the dimensions:
[1,] 1 1 1 1
[2,] 2 2 2 2
[3,] 3 3 3 3
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> diag(1:3)
[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3
[1,] 1 2 3 4 5
[2,] 2 4 6 8 10
[3,] 3 6 9 12 15
[4,] 4 8 12 16 20
[5,] 5 10 15 20 25
The last operator performs an outer product, so it creates a matrix with (i, j)-th entry xiyj .
The function outer() generalizes this to any function f on two arguments, to create a matrix
with entries f(xi , yj ). (More on functions later.)
[1,] 2 3 4 5
[2,] 3 4 5 6
[3,] 4 5 6 7
17
distinct from scalar multiplication *.
[,1]
18
[1,] 30
[2,] 36
[3,] 45
[1,] 1 4 7
[2,] 4 10 16
[3,] 9 18 30
> t(A) #
[,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 10
> det(A) #
determinant [1] -3
[1] 1 5 10
2.2 ARRAY:
Of course, if we have a data set consisting of more than two pieces of categorical information
19
about each subject, then a matrix is not sufficient. The generalization of matrices to higher
20
dimensions is the array. Arrays are defined much like matrices, with a call to the array()
command. Here is a 2 × 3 × 3 array:
> arr
,,1
[1,] 1 3 5
[2,] 2 4 6
,,2
[1,] 7 9 11
[2,] 8 10 12
,,3
[1,] 13 15 17
[2,] 14 16 18
Each 2-dimensional slice defined by the last co-ordinate of the array is shown as a 2 × 3
matrix. Note that we no longer specify the number of rows and columns separately, but use a
single vector dim whose length is the number of dimensions. You can recover this vector
with the dim() function.
> dim(arr)
[1] 2 3 3
subsetted and modified in exactly the same way as a matrix, only using
> arr[1,2,3]
[1] 15
> arr[,2,]
21
[,1] [,2] [,3]
[1,] 3 9 15
[2,] 4 10 16
> arr[,,1,drop=FALSE]
,,1
[1,] 0 3 5
[2,] 2 4 6
Factors
R has a special data structure to store categorical variables. It tells R that a variable is
nominal or ordinal by making it a factor.
data$x = as.factor(data$x)
22
EXPERIMENT-4
Aim: To Create Sample (Dummy) Data in R and perform data manipulation with R
2.3 THEORY:
This covers how to execute most frequently used data manipulation tasks with R. It
includes various examples with datasets and code. It gives you a quick look at several
functions used in R.
# for multiple
# OR
> DF[keeps]
> DF
name=c('a','b','c','d','e','e'),
marks=c(44,55,22,33,66,77))
> d3
d3[order(d3$roll),
] OR
d3[with(d3,order(roll)),]
23
2.6 SUBSETS:
roll=c(1:5)
names=c(letters[1:5])
marks=c(12,33,44,55,66)
d4=data.frame(roll,names,marks)
sub1=subset(d4,marks>33 & roll>4)
sub1
sub1=sub1=subset(d4,marks>33 & roll>4,select = c(roll,names))
sub1
d$class=c(1,2,1,2,1,2)
table(cls)
25
In this example, we are replacing 1 with 6 in Q1 variable
mydata$Q1[mydata$Q1==1] <- 6
In this example, we are replacing "Delhi" with "Mumbai" in State variable. We need
to convert the variable from factor to character.
mydata$State = as.character(mydata$State)
mydata$State[mydata$State=='Delhi'] <- 'Mumbai'
2.13 SORTING
Sorting is one of the most common data manipulation task. It is generally used when
we want to see the top 5 highest / lowest values of a variable.
26
2.14 SORTING A VECTOR
x= sample(1:50)
x = sort(x, decreasing = TRUE)
The function sort() is used for sorting a 1 dimensional vector. It cannot be used for more
than 1 dimensional vector.
Note : "-" sign before mydata$SAT tells R to sort SAT variable in descending order.
28
data2=data.frame(roll=c(1,2,3,5),
marks=c(20,25,43,60))
2.18 CONCLUSION:
2.19 DATA2
RESULT=INTERSECT(DATA1$ROLL,
DATA2$ROLL) RESULT
RESULT=MERGE(DATA1,DATA2,ALL
=FALSE) RESULT
29
EXPERIMENT
-5
Aim: Study and implementation of various control structures in R
THEORY:
Loop helps you to repeat the similar operation on different variables or on different columns or
on different datasets. For example, you want to multiple each variable by 5. Instead of multiply
each variable one by one, you can perform this task in loop. Its main benefit is to bring down
the duplication in your code which helps to make changes later in the code.
The If-Else statements are important part of R programming. In this tutorial, we will see
various ways to apply conditional statements (If..Else nested IF) in R. In R, there are a lot of
powerful packages for data manipulation. In the later part of this tutorial, we will see how IF
ELSE statements are used in popular packages.
SAMPLE DATA
Let's create a sample data to show how to perform IF ELSE function. This data frame would
be used further in examples.
x1 x2 x3
1 129 A
3 178 B
5 140 C
7 186 D
9 191 E
11 104 F
13 150 G
15 183 H
17 151 I
19 142 J
set.seed(123)
mydata = data.frame(x1 = seq(1,20,by=2),
x2 = sample(100:200,10,FALSE),
x3 = LETTERS[1:10]) 30
x1 = seq(1,20,by=2) : The variable 'x1' contains alternate numbers starting from 1 to 20. In total,
these are 10 numeric values.
31
x2 = sample(100:200,10,FALSE) : The variable 'x2' constitutes 10 non-repeating random
numbers ranging between 100 and 200.
The ifelse() function in R works similar to MS Excel IF function. See the syntax below -
Suppose you are asked to create a binary variable - 1 or 0 based on the variable 'x2'. If value of a
variable 'x2' is greater than 150, assign 1 else 0.
mydata$x4 = ifelse(mydata$x2>150,1,0)
In this case, it creates a variable x4 on the same data frame 'mydata'. The output is shown in the
image below -
33
APPLY IFELSE() ON CHARACTER VARIABLES
If variable 'x3' contains character values - 'A', 'D', the variable 'x1' should be multiplied by 2.
Otherwise it should be multiplied by 3.
x1 x2 x3 y
1 129 A 2
3 178 B 9
5 140 C 15
7186 D 14
9 191 E 27
11 104 F 33
13 150 G 39
15 183 H 45
17 151 I 51
19 142 J 57
EXPERIMENT-6
Aim: Data Manipulation with dplyr package
THEORY:
The dplyr package is one of the most powerful and popular package in R. This package was
written by the most popular R programmer Hadley Wickham who has written many useful R
packages such as ggplot2, tidyr etc. This post includes several examples and tips of how to use
dplyr package for cleaning and transforming data. It's a complete tutorial on data manipulation
and data wrangling with R.
WHAT IS DPLYR?
The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data.
In short, it makes data exploration and data manipulation easy and fast in R.
People have been utilizing SQL for analyzing data for decades. Every modern data analysis
software such as Python, R, SAS etc supports SQL commands. But SQL was never designed to
perform data analysis. It was rather designed for querying and managing data. There are many
data analysis operations where SQL fails or makes simple things difficult. For example,
calculating median for multiple variables, converting wide format data to long format etc.
Whereas, dplyr package was designed to do data analysis.
The names of dplyr functions are similar to SQL commands such as select()for selecting
variables, group_by() - group data by grouping variable, join() - joining two data sets. Also
includes inner_join() and left_join(). It also supports sub queries for which SQL was popular
for.
install.packages("dplyr")
To load dplyr package, type the command below
library(dplyr)
In this tutorial, we are using the following data which contains income generated by states from
year 2002 to 2015. Note : This data do not contain actual income figures of the states.
This dataset contains 51 observations (rows) and 16 variables (columns). The snapshot of first 6
rows of the dataset is shown below.
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
36
Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
Submit the following code. Change the file path in the code below.
mydata = read.csv("C:\\Users\\Deepanshu\\Documents\\sampledata.csv")
The sample_n function selects random rows from a data frame (or table). The second
parameter of the function tells R the number of rows to select.
sample_n(mydata,3)
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
33 N New York 1395149 1611371 1170675 1446810 1426941 1463171 1732098 1426216
The sample_frac function returns randomly N% of rows. In the example below, it returns
randomly 10% of rows.
sample_frac(mydata,0.1)
x1 = distinct(mydata)
In this dataset, there is not a single duplicate row so it returned same number of rows as
in mydata.
The .keep_all function is used to retain all other variables in the output data frame.
In the example below, we are using two variables - Index, Y2010 to determine uniqueness.
SELECT( ) FUNCTION
38
select() syntax : select(data ,....)
data : Data Frame
.... : Variables by name or by function
EXAMPLE- SELECTING VARIABLES (OR COLUMNS)
Suppose you are asked to select only a few variables. The code below selects variables "Index",
columns from "State" to "Y2008".
mydata2 = select(mydata, Index, State:Y2008)
The following functions helps you to select variables based on their names.
Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02, x03.
one_of() Variables in character vector.
39
everything() All variables.
The code below keeps variable 'State' in the front and the remaining variables follow that.
[1] "State" "Index" "Y2002" "Y2003" "Y2004" "Y2005" "Y2006" "Y2007" "Y2008" "Y2009"
RENAME( ) FUNCTION
40
FILTER( ) FUNCTION
Suppose you need to subset data. You want to filter rows and retain only those values in which
Index is equal to A.
mydata7 = filter(mydata, Index == "A")
41
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
The %in% operator can be used to select multiple items. In the following program, we are
telling R to select rows against 'A' and 'C' in column 'Index'.
Suppose you need to apply 'AND' condition. In this case, we are picking data for 'A' and 'C' in
the column 'Index' and income greater than 1.3 million in Year 2002.
mydata8 = filter(mydata6, Index %in% c("A", "C") & Y2002 >= 1300000 )
The 'I' denotes OR in the logical condition. It means any of the two conditions.
mydata9 = filter(mydata6, Index %in% c("A", "C") | Y2002 >= 1300000)
42
EXAMPLE- NOT CONDITION
43
The "!" sign is used to reverse the logical condition.
mydata10 = filter(mydata6, !Index %in% c("A", "C"))
EXAMPLE-CONTAINS CONDITION
The grepl function is used to search for pattern matching. In the following code, we are
looking for records wherein column state contains 'Ar' in their name.
SUMMARISE( ) FUNCTION
In the example below, we are calculating mean and median for the variable Y2015.
In the following example, we are calculating number of records, mean and median for
variables Y2005 and Y2006. The summarise_at function allows us to select multiple
variables by their names.
44
EXAMPLE- SUMMARIZE WITH CUSTOM FUNCTIONS
We can also use custom functions in the summarise function. In this case, we are computing
the number of records, number of missing values, mean and median for variables Y2011 and
Y2012. The dot (.) denotes each variables specified in the second argument of the function.
summarise_at(mydata, vars(Y2011, Y2012),
funs(n(), missing = sum(is.na(.)), mean(., na.rm = TRUE), median(.,na.rm = TRUE)))
Suppose you want to subtract mean from its original value and then calculate variance of it.
set.seed(222)
mydata <- data.frame(X1=sample(1:100,100), X2=runif(100))
summarise_at(mydata,vars(X1,X2), function(x) var(x - mean(x)))
X1 X2
1 841.6667 0.08142161
ALTERNATIVE METHOD :
45
We are checking the number of levels/categories and count of missing observations
in a categorical (factor) variable.
summarise_all(mydata["Index"], funs(nlevels(.), nmiss=sum(is.na(.))))
nlevels nmiss
1 19 0
arrange()
function : Use :
Variables
The default sorting order of arrange() function is ascending. In this example, we are
sorting data by multiple variables.
arrange(mydata, Index, Y2011)
Suppose you need to sort one variable by descending order and other variable by ascending
oder.
arrange(mydata, desc(Index), Y2011)
It is important to understand the pipe (%>%) operator before knowing the other
functions of dplyr package. dplyr utilizes pipe operator from another package
(magrittr).
It allows you to write sub-queries like we do it in sql.
Note : All the functions in dplyr package can be used without the pipe operator. The
question arises "Why to use pipe operator %>%". The answer is it lets to wrap multiple
functions together with the use of %>%.
46
SYNTAX :
filter(data_frame, variable == value)
or
data_frame %>%restricted
The %>% is NOT filter(variable == value)
to filter function. It can be used with any function.
EXAMPLE :
The code below demonstrates the usage of pipe %>% operator. In this example, we are
selecting 10 random observations of two variables "Index" "State" from the data frame
dt = sample_n(select(mydata, Index, State),10)
or
"mydata".
GROUP_BY() FUNCTION :
SYNTAX :
group_by(data, variables)
or
data %>% group_by(variables)
47
48