R Programming
R Programming
R Programming
Programming
Hadoop & R Programmming
Strings
Any value written within a pair of single quote or double quotes in R is treated as a string.
Internally R stores every string within double quotes, even when you create them with a single
quote.
•The quotes at the beginning and end of a string should be both double quotes or both single
quotes. They can not be mixed.
•Double quotes can be inserted into a string starting and ending with a single quote.
•Single quotes can be inserted into a string starting and ending with double quotes.
•Double quotes can not be inserted into a string starting and ending with double quotes.
•Single quote can not be inserted into a string starting and ending with a single quote.
Hadoop & R Programmming
Strings
Examples of Valid Strings
Strings
Examples of Invalid Strings
Strings
String Manipulation
Concatenating Strings - paste() function
Many strings in R are combined using the paste() function. It can take any number of
arguments to be combined together.
Syntax
The basic syntax for paste function is −
paste(..., sep = " ", collapse = NULL)
Following is the description of the parameters used −
1. ... represents any number of arguments to be combined.
2. sep represents any separator between the arguments. It is optional.
3. collapse is used to eliminate the space in between two strings. But not the space within two
words of one string.
Hadoop & R Programmming
Strings
String Manipulation
Program:
a <- "Hello"
b <- 'How'
c <- "are you? "
print(paste(a,b,c))
print(paste(a,b,c, sep = "-"))
print(paste(a,b,c, sep = "", collapse = ""))
Program output:
[1] "Hello How are you? "
[1] "Hello-How-are you? "
[1] "HelloHoware you? "
Hadoop & R Programmming
Strings
Formatting numbers & strings - format() function
Numbers and strings can be formatted to a specific style using format() function.
Syntax
The basic syntax for format function is −
format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none"))
Following is the description of the parameters used −
x is the vector input.
digits is the total number of digits displayed.
nsmall is the minimum number of digits to the right of the decimal point.
scientific is set to TRUE to display scientific notation.
width indicates the minimum width to be displayed by padding blanks in the beginning.
justify is the display of the string to left, right or center.
Hadoop & R Programmming
Strings
Formatting numbers & strings - format() function
# Total number of digits displayed. Last digit rounded off.
result <- format(23.123456789, digits = 9)
print(result)
# Display numbers in scientific notation.
result <- format(c(6, 13.14521), scientific = TRUE)
print(result)
# The minimum number of digits to the right of the decimal point.
result <- format(23.47, nsmall = 5)
print(result)
# Format treats everything as a string.
result <- format(6)
print(result)
Hadoop & R Programmming
Strings
Formatting numbers & strings - format() function
# Numbers are padded with blank in the beginning for width.
result <- format(13.7, width = 6)
print(result)
# Left justify strings.
result <- format("Hello", width = 8, justify = "l")
print(result)
# Justify string with center.
result <- format("Hello", width = 8, justify = "c")
print(result)
Hadoop & R Programmming
Strings
Formatting numbers & strings - format() function
[1] "23.1234568"
[1] "23.47000"
[1] "6"
Strings
4. Changing the case - toupper() & tolower() functions
Syntax
toupper(x)
tolower(x)
Strings
4. Changing the case - toupper() & tolower() functions
Example
print(result)
print(result)
1] "CHANGING TO UPPER"
Strings
5. Extracting parts of a string - substring() function
Syntax
substring(x,first,last)
Strings
5. Extracting parts of a string - substring() function
Example:
print(result)
[1] "act"
Hadoop & R Programmming
Vectors?
A. Vectors are the most basic R data objects and there are six types of atomic vectors. They are
logical, integer, double, complex, character and raw.
Vector Creation
Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of
the above vector types.
# Atomic vector of type character.
print("abc");
# Atomic vector of type double.
print(12.5)
# Atomic vector of type integer.
print(63L)
Hadoop & R Programmming
Vectors?
# Atomic vector of type logical.
print(TRUE)
# Atomic vector of type complex.
print(2+3i)
# Atomic vector of type raw.
print(charToRaw('hello’))
When we execute the above code, it produces the following result −
•[1] "abc"
•[1] 12.5
•[1] 63
•[1] TRUE
•[1] 2+3i
•[1] 68 65 6c 6c 6f
Hadoop & R Programmming
Vectors?
Multiple Elements Vector
Using colon operator with numeric data
# Creating a sequence from 5 to 13.
v <- 5:13
print(v)
# Creating a sequence from 6.6 to 12.6.
v <- 6.6:12.6
print(v)
# If the final element specified does not belong to the sequence then it is discarded.
v <- 3.8:11.4
print(v)
•[1] 5 6 7 8 9 10 11 12 13
•[1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6
•[1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8
Hadoop & R Programmming
Vectors?
Multiple Elements Vector
print(seq(5, 9, by = 0.4))
[1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0
Hadoop & R Programmming
Vectors?
Multiple Elements Vector
The non-character values are coerced to character type if one of the elements is a character.
s <- c('apple','red',5,TRUE)
print(s)
Vectors
Accessing Vector Elements
•The [ ] brackets are used for indexing. Indexing starts with position 1.
•Giving a negative value in the index drops that element from result.
Vectors
Accessing Vector Elements
# Accessing vector elements using negative indexing.
x <- t[c(-2,-5)]
print(x)
# Accessing vector elements using 0/1 indexing.
y <- t[c(0,0,0,0,0,0,1)]
print(y)
When we execute the above code, it produces the following result −
[1] "Mon" "Tue" "Fri"
[1] "Sun" "Fri"
[1] "Sun" "Tue" "Wed" "Fri" "Sat"
[1] "Sun"
Hadoop & R Programmming
Vector Manipulation
Vector arithmetic
Two vectors of same length can be added, subtracted, multiplied or divided giving the result as a
vector output.
# Create two vectors.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)
# Vector addition.
add.result <- v1+v2
print(add.result)
# Vector subtraction.
sub.result <- v1-v2
print(sub.result)
# Vector multiplication.
multi.result <- v1*v2
print(multi.result)
Hadoop & R Programmming
Vector Manipulation
# Vector division.
divi.result <- v1/v2
print(divi.result)
Vector Manipulation
# Vector Element Recycling
If we apply arithmetic operations to two vectors of unequal length, then the elements of the
shorter vector are recycled to complete the operations.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)
add.result <- v1+v2
print(add.result)
sub.result <- v1-v2
print(sub.result)
When we execute the above code, it produces the following result −
[1] 7 19 8 16 4 22
[1] -1 -3 0 -6 -4 0
Hadoop & R Programmming
Vector Manipulation
# Vector Element Sorting
Elements in a vector can be sorted using the sort() function.
v <- c(3,8,4,5,0,11, -9, 304)
# Sort the elements of the vector.
sort.result <- sort(v)
print(sort.result)
# Sort the elements in the reverse order.
revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)
# Sorting character vectors.
v <- c("Red","Blue","yellow","violet")
sort.result <- sort(v)
print(sort.result)
Hadoop & R Programmming
Vector Manipulation
# Vector Element Sorting
# Sorting character vectors in reverse order.
revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)
Creating a List
Following is an example to create a list containing strings, numbers, vectors and a logical values
# Create a list containing strings, numbers, vectors and a logical
# values.
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data) .
When we execute the above code, it produces the following result −
Hadoop & R Programmming
The list elements can be given names and they can be accessed using these names.
print(list_data)
Hadoop & R Programmming
A list can be converted to a vector so that the elements of the vector can be used for further
manipulation. All the arithmetic operations on vectors can be applied after the list is converted
into vectors. To do this conversion, we use the unlist() function. It takes the list as input and
produces a vector.
# Create lists.
list1 <- list(1:5)
print(list1)
list2 <-list(10:14)
print(list2)
# Convert the lists to vectors.
v1 <- unlist(list1)
v2 <- unlist(list2)
Hadoop & R Programmming
Matrices
Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular
layout. They contain elements of the same atomic types. Though we can create a matrix
containing only characters or only logical values, they are not of much use. We use matrices
containing numeric elements to be used in mathematical calculations.
A Matrix is created using the matrix() function.
Syntax
The basic syntax for creating a matrix in R is −
matrix(data, nrow, ncol, byrow, dimnames)
Following is the description of the parameters used −
Data is the input vector which becomes the data elements of the matrix.
nrow is the number of rows to be created.
ncol is the number of columns to be created.
byrow is a logical clue. If TRUE then the input vector elements are arranged by row.
dimname is the names assigned to the rows and columns.
Hadoop & R Programmming
Matrices
Create a matrix taking a vector of numbers as input.
# Elements are arranged sequentially by row.
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M)
# Elements are arranged sequentially by column.
N <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(N)
# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")
P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))
print(P)
Hadoop & R Programmming
Matrices
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] 3 4 5
[2,] 6 7 8
[3,] 9 10 11
[4,] 12 13 14
[,1] [,2] [,3]
[1,] 3 7 11
[2,] 4 8 12
[3,] 5 9 13
[4,] 6 10 14
col1 col2 col3
row1 3 4 5
row2 6 7 8
row3 9 10 11
row4 12 13 14
Hadoop & R Programmming
Matrices
Accessing Elements of a Matrix
Elements of a matrix can be accessed by using the column and row index of the element. We
consider the matrix P above to find the specific elements below.
# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")
# Create the matrix.
P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))
# Access the element at 3rd column and 1st row.
print(P[1,3])
# Access the element at 2nd column and 4th row.
print(P[4,2])
# Access only the 2nd row.
print(P[2,])
Hadoop & R Programmming
Matrices
Accessing Elements of a Matrix
# Access only the 3rd column.
print(P[,3])
When we execute the above code, it produces the following result −
[1] 5
[1] 13
col1 col2 col3
6 7 8
row1 row2 row3 row4
5 8 11 14
Hadoop & R Programmming
Matrices
Matrix Computations
Various mathematical operations are performed on the matrices using the R operators. The result
of the operation is also a matrix.
The dimensions (number of rows and columns) should be the same for the matrices involved in
the operation.
Matrix Addition & Subtraction
# Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)
matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)
print(matrix2)
Hadoop & R Programmming
Matrices
Matrix Computations
# Add the matrices.
result <- matrix1 + matrix2
cat("Result of addition","\n")
print(result)
# Subtract the matrices
result <- matrix1 - matrix2
cat("Result of subtraction","\n")
print(result)
Hadoop & R Programmming
Matrices
Matrix Computations
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] 3 -1 2
[2,] 9 4 6
[,1] [,2] [,3]
[1,] 5 0 3
[2,] 2 9 4
Result of addition
[,1] [,2] [,3]
[1,] 8 -1 5
[2,] 11 13 10
Result of subtraction
[,1] [,2] [,3]
[1,] -2 -1 -1
[2,] 7 -5 2
Hadoop & R Programmming
Matrices
Matrix Multiplication & Division
# Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)
matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)
print(matrix2)
# Multiply the matrices.
result <- matrix1 * matrix2
cat("Result of multiplication","\n")
print(result)
# Divide the matrices
result <- matrix1 / matrix2
cat("Result of division","\n")
print(result)
Hadoop & R Programmming
Matrices
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] 3 -1 2
[2,] 9 4 6
[,1] [,2] [,3]
[1,] 5 0 3
[2,] 2 9 4
Result of multiplication
[,1] [,2] [,3]
[1,] 15 0 6
[2,] 18 36 24
Result of division
[,1] [,2] [,3]
[1,] 0.6 -Inf 0.6666667
[2,] 4.5 0.4444444 1.5000000
Hadoop & R Programmming
Data Frames
A data frame is a table or a two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each column.
•The data stored in a data frame can be of numeric, factor or character type.
Data Frames
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)
Hadoop & R Programmming
Data Frames
When we execute the above code, it produces the following result −
emp_id emp_name salary start_date
1 1 Rick 623.30 2012-01-01
2 2 Dan 515.20 2013-09-23
3 3 Michelle 611.00 2014-11-15
4 4 Ryan 729.00 2014-05-11
5 5 Gary 843.25 2015-03-27
Hadoop & R Programmming
Data Frames
Get the Structure of the Data Frame
The structure of the data frame can be seen by using str() function.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Get the structure of the data frame.
str(emp.data)
Hadoop & R Programmming
Data Frames
Summary of Data in Data Frame
The statistical summary and nature of the data can be obtained by applying summary()
function.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the summary.
print(summary(emp.data))
Hadoop & R Programmming
Data Frames
When we execute the above code, it produces the following result
Data Frames
Extract Data from Data Frame
Extract specific columns from a data frame using column name.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
Hadoop & R Programmming
Data Frames
emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25
Hadoop & R Programmming
Data Frames
Extract the first two rows and then all columns
Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract the first two rows.
result <- emp.data[1:2,]
print(result)
Hadoop & R Programmming
Data Frames
When we execute the above code, it produces the following result −
Data Frames
Extract 3rd and 5th row with 2nd and 4th column
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)
Hadoop & R Programmming
Data Frames
emp_name start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27
Hadoop & R Programmming
Data Frames
Expand Data Frame
A data frame can be expanded by adding columns and rows.
Add Column
Just add the column vector using a new column name.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
Hadoop & R Programmming
Data Frames
# Add the "dept" column.
emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)
Data Frames
Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows in
the same structure as the existing data frame and use the rbind() function.
In the example below we create a data frame with new rows and merge it with the existing data
frame to create the final data frame.
# Create the first data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)
Hadoop & R Programmming
Data Frames
# Create the second data frame
emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Finance"),
stringsAsFactors = FALSE
)
# Bind the two data frames.
emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)
Hadoop & R Programmming
Data Frames
When we execute the above code, it produces the following result −
emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Rasmi 578.00 2013-05-21 IT
7 7 Pranab 722.50 2013-07-30 Operations
8 8 Tusar 632.80 2014-06-17 Finance
Hadoop & R Programmming
Reshaping?
A. Data Reshaping in R is about changing the way data is organized into rows and columns. Most
of the time data processing in R is done by taking the input data as a data frame. It is easy to
extract data from the rows and columns of a data frame but there are situations when we need
the data frame in a format that is different from the format in which we received it. R has many
functions to split, merge and change the rows to columns and vice-versa in a data frame.
We can join multiple vectors to create a data frame using the cbind() function. Also we can
merge two data frames using rbind() function.
Hadoop & R Programmming
Reshaping?
# Create vector objects.
city <- c("Tampa","Seattle","Hartford","Denver")
state <- c("FL","WA","CT","CO")
zipcode <- c(33602,98104,06161,80294)
# Combine above three vectors into one data frame.
addresses <- cbind(city,state,zipcode)
# Print a header.
cat("# # # # The First data frame\n")
# Print the data frame.
print(addresses)
# Create another data frame with similar columns
new.address <- data.frame(
city = c("Lowry","Charlotte"),
state = c("CO","FL"),
zipcode = c("80230","33949"),
stringsAsFactors = FALSE
)
Hadoop & R Programmming
Reshaping?
# Print a header.
cat("# # # The Second data frame\n")
# Print the data frame.
print(new.address)
# Combine rows form both the data frames.
all.addresses <- rbind(addresses,new.address)
# Print a header.
cat("# # # The combined data frame\n")
# Print the result.
print(all.addresses) )
Hadoop & R Programmming
Reshaping?
# # # # The First data frame
city state zipcode
[1,] "Tampa" "FL" "33602"
[2,] "Seattle" "WA" "98104"
[3,] "Hartford" "CT" "6161"
[4,] "Denver" "CO" "80294"
# # # The Second data frame
city state zipcode
1 Lowry CO 80230
2 Charlotte FL 33949
Hadoop & R Programmming
Explain about Packages?
A. R packages are a collection of R functions, compiled code and sample data. They are stored
under a directory called "library" in the R environment. By default, R installs a set of packages
during installation. More packages are added later, when they are needed for some specific
purpose. When we start the R console, only the default packages are available by default. Other
packages which are already installed have to be loaded explicitly to be used by the R program
that is going to use them.
Below is a list of commands to be used to check, verify and use the R packages.
.libPaths()
Hadoop & R Programmming
Explain about Packages?
When we execute the above code, it produces the following result. It may vary depending on the
local settings of your pc.
library()
When we execute the above code, it produces the following result. It may vary depending on the
local settings of your pc.
Hadoop & R Programmming
Explain about Packages?
Packages in library ‘C:/Program Files/R/R-3.2.2/library’:
base The R Base Package
boot Bootstrap Functions (Originally by Angelo Canty
for S)
class Functions for Classification
cluster "Finding Groups in Data": Cluster Analysis
Extended Rousseeuw et al.
codetools Code Analysis Tools for R
compiler The R Compiler Package
datasets The R Datasets Package
foreign Read Data Stored by 'Minitab', 'S', 'SAS',
'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
graphics The R Graphics Package
grDevices The R Graphics Devices and Support for Colours
and Fonts
Hadoop & R Programmming
Explain about Packages?
grid The Grid Graphics Package
KernSmooth Functions for Kernel Smoothing Supporting Wand
& Jones (1995)
lattice Trellis Graphics for R
MASS Support Functions and Datasets for Venables and
Ripley's MASS
Matrix Sparse and Dense Matrix Classes and Methods
methods Formal Methods and Classes
mgcv Mixed GAM Computation Vehicle with GCV/AIC/REML
Smoothness Estimation
nlme Linear and Nonlinear Mixed Effects Models
nnet Feed-Forward Neural Networks and Multinomial
Log-Linear Models
parallel Support for Parallel computation in R
rpart Recursive Partitioning and Regression Trees
spatial Functions for Kriging and Point Pattern Analysis
Hadoop & R Programmming
Explain about Packages?
splines Regression Spline Functions and Classes
stats The R Stats Package
stats4 Statistical Functions using S4 Classes
survival Survival Analysis
tcltk Tcl/Tk Interface
tools Tools for Package Development
utils The R Utils Package
Hadoop & R Programmming
Explain about Packages?
There are two ways to add new R packages. One is installing directly from the CRAN directory
and another is downloading the package to your local system and installing it manually.
The following command gets the packages directly from CRAN webpage and installs the package
in the R environment. You may be prompted to choose the nearest mirror. Choose the one
appropriate to your location.
install.packages("Package Name")
install.packages("XML")
Explain about Packages? Hadoop & R Programmming