First Course On R

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

First course on R

Ayush Ranjan Sinha

IIM KozhikODHE
Explanation and Use:

1. Assignment (=):

o x = 4 and y = 2 assign the values 4 and 2 to the variables x and y, respectively.

o In R, you can also use <- for assignment, which is common in R programming.

2. Variable (x):

o Simply typing x and running it will display the value stored in x (in this case,
4).

o This is a shorthand way to view the value of a variable in the console.

3. print(x):

o This explicitly tells R to print the value of x.

o print() is helpful in situations like functions or loops where you want to


output intermediate values.
4. Example Output:
When you run the code above, the output would look like this:
Explanation and Use:

1. Division (z = x / y):
o Here, z is assigned the result of x / y, where x is 4 and y is 2. This will give z
the value of 2.
o Division (/) is a standard arithmetic operator in R, used for dividing two
numbers or variables.
2. log(4):
o This computes the natural logarithm (base e) of 4.
o In R, log() by default calculates the natural log. If you want a logarithm with
a different base (e.g., base 10), use log(4, base = 10).
3. log(x):
o This calculates the natural log of x, which is 4 in this case, resulting in
approximately 1.386294.
4. exp(x):
o The exp() function calculates the exponential function e^x
o With x = 4, this computes e^x, resulting in approximately 54.59815.
5. Variable (x):
o Typing x alone again will simply output the current value of x (which is 4), a
quick way to check its value.

Explanation and Use:


1. Vector Creation (x = c(3, 4, 2, 1, 9)):
o c() is the function for combining values into a vector.
o Here, x is assigned the vector containing the numbers 3, 4, 2, 1, 9.
2. Accessing Elements (x[1] and x[5]):
o x[1] returns the first element of x, which is 3.
o x[5] returns the fifth element of x, which is 9.
o The square brackets [ ] are used to access elements by their position (index)
within the vector.
3. Maximum Value (max(x)):
o max(x) returns the largest value in the vector x, which is 9.
4. Index of the Minimum Value (which.min(x)):
o which.min(x) returns the index (position) of the smallest value in x.
o Since 1 is the smallest value and it’s in the fourth position, this command will
return 4.
5. Length of Vector (length(x)):
o length(x) returns the number of elements in x, which is 5.
Explanation and Use:

1. Vector Arithmetic (x + y and 2 * x):


o x + y: Adds each corresponding element of vectors x and y. Given x = c(3,
4, 2, 1, 9) and y = c(3, 4, 5, 2, 6), the result will be c(6, 8, 7, 3,
15).
o 2 * x: Multiplies each element in x by 2, resulting in c(6, 8, 4, 2, 18).
2. Creating a Sequence (101:200):
o 101:200 creates a sequence of integers from 101 to 200. This is often used to
generate a range of numbers quickly in R.
3. Logical Comparisons (x > 3 and x == 2):
o x > 3: Checks each element in x to see if it’s greater than 3, returning a
logical vector. For x = c(3, 4, 2, 1, 9), it returns FALSE, TRUE, FALSE,
FALSE, TRUE.
o x == 2: Checks each element in x to see if it’s equal to 2, returning TRUE
where the condition is met. Here, it returns FALSE, FALSE, TRUE, FALSE,
FALSE.
4. Counting Elements that Satisfy a Condition (sum(x > 3)):
o sum(x > 3): Counts how many elements in x are greater than 3 by summing
the TRUE values in the logical vector. Each TRUE is counted as 1, so for x =
c(3, 4, 2, 1, 9), it returns 2 since there are two values (4 and 9) greater
than 3.
5. Finding Indices of Elements that Satisfy a Condition (which(x > 3)):
o which(x > 3): Returns the indices (positions) of elements in x that are
greater than 3. For x = c(3, 4, 2, 1, 9), it returns 2 5 (indicating the 2nd
and 5th elements are greater than 3).
6. Subsetting Vector with Condition (x[which(x > 3)]):
o x[which(x > 3)]: Returns the actual values in x that are greater than 3. Here,
it will return 4 9 since these are the elements greater than 3.

Practical Use:

These commands demonstrate essential skills in R for filtering, counting, and subsetting data based
on conditions—common tasks in data analysis. For instance, sum() and which() can help in data
cleaning (identifying specific values) or in selecting specific subsets of data for analysis or
visualization.
 Dimensions of the Matrix (dim(A), dim(A)[1], and dim(A)[2]):

 dim(A) returns the dimensions of A as a vector. For a 2x2 matrix, this will be 2 2.
 dim(A)[1] returns the number of rows (2), and dim(A)[2] returns the number of
columns (2).

 Accessing Elements (A[1, 1] and A[1, 2]):

 A[1, 1] accesses the element in the first row and first column of A, which is 4.
 A[1, 2] accesses the element in the first row and second column of A, which is 5
when byrow = TRUE (or 6 without it).

 Accessing Columns and Rows (A[, 1] and A[1, ]):

 A[, 1] returns the entire first column of A as a vector. With byrow = TRUE, this will
be 4 6.
 A[1, ] returns the entire first row of A as a vector. With byrow = TRUE, this will be 4
5.

Practical Use:

These matrix commands are used in R for handling structured data, such as data tables and arrays.
You can create matrices with specific layouts, retrieve rows or columns, and access individual
elements, making these operations fundamental in matrix algebra, statistical modeling, and data
manipulation.
Practical Use:

These matrix operations are fundamental in linear algebra and are widely used in statistical
modeling, physics, machine learning, and economics. For example, matrix multiplication is essential
for transforming data, while finding an inverse is necessary in solving systems of linear equations,
which is common in regression analysis and optimization problems.
 Creating a Vector with a Missing Value (x = c(2, 3, 4, NA)):

 x is a vector with values 2, 3, 4, NA. The NA denotes a missing value, which often represents
incomplete or missing data.

 Checking for Missing Values (is.na(x), sum(is.na(x)), which(is.na(x))):

 is.na(x): Returns a logical vector showing TRUE for missing values, so the output will be
FALSE, FALSE, FALSE, TRUE.

 sum(is.na(x)): Counts the missing values in x. Here, it will return 1 because there is one NA in
x.

 which(is.na(x)): Returns the index of missing values in x, so it will return 4 (the position of
NA).

 Assigning a Value to a Specific Index (x[5] = 7):

 This assigns 7 to the fifth element of x, changing x to c(2, 3, 4, NA, 7).

 Assigning NA to a Matrix Element (A[3, 3] = NA):

 This sets the element at row 3, column 3 of A to NA.

 Finding Positions of Missing Values in a Matrix (which(is.na(A)), which(is.na(A), arr.ind = TRUE)):

 which(is.na(A)): Finds the indices of NA values in A, outputting positions in vector format.

 which(is.na(A), arr.ind = TRUE): Returns row and column indices of missing values, allowing
you to locate them more precisely. For example, if A[3,3] and A[5,5] are NA, it will return
(3,3) and (5,5).

 Assigning Another NA to a Matrix Element (A[5, 5] = NA):


 This sets the element at row 5, column 5 of A to NA, adding another missing value.

 Removing Specific Rows (A[-c(3, 5), ]):

 A[-c(3, 5), ] removes rows 3 and 5 from A, displaying only rows 1, 2, and 4.
Explanation and Use:

1. Reading an Excel File (d = read_excel(file.choose(), header = TRUE)):

o read_excel() is a function from the readxl package, commonly used for reading Excel
files into R.
o file.choose() opens a file dialog to manually select the Excel file.

o header = TRUE specifies that the first row contains column headers (variable names).

o This command reads the selected Excel file into d.

2. Converting to a Data Frame (d = data.frame(Advertising)):

o This command assumes Advertising is an object (like a matrix or list) that needs
conversion into a data frame.

o data.frame() is used to convert data structures into data frames, making data easier
to work with, especially for analysis.

3. Viewing the First and Last Rows of Data (head(d) and tail(d, 10)):

o head(d) displays the first six rows of d by default.

o tail(d, 10) displays the last 10 rows of d.

 Viewing the Data in a Spreadsheet-like Format (View(d)):

 View(d) opens d in a separate tab in RStudio, displaying the data in a table format. Useful for
a comprehensive look at the dataset.

 Viewing Structure of Data (str(d)):

 str(d) shows the internal structure of d, including column names, data types, and the first
few values of each column. This is very helpful for an overview of the dataset's format.

 Creating a Histogram (hist(d$Sales)):

 hist(d$Sales) creates a histogram for the Sales column of d, providing a visual representation
of the distribution of sales data.

 This command is useful for understanding data distribution and identifying patterns or
outliers.

 Checking Data Dimensions (dim(d)):

 dim(d) returns the dimensions (number of rows and columns) of d, which helps understand
the dataset size.

 Removing the First Column (d = d[, -1]):

 d = d[, -1] removes the first column from d and reassigns the result to d.

 The -1 notation is a shortcut for excluding the specified column by position, which can be
helpful when the first column (e.g., row IDs) is unnecessary for analysis.
Explanation and Use:

1. Viewing the First Few Rows (head(d)):

o head(d) displays the first six rows of the data frame d, allowing you to quickly inspect
the beginning of your dataset.

2. Counting Total Missing Values (sum(is.na(d))):

o is.na(d) creates a logical matrix indicating TRUE for missing values and FALSE for non-
missing values.
o sum(is.na(d)) then sums up all TRUE values, providing the total count of missing
values across the entire data frame.

3. Checking Missing Values Row-Wise (head(is.na(d))):

o is.na(d) generates a logical matrix (same as above), and head(is.na(d)) displays the
first six rows of this matrix, showing where missing values are present in those initial
rows.

o For each TRUE, there’s a missing value in d at that position.

4. Identifying Missing Values by Column (colSums(is.na(d))):

 colSums(is.na(d)) calculates the number of missing values for each column in d.

 is.na(d) returns TRUE for missing values, and colSums counts these TRUE values by column,
providing a quick overview of where missing values are concentrated.

Practical Use:

These commands are essential for data cleaning and preparation, helping you quickly locate
and quantify missing values. Understanding missing values in your dataset is crucial, as they
can affect calculations, analyses, and model performance. Using colSums(is.na(d)) allows
you to see which columns need attention, aiding in making decisions about data imputation or
handling.
Explanation and Use:

1. Handling Missing Values (d = na.omit(d)):

o na.omit(d) removes any rows in d that contain missing values. This approach is useful
for handling small amounts of missing data, especially in large datasets where
imputing or dealing with missing values individually might not be practical.

2. Displaying Column Names (names(d)):

o names(d) outputs the names of all columns in d, giving a quick overview of the
variables in the dataset.

3. Inspecting Data Structure (str(d)):

o str(d) provides a compact display of the dataset’s structure, including the data types
of each column and a preview of the data, which helps confirm that variables are in
the expected format for analysis.

4. Viewing Values of a Single Column (head(d$Sales)):

o head(d$Sales) returns the first few values in the Sales column, giving a sample view
of the data for this variable.

 Creating Histograms for Sales:

 hist(d$Sales, col = "red"): Creates a histogram with bars colored red to visualize the
distribution of Sales.
 hist(d$Sales, col = "red", xlab = "Sales"): Adds a label for the x-axis as "Sales".

 hist(d$Sales, col = "red", xlab = "Sales", ylab = "Number of Markets"): Adds a label for the y-
axis as "Number of Markets".

 hist(d$Sales, col = "red", xlab = "Sales", ylab = "Number of Markets", main = "Distribution of
Sales"): Sets the title of the histogram to "Distribution of Sales".

 hist(d$Sales, col = "red", xlab = "Sales", ylab = "Number of Markets", main = ""): Creates the
histogram without a title, in case it’s unnecessary.

 Attaching the Data Frame (attach(d)):

 attach(d) allows you to refer to columns directly by their names without prefixing with d$.
This can make code cleaner, especially when plotting or performing analyses on multiple
columns. However, it's good practice to detach it afterward to avoid confusion in larger
scripts.

Practical Use:

This approach is foundational for data cleaning, inspecting, and visualizing initial insights.
The histograms provide a quick sense of the Sales distribution, which can help identify
patterns, skewness, or outliers in the data. Using attach simplifies column access in further
analysis, though it's essential to manage attached data carefully in complex scripts to avoid
ambiguity.

Explanation and Use:

1. Creating a Scatter Plot (plot(TV, Sales, col = "red")):

o plot() generates a scatter plot of TV against Sales.

o col = "red" sets the color of the points in the plot to red, helping differentiate it
visually.

2. Customizing Line Width (plot(TV, Sales, col = "red", lwd = 2)):


o lwd = 2 adjusts the line width in the plot, if lines are used instead of points. In scatter
plots, lwd would impact any trend or regression lines drawn later.

3. Setting Up a Multi-Panel Layout (par(mfrow = c(1, 3))):

o par(mfrow = c(1, 3)) arranges the plotting area into a grid of 1 row and 3 columns.
This allows you to display three plots side-by-side in a single output window, useful
for comparing relationships simultaneously.

o After setting mfrow, each subsequent plot command will fill a panel in this grid.

4. Plotting Different Variables against Sales:

 plot(Radio, Sales, col = "blue", lwd = 2): Plots Radio against Sales with blue points and thicker
lines.

 plot(Newspaper, Sales, col = "red", lwd = 2): Plots Newspaper against Sales with red points.

 plot(TV, Sales, col = "green", lwd = 2): Plots TV against Sales with green points.

Example Output:

With this setup, you'll get three side-by-side scatter plots:

1. Radio vs. Sales in blue.

2. Newspaper vs. Sales in red.

3. TV vs. Sales in green.

These plots allow a quick visual comparison of how each advertising medium correlates with Sales.
Explanation and Use:

1. Load the ISLR Library (library(ISLR)):

o library(ISLR) loads the ISLR package, which contains datasets and functions for
analyzing data often used in the "Introduction to Statistical Learning with R" book.

o If you get an error indicating that the package isn’t installed, you can install it using
install.packages("ISLR") (though you usually run this once to install the package).

2. Load the Default Dataset:

o After loading the ISLR package, you can access its datasets, including Default.

o attach(Default) attaches the Default dataset, allowing you to access its columns
directly by name without using the Default$ prefix.

3. View Column Names (names(Default)):

o names(Default) lists all the column names in the Default dataset, giving an overview
of the variables available for analysis.

Practical Use:

The Default dataset is commonly used for logistic regression analysis, as it contains data on whether
a customer defaulted on their credit card payment (default), whether they are a student (student),
their credit card balance (balance), and their annual income (income). Loading the dataset and
inspecting its structure helps prepare for analysis, such as predicting the probability of default based
on other variables.
Explanation and Use:

1. Getting Help Documentation (help(Default)):

o help(Default) opens the help documentation for the Default dataset, if available,
providing details about the dataset’s variables and context.

2. Displaying the First Few Rows (head(Default)):

o head(Default) shows the first six rows of Default, providing a quick preview of the
data.

3. Creating a Frequency Table (table(default)):

o table(default) counts the occurrences of each unique value in the default column
(e.g., "Yes" and "No"), showing how many customers defaulted versus didn’t.

4. Storing and Displaying the Frequency Table (TO = table(default)):

o TO = table(default) stores the frequency table for default in TO.

o TO displays this stored table, allowing for further use in calculations.

5. Accessing Specific Elements of TO (TO[1]):

o TO[1] retrieves the first value in TO, which corresponds to the count of non-
defaulting cases (if No is first in the table).

6. Cross-Tabulating student and default (T2 = table(student, default)):

o table(student, default) creates a contingency table T2 showing the count of defaults


across the two student categories (e.g., "Yes" and "No").

o This table helps understand how being a student might relate to the likelihood of
defaulting.

7. Calculating Row-Wise Proportions (prop.table(T2, margin = 1)):


o prop.table(T2, margin = 1) calculates row-wise proportions for T2. The margin = 1
argument normalizes the counts within each row, converting them to proportions
that sum to 1 across the row.

8. Simple Arithmetic Expression (127 / 2817 + 127):

o This command performs a basic arithmetic calculation. Here, it calculates the result
of 127 / 2817 and then adds 127.

Practical Use:

These commands enable exploration of categorical relationships within the Default dataset.
Frequency tables (table and prop.table) help identify patterns, such as whether students default at a
different rate compared to non-students.

You might also like