First Course On R
First Course On R
First Course On R
IIM KozhikODHE
Explanation and Use:
1. Assignment (=):
o In R, you can also use <- for assignment, which is common in R programming.
2. Variable (x):
o Simply typing x and running it will display the value stored in x (in this case,
4).
3. print(x):
1. Division (z = x / y):
o Here, z is assigned the result of x / y, where x is 4 and y is 2. This will give z
the value of 2.
o Division (/) is a standard arithmetic operator in R, used for dividing two
numbers or variables.
2. log(4):
o This computes the natural logarithm (base e) of 4.
o In R, log() by default calculates the natural log. If you want a logarithm with
a different base (e.g., base 10), use log(4, base = 10).
3. log(x):
o This calculates the natural log of x, which is 4 in this case, resulting in
approximately 1.386294.
4. exp(x):
o The exp() function calculates the exponential function e^x
o With x = 4, this computes e^x, resulting in approximately 54.59815.
5. Variable (x):
o Typing x alone again will simply output the current value of x (which is 4), a
quick way to check its value.
Practical Use:
These commands demonstrate essential skills in R for filtering, counting, and subsetting data based
on conditions—common tasks in data analysis. For instance, sum() and which() can help in data
cleaning (identifying specific values) or in selecting specific subsets of data for analysis or
visualization.
Dimensions of the Matrix (dim(A), dim(A)[1], and dim(A)[2]):
dim(A) returns the dimensions of A as a vector. For a 2x2 matrix, this will be 2 2.
dim(A)[1] returns the number of rows (2), and dim(A)[2] returns the number of
columns (2).
A[1, 1] accesses the element in the first row and first column of A, which is 4.
A[1, 2] accesses the element in the first row and second column of A, which is 5
when byrow = TRUE (or 6 without it).
A[, 1] returns the entire first column of A as a vector. With byrow = TRUE, this will
be 4 6.
A[1, ] returns the entire first row of A as a vector. With byrow = TRUE, this will be 4
5.
Practical Use:
These matrix commands are used in R for handling structured data, such as data tables and arrays.
You can create matrices with specific layouts, retrieve rows or columns, and access individual
elements, making these operations fundamental in matrix algebra, statistical modeling, and data
manipulation.
Practical Use:
These matrix operations are fundamental in linear algebra and are widely used in statistical
modeling, physics, machine learning, and economics. For example, matrix multiplication is essential
for transforming data, while finding an inverse is necessary in solving systems of linear equations,
which is common in regression analysis and optimization problems.
Creating a Vector with a Missing Value (x = c(2, 3, 4, NA)):
x is a vector with values 2, 3, 4, NA. The NA denotes a missing value, which often represents
incomplete or missing data.
is.na(x): Returns a logical vector showing TRUE for missing values, so the output will be
FALSE, FALSE, FALSE, TRUE.
sum(is.na(x)): Counts the missing values in x. Here, it will return 1 because there is one NA in
x.
which(is.na(x)): Returns the index of missing values in x, so it will return 4 (the position of
NA).
which(is.na(A), arr.ind = TRUE): Returns row and column indices of missing values, allowing
you to locate them more precisely. For example, if A[3,3] and A[5,5] are NA, it will return
(3,3) and (5,5).
A[-c(3, 5), ] removes rows 3 and 5 from A, displaying only rows 1, 2, and 4.
Explanation and Use:
o read_excel() is a function from the readxl package, commonly used for reading Excel
files into R.
o file.choose() opens a file dialog to manually select the Excel file.
o header = TRUE specifies that the first row contains column headers (variable names).
o This command assumes Advertising is an object (like a matrix or list) that needs
conversion into a data frame.
o data.frame() is used to convert data structures into data frames, making data easier
to work with, especially for analysis.
3. Viewing the First and Last Rows of Data (head(d) and tail(d, 10)):
View(d) opens d in a separate tab in RStudio, displaying the data in a table format. Useful for
a comprehensive look at the dataset.
str(d) shows the internal structure of d, including column names, data types, and the first
few values of each column. This is very helpful for an overview of the dataset's format.
hist(d$Sales) creates a histogram for the Sales column of d, providing a visual representation
of the distribution of sales data.
This command is useful for understanding data distribution and identifying patterns or
outliers.
dim(d) returns the dimensions (number of rows and columns) of d, which helps understand
the dataset size.
d = d[, -1] removes the first column from d and reassigns the result to d.
The -1 notation is a shortcut for excluding the specified column by position, which can be
helpful when the first column (e.g., row IDs) is unnecessary for analysis.
Explanation and Use:
o head(d) displays the first six rows of the data frame d, allowing you to quickly inspect
the beginning of your dataset.
o is.na(d) creates a logical matrix indicating TRUE for missing values and FALSE for non-
missing values.
o sum(is.na(d)) then sums up all TRUE values, providing the total count of missing
values across the entire data frame.
o is.na(d) generates a logical matrix (same as above), and head(is.na(d)) displays the
first six rows of this matrix, showing where missing values are present in those initial
rows.
is.na(d) returns TRUE for missing values, and colSums counts these TRUE values by column,
providing a quick overview of where missing values are concentrated.
Practical Use:
These commands are essential for data cleaning and preparation, helping you quickly locate
and quantify missing values. Understanding missing values in your dataset is crucial, as they
can affect calculations, analyses, and model performance. Using colSums(is.na(d)) allows
you to see which columns need attention, aiding in making decisions about data imputation or
handling.
Explanation and Use:
o na.omit(d) removes any rows in d that contain missing values. This approach is useful
for handling small amounts of missing data, especially in large datasets where
imputing or dealing with missing values individually might not be practical.
o names(d) outputs the names of all columns in d, giving a quick overview of the
variables in the dataset.
o str(d) provides a compact display of the dataset’s structure, including the data types
of each column and a preview of the data, which helps confirm that variables are in
the expected format for analysis.
o head(d$Sales) returns the first few values in the Sales column, giving a sample view
of the data for this variable.
hist(d$Sales, col = "red"): Creates a histogram with bars colored red to visualize the
distribution of Sales.
hist(d$Sales, col = "red", xlab = "Sales"): Adds a label for the x-axis as "Sales".
hist(d$Sales, col = "red", xlab = "Sales", ylab = "Number of Markets"): Adds a label for the y-
axis as "Number of Markets".
hist(d$Sales, col = "red", xlab = "Sales", ylab = "Number of Markets", main = "Distribution of
Sales"): Sets the title of the histogram to "Distribution of Sales".
hist(d$Sales, col = "red", xlab = "Sales", ylab = "Number of Markets", main = ""): Creates the
histogram without a title, in case it’s unnecessary.
attach(d) allows you to refer to columns directly by their names without prefixing with d$.
This can make code cleaner, especially when plotting or performing analyses on multiple
columns. However, it's good practice to detach it afterward to avoid confusion in larger
scripts.
Practical Use:
This approach is foundational for data cleaning, inspecting, and visualizing initial insights.
The histograms provide a quick sense of the Sales distribution, which can help identify
patterns, skewness, or outliers in the data. Using attach simplifies column access in further
analysis, though it's essential to manage attached data carefully in complex scripts to avoid
ambiguity.
o col = "red" sets the color of the points in the plot to red, helping differentiate it
visually.
o par(mfrow = c(1, 3)) arranges the plotting area into a grid of 1 row and 3 columns.
This allows you to display three plots side-by-side in a single output window, useful
for comparing relationships simultaneously.
o After setting mfrow, each subsequent plot command will fill a panel in this grid.
plot(Radio, Sales, col = "blue", lwd = 2): Plots Radio against Sales with blue points and thicker
lines.
plot(Newspaper, Sales, col = "red", lwd = 2): Plots Newspaper against Sales with red points.
plot(TV, Sales, col = "green", lwd = 2): Plots TV against Sales with green points.
Example Output:
These plots allow a quick visual comparison of how each advertising medium correlates with Sales.
Explanation and Use:
o library(ISLR) loads the ISLR package, which contains datasets and functions for
analyzing data often used in the "Introduction to Statistical Learning with R" book.
o If you get an error indicating that the package isn’t installed, you can install it using
install.packages("ISLR") (though you usually run this once to install the package).
o After loading the ISLR package, you can access its datasets, including Default.
o attach(Default) attaches the Default dataset, allowing you to access its columns
directly by name without using the Default$ prefix.
o names(Default) lists all the column names in the Default dataset, giving an overview
of the variables available for analysis.
Practical Use:
The Default dataset is commonly used for logistic regression analysis, as it contains data on whether
a customer defaulted on their credit card payment (default), whether they are a student (student),
their credit card balance (balance), and their annual income (income). Loading the dataset and
inspecting its structure helps prepare for analysis, such as predicting the probability of default based
on other variables.
Explanation and Use:
o help(Default) opens the help documentation for the Default dataset, if available,
providing details about the dataset’s variables and context.
o head(Default) shows the first six rows of Default, providing a quick preview of the
data.
o table(default) counts the occurrences of each unique value in the default column
(e.g., "Yes" and "No"), showing how many customers defaulted versus didn’t.
o TO[1] retrieves the first value in TO, which corresponds to the count of non-
defaulting cases (if No is first in the table).
o This table helps understand how being a student might relate to the likelihood of
defaulting.
o This command performs a basic arithmetic calculation. Here, it calculates the result
of 127 / 2817 and then adds 127.
Practical Use:
These commands enable exploration of categorical relationships within the Default dataset.
Frequency tables (table and prop.table) help identify patterns, such as whether students default at a
different rate compared to non-students.