Topic 1 Data Management in R EDUC 216
Topic 1 Data Management in R EDUC 216
Topic 1 Data Management in R EDUC 216
What is R?
R is a free software environment for statistical computing and graphics. It is supported by the
R Foundation for Statistical Computing.
R is a GNU package and is available under the GNU General Public License, which can be
assumed to be free to a certain extent and is open source.
The R – chitecture
R exists as base package with a reasonable amount of functionality. The Software R and its
packages are stored in a central location known as the CRAN or the Comprehensive R Archive
Network. Once a package is stored in the CRAN, anyone with an internet connection can download
it from the CRAN and install it to use within their own copy of R.
Advantages
-free
-versatile
-rapidly expanding tool and can respond quickly to new developments
Disadvantages
-ease of use (typing instructions rather than pointing, clicking, and dragging things with
a mouse.
-work with a command line rather than a graphical user interface)
To install R onto your computer you need to visit the project website (http://www.R-project.org).
The figure below shows the process of obtaining the installation files. On the main project page,
on the left-hand side, click on the link labelled ‘ CRAN’
There are various copies (mirrors) of CRAN across the globe; therefore the link to the CRAN will
navigate you to page of links to the various ‘mirror’ sites. Scroll down this list to find a mirror
near to you.
Once you have been redirected to the CRAN mirror that you selected, you will see a web page
that asks you which platform you use (Linux , MacOS or Windows). Click the link that applies
to you.
If you click on the ‘Windows’ link, then you’ll be taken to another page with some more links;
click on’ base’, which will direct you to the webpage with the link to the setup file, once there,
click on the link that says ‘Download R ___ for Windows’, which will initiate the download of the
R setup file. Once this file has been downloaded, double click on it and you will enter a (hopefully)
familiar install procedure.
If you click on the ‘MacOS’ link you will be taken directly to a page from where you can download
the install package by clicking on the link labelled ‘R-__.pkg’ Clicking this link will download the
install file; once downloaded, double click on it and you will enter the normal MAcOS install
procedure.
An IDE is a software application that helps programmers develop software more easily and more
productively. An IDE is made of a code editor, compiler and debugger tools.
To Install RStudio
1. Go to www.rstudio.com and click on the "Download RStudio" button.
2. Click on "Download RStudio Desktop."
3. For Windows : Click on the version recommended for your system, or the latest
Windows version, and save the executable file. Run the .exe file and follow the
installation instructions.
4. For MacOS: Click on the version recommended for your system, or the latest Mac
version, save the .dmg file on your computer, double-click it to open, and then
drag and drop it to your applications folder.
Environement/History
Editor/Script/
Data Pane
Files/Plots/
Packages/Help
Console
Code Editor/Source – a separate window where you can write your commands rather than
writing directly to the console. Here, you can enter multiple lines of code, save your script file to
disk, and perform other tasks on your script.
It’s Smart – it recognizes and highlights various elements of the code; it helps you find matching
brackets in your scripts.
Console – It is the main window where you can both type commands and see the results of
executing these commands. This is where you do all the interactive work with R.
Files, Plots, Package, and Help – File. This is where you can browse the folders and files on
your computer. Plots. This is where R displays your plots. Packages. This is where you can
view a list of all the installed packages. Help. This is where you can browse the built-in Help
system of R.
File – It allows you to do general things such as saving workspace. Likewise, you can open
previously saved files and print graphs, data or output. In essence, it contains all the options that
are customarily found in File menus.
Edit – This menu contains edit functions such as cut and paste. From here, you can also clear the
console, activate a rudimentary data editor, and change how the Graphical User Interface looks.
View – This menu lets you select whether or not to see the toolbar and whether to show a status
bar at the bottom of the window.
Misc – This menu contains options to stop ongoing computations, to list any objects in your
working environment, and also to select whether R autocompletes words and filenames for you.
Packages – This menu is very important because it is where you load, install and update packages.
Window – If you have multiple windows. This menu allows you to change how the windows in R
are arranged.
Help – It routes you to online help (links to frequently asked questions, the R webpage etc.) and it
offers you an offline help (pdf manuals and system help files).
Resize – This menu is for resizing the image in the graphics window so that it is a fixed size, it is
scaled to fit the window but retains its aspects ratio (fit to window), or it expands to fit the window
but does not maintain it aspects ratio.
Commands in R are generally made up of two parts: objects and functions. These are separate by
“< −“, which you can think of as meaning ‘is created from’. As such, the general from of
command is: object <- function which means ‘object is created from function’.
R is case sensitive; which means that if the same things are written in upper or lower case, R
thinks that they are completely different things.
Installing Packages
Package is self-contained set of code that adds functionality to R, similar to the way that an add-
ins adds functionality to Microsoft excel.
In windows if you select Packages => Install packages(s)… the window that will open first asks
you to select a CRAN and then choose a package you want to install.
If you know the package you want to install, then the simplest way to execute this command is
install.packages (“package.name”) in which ‘package.name’ is replaced by the name of the
package that you’d like to installed. Note that the name of the package must be enclosed in speech
marks.
Once a package is installed you need to reference it for R to know that you’re using it. You need
to install the package only once but once you need to reference it each time you start a new session
of R.
R Workspace
The collection of objects and things you have created in a session is known as your workspace.
Before you look into importing data into the R console, you must determine your workplace or
work directory first. You should always set the current workspace or work directory.
A working directory is a directory where you want to store your data files.
To set the working directory to this folder, we use the setwd( ) command to specify this newly
created folder as the working directory.
- Create a folder and place the data files you’ll be using in that folder.
Example: setwd(“D:/R Training/Files”)
By executing this command, we can now access files in that folder directly without having
reference to the full file path.
If you want to check what working directory is, we have to execute the command getwd( ).
A< - 1
B<- 2
Creating Variables
Example:
R_name<-c(“Renan”, “John”, “Chlea”,”Jean”)
Province<- c(“Bohol”, “Cebu”, “Negros Oriental”,” “Siquijor”)
Age<-c(28,32,27,30)
Variables that consist of data that are text are known as string variables. Variables that contain
data that are numbers are known as numeric variables.
Example:
Birthdate<- as. Date (c (“1990-06-21”,”1986-07-16”, “1991-09-08”,”1988-05-24”))
Creating Dataframes
If we want to combine R_name,Province, Birthdate, and Age and create dataframe, we can use the
data.frame() function.
Example:
Profile<-data.frame(R_name,Province,B_date, Age)
In this command, we create a new object ( called Profile) . As such, our dataframe consists of four
variables (names of the respondents, their provinces, birthdates and ages).
Now that the dataframe has been created we can refer to these variables at any point using the
general form:
dataframe$variableName
Check:
Profile$Province
Profile$ Age
Dataframes are not the only way to combine variables in R. You can also use the list( ) and cbind()
functions to combine variables.
The list( ) creates a list of separate objects; you can imagine it as though your handbag (or manbag)
but nicely organized. Your handbag contains lots of different objects: wallet, phone, iPod, pen,
etc. Those objects can be different but this doesn’t stop them from being collected into the same
bag. The list( ) function creates a sort of bag into which you can place objects that you have created
in R.
Profilel<-list(Province, Age)
The function cbind( ) is used simply for pasting columns of data together (you can also use rbind()
to combine rows of data together.
Profile2<-cbind(Province, Age)
Notice that the numbers are in quotes; this is because the variable containing provinces is text,
so it causes the ages to be text as well. For this reason, cbind() is most useful for combining
variables of the same type.
A coding variable (also known as a grouping variable or factor) is a variable that uses numbers to
different groups of data. As such, it is a numeric variable, but these numbers represent names
(i.e., it is a nominal variable).
First we can enter the data and then worry about turning these data into coding variable.
Example
sex<-c(0,0,1,1)
In situations like this, in which all cases in the same group are grouped together in the data file,
we could do the same thing more quickly using the rep() function. This function takes the
general form of rep (number to repeat, how many repetitions).
To turn this variable into a factor, we use the factor () function. This function takes the general
form:
factor (variable, levels = c(x,y…,z), labels =c(label1”,”label2”,…”label3”)).
If we have used regular series such as 1, 2,3,4 we can abbreviate this as c(1:4) where the colon
simply means ‘all the values between; so c(1:4) is the same as c(1,2,3,4).
Missing Values
Missing data can occur for a variety of reasons: in long questionnaires participants accidentally
miss out questions; in experimental procedures mechanical faults can lead to a datum not being
recorded; and in research on delicate topics(e.g., sexual behavior) participants may exert their right
not to answer a question.
In R the code used is NA (in capital letters) which stands for (“not available”)
You can read the data CSV file using the read.csv function:
> data <- read.csv(file=”data.csv”, header=TRUE);
> require(“xlsx’);
Loading required package:xlsx
You can install the foreign package using the install.packages () function:
To read the SPSS file to a data frame type, use the read.spss () function:
Data<-read.spss(file=”data.spss”, to.data.frame=TRUE);
It is possible to do some basic data editing (and analysis) using a package called Rcmdr (short
for R Commander). This package loads a window style interface for basic data manipulation.
R Commander offers a basic spreadsheet style interface for entering data (i.e. like Excel)
To create a new dataframe, select Data => New data set… which opens a dialog box that enables
you to name the dataframe.
To convert a string variable to a factor or coding variable, select Data => Manage variables in
active data set => Convert numeric variables to factors …
Select the variable that you want to convert. If you want to type some labels for the levels of
your coding variable, then select “Supply level names” and click on .ok.
To activate a submenu that enables you to open a text file, SPSS, Stata or Excel file, select Data
=> Import Data.