Afin8015 Topic 1 2023.

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

' $

AFIN-8015: Financial Data Science


Topic-1- Introduction to Data Analytics & R

& %
Contents

1 Introduction to AFIN8015 6
1.1 Assessment Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Data Science - An Introduction 8


2.1 What is Data Science? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Key Component of Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Data Science Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Data Science Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Life Cycle of a Data Science Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.5 Data Science Project Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Financial Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Types of Financial Data (not an exhaustive list) . . . . . . . . . . . . . . . . . . . . . 15

3 Introduction to R (part-I) 16
3.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2
3.2 Why should we learn R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Installing R for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 R GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 R Studio- A better way to do R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Installing RStudio for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7 RStudio GUI/IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.8 Installing Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.9 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.10 Task Views in R-Introduction & Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.11 R core packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.12 Example-1 Hello R! and more. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 R Data Types and Data Structures 30


4.1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.1 Double . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.2 Integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.3 Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.4 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.5 Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.6 Date & Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.7 Missing Data in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Data Structures in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.3 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.4 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.5 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Data Import Export in R 49


5.1 Tabular Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.1 Reading Data from a Text File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.2 Reading Data from CSV File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.3 Reading from Excel Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Importing Data using RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 To Do: Financial Market Data 61


6.1 FACTSET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Basic Analysis in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

References 63
Part 1

Introduction to AFIN8015

• This course introduces the fundamental process of data science for finance to students with an interest
in the rapidly growing area of FinTech (Financial Technology).

• The unit focuses on developing critical computational, statistical, and other contemporary analytical
skills that are essential for people conducting the data-driven financial analytics in the FinTech area.

• Main focus on

– Practical hands on introduction to the concepts and analytical skills through applied data-driven
examples.
– Discussion on contemporary methods in data science such as Regression and Classification meth-
ods, Data Management, Visualisation, Machine Learning etc.
– Practical implementation of these models using industry standard data sources and software (R).

• Course targets students interested in financial analytics, including data analytics, predictive and clas-
sification methods.
6
• The data science concepts and applications covered in this unit focus on analysing various types of
data sources in Financial Service sector to generate actionable insights.

• Is very much a hands-on course, with the seminars conducted in the computer laboratories emphas-
ising upon empirical work and applied analysis of real market data.

– Use of personal computers is encouraged.


– R will be used for computation throughout the unit.

1.1 Assessment Structure

Weighting Due Details


Early Diagnostic Online Quiz 5% Week-3 Online test on ilearn with 10 to 15
MCQs.
Student will be required to analyse
Financial Data Analysis 1 40% Week-6 real world financial data sets using
relevant descriptive statistics and
visualisation techniques.
Student will conduct quantitative
Financial Data Analysis 2 55% Week-11 and qualitative analysis using data
science tools and techniques and
present the findings.
Table 1.1: Assessment Structure
Part 2

Data Science - An Introduction

2.1 What is Data Science?

• There is a no single definition.

• Data science is a cross disciplinary practice.

• A simple definition, according to Wikipedia, “Data science is an inter-disciplinary field that uses sci-
entific methods, processes, algorithms and systems to extract knowledge and insights from many
structural and unstructured data. Data science is related to data mining, deep learning and big data.”

• Data science1 is a cross-disciplinary practice that draws on methods from

– data engineering, descriptive statistics, data mining, machine learning, and predictive analytics

• For this course, we will focus on data science applied to business analytics, specifically, financial
analytics to conduct Financial Data Science.
1
Data Science and Data Analytics are related terms with at times similar meanings.

8
2.1.1 Key Component of Data Science
• Statistics & Mathematics

• Domain Expertise: Specialised Knowledge or Skills e.g., Financial Analytical Methods for Classification
and Prediction.

• Computer Science

– Computer Programming (R, Python, other tools)


– Algorithms, Data Structures etc.
– Data Mining, Big Data Tools (Hadoop, NoSQL, etc)

• Data Engineering & Data Visualisation (part of computer science but bit different)

• Machine Learning

– Predictive Methods
– Classification Methods

• A combination of Descriptive, Predictive and Prescriptive Analytics.

• Figure-1 (Minelli et al., 2012) major components in a big data analytics framework and the questions it
attempts to answer.
Figure 2.1: Big Data Analytics

Financial Data Science is the application of Data Science to generate insights in the Financial Services
domain.

2.1.2 Data Science Domains


• Figure-2 shows active data science domains
Figure 2.2: Data Science Domains (Dasgupta et al., 2018)

2.1.3 Data Science Process


• The methodical discovery of useful relationships and patterns in data is enabled by a set of iterative
activities collectively known as the data science process.

• The standard data science process involves (1) understanding the problem, (2) preparing the data
samples, (3) developing the model, (4) applying the model on a dataset to see how the model may
work in the real world, and (5) deploying and maintaining the models.

• Figure-2 shows a basic data science process (source chapter-2 Kotu (2018))

Figure 2.3: Data Science Process (Kotu, 2018)


• The steps within the data science process are not linear and have to undergo many iterations, checks
and balances, with back and forth reassessment.

2.1.4 Life Cycle of a Data Science Project


• Figure-3 depicts a typical data science process (chapter-1 (Mount & Zumel, 2019))

Figure 2.4: Data Science Project- Life cycle


2.1.5 Data Science Project Roles
• Mount & Zumel (2019) mention the following roles in a data science project

Role Responsibilities
Project sponsor Represents the business interests; champions the project
Client Represents end users’ interests; domain expert
Data scientist Sets and executes analytic strategy; communicates with sponsor and client
Data architect Manages data and data storage; sometimes manages data collection
Operations Manages infrastructure; deploys final project results
Table 2.1: Data Science Project Roles

2.2 Financial Analytics

• Financial Analytics can be referred to as the in-depth exploration and analysis of financial data gener-
ated or used by companies to support their decision making process.

• Financial Analytics uses various Descriptive, Predictive and Prescriptive methods to generate action-
able insights enabling professionals to produce reliable estimates and forecasts of financial risk, re-
turns, asset pricing, cash flows etc.

• Financial Analytics may involve a range of analytical methods ranging from simple descriptive statistics
and data visualisation to sophisticated methods (statistical methods, machine learning methods, big
data methods) for financial analytics and time-series modelling to model and forecast cash flows,
profitability, risk, and return of financial assets.

2.2.1 Types of Financial Data (not an exhaustive list)


• Time Series Data

– Frequencies (annual, monthly, daily, intraday etc).

• Cross Sectional Data

– Characterised by individual companies, countries etc.

• Panel Data

– Both time series and cross sectional. For example, financial performance per month for N compan-
ies

• Unstructured Data

– Text Data etc.


Part 3

Introduction to R (part-I)

3.1 What is R?

According to the official webpage:

• R is a language and environment for statistical computing and graphics. It is a GNU project which is
similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T,
now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different
implementation of S. There are some important differences, but much code written for S runs unaltered
under R.1

According to Wikipedia

• R is a free software programming language and a software environment for statistical computing and
graphics. The R language is widely used among statisticians and data miners for developing statistical
software and data analysis. Polls and surveys of data miners are showing R’s popularity has increased
1
http://www.r-project.org/about.html
16
substantially in recent years.

To Summarise

• R is the most amazing free statistical software ever!

• This recent video by Revolution Analytics does a great job in summarizing R https://www.youtube.com/watch?v=ZCQHm

3.2 Why should we learn R?

R follows a type inference2 coding structure and provides a wide variety of statistical and graphical tech-
niques, including;

• Linear and non-linear modelling

• Univariate & Multivariate Statistics

• Classical statistical tests

• Time-series analysis/ Econometrics

• Simulation and Modelling

• Datamining-classification, clustering etc.


2
Type inference refers to the automatic deduction of the type of an expression in a programming language.
• For computationally intensive tasks, C, C++, and Fortran code can be linked and called at run time.

• R is easily extensible through functions and extensions, and the R community is noted for its active
contributions in terms of packages. Total 11,799 packages and counting3

# Run to see the total number of current packages available to users


length(available.packages()[, 1])

3.3 Installing R for Windows

• The latest version of R can be download from the R homepage4.

• R download page: http://www.cran.r-project.org/bin/windows/base/ The page also provides some


instructions and FAQ’s on R installation.

3.4 R GUI

The R GUI in windows looks like


3
The R command below should run with an Internet connection. We will discuss how to setup proxy on work computers later.
4
R is a available for all three major Operating System types (Windows/Mac/Linux) but here we are only looking into using R on
windows. Using R on Mac/Linux will be slightly different.
Figure 3.1: R GUI (Windows)
3.5 R Studio- A better way to do R

• RStudio IDE is a powerful and productive user interface for R. It’s free and open source, and works IDE: Integrated Development

great on Windows, Mac, and Linux Environment

• A good introduction to RStudio is found in the recently published book; Learning RStudio for R Statist-
ical Computing. http://www.packtpub.com/learning-rstudio-for-r-statistical-computing/book

3.6 Installing RStudio for Windows

• RStudio and R both work together so both have to be installed5.

3.7 RStudio GUI/IDE

• RStudio GUI is composed of 4 panes which can be rearranged according to the requirements.

• There are a lot of short introductions to RStudio available online so we will not go into more details.
The figure below gives the snapshot of RStudio GUI.

• A short intro to RStudio https://vimeo.com/97166163

5
Go to http://www.rstudio.com/ide/download/desktop to download RStudio for desktop
3.8 Installing Packages

• R provides several in-house and user contributed packages.

• The easiest way to install packages is to do it via R console. The command install.packages(“package
name”) installs R packages directly from internet. Other options to install various dependencies to a
package can be easily specified when calling this function. A call to this function asks the user to chose
a CRAN mirror at the first instance.

Run the following to install Quantreg package on R. Also use the help function to get the details.

# Opens a webpage when called from R or shows help in the help window
# in RStudio
help(install.packages)
# Install package tidyverse with all the required dependencies.
install.packages("tidyverse", dependencies = c("Depends", "Suggests"))

3.9 Getting Help

As R is constantly evolving and new functions/packages are introduced every day it is good to know
sources of help. The most basic help one can get is via the help() function. This function shows the help
file for a function which has been created by package managers.

help("function name")

The following can be used to search for a function etc.

#Replace the 'search string' with the expression you want to search
??search string

• All the R packages (with few exceptions) have a user’s manual listing the functions in a package. This
can be downloaded in PDF format from the R package download page6.

• R also provides some search tools given at http://cran.r-project.org/search.html The R Site search
is helpful in searching for topics related to problem in hand.

• Other than these various good R related blogs are on the internet which can be really helpful. A
combined upto date view of 452 contributed blogs can be found at R-bloggers7.

• Over all there quite a big community of R Users and help can be found for most of the topics.
6
For example reference manual for quantreg package is at http://cran.r-project.org/web/packages/quantreg/quantreg.pdf
7
Go to www.r-bloggers.com
3.10 Task Views in R-Introduction & Installation

• Task Views in R provide packages grouped together according to a generalized task they are used for.

• Table below gives the name of task views available8.

• The following commands install the package ctv and then Finance task view.

# install package task views


install.packages("ctv")
library("ctv") #R function library() is used to call a package
# install Finance task view
install.views("Finance")
8
This list of available task views can be found at http://cran.r-project.org/web/views/
Table 3.1: Task Views
CRAN Task Views
Bayesian Bayesian Inference OfficialStatistics Official Statistics & Survey Methodology
ChemPhys Chemometrics and Computational Physics Optimization Optimization and Mathematical Programming
ClinicalTrials Clinical Trial Design, Monitoring, and Analysis Pharmacokinetics Analysis of Pharmacokinetic Data
Cluster Cluster Analysis & Finite Mixture Models Phylogenetics Phylogenetics, Especially Comparative Methods
DifferentialEquations Differential Equations Psychometrics Psychometric Models and Methods
Distributions Probability Distributions ReproducibleResearch Reproducible Research
Econometrics Computational Econometrics Robust Robust Statistical Methods
Environmetrics Analysis of Ecological and Environmental Data SocialSciences Statistics for the Social Sciences
ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data Spatial Analysis of Spatial Data
Finance Empirical Finance SpatioTemporal Handling and Analyzing Spatio-Temporal Data
Genetics Statistical Genetics Survival Survival Analysis
Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization TimeSeries Time Series Analysis
HighPerformanceComputing High-Performance and Parallel Computing with R gR gRaphical Models in R
MachineLearning Machine Learning & Statistical Learning Multivariate Multivariate Statistics
MedicalImaging Medical Image Analysis NaturalLanguageProcessing Natural Language Processing

3.11 R core packages.

• R comes with few bundled core packages which provide various data analytics/statistical capabilities
to R. The base package in R has basic functions and operators which are required for analytical
programming, stats is another example of core R packages.

# List of R core packages


row.names(installed.packages(priority = "base"))
# [1] "base" "compiler" "datasets" "graphics" "grDevices"
# [6] "grid" "methods" "parallel" "splines" "stats"
# [11] "stats4" "tcltk" "tools" "utils"

• The above command should give a help page similar to in the figure. Which gives selectable help
pages for various functions in R’s base package.
Figure 3.3: R base package help directory

• Next we will learn some basics about programming, but before that we’ll do the following example
which illustrate what we have learnt till now and some useful tips9.

3.12 Example-1 Hello R! and more.

message("Hello R!") #use to display messages


print("Hello R!") #use to display variables/messages

# [1] "Hello R!"

msg = "Hello R!" #type inference no need to define strings!


print(msg)

# [1] "Hello R!"

• R packages come with various datasets and demo codes specific to the packages.

demo(package = "base")
## run demo() to display all the available demos in the loaded
## packages

Lets run recursion demo from the base package in RStudio

9
All the examples/codes are executed in RStudio
demo(recursion)
Figure 3.2: RStudio IDE
Part 4

R Data Types and Data Structures

4.1 Data Types

• As per R’s official language definitions; in every computer language variables provide a means of
accessing the data stored in memory.

• R does not provide direct access to the computer’s memory but rather provides a number of specialized
data structures we will refer to as objects. These objects are referred to through symbols or variables.

4.1.1 Double
Doubles are numbers like 5.0, 5.5, 10.999 etc. They may or may not include decimal places. Doubles are
mostly used to represent a continuous variable like serial number, weight, age etc.
x = 8.5
is.double(x) #to check if the data type is double

# [1] TRUE

30
4.1.2 Integer
Integers are natural numbers.
x = 9
typeof(x)

# [1] "double"

The following specifically assigns an integer to x


x = as.integer(9)
typeof(x)

# [1] "integer"

4.1.3 Logical
A variable of data type logical has the value TRUE or FALSE. T
x = 11
y = 10
a = x > y
a

# [1] TRUE
typeof(a)

# [1] "logical"

To perform calculation on logical objects in R the FALSE is replaced by a zero and TRUE is replaced by
1.

4.1.4 Character
Characters represent the string values in R. An object of type character can have alphanumeric strings.
Character objects are specified by assigning a string or collection of characters between double quotes (“
string”) . Everything in a double quote is considered a string in R.

x = "This is a string"
print(x)

# [1] "This is a string"

x = "a"
typeof(x)

# [1] "character"
4.1.5 Factor
Factor is an important data type to represent categorical data. This also comes handy when dealing with
Panel or Longitudinal data. Example of factors are Blood type (A , B, AB, O), Sex (Male or Female).
Factor objects can be created from character object or from numeric object. The operator c is used to create a vector
of values which can be of any data type.
b.type = c("A", "AB", "B", "O") #character object
# use factor function to convert to factor object
b.type = factor(b.type)
b.type

# [1] A AB B O
# Levels: A AB B O

# to get individual elements (levels) in factor object


levels(b.type)

# [1] "A" "AB" "B" "O"

4.1.6 Date & Time


R is capable of dealing calendar dates and times. It is an important object when dealing with time series
models. The function as.Date can be used to create an object of class Date.1 Tip: Use args(function name)to see the
1 various arguments in a function.
See help(as.Date)for more details about the formats of dates.
date1 = "31-01-2012"
date1 = as.Date(date1, "%d-%m-%Y")
date1

# [1] "2012-01-31"

data.class(date1)

# [1] "Date"

# The date and time are internally interpreted as Double so the


# function typeof will return the type Double
typeof(date1)

# [1] "double"

• R has two inbuilt classes POSIXct and POSIXlt to deal with date and time which can be used to repres-
ent calendar dates and times.

• A character date or time can be converted to these two classes by calling the function as.POSIXct to
create a POSIXct object. This function accepts date, time or date with time as character input and uses
a format argument to specify a non default format. A time zone can also be specified when dealing
with a specific time zone 2
2
See help(as.POSIXct)or help(as.POSIXlt) for further details. strptime is a very useful function to convert one format of date and
4.1.7 Missing Data in R
• Datasets available for research often has missing data. In R missing data is represented by NA (Not
Available), it can be any missing data type. Another symbol to represent missing number is NaN (Not a
Number).

• The following example shows how to detect missing values in data vector. NULL in R represents a null object with
length zero or for an undefined object
m.data = c("100", "200", "missing") We often come across +- Infinite values

# convert m.data to double will create one missing value as 'missing' in the models (for instance division by

# is not a double zero). −Inf, Inf represent negative and

m.data = as.double(m.data) positive infinite values in R.

# the warning message tells that an NA was insterted for a value


# which couldnt be converted to type double
is.na(m.data) #check for the missing value

# [1] FALSE FALSE TRUE

time in character to another. See help(strptime)for the different date/time formats.


4.2 Data Structures in R

Every data analysis requires the data to be structured in a well defined way. These coherent ways to
put together data forms some basic data structures in R. Every data set intended for analysis has to be
imported in R environment as a data structure. R has the following basic data structures:

• Vector

• Matrix

• Array

• Data Frame

• Lists

4.2.1 Vector
• Vectors are group of values having same data types.

• There can be numeric vectors, character vector and so on. Vectors are mostly used to represent a
single variable in a data set.

• A vector is constructed using the function c. The same function c can be used to
combine different vectors of same data
type.
vec1 = c(1, 2, 3, 4, 5)
vec1

# [1] 1 2 3 4 5

4.2.2 Matrices
• A matrix is a collection of data elements arranged in a two-dimensional rectangular layout. Like vectors
all the elements in a matrix are of same data type.
 
1 2
3 4
 
 
5 6

• The function matrix is used to create matrices in R. Note that all the elements in a matrix object are of
same basic type. Lets create the matrix in the example above

m1 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 2, byrow = TRUE)


# nrow-specify number of rows, ncol-specify number of columns,
# byrow-fill the matrix in rows with the data supplied
m1 #print the matrix
# [,1] [,2]
# [1,] 1 2
# [2,] 3 4
# [3,] 5 6

• A vector can be converted to matrix using dim function, e.g;

m2 = c(1, 2, 3, 4, 5, 6)
dim(m2) = c(3, 2) #the matrix will be filled by columns
m2

# [,1] [,2]
# [1,] 1 4
# [2,] 2 5
# [3,] 3 6

# use dim to get the dimension (#rows and #columns) of a matrix


dim(m1)

# [1] 3 2

Function cbind and rbind can also be


used to create matrices by combining
Matrix Manipulations two or more vectors by columns or by
row.
• For calculations on matrices; all the mathematical functions available for vectors are applicable on a
matrix. All operations are applied on each element in a matrix, e.g.

m3 = m1 * 2 # all elements will be multiplied by 2 individually


m3

# [,1] [,2]
# [1,] 2 4
# [2,] 6 8
# [3,] 10 12

• A matrix can be multiplied with a vector as long as the length of the vector is a multiple of length of the
matrix. Try different combinations of matrix and vector arithmetic to see the results and errors.

• Mathematical matrix operations are also available for matrices in R. For instance % ∗ % is used for
matrix multiplication, the matrices must agree dimensionally for matrix multiplication. For example Note the use of : operator to create a
sequence
dim(m1) # 3 rows and 2 columns

# [1] 3 2

# create another matrix with 2 rows and 3 columns


m3 = matrix(c(1:6), ncol = 3)
m1 %*% m3
# [,1] [,2] [,3]
# [1,] 5 11 17
# [2,] 11 25 39
# [3,] 17 39 61

R facilitates various matrix specific operations. Table 1 gives most of the available functions and operators.
Use help() or ?followed by function name to get more details about the operators and functions.
Table 4.1: Functions and operators for matrices
Operator or Function Description
X * Y Element-wise multiplication
X %*% Y Matrix multiplication
Y %o% X Outer product. XB’
crossprod(X,Y) X’Y
crossprod(X) X’X
t(X) Transpose
diag(x) Creates diagonal matrix with
elements of x in the principal
diagonal
diag(X) Returns a vector containing the
elements of the principal
diagonal
diag(k) If k is a scalar, this creates a k x
k identity matrix. Go figure.
solve(X, b) Returns vector x in the equation
b = Xx (i.e., X-1b)
solve(X) Inverse of X where X is a
square matrix.
y=eigen(X) y$val are the eigenvalues of X
y$vec are the eigenvectors of X
y=svd(X) Singular value decomposition of
X.
R = chol(X) Choleski factorization of X.
Returns the upper triangular
factor, such that R’R = X.
y = qr(X) QR decomposition of X.
cbind(X,Y,...) Combine matrices(vectors)
horizontally. Returns a matrix.
rbind(X,Y,...) Combine matrices(vectors)
vertically. Returns a matrix.
rowMeans(X) Returns vector of row means.
rowSums(X) Returns vector of row sums.
colMeans(X) Returns vector of column
means.
colSums(X) Returns vector of column
means.
4.2.3 Arrays
• Arrays are the generalisation of vectors and matrices. A vector in R is a one dimensional array and a
matrix a two dimensional array. An array is a multiply subscripted collection of data entries of the same
data type. Arrays can be constructed using the function array, for example3

z = c(1:24) #vector of length 24


# constructing a 3 by 4 by 2 array
a1 = array(z, dim = c(3, 4, 2))
a1

# , , 1
#
# [,1] [,2] [,3] [,4]
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
#
# , , 2
#
# [,1] [,2] [,3] [,4]
3
Function dim can also be used to define an array by assigning dimensions to a vector.
# [1,] 13 16 19 22
# [2,] 14 17 20 23
# [3,] 15 18 21 24

• Individual elements of an array are accessed by referring them by their index. This is done by giving
the name of the array followed by the subscript (index) in this square bracket separated by commas.
We try to access the element [1,3,1] of array a1 in the following example

# element in the row 1 and column 3 in the first subset


a1[1, 3, 1]

# [1] 7

• Next we discuss the Data Frames which are the most convenient data structures for data analysis in
R.

4.2.4 Data Frames


• Data frame forms the most convenient data structures in R to represent tabular data.

• In quantitative research data is often in the form of data tables. These data tables have multiple rows
and can have multiple columns with each column representing a different variable (quantity).
• A data frame in R is the most natural way to represent these data sets as it can have different data
type in the data frame object. Most statistical routines in R require a data frame as input.
The following example uses an important function str on R’s inbuilt data frame “swiss”. str function is
used to see the internal structure of an object in R.

# swiss dataframe has standardized fertility measure and


# socio-economic indicators for each of 47 French-speaking provinces
# of Switzerland at about 1888.
data(swiss)
str(swiss)

# 'data.frame': 47 obs. of 6 variables:


# $ Fertility : num 80.2 83.1 92.5 85.8 76.9 ...
# $ Agriculture : num 17 45.1 39.7 36.5 43.5 ...
# $ Examination : int 15 6 5 12 17 ...
# $ Education : int 12 9 5 7 15 ...
# $ Catholic : num 9.96 84.84 ...
# $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 ...

• Data frames have two attributes namely; names and row.names, these two contains the column names
and row names respectively. The data in the named column can be accessed by the $operator.
# using names and row.names
names(swiss) #name of the columns (can also use colnames)

# [1] "Fertility" "Agriculture" "Examination"


# [4] "Education" "Catholic" "Infant.Mortality"

colnames(swiss)

# [1] "Fertility" "Agriculture" "Examination"


# [4] "Education" "Catholic" "Infant.Mortality"

row.names(swiss) #name of the rows

# [1] "Courtelary" "Delemont" "Franches-Mnt" "Moutier"


# [5] "Neuveville" "Porrentruy" "Broye" "Glane"
# [9] "Gruyere" "Sarine" "Veveyse" "Aigle"
# [13] "Aubonne" "Avenches" "Cossonay" "Echallens"
# [17] "Grandson" "Lausanne" "La Vallee" "Lavaux"
# [21] "Morges" "Moudon" "Nyone" "Orbe"
# [25] "Oron" "Payerne" "Paysd'enhaut" "Rolle"
# [29] "Vevey" "Yverdon" "Conthey" "Entremont"
# [33] "Herens" "Martigwy" "Monthey" "St Maurice"
# [37] "Sierre" "Sion" "Boudry" "La Chauxdfnd"
# [41] "Le Locle" "Neuchatel" "Val de Ruz" "ValdeTravers"
# [45] "V. De Geneve" "Rive Droite" "Rive Gauche"
swiss$Fertility #returns the vector of data in the column Fertility

# [1] 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 87.1 64.1 66.9
# [14] 68.9 61.7 68.3 71.7 55.7 54.3 65.1 65.5 65.0 56.6 57.4 72.5 74.2
# [27] 72.0 60.5 58.3 65.4 75.5 69.3 77.3 70.5 79.4 65.0 92.2 79.3 70.4
# [40] 65.7 72.7 64.4 77.6 67.6 35.0 44.7 42.8

• Data frames are constructed using the function data.frame. For example following creates a data frame
of a character and numeric vector.

num1 = seq(1:5)
ch1 = c("A", "B", "C", "D", "E")
df1 = data.frame(ch1, num1)
df1

# ch1 num1
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
# 5 E 5
4.2.5 Lists
• A list is like generic vector containing other objects. Lists can have numerous elements any type and
structure they can also be of different lengths

• A list can contain another list and therefore it can be used to construct arbitrary data structures.

• A list can be constructed using the list function, for example

e1 = c(2, 3, 5) #element-1
e2 = c("aa", "bb", "cc", "dd", "ee") #element-2
e3 = c(TRUE, FALSE, TRUE, FALSE, FALSE) #element-3
e4 = df1 #element-4 (previously constructed data frame)
lst1 = list(e1, e2, e3, e4) # lst contains copies of e1,e2,e3,e4
str(lst1) #show the structure of lst1

# List of 4
# $ : num [1:3] 2 3 5
# $ : chr [1:5] "aa" "bb" ...
# $ : logi [1:5] TRUE FALSE TRUE ...
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ ch1 : chr [1:5] "A" "B" ...
# ..$ num1: int [1:5] 1 2 3 4 5
• Components are always numbered and may always be referred to as such.

• Thus if lst1 is the name of a list with four components, these may be individually referred to as lst1[[1]],
lst1[[2]], lst1[[3]] and lst1[[4]]. Note: When a single square bracket is
used the component of a list is returned
# first element of lst1 as a list while the double square bracket
lst1[[1]] returns the component itself

# [1] 2 3 5

lst1[1]

# [[1]]
# [1] 2 3 5

The elements in a list can also be named using the list function and these elements can be referred
individually via there names.

names(lst1) = c("e1", "e2", "e3", "e4")


names(lst1) #name of the elements

# [1] "e1" "e2" "e3" "e4"

lst1$e1 #using $operator to refer the element

# [1] 2 3 5
Part 5

Data Import Export in R

Errors using inadequate data are much less than those using no data at all.
-Charles Babbage

49
5.1 Tabular Data

5.1.1 Reading Data from a Text File


• The easiest way to import data into R’s statistical system is to do in a tabular format saved in a text/
file.

• To import tabular data from a text file, R provides the function read.table(). read.table() is the most
convenient function to import tabular data from text files and can be easily used for data files of
small or moderate size having data in a rectangular format. The arguments which can be passed
to read.table()are given below.

args(read.table)

# function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",
# numerals = c("allow.loss", "warn.loss", "no.loss"), row.names,
# col.names, as.is = !stringsAsFactors, na.strings = "NA",
# colClasses = NA, nrows = -1, skip = 0, check.names = TRUE,
# fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE,
# comment.char = "#", allowEscapes = FALSE, flush = FALSE,
# stringsAsFactors = FALSE, fileEncoding = "", encoding = "unknown",
# text, skipNul = FALSE)
# NULL

• Some of the important arguments for the function read.table are discussed below, for the rest see the
help file using help(read.table).
Argument Description

file The name of the tabular (text) file to import along with the full

path

header A logical argument to specify if the names of the variables are

available in the first row

sep Character to specify the seperator type, default “ “ takes any

white space as a separator

quote To specify if the character vectors in the data are in quotes, this

shuold specify the type of quotes

as.is To specify if the character vectors should be converted to

factors. The default behaviour is to read characters as

characters and not factors

strip.white A logical value to specify if the extra leading and trailing white

spaces have to be removed from the character fiels. This is

used when sep !=".

fill Logical value to specify if the blank fields in a row should be

filled.
• The example below imports a tab delimited text file. Note the use of “\t” in the sep argument
for tab delimited data . The header
• Note that in the example below, the working directory for the RStudio session has already been set to argument is also TRUE here as our
the destination file’s directory. If the working directory is different from the location of the data file then dataset has variable names in the first

either the working directory should be changed using setwd or RStudio’s GUI or full path for the file’s row

location should be provided with the file name.

data_readtable = read.table("demo_data.txt", sep = "\t", header = TRUE)


head(data_readtable)

# Date AAPL MSFT


# 1 2/01/1998 4.06 16.39
# 2 5/01/1998 3.97 16.30
# 3 6/01/1998 4.73 16.39
# 4 7/01/1998 4.38 16.20
# 5 8/01/1998 4.55 16.31
# 6 9/01/1998 4.55 15.88

• This data can be now saved into .Rdata format after importing from a text file using save or can be
written to another text file using write.table as shown below:
# saving data as an object in .Rdata format
save(data_readtable, file = "data1.Rdata")
# saving data into another text file
write.table(data_readtable, file = "data1.txt")
# saving data into .Rdata
save(data_readtable, file = "data1.Rdata")

• The .Rdata file can be easily loaded into the system using load function as shown below Note we are still in the same working
directory, if this is not the case you will
load("data1.Rdata") #using load to load R data have to provide the path or change the
head(data_readtable) directory.

# V1 V2 V3
# 1 Date AAPL MSFT
# 2 2/01/1998 4.06 16.39
# 3 5/01/1998 3.97 16.3
# 4 6/01/1998 4.73 16.39
# 5 7/01/1998 4.38 16.2
# 6 8/01/1998 4.55 16.31
5.1.2 Reading Data from CSV File
• Reading data from a CSV file is made easy by the read.csv function. read.csv function is an extension
of read.table. It facilitates direct import of data from CSV files. read.csv function takes the following
arguments

# function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
# fill = TRUE, comment.char = "", ...)
# NULL

• The following example imports a CSV file with the same data as previously imported from a text file.

# Check the working directory before importing else provide full path
data_readcsv = read.csv("demo_data.csv")
head(data_readcsv)

# Date AAPL MSFT


# 1 2/01/1998 4.06 16.39
# 2 5/01/1998 3.97 16.30
# 3 6/01/1998 4.73 16.39
# 4 7/01/1998 4.38 16.20
# 5 8/01/1998 4.55 16.31
# 6 9/01/1998 4.55 15.88
• Similar to write.table data can also be written to an external csv file using write.csv. The following
example uses an inbuilt data set in R and exports it to a CSV1.

data(iris) #R inbuilt dataset


head(iris)

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species


# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa

write.csv(iris, "data_iris.csv", row.names = FALSE)

5.1.3 Reading from Excel Files


Although not convenient R does provide methods to import data from excel file with the help of external
packages. Overall all the methods provided by packages like gdata, XLConnet, xlsx are not as efficient as
transporting an excel sheet to a CSV and then importing it in R
1
Notice the use of row.names=FALSE to avoid creating one more column in the CSV file with row numbers
5.2 Importing Data using RStudio

• To import data click on Import Dataset →From Text File.. →Browse for the file to import.

• Remember the file should be in a tabular format, a text file or a csv are the best options. On clicking
Import the data will be imported in a Data Frame and will be made visible by RStudio.

• This will also generate basic data import command used for importing and viewing the file in the
RStudio console as shown in the figure below. Note that the path in the command as shown in the
console has been scrambled as it will be different for every computer
Figure 5.1: Basic Import Dataset Wizard in RStudio
Figure 5.2: Basic Import Dataset Wizard in RStudio
Figure 5.3: Data import in RStudio
Part 6

To Do: Financial Market Data

6.1 FACTSET

• Setup your FACTSET access.

– Request an account here if you havent received onehttps://advantage.factset.com/academic_idrequest

• Spend sometime to learn the interface and download some data.

• Download Stock Price Data.

6.2 Basic Analysis in R

• Download BHP stock price series (2010-2021) from FACTSET

• Import the data in R

• Use the summary function to get the basic summary of the data

• Use the plotfunction to get the basic scatter plot of the data
61
• Challenge: Convert prices to logarithmic returns and calculate the summary and make the plot using
R
Next Time

• Data Exploration using R

• Data Visualisation using R

63
References

Dasgupta, Nataraj, Farias, Ricardo Anjoleto, & Lanzetta, Vitor Bianchi. 2018. Hands-On Data Science with R. Packt Publishing.

Kotu, Vijay. 2018. Data Science: Concepts and Practice. 2nd ed.. edn.

Minelli, Michael, Chambers, Michele, & Dhiraj, Ambiga. 2012. Big data, big analytics: emerging business intelligence and analytic trends for today’s businesses. John Wiley & Sons.

Mount, John, & Zumel, Nina. 2019. Practical Data Science with R, Second Edition. Manning Publications.

64

You might also like