Afin8015 Topic 1 2023.
Afin8015 Topic 1 2023.
Afin8015 Topic 1 2023.
& %
Contents
1 Introduction to AFIN8015 6
1.1 Assessment Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Introduction to R (part-I) 16
3.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2
3.2 Why should we learn R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Installing R for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 R GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 R Studio- A better way to do R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Installing RStudio for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7 RStudio GUI/IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.8 Installing Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.9 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.10 Task Views in R-Introduction & Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.11 R core packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.12 Example-1 Hello R! and more. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References 63
Part 1
Introduction to AFIN8015
• This course introduces the fundamental process of data science for finance to students with an interest
in the rapidly growing area of FinTech (Financial Technology).
• The unit focuses on developing critical computational, statistical, and other contemporary analytical
skills that are essential for people conducting the data-driven financial analytics in the FinTech area.
• Main focus on
– Practical hands on introduction to the concepts and analytical skills through applied data-driven
examples.
– Discussion on contemporary methods in data science such as Regression and Classification meth-
ods, Data Management, Visualisation, Machine Learning etc.
– Practical implementation of these models using industry standard data sources and software (R).
• Course targets students interested in financial analytics, including data analytics, predictive and clas-
sification methods.
6
• The data science concepts and applications covered in this unit focus on analysing various types of
data sources in Financial Service sector to generate actionable insights.
• Is very much a hands-on course, with the seminars conducted in the computer laboratories emphas-
ising upon empirical work and applied analysis of real market data.
• A simple definition, according to Wikipedia, “Data science is an inter-disciplinary field that uses sci-
entific methods, processes, algorithms and systems to extract knowledge and insights from many
structural and unstructured data. Data science is related to data mining, deep learning and big data.”
– data engineering, descriptive statistics, data mining, machine learning, and predictive analytics
• For this course, we will focus on data science applied to business analytics, specifically, financial
analytics to conduct Financial Data Science.
1
Data Science and Data Analytics are related terms with at times similar meanings.
8
2.1.1 Key Component of Data Science
• Statistics & Mathematics
• Domain Expertise: Specialised Knowledge or Skills e.g., Financial Analytical Methods for Classification
and Prediction.
• Computer Science
• Data Engineering & Data Visualisation (part of computer science but bit different)
• Machine Learning
– Predictive Methods
– Classification Methods
• Figure-1 (Minelli et al., 2012) major components in a big data analytics framework and the questions it
attempts to answer.
Figure 2.1: Big Data Analytics
Financial Data Science is the application of Data Science to generate insights in the Financial Services
domain.
• The standard data science process involves (1) understanding the problem, (2) preparing the data
samples, (3) developing the model, (4) applying the model on a dataset to see how the model may
work in the real world, and (5) deploying and maintaining the models.
• Figure-2 shows a basic data science process (source chapter-2 Kotu (2018))
Role Responsibilities
Project sponsor Represents the business interests; champions the project
Client Represents end users’ interests; domain expert
Data scientist Sets and executes analytic strategy; communicates with sponsor and client
Data architect Manages data and data storage; sometimes manages data collection
Operations Manages infrastructure; deploys final project results
Table 2.1: Data Science Project Roles
• Financial Analytics can be referred to as the in-depth exploration and analysis of financial data gener-
ated or used by companies to support their decision making process.
• Financial Analytics uses various Descriptive, Predictive and Prescriptive methods to generate action-
able insights enabling professionals to produce reliable estimates and forecasts of financial risk, re-
turns, asset pricing, cash flows etc.
• Financial Analytics may involve a range of analytical methods ranging from simple descriptive statistics
and data visualisation to sophisticated methods (statistical methods, machine learning methods, big
data methods) for financial analytics and time-series modelling to model and forecast cash flows,
profitability, risk, and return of financial assets.
• Panel Data
– Both time series and cross sectional. For example, financial performance per month for N compan-
ies
• Unstructured Data
Introduction to R (part-I)
3.1 What is R?
• R is a language and environment for statistical computing and graphics. It is a GNU project which is
similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T,
now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different
implementation of S. There are some important differences, but much code written for S runs unaltered
under R.1
According to Wikipedia
• R is a free software programming language and a software environment for statistical computing and
graphics. The R language is widely used among statisticians and data miners for developing statistical
software and data analysis. Polls and surveys of data miners are showing R’s popularity has increased
1
http://www.r-project.org/about.html
16
substantially in recent years.
To Summarise
• This recent video by Revolution Analytics does a great job in summarizing R https://www.youtube.com/watch?v=ZCQHm
R follows a type inference2 coding structure and provides a wide variety of statistical and graphical tech-
niques, including;
• R is easily extensible through functions and extensions, and the R community is noted for its active
contributions in terms of packages. Total 11,799 packages and counting3
3.4 R GUI
• RStudio IDE is a powerful and productive user interface for R. It’s free and open source, and works IDE: Integrated Development
• A good introduction to RStudio is found in the recently published book; Learning RStudio for R Statist-
ical Computing. http://www.packtpub.com/learning-rstudio-for-r-statistical-computing/book
• RStudio GUI is composed of 4 panes which can be rearranged according to the requirements.
• There are a lot of short introductions to RStudio available online so we will not go into more details.
The figure below gives the snapshot of RStudio GUI.
5
Go to http://www.rstudio.com/ide/download/desktop to download RStudio for desktop
3.8 Installing Packages
• The easiest way to install packages is to do it via R console. The command install.packages(“package
name”) installs R packages directly from internet. Other options to install various dependencies to a
package can be easily specified when calling this function. A call to this function asks the user to chose
a CRAN mirror at the first instance.
Run the following to install Quantreg package on R. Also use the help function to get the details.
# Opens a webpage when called from R or shows help in the help window
# in RStudio
help(install.packages)
# Install package tidyverse with all the required dependencies.
install.packages("tidyverse", dependencies = c("Depends", "Suggests"))
As R is constantly evolving and new functions/packages are introduced every day it is good to know
sources of help. The most basic help one can get is via the help() function. This function shows the help
file for a function which has been created by package managers.
help("function name")
#Replace the 'search string' with the expression you want to search
??search string
• All the R packages (with few exceptions) have a user’s manual listing the functions in a package. This
can be downloaded in PDF format from the R package download page6.
• R also provides some search tools given at http://cran.r-project.org/search.html The R Site search
is helpful in searching for topics related to problem in hand.
• Other than these various good R related blogs are on the internet which can be really helpful. A
combined upto date view of 452 contributed blogs can be found at R-bloggers7.
• Over all there quite a big community of R Users and help can be found for most of the topics.
6
For example reference manual for quantreg package is at http://cran.r-project.org/web/packages/quantreg/quantreg.pdf
7
Go to www.r-bloggers.com
3.10 Task Views in R-Introduction & Installation
• Task Views in R provide packages grouped together according to a generalized task they are used for.
• The following commands install the package ctv and then Finance task view.
• R comes with few bundled core packages which provide various data analytics/statistical capabilities
to R. The base package in R has basic functions and operators which are required for analytical
programming, stats is another example of core R packages.
• The above command should give a help page similar to in the figure. Which gives selectable help
pages for various functions in R’s base package.
Figure 3.3: R base package help directory
• Next we will learn some basics about programming, but before that we’ll do the following example
which illustrate what we have learnt till now and some useful tips9.
• R packages come with various datasets and demo codes specific to the packages.
demo(package = "base")
## run demo() to display all the available demos in the loaded
## packages
9
All the examples/codes are executed in RStudio
demo(recursion)
Figure 3.2: RStudio IDE
Part 4
• As per R’s official language definitions; in every computer language variables provide a means of
accessing the data stored in memory.
• R does not provide direct access to the computer’s memory but rather provides a number of specialized
data structures we will refer to as objects. These objects are referred to through symbols or variables.
4.1.1 Double
Doubles are numbers like 5.0, 5.5, 10.999 etc. They may or may not include decimal places. Doubles are
mostly used to represent a continuous variable like serial number, weight, age etc.
x = 8.5
is.double(x) #to check if the data type is double
# [1] TRUE
30
4.1.2 Integer
Integers are natural numbers.
x = 9
typeof(x)
# [1] "double"
# [1] "integer"
4.1.3 Logical
A variable of data type logical has the value TRUE or FALSE. T
x = 11
y = 10
a = x > y
a
# [1] TRUE
typeof(a)
# [1] "logical"
To perform calculation on logical objects in R the FALSE is replaced by a zero and TRUE is replaced by
1.
4.1.4 Character
Characters represent the string values in R. An object of type character can have alphanumeric strings.
Character objects are specified by assigning a string or collection of characters between double quotes (“
string”) . Everything in a double quote is considered a string in R.
x = "This is a string"
print(x)
x = "a"
typeof(x)
# [1] "character"
4.1.5 Factor
Factor is an important data type to represent categorical data. This also comes handy when dealing with
Panel or Longitudinal data. Example of factors are Blood type (A , B, AB, O), Sex (Male or Female).
Factor objects can be created from character object or from numeric object. The operator c is used to create a vector
of values which can be of any data type.
b.type = c("A", "AB", "B", "O") #character object
# use factor function to convert to factor object
b.type = factor(b.type)
b.type
# [1] A AB B O
# Levels: A AB B O
# [1] "2012-01-31"
data.class(date1)
# [1] "Date"
# [1] "double"
• R has two inbuilt classes POSIXct and POSIXlt to deal with date and time which can be used to repres-
ent calendar dates and times.
• A character date or time can be converted to these two classes by calling the function as.POSIXct to
create a POSIXct object. This function accepts date, time or date with time as character input and uses
a format argument to specify a non default format. A time zone can also be specified when dealing
with a specific time zone 2
2
See help(as.POSIXct)or help(as.POSIXlt) for further details. strptime is a very useful function to convert one format of date and
4.1.7 Missing Data in R
• Datasets available for research often has missing data. In R missing data is represented by NA (Not
Available), it can be any missing data type. Another symbol to represent missing number is NaN (Not a
Number).
• The following example shows how to detect missing values in data vector. NULL in R represents a null object with
length zero or for an undefined object
m.data = c("100", "200", "missing") We often come across +- Infinite values
# convert m.data to double will create one missing value as 'missing' in the models (for instance division by
Every data analysis requires the data to be structured in a well defined way. These coherent ways to
put together data forms some basic data structures in R. Every data set intended for analysis has to be
imported in R environment as a data structure. R has the following basic data structures:
• Vector
• Matrix
• Array
• Data Frame
• Lists
4.2.1 Vector
• Vectors are group of values having same data types.
• There can be numeric vectors, character vector and so on. Vectors are mostly used to represent a
single variable in a data set.
• A vector is constructed using the function c. The same function c can be used to
combine different vectors of same data
type.
vec1 = c(1, 2, 3, 4, 5)
vec1
# [1] 1 2 3 4 5
4.2.2 Matrices
• A matrix is a collection of data elements arranged in a two-dimensional rectangular layout. Like vectors
all the elements in a matrix are of same data type.
1 2
3 4
5 6
• The function matrix is used to create matrices in R. Note that all the elements in a matrix object are of
same basic type. Lets create the matrix in the example above
m2 = c(1, 2, 3, 4, 5, 6)
dim(m2) = c(3, 2) #the matrix will be filled by columns
m2
# [,1] [,2]
# [1,] 1 4
# [2,] 2 5
# [3,] 3 6
# [1] 3 2
# [,1] [,2]
# [1,] 2 4
# [2,] 6 8
# [3,] 10 12
• A matrix can be multiplied with a vector as long as the length of the vector is a multiple of length of the
matrix. Try different combinations of matrix and vector arithmetic to see the results and errors.
• Mathematical matrix operations are also available for matrices in R. For instance % ∗ % is used for
matrix multiplication, the matrices must agree dimensionally for matrix multiplication. For example Note the use of : operator to create a
sequence
dim(m1) # 3 rows and 2 columns
# [1] 3 2
R facilitates various matrix specific operations. Table 1 gives most of the available functions and operators.
Use help() or ?followed by function name to get more details about the operators and functions.
Table 4.1: Functions and operators for matrices
Operator or Function Description
X * Y Element-wise multiplication
X %*% Y Matrix multiplication
Y %o% X Outer product. XB’
crossprod(X,Y) X’Y
crossprod(X) X’X
t(X) Transpose
diag(x) Creates diagonal matrix with
elements of x in the principal
diagonal
diag(X) Returns a vector containing the
elements of the principal
diagonal
diag(k) If k is a scalar, this creates a k x
k identity matrix. Go figure.
solve(X, b) Returns vector x in the equation
b = Xx (i.e., X-1b)
solve(X) Inverse of X where X is a
square matrix.
y=eigen(X) y$val are the eigenvalues of X
y$vec are the eigenvectors of X
y=svd(X) Singular value decomposition of
X.
R = chol(X) Choleski factorization of X.
Returns the upper triangular
factor, such that R’R = X.
y = qr(X) QR decomposition of X.
cbind(X,Y,...) Combine matrices(vectors)
horizontally. Returns a matrix.
rbind(X,Y,...) Combine matrices(vectors)
vertically. Returns a matrix.
rowMeans(X) Returns vector of row means.
rowSums(X) Returns vector of row sums.
colMeans(X) Returns vector of column
means.
colSums(X) Returns vector of column
means.
4.2.3 Arrays
• Arrays are the generalisation of vectors and matrices. A vector in R is a one dimensional array and a
matrix a two dimensional array. An array is a multiply subscripted collection of data entries of the same
data type. Arrays can be constructed using the function array, for example3
# , , 1
#
# [,1] [,2] [,3] [,4]
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
#
# , , 2
#
# [,1] [,2] [,3] [,4]
3
Function dim can also be used to define an array by assigning dimensions to a vector.
# [1,] 13 16 19 22
# [2,] 14 17 20 23
# [3,] 15 18 21 24
• Individual elements of an array are accessed by referring them by their index. This is done by giving
the name of the array followed by the subscript (index) in this square bracket separated by commas.
We try to access the element [1,3,1] of array a1 in the following example
# [1] 7
• Next we discuss the Data Frames which are the most convenient data structures for data analysis in
R.
• In quantitative research data is often in the form of data tables. These data tables have multiple rows
and can have multiple columns with each column representing a different variable (quantity).
• A data frame in R is the most natural way to represent these data sets as it can have different data
type in the data frame object. Most statistical routines in R require a data frame as input.
The following example uses an important function str on R’s inbuilt data frame “swiss”. str function is
used to see the internal structure of an object in R.
• Data frames have two attributes namely; names and row.names, these two contains the column names
and row names respectively. The data in the named column can be accessed by the $operator.
# using names and row.names
names(swiss) #name of the columns (can also use colnames)
colnames(swiss)
# [1] 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 87.1 64.1 66.9
# [14] 68.9 61.7 68.3 71.7 55.7 54.3 65.1 65.5 65.0 56.6 57.4 72.5 74.2
# [27] 72.0 60.5 58.3 65.4 75.5 69.3 77.3 70.5 79.4 65.0 92.2 79.3 70.4
# [40] 65.7 72.7 64.4 77.6 67.6 35.0 44.7 42.8
• Data frames are constructed using the function data.frame. For example following creates a data frame
of a character and numeric vector.
num1 = seq(1:5)
ch1 = c("A", "B", "C", "D", "E")
df1 = data.frame(ch1, num1)
df1
# ch1 num1
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
# 5 E 5
4.2.5 Lists
• A list is like generic vector containing other objects. Lists can have numerous elements any type and
structure they can also be of different lengths
• A list can contain another list and therefore it can be used to construct arbitrary data structures.
e1 = c(2, 3, 5) #element-1
e2 = c("aa", "bb", "cc", "dd", "ee") #element-2
e3 = c(TRUE, FALSE, TRUE, FALSE, FALSE) #element-3
e4 = df1 #element-4 (previously constructed data frame)
lst1 = list(e1, e2, e3, e4) # lst contains copies of e1,e2,e3,e4
str(lst1) #show the structure of lst1
# List of 4
# $ : num [1:3] 2 3 5
# $ : chr [1:5] "aa" "bb" ...
# $ : logi [1:5] TRUE FALSE TRUE ...
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ ch1 : chr [1:5] "A" "B" ...
# ..$ num1: int [1:5] 1 2 3 4 5
• Components are always numbered and may always be referred to as such.
• Thus if lst1 is the name of a list with four components, these may be individually referred to as lst1[[1]],
lst1[[2]], lst1[[3]] and lst1[[4]]. Note: When a single square bracket is
used the component of a list is returned
# first element of lst1 as a list while the double square bracket
lst1[[1]] returns the component itself
# [1] 2 3 5
lst1[1]
# [[1]]
# [1] 2 3 5
The elements in a list can also be named using the list function and these elements can be referred
individually via there names.
# [1] 2 3 5
Part 5
Errors using inadequate data are much less than those using no data at all.
-Charles Babbage
49
5.1 Tabular Data
• To import tabular data from a text file, R provides the function read.table(). read.table() is the most
convenient function to import tabular data from text files and can be easily used for data files of
small or moderate size having data in a rectangular format. The arguments which can be passed
to read.table()are given below.
args(read.table)
# function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",
# numerals = c("allow.loss", "warn.loss", "no.loss"), row.names,
# col.names, as.is = !stringsAsFactors, na.strings = "NA",
# colClasses = NA, nrows = -1, skip = 0, check.names = TRUE,
# fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE,
# comment.char = "#", allowEscapes = FALSE, flush = FALSE,
# stringsAsFactors = FALSE, fileEncoding = "", encoding = "unknown",
# text, skipNul = FALSE)
# NULL
• Some of the important arguments for the function read.table are discussed below, for the rest see the
help file using help(read.table).
Argument Description
file The name of the tabular (text) file to import along with the full
path
quote To specify if the character vectors in the data are in quotes, this
strip.white A logical value to specify if the extra leading and trailing white
filled.
• The example below imports a tab delimited text file. Note the use of “\t” in the sep argument
for tab delimited data . The header
• Note that in the example below, the working directory for the RStudio session has already been set to argument is also TRUE here as our
the destination file’s directory. If the working directory is different from the location of the data file then dataset has variable names in the first
either the working directory should be changed using setwd or RStudio’s GUI or full path for the file’s row
• This data can be now saved into .Rdata format after importing from a text file using save or can be
written to another text file using write.table as shown below:
# saving data as an object in .Rdata format
save(data_readtable, file = "data1.Rdata")
# saving data into another text file
write.table(data_readtable, file = "data1.txt")
# saving data into .Rdata
save(data_readtable, file = "data1.Rdata")
• The .Rdata file can be easily loaded into the system using load function as shown below Note we are still in the same working
directory, if this is not the case you will
load("data1.Rdata") #using load to load R data have to provide the path or change the
head(data_readtable) directory.
# V1 V2 V3
# 1 Date AAPL MSFT
# 2 2/01/1998 4.06 16.39
# 3 5/01/1998 3.97 16.3
# 4 6/01/1998 4.73 16.39
# 5 7/01/1998 4.38 16.2
# 6 8/01/1998 4.55 16.31
5.1.2 Reading Data from CSV File
• Reading data from a CSV file is made easy by the read.csv function. read.csv function is an extension
of read.table. It facilitates direct import of data from CSV files. read.csv function takes the following
arguments
# function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
# fill = TRUE, comment.char = "", ...)
# NULL
• The following example imports a CSV file with the same data as previously imported from a text file.
# Check the working directory before importing else provide full path
data_readcsv = read.csv("demo_data.csv")
head(data_readcsv)
• To import data click on Import Dataset →From Text File.. →Browse for the file to import.
• Remember the file should be in a tabular format, a text file or a csv are the best options. On clicking
Import the data will be imported in a Data Frame and will be made visible by RStudio.
• This will also generate basic data import command used for importing and viewing the file in the
RStudio console as shown in the figure below. Note that the path in the command as shown in the
console has been scrambled as it will be different for every computer
Figure 5.1: Basic Import Dataset Wizard in RStudio
Figure 5.2: Basic Import Dataset Wizard in RStudio
Figure 5.3: Data import in RStudio
Part 6
6.1 FACTSET
• Use the summary function to get the basic summary of the data
• Use the plotfunction to get the basic scatter plot of the data
61
• Challenge: Convert prices to logarithmic returns and calculate the summary and make the plot using
R
Next Time
63
References
Dasgupta, Nataraj, Farias, Ricardo Anjoleto, & Lanzetta, Vitor Bianchi. 2018. Hands-On Data Science with R. Packt Publishing.
Kotu, Vijay. 2018. Data Science: Concepts and Practice. 2nd ed.. edn.
Minelli, Michael, Chambers, Michele, & Dhiraj, Ambiga. 2012. Big data, big analytics: emerging business intelligence and analytic trends for today’s businesses. John Wiley & Sons.
Mount, John, & Zumel, Nina. 2019. Practical Data Science with R, Second Edition. Manning Publications.
64