Statistics Machine Learning Python Draft

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 319
At a glance
Powered by AI
The document discusses Python tools and libraries for data science and machine learning.

Topics such as Python language basics, scientific Python libraries like NumPy and Pandas, statistics, machine learning algorithms and deep learning are covered.

Python programming language, Anaconda distribution, NumPy, Pandas and Matplotlib for scientific computing are some of the languages and tools discussed.

Statistics and Machine Learning

in
Python
Release 0.3
beta

Edouard Duchesnay, Tommy Löfstedt, Feki Younes

Nov 13, 2019


CONTENT
S

1 Introduction 1
1.1 Python ecosystem for data-science . . . . . . . . . . . . . . . . . . . . . . . 1
. . .
1.2 Introduction to Machine 5
Learning . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Data analysis methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
. . .

2 Python language 9
2.1 Import libraries 9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Basic 9
operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Data 10
types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Execution control 16
statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
. . . .
2.6 List comprehensions, iterators, etc. . . . . . . . . . . . . . . . . . . . . . . 20
. . . .
2.7 Regular expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
. . . .
2.8 System 22
programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9 Scripts and argument parsing . . . . . . . . . . . . . . . . . . . . . . . . . 28
. . . .
2.10 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
. . . .
2.11 Modules and packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
. . .
2.12 Object Oriented Programming (OOP) . . . . . . . . . . . . . . . . . . . . . . 30
. .
2.13 32
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
3 Scientific Python 33
3.1 Numpy: arrays and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 33
. . .
3.2 Pandas: data 42
manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Matplotlib: data 54
visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Statistics 69
4.1 Univariate 69
statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8 Gradient
descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

6 Deep Learning
243
6.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 243
6.2 Multilayer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 257
6.3 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 276
6.4 Transfer Learning
Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

7 Indices and tables


315

ii
CHAPTER

ONE

INTRODUCTION

1. Python ecosystem for data-science

1. Python language

• Interpreted
• Garbage collector (do not prevent from memory leak)
• Dynamically-typed language (Java is statically typed)

2. Anaconda

Anaconda is a python distribution that ships most of python tools and


libraries
Installation
1. Download anaconda (Python 3.x) http://continuum.io/downloads
2. Install it, on Linux
bash Anaconda3-2.4.1-Linux-x86_64.sh

3. Add anaconda path in your PATH variable in your .bashrc


file:
export PATH="${HOME}/anaconda3/bin:$PATH"

Managing with ‘‘conda‘‘


Update conda package and environment manager to current
version
conda update conda

Install additional packages. Those commands install qt back-end (Fix a temporary issue to
run spyder)

conda install pyqt


conda install PyOpenGL
conda update --all

Install seaborn for


graphics

1
Statistics and Machine Learning in Python, Release 0.3 beta

conda install seaborn


# install a specific version from anaconda chanel
conda install -c anaconda pyqt=4.11.4

List installed
packages
conda l i s t

Search available
packages
conda search pyqt
conda search scikit-learn

Environments
• A conda environment is a directory that contains a specific collection of conda packages
that you have installed.
• Control packages environment for a specific purpose: collaborating with someone
else, delivering an application to your client,
•Switch between environments

List of all environments


:: conda info –envs
1. Create new environment
2. Activate
3. Install new package

conda create --name test


# Or
conda env create -f environment.yml
source activate test
conda info --envs
conda l i s t
conda search -f numpy
conda install numpy

Miniconda
Anaconda without the collection of (>700) packages. With Miniconda you download only
the packages you want with the conda command: conda install PACKAGENAME
1. Download anaconda (Python 3.x) https://conda.io/miniconda.html
2. Install it, on Linux

bash Miniconda3-latest-Linux-x86_64.sh

3. Add anaconda path in your PATH variable in your .bashrc


file:
export PATH=${HOME}/miniconda3/bin:$PATH

4. Install required
packages

2 Chapter 1. Introduction
Statistics and Machine Learning in Python, Release 0.3
beta

conda install -y scipy


conda install -y pandas
conda install -y matplotlib
conda install -y statsmodels
conda install -y scikit-learn
conda install -y sqlite
conda install -y spyder
conda install -y jupyter

1.1.3 Commands

python: python interpreter. On the dos/unix command line execute wholes


file:
python file.py

Interactive
mode:
python

Quite with CTL-D


ipython: advanced interactive python
interpreter:
ipython

Quite with CTL-D


pip alternative for packages management (update -U in user directory
--user):
pip install -U --user seaborn

For
neuroimaging:
pip install -U --user nibabel
pip install -U --user nilearn

spyder: IDE (integrated development environment):


• Syntax highlighting.
• Code introspection for code completion (use TAB).
• Support for multiple Python consoles (including
IPython).
• Explore and edit variables from a GUI.
• Debugging.
•Navigate in code (go to function definition) CTL.

3 or 4 panels:
text editor help/variable explorer
ipython interpreter

Shortcuts: - F9 run
line/selection

1.1. Python ecosystem for data-science 3


Statistics and Machine Learning in Python, Release 0.3 beta

1.1.4 Libraries

scipy.org: https://www.scipy.org/docs.html
Numpy: Basic numerical operation. Matrix operation plus some basic solvers.:
import numpy as np
X =np.array([[1, 2], [3, 4]])
#v =np.array([1, 2]).reshape((2, 1))
v =np.array([1, 2])
np.dot(X, v) # no broadcasting
X * v # broadcasting
np.dot(v, X)
X - X.mean(axis=0)

Scipy: general scientific libraries with advanced


solver:
import scipy
import scipy.linalg
scipy.linalg.svd(X, full_matrices=False)

Matplotlib:
visualization:
import numpy as np
import matplotlib.pyplot as plt
#%matplotlib qt
x =np.linspace(0, 10, 50)
sinus =np.sin(x)
plt.plot(x, sinus)
plt.show()

Pandas: Manipulation of structured data (tables). input/output excel files,


etc.
Statsmodel: Advanced statistics
Scikit-learn: Machine learning
li- Arrays Structured Solvers: Solvers: Stats: Stats: Machine
brary data, data, I/O basic advanced basic ad- learning
Num. vance
comp, I/O d
Nump X X
y
Scipy X X X
Pan- X
das
Stat- X X
mod
-
els
Scikit- X
learn

4 Chapter 1. Introduction
Statistics and Machine Learning in Python, Release 0.3
beta

2. Introduction to Machine Learning

1. Machine learning within data science

Machine learning covers two main types of data analysis:


1. Exploratory analysis: Unsupervised learning. Discover the structure within the
data. E.g.: Experience (in years in a company) and salary are correlated.
2. Predictive analysis: Supervised learning. This is sometimes described as “learn
from the past to predict the future”. Scenario: a company wants to detect potential
future clients among a base of prospects. Retrospective data analysis: we go through
the data constituted of previous prospected companies, with their characteristics
(size, domain, localization, etc. . . ). Some of these companies became clients, others
did not. The ques- tion is, can we possibly predict which of the new companies are
more likely to become clients, based on their characteristics based on previous
observations? In this example, the training data consists of a set of n training
samples. Each sample, ��, is a vector of p input features (company characteristics)
and a target feature (�� ∈ {𝑌 𝑒�𝑁�}
, (whether they became a client or not).

2. IT/computing science tools

• High Performance Computing


(HPC)
• Data flow, data base, file I/O, etc.

1.2. Introduction to Machine Learning 5


Statistics and Machine Learning in Python, Release 0.3 beta

• Python: the programming language.


• Numpy: python library particularly useful for handling of raw numerical data
(matrices, mathematical operations).
• Pandas: input/output, manipulation structured data (tables).

3. Statistics and applied mathematics

• Linear model.
• Non parametric statistics.
• Linear algebra: matrix operations, inversion, eigenvalues.

1.3 Data analysis methodology

1. Formalize customer’s needs into a learning problem:


• A target variable: supervised problem.
– Target is qualitative: classification.
– Target is quantitative: regression.
• No target variable: unsupervised problem
– Vizualisation of high-dimensional samples: PCA, manifolds learning,
etc.
– Finding groups of samples (hidden structure): clustering.
2. Ask question about the datasets
• Number of samples
• Number of variables, types of each variable.
3. Define the sample
• For prospective study formalize the experimental design:
inclusion/exlusion cri- teria. The conditions that define the acquisition of
the dataset.
• For retrospective study formalize the experimental design:
inclusion/exlusion criteria. The conditions that define the selection of the
dataset.
4. In a document formalize (i) the project objectives; (ii) the required learning dataset
(more specifically the input data and the target variables); (iii) The conditions that
define the ac- quisition of the dataset. In this document, warn the customer that the
learned algorithms may not work on new data acquired under different condition.
5. Read the learning dataset.
6. (i) Sanity check (basic descriptive statistics); (ii) data cleaning (impute missing
data, recoding); Final Quality Control (QC) perform descriptive statistics and
think ! (re- move possible confounding variable, etc.).
7. Explore data (visualization, PCA) and perform basic univariate statistics for
association between the target an input variables.
6 Chapter 1. Introduction
8. Perform more complex multivariate-machine learning.
Statistics and Machine Learning in Python, Release 0.3
beta

9. Model validation using a left-out-sample strategy (cross-validation, etc.).


10. Apply on new data.

1.3. Data analysis methodology 7


Statistics and Machine Learning in Python, Release 0.3 beta

8 Chapter 1. Introduction
CHAPTER

TWO

PYTHO
N
LANGU
AGE
Note: Click here to download the full example code

Source Kevin Markham


https://github.com/justmarkham/python-reference

2.1 Import libraries


# 'generic import' of math module
import math
math.sqrt(25)

# import a function
from math import sqrt
sqrt(25) # no longer have to reference the module

# import multiple functions at once


from math import cos, floor

# import a l l functions in a module (generally discouraged) #


from os import *

# define an alias
import numpy as np

# show a l l functions in math module


content = dir(math)

2.2 Basic operations

# Numbers
10 + 4 # add (returns 14)
10 - 4 # subtract (returns 6)
10 * 4 # multiply (returns 40)
10 ** 4 # exponent (returns 10000)
10 / 4 # divide (returns 2 because both types are 'int')
10 / float(4) # divide (returns 2.5)
5%4 # modulo (returns 1) - also known as the remainder

(continues on next page)

9
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


10 / 4 # true division (returns 2.5) page)
10 // 4 # floor division (returns 2)

# Boolean operations
# comparisons (these return True)
5 >3
5 >= 3
5 != 3
5 == 5

# boolean operations (these return True)


5 >3 and 6 > 3
5 >3 or 5 < 3
not False
False or not False and True #
evaluation order: not, and, or

2.3 Data types

# determine the type of an object


type(2) # returns 'int'
type(2.0) # returns 'float'
type('two') # returns 'str'
type(True) # returns 'bool'
type(None) # returns
'NoneType'

# check i f an object is of a given type


isinstance(2.0, int) # returns False
isinstance(2.0, ( i n t , float)) # returns True

# convert an object to a given type


float(2)
int(2.9)
str(2.9)

# zero, None, and empty containers are converted to False


bool(0)
bool(None)
bool('') # empty string
bool([]) # empty l i s t
bool({}) # empty
dictionary

# non-empty containers and non-zeros are converted to True


bool(2)
bool('two')
bool([2])

2.3.1 Lists

Different objects categorized along a certain ordered sequence, lists are ordered, iterable,
mu- table (adding or removing objects changes the list size), can contain multiple data
types .. chunk-chap13-001

10 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3
beta

# create an empty l i s t (two ways)


empty_list = [ ]
empty_list = l i s t ( )

# create a l i s t
simpsons =['homer', 'marge',
'bart']

# examine a l i s t
simpsons[0] # print element 0 ('homer')
len(simpsons) # returns the length (3)

# modify a l i s t (does not return the l i s t )


simpsons.append('lisa') # append element to end
simpsons.extend(['itchy', 'scratchy']) # append multiple elements to end
simpsons.insert(0, 'maggie') # insert element at index 0 (shifts everything␣
˓ → right)

simpsons. remove('bart') # searches for f i r s t instance and removes i t


simpsons.pop(0) # removes element 0 and returns i t
del simpsons[0] # removes element 0 (does not return i t ) #
simpsons[0] = 'krusty' replace element 0

# concatenate lists (slower than 'extend' method)


neighbors =simpsons + ['ned','rod','todd']

# find elements in a l i s t
simpsons.count('lisa') # counts the number of instances #
simpsons.index('itchy') returns index of f i r s t instance

# l i s t slicing [start:end:stride]
weekdays =['mon','tues','wed','thurs','fri']
weekdays[0] # element 0
weekdays[0: 3 # elements 0, 1, 2
] # elements 0, 1, 2
weekdays[:3] # elements 3, 4
weekdays[3:] # last element (element 4)
weekdays[-1] # every 2nd element (0, 2, 4)
weekdays[::2] # backwards (4, 3, 2, 1, 0)
weekdays[::-
1] alternative method for returning the l i s t backwards
#
list(reversed(weekdays))

# sort a l i s t in place (modifies but does not return the l i s t )


simpsons.sort()
simpsons. sort(reverse=True # sort in reverse
) simpsons.sort(key=len) # sort by a key

# return a sorted l i s t (but does not modify the original l i s t )


sorted(simpsons)
sorted(simpsons, reverse=True)
sorted(simpsons, key=len)

# create a second reference to the same l i s t


num =[1, 2, 3]
same_num =num
same_num[0] = 0 # modifies both
'num' and 'same_num'

# copy a l i s t (three ways)


(continues on next page)

2.3. Data types 11


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


new_num =num.copy() page)
new_num =num[:]
new_num = list(num)

# examine objects
id(num) ==id(same_num) # returns True
id(num) ==id(new_num) # returns False
num is same_num # returns True
num is new_num # returns
False
num == same_num # returns True
num == new_num # returns True
(their contents are equivalent)

# conatenate +, replicate *
[1, 2, 3] +[4, 5, 6]
[ "a " ] * 2 + ["b"] * 3

2.3.2 Tuples

Like lists, but their size cannot change: ordered, iterable, immutable, can contain multiple
data types

# create a tuple
digits = (0, 1, 'two') # create a tuple directly
digits = tuple([0, 1, 'two']) # create a tuple from a l i s t
zero = (0,) # trailing comma i s required
to indicate it's a tuple

# examine a tuple
digits[2] # returns 'two'
len(digits) # returns 3
digits.count(0) # counts the number of instances of that value (1)
digits.index(1) # returns the index of the f i r s t instance of that value (1)

# elements of a tuple cannot be modified #


digits[2] = 2 # throws an error

# concatenate tuples
digits =digits +(3, 4)

# create a single tuple with elements repeated (also works with l i s t s )


(3, 4) * 2 # returns (3, 4, 3, 4)

# tuple unpacking
bart =('male', 10, 'simpson') # create a tuple

2.3.3 Strings

A sequence of characters, they are iterable,


immutable
# create a string
s = str(42) # convert another data type into a string
s =' I like you'

(continues on next
page)

12 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous page)


# examine a string
s[0] # returns 'I'
len(s) # returns 10

# string slicing like lists


s[:6] # returns ' I like'
s[7:] # returns 'you'
s[-1] # returns 'u'

# basic string methods (does not modify the original string)


s.lower() # returns ' i like you'
s.upper() # returns ' I LIKE YOU'
s. startswith('I ') # returns True
# returns True
s. endswith('you' # returns False
) s.isdigit() (returns True i f every
˓ → digit) # returns index
character in theofstring
f i r s t occurrence (2) , but doesn't support regex #
s.find('like') returns
is a␣ -1 since not found
s.replace('like','love')
s.find('hate') # replaces a l l instances of 'like' with 'love'

# split a string into a l i s t of substrings separated by a delimiter


s.split(' ') # returns ['I','like','you']
s.s pl it( ) # same thing
s2 ='a, an, the'
s2.split(',') # returns ['a',' an',' the']

# join a l i s t of strings into one string using a delimiter


stooges =['larry','curly','moe']
' '.join(stooges) # returns 'larry curly moe'

# concatenate strings
s3 ='The meaning of l i f e is'
s4 = '42'
s3 + ' ' + s4 # returns
'The meaning of l i f e is 42'
s3 +' ' +str(42) # same
thing

# remove whitespace from start and end of a string s5


=' ham and cheese '
s5.strip() # returns 'ham and cheese'

# string substitutions: a l l of these return 'raining cats and dogs'


'raining %s and %s' % ('cats','dogs') # old way
'raining { } and {}'.format('cats','dogs') # new way
'raining {arg1} and {arg2}'.format(arg1='cats',arg2='dogs') # named arguments

# string formatting
#more examples: http://mkaz.com/2012/10/10/python-string-format/
'pi i s {:.2f}'.format(3.14159) # returns 'pi is 3.14'

2.3.4 Strings 2/2

Normal strings allow for escaped characters


print('first line\nsecond line')

2.3. Data types 13


Statistics and Machine Learning in Python, Release 0.3 beta

Out:
f i r s t line
second line

raw strings treat backslashes as literal


characters
print(r'first line\nfirst line')

Out:
f i r s t line\nfirst line

sequece of bytes are not strings, should be decoded before some


operations
s =b'first line\nsecond line'
print(s)

print(s.decode('utf-
8').split())
Out:
b'first line\nsecond line'
[' f ir st' , 'line', 'second', 'line']

2.3.5 Dictionaries

Dictionaries are structures which can contain multiple data types, and is ordered with key-
value pairs: for each (unique) key, the dictionary outputs one value. Keys can be strings,
numbers, or tuples, while the corresponding values can be any Python object. Dictionaries
are: unordered, iterable, mutable

# create an empty dictionary (two ways)


empty_dict ={ }
empty_dict =dict()

# create a dictionary (two ways)


family ={'dad':'homer', 'mom':'marge', 'size':6}
family =dict(dad='homer', mom='marge', size=6)

# convert a l i s t of tuples into a dictionary


list_of_tuples =[('dad','homer'), ('mom','marge'), ('size', 6)]
family = dict(list_of_tuples)

# examine a dictionary
family['dad'] # returns
'homer' len(family) # returns 3
family.keys() # returns l i s t :
['dad', 'mom', 'size']
family.values() # returns l i s t : ['homer', 'marge', 6]
family.items() # returns l i s t of tuples:
# [('dad',
'homer'), ('mom', 'marge'), ('size',
6)]
'mom' in family # returns True
'marge' in family # returns False (only checks keys)
(continues on next
# modify a dictionary (does not return the dictionary) page)

14 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


family['cat'] ='snowball' # add a new entry page)
family['cat'] ='snowball i i ' # edit an existing entry
del family['cat'] # delete an entry
family['kids'] =['bart', 'lisa'] # value can be a l i s t
family.pop('dad') # removes an entry and
family.update({'baby':'maggie', 'grandpa':'abe'}) # add multiple entries
returns the value
('homer')
# accessing values more safely with 'get'
family['mom'] # returns 'marge'
family. get('mom' # same thing
) try:
family['grand # throws an error
ma']
except KeyError as e:
print("Error", e)
family.get('grandma') # returns None
family.get('grandma', 'not found') # returns 'not found' (the default)

# accessing a l i s t element within a dictionary


family['kids'][0] # returns 'bart'
family['kids'].remove('lisa') # removes 'lisa'

# string substitution using a dictionary


'youngest child is %(baby)s' % family # returns
'youngest child is maggie'

Out:
Error 'grandma'

2.3.6 Sets

Like dictionaries, but with unique keys only (no corresponding values). They are:
unordered, it- erable, mutable, can contain multiple data types made up of unique
elements (strings, numbers, or tuples)
# create an empty set
empty_set = set()

# create a set
languages = # create a set directly
{'python', 'r',
snakes =set(['cobra', 'viper', 'python']) # create a set from a l i s t
'java'}
# examine a set
len(languages) # returns 3
'python' in # returns True
languages
# set operations
languages & snakes # returns intersection: {'python'}
languages | snakes # returns union: {'cobra', 'r', 'java', 'viper', 'python'}
languages - snakes # returns set difference: {'r', 'java'}
snakes - languages # returns set difference: {'cobra', 'viper'}

# modify a set (does not return the set)


languages.add('sql') # add a new element
(continues on next
page)

2.3. Data types 15


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


languages.add('r') page)
# try to add an existing element (ignored, no error) #
languages. remove('java') remove an element
try:
languages.remove('c # try to remove a non-existing element (throws an error)
')
except KeyError as e:
print("Error", e) # removes an element i f present, but ignored
languages.discard('c') otherwise # removes and returns an arbitrary element
languages.pop() # removes a l l elements
languages.update('go',
languages. cl 'spark') # add multiple elements (can also pass a l i s t or set)
ear()
# get a sorted l i s t of unique elements from a l i s t
sorted(set([9, 0, 2, 1, 0])) # returns [0, 1, 2, 9]

Out:

Error 'c'

4. Execution control
statements

1. Conditional statements
x= 3
# i f statement
i f x > 0:
print('positive')

# if/else statement
i f x > 0:
print('positive')
else:
print('zero or negative')

# if/elif/else statement
i f x > 0:
print('positive')
elif x == 0:
print('zero')
else:
print('negative')

# single-line i f statement (sometimes discouraged)


i f x >0: print('positive')

# single-line if/else statement (sometimes discouraged)


# known as a 'ternary operator'
'positive' i f x > 0 else 'zero or negative'

'positive' i f x > 0 else 'zero or negative'

Out:

16 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3
beta

positive

positive

positive

positive
2.4.2 Loops

Loops are a set of instructions which repeat until termination conditions are This
met. include iterating through all values in an object, go through a range of can
values, etc
# range returns a l i s t of integers
range(0, 3) # returns [0, 1, 2]: includes f i r s t value but excludes second value
range(3) # same thing: starting at zero is the default
range(0, 5, 2) # returns [0, 2, 4]: third argument specifies the 'stride'

# for loop
fruits =['apple', 'banana', 'cherry']
for i in range(len(fruits)):

print(fruits[i].upper())

# alternative for loop


(recommended style)
for fruit in fruits:
print( f ru it. upper())

# use range when iterating


over a large sequence to
avoid actually creating
the integer␣
˓ → li st in memory
v=0
for i in
range(10 ** 6):
v += 1

quote = " " "


our incomes are like our shoes; i f too small they gall and pinch us but
i f too large they cause us to stumble and to trip
"""

count ={k:0 for k in set(quote.split())}


for word in quote.split():
count[word] += 1

# iterate through two things at once (using tuple unpacking)


family ={'dad':'homer', 'mom':'marge', 'size':6}
for key, value in family.items():
print(key, value)

# use enumerate i f you need to


access the index value within the
loop
for index, fruit in enumerate(fruits):
print(index, f ruit)
(continues on next
page)
# for/else loop
2.4.fruit
for in fruits:
Execution control statements 17
i f fruit =='banana':
print("Found the banana!")
break # exit
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


else: page)
# this block executes ONLY i f the for loop completes without hitting 'break'
print("Can't find the banana")

# while loop
count =0
while count
< 5:
print("T
his will
print 5
Out:
times")
count +=
APPLE
BANAN 1
A
#
CHERRY
equivale
APPLE
nt to
BANAN
A 'count =
count +
CHERR
Y 1'
dad homer
mom
marge
size 6
1 apple
2 banana
3 cherry
Can't find the banana
Found the banana!
This will print 5
times
This will print 5 times
This will print 5 times
2.4.3 Exceptions handling
This will print 5 times
This will print 5
times
dct =dict(a=[1, 2], b=[4, 5])

key = 'c'
try:
dct[key]
except:
print("Key %s is missing. Add i t with empty value" % key)
dct['c'] = [ ]

print(dct)

Out:

Key c is missing. Add i t with empty value


{'a': [1, 2], 'b': [4, 5], 'c': [ ] }

18 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3
beta

2.5 Functions

Functions are sets of instructions launched when called upon, they can have multiple
input values and a return value

# define a function with no arguments and no return values


def print_text():
print('this is text')

# call the function


print_text()

# define a function
with one argument
and no return values
def print_this(x):
# call the function
print(x)
print_this(3) # prints 3
n = print_this(3) # prints 3, but doesn't assign 3 to n
# because the function has
no return statement

#
def add(a, b):
return a + b

add(2, 3)

add("deux", "trois")

add(["deux",
"trois "], [2, 3])

# define a function
with one argument and
one return value
def square_this(x):
return x ** 2

# include an
optional docstring to
# call thethe
describe function
effect
square_this(3)
of a function # prints 9
var =
def square_this(x): # assigns 9 to var, but does not print 9
square_this(3)
"""Return the
# default
squarearguments
of a
def power_this(x,
number.""" power=2):
return x ** power
2
power_this(2) # 4
power_this(2, 3) # 8

# use 'pass' as a placeholder i f you haven't written the function body


def stub():
pass

# return two values from a single function


(continues on next page)

2.5. Functions 19
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


def min_max(nums): page)
return min(nums), max(nums)

# return values can be assigned to a single variable as a tuple


nums =[1, 2, 3]
min_max_num = min_max(nums) # min_max_num =(1, 3)

# return values can be assigned into multiple variables using tuple unpacking
min_num, max_num = min_max(nums) # min_num =1, max_num = 3

Out:
this i s text
3
3

6. List comprehensions, iterators, etc.

1. List comprehensions

Process which affects whole lists without iterating through loops. For more: http://
python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.h
tml
# for loop to create a l i s t of cubes
nums =[1, 2, 3, 4, 5]
cubes = [ ]
for num in nums:
cubes. append(num**3
)

# equivalent l i s t
comprehension
cubes = [num**3 for num
in nums]

# [1, 8, 27, 64, 125]

# for loop to create a l i s t of cubes of even numbers


cubes_of_even =[ ]
for num in nums:
i f num % 2 ==0:
cubes_of_even. append(num**3)

# equivalent l i s t comprehension
# syntax: [expression for variable in iterable i f condition]
cubes_of_even = [num**3 for num in nums i f num % 2 == 0] # [8,
64]

# for loop to cube even numbers and square odd numbers


cubes_and_squares = [ ]
for num in nums:
i f num % 2 ==0:
cubes_and_squares. append(num**3)
else:
˓→64, 25]
cubes_and_squares.append(num**2) (continues on next
page)
# equivalent l i s t comprehension (using a
20
ternary expression) Chapter 2. Python language
# syntax: [true_condition i f condition else false_condition for variable in iterable]
cubes_and_squares = [num**3 i f num % 2 == 0 else num**2 for num in nums] # [1, 8,
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
# for loop to flatten a 2d-matrix
matrix =[[1, 2], [3, 4]]
items =[ ]
for row in matrix:
for item in row:
items. append(item)

# equivalent l i s t
comprehension items =
[item for row in matrix
for item in row]

# [1, 2, 3, 4]

# set comprehension
fruits =['apple',
'banana',
# {'apple': 5, 'banana
'cherry']
unique_lengths =
{len(fruit) for
fruit in fruits}
2.7 Regular# expression
{5, 6}

# dictionary
1. Compile Regular expression with a
patetrn comprehension
import re fruit_lengths =
{f ruit:len(fruit)
for fruit
# 1. compile Regular expression
in with a patetrn regex
fruits}
= re.compile("^.+(sub-.+)_(ses-.+)_(mod-.+)")
˓ → ': 6, 'cherry': 6}

2. Match compiled RE on string


Capture the pattern ’anyprefixsub-<subj id>_ses-<session
id>_<modality>’
strings =["abcsub-033_ses-01_mod-mri", "defsub-044_ses-01_mod-mri", "ghisub-055_ses-02_
˓→mod-ctscan" ]

print([regex.findall(s)[0] for s in strings])

Out:

[('sub-033', 'ses-01', 'mod-mri'), ('sub-044', 'ses-01', 'mod-mri'), ('sub-055', 'ses-02',


˓→ 'mod-ctscan')]

Match methods on compiled regular


expression
Method/Attribute Purpose
match(string) Determine if the RE matches at the beginning of the string.
search(string) Scan through a string, looking for any location where this RE
matches.
findall(string) Find all substrings where the RE matches, and returns them as a
list.
finditer(string) Find all substrings where the RE matches, and returns them as an
itera-
2. Replace compiled RE on
tor.
string

2.7. Regular expression 21


Statistics and Machine Learning in Python, Release 0.3 beta

regex =re.compile("(sub-[^_]+)") # match (sub-...)_


print([regex.sub("SUB-", s) for s in strings])

regex.sub("SUB-", "toto")

Out:

['abcSUB-_ses-01_mod-mri', 'defSUB-_ses-01_mod-mri', 'ghiSUB-_ses-02_mod-ctscan']

Replace all non-alphanumeric characters in a


string
re.sub('[^0-9a-zA-Z]+', '', 'h^&ell’.,|o w]{+orld')

8. System programming

1. Operating system interfaces


(os)
import os

Current working
directory
# Get the current working directory
cwd =os.getcwd()
print(cwd)

# Set the current working directory


os.chdir(cwd)

Out:

/home/edouard/git/pystatsml/python_lang

Temporary
directory
import tempfile

tmpdir =tempfile.gettempdir()

Join paths

mytmpdir =os.path.join(tmpdir, "foobar")

# l i s t containing the names of the entries in the directory given by path.


os.listdir(tmpdir)

Create a
directory
i f not os.path.exists(mytmpdir):
os.mkdir(mytmpdir)

os.makedirs(os.path.join(tmpdir,
"foobar", "plop", "toto"),
exist_ok=True)

22 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3
beta

2.8.2 File input/output


filename =os.path.join(mytmpdir, "myfile.txt")
print(filename)

# Write
lines =["Dans python tout est bon", "Enfin,
presque"]

## write line by line


fd =open(filename, "w")
fd.write(lines[0] +"\n")
fd.write(lines[1]+ "\n")
fd.close()

## use a context manager


to automatically close
your f i l e
with open(filename, 'w')
as f :
for line in lines:
f.write(line +'\n')

# Read
## read one line at a time (entire f i l e does not have to f i t into memory) f =
open(filename, " r " )
f.readline() # one string per line (including newlines)
f.readline() # next line
f.close()

## read one line at a time (entire f i l e does not have to f i t into memory) f =
open(filename, 'r')
f.readline() # one string per line (including newlines)
f.readline() # next line
f.close()

## read the whole f i l e at once, return a l i s t of lines f


=open(filename, 'r')
f.readlines() # one l i s t , each line i s one string
f.close()

## use l i s t comprehension to duplicate readlines without reading entire f i l e at once f =


open(filename, 'r')
[line for line in f ]
f.close()

## use a context
manager to
Out:
automatically close
your f i l e
/tmp/foobar/myfile.txt
with open(filename, 'r') as f :
lines =[line for line in f ]

2.8.3 Explore, list directories

Walk

2.8. System programming 23


Statistics and Machine Learning in Python, Release 0.3 beta

import os

WD=os.path.join(tmpdir, "foobar")

for dirpath, dirnames, filenames in os.walk(WD):


print(dirpath, dirnames, filenames)

Out:

/tmp/foobar ['plop'] ['myfile.txt']


/tmp/foobar/plop ['toto'] [ ]
/tmp/foobar/plop/toto [ ] [ ]

glob, basename and file extension TODO


FIXME
import
tempfile
import glob

tmpdir =
tempfile.gette
mpdir()

filenames =glob.glob(os.path.join(tmpdir, " * " , "*.txt") )


print(filenames)

# take basename then remove extension


basenames =[os.path.splitext(os.path.basename(f))[0] for f in filenames]
Out:
print(basenames)
['/tmp/foobar/myfile.txt']
['myfile']

shutil - High-level file


operations
import shutil

src =os.path.join(tmpdir, "foobar", "myfile.txt")


dst =os.path.join(tmpdir, "foobar", "plop", "myfile.txt")
print("copy %s to %s" % (src, dst))

shutil.copy(src, dst)

print("File %s exists ?" % dst, os.path.exists(dst))

src =os.path.join(tmpdir, "foobar", "plop")


dst =os.path.join(tmpdir, "plop2")
print("copy tree %s under %s" % (src, dst))

try:
shutil.copytree(src, dst)

shutil.rmtree(dst)

shutil.move(src, dst)
except (FileExistsError,
FileNotFoundError) as e:
pass

24 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3
beta

Out:
copy /tmp/foobar/myfile.txt to /tmp/foobar/plop/myfile.txt
File /tmp/foobar/plop/myfile.txt exists ? True
copy tree /tmp/foobar/plop under /tmp/plop2

4. Command execution with subprocess

• For more advanced use cases, the underlying Popen interface can be used directly.
• Run the command described by args.
• Wait for command to complete
• return a CompletedProcess instance.
• Does not capture stdout or stderr by default. To do so, pass PIPE for the stdout
and/or stderr arguments.

import subprocess

# doesn't capture output


p =subprocess.run(["ls", " - l " ] )
print(p.returncode)

# Run through the shell.


subprocess.run("ls - l" , shell=True)

# Capture output
out =subprocess.run(["ls", "-a", " / " ] , stdout=subprocess.PIPE, stderr=subprocess.STDOUT) #
out.stdout is a sequence of bytes that should be decoded into a utf-8 string
print(out.stdout.decode('utf-8').split("\n")[:5])

Out:

0
[ '. ' , ' . . ' , 'bin', 'boot', 'cdrom']

5. Multiprocessing and multithreading

Process
A process is a name given to a program instance that has been loaded into
memory and managed by the operating system.
Process = address space + execution context (thread of
control) Process address space (segments):
• Code.
• Data (static/global).
• Heap (dynamic memory allocation).
•Stack.

Execution
context:
2.8. System programming 25
Statistics and Machine Learning in Python, Release 0.3 beta

• Data registers.
• Stack pointer (SP).
• Program counter (PC).
• Working Registers.

OS Scheduling of processes: context switching (ie. save/load Execution


context) Pros/cons
• Context switching expensive.
• (potentially) complex data sharing (not necessary true).
• Cooperating processes - no need for memory protection (separate address
spaces).
• Relevant for parrallel computation with memory allocation.

Threads
• Threads share the same address space (Data registers): access to code,
heap and (global) data.
•Separate execution stack, PC and Working

Registers. Pros/cons
• Faster context switching only SP, PC and Working
Registers.
• Can exploit fine-grain concurrency
• Simple data sharing through the shared address
space.
• Precautions have to be taken or two threads will write to the same
memory at the same time. This is what the global interpreter lock (GIL)
is for.
• Relevant for GUI, I/O (Network, disk) concurrent operation

In Python
• The threading module uses threads.
•The multiprocessing module uses
import time
import
processes. Multithreading
threading

def
list_append(cou
nt, sign=1,
out_list=None):
i f out_list is None:
out_list =l i s t ( )
for i in range(count):
out_list.append(sign * i ) sum(out_list)
# do some computation
return out_list
(continues on next
size = 10000 # Number of numbers to add page)

out_list
26 =l i s t ( ) # result is a simple l i s t Chapter 2. Python language
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)) )
thread1 =threading.Thread(target=list_append, args=(size, 1, out_list,
thread2 =threading.Thread(target=list_append, args=(size, -1, out_list, ) )

startime = time.time()
# Will execute both in parallel
thread1.start()
thread2.start()
# Joins threads back to the parent process
thread1.join()
thread2.join()
print("Threading ellapsed time " ,
time.time() - startime)

print(out_list[:10])

Out:

Threading ellapsed time 1.7868659496307373


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Multiprocessin
g
import multiprocessing

# Sharing requires specific mecanism


out_list1 =
multiprocessing.Manager().list()
p1 =multiprocessing.Process(target=list_append, args=(size, 1, None))
out_list2 = multiprocessing.Manager().list()
p2 =multiprocessing.Process(target=list_append, args=(size, -1, None))

startime =time.time()
p1.start()
p2. start()
p1.join()
p2.join()
print("Mul
tiprocessi
ng
ellapsed
Out:
time " ,
time.time(
Multiprocessing ellapsed time 0.3927607536315918
)-
startime)
Sharing object between process with Managers
#
Managers provide a way to create data which can be shared between different processes, in-
print(out
cluding
_list[:10]sharing over a network between processes running on different machines. A
manager
) i s not object controls a server process which manages shared objects.
availlable
import
multiprocessing
import time

size = int(size / 100) # Number of numbers to add

# Sharing requires specific mecanism


out_list =multiprocessing.Manager().list() (continues on next
page)

2.8. System programming 27


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


page)
p1 =multiprocessing.Process(target=list_append, args=(size, 1, out_list))
p2 =multiprocessing.Process(target=list_append, args=(size, -1, out_list))

startime =time.time()

p1.start()
p2.start()

p1.join()
p2.join()

print(out_list[:10])

print("Multiprocessing
with shared object
ellapsed time " ,
Out:
time.time() - startime)

[0, 1, 2, 0, 3, -1, 4, -2, 5, -3]


Multiprocessing with shared object ellapsed time 0.7650048732757568

2.9 Scripts and argument parsing

Example, the word count script

import os
import
os.path
import
argparse
import re
import pandas
as pd

i f name =="
main " :
# parse command line options
output = "word_count.csv"
parser =
argparse.ArgumentParser()
parser.add_argument('-i',
'--input',
help='list of input f il es .' ,
nargs='+', type=str)
parser.add_argument('-o', '--output',
help='output csv f i l e (default %s)' % output,
type=str, default=output)
options =parser.parse_args()

i f options.input is None :
parser.print_help()
raise SystemExit("Error:
input f iles are
missing")
else: (continues on next
filenames =[ f for f in page)
options.input i f
28 os.path.isfile(f)] Chapter 2. Python language
# Match words
regex = re.compile("[a-zA-Z]
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


fd =open(filename, " r " ) page)
for line in fd:
for word in regex.findall(line.lower()):
i f not word in count:
count[word] = 1
else:
count[word] +=1

fd =open(options.output, "w") #

Pandas
df =pd.DataFrame([[k, count[k]] for k in count], columns=["word", "count"])
df.to_csv(options.output, index=False)

2.10 Networking

# TODO

2.10.1 FTP

# Full FTP features with ftplib


import ftplib
ftp =ftplib.FTP("ftp.cea.fr")
ftp.login()
ftp.cwd('/pub/unati/people/edu
chesnay/pystatml')
ftp.retrlines('LIST')

fd =open(os.path.join(tmpdir, "README.md"), "wb")


ftp.retrbinary('RETR README.md', fd.write)
fd.close()
ftp.quit()

# File download urllib


import urllib.request
ftp_url ='ftp://ftp.cea.fr/pub/unati/people/educhesnay/pystatml/README.md'
urllib.request.urlretrieve(ftp_url, os.path.join(tmpdir, "README2.md"))
Out:
-rw-r--r-- 1 ftp ftp 3019 Oct 16 00:30 README.md
-rw-r--r-- 1 ftp ftp 9588437 Oct 28 19:58␣
˓→StatisticsMachineLearningPythonDraft.pdf

2.10.2 HTTP

# TODO

2.10. Networking 29
Statistics and Machine Learning in Python, Release 0.3 beta

2.10.3 Sockets

# TODO

2.10.4 xmlrpc

# TODO

2.11 Modules and packages

A module is a Python file. A package is a directory which MUST contain a special file
called
i ni t .py
To import, extend variable PYTHONPATH:
export PYTHONPATH=path_to_parent_python_module:${PYTHONPATH}

Or

import sys
sys.path.append("path_to_parent_python_module")

The i n i t .py file can be empty. But you can set which modules the package exports as the
API, while keeping other modules internal, by overriding the all variable, like so:
parentmodule/ i n i t .py file:

from . import submodule1


from . import submodule2

from .submodule3 import function1


from .submodule3 import function2

a l l =["submodule1", "submodule2",
"function1", "function2"]

User can
import:
import
parentmodule.submodule1
import
parentmodule.function1
Python Unit Testing

2.12 Object Oriented Programming (OOP)

Sources
• http://python-textbok.readthedocs.org/en/latest/Object_Oriented_Programming.html

Principles
• Encapsulate data (attributes) and code (methods) into objects.

30 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3
beta

• Class = template or blueprint that can be used to create objects.


• An object is a specific instance of a class.
• Inheritance: OOP allows classes to inherit commonly used state and behaviour from
other classes. Reduce code duplication
• Polymorphism: (usually obtained through polymorphism) calling code is agnostic as
to whether an object belongs to a parent class or one of its descendants (abstraction,
modu- larity). The same method called on 2 objects of 2 different classes will
import math differently.
behave
class Shape2D:
def area(self):
raise NotImplementedError()

# i ni t i s a special method called the constructor #

Inheritance + Encapsulation
class Square(Shape2D):
def i ni t (self , width):
self.width = width

def area(self):
return self.width ** 2

class Disk(Shape2D):
def i ni t (self , radius):
self.radius = radius

def area(self):
return math.pi *
self.radius ** 2

shapes =[Square(2), Disk(3)]

# Polymorphism
print([s.area() for s in
shapes])

s = Shape2D()
try:
s.area()
except NotImplementedError as e:

Out:print( " NotImplementedError" )

[4, 28.274333882308138]
NotImplementedError

2.12. Object Oriented Programming (OOP) 31


Statistics and Machine Learning in Python, Release 0.3 beta

13. Exercises

1. Exercise 1: functions

Create a function that acts as a simple calulator If the operation is not specified, default to
addition If the operation is misspecified, return an prompt message Ex:
calc(4,5,"multiply") returns 20 Ex: calc(3,5) returns 8 Ex: calc(1, 2, "something")
returns error message

2. Exercise 2: functions + list + loop

Given a list of numbers, return a list where all adjacent duplicate elements have been
reduced to a single element. Ex: [1, 2, 2, 3, 2] returns [1, 2, 3, 2]. You may create a
new list or modify the passed in list.
Remove all duplicate values (adjacent or not) Ex: [1, 2, 2, 3, 2] returns [1, 2, 3]

3. Exercise 3: File I/O

1.Copy/paste the BSD 4 clause license (https://en.wikipedia.org/wiki/BSD_licenses) into a


text file. Read, the file and count the occurrences of each word within the file. Store the
words’ occurrence number in a dictionary.
2.Write an executable python command count_words.py that parse a list of input files
provided after --input parameter. The dictionary of occurrence is save in a csv file
provides by --output. with default value word_count.csv. Use: - open - regular expression
- argparse (https://docs. python.org/3/howto/argparse.html)

2.13.4 Exercise 4: OOP

3. Create a class Employee with 2 attributes provided in the constructor: name,


years_of_service. With one method salary with is obtained by 1500 + 100 *
years_of_service.
4. Create a subclass Manager which redefine salary method 2500 +120 *
years_of_service.
5. Create a small dictionary-nosed database where the key is the employee’s name.
Populate the database with: samples = Employee(‘lucy’, 3), Employee(‘john’, 1),
Manager(‘julie’, 10), Manager(‘paul’, 3)
6. Return a table of made name, salary rows, i.e. a list of list [[name, salary]]
7. Compute the average salary
Total running time of the script: ( 0 minutes 3.188 seconds)

32 Chapter 2. Python language


CHAPTER

THREE

SCIENTIFIC PYTHON

Note: Click here to download the full example code

3.1 Numpy: arrays and matrices

NumPy is an extension to the Python programming language, adding support for large,
multi- dimensional (numerical) arrays and matrices, along with a large library of high-
level mathe- matical functions to operate on these arrays.
Sources:
• Kevin Markham: https://github.com/justmarkham

import numpy as np

3.1.1 Create arrays

Create ndarrays from lists. note: every element must be the same type (will be converted
if possible)

data1 = [1, 2, 3, 4, 5] # list


arr1 = np.array(data1) # 1d array
data2 =[range(1, 5), range(5, 9)] # l i s t of l is ts
arr2 = np.array(data2) # 2d array
arr2.tolist() # convert array
back to l i s t
create special
arrays
np.zeros(10)
np.zeros((3, 6))
np.ones(10)
np.linspace(0, 1, 5) # 0 to 1 (inclusive) with 5 points
np.logspace(0, 3, 4) # 10^0 to 10^3 (inclusive) with 4 points

arange is like range, except it returns an array (not a


list)
int_array =np.arange(5)
float_array =int_array.astype(float)

33
Statistics and Machine Learning in Python, Release 0.3 beta

3.1.2 Examining arrays

arr1. dtype # float64


# int32
arr2. dtype #2
arr2.ndim # (2, 4) - axis 0 is rows, axis 1 is columns #
8 - total number of elements
arr2. shape # 2 - size of f irst dimension (aka axis)
arr2.size
len(arr2)
3.1.3 Reshaping

arr =np.arange(10, dtype=float).reshape((2, 5))


print(arr.shape)
print(arr.reshape(5, 2))

Out:

(2, 5)
[[0. 1.]
[2. 3.]
[4. 5.]
[6. 7.]
[8. 9.] ]

Add an
axis
a =np.array([0, 1])
a_col =a [ : , np.newaxis]
print(a_col)
#or
a_col =a [ : , None]

Out:

[[0]
[1]]

Transpos
e
print(a_col.T)

Out:

[[0 1]]

Flatten: always returns a flat copy of the orriginal


array
arr_flt =arr.flatten()
arr_flt[0] =33
print(arr_flt)
print(arr)

Out:

34 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

[33. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[[0. 1. 2. 3. 4.]
[5. 6. 7. 8. 9.]]

Ravel: returns a view of the original array whenever


possible.
arr_flt =arr.ravel()
arr_flt[0] =33
print(arr_flt)
print(arr)

Out:

[33. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[[33. 1. 2. 3. 4.]
[ 5. 6. 7. 8. 9.] ]

4. Summary on axis, reshaping/flattening and selection

Numpy internals: By default Numpy use C convention, ie, Row-major language: The
matrix is stored by rows. In C, the last index changes most rapidly as one moves
through the array as stored in memory.
For 2D arrays, sequential move in the memory will:
• iterate over rows (axis 0)
– iterate over columns (axis 1)
For 3D arrays, sequential move in the memory will:
• iterate over plans (axis 0)
– iterate over rows (axis 1)
iterate over columns (axis 2)

3.1. Numpy: arrays and matrices 35


Statistics and Machine Learning in Python, Release 0.3 beta

x =np.arange(2 * 3 * 4)
print(x)

Out:

[ 0 12 3456 78 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]

Reshape into 3D (axis 0, axis 1,


axis 2)
x =x.reshape(2, 3, 4)
print(x)

Out:
[ [ [ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]

Selection get first


plan
print(x[0, : , : ] )

Out:
[ [ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

Selection get first


rows
print(x[:, 0, : ] )

Out:

[ [ 0 1 2 3]
[12 13 14 15]]

Selection get first


columns
print(x[:, : , 0])

Out:

[ [ 0 4 8]
[12 16 20]]

Rave
l
print(x.ravel())

Out:

36 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

[ 0 12 3456 78 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]

3.1.5 Stack arrays

Stack flat arrays in


columns
a = np.array([0, 1])
b = np.array([2, 3])

ab =np.stack((a, b)).T
print(ab)

# or
np.hstack((a[:, None],
b[: , None]))

Out:

[[0 2]
[1 3]]

3.1.6 Selection

Single item

arr =np.arange(10, dtype=float).reshape((2, 5))

arr[0] # 0th element (slices like a l i s t )


arr[0, 3] # row 0, column 3: returns 4
arr[0] [ 3] # alternative syntax

Slicing

Syntax: start:stop:step with start (default 0) stop (default last) step (default
1)
arr[0, : ] # row 0: returns 1d array ([1, 2, 3, 4])
a r r [ : , 0] # column 0: returns 1d array ([1, 5])
a r r [ : , :2] # columns strictly before index 2 (2 f i r s t columns)
a r r [ : , 2:] # columns after index 2 included
arr2 = a r r [ : , 1:4] # columns between index 1 (included) and 4 (excluded)
print(arr2)

Out:

[[1. 2. 3.]
[6. 7. 8.]]

Slicing returns a view (not a


copy)
arr2[0, 0] = 33
print(arr2)
print(arr)

3.1. Numpy: arrays and matrices 37


Statistics and Machine Learning in Python, Release 0.3 beta

Out:
[[33. 2. 3.]
[ 6. 7. 8.] ]
[ [ 0. 33. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]]

Row 0: reverse
order
print(arr[0, ::-1])

# The rule of thumb here can be: in the context of lvalue indexing ( i . e . the indices are␣
˓→placed in the lef t hand side value of an assignment), no view or copy of the array i s ␣
˓→created (because there is no need to). However, with regular values, the above rules␣
˓→ for creating views does apply.

Out:

[ 4. 3. 2. 33. 0.]

Fancy indexing: Integer or boolean array


indexing

Fancy indexing returns a copy not a


view. Integer array indexing
arr2 =a r r [ : , [1,2,3]] # return a copy
print(arr2)
arr2[0, 0] = 44
print(arr2)
print(arr)

Out:
[[33. 2. 3.]
[ 6. 7. 8.]]
[[44. 2. 3.]
[ 6. 7. 8.]]
[ [ 0. 33. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]]

Boolean arrays
indexing
arr2 =arr[arr >5] # return a copy

print(arr2)
arr2[0] =44
print(arr2)
print(arr)

Out:
[33. 6. 7. 8. 9.]
[44. 6. 7. 8. 9.]
[ [ 0. 33. 2. 3. 4.]
[ 5. 6. 7. 8. 9.] ]

38 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

However, In the context of lvalue indexing (left hand side value of an assignment) Fancy
autho- rizes the modification of the original array
arr[arr >5] =0
print(arr)

Out:

[[0. 0. 2. 3. 4.]
[5. 0. 0. 0. 0.]]

Boolean arrays indexing


continues
names =np.array(['Bob', 'Joe', 'Will', 'Bob'])
names == 'Bob' # returns a boolean array
names[names != 'Bob'] # logical selection
(names == 'Bob') | (names == 'Will') # keywords "and/or" don't work with boolean arrays
names[names != 'Bob'] = 'Joe' # assign based on a logical selection
np.unique(names) # set function

3.1.7 Vectorized operations

nums =np.arange(5)
nums * 10 # multiply each element by 10
nums =np.sqrt(nums) # square root of each element
np.ceil(nums) # also floor, rint (round to nearest i nt)
np.isnan(nums) # checks for NaN
nums + np.arange(5) # add element-wise
np.maximum(nums, np.array([1, -2, 3, -4, 5])) # compare element-wise

# Compute Euclidean distance between 2 vectors


vec1 =np.random.randn(10)
vec2 = np.random.randn(10)
dist =np.sqrt(np.sum((vec1 - vec2) ** 2))

# math and stats


rnd =np.random.randn(4, 2) # random normals in 4x2 array
rnd.mean()
rnd.std()
rnd.argmin() # index of minimum element
rnd.sum()
rnd. sum(axis=0 # sum of columns
) # sum of rows
rnd. sum(axis=1
#
) methods for boolean arrays
(rnd > 0).sum() # counts number of positive values
(rnd > 0).any() # checks i f any value i s True
(rnd > 0 ) . a l l ( ) # checks i f a l l values are True

# random numbers
np. random. seed(12234 # Set the seed
) np.random.rand(2, # 2 x 3 matrix in [0, 1]
3) # random normals (mean 0, sd 1)
np.random.randint(0,
np.random.randn(10) 2, 10) # 10 randomly picked 0 or 1

3.1. Numpy: arrays and matrices 39


Statistics and Machine Learning in Python, Release 0.3 beta

8. Broadcasting

Sources: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html Implicit


con- version to allow operations on arrays of different sizes. - The smaller array is
stretched or “broadcasted” across the larger array so that they have compatible shapes. -
Fast vectorized operation in C instead of Python. - No needless copies.

Rules

Starting with the trailing axis and working backward, Numpy compares arrays
dimensions.
• If two dimensions are equal then continues
• If one of the operand has dimension 1 stretches it to match the largest one
• When one of the shapes runs out of dimensions (because it has less dimensions
than the other shape), Numpy will use 1 in the comparison process until the other
shape’s dimensions run out as well.

Fig. 1: Source:
http://www.scipy-lectures.org
a = np.array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])

b =np.array([0, 1, 2])
(continues on next
page)

40 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
print(a +b)

Out:

[ [ 0 1 2]
[10 11 12]
[20 21 22]
[30 31 32]]

Examples
Shapes of operands A, B and
result:
A (2d array): 5 x 4
1
(1d
5 x4
B array):
R (2d array):
e
s
u
l
t
A (2d array): 5x4
B (1d array): 4
Result (2d array): 5x4
A (3d array): 15 x 3 x 5
B (3d array): 15 x 1 x 5
Result (3d array): 15 x 3 x 5
A (3d array): 15 x 3 x 5
B (2d array): 3x5
Result (3d array): 15 x 3 x 5
A (3d array): 15 x 3 x 5
B (2d array): 3x1
Result (3d array): 15 x 3 x 5

3.1.9 Exercises
X =np.random.randn(4,
Given the array: 2) # random normals in 4x2 array

• For each column find the row index of the minimum value.
• Write a function standardize(X) that return an array whose columns are centered
and scaled (by std-dev).
Total running time of the script: ( 0 minutes 0.039 seconds)

Note: Click here to download the full example code

3.1. Numpy: arrays and matrices 41


Statistics and Machine Learning in Python, Release 0.3 beta

3.2 Pandas: data manipulation

It is often said that 80% of data analysis is spent on the cleaning and small, but
important, aspect of data manipulation and cleaning with Pandas.
Sources:
• Kevin Markham: https://github.com/justmarkham
• Pandas doc: http://pandas.pydata.org/pandas-docs/stable/index.html

Data structures
• Series is a one-dimensional labeled array capable of holding any data type (inte- gers,
strings, floating point numbers, Python objects, etc.). The axis labels are col-
lectively referred to as the index. The basic method to create a Series is to call
pd.Series([1,3,5,np.nan,6,8])
• DataFrame is a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table, or a dict of
Series objects. It stems from the R data.frame() object.

import pandas as
pd import numpy as
np
import
matplotlib.pyplot
as plt
3.2.1 Create DataFrame

columns =['name', 'age', 'gender', 'job']

user1 = pd.DataFrame([['alice', 19, "F", "student"],


['john', 26, "M", "student"]],
columns=columns)

user2 =pd.DataFrame([['eric', 22, "M",


"student"],
['paul', 58, "F", "manager"]],
columns=columns)

user3 =pd.DataFrame(dict(name=['peter', 'julie'],


age=[33, 44], gender=['M', 'F'],
job=['engineer', 'scientist']))

print(user3)

Out:

name age gender job


1 peter M engineer
33 F
2 julie scientist
44
3.2.2 Combining DataFrames

42 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

Concatenate DataFrame
user1.append(user2)
users =pd.concat([user1, user2, user3])
print(users)

Out:

name age gender job


1 alice F student
19 M student
2 john M student
26 F
3 eric22 manager
4 paul M engineer
58 F
5 peter scientist
33
Join
6 julie
DataFrame
44
user4 =pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'],
height=[165, 180, 175, 171]))
print(user4)

Out:
name height
0 alice 165
1 john 180
2 eric 175
3 julie 171

Use intersection of keys from both


frames
merge_inter =pd.merge(users, user4, on="name")

print(merge_inter)

Out:
name age gende job height
r
0 alice 19 F student 165
1 john 26 M student 180
2 eric 22 M student 175
3 julie 44 F scientist 171

Use union of keys from both


frames
users =pd.merge(users, user4, on="name", how='outer')
print(users)

Out:
name age gender job height
0 alice 19 F student 165.0
1 john 26 M student 180.0
(continues on next
page)

3.2. Pandas: data manipulation 43


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


2 eric 22 M student 175.0 page)
3 paul 58 F manager NaN
4 peter 33 M engineer NaN
5 julie 44 F scientist 171.0

Reshaping by pivoting

“Unpivots” a DataFrame from wide format to long (stacked)


format,
staked =pd.melt(users, id_vars="name", var_name="variable", value_name="value")
print(staked)

Out:
name variable value
0 alice age 19
1 age 26
john
2 eric age 22
3 age 58
paul
4 age 33
peter
5 julie age 44
6 alice gender F
7 gender M
john
8 eric gender M
9 gender F
paul
10 peter gender M
11 julie gender F
12 alice job student
13 job student
john
14 eric job student
15 job manager
paul
16 peter job engineer
17 julie job scientist
18 alice height 165
19 height 180
john
20 eric height 175
“pivots” a DataFrame
21 height fromNaNlong (stacked) format to wide
format,
paul
print(staked.pivot(index='name',
22 peter height NaN columns='variable', values='value'))
23 julie height 171
Out:

variable age gender height


job
alicename 19 F 165 student
eric 22 M 175 student
john 26 M 180 student
julie 44 F 171 scientist
paul 58 F NaN manager
peter 33 M NaN engineer

44 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

3.2.3 Summarizing

# examine the users data

users # print the f i r s t 30 and last 30 rows


type(users) # DataFrame
users.head() # print the f i r s t 5 rows
users.tail() # print the last 5 rows

users.index # "the index" (aka "the labels")


users. columns # column names (which i s "an index")
users.dtypes # data types of each column
users.shape # number of rows and columns
users.values # underlying numpy array
users.info() # concise summary (includes memory
usage as of pandas 0.15.0)
Out:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 5 columns):
name 6 non-null object
age 6 non-null int64
gender 6 non-null object job
6 non-null object
height 4 non-null float64
dtypes: float64(1), int64(1),
object(3)
memory usage: 288.0+ bytes

3.2.4 Columns selection

users['gender'] # select one column


type(users['gender']) # Series
users.gender # select one column
using the DataFrame
# select multiple columns
users[['age', 'gender']] # select two columns
my_cols =['age', 'gender'] # or, create a l i s t . . .
users[my_cols] # ...and use that l i s t to select columns
type(users[my_cols]) # DataFrame

3.2.5 Rows selection (basic)

iloc is strictly integer position


based
df =users.copy()
df.iloc[0] # f i rst
row
df.iloc[0, 0] # f ir st item of f i rst row
df.iloc[0, 0] =55

for i in range(users.shape[0]):
row =df .iloc[i] (continues on next
page)

3.2. Pandas: data manipulation 45


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


row.age *= 100 # setting a copy, and not the original frame data. page)

print(df) # df is not modified

Out:

/home/edouard/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py:5096:␣
˓→SettingWithCopyWarning:
A value i s trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/
˓→indexing.html#indexing-view-versus-copy
self[name] =value
name age gende job height
r
0 55 19 F student 165.0
1 john 26 M student 180.0
2 eric 22 M student 175.0
3 paul 58 F manager NaN
4 peter 33 M engineer NaN
5 julie 44 F scientist 171.0

ix supports mixed integer and label based


access.
df = users.copy()
df.loc[0] # f i rst row
df.loc[0, "age"] # f i rst item of f irst row
df.loc[0, "age"] =55

for i in range(df.shape[0]):
df.loc[i, "age"] *= 10

print(df) # df is modified

Out:
name age gende job height
r
0 alice 550 F student 165.0
1 john 260 M student 180.0
2 eric 220 M student 175.0
3 paul 580 F manager NaN
4 peter 330 M engineer NaN
5 julie 440 F scientist 171.0

3.2.6 Rows selection (hltering)

simple logical filtering

users[users.age < 20] # only show users with age <20


young_bool =users.age <20 # or, create a Series of booleans...
young = users[young_bool] # ...and use that Series to f i l t e r rows
users[users.age < 20].job # select one column from the filtered results
print(young)

Out:

46 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

name age gender job height


0 alice 19 F student 165.0

Advanced logical filtering

users[users.age < 20][['age', 'job']] # select multiple columns


users[(users.age > 20) & (users.gender == 'M')] # use multiple conditions
users[users.job.isin(['student', 'engineer'])] # f i l t e r specific values

3.2.7 Sorting

df = users.copy()

df.age.sort_values() # only works for a Series


df.sort_values(by='age') # sort rows by a specific column
df.sort_values(by='age', ascending=False) # use descending order instead
df.sort_values(by=['job', 'age']) # sort by multiple columns
df.sort_values(by=['job', 'age'], inplace=True) # modify df

print(df)

Out:

name age gender job height


4 peter M engineer
NaN
33 F
3 paul manager
58 NaN F scientist
5 julie 171.0
F
44 student 165.0
0 alice M
3.2.8 Descriptive statistics
student 175.0
19 M
Summarize
2 eric all numeric
student 180.0
columns22
print(df.describe())
1 john
26
Out:
age height
count 6.000000 4.000000
mean 33.66666 172.75000
7 0
std 14.89518 6.344289
9
min 19.00000 165.00000
0 0
25% 23.00000 169.50000
0 0
50% 29.50000 173.00000
0 0
75% 41.25000
Summarize all 176.25000
columns 0 0
m ax 58.00000 180.00000
print(df.describe(include='all'))
0 0
print(df.describe(include=['object'])) # limit to one (or more) types

3.2. Pandas: data manipulation 47


Statistics and Machine Learning in Python, Release 0.3 beta

Out:
nam age gender job height
e
count 6 6.000000 6 6 4.000000
unique 6 NaN 2 4 NaN
top eric NaN M student NaN
freq 1 NaN 3 3 NaN
mean NaN 33.666667 NaN NaN 172.75000
0
std NaN 14.895189 NaN NaN 6.344289
min NaN 19.000000 NaN NaN 165.00000
0
25% NaN 23.000000 NaN NaN 169.50000
0
50% NaN 29.500000 NaN NaN 173.00000
0
75% NaN 41.250000 NaN NaN 176.25000
0
max NaN 58.000000 NaN NaN 180.00000
0
nam gender job
e
Statistics per
count 6 group
6
(groupby)
print(df.groupby("job").mean())
6
unique 6 2
print(df.groupby("job")["age"].mean())
4
top eric M student
print(df.groupby("job").describe(include='all'))
freq 1 3
Out: 3
age height
job
engineer 33.000000 NaN
manager 58.000000 NaN
scientist 44.000000 171.0000
00
student 22.333333 173.3333
33
job
engineer 33.000000
manager 58.000000
scientist 44.000000
student 22.333333
Name: age, dtype: float64
name . . . height
count unique top freq mean ... 25% 50% 75% max
min
job ...
engineer 1 1 peter 1 NaN ... NaN NaN NaN NaN NaN
manager 1 1 paul 1 NaN ... NaN NaN NaN NaN NaN
scientist 1 1 julie 1 NaN . . . 171.0 171.0 171.0 171.0 171.0
student 3 3 eric 1 NaN . . . 165.0 170.0 175.0 177.5 180.0

[4 rows x 44 columns]

Groupby in a
loop
for grp, data in df.groupby("job"):
print(grp, data)

Out:

48 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

engineer name age gender job height


4 peter 33 M NaN
manager engineer age job
name gender height NaN
3 paul 58 name Fage gender
manager job
scientist F scientist height
5 julie nam age gender 171.0
44 e F student job
student 0 M height
alic student M 165.0
e student 175.0
180.0
19
3.2.9 Quality2 check
eric
22
Remove duplicate data
1
john
df =users.append(df.iloc[0], ignore_index=True)
26
print(df.duplicated()) # Series of booleans
# (True i f a row is identical to a previous row)
df.duplicated().sum() # count of duplicates
df[df.duplicated()] # only show
duplicates
df.age.duplicated() # check a single column for duplicates
df.duplicated(['age', 'gender']).sum() # specify columns for finding duplicates
df = df.drop_duplicates() # drop duplicate rows

Out:
1 False
2 False
3 False
4 False
5 False
6 False
7True
dtype: bool

Missing
data
# Missing values are often just excluded
df = users.copy()

df.describe(include='all') # excludes missing values

# find missing values in a Series


df.height.isnull() # True i f NaN, False otherwise
df.height.notnull() # False i f NaN, True otherwise
df[df.height.notnull()] # only show rows where age is not NaN
df.height.isnull().sum() # count the missing values

# find missing values in a DataFrame df.isnull()


# DataFrame of booleans
df.isnull().sum() # calculate the sum of
each column

3.2. Pandas: data manipulation 49


Statistics and Machine Learning in Python, Release 0.3 beta

Strategy 1: drop missing values


df.dropna() # drop a row i f ANY values are missing
df.dropna(how='all') # drop a row only i f ALL values are missing

Strategy 2: fi ll in missing
values
df.height.mean()
df = users.copy()
df.loc[df.height
.isnull(),
"height"] =
df["height"].mea
n()
Out:
print(df)
name age gende job height
r
0 alice 19 F student 165.00
1 john 26 M student 180.00
2 eric 22 M student 175.00
3 paul 58 F manager 172.75
4 peter 33 M engineer 172.75
5 julie 44 F scientist 171.00

3.2.10 Rename values

df =users.copy()
print(df. columns)
df.columns =
['age', 'genre',
'travail',
'nom', 'taille']

df.travail =df.travail.map({ 'student':'etudiant', 'manager':'manager',


'engineer':'ingenieur', 'scientist':'scientific'})
# assert df.travail.isnull().sum() == 0

df['travail'].str.contains("etu|inge")
Out:

Index(['name', 'age', 'gender', 'job', 'height'], dtype='object')

3.2.11 Dealing with outliers

size =pd.Series(np.random.normal(loc=175, size=20, scale=10))


# Corrupt the f i rst 3 measures
size[:3] +=500

Based on parametric statistics: use the mean

Assume random variable follows the normal distribution Exclude data outside 3 standard-
deviations: - Probability that a sample lies within 1 sd: 68.27% - Probability that a
sample lies within 3 sd: 99.73% (68.27 + 2 15.73)

50 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

size_outlr_mean = size.copy()
size_outlr_mean[((size - size.mean()).abs() >3 * size.std())] =size.mean()
print(size_outlr_mean.mean())

Out:

248.48963819938044

Based on non-parametric statistics: use the median

Median absolute deviation (MAD), based on the median, is a robust non-parametric


statistics. https://en.wikipedia.org/wiki/Median_absolute_deviation

mad =1.4826 * np.median(np.abs(size - size.median()))


size_outlr_mad = size.copy()

size_outlr_mad[((size - size.median()).abs() >3 * mad)] =size.median()


print(size_outlr_mad.mean(), size_outlr_mad.median())

Out:

173.80000467192673 178.7023568870694

3.2.12 File I/O

csv

import tempfile, os.path


tmpdir =tempfile.gettempdir()
csv_filename =os.path.join(tmpdir, "users.csv")
users.to_csv(csv_filename, index=False)
other = pd.read_csv(csv_filename)

Read csv from


url
url ='https://raw.github.com/neurospin/pystatsml/master/datasets/salary_table.csv'
salary =pd.read_csv(url)

Exce
l
xls_filename =os.path.join(tmpdir, "users.xlsx")
users.to_excel(xls_filename, sheet_name='users',
index=False)

pd.read_excel(xls_filename, sheet_name='users')

# Multiple sheets
with pd.ExcelWriter(xls_filename) as writer:
users.to_excel(writer, sheet_name='users', (continues on next
index=False) page)

3.2. Pandas: data manipulation 51


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


df.to_excel(writer, sheet_name='salary', index=False) page)

pd.read_excel(xls_filename, sheet_name='users')
pd.read_excel(xls_filename, sheet_name='salary')

SQL
(SQLite)
import pandas as
pd import sqlite3

db_filename =
os.path.join(tmpdir
,Connect
"users.db")

conn =sqlite3.connect(db_filename)

Creating tables with


pandas
url ='https://raw.github.com/neurospin/pystatsml/master/datasets/salary_table.csv'
salary =pd.read_csv(url)

salary.to_sql("salary", conn, if_exists="replace")

Push
modifications
cur = conn.cursor()
values =(100, 14000, 5, 'Bachelor', 'N')
cur.execute("insert into salary values ( ? , ? , ? , ? , ? ) " , values)
conn.commit()

Reading results into a pandas


DataFrame
salary_sql =pd.read_sql_query("select * from salary;", conn)
print(salary_sql.head())

pd.read_sql_query("select * from salary;", conn).tail()


pd.read_sql_query('select * from salary where salary>25000;', conn)
pd.read_sql_query('select * from salary where experience=16;', conn)
pd.read_sql_query('select * from salary where education="Master";', conn)

Out:
index salary experience education management
0 0 13876 1 Bachelor Y
1 1 11608 1 Ph.D N
2 2 18701 1 Ph.D Y
3 3 11283 1 Master N
4 4 11767 1 Ph.D N

3.2.13 Exercises

52 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

Data Frame

1. Read the iris dataset at ‘


https://github.com/neurospin/pystatsml/tree/master/datasets/ iris.csv’
2. Print column names
3. Get numerical columns
4. For each species compute the mean of numerical columns and store it in a stats table
like:
species sepal_length sepal_width petal_length petal_width
0 setosa 5.006 3.428 1.462 0.246
1 versicolor 5.936 2.770 4.260 1.326
2 virginica 6.588 2.974 5.552 2.026

Missing data

Add some missing data to the previous table


users:
df = users.copy()
df.ix[[0, 2], "age"] = None
df.ix[[1, 3], "gender"] =None

Out:

/home/edouard/git/pystatsml/scientific_python/scipy_pandas.py:440: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:


http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
df.ix[[0, 2], "age"] = None
/home/edouard/git/pystatsml/scientific_python/scipy_pandas.py:441:
DeprecationWarning:
.ix i s deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:


http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
df.ix[[1, 3], "gender"] =None

1. Write a function fillmissing_with_mean(df) that fill all missing value of numerical


col- umn with the mean of the current columns.
2. Save the original users and “imputed” frame in a single excel file “users.xlsx” with
2 sheets: original, imputed.
Total running time of the script: ( 0 minutes 1.488 seconds)

3.2. Pandas: data manipulation 53


Statistics and Machine Learning in Python, Release 0.3 beta

3. Matplotlib: data visualization

Sources - Nicolas P. Rougier: http://www.labri.fr/perso/nrougier/teaching/matplotlib -


https:
//www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations

1. Basic plots
import numpy as np
import matplotlib.pyplot as plt

# inline plot (for jupyter)


%matplotlib inline

x =np.linspace(0, 10, 50)


sinus =np.sin(x)

plt.plot(x, sinus)
plt.show()

plt.plot(x, sinus, "o")


plt.show()
# use plt.plot to get
color / marker
abbreviations

54 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

# Rapid multiplot

cosinus = np.cos(x)
plt.plot(x, sinus, "-b", x, sinus, "ob", x, cosinus, "-r", x, cosinus, "or")
plt.xlabel('this is x!')
plt.ylabel('this is y!')
plt.title('My First Plot')
plt.show()

# Step by step
plt.plot(x, sinus, label='sinus', color='blue', linestyle='--', linewidth=2)
(continues on next
page)

3.3. Matplotlib: data visualization 55


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


page)
plt.plot(x, cosinus, label='cosinus', color='red', linestyle='-', linewidth=2)
plt.legend()
plt.show()

3.3.2 Scatter (2D) plots

Load dataset

import pandas as
pd try:
salary =
pd.read_csv("..
/datasets/sala
ry_table.csv")
except:
url ='https://raw.github.com/neurospin/pystatsml/master/datasets/salary_table.csv'
salary =pd.read_csv(url)

df = salary
Simple scatter with
colors
colors =colors_edu ={'Bachelor':'r', 'Master':'g', 'Ph.D':'blue'}
plt.scatter(df['experience'], df['salary'], c=df['education'].apply(lambda x: colors[x]),␣
˓→s=100)

<matplotlib.collections.PathCollection at 0x7f39efac6358>

56 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

Scatter plot with colors and


symbols
## Figure size
plt.figure(figsize=(6,5))

## Define colors / sumbols manually


symbols_manag =dict(Y='*', N='.')
colors_edu ={'Bachelor':'r',
'Master':'g', 'Ph.D':'b'}

## group by education x management


=>6 groups
for values, d in salary.groupby(['education','management']):
edu, manager = values
plt.scatter(d['experience'], d['salary'],
marker=symbols_manag[manager], color=colors_
˓→edu[edu],

s=150, label=manager+"/"+edu)

## Set labels
plt.xlabel('Experience')
plt.ylabel('Salary')
plt.legend(loc=4) # lower right
plt.show()

3.3. Matplotlib: data visualization 57


Statistics and Machine Learning in Python, Release 0.3 beta

3.3.3 Saving Figures

### bitmap format


plt.plot(x, sinus)
plt. savefig(" sinus.png" )
plt.close()

# Prefer vectorial format (SVG: Scalable Vector Graphics) can be edited with #
Inkscape, Adobe Illustrator, Blender, etc.
plt.plot(x, sinus)
plt. savefig(" sinus.svg" )
plt.close()

# Or pdf
plt.plot(x, sinus)
plt. savefig(" sinus.pdf" )
plt.close()

3.3.4 Seaborn

Sources: - http://stanford.edu/~mwaskom/software/seaborn -
https://elitedatascience.com/ python-seaborn-tutorial
If needed, install using: pip install -U --user seaborn

58 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

Boxplot

Box plots are non-parametric: they display variation in samples of a statistical population
with- out making any assumptions of the underlying statistical distribution.

Fig. 2:
title
import seaborn as sns

sns.boxplot(x="education", y="salary", hue="management", data=salary)

<matplotlib.axes._subplots.AxesSubplot at 0x7f39ed42ff28>

3.3. Matplotlib: data visualization 59


Statistics and Machine Learning in Python, Release 0.3 beta

sns.boxplot(x="management", y="salary", hue="education", data=salary)


sns.stripplot(x="management", y="salary", hue="education", data=salary, jitter=True,␣
˓→dodge=True, linewidth=1)# J i t t e r and split options separate datapoints according to␣
˓→group"

<matplotlib.axes._subplots.AxesSubplot at 0x7f39eb61d780>

### Density plot with one figure containing multiple axis


One figure can contain several axis, whose contain the graphic
elements

60 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

# Set up the matplotlib figure: 3 x 1 axis

f , axes =plt.subplots(3, 1, figsize=(9, 9), sharex=True)

i =0
for edu, d in salary.groupby(['education']):
sns.distplot(d.salary[d.management =="Y"], color="b",
bins=10, label="Manager",␣
˓→ax=axes[i])
sns.distplot(d.salary[d.management =="N"], color="r",
bins=10, label="Employee",␣
˓→ax=axes[i])

axes[i].set_title(edu)
axes[i]. set_ylabel('Density')
i += 1
ax = plt.legend()

3.3. Matplotlib: data visualization 61


Statistics and Machine Learning in Python, Release 0.3 beta

Violin plot (distribution)

ax =sns.violinplot(x="salary", data=salary)

Tune
bandwidth
ax =sns.violinplot(x="salary", data=salary, bw=.15)

ax =sns.violinplot(x="management", y="salary", hue="education", data=salary)

62 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

Tips dataset One waiter recorded information about each tip he received over a period of a
few months working in one restaurant. He collected several variables:

import seaborn as sns


#sns.set(style="whitegrid")
tips =sns.load_dataset("tips")
print(tips.head())

ax =
sns.violinplot(x=tips["total_
bill"])
total_bill tip sex smoker daytime size No
0 16.99 1.01 Sun Dinner 2
Female No Sun Dinner
1 10.34 1.66 3
Male No Sun Dinner
2 21.01 3.50 3
Male No Sun Dinner
3 23.68 3.31 2
Male No Sun Dinner
4 24.59 3.61 4
Female

3.3. Matplotlib: data visualization 63


Statistics and Machine Learning in Python, Release 0.3 beta

Group by
day
ax =sns.violinplot(x="day", y="total_bill", data=tips, palette="muted")

Group by day and color by time (lunch vs


dinner)
ax =sns.violinplot(x="day", y="total_bill", hue="time", data=tips, palette="muted",␣
˓→split=True)

64 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

Pairwise scatter
plots
g =sns.PairGrid(salary, hue="management")
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
ax = g.add_legend()

3.3. Matplotlib: data visualization 65


Statistics and Machine Learning in Python, Release 0.3 beta

3.3.5 Time series

import seaborn as sns


sns.set(style="darkgrid")

# Load an example dataset with long-form data


fmri = sns.load_dataset("fmri")

# Plot the responses for different events and


regions

ax =sns.pointplot(x="timepoint", y="signal",
hue="region", style="event",
data=fmri)
# version 0.9
# sns.lineplot(x="timepoint", y="signal",
# hue="region", style="event",
# data=fmri)

66 Chapter 3. Scientihc Python


Statistics and Machine Learning in Python, Release 0.3
beta

3.3. Matplotlib: data visualization 67


Statistics and Machine Learning in Python, Release 0.3 beta

68 Chapter 3. Scientihc Python


CHAPTER

FOUR

STATIS
TICS

1. Univariate statistics

Basics univariate statistics are required to explore dataset:


• Discover associations between a variable of interest and potential predictors. It is
strongly recommended to start with simple univariate methods before moving to
complex multi- variate predictors.
• Assess the prediction performances of machine learning predictors.
• Most of the univariate statistics are based on the linear model which is one of the
main model in machine learning.

4.1.1 Estimators of the main statistical measures

Mean

Properties of the expected value operator E(·) of a random variable 𝑋

𝐸(𝑋 + 𝑐) = 𝐸 (𝑋 ) + 𝑐 (4.1
𝐸(𝑋 + 𝑌 ) = 𝐸 (𝑋 ) + )
𝐸(𝑌 ) (4.2
𝐸(𝑎𝑋) = 𝑎𝐸(𝑋) )
(4.3
The estimator �¯ on a sample of size �: � = �1, ..., ��
)
is given by
1 ∑︁
�¯ ��

= �

�¯ is itself a random variable with


properties:
• 𝐸(�¯) = �¯,
• 𝑉𝑎�(�¯) = 𝑉𝑎�(𝑋)
� .

Varianc
e
2
2
2
𝑉𝑎�(𝑋) = 𝐸 ((𝑋 − 𝐸 (𝑋 )) ) = 𝐸(𝑋 ) −
(𝐸 (𝑋 ))
69
Statistics and Machine Learning in Python, Release 0.3 beta

The estimator is
1 ∑︁
2
𝜎=
� (��− 2
�−
� �¯)
1
Note here the subtracted 1 degree of freedom (df) in the divisor. In standard statistical
practice,
𝑓𝑑= 1 provides an unbiased estimator of the variance of a hypothetical infinite population.
With 𝑓𝑑 = 0 it instead provides a maximum likelihood estimate of the variance for
normally distributed variables.

Standard deviation

√︀
The estimator is simply 𝜎� = 𝜎�
2. 𝑆�𝑑(𝑋) = 𝑉𝑎�(𝑋)

Covariance

𝐶��(𝑋, 𝑌 ) = 𝐸 ((𝑋 − 𝐸(𝑋))(𝑌 − 𝐸(𝑌 ))) = 𝐸(𝑋𝑌 ) −


𝐸(𝑋)𝐸(𝑌 ).
Properties:
Cov(𝑋, 𝑋 ) =
Var(𝑋 ) Cov(𝑋, 𝑌 ) =
Cov(𝑌, 𝑋 ) Cov(𝑐𝑋, 𝑌 ) = 𝑐
Cov(𝑋, 𝑌 ) Cov(𝑋 + 𝑐, 𝑌 )
= Cov(𝑋, 𝑌 )
1 ∑︁
𝜎
The estimator with 𝑓𝑑= 1�is �− (�� − �¯)(��
=� � − �¯).
1

Correlatio
n
𝐶��(𝑋, 𝑌 )
𝐶��(𝑋, 𝑌) =
𝑆�(𝑋)𝑆
𝑑 �(𝑌𝑑 )
The estimator
is 𝜎��
𝜌� .
=� 𝜎�𝜎�

Standard Error (SE)

The standard error (SE) is the standard deviation (of the sampling distribution) of a
statistic: 𝑆�√𝑑(𝑋)
𝑆𝐸(𝑋) = �

.
It is most commonly considered for the mean with the
estimator

𝑆𝐸(�) = 𝑆�𝑑(𝑋) = 𝜎�¯ (4.4


𝜎� )
= √ .
� (4.5
)
70 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

Exercises

• Generate 2 random samples: � ∼ 𝑁 (1.78, 0.1) and � ∼ 𝑁 (1.66, 0.1), both of size
10.
• Compute �¯, 𝜎�, 𝜎�� (xbar, xvar, xycov) using only the np.sum() operation. Explore
the np. module to find out which numpy functions performs the same computations
and compare them (using assert) with your previous results.

4.1.2 Main distributions

Normal distribution

The normal distribution, noted �(𝜇 , 𝜎) with parameters: 𝜇mean (location) and 𝜎> 0 std-
dev. Estimators: �¯ and 𝜎�.
The normal distribution, noted � , is useful because of the central limit theorem (CLT)
which states that: given certain conditions, the arithmetic mean of a sufficiently large
number of iter- ates of independent random variables, each with a well-defined expected
import numpy as np
value
importand well-defined variance,
matplotlib.pyplot as will be approximately normally distributed, regardless of
the
plt underlying distribution.
from scipy.stats import
norm
%matplotlib inline

mu =0 # mean
variance =2 #variance
sigma =np.sqrt(variance)
#standard deviation",
x =np.linspace(mu-3*variance,mu+3*variance, 100)
plt.plot(x, norm.pdf(x, mu, sigma))
[<matplotlib.lines.Line2D at 0x7f5cd6d3afd0>]

4.1. Univariate statistics 71


Statistics and Machine Learning in Python, Release 0.3 beta

The Chi-Square distribution

The chi-square or 𝜒�2 distribution with �degrees of freedom (df) is the distribution of a

the
sumsquares
of of �independent standard normal random variables � ( 0 , 1). Let 𝑋 ∼ �(𝜇 ,
𝜎2), then, 𝑍 = (𝑋 − 𝜇)/𝜎 ∼ � ( 0 , 1), then:
• The squared standard 𝑍2∼ 𝜒(one
2
1 df).
∑︀
• The distribution of sum of squares of � normal random variables: �

2 𝑍 ∼ 𝜒2


The sum of two 𝜒2 RV with � and � df is a 𝜒2 RV with � + � df. This is useful when
sum- ming/subtracting sum of squares.
The 𝜒2-distribution is used to model errors measured as sum of squares or the
distribution of the sample variance.

The Fisher’s F-distribution

The 𝐹-distribution, 𝐹�� , , with �and � degrees of freedom is the ratio of two independent
𝜒variables.
2 Let 𝑋 ∼ 𝜒2�
and 𝑌∼ 𝜒 then:
2

𝑋/
𝐹�,� = �
𝑌/�
The 𝐹-distribution plays a central role in hypothesis testing answering the question: Are
two variances equals?, is the ratio or two errors significantly large ?.

import numpy as np
from scipy.stats import f
import matplotlib.pyplot as plt
%matplotlib inline

fvalues =np.linspace(.1, 5, 100)

# pdf(x, df1, df2): Probability density function at x of F.


plt.plot(fvalues, f.pdf(fvalues, 1, 30), 'b-', label="F(1, 30)")
plt.plot(fvalues, f.pdf(fvalues, 5, 30), 'r-', label="F(5, 30)")
plt.legend()

# cdf(x, df1, df2): Cumulative distribution function of F. #


ie.
proba_at_f_inf_3 =f.cdf(3, 1, 30) # P(F(1,30) <3)

# ppf(q, df1, df2): Percent point function (inverse of cdf) at q of F.


f_at_proba_inf_95 =f.ppf(.95, 1, 30) # q such P(F(1,30) <.95)
assert f.cdf(f_at_proba_inf_95, 1, 30) ==.95

# sf(x, df1, df2): Survival function (1 - cdf) at x of F.


proba_at_f_sup_3 =f.sf(3, 1, 30) # P(F(1,30) >3)
assert proba_at_f_inf_3 +proba_at_f_sup_3 ==1

# p-value: P(F(1, 30)) <0.05


low_proba_fvalues =fvalues[fvalues >f_at_proba_inf_95]
plt.fill_between(low_proba_fvalues, 0, f.pdf(low_proba_fvalues, 1, 30),
alpha=.8, label="P < 0.05")
plt.show()

72 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

The Student’s �-
distribution
Let 𝑀 ∼ � ( 0 , 1) and 𝑉∼2�𝜒. The �-distribution,
� 𝑇, with �degrees of freedom is the
ratio:

𝑇�= √ ︀ �𝑉/�

The distribution of the difference between an estimated parameter and its true (or
assumed) value divided by the standard deviation of the estimated parameter (standard
error) follow a
�-distribution. Is this parameters different from a given value?

3. Hypothesis Testing

Examples
• Test a proportion: Biased coin ? 200 heads have been found over 300 flips, is it
coins biased ?
• Test the association between two variables.
– Exemple height and sex: In a sample of 25 individuals (15 females, 10 males),
is female height is different from male height ?
– Exemple age and arterial hypertension: In a sample of 25 individuals is age
height correlated with arterial hypertension ?
Steps
1. Model the data
2. Fit: estimate the model parameters (frequency, mean, correlation, regression
coeficient)
3. Compute a test statistic from model the parameters.
4. Formulate the null hypothesis: What would be the (distribution of the) test statistic
if the observations are the result of pure chance.

4.1. Univariate statistics


73
Statistics and Machine Learning in Python, Release 0.3 beta

5. Compute the probability (�-value) to obtain a larger value for the test statistic by
chance (under the null hypothesis).

Flip coin: Simplihed example

Biased coin ? 2 heads have been found over 3 flips, is it coins biased ?
1. Model the data: number of heads follow a Binomial disctribution.
2. Compute model parameters: N=3, P = the frequency of number of heads over the
number of flip: 2/3.
3. Compute a test statistic, same as frequency.
4. Under the null hypothesis the distribution of the number of tail is:

1 2 3 count
#heads
0
H 1
H 1
H 1
H H 2
H H 2
H H 2
H H H 3

8 possibles configurations, probabilities of differents values for � are: � measure the


number of success.
• 𝑃(� = 0) = 1/8
• 𝑃(� = 1) = 3/8
• 𝑃(� = 2) = 3/8
• 𝑃(� = 3) = 1/8

plt.bar([0, 1, 2, 3], [1/8, 3/8, 3/8, 1/8], width=0.9)


_ =plt.xticks([0, 1, 2, 3], [0, 1, 2, 3])
plt.xlabel("Distribution of the number of head over 3 f l i p under the null hypothesis")

Text(0.5, 0, 'Distribution of the number of head over 3 f l ip under the null hypothesis')

74 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

3. Compute the probability (�-value) to observe a value larger or equal that 2 under the
null hypothesis ? This probability is the �-value:

𝑃 (� ≥ 2|𝐻 0 ) = 𝑃 (� = 2) + 𝑃 (� = 3) = 3/8 + 1/8 = 4 /8 = 1/2

Flip coin: Real Example

Biased coin ? 60 heads have been found over 100 flips, is it coins biased ?
1. Model the data: number of heads follow a Binomial disctribution.
2. Compute model parameters: N=100, P=60/100.
3. Compute a test statistic, same as frequency.
4. Compute a test statistic: 60/100.
5. Under the null hypothesis the distribution of the number of tail (�) follow the
binomial distribution of parameters N=100, P=0.5:
(︂ )︂
10
0 0.5 (1 − 0.5)
𝑃 �(𝑋 = �|𝐻 0 ) = 𝑃 �(𝑋 = �|� = 100, �

= 0.5) = (100−�)
.

100 (︂
︁∑ )︂
10
𝑃(𝑋 = � ≥ 60|𝐻 0 ) 0 0.5 (1 − 0.5)

= �= 6 (100−�)
�(︂ )︂
0
︁∑60
10
0� 0.5 (1 − 0.5)
= 1 � (100−�)
, the cumulative distribution
− �= function.
1

Use tabulated binomial distribution

4.1. Univariate statistics 75


Statistics and Machine Learning in Python, Release 0.3 beta

import scipy.stats
import matplotlib.pyplot as plt

#tobs =2.39687663116 # assume the t-value


succes =np.linspace(30, 70, 41)
plt.plot(succes,
scipy.stats.binom.pmf(succes, 100, 0.5),
'b-', label="Binomial(100, 0.5)
˓ → ")
upper_succes_tvalues =succes[succes >60]
plt.fill_between(upper_succes_tvalues, 0,
scipy.stats.binom.pmf(upper_succes_tvalues
, 100,
˓→ 0.5), alpha=.8, label="p-value")
_ = plt.legend()

0.01760010010885238
pval =1 - scipy.stats.binom.cdf(60, 100, 0.5)
print(pval)

Random sampling of the Binomial distribution under the null hypothesis

sccess_h0 =scipy.stats.binom.rvs(100, 0.5, size=10000, random_state=4)


print(sccess_h0)

#sccess_h0 =np.array([) for i in range(5000)])


import seaborn as sns
_ =sns.distplot(sccess_h0, hist=False)

pval_rnd =np.sum(sccess_h0 >=60) / (len(sccess_h0) + 1)


print("P-value using monte-carlo sampling of the Binomial distribution
under H0=", pval_
˓→rnd)

[60 52 51 . . . 45 51 44]
P-value using monte-carlo sampling of the Binomial distribution under H0= 0.
˓→025897410258974102

76 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

One sample �-test

The one-sample �-test is used to determine whether a sample comes from a population
with a specific mean. For example you want to test if the average height of a population is
1.75 �.
1 Model the data
Assume that height is normally distributed: 𝑋 ∼ �(𝜇 , 𝜎), ie:

height�= average height over the population + (4.6


error� )
�� = �¯ + 𝜀� (4.7
)
The 𝜀� are called the residuals
2 Fit: estimate the model parameters

�¯, �� are the estimators


of 𝜇, 𝜎. 3 Compute a test
statistic
In testing the null hypothesis that the population mean is equal to a specified value 𝜇0 =
1.75, one uses the statistic:
difference of means√
�= (4.8
� std-dev of
)
noise
� = effect
√ − 𝜇0 √
�¯ (4.9
size
�= �
� )


(4.10
Remarks: Although the parent population does not need to be normally distributed, the
)
dis- tribution of the population of sample means, �, is assumed to be normal. By the
central limit
4.1. Univariate statistics 77
Statistics and Machine Learning in Python, Release 0.3 beta

theorem, if the sampling of the parent population is independent then the sample means
will be approximately normal.
4 Compute the probability of the test statistic under the null hypotheis. This require to
have the distribution of the t statistic under 𝐻 0 .

Example

Given the following samples, we will test whether its true mean is 1.75.
Warning, when computing the std or the variance, set ddof=1. The default value, ddof=0,
leads to the biased estimator of the variance.
import numpy as np

x = [1.83, 1.83, 1.73, 1.82, 1.83, 1.73, 1.99, 1.85, 1.68, 1.87]

xbar =np.mean(x) # sample mean


mu0 =1.75 # hypothesized value
s =np.std(x, ddof=1) # sample standard deviation n
=len(x) # sample size

print(xbar)

tobs =(xbar - mu0) / (s / np.sqrt(n))


print(tobs)

1.816
2.3968766311585883

The :math:‘p‘-value is the probability to observe a value � more extreme than the
observed one
��𝑏under the null hypothesis 𝐻 0 : 𝑃(�> ��� 𝑏 |𝐻0)
import scipy.stats as stats
import matplotlib.pyplot as
plt

#tobs =2.39687663116 # assume the t-value


tvalues =np.linspace(-10, 10, 100)
plt.plot(tvalues, stats.t.pdf(tvalues, n-1), 'b-', label="T(n-1)")
upper_tval_tvalues =tvalues[tvalues >tobs]
plt.fill_between(upper_tval_tvalues, 0,
stats.t.pdf(upper_tval_tvalues, n-1), alpha=.8,␣
˓→label="p-value")

_ = plt.legend()

78 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

4. Testing pairwise associations

Univariate statistical analysis: explore association betweens pairs of variables.


• In statistics, a categorical variable or factor is a variable that can take on one of a
limited, and usually fixed, number of possible values, thus assigning each
individual to a particular group or “category”. The levels are the possibles values of
the variable. Number of levels
= 2: binomial; Number of levels > 2: multinomial. There is no intrinsic
ordering to the categories. For example, gender is a categorical variable having two
categories (male and female) and there is no intrinsic ordering to the categories. For
example, Sex (Female, Male), Hair color (blonde, brown, etc.).
• An ordinal variable is a categorical variable with a clear ordering of the levels. For
example: drinks per day (none, small, medium and high).
• A continuous or quantitative variable � ∈ R is one that can take any value in a range
of
possible values, possibly infinite. E.g.: salary, experience in years, weight.
What statistical test should I use?
See: http://www.ats.ucla.edu/stat/mult_pkg/whatstat/
### Pearson correlation test: test association between two quantitative variables
Test the correlation coefficient of two quantitative variables. The test calculates a Pearson
cor- relation coefficient and the �-value for testing non-correlation.
Let � and � two quantitative variables, where �samples were obeserved. The linear
correlation coeficient is defined as :
∑︀ �
�= (�� − �¯)(�� −
�= √ ︀∑︀ � (1� − 2√︀∑︀ � (� − 2 .
�= ��¯) �= �
√ 1 �¯ ) 1 �¯)
Under 𝐻0 , the test statistic � = √ � follow Student distribution with �− 2
�− 2
freedom 1−�2 degrees of
.

4.1. Univariate statistics 79


Statistics and Machine Learning in Python, Release 0.3 beta

Fig. 1: Statistical
tests

80 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

import numpy as np
import scipy.stats as stats
n = 50
x = np.random.normal(size=n)
y =2 * x +np.random.normal(size=n)

# Compute with scipy


cor, pval =stats.pearsonr(x, y)
print(cor, pval)

0.904453622242007 2.189729365511301e-19

Two sample (Student) �-test: compare two


means

Fig. 2: Two-sample model

The two-sample �-test (Snedecor and Cochran, 1989) is used to determine if two
population means are equal. There are several variations on this test. If data are paired
(e.g. 2 measures, before and after treatment for each individual) use the one-sample �-
test of the difference. The variances of the two samples may be assumed to be equal (a.k.a.
homoscedasticity) or unequal (a.k.a. heteroscedasticity).

1. Model the data

Assume that the two random variables are normally distributed: �1∼ �(𝜇 1 , 𝜎1), �2 ∼
�(𝜇 2 , 𝜎2).

2. Fit: estimate the model parameters

Estimate means and variances: �¯1, �2, �¯2,


�2 .
�1 �2

3. �-test

The general principle is

4.1. Univariate statistics 81


Statistics and Machine Learning in Python, Release 0.3 beta

difference of means
(4.11
�=
standard dev of )
difference of
=
error (4.12
means its
)
�¯ 1 − �¯2error
standard √ (4.13
= √︀∑︀ 𝜀 2 �− 2
)
�¯1 −
�¯2 (4.14
= ��¯1 − )
�¯2
Since �1and �2are
independant:

��2 � (4.15
��¯1−�¯= �
�¯+
2
2 1 = + 1 2
2 �¯2 �2 �21
� )
thus √︃ (4.16

2
� �2 )(4.17

�¯1 − = �1 �2
�¯2
�1+ 2
)
�2
Equal or unequal sample sizes, unequal variances (Welch’s �-
test)

Welch’s �-test defines the � statistic as �¯1 −



︁�22
.
� =�¯
� �2
�112+

�2
To compute the �-value one needs the degrees of freedom associated with this variance
estimate. It is approximated using the Welch–Satterthwaite equation:
(︂ 2 2
)︂ 2
� �
� �2
�11 +
𝜈≈ �2 .
�4 �4
�1 �2
�21 (� +
1 −1) �(�
2
2
2 −1)

Equal or unequal sample sizes, equal


variances
If we assume equal variance (ie, ��2=
1
�2= �2), where �2is an estimator of the common
variance of the two �1
samples:

2 (� − 1) +
� 2 � (�
� 1 � 2
� −1 1)
2 (4.18
�1+ �22−
= ∑︀ �1 ∑︀ �2 )
� (� 2 − �¯
1� 1 ) +� 2�− 2
2 2
= (�(�1 − 1) + (�2 − �¯ ) (4.19
)
1)
the
n √︃ √︂
2
�+ = 1 +
��¯1−�¯2 ��
1 �
2
2 1
�1 �2

=
82 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

Therefore, the � statistic, that is used to test whether the means are different is:
�¯1 − ,
� =�¯√ ︁2 1 1

· �1 + �2

Equal sample sizes, equal variances

If we simplify the problem assuming equal samples of size �1= �2= �


we get

�¯
1 −2 √
� (4.20
�¯�√ ·
= )
2� size ·
≈ effect
√ difference of means √ (4.21
≈� · � )
standard deviation of
the noise (4.22
Example )

Given the following two samples, test whether their means are equal using the standard t-
test, assuming equal variance.
import scipy.stats as stats

height = np.array([ 1.83, 1.83, 1.73, 1.82, 1.83, 1.73, 1.99, 1.85, 1.68, 1.87,
1.66, 1.71, 1.73, 1.64, 1.70, 1.60, 1.79, 1.73, 1.62, 1.77])

grp =np.array(["M"] * 10 +[ " F " ] * 10) #

Compute with scipy


print(stats.ttest_ind(height[grp ==
"M"], height[grp ==" F " ] ,
equal_var=True))
Ttest_indResult(statistic=3.5511519888466885, pvalue=0.00228208937112721)

ANOVA 𝐹-test (quantitative ~ categorial (>=2 levels))

Analysis of variance (ANOVA) provides a statistical test of whether or not the means of
several groups are equal, and therefore generalizes the �-test to more than two groups.
ANOVAs are useful for comparing (testing) three or more means (groups or variables) for
statistical signifi- cance. It is conceptually similar to multiple two-sample �-tests, but is
less conservative.
Here we will consider one-way ANOVA with one independent variable, ie one-way
anova. Wikipedia:
• Test if any group is on average superior, or inferior, to the others versus the null
hypothesis that all four strategies yield the same mean response
• Detect any of several possible differences.
• The advantage of the ANOVA 𝐹 -test is that we do not need to pre-specify which
strategies are to be compared, and we do not need to adjust for making multiple
comparisons.
4.1. Univariate statistics 83
Statistics and Machine Learning in Python, Release 0.3 beta

• The disadvantage of the ANOVA 𝐹-test is that if we reject the null hypothesis, we do
not know which strategies can be said to be significantly different from the others.

1. Model the data

A company has applied three marketing strategies to three samples of customers in order
in- crease their business volume. The marketing is asking whether the strategies led to
different increases of business volume. Let �1, �2and �3be the three samples of business
volume increase.
Here we assume that the three populations were sampled from three random variables
that are normally distributed. I.e., 𝑌1 ∼ 𝑁 (𝜇1, 𝜎1), 𝑌2 ∼ 𝑁 (𝜇2, 𝜎2) and 𝑌3 ∼ 𝑁 (𝜇3, 𝜎3).

2. Fit: estimate the model parameters

Estimate means and variances: �¯�, 𝜎�, ∀� ∈ {1, 2, 3}.

3. 𝐹 -test

The formula for the one-way ANOVA F-test statistic is

Explained variance
(4.23
𝐹=
Unexplained variance )
Between-group �
= = 2𝐵 .
(4.24
variability Within-
�𝑊 )
group variability 2
The “explained variance”, or “between-group variability”
is ∑︁
�2
𝐵 = � � − �¯ 2
)/ ( 𝐾 −
(�¯ � �· 1),

where �¯�· denotes the sample mean in the �th group, �� is the number of observations
group, �¯h denotes the overall mean of the data, and 𝐾 denotes the number of
in the �t
groups. The “unexplained variance”, or “within-group variability” is
∑︁
��− �¯ ) / ( 𝑁
2
�𝑊 =
2 (��� �·− 𝐾 ),

where �� � is the �th observation in the �th out of 𝐾 groups and 𝑁 is the overall sample
size. This 𝐹-statistic follows the 𝐹-distribution with 𝐾 − 1 and 𝑁 − 𝐾 degrees of freedom
under the null hypothesis. The statistic will be large if the between-group variability is
large relative to the within-group variability, which is unlikely to happen if the
population means of the groups all have the same value.
Note that when there are only two groups for the one-way ANOVA F-test, 𝐹= �2 where �
is the Student’s � statistic.

84 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

Chi-square, 𝜒2 (categorial ~ categorial)

Computes the chi-square, 𝜒2, statistic and �-value for the hypothesis test of independence
of frequencies in the observed contingency table (cross-table). The observed frequencies are
tested against an expected contingency table obtained by computing expected frequencies
based on the marginal sums under the assumption of independence.
Example: 20 participants: 10 exposed to some chemical product and 10 non exposed
(exposed
= 1 or 0). Among the 20 participants 10 had cancer 10 not (cancer = 1 or 0). 𝜒2 tests
import numpy as np
the association
import pandas asbetween those two variables.
pd
import
scipy.stats as
stats

# Dataset:
# 15 samples:
# 10 f i r s t exposed
exposed =
np.array([1] * 10 +
[0] * 10)
# 8 f i r s t with cancer, 10 without, the last two with.
cancer =np.array([1] * 8 +[0] * 10 +[1] * 2)

crosstab =pd.crosstab(exposed, cancer,


rownames=['exposed'],
colnames=[ 'cancer']
) print("Observed table:")
print(" ")
print(crosstab)

chi2, pval, dof, expected =


stats.chi2_contingency(crosstab) print("Statistics:")
print(" ")
Observed table: print("Chi2 =%f, pval
=%f" % (chi2, pval)) print("Expected
cancer 0 1
table:")
exposed print(" ")
0 8 2
print(expected)
1 2 8
Statistics:

Chi2 =5.000000, pval = 0.025347


Expected table:

[[5. 5.]
[5. 5.] ]

Computing expected cross-table

# Compute expected cross-table based on proportion


exposed_marg =crosstab.sum(axis=0)
exposed_freq =exposed_marg / exposed_marg.sum()

(continues on next
page)

4.1. Univariate statistics 85


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


cancer_marg =crosstab.sum(axis=1) page)
cancer_freq =cancer_marg / cancer_marg.sum()

print('Exposed frequency? Yes: %.2f' % exposed_freq[0],


'No: %.2f' % exposed_freq[1])
print('Cancer frequency? Yes: %.2f' % cancer_freq[0],
'No: %.2f' % cancer_freq[1])

print('Expected frequencies:')
print(np.outer(exposed_freq, cancer_freq))

print('Expected cross-table (frequencies * N): ')


print(np.outer(exposed_freq, cancer_freq) * len(exposed))

Exposed frequency? Yes: 0.50 No: 0.50


Cancer frequency? Yes: 0.50 No: 0.50
Expected frequencies:
[[0.25 0.25]
[0.25 0.25]]
Expected cross-table (frequencies * N):
[[5. 5.]
[5. 5.] ]

4.1.5 Non-parametric test of pairwise associations

Spearman rank-order correlation (quantitative ~ quantitative)

The Spearman correlation is a non-parametric measure of the monotonicity of the


relationship between two datasets.
When to use it? Observe the data distribution: - presence of outliers - the distribution of
the residuals is not Gaussian.
Like other correlation coefficients, this one varies between -1 and + 1 with 0 implying no
cor- relation. Correlations of -1 or + 1 imply an exact monotonic relationship. Positive
correlations imply that as � increases, so does �. Negative correlations imply that as �
increases, � decreases.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as
plt

x = np.array([44.4, 45.9, 41.9,


53.3, 44.7, 44.1, 50.7, 45.2,
46, 47, 48, 60.1])
y =np.array([2.6, 3.1, 2.5, 5.0,
3.6, 4.0, 5.2, 2.8, 4, 4.1, 4.5,
3.8])

plt.plot(x, y, "bo")

# Non-Parametric Spearman
cor, pval =stats.spearmanr(x, y)
print("Non-Parametric Spearman
cor test, cor: %.4f, pval: %.4f"
% (cor, pval))
86
# "Parametric Pearson cor test Chapter 4. Statistics
cor, pval =stats.pearsonr(x, y)
print("Parametric Pearson cor
test: cor: %.4f, pval: %.4f" %
Statistics and Machine Learning in Python, Release 0.3
beta

Non-Parametric Spearman cor test, cor: 0.7110, pval: 0.0095


Parametric Pearson cor test: cor: 0.5263, pval: 0.0788

Wilcoxon signed-rank test (quantitative ~ cte)

Source: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used when
com- paring two related samples, matched samples, or repeated measurements on a single
sample to assess whether their population mean ranks differ (i.e. it is a paired difference
test). It is equivalent to one-sample test of the difference of paired samples.
It can be used as an alternative to the paired Student’s �-test, �-test for matched pairs, or
the �- test for dependent samples when the population cannot be assumed to be normally
distributed.
When to use it? Observe the data distribution: - presence of outliers - the distribution of
the residuals is not Gaussian
It has a lower sensitivity compared to �-test. May be problematic to use when the sample
size is small.
Null hypothesis 𝐻 0 : difference between the pairs follows a symmetric distribution around
import scipy.stats as stats
zero.
n = 20
# Buisness Volume time 0
bv0 =np.random.normal(loc=3, scale=.1, size=n)
# Buisness Volume time 1
bv1 =bv0 +0.1 +np.random.normal(loc=0,
scale=.1, size=n)

# create an outlier
bv1[0] -= 10

# Paired t-test
(continues on next
page)

4.1. Univariate statistics 87


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


print(stats.ttest_rel(bv0, bv1)) page)

# Wilcoxon
print(stats.wilcoxon(bv0, bv1))

Ttest_relResult(statistic=0.8167367438079456, pvalue=0.4242016933514212)
WilcoxonResult(statistic=40.0, pvalue=0.015240061183200121)

Mann–Whitney 𝑈 test (quantitative ~ categorial (2 levels))

In statistics, the Mann–Whitney 𝑈 test (also called the Mann–Whitney–Wilcoxon, Wilcoxon


rank-sum test or Wilcoxon–Mann–Whitney test) is a nonparametric test of the null
hypothesis that two samples come from the same population against an alternative
hypothesis, especially that a particular population tends to have larger values than the
other.
It can be applied on unknown distributions contrary to e.g. a �-test that has to be applied
only on normal distributions, and it is nearly as efficient as the �-test on normal
import scipy.stats as stats
distributions.
n = 20
# Buismess Volume group 0
bv0 =np.random.normal(loc=1, scale=.1, size=n)

# Buismess Volume group 1


bv1 =np.random.normal(loc=1.2, scale=.1, size=n)

# create an outlier
bv1[0] -= 10

# Two-samples t-test
print(stats.ttest_ind(bv0, bv1))

# Wilcoxon
print(stats.mannwhitneyu(bv0, bv1))

Ttest_indResult(statistic=0.6227075213159515, pvalue=0.5371960369300763)
MannwhitneyuResult(statistic=43.0, pvalue=1.1512354940556314e-05)

4.1.6 Linear model

Given �random samples (��, �1�, . . . , ���), � = 1, . . . , �, the linear regression



models the
between the relation
observations �� and the independent variables
� � is
formulated as
�� = 𝛽0 + 𝛽1�1�+ · · · + 𝛽���� + 𝜀� � = 1, . . . , �

• The 𝛽’s are the model parameters, ie, the regression coeficients.

• 𝛽0 is the intercept or the bias.


• 𝜀� are the residuals.
• An independent variable (IV). It is a variable that stands alone and isn’t changed by
the other variables you are trying to measure. For example, someone’s age might be
an
88 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

Fig. 3: Linear model

independent variable. Other factors (such as what they eat, how much they go to
school, how much television they watch) aren’t going to change a person’s age. In fact,
when you are looking for some kind of relationship between variables you are trying
to see if the independent variable causes some kind of change in the other variables,
or dependent variables. In Machine Learning, these variables are also called the
predictors.
• A dependent variable. It is something that depends on other factors. For example, a
test score could be a dependent variable because it could change depending on several
factors such as how much you studied, how much sleep you got the night before you
took the test, or even how hungry you were when you took it. Usually when you are
looking for a relationship between two things you are trying to find out what makes
the dependent variable change the way it does. In Machine Learning this variable is
called a target variable.

Simple regression: test association between two quantitative variables

Using the dataset “salary”, explore the association between the dependant variable (e.g.
Salary) and the independent variable (e.g.: Experience is quantitative).
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

url ='https://raw.github.com/neurospin/pystatsml/master/datasets/salary_table.csv'
salary =pd.read_csv(url)

1. Model the data

Model the data on some hypothesis e.g.: salary is a linear function of the
experience.

salary� = 𝛽experience�+ 𝛽0+ 𝜖�,

more generally

�� = 𝛽 �� + 𝛽0 + 𝜖�

• 𝛽: the slope or coefficient or parameter of the model,


• 𝛽0: the intercept or bias is the second parameter of the model,

4.1. Univariate statistics 89


Statistics and Machine Learning in Python, Release 0.3 beta

• 𝜖�: is the �th error, or residual with 𝜖∼ � ( 0 , 𝜎2).


The simple regression is equivalent to the Pearson correlation.

2. Fit: estimate the model parameters

The goal it the


Minimizes so estimate 𝛽, 𝛽0 and
mean squared 𝜎2. (MSE) or the Sum squared error (SSE). The so-called
error ∑︀ 2
Ordinary Least Squares (OLS) finds 𝛽, 𝛽 0 that minimizes the 𝑆𝑆𝐸 = �𝜖
∑︁ �
𝑆𝑆𝐸 = (�� − 𝛽� �− 𝛽0) 2

Recall from calculus that an extreme point can be found by computing where the derivative
is zero, i.e. to find the intercept, we perform the steps:
𝜕𝑆𝑆𝐸 ∑︁
= (�� − 𝛽��− 𝛽0) =
𝜕𝛽0 0 �
∑︁ ∑︁
�� = 𝛽 �� + 0
� �𝛽 �
� �¯ = � 𝛽 �¯ + �𝛽0
𝛽0 = �¯ − 𝛽 �¯

To find the regression coefficient, we perform the steps:


𝜕𝑆𝑆𝐸 ∑︁
= � � − 𝛽�
� (� � − 0𝛽 ) =
𝜕𝛽
0 �
Plug in 𝛽0:

∑︁
��(�� − 𝛽�� − �¯ +

∑︁ 𝛽�¯) = 0∑︁ ∑︁
���� − �¯ � � = 𝛽
� − �¯)�
� (� �

Divide both sides by


�:
1 ∑︁ 1 ∑︁
��� � − �¯�¯ = 𝛽 (� −
� �
��¯ )
� �
∑︀
�� −
�1
𝛽= �� 𝐶��(�,
1 ∑︀�¯�¯
� =
�)
.
� � (��−
�¯) 𝑉
from scipy import stats 𝑎�(�)
import numpy as np
y, x =salary.salary, salary.experience
beta, beta0, r_value, p_value, std_err =stats.linregress(x,y)
print("y =%f x +%f, r : %f, r-squared: %f,\np-value: %f, std_err: %f"
% (beta, beta0, r_value, r_value**2, p_value, std_err))

print("Regression line with the scatterplot")


yhat =beta * x +beta0 # regression line
(continues on next page)

90 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


plt.plot(x, yhat, 'r-', x, y,'o') page)
plt.xlabel('Experience (years)')
plt.ylabel('Salary')
plt.show()

print("Using seaborn")
import seaborn as sns
sns.regplot(x="experience",
y="salary", data=salary);
y =491.486913 x +13584.043803, r : 0.538886, r-squared: 0.290398,
p-value: 0.000112, std_err: 115.823381
Regression line with the scatterplot

Using seaborn

4.1. Univariate statistics 91


Statistics and Machine Learning in Python, Release 0.3 beta

3. 𝐹 -Test

1. Goodness of ht

The goodness of fit of a statistical model describes how well it fits a set of observations.
Mea- sures of goodness of fit typically summarize the discrepancy between observed
values and the values expected under the model in question. We will consider the
explained variance also known as the coefficient of determination, denoted 𝑅2 pronounced
R-squared.
The total sum of squares, 𝑆𝑆tot is the sum of the sum of squares explained by the
regression,
𝑆𝑆reg, plus the sum of squares of residuals unexplained by the regression, 𝑆𝑆res, also called
the SSE, i.e. such that

𝑆𝑆tot = 𝑆𝑆reg + 𝑆𝑆res

Fig. 4:
title

92 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

The mean of � is
1 ∑︁
�¯ ��

= �.

The total sum of squares is the total squared sum of deviations from the mean of
�, i.e. ∑︁
𝑆𝑆 tot = (� − 2
�¯)� �

The regression sum of squares, also called the explained sum of


squares: ∑︁
𝑆𝑆 reg = (�ˆ − 2
�¯�), �

where �ˆ� = 𝛽�� + 𝛽0 is the estimated value of salary �ˆ� given a value of

experience ��. The sum of squares of the residuals, also called the residual sum
∑︁
of squares (RSS) is: �−
2
𝑆𝑆 res = (�
�ˆ)�. �

𝑅2 is the explained sum of squares of errors. It is the variance explain by the regression
divided by the total variance, i.e.
explained SS 𝑆𝑆 reg 𝑆𝑆 �𝑒
𝑅 2= = = 1− � .
total SS
𝑆𝑆��� 𝑆𝑆���

3.2 Test

Let 𝜎ˆ2= 𝑆𝑆res/(� − 2) be an estimator of the variance of 𝜖. The 2 in the denominator


stems from the 2 estimated parameters: intercept and coefficient.
• Unexplained variance: 𝑆𝑆 re
𝜎^s2
∼ 𝜒 2�−
2
• Explained variance: 𝑆𝑆reg ∼ 𝜒.21 The single degree of freedom comes from the
𝜎^2
between 𝜎^2 (∼ 𝜒2�−1) anddifference
𝑆𝑆 tot
𝑆𝑆res (∼ 𝜒 2 ), i.e. (� − 1) − (� − 2) degree of
𝜎^2 �−2freedom.
The Fisher statistics of the ratio of two variances:
Explained variance 𝑆𝑆reg/1
𝐹= = ∼ 𝐹(1, �− 2 )
Unexplained variance 𝑆𝑆res/(� − 2)

Using the 𝐹-distribution, compute the probability of observing a value greater than 𝐹 under
𝐻 0 , i.e.: 𝑃(� > 𝐹|𝐻 0 ), i.e. the survival function (1 − Cumulative Distribution Function)
at � of the given 𝐹 -distribution.

Multiple regression

Theory

Muliple Linear Regression is the most basic supervised learning algorithm.

Given: a set of training data {�1, ..., �𝑁} with corresponding targets {�1, ...,

�𝑁}.
4.1. Univariate statistics 93
Statistics and Machine Learning in Python, Release 0.3 beta

In linear regression, we assume that the model that generates the data involves only a
linear combination of the input variables, i.e.
�(��, 𝛽) = 𝛽0 + 𝛽1�1 + ... + 𝛽𝑃�,𝑃
� �

or, simplified

∑︁
𝑃 −1
�(�
� , 𝛽) =0 𝛽 + 𝛽����
�=1 .

Extending each sample with an intercept, �� := [1, ��] ∈ 𝑅𝑃+ 1 allows us to use a more
general notation based on linear algebra and write it as a simple dot product:

�(��, 𝛽) =�
�𝑇𝛽,
where 𝛽 ∈ 𝑅𝑃+ 1 is a vector of weights that define the 𝑃+ 1 parameters of the model. From
now we have 𝑃regressors + the intercept.
Minimize the Mean Squared Error MSE
loss: ∑︁
𝑁 ∑︁𝑁
�− �(�, 𝛽)) �− �
2 𝑇 2
𝑀𝑆𝐸(𝛽) = (� (� �
1 1
�= 1= � 𝛽)
𝑁 �= 1 𝑁
Let 𝑋 = [� 𝑇 , ..., �
𝑇 ] be a 𝑁 × 𝑃+ 1 matrix of 𝑁 samples of 𝑃input features with one
0 𝑁
column
of one and let be � = [�1, ..., �𝑁] be a vector of the 𝑁 targets. Then, using linear algebra,
the
mean squared error (MSE) loss can be rewritten:
𝑀 𝑆𝐸(𝛽) = ||� − 𝑋𝛽||22 .
1𝑁
The 𝛽 that minimises the MSE can be found
by:
(︂ )︂
∇𝛽 ||� − 𝑋𝛽||22
= (4.25
1
0 )
1 𝑁 𝑇
(4.26
∇𝛽(� − 𝑋𝛽) (� − 𝑋𝛽) =
1
𝑁 )
𝑇 0 𝑇 𝑇 𝑇 𝑇
(4.27
∇𝛽(� � − 2𝛽 𝑋 � + 𝛽𝑋 𝑋𝛽) = 0
𝑁 )
𝑇 𝑇
−2𝑋 � + 2𝑋 𝑋𝛽 = 0 (4.28
𝑋 𝑋𝛽 = 𝑋 �
𝑇 𝑇 )

𝛽= (𝑋 𝑇 𝑋) − 1 𝑋 𝑇 �, (4.29
)
where (𝑋 𝑇 𝑋) − 1 𝑋 𝑇 is a pseudo inverse of 𝑋 .
(4.30
)
Fit with numpy

import numpy as np
from scipy import linalg
np.random.seed(seed=42) # make the example reproducible
(continues on next
page)

94 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
# Dataset
N, P =50, 4
X =np.random.normal(size= N * P).reshape((N, P))
## Our model needs an intercept so we add a column of 1s:
X[: , 0] =1
print(X[:5, : ] )

betastar =np.array([10, 1., .5, 0.1])


e = np.random.normal(size=N)
y =np.dot(X, betastar) + e

# Estimate the parameters


Xpinv =linalg.pinv2(X)
betahat =np.dot(Xpinv, y)
print("Estimated beta:\n",
betahat)
[ [ 1. -0.1382643 0.64768854
1.52302986]
[ 1. -0.23413696 1.57921282 0.76743473]
[ 1. 0.54256004 -0.46341769 -0.46572975]
[ 1. -1.91328024 -1.72491783 -0.56228753]
[ 1. 0.31424733 -0.90802408 -1.4123037 ] ]
Estimated beta:
[10.14742501 0.57938106 0.51654653 0.17862194]

4.1.7 Linear model with statsmodels

Sources:
http://statsmodels.sourceforge.net/devel/examples/

Multiple regression

Interface with
import statsmodels.api as sm
statsmodels
## Fit and summary:
model =sm.OLS(y, X ) . f i t ( )
print(model.summary())

# prediction of new values


ypred = model.predict(X)

# residuals + prediction
== true values
assert np.all(ypred +
model.resid == y)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.363
Model: OLS Adj. R-squared: 0.322
Method: Least Squares F-statistic: 8.748
Date: Wed, 06 Nov 2019 Prob (F-statistic): 0.000106
(continues on next
page)

4.1. Univariate statistics 95


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


Time: 18:03:24 Log-Likelihood: page) -71.271
No. Observations: 50 AIC: 150.5
Df Residuals: 46 BIC: 158.2
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
const 10.147 0.150 67.52 0.000 9.845 10.450
4 0
x1 0.5794 0.160 3.623 0.001 0.258 0.901
x2 0.5165 0.151 3.425 0.001 0.213 0.820
x3 0.1786 0.144 1.240 0.221 -0.111 0.469
======== ================================ ===============================
====== =
Omnibus: 2.493 Durbin-Watson: 2.369
Prob(Omnibus): 0.288 Jarque-Bera ( J B ) : 1.544
Skew: 0.330 Prob(JB): 0.462
Kurtosis: 3.554 Cond. No. 1.27
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→ specified.

Interface with
Pandas
Use R language syntax for data.frame. For an additive model:�� =
0 𝛽+ 1 �

2 𝛽+ 𝜖 ≡ y
1 𝛽+ 2 �
� �
~
x1 +
x2.
import statsmodels.formula.api as smfrmla

df =pd.DataFrame(np.column_stack([X, y ] ) , columns=['inter', 'x1','x2', 'x3', 'y'])


print(df.columns, df.shape)
# Build a model excluding the intercept, i t i s implicit
model =smfrmla.ols("y~x1 +x2 +x3", d f ) . f i t ( )
print(model.summary())

Index(['inter', 'x1', 'x2', 'x3', 'y'], dtype='object') (50, 5)


OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.363
Model: OLS Adj. R-squared: 0.322
Method: Least Squares F-statistic: 8.748
Date: Wed, 06 Nov 2019 Prob (F-statistic): 0.000106
Time: 18:03:24 Log-Likelihood: -71.271
No. Observations: 50 AIC: 150.5
Df Residuals: 46 BIC: 158.2
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]

Intercept 10.1474 0.150 67.520 0.000 9.845 10.450


x1 0.5794 0.160 3.623 0.001 0.258 0.901
(continues on next page)

96 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


x2 0.5165 0.151 3.425 0.001 0.213page) 0.820
x3 0.1786 0.144 1.240 0.221 -0.111 0.469
======================= ============= ===============================
= ========= =
Omnibus: 2.493 Durbin-Watson: 2.369
Prob(Omnibus): 0.288 Jarque-Bera ( J B ) : 1.544
Skew: 0.330 Prob(JB): 0.462
Kurtosis: 3.554 Cond. No. 1.27
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→ specified.

Multiple regression with categorical independent variables or factors: Analysis of covariance


(ANCOVA)

Analysis of covariance (ANCOVA) is a linear model that blends ANOVA and linear
regression. ANCOVA evaluates whether population means of a dependent variable (DV) are
equal across levels of a categorical independent variable (IV) often called a treatment,
while statistically controlling for the effects of other quantitative or continuous variables
that are not of primary interest, known as covariates (CV).

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

try:
df = pd.read_csv("../datasets/salary_table.csv")
except:
url ='https://raw.github.com/neurospin/pystatsml/master/datasets/salary_table.csv'
df = pd.read_csv(url)

import seaborn as sns


fi g, axes =plt.subplots(1, 3)

sns.distplot(df.salary[df.management =="Y"], color="r", bins=10, label="Manager:Y",␣


˓→ax=axes[0])
sns.distplot(df.salary[df.management =="N"], color="b", bins=10, label="Manager:Y",␣
˓→ax=axes[0])

sns.regplot("experience", "salary", data=df, ax=axes[1])

sns.regplot("experience", "salary", color=df.management,


data=df, ax=axes[2])

#sns.stripplot("experience", "salary", hue="management",


data=df, ax=axes[2])

TypeError Traceback (most recent call last)

<ipython-input-28-c2d69cab90c3> in <module>
8
(continues on next
page)

4.1. Univariate statistics 97


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


9 sns.regplot("experience", "salary", hue=df.management, page)
---> 10 data=df, ax=axes[2]) 11
12 #sns.stripplot("experience", "salary",
hue="management", data=df,
ax=axes[2])

TypeError: regplot() got an unexpected


keyword argument 'hue'

One-way AN(C)OVA

• ANOVA: one categorical independent variable, i.e. one


factor.
• ANCOVA: ANOVA with some covariates.
import statsmodels.formula.api as smfrmla

oneway =smfrmla.ols('salary ~ management +experience', d f ) . f i t ( )


print(oneway.summary())
aov =sm.stats.anova_lm(oneway, typ=2) # Type 2 ANOVA DataFrame
print(aov)

OLS Regression Results


==============================================================================
Dep. Variable: salary R-squared: 0.865
Model: OLS Adj. R-squared: 0.859
Method: Least Squares F-statistic: 138.2
Date: Thu, 07 Nov 2019 Prob (F-statistic): 1.90e-19
Time: 12:16:50 Log-Likelihood: -407.76
No. Observations: 46 AIC: 821.5
Df Residuals: 43 BIC: 827.0
Df Model: 2
(continues on next page)

98 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


Covariance Type: nonrobust page)
===================================================================================
coef std err t P>|t| [0.025 0.975]
Intercept 1.021e+0 525.999 19.411 0.000 9149.578 1.13e+04
4
management[T.Y] 7145.015 527.320 13.550 0.000 6081.572 8208.458
1
experience 527.1081 51.106 10.314 0.000 424.042
=============== ======= 630.174
= ===== ==================================================
Omnibus: 11.437 Durbin-Watson: 2.193
Prob(Omnibus): 0.003 Jarque-Bera ( J B ) : 11.260
Skew: -1.131 Prob(JB): 0.00359
=================================
Kurtosis: ========Cond.
3.872 ======No.
===============================
22.4
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→ specified.
sum_sq df F PR(>F)
management 5.755739e+ 1.0 183.59346 4.054116e-
08 6 17
experience 3.334992e+ 1.0 106.37776 3.349662e-
08 8 13
Residual 1.348070e+ 43.0 NaN NaN
08
Two-way AN(C)OVA

Ancova with two categorical independent variables, i.e. two


factors.
import statsmodels.formula.api as smfrmla

twoway =smfrmla.ols('salary ~ education +management +experience', d f ) . f i t ( )


print(twoway.summary())
aov =sm.stats.anova_lm(twoway, typ=2) # Type 2 ANOVA DataFrame
print(aov)

OLS Regression Results


==============================================================================
Dep. Variable: salary R-squared: 0.957
Model: OLS Adj. R-squared: 0.953
Method: Least Squares F-statistic: 226.8
Date: Thu, 07 Nov 2019 Prob (F-statistic): 2.23e-27
Time: 12:16:52 Log-Likelihood: -381.63
No. Observations: 46 AIC: 773.3
Df Residuals: 41 BIC: 782.4
Df Model: 4
Covariance Type: nonrobust
=================== ===================================================================
=
coef std err t P>|t| [0.025 0.975]
Intercept 8035.597 386.689 20.781 0.000 7254.663 8816.53
6 2
education[T.Master] 3144.035 361.968 8.686 0.000 2413.025 3875.04
2 5
education[T.Ph.D] 2996.210 411.753 7.277 0.000 2164.659 3827.76
3 2
management[T.Y] 6883.531 313.919 21.928 0.000 6249.559 7517.50
0 (continues on next3
experience 546.1840 30.519 17.896 0.000 484.549
page) 607.819
==============================================================================
4.1. Univariate statistics 99
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


Omnibus: 2.293 Durbin-Watson: page) 2.237
Prob(Omnibus): 0.318 Jarque-Bera ( J B ) : 1.362
Skew: -0.077 Prob(JB): 0.506
Kurtosis: 2.171 Cond. No. 33.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→ specified.
sum_sq df F PR(>F)
education 9.152624e+ 2.0 43.351589 7.672450e-
07 11
management 5.075724e+ 1.0 480.82539 2.901444e-
08 4 24
experience 3.380979e+ 1.0 320.28152 5.546313e-
08 4 21
Residual 4.328072e+ 41.0 NaN NaN
Comparing two nested07 models

oneway is nested within twoway. Comparing two nested models tells us if the additional
predic- tors (i.e. education) of the full model significantly decrease the residuals. Such
comparison can be done using an 𝐹-test on residuals:
print(twoway.compare_f_test(oneway)) # return F, pval, df

(43.35158945918107, 7.672449570495418e-11, 2.0)

Factor coding

See
http://statsmodels.sourceforge.net/devel/contrasts.html
By default Pandas use “dummy coding”. Explore:
print(twoway.model.data.param_names)
print(twoway.model.data.exog[:10, : ] )

['Intercept', 'education[T.Master]', 'education[T.Ph.D]', 'management[T.Y]', 'experience']


[[1. 0. 0. 1. 1.]
[1. 0. 1. 0. 1.]
[1. 0. 1. 1. 1.]
[1. 1. 0. 0. 1.]
[1. 0. 1. 0. 1.]
[1. 1. 0. 1. 2.]
[1. 1. 0. 0. 2.]
[1. 0. 0. 0. 2.]
[1. 0. 1. 0. 2.]
[1. 1. 0. 0. 3.]]

Contrasts and post-hoc


tests

100 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

# t-test of the specific contribution of experience:


ttest_exp =twoway.t_test([0, 0, 0, 0, 1])
ttest_exp.pvalue, ttest_exp.tvalue
print(ttest_exp)

# Alternatively, you can specify the hypothesis tests using a string


twoway.t_test('experience')

# Post-hoc is salary of Master different salary of Ph.D? #


i e . t-test salary of Master =salary of Ph.D.
print(twoway.t_test('education[T.Master] =
education[T.Ph.D]'))
Test for Constraints
=================================================
=============================
coef std err t P>|t| [0.025 0.975]
c0 546.1840 30.519 17.896 0.000 484.549 607.819
=================================================
=============================
Test for Constraints
=================================================
=============================
coef std err t P>|t| [0.025 0.975]
c0 147.8249 387.659 0.381 0.705 -635.069 930.719
=================================================
=============================
4.1.8 Multiple comparisons

import numpy as np
np.random.seed(seed=42) # make example reproducible

# Dataset
n_samples, n_features =100, 1000
n_info =int(n_features/10) # number of features with information
n1, n2 =int(n_samples/2), n_samples - int(n_samples/2)
snr = .5
Y =np.random.randn(n_samples, n_features)
grp =np.array(["g1"] * n1 +["g2"] * n2)

# Add some group effect for Pinfo features


Y[grp=="g1", :n_info] +=snr

#
import scipy.stats as stats
import matplotlib.pyplot as
plt
tvals, pvals =
np.full(n_features, np.NAN),
np.full(n_features, np.NAN)
for j in range(n_features):
t v a l s [ j] , pvals[j] =
stats.ttest_ind(Y[grp=="g1"
, j ] , Y[grp=="g2", j ] ,
e
q
u
a (continues on next
l page)
_
4.1. Univariate statistics v 101
a
r
=
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


axis[1].plot(range(n_features), pvals, 'o') page)
axis[1].axhline(y=0.05, color='red', linewidth=3, label="p-value=0.05")
#axis[1].axhline(y=0.05, label="toto", color='red')
axis[1].set_ylabel("p-value")
axis[1].legend()

axis[2].hist([pvals[n_info:], pvals[:n_info]],
stacked=True, bins=100, label=["Negatives", "Positives"])
axis[2].set_xlabel("p-value histogram")
axis[2].set_ylabel("density")
axis[2].legend()

plt.tight_layout()

Note that under the null hypothesis the distribution of the p-values is
uniform. Statistical measures:
• True Positive (TP) equivalent to a hit. The test correctly concludes the presence of an
effect.
• True Negative (TN). The test correctly concludes the absence of an effect.
• False Positive (FP) equivalent to a false alarm, Type I error. The test improperly con-
cludes the presence of an effect. Thresholding at �-value < 0.05 leads to 47 FP.
• False Negative (FN) equivalent to a miss, Type II error. The test improperly
concludes the absence of an effect.

P, N =n_info, n_features - n_info # Positives, Negatives TP


=np.sum(pvals[:n_info ] <0.05) # True Positives
FP =np.sum(pvals[n_info: ] <0.05) # False Positives
print("No correction, FP: %i (expected: %.2f), TP: %i" %
(FP, N * 0.05, TP))

102 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

Bonferroni correction for multiple comparisons

The Bonferroni correction is based on the idea that if an experimenter is testing 𝑃


hypothe- ses, then one way of maintaining the familywise error rate (FWER) is to test
each individual hypothesis at a statistical significance level of 1/𝑃 times the desired
maximum overall level.
So, if the desired significance level for the whole family of tests is 𝛼(usually 0.05), then
the Bonferroni correction would test each individual hypothesis at a significance level of
𝛼/𝑃 . For example, if a trial is testing 𝑃= 8 hypotheses with a desired 𝛼= 0.05, then the
import statsmodels.sandbox.stats.multicomp
Bonferroni as hypothesis
correction would test each individual multicomp at 𝛼= 0.05/8 = 0.00625.
_, pvals_fwer, _, _ =multicomp.multipletests(pvals, alpha=0.05,
method='bonferroni'
) TP =np.sum(pvals_fwer[:n_info ] <0.05) # True Positives
FP =
np.sum(pvals_fwer[n_info: ] <0.05) # False Positives
print("FWER
correction, FP: %i, TP: %i" % (FP, TP))
The False discovery rate (FDR) correction for multiple comparisons

FDR-controlling procedures are designed to control the expected proportion of rejected null
hypotheses that were incorrect rejections (“false discoveries”). FDR-controlling
procedures pro- vide less stringent control of Type I errors compared to the familywise
error rate (FWER) con- trolling procedures (such as the Bonferroni correction), which
control the probability of at least one Type I error. Thus, FDR-controlling procedures have
greater power, at the cost of increased rates of Type I errors.

import statsmodels.sandbox.stats.multicomp as multicomp


_, pvals_fdr, _, _ =multicomp.multipletests(pvals, alpha=0.05,
method='fdr_bh'
) TP =np.sum(pvals_fdr[:n_info ] <0.05) # True Positives
FP =
np.sum(pvals_fdr[n_info: ] <0.05) # False Positives

print("FDR
correction, FP: %i, TP: %i" % (FP, TP))

4.1.9 Exercises

Simple linear regression and correlation (application)

Load the dataset: birthwt Risk Factors Associated with Low Infant Birth Weight at
https://raw. github.com/neurospin/pystatsml/master/datasets/birthwt.csv
1. Test the association of mother’s (bwt) age and birth weight using the correlation test
and linear regeression.
2. Test the association of mother’s weight (lwt) and birth weight using the correlation
test and linear regeression.
3.Produce two scatter plot of: (i) age by birth weight; (ii) mother’s weight by birth
weight. Conclusion ?

4.1. Univariate statistics 103


Statistics and Machine Learning in Python, Release 0.3 beta

Simple linear regression (maths)

Considering the salary and the experience of the salary table.


https://raw.github.com/ neurospin/pystatsml/master/datasets/salary_table.csv
Compute:
• Estimate the model paramters 𝛽, 𝛽0 using scipy stats.linregress(x,y)
•Compute the predicted values �ˆ

Compute:
• �¯: y_mu
• 𝑆𝑆tot: ss_tot
• 𝑆𝑆reg: ss_reg
• 𝑆𝑆res: ss_res
• Check partition of variance formula based on sum of squares by using assert np.
allclose(val1, val2, atol=1e-05)
• Compute 𝑅2 and compare it with the r_value above
• Compute the 𝐹 score
• Compute the �-value:
• Plot the 𝐹(1, �) distribution for 100 𝑓 values within [10, 25]. Draw 𝑃(𝐹 (1, > 𝐹),
�)
i.e. color the surface defined by the � values larger than 𝐹below the 𝐹(1, �).
• 𝑃(𝐹 (1, �) > 𝐹) is the �-value, compute it.

Multiple regression

Considering the simulated data used below:


1. What are the dimensions of pinv(𝑋)?
2. Compute the MSE between the predicted values and the true
values.
import numpy as np
from scipy import linalg
np.random.seed(seed=42) # make the example reproducible

# Dataset
N, P =50, 4
X =np.random.normal(size= N * P).reshape((N, P))
## Our model needs an intercept so we add a column of 1s:
X[: , 0] =1
print(X[:5, : ] )

betastar =np.array([10, 1., .5, 0.1])


e = np.random.normal(size=N)
y =np.dot(X, betastar) + e

# Estimate the parameters


Xpinv = linalg.pinv2(X)

(continues on next
page)

104 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


betahat =np.dot(Xpinv, y) page)
print("Estimated beta:\n", betahat)

Two sample t-test (maths)

Given the following two sample, test whether their means are
equals.
height =np.array([ 1.83, 1.83, 1.73, 1.82, 1.83,
1.73,1.99, 1.85, 1.68, 1.87,
1.66, 1.71, 1.73, 1.64, 1.70,
1.60, 1.79, 1.73, 1.62, 1.77])
grp =np.array(["M"] * 10 +[ " F " ] * 10)

• Compute the means/std-dev per groups.


• Compute the �-value (standard two sample t-test with equal variances).
• Compute the �-value.
• The �-value is one-sided: a two-sided test would test P(T >tval) and P(T <-tval).
What would the two sided �-value be?
• Compare the two-sided �-value with the one obtained by stats.ttest_ind using
assert np.allclose(arr1, arr2).

Two sample t-test (application)

Risk Factors Associated with Low Infant Birth Weight: https://raw.github.com/neurospin/


pystatsml/master/datasets/birthwt.csv
1. Explore the data
2. Recode smoke factor
3. Compute the means/std-dev per groups.
4. Plot birth weight by smoking (box plot, violin plot or histogram)
5. Test the effect of smoking on birth weight

Two sample t-test and random permutations

Generate 100 samples following the model:

� = 𝑔+ 𝜀

Where the noise 𝜀 ∼ 𝑁 (1, 1) and 𝑔 ∈ {0, 1 } is a group indicator variable with 50 ones and
50
zeros.
• Write a function tstat(y, g) that compute the two samples t-test of y splited in two
groups defined by g.
• Sample the t-statistic distribution under the null hypothesis using random
permutations.
• Assess the p-value.
4.1. Univariate statistics 105
Statistics and Machine Learning in Python, Release 0.3 beta

Univariate associations (developpement)

Write a function univar_stat(df, target, variables) that computes the parametric


statistics and �-values between the target variable (provided as as string) and all
variables (provided as a list of string) of the pandas DataFrame df. The target is a
quantitative variable but vari- ables may be quantitative or qualitative. The function
returns a DataFrame with four columns: variable, test, value, p_value.
Apply it to the salary dataset available at
https://raw.github.com/neurospin/pystatsml/master/ datasets/salary_table.csv, with
target being S: salaries for I T staff in a corporation.

Multiple comparisons

This exercise has 2 goals: apply you knowledge of statistics using vectorized numpy
operations. Given the dataset provided for multiple comparisons, compute the two-sample
�-test (assuming equal variance) for each (column) feature of the Y array given the two
groups defined by grp variable. You should return two vectors of size n_features: one for
the �-values and one for the
�-values.

ANOVA

Perform an ANOVA dataset described bellow


• Compute between and within variances
• Compute 𝐹-value: fval
• Compare the �-value with the one obtained by stats.f_oneway using assert
# dataset np. allclose(arr1, arr2)
mu_k = np.array([1, 2, 3]) # means of 3 samples
sd_k = np.array([1, 1, 1]) # sd of 3 samples
n_k =np.array([10, 20, 30]) # sizes of 3 samples grp
= [0, 1, 2] # group labels
n = np.sum(n_k)
label =np.hstack([[k] * n_k[k] for k in [0, 1, 2]])

y =np.zeros(n)
for k in grp:
y[label ==k] =np.random.normal(mu_k[k],
sd_k[k], n_k[k])

# Compute with scipy


fval, pval =stats.f_oneway(y[label ==0], y[label
==1], y[label ==2])

Note: Click here to download the full example code

106 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

2. Lab 1: Brain volumes study

The study provides the brain volumes of grey matter (gm), white matter (wm) and
cerebrospinal fluid) (csf) of 808 anatomical MRI scans.

1. Manipulate data

Set the working directory within a directory called “brainvol”


Create 2 subdirectories: data that will contain downloaded data and reports for results of
the analysis.
import os
import
os.path
import pandas as pd
import tempfile
import
urllib.request

WD=os.path.join(tempfile.gettempdir(), "brainvol")
os.makedirs(WD, exist_ok=True)
#os.chdir(WD)

# use cookiecutter f i l e organization


# https://drivendata.github.io/cookiecutter-data-science/
os.makedirs(os.path.join(WD, "data"), exist_ok=True)
#os.makedirs("reports", exist_ok=True)
Fetch data
• Demographic data demo.csv (columns: participant_id, site, group, age, sex) and tissue
volume data: group is Control or Patient. site is the recruiting site.
• Gray matter volume gm.csv (columns: participant_id, session, gm_vol)
• White matter volume wm.csv (columns: participant_id, session, wm_vol)
• Cerebrospinal Fluid csf.csv (columns: participant_id, session, csf_vol)

base_url ='https://raw.github.com/neurospin/pystatsml/master/datasets/brain_volumes/%s'
data = dict()
for f i l e in ["demo.csv", "gm.csv", "wm.csv", "csf.csv"]:
urllib.request.urlretrieve(base_url % f i l e , os.path.join(WD, "data", f i l e ) )

demo =pd.read_csv(os.path.join(WD, "data", "demo.csv"))


gm=pd.read_csv(os.path.join(WD, "data", "gm.csv"))
wm =pd.read_csv(os.path.join(WD, "data", "wm.csv"))
csf =pd.read_csv(os.path.join(WD, "data", "csf.csv"))

print("tables can be merge using shared columns")


print(gm.head())

Out:

tables can be merge using shared columns


participant_id session gm_vol
0 sub-S1-0002 ses-01 0.672506
(continues on next
page)

4.2. Lab 1: Brain volumes study 107


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


1 sub-S1-0002 ses-02 0.678772 page)
2 sub-S1-0002 ses-03 0.665592
3 sub-S1-0004 ses-01 0.890714
4 sub-S1-0004 ses-02 0.881127

Merge tables according to participant_id

brain_vol =pd.merge(pd.merge(pd.merge(demo, gm), wm), csf)


assert brain_vol.shape ==(808, 9)

Drop rows with missing values

brain_vol = brain_vol.dropna()
assert brain_vol.shape ==(766, 9)

Compute Total Intra-cranial volume tiv_vol = gm_vol + csf_vol +


wm_vol.
brain_vol["tiv_vol"] =brain_vol["gm_vol"] +brain_vol["wm_vol"] + brain_vol["csf_vol"]

Compute tissue fractions gm_f = gm_vol / tiv_vol, wm_f = wm_vol /


tiv_vol.
brain_vol["gm_f"] =brain_vol["gm_vol"] / brain_vol["tiv_vol"]
brain_vol["wm_f"] =brain_vol["wm_vol"] / brain_vol["tiv_vol"]

Save in a excel file brain_vol.xlsx

brain_vol.to_excel(os.path.join(WD, "data", "brain_vol.xlsx"),


sheet_name='data', index=False)

4.2.2 Descriptive Statistics

Load excel file brain_vol.xlsx

import os
import pandas as pd
import seaborn as
sns
import
statsmodels.formul
a.api as smfrmla
import
statsmodels.api as
sm

brain_vol =pd.read_excel(os.path.join(WD,
Descriptive "data",
statistics Most of participants have "brain_vol.xlsx"),
several MRI sessions (column session)
sheet_name='data')
Select on rows from session one “ses-01”
# Round float at 2 decimals when printing
pd.options.display.float_format = '{:,.2f}'.format
brain_vol1 =brain_vol[brain_vol.session =="ses-01"]
# Check that there are no duplicates
assert len(brain_vol1.participant_id.unique()) ==
len(brain_vol1.participant_id)
Global descriptives statistics of numerical
variables

108 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

desc_glob_num =brain_vol1.describe()
print(desc_glob_num)

Out:
age gm_vol wm_vo csf_vol tiv_vol gm_f wm_f
l
count 244.00 244.00 244.00 244.00 244.00 244.00 244.00
mean 34.54 0.71 0.44 0.31 1.46 0.49 0.30
std 12.09 0.08 0.07 0.08 0.17 0.04 0.03
min 18.00 0.48 0.05 0.12 0.83 0.37 0.06
25% 25.00 0.66 0.40 0.25 1.34 0.46 0.28
50% 31.00 0.70 0.43 0.30 1.45 0.49 0.30
75% 44.00 0.77 0.48 0.37 1.57 0.52 0.31
max 61.00 1.03 0.62 0.63 2.06 0.60 0.36

Global Descriptive statistics of categorical


variable
desc_glob_cat =brain_vol1[["site", "group", "sex"]].describe(include='all')
print(desc_glob_cat)

print("Get count by level")


desc_glob_cat =pd.DataFrame({col: brain_vol1[col].value_counts().to_dict()
for col in [ " s i t e" , "group", "sex"]})
print(desc_glob_cat)

Out:
site group sex
count 244 244 244
unique 7 2 2
top S7 Patient M
freq 65 157 155
Get count by level
site group sex
Contro nan 87.00 nan
l
F nan nan 89.00
M nan nan 155.00
Patient nan 157.00 nan
S1 13.00 nan nan
S3 29.00 nan nan
S4 15.00 nan nan
S5 62.00 nan nan
S6 1.00 nan nan
S7 65.00 nan nan
S8 59.00 nan nan

Remove the single participant from site


6
brain_vol =brain_vol[brain_vol.site != "S6"]
brain_vol1 =brain_vol[brain_vol.session == "ses-01"]
desc_glob_cat =pd.DataFrame({col:
brain_vol1[col].value_counts().to_dict()
for col in [ " s i t e" ,
"group", "sex"]})
print(desc_glob_cat)
Out:

4.2. Lab 1: Brain volumes study 109


Statistics and Machine Learning in Python, Release 0.3 beta

site grou se
Control
na p x
n 86.0 na
F nan 0 nan 88.00 n
M nan nan 155.0
0
Patient nan 157.0 nan
0
S1 13.0 nan nan
0
S3 29.0 nan nan
0
S4 15.0 statistics
Descriptives nan nan
of numerical variables per clinical
0
status
S5 62.0 =brain_vol1[["group",
nan nan
desc_group_num 'gm_vol']].groupby("group").describe()
0
print(desc_group_num)
S7 65.0 nan nan
0
Out:
S8 59.0 nan nan
0
gm_vol
count mean std min 25%50%75% max
group
Control 86.00 0.72 0.09 0.48 0.66 0.71 0.78 1.03
Patient 157.00 0.70 0.08 0.53 0.65 0.70 0.76 0.90

3. Statistics

Objectives:
1. Site effect of gray matter atrophy
2. Test the association between the age and gray matter atrophy in the control and
patient population independently.
3. Test for differences of atrophy between the patients and the controls
4. Test for interaction between age and clinical status, ie: is the brain atrophy process
in patient population faster than in the control population.
5. The effect of the medication in the patient population.

import statsmodels.api as sm
import statsmodels.formula.api as
smfrmla import scipy.stats
import seaborn as sns

1 Site effect on Grey Matter atrophy


The model is Oneway Anova gm_f ~ site The ANOVA test has important assumptions
that must be satisfied in order for the associated p-value to be valid.
• The samples are independent.
• Each sample is from a normally distributed population.
• The population standard deviations of the groups are all equal. This property is
known as homoscedasticity.

110 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

Plot
sns.violinplot("site", "gm_f", data=brain_vol1)

Stats with
scipy
fstat, pval =scipy.stats.f_oneway(*[brain_vol1.gm_f[brain_vol1.site == s]
for s in brain_vol1.site.unique()])
print("Oneway Anova gm_f ~ site F=%.2f, p-value=%E" % (f stat , pval))

Out:

Oneway Anova gm_f ~ site F=14.82, p-value=1.188136E-12

Stats with
statsmodels
anova =smfrmla.ols("gm_f ~ site", data=brain_vol1).fit()
# print(anova.summary())
print("Site explains %.2f%% of the grey matter fraction variance" %
(anova.rsquared * 100))

print(sm.stats.anova_lm(anova, typ=2))

Out:

Site explains 23.82% of the grey matter fraction variance


sum_sq df F PR(>F)
(continues on next
page)

4.2. Lab 1: Brain volumes study 111


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


site 0.11 5.00 14.82 0.00 page)
Residual 0.35 237.00 nan na
n
2. Test the association between the age and gray matter atrophy in the control and patient
population independently.
Plot

sns.lmplot("age", "gm_f", hue="group", data=brain_vol1)

brain_vol1_ctl =brain_vol1[brain_vol1.group =="Control"]


brain_vol1_pat =brain_vol1[brain_vol1.group =="Patient"]

Stats with
scipy
print("--- In control population ---")
beta, beta0, r_value, p_value, std_err =\
scipy.stats.linregress(x=brain_vol1_c
tl.age, y=brain_vol1_ctl.gm_f)

print("gm_f =%f * age +%f" % (beta, beta0))


print("Corr: %f, r-squared: %f, p-value: %f,
std_err: %f"\
% (r_value, r_value**2, p_value,
std_err))
(continues on next
print("--- In patient population ---")
page)
beta, beta0, r_value, p_value, std_err =\
112 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
scipy.stats.linregress(x=brain_vol1_pat.age, y=brain_vol1_pat.gm_f)

print("gm_f =%f * age +%f" % (beta, beta0))


print("Corr: %f, r-squared: %f, p-value: %f, std_err: %f"\
% (r_value, r_value**2, p_value, std_err))

print("Decrease seems faster in patient than in control population")

Out:

--- In control population --- gm_f


=-0.001181 * age +0.529829
Corr: -0.325122, r-squared:
0.105704, p-value: 0.002255,
std_err: 0.000375
--- In patient population --- gm_f
=-0.001899 * age +0.556886
Corr: -0.528765, r-squared:
0.279592,
Stats with p-value: 0.000000,
std_err: 0.000245
statsmodels
Decrease
print("--- seems faster in
In control patient ---")
population
than in control population
l r =smfrmla.ols("gm_f ~ age", data=brain_vol1_ctl).fit()
print(lr.summary())
print("Age explains %.2f%% of the grey matter fraction
variance" %
(lr.rsquared * 100))

print("--- In patient population ---")


l r =smfrmla.ols("gm_f ~ age", data=brain_vol1_pat).fit()
print(lr.summary())
print("Age explains %.2f%% of the grey matter fraction
variance" %
(lr.rsquared * 100))
Out:
--- In control population ---
OLS Regression Results
==============================
==============================
==================
Dep. Variable: gm_f R-squared: 0.106
Model: OLS Adj. R-squared: 0.095
Method: Least Squares F-statistic: 9.929
Date: jeu., 31 oct. 2019 Prob (F-statistic): 0.00226
Time: 16:09:4 Log-Likelihood: 159.34
0
No. Observations: 86 AIC: -314.7
Df Residuals: 84 BIC: -309.8
Df Model: 1
Covariance Type: coef std nonrobust
err t P>|t| [0.025 0.975]
===== ===
Intercept============
0.5298======== ===
0.013== ==========
40.350 = ======= ===
0.000============
0.504======== ====
0.556
age -0.0012 0.000 -3.151 0.002 -0.002 -
0.000
==============================================================================
Omnibus: 0.946 Durbin-Watson: 1.628
Prob(Omnibus): 0.623 Jarque-Bera ( J B ) : 0.782
Skew: 0.233 Prob(JB): 0.676
(continues on next page)

4.2. Lab 1: Brain volumes study 113


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


Kurtosis: 2.962 Cond. No. 111.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors i s correctly␣
˓→ specified.
Age explains 10.57% of the grey matter fraction variance
--- In patient population ---
OLS Regression Results
==============================================================================
Dep. Variable: gm_f R-squared: 0.280
Model: OLS Adj. R-squared: 0.275
Method: Least Squares F-statistic: 60.16
Date: jeu., 31 oct. 2019 Prob (F-statistic): 1.09e-12
Time: 16:09:4 Log-Likelihood: 289.38
0
No. Observations: 157 AIC: -574.8
Df Residuals: 155 BIC: -568.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
Intercept 0.5569 0.009 60.81 0.000 0.539 0.575
7
age -0.0019 0.000 -7.756 0.000 -0.002 -
0.001
==============================================================================
Omnibus: 2.310 Durbin-Watson: 1.325
Prob(Omnibus): 0.315 Jarque-Bera ( J B ) : 1.854
Skew: 0.230 Prob(JB): 0.396
Kurtosis:
================================3.268
=========Cond.
======No.
============================111.
===

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors i s correctly␣
˓→ specified.
Age explains 27.96% of the grey matter fraction variance

Before testing for differences of atrophy between the patients ans the controls Preliminary
tests for age x group effect (patients would be older or younger than Controls)
Plot

sns.violinplot("group", "age", data=brain_vol1)

114 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

Stats with
scipy
print(scipy.stats.ttest_ind(brain_vol1_ctl.age, brain_vol1_pat.age))

Out:
Ttest_indResult(statistic=-1.2155557697674162, pvalue=0.225343592508479)

Stats with
statsmodels
print(smfrmla.ols("age ~ group", data=brain_vol1).fit().summary())
print("No significant difference in age between patients and controls")

Out:
OLS Regression Results
==============================================================================
Dep. Variable: age R-squared: 0.006
Model: OLS Adj. R-squared: 0.002
Method: Least Squares F-statistic: 1.478
Date: jeu., 31 oct. 2019 Prob (F-statistic): 0.225
Time: 16:09:40 Log-Likelihood: -949.69
No. Observations: 243 AIC: 1903.
Df Residuals: 241 BIC: 1910.
Df Model: 1
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
(continues on next page)

4.2. Lab 1: Brain volumes study 115


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


page)

Intercept 33.2558 1.305 25.484 0.000 30.685 35.82


6
group[T.Patient] 1.9735 1.624 1.216 0.225 -1.225 5.17
============================================================================== 2
Omnibus: 35.711 Durbin-Watson: 2.096
Prob(Omnibus): 0.000 Jarque-Bera ( J B ) : 20.726
Skew: 0.569 Prob(JB): 3.16e-05
=================================
Kurtosis: ========
2.133 ======
Cond. ============================3.12
No. ===

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors i s correctly␣
˓→ specified.
No significant difference in age between patients and controls

Preliminary tests for sex x group (more/less males in patients than in Controls)

crosstab =pd.crosstab(brain_vol1.sex, brain_vol1.group)


print("Obeserved contingency table")
print(crosstab)

chi2, pval, dof, expected =scipy.stats.chi2_contingency(crosstab)

print("Chi2 =%f, pval =%f" % (chi2, pval))


print("Expected contingency table under the null hypothesis")
print(expected)
print("No significant difference in sex between patients and
controls")
Out:

Obeserved contingency table


group Control Patient
sex
F 3 5
3 5
Chi2
M =0.143253, 5 pval =100.705068
3
Expected contingency 2
table under the null hypothesis
[ [ 31.14403292 56.85596708]
[ 54.85596708 100.14403292]]
No significant difference in sex between patients and
controls

3. Test for differences of atrophy between the


print(sm.stats.anova_lm(smfrmla.ols("gm_f
patients and the controls ~ group", data=brain_vol1).fit(), typ=2))
print("No significant difference in age between patients and controls")

Out:
sum_ df F PR(>F
group
sq )
0.0 1.0 0.0 0.9
Residual 00.46 241.00
0 1 nan 2 nan
No significant difference in age between patients and controls
This model is simplistic we should adjust for age
and site

116 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

print(sm.stats.anova_lm(smfrmla.ols(
"gm_f ~ group +age +site", data=brain_vol1).fit(), typ=2))
print("No significant difference in age between patients and controls")

Out:
sum_sq df F PR(>F)
group 0.00 1.00 1.82 0.18
site 0.11 5.00 19.7 0.00
9
age 0.09 1.00 86.8 0.00
6
Residual 0.25 235.00 nan nan
No significant difference in age between patients and controls

4. Test for interaction between age and clinical status, ie: is the brain atrophy process in
patient population faster than in the control population.

ancova =smfrmla.ols("gm_f ~ group:age +age +site", data=brain_vol1).fit()


print(sm.stats.anova_lm(ancova, typ=2))

print("= Parameters =")


print(ancova.params)

print("%.3f%% of grey matter loss per year (almost %.1f%% per decade)" %\
(ancova.params.age * 100, ancova.params.age * 100 * 10))

print("grey matter loss in patients is accelerated by %.3f%% per decade" %


(ancova.params['group[T.Patient]:age'] * 100 * 10))

Out:
sum_s df F PR(>F)
q
site 0.11 5.00 20.2 0.00
8
age 0.10 1.00 89.3 0.00
7
group:age
=Parameters = 0.00 1.00 3.28 0.07
Residual
Intercept 0.25 235.0 nan
0.52 nan
site[T.S3] 00.01
site[T.S4] 0.03
site[T.S5] 0.01
site[T.S7] 0.06
site[T.S8] 0.02
age -0.00
group[T.Patient]:age -0.00
dtype: float64
-0.148% of grey matter loss pe r year (almost -1.5% per decade)
grey matter loss in patients is accelerated by -0.232% per decade

Total running time of the script: ( 0 minutes 4.267 seconds)

4.3 Multivariate statistics

Multivariate statistics includes all statistical techniques for analyzing samples made of
two or more variables. The data set (a 𝑁 × 𝑃matrix X) is a collection of 𝑁 independent
samples

4.3. Multivariate statistics 117


Statistics and Machine Learning in Python, Release 0.3 beta

column vectors⎡[x1, 𝑇. . . ⎤, x�⎡, . . . , x𝑁 ] of length 𝑃 ⎤ ⎡ ⎤


−x1 − �11 ·· ··· � 1𝑃 �11 ... 1𝑃
�1
. · �. . � .
⎢ .. ⎥ ⎢ . .. ⎥ ⎢ . ⎥
⎢ 𝑇 ⎥
X = ⎢ −x.� −⎥ = ⎢ .� ⎢ ··· � .𝑃⎥ X .
� ⎢ ⎥
� ⎢ . 1 · · · �. � = . ⎥ ⎢ . ⎥
⎣ . .
� .. ⎦ ⎣ . ⎦
−x.𝑇𝑃 − . ..
�𝑁 1 · · · �𝑁� �𝑁 𝑃 �𝑁 1 �𝑁𝑃 𝑁 ×𝑃
··· .

4.3.1 Linear Algebra

Euclidean norm and distance

The Euclidean norm of a vector a ∈ R𝑃is denoted



⎸∑︁ 𝑃
‖a‖2 = ⎷ 𝑎�2

The Euclidean distance between two vectors a, b


∈ R𝑃 is
⎯⎸

∑︁ 𝑃
‖a − b‖2 = (𝑎� −
� 𝑏�)2

Dot product and projection

Source: Wikipedia
Algebraic definition
The dot product, denoted ’‘·” of two 𝑃-dimensional vectors a = [𝑎1, 𝑎2, ..., 𝑎𝑃] and a = [𝑏1,
𝑏2, ..., 𝑏𝑃] is defined as
⎡ ⎤
𝑏1
..
𝑇 ∑︁ ︀[
a· b= a b= 𝑎�𝑏� = 𝑎1 ︀
]⎢ ⎥
a𝑇 . . . 𝑎𝑃 ⎢ b⎥ .
... ⎢ . ⎥

⎣ . ⎦
𝑏𝑃

The Euclidean norm of a vector can be computed using the dot


product, as


‖a‖2 = a · a.

Geometric definition: projection


In Euclidean space, a Euclidean vector is a geometrical object that possesses both a
magnitude and a direction. A vector can be pictured as an arrow. Its magnitude is its
length, and its direction is the direction that the arrow points. The magnitude of a vector
a is denoted by ‖a‖2 .
The dot product of two Euclidean vectors a and b is defined by

118 a · b = ‖a‖2 ‖b‖ 2 cos 𝜃, Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

where 𝜃is the angle between a and b.


In particular, if a and b are orthogonal, then the angle between them is 90° and

a · b = 0.

At the other extreme, if they are codirectional, then the angle between them is 0° and

a · b = ‖a‖2 ‖b‖ 2

This implies that the dot product of a vector a by itself is


a · a = 22
‖a‖ .
The scalar projection (or scalar component) of a Euclidean vector a in the direction of a
Eu- clidean vector b is given by

𝑎𝑏= ‖a‖2 cos 𝜃,

where 𝜃is the angle between a and b.


In terms of the geometric definition of the dot product, this can be rewritten
a· b
𝑎 𝑏= ,
‖b‖ 2

Fig. 5:
Projection.

import numpy as np
np.random.seed(42)

a =np.random.randn(10)
b=
np.random.randn(10)

np.dot(a, b)

-4.085788532659924

4.3. Multivariate statistics 119


Statistics and Machine Learning in Python, Release 0.3 beta

4.3.2 Mean vector

The mean (𝑃 × 1) column-vector 𝜇whose estimator is


⎡ ⎤ ⎡ ⎤
�� �1
.1
︁∑𝑁 x = 1 ︁∑ ⎢ . ⎥
𝑁 ⎢ ¯. ⎥
1 i
𝑁 . ⎢ � � ⎥ = ⎢ �¯�⎥ .
x¯ = ⎢ .� ⎥ ⎢ ⎥
𝑁 �= �=
⎣ . ⎦ ⎣ ⎦
1 1 .
. �
��
𝑃
¯𝑃

3. Covariance matrix

• The covariance matrix Σ X X is a symmetric positive semi-definite matrix whose


element in the �� , position is the covariance between the ��ℎ and ��ℎ elements of
a random vector
i.e. the ��ℎ and ��ℎ columns of X.
• The covariance matrix generalizes the notion of covariance to multiple dimensions.
• The covariance matrix describe the shape of the sample distribution around the mean
assuming an elliptical distribution:
𝑇
Σ X X = 𝐸 (X − 𝐸 (X)) 𝐸 (X − 𝐸 (X)),

whose estimator SXX is a 𝑃× 𝑃matrix


1 given by 𝑇 𝑇
XX 𝑇
S = 𝑁 − 1 (X − 1x¯ ) (X −
1x¯ ).
If we assume that X is centered, i.e. X is replaced by X − 1x¯𝑇 then the
estimator is ⎡ ⎤ ⎡ ⎤
� 𝑁1 ⎡ ⎤ � �

11
⎥ �11 · · · ⎢
1 �1 1𝑃
.. ⎥⎥
1𝑃 . ..
1 1 �
⎢· · · 1 �𝑁 ⎥ ⎢ �.. �
. 1 . ⎥ = ⎢
𝑇 . �
.. �
SXX = X X = ⎢ .� �.. .. ⎥ ,
. ⎥ ⎣ .
� �
𝑁−1 𝑁− 1⎣ . . ⎦ ��
� � ⎦
� �𝑁 1 �𝑁� �𝑁𝑃 � 𝑃

�1𝑃
· · · �𝑁𝑃 ··· 𝑃

wher
e
· · ·1 1 ∑︁
𝑁
𝑇
��� = � xj xk = ���
𝑁−1 𝑁 − 1 �= ��
=�� �
1
is an estimator of the covariance between the ��ℎ and ��ℎ
variables.
## Avoid warnings and force inline plot
%matplotlib inline
import warnings
warnings. filterwarnings(" ignore" )
##
import numpy as np
import scipy
import matplotlib.pyplot as
plt import seaborn as sns
import pystatsml.plot_utils
import seaborn as sns # nice
color
(continues on next
page)

120 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


np.random.seed(42) page)
colors = sns.color_palette()

n_samples, n_features =100, 2

mean, Cov, X =[None] * 4, [None] * 4, [None] * 4


mean[0] =np.array([-2.5, 2.5])
Cov[0] =np.array([[1, 0],
[0, 1]])

mean[1] =np.array([2.5,
2.5])
Cov[1] =np.array([[1, .5],
[. 5,
1]])

mean[2] =np.array([-2.5,
-2.5])
Cov[2] =np.array([[1, .9],
[. 9,
1]])

mean[3] =np.array([2.5,
-2.5])
Cov[3] =np.array([[1, -.9],
[-.9,
1]])

# Generate dataset
for i in range(len(mean)):
X[i] =
np.random.multivariate_
normal(mean[i], Cov[i],
n_samples)

# Plot
for i in range(len(mean)):
# Points
plt.scatter(X[i][:, 0],
X [ i ] [ : , 1],
color=colors[i],
label="class %i" % i )
# Means
plt.scatter(mean[i][0], mean[i][1], marker="o", s=200, facecolors='w',
edgecolors=colors[i], linewidth=2)
# Ellipses representing the covariance matrices
pystatsml.plot_utils.plot_cov_ellipse(Cov[i], pos=mean[i],
facecolor='none',
linewidth=2, edgecolor=colors[i])

plt.axis('equal')
_ =plt.legend(loc='upper lef t')

4.3. Multivariate statistics 121


Statistics and Machine Learning in Python, Release 0.3 beta

4.3.4 Correlation matrix

import numpy as np
import pandas as
pd
import matplotlib.pyplot as
plt import seaborn as sns

url ='https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
df = pd.read_csv(url)

#Compute the correlation matrix


corr = df.corr()

# Generate a mask for the upper triangle


mask =np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f , ax =plt.subplots(figsize=(5.5, 4.5))
cmap =sns.color_palette("RdBu_r", 11)
# Draw the heatmap with the mask and
correct aspect ratio
_ =sns.heatmap(corr, mask=None, cmap=cmap, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})

122 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

Re-order correlation matrix using


AgglomerativeClustering
# convert correlation to distances
d =2 * (1 - np.abs(corr))

from sklearn.cluster import


AgglomerativeClustering
clustering =
AgglomerativeClustering(n_clusters
=3, linkage='single',
affinity="precomputed
˓→").fit(d)
lab=0

clusters =
[list(corr.columns[clustering.label
s_==lab]) for lab in
set(clustering.labels_
˓ → )]

print(clusters)

reordered =np.concatenate(clusters)

R =corr.loc[reordered, reordered]
[['mpg', 'cyl', 'disp', 'hp', 'wt', 'qsec', 'vs', 'carb'], ['am', 'gear'], ['drat']]
f , ax =plt.subplots(figsize=(5.5,
4.5))
# Draw the heatmap with the mask and
correct aspect ratio
_ =sns.heatmap(R, mask=None, cmap=cmap, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})

4.3. Multivariate statistics 123


Statistics and Machine Learning in Python, Release 0.3 beta

4.3.5 Precision matrix

In statistics, precision is the reciprocal of the variance, and the precision matrix is the
matrix inverse of the covariance matrix.
It is related to partial correlations that measures the degree of association between two
vari- ables, while controlling the effect of other variables.

import numpy as np

Cov = np.array([[1.0, 0.9, 0.9, 0.0, 0.0, 0.0],


[0.9, 1.0, 0.9, 0.0, 0.0, 0.0],
[0.9, 0.9, 1.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 1.0, 0.9, 0.0],
[0.0, 0.0, 0.0, 0.9, 1.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 1.0]])

print("# Precision matrix:")


Prec =np.linalg.inv(Cov)
print(Prec.round(2))

print("# Partial correlations:")


Pcor = np.zeros(Prec.shape)
Pcor[::] =np.NaN

for i , j in
zip(*np.triu_indices_from(Prec,
1)):
Pcor[i, j ] =- Prec[i, j ] /
np.sqrt(Prec[i, i ] * Prec[j,
j])

print(Pcor.round(2))

124 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

# Precision matrix:
0. 0. 0. ]
[ [[-3.21
6.79 -3.21 -3.21
6.79 -3.21 0. 0. 0. ]
[-3.21 -3.21 6.79 0. 0. 0. ]
[ 0. -0. -0. 5.26 -4.74 -0. ]
[ 0. 0. 0. -4.74 5.26 0. ]
[ 0. 0. 0. 0. 0. 1. ]]
# Partial correlations:
[[ nan 0.47 0.47 -0. -0. -0. ]
[ nan nan 0.47 -0. -0. -0. ]
[ nan nan nan -0. -0. -0. ]
[ nan nan nan 0.9 0. ]
nan
[ nan nan nan nan -0. ]
nan
[ nan nan nan nan nan nan]]

6. Mahalanobis distance

• The Mahalanobis distance is a measure of the distance between two points x and 𝜇
where the dispersion (i.e. the covariance structure) of the samples is taken into
account.
•The dispersion is considered through covariance

matrix. This is formally expressed as

√︁
𝐷 𝑀 (x, 𝜇) = (x − 𝜇)𝑇 Σ − 1 (x − 𝜇).

Intuitions
• Distances along the principal directions of dispersion are contracted since they
correspond to likely dispersion of points.
• Distances othogonal to the principal directions of dispersion are dilated since they
corre- spond to unlikely dispersion of points.
For example


ones =np.ones(Cov.shape[0]) 𝐷 𝑀 (1) = 1𝑇Σ−1 1.
d_euc =np.sqrt(np.dot(ones, ones))
d_mah =np.sqrt(np.dot(np.dot(ones, Prec), ones))

print("Euclidean norm of ones=%.2f. Mahalanobis norm of ones=%.2f" % (d_euc, d_mah))

Euclidean norm of ones=2.45. Mahalanobis norm of ones=1.77

The first dot product that distances along the principal directions of dispersion are
contracted:
print(np.dot(ones, Prec))

[0.35714286 0.35714286 0.35714286 0.52631579 0.52631579 1. ]

4.3. Multivariate statistics 125


Statistics and Machine Learning in Python, Release 0.3 beta

import numpy as
np import scipy
import matplotlib.pyplot as
plt import seaborn as sns
import pystatsml.plot_utils
%matplotlib
inline
np. random. seed(40
)
colors =
sns.color_palette
()

mean =np.array([0, 0])


Cov =np.array([[1, .8],
[.8,
1]])
samples =np.random.multivariate_normal(mean, Cov, 100)
x1 =np.array([0, 2])
x2 = np.array([2, 2])

plt.scatter(samples[:, 0], samples[:, 1], color=colors[0])


plt.scatter(mean[0], mean[1], color=colors[0], s=200, label="mean")
plt.scatter(x1[0], x1[1], color=colors[1], s=200, label="x1")
plt.scatter(x2[0], x2[1], color=colors[2], s=200, label="x2")

# plot covariance ellipsis


pystatsml.plot_utils.plot_cov_ellipse(Cov, pos=mean,
facecolor='none',
linewidth=2,
edgecolor=colors[0])
# Compute distances
d2_m_x1 =scipy.spatial.distance.euclidean(mean, x1)
d2_m_x2 =scipy.spatial.distance.euclidean(mean, x2)

Covi = scipy.linalg.inv(Cov)
dm_m_x1 =scipy.spatial.distance.mahalanobis(mean, x1, Covi)
dm_m_x2 =scipy.spatial.distance.mahalanobis(mean, x2, Covi)

# Plot distances
vm_x1 = (x1 - mean) / d2_m_x1
vm_x2 = (x2 - mean) / d2_m_x2
j i t t e r =.1
plt.plot([mean[0] - j i t t e r ,
d2_m_x1 * vm_x1[0] - j i t t e r ] ,
[mean[1], d2_m_x1 * vm_x1[1]], color='k')
plt.plot([mean[0] - j i t t e r , d2_m_x2 * vm_x2[0] - j i t t e r ] ,
[mean[1], d2_m_x2 * vm_x2[1]],
color='k')

plt.plot([mean[0] +j i t t e r , dm_m_x1 * vm_x1[0] +j i t t e r ] ,


[mean[1], dm_m_x1 * vm_x1[1]], color='r')
plt.plot([mean[0] +j i t t e r , dm_m_x2 * vm_x2[0] +j i t t e r ] ,
[mean[1], dm_m_x2 * vm_x2[1]], color='r')

plt.legend(loc='lower right')
plt.text(-6.1, 3,
'Euclidian: d(m, x1) =%.1f<d(m, x2) =%.1f' % (d2_m_x1, d2_m_x2), color='k')
plt.text(-6.1, 3.5,
'Mahalanobis: d(m, x1) =%.1f>d(m, x2) =%.1f' % (dm_m_x1, dm_m_x2),
color='r')
126 Chapter 4. Statistics
plt.axis('equal')
print('Euclidian d(m, x1) =%.2f <d(m, x2) =%.2f' % (d2_m_x1,
d2_m_x2)) print('Mahalanobis d(m, x1) =%.2f >d(m, x2) =%.2f' % (dm_m_x1,
Statistics and Machine Learning in Python, Release 0.3
beta

Euclidian d(m, x1) =2.00 <d(m, x2) =2.83


Mahalanobis d(m, x1) =3.33 >d(m, x2) = 2.11

If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the
Eu- clidean distance. If the covariance matrix is diagonal, then the resulting distance
measure is called a normalized Euclidean distance.
More generally, the Mahalanobis distance is a measure of the distance between a point x
and a distribution �(x|𝜇, Σ). It is a multi-dimensional generalization of the idea of
measuring how many standard deviations away x is from the mean. This distance is zero
if x is at the mean, and grows as x moves away from the mean: along each principal
component axis, it measures the number of standard deviations from x to the mean of the
distribution.

4.3.7 Multivariate normal distribution

The distribution, or probability density function (PDF) (sometimes just density), of a


continuous random variable is a function that describes the relative likelihood for this
random variable to take on a given value.
The multivariate normal distribution, or multivariate Gaussian distribution, of a 𝑃
-dimensional random vector x = [�1, �2, . . . , �]𝑃𝑇 is
1
�(x|𝜇, Σ ) exp{− (x − 𝜇)
𝑇 Σ−1 (x −
1
= 𝜇)}.
(2𝜋 ) 𝑃 / 2 |Σ| 1 / 2
import numpy as np 2
import matplotlib.pyplot as
plt import scipy.stats
from scipy.stats import
multivariate_normal
from mpl_toolkits.mplot3d
import Axes3D
(continues on next
page)

4.3. Multivariate statistics 127


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


def multivariate_normal_pdf(X, mean, sigma): page)
"""Multivariate normal probability density function over X (n_samples x n_features)""" P
= X.shape[1]
det =np.linalg.det(sigma)
norm_const =1.0 / (((2*np.pi) ** (P/2)) * np.sqrt(det))
X_mu =X - mu
inv = np.linalg.inv(sigma)
d2 =np.sum(np.dot(X_mu, inv) * X_mu, axis=1)
return norm_const * np.exp(-0.5 * d2)

# mean and covariance


mu =np.array([0, 0])
sigma =np.array([[1, -.5],
[-.5, 1]] )

# x, y grid
x, y =np.mgrid[-3:3:.1,
-3:3:.1]
X =np.stack((x.ravel(),
y.ravel())).T
norm =
multivariate_normal_pdf(X,
mean,
sigma).reshape(x.shape)

# Do i t with scipy
norm_scpy =
multivariate_normal(mu,
sigma).pdf(np.stack((x, y) ,
axis=2))
assert np.allclose(norm,
norm_scpy)

# Plot
fig =plt.figure(figsize=(10, 7))
ax = fig.gca(projection='3d')
surf =ax.plot_surface(x, y, norm,
rstride=3,
cstride=3, cmap=plt.cm.coolwarm,
linewidth=1, antialiased=False
)

ax.set_zlim(0, 0.2)
ax.zaxis.set_major_locator(plt.LinearLocator(10))
ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))

ax.set_xlabel('X')
ax.set_ylabel('Y')
ax. set_zlabel('p(x)')

plt.title('Bivariate Normal/Gaussian distribution')


fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.coolwarm)
plt.show()

128 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

8. Exercises

Dot product and Euclidean norm

Given a = [2, 1]𝑇 and b = [1, 1]𝑇


1. Write a function euclidean(x) that computes the Euclidean norm of vector, x.
2. Compute the Euclidean norm of a.
3. Compute the Euclidean distance of ‖a − b‖2 .
4. Compute the projection of b in the direction of vector a: 𝑏𝑎.
5. Simulate a dataset X of 𝑁 = 100 samples of 2-dimensional vectors.
6. Project all samples in the direction of the vector a.

Covariance matrix and Mahalanobis norm

1. Sample a dataset X of 𝑁 = 100 samples of 2-dimensional vectors from the bivariate


︂]
normal distribution �(𝜇 , Σ ) where 𝜇= [1, 1]𝑇 and Σ 1 0.8
.
0.8,
︂[
1
=
2. Compute the mean vector x¯ and center X. Compare the estimated mean x¯ to the true
mean, 𝜇.
3. Compute the empirical covariance matrix S. Compare the estimated covariance matrix
S
to the true covariance matrix, Σ.

4.3. Multivariate statistics


129
Statistics and Machine Learning in Python, Release 0.3 beta

4. Compute S−1 (Sinv) the inverse of the covariance matrix by using


scipy.linalg.inv(S).
5. Write a function mahalanobis(x, xbar, Sinv) that computes the Mahalanobis distance
of a vector x to the mean, x¯.
6. Compute the Mahalanobis and Euclidean distances of each sample x� to the mean x¯.
Store the results in a 100 × 2 dataframe.

4.4 Time Series in python

Two libraries:
• Pandas: https://pandas.pydata.org/pandas-docs/stable/timeseries.html
• scipy http://www.statsmodels.org/devel/tsa.html

1. Stationarity

A T S is said to be stationary if its statistical properties such as mean, variance remain


constant over time.
• constant mean
• constant variance
• an autocovariance that does not depend on time.

what is making a T S non-stationary. There are 2 major reasons behind non-stationaruty


of a TS:
1. Trend – varying mean over time. For eg, in this case we saw that on average, the
number of passengers was growing over time.
2. Seasonality – variations at specific time-frames. eg people might have a tendency to
buy cars in a particular month because of pay increment or festivals.

4.4.2 Pandas Time Series Data Structure

A Series is similar to a list or an array in Python. It represents a series of values


(numeric or otherwise) such as a column of data. It provides additional functionality,
methods, and operators, which make it a more powerful version of a list.
import pandas as
pd import numpy as
np

# Create a Series from a l i s t


ser =pd.Series([1, 3])
print(ser)

# String as index
prices = {'apple': 4.99,
'banana': 1.99,
'orange': 3.99}
ser = pd.Series(prices) (continues on next
page)

130 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


print(ser) page)

x =pd.Series(np.arange(1,3), index=[x for x in 'ab'])


print(x)
print(x['b'])

0 1
1 3
dtype: int64
apple
4.99
banana
1.99
orange 3.99
dtype: float64
a1
b 2
dtype: int64
2
3. Time Series Analysis of Google Trends

source: https://www.datacamp.com/community/tutorials/time-series-analysis-tutorial
Get Google Trends data of keywords such as ‘diet’ and ‘gym’ and see how they vary over
time while learning about trends and seasonality in time series data.
In the Facebook Live code along session on the 4th of January, we checked out Google trends
data of keywords ‘diet’, ‘gym’ and ‘finance’ to see how they vary over time. We asked
ourselves if there could be more searches for these terms in January when we’re all
trying to turn over a new leaf?
In this tutorial, you’ll go through the code that we put together during the session step by
step. You’re not going to do much mathematics but you are going to do the following:
• Read data
• Recode data
• Exploratory Data Analysis

4. Read data

import numpy as np
import pandas as
pd
import matplotlib.pyplot as
plt import seaborn as sns

# Plot appears on i ts own


windows
%matplotlib inline
# Tools / Preferences / Ipython
Console / Graphics / Graphics
Backend / Backend:␣
˓→“automatic” (continues on next
# Interactive Matplotlib Jupyter Notebook page)
# %matplotlib inline
4.4. Time Series in python 131
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


try: page)
url = "https://raw.githubusercontent.com/datacamp/datacamp_facebook_live_ny_
˓→resolution/master/datasets/multiTimeline.csv"
df =pd.read_csv(url, skiprows=2)
except:
df =pd.read_csv("../datasets/multiTimeline.csv", skiprows=2)

print(df. head())

# Rename

columns
df.columns =
['month',
'diet', 'gym',
'finance']
Month diet: (Worldwide) gym: (Worldwide) finance: (Worldwide)
0 2004-01 100 31 48
# 2004-02
1 Describe 75 26 49
print(df.
2 2004-03describe()) 67 24 47
3 2004-04 70 22 48
4 2004-05 72 22 43
diet gym finance
count 168.000000 168.00000 168.000000
0
mean 34.690476 47.148810
49.642857
std 8.134316 4.972547
8.033080
min 22.000000 38.000000
34.000000
25% 28.000000 44.000000
44.000000
50% 32.500000 46.000000
48.500000
75%
4.4.5 Recode data41.000000 50.000000
53.000000
max 58.000000 73.000000
Next, you’ll turn the ‘month’ column into a DateTime data type and make it the index of
100.000000
the DataFrame.
Note that you do this because you saw in the result of the .info() method that the ‘Month’
column was actually an of data type object. Now, that generic data type encapsulates
everything from strings to integers, etc. That’s not exactly what you want when you want
to be looking at time series data. That’s why you’ll use .to_datetime() to convert the
‘month’ column in your DataFrame to a DateTime.
Be careful! Make sure to include the inplace argument when you’re setting the index of the
DataFrame df so that you actually alter the original index and set it to the ‘month’ column.

df.month =pd.to_datetime(df.month)
df.set_index('month',
inplace=True)

print(df.head())
diet gym finance
month
2004-01-01 100 31 48
(continues on next
page)

132 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


2004-02-01 75 26 49 page)
2004-03-01 67 24 47
2004-04-01 70 22 48
2004-05-01 72 22 43

4.4.6 Exploratory Data Analysis

You can use a built-in pandas visualization method .plot() to plot your data as 3 line plots
on a single figure (one for each column, namely, ‘diet’, ‘gym’, and ‘finance’).

df.plot()
plt. xlabel('Year') ;

# change figure
parameters
#
df.plot(figsize=(20
,10), linewidth=5,
fontsize=20)

# Plot single
column
# df[['diet']].plot(figsize=(20,10), linewidth=5, fontsize=20)
# plt.xlabel('Year', fontsize=20);

Note that this data is relative. As you can read on Google trends:
Numbers represent search interest relative to the highest point on the chart for the given
region and time. A value of 100 is the peak popularity for the term. A value of 50 means
that the term is half as popular. Likewise a score of 0 means the term was less than 1%
as popular as the peak.

4.4.7 Resampling, Smoothing, Windowing, Rolling average: Trends

Rolling average, for each time point, take the average of the points on either side of it.
Note that the number of points is specified by a window size.

4.4. Time Series in python


133
Statistics and Machine Learning in Python, Release 0.3 beta

Remove Seasonality with pandas Series.


See: http://pandas.pydata.org/pandas-docs/stable/timeseries.html A: ‘year end frequency’
year frequency
diet = df['diet']

diet_resamp_yr =diet.resample('A').mean()
diet_roll_yr = diet.rolling(12).mean()

ax =diet.plot(alpha=0.5, style='-') # store axis (ax) for latter plots


diet_resamp_yr.plot(style=':', label='Resample at year frequency', ax=ax)
diet_roll_yr.plot(style='--', label='Rolling average (smooth), window size=12', ax=ax)
ax.legend()

<matplotlib.legend.Legend at 0x7f0db4e0a2b0>

Rolling average (smoothing) with


Numpy
x =np.asarray(df[['diet']])
win = 12
win_half =int(win / 2)
# print([((idx-win_half),
(idx+win_half)) for idx in
np.arange(win_half,
len(x))])

diet_smooth =
np.array([x[(idx-win_half):
[<matplotlib.lines.Line2D
(idx+win_half)].mean() forat 0x7f0db4cfea90>]
idx in np.arange(win_
˓ → half, len(x))])
plt. plot(diet_smooth)

134 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

Trends Plot Diet and Gym


Build a new DataFrame which is the concatenation diet and gym smoothed
data
gym =df['gym']

df_avg =pd.concat([diet.rolling(12).mean(), gym.rolling(12).mean()], axis=1)


df_avg.plot()
plt.xlabel('Year')

Text(0.5, 0, 'Year')

Detrendin
g

4.4. Time Series in python 135


Statistics and Machine Learning in Python, Release 0.3 beta

df_dtrend =df[["diet", "gym"]] - df_avg


df_dtrend.plot()
plt.xlabel('Year')

Text(0.5, 0, 'Year')

4.4.8 First-order differencing: Seasonal Patterns

# diff =original - shiftted data


# (exclude f irst term for some implementation details)
assert np.all((diet.diff() ==diet - di et.shi f t()) [ 1:] )

df.diff().plot()
plt. xlabel('Year')

Text(0.5, 0, 'Year')

136 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

4.4.9 Periodicity and Correlation

df.plot()
plt. xlabel('Year') ;
print(df.corr())

diet gym finance


diet 1.000000 -0.100764 -0.034639
gym -0.100764
1.000000 -0.284279
finance -0.034639
-0.284279 1.000000

4.4. Time Series in python 137


Statistics and Machine Learning in Python, Release 0.3 beta

Plot correlation matrix


sns.heatmap(df.corr(), cmap="coolwarm")

<matplotlib.axes._subplots.AxesSubplot at 0x7f0db29f3ba8>

‘diet’ and ‘gym’ are negatively correlated! Remember that you have a seasonal and a trend
component. From the correlation coefficient, ‘diet’ and ‘gym’ are negatively correlated:
• trends components are negatively correlated.
•seasonal components would positively correlated and their

The actual correlation coefficient is actually capturing both of


those.
Seasonal correlation: correlation of the first-order differences of
df.diff().plot()
these time series
plt.xlabel('Year');

print(df.diff().corr())

diet gym finance


diet 1.00000 0.75870 0.373828
0 7
gym 0.75870 1.00000 0.301111
7 0
finance 0.37382 0.30111 1.000000
8 1

138 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

Plot correlation
matrix
sns.heatmap(df.diff().corr(), cmap="coolwarm")

<matplotlib.axes._subplots.AxesSubplot at 0x7f0db28aeb70>

Decomposing time serie in trend, seasonality and


residuals
from statsmodels.tsa.seasonal import seasonal_decompose

x = gym

x =x.astype(float) # force float


(continues on next
page)

4.4. Time Series in python 139


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


decomposition =seasonal_decompose(x) page)
trend = decomposition.trend
seasonal =decomposition.seasonal
residual =decomposition.resid

plt.subplot(411)
plt.plot(x, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seas
onality')
plt. legend(loc='best'
) plt.subplot(414)
plt.plot(residual,
label='Residuals')
plt. legend(loc='best'
) plt.tight_layout()

4.4.10 Autocorrelation

A time series is periodic if it repeats itself at equally spaced intervals, say, every 12
months. Autocorrelation Function (ACF): It is a measure of the correlation between the
T S with a lagged version of itself. For instance at lag 5, ACF would compare series at
time instant t1. . . t2 with series at instant t1-5. . . t2-5 (t1-5 and t2 being end
points).
Plot
# from pandas.plotting import autocorrelation_plot
from pandas.plotting import autocorrelation_plot
(continues on next
page)

140 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
x =df["diet"].astype(float)
autocorrelation_plot(x)

<matplotlib.axes._subplots.AxesSubplot at 0x7f0db25b2dd8>

Compute Autocorrelation Function


(ACF)
from statsmodels.tsa.stattools import acf

x_diff =x.diff().dropna() # f irst item i s NA


lag_acf =acf(x_diff, nlags=36)
plt.plot(lag_acf)
plt.title('Autocorrelation Function')

/home/edouard/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/stattools.py:541:␣
˓→FutureWarning: fft=True will become the default in a future version of statsmodels. To␣

˓→suppress this warning, explicitly set fft=False.


warnings.warn(msg, FutureWarning)

Text(0.5, 1.0, 'Autocorrelation Function')

4.4. Time Series in python 141


Statistics and Machine Learning in Python, Release 0.3 beta

ACF peaks every 12 months: Time series is correlated with itself shifted by 12 months.

4.4.11 Time Series Forecasting with Python using Autoregressive Moving Average
(ARMA) models

Source:
• https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/
9781783553358/7/ch07lvl1sec77/arma-models
• http://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average
_model
• ARIMA:
https://www.analyticsvidhya.com/blog/2016/02/
time-series-forecasting-codes-python/
ARMA models are often used to forecast a time series. These models combine
autoregressive and moving average models. In moving average models, we assume that a
variable is the sum of the mean of the time series and a linear combination of noise
components.
∑︁� models
The autoregressive and moving average ∑︁can have different orders. In general, we

� = 𝑎 + 𝜀�−�+ and
can define an ARMA model with p autoregressive𝑏�
� � terms 𝜀 q moving average terms as
follows: � � �−� � �

Choosing p and q

Plot the partial autocorrelation functions for an estimate of p, and likewise using the
autocorre- lation functions for an estimate of q.
Partial Autocorrelation Function (PACF): This measures the correlation between the T S
with a lagged version of itself but after eliminating the variations already explained by
the intervening

142 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

comparisons. Eg at lag 5, it will check the correlation but remove the effects already
explained by lags 1 to 4.
from statsmodels.tsa.stattools import acf, pacf

x = df["gym"].astype(float)

x_diff =x.diff().dropna() # f irst item i s NA #


ACF and PACF plots:

lag_acf =acf(x_diff, nlags=20)


lag_pacf =pacf(x_diff, nlags=20,
method='ols')

#Plot ACF:
plt.subplot(121)
plt. plot(lag_acf)
plt.axhline(y=0,li
nestyle='--',col
or='gray')
plt.axhline(y=-1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.title('Autocorrelation Function (q=1)')

#Plot PACF:
plt.subplot(122)
plt. plot(lag_pacf)
plt.axhline(y=0,lin
estyle='--',color
='gray')
plt.axhline(y=-1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function (p=1)') plt.tight_layout()

In this plot, the two dotted lines on either sides of 0 are the confidence interevals. These
can be used to determine the p and q values as:
• p: The lag value where the PACF chart crosses the upper confidence interval for
the first

4.4. Time Series in python


143
Statistics and Machine Learning in Python, Release 0.3 beta

time, in this case p=1.


• q: The lag value where the ACF chart crosses the upper confidence interval for the
first time, in this case q=1.

Fit ARMA model with statsmodels

1. Define the model by calling ARMA() and passing in the p and q parameters.
2. The model is prepared on the training data by calling the f i t ( ) function.
3. Predictions can be made by calling the predict() function and specifying the index of
the time or times to be predicted.

from statsmodels.tsa.arima_model import ARMA

model =ARMA(x, order=(1, 1 ) ) . f i t ( ) # f i t model

print(model.summary())
plt.plot(x)
plt.plot(model.predict(), color='red')
plt.title('RSS: %.4f'% sum((model.fittedvalues-
x)**2))
/home/edouard/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/base/tsa_model.
˓→py:165: ValueWarning: No frequency information was provided, so inferred frequency MS␣
˓ → will be used.
% freq, ValueWarning)
/home/edouard/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/kalmanf/kalmanfilter.
˓→py:221: RuntimeWarning: divide by zero encountered in true_divide
Z_mat, R_mat, T_mat)

ARMA Model Results


==============================================================================
Dep. Variable: gym No. Observations: 168
Model: ARMA(1, 1) Log Likelihood -436.852
Method: css-mle S.D. of innovations 3.229
Date: Tue, 29 Oct 2019 AIC 881.704
Time: 11:47:14 BIC 894.200
Sample: 01-01-2004 HQIC 886.776
- 12-01-2017
======== ================================================================
====== coef std err z P>|z| [0.025 0.975
]
const 36.4316 8.827 4.127 0.000 19.13 53.73
1 2
ar.L1.gym 0.9967 0.005 220.566 0.000 0.988 1.00
6
ma.L1.gym -0.7494 -13.931 0.000 -0.855
0.054 Roots -0.644
========
=============================== ====== ===============================
= Real Imaginary Modulus Frequenc
y
AR.1 1.0033 +0.0000j 1.0033 0.000
Text(0.5, 1.0, 'RSS: 1794.4661')
0
MA.1 1.3344 +0.0000j 1.3344 0.000
0

144 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3
beta

4.4. Time Series in python 145


Statistics and Machine Learning in Python, Release 0.3 beta

146 Chapter 4. Statistics


CHAPTER

FIVE

MACHINE LEARNING

1. Dimension reduction and feature extraction

1. Introduction

In machine learning and statistics, dimensionality reduction or dimension reduction is


the pro- cess of reducing the number of features under consideration, and can be divided
into feature selection (not addressed here) and feature extraction.
Feature extraction starts from an initial set of measured data and builds derived values
(fea- tures) intended to be informative and non-redundant, facilitating the subsequent
learning and generalization steps, and in some cases leading to better human
interpretations. Feature extrac- tion is related to dimensionality reduction.
The input matrix X, of dimension 𝑁 × 𝑃, is
⎡ ⎤
�11 . . . 1𝑃
⎢ � ⎥
⎢ .. X ⎥
⎢ . ⎥
⎣ . ⎦
�𝑁 1 . . . 𝑁 𝑃

where the rows represent the samples and columns represent the variables.
The goal is to learn a transformation that extracts a few relevant features. This is
generally done by exploiting the covariance Σ X X between the input features.

5.1.2 Singular value decomposition and matrix factorization

Matrix factorization principles

Decompose the data matrix X 𝑁 ×𝑃 into a product of a mixing matrix U 𝑁 ×𝐾 and a dictionary
matrix V𝑃 × 𝐾 .

X = UV𝑇 ,

If we consider only a subset of components 𝐾 < �𝑎��(X) < min(𝑃, 𝑁 − 1) , X is


approximated by a matrix Xˆ :

X ≈ X=
ˆ UV𝑇 ,

147
Statistics and Machine Learning in Python, Release 0.3 beta

Each line of xi is a linear combination (mixing ui) of dictionary items V.


𝑁 𝑃-dimensional data points lie in a space whose dimension is less than 𝑁 − 1 (2 dots lie
on a line, 3 on a plane, etc.).

Fig. 1: Matrix factorization

Singular value decomposition (SVD) principles

Singular-value decomposition (SVD) factorises the data matrix X 𝑁 ×𝑃 into a


product:

X = UDV𝑇 ,

where ⎡ ⎤ ⎡ ⎤
�1𝑃 �1𝐾 ⎤ ⎡ ⎤
⎡𝑑 0
⎢ ⎥ ⎢ ⎥ 1
� X �11 ⎥ = ⎢ U D ⎦ ⎣ �11 �1𝑃
⎦ .
⎢ 11 ⎣ V𝑇
⎣ ⎦ ⎣ ⎦⎥ 0 𝑑𝐾 �𝐾1 �𝐾𝑃
�𝑁 1 �𝑁𝑃 �𝑁 1 �𝑁
𝐾
U: right-singular
• V = [v1, · · · , v𝐾 ] is a 𝑃 × 𝐾 orthogonal matrix.
• It is a dictionary of patterns to be combined (according to the mixing coefficients) to
reconstruct the original samples.
• V perfoms the initial rotations (projection) along the 𝐾 = min(𝑁 , 𝑃) principal
compo- nent directions, also called loadings.
• Each v� performs the linear combination of the variables that has maximum sample
vari- ance, subject to being uncorrelated with the previous v�−1.
D: singular values
• D is a 𝐾 × 𝐾 diagonal matrix made of the singular values of X with 𝑑1≥ 𝑑2≥ · · · ≥
𝑑𝐾≥ 0.
• D scale the projection along the coordinate axes by 𝑑1, 𝑑2, · · · , 𝑑𝐾 .

148 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

•Singular values are the square roots of the eigenvalues of X𝑇

X. V: left-singular vectors
• U = [u1, · · · , u𝐾 ] is an 𝑁 × 𝐾 orthogonal matrix.
• Each row vi provides the mixing coefficients of dictionary
items to reconstruct the sample
xi
• It may be understood as the coordinates on the new orthogonal basis (obtained after
the initial rotation) called principal components in the PCA.

SVD for variables transformation

V transforms correlated variables (X) into a set of uncorrelated ones (UD) that better
expose the various relationships among the original data items.
X = UDV𝑇 , (5.1)
X V = UDV𝑇 V, (5.2)
X V = UDI, (5.3)
X V = UD (5.4)

At the same time, SVD is a method for identifying and ordering the dimensions along
which data points exhibit the most variation.

import numpy as
np import scipy
from sklearn.decomposition import
PCA import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

np.random.seed(42)

# dataset
n_samples = 100
experience =
np.random.norm
al(size=n_sampl
es)
salary =1500 +experience +np.random.normal(size=n_samples, scale=.5) X
=np.column_stack([experience, salary])

# PCA using SVD


X -= X.mean(axis=0) # Centering is required
U, s, Vh =scipy.linalg.svd(X, full_matrices=False)
# U : Unitary matrix having lef t singular vectors as columns.
# Of shape (n_samples,n_samples) or (n_samples,n_comps), depending on
# full_matrices.
#
# s : The singular values, sorted in non-increasing order. Of shape (n_comps,), #
with n_comps =min(n_samples, n_features).
#
# Vh: Unitary matrix having right singular vectors as rows.
# Of shape (n_features, n_features) or (n_comps, n_features) depending(continues
# on next
page)
on full_matrices.
5.1. Dimension reduction and feature extraction 149
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


page)
plt.figure(figsize=(9, 3))

plt.subplot(131)
plt.scatter(U[:, 0], U[:, 1], s=50)
plt.axis('equal')
plt.titl e("U : Rotated and scaled
data")

plt. subplot(132)

# Project data
PC =np.dot(X, Vh.T) plt.scatter(PC[:,
0], PC[:, 1], s=50) plt.axis('equal')
plt.title("XV: Rotated data")
plt. xlabel(" PC1" )

plt. ylabel(" PC2" )

plt.subplot(133)
plt.scatter(X[:,
0], X [:, 1],
s=50)
for i in
range(Vh.shape[0
]):
plt.arrow(x=0, y=0, dx=Vh[i, 0], dy=Vh[i, 1], head_width=0.2,
head_length=0.2, linewidth=2, fc='r', ec='r')
plt.text(Vh[i, 0], Vh[i, 1],'v%i' % (i+1), color="r", fontsize=15,
horizontalalignment='right', verticalalignment='top')
plt.axis('equal')
plt.ylim(-4, 4)

plt.title( "X: original data (v1,


v2:PC d i r . ) " ) plt.xlabel("experience")
plt.ylabel("salary")

plt.tight_layout()

3. Principal components analysis (PCA)

Sources:
• C. M. Bishop Pattern Recognition and Machine Learning, Springer, 2006
• Everything you did and didn’t know about PCA
• Principal Component Analysis in 3 Simple Steps

150 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Principles

• Principal components analysis is the main method used for linear dimension
reduction.
• The idea of principal component analysis is to find the 𝐾 principal components di-
rections (called the loadings) V 𝐾 × 𝑃 that capture the variation in the data as much as
possible.
into a set of 𝑁 𝐾-dimensional samples C 𝑁 × 𝐾 , where the 𝐾 < 𝑃. The new variables
areconverts
• linearly
It uncorrelated.
a set of 𝑁 𝑃The columns of
-dimensional C 𝑁 × 𝐾 are called
observations N 𝑁 ×𝑃the
of possibly
principalcorrelated
components.
• variables
The dimension reduction is obtained by using only 𝐾 < 𝑃components that exploit
corre- lation (covariance) among the original variables.
• PCA is mathematically defined as an orthogonal linear transformation V 𝐾 × 𝑃 that
trans- forms the data to a new coordinate system such that the greatest variance by
some projec- tion of the data comes to lie on the first coordinate (called the first
principal component), the second greatest variance on the second coordinate, and so
on.

C 𝑁 ×𝐾 = X 𝑁 ×𝑃 V𝑃 ×𝐾

• PCA can be thought of as fitting a 𝑃-dimensional ellipsoid to the data, where each axis
of the ellipsoid represents a principal component. If some axis of the ellipse is
small, then the variance along that axis is also small, and by omitting that axis and
its corresponding prin- cipal component from our representation of the dataset, we
lose only a commensurately small amount of information.
• Finding the 𝐾 largest axes of the ellipse will permit to project the data onto a space
having dimensionality 𝐾 < 𝑃while maximizing the variance of the projected data.

Dataset preprocessing

Centering

Consider a data matrix, X , with column-wise zero empirical mean (the sample mean of
each column has been shifted to zero), ie. X is replaced by X − 1x¯𝑇.

Standardizing

Optionally, standardize the columns, i.e., scale them by their standard-deviation. Without
stan- dardization, a variable with a high variance will capture most of the effect of the PCA.
The principal direction will be aligned with this variable. Standardization will, however,
raise noise variables to the save level as informative variables.
The covariance matrix of centered standardized data is the correlation matrix.

Eigendecomposition of the data covariance matrix

To begin with, consider the projection onto a one-dimensional space (𝐾 = 1). We can define
the direction of this space using a 𝑃 -dimensional vector v, which for convenience (and
without loss of generality) we shall choose to be a unit vector so that ‖v‖ 2 = 1 (note
that we are only

5.1. Dimension reduction and feature extraction


Statistics and Machine Learning in Python, Release 0.3 beta

interested in the direction defined by v, not in the magnitude of v itself). PCA consists of
two mains steps:
Projection in the directions that capture the greatest variance
Each 𝑃-dimensional data point x� is then projected onto v, where the coordinate (in the
ordinate system of v) is a scalar value, namely x
co- � v. I.e., we want to find the vector v that
𝑇

maximizes these coordinates along v, which we will see corresponds to maximizing the
vari-
ance of the projected data. This is equivalently expressed as
∑︁ ︀( ︀) 2
v = arg x𝑇�v .
1
max ‖v‖=1 𝑁 �
We can write this in matrix form
as
v = arg max ‖Xv‖2 = v𝑇X 𝑇Xv = v 𝑇S X X v,
1 1
‖v‖= 1 𝑁 𝑁
where SXX is a biased estiamte of the covariance matrix of the data,
i.e. 1
SX X = X𝑇 X.
𝑁

We now maximize the projected variance v𝑇SXX v with respect to v. Clearly, this has to be a
constrained maximization to prevent ‖v 2 ‖ → ∞. The appropriate constraint comes from
normalization
the condition ‖v‖2 ≡ ‖ v‖ 22= v v
𝑇 = 1 . To enforce this constraint, we introduce

Lagrange multiplier that we shall denote bya 𝜆, and then make an unconstrained
maximization of

v𝑇 SXXv − 𝜆(v𝑇 v − 1).

By setting the gradient with respect to v equal to zero, we see that this quantity has a
stationary point when

SXX v = 𝜆v.

We note that v is an eigenvector of SXX .


If we left-multiply the above equation by v𝑇 and make use of v𝑇 v = 1, we see that the
variance is given by

v𝑇SXX v = 𝜆,

and so the variance will be at a maximum when v is equal to the eigenvector


corresponding to the largest eigenvalue, 𝜆. This eigenvector is known as the first
principal component.
We can define additional principal components in an incremental fashion by choosing each
new direction to be that which maximizes the projected variance amongst all possible
directions that are orthogonal to those already considered. If we consider the general case
of a 𝐾-dimensional projection space, the optimal linear projection for which the variance of
the projected data is maximized is now defined by the 𝐾 eigenvectors, v 1 , . . . , v K , of the
data covariance matrix SXX that corresponds to the 𝐾 largest eigenvalues, 𝜆1 ≥ 𝜆2 ≥ · · · ≥
𝜆𝐾.

152 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Back to SVD

The sample covariance matrix of centered data X is given by


1
SX X X𝑇 X.
𝑁−1
=
We rewrite X𝑇 X using the SVD decomposition of X as

X𝑇 X = (UDV 𝑇 )𝑇 (UDV 𝑇 )
= VD𝑇 U𝑇 UDV𝑇
= VD 2 V 𝑇
V𝑇 X𝑇 XV = D 2
1 1
V𝑇 X𝑇 X V = 2
𝑁−1 𝑁 − 1D
1
V𝑇 S X X V 2
𝑁 − 1D
=
.
Considering only the ��ℎ right-singular vectors v�associated to the singular
value 𝑑� 1 2
vk𝑇SXXvk = 𝑑�,
𝑁−1
It turns out that if you have done the singular value decomposition then you already have
the Eigenvalue decomposition for X𝑇 X. Where - The eigenvectors of SXX are equivalent to
the right singular vectors, V, of X. - The eigenvalues, 𝜆� , of SXX , i.e. the variances
of the
components, are equal to 𝑁1 −1
times the squared singular values, 𝑑� .
Moreover computing PCA with SVD do not require to form the matrix X𝑇 X, so computing
the SVD is now the standard way to calculate a principal components analysis from a data
matrix, unless only a handful of components are required.

PCA outputs

The SVD or the eigendecomposition of the data covariance matrix provides three main
quanti- ties:
1. Principal component directions or loadings are the eigenvectors of X𝑇 X. The V 𝐾 × 𝑃
or the right-singular vectors of an SVD of X are called principal component directions
of
X. They are generally computed using the SVD of X.
2. Principal components is the 𝑁 × 𝐾 matrix C which is obtained by projecting X onto
the principal components directions, i.e.

C 𝑁 ×𝐾 = X 𝑁 ×𝑃 V𝑃 × 𝐾 .

Since X = UDV𝑇 and V is orthogonal (V𝑇 V = I):

5.1. Dimension reduction and feature extraction 153


Statistics and Machine Learning in Python, Release 0.3 beta

𝑇
C 𝑁 ×𝐾 = 𝑁 ×𝑃 V (5.5
𝑃 ×𝐾
UDV
= 𝑇 )
𝑁 ×𝐾 𝑁 ×𝐾 I
𝐾×𝐾
C 𝑁 ×𝐾 UD 𝑇 (5.6
𝑁 ×𝐾
= )
UD (5.7
C )
Thus c� = Xv� = u�𝑑� , for � = 1, . . . 𝐾 . Hence u� is simply the projection of the
(5.8
row vectors of
)
X, i.e., the input predictor vectors,⎡ on the direction v� , scaled
⎤ by 𝑑� .
�1,1�1,1 + . . . +1,𝑃�1,𝑃
⎢ � ⎥
c1 = �2,1�1,1 + . . . + �2,𝑃
.. ⎥
�1,𝑃 ⎦
�𝑁,1�1,1 + . . . + �𝑁,𝑃 �1,𝑃

3. The variance of each component is given by the eigen values 𝜆�, � = 1, . . . 𝐾 . It can
be obtained from the singular values:

1
�𝑎�(c
�) (Xv�) 2 (5.9
𝑁−1
= )
1
= (u�𝑑 ) 2 (5.10
𝑁−1 �
)
1
= �
2 (5.11
𝑁 − 1𝑑
)

Determining the number of PCs

We must choose 𝐾 * ∈ [1, . . . , 𝐾], the number of required components. This can be done
by calculating the explained variance ratio of the 𝐾 * first components and by choosing 𝐾 *
such that the cumulative explained variance ratio is greater than some given threshold
(e.g., ≈ 90%). This is expressed as
∑︀* 𝐾 �𝑎�(c


cumulative explained � ∑︀ 𝐾 ) .
variance(c ) = � �𝑎�(c

)

Interpretation and visualization

PCs
Plot the samples projeted on first the principal components as e.g. PC1 against PC2.
PC directions
Exploring the loadings associated with a component provides the contribution of each
original variable in the component.
Remark: The loadings (PC directions) are the coefficients of multiple regression of PC on
origi- nal variables:

154 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

c = Xv (5.12
)
X𝑇 c = X𝑇 Xv
(5.13
(X𝑇 X) − 1 X 𝑇 c = v
)
(5.14
Another way to evaluate the contribution of the original variables in each PC can be
) i.e.
obtained by computing the correlation between the PCs and the original variables,
columns of X, denoted x� , for � = 1, . . . , 𝑃 . For the ��ℎ PC, compute and plot the
correlations with all original variables

𝑐��(c�, x� ), � = 1 . . . 𝐾, � = 1 . . . 𝐾.

These quantities are sometimes called the correlation loadings.


import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

np.random.seed(42)

# dataset
n_samples = 100
experience =
np.random.norm
al(size=n_sampl
es)
salary =1500 +experience +np.random.normal(size=n_samples, scale=.5) X
=np.column_stack([experience, salary])

# PCA with scikit-learn


pca =PCA(n_components=2)
pca.fit(X)
print(pca.explained_varia
nce_ratio_)

PC = pca.transform(X)

plt.subplot(121)
plt.scatter(X[:, 0], X[: , 1])
plt.xlabel("x1");
plt.ylabel("x2")

plt.subplot(122)
plt.scatter(PC[:, 0], PC[:, 1])
plt.xlabel("PC1 (var=%.2f)" % pca.explained_variance_ratio_[0])
plt.ylabel("PC2 (var=%.2f)" % pca.explained_variance_ratio_[1])
[0.93646607 0.06353393]
plt.axis('equal')
plt.tight_layout()

5.1. Dimension reduction and feature extraction 155


Statistics and Machine Learning in Python, Release 0.3 beta

4. Multi-dimensional Scaling (MDS)

Resources:
• http://www.stat.pitt.edu/sungkyu/course/2221Fall13/lec8_mds_combined.pdf
• https://en.wikipedia.org/wiki/Multidimensional_scaling
• Hastie, Tibshirani and Friedman (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. New York: Springer, Second Edition.
The purpose of MDS is to find a low-dimensional projection of the data in which the
pairwise distances between data points is preserved, as closely as possible (in a least-
squares sense).
• Let D be the (𝑁 × 𝑁 ) pairwise distance matrix where 𝑑�� is a distance between
and �.�
points
• The MDS concept can be extended to a wide variety of data types specified in terms of
a similarity matrix.
Given the dissimilarity (distance) matrix D 𝑁 ×𝑁 = [𝑑�� ], MDS attempts to find 𝐾-
dimensional projections of the 𝑁 points x1, . . . , x𝑁 ∈ R 𝐾 , concatenated in an X 𝑁 × 𝐾
matrix, so that 𝑑�� ≈
‖x�− x� ‖ are as close as possible. This can be obtained by the minimization of a loss
∑︁
function called the stress function
stress(X) = (𝑑� −� ‖x −� x ‖) .2
�̸=� �

This loss function is known as least-squares or Kruskal-Shepard


scaling. A modification of least-squares scaling is the Sammon
mapping ∑︁ (𝑑�� − ‖x�− x� 2
stress S a m m o n (X) .
‖) 𝑑�
= �≠

156 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

The Sammon mapping performs better at preserving small distances compared to the least-
squares scaling.

Classical multidimensional scaling


Also known as principal coordinates analysis, PCoA.
• The distance matrix, D, is transformed to a similarity matrix, B, often using
centered inner products.
• The loss function becomes
∑︁ ︀( ︀) 2
stress
classical (X) − ⟨x �
𝑏�� ,x⟩ .
= � �̸=�

• The stress function in classical MDS is sometimes called strain.


• The solution for the classical MDS problems can be found from the eigenvectors of
the similarity matrix.
• If the distances in D are Euclidean and double centered inner products are used, the
results are equivalent to PCA.

Example

The eurodist datset provides the road distances (in kilometers) between 21 cities in
Europe. Given this matrix of pairwise (non-Euclidean) distances D = [𝑑�� ], MDS can be
used to recover the coordinates of the cities in some Euclidean referential whose
orientation is arbitrary.
import pandas as
pd import numpy as
np
import
matplotlib.pyplot
as plt

# Pairwise distance
between European
cities
try:
url ='../datasets/eurodist.csv'
df = pd.read_csv(url)
except:
url ='https://raw.github.com/neurospin/pystatsml/master/datasets/eurodist.csv'
df = pd.read_csv(url)

print(df.iloc[:5, :5])

city =df["city"]
D =np.array(df.iloc[:,
1:]) # Distance matrix

# Arbitrary choice of
city Athens Barcelona Brussels Calais 0
K=2 components
from Athens 0 3313 2963 3175
1 Barcelona
sklearn.manifold 3313 0
1318
import MDS 1326 (continues on next
mds = page)
MDS(dissimilarity='prec
omputed',
5.1. Dimension reduction and feature extraction 157
n_components=2,
random_state=40,
max_iter=3000,␣
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


2 Brussels 2963 1318 0 204 page)
3 Calais 3175 1326 204 0
4 Cherbourg 3339 1294 583 460

Recover coordinates of the cities in Euclidean referential whose orientation is


arbitrary:
from sklearn import metrics
Deuclidean =metrics.pairwise.pairwise_distances(X, metric='euclidean')
print(np.round(Deuclidean[:5, : 5 ] ) )

[[ 0. 3116. 2994. 3181. 3428.]


[3116. 0. 1317. 1289. 1128.]
[2994. 1317. 0. 198. 538.]
[3181. 1289. 198. 0. 358.]
[3428. 1128. 538. 358. 0.]]

Plot the
results:
# Plot: apply some rotation and f lip
theta =80 * np.pi / 180.
rot =np.array([[np.cos(theta),
-np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
Xr =np.dot(X, rot)
# f lip x
Xr[:, 0] *= -1
plt.scatter(Xr[:, 0], Xr[:, 1])

for i in range(len(city)):
plt.text(Xr[i, 0], X r[ i , 1], c i t y [ i ] )
plt.axis('equal')
(-1894.1017744377398,
2914.3652937179477,
-1712.9885463201906,
2145.4522453884565)

158 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Determining the number of components

We must choose 𝐾 * ∈ {1, . . . , 𝐾 } the number of required components. Plotting the values
the stress function, obtained using � ≤ 𝑁 − 1 components. In general, start with
of
1, . . . 𝐾𝐾≤* where
Choose 4. you can clearly distinguish an elbow in the stress curve.
Thus, in the plot below, we choose to retain information accounted for by the first two
compo- nents, since this is where the elbow is in the stress curve.
k_range =range(1, min(5, D.shape[0]-1))
stress =[MDS(dissimilarity='precomputed', n_components=k,
random_state=42, max_iter=300, eps=1e-9).fit(D).stress_ for k in k_range]

print(stress)
plt.plot(k_range, stress)
plt.xlabel("k")
plt.ylabel("stress")

[48644495.28571428, 3356497.365752386, 2858455.495887962, 2756310.637628011]

Text(0, 0.5, 'stress')

5.1. Dimension reduction and feature extraction 159


Statistics and Machine Learning in Python, Release 0.3 beta

5. Nonlinear dimensionality reduction

Sources:
• Scikit-learn documentation
• Wikipedia

Nonlinear dimensionality reduction or manifold learning cover unsupervised methods that


attempt to identify low-dimensional manifolds within the original 𝑃 -dimensional space
that represent high data density. Then those methods provide a mapping from the high-
dimensional space to the low-dimensional embedding.

Isomap

Isomap is a nonlinear dimensionality reduction method that combines a procedure to


compute the distance matrix with MDS. The distances calculation is based on geodesic
distances evalu- ated on neighborhood graph:
1. Determine the neighbors of each point. All points in some fixed radius or K nearest
neigh- bors.
2. Construct a neighborhood graph. Each point is connected to other if it is a K nearest
neighbor. Edge length equal to Euclidean distance.
3. Compute shortest path between pairwise of points 𝑑�� to build the distance matrix D.
4. Apply MDS on D.

import matplotlib.pyplot as plt


from mpl_toolkits.mplot3d import Axes3D
from sklearn import manifold, datasets
(continues on next
page)

160 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
X, color =datasets.samples_generator.make_s_curve(1000, random_state=42)

fig =plt.figure(figsize=(10, 5))


plt.suptitle("Isomap Manifold Learning", fontsize=14)

ax =fig.add_subplot(121, projection='3d')
ax.scatter(X[:, 0], X[ :, 1], X[ :, 2], c=color, cmap=plt.cm.Spectral)
ax.view_init(4, -72)
plt.title('2D "S shape" manifold in 3D')

Y =manifold.Isomap(n_neighbors=10, n_components=2).fit_transform(X)
ax = fig.add_subplot(122)
plt.scatter(Y[:, 0], Y[: , 1], c=color, cmap=plt.cm.Spectral)
plt.title("Isomap")
plt.xlabel("First component")
plt.ylabel("Second component")
plt.axis('tight')

(-5.276311544714793, 5.4164373180970316, -1.23497771017066, 1.2910940054965336)

5.1.6 Exercises

PCA

Write a basic PCA class

Write a class BasicPCA with two


methods:

5.1. Dimension reduction and feature extraction 161


Statistics and Machine Learning in Python, Release 0.3 beta

• f i t( X ) that estimates the data mean, principal components directions V and the
explained variance of each component.
• transform(X) that projects the data onto the principal components.
Check that your BasicPCA gave similar results, compared to the results from sklearn.

Apply your Basic PCA on the iris dataset

The data set is available at:


https://raw.github.com/neurospin/pystatsml/master/datasets/iris. csv
• Describe the data set. Should the dataset been standardized?
• Describe the structure of correlations among variables.
• Compute a PCA with the maximum number of components.
• Compute the cumulative explained variance ratio. Determine the number of
components
𝐾 by your computed values.
• Print the 𝐾 principal components directions and correlations of the 𝐾 principal
compo- nents with the original variables. Interpret the contribution of the original
variables into the PC.
• Plot the samples projected into the 𝐾 first PCs.
• Color samples by their species.

MDS

Apply MDS from sklearn on the i r i s dataset available at:


https://raw
.github.com/neurospin/pystatsml/master/datasets/iris.csv
• Center and scale the dataset.
• Compute Euclidean pairwise distances matrix.
• Select the number of components.
• Show that classical MDS on Euclidean pairwise distances
matrix is equivalent to PCA.

5.2 Clustering

Wikipedia: Cluster analysis or clustering is the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are more similar (in some sense or
another) to each other than to those in other groups (clusters). Clustering is one of the
main task of exploratory data mining, and a common technique for statistical data
analysis, used in many fields, including machine learning, pattern recognition, image
analysis, information retrieval, and bioinformatics.
Sources: http://scikit-learn.org/stable/modules/clustering.html

162 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

5.2.1 K-means clustering

Source: C. M. Bishop Pattern Recognition and Machine Learning, Springer, 2006


Suppose we have a data set 𝑋 = {�1, · · · , �𝑁} that consists of 𝑁 observations of a random
𝐷-dimensional Euclidean variable �. Our goal is to partition the data set into some
number, 𝐾, of clusters, where we shall suppose for the moment that the value of 𝐾 is
given. Intuitively, we might think of a cluster as comprising a group of data points whose
inter-point distances are small compared to the distances to points outside of the cluster.
We can formalize this notion by first introducing a set of 𝐷-dimensional vectors 𝜇� ,
where � = 1, . . . , 𝐾 , in which 𝜇� is a prototype associated with the ��ℎ cluster. As we
to clusters,
shall as wellwe
see shortly, ascan
a set
think of the{𝜇𝜇�
of vectors � },assuch that the sum
representing of the squares
the centres of the Our
of the clusters.
each data
distances point
of to its closest prototype vector
goal is then to find an assignment of data points 𝜇� , is at a minimum.
It is convenient at this point to define some notation to describe the assignment of data
points to clusters. For each data point �� , we introduce a corresponding set of binary
∈ {0, variables
indicator 1}, where � = 1, . . . , 𝐾 , that describes which of the 𝐾 clusters the data point
� �


assigned is
� to, so that if data point �� is assigned to cluster � then �� � = 1, and �� � =
0 for � ̸= �. This is known as the 1-of-𝐾 coding scheme. We can then define an objective
function, denoted inertia, as

∑︁
𝑁 ∑︁ 𝐾
2
𝐽(�, 𝜇) = ��� − 𝜇�
‖ 2
� � ‖��

which represents the sum of the squares of the Euclidean distances of each data point to
its assigned vector 𝜇� . Our goal is to find values for the {��� } and the {𝜇� } so as to
minimize the function 𝐽. We can do this through an iterative procedure in which each
iteration involves two successive steps corresponding to successive optimizations with
respect to the �� � and the 𝜇 �
. First we choose some initial values for the 𝜇� . Then in the first phase we minimize 𝐽
with respect to the �� � , keeping the 𝜇� fixed. In the second phase we minimize 𝐽with
respect to the 𝜇� , keeping �� � fixed. This two-stage optimization process is then repeated
until conver- gence. We shall see that these two stages of updating �� � and 𝜇� correspond
respectively to the expectation (E) and maximization (M) steps of the expectation-
maximisation (EM) algorithm, and to emphasize this we shall use the terms E step and
M step in the context of the 𝐾-means algorithm.
Consider first the determination of the �� � . Because 𝐽in is a linear function of �� � , this
opti- mization can be performed easily to give a closed form solution. The terms involving
different
� are independent and so we can{︃ optimize for each � separately by choosing �� � to be 1
for whichever value of � gives the 1, minimum
if � = argvalue of �| �
min |�
��−− 𝜇� || . In other words, we
2

simply assign the �th data� point



� to the
𝜇� || .
2 closest cluster centre. More formally, this (5.15
can be
= )
expressed as
0, otherwise.
Now consider the optimization of the 𝜇� with the �� � held fixed. The objective function 𝐽
is aquadratic function of 𝜇� , and it can be minimized by setting its derivative with respect
to 𝜇� to zero giving
∑︁
2 � � − 𝜇) =
�� (�
0 � �

5.2. Clustering 163


Statistics and Machine Learning in Python, Release 0.3 beta

which we can easily solve for 𝜇� to give ︀∑


�� ��
� ︀∑
𝜇= �� .
��

The denominator in this expression is equal to the
number of points assigned to cluster

�, and so this result has a simple interpretation, namely set 𝜇� equal to the mean of all
of the data points
�� assigned to cluster �. For this reason, the procedure is known as the 𝐾-means
algorithm.
The two phases of re-assigning data points to clusters and re-computing the cluster means
are repeated in turn until there is no further change in the assignments (or until some
maximum number of iterations is exceeded). Because each phase reduces the value of the
objective func- tion 𝐽, convergence of the algorithm is assured. However, it may converge
from sklearn import cluster, datasets
to a local rather than global minimum of 𝐽.
import matplotlib.pyplot as plt
import seaborn as sns # nice color
%matplotlib inline

i r i s =datasets.load_iris()
X =iris.data[:, :2] # use only 'sepal length and sepal width'
y_iris =iris.target

km2 = cluster.KMeans(n_clusters=2).fit(X)
km3 = cluster.KMeans(n_clusters=3).fit(X)
km4 =cluster.KMeans(n_clusters=4).fit(X)

plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.scatter(X[:, 0], X[: , 1], c=km2.labels_)
plt.title("K=2, J=%.2f" % km2.inertia_)

plt.subplot(132)
plt.scatter(X[:, 0], X[: , 1], c=km3.labels_)
plt.title("K=3, J=%.2f" % km3.inertia_)

plt.subplot(133)
plt.scatter(X[:, 0], X[: , 1], c=km4.labels_)#.astype(np.float))
plt.title("K=4, J=%.2f" % km4.inertia_)

Text(0.5, 1.0, 'K=4, J=27.97')

164 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Exercises

1. Analyse clusters

• Analyse the plot above visually. What would a good value of 𝐾 be?
• If you instead consider the inertia, the value of 𝐽 , what would a good value of 𝐾
be?
• Explain why there is such difference.
• For 𝐾 = 2 why did 𝐾-means clustering not find the two “natural” clusters? See the
assumptions of 𝐾-means: http://scikit-learn.org/stable/auto_examples/cluster/plot_
kmeans_assumptions.html#example-cluster-plot-kmeans-assumptions-py

2. Re-implement the 𝐾-means clustering algorithm (homework)

Write a function kmeans(X, K) that return an integer vector of the samples’ labels.
2. Gaussian mixture models

The Gaussian mixture model (GMM) is a simple linear superposition of Gaussian


components over the data, aimed at providing a rich class of density models. We turn to a
formulation of Gaussian mixtures in terms of discrete latent variables: the 𝐾 hidden
classes to be discovered.
Differences compared to 𝐾-means:
• Whereas the 𝐾-means algorithm performs a hard assignment of data points to
clusters, in which each data point is associated uniquely with one cluster, the GMM
algorithm makes a soft assignment based on posterior probabilities.
• Whereas the classic 𝐾-means is only based on Euclidean distances, classic GMM use a
Mahalanobis distances that can deal with non-spherical distributions. It should be
noted that Mahalanobis could be plugged within an improved version of 𝐾-Means
clustering. The Mahalanobis distance is unitless and scale-invariant, and takes into
account the cor- relations of the data set.
The Gaussian mixture distribution can be written as a linear superposition of 𝐾
Gaussians in the form:
∑︁
�(�) =𝐾 �( � |�𝜇 , Σ )
�(�), �= �
1
where
: •
The �(�) are∑︀ 𝐾the mixing coefficients also know as the class probability of class �,
and
sumthey
to �= �(�) =
one: 1 1.
• �( � | 𝜇�, Σ� ) = �(� | �) is the conditional distribution of � given a particular
class �. It is the multivariate Gaussian distribution defined over a 𝑃 -dimensional
vector � of continu- ous variables.
The goal is to maximize the log-likelihood of the GMM:
{︃ }︃ {︃ }︃
∏︁
𝑁 ∏︁ ︁∑𝐾
𝑁 ∑𝑁︁ ︁∑𝐾
ln �(��) �(�� | 𝜇�, Σ� )�(�) �(�� | 𝜇�, Σ� )
= �=ln1 �= 1 = ln�= 1 �(�) .
�= 1 �= 1
5.2. Clustering 165
Statistics and Machine Learning in Python, Release 0.3 beta

To compute the classes parameters: �(�), 𝜇�, Σ� we sum over all samples, by weighting
sample
each � by its responsibility or contribution
∑︀ to class �: �(�| ��) such that for each
contribution
point its to all classes sum to one ��( �| �� ) = 1. This contribution is the
conditional
probability of class � given �: �(�| �) (sometimes called the posterior). It can be
computed using Bayes’ rule:

�(�|
�(�| (5.16
�)�(�)
�(
�) = )
�)�(� | 𝜇�, Σ� )
∑︀
�(�)
𝐾 (5.17
�= �(� | 𝜇�, Σ� ) )
= 1 �(�)
Since the class parameters, �(�), 𝜇� and Σ� , depend on the responsibilities �(� | �)
and the responsibilities depend on class parameters, we need a two-step iterative
algorithm: the expectation-maximization (EM) algorithm. We discuss this algorithm
next.
### The expectation-maximization (EM) algorithm for Gaussian mixtures
Given a Gaussian mixture model, the goal is to maximize the likelihood function with
respect to the parameters (comprised of the means and covariances of the components and
the mixing coefficients).
Initialize the means 𝜇� , covariances Σ� and mixing coefficients �(�)
1. E step. For each sample �, evaluate the responsibilities for each class � using the
current parameter values �( �� |�
𝜇, Σ )
�(�|� ∑︀ 𝐾
�(�)�
�) = �=1 �(�� | 𝜇�, Σ� )
�(�)
2. M step. For each class, re-estimate the parameters using the current
responsibilities

1 ∑︁
𝑁
𝜇 new
� �(�|� (5.18
= 𝑁� �= � )�� )
1 ∑︁
1𝑁
new
Σ� �(� �| � )(�new new 𝑇
� − 𝜇 �− 𝜇 � (5.19
= 𝑁� �= � ) ) )
1 (�
�new(�) (5.20
𝑁
𝑁 )
= �
3. Evaluate the log-
likelihood {︃ 𝐾 }︃
︁∑𝑁 ︁∑
ln �(�|𝜇�, Σ� )�(�) ,
�=1 �=1

and check for convergence of either the parameters or the log-likelihood. If the convergence
criterion is not satisfied return to step 1.
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns # nice color
import sklearn
(continues on next
page)

166 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


from sklearn.mixture import GaussianMixture page)

import

pystatsml.plot_utils

colors =sns.color_palette()

i r i s =datasets.load_iris()
X =iris.data[:, :2] # 'sepal length (cm)''sepal width (cm)'
y_iris =iris.target

gmm2 =GaussianMixture(n_components=2, covariance_type='full').fit(X)


gmm3 =GaussianMixture(n_components=3, covariance_type='full').fit(X)
gmm4=GaussianMixture(n_components=4, covariance_type='full').fit(X)

plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.scatter(X[:, 0], X [:,
1], c=[colors[lab] for lab
in gmm2.predict(X)])#,
color=colors)
for i in range(gmm2.covariances_.shape[0]):
pystatsml.plot_utils.plot_cov_ellipse(cov=gmm2.covariances_[i, : ] , pos=gmm2.means_[i,␣
˓→:],

facecolor='none', linewidth=2, edgecolor=colors[i])


plt.scatter(gmm2.means_[i, 0], gmm2.means_[i, 1], edgecolor=colors[i],
marker="o", s=100, facecolor="w",
linewidth=2)
plt.title("K=2")

plt.subplot(132)
plt.scatter(X[:, 0], X[: , 1], c=[colors[lab] for lab in
gmm3.predict(X)])
for i in range(gmm3.covariances_.shape[0]):
pystatsml.plot_utils.plot_cov_ellipse(cov=gmm3.covariances_[i, : ] , pos=gmm3.means_[i,␣
˓→:],

facecolor='none', linewidth=2, edgecolor=colors[i])


plt.scatter(gmm3.means_[i, 0], gmm3.means_[i, 1], edgecolor=colors[i],
marker="o", s=100, facecolor="w",
linewidth=2)
plt.title("K=3")

plt.subplot(133)
plt.scatter(X[:, 0], X[: , 1], c=[colors[lab] for lab in
gmm4.predict(X)]) # .astype(np.
˓ → float))

for i in range(gmm4.covariances_.shape[0]):
pystatsml.plot_utils.plot_cov_ellipse(cov=gmm4.covariances_[i, : ] , pos=gmm4.means_[i,␣
˓→:],

facecolor='none', linewidth=2, edgecolor=colors[i])


plt.scatter(gmm4.means_[i, 0], gmm4.means_[i, 1], edgecolor=colors[i],
marker="o", s=100, facecolor="w",
linewidth=2)
_ = plt.title("K=4")

5.2. Clustering 167


Statistics and Machine Learning in Python, Release 0.3 beta

5.2.3 Model selection

### Bayesian information criterion


In statistics, the Bayesian information criterion (BIC) is a criterion for model selection
among a finite set of models; the model with the lowest BIC is preferred. It is based, in
part, on the likelihood function and it is closely related to the Akaike information
criterion (AIC).
X = iris.data
y_iris =iris.target

bic =l i s t ( )
#print(X)

ks =
np.arange(1,
10)

for k in ks:
gmm =GaussianMixture(n_components=k, covariance_type='full')
gmm.fit(X)
bic.append(gmm.bic(X))

k_chosen = ks[np.argmin(bic)]

plt.plot(ks, bic)
plt.xlabel("k")
plt. ylabel(" BIC" )

print("Choose
k=", k_chosen)
Choose k= 2

168 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

4. Hierarchical clustering

Hierarchical clustering is an approach to clustering that build hierarchies of clusters in


two main approaches:
• Agglomerative: A bottom-up strategy, where each observation starts in their own
cluster, and pairs of clusters are merged upwards in the hierarchy.
• Divisive: A top-down strategy, where all observations start out in the same cluster,
and then the clusters are split recursively downwards in the hierarchy.
In order to decide which clusters to merge or to split, a measure of dissimilarity between
clusters is introduced. More specific, this comprise a distance measure and a linkage
criterion. The distance measure is just what it sounds like, and the linkage criterion is
essentially a function of the distances between points, for instance the minimum distance
between points in two clusters, the maximum distance between points in two clusters,
the average distance between points in two clusters, etc. One particular linkage criterion,
the Ward criterion, will be discussed next.

Ward clustering

Ward clustering belongs to the family of agglomerative hierarchical clustering algorithms.


This means that they are based on a “bottoms up” approach: each sample starts in its own
cluster, and pairs of clusters are merged as one moves up the hierarchy.
In Ward clustering, the criterion for choosing the pair of clusters to merge at each step is
the minimum variance criterion. Ward’s minimum variance criterion minimizes the total
within- cluster variance by each merge. To implement this method, at each step: find the
pair of clusters that leads to minimum increase in total within-cluster variance after
merging. This increase is a weighted squared distance between cluster centers.
The main advantage of agglomerative hierarchical clustering over 𝐾-means clustering is
that you can benefit from known neighborhood information, for example, neighboring
pixels in an
5.2. Clustering 169
Statistics and Machine Learning in Python, Release 0.3 beta

image.
from sklearn import cluster, datasets
import matplotlib.pyplot as plt
import seaborn as sns # nice color

i r i s =datasets.load_iris()
X =iris.data[:, :2] # 'sepal length (cm)''sepal width (cm)'
y_iris =iris.target

ward2 = cluster.AgglomerativeClustering(n_clusters=2, linkage='ward').fit(X)


ward3 = cluster.AgglomerativeClustering(n_clusters=3, linkage='ward').fit(X)
ward4 =cluster.AgglomerativeClustering(n_clusters=4, linkage='ward').fit(X)

plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.scatter(X[:, 0], X[: , 1], c=ward2.labels_)
plt.title("K=2")

plt.subplot(132)
plt.scatter(X[:, 0], X[: , 1], c=ward3.labels_)
plt.title("K=3")

plt.subplot(133)
plt.scatter(X[:, 0], X[: , 1], c=ward4.labels_) # .astype(np.float))
plt.title("K=4")

Text(0.5, 1.0, 'K=4')

5.2.5 Exercises

Perform clustering of the iris dataset based on all variables using Gaussian mixture
models. Use PCA to visualize clusters.

170 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3 beta

3. Linear methods for regression

1. Ordinary least squares

Linear regression models the output, or target variable � ∈ R as a linear combination of


the (𝑃 − 1)-dimensional input � ∈ R(𝑃 − 1) . Let X be the 𝑁 × 𝑃matrix with each row an
input vector (with a 1 in the first position), and similarly let � be the 𝑁 -dimensional
vector of outputs in the training set, the linear model will predict the y given X using the
parameter vector, or weight vector 𝛽 ∈ R𝑃 according to

y = X 𝛽 + 𝜀,

where 𝜀 ∈ R 𝑁 are the residuals, or the errors of the prediction. The 𝛽 is found by
minimizing an objective function, which is the loss function, 𝐿(𝛽 ), i.e. the error
measured on the data. This error is the sum of squared errors (SSE) loss.

𝐿 (𝛽 ) = (5.21
SSE(𝛽 ∑︁) )
�− x
𝑇 2
= 𝑁 (� � (5.22
𝛽) � )
𝑇
= (y − X 𝛽 ) (y − (5.23
X𝛽 ) )
= ‖y − X 𝛽 22‖
, (5.24
)
Minimizing the SSE is the Ordinary Least Square OLS regression as objective function.
which is a simple ordinary least squares (OLS) minimization whose analytic solution is:

𝛽 O LS = ( 𝑋 𝑇
𝑋 )− 1 𝑋 𝑇

The gradient of the loss:


𝐿(𝛽, 𝑋 , ∑︁
𝜕 = 2 � � ·𝛽
� (�
�)
− ��) �
𝜕𝛽

5.3.2 Linear regression with scikit-learn

Scikit learn offer many models for supervised learning, and they all follow the same
application programming interface (API), namely:

model =Estimator()
model.fit(X, y)
predictions =
model.predict(X)
%matplotlib inline
import warnings
warnings.filterwarnings(action='once')

5.3. Linear methods for regression 171


Statistics and Machine Learning in Python, Release 0.3 beta

from mpl_toolkits.mplot3d import Axes3D


import matplotlib.pyplot as
plt import numpy as np
import pandas as pd
import sklearn.linear_model as
lm import sklearn.metrics as
metrics
%matplotlib inline

# Fi t Ordinary Least Squares: OLS


csv =
pd.read_csv('https://raw.githubuse
rcontent.com/neurospin/pystatsml
/master/datasets/
˓→Advertising.csv', index_col=0)
X =csv[['TV', 'Radio']]
y =csv['Sales']

l r =lm.LinearRegression().fit(X, y)
y_pred = lr.predict(X)
print("R-squared =",
metrics.r2_score(y, y_pred))

print("Coefficients =", lr.coef_)

# Plot
fig =plt.figure()
ax =fig.add_subplot(111,
projection='3d')

ax.scatter(csv['TV'], csv['Radio'], csv['Sales'], c='r', marker='o')

xx1, xx2 = np.meshgrid(


np.linspace(csv['TV'].min(), csv['TV'].max(), num=10),
np.linspace(csv['Radio'].min(), csv['Radio'].max(), num=10))

XX =np.column_stack([xx1.ravel(), xx2.ravel()])

yy = lr.predict(XX)
ax.plot_surface(xx1, xx2, yy.reshape(xx1.shape), color='None')
/home/edouard/anaconda3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.
ax.set_xlabel('TV')
˓→ufunc size changed, may indicate binary incompatibility. Expected 192 from C header,␣
ax.set_ylabel('Radio')
˓→got 216 from PyObject
_= ax.set_zlabel('Sales')
return f(*args, **kwds)
/
home/edouard/anaconda3
/lib/python3.7/importlib/_
bootstrap.py:219:
RuntimeWarning: numpy.
˓→ufunc size changed, may

indicate binary
incompatibility. Expected
192 from C header,␣
˓→got 216 from PyObject
return f(*args, **kwds)
/
home/edouard/anaconda3
/lib/python3.7/importlib/_
bootstrap.py:219:
RuntimeWarning: numpy.
˓→ufunc size changed, may
172 Chapter 5. Machine Learning
indicate binary
incompatibility. Expected
192 from C header,␣
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


return f(*args, **kwds) page)
/home/edouard/anaconda3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.
˓→ufunc size changed, may indicate binary incompatibility. Expected 192 from C header,␣

˓→got 216 from PyObject

return f(*args, **kwds)


/
home/edouard/anaconda3
/lib/python3.7/importlib/_
bootstrap.py:219:
RuntimeWarning: numpy.
˓→ufunc size changed, may
R-squared =0.8971942610828956
indicate binary
Coefficients =[0.04575482
incompatibility. Expected 0.18799423]
192 from C header,␣
˓→got 216 from PyObject

return f(*args, **kwds)

5.3.3 Overhtting

In statistics and machine learning, overfitting occurs when a statistical model describes
random errors or noise instead of the underlying relationships. Overfitting generally
occurs when a model is excessively complex, such as having too many parameters relative
to the number of observations. A model that has been overfit will generally have poor
predictive performance, as it can exaggerate minor fluctuations in the data.
A learning algorithm is trained using some set of training samples. If the learning
algorithm has the capacity to overfit the training samples the performance on the training
sample set will improve while the performance on unseen test sample set will decline.
The overfitting phenomenon has three main explanations: - excessively complex models, -
mul- ticollinearity, and - high dimensionality.

Model complexity

Complex learners with too many parameters relative to the number of observations may
overfit the training dataset.

5.3. Linear methods for regression 173


Statistics and Machine Learning in Python, Release 0.3 beta

Multicollinearity

Predictors are highly correlated, meaning that one can be linearly predicted from the
others. In this situation the coefficient estimates of the multiple regression may change
erratically in response to small changes in the model or the data. Multicollinearity does
not reduce the predictive power or reliability of the model as a whole, at least not within
the sample data set; it only affects computations regarding individual predictors. That is,
a multiple regression model with correlated predictors can indicate how well the entire
bundle of predictors predicts the outcome variable, but it may not give valid results about
any individual predictor, or about which predictors are redundant with respect to others.
In case of perfect multicollinearity the predictor matrix is singular and therefore cannot
be inverted. Under these circumstances, for a general linear model y = X 𝛽 + 𝜀, the
ordinary least-squares estimator, 𝛽 𝑂𝐿𝑆 = (X𝑇 X) − 1 X 𝑇 y, does not exist.
An example where correlated predictor may produce an unstable model follows:
import numpy as np
from mpl_toolkits.mplot3d import
Axes3D
import matplotlib.pyplot as plt
# business volume
bv =np.array([10, 20, 30, 40, 50]) # Tax
bp
tax==
.1.2* *bvbv+np.array([-.1, .2, .1, -.2, .1]) # business potential

X =np.column_stack([bv, tax])
beta_star =np.array([.1, 0]) # true solution

'''
Since tax and bv are correlated, there is an infinite number of linear combinations
leading to the same prediction.
'''

# 10 times the bv then subtract i t 9 times using the tax variable:


beta_medium =np.array([.1 * 10, -.1 * 9 * (1/.2)])
# 100 times the bv then subtract i t 99 times using the tax variable:
beta_large =np.array([.1 * 100, -.1 * 99 * (1/.2)])

# Check that a l l model lead to the same result


assert np.all(np.dot(X, beta_star) ==np.dot(X, beta_medium))
assert np.all(np.dot(X, beta_star) ==np.dot(X, beta_large))

Multicollinearity between the predictors: busine volumes and


ss
tax produces unstable model with arbitrary large
s coefficients.

174 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Dealing with multicollinearity:


• Regularisation by e.g. ℓ2 shrinkage: Introduce a bias in the solution by making (𝑋 𝑇 𝑋 )
−1

non-singular. See ℓ2 shrinkage.


• Feature selection: select a small number of features. See: Isabelle Guyon and André
Elisseeff An introduction to variable and feature selection The Journal of Machine
Learning Research, 2003.
• Feature selection: select a small number of features using ℓ1 shrinkage.
• Extract few independent (uncorrelated) features using e.g. principal components
analysis (PCA), partial least squares regression (PLS-R) or regression methods
that cut the number of predictors to a smaller set of uncorrelated components.

High dimensionality

High dimensions means a large number of input features. Linear predictor associate one
pa- rameter to each input feature, so a high-dimensional situation (𝑃 , number of features,
is large) with a relatively small number of samples 𝑁 (so-called large 𝑃small 𝑁 situation)
generally lead to an overfit of the training data. Thus it is generally a bad idea to add
many input features into the learner. This phenomenon is called the curse of
dimensionality.
One of the most important criteria to use when choosing a learning algorithm is based on
the relative size of 𝑃and 𝑁 .
• Remenber that the “covariance” matrix X𝑇 X used in the linear model is a 𝑃 × 𝑃
matrix of rank min(𝑁 , 𝑃). Thus if 𝑃> 𝑁 the equation system is overparameterized
5.3. Linear methods for regression 175
and admit an
Statistics and Machine Learning in Python, Release 0.3 beta

infinity of solutions that might be specific to the learning dataset. See also ill-
conditioned or singular matrices.
• The sampling density of 𝑁 samples in an 𝑃 -dimensional space is proportional to 𝑁
1/𝑃 . Thus a high-dimensional space becomes very sparse, leading to poor
estimations of sam- ples densities.
• Another consequence of the sparse sampling in high dimensions is that all sample
points are close to an edge of the sample. Consider 𝑁 data points uniformly
distributed in a
𝑃 -dimensional unit ball centered at the origin. Suppose we consider a nearest-
neighbor estimate at the origin. The median distance from the origin to the closest
data point is given by the expression(︂ )︂ 1/ 𝑃
1𝑁
𝑑(𝑃, 𝑁 ) = 1 −
2
A more complicated expression exists . for the mean distance to the closest point. For N =
500, P = 10 , 𝑑(𝑃, 𝑁 ) ≈ 0.52, more than halfway to the boundary. Hence most data
points are closer to the boundary of the sample space than to any other data point. The
reason that this presents a problem is that prediction is much more difficult near the
edges of the training sample. One must extrapolate from neighboring sample points
rather than interpolate between them. (Source: T Hastie, R Tibshirani, J Friedman. The
Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, 2009.)
• Structural risk minimization provides a theoretical background of this phenomenon.
(See VC dimension.)
• See also bias–variance trade-off.

import seaborn # nicer plots

def fit_on_increasing_size(model):
n_samples =100
n_features_ =np.arange(10, 800, 20)
r2_train, r2_test, snr =[ ] , [ ] , [ ]
for n_features in n_features_:
# Sample the dataset (* 2 nb of
samples)
n_features_info =int(n_features/10)
np.random.seed(42) # Make reproducible
X =np.random.randn(n_samples * 2,
n_features)
beta =np.zeros(n_features)
beta[:n_features_info] =1
Xbeta =np.dot(X, beta)
eps =np.random.randn(n_samples
* 2)
y =Xbeta +eps
# Split the dataset into train and test sample Xtrain,
Xtest =X[:n_samples, : ] , X[n_samples:, : ] ytrain,
ytest =y[:n_samples], y[n_samples:]
# fit/predict
l r =model.fit(Xtrain, ytrain)
y_pred_train =lr.predict(Xtrain)
y_pred_test =lr.predict(Xtest)
snr.append(Xbeta.std() / eps.std())
r2_train.append(metrics.r2_score(ytrai
n, y_pred_train))
(continues on next
r2_test.append(metrics.r2_score(ytest, page)
y_pred_test))
return
176 n_features_, np.array(r2_train), Chapter 5. Machine Learning
np.array(r2_test), np.array(snr)
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


def plot_r2_snr(n_features_, r2_train, r2_test, xvline, snr, ax): page)
"""
Two scales plot. Left y-axis: train test r-squared. Right y-axis SNR. " " "
ax.plot(n_features_, r2_train, label="Train r-squared", linewidth=2)
ax.plot(n_features_, r2_test, label="Test r-squared", linewidth=2)
ax.axvline(x=xvline, linewidth=2, color='k', ls='--')
ax.axhline(y=0, linewidth=1, color='k', ls='--')
ax.set_ylim(-0.2, 1.1)
ax.set_xlabel("Number of input features")
ax.set_ylabel("r-squared")
ax.legend(loc='best')
ax.set_title("Prediction perf.")
ax_right =ax.twinx()
ax_right.plot(n_features_, snr, 'r-', label="SNR", linewidth=1)
ax_right.set_ylabel("SNR", color='r')
for t l in ax_right.get_yticklabels():
tl.set_color('r')

# Model =linear regression


mod =lm.LinearRegression()

# Fit models on dataset


n_features, r2_train,
r2_test, snr =
fit_on_increasing_size(model
=mod)

argmax =n_features[np.argmax(r2_test)]

# plot
f i g, axis =plt.subplots(1, 2,
figsize=(9, 3))

# Left pane: a l l features


plot_r2_snr(n_features, r2_train,
r2_test, argmax, snr, axis[0])

# Right pane: Zoom on 100 f irst features


plot_r2_snr(n_features[n_features <= 100],
r2_train[n_features <=100], r2_test[n_features <=100],
argmax,
snr[n_features <=100],
axis[1]
/home/edouard/anaconda3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.
)˓→ufunc size changed, may indicate binary incompatibility. Expected 192 from C header,␣
plt.tight_layout()
˓→got 216 from PyObject
return f(*args, **kwds)

5.3. Linear methods for regression 177


Statistics and Machine Learning in Python, Release 0.3 beta

Exercises

Study the code above and:


• Describe the datasets: 𝑁 : nb_samples, 𝑃: nb_features.
• What is n_features_info?
• Give the equation of the generative model.
• What is modified by the loop?
• What is the SNR?

Comment the graph above, in terms of training and test performances:


• How does the train and test performance changes as a function of �?
• Is it the expected results when compared to the SNR?
• What can you conclude?

5.3.4 Ridge regression (ℓ2-regularization)

Overfitting generally leads to excessively complex weight vectors, accounting for noise or
spu- rious correlations within predictors. To avoid this phenomenon the learning should
constrain the solution in order to fit a global pattern. This constraint will reduce (bias)
the capacity of the learning algorithm. Adding such a penalty will force the coefficients to
be small, i.e. to shrink them toward zeros.
Therefore the loss function 𝐿 (𝛽 ) (generally the SSE) is combined with a penalty
function
Ω(𝛽 ) leading to the general form:

Penalized(𝛽) = 𝐿 (𝛽 ) + 𝜆Ω(𝛽)

The respective contribution of the loss and the penalty is controlled by the regularization
parameter 𝜆.
Ridge regression impose a ℓ2 penalty on the coefficients, i.e. it penalizes with the
Euclidean norm of the coefficients while minimizing SSE. The objective function
becomes:

178 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

∑︁
𝑁
�− x
𝑇 2 2
Ridg ( 𝛽 ) = (� �𝛽 ) + 2 (5.25
e 𝜆‖𝛽‖ � )
= ‖y − X 𝛽22 ‖ + 2
2 (5.26
𝜆‖𝛽‖ . )
The 𝛽 that minimises 𝐹𝑅�𝑑𝑔𝑒(𝛽) can be found by the following
derivation:

∇𝛽 Ridge(𝛽) = (5.27
(︀ 0 )︀ )
︀( 𝑇 ∇𝛽 (y𝑇 − 𝑇X𝛽 ) (y − X 𝛽 ) + 𝜆𝛽𝑇 ︀𝛽 =
𝑇
) (5.28
∇𝛽 (y y 0
− 2𝛽 X y + 𝛽 𝑇 𝑇
X X 𝛽 + 𝑇𝜆𝛽 𝛽 )
= 0 𝑇 𝑇 )
−2X y + 2X X 𝛽 + 2𝜆𝛽
= 0 (5.29
𝑇 𝑇 )
− X y + (X X + 𝜆I)𝛽 = 0
(5.30
(X𝑇 X + 𝜆I)𝛽 =
)
X𝑇 y
(5.31
𝛽 = of X𝑇 X before inversion.) This
• The solution adds a positive constant to the diagonal
makes the problem nonsingular, even if X𝑇 X is(X not
𝑇
of full rank, and was the main
motivation behind ridge regression. X (5.32
+ )
• Increasing 𝜆shrinks the 𝛽 coefficients toward 0.
𝜆I) (5.33
• This approach penalizes the objective function by−1the
X𝑇 Euclidian (:math:‘ell_2‘))norm
of the coefficients such that solutions with largeycoefficients become unattractive.
The gradient of the loss:
𝐿(𝛽, 𝑋 , ∑︁
𝜕 = 2( � � (�� ·𝛽 −
�)
� ) +� 𝜆𝛽)�
𝜕𝛽
The ridge penalty shrinks the coefficients toward zero. The figure illustrates: the OLS
solution on the left. The ℓ1 and ℓ2 penalties in the middle pane. The penalized OLS in the
right pane. The right pane shows how the penalties shrink the coefficients toward zero.
The black points are the minimum found in each case, and the white points represents the
true solution used to generate the data.
import matplotlib.pyplot as
plt import numpy as np
import sklearn.linear_model
as lm

# lambda is alpha!
mod =lm.Ridge(alpha=10)

# Fi t models on dataset
n_features, r2_train, r2_test,
snr =
fit_on_increasing_size(model=mo
d)

argmax =n_features[np.argmax(r2_test)]
(continues on next
# plot page)
fi g, axis =plt.subplots(1, 2,
figsize=(9,
5.3. 3))
Linear methods for regression 179
Statistics and Machine Learning in Python, Release 0.3 beta

Fig. 2: ℓ1 and ℓ2 shrinkages

(continued from previous


# Left pane: a l l features page)
plot_r2_snr(n_features, r2_train, r2_test, argmax, snr, axis[0])

# Right pane: Zoom on 100 f ir st features


plot_r2_snr(n_features[n_features <= 100],
r2_train[n_features <=100], r2_test[n_features <=100],
argmax,
snr[n_features <=100],
axis[1]
)
plt.tight_layout()

Exercice

What benefit has been obtained by using ℓ2


regularization?

180 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

5.3.5 Lasso regression (ℓ1-regularization)

Lasso regression penalizes the coefficients by the ℓ1 norm. This constraint will reduce
(bias) the capacity of the learning algorithm. To add such a penalty forces the
coefficients to be small,
i.e. it shrinks them toward zero. The objective function to minimize becomes:
Lasso(𝛽) = ‖y − X22𝛽 ‖ + 1 (5.34
𝜆‖𝛽‖ . )
This penalty forces some coefficients to be exactly zero, providing a feature selection
property.
import matplotlib.pyplot as
plt import numpy as np
import sklearn.linear_model
as lm

# lambda is alpha !
mod = lm.Lasso(alpha=.1)

# Fi t models on dataset
n_features, r2_train, r2_test,
snr =
fit_on_increasing_size(model=mo
d)

argmax =n_features[np.argmax(r2_test)]

# plot
fi g, axis =plt.subplots(1, 2,
figsize=(9, 3))

# Left pane: a l l features


plot_r2_snr(n_features, r2_train,
r2_test, argmax, snr, axis[0])

# Right pane: Zoom on 200 f ir st features


plot_r2_snr(n_features[n_features <= 200],
r2_train[n_features <=200], r2_test[n_features <=200],
argmax,
snr[n_features <=200],
axis[1]
)
plt.tight_layout()

Sparsity of the ℓ1 norm

5.3. Linear methods for regression 181


Statistics and Machine Learning in Python, Release 0.3 beta

Occam’s razor

Occam’s razor (also written as Ockham’s razor, and lex parsimoniae in Latin, which means
law of parsimony) is a problem solving principle attributed to William of Ockham (1287-
1347), who was an English Franciscan friar and scholastic philosopher and theologian.
The principle can be interpreted as stating that among competing hypotheses, the one
with the fewest assumptions should be selected.

Principle of parsimony

The simplest of two competing theories is to be preferred. Definition of parsimony:


Economy of explanation in conformity with Occam’s razor.
Among possible models with similar loss, choose the simplest one:
• Choose the model with the smallest coefficient vector, i.e. smallest ℓ2 (‖𝛽 ‖ 2 ) or ℓ1
(‖𝛽 ‖ 1 of
norm ) 𝛽 , i.e. ℓ2 or ℓ1 penalty. See also bias-variance tradeoff.
• Choose the model that uses the smallest number of predictors. In other words,
choose the model that has many predictors with zero weights. Two approaches are
available to obtain this: (i) Perform a feature selection as a preprocessing prior to
applying the learning algorithm, or (ii) embed the feature selection procedure
within the learning process.

Sparsity-induced penalty or embedded feature selection with the ℓ1 penalty

The penalty based on the ℓ1 norm promotes sparsity (scattered, or not dense): it forces
many coefficients to be exactly zero. This also makes the coefficient vector scattered.
The figure bellow illustrates the OLS loss under a constraint acting on the ℓ1 norm of the
coef- ficient vector. I.e., it illustrates the following optimization problem:
minimize ‖y − 2
2
X𝛽 ‖
𝛽
subject to ‖𝛽 ‖ 1 ≤
1.

Optimization issues

Section to be completed
• No more closed-form solution.
• Convex but not differentiable.
• Requires specific optimization algorithms, such as the fast iterative shrinkage-
thresholding algorithm (FISTA): Amir Beck and Marc Teboulle, A Fast Iterative
Shrinkage-Thresholding Algorithm for Linear Inverse Problems SIAM J. Imaging Sci.,
2009.

5.3.6 Elastic-net regression (ℓ2-ℓ1-regularization)

The Elastic-net estimator combines the ℓ1 and ℓ2 penalties, and results in the problem to

182 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Fig. 3: Sparsity of L1
norm

︀( ︀)
Enet(𝛽 ) = ‖y −𝑇 X 22 𝛽 ‖ + 𝛼 𝜌1‖𝛽 ‖ + (1 −22 𝜌) (5.35
‖𝛽 ‖ , )
where 𝛼acts as a global penalty and 𝜌as an ℓ 1 /ℓ 2 ratio.

Rational

• If there are groups of highly correlated variables, Lasso tends to arbitrarily select
only one from each group. These models are difficult to interpret because covariates
that are strongly associated with the outcome are not included in the predictive
model. Conversely, the elastic net encourages a grouping effect, where strongly
correlated predictors tend to be in or out of the model together.
• Studies on real world data and simulation studies show that the elastic net often
outper- forms the lasso, while enjoying a similar sparsity of representation.
import matplotlib.pyplot as
plt import numpy as np
import sklearn.linear_model
as lm

mod =lm.ElasticNet(alpha=.5, l1_ratio=.5)

# Fi t models on dataset
n_features, r2_train, r2_test, snr = (continues on next
fit_on_increasing_size(model=mod) page)

5.3. Linear methods for regression 183


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


page)
argmax =n_features[np.argmax(r2_test)]

# plot
fi g, axis =plt.subplots(1, 2,
figsize=(9, 3))

# Left pane: a l l features


plot_r2_snr(n_features, r2_train,
r2_test, argmax, snr, axis[0])

# Right pane: Zoom on 100 f ir st features


plot_r2_snr(n_features[n_features <= 100],
r2_train[n_features <=100], r2_test[n_features <=100],
argmax,
snr[n_features <=100],
axis[1]
)
plt.tight_layout()

5.4 Linear classihcation

Given a training set of 𝑁 samples, 𝐷 = { ( � 1 , �1), . . . , ( � 𝑁 , �𝑁) } , where �� is a


multidimen- sional input vector with dimension 𝑃and class label (target or response).
Multiclass Classification problems can be seen as several binary classification problems
{0,
�� 1∈ } where the classifier aims to discriminate the sample of the current class (label 1)
versus
the samples of other classes (label 0).
Therfore, for each class the classifier seek for a vector of parameters � that performs
a linear combination of the input variables, � � . This step performs a projection or a
𝑇

rotation of n iput sample into a good discriminative one-dimensional sub-space, that best
discriminate sample of current class vs sample of other classes.
This score (a.k.a decision function) is tranformed, using the nonlinear activation funtion 𝑓
(.), to a “posterior probabilities” of class 1: �(� = 1 | � ) = 𝑓(� 𝑇 � ) , where, �(�
= 1 | � ) = 1 − �(� = 0|�).
The decision surfaces (orthogonal hyperplan to � ) correspond to 𝑓(�) = constant, so
that
�𝑇 � = constant and hence the decision surfaces are linear functions of � , even if
the function
𝑓(.) is nonlinear.
A thresholding of the activation (shifted by the bias or intercept) provides the predicted
class label.
184 Chapter 5. Machine Learning
Statistics and Machine Learning in Python, Release 0.3
beta

The vector of parameters, that defines the discriminative axis, minimizes an objective
function ∑︁
min 𝐽 = 𝐿(��, 𝑓( �𝑇 � ) ) +
𝐽( � ) that is a sum of of loss
� function 𝐿 ( � ) and some penalties on the weights
Ω(�), � �
vector Ω(�).

5.4.1 Fisher’s linear discriminant with equal class covariance

This geometric method does not make any probabilistic assumptions, instead it relies on
dis- tances. It looks for the linear projection of the data points onto a vector, � , that
maximizes the between/within variance ratio, denoted 𝐹 ( � ) . Under a few assumptions,
it will provide the same results as linear discriminant analysis (LDA), explained below.
Suppose two classes of observations, 𝐶0 and 𝐶1, have means 𝜇 0 and 𝜇 1 and the same total
within-class scatter (“covariance”) matrix,

∑︁ ∑︁
� − 0𝜇 �
) ( �0 − ( 1� −
𝑇 𝑇
𝑆� = (� 𝜇 ) +� � 1 (5.36
�∈𝐶−
𝜇� ) ( � 0 𝜇 ) )
�∈𝐶1
(5.37
= 𝑋𝑐 𝑋 𝑐 ,
𝑇
)
where 𝑋 𝑐 is the (𝑁 × 𝑃) matrix of data centered on their respective
means: [︂ ︂]
𝑋0 −
𝑋𝑐 ,
𝜇 01
𝑋
=
where 𝑋 0 and 𝑋 1 are the (𝑁 0 × 𝑃) and (𝑁 − 𝜇× 1 𝑃) matrices of samples of classes 𝐶
1 0

and 𝐶1. Let 𝑆 𝐵 being the scatter “between-class” matrix, given by


𝑇
𝑆 𝐵 = (𝜇 1 − 𝜇 0 )(𝜇 1 − 𝜇 0 ) .
The linear combination of features � 𝑇 � have means
� 𝑇 𝑐 𝑋 𝑇 𝑋 𝑐 � . Fisher defined� the
𝑇 separation
𝜇� for �between these
= 0, 1, and two
variancdistributions
e to
of the variance
be the rato
i between the classes to the variance within the
classes:

𝜎2 (5.38
𝐹 F i s h e r ( � ) =2
between 𝜎 withi )
n 𝑇
(� 𝜇1 −
= (5.39
�� 𝑇 𝑇 𝜇 )𝑇2
𝑐 0𝑋 )
( �𝑋𝑇𝑐 �( 𝜇 1
= (5.40
−� 𝜇0𝑇))𝑐2 𝑋𝑇 )
�𝑋𝑇𝑐 � (𝜇 1 − 𝜇 0 )(𝜇 1
= (5.41
− 𝜇 0 ) 𝑇�
� 𝑇 𝑐 𝑋𝑇 )
� 𝑇 𝑋𝑐 �
= . (5.42
𝑆 𝐵𝑇 �
� )
𝑆 𝑊 �
The Fisher most discriminant
projection
In the two-class case, the maximum separation occurs by a projection on the ( 𝜇 1 − 𝜇 0 )
the Mahalanobis metric 𝑆 𝑊 −1 , so that
using
� ∝𝑆 𝑊
−1
(𝜇 1
− 𝜇0).
5.4. Linear classihcation 185
Statistics and Machine Learning in Python, Release 0.3 beta

Demonstration

Differentiating 𝐹Fisher(�) with respect to � gives

(︂ 𝑇 )︂
∇ � 𝐹� (�) = 0
𝐵r�
Fi s h e
∇� � 𝑇 � =
� 𝑆 �
0

(�𝑇 𝑆 𝑊 �)(2𝑆 𝐵 � ) − (�𝑇 𝑆 𝐵 �)(2𝑆 𝑊
�) = 0 �

(�𝑇 𝑆 𝑊 � ) ( 𝑆�𝐵 𝑇 � ) =
(�𝑇 𝑆 𝐵 𝑆�)(𝑆
𝐵 �𝑊 𝑆�
=𝐵) ��𝑇
� 𝑆
𝑆 𝐵 � = 𝜆�(𝑆 𝑊 � )
( 𝑆 𝑊 �)
�1

𝑆 𝑊 𝑆𝐵 � = 𝜆 �.

Since we do not care about the magnitude of � , only its direction, we replaced the scalar
factor
( � 𝑇 𝑆 𝐵 � ) / ( � 𝑇 𝑆 𝑊 � ) by 𝜆.
In the multiple-class case, the solutions � are determined by the eigenvectors of 𝑆 𝑊
−1
𝑆 𝐵 that correspond to the 𝐾 − 1 largest eigenvalues.
However, in the two-class case (in which 𝑆 𝐵 = ( 𝜇 1 − 𝜇 0 ) ( 𝜇 1 − 𝜇 0 ) 𝑇 ) it is easy to
show that
� = 𝑆 𝑊 − 1 ( 𝜇 1 − 𝜇 0 ) is the unique eigenvector of 𝑆 𝑊 − 1 𝑆 𝐵 :

𝑆 𝑊
−1
(𝜇 1 − 𝜇 0 ) ( 𝜇 1 − 𝜇 0 ) 𝑇 � = 𝜆 �
𝑆 𝑊
−1
( 𝜇 1 − 𝜇 0 ) ( 𝜇 1 − 𝜇 0 )𝑇 𝑆 𝑊
−1
(𝜇 1 − 𝜇 0 ) = 𝜆 𝑆 𝑊 −1
(𝜇 1 − 𝜇 0 ),

where here 𝜆= ( 𝜇 1 − 𝜇 0 ) 𝑇 𝑆 𝑊
−1
(𝜇 1 − 𝜇 0 ). Which leads to the result

� ∝𝑆 𝑊
−1
( 𝜇 1 − 𝜇 0 ).

The separating hyperplane

The separating hyperplane is a 𝑃 − 1-dimensional hyper surface, orthogonal to the


projection vector, �. There is no single best way to find the origin of the plane along �,
or equivalently the classification threshold that determines whether a point should be
classified as belonging to 𝐶0 or to 𝐶1. However,
1 if the projected points have roughly the
same distribution, then the threshold
𝑇= �can be· chosen − the hyperplane exactly between the
( 𝜇 1 as
2
projections of the two means,
𝜇0).
i.e. as
import matplotlib.pyplot as
plt import warnings

warnings.filterwarnings('ignore
')
%matplotlib inline

186 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Fig. 4: The Fisher most discriminant


projection

5.4. Linear classihcation 187


Statistics and Machine Learning in Python, Release 0.3 beta

2. Linear discriminant analysis (LDA)

Linear discriminant analysis (LDA) is a probabilistic generalization of Fisher’s linear


discrimi- nant. It uses Bayes’ rule to fix the threshold based on prior probabilities of
classes.
1. First compute the class-conditional distributions of � given class 𝐶�: �(�|𝐶� ) =
a P-dimensional
�(�|𝜇 � vector
, 𝑆 𝑊� ).
of Where
continuous �(� variables,
| 𝜇 � , 𝑆 which is
𝑊 ) is the multivariate Gaussian
given by distribution defined over
−1
� ( �� | 𝜇 𝑊, 𝑆 ) = exp{− ( � �
𝑇
(� − �
(2𝜋 ) |𝑆 𝑊 |
𝑃 / 2 1 / 2 1
− 𝜇 )𝑊𝑆 𝜇 )}
1
2. Estimate the prior probabilities
2 of class �, �(𝐶�) = 𝑁�/𝑁 .
3. Compute posterior probabilities (ie. the probability of a each class given a sample)
combining conditional with priors using Bayes’ rule:
�(𝐶� )
�(𝐶�| � )
= �(� |𝐶� )
�(�)by suming of classes: As usual, the
Where �(�) is the marginal distribution obtained
denom- inator in Bayes’ theorem can be found in terms of the quantities appearing in the
numerator, because
∑︁
�(�) = �(�|𝐶 � )
�(𝐶 ) � �

4. Classify � using the Maximum-a-Posteriori probability: 𝐶�= arg max𝐶 �


�(𝐶�| � )
LDA is a generative model since the class-conditional distributions cal be used to generate
samples of each classes.
LDA is useful to deal with imbalanced group sizes (eg.: 𝑁 1 ≫ 𝑁 0 ) since priors
probabilities can be used to explicitly re-balance the classification by setting �(𝐶0) =
�(𝐶1) = 1/2 or whatever seems relevant.
LDA can be generalised to the multiclass case with 𝐾 > 2.
With 𝑁 1 = 𝑁 0 , LDA lead to the same solution than Fisher’s linear discriminant.

Exercise

How many
import numpparameters
y as np are required to estimate to perform a LDA ?
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Dataset
n_samples, n_features =100, 2
mean0, mean1 =np.array([0, 0]), np.array([0, 2])
Cov =np.array([[1, .8],[ .8, 1]])
np.random.seed(42)
X0 =
np.random.multivariate_normal(mea
n0, Cov, n_samples)
X1 =np.random.multivariate_normal(mean1, Cov, n_samples) X
=np.vstack([X0, X1])
y =np.array([0] * X0.shape[0] +[1] * X1.shape[0])
(continues on next
# LDA with scikit-learn
page)

188 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


lda = LDA() page)
proj =lda.fit(X, y).transform(X)
y_pred_lda =lda.predict(X)

errors =y_pred_lda != y
print("Nb errors=%i, error rate=
%.2f" % (errors.sum(),
errors.sum() / len(y_pred_lda)))
Nb errors=10, error rate=0.05

5.4.3 Logistic regression

Logistic regression is called a generalized linear models. ie.: it is a linear model with a
link function that maps the output of linear multiple regression to the posterior
probability of class 1 �(1|�) using the logistic sigmoid function:

1
�(1|�,

1 + e x p ( − ��
� )=
·� )
def logistic(x): return 1 / (1 +np.exp(-x))
def logistic_loss(x): return np.log(1 +np.exp(-x))

x =np.linspace(-6, 6, 100)
plt.subplot(121)
plt.plot(x, logistic(x))
plt.grid(True)
plt.title('Logistic
(sigmoid)')

x =np.linspace(-3, 3, 100)
plt.subplot(122)
plt.plot(x, logistic_loss(x), label='Logistic loss')
plt.plot(x, np.maximum(0, 1 - x), label='Hinge loss')
plt.legend()
plt.title('Losses')
plt.grid(True)

5.4. Linear classihcation 189


Statistics and Machine Learning in Python, Release 0.3 beta

The Loss function for sample � is the negative log of the


probability:
{︃− log(�(1|�, � � )) if ��
𝐿 ( � ,�� ,
= log(1
− 1 − �(1|�, � � )
�)=�
if �� = 0
For the whole dataset 𝑋 , � = {� � , ��} the loss function to minimize 𝐿 ( � , 𝑋 ,
� ) is the nega
vie
t negative log likelihood (nll) that can be simplied using a 0/1 coding of the
label in the case of binary classification:

𝐿 ( � , 𝑋 , � ) = − log (5.43
ℒ ( � , 𝑋, � ) �� )
(1−��)
= ∑︁− log Π� {�(1|�, � � ) * (5.44
{��log �(1|�,
=(1 − �(1|�, � � )� �} ) +� (1 − � ) log(1 }
)
− �(1|�,
� � � ))
∑︁
= {��� � · � ) − log(1 + (5.45
(5.46
exp �( �� · � ))} ))
(5.47
)
This is solved by numerical method using the gradient of the
loss: ∑︁
𝐿 (�,
𝜕 = � � (��− | � ,
𝑋 , �)
�(1� � � ))
𝜕 �
�����
𝐿𝑔 ��������
𝑔𝑒
𝑐𝑒 𝑎* *𝑑��
𝑐����� 𝑎��� 𝑒𝑑�* *���
𝑒�� 𝑒𝑐��
𝑓𝑐��
𝑒�����

on the posterior probability of each class �(𝐶� |�). It only requires to estimate the 𝑃
weight of the � vector. Thus it should be favoured over LDA with many input features. In
small dimension and balanced situations it would provide similar predictions than LDA.
However imbalanced group sizes cannot be explicitly controlled. It can be managed using a
reweighting of the input samples.

190 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

from sklearn import linear_model


logreg =linear_model.LogisticRegression(C=1e8, solver='lbfgs')
# This class implements regularized logistic regression. C i s the Inverse of␣
˓→regularization strength.
# Large value =>no regularization.

logreg.fit(X, y)
y_pred_logreg = logreg.predict(X)

errors =y_pred_logreg != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_logreg)))
print(logreg.coef_)

Nb errors=10, error rate=0.05


[[-5.1516729 5.57303883]]

Exercise

Explore the Logistic Regression parameters and proposes a solution in cases of highly
im- balanced training dataset 𝑁 1 ≫ 𝑁 0 when we know that in reality both classes have the
same probability �(𝐶1) = �(𝐶0).

5.4.4 Overhtting

VC dimension (for Vapnik–Chervonenkis dimension) is a measure of the capacity


(complexity, expressive power, richness, or flexibility) of a statistical classification
algorithm, defined as the cardinality of the largest set of points that the algorithm can
shatter.
Theorem: Linear classifier in 𝑅𝑃 have VC dimension of 𝑃+ 1. Hence in dimension two (𝑃
= 2) any random partition of 3 points can be learned.

Fig. 5: In 2D we can shatter any three non-collinear points

5.4.5 Ridge Fisher’s linear classihcation (L2-regularization)

When the matrix 𝑆 𝑊 is not full rank or 𝑃≫ 𝑁 , the The Fisher most discriminant
projection estimate of the is not unique. This can be solved using a biased version of
𝑆 𝑊 :

𝑆 𝑊 = 𝑆
𝑒𝑔𝑅�
𝑑
𝑊 + 𝜆𝐼

5.4. Linear classihcation 191


Statistics and Machine Learning in Python, Release 0.3 beta

where 𝐼 is the 𝑃× 𝑃identity matrix. This leads to the regularized (ridge) estimator of the
Fisher’s linear discriminant analysis:

�𝑅�𝑑𝑔𝑒 ∝ ( 𝑆 𝑊 + 𝜆 𝐼 ) − 1 (𝜇 1 − 𝜇 0 )

Fig. 6: The Ridge Fisher most discriminant projection

Increasing 𝜆will:
• Shrinks the coefficients toward zero.
• The covariance will converge toward the diagonal matrix, reducing the contribution of
the pairwise covariances.

5.4.6 Ridge logistic regression (L2-regularization)

The objective function to be minimized is now the combination of the logistic loss
(negative log likelyhood) − log ℒ ( � ) with a penalty of the L2 norm of the weights
vector. In the two-class case, using the 0/1 coding we obtain:

min Logistic ( � ) = − log ℒ ( � , 𝑋 , 2



ridge � ) + 𝜆 ‖�‖

# Dataset
# Build a classification task using 3 informative features
from sklearn import datasets

X, y = datasets.make_classification(n_samples=100,
n_features=20,
n_informative=3
,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
import numpy as np
import matplotlib.pyplot as plt

(continues on next
page)

192 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


from sklearn import linear_model page)
l r =linear_model.LogisticRegression(C=1, solver='lbfgs')
# This class implements regularized logistic regression. C i s the Inverse of␣
˓→regularization strength.
# Large value =>no regularization.

l r . f i t ( X , y)
y_pred_lr =lr.predict(X)

# Retrieve proba from coef vector


print(lr.coef_.shape)
probas =1 / (1 +np.exp(- (np.dot(X, lr.coef_.T) +lr.intercept_))).ravel()
print("Diff", np.max(np.abs(lr.predict_proba(X)[:, 1] - probas)))
#plt.plot(lr.predict_proba(X)[:, 1], probas, "ob")

errors =y_pred_lr != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y)))
print(lr.coef_)

(1, 20)
Diff 0.0
Nb errors=26, error rate=0.26
[[- 0.7579822 -0.01228473 -0.11412421 0.25491221 0.4329847
0.12899737
0.14564739 0.16763962 0.85071394 0.02116803 -0.1611039 -0.0146019
-0.03399884 0.43127728 -0.05831644 -0.0812323 0.15877844 0.29387389
0.54659524 0.03376169]]

5.4.7 Lasso logistic regression (L1-regularization)

The objective function to be minimized is now the combination of the logistic loss − log
ℒ(�) with a penalty of the L1 norm of the weights vector. In the two-class case, using
the 0/1 coding we obtain:

min Logistic Lasso(�) = − log ℒ ( � , 𝑋 , 1



� ) + 𝜆‖�‖
from sklearn import linear_model
lrl1 =linear_model.LogisticRegression(penalty='l1')
# This class implements regularized logistic regression. C i s the Inverse of␣
˓→regularization strength.
# Large value =>no regularization.

l r l 1 . f i t ( X , y)
y_pred_lrl1 = lrl1.predict(X)

errors =y_pred_lrl1 != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_lrl1)))
print(lrl1.coef_)

Nb errors=27, error rate= 0.2


7 0. 0.19755021 0.36482981
[[-0.1133675 0.681581 0.
0.08056873 0.062054 0.76016779 0. -0.10808542 0.
66
0. 0.337503 0. 0. 0.07903305 0.2015893
72
0.48384297 0. ]]
5.4. Linear classihcation 193
Statistics and Machine Learning in Python, Release 0.3 beta

5.4.8 Ridge linear Support Vector Machine (L2-regularization)

Support Vector Machine seek for separating hyperplane with maximum margin to enforce
ro- bustness against noise. Like logistic regression it is a discriminative method that
only focuses of predictions.
Here we present the non separable case of Maximum Margin Classifiers with ±1 coding
(ie.:
�� {−1, +1}). In the next figure the legend aply to samples of “dot” class.

Fig. 7: Linear lar margin


classifiers

Linear SVM for classification (also called SVM-C or SVC)


minimizes: min Linear S V M ( � ) = penalty(�) + 𝐶 Hinge
∑︀ 𝑁
loss(�) = ‖�‖2 + � 𝜉�
with ∀� 𝐶 ��(� · � � ) ≥ 1 − 𝜉�

Here we introduced the slack variables: 𝜉�, with 𝜉� = 0 for points that are on or inside
the correct margin boundary and 𝜉� = |�� − (�𝑑� 𝑐 �· � � )| for other points. Thus:
1. If ��(�· � � ) ≥ 1 then the point lies outside the margin but on the correct side of
h
te decision boundary. In this case 𝜉� = 0. The constraint is thus not active for this
point. It does not contribute to the prediction.
2. If 1 > ��(� · � � ) ≥ 0 then the point lies inside the margin and on the correct
side of hte decision boundary. In this case 0 < 𝜉� ≤ 1. The constraint is active for
this point. It does contribute to the prediction as a support vector.
3. If 0 < ��(� · � � )) then the point is on the wrong side of the decision boundary
(missclassi- fication). In this case 0 < 𝜉� > 1. The constraint is active for this point.
It does contribute to the prediction as a support vector.
This loss is called the hinge loss, defined as:

max(0, 1 − �� (� · � � ))

So linear SVM is closed to Ridge logistic regression, using the hinge loss instead of the
logistic loss. Both will provide very similar predictions.

194 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

from sklearn import svm

svmlin = svm.LinearSVC()
# Remark: by default LinearSVC uses squared_hinge as loss
svmlin.fit(X, y)
y_pred_svmlin = svmlin.predict(X)

errors =y_pred_svmlin != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_svmlin)))
print(svmlin.coef_)

Nb errors=26, error rate=0.26


-0.05149163 0.09940419 0.17726429
[[-0.056121
0.06519484 0.31189589
0.08921394 0.00271961
0.3533912 0.00601045 -0.06201172 -0.00741169
-0.02156861 0.18272446 -0.02162812 -0.04061099 0.07204358 0.13083485
0.23721453 0.0082412 ] ]

5.4.9 Lasso linear Support Vector Machine (L1-regularization)

Linear SVM for classification (also called SVM-C or SVC) with l1-
regularization ∑︀ 𝑁
min 𝐹Lasso linear SVM(�) = 𝜆 ||�||1 � 𝜉�
with ∀� ��(�+ ·𝐶 �� ) ≥
1 − �𝜉
from sklearn import svm

svmlinl1 =svm.LinearSVC(penalty='l1', dual=False)


# Remark: by default LinearSVC uses squared_hinge as loss

svmlinl1.fit(X, y)
y_pred_svmlinl1 =svmlinl1.predict(X)

errors =y_pred_svmlinl1 != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_
˓→ svmlinl1)))

print(svmlinl1.coef_)

Nb errors=26, error rate=0.26


-0.03541637 0.09261429 0.16763337
[[-0.0533391 0.299345740.3406516
0.05808022 0.07587758 0. 0. -0.0555901 -0.00194174
-0.01312517 0.16866053 -0.01450499 -0.02500558 0.06073932 0.11738939
0.22485446 0.00473282]]

## Exercise
Compare predictions of Logistic regression (LR) and their SVM counterparts, ie.: L2 LR
vs L2 SVM and L1 LR vs L1 SVM
• Compute the correlation between pairs of weights vectors.
• Compare the predictions of two classifiers using their decision function:

– Give the equation of the decision function for a linear classifier, assuming that
their is no intercept.

5.4. Linear classihcation 195


Statistics and Machine Learning in Python, Release 0.3 beta

– Compute the correlation decision function.


– Plot the pairwise decision function of the classifiers.
• Conclude on the differences between Linear SVM and logistic regression.

5.4.10 Elastic-net classihcation (L2-L1-regularization)

The objective function to be minimized is now the combination of the logistic loss log
𝐿 ( � ) or the hinge loss with combination of L1 and L2 penalties. In the two-class case,
using the 0/1 coding we obtain:

︀( 2 ︀)
min Logistic enet(�) = − log ℒ ( � , 𝑋 , � )1 + 𝛼 𝜌 �2 (5.48
‖� ‖ Hinge − 𝜌) ‖ ︀( 2 ︀)
min + (1 enet(�) = Hinge l o s s ( � ) + 𝛼𝜌‖ � ‖ + � ‖
1 2
)
(1 − 𝜌) ‖ ‖ (5.49
)
from sklearn import datasets
from sklearn import linear_model as lm
import matplotlib.pyplot as plt

X, y = datasets.make_classification(n_samples=100,
n_features=20,
n_informative=3
,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)

enetloglike =lm.SGDClassifier(loss="log",
penalty="elasticnet",
alpha=0.0001,
l1_ratio=0.15,
class_weight='
balanced')
enetloglike.fit(X, y)
SGDClassifier(alpha=0.0001, average=False, class_weight='balanced',
enethinge =lm.SGDClassifier(loss="hinge",
early_stopping=False,
penalty="elasticnet", epsilon=0.1, eta0=0.0, fit_intercept=True,
l1_ratio=0.15, alpha=0.0001,
learning_rate='optimal', loss='hinge',
max_iter=1000,l1_ratio=0.15,
n_iter_no_change=5, n_jobs=None,
penalty='elasticnet', power_t=0.5, random_state=None,
class_weight='
shuffle=True, tol=0.001,
balanced') validation_fraction=0.1, verbose=0,
warm_start=False)
enethinge.fit(X, y)

Exercise

Compare predictions of Elastic-net Logistic regression (LR) and Hinge-loss Elastic-net


• Compute the correlation between pairs of weights vectors.
• Compare the predictions of two classifiers using their decision function:

– Compute the correlation decision function.

196 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

– Plot the pairwise decision function of the classifiers.


• Conclude on the differences between the two losses.

5.4.11 Metrics of classihcation performance evaluation

Metrics for binary classihcation

source: https://en.wikipedia.org/wiki/Sensitivity_and_specificity
Imagine a study evaluating a new test that screens people for a disease. Each person taking
the test either has or does not have the disease. The test outcome can be positive
(classifying the person as having the disease) or negative (classifying the person as not
having the disease). The test results for each subject may or may not match the subject’s
actual status. In that setting:
• True positive (TP): Sick people correctly identified as sick
• False positive (FP): Healthy people incorrectly identified as sick
• True negative (TN): Healthy people correctly identified as healthy
• False negative (FN): Sick people incorrectly identified as healthy
• Accuracy (ACC):
ACC = (TP + TN) / (TP + FP + FN + TN)
• Sensitivity (SEN) or recall of the positive class or true positive rate (TPR) or hit
rate: SEN = TP / P = TP / (TP+FN)
• Specificity (SPC) or recall of the negative class or true negative
rate: SPC = TN / N = TN / (TN+FP)
• Precision or positive predictive value
(PPV): PPV = TP / (TP + FP)
• Balanced accuracy (bACC):is a useful performance measure is the balanced accuracy
which avoids inflated performance estimates on imbalanced datasets (Brodersen, et
al. (2010). “The balanced accuracy and its posterior distribution”). It is defined as
the arith- metic mean of sensitivity and specificity, or the average accuracy obtained
on either class:
bACC = 1/2 (SEN + SPC)
• F1 Score (or F-score) which is a weighted average of precision and recall are usefull
to deal with imballaced datasets
The four outcomes can be formulated in a 2×2 contingency table or confusion matrix
https:
from sklearn import
//en.wikipedia.org/wiki/Sensitivity_and_specificity
metrics y_pred =[0, 1, 0,
0]
For more precision see: http://scikit-learn.org/stable/modules/model_evaluation.html
y_true = [0, 1, 0, 1]

metrics.accuracy_score(y_tr
ue, y_pred) (continues on next
page)

5.4. Linear classihcation 197


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


# The overall precision an recall page)
metrics.precision_score(y_true, y_pred)
metrics.recall_score(y_true, y_pred)

# Recalls on individual classes: SEN &


SPC
recalls =metrics.recall_score(y_true, y_pred, average=None)
recalls[0] # is the recall of class 0: specificity recalls[1]
# is the recall of class 1: sensitivity

# Balanced accuracy
b_acc =recalls.mean()

# The overall precision


an recall on each
individual class
p, r , f , s =
metrics.precision_rec
Signihcance of classihcation rate
all_fscore_support(y_t
rue, y_pred)
P-value associated to classification rate. Compared the number of correct classifications
curacy ×𝑁 ) to the null hypothesis of Binomial distribution of parameters � (typically
(=ac-
50%
chanceoflevel) and 𝑁 (Number of observations).
Is 65% of accuracy a significant prediction rate among 70 observations?
Since this is an exact, two-sided test of the null hypothesis, the p-value can be divided by
2 since we test that the accuracy is superior to the chance level.

import scipy.stats

acc, N =0.65, 70
pval =scipy.stats.binom_test(x=int(acc * N), n=N, p=0.5) / 2
print(pval)

0.01123144774625465

Area Under Curve (AUC) of Receiver operating characteristic (ROC)

Some classifier may have found a good discriminative projection �. However if the
threshold to decide the final predicted class is poorly adjusted, the performances will
highlight an high specificity and a low sensitivity or the contrary.
In this case it is recommended to use the AUC of a ROC analysis which basically provide a
mea- sure of overlap of the two classes when points are projected on the discriminative
axis. For more detail on ROC and AUC
see:https://en.wikipedia.org/wiki/Receiver_operating_characteristic.
from sklearn import metrics
score_pred =np.array([.1 ,.2, .3, .4, .5, .6, .7, .8])
y_true = np.array([0, 0, 0, 0, 1, 1, 1, 1])
thres =.9
y_pred =(score_pred >thres).astype(int)

print("Predictions:", y_pred)
metrics.accuracy_score(y_true, y_pred)
(continues on next
page)

198 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
# The overall precision an recall on each individual class
p, r , f , s =metrics.precision_recall_fscore_support(y_true, y_pred)
print("Recalls:", r )
# 100% of specificity, 0% of sensitivity

# However AUC=1 indicating a perfect separation of the two classes


auc =metrics.roc_auc_score(y_true, score_pred)
print("AUC:", auc)

Predictions: [0 0 0 0 0 0 0 0]
Recalls: [1. 0.]
AUC: 1.0

5.4.12 Imbalanced classes

Learning with discriminative (logistic regression, SVM) methods is generally based on


minimiz- ing the misclassification of training samples, which may be unsuitable for
imbalanced datasets where the recognition might be biased in favor of the most numerous
class. This problem can be addressed with a generative approach, which typically requires
more parameters to be determined leading to reduced performances in high dimension.
Dealing with imbalanced class may be addressed by three main ways (see Japkowicz and
Stephen (2002) for a review), resampling, reweighting and one class learning.
In sampling strategies, either the minority class is oversampled or majority class is
undersam- pled or some combination of the two is deployed. Undersampling (Zhang and
Mani, 2003) the majority class would lead to a poor usage of the left-out samples.
Sometime one cannot afford such strategy since we are also facing a small sample size
problem even for the majority class. Informed oversampling, which goes beyond a trivial
duplication of minority class samples, re- quires the estimation of class conditional
distributions in order to generate synthetic samples. Here generative models are
required. An alternative, proposed in (Chawla et al., 2002) generate samples along the
line segments joining any/all of the k minority class nearest neighbors. Such procedure
blindly generalizes the minority area without regard to the majority class, which may be
particularly problematic with high-dimensional and potentially skewed class distribution.
Reweighting, also called cost-sensitive learning, works at an algorithmic level by
adjusting the costs of the various classes to counter the class imbalance. Such
reweighting can be im- plemented within SVM (Chang and Lin, 2001) or logistic
regression (Friedman et al., 2010) classifiers. Most classifiers of Scikit learn offer such
reweighting possibilities.
The class_weight parameter can be positioned into the "balanced" mode which uses the
values of � to automatically adjust weights inversely proportional to class frequencies in
import num
the input py as
data asnp
𝑁/(2𝑁 � ).
from sklearn import
linear_model from sklearn
import datasets from sklearn
import metrics import
matplotlib.pyplot as plt

# dataset (continues on next


page)

5.4. Linear classihcation 199


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


X, y = datasets.make_classification(n_samples=500, page)
n_features=5,
n_informative=2
,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=1,
shuffle=False)

print(*["#samples of class %i =%i;" % (l ev, np.sum(y ==lev)) for lev in np.unique(y)])

print('# No Reweighting balanced dataset')


lr_inter =linear_model.LogisticRegression(C=1)
lr_inter.fit(X, y)
p, r , f , s =metrics.precision_recall_fscore_support(y, lr_inter.predict(X))
print("SPC: %.3f; SEN: %.3f" % tuple(r))
print('# =>The predictions are balanced in sensitivity and specificity\n')

# Create imbalanced dataset, by subsampling sample of class 0: keep only 10% of #


class 0's samples and a l l class 1's samples.
n0 =int(np.rint(np.sum(y ==0) / 20))
subsample_idx =np.concatenate((np.where(y ==0)[0][:n0], np.where(y ==1)[0]))
Ximb =X[subsample_idx, : ]
yimb =y[subsample_idx]
print(*["#samples of class %i =%i;" % (l ev, np.sum(yimb ==lev)) for lev in
np.unique(yimb)])

print('# No Reweighting on imbalanced dataset')


lr_inter = linear_model.LogisticRegression(C=1)
lr_inter.fit(Ximb, yimb)
p, r , f , s = metrics.precision_recall_fscore_support(yimb, lr_inter.predict(Ximb))
print("SPC: %.3f; SEN: %.3f" % tuple(r))
print('# =>Sensitivity >> specificity\n')

print('# Reweighting on imbalanced dataset')


lr_inter_reweight =linear_model.LogisticRegression(C=1, class_weight="balanced")
lr_inter_reweight.fit(Ximb, yimb)
p, r , f , s =metrics.precision_recall_fscore_support(yimb,
lr_inter_reweight.predict(Xi
mb))
print("SPC: %.3f; SEN: %.3f" % tuple(r))
print('# =>The predictions are balanced in sensitivity and specificity\n')
#samples of class 0 =250; #samples of class 1 =250; #
No Reweighting balanced dataset
SPC: 0.940; SEN: 0.928
# =>The predictions are balanced in sensitivity and
specificity

#samples of class 0 =12; #samples of class 1 =250; #


No Reweighting on imbalanced dataset
SPC: 0.750; SEN: 0.992
# =>Sensitivity >> specificity

# Reweighting on imbalanced dataset


SPC: 1.000; SEN: 0.972
# =>The predictions are balanced in
sensitivity and specificity

200 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

5.4.13 Exercise

Fisher linear discriminant rule

Write a class FisherLinearDiscriminant that implements the Fisher’s linear discriminant


anal- ysis. This class must be compliant with the scikit-learn API by providing two
methods: - f i t ( X ,
y) which fits the model and returns the object itself; - predict(X) which returns a vector
of the predicted values. Apply the object on the dataset presented for the LDA.

5. Non linear learning algorithms

1. Support Vector Machines (SVM)

SVM are based kernel methods require only a user-specified kernel function 𝐾(��, �� ),
i.e., a similarity function over pairs of data points (��, �� ) into kernel (dual) space on
which learning algorithms operate linearly, i.e. every operation on points is a linear
combination of 𝐾(��, �� ).
2. Learning
Outline algorithms
of the SVM operate linearly by dot product into high-kernel space 𝐾 (., ��)
algorithm:
· 𝐾 (., �� ).
1. Map points � into kernel space using a kernel function: � → 𝐾(�, .).
• Using the kernel trick (Mercer’s Theorem) replace dot product in hgh
space by a simpler operation such that 𝐾 (., ��) · 𝐾 (., �� ) = 𝐾(��, �� ).
dimensional
need
Thustowecompute
only a similarity measure for each pairs of point and store in a 𝑁
× 𝑁 matrix.
Gram
• Finally, The learning process consist of estimating the 𝛼� of the decision
function that maximises the hinge loss (of 𝑓 (�)) plus some penalty when
applied on all training points.
(︃ 𝑁 )︃
︁∑
𝑓(�) = sign 𝛼� �� 𝐾(��,
�) .

3. Predict a new point � using the decision function.

Gaussian kernel (RBF, Radial Basis Function):

One of the most commonly used kernel is the Radial Basis Function (RBF) Kernel. For a
pair of points ��, �� the RBF kernel is defined as:

(︂ )︂
‖��− �� 2
𝐾(��, �� ) = (5.50
‖ 2𝜎2
exp − )
︀(
= exp −𝛾 ‖�� − ��
(5.51
︀)

Where 𝜎 (or 𝛾) defines the kernel width
2
)
parameter. Basically, we consider a Gaussian
function centered on each training sample ��. it has a ready interpretation as a similarity
measure as it decreases with squared Euclidean distance between the two feature vectors.

5.5. Non linear learning algorithms


201
Statistics and Machine Learning in Python, Release 0.3 beta

Fig. 8: Support Vector


Machines.

Non linear SVM also exists for regression problems.


%matplotlib inline
import warnings
warnings.filterwarnings(action='once')

import numpy as np
from sklearn.svm import SVC
from sklearn import datasets
import matplotlib.pyplot as
plt

# dataset
X, y =
datasets.make_classification(n_s
amples=10,
n_features=2,n_redundant=0,
n_classes=2,
random_state=1
,
shuffle=False)
c l f =SVC(kernel='rbf')#, gamma=1)
c l f . f i t ( X , y)
print("#Errors: %i" % np.sum(y != clf.predict(X)))

c l f . decision_function(X)

# Usefull internals:
# Array of support vectors
clf.support_vectors_
/home/edouard/anaconda3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.
#˓→ufunc
indicessize changed,vectors
of support may indicate
withinbinary incompatibility.
original X Expected 192 from C header,␣
˓→got 216 from PyObject
np.all(X[clf.support_,:] == clf.support_vectors_)
return f(*args, **kwds)
/
home/edouard/anaconda3
˓→ufunc size changed, mayindicate binary incompatibility. Expected 192(cfornotminuCeshoenadneerx,t ␣page)
/lib/python3.7/importlib/_
˓→got 216 from PyObject
bootstrap.py:219:
202
RuntimeWarning: numpy. Chapter 5. Machine Learning
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


return f(*args, **kwds) page)
/home/edouard/anaconda3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.
˓→ufunc size changed, may indicate binary incompatibility. Expected 192 from C header,␣

˓→got 216 from PyObject

return f(*args, **kwds)


/
home/edouard/anaconda3
/lib/python3.7/importlib/_
bootstrap.py:219:
RuntimeWarning: numpy.
˓→ufunc size changed, may
indicate binary
incompatibility. Expected
192 from C header,␣
˓→got 216 from PyObject
#Errors: 0
return f(*args, **kwds)
/
home/edouard/anaconda3
/home/edouard/anaconda3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.
/lib/python3.7/importlib/_
˓→ufunc size changed, may indicate binary incompatibility. Expected 192 from C header,␣
bootstrap.py:219:
˓→got 216 from PyObject
RuntimeWarning:
return f(*args,numpy.
**kwds)
/
˓→ufunc size changed, may
indicate binary
home/edouard/anaconda3
incompatibility. Expected
/lib/python3.7/site-
192 from C header,␣
packages/sklearn/svm/bas
˓→got 216 from PyObject
e.py:193:␣
return f(*args,
˓→FutureWarning: **kwds)
The
default value of gamma
T rue change from 'auto'
will
to 'scale' in version␣
˓→0.22 to account better
for unscaled
5.5.2 Decision features.
tree
Set gamma explicitly to
'auto' or 'scale'␣
A tree can be “learned” by splitting the training dataset into subsets based on an features
˓→to avoid this warning.
value
"avoid test.
this warning.",
FutureWarning)
Each internal node represents a “test” on an feature resulting on the split of the current
sample. At each step the algorithm selects the feature and a cutoff value that maximises a
given metric. Different metrics exist for regression tree (target is continuous) or
classification tree (the target is qualitative).
This process is repeated on each derived subset in a recursive manner called recursive
partition- ing. The recursion is completed when the subset at a node has all the same
value of the target variable, or when splitting no longer adds value to the predictions.
This general principle is implemented by many recursive partitioning tree algorithms.
Decision trees are simple to understand and interpret however they tend to overfit the
data. However decision trees tend to overfit the training set. Leo Breiman propose random
forest to deal with this issue.
A single decision tree is usually overfits the data it is learning from because it learn from
only one pathway of decisions. Predictions from a single decision tree usually don’t make
accurate predictions on new data.

5.5. Non linear learning algorithms 203


Statistics and Machine Learning in Python, Release 0.3 beta

Fig. 9: Classification tree.

3. Random forest

A random forest is a meta estimator that fits a number of decision tree learners on
various sub-samples of the dataset and use averaging to improve the predictive accuracy
and control over-fitting.
Random forest models reduce the risk of overfitting by introducing randomness by:
• building multiple trees (n_estimators)
• drawing observations with replacement (i.e., a bootstrapped sample)
• splitting nodes on the best split among a random subset of the features selected at
every node

from sklearn.ensemble import RandomForestClassifier

forest =RandomForestClassifier(n_estimators =100)


forest.fit(X, y)

print("#Errors: %i" % np.sum(y !=


forest.predict(X)))

/home/edouard/anaconda3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.


˓→ufunc size changed, may indicate binary incompatibility. Expected 192 from C header,␣

˓→got 216 from PyObject


return f(*args, **kwds)

#Errors: 0

204 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

5.5.4 Extra Trees (Low Variance)

Extra Trees is like Random Forest, in that it builds multiple trees and splits nodes using
random subsets of features, but with two key differences: it does not bootstrap
observations (meaning it samples without replacement), and nodes are split on random
splits, not best splits. So, in summary, ExtraTrees: builds multiple trees with bootstrap
= False by default, which means it samples without replacement nodes are split based on
random splits among a random subset of the features selected at every node In Extra
Trees, randomness doesn’t come from bootstrapping of data, but rather comes from the
random splits of all observations. ExtraTrees is named for (Extremely Randomized
Trees).

5.6 Resampling Methods


import matplotlib.pyplot as
plt import warnings

warnings.filterwarnings('ignore
')
%matplotlib inline

5.6.1 Train, Validation and Test Sets

Machine learning algorithms overfit taining data. Predictive performances MUST be


evaluated on independant hold-out dataset.

Fig. 10: Train, Validation and Test Sets.

1. Training dataset: Dataset used to fit the model (set the model parameters like
weights). The training error can be easily calculated by applying the statistical
learning method to the observations used in its training. But because of overfitting,
the training error rate can dramatically underestimate the error that would be
obtained on new samples.

5.6. Resampling Methods 205


Statistics and Machine Learning in Python, Release 0.3 beta

2. Validation dataset: Dataset used to provide an unbiased evaluation of a model fit on


the training dataset while tuning model hyperparameters. The validation error is
the aver- age error that results from a learning method to predict the response on a
new (validation) samples that is, on samples that were not used in training the
method.
3. Test Dataset: Dataset used to provide an unbiased evaluation of a final model fit on
the training dataset. It is only used once a model is completely trained(using the
train and validation sets).
What is the Difference Between Test and Validation Datasets? by Jason Brownlee
Thus the original dataset is generally split in a training, validation and a test data sets.
Large training+validation set (80%) small test set (20%) might provide a poor
estimation of the pre- dictive performances (same argument stands for train vs validation
samples). On the contrary, large test set and small training set might produce a poorly
estimated learner. This is why, on situation where we cannot afford such split, it
recommended to use cross-validation scheme to estimate the predictive power of a learning
algorithm.

5.6.2 Cross-Validation (CV)

Cross-Validation scheme randomly divides the set of observations into 𝐾 groups, or folds,
of approximately equal size. The first fold is treated as a validation set, and the method 𝑓
() is fitted on the remaining union of 𝐾 − 1 folds: (𝑓( 𝑋 − 𝐾 , � − 𝐾 )).
The measure of performance (the score function �), either a error measure or an correct
predic- tion measure is an average of a loss error or correct prediction measure, noted ℒ,
between a true target value and the predicted target value. The score function is evaluated
of the on the obser- vations in the held-out fold. For each sample � we consider the model
estimated 𝑓 (𝑋 −�(�) , �−�(�) on the data set without the group � that contains � noted
−�(�). This procedure is repeated 𝐾 times; each time, a different group of observations
is treated as a test set. Then we compare the predicted value (𝑓−�(�)(��) = �ˆ�) with
true value �� using a Error or Loss function ℒ(�, �ˆ).
For 10-fold we can either average over 10 values (Macro measure) or concatenate the 10
ex- periments and compute the micro measures.
Two strategies micro vs macro estimates:

(︁ compute a score �
Micro measure: average(individual scores):
︁∑𝑁 ︁) for each sample and average
over all samples. It is simillar to average score(concatenation):
. an averaged score
�(𝑓 )samples.
computed over all concatenated =1 ℒ ��,𝑓(�−�(�),
𝑁 �
�−�(�))
Macro measure mean(CV scores) (the most commonly used method): compute a score � on
eacheach fold � and average accross folds:

1 ∑︁
𝐾
�(𝑓 ) ��(𝑓 ).
𝐾
= �

1 ∑︁
𝐾
1 ∑︁ (︁ )︁
�(𝑓 ) ℒ .
𝑁 ��,𝑓(�−�(�),
� �
=
𝐾 �∈� �−�(�))

206 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

These two measures (an average of average vs. a global average) are generaly similar. They
may differ slightly is folds are of different sizes.
This validation scheme is known as the K-Fold CV. Typical choices of 𝐾 are 5 or 10,
[Kohavi 1995]. The extreme case where 𝐾 = 𝑁 is known as leave-one-out cross-
validation, LOO-CV.

CV for regression

%matplotlib inlinefunction ℒ() is the r-squared score. However other function could be
Usually the error
import warnings
used.
warnings.filterwarnings(action='once')

import numpy as np
from sklearn import datasets
import sklearn.linear_model as
lm import sklearn.metrics as
metrics
from sklearn.model_selection
import KFold

X, y =
datasets.make_regression(n_samples
=100, n_features=100,
n_informative=10, random_state=42)
estimator = lm.Ridge(alpha=10)

cv =
KFold(n_splits=5, random_state=42)
r2_train, r2_test =l i s t ( ) , l i s t ( )

for train, test in cv.split(X):


estimator.fit(X[train, : ] , y[train])
r2_train.append(metrics.r2_score(y[train], estimator.predict(X[train, : ] ) ) )
r2_test.append(metrics.r2_score(y[test], estimator.predict(X[test, : ] ) ) )
Train r2:0.99
print("Train
Test r2:0.73 r2:%.2f" % np.mean(r2_train))
print("Test r2:%.2f" % np.mean(r2_test))
Scikit-learn provides user-friendly function to perform
CV:
from sklearn.model_selection import cross_val_score

scores =cross_val_score(estimator=estimator, X=X, y=y, cv=5)


print("Test r2:%.2f" % scores.mean())

# provide a cv
cv =KFold(n_splits=5, random_state=42)
scores =cross_val_score(estimator=estimator, X=X, y=y, cv=cv)
print("Test r2:%.2f" % scores.mean())

Test r2:0.73
Test r2:0.73

5.6. Resampling Methods 207


Statistics and Machine Learning in Python, Release 0.3 beta

CV for classihcation

With classification problems it is essential to sample folds where each set contains
approxi- mately the same percentage of samples of each target class as the complete set.
This is called stratification. In this case, we will use StratifiedKFold with is a variation
of k-fold which returns stratified folds.
Usually the error function 𝐿() are, at least, the sensitivity and the specificity. However
other function could be used.
import numpy as np
from sklearn import datasets
import sklearn.linear_model as
lm import sklearn.metrics as
metrics
from sklearn.model_selection
import StratifiedKFold

X, y =
datasets.make_classification(n_sam
ples=100, n_features=100,
n_infor
mative=
10,
random_
state=42
)

estimator =lm.LogisticRegression(C=1, solver='lbfgs')

cv =StratifiedKFold(n_splits=5)

# Lists to store scores by folds (for macro measure only)


recalls_train, recalls_test, acc_test =l i s t ( ) , l i s t ( ) , l i s t ( )

# Or vector of test predictions (for both macro and micro


measures, not for training␣
˓→samples)
y_test_pred =np.zeros(len(y))

for train, test in cv.split(X, y ) :


estimator.fit(X[train, : ] , y[train])
recalls_train.append(metrics.recall_sco
re(y[train],
estimator.predict(X[train, : ] ) , ␣
˓→average=None))
recalls_test.append(metrics.recall_scor
e(y[test],
estimator.predict(X[test, : ] ) , ␣
˓→average=None))

acc_test.append(metrics.accuracy_score(
y[test],
estimator.predict(X[test, : ] ) ) )

# Store test predictions (for micro measures)


y_test_pred[test] =estimator.predict(X[test, : ] )

print("== Macro measures ==")


# Use l is ts of scores (continues on next
recalls_train =np.array(recalls_train) page)
recalls_test = np.array(recalls_test)
print("Train
208 SPC:%.2f; SEN:%.2f" % Chapter 5. Machine Learning
tuple(recalls_train.mean(axis=0)))
print("Test SPC:%.2f; SEN:%.2f" % tuple(recalls_test.mean(axis=0)), )
print("Test ACC:%.2f, ballanced ACC:%.2f" %
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


tuple(metrics.recall_score(y, y_test_pred, average=None))) page)
print("Test ACC:%.2f" % metrics.accuracy_score(y, y_test_pred))

==Macro measures ==
Train SPC:1.00; SEN:1.00
Test SPC:0.78; SEN:0.82
Test ACC:0.80, ballanced
ACC:0.80 Folds: [0.9,
0.7, 0.95, 0.7, 0.75]
Test ACC:0.80 Folds:
[0.9, 0.7, 0.95, 0.7,
0.75]
= =Micro measures
Scikit-learn == user-friendly function to perform
provides
Test
CV: SPC:0.78; SEN:0.82
Test ACC:0.80
from sklearn.model_selection import cross_val_score

scores =cross_val_score(estimator=estimator, X=X, y=y, cv=5)


scores.mean()

# provide CV and score


def balanced_acc(estimator, X, y, **kwargs):
'''
Balanced acuracy scorer
'''
return metrics.recall_score(y,
estimator.predict(X),
average=None).mean()

scores =cross_val_score(estimator=estimator, X=X, y=y, cv=5, scoring=balanced_acc)


print("Test ACC:%.2f" % scores.mean())
Test ACC:0.80

Note that with Scikit-learn user-friendly function we average the scores’ average obtained
on individual folds which may provide slightly different results that the overall average
presented earlier.

5.6.3 Parallel computation with joblib

Dataset

import numpy as np
from sklearn import datasets
import sklearn.linear_model as
lm import sklearn.metrics as
metrics
from sklearn.model_selection
import StratifiedKFold
X, y =
datasets.make_classification(n_sam
ples=20, n_features=5,
n_informative=2, random_
Use cross_validate
˓→state=42)
function
cv = StratifiedKFold(n_splits=5)

5.6. Resampling Methods 209


Statistics and Machine Learning in Python, Release 0.3 beta

from sklearn.model_selection import cross_validate

estimator =lm.LogisticRegression(C=1, solver='lbfgs')


cv_results =cross_validate(estimator, X, y, cv=cv, n_jobs=5)
print(np.mean(cv_results['test_score']),
cv_results['test_score'])
0.8 [0.5 0.5 1. 1. 1. ]

### Sequential computation


If we want have full control of the operations performed within each fold (retrieve the
models parameters, etc.). We would like to parallelize the folowing sequetial code:

estimator =lm.LogisticRegression(C=1, solver='lbfgs')


y_test_pred_seq =np.zeros(len(y)) # Store predictions in the original order
coefs_seq =l i s t ( )
for train, test in cv.split(X, y) :
X_train, X_test, y_train, y_test =X[train, : ] , X[test, : ] , y[train], y[test]
estimator.fit(X_train, y_train)
y_test_pred_seq[test] = estimator.predict(X_test)
coefs_seq.append(estimator.coef_)

test_accs =[metrics.accuracy_score(y[test], y_test_pred_seq[test]) for train, test


in cv.
˓ → split(X, y )]

print(np.mean(test_accs), test_accs)
coefs_cv =np.array(coefs_seq)
print(coefs_cv)

print(coefs_cv. mean(axis=0)
) print("Std Err of the
coef")
print(coefs_cv.std(axis=0) /
0.8 [0.5, 0.5, 1.0, 1.0, 1.0]
np.sqrt(coefs_cv.shape[0]))
[[[-0.87692513 0.6260013 1.18714373 -0.30685978 -0.38037393]]

[[-0.7464993 0.6213816 1.10144804 0.19800115 -0.40112109]]


5
[[-0.96020317 0.5113513 1.1210943 0.08039112 -0.2643663 ] ]
4 1.06637346 -0.10994258 -0.29152132]]
[[-0.85755505 0.5201055
2
[[-0.89914467 0.5148148 1.08675378 -0.24767837 -0.27899525]]]
3
[[-0.86806546 0.55873093 1.11256266 -0.07721769 -0.32327558]]
Std Err of the coef
[[0.03125544 0.02376198 0.01850211 0.08566194 0.02510739]]

Parallel computation with


joblib
from sklearn.externals.joblib import Parallel, delayed
from sklearn.base import is_classifier, clone

def _split_fit_predict(estimator, X, y, train, test):


X_train, X_test, y_train, y_test =X[train, : ] , X[test, : ] , y[train], y[test]
estimator.fit(X_train, y_train)
(continues on next
page)

210 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


return [estimator.predict(X_test), estimator.coef_] page)

estimator =lm.LogisticRegression(C=1, solver='lbfgs')

parallel = Parallel(n_jobs=5)
cv_ret =
parallel( delayed(_split_fi
t_predict)(
clone(estimator), X, y,
train, test)
for train, test in cv.split(X,
y) )

y_test_pred_cv, coefs_cv =
zip(*cv_ret)

# Retrieve predictions in the original order


y_test_pred =np.zeros(len(y))
for i , (train, test) in enumerate(cv.split(X, y ) ) :
y_test_pred[test] =y_test_pred_cv[i]

test_accs =[metrics.accuracy_score(y[test],
y_test_pred[test]) for train,
0.8 [0.5, 0.5, 1.0, 1.0, 1.0] test in cv.
˓ → split(X, y )]
print(np.mean(test_accs),
Test same predictions andtest_accs)
same
coeficients
assert np.all(y_test_pred == y_test_pred_seq)
assert np.allclose(np.array(coefs_cv).squeeze(), np.array(coefs_seq).squeeze())

4. CV for model selection: setting the hyper parameters

It is important to note CV may be used for two separate goals:


1. Model assessment: having chosen a final model, estimating its prediction error
(gener- alization error) on new data.
2. Model selection: estimating the performance of different models in order to choose
the best one. One special case of model selection is the selection model’s hyper
parameters. Indeed remember that most of learning algorithm have a hyper
parameters (typically the regularization parameter) that has to be set.
Generally we must address the two problems simultaneously. The usual approach for both
problems is to randomly divide the dataset into three parts: a training set, a validation
set, and a test set.
• The training set (train) is used to fit the models;
• the validation set (val) is used to estimate prediction error for model selection or to
determine the hyper parameters over a grid of possible values.
• the test set (test) is used for assessment of the generalization error of the final
chosen model.
### Grid search procedure
Model selection of the best hyper parameters over a grid of possible values

5.6. Resampling Methods 211


Statistics and Machine Learning in Python, Release 0.3 beta

For each possible values of hyper parameters 𝛼�:


1. Fit the learner on training set: 𝑓(𝑋���
𝑎 �, ��
��,𝛼�)
𝑎�

2. Evaluate the model on the validation set and keep the parameter(s) that minimises
the error measure
𝛼* = arg min 𝐿(𝑓 (𝑋���
𝑎 �), ��
𝑎, 𝛼�)

3. Refit the learner on all training + validation data using the best hyper parameters: 𝑓
* ≡

𝑓(𝑋��� 𝑎 ,����
𝑎 �∪�� 𝑎 , 𝛼 *)
𝑎 �∪��

4. Model assessment of 𝑓* on the test set: 𝐿(𝑓 *(𝑋��


𝑒 �), ��

𝑒�)

Nested CV for model selection and assessment

Most of time, we cannot afford such three-way split. Thus, again we will use CV, but in
this case we need two nested CVs.
One outer CV loop, for model assessment. This CV performs 𝐾 splits of the dataset into
training plus validation (𝑋 − 𝐾 , �−𝐾) set and a test set 𝑋 𝐾 , �𝐾
One inner CV loop, for model selection. For each run of the outer loop, the inner loop loop
performs 𝐿 splits of dataset (𝑋 − 𝐾 , �−𝐾) into training set: (𝑋 −𝐾 ,−𝐿 , �−𝐾,−𝐿) and a validation
set: (𝑋 −𝐾 ,𝐿 , �−𝐾,𝐿).

Implementation with scikit-learn

Note that the inner CV loop combined with the learner form a new learner with an
automatic model (parameter) selection procedure. This new learner can be easily
constructed using Scikit- learn. The learned is wrapped inside a GridSearchCV class.
import numpy as np
Then the newimport
from sklearn learneddatasets
can be plugged into the classical outer CV loop.
import sklearn.linear_model as lm
from sklearn.model_selection import GridSearchCV
import sklearn.metrics as metrics
from sklearn.model_selection import KFold

# Dataset
noise_sd = 10
X, y, coef =
datasets.make
_regression(n_
samples=50,
n_features=1
00,
noise=noise_
sd,
n
_
i
n
f
o
r
m
212 a Chapter 5. Machine Learning
t
i
v
Statistics and Machine Learning in Python, Release 0.3
beta

SNR: 2.6358469446381614

Regression models with built-in cross-validation

Sklearn will automatically select a grid of parameters, most of time use the defaults
values.
n_jobs is the number of CPUs to use during the cross validation. If -1, use all the
CPUs.
model.fit(X, y)
1) Biased usage: fit on all data, ommit outer CV loop
print("Train r2:%.2f" % metrics.r2_score(y, model.predict(X)))
print(model.best_params_)

Train r2:0.96
{'alpha': 1.0, 'l1_ratio': 0.9}

2) User made outer CV, useful to extract specific


information
cv =KFold(n_splits=5, random_state=42)
r2_train, r2_test =l i s t ( ) , l i s t ( )
alphas = l i s t ( )

for train, test in cv.split(X, y) :


X_train, X_test, y_train, y_test =X[train, : ] , X[test, : ] , y[train], y[test]
model.fit(X_train, y_train)

r2_test.append(metrics.r2_score(y_test, model.predict(X_test)))
r2_train.append(metrics.r2_score(y_train, model.predict(X_train)))

alphas.append(model.best_params_)

print("Train r2:%.2f" %
np.mean(r2_train)) print("Test r2:%.2f" %
np.mean(r2_test)) print("Selected
alphas:", alphas)

Train r2:1.00
Test r2:0.55
Selected
alphas:
[{'alpha':
0.001,
'l1_ratio':
3) User-friendly sklearn for outer
0.9},
CV
{'alpha':
from sklearn.model_selection import cross_val_score
0.001, =cross_val_score(estimator=model, X=X, y=y, cv=cv)
scores
'l1_ratio': r2:%.2f" % scores.mean())
print("Test
0.9}, {
˓→'alpha':
Test r2:0.55
0.001,
'l1_ratio':
0.9},sklearn import datasets
from
{'alpha':
import sklearn.linear_model as
0.01,
lm import sklearn.metrics as
metrics
'l1_ratio':
0.9},sklearn.model_selection
from (continues on next
import cross_val_score
{'alpha': page)
0.001,
5.6. Resampling Methods
˓→ 'l1_ratio':
213
0.9}]
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


page)
# Dataset
X, y, coef =datasets.make_regression(n_samples=50, n_features=100, noise=10,
n_informative=2, random_state=42, coef=True)

print("== Ridge (L2 penalty) ==")


model = lm.RidgeCV(cv=3)
# Let sklearn select a l i s t of alphas with default LOO-CV
scores = cross_val_score(estimator=model, X=X, y=y, cv=5)
print("Test r2:%.2f" % scores.mean())

print("== Lasso (L1 penalty) ==")


model =lm.LassoCV(n_jobs=-1, cv=3)
# Let sklearn select a l i s t of alphas with default 3CV
scores =cross_val_score(estimator=model, X=X, y=y, cv=5)
print("Test r2:%.2f" % scores.mean())

print("== ElasticNet (L1 penalty) ==")


model =lm.ElasticNetCV(l1_ratio=[.1, .5, .9], n_jobs=-1, cv=3)
# Let sklearn select a l i s t of alphas with default 3CV
scores =cross_val_score(estimator=model, X=X, y=y, cv=5)
print("Test r2:%.2f" % scores.mean())

==Ridge (L2 penalty) ==


Test r2:0.16
==Lasso (L1 penalty) ==
Test r2:0.74
==ElasticNet (L1
penalty) ==
Test r2:0.58

Classihcation models with built-in cross-


validation
from sklearn import datasets
import sklearn.linear_model as
lm import sklearn.metrics as
metrics
from sklearn.model_selection
import cross_val_score

X, y =
datasets.make_classification(n_sam
ples=100, n_features=100,
n_infor
mative=
10,
random_
state=42
)

# provide CV and score


def balanced_acc(estimator, X, y, **kwargs):
'''
Balanced accuracy scorer
'''
return metrics.recall_score(y,
estimator.predict(X),
214 average=None).mean() Chapter 5. Machine Learning

print("== Logistic Ridge (L2 penalty) ==")


model =
Statistics and Machine Learning in Python, Release 0.3
beta

==Logistic Ridge (L2 penalty) ==


Test ACC:0.77

5. Random Permutations

A permutation test is a type of non-parametric randomization test in which the null


distribution of a test statistic is estimated by randomly permuting the observations.
Permutation tests are highly attractive because they make no assumptions other than that
the observations are independent and identically distributed under the null hypothesis.
1. Compute a observed statistic ��𝑏on the data.
2. Use randomization to compute the distribution of � under the null hypothesis:
Perform 𝑁 random permutation of the data. For each sample of permuted data, � the
data compute the statistic ��. This procedure provides the distribution of � under
the null hypothesis 𝐻0:
𝑃 (�|𝐻0)

𝑏 |𝐻0) |{��> ���


3. Compute the p-value = 𝑃(�> ��� 𝑏 }|, where ��’s include ��
𝑏.

Example with a correlation

The statistic
import is the
numpy as np correlation.
import scipy.stats as stats
import matplotlib.pyplot as
plt import seaborn as sns
%matplotlib inline
#%matplotlib qt

np.random.seed(42)
x =np.random.normal(loc=10,
scale=1, size=100)
y =x +np.random.normal(loc=-3,
scale=3, size=100) # snr =1/2

# Permutation: simulate the null hypothesis


nperm = 10000
perms =np.zeros(nperm + 1)

perms[0] =np.corrcoef(x, y)[0, 1]

for i in range(1, nperm):


perms[i] =
np.corrcoef(np.random.permutation(x),
y)[0, 1]

# Plot
# Re-weight to obtain distribution
weights =np.ones(perms.shape[0]) / perms.shape[0]
plt.hist([perms[perms >=perms[0]], perms], histtype='stepfilled',
bins=100, label=["t>t obs (p-value)", "t<t obs"],
weights=[weights[perms >=perms[0]], weights])

plt.xlabel("Statistic distribution under null hypothesis") (continues on next


plt.axvline(x=perms[0], color='blue', linewidth=1, label="observed statistic")
page)
_ =plt.legend(loc="upper l e f t " )
5.6. Resampling Methods 215
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


page)
# One-tailed empirical p-value
pval_perm =np.sum(perms >=perms[0]) / perms.shape[0]

# Compare with Pearson's correlation test


_, pval_test =stats.pearsonr(x, y)

print("Permutation two tailed p-value=%.5f. Pearson test p-value=%.5f" % (2*pval_perm,␣


˓→ pval_test))

Permutation two tailed p-value=0.06959. Pearson test p-value=0.07355

Exercise

Given the logistic regression presented above and its validation given a 5 folds CV.
1. Compute the p-value associated with the prediction accuracy using a permutation
test.
2. Compute the p-value associated with the prediction accuracy using a parametric
test.

6. Bootstrapping

Bootstrapping is a random sampling with replacement strategy which provides an non-


parametric method to assess the variability of performances scores such standard errors
or confidence intervals.
A great advantage of bootstrap is its simplicity. It is a straightforward way to derive
estimates of standard errors and confidence intervals for complex estimators of complex
parameters of the distribution, such as percentile points, proportions, odds ratio, and
correlation coefficients.
216
1. Perform 𝐵 sampling, with replacement, of the dataset. Chapter 5. Machine Learning
Statistics and Machine Learning in Python, Release 0.3
beta

2. For each sample � fit the model and compute the scores.
3. Assess standard errors and confidence intervals of scores using the scores obtained
on the
𝐵 resampled dataset.

import numpy as np
from sklearn import datasets
import sklearn.linear_model as
lm import sklearn.metrics as
metrics import pandas as pd

# Regression dataset
n_features =5
n_features_info = 2
n_samples = 100
X =np.random.randn(n_samples, n_features)
beta =np.zeros(n_features)
beta[:n_features_info] =1
Xbeta =np.dot(X, beta)
eps =np.random.randn(n_samples)
y =Xbeta + eps

# Fi t model on a l l data ( ! ! risk of overfit)


model = lm.RidgeCV()
model.fit(X, y)
print("Coefficients on a l l data:")
print(model.coef_)

# Bootstrap loop
nboot =100 # ! ! Should be at least 1000
scores_names = ["r2"]
scores_boot =np.zeros((nboot,
len(scores_names)))
coefs_boot =np.zeros((nboot, X.shape[1]))

orig_all = np.arange(X.shape[0])
for boot_i in range(nboot):
boot_tr =np.random.choice(orig_all, size=len(orig_all), replace=True)
boot_te =np.setdiff1d(orig_all, boot_tr, assume_unique=False)
Xtr, ytr =X[boot_tr, : ] , y[boot_tr]
Xte, yte =X[boot_te, : ] , y[boot_te]
model.fit(Xtr, ytr)
y_pred = model.predict(Xte).ravel()
scores_boot[boot_i, : ] =metrics.r2_score(yte, y_pred)
coefs_boot[boot_i, : ] =model.coef_

# Compute Mean, SE, CI


scores_boot =pd.DataFrame(scores_boot, columns=scores_names)
scores_stat =scores_boot.describe(percentiles=[.975, .5, .025])

print("r-squared: Mean=%.2f, SE=%.2f, CI=(%.2f %.2f)" %\


tuple(scores_stat.loc[["mean", "std", "5%", "95%"], "r 2"] ))

coefs_boot = pd.DataFrame(coefs_boot)
coefs_stat =coefs_boot.describe(percentiles=[.975, .5, .025])
print("Coefficients distribution")
print(coefs_stat)

5.6. Resampling Methods 217


Statistics and Machine Learning in Python, Release 0.3 beta

Coefficients on a l l data:
[ 1.0257263 1.11323 -0.0499828 -0.09263008 0.15267576]
r-squared: Mean=0.61, SE=0.10, CI=(nan nan)
Coefficients distribution
0 1 2 3 4
count 100.00000 100.00000 100.00000 100.00000 100.00000
0 0 0 0 0
mean 1.012534 1.132775 -0.056369 -0.100046 0.164236
std 0.094269 0.104934 0.111308 0.095098 0.095656
min 0.759189 0.836394 -0.290386 -0.318755 -0.092498
2.5% 0.814260 0.948158 -0.228483 -0.268790 -0.044067
50% 1.013097 1.125304 -0.057039 -0.099281 0.164194
97.5% 1.170183 1.320637 0.158680 0.085064 0.331809
max 1.237874 1.340585 0.291111 0.151059 0.450812

7. Ensemble learning: bagging, boosting and stacking

These methods are Ensemble learning techniques. These models are machine learning
paradigms where multiple models (often called “weak learners”) are trained to solve the
same problem and combined to get better results. The main hypothesis is that when
weak models are correctly combined we can obtain more accurate and/or robust models.

1. Single weak learner

In machine learning, no matter if we are facing a classification or a regression problem,


the choice of the model is extremely important to have any chance to obtain good results.
This choice can depend on many variables of the problem: quantity of data, dimensionality
of the space, distribution hypothesis. . .
A low bias and a low variance, although they most often vary in opposite directions, are
the two most fundamental features expected for a model. Indeed, to be able to “solve” a
problem, we want our model to have enough degrees of freedom to resolve the underlying
complexity of the data we are working with, but we also want it to have not too much
degrees of freedom to avoid high variance and be more robust. This is the well known
bias-variance tradeoff.
In ensemble learning theory, we call weak learners (or base models) models that can be
used as building blocks for designing more complex models by combining several of them.
Most of the time, these basics models perform not so well by themselves either because
they have a high bias (low degree of freedom models, for example) or because they have
too much variance to be robust (high degree of freedom models, for example). Then, the
idea of ensem- ble methods is to combining several of them together in order to create a
strong learner (or ensemble model) that achieves better performances.
Usually, ensemble models are used in order to :
• decrease the variance for bagging (Bootstrap Aggregating) technique
• reduce bias for the boosting technique
• improving the predictive force for stacking technique.
To understand these techniques, first, we will explore what is boostrapping and its
different
hypothesis.

218 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Fig. 11: towardsdatascience blog

2. Bootstrapping

Bootstrapping is a statistical technique which consists in generating samples of size B


(called bootstrap samples) from an initial dataset of size N by randomly drawing with
replacement B observations.
Illustration of the bootstrapping process.
Under some assumptions, these samples have pretty good statistical properties: in first
ap- proximation, they can be seen as being drawn both directly from the true underlying
(and often unknown) data distribution and independently from each others. So, they can
be considered as representative and independent samples of the true data distribution.
The hypothesis that have to be verified to make this approximation valid are twofold: -
First, the size N of the initial dataset should be large enough to capture most of the
complexity of the underlying distribution so that sampling from the dataset is a good
approximation of sampling from the real distribution (representativity).
• Second, the size N of the dataset should be large enough compared to the size B of
the bootstrap samples so that samples are not too much correlated (independence).
Bootstrap samples are often used, for example, to evaluate variance or confidence
intervals of a statistical estimators. By definition, a statistical estimator is a function of
some observa- tions and, so, a random variable with variance coming from these
observations. In order to estimate the variance of such an estimator, we need to evaluate
it on several independent sam- ples drawn from the distribution of interest. In most of
the cases, considering truly independent samples would require too much data compared to
the amount really available. We can then use bootstrapping to generate several bootstrap
samples that can be considered as being “almost- representative” and “almost-independent”
(almost i.i.d. samples). These bootstrap samples will allow us to approximate the
variance of the estimator, by evaluating its value for each of them.

5.7. Ensemble learning: bagging, boosting and stacking


219
Statistics and Machine Learning in Python, Release 0.3 beta

Fig. 12: towardsdatascience


blog

Fig. 13: towardsdatascience


blog

220 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Bootstrapping is often used to evaluate variance or confidence interval of some statistical


esti- mators.

5.7.3 Bagging

In parallel methods we fit the different considered learners independently from each
others and, so, it is possible to train them concurrently. The most famous such approach
is “bagging” (standing for “bootstrap aggregating”) that aims at producing an ensemble
model that is more robust than the individual models composing it.
When training a model, no matter if we are dealing with a classification or a regression
problem, we obtain a function that takes an input, returns an output and that is defined
with respect to the training dataset.
The idea of bagging is then simple: we want to fit several independent models and
“average” their predictions in order to obtain a model with a lower variance. However, we
can’t, in practice, fit fully independent models because it would require too much data. So,
we rely on the good “approximate properties” of bootstrap samples (representativity and
independence) to fit models that are almost independent.
First, we create multiple bootstrap samples so that each new bootstrap sample will act as
another (almost) independent dataset drawn from true distribution. Then, we can fit a
weak learner for each of these samples and finally aggregate them such that we kind of
“aver- age” their outputs and, so, obtain an ensemble model with less variance that its
components. Roughly speaking, as the bootstrap samples are approximatively
independent and identically distributed (i.i.d.), so are the learned base models. Then,
“averaging” weak learners outputs do not change the expected answer but reduce its
variance.
So, assuming that we have L bootstrap samples (approximations of L independent
datasets) of size B denoted

Fig. 14: Medium Science Blog

Each {. . . .} is a bootstrap sample of B observation


we can fit L almost independent weak learners (one on each
dataset)

Fig. 15: Medium Science Blog

and then aggregate them into some kind of averaging process in order to get an ensemble
model with a lower variance. For example, we can define our strong model such that
There are several possible ways to aggregate the multiple models fitted in parallel. - For a
regression problem, the outputs of individual models can literally be averaged to obtain
the output of the ensemble model. - For classification problem the class outputted by each
model can be seen as a vote and the class that receives the majority of the votes is
returned by the ensemble model (this is called hard-voting). Still for a classification
problem, we can also consider the probabilities of each classes returned by all the models,
average these

5.7. Ensemble learning: bagging, boosting and stacking 221


Statistics and Machine Learning in Python, Release 0.3 beta

Fig. 16: Medium Science Blog

probabilities and keep the class with the highest average probability (this is called soft-
voting). – > Averages or votes can either be simple or weighted if any relevant weights
can be used.
Finally, we can mention that one of the big advantages of bagging is that it can be
parallelised. As the different models are fitted independently from each others, intensive
parallelisation tech- niques can be used if required.

Fig. 17: Medium Science Blog

Bagging consists in fitting several base models on different bootstrap samples and build
an ensemble model that “average” the results of these weak learners.
Question : - Can you name an algorithms based on Bagging technique , Hint : leaf
###### Examples
Here, we are trying some example of stacking
• Bagged Decision Trees for Classification

import pandas
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

names =['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe =
pandas.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/
˓→pima-indians-diabetes.data.csv",names=names)

array = dataframe.values
(continues on next
page)

222 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


x = array[:,0:8] page)
y =array[:,8]
max_features =
3

kfold =model_selection.KFold(n_splits=10, random_state=2020)


r f =DecisionTreeClassifier(max_features=max_features)
num_trees = 100

model =BaggingClassifier(base_estimator=rf, n_estimators=num_trees, random_state=2020)


results =model_selection.cross_val_score(model, x, y, cv=kfold)
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std()))
• Random Forest
Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier

names =['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe =
pandas.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/
˓→pima-indians-diabetes.data.csv",names=names)

array =dataframe.values
x = array[:,0:8]
y =array[:,8]

kfold =model_selection.KFold(n_splits=10, random_state=2020)


r f =DecisionTreeClassifier()
num_trees = 100
max_features =3

kfold =model_selection.KFold(n_splits=10, random_state=2020)


model =RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results =model_selection.cross_val_score(model, x, y, cv=kfold)
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std()))

Both of these algorithms will print, Accuracy: 0.77 (+/- 0.07). They are equivalent.

5.7.4 Boosting

In sequential methods the different combined weak models are no longer fitted indepen-
dently from each others. The idea is to fit models iteratively such that the training of
model at a given step depends on the models fitted at the previous steps. “Boosting” is
the most famous of these approaches and it produces an ensemble model that is in general
less biased than the weak learners that compose it.
Boosting methods work in the same spirit as bagging methods: we build a family of
models
that are aggregated to obtain a strong learner that performs better.
However, unlike bagging that mainly aims at reducing variance, boosting is a technique
that consists in fitting sequentially multiple weak learners in a very adaptative way: each
model in the sequence is fitted giving more importance to observations in the dataset that
were badly handled by the previous models in the sequence. Intuitively, each new model
focus its efforts on the most difficult observations to fit up to now, so that we obtain, at
the
5.7. Ensemble learning: bagging, boosting and stacking 223
Statistics and Machine Learning in Python, Release 0.3 beta

end of the process, a strong learner with lower bias (even if we can notice that boosting
can also have the effect of reducing variance).
– > Boosting, like bagging, can be used for regression as well as for classification
problems.
Being mainly focused at reducing bias, the base models that are often considered for
boosting are models with low variance but high bias. For example, if we want to
usetreesas our base models, we will choosemost of the time shallow decision trees with
only a few depths.
Another important reason that motivates the use of low variance but high bias models as
weak learners for boosting is that these models are in general less computationally
expensive to fit (few degrees of freedom when parametrised). Indeed, as computations to
fit the different mod- els can’t be done in parallel (unlike bagging), it could become too
expensive to fit sequentially several complex models.
Once the weak learners have been chosen, we still need to define how they will be
sequentially fitted and how they will be aggregated. We will discuss these questions in
the two follow- ing subsections, describing more especially two important boosting
algorithms: adaboost and gradient boosting.
In a nutshell, these two meta-algorithms differ on how they create and aggregate the
weak learners during the sequential process. Adaptive boosting updates the weights
attached to each of the training dataset observations whereas gradient boosting updates
the value of these observations. This main difference comes from the way both methods
try to solve the optimisation problem of finding the best model that can be written as a
weighted sum of weak learners.

Fig. 18: Medium Science Blog

Boosting consists in, iteratively, fitting a weak learner, aggregate it to the ensemble model
and “update” the training dataset to better take into account the strengths and weakness of
the current ensemble model when fitting the next base model.

1/ Adaptative boosting

In adaptative boosting (often called “adaboost”), we try to define our ensemble model as a
weighted sum of L weak learners

224 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Fig. 19: Medium Science Blog

Finding the best ensemble model with this form is a difficult optimisation problem. Then,
instead of trying to solve it in one single shot (finding all the coefficients and weak
learners that give the best overall additive model), we make use of an iterative
optimisation process that is much more tractable, even if it can lead to a sub-optimal
solution. More especially, we add the weak learners one by one, looking at each iteration
for the best possible pair (coefficient, weak learner) to add to the current ensemble model.
In other words, we define recurrently the (s_l)’s such that

Fig. 20: towardsdatascience Blog

where c_l and w_l are chosen such that s_l is the model that fit the best the training data
and, so, that is the best possible improvement over s_(l-1). We can then denote

Fig. 21: towardsdatascience Blog

where E(.) is the fitting error of the given model and e(.,.) is the loss/error function.
Thus, instead of optimising “globally” over all the L models in the sum, we approximate
the optimum by optimising “locally” building and adding the weak learners to the strong
model one by one.
More especially, when considering a binary classification, we can show that the adaboost
algo- rithm can be re-written into a process that proceeds as follow. First, it updates the
observations weights in the dataset and train a new weak learner with a special focus
given to the obser- vations misclassified by the current ensemble model. Second, it adds
the weak learner to the weighted sum according to an update coefficient that expresse the
performances of this weak model: the better a weak learner performs, the more it
contributes to the strong learner.
So, assume that we are facing a binary classification problem, with N observations in our
dataset and we want to use adaboost algorithm with a given family of weak models. At the
very beginning of the algorithm (first model of the sequence), all the observations have
the same weights 1/N. Then, we repeat L times (for the L learners in the sequence) the
following steps:
fit the best possible weak model with the current observations weights
compute the value of the update coefficient that is some kind of scalar evaluation metric of
the weak learner that indicates how much this weak learner should be taken into account
into the ensemble model
update the strong learner by adding the new weak learner multiplied by its update
coefficient compute new observations weights that expresse which observations we would
5.7. Ensemble learning: bagging, boosting and stacking 225
like to focus
Statistics and Machine Learning in Python, Release 0.3 beta

on at the next iteration (weights of observations wrongly predicted by the aggregated


model increase and weights of the correctly predicted observations decrease)
Repeating these steps, we have then build sequentially our L models and aggregate them
into a simple linear combination weighted by coefficients expressing the performance of
each learner.
Notice that there exists variants of the initial adaboost algorithm such that LogitBoost
(classifi- cation) or L2Boost (regression) that mainly differ by their choice of loss
function.

Fig. 22: Medium Science Blog

Adaboost updates weights of the observations at each iteration. Weights of well classified
obser- vations decrease relatively to weights of misclassified observations. Models that
perform better have higher weights in the final ensemble model.

2/ Gradient boosting

In gradient boosting, the ensemble model we try to build is also a weighted sum of weak
learners

Fig. 23: Medium Science Blog

Just as we mentioned for adaboost, finding the optimal model under this form is too
difficult and an iterative approach is required. The main difference with adaptative
boosting is in the definition of the sequential optimisation process. Indeed, gradient
boosting casts the problem into a gradient descent one: at each iteration we fit a weak
learner to the opposite of the gradient of the current fitting error with respect to the
current ensemble model. Let’s try to clarify this last point. First, theoretical gradient
descent process over the ensemble model can be written

226 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Fig. 24: Medium Science Blog

where E(.) is the fitting error of the given model, c_l is a coefficient corresponding to the
step size and

Fig. 25: Medium Science Blog

This entity is the opposite of the gradient of the fitting error with respect to the ensemble
model at step l-1. This opposite of the gradient is a function that can, in practice, only be
evaluated for observations in the training dataset (for which we know inputs and
outputs): these evaluations are called pseudo-residuals attached to each observations.
Moreover, even if we know for the observations the values of these pseudo-residuals, we
don’t want to add to our ensemble model any kind of function: we only want to add a new
instance of weak model. So, the natural thing to do is to fit a weak learner to the pseudo-
residuals computed for each observation. Finally, the coefficient c_l is computed following a
one dimensional optimisation process (line-search to obtain the best step size c_l).
So, assume that we want to use gradient boosting technique with a given family of weak
models. At the very beginning of the algorithm (first model of the sequence), the pseudo-
residuals are set equal to the observation values. Then, we repeat L times (for the L
models of the sequence) the following steps:
fit the best possible weak model to pseudo-residuals (approximate the opposite of the
gradient with respect to the current strong learner)
compute the value of the optimal step size that defines by how much we update the
ensemble model in the direction of the new weak learner
update the ensemble model by adding the new weak learner multiplied by the step size
(make a step of gradient descent)
compute new pseudo-residuals that indicate, for each observation, in which direction we
would like to update next the ensemble model predictions
Repeating these steps, we have then build sequentially our L models and aggregate them
fol- lowing a gradient descent approach. Notice that, while adaptative boosting tries to
solve at each iteration exactly the “local” optimisation problem (find the best weak learner
and its coefficient to add to the strong model), gradient boosting uses instead a gradient
descent approach and can more easily be adapted to large number of loss functions. Thus,
gradi- ent boosting can be considered as a generalization of adaboost to arbitrary
differentiable loss functions.
Note There is an algorithm which gained huge popularity after a Kaggle’s competitions. It
is XGBoost (Extreme Gradient Boosting). This is a gradient boosting algorithm which
has more flexibility (varying number of terminal nodes and left weights) parameters to
avoid sub- learners correlations. Having these important qualities, XGBOOST is one of
the most used algorithm in data science. LIGHTGBM is a recent implementation of this
algorithm. It was

5.7. Ensemble learning: bagging, boosting and stacking 227


Statistics and Machine Learning in Python, Release 0.3 beta

published by Microsoft and it gives us the same scores (if parameters are equivalents) but
it runs quicker than a classic XGBOOST.

Fig. 26: Medium Science Blog

Gradient boosting updates values of the observations at each iteration. Weak learners are
trained to fit the pseudo-residuals that indicate in which direction to correct the current
en- semble model predictions to lower the error.
Examples
Here, we are trying an example of Boosting and compare it to a Bagging. Both of
algorithms take the same weak learners to build the macro-model
• Adaboost Classifier

from sklearn.ensemble import


AdaBoostClassifier from sklearn.tree import
DecisionTreeClassifier from sklearn.datasets
import load_breast_cancer import pandas as pd
import numpy as np
from sklearn.model_selection import
train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import
LabelEncoder from sklearn.metrics import
accuracy_score from sklearn.metrics import
f1_score
(continues on next
breast_cancer = load_breast_cancer() page)

228 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
x =pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y =pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)
# Transforming string Target to an int
encoder =LabelEncoder()
binary_encoded_y = pd.Series(encoder.fit_transform(y))

#Train Test Split


train_x, test_x, train_y, test_y =train_test_split(x, binary_encoded_y, random_state=1)
clf_boosting =AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1),
n_estimators=200
)
clf_boosting.fit(train_x, train_y)
predictions = clf_boosting.predict(test_x)
print("For Boosting : F1 Score { } , Accuracy
{}".format(round(f1_score(test_y,predictio
ns),
˓→2),round(accuracy_score(test_y,predicti
• Random Forest as a bagging
ons),2)))
classifier
from sklearn.ensemble import
AdaBoostClassifier from sklearn.tree import
DecisionTreeClassifier from sklearn.datasets
import load_breast_cancer import pandas as pd
import numpy as np
from sklearn.model_selection import
train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import
LabelEncoder from sklearn.metrics import
accuracy_score from sklearn.metrics import
f1_score
from sklearn.ensemble import
RandomForestClassifier

breast_cancer = load_breast_cancer()
x =pd.DataFrame(breast_cancer.data,
columns=breast_cancer.feature_names)
y =pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)
# Transforming string Target to an int
encoder =LabelEncoder()
binary_encoded_y = pd.Series(encoder.fit_transform(y))

#Train Test Split


train_x, test_x, train_y, test_y =train_test_split(x, binary_encoded_y, random_state=1)
clf_bagging =RandomForestClassifier(n_estimators=200, max_depth=1)
clf_bagging.fit(train_x, train_y)
predictions =clf_bagging.predict(test_x)
Comparaison
print("For Bagging : F1 Score { } , Accuracy {}".format(round(f1_score(test_y,predictions),
˓→2),round(accuracy_score(test_y,predictions),2)))

Metric Bagging Boosting


Accuracy 0.91 0.97
F1-Score 0.88 0.95

5.7. Ensemble learning: bagging, boosting and stacking 229


Statistics and Machine Learning in Python, Release 0.3 beta

5. Overview of stacking

Stacking mainly differ from bagging and boosting on two points : - First stacking often
con- siders heterogeneous weak learners (different learning algorithms are combined)
whereas bagging and boosting consider mainly homogeneous weak learners. - Second,
stacking learns to combine the base models using a meta-model whereas bagging and
boosting combine weak learners following deterministic algorithms.
As we already mentioned, the idea of stacking is to learn several different weak learners
and combine them by training a meta-model to output predictions based on the multiple
predic- tions returned by these weak models. So, we need to define two things in order to
build our stacking model: the L learners we want to fit and the meta-model that combines
them.
For example, for a classification problem, we can choose as weak learners a KNN classifier,
a logistic regression and a SVM, and decide to learn a neural network as meta-model.
Then, the neural network will take as inputs the outputs of our three weak learners and
will learn to return final predictions based on it.
So, assume that we want to fit a stacking ensemble composed of L weak learners. Then we
have to follow the steps thereafter:
• split the training data in two folds
• choose L weak learners and fit them to data of the first fold
• for each of the L weak learners, make predictions for observations in the second
fold
• fit the meta-model on the second fold, using predictions made by the weak learners
as inputs
In the previous steps, we split the dataset in two folds because predictions on data that
have been used for the training of the weak learners are not relevant for the training of
the meta- model.

Fig. 27: Medium Science


Blog

230 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Stacking consists in training a meta-model to produce outputs based on the outputs


returned by some lower layer weak learners.
A possible extension of stacking is multi-level stacking. It consists in doing stacking with
multiple layers. As an example,

Fig. 28: Medium Science Blog

Multi-level stacking considers several layers of stacking: some meta-models are trained
on out- puts returned by lower layer meta-models and so on. Here we have represented a
3-layers stacking model.
Examples
Here, we are trying an example of Stacking and compare it to a Bagging & a Boosting. We
note that, many other applications (datasets) would show more difference between these
techniques.
from sklearn.ensemble import
AdaBoostClassifier from sklearn.tree import
DecisionTreeClassifier from sklearn.datasets
import load_breast_cancer import pandas as pd
import numpy as np
from sklearn.model_selection import
train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import
LabelEncoder from sklearn.metrics import
accuracy_score from sklearn.metrics import
f1_score
from sklearn.ensemble import
RandomForestClassifier
from sklearn.linear_model import
LogisticRegression

breast_cancer = load_breast_cancer()
x =pd.DataFrame(breast_cancer.data,
columns=breast_cancer.feature_names)
y=
pd.Categorical.from_codes(breast_cancer.target
, breast_cancer.target_names)
˓→ state=2020 (continues on next
# Transforming string Target to an int
) page)
encoder = LabelEncoder()
binary_encoded_y =
5.7. Ensemble learning: bagging, boosting and stacking 231
pd.Series(encoder.fit_transform(y))

#Train Test Split


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


page)
boosting_clf_ada_boost= AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1),
n_estimators=3
)
bagging_clf_rf =
RandomForestClassifier(n_estimators=200,
max_depth=1,random_state=2020)

clf_rf =RandomForestClassifier(n_estimators=200, max_depth=1,random_state=2020)


clf_ada_boost = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1,random_state=2020),
n_estimators=3
)

clf_logistic_reg =LogisticRegression(solver='liblinear',random_state=2020)

#Customizing and Exception message


class
NumberOfClassifierException(Exception)
: pass

#Creating a stacking class


class Stacking():

'''
This is a test class for stacking !
Please f i l l Free to change i t to f i t your
needs
We suppose that at least the First N-1 Classifiers have a
predict_proba function.
'''

def i nit (self,classifiers):


if(len(classifiers) <2):
raise numberOfClassifierException("You must f i t your
classifier with 2␣
˓→ classifiers at least");
else:
self._classifiers =classifiers

def fit(self,data_x,data_y):

stacked_data_x = data_x.copy()
for classfier in self._classifiers[:-1]:
classfier.fit(data_x,data_y)
stacked_data_x =
np.column_stack((stacked_data_x,cla
ssfier.predict_proba(data_
˓→ x)))

last_classifier =self._classifiers[-1]
last_classifier. fit(stacked_data_x,data_y)

def predict(self,data_x):

232 stacked_data_x = data_x.copy() Chapter 5. Machine Learning


for classfier in self._classifiers[:-1]:
prob_predictions = classfier.predict_proba(data_x)
(continues on next page)
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
stacked_data_x = np.column_stack((stacked_data_x,prob_predictions))

last_classifier =self._classifiers[-1]
return last_classifier.predict(stacked_data_x)

bagging_clf_rf.fit(train_x, train_y)
boosting_clf_ada_boost.fit(train_x, train_y)

classifers_list =[clf_rf,clf_ada_boost,clf_logistic_reg]
clf_stacking =Stacking(classifers_list)
clf_stacking.fit(train_x,train_y)

predictions_bagging =bagging_clf_rf.predict(test_x)
predictions_boosting =boosting_clf_ada_boost.predict(test_x)
predictions_stacking =clf_stacking.predict(test_x)

print("For Bagging : F1 Score { } , Accuracy


{}".format(round(f1_score(test_y,predictions_
˓→bagging),2),round(accuracy_score(test_y,predictions_baggin
g),2)))
print("For Boosting : F1 Score { } , Accuracy
{}".format(round(f1_score(test_y,predictions_
˓→boosting),2),round(accuracy_score(test_y,predictions_boost
ing),2)))
print("For Stacking : F1 Score { } , Accuracy
Comparaison
{}".format(round(f1_score(test_y,predictions_
˓→stacking),2),round(accuracy_score(test_y,predictions_stacki

ng),2))) Metric Bagging Boosting Stacking


Accuracy 0.90 0.94 0.98
F1-Score 0.88 0.93 0.98

8. Gradient descent

Gradient descent is an optimization algorithm used to minimize some function by


iteratively moving in the direction of steepest descent as defined by the negative of the
gradient. In machine learning, we use gradient descent to update the parameters of our
model. Parame- ters refer to coefficients in Linear Regression and weights in neural
networks.
This section aims to provide you an explanation of gradient descent and intuitions towards
the behaviour of different algorithms for optimizing it. These explanations will help you
put them to use.
We are first going to introduce the gradient descent, solve it for a regression problem and
look at its different variants. Then, we will then briefly summarize challenges during
training. Finally, we will introduce the most common optimization algorithms by
showing their motivation to resolve these challenges and list some advices for facilitate
the algorithm choice.

1. Introduction

Consider the 3-dimensional graph below in the context of a cost function. Our goal is to
move from the mountain in the top right corner (high cost) to the dark blue sea in the
5.8. Gradient descent
bottom 233
Statistics and Machine Learning in Python, Release 0.3 beta

left (low cost). The arrows represent the direction of steepest descent (negative gradient)
from any given point–the direction that decreases the cost function as quickly as possible

Fig. 29: adalta.it

Starting at the top of the mountain, we take our first step downhill in the direction
specified by the negative gradient. Next we recalculate the negative gradient (passing in
the coordinates of our new point) and take another step in the direction it specifies. We
continue this process iteratively until we get to the bottom of our graph, or to a point
where we can no longer move downhill–a local minimum.

Learning rate

The size of these steps is called the learning rate. With a high learning rate we can cover
more ground each step, but we risk overshooting the lowest point since the slope of the hill
is constantly changing. With a very low learning rate, we can confidently move in the
direction of the negative gradient since we are recalculating it so frequently. A low
learning rate is more precise, but calculating the gradient is time-consuming, so it will
take us a very long time to get to the bottom.

Fig. 30:
jeremyjordan

234 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Cost function

A Loss Function (Error function) tells us “how good” our model is at making predictions
for a given set of parameters. The cost function has its own curve and its own gradients.
The slope of this curve tells us how to update our parameters to make the model more
accurate.

5.8.2 Numerical solution for gradient descent

Let’s run gradient descent using a linear regression cost function.


There are two parameters in our cost function we can control: - $ .. raw::
latex beta‘_0$ : (the bias) - $:raw-latex:beta_1 $ : (weight or
coefficient)
Since we need to consider the impact each one has on the final prediction, we need to
use partial derivatives. We calculate the partial derivatives of the cost function
1 1 ∑︁� ∑︁
with 𝜕𝑀𝑆𝐸 2 in 1
𝑓(𝛽 ,respect
0𝛽) =1
to each parameter
= �and store the results
(��− (𝛽1��+ 0𝛽))
a gradient.
((𝛽1��+ 𝛽0) −
2
2 𝜕𝛽 2𝑁
Given the cost function �= = �= ��)
2𝑁 1 1
The gradient can be calculated as
[︃ 𝜕𝑓 ︃] ]︂
︂[ 1 ∑︀ −2((𝛽1 �� + 0𝛽) − ︂] ︂[∑︀ −1 ((𝛽 1��+ 𝛽0) − �

𝑓 (𝛽 , 0𝛽) =
𝜕𝛽0
= 1 2𝑁 ∑︀ = −1 𝑁 ∑︀
1 𝜕𝑓
𝜕𝛽1 2𝑁 −2
�� )�((𝛽
�1 �
� + 0𝛽) − � 𝑁 �)�((𝛽 � + 0𝛽) −
�1 �
�) � �)
To solve for the gradient, we iterate through our data points using our
:math:beta_1 and :math:beta_0 values and compute the
partial derivatives. This new gradient tells us the slope of our cost function at our cur-
rent position (current parameter values) and the direction we should move to update our
parameters. The size of our update is controlled by the learning rate.
Pseudocode of this algorithm

Function gradient_descent(X, Y, learning_rate, number_iterations):

m : 1
b : 1
m_deriv : 0
b_deriv : 0
data_length : length(X)
loop i : 1 --> number_iterations:
loop i : 1 -> data_length :
m_deriv : m_deriv -X[i] *
((m*X[i] +b) - Y [ i ] )
b_deriv : b_deriv - ((m*X[i] +
b) - Y [ i ] )
m : m - (m_deriv / data_length) * learning_rate b :
b - (b_deriv / data_length) * learning_rate

return m, b

5.8. Gradient descent 235


Statistics and Machine Learning in Python, Release 0.3 beta

3. Gradient descent variants

There are three variants of gradient descent, which differ in how much data we use to
compute the gradient of the objective function. Depending on the amount of data, we make
a trade-off between the accuracy of the parameter update and the time it takes to perform
an update.

Batch gradient descent

Batch gradient descent, known also as Vanilla gradient descent, computes the gradient of
the cost function with respect to the parameters 𝜃for the entire training dataset :

𝜃= 𝜃− 𝜂· ∇𝜃𝐽 (𝜃)

As we need to calculate the gradients for the whole dataset to perform just one update,
batch gradient descent can be very slow and is intractable for datasets that don’t fit in
memory. Batch gradient descent also doesn’t allow us to update our model online.

Stochastic gradient descent

Stochastic gradient descent (SGD) in contrast performs a parameter update for each
training example �(�)and label �(�)
• Choose an initial vector of parameters � and learning rate 𝜂.
• Repeat until an approximate minimum is obtained:

– Randomly shuffle examples in the training set.


– For � ∈ 1, . . . , �
𝜃= 𝜃− 𝜂· ∇𝜃𝐽 (𝜃; �(�); �(�))
Batch gradient descent performs redundant computations for large datasets, as it recom-
putes gradients for similar examples before each parameter update. SGD does away with
this redundancy by performing one update at a time. It is therefore usually much faster
and can also be used to learn online. SGD performs frequent updates with a high variance
that cause the objective function to fluctuate heavily as in the image below.
While batch gradient descent converges to the minimum of the basin the parameters are
placed in, SGD’s fluctuation, on the one hand, enables it to jump to new and potentially
better local minima. On the other hand, this ultimately complicates convergence to the
exact minimum, as SGD will keep overshooting. However, it has been shown that when we
slowly decrease the learning rate, SGD shows the same convergence behaviour as batch
gradient descent, almost certainly converging to a local or the global minimum for non-
convex and convex optimization respectively.

Mini-batch gradient descent

Mini-batch gradient descent finally takes the best of both worlds and performs an update
for every mini-batch of n training examples:

𝜃= 𝜃− 𝜂· ∇𝜃𝐽 (𝜃; �(�:�+�); �(�:�+�))


This way, it :

236 Chapter 5. Machine


Learning
Statistics and Machine Learning in Python, Release 0.3
beta

Fig. 31:
Wikipedia

5.8. Gradient descent 237


Statistics and Machine Learning in Python, Release 0.3 beta

• reduces the variance of the parameter updates, which can lead to more stable con-
vergence.
• can make use of highly optimized matrix optimizations common to state-of-the-art
deep learning libraries that make computing the gradient very efficient. Common
mini-batch sizes range between 50 and 256, but can vary for different applications.
Mini-batch gradient descent is typically the algorithm of choice when training a neural
network.

4. Gradient Descent challenges

Vanilla mini-batch gradient descent, however, does not guarantee good convergence, but
offers a few challenges that need to be addressed:
• Choosing a proper learning rate can be difficult. A learning rate that is too small leads
to painfully slow convergence, while a learning rate that is too large can hinder
conver- gence and cause the loss function to fluctuate around the minimum or even
to diverge.
• Learning rate schedules try to adjust the learning rate during training by e.g. an-
nealing, i.e. reducing the learning rate according to a pre-defined schedule or when
the change in objective between epochs falls below a threshold. These schedules and
thresh- olds, however, have to be defined in advance and are thus unable to adapt to a
dataset’s characteristics.
• Additionally, the same learning rate applies to all parameter updates. If our data is
sparse and our features have very different frequencies, we might not want to update
all of them to the same extent, but perform a larger update for rarely occurring
features.
• Another key challenge of minimizing highly non-convex error functions common for
neural networks is avoiding getting trapped in their numerous suboptimal local
min- ima. These saddle points (local minimas) are usually surrounded by a plateau
of the same error, which makes it notoriously hard for SGD to escape, as the
gradient is close to zero in all dimensions.

5. Gradient descent optimization algorithms

In the following, we will outline some algorithms that are widely used by the deep
learning community to deal with the aforementioned challenges.

Momentum

SGD has trouble navigating ravines (areas where the surface curves much more steeply in
one dimension than in another), which are common around local optima. In these
scenarios, SGD oscillates across the slopes of the ravine while only making hesitant
progress along the bottom towards the local optimum as in the image below.
Source
No momentum: moving toward local largest gradient create
oscillations. With momentum: accumulate velocity to avoid
oscillations.
238 Chapter 5. Machine Learning
Statistics and Machine Learning in Python, Release 0.3
beta

Fig. 32:
Wikipedia

Fig. 33: No momentum: oscillations toward local largest


gradient

Fig. 34: With momentum: accumulate velocity to avoid


oscillations

5.8. Gradient descent 239


Statistics and Machine Learning in Python, Release 0.3 beta

Momentum is a method that helps accelerate SGD in the relevant direction and dampens
oscillations as can be seen in image above. It does this by adding a fraction :math:‘gamma‘
of the update vector of the past time step to the current update vector

�� = 𝜌��−1 + ∇𝜃𝐽
(5.52
(𝜃)
)
𝜃= 𝜃− ��
vx = 0
while True:
dx =gradient(J, x)
vx =rho * vx +dx
x -= learning_rate *
vx
Note: The momentum term :math:‘rho‘ is usually set to 0.9 or a similar value.
Essentially, when using momentum, we push a ball down a hill. The ball accumulates mo-
mentum as it rolls downhill, becoming faster and faster on the way (until it reaches its
terminal velocity if there is air resistance, i.e. :math:‘rho‘ <1 ).
The same thing happens to our parameter updates: The momentum term increases for
dimen- sions whose gradients point in the same directions and reduces updates for
dimensions whose gradients change directions. As a result, we gain faster convergence
and reduced oscillation.

AdaGrad: adaptive learning rates

• Added element-wise scaling of the gradient based on the historical sum of squares in
each dimension.
• “Per-parameter learning rates” or “adaptive learning rates”

grad_squared = 0
while True:
dx =gradient(J, x)
grad_squared +=dx * dx
x -= learning_rate * dx /
(np.sqrt(grad_squared)
• +Progress
1e-7) along “steep” directions is damped.
• Progress along “flat” directions is accelerated.
• Problem: step size over long time = > Decays to
zero.

RMSProp: “Leaky AdaGrad”


grad_squared = 0
while True:
dx =gradient(J, x)
grad_squared +=decay_rate * grad_squared +(1 - decay_rate) * dx * dx x -=
learning_rate * dx / (np.sqrt(grad_squared) +1e-7)

• decay_rate =1: gradient descent


• decay_rate =0: AdaGrad

240 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Nesterov accelerated gradient

However, a ball that rolls down a hill, blindly following the slope, is highly
unsatisfactory. We’d like to have a smarter ball, a ball that has a notion of where it is
going so that it knows to slow down before the hill slopes up again. Nesterov accelerated
gradient (NAG) is a way to give our momentum term this kind of prescience. We know
that we will use our momentum term 𝛾��−1to move the parameters 𝜃.
Computing 𝜃− 𝛾��−1thus gives us an approximation of the next position of the parameters
(the gradient is missing for the full update), a rough idea where our parameters are going
to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our
current parameters 𝜃but w.r.t. the approximate future position of our parameters:

�� = 𝛾��−1+ 𝜂∇𝜃𝐽(𝜃−
(5.53
𝛾��−1)
)
𝜃= 𝜃− ��
Again, we set the momentum term 𝛾 to a value of around 0.9. While Momentum first com-
putes the current gradient and then takes a big jump in the direction of the updated
accumulated gradient , NAG first makes a big jump in the direction of the previous ac-
cumulated gradient, measures the gradient and then makes a correction, which results in
the complete NAG update. This anticipatory update prevents us from going too fast and
results in increased responsiveness, which has significantly increased the performance of
RNNs on a number of tasks

Adam

Adaptive Moment Estimation (Adam) is a method that computes adaptive learning rates
for each parameter. In addition to storing an exponentially decaying average of past
squared gradients :math:‘v_t‘, Adam also keeps an exponentially decaying average of past
gradients
:math:‘m_t‘, similar to momentum. Whereas momentum can be seen as a ball running
down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima
in the error surface. We compute the decaying averages of past and past squared gradients
��and �� respectively as follows:

��= 𝛽1��−1 + (1 − 𝛽1)∇𝜃𝐽


(5.54
(𝜃)
2 )
�� = 𝛽2��−1 + (1 − 𝛽2)∇𝜃𝐽
(𝜃)
��and �� are estimates of the first moment (the mean) and the second moment (the
uncentered variance) of the gradients respectively, hence the name of the method. Adam
(almost)
first_moment = 0
second_moment = 0
while True:
dx =gradient(J, x)
# Momentum:
first_moment =beta1
* first_moment +(1
- beta1) * dx
# AdaGrad/RMSProp
second_moment =
beta2 *
5.8. second_moment + (1
Gradient descent 241
- beta2) * dx * dx
x -= learning_rate *
first_moment /
Statistics and Machine Learning in Python, Release 0.3 beta

As ��and �� are initialized as vectors of 0’s, the authors of Adam observe that they are
biased towards zero, especially during the initial time steps, and especially when the decay
rates are small (i.e. 𝛽1 and 𝛽2 are close to 1). They counteract these biases by computing
bias-corrected first and second moment estimates:

��
�ˆ
�= (5.55
1 − 𝛽1�
)

�ˆ
� (5.56
1 − 𝛽2�

= )

They then use these to update the parameters (Adam update


rule):
𝜂
𝜃�+ 1 = 𝜃 − √ �ˆ ��
� � ˆ
+ 𝜖
• �ˆ � Accumulate gradient: velocity.
• �ˆ� Element-wise scaling of the gradient based on the historical sum of squares in
each dimension.
• Choose Adam as default optimizer
• Default values of 0.9 for 𝛽1, 0.999 for 𝛽2, and 10 −7 for 𝜖.
• learning rate in a range between 1𝑒− 3 and 5𝑒− 4

242 Chapter 5. Machine Learning


CHAPTER

SIX

DEEP LEARNING

1. Backpropagation

1. Course outline:

1. Backpropagation and chaine


rule
2. Lab: with numpy and pytorch
%matplotlib inline

6.1.2 Backpropagation and chaine rule

We will set up a two layer network source pytorch tuto :

Y = max(XW(1) , 0)W ( 2 )

A fully-connected ReLU network with one hidden layer and no biases, trained to predict y
from x using Euclidean error.

Chaine rule

Forward pass with local partial derivatives of ouput given inputs:

(1) ℎ (1) = max(�(1) , 0 ) → �(2) = ℎ (1)𝑇


𝑇 (1) → → 𝐿(�(2), �) = (�(2) −
�→ � = � �(2)
�)2

� ↗
(1) � (2) ↗
𝜕�(1) 𝜕ℎ(1) 𝜕�(2) 𝐿
𝜕
= � 1 if �(1)>0 = 2(�(2) − �)
𝜕�(1) (1) = {else 0 𝜕�(2) 𝜕�(2)
𝜕�(1) = (1) = ℎ = (2)
𝜕� 𝜕�(1) �(2)
𝜕�
� 𝜕ℎ(1)

Backward: compute gradient of the loss given each parameters vectors applying chaine rule
from the loss downstream to the parameters:
For � (2 ) :

243
Statistics and Machine Learning in Python, Release 0.3 beta

𝜕𝐿 𝜕𝐿 𝜕�(2)
= (6.1
𝜕�(2) 𝜕�(2) 𝜕�(2) )
=2(�(2) − (6.2
�)ℎ(1) )
For
� ( 1) :

𝜕𝐿 𝜕𝐿 𝜕�(2) 𝜕ℎ(1) 𝜕�(1)


= (6.3
𝜕�(1) 𝜕�(2) 𝜕ℎ(1) 𝜕�(1) 𝜕�(1) (1)
)
(2) 1 if �
(2)
=2(� − �)�{ else 0
>0
� (6.4
)

Recap: Vector derivatives

Given a function � = � with � the output, � the input and � the


coeficients.
• Scalar to Scalar: � ∈ R, � ∈ R, � ∈ R

Regular derivative:
𝜕 = �∈R
𝜕�

If � changes by a small amount, how
much will � change?
• Vector to Scalar: � ∈ R𝑁 , � ∈ R, �
Derivative 𝜕� ∈ R𝑁
is Gradient of partial derivative:
𝜕�
∈ R𝑁

⎡ ⎤
𝜕𝜕�
�1
⎢ . ⎥
𝜕�
= ∇� � ⎢⎢ 𝜕 ⎥
𝜕� �⎥
(6.5
𝜕� �
.
= ⎢ ⎥ )
⎣ . ⎦
𝜕�
𝜕�𝑁

For each element �� of �, if it changes by a small amount then how much will y
change?
•Vector to Vector: � ∈ R𝑁 , � ∈ R 𝑀

Derivative is Jacobian of partial


derivative: TO COMPLETE
𝜕� ∈ R
𝑁
×𝑀
𝜕�

Backpropagation summary

Backpropagation algorithm in a graph: 1. Forward pass, for each node compute local partial
derivatives of ouput given inputs 2. Backward pass: apply chain rule from the end to each
parameters - Update parameter with gradient descent using the current upstream gradient
and the current local gradient - Compute upstream gradient for the backward nodes

244 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Think locally and remember that at each node: - For the loss the gradient is the error - At
each step, the upstream gradient is obtained by multiplying the upstream gradient (an
error) with the current parameters (vector of matrix). - At each step, the current local
gradient equal the input, therfore the current update is the current upstream gradient
import num
time the py as np
input.
import matplotlib.pyplot as
plt import seaborn as sns
import
sklearn.model_selection

3. Lab: with numpy and pytorch

Load iris data set

Goal: Predict Y = [petal_length, petal_width] = f(X = [sepal_length,


sepal_width])
• Plot data with seaborn
• Remove setosa samples
• Recode ‘versicolor’:1, ‘virginica’:2
• Scale X and Y
• Split data in train/test 50%/50%
i r i s =sns.load_dataset("iris")
#g =sns.pairplot(iris, hue="species")
df =iris[iris.species != "setosa"]
g = sns.pairplot(df, hue="species")
df['species_n'] =
iris.species.map({'versicolor':1,
'virginica':2})

# Y ='petal_length', 'petal_width'; X =
'sepal_length', 'sepal_width')
X_iris = np.asarray(df.loc[:, ['sepal_length', 'sepal_width']], dtype=np.float32)
Y_iris = np.asarray(df.loc[:, ['petal_length', 'petal_width']], dtype=np.float32)
label_iris =np.asarray(df.species_n, dtype=int)

# Scale
from sklearn.preprocessing import StandardScaler
scalerx, scalery =StandardScaler(), StandardScaler()
X_iris =scalerx.fit_transform(X_iris)
Y_iris =StandardScaler().fit_transform(Y_iris)

# Split train test


X_iris_tr, X_iris_val, Y_iris_tr, Y_iris_val, label_iris_tr, label_iris_val =\
sklearn.model_selection.train_test_split(X_iris, Y_iris, label_iris, train_size=0.5,␣
stratify=label_iris)
/home/edouard/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5:␣
˓→

˓→SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame. Try


using .loc[row_indexer,col_indexer] =value instead
See the caveats in the documentation: http://pandas.pydata.org/
pandas-docs/stable/
˓→indexing.html#indexing-view-versus-copy
"""

6.1. Backpropagation 245


Statistics and Machine Learning in Python, Release 0.3 beta

Backpropagation with numpy

This implementation uses numpy to manually compute the forward pass, loss, and
backward pass.
# X=X_iris_tr; Y=Y_iris_tr; X_val=X_iris_val; Y_val=Y_iris_val

def two_layer_regression_numpy_train(X, Y, X_val, Y_val, l r , nite):


# N i s batch size; D_in is input dimension;
# H i s hidden dimension; D_out is output dimension. # N,
D_in, H, D_out =64, 1000, 100, 10
N, D_in, H, D_out =X.shape[0], X.shape[1], 100,
Y.shape[1]

W1 =np.random.randn(D_in, H) W2 =
np.random.randn(H, D_out)

losses_tr, losses_val =l i s t ( ) , l i s t ( )

learning_rate = l r
for t in range(nite):
# Forward pass: compute predicted y z1 =
X.dot(W1) (continues on next
page)

246 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


h1 =np.maximum(z1, 0) page)
Y_pred =h1.dot(W2)

# Compute and print


loss
loss =
np.square(Y_pred -
Y).sum()

# Backprop to compute gradients of w1 and w2 with respect to loss


grad_y_pred =2.0 * (Y_pred - Y)
grad_w2 =h1.T.dot(grad_y_pred)
grad_h1 =grad_y_pred.dot(W2.T)
grad_z1 =grad_h1.copy()
grad_z1[z1 <0] = 0
grad_w1 = X.T.dot(grad_z1)

# Update weights
W1 -= learning_rate * grad_w1
W2 -= learning_rate * grad_w2

# Forward pass for validation set: compute predicted y


z1 = X_val.dot(W1)
h1 =np.maximum(z1, 0)
y_pred_val =h1.dot(W2)
loss_val =
np.square(y_pred_val -
Y_val).sum()

losses_tr.append(loss)
losses_val.append(loss_val)

i f t % 10 == 0:
print(t, loss,
loss_val)

return W1, W2, losses_tr,


losses_val

W1, W2, losses_tr, losses_val =


two_layer_regression_numpy_train(X
0 15126.224825529907 2910.260853330454
=X_iris_tr,
10 Y=Y_iris_tr,
71.5381374591153 104.97056197642135
˓→ X_val=X_iris_val,
20 50.756938353833334 80.02800827986354
Y_val=Y_iris_val,
30 46.546510744624236 72.85211241738614
40 44.41413064447564 69.31127324764276 lr=1e-4, nite=50)
plt.plot(np.arange(len(losses_tr)), losses_tr, "-b", np.arange(len(losses_val)), losses_
˓ → val, "-r")
[<matplotlib.lines.Line2D at 0x7f960cf5e9b0>,
<matplotlib.lines.Line2D at 0x7f960cf5eb00>]

6.1. Backpropagation 247


Statistics and Machine Learning in Python, Release 0.3 beta

Backpropagation with PyTorch Tensors

source
Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical compu-
tations. For modern deep neural networks, GPUs often provide speedups of 50x or
greater, so unfortunately numpy won’t be enough for modern deep learning. Here we
introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is
conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch
provides many functions for op- erating on these Tensors. Behind the scenes, Tensors can
keep track of a computational graph and gradients, but they’re also useful as a generic tool
for scientific computing. Also unlike numpy, PyTorch Tensors can utilize GPUs to
accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need
to cast it to a new datatype. Here we use PyTorch Tensors to fit a two-layer network to
random data. Like the numpy example above we need to manually implement the forward
and backward passes through the network:
import torch

# X=X_iris_tr; Y=Y_iris_tr; X_val=X_iris_val; Y_val=Y_iris_val

def two_layer_regression_tensor_train(X, Y, X_val, Y_val, l r , nite):

dtype = torch.float
device = torch.device("cpu")
# device =torch.device("cuda:0") # Uncomment this to run on GPU

# N i s batch size; D_in is input dimension;


# H i s hidden dimension; D_out is output dimension.
N, D_in, H, D_out =X.shape[0], X.shape[1], 100, Y.shape[1]

# Create random input and output data X =


torch.from_numpy(X)
Y = torch.from_numpy(Y)
(continues on next
page)

248 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


X_val =torch.from_numpy(X_val) page)
Y_val = torch.from_numpy(Y_val)

# Randomly initialize weights


W1 =torch.randn(D_in, H, device=device, dtype=dtype) W2
=torch.randn(H, D_out, device=device, dtype=dtype)

losses_tr, losses_val =l i s t ( ) , l i s t ( )

learning_rate = l r
for t in range(nite):
# Forward pass: compute predicted y
z1 =X.mm(W1)
h1 = z1.clamp(min=0)
y_pred = h1.mm(W2)

# Compute and print loss


loss =(y_pred -
Y).pow(2).sum().item()

# Backprop to compute gradients of w1 and w2 with respect to loss


grad_y_pred =2.0 * (y_pred - Y)
grad_w2 =h1.t().mm(grad_y_pred)
grad_h1 =grad_y_pred.mm(W2.t())
grad_z1 =grad_h1.clone()
grad_z1[z1 <0] = 0
grad_w1 = X.t().mm(grad_z1)

# Update weights using gradient descent


W1 -= learning_rate * grad_w1
W2 -= learning_rate * grad_w2

# Forward pass for validation set: compute predicted y


z1 = X_val.mm(W1)
h1 =z1.clamp(min=0)
y_pred_val = h1.mm(W2)
loss_val =(y_pred_val
-
Y_val).pow(2).sum().it
em()

losses_tr.append(loss)
losses_val.append(loss_val)

i f t % 10 == 0:
print(t, loss,
loss_val)

return W1, W2, losses_tr,


losses_val

W1, W2, losses_tr, losses_val =


two_layer_regression_tensor_train(
X=X_iris_tr, Y=Y_iris_
˓ → tr, X_val=X_iris_val,
0 8086.1591796875 5429.57275390625
Y_val=Y_iris_val,
10 225.77589416503906 331.83734130859375
lr=1e-4, nite=50)
20 86.46501159667969 117.72447204589844
(continues on next
plt.plot(np.arange(len(losses_tr)), losses_tr, "-b", np.arange(len(losses_val)),
losses_ page)
˓ → val, "-r")
6.1. Backpropagation 249
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


30 52.375606536865234 73.84156036376953 page)
40 43.16458511352539 64.0667495727539

[<matplotlib.lines.Line2D at 0x7f960033c470>,
<matplotlib.lines.Line2D at 0x7f960033c5c0>]

Backpropagation with PyTorch: Tensors and autograd

source
A fully-connected ReLU network with one hidden layer and no biases, trained to predict y
from x by minimizing squared Euclidean distance. This implementation computes the
forward pass using operations on PyTorch Tensors, and uses PyTorch autograd to compute
gradients. A PyTorch Tensor represents a node in a computational graph. If x is a Tensor
that has x. requires_grad=True then x.grad is another Tensor holding the gradient of x
with respect to some scalar value.

import torch

# X=X_iris_tr; Y=Y_iris_tr; X_val=X_iris_val; Y_val=Y_iris_val


# del X, Y, X_val, Y_val

def two_layer_regression_autograd_train(X, Y, X_val, Y_val, l r , nite):

dtype = torch.float
device = torch.device("cpu")
# device =torch.device("cuda:0") # Uncomment this to run on GPU

# N i s batch size; D_in is input dimension;


# H i s hidden dimension; D_out is output dimension.
N, D_in, H, D_out =X.shape[0], X.shape[1], 100, Y.shape[1]

(continues on next
page)

250 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous page)


# Setting requires_grad=False indicates that we do not need to compute gradients #
with respect to these Tensors during the backward pass.
X =torch.from_numpy(X)
Y = torch.from_numpy(Y)
X_val =
torch.from_numpy(X_val)
Y_val =
torch.from_numpy(Y_val)

# Create random Tensors


for weights.
# Setting requires_grad=True indicates that we want to compute gradients with #
respect to these Tensors during the backward pass.
W1 =torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
W2 =torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

losses_tr, losses_val =l i s t ( ) , l i s t ( )

learning_rate = l r
for t in range(nite):
# Forward pass: compute predicted y
using operations on Tensors; these
# are exactly the same operations we
used to compute the forward pass
using
# Tensors, but we do not need to keep references to intermediate values since # we
are not implementing the backward pass by hand.
y_pred = X.mm(W1).clamp(min=0).mm(W2)

# Compute and print loss using operations on Tensors. #


Now loss is a Tensor of shape (1,)
# loss.item() gets the scalar value held in the loss.
loss =(y_pred - Y).pow(2).sum()

# Use autograd to compute the backward pass. This call will compute the #
gradient of loss with respect to a l l Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the gradient #
of the loss with respect to w1 and w2 respectively.
loss.backward()

# Manually update weights using gradient descent. Wrap in torch.no_grad() #


because weights have requires_grad=True, but we don't need to track this # in
autograd.
# An alternative way is to operate on weight.data and weight.grad.data. #
Recall that tensor.data gives a tensor that shares the storage with
# tensor, but doesn't track history.
# You can also use torch.optim.SGD to achieve this.
with torch.no_grad():
W1 -= learning_rate * W1.grad
W2 -= learning_rate * W2.grad

# Manually zero the gradients after updating weights


W1.grad.zero_()
W2.grad.zero_()

y_pred = X_val.mm(W1).clamp(min=0).mm(W2)

# Compute and print loss using operations on Tensors. #


Now loss is a Tensor of shape (1,)
# loss.item() gets the scalar value held in the loss.
loss_val =(y_pred - Y).pow(2).sum()
6.1. Backpropagation 251
(continues on next page)
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


i f t % 10 == 0: page)
print(t, loss.item(), loss_val.item())

losses_tr.append(loss.item())
losses_val.append(loss_val.item())

return W1, W2, losses_tr, losses_val

W1, W2, losses_tr, losses_val =


two_layer_regression_autograd_train(X=X_i
ris_tr, Y=Y_iris_
˓ → tr, X_val=X_iris_val, Y_val=Y_iris_val,

lr=1e-4, nite=50)
plt.plot(np.arange(len(losses_tr)), losses_tr, "-b", np.arange(len(losses_val)), losses_
˓ → val, "-r")

0 8307.1806640625 2357.994873046875
10 111.97289276123047 250.04209899902344
20 65.83244323730469 201.63694763183594
30 53.70908737182617 183.17051696777344
40 48.719329833984375 173.3616943359375

[<matplotlib.lines.Line2D at 0x7f95ff2ad978>,
<matplotlib.lines.Line2D at 0x7f95ff2adac8>]

Backpropagation with PyTorch: nn

source
This implementation uses the nn package from PyTorch to build the network. PyTorch
autograd makes it easy to define computational graphs and take gradients, but raw
autograd can be a bit too low-level for defining complex neural networks; this is where
the nn package can help. The nn package defines a set of Modules, which you can think of
as a neural network layer that has produces output from input and may have some
trainable weights.

252 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

import torch

# X=X_iris_tr; Y=Y_iris_tr; X_val=X_iris_val; Y_val=Y_iris_val


# del X, Y, X_val, Y_val

def two_layer_regression_nn_train(X, Y, X_val, Y_val, l r , nite):

# N i s batch size; D_in is input dimension;


# H i s hidden dimension; D_out is output dimension.
N, D_in, H, D_out =X.shape[0], X.shape[1], 100, Y.shape[1]

X =torch.from_numpy(X) Y =
torch.from_numpy(Y)
X_val =torch.from_numpy(X_val)
Y_val = torch.from_numpy(Y_val)

# Use the nn package to define our model as a sequence of layers. nn.Sequential # i s a


Module which contains other Modules, and applies them in sequence to
# produce i ts output. Each Linear Module computes output from input using a #
linear function, and holds internal Tensors for i t s weight and bias. model =
torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this # case
we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

losses_tr, losses_val =l i s t ( ) , l i s t ( )

learning_rate = l r
for t in range(nite):
# Forward pass: compute predicted y by passing x to the model. Module objects #
override the call operator so you can call them like functions. When
# doing so you pass a Tensor of input data to the Module and i t produces
# a Tensor of output data.
y_pred = model(X)

# Compute and print loss. We pass Tensors containing the predicted and true #
values of y, and the loss function returns a Tensor containing the
# loss.
loss =loss_fn(y_pred, Y)

# Zero the gradients before running the backward pass.


model.zero_grad()

# Backward pass: compute gradient of the loss with respect to a l l the learnable #
parameters of the model. Internally, the parameters of each Module are stored # in
Tensors with requires_grad=True, so this call will compute gradients for
# a l l learnable parameters in the model.
loss.backward()

# Update the weights using gradient descent. Each parameter i s a Tensor, so # we


can access i ts gradients like we did before.
with torch.no_grad():
(continues on next page)

6.1. Backpropagation
253
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


for param in model.parameters(): page)
param -= learning_rate * param.grad
y_pred = model(X_val)
loss_val =(y_pred -
Y_val).pow(2).sum()

i f t % 10 == 0:
print(t, loss.item(),
loss_val.item())

losses_tr.append(loss.item())
losses_val.append(loss_val.item())

return model, losses_tr,


losses_val

model, losses_tr, losses_val =


two_layer_regression_nn_train(X=X_iris_tr,
Y=Y_iris_tr, X_
˓→ val=X_iris_val, Y_val=Y_iris_val,
lr=1e-4, nite=50)
0 82.32025146484375 91.3389892578125
plt.plot(np.arange(len(losses_tr)), losses_tr, "-b", np.arange(len(losses_val)),
10 50.322200775146484 63.563087463378906
losses_
20 40.825225830078125 57.13555145263672
˓ → val, "-r")
30 37.53572082519531 55.74506378173828
40 36.191200256347656 55.499732971191406

[<matplotlib.lines.Line2D at 0x7f95ff296668>,
<matplotlib.lines.Line2D at 0x7f95ff2967b8>]

Backpropagation with PyTorch optim

This implementation uses the nn package from PyTorch to build the network. Rather than
man- ually updating the weights of the model as we have been doing, we use the
optim package to

254 Chapter 6. Deep


Learning
Statistics and Machine Learning in Python, Release 0.3
beta

define an Optimizer that will update the weights for us. The optim package defines many
op- timization algorithms that are commonly used for deep learning, including
SGD+momentum, RMSProp, Adam, etc.

import torch

# X=X_iris_tr; Y=Y_iris_tr; X_val=X_iris_val; Y_val=Y_iris_val

def two_layer_regression_nn_optim_train(X, Y, X_val, Y_val, l r , nite):

# N i s batch size; D_in is input dimension;


# H i s hidden dimension; D_out is output dimension.
N, D_in, H, D_out =X.shape[0], X.shape[1], 100, Y.shape[1]

X =torch.from_numpy(X) Y =
torch.from_numpy(Y)
X_val =torch.from_numpy(X_val)
Y_val = torch.from_numpy(Y_val)

# Use the nn package to define our model and loss function. model
= torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn =
torch.nn.MSELoss(reduction='su
m')

losses_tr, losses_val =l i s t ( ) ,
list()

# Use the optim package to define


an Optimizer that will update
the weights of
# the model for us. Here we will use Adam; the optim package contains many other #
optimization algoriths. The f ir st argument to the Adam constructor tells the
# optimizer which Tensors i t should update.
learning_rate = l r
optimizer =torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(nite):
# Forward pass: compute predicted y by passing x to the model. y_pred
= model(X)

# Compute and print loss.


loss =loss_fn(y_pred, Y)

# Before the backward pass, use the optimizer object to zero a l l of the #
gradients for the variables i t will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i . e , not overwritten) whenever .backward()
# i s called. Checkout docs of torch.autograd.backward for more details.
optimizer.zero_grad()

# Backward pass: compute gradient of the loss with respect to model #


parameters
loss.backward()

# Calling the step function on an Optimizer makes an update to i t s #


parameters
optimizer.step()
6.1. Backpropagation 255
(continues on next page)
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


with torch.no_grad(): page)
y_pred =model(X_val)
loss_val =
loss_fn(y_pred,
Y_val)

i f t % 10 == 0:
print(t, loss.item(),
loss_val.item())

losses_tr.append(loss.item())
losses_val.append(loss_val.item())

return model, losses_tr, losses_val

model, losses_tr, losses_val =


two_layer_regression_nn_optim_train(X=X_i
ris_tr, Y=Y_iris_
˓ → tr, X_val=X_iris_val, Y_val=Y_iris_val,

0 92.271240234375 83.96189880371094 lr=1e-3, nite=50)


plt.plot(np.arange(len(losses_tr)), losses_tr,
10 64.25907135009766 59.872535705566406 "-b", np.arange(len(losses_val)), losses_
˓ → val, "-r")
20 47.6252555847168 50.228126525878906
30 40.33802032470703 50.60377502441406
40 38.19448471069336 54.03163528442383

[<matplotlib.lines.Line2D at 0x7f95ff200080>,
<matplotlib.lines.Line2D at 0x7f95ff2001d0>]

256 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

2. Multilayer Perceptron (MLP)

1. Course outline:

1. Recall of linear classifier


2. MLP with scikit-learn
3. MLP with pytorch
4. Test several MLP architectures
5.Limits of
MLP Sources:
Deep learning
•cs231n.stanford.edu

Pytorch
• WWW tutorials
• github tutorials
•github examples

MNIST and pytorch:


• MNIST
nextjournal.co
m/gkoehler/pyt
%matplotlib inline
orch-mnist
import os
• MNIST
import numpy as
github/pytorch/
np import torch
examples
import torch.nn
as •nnMNIST kaggle
import torch.nn.functional
as F import torch.optim as
optim
from torch.optim import
lr_scheduler
import torchvision
from torchvision import
transforms from torchvision
import datasets from torchvision
import models
#
from pathlib import Path
import matplotlib.pyplot as plt

# Device configuration
cuda:0
device =torch.device('cuda:0' i f torch.cuda.is_available() else 'cpu')
print(device)
Hyperparameter
s

6.2. Multilayer Perceptron (MLP) 257


Statistics and Machine Learning in Python, Release 0.3 beta

6.2.2 Dataset: MNIST Handwritten Digit Recognition

from pathlib import Path


WD=os.path.join(Path.home(), "data", "pystatml", "dl_mnist_pytorch")
os.makedirs(WD, exist_ok=True)
os.chdir(WD)
print("Working dir i s : " , os.getcwd())
os.makedirs("data", exist_ok=True)
os.makedirs("models", exist_ok=True)

def load_mnist(batch_size_train, batch_size_test):

train_loader =torch.utils.data.DataLoader(
datasets.MNIST('data', train=True,
download=True,
transform=transforms. Compose(
[ transforms.ToTensor(),
transforms.Normalize((0.1307,)
, (0.3081,)) # Mean and Std
of␣
˓→the MNIST dataset
])),
batch_size=batch_size_train, shuffle=True)

val_loader = torch.utils.data.DataLoader(
datasets.MNIST('data', train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)) # Mean and Std of the
MNIST dataset
])),
batch_size=batch_size_test, shuffle=True)
return train_loader, val_loader

train_loader, val_loader =load_mnist(64, 10000)

dataloaders =dict(train=train_loader, val=val_loader)

# Info about the dataset


D_in =np.prod(dataloaders["train"].dataset.data.shape[1:])
D_out = len(dataloaders["train"].dataset.targets.unique())
print("Datasets shapes:", {x:
dataloaders[x].dataset.data.shape
Working for x in ['train', 'val']}
dir is: /volatile/duchesnay/data/pystatml/dl_mnist_pytorch
˓→ )
Datasets shapes: {'train': torch.Size([60000, 28, 28]), 'val': torch.Size([10000, 28,␣
print("N
˓→ 28])} input features:", D_in, "Output classes:", D_out)

N input features: 784 Output classes: 10

Now let’s take a look at some mini-batches


examples.
batch_idx, (example_data, example_targets) =next(enumerate(train_loader))
print("Train batch:", example_data.shape, example_targets.shape) batch_idx,
(example_data, example_targets) =next(enumerate(val_loader)) print("Val
batch:", example_data.shape, example_targets.shape)

Train batch: torch.Size([64, 1, 28, 28]) torch.Size([64])


Val batch: torch.Size([10000, 1, 28, 28]) torch.Size([10000])

So one test data batch is a tensor of shape: . This means we have 1000 examples of
28x28 pixels

258 Chapter 6. Deep


Learning
Statistics and Machine Learning in Python, Release 0.3
beta

in grayscale (i.e. no rgb channels, hence the one). We can plot some of them using
def show_data_label_prediction(data, y_true, y_pred=None, shape=(2, 3)):
matplotlib.
y_pred =[None] * len(y_true) i f y_pred is None else y_pred
fig =plt.figure()
for i in range(np.prod(shape)):
plt.subplot(*shape, i+1)
plt.tight_layout()
plt.imshow(data[i][0],
cmap='gray',
interpolation='none')
plt.title("True: { } Pred: {}".format(y_true[i], y_pred[i]))
plt.xticks([])
plt.yticks([])

show_data_label_prediction(data=example_data,
y_true=example_targets, y_pred=None,␣
˓→shape=(2, 3))

6.2.3 Recall of linear classiher

Binary logistic regression

1 neuron as output layer

𝑓(�) = 𝜎(�𝑇�)

Softmax Classiher (Multinomial Logistic Regression)

• Input �: a vector of dimension (0) (layer 0).


•Ouput 𝑓(�) a vector of (1) (layer 1) possible

labels The model as (1) neurons as output layer


𝑓(�) =
softmax(�𝑇𝑊 + 𝑏)
6.2. Multilayer Perceptron (MLP) 259
Statistics and Machine Learning in Python, Release 0.3 beta

Where 𝑊 is a (0) × (1) of coefficients and 𝑏is a (1)-dimentional vector of bias.


MNIST classfification using multinomial
logistic source: Logistic regression MNIST
Here we fit a multinomial logistic regression with L2 penalty on a subset of the MNIST
digits classification task.
source: scikit-learn.org
X_train =train_loader.dataset.data.numpy()
#print(X_train.shape)
X_train =X_train.reshape((X_train.shape[0], -1))
y_train = train_loader.dataset.targets.numpy()

X_test =val_loader.dataset.data.numpy() X_test


=X_test.reshape((X_test.shape[0], -1)) y_test
= val_loader.dataset.targets.numpy()

print(X_train.shape, y_train.shape)

(60000, 784) (60000,)

import matplotlib.pyplot as
plt import numpy as np

#from sklearn.datasets import


fetch_openml
from sklearn.linear_model import LogisticRegression
#from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import check_random_state

scaler =StandardScaler()
X_train =scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Turn up tolerance for faster


convergence
c l f =LogisticRegression(C=50., multi_class='multinomial', solver='sag', tol=0.1)
clf .fit(X_train, y_train)
#sparsity =np.mean(clf.coef_ ==0) * 100
score =clf.score(X_test, y_test)

print("Test score with penalty: %.4f" % score)


Test score with penalty: 0.9035

coef =clf.coef_.copy()
plt.figure(figsize=(10, 5))
scale =np.abs(coef).max()
for i in range(10):
l1_plot =plt.subplot(2, 5, i +1)
l1_plot.imshow(coef[i].reshape(28, 28), interpolation='nearest',
cmap=plt.cm.RdBu, vmin=-scale, vmax=scale)
l1_plot.set_xticks(())
l1_plot.set_yticks(())
l1_plot.set_xlabel('Class %i' % i )
(continues on next
page)

260 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


plt.suptitle('Classification vector f or ...' ) page)

plt.show()

6.2.4 Model: Two Layer MLP

MLP with Scikit-learn

from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(100, ) , max_iter=5, alpha=1e-4,


solver='sgd', verbose=10, tol=1e-4, random_state=1,
learning_rate_init=0.01, batch_size=64)

mlp.fit(X_train, y_train)
print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))

print("Coef shape=", len(mlp.coefs_))

fi g, axes =plt.subplots(4, 4)
# use global min / max to ensure a l l
weights are shown on the same scale
vmin, vmax =mlp.coefs_[0].min(),
mlp.coefs_[0].max()
for coef, ax in zip(mlp.coefs_[0].T, axes.ravel()):
ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin,
vmax=.5 * vmax)
ax. set_xticks(())
ax. set_yticks(())

plt.show()

6.2. Multilayer Perceptron (MLP) 261


Statistics and Machine Learning in Python, Release 0.3 beta

Iteration 1, loss = 0.28611761


Iteration 2, loss = 0.13199804
Iteration 3, loss = 0.09278073
Iteration 4, loss = 0.07177168
Iteration 5, loss = 0.05288073

/home/ed203246/anaconda3/lib/python3.7/site-packages/sklearn/neural_network/multilayer_
˓→perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (5)␣
˓→reached and the optimization hasn't converged yet.

% self.max_iter, ConvergenceWarning)

Training set score: 0.989067


Test set score: 0.971900 Coef
shape= 2

MLP with
pytorch
class TwoLayerMLP(nn.Module):

def i ni t (self , d_in, d_hidden, d_out):


super(TwoLayerMLP, self ). init ( )
self.d_in =d_in

self.linear1 =nn.Linear(d_in, d_hidden)


self.linear2 =nn.Linear(d_hidden, d_out)

def forward(self, X):


X =X.view(-1, self.d_in) X =
self.linear1(X)
return
F.log_softmax(self.linear
2(X), dim=1)

262 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Train the Model

• First we want to make sure our network is in training mode.


• Iterate over epochs
• Alternate train and validation dataset
• Iterate over all training/val data once per epoch. Loading the individual batches is
han- dled by the DataLoader.
• Set the gradients to zero using optimizer.zero_grad() since PyTorch by default
accumu- lates gradients.
• Forward pass:
– model(inputs): Produce the output of our network.
– torch.max(outputs, 1): softmax predictions.
– criterion(outputs, labels): loss between the output and the ground truth
label.
• In training mode, backward pass backward(): collect a new set of gradients which we
propagate back into each of the network’s parameters using optimizer.step().
• We’ll also keep track of the progress with some printouts. In order to create a nice
training curve later on we also create two lists for saving training and testing
losses. On the x-axis we want to display the number of training examples the
network has seen during training.
• Save model state: Neural network modules as well as optimizers have the ability to
save and load their internal state using .state_dict(). With this we can con- tinue
# %load train_val_model.py
training from previously saved state dicts if needed - we’d just need to call .
load_state_dict(state_dict).
# %load train_val_model.py
import numpy as
np import torch
import time
import copy

def train_val_model(model, criterion, optimizer, dataloaders, num_epochs=25,


scheduler=None, log_interval=None):
since =time.time()

best_model_wts =copy.deepcopy(model.state_dict())
best_acc = 0.0

# Store losses and accuracies accross epochs


losses, accuracies =dict(train=[], val=[]),
dict(train=[], val=[])

for epoch in range(num_epochs):


i f log_interval is not None and epoch % log_interval ==0:
print('Epoch {}/{}'.format(epoch, num_epochs - 1))
print('-' * 10)
(continues on next
page)

6.2. Multilayer Perceptron (MLP) 263


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


# Each epoch has a training and validation phase
for phase in ['train', 'val']:
i f phase =='train':
model.train() # Set model to training mode
else:
model.eval() # Set model to evaluate mode

running_loss =0.0
running_corrects = 0

# Iterate over data.


nsamples = 0
for inputs, labels in dataloaders[phase]:
inputs =inputs.to(device)
labels = labels.to(device)
nsamples += inputs.shape[0]

# zero the parameter gradients


optimizer.zero_grad()

# forward
# track history i f only in train
with torch.set_grad_enabled(phase =='train'):
outputs = model(inputs)
_, preds =torch.max(outputs, 1)
loss =criterion(outputs, labels)

# backward +optimize only i f in training phase


i f phase =='train':
loss.backward()
optimizer. step()

# statistics
running_loss +=loss.item() * inputs.size(0)
running_corrects +=torch.sum(preds == labels.data)

i f scheduler is not None and phase =='train':


scheduler.step()

#nsamples =dataloaders[phase].dataset.data.shape[0]
epoch_loss =running_loss / nsamples
epoch_acc =running_corrects.double() / nsamples

losses[phase].append(epoch_loss)
accuracies[phase].append(epoch_acc)
i f log_interval is not None and epoch % log_interval ==0:
print('{} Loss: { : . 4 f } Acc: {:.2f}%'.format(
phase, epoch_loss, 100 * epoch_acc))

# deep copy the model


i f phase =='val' and epoch_acc >best_acc:
best_acc = epoch_acc
best_model_wts =
copy.deepcopy(model.state_dict())
i f log_interval is not None and epoch % log_interval ==0:
print()

(continues on next page)

264 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


time_elapsed =time.time() - since page)
print('Training complete in {:.0f}m
{:.0f}s'.format( time_elapsed // 60, time_elapsed
% 60))
print('Best val Acc: {:.2f}%'.format(100 * best_acc))

# load best model weights


model.load_state_dict(best_model_wts)

return model, losses, accuracies

Run one epoch and save the


model
model =TwoLayerMLP(D_in, 50, D_out).to(device)
print(next(model.parameters()).is_cuda)
optimizer =optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
criterion = nn.NLLLoss()

# Explore the model


for parameter in model.parameters():
print(parameter.shape)

print("Total number of parameters =",


np.sum([np.prod(parameter.shape) for
parameter in␣
˓→model.parameters()]))

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=1, log_interval=1)

print(next(model.parameters()).is_cuda)
torch.save(model.state_dict(), 'models/mod-%s.pth' % model. class . name )
True
torch.Size([50, 784])
torch.Size([50])
torch.Size([10, 50])
torch.Size([10])
Total number of parameters =39760
Epoch 0/0

train Loss: 0.4472 Acc: 87.65%


val Loss: 0.3115 Acc: 91.25%

Training complete in 0m10s


Best val Acc: 91.25%
True

Use the model to make new predictions. Consider the device, ie, load data on
device
example_data.to(device) from prediction, then move back to cpu example_data.cpu().
batch_idx, (example_data, example_targets) =next(enumerate(val_loader))
example_data = example_data.to(device)

with torch.no_grad():
output = model(example_data).cpu()

example_data = example_data.cpu()
(continues on next
page)

6.2. Multilayer Perceptron (MLP) 265


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


page)
# print(output.is_cuda)

# Softmax predictions
preds =
output.argmax(dim=1)

print("Output shape=",
output.shape, "label
shape=", preds.shape)
print("Accuracy ={:.2f}
%".format((example_tar
gets ==
preds).sum().item() *
Output
100. /␣shape= torch.Size([10000, 10]) label shape= torch.Size([10000])
Accuracy = 91.25%
˓→len(example_targets)
))

show_data_label_predicti
on(data=example_data,
y_true=example_targets,
y_pred=preds,␣
˓→shape=(3, 4))

Plot missclassified
samples
errors =example_targets != preds
#print(errors, np.where(errors))
print("Nb errors ={ } , (Error rate
={:.2f}%)".format(errors.sum(),
100 * errors.sum().
˓→ item() / len(errors)))
err_idx = np.where(errors)[0]
show_data_label_prediction(d
ata=example_data[err_idx],
Nb errors =875, (Error rate = 8.75%)
y_true=example_targets[err_i
dx],
y
_
pr
e
d
=
266 pr Chapter 6. Deep Learning
e
d
s[
Statistics and Machine Learning in Python, Release 0.3
beta

Continue training from checkpoints: reload the model and run 10 more
epochs
model =TwoLayerMLP(D_in, 50, D_out)
model.load_state_dict(torch.load('models/mod-%s.pth' % model. class . name ) )
model.to(device)

optimizer =optim.SGD(model.parameters(), lr=0.01, momentum=0.5)


criterion = nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=10, log_interval=2)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/9

train Loss: 0.3088 Acc: 91.12%


val Loss: 0.2877 Acc: 91.92%

Epoch 2/9

train Loss: 0.2847 Acc: 91.97%


val Loss: 0.2797 Acc: 92.05%

Epoch 4/9

train Loss: 0.2743 Acc: 92.30%


val Loss: 0.2797 Acc: 92.11%

Epoch 6/9

train Loss: 0.2692 Acc: 92.46%

(continues on next
page)

6.2. Multilayer Perceptron (MLP) 267


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


val Loss: 0.2717 Acc: 92.25% page)

Epoch 8/9

train Loss: 0.2643 Acc: 92.66%


val Loss: 0.2684 Acc: 92.44%

Training complete in 1m51s


Best val Acc: 92.52%

5. Test several MLP architectures

• Define a MultiLayerMLP([D_in, 512, 256, 128, 64, D_out]) class that take the size of
the layers as parameters of the constructor.
• Add some non-linearity with relu acivation function

class MLP(nn.Module):

def i ni t (self , d_layer):


super(MLP, self ). in it ( )
self.d_layer = d_layer
layer_list =
[nn.Linear(d_layer[l],
d_layer[l+1]) for l in
range(len(d_layer) -␣
˓→ 1)]
self.linears =
nn.ModuleList(layer_list)

def forward(self, X):


X =X.view(-1, self.d_layer[0])
# relu(Wl x) for a l l hidden
layer
for layer in self.linears[:-1]:
X = F.relu(layer(X))
268 # softmax(Wl x) for output layer Chapter 6. Deep Learning
return
F.log_softmax(self.linears[-1]
(X), dim=1)
Statistics and Machine Learning in Python, Release 0.3
beta

model =MLP([D_in, 512, 256, 128, 64, D_out]).to(device)

optimizer =optim.SGD(model.parameters(), lr=0.01, momentum=0.5)


criterion = nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=10, log_interval=2)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/9

train Loss: 1.2111 Acc: 61.88%


val Loss: 0.3407 Acc: 89.73%

Epoch 2/9

train Loss: 0.1774 Acc: 94.74%


val Loss: 0.1510 Acc: 95.47%

Epoch 4/9

train Loss: 0.0984 Acc: 97.16%


val Loss: 0.1070 Acc: 96.76%

Epoch 6/9

train Loss: 0.0636 Acc: 98.14%


val Loss: 0.0967 Acc: 96.98%

Epoch 8/9

train Loss: 0.0431 Acc: 98.75%


val Loss: 0.0822 Acc: 97.55%

Training complete in 1m54s


Best val Acc: 97.55%

6.2. Multilayer Perceptron (MLP) 269


Statistics and Machine Learning in Python, Release 0.3 beta

6.2.6 Reduce the size of training dataset

Reduce the size of the training dataset by considering only 10 minibatche for size16.

train_loader, val_loader =load_mnist(16, 1000)

train_size =10 * 16
# Stratified sub-sampling
targets =train_loader.dataset.targets.numpy()
nclasses = len(set(targets))

indices =
np.concatenate([np.random.choice(np.where(targ
ets ==lab)[0], int(train_size /␣
˓→nclasses),replace=False)

for lab in set(targets)])


np.random.shuffle(indices)

train_loader =torch.utils.data.DataLoader(train_loader.dataset, batch_size=16,


sampler=torch.utils.data.SubsetRandomSampler(indices))

# Check train subsampling


train_labels =np.concatenate([labels.numpy() for inputs, labels in train_loader])
print("Train size=", len(train_labels), " Train label count=", {lab:np.sum(train_labels␣
˓→== lab) for lab in set(train_labels)})

print("Batch sizes=", [inputs.size(0) for inputs, labels in train_loader])

# Put together train and val


dataloaders =dict(train=train_loader, val=val_loader)

# Info about the dataset


D_in =np.prod(dataloaders["train"].dataset.data.shape[1:])
D_out = len(dataloaders["train"].dataset.targets.unique())
print("Datasets shape", {x:
dataloaders[x].dataset.data.shape for x in ['train',
'val']})
print("N input features", D_in, "N output", D_out)
270 Chapter 6. Deep Learning
Statistics and Machine Learning in Python, Release 0.3
beta

Train size= 160 Train label count= {0: 16, 1: 16, 2: 16, 3: 16, 4: 16, 5: 16, 6: 16, 7:␣
˓→16, 8: 16, 9: 16}
Batch sizes= [16, 16, 16, 16, 16, 16, 16, 16, 16, 16]
Datasets shape {'train': torch.Size([60000, 28, 28]), 'val': torch.Size([10000, 28, 28])}
N input features 784 N output 10

model =MLP([D_in, 512, 256, 128, 64, D_out]).to(device) optimizer


=optim.SGD(model.parameters(), lr=0.01, momentum=0.5) criterion
= nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=100, log_interval=20)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/99

train Loss: 2.3066 Acc: 9.38%


val Loss: 2.3058 Acc: 10.34%

Epoch 20/99

train Loss: 2.1213 Acc: 58.13%


val Loss: 2.1397 Acc: 51.34%

Epoch 40/99

train Loss: 0.4651 Acc: 88.75%


val Loss: 0.8372 Acc: 73.63%

Epoch 60/99

train Loss: 0.0539 Acc: 100.00%


val Loss: 0.8384 Acc: 75.46%

Epoch 80/99

train Loss: 0.0142 Acc: 100.00%


val Loss: 0.9417 Acc: 75.55%

Training complete in 1m57s


Best val Acc: 76.02%

6.2. Multilayer Perceptron (MLP) 271


Statistics and Machine Learning in Python, Release 0.3 beta

Use an opimizer with an adaptative learning rate:


Adam
model =MLP([D_in, 512, 256, 128, 64, D_out]).to(device)
optimizer =torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=100, log_interval=20)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/99

train Loss: 2.2523 Acc: 20.62%


val Loss: 2.0853 Acc: 45.51%

Epoch 20/99

train Loss: 0.0010 Acc: 100.00%


val Loss: 1.0113 Acc: 78.08%

Epoch 40/99

train Loss: 0.0002 Acc: 100.00%


val Loss: 1.1456 Acc: 78.12%

Epoch 60/99

train Loss: 0.0001 Acc: 100.00%


val Loss: 1.2630 Acc: 77.98%

Epoch 80/99

train Loss: 0.0000 Acc: 100.00%


val Loss: 1.3446 Acc: 77.87%

(continues on next
page)

272 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
Training complete in 1m54s
Best val Acc: 78.52%

6.2.7 Run MLP on CIFAR-10 dataset

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000
images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000
images. The test batch contains exactly 1000 randomly-selected images from each class.
The training batches contain the remaining images in random order, but some training
batches may contain more images from one class than another. Between them, the training
batches contain exactly 5000 images from each class.

Here are the classes in the dataset, as well as 10 random images from each: - airplane
- automobile
- bird
- cat
- deer
- dog
- frog
- horse
- ship
- truck

Load CIFAR-10 dataset

6.2. Multilayer Perceptron (MLP) 273


Statistics and Machine Learning in Python, Release 0.3 beta

from pathlib import Path


WD=os.path.join(Path.home(), "data", "pystatml", "dl_cifar10_pytorch")
os.makedirs(WD, exist_ok=True)
os.chdir(WD)
print("Working dir i s : " , os.getcwd())
os.makedirs("data", exist_ok=True)
os.makedirs("models", exist_ok=True)

import numpy as
np import torch
import torch.nn as
nn import
torchvision
import
torchvision.transfor
ms as transforms

# Device
configuration
device =
torch.device('cuda'
if
torch.cuda.is_availa
ble() else 'cpu')

# Hyper-parameters
num_epochs =5
learning_rate =
0.001

# Image preprocessing modules


transform = transforms.Compose([
transforms.Pad(4),
transforms. RandomHorizontalFlip(),
transforms.RandomCrop(32),
transforms.ToTensor()])

# CIFAR-10 dataset
train_dataset =
torchvision.datasets.CIFAR10(root='dat
a/',
train=True,
transform=transform
, download=True)

val_dataset = torchvision.datasets.CIFAR10(root='data/',
train=False,
transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=100
,
shuffle=True)

val_loader =
torch.utils.data.DataLoader(dataset=val_dataset,
batch_size=100
,
shuffle=False)
274 Chapter 6. Deep Learning
# Put together train and val
dataloaders =dict(train=train_loader, val=val_loader)
Statistics and Machine Learning in Python, Release 0.3
beta

Working dir is: /volatile/duchesnay/data/pystatml/dl_cifar10_pytorch


Files already downloaded and verified
Datasets shape: {'train': (50000, 32, 32, 3), 'val': (10000, 32, 32, 3)} N
input features: 3072 N output: 10

model =MLP([D_in, 512, 256, 128, 64, D_out]).to(device)


optimizer =torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=50, log_interval=10)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/49

train Loss: 2.0171 Acc: 24.24%


val Loss: 1.8761 Acc: 30.84%

Epoch 10/49

train Loss: 1.5596 Acc: 43.70%


val Loss: 1.5853 Acc: 43.07%

Epoch 20/49

train Loss: 1.4558 Acc: 47.59%


val Loss: 1.4210 Acc: 48.88%

Epoch 30/49

train Loss: 1.3904 Acc: 49.79%


val Loss: 1.3890 Acc: 50.16%

Epoch 40/49

train Loss: 1.3497 Acc: 51.24%


val Loss: 1.3625 Acc: 51.41%

Training complete in 8m59s


Best val Acc: 52.38%

6.2. Multilayer Perceptron (MLP) 275


Statistics and Machine Learning in Python, Release 0.3 beta

3. Convolutional neural network

1. Outline

2. Architecures
3. Train and test functions
4. CNN models
5. MNIST
6.CIFAR-10
Sources:
Deep learning -
cs231n.stanford.edu CNN -
Stanford cs231n
Pytorch - WWW tutorials - github -
tutorials - github examples MNIST
MNIST and pytorch: - MNIST
nextjournal.com/gkoehler/pytorch-mnist
2. Architectures github/pytorch/examples - MNIST
kaggle
Sources:
• cv-tricks.com
• [zhenye-na.github.io(]https://zhenye-na.github.io/2018/12/01/cnn-deep-leearning-
ai- week2.html)

276 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

LeNet

The first Convolutional Networks were developed by Yann LeCun in 1990’s.

Fig. 1:
LeNet

AlexNet

(2012, Alex Krizhevsky, Ilya Sutskever and Geoff


Hinton)

Fig. 2: AlexNet

• Deeper, bigger,
• Featured Convolutional Layers stacked on top of each other (previously it was
common to only have a single CONV layer always immediately followed by a POOL
layer).
• ReLu(Rectified Linear Unit) for the non-linear part, instead of a Tanh or Sigmoid.
The advantage of the ReLu over sigmoid is that it trains much faster than the latter
because the derivative of sigmoid becomes very small in the saturating region and
therefore the updates to the weights almost vanish. This is called vanishing gradient
problem.
• Dropout: reduces the over-fitting by using a Dropout layer after every FC layer.
Dropout layer has a probability,(p), associated with it and is applied at every neuron
of the response map separately. It randomly switches off the activation with the
probability p.
Why does DropOutneural
6.3. Convolutional work?network 277
Statistics and Machine Learning in Python, Release 0.3 beta

Fig. 3: AlexNet
architecture

Fig. 4:
Dropout

278 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

The idea behind the dropout is similar to the model ensembles. Due to the dropout layer,
dif- ferent sets of neurons which are switched off, represent a different architecture and
all these different architectures are trained in parallel with weight given to each subset
and the sum- mation of weights being one. For n neurons attached to DropOut, the
number of subset ar- chitectures formed is 2^n. So it amounts to prediction being
averaged over these ensembles of models. This provides a structured model
regularization which helps in avoiding the over- fitting. Another view of DropOut being
helpful is that since neurons are randomly chosen, they tend to avoid developing co-
adaptations among themselves thereby enabling them to develop meaningful features,
independent of others.
• Data augmentation is carried out to reduce over-fitting. This Data augmentation
includes mirroring and cropping the images to increase the variation in the training
data-set.
GoogLeNet. (Szegedy et al. from Google 2014) was a Convolutional Network . Its main
contri- bution was the development of an
• Inception Module that dramatically reduced the number of parameters in the network
(4M, compared to AlexNet with 60M).

Fig. 5: Inception Module

• There are also several followup versions to the GoogLeNet, most recently Inception-
v4.
VGGNet. (Karen Simonyan and Andrew Zisserman 2014)
• 16 CONV/FC layers and, appealingly, features an extremely homogeneous
architecture.
• Only performs 3x3 convolutions and 2x2 pooling from the beginning to the end.
Replace large kernel-sized filters(11 and 5 in the first and second convolutional
layer, respectively) with multiple 3X3 kernel-sized filters one after another.
With a given receptive field(the effective area size of input image on which output
depends), multiple stacked smaller size kernel is better than the one with a larger size
kernel because multiple non-linear layers increases the depth of the network which
enables it to learn more complex features, and that too at a lower cost. For example, three
3X3 filters on top of each other with stride 1 ha a receptive size of 7, but the number of
parameters involved is 3 (9^2) in comparison to 49^2 parameters of kernels with a
size of 7.

6.3. Convolutional neural network


Statistics and Machine Learning in Python, Release 0.3 beta

Fig. 6:
VGGNet

Fig. 7: VGGNet
architecture

280 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

• Lot more memory and parameters (140M)

ResNet. (Kaiming He et al.


2015) Resnet block variants (
Source):

Fig. 8: ResNet
block

Fig. 9: ResNet 18

• Skip connections
• Batch normalization.
• State of the art CNN models and are the default choice (as of May 10, 2016). In
particular, also see more
• Recent developments that tweak the original architecture from Kaiming He et al.
Identity Mappings in Deep Residual Networks (published March 2016).
Models in pytorch

6.3. Convolutional neural network 281


Statistics and Machine Learning in Python, Release 0.3 beta

Fig. 10: ResNet 18 architecture

3. Architecures general guidelines

• ConvNets stack CONV,POOL,FC layers


• Trend towards smaller filters and deeper architectures: stack 3x3, instead of 5x5
• Trend towards getting rid of POOL/FC layers (just CONV)
• Historically architectures looked like [(CONV-RELU) x N POOL?] x M (FC-RELU) x
K, SOFTMAX where N is usually up to ~5, M is large, 0 < = K < = 2.
• but recent advances such as ResNet/GoogLeNet have challenged this paradigm

4. Train function

%matplotlib inline

import os
import numpy as
np import torch
import torch.nn
as nn
import
torch.optim as
optim
from torch.optim
import
lr_scheduler
import
torchvision (continues on next
import torchvision.transforms as page)
transforms from torchvision import
models
282 Chapter 6. Deep Learning
#
from pathlib import Path
import matplotlib.pyplot as plt
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
# Device configuration
device =torch.device('cuda' i f torch.cuda.is_available() else 'cpu')

# %load train_val_model.py
import numpy as
np import torch
import time
import copy

def train_val_model(model, criterion, optimizer, dataloaders, num_epochs=25,


scheduler=None, log_interval=None):
since =time.time()

best_model_wts =copy.deepcopy(model.state_dict())
best_acc = 0.0

# Store losses and accuracies accross epochs


losses, accuracies =dict(train=[], val=[]),
dict(train=[], val=[])

for epoch in range(num_epochs):


i f log_interval is not None and epoch % log_interval ==0:
print('Epoch {}/{}'.format(epoch, num_epochs - 1))
print('-' * 10)

# Each epoch has a training and validation phase


for phase in ['train', 'val']:
i f phase =='train':
model.train() # Set model to training mode
else:
model.eval() # Set model to
evaluate mode

running_loss =0.0
running_corrects = 0

# Iterate over data.


nsamples = 0
for inputs, labels in dataloaders[phase]:
inputs =inputs.to(device)
labels = labels.to(device)
nsamples += inputs.shape[0]

# zero the parameter gradients


optimizer.zero_grad()

# forward
# track history i f only in train
with torch.set_grad_enabled(phase =='train'):
outputs = model(inputs)
_, preds =torch.max(outputs, 1)
loss =criterion(outputs, labels)

# backward +optimize only i f in training phase


i f phase =='train':
(continues on next page)

6.3. Convolutional neural network 283


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


loss.backward() page)
optimizer. step()

# statistics
running_loss +=loss.item() * inputs.size(0)
running_corrects +=torch.sum(preds == labels.data)

i f scheduler is not None and phase =='train':


scheduler.step()

#nsamples =dataloaders[phase].dataset.data.shape[0]
epoch_loss =running_loss / nsamples
epoch_acc =running_corrects.double() / nsamples

losses[phase].append(epoch_loss)
accuracies[phase].append(epoch_acc)
i f log_interval is not None and epoch % log_interval ==0:
print('{} Loss: { : . 4 f } Acc: {:.2f}%'.format(
phase, epoch_loss, 100 * epoch_acc))

# deep copy the model


i f phase =='val' and epoch_acc >best_acc:
best_acc = epoch_acc
best_model_wts =
copy.deepcopy(model.state_dict())
i f log_interval is not None and epoch % log_interval ==0:
print()

time_elapsed =time.time() - since


print('Training complete in {:.0f}m
{:.0f}s'.format( time_elapsed // 60, time_elapsed
% 60))
print('Best val Acc: {:.2f}%'.format(100 * best_acc))

# load best model weights


model.load_state_dict(best_model_wts)

return model, losses, accuracies

6.3.5 CNN models

LeNet-5

Here we implement LeNet-5 with relu activation. Sources: (1),


(2).
import torch.nn as nn
import torch.nn.functional as F

class
LeNet5(nn.Module):
"""
layers: (nb channels in input layer,
nb channels in 1rst conv, nb
channels in 2nd conv,
nb neurons for 1rst FC: TO BE TUNED,
nb neurons for 2nd FC, nb neurons
for 3rd FC, (continues on next
page)

284 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


nb neurons output FC TO BE TUNED) page)
"""
def i ni t (self , layers =(1, 6, 16, 1024, 120, 84, 10), debug=False):
super(LeNet5, self ) . init ( )
self.layers = layers
self.debug = debug
self.conv1 =nn.Conv2d(layers[0], layers[1], 5, padding=2)
self.conv2 =nn.Conv2d(layers[1], layers[2], 5)
self.fc1 = nn.Linear(layers[3], layers[4])
self.fc2 =nn.Linear(layers[4],
layers[5]) self.fc3 = nn.Linear(layers[5],
layers[6])

def forward(self, x):


x =F.max_pool2d(F.relu(self.conv1(x)), 2) # same shape / 2 x =
F.max_pool2d(F.relu(self.conv2(x)), 2) # -4 / 2
i f self.debug:
print("### DEBUG: Shape of last convnet=", x.shape[1:], " . FC
size=", np.
˓→prod(x.shape[1:]))

x =x.view(-1, self.layers[3]) x =
F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return F.log_softmax(x, dim=1)

VGGNet like: conv-relu


blocks
# Defining the network (LeNet-5)
import torch.nn as nn
import torch.nn.functional as F

class MiniVGGNet(torch.nn.Module):

def i ni t (self , layers=(1, 16, 32, 1024, 120, 84, 10), debug=False):
super(MiniVGGNet, self ) . init ( )
self.layers =layers
self.debug =debug

# Conv block 1
self.conv11 =
nn.Conv2d(in_chann
els=layers[0],
out_channels=layers
[1], kernel_
˓→size=3,
stride=1, padding=0,
bias=True)
self.conv12 =
nn.Conv2d(in_channels=la
yers[1],
out_channels=layers[1],
kernel_
˓→size=3,

stride=1, padding=0,
bias=True)

# Conv block 2 (continues on next


page)
self.conv21 =
nn.Conv2d(in_chann
6.3. Convolutional neural network 285
els=layers[1],
out_channels=layers
[2], kernel_
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


self.fc1 = nn.Linear(layers[3], layers[4]) page)
self.fc2 = nn.Linear(layers[4], layers[5])
self.fc3 =nn.Linear(layers[5], layers[6])

def forward(self, x):


x = F.relu(self.conv11(x))
x = F.relu(self.conv12(x))
x =F.max_pool2d(x, 2)

x = F.relu(self.conv21(x))
x = F.relu(self.conv22(x))
x =F.max_pool2d(x, 2)

i f self.debug:
print("### DEBUG: Shape
of last convnet=",
x.shape[1:], " . FC
size=", np.
˓→prod(x.shape[1:]))

x =x.view(-1,
self.layers[3]) x =
F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)

return
ResNet-like Model: F.log_softmax(x,
dim=1)

Stack multiple resnet blocks

#
# # An implementation of https://arxiv.org/pdf/1512.03385.pdf
#
# See section 4.2 for the model architecture on CIFAR-10
#
# Some part of the code was referenced from below
# # https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py #
#
# import torch.nn as nn

# 3x3 convolution
def conv3x3(in_channels, out_channels, stride=1):
return nn.Conv2d(in_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False)

# Residual block
class ResidualBlock(nn.Module):
def i ni t ( self , in_channels, out_channels, stride=1, downsample=None):
super(ResidualBlock, self ). in it ( )
self.conv1 =conv3x3(in_channels, out_channels, stride)
self.bn1 =nn.BatchNorm2d(out_channels)
self.relu =nn.ReLU(inplace=True)
self.conv2 =conv3x3(out_channels,
out_channels)
self.bn2 =nn.BatchNorm2d(out_channels)
self.downsample = downsample

def forward(self, x):


residual = x
out = self.conv1(x)
286 (continues
Chapter 6. Deepon Learning
next page)
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


out =self.bn1(out) page)
out =self.relu(out)
out =self.conv2(out)
out =self.bn2(out)
i f self.downsample:
residual =self.downsample(x)
out += residual
out = self.relu(out)
return out

# ResNet
class ResNet(nn.Module):
def i ni t (self , block, layers, num_classes=10):
super(ResNet, self ). in it ( ) self.in_channels
= 16
self.conv =conv3x3(3, 16)
self.bn =nn.BatchNorm2d(16) self.relu
=nn.ReLU(inplace=True)
self.layer1 =self.make_layer(block,
16, layers[0])
self.layer2 =self.make_layer(block,
32, layers[1], 2)
self.layer3 =self.make_layer(block, 64, layers[2], 2)
self.avg_pool = nn.AvgPool2d(8)
self.fc =nn.Linear(64, num_classes)

def make_layer(self, block, out_channels, blocks, stride=1):


downsample = None
i f (stride != 1) or (self.in_channels != out_channels):
downsample = nn.Sequential(
conv3x3(self.in_channels, out_channels,
stride=stride),
nn.BatchNorm2d(out_channels))
layers = [ ]
layers.append(block(self.in_c
hannels, out_channels, stride,
downsample))
self.in_channels =
out_channels
for i in range(1, blocks):
layers.append(block(out_channels, out_channels))
return nn.Sequential(*layers)

def forward(self, x):


out =self.conv(x)
out =self.bn(out)
out =self.relu(out)
out =self.layer1(out) out =
self.layer2(out) out =
self.layer3(out) out =
self.avg_pool(out)
out =out.view(out.size(0),
ResNet9
-1)
DAWNBench
• out on cifar10
= self.fc(out)
return F.log_softmax(out, dim=1)
• ResNet9: train to 94% CIFAR10 accuracy in 100
#return out
seconds

6.3. Convolutional neural network 287


Statistics and Machine Learning in Python, Release 0.3 beta

6.3.6 MNIST digit classihcation

from pathlib import Path


from torchvision import datasets, transforms
import os

WD=os.path.join(Path.home(), "data", "pystatml", "dl_mnist_pytorch")


os.makedirs(WD, exist_ok=True)
os.chdir(WD)
print("Working dir i s : " , os.getcwd())
os.makedirs("data", exist_ok=True)
os.makedirs("models", exist_ok=True)

def load_mnist(batch_size_train, batch_size_test):

train_loader =torch.utils.data.DataLoader(
datasets.MNIST('data', train=True,
download=True,
transform=transforms.Compose([ transforms.ToT
ensor(), transforms.Normalize((0.1307,),
(0.3081,))
])),
batch_size=batch_size_train, shuffle=True)

test_loader = torch.utils.data.DataLoader(
datasets.MNIST('data', train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=batch_size_test, shuffle=True)
return train_loader, test_loader

train_loader, val_loader =load_mnist(64, 1000)

dataloaders =dict(train=train_loader, val=val_loader)

# Info about the dataset


data_shape =dataloaders["train"].dataset.data.shape[1:]
D_in = np.prod(data_shape)
D_out = len(dataloaders["train"].dataset.targets)
print("Datasets shape", {x: dataloaders[x].dataset.data.shape for x in ['train', 'val']})
print("N input features", D_in, "N output", D_out)

Working dir is: /volatile/duchesnay/data/pystatml/dl_mnist_pytorch


Datasets shape {'train': torch.Size([60000, 28, 28]), 'val': torch.Size([10000, 28, 28])}
N input features 784 N output 60000

LeNet

Dry run in debug mode to get the shape of the last convnet
layer.
model =LeNet5((1, 6, 16, 1, 120, 84, 10), debug=True)
batch_idx, (data_example, target_example) =next(enumerate(train_loader))
print(model)
_ = model(data_example)

288 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

LeNet5(
(conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
### DEBUG: Shape of last convnet= torch.Size([16, 5, 5]) . FC
size= 400
Set First FC layer to
400
model =LeNet5((1, 6, 16, 400, 120, 84, 10)).to(device) optimizer =
optim.SGD(model.parameters(), lr=0.01, momentum=0.5) criterion =
nn.NLLLoss()

# Explore the model


for parameter in model.parameters():
print(parameter.shape)

print("Total number of parameters =",


np.sum([np.prod(parameter.shape) for
parameter in␣
˓→model.parameters()]))

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=5, log_interval=2)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')


torch.Size([6, 1, 5, 5])
torch.Size([6])
torch.Size([16, 6, 5, 5])
torch.Size([16])
torch.Size([120, 400])
torch.Size([120])
torch.Size([84, 120])
torch.Size([84])
torch.Size([10, 84])
torch.Size([10])
Total number of parameters =61706
Epoch 0/4

train Loss: 0.6541 Acc: 80.22%


val Loss: 0.1568 Acc: 94.88%

Epoch 2/4

train Loss: 0.0829 Acc: 97.46%


val Loss: 0.0561 Acc: 98.29%

Epoch 4/4

train Loss: 0.0545 Acc: 98.32%


val Loss: 0.0547 Acc: 98.23%

Training complete in 0m53s


Best val Acc: 98.29%

6.3. Convolutional neural network 289


Statistics and Machine Learning in Python, Release 0.3 beta

MiniVGGNet

model =MiniVGGNet(layers=(1, 16, 32, 1, 120, 84, 10), debug=True)

print(model)
_ = model(data_example)

MiniVGGNet(
(conv11): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1))
(conv12): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
(conv21): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
(conv22): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
### DEBUG: Shape of last convnet= torch.Size([32, 5, 5]) . FC size= 800

Set First FC layer to


800
model =MiniVGGNet((1, 16, 32, 800, 120, 84, 10)).to(device)
optimizer =optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
criterion = nn.NLLLoss()

# Explore the model


for parameter in model.parameters():
print(parameter.shape)

print("Total number of parameters =",


np.sum([np.prod(parameter.shape) for
parameter in␣
˓→model.parameters()]))

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


(continues on next
num_epochs=5, log_interval=2)
page)

290 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

torch.Size([16, 1, 3, 3])
torch.Size([16])
torch.Size([16, 16, 3, 3])
torch.Size([16])
torch.Size([32, 16, 3, 3])
torch.Size([32])
torch.Size([32, 32, 3, 3])
torch.Size([32])
torch.Size([120, 800])
torch.Size([120])
torch.Size([84, 120])
torch.Size([84])
torch.Size([10, 84])
torch.Size([10])
Total number of parameters =123502
Epoch 0/4

train Loss: 1.4950 Acc: 45.62%


val Loss: 0.2489 Acc: 91.97%

Epoch 2/4

train Loss: 0.0838 Acc: 97.30%


val Loss: 0.0653 Acc: 97.81%

Epoch 4/4

train Loss: 0.0484 Acc: 98.51%


val Loss: 0.0443 Acc: 98.60%

Training complete in 0m57s


Best val Acc: 98.60%

6.3. Convolutional neural network 291


Statistics and Machine Learning in Python, Release 0.3 beta

Reduce the size of training dataset

Reduce the size of the training dataset by considering only 10 minibatche for size16.

train_loader, val_loader =load_mnist(16, 1000)

train_size =10 * 16
# Stratified sub-sampling
targets =train_loader.dataset.targets.numpy()
nclasses = len(set(targets))

indices =
np.concatenate([np.random.choice(np.where(targ
ets ==lab)[0], int(train_size /␣
˓→nclasses),replace=False)

for lab in set(targets)])


np.random.shuffle(indices)

train_loader =torch.utils.data.DataLoader(train_loader.dataset, batch_size=16,


sampler=torch.utils.data.SubsetRandomSampler(indices))

# Check train subsampling


train_labels =np.concatenate([labels.numpy() for inputs, labels in train_loader])
print("Train size=", len(train_labels), " Train label count=", {lab:np.sum(train_labels␣
˓→== lab) for lab in set(train_labels)})
print("Batch sizes=", [inputs.size(0) for inputs, labels in train_loader])

# Put together train and val


dataloaders =dict(train=train_loader, val=val_loader)

# Info about the dataset


data_shape =dataloaders["train"].dataset.data.shape[1:]
D_in = np.prod(data_shape)
D_out =
len(dataloaders["train"].dataset.targets.unique())
print("Datasets shape", {x:
dataloaders[x].dataset.data.shape for x in ['train',
'val']})
(continues on next page)
Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


print("N input features", D_in, "N output", D_out) page)

Train size= 160 Train label count= {0: 16, 1: 16, 2: 16, 3: 16, 4: 16, 5: 16, 6: 16, 7:␣
˓→16, 8: 16, 9: 16}
Batch sizes= [16, 16, 16, 16, 16, 16, 16, 16, 16, 16]
Datasets shape {'train': torch.Size([60000, 28, 28]), 'val': torch.Size([10000, 28, 28])}
N input features 784 N output 10

LeNet
5
model =LeNet5((1, 6, 16, 400, 120, 84, D_out)).to(device)
optimizer =optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
criterion = nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=100, log_interval=20)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/99

train Loss: 2.3009 Acc: 12.50%


val Loss: 2.2938 Acc: 10.90%

Epoch 20/99

train Loss: 0.3881 Acc: 90.00%


val Loss: 0.6946 Acc: 78.96%

Epoch 40/99

train Loss: 0.0278 Acc: 100.00%


val Loss: 0.5514 Acc: 86.55%

Epoch 60/99

train Loss: 0.0068 Acc: 100.00%


val Loss: 0.6252 Acc: 86.77%

Epoch 80/99

train Loss: 0.0033 Acc: 100.00%


val Loss: 0.6711 Acc: 86.83%

Training complete in 1m54s


Best val Acc: 86.92%

6.3. Convolutional neural network 293


Statistics and Machine Learning in Python, Release 0.3 beta

MiniVGGNet

model =MiniVGGNet((1, 16, 32, 800, 120, 84, 10)).to(device)


optimizer =optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
criterion = nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=100, log_interval=20)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/99

train Loss: 2.3051 Acc: 10.00%


val Loss: 2.3061 Acc: 9.82%

Epoch 20/99

train Loss: 2.2842 Acc: 10.00%


val Loss: 2.2860 Acc: 9.82%

Epoch 40/99

train Loss: 0.6609 Acc: 75.63%


val Loss: 0.7725 Acc: 72.44%

Epoch 60/99

train Loss: 0.2788 Acc: 93.75%


val Loss: 0.7235 Acc: 80.34%

Epoch 80/99

train Loss: 0.0023 Acc: 100.00%


val Loss: 0.7130 Acc: 86.56%

(continues on next
page)

294 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
Training complete in 2m11s
Best val Acc: 86.72%

6.3.7 CIFAR-10 dataset

Source Yunjey Choi

from pathlib import Path


WD=os.path.join(Path.home(), "data", "pystatml", "dl_cifar10_pytorch")
os.makedirs(WD, exist_ok=True)
os.chdir(WD)
print("Working dir i s : " , os.getcwd())
os.makedirs("data", exist_ok=True)
os.makedirs("models", exist_ok=True)

import numpy as
np import torch
import torch.nn as
nn import
torchvision
import
torchvision.transfor
ms as transforms

# Device
configuration
device =
torch.device('cuda'
if
torch.cuda.is_availa
ble() else 'cpu') (continues on next
page)
# Hyper-parameters
num_epochs =5
6.3. Convolutional
learning_rate = neural network 295
0.001

# Image preprocessing modules


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


transforms.Pad(4), page)
transforms. RandomHorizontalFlip(),
transforms.RandomCrop(32),
transforms.ToTensor()])

# CIFAR-10 dataset
train_dataset =
torchvision.datasets.CIFAR10(root='dat
a/',
train=True,
transform=transform
, download=True)

val_dataset = torchvision.datasets.CIFAR10(root='data/',
train=False,
transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=100
,
shuffle=True)

val_loader =
torch.utils.data.DataLoader(dataset=val_dataset,
batch_size=100
,
shuffle=False)

# Put together train and val


dataloaders =dict(train=train_loader, val=val_loader)

# Info about the dataset


data_shape =dataloaders["train"].dataset.data.shape[1:]
D_in = np.prod(data_shape)
D_out = len(set(dataloaders["train"].dataset.targets))
Working dir is: /volatile/duchesnay/data/pystatml/dl_cifar10_pytorch
print("Datasets shape:", {x: dataloaders[x].dataset.data.shape for x in ['train', 'val']})
Files
print("N inputdownloaded
already features:",and verified
D_in, "N output:", D_out)
Datasets shape: {'train': (50000, 32, 32, 3), 'val': (10000, 32, 32, 3)} N
input features: 3072 N output: 10

LeNe
t
model =LeNet5((3, 6, 16, 1, 120, 84, D_out), debug=True)
batch_idx, (data_example, target_example) =next(enumerate(train_loader))
print(model)
_ = model(data_example)

LeNet5(
(conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
(continues on next
page)

296 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
### DEBUG: Shape of last convnet= torch.Size([16, 6, 6]) . FC size= 576

Set 576 neurons to the first FC layer


SGD with momentum lr=0.001,
momentum=0.5
model =LeNet5((3, 6, 16, 576, 120, 84, D_out)).to(device) optimizer
=optim.SGD(model.parameters(), lr=0.001, momentum=0.5) criterion
= nn.NLLLoss()

# Explore the model


for parameter in model.parameters():
print(parameter.shape)

print("Total number of parameters =",


np.sum([np.prod(parameter.shape) for
parameter in␣
˓→model.parameters()]))

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=25, log_interval=5)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')


torch.Size([6, 3, 5, 5])
torch.Size([6])
torch.Size([16, 6, 5, 5])
torch.Size([16])
torch.Size([120, 576])
torch.Size([120])
torch.Size([84, 120])
torch.Size([84])
torch.Size([10, 84])
torch.Size([10])
Total number of parameters =83126
Epoch 0/24

train Loss: 2.3047 Acc: 10.00%


val Loss: 2.3043 Acc: 10.00%

Epoch 5/24

train Loss: 2.3019 Acc: 10.01%


val Loss: 2.3015 Acc: 10.36%

Epoch 10/24

train Loss: 2.2989 Acc: 12.97%


val Loss: 2.2979 Acc: 11.93%

Epoch 15/24

train Loss: 2.2854 Acc: 10.34%


val Loss: 2.2808 Acc: 10.26%

Epoch 20/24

train Loss: 2.1966 Acc: 15.87%

(continues on next
page)

6.3. Convolutional neural network 297


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


val Loss: 2.1761 Acc: 17.29% page)

Training complete in 4m40s


Best val Acc: 22.81%

Increase learning rate and momentum lr=0.01,


momentum=0.9
model =LeNet5((3, 6, 16, 576, 120, 84, D_out)).to(device)
optimizer =optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
criterion = nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=25, log_interval=5)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/24

train Loss: 2.1439 Acc: 19.81%


val Loss: 1.9415 Acc: 29.59%

Epoch 5/24

train Loss: 1.3457 Acc: 51.62%


val Loss: 1.2294 Acc: 55.74%

Epoch 10/24

train Loss: 1.1607 Acc: 58.39%


val Loss: 1.1031 Acc: 60.68%

Epoch 15/24

train Loss: 1.0710 Acc: 62.08%

(continues on next
page)

298 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


val Loss: 1.0167 Acc: 64.26% page)

Epoch 20/24

train Loss: 1.0078 Acc: 64.25%


val Loss: 0.9505 Acc: 66.62%

Training complete in 4m58s


Best val Acc: 67.30%

Adaptative learning rate:


Adam
model = LeNet5((3, 6, 16, 576, 120, 84, D_out)).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=25, log_interval=5)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/24

train Loss: 1.8857 Acc: 29.70%


val Loss: 1.6223 Acc: 40.22%

Epoch 5/24

train Loss: 1.3564 Acc: 50.88%


val Loss: 1.2271 Acc: 55.97%

Epoch 10/24

train Loss: 1.2169 Acc: 56.35%

(continues on next
page)

6.3. Convolutional neural network 299


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


val Loss: 1.1393 Acc: 59.72% page)

Epoch 15/24

train Loss: 1.1296 Acc: 59.67%


val Loss: 1.0458 Acc: 63.05%

Epoch 20/24

train Loss: 1.0830 Acc: 61.16%


val Loss: 1.0047 Acc: 64.49%

Training complete in 4m34s


Best val Acc: 65.76%

MiniVGGNet

model =MiniVGGNet(layers=(3, 16, 32, 1, 120, 84, D_out), debug=True)


print(model)
_ = model(data_example)

MiniVGGNet(
(conv11): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1))
(conv12): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
(conv21): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
(conv22): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
### DEBUG: Shape of last convnet= torch.Size([32, 6, 6]) . FC size= 1152

300 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

Set 1152 neurons to the first FC layer


SGD with large momentum and learning rate
model = MiniVGGNet((3, 16, 32, 1152, 120, 84, D_out)).to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
criterion = nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=25, log_interval=5)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/24

train Loss: 2.3027 Acc: 10.12%


val Loss: 2.3004 Acc: 10.31%

Epoch 5/24

train Loss: 1.4790 Acc: 45.88%


val Loss: 1.3726 Acc: 50.32%

Epoch 10/24

train Loss: 1.1115 Acc: 60.74%


val Loss: 1.0193 Acc: 64.00%

Epoch 15/24

train Loss: 0.8937 Acc: 68.41%


val Loss: 0.8297 Acc: 71.18%

Epoch 20/24

train Loss: 0.7848 Acc: 72.14%


val Loss: 0.7136 Acc: 75.42%

Training complete in 4m27s


Best val Acc: 76.73%

6.3. Convolutional neural network 301


Statistics and Machine Learning in Python, Release 0.3 beta

Adam

model =MiniVGGNet((3, 16, 32, 1152, 120, 84, D_out)).to(device)


optimizer =torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=25, log_interval=5)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/24

train Loss: 1.8366 Acc: 31.34%


val Loss: 1.5805 Acc: 41.62%

Epoch 5/24

train Loss: 1.1755 Acc: 57.79%


val Loss: 1.1027 Acc: 60.83%

Epoch 10/24

train Loss: 0.9741 Acc: 65.53%


val Loss: 0.8994 Acc: 68.29%

Epoch 15/24

train Loss: 0.8611 Acc: 69.74%


val Loss: 0.8465 Acc: 70.90%

Epoch 20/24

train Loss: 0.7916 Acc: 71.90%


val Loss: 0.7513 Acc: 74.03%

(continues on next
page)

302 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
Training complete in 4m23s
Best val Acc: 74.87%

ResNe
t
model =ResNet(ResidualBlock, [2, 2, 2], num_classes=D_out).to(device) # 195738 parameters
optimizer =torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.NLLLoss()

model, losses, accuracies =train_val_model(model, criterion, optimizer, dataloaders,


num_epochs=25, log_interval=5)

_ =plt.plot(losses['train'], '-b', losses['val'], '--r')

Epoch 0/24

train Loss: 1.4402 Acc: 46.84%


val Loss: 1.7289 Acc: 38.18%

Epoch 5/24

train Loss: 0.6337 Acc: 77.88%


val Loss: 0.8672 Acc: 71.34%

Epoch 10/24

train Loss: 0.4851 Acc: 83.11%


val Loss: 0.5754 Acc: 80.47%

Epoch 15/24

(continues on next
page)

6.3. Convolutional neural network 303


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


train Loss: 0.3998 Acc: 86.22% page)
val Loss: 0.6208 Acc: 80.16%

Epoch 20/24

train Loss: 0.3470 Acc: 87.99%


val Loss: 0.4696 Acc: 84.20%

Training complete in 7m5s


Best val Acc: 85.60%

6.4 Transfer Learning Tutorial

Sources:
• cs231n @ Stanford
•Sasank

Chilamkurthy Quote
cs231n @ Stanford:
In practice, very few people train an entire Convolutional Network from scratch (with
random initialization), because it is relatively rare to have a dataset of sufficient size.
Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which
contains 1.2 million images with 1000 categories), and then use the ConvNet either as
an initialization or a fixed feature extractor for the task of interest.
These two major transfer learning scenarios look as follows:
• ConvNet as fixed feature extractor:
– Take a ConvNet pretrained on ImageNet,

304 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

– Remove the last fully-connected layer (this layer’s outputs are the 1000 class
scores for a different task like ImageNet)
–Treat the rest of the ConvNet as a fixed feature extractor for the new
dataset. In practice:
– Freeze the weights for all of the network except that of the final fully connected
layer. This last fully connected layer is replaced with a new one with random
weights and only this layer is trained.
• Finetuning the convnet:
fine-tune the weights of the pretrained network by continuing the backpropagation. It is
possi- ble to fine-tune all the layers of the ConvNet
Instead of random initializaion, we initialize the network with a pretrained network, like
the one that is trained on imagenet 1000 dataset. Rest of the training looks as usual.
%matplotlib inline

import os
import numpy as
np import torch
import torch.nn
as nn
import
torch.optim as
optim
from torch.optim
import
lr_scheduler
import
torchvision
import torchvision.transforms as
transforms from torchvision import
models
#
from pathlib import Path
1.
importTraining function as plt
matplotlib.pyplot

# Device configuration
Combine train and test/validation into a single function.
device =torch.device('cuda' i f
Now, let’s write a general else
torch.cuda.is_available() function to train a model. Here, we will illustrate:
'cpu')
• Scheduling the learning rate
• Saving the best model

In the following, parameter scheduler is an LR scheduler object from torch.optim.


lr_scheduler.
# %load train_val_model.py
import numpy as
np import torch
import time
import copy

def
train_val_model(m (continues on next
odel, criterion, page)
optimizer,
6.4. Transfer Learning Tutorial
dataloaders, 305
num_epochs=25,
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


scheduler=None, log_interval=None): page)

since =time.time()

best_model_wts =copy.deepcopy(model.state_dict())
best_acc = 0.0

# Store losses and accuracies accross epochs


losses, accuracies =dict(train=[], val=[]),
dict(train=[], val=[])

for epoch in range(num_epochs):


i f log_interval is not None and epoch % log_interval ==0:
print('Epoch {}/{}'.format(epoch, num_epochs - 1))
print('-' * 10)

# Each epoch has a training and validation phase


for phase in ['train', 'val']:
i f phase =='train':
model.train() # Set model to training mode
else:
model.eval() # Set model to evaluate
mode

running_loss =0.0
running_corrects = 0

# Iterate over data.


nsamples = 0
for inputs, labels in dataloaders[phase]:
inputs =inputs.to(device)
labels = labels.to(device)
nsamples += inputs.shape[0]

# zero the parameter gradients


optimizer.zero_grad()

# forward
# track history i f only in train
with torch.set_grad_enabled(phase =='train'):
outputs = model(inputs)
_, preds =torch.max(outputs, 1)
loss =criterion(outputs, labels)

# backward +optimize only i f in training phase


i f phase =='train':
loss.backward()
optimizer. step()

# statistics
running_loss +=loss.item() * inputs.size(0)
running_corrects +=torch.sum(preds == labels.data)

i f scheduler is not None and phase =='train':


scheduler.step()

#nsamples =dataloaders[phase].dataset.data.shape[0]
epoch_loss =running_loss / nsamples
(continues on next page)

306 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


epoch_acc =running_corrects.double() / nsamples page)

losses[phase].append(epoch_loss)
accuracies[phase].append(epoch_acc)
i f log_interval is not None and epoch % log_interval ==0:
print('{} Loss: { : . 4 f } Acc: {:.4f}'.format(
phase, epoch_loss, epoch_acc))

# deep copy the model


i f phase =='val' and epoch_acc >best_acc:
best_acc = epoch_acc
best_model_wts =
copy.deepcopy(model.state_dict())
i f log_interval is not None and epoch % log_interval ==0:
print()

time_elapsed =time.time() - since


print('Training complete in {:.0f}m
{:.0f}s'.format( time_elapsed // 60, time_elapsed
% 60))
print('Best val Acc: {:4f}'.format(best_acc))

# load best model weights


model.load_state_dict(best_model_wts)

return model, losses, accuracies

6.4.2 CIFAR-10 dataset

Source Yunjey Choi

WD=os.path.join(Path.home(), "data", "pystatml", "dl_cifar10_pytorch")


os.makedirs(WD, exist_ok=True)
os.chdir(WD)
print("Working dir i s : " , os.getcwd())
os.makedirs("data", exist_ok=True)
os.makedirs("models", exist_ok=True)

# Image preprocessing modules


transform = transforms.Compose([
transforms.Pad(4),
transforms. RandomHorizontalFlip(),
transforms.RandomCrop(32),
transforms.ToTensor()])

# CIFAR-10 dataset
train_dataset =
torchvision.datasets.CIFAR10(root='dat
a/',
train=True,
transform=transform
, download=True)

test_dataset = torchvision.datasets.CIFAR10(root='data/',
train=False,
transform=transforms.ToTensor())

# Data loader (continues on next


page)

6.4. Transfer Learning Tutorial 307


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


train_loader = torch.utils.data.DataLoader(dataset=train_dataset, page)
batch_size=100
,
shuffle=True)

val_loader =
torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=100
,
shuffle=False)

# Put together train and val


dataloaders =dict(train=train_loader, val=val_loader)

# Info about the dataset


data_shape =dataloaders["train"].dataset.data.shape[1:]
D_in = np.prod(data_shape)
D_out = len(set(dataloaders["train"].dataset.targets))
print("Datasets shape", {x: dataloaders[x].dataset.data.shape for x in ['train', 'val']})
Working
print("Ndir
input /home/edouard/data/pystatml/dl_cifar10_pytorch
is:features", D_in, "N output", D_out)
Files already downloaded and verified
Datasets shape {'train': (50000, 32, 32, 3), 'val': (10000, 32, 32, 3)} N
input features 3072 N output 10

Finetuning the convnet

• Load a pretrained model and reset final fully connected


layer.
• SGD optimizer.
model_ft =models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
# Here the size of each output sample is set to 10.
model_ft.fc =nn.Linear(num_ftrs, D_out)

model_ft =model_ft.to(device)

criterion =nn.CrossEntropyLoss()
# Observe that a l l parameters are
being optimized
optimizer_ft =
optim.SGD(model_ft.parameters(),
lr=0.001, momentum=0.9)

# Decay LR by a factor of 0.1 every


7 epochs
exp_lr_scheduler =
lr_scheduler.StepLR(optimizer_ft,
step_size=7, gamma=0.1)

model, losses, accuracies =train_val_model(model_ft, criterion, optimizer_ft,


Epoch 0/24
dataloaders, scheduler=exp_lr_scheduler, num_epochs=25, log_interval=5)

train Loss:
epochs 1.2478 Acc: 0.5603
=np.arange(len(losses['train']))
val Loss: 0.9084 Acc:losses['train'],
_ =plt.plot(epochs, 0.6866 '-b', epochs, losses['val'], '--r')
(continues on next
page)

308 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
Epoch 5/24

train Loss: 0.5801 Acc: 0.7974


val Loss: 0.5918 Acc: 0.7951

Epoch 10/24

train Loss: 0.4765 Acc: 0.8325


val Loss: 0.5257 Acc: 0.8178

Epoch 15/24

train Loss: 0.4555 Acc: 0.8390


val Loss: 0.5205 Acc: 0.8201

Epoch 20/24

train Loss: 0.4557 Acc: 0.8395


val Loss: 0.5183 Acc: 0.8212

Training complete in 277m 16s


Best val Acc: 0.822800

Adam
optimizer
model_ft =models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
# Here the size of each output sample is set to 10.
model_ft.fc =nn.Linear(num_ftrs, D_out)

model_ft =model_ft.to(device)

criterion =nn.CrossEntropyLoss()
(continues on next
page)

6.4. Transfer Learning Tutorial 309


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


page)
# Observe that a l l parameters are being optimized
optimizer_ft =torch.optim.Adam(model_ft.parameters(), lr=0.001)

# Decay LR by a factor of 0.1 every 7 epochs


exp_lr_scheduler =lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

model, losses, accuracies =train_val_model(model_ft, criterion, optimizer_ft,


dataloaders, scheduler=exp_lr_scheduler, num_epochs=25, log_interval=5)

epochs =np.arange(len(losses['train']))
_ =plt.plot(epochs, losses['train'], '-b', epochs, losses['val'], '--r')

Epoch 0/24

train Loss: 1.0491 Acc: 0.6407


val Loss: 0.8981 Acc: 0.6881

Epoch 5/24

train Loss: 0.5495 Acc: 0.8135


val Loss: 0.9076 Acc: 0.7147

Epoch 10/24

train Loss: 0.3352 Acc: 0.8834


val Loss: 0.4148 Acc: 0.8613

Epoch 15/24

train Loss: 0.2819 Acc: 0.9017


val Loss: 0.4019 Acc: 0.8646

Epoch 20/24

train Loss: 0.2719 Acc: 0.9050


val Loss: 0.4025 Acc: 0.8675

Training complete in 293m 37s


Best val Acc: 0.868800

310 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

ResNet as a feature extractor

Freeze all the network except the final layer: requires_grad ==False to freeze the
parameters so that the gradients are not computed in backward().
model_conv = torchvision.models.resnet18(pretrained=True)
for param in model_conv.parameters():
param.requires_grad = False

# Parameters of newly constructed modules have requires_grad=True by default


num_ftrs = model_conv.fc.in_features
model_conv.fc =nn.Linear(num_ftrs, D_out)

model_conv =model_conv.to(device)

criterion =nn.CrossEntropyLoss()

# Observe that only parameters of final layer are being optimized as #


opposed to before.
optimizer_conv =optim.SGD(model_conv.fc.parameters(), lr=0.001,
momentum=0.9)

# Decay LR by a factor of 0.1 every 7 epochs


exp_lr_scheduler =lr_scheduler.StepLR(optimizer_conv, step_size=7, gamma=0.1)
model, losses, accuracies =train_val_model(model_conv, criterion, optimizer_conv,
dataloaders, scheduler=exp_lr_scheduler, num_epochs=25, log_interval=5)

epochs =np.arange(len(losses['train']))
_ =plt.plot(epochs, losses['train'], '-b', epochs, losses['val'], '--r')

Epoch 0/24

train Loss: 1.9107 Acc: 0.3265


val Loss: 1.7982 Acc: 0.3798

(continues on next
page)

6.4. Transfer Learning Tutorial 311


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous


Epoch 5/24 page)

train Loss: 1.6666 Acc: 0.4165


val Loss: 1.7067 Acc: 0.4097

Epoch 10/24

train Loss: 1.6411 Acc: 0.4278


val Loss: 1.6737 Acc: 0.4269

Epoch 15/24

train Loss: 1.6315 Acc: 0.4299


val Loss: 1.6724 Acc: 0.4218

Epoch 20/24

train Loss: 1.6400 Acc: 0.4274


val Loss: 1.6755 Acc: 0.4250

Training complete in 61m 46s


Best val Acc: 0.430200

Adam
optimizer
model_conv = torchvision.models.resnet18(pretrained=True)
for param in model_conv.parameters():
param.requires_grad = False

# Parameters of newly constructed modules have requires_grad=True by default


num_ftrs = model_conv.fc.in_features
model_conv.fc =nn.Linear(num_ftrs, D_out)

model_conv = model_conv.to(device)
(continues on next
page)

312 Chapter 6. Deep Learning


Statistics and Machine Learning in Python, Release 0.3
beta

(continued from previous


page)
criterion = nn.CrossEntropyLoss()

# Observe that only parameters of final layer are being optimized as #


opposed to before.
optimizer_conv =optim.Adam(model_conv.fc.parameters(), lr=0.001)

# Decay LR by a factor of 0.1 every 7 epochs


exp_lr_scheduler =lr_scheduler.StepLR(optimizer_conv, step_size=7,
gamma=0.1)

model, losses, accuracies =train_val_model(model_conv, criterion, optimizer_conv,


exp_lr_scheduler, dataloaders, num_epochs=25, log_interval=5)

epochs =np.arange(len(losses['train']))
_ =plt.plot(epochs, losses['train'], '-b', epochs, losses['val'], '--r')

TypeError Traceback (most recent call last)

<ipython-input-16-dde92868b554> in <module>
19
20 model, losses, accuracies =
train_val_model(model_conv, criterion,
optimizer_conv,
---> 21 exp_lr_scheduler, dataloaders, num_epochs=25, log_interval=5)
22
23 epochs = np.arange(len(losses['train']))

TypeError: train_val_model() got multiple values for argument 'num_epochs'

6.4. Transfer Learning Tutorial 313


Statistics and Machine Learning in Python, Release 0.3 beta

314 Chapter 6. Deep Learning


CHAPTER

SEVEN

INDICES AND TABLES

• genindex
• modinde
x
• search

315

You might also like