01 Intro To ML Wo Videos

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 46

Introduction to

Machine
Learning

Olivia Pfeiler
[email protected]

Applied Data Science


Master of Science in Engineering
Introduction round

• Switch on the video


• Introduce yourself
• In which study program are you in?
• Your background, e.g. which bachelor
study program did you finish?

2
In this course you will learn …

• Brief history of ML/AI


• Basic ML terms & definitions
• Data pre-processing methods
• Overview of fundamental ML algorithms
• Model evaluation methods
• Feature engineering & selection
• How to work with real world data (examples with R)
• Data Science / ML best practices & some things to think about

3
How will you learn?

• Theory & practical examples (in R)


• 1st part of the class with Olivia Pfeiler will be more theoretical
• 2nd part of the class with Kathrin Plankensteiner will be mainly
practical
• Slides and code examples will be available in moodle course
• There is no “script / book” for this class

4
Overview of classes
Date Start End topic location lecturer
04.10.202 History of ML/AI & basic ML EDV1 (Villach) / Pfeiler
4 4:50 pm 8:10 pm VCR
terms
11.10.202 Blended learning (algebra theory); complete data camp class – deadline Pfeiler / self-
4 Nov 1, 2024 study
18.10.202 Blended learning (data pre-processing); complete data camp class – Pfeiler / self-
4 deadline Nov 7, 2024 study
08.11.202 Recap data preprocessing; EDV1 (Villach) / Pfeiler
4:50 pm 8:10 pm
4 learning VCR
15.11.202 Decision Trees & decision EDV1 (Villach) / Pfeiler
4:50 pm 8:10 pm
4 metrics VCR
22.11.202 Kmeans & resampling VCR Pfeiler
4:50 pm 8:10 pm
4 methods
06.12.202 Feature engineering & feature EDV4 (Villach) / Pfeiler
4:50 pm 8:10 pm
4 selection VCR
11.12.202 ML lab 1 (exercises & EDV2 (Villach) / Plankensteine
5:40 pm 9:00 pm
4 homework) VCR r
13.12.202 Regression (linear & non- EDV1 (Villach) / Pfeiler
4 4:50 pm 8:10 pm linear) & information about VCR
student projects 5
How to pass the class?

• Attendance in the class room or online is not part of the final grade

• Homework
• Linear Algebra exercises (self-study + data camp class, deadline Nov. 1st, 2024)  5%
of final grade
• Data pre-processing (self-study + data camp class, deadline Nov. 7th, 2024)  5% of
final grade
• ML exercise 1 (with K. Plankensteiner, submission, Jan. 24)  5% of final grade
• ML exercise 2 (with K. Plankensteiner, submission Jan. 24)  5% of final grade

• One group work:  20% of final grade


• Apply learned ML methods on a given data set & implement a web app with Shiny
• Deliverables: Shiny app + presentation on Jan 25, 2025

• Final exam (date after Jan. 10, 2025, please align on possible dates)  60% of final grade
6
Bibliography

• An Introduction to Statistical Learning. Gareth James, Daniela Witten,


Trevor Hastie, Robert Tishirami. Springer New York, 2013
• The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
Springer 2009
• Understanding Machine Learning: From Theory to Algorithms. Shai
Shalev-Shwartz und Shai Ben-David. Cambridge University Press, 2014
• Machine Learning for Hackers. Drew Conway & John Myles White.
O'Reilly UK Ltd, 2012
• Introduction to Machine Learning with Python: A Guide for Data
Scientists. Sarah Guido. O'Reilly UK Ltd, 2016
7
Prerequisites

• Statistics & Probability Theory


 covered by lectures from Prof. Kloss-Brandstätter & Prof. Pilz
• Programming Skills: R
 introduction made by Prof. Kloss-Brandstätter
 Do you need an R refresher? DataCamp – Introduction to R
• Linear Algebra
• Calculus

8
Plan for today

Topics Procedure
• AI & ML – why there is a hype? • Two 90 min sessions with
• AI & ML – A quick tour through the one 20 min break
history
• Machine Learning – general
explanation
• The Machine Learning Life Cycle

9
AI & Machine Learning hype

10
Decision Neural Supervised Python
Tree Network Learning

Support
Data Vector K-means GPU Deep
Learning
Machines
Science
Buzzword Feature Data Pre- Big
Bingo Algorithm
Engineering processing Data

Decision Cloud Linear Tensor


Metric Regression Flow

11
AI & Machine Learning in the media

12
Source: https://www.quytech.com/blog/wp-content/uploads/2021/05/AI-spending.png 13
Why is AI/ML so popular right now?
• ML matured a lot in the last decade & changed a lot in the last years
• e.g. statistical and probabilistic underpinning of methods
• tools to implement ML methods have also matured +
• Abundant data
• the amount of data collected and stored is growing rapidly  “information overload”
• diverse data sources (broad applicability): email, social networks, RSS, podcasts,
websites, …

• Abundant computation
• computation power is easily accessible and cheap
• enabling “abundant data” & powerful machine learning algorithms

14
Will the hype continue?

15
AI winters & AI hypes

Source:
https://hackernoon.com
approx. 2007
“Data explosion” / “information 16
overload”
Hype Cycle

• Offers a snapshot of the


relative market promotion and
perceived value of
innovations.

• Highlights overhyped areas

• Estimates when innovations


and trends will reach maturity

https://www.gartner.com/en/documents/3887767
12 Okt 2023 restricted Copyright © Infineon Technologies AG 2023. All rights reserved.
Gartner Hype Cycle for AI – Status 2023

Plateau of Productivity will


be reached in

Source: Gartner, 2023

18
AI & ML – A quick tour through the
history

19
AI & Machine Learning history

20
Source: https://atos.net
<1960 “I repeat”

Alan Turing published a paper 1950 Arthur Samuel defines the term
entitled “Computing Machinery and Machine learning:
Intelligence”
“ML is the field of study that gives
Turing Test or “Imitation Game”: computers the ability to learn
a simple test that could be used to 1959 without being explicitly
prove that machines could think
programmed”

He wrote a program that learnt


checkers well enough to beat him (by
self play)

Source: www.wikipedia.org

Source: www.wikipedia.org 21
1960 – 2010 “I imitate”
I repeat

ELIZA developed by Joseph 1965


Weizenbaum

ELIZA is an interactive program


that is able to have a dialogue
on any subject Link to Eliza Video @
YouTube

22
1960 – 2010 “I imitate”
I repeat

Mercedes Benz builds the first 1986


driverless car
Deep Blue (IBM) challenged and
defeated the then world chess
champion Gary Kasparov
1997 Deep Blue examines thousands of
moves using a min-max search
algorithm & continually increases it’s
power
Source: https://driving.ca

Source: www.britannica.com 23
2010 – 2018 “I learn”
I repeat

IBM’s Watson wins Jeopardy 2011


beating two former champions
2011 Apple introduces Siri

2014 Amazon introduces Alexa

Source: www.infoworld.com

24
2010 – 2018 “I learn”
I repeat

Era of “ImageNet” 2012


Label: Wolf 2023 2014 Label: A brown bear is swimming
in the water

2017 Label: Two brown bears sitting on


top of rocks

Vinyals et al., IEEE Transactions on Pattern Analysis and Machine


source
Intelligence, 2017

25
2010 – 2018 “I learn”
I repeat

Google DeepMind’s AlphaGo 2016


defeats Go champion Lee Sedol

Link to AlphaGo video @


YouTube

26
2018 - 2022 “I learn to learn”
I repeat

Link to “Solving Rubiks Cube” video:


openai.com

27
> 2022 “I contribute”

Source: https://www.openai.com/
Machine Learning – general
explanation

29
AI, Machine Learning & Deep Learning

Mathemat
AI ML DL Data ics
Mimic the Computers ML inspired by
intelligence and learn from data our brain’s Scien Statistics
behavioral pattern without network of
of humans complex set of neurons ce Visualizati
rules
on
EDA

30
What is Machine Learning?
• ML explores the study and construction of algorithms that can learn
from and make predictions on data (allow computers to learn)
• Basic principals of ML have a strong relation to
• Statistical and mathematical theory
• Numerical optimization
• Learning algorithms
• Two main goals
• Make predictions
• Understand systems better
31
Data Analytics vs. Machine Learning

data analytics ≙ data analysis


“… process of inspecting, cleansing, transforming, and modeling data with the goal of
discovering useful information, suggesting conclusions, and supporting decision-making”
[Wikipedia]

Explorative Data Analysis (EDA): • Outlier detection


• Box plots, histograms, bar charts, time series,… • Correlation
analysis
Descriptive statistics • Regression
• Summary statistics (mean, std. dev, …), key
parameters
• Machine Learning
• …
ML
32
When do we need ML?
• Human expertise does not exist (navigating on mars)
• Humans are unable to explain their expertise/ tasks are to complex to
program (speech recognition, spam filters, robotics, driving, …)
• Solution changes in time (routing on a computer network)
• Solutions needs to be adapted to particular cases (user biometrics)
• Tasks beyond human capabilities (weather prediction, genomic data,
search engines)
• Example: It is easier to write a program that learn how to play checkers or
backgammon well by self-play than converting the expertise of a master player into a
program.

• No need to use ML for computing a pay roll, we “just” need an algorithm

33
Elements of ML
• Different types of learning: • Models: Theoretical assumptions
supervised vs. unsupervised, active explaining relationships in the data
vs. passive, … • Algorithms: Routines to get model
Models & parameters & make predictions
Learning
Algorithms

Data & Data Evaluation &


• Training/Validation/Test data • How well a model performs on new
pre-processing Generalization
• Preparation methods to ensure data
high data quality • Also called “inductive reasoning” or
• Feature engineering “inductive inference”

34
Some advantages of ML
• Decision rules learnt from data
 new relations may become visible
• Objective and repeatable decision making
 no subjective reasoning
• ML models are flexible, because e.g.
predictors can be combined (non)-linearly
• ML models will improve over time (with better
data quality, lager data sets, more computing
power, …)
• Basic ML methods are implemented and easy
to use in common SW (MATLAB, Python, R, …)

35
Some things to consider when using ML
• ML can only learn from information given
by the data  high quality and reliable
training data are needed
• 80% (data preparation) vs. 20% (data
evaluation) rule is true, BUT just for the
Proof of Concept (PoC). Even more
resources are needed for deployment
• Large/complex ML models may over fit 
loss in prediction power / generalization
• Selection of ML method is “subjective”
• Basic ML models have often a “linear
nature”
• Complex models need extensive
computing power for the training  36
The Machine Learning Life Cycle

37
<<
<<<
<<<
<< 2)
<<<
Understand
<<
1) Understand
Business &
Data
3) Gather
& Prepare
The Machine
Process Data
Learning Life
Cycle
6) Deploy & 4) Evaluate
maintain & Test
Models 5) Select & Models
Optimize topics addressed
Model in this class
CRISP-DM: CRoss-Industry Standard Process for
Data Mining

38
CRISP DM –
alternative
presentation

Source: https://en.wikipedia.org/wiki/Cross-industry_standard_process_f
or_data_mining

39
Sourc
e 40
ML life cycle – Roles and Responsibilities
Domain Expert Data Scientist In the PAST …
<< • Data Scientist supported by domain expert
<<<
where responsible for the whole ML life
<<<
<< 2) cycle
<<<
Understand
Data
<<
1) Understand 3) Gather
BUT ML projects are …
Business & & Prepare
Process Data • complex, similar to SW project
• need diverse competencies to be successful
• Proof of Concept (PoC) is just the beginning

6) Deploy & 4) Evaluate


maintain & Test RESULT …
Models 5) Select & Models
• 87 % of Data Science projects fail or stop
Optimize
Model after the PoC [source]
CRISP-DM: CRoss-Industry Standard Process for Data Mining

41
ML life cycle – Roles and Responsibilities
Domain Expert Data Engineer TODAY…
<< • successful DS teams consist of people
<<<
<<< with different expertise working
<< 2) together
<<<
Understand
Data
<< • Domain Expert: Understands the
1) Understand 3) Gather
Business & & Prepare problem, the business needs and the
Process Data data
• Data Engineer: Develops and
manages the pipeline for raw data
collection and pre-processing to feed
6) Deploy & 4) Evaluate ML models
maintain & Test
Models 5) Select & Models • Data Scientist: Analyses data,
ML Engineer Optimize Data Scientist
develops & evaluates ML/DL models
Model
• ML Engineer: Develops pipelines to
CRISP-DM: CRoss-Industry Standard Process for Data Mining
deploy and maintain ML models
42
Managing the ML life cycle – MLOps
<<
<<<
• Compound of “Machine Learning”
<<<
and “operations” << 2)
<<<
Understand
• Practice to manage the whole ML Data
<<
1) Understand 3) Gather
lifecycle in order to optimize both Business & & Prepare
the governance and the scalability Process Data

• Collaboration and communication


across all data science roles to
manage ML in productive 6) Deploy & 4) Evaluate
environments  pipelines maintain & Test
Models 5) Select & Models
Optimize
Model

43
Overview of classes
Date Start End topic location lecturer
04.10.202 History of ML/AI & basic ML EDV1 (Villach) / Pfeiler
4 4:50 pm 8:10 pm VCR
terms
11.10.202 Blended learning (algebra theory); complete data camp class – deadline Pfeiler / self-
4 Nov 1, 2024 study
18.10.202 Blended learning (data pre-processing); complete data camp class – Pfeiler / self-
4 deadline Nov 7, 2024 study
08.11.202 Recap data preprocessing; EDV1 (Villach) / Pfeiler
4:50 pm 8:10 pm
4 learning VCR
15.11.202 Decision Trees & decision EDV1 (Villach) / Pfeiler
4:50 pm 8:10 pm
4 metrics VCR
22.11.202 Kmeans & resampling VCR Pfeiler
4:50 pm 8:10 pm
4 methods
06.12.202 Feature engineering & feature EDV4 (Villach) / Pfeiler
4:50 pm 8:10 pm
4 selection VCR
11.12.202 ML lab 1 (exercises & EDV2 (Villach) / Plankensteine
5:40 pm 9:00 pm
4 homework) VCR r
13.12.202 Regression (linear & non- EDV1 (Villach) / Pfeiler
4 4:50 pm 8:10 pm linear) & information about VCR
student projects 44
Blended learning assignments
1. Recap on algebra theory
• Slides & link to online tutorials available in moodle
• ToDo: complete data camp class – deadline Nov 1, 2024
• Knowing the content is essential to understand how ML algorithms work

2. Self-study on data pre-processing methods (partly recap of basic statistical methods)


• Slides & videos – will be available soon in moodle (you will be informed via email)
• ToDo: will be announced until Oct. 11, 2024 – deadline Nov 7, 2024
• Knowing the content will be essential to understand the following lectures and the next
homeworks & exercises

Questions?
45
Olivia Pfeiler
[email protected]

Applied Data Science


Master of Science in Engineering

You might also like