Introduction To Gather

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Introduction to Gather

© Explore Data Science Academy


Overview

This tutorial is laid out as follows:

The Explore Data Science process

Gathering data in the real world

What is involved in “gathering” data?

Where do we get data?

Where do we store data and how has this changed over time?

Conclusion
The Explore Data Science Process

The Explore Data Science Process is about solving real-world problems using data.

GATHER

Databases
SQL Queries
Types of Databases
Modifying Data
Schema Maintenance

Statistics
Probability
Distributions
Set Theory
Gathering data in the real world

Be prepared to spend A LOT of time gathering data!

The fallacy of “perfect data” What do we do with imperfect data?

• No such thing as a perfect dataset.


• The PARETO PRINCIPLE states that roughly 80%
• Part of the gathering process involves checking, of the effects come from 20% of the causes.
cleaning, and getting into the right format. • For Data Science, this means quickly
• Sometimes we need to meet halfway between understanding the 20% of data that accounts for
what is available and what is required. 80% of the results.

ITERATE!

• Use what data is available and get started.


• Doing descriptive analytics and building models.
• Helps you understand what more data you might
need to continue.
What is involved in “Gathering” data?

Data is the essence of data science - it’s in the name!

• Finding data e.g. using web scraping to extract data.


Processes • Creating data e.g. collecting or transforming data.
Involved • Storing data e.g. using AWS to maintain databases.
• Managing data e.g backing up and granting access to data.

• We need data to solve problems, so we gather it after we have


A continual specified the problem.
process • A continual feedback loop exists between E-G-A-D (and we never stop
gathering data).

• Impossible to do data science without good quality data -


garbage in, garbage out!
Relevance
• Data needs to be in the correct format in order to analyse
and visualise it.
Where do we get data?

Getting data is a critical part of data science, sometimes you get lucky and it’s already available….

Using other people’s data Collecting your own data

Open data sources, for example Create your own new datasets
• Stats SA • Primary research, including:
• UCT’s Data Portal ○ Surveys
• City of Cape Town ○ Interviews
• The World Bank ○ Simulating data

Proprietary data sources Collect other people’s data


• Industry datasets • Use web scraping to pull data off websites
• Company specific datasets • Use API’s to pull data off systems and specific
applications
• Capture data electronically that used to be on
You should not share any proprietary data without written paper
consent from the source. Also need to be aware of regulations
like the Protection of Personal Information (POPI) Act
Where do we store data and how has this changed over time?

There are multiple mediums for storing data and these are constantly changing and improving.

Old School Local Storage Cloud Storage

• Prehistoric data storage • Data stored physically on a • We are now starting to store
included writing on clay local computer, external drive data in the “cloud” e.g. on
tablets or on rock or on a server in a database or Amazon Web Services,
in a file system Microsoft Azure, or Google
Cloud

• Data was then written or


typed on paper and stored in
filing cabinets
Conclusion

What you have learnt

There is no such thing as a perfect dataset.

The process of finding, creating, storing, and


managing data.

The multiple mediums for storing data.

You might also like