Rangkuman Data Science
Rangkuman Data Science
Rangkuman Data Science
Exploratory question
Mechanistic question
• An exploratory question is one in which you analyze • Finally, none of the questions described so far will
the data to see if there are patterns, trends, or lead to an answer that will tell us, if the diet does,
relationships between variables. indeed, cause a reduction in the number of viral
• These types of analyses are also called “hypothesis- illnesses, how the diet leads to a reduction in the
generating” analyses because rather than testing a number of viral illnesses.
hypothesis as would be done with an inferential, • A question that asks how a diet high in fresh fruits
causal, or mechanistic question, you are looking for and vegetables leads to a reduction in the number of
patterns that would support proposing a hypothesis. viral illnesses would be a mechanistic question.
Feasible
Data mengenai kecerdasan kognitif dan Ibu
bekerja dan data mengenai characteristic ibu-anak
terdapat di IFLS.
Examples
Contoh:
• Wanita Karir (Ibu Bekerja) dan Perkembangan
Kecerdasan Anak
Problem oriented
Analytical
Data Pre-processing
Amenable to Economic Analysis
• Data Munging
• 80% of project time is typically getting data • Handling Missing Data
ready. • Smooth Noisy Data
• Data in the real world is often dirty; that is, it
needs being cleaned up before it can be used for a Data Munging
desired purpose. • Often, the data is not in a format that is easy to
• This is often called data pre-processing. work with.
• What makes data “dirty”? Here are some of the • Thus, we need to convert it to something more
factors that indicate that data is not clean or ready suitable for a computer to understand.
to process: • This can be done manually, automatically, or,
• Incomplete. When some of the attribute values in many cases, semiautomatically.
are lacking, certain attributes of interest are EXAMPLE:
lacking, or attributes contain only aggregate data. Consider the following text recipe.
• Noisy. When data contains errors or outliers. For “Add two diced tomatoes, three cloves of garlic,
example, some of the data points in a dataset may and a pinch of salt in the mix.” This can be turned
contain extreme values that can severely affect the into a table
dataset’s range.
• Inconsistent. Data contains discrepancies in
codes or names. For example, if the “Name”
column for registration records of employees
contains values other than alphabetical letters, or
if records do not start with a capital letter,
discrepancies are present.
Machine Learning
• Machine learning is part of data science.
• What happens most often is data gets generated
in massive volumes and it becomes totally tedious
for a data scientist to work on it. That is when
machine learning comes into action.
• Definition:
- Machine learning is the ability given to a
system to learn and process data sets
autonomously without human intervention. Or,
- Machine learning is the ability of
algorithms to learn from data
• Machine Learning (ML) uses various statistical
approaches for making predictions.
• It also has a major role in pattern finding in data,
that is, it can find various patterns in complex data
given to it.
• ML is a part of its larger domain, which is
Artificial Intelligence (AI).
• ML deals with algorithms that learn from given
data and make predictions.
Why Do We Need Machine Learning?
• Machine Learning has made the analysis of large
amounts of data very efficient. Normal algorithms
are not capable of doing complex tasks, which is
why ML is in use.
• They learn from previous computations to
supply reliable, repeatable decisions and results.
• ML is widely used in fields like banking, Machine Learning Approaches
healthcare, science and many more. • There are various approaches when it comes to
• Tasks like pattern recognition, prediction of implementing Machine Learning algorithms.
future data would not be possible without ML. • The approaches in ML are now classified into
two categories:
How Does Machine Learning Work? • Grouping of the algorithms by their
• Machine Learning has a basic working pattern. learning style.
The algorithm takes both the input and output of
the program.
• Then it trains a model using the given data. The
algorithm looks for various patterns in the given
data.
• These patterns help in future predictions. The
results obtained from training the model will help
in improving the model’s working.
2. Association
ML: Decision Tree Algorithms
• Association rules allow you to determine
• In trees, the data splits according to specific
associations amongst data objects inside large
parameters. It consists of nodes and leaves.
databases.
• Here, leaves are the result whereas the nodes
• This helps to find more diverse rules in the
represent the point where the data is split.
data.
• Splitting here means, if there are two options of
• Association rule provides us with deeper
yes and no, at a time only one result is there.
information about groups.
• There are two types of trees:
• For example, people who drink coffee may also
- classification trees and (a yes/no type)
like to drink tea.
- regression trees (the data is continuous)
• There are various algorithms using which the
ML: Semi-Supervised Learning
decision tree is constructed:
supervised learning
- ID3 algorithm (Iterative Dichotomiser 3
• we have to manually label the data
algorithm)
• This requires a lot of time and can be expensive.
- CART (Classification and Regression
• Also, it requires ML engineers or special data
Testing)
scientists to do such jobs
- Chi-square method
- Decision Stump
unsupervised learning
- M5 algorithm
• the results obtained are less accurate.
• This is because the data is not labeled and
ML: Bayesian Algorithm
unknown.
• For Bayesian methods, it is obvious that the
Bayes theorem is there in all methods.
semi-supervised learning
• Bayes theorem is all about probability,
• Eliminates supervised and unsupervised learning
especially conditional probability.
problems.
• It suggests that an event A will happen if an
• The algorithm trains using both labeled and
event B has already happened.
unlabeled data.
• The most famous algorithm used is the Naïve
• The labeled data is a small part as compared to
Bayes theorem.
the unlabeled data.
• It works on probability and it can calculate the
• The programmer uses unsupervised learning to
likelihood of events to happen.
first group the unlabeled data.
• There are various other algorithms like:
• Then the supervised learning labels all the
- Gaussian Naïve Bayes theorem
remaining unlabeled data.
- Multinomial Naïve Bayes theorem
- Bayesian Belief Network