Rangkuman Data Science

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Stating and Refining the Question diet.

• For example, higher income may be one of the final


• Doing data analysis requires quite a bit of thinking set of predictors, and you may not know (or even
and when you’ve completed a good data analysis, care) why people with higher incomes are more likely
you’ve spent more time thinking than doing. to eat a diet high in fresh fruits and vegetables, but
• The thinking begins before you even look at a what is most important is that income is a factor that
dataset, and it’s well worth devoting careful thought predicts this behavior.
to your question.

Descriptive question Causal question


• A descriptive question is one that seeks to • A causal question asks about whether changing one
summarize a characteristic of a set of data. factor will change another factor, on average, in a
• Examples include determining the proportion of population.
males, the mean number of servings of fresh fruits • Sometimes the underlying design of the data
and vegetables per day, or the frequency of viral collection, by default, allows for the question that you
illnesses in a set of data collected from a group of ask to be causal.
individuals. • An example of this would be data collected
• There is no interpretation of the result itself in the context of a randomized trial, in which people
as the result is a fact, an attribute of the set were randomly assigned to eat a diet high in fresh
of data that you are working with. fruits and vegetables or one that

Exploratory question
Mechanistic question
• An exploratory question is one in which you analyze • Finally, none of the questions described so far will
the data to see if there are patterns, trends, or lead to an answer that will tell us, if the diet does,
relationships between variables. indeed, cause a reduction in the number of viral
• These types of analyses are also called “hypothesis- illnesses, how the diet leads to a reduction in the
generating” analyses because rather than testing a number of viral illnesses.
hypothesis as would be done with an inferential, • A question that asks how a diet high in fresh fruits
causal, or mechanistic question, you are looking for and vegetables leads to a reduction in the number of
patterns that would support proposing a hypothesis. viral illnesses would be a mechanistic question.

Characteristics of a Good Question (Peng &


Inferential question Matsui, 2015)
• An inferential question would be a restatement of There are five key characteristics of a good question
this proposed hypothesis as a question and would be for a data analysis:
answered by analyzing a different set of data, which in • It should be of interest to your audience
this example, is a representative sample of adults in • It has not already been answered.
the Indonesia. • It should stem from a plausible framework.
• You will be able to infer what is true, on average, for • It should be answerable.
the adult population in the Indonesia from the • It should be specific.
analysis you perform on the representative sample. • Is eating a healthier diet better for you?
• Does eating at least 5 servings per day of fresh
fruits and vegetables lead to fewer upper respiratory
Predictive question tract infections (colds)? >> More Spesific
• A predictive question would be one where you ask
what types of people will eat a diet high in fresh fruits Good Economic Research Question
and vegetables during the next year. • Especially for research in the field of economics,
• In this type of question you are less interested there are additional criteria that need to be
in what causes someone to eat a certain diet, just considered.
what predicts whether someone will eat this certain • These criteria are as follows:
1. Tampaknya seperti topik penelitian di bidang
educational psychology
2. Topik ini merupakan isu penelitian di
bidang labor economics dan teori human capital.
3. Merupakan aplikasi dari teori educational
production function

Feasible
Data mengenai kecerdasan kognitif dan Ibu
bekerja dan data mengenai characteristic ibu-anak
terdapat di IFLS.

Examples
Contoh:
• Wanita Karir (Ibu Bekerja) dan Perkembangan
Kecerdasan Anak

Problem oriented

Analytical

Untuk membahas mengenai isu (research question) ini


dibutuhkan sebuah studi mengenai proses
pengembangan kecerdasan anak dan peran ibu atau
pengasuh / childcare dalam hal tsb.

Interesting and Significant

 Studi sebelumnya menemukan bahwa


kurangnya perhatian ibu menyebabkan
kecerdasan anak berkurang.
 Namun belum ada kesepakatan yang permanen
diantara para ahli. Hal ini menarikuntuk diteliti
karena hasilnya dapat digunakan untuk
membuat kebijakan yang dapat berpengaruh
pada banyak keluarga di Indonesia.

Data Pre-processing
Amenable to Economic Analysis
• Data Munging
• 80% of project time is typically getting data • Handling Missing Data
ready. • Smooth Noisy Data
• Data in the real world is often dirty; that is, it
needs being cleaned up before it can be used for a Data Munging
desired purpose. • Often, the data is not in a format that is easy to
• This is often called data pre-processing. work with.
• What makes data “dirty”? Here are some of the • Thus, we need to convert it to something more
factors that indicate that data is not clean or ready suitable for a computer to understand.
to process: • This can be done manually, automatically, or,
• Incomplete. When some of the attribute values in many cases, semiautomatically.
are lacking, certain attributes of interest are EXAMPLE:
lacking, or attributes contain only aggregate data. Consider the following text recipe.
• Noisy. When data contains errors or outliers. For “Add two diced tomatoes, three cloves of garlic,
example, some of the data points in a dataset may and a pinch of salt in the mix.” This can be turned
contain extreme values that can severely affect the into a table
dataset’s range.
• Inconsistent. Data contains discrepancies in
codes or names. For example, if the “Name”
column for registration records of employees
contains values other than alphabetical letters, or
if records do not start with a capital letter,
discrepancies are present.

Forms of data preprocessing Handling Missing Data


• Sometimes data may be in the right format, but
some of the values are missing.
• Missing data:
• problems with the process of collecting data
• equipment malfunction
• comprehensiveness may not have been
considered important at the time of collection.
• system or human error while storing or
transferring the data.
• Strategies to combat missing data include:
• ignoring that record,
• using a global constant to fill in all missing
values,
• imputation,
• inference-based solutions (Bayesian formula
or a decision tree), etc.
Data Cleaning
Smooth Noisy Data
• Since there are several reasons why data could • There are times when the data is not missing, but
be “dirty,” there are just as many ways to “clean” it is corrupted for some reason.
it. • Data corruption may be a result of faulty data
• Three key methods that describe ways in which collection instruments, data entry
data may be “cleaned,” or better organized: problems, or technology limitations.
• Just as there is no single technique to take care • Attribute or feature construction:
of missing data, there is no one way to • New attributes constructed from the given
remove noise, or smooth out the noisiness in ones.
the data.
Data Reduction
Example: • Data reduction is a key process in which a
A digital thermometer measures temperature reduced representation of a dataset that produces
to one decimal point (e.g., 70.1°F), but the storage the same or similar analytical results is obtained.
system ignores the decimal points. Now we have • One example of a large dataset that could
70.1°F and 70.9°F both stored as 70°F. This may warrant reduction is a data cube.
not seem like a big deal, but for humans a 99.4°F • Data cubes are multidimensional sets of data that
temperature means you are fine, and 99.8°F can be stored in a spreadsheet.
means you have a fever, and ifm our storage
system represents both as 99°F, then it fails to • Data Reduction Techniques:
differentiate between healthy and sick persons! Data Cube Aggregation. We reduce the data
to its more meaningful size and structure for the
Data Integration task at hand.
• To be as efficient and effective for various data Dimensionality Reduction. In contrast with
analyses as possible, data from various sources the data cube aggregation method, where the data
commonly needs to be integrated. reduction was with the consideration of the task,
• The following steps describe how to integrate dimensionality reduction method works with
multiple databases or files: respect to the nature of the data.
1. Combine data from multiple sources into a
coherent storage place (e.g., a single file or a Data Discretization
database). • We are often dealing with data that are collected
2. Engage in schema integration, or the from processes that are continuous, such as
combining of metadata from different sources. temperature, ambient light, and a company’s stock
3. Detect and resolve data value conflicts. price. But sometimes we need to convert these
4. Address redundant data in data integration. continuous values into more manageable parts.
• This mapping is called discretization.
Data Transformation • There are three types of attributes involved in
Data must be transformed so it is consistent and discretization:
readable (by a system). Nominal: Values from an unordered set
The following five processes may be used for data Ordinal: Values from an ordered set
transformation: Continuous: Real numbers
• Smoothing: Remove noise from data. • To achieve discretization, divide the range of
• Aggregation: Summarization, data cube continuous attributes into intervals.
construction. • For instance, we could decide to split the range
• Generalization: Concept hierarchy climbing. of temperature values into cold, moderate, and
• Normalization: Scaled to fall within a small, hot, or the price of company stock into above or
specified range and aggregation. Some of the below its market valuation
techniques that are used for accomplishing
normalization are:
• Min–max normalization.
• Z-score normalization.
• Normalization by decimal scaling.
Machine Learning: Introduction

Machine Learning
• Machine learning is part of data science.
• What happens most often is data gets generated
in massive volumes and it becomes totally tedious
for a data scientist to work on it. That is when
machine learning comes into action.
• Definition:
- Machine learning is the ability given to a
system to learn and process data sets
autonomously without human intervention. Or,
- Machine learning is the ability of
algorithms to learn from data
• Machine Learning (ML) uses various statistical
approaches for making predictions.
• It also has a major role in pattern finding in data,
that is, it can find various patterns in complex data
given to it.
• ML is a part of its larger domain, which is
Artificial Intelligence (AI).
• ML deals with algorithms that learn from given
data and make predictions.
Why Do We Need Machine Learning?
• Machine Learning has made the analysis of large
amounts of data very efficient. Normal algorithms
are not capable of doing complex tasks, which is
why ML is in use.
• They learn from previous computations to
supply reliable, repeatable decisions and results.
• ML is widely used in fields like banking, Machine Learning Approaches
healthcare, science and many more. • There are various approaches when it comes to
• Tasks like pattern recognition, prediction of implementing Machine Learning algorithms.
future data would not be possible without ML. • The approaches in ML are now classified into
two categories:
How Does Machine Learning Work? • Grouping of the algorithms by their
• Machine Learning has a basic working pattern. learning style.
The algorithm takes both the input and output of
the program.
• Then it trains a model using the given data. The
algorithm looks for various patterns in the given
data.
• These patterns help in future predictions. The
results obtained from training the model will help
in improving the model’s working.

• Grouping of the algorithms by their


similarity.

ML: Supervised Learning


• Supervised learning is the machine learning task
of learning a function that maps an input to an
output based on example input-output pairs.
• It infers a function from labeled training data
consisting of a set of training

Machine Learning Applications examples.


• In this type of learning, we have labeled input
data. This means that the data presented to the
model already contains the correct answer.
• Some might wonder why we are providing data
that already has the right answer. The answer is
simple.
• We are giving this pre-labeled data to make the
model learn from it. This means if data with
similar features is given to the machine in the
future, it will recognize it.
Advantages and Disadvantages of
• In addition, the labeled data (training data) help
Machine Learning
to train the model to improve its accuracy.
2. Regression
• It is a way which helps find the correlation
between variables and enables us to predict the
continual output variable supported the one or
more predictor variables.
• The output in this problem is a real value like
‘mass’, ‘percent’, ‘rupees’, ‘dollars’ etc.

ML: Unsupervised Learning


• Unsupervised learning is a type of machine
learning algorithm used to draw inferences from
datasets consisting of input data without labeled
responses.
• The most common unsupervised learning
method is cluster analysis, which is used for
exploratory data analysis to find hidden patterns
or grouping in data.
• In this type of learning, there is no labeled input
data. This means that the machine won’t be given
any training data to learn from.
• Thus, the task of the machine is to group similar
types of data together. This is done based on
patterns and differences without any sort of
previous training.
• Thus, the machine will have to figure everything
from scratch without any help.

Supervised learning is of two types:


1. Classification
• It specifies the category to which data elements
belong to and is best used when the output has
finite and discrete values.
• It predicts a category for an input variable also.
• The output in this problem is a category.
Such as ‘blue’, ‘green’, ‘sunny’, ‘no sunny’,
‘disease’, ‘no disease’ etc.
• The output here is in the form of classes.
Unsupervised learning is of two types:
1. Clustering
• It which involves segregating data supported the
similarity between data instances.
• It is an iterative process to seek out cluster
centers called centroids and assigning data points
to at least one of the centroids.
• Here, we have groups of data according to
certain criteria.
• For example, people who drink coffee are
grouped separately. Whereas people who don’t
drink come in a different group.

2. Association
ML: Decision Tree Algorithms
• Association rules allow you to determine
• In trees, the data splits according to specific
associations amongst data objects inside large
parameters. It consists of nodes and leaves.
databases.
• Here, leaves are the result whereas the nodes
• This helps to find more diverse rules in the
represent the point where the data is split.
data.
• Splitting here means, if there are two options of
• Association rule provides us with deeper
yes and no, at a time only one result is there.
information about groups.
• There are two types of trees:
• For example, people who drink coffee may also
- classification trees and (a yes/no type)
like to drink tea.
- regression trees (the data is continuous)
• There are various algorithms using which the
ML: Semi-Supervised Learning
decision tree is constructed:
supervised learning
- ID3 algorithm (Iterative Dichotomiser 3
• we have to manually label the data
algorithm)
• This requires a lot of time and can be expensive.
- CART (Classification and Regression
• Also, it requires ML engineers or special data
Testing)
scientists to do such jobs
- Chi-square method
- Decision Stump
unsupervised learning
- M5 algorithm
• the results obtained are less accurate.
• This is because the data is not labeled and
ML: Bayesian Algorithm
unknown.
• For Bayesian methods, it is obvious that the
Bayes theorem is there in all methods.
semi-supervised learning
• Bayes theorem is all about probability,
• Eliminates supervised and unsupervised learning
especially conditional probability.
problems.
• It suggests that an event A will happen if an
• The algorithm trains using both labeled and
event B has already happened.
unlabeled data.
• The most famous algorithm used is the Naïve
• The labeled data is a small part as compared to
Bayes theorem.
the unlabeled data.
• It works on probability and it can calculate the
• The programmer uses unsupervised learning to
likelihood of events to happen.
first group the unlabeled data.
• There are various other algorithms like:
• Then the supervised learning labels all the
- Gaussian Naïve Bayes theorem
remaining unlabeled data.
- Multinomial Naïve Bayes theorem
- Bayesian Belief Network

ML: Clustering Algorithms


• This is an unsupervised learning approach,
which is very useful when it comes to grouping of
data.
• Here, in clustering, similar types of data occur in
a single group or cluster.
• Whereas, if data is not similar, then it occurs in
some other group or cluster.
• The algorithms used are:
•K-means
• K-medians
• Hierarchical clustering

ML: Artificial Neural Network (ANN)


Algorithms
• ANNs work on the exact concept of biological
neurons (the nerve cells) in your brain. An ANN
consists of many artificial neurons, which perform
the exact task as the nerve cell in a brain.
• The neural networks are a concept of Deep
Learning. They can simulate the biological
nervous system.
• It is capable of both Machine Learning and
pattern recognition.
• ANN is mainly an information processing
technique. It is also a type of graph consisting of
nodes and connecting arcs.
• The neural network consists of three layers: The
input layer, the hidden layer, and the output layer.
• The input layer takes the input. The hidden layer
processes the input data.
• It performs various tasks on the given data. Then
the processed data passes through the output layer.
• Each neuron is represented by a circle. Whereas
the connections here are arc-shaped.
• There are various algorithms in ANN:
• Perceptron learning
• Multilayer perceptron
• Back-propagation
• Stochastic gradient descent

You might also like