Notes - Unit01 - Data Science and Big Data Analytics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

DATA SCIENCE AND BIG DATA ANALYTICS

UNIT 01

SYLLABUS

BASICS AND NEED OF DATA SCIENCE AND BIG DATA, APPLICATION OF DATA SCIENCE, EXPLOSION OF
DATA,5VS OF BIG DATA,RELATIONSHIP BETWEEN DATA SCIENCE AND INFORMATION SCIENCE,

BUSINESS INTELLIGENCE VS DATASCIENCE, DATA SCIENCE LIFE CYCLE,

DATA: DATA TYPES, DATA COLLECTION, NEED OF DATA WRANGLING,

METHODS: DATA CLEANING, DATA INTEGRATION, DATA REDUCTION, DATA TRANSFORMATION, DATA
DISCRETIZATION

Data Preprocessing in Data Mining


Preprocessing

Data preprocessing is a data mining technique which is used to transform the raw
data in a useful and efficient format.

Why is Data preprocessing important?

Preprocessing of data is mainly to check the data quality. The quality can be
checked by the following

• Accuracy: To check whether the data entered is correct or not.


• Completeness: To check whether the data is available or not recorded.
• Consistency: To check whether the same data is kept in all the places
• that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.

Major Tasks in Data Preprocessing:


1. Data cleaning(To deal with missing value+ noisy data)
2. Data integration
3. Data reduction
4. Data transformation
Rollno Name Result Sub1SUB2 SUB3sub4 Sub5

----- vgs dist 78 89 -----

02

03

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc.
Data cleaning is the process to remove incorrect data, incomplete data
and inaccurate data from the datasets, and it also replaces the missing
values. There are some techniques in data cleaning

• (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:

1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the
missing values manually,
3. by attribute mean
4. or the most probable value.

• Standard values like “Not Available” or “NA” can be used to replace the
missing values.
• Missing values can also be filled manually but it is not recommended when
that dataset is big.
• The attribute’s mean value can be used to replace the missing value when
the data is normally distributed
wherein in the case of non-normal distribution median value of the attribute can
be used.
• While using regression or decision tree algorithms the missing value can be
replaced by the most probable
value.

• (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines. It can
be generated due to faulty data collection, data entry errors etc.
• It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed
to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to
complete the task.
Binning:
This method is to smooth or handle noisy data.
First, the data is sorted.
and then the sorted values are separated and stored in the form of
bins.
There are three methods for smoothing data in the bin. Smoothing by
bin mean method:
In this method, the values in the bin are replaced by the mean value of
the bin;
Smoothing by bin median:
In this method, the values in the bin are replaced by the median value;
Smoothing by bin boundary:
In this method, the using minimum and maximum values of the bin
values are taken and the values are replaced by the closest boundary
value.

2. Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

2. Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:

1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order
to get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute.the attribute having p-value greater than significance level can be
discarded.

3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or
lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless reduction else
it is called lossy reduction. The two effective methods of dimensionality
reduction are:
Wavelet transforms
and PCA (Principal Component Analysis).

Data Integration in Data Mining

Data Integration is a data preprocessing technique that involves “combining data


from multiple heterogeneous data sources” into a coherent data store and provide
a unified view of the data. These sources may include multiple data cubes,
databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
There are mainly 2 major approaches for data integration – one is the “tight
coupling approach” and another is the “loose coupling approach”.

Tight Coupling:
• Here, a data warehouse is treated as an information retrieval component.
• In this coupling, data is combined from different sources into a single
physical location through the process of ETL – Extraction, Transformation, and
Loading.
Loose Coupling:
• Here, an interface is provided that takes the query from the user, transforms
it in a way the source database can understand, and then sends the query
directly to the source databases to obtain the result.
• And the data only remains in the actual source databases.
Issues in Data Integration:
There are no issues to consider during data integration: Schema Integration,
Redundancy, Detection, and resolution of data value conflicts. These are explained
in brief below.
1. Schema Integration:
• Integrate metadata from different sources.
• The real-world entities from multiple sources are matched referred to as the
entity identification problem.
2. Redundancy:
• An attribute may be redundant if it can be derived or obtaining from another
attribute or set of attributes.
• Inconsistencies in attributes can also cause redundancies in the resulting
data set.
• Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
• This is the third important issue in data integration.
• Attribute values from different sources may differ for the same real-world
entity.
• An attribute in one system may be recorded at a lower level abstraction than
the “same” attribute in another.

You might also like