Notes - Unit01 - Data Science and Big Data Analytics
Notes - Unit01 - Data Science and Big Data Analytics
Notes - Unit01 - Data Science and Big Data Analytics
UNIT 01
SYLLABUS
BASICS AND NEED OF DATA SCIENCE AND BIG DATA, APPLICATION OF DATA SCIENCE, EXPLOSION OF
DATA,5VS OF BIG DATA,RELATIONSHIP BETWEEN DATA SCIENCE AND INFORMATION SCIENCE,
METHODS: DATA CLEANING, DATA INTEGRATION, DATA REDUCTION, DATA TRANSFORMATION, DATA
DISCRETIZATION
Data preprocessing is a data mining technique which is used to transform the raw
data in a useful and efficient format.
Preprocessing of data is mainly to check the data quality. The quality can be
checked by the following
02
03
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc.
Data cleaning is the process to remove incorrect data, incomplete data
and inaccurate data from the datasets, and it also replaces the missing
values. There are some techniques in data cleaning
• Standard values like “Not Available” or “NA” can be used to replace the
missing values.
• Missing values can also be filled manually but it is not recommended when
that dataset is big.
• The attribute’s mean value can be used to replace the missing value when
the data is normally distributed
wherein in the case of non-normal distribution median value of the attribute can
be used.
• While using regression or decision tree algorithms the missing value can be
replaced by the most probable
value.
2. Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order
to get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or
lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless reduction else
it is called lossy reduction. The two effective methods of dimensionality
reduction are:
Wavelet transforms
and PCA (Principal Component Analysis).
Tight Coupling:
• Here, a data warehouse is treated as an information retrieval component.
• In this coupling, data is combined from different sources into a single
physical location through the process of ETL – Extraction, Transformation, and
Loading.
Loose Coupling:
• Here, an interface is provided that takes the query from the user, transforms
it in a way the source database can understand, and then sends the query
directly to the source databases to obtain the result.
• And the data only remains in the actual source databases.
Issues in Data Integration:
There are no issues to consider during data integration: Schema Integration,
Redundancy, Detection, and resolution of data value conflicts. These are explained
in brief below.
1. Schema Integration:
• Integrate metadata from different sources.
• The real-world entities from multiple sources are matched referred to as the
entity identification problem.
2. Redundancy:
• An attribute may be redundant if it can be derived or obtaining from another
attribute or set of attributes.
• Inconsistencies in attributes can also cause redundancies in the resulting
data set.
• Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
• This is the third important issue in data integration.
• Attribute values from different sources may differ for the same real-world
entity.
• An attribute in one system may be recorded at a lower level abstraction than
the “same” attribute in another.