C-42 Exp 3 Sma
C-42 Exp 3 Sma
C-42 Exp 3 Sma
TUS3F202135 C42
LAB MANUAL
PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment No-03
A.1 Aim:
Data Cleaning and Storage- Pre-process, filter and store social media data for business (Using Python,
MongoDB, R, etc.).
Lab Objective To understand the fundamental concepts of social media networks
Lab Outcome Collect, monitor, store and track social media data
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical.
The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black board access
available)
Roll. No.C42 Name: Sarvesh Patil
Class BE-C Batch: C3
Date of Experiment: Date of Submission:
Grade:
B. 1Study the fundamentals of social media platform and implement data cleaning,
pre-processing, filtering and storing social media data for business:
(Paste your Search material completed during the 2 hours of practical in the lab here)
• Students can use any social media data to perform cleaning, pre-processing and filtering.
• Use the chosen data to perform cleaning, pre-processing and filtering.
B.4 Conclusion:
(Students must write the conclusion as per the attainment of individual outcome listed above
and learning/observation noted in section B.3)
Hence exploratory data analysis was conducted on twitter dataset using python.
Ensures accuracy: Clean data ensures that the insights derived from the data are accurate and
reliable. Data that is not clean can result in inaccurate or unreliable conclusions.
Improves decision-making: Clean data helps decision-makers make informed decisions based on
accurate and reliable insights.
Increases efficiency: Cleaning data before analysis saves time and resources, as it eliminates the
need to redo analysis or correct errors after analysis.
Facilitates data integration: Data cleaning helps to integrate multiple data sources that may have
different formats and standards, making it easier to combine and analyze data from different
sources.
Enhances data quality: Data cleaning improves the overall quality of data, making it more
valuable for various purposes such as research, business intelligence, and data analytics.
Overall, data cleaning is an essential step in data management that ensures data accuracy,
reliability, and consistency, which are crucial for making informed decisions and deriving
insights.
Sarvesh Patil
TUS3F202135 C42
Q2. What are the steps involved in data cleaning?
Ans:-There are several steps involved in data cleaning. The specific steps may vary depending
on the nature and complexity of the data, but the following are some common steps that are
typically involved in data cleaning:
Identify data quality issues: The first step is to identify the quality issues in the data. This
involves examining the data for missing values, duplicate records, incorrect data formats, and
other anomalies.
Define data cleaning rules: Once the data quality issues are identified, data cleaning rules need to
be defined. These rules specify how to address the issues identified in step 1. For example, a rule
may be to delete duplicate records or to fill in missing values using statistical methods.
Data profiling: Data profiling involves analyzing the data to understand its structure,
relationships, and distributions. This helps to identify any patterns or anomalies that may require
attention.
Data validation: Data validation involves checking the data against external sources or standards
to ensure that it is accurate and complete. For example, data may be validated against a set of
predefined rules or against data from a different source.
Data transformation: Data transformation involves converting the data into a format that is
consistent and usable for analysis. This may involve converting data types, standardizing values,
or normalizing data.
Data enrichment: Data enrichment involves adding additional information to the data to enhance
its value. For example, adding geographical information to customer data can help to analyze the
customer behavior across different locations.
Data integration: Data integration involves combining data from different sources into a unified
dataset. This may involve standardizing data formats, resolving data conflicts, and matching
records across different sources.
Data documentation: Data documentation involves creating documentation that describes the
data, its sources, and any changes made during the data cleaning process. This helps to ensure
that the data is well-documented and easy to understand for future use.
Overall, data cleaning is a complex process that requires careful planning and execution to
ensure the data is clean, accurate, and usable for analysis.
Handling missing values: Missing values can be handled in several ways, including imputation
(filling in missing values with a reasonable estimate), deletion (removing rows with missing
values), or interpolation (estimating missing values based on the values of neighboring data
points).
Standardizing data: Standardizing data involves converting data values to a common scale or
format. For example, converting dates to a standardized date format, or converting units of
measurement to a common unit.
Correcting data format errors: Data format errors can include incorrect data types, inconsistent
date formats, or inconsistent naming conventions. Correcting data format errors ensures data
consistency and accuracy.
Handling outliers: Outliers are extreme values that are significantly different from the majority
of data points. Outliers can be handled by either removing them or transforming them to a more
appropriate value.
Handling inconsistent data: Inconsistent data refers to data that does not conform to established
standards or rules. This can include misspelled names, inconsistent addresses, or inconsistent
data formats. Handling inconsistent data involves identifying and correcting errors to ensure data
consistency.
Resolving data conflicts: Data conflicts can occur when data from different sources or systems
contain conflicting information. Resolving data conflicts involves identifying the source of the
conflict and determining the correct value based on established rules or standards.
Data profiling: Data profiling involves analyzing the data to identify patterns, distributions, and
relationships. This can help identify potential quality issues and inform data cleaning strategies.
Overall, data cleaning techniques are used to ensure that data is accurate, complete, and
consistent. The specific techniques used depend on the nature and complexity of the data and the
quality issues that need to be addressed.
Q4. What are the steps involved in data pre-processing? Enlist the data pre-processing techniques.
Ans:- Data pre-processing refers to the process of preparing raw data for analysis. It involves
several steps to transform and clean the data to make it usable for further analysis. The following
are some common steps involved in data pre-processing:
Data collection: This step involves collecting raw data from various sources such as databases,
spreadsheets, or text files.
Sarvesh Patil
TUS3F202135 C42
Data cleaning: Data cleaning involves identifying and correcting errors, inconsistencies, and
inaccuracies in the data.
Data integration: Data integration involves combining data from different sources to create a
unified dataset.
Data transformation: Data transformation involves converting the data into a format that is
consistent and usable for analysis. This may involve converting data types, standardizing values,
or normalizing data.
Data reduction: Data reduction involves reducing the size of the dataset while retaining the most
important information. This may involve sampling, feature selection, or feature extraction
techniques.
Data discretization: Data discretization involves converting continuous data into discrete
categories or ranges. This can be useful for some analysis techniques.
Data normalization: Data normalization involves scaling the data to a common range or format.
This is useful for analysis techniques that are sensitive to the scale of the data.
Data aggregation: Data aggregation involves combining data at a higher level of granularity,
such as summing sales data by month or year.
Data balancing: Data balancing involves addressing imbalances in the data distribution, such as
oversampling minority classes or undersampling majority classes.
Data pre-processing techniques can be categorized into two types: numerical data pre-processing
techniques and categorical data pre-processing techniques. Some common data pre-processing
techniques include:
Scaling
Standardization
Binning
Log transformation
Outlier removal
One-hot encoding
Label encoding
Binary encoding
Frequency encoding
Sarvesh Patil
TUS3F202135 C42
Overall, data pre-processing is a critical step in the data analysis process, as it ensures that the
data is accurate, consistent, and usable for analysis.
Q5. Explain the difference between data cleaning and data pre-processing techniques. Ans:Data
cleaning and data pre-processing are both important steps in preparing data for analysis, but
they are different in nature and scope.
Data cleaning is the process of identifying and correcting errors, inconsistencies, and
inaccuracies in the data. This may involve removing duplicate records, handling missing values,
correcting data format errors, and resolving data conflicts. Data cleaning is focused on
improving the quality of the data by ensuring that it is accurate, complete, and consistent.
Data pre-processing, on the other hand, involves a broader set of activities aimed at preparing
raw data for analysis. This may involve several steps, including data cleaning, data integration,
data transformation, data reduction, data discretization, data normalization, data aggregation, and
data balancing. The goal of data pre-processing is to transform raw data into a format that is
usable for analysis, by addressing issues such as data quality, data format, data size, and data
type.