BIG DATA ANALYTICS
BIG DATA ANALYTICS
BIG DATA ANALYTICS
Unit 1
2. Shortage of Skilled Professionals: With the exponential rise in data, there is a critical
need for data scientists and analysts skilled in big data. This shortage can lead to delays
and inefficiencies in data analysis.
3. Extracting Meaningful Insights: The vast quantity of data requires precise methods
to filter relevant insights for specific departments, which can be challenging to
implement effectively across large organizations.
4. Managing Large Data Volumes: Handling the sheer volume of data generated daily
is challenging for both data storage and retrieval systems. Efficient data management
and storage solutions are critical for smooth operations.
6. Data Quality and Storage: The storage and maintenance of high-quality data pose
difficulties, especially when merging inconsistent or incomplete data from various
sources. Solutions like data lakes help but bring their own issues, including error-prone
data integration.
7. Data Security and Privacy: The extensive data usage and storage introduce significant
security and privacy concerns, as data from disparate sources can be exposed to
unauthorized access, requiring robust security solutions.
List and Explain the applications of Big data
Healthcare
Big data analytics helps in predicting epidemics, improving the quality of life, and
avoiding preventable diseases. Analyzing patient records, health patterns, and trends
allows for better diagnosis and personalized medicine.
Finance
Risk management, fraud detection, and personalized banking services are powered by
big data. Financial institutions analyze vast amounts of transactional data to spot
suspicious activities and trends.
Retail
Ever wonder how online stores recommend products? That’s big data at work. By
analyzing customers’ past purchases, browsing habits, and social media activity,
retailers can offer personalized shopping experiences and optimize their stock
management.
Transportation
Route optimization, traffic management, and predictive maintenance of vehicles rely
heavily on big data. Companies like Uber and Lyft use real-time data to optimize routes
and reduce waiting times for passengers.
Social Media
Platforms like Facebook and Twitter analyze user data to understand user preferences,
detect trends, and even predict user behavior, making sure that content and ads are
relevant to each individual.
Education
By analyzing student performance data, schools can personalize learning plans, identify
students at risk of falling behind, and improve the overall educational experience.
Energy Sector
Big data helps in predicting energy consumption, optimizing energy usage, and
detecting inefficiencies or anomalies in energy grids.
Manufacturing
Predictive maintenance and real-time quality assurance are big in manufacturing. Big
data can forecast when machines are likely to fail and ensure products meet quality
standards by analyzing data from the production line.
Agriculture
Farmers use data from weather forecasts, soil sensors, and crop health imagery to
maximize yield and reduce waste. Precision agriculture is all about making farming as
efficient and productive as possible
Variety:
Big Data comes in different forms, from structured data (databases) to unstructured data
(text, images, videos) and semi-structured data (XML, JSON).
Velocity:
Data is generated at unprecedented speeds, requiring real-time or near-real-time
processing. Think about social media updates, financial transactions, etc.
Veracity:
This refers to the trustworthiness and accuracy of the data. Big Data often contains
uncertainties, biases, and noise that need to be filtered out.
Value:
The ultimate goal is to derive meaningful insights and value from the data. It's not just
about having data but making sense of it to drive decision-making.
Define Data Science. Identify the disciplines that are required for Data Scientist.
Data science is the professional field that deals with turning data into value such as new
insights or predictive models. It brings together expertise from fields including
statistics, mathematics, computer science, communication as well as domain expertise
such as business knowledge.
Statistics and Mathematics: Essential for data analysis and creating statistical models.
Computer Science: Necessary for handling large datasets, developing algorithms, and
using computational tools.
Communication: Critical for explaining technical insights to non-technical audiences.
Domain Expertise (e.g., Business Knowledge): Helps in applying insights effectively
within a specific industry or functional context
a) Structured
● Structured is one of the types of big data and By structured data, we mean data
that can be processed, stored, and retrieved in a fixed format.
● It refers to highly organized information that can be readily and seamlessly
stored and accessed from a database by simple search engine algorithms.
● For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an
organized manner.
b) Unstructured
● Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and
analyze unstructured data. Email is an example of unstructured data. Structured
and unstructured are two important types of big data.
c) Semi-structured
● Semi structured is the third type of big data.
● Semi-structured data pertains to the data containing both the formats mentioned
above, that is, structured and unstructured data.
● To be precise, it refers to the data that although has not been classified under a
particular repository (database), yet contains vital information or tags that
segregate individual elements within the data. Thus we come to the end of types
of data.
UNIT 2
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and
Hive that are used to help Hadoop modules.
● Sqoop: It is used to import and export data to and from between HDFS and
RDBMS.
● Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
● Hive: It is a platform used to develop SQL type scripts to do MapReduce
operations.
Note: There are various ways to execute MapReduce operations:
● The traditional approach using Java MapReduce program for structured, semi-
structured, and unstructured data.
● The scripting approach for MapReduce to process structured and semi structured
data using Pig.
● The Hive Query Language (HiveQL or HQL) for MapReduce to process
structured data using Hive.
MapReduce
● MapReduce is a parallel programming model for writing distributed applications
devised at Google for efficient processing of large amounts of data (multi-
terabyte data-sets), on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
● The MapReduce program runs on Hadoop which is an Apache open-source
framework.
Advantages of Hadoop
● Hadoop framework allows the user to quickly write and test distributed systems.
It is efficient, and it automatic distributes the data and work across the machines
and in turn, utilizes the underlying parallelism of the CPU cores.
● Hadoop does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed to detect and
handle failures at the application layer.
● Servers can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.
● Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
OR
List and explain the different types of NoSQL Databases.
Explain the following HDFS commands with examples: i) put ii) expunge iii)
chmod iv) copyToLocal v) mkdir
1. put
- Description: Uploads a file from the local file system to HDFS.
- Example: `hdfs dfs -put myfile.txt /data/`
- Explanation: This command will upload `myfile.txt` from the local system to the
HDFS directory `/data/`.
2. expunge
- Description: Removes files from the HDFS trash permanently.
- Example: `hdfs dfs -expunge`
- Explanation: Empties the trash in HDFS, permanently deleting all files stored
there.
3. chmod
- Description: Changes permissions of a file or directory in HDFS.
- Example: `hdfs dfs -chmod 777 /data/myfile.txt`
- Explanation: This sets `myfile.txt` permissions to `777`, allowing read, write, and
execute access to everyone.
4. copyToLocal
- Description: Copies a file from HDFS to the local file system.
- Example: `hdfs dfs -copyToLocal /data/myfile.txt ./`
- Explanation: Copies `myfile.txt` from the HDFS `/data/` directory to the current
local directory.
5. mkdir
- Description: Creates a new directory in HDFS.
- Example: `hdfs dfs -mkdir /data/newdir`
- Explanation: This command creates a new directory `newdir` in the HDFS path
`/data/`.
What are NoSQL databases? Distinguish between the NoSQL and relational
databases.
Sir notes
Page 15-16