BIG DATA ANALYTICS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

BIG DATA ANALYTICS

Unit 1

Describe the various challenges of big data.


Sir notes page 8-9
Big data presents numerous challenges due to the scale, complexity, and variety of the
data involved. Here are several primary challenges identified:

1. Synchronization Across Disparate(means different/various)Data Sources: Integrating


diverse data from multiple sources into one analytical platform is difficult, as
discrepancies can lead to inaccurate insights.

2. Shortage of Skilled Professionals: With the exponential rise in data, there is a critical
need for data scientists and analysts skilled in big data. This shortage can lead to delays
and inefficiencies in data analysis.

3. Extracting Meaningful Insights: The vast quantity of data requires precise methods
to filter relevant insights for specific departments, which can be challenging to
implement effectively across large organizations.

4. Managing Large Data Volumes: Handling the sheer volume of data generated daily
is challenging for both data storage and retrieval systems. Efficient data management
and storage solutions are critical for smooth operations.

5. Data Management Landscape Uncertainty: The rapid evolution of data management


technologies means companies often struggle to identify the best tools for their needs
without introducing new complexities or risks.

6. Data Quality and Storage: The storage and maintenance of high-quality data pose
difficulties, especially when merging inconsistent or incomplete data from various
sources. Solutions like data lakes help but bring their own issues, including error-prone
data integration.

7. Data Security and Privacy: The extensive data usage and storage introduce significant
security and privacy concerns, as data from disparate sources can be exposed to
unauthorized access, requiring robust security solutions.
List and Explain the applications of Big data
Healthcare
Big data analytics helps in predicting epidemics, improving the quality of life, and
avoiding preventable diseases. Analyzing patient records, health patterns, and trends
allows for better diagnosis and personalized medicine.

Finance
Risk management, fraud detection, and personalized banking services are powered by
big data. Financial institutions analyze vast amounts of transactional data to spot
suspicious activities and trends.

Retail
Ever wonder how online stores recommend products? That’s big data at work. By
analyzing customers’ past purchases, browsing habits, and social media activity,
retailers can offer personalized shopping experiences and optimize their stock
management.

Transportation
Route optimization, traffic management, and predictive maintenance of vehicles rely
heavily on big data. Companies like Uber and Lyft use real-time data to optimize routes
and reduce waiting times for passengers.

Social Media
Platforms like Facebook and Twitter analyze user data to understand user preferences,
detect trends, and even predict user behavior, making sure that content and ads are
relevant to each individual.

Education
By analyzing student performance data, schools can personalize learning plans, identify
students at risk of falling behind, and improve the overall educational experience.

Energy Sector
Big data helps in predicting energy consumption, optimizing energy usage, and
detecting inefficiencies or anomalies in energy grids.

Manufacturing
Predictive maintenance and real-time quality assurance are big in manufacturing. Big
data can forecast when machines are likely to fail and ensure products meet quality
standards by analyzing data from the production line.

Agriculture
Farmers use data from weather forecasts, soil sensors, and crop health imagery to
maximize yield and reduce waste. Precision agriculture is all about making farming as
efficient and productive as possible

Define Big Data. Discuss the characteristics of Big Data


Big Data refers to complex and large data sets that have to be processed and analyzed
to uncover valuable information that can benefit businesses and organizations.

Characteristics of Big Data (The 5 Vs)


Volume:
The sheer scale of data generated is enormous. We're talking petabytes, exabytes, and
beyond.

Variety:
Big Data comes in different forms, from structured data (databases) to unstructured data
(text, images, videos) and semi-structured data (XML, JSON).

Velocity:
Data is generated at unprecedented speeds, requiring real-time or near-real-time
processing. Think about social media updates, financial transactions, etc.

Veracity:
This refers to the trustworthiness and accuracy of the data. Big Data often contains
uncertainties, biases, and noise that need to be filtered out.

Value:
The ultimate goal is to derive meaningful insights and value from the data. It's not just
about having data but making sense of it to drive decision-making.

Distinguish between Traditional business intelligence (BI) versus big data.


Traditional BI methodology is based on the principle of grouping all business data into
a central server. Typically, this data is analyzed in offline mode, after storing the
information in an environment called Data Warehouse. The data is structured in a
conventional relational database with an additional set of indexes and forms of access
to the tables (multidimensional cubes). A Big Data solution differs in many aspects to
BI to use.
These are the main differences between Big Data and Business Intelligence:
1. In a Big Data environment, information is stored on a distributed file system, rather
than on a central server. It is a much safer and more flexible space.
2. Big Data solutions carry the processing functions to the data, rather than the data to
the functions. As the analysis is centered on the information, it´s easier to handle larger
amounts of information in a more agile way.
3. Big Data can analyze data in different formats, both structured and unstructured. The
volume of unstructured data (those not stored in a traditional database) is growing at
levels much higher than the structured data. Nevertheless, its analysis carries different
challenges. Big Data solutions solve them by allowing a global analysis of various
sources of information.
4. Data processed by Big Data solutions can be historical or come from real-time
sources. Thus, companies can make decisions that affect their business in an agile and
efficient way.
5. Big Data technology uses parallel mass processing (MPP) concepts, which improves
the speed of analysis. With MPP many instructions are executed simultaneously, and
since the various jobs are divided into several parallel execution parts, at the end the
overall results are reunited and presented. This allows you to analyze large volumes of
information quickly.

Define Data Science. Identify the disciplines that are required for Data Scientist.
Data science is the professional field that deals with turning data into value such as new
insights or predictive models. It brings together expertise from fields including
statistics, mathematics, computer science, communication as well as domain expertise
such as business knowledge.

Statistics and Mathematics: Essential for data analysis and creating statistical models.
Computer Science: Necessary for handling large datasets, developing algorithms, and
using computational tools.
Communication: Critical for explaining technical insights to non-technical audiences.
Domain Expertise (e.g., Business Knowledge): Helps in applying insights effectively
within a specific industry or functional context

Describe the structured,semi structured and unstructured data.(types of data)

a) Structured
● Structured is one of the types of big data and By structured data, we mean data
that can be processed, stored, and retrieved in a fixed format.
● It refers to highly organized information that can be readily and seamlessly
stored and accessed from a database by simple search engine algorithms.
● For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an
organized manner.
b) Unstructured
● Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and
analyze unstructured data. Email is an example of unstructured data. Structured
and unstructured are two important types of big data.
c) Semi-structured
● Semi structured is the third type of big data.
● Semi-structured data pertains to the data containing both the formats mentioned
above, that is, structured and unstructured data.
● To be precise, it refers to the data that although has not been classified under a
particular repository (database), yet contains vital information or tags that
segregate individual elements within the data. Thus we come to the end of types
of data.

UNIT 2

Enlighten the overview of Hadoop EcoSystem with a neat diagram.

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and
Hive that are used to help Hadoop modules.
● Sqoop: It is used to import and export data to and from between HDFS and
RDBMS.
● Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
● Hive: It is a platform used to develop SQL type scripts to do MapReduce
operations.
Note: There are various ways to execute MapReduce operations:
● The traditional approach using Java MapReduce program for structured, semi-
structured, and unstructured data.
● The scripting approach for MapReduce to process structured and semi structured
data using Pig.
● The Hive Query Language (HiveQL or HQL) for MapReduce to process
structured data using Hive.

With a neat diagram, explain the HDFS Architecture.


Hadoop has two major layers namely −
Processing/Computation layer (MapReduce), and
Storage layer (Hadoop Distributed File System).

MapReduce
● MapReduce is a parallel programming model for writing distributed applications
devised at Google for efficient processing of large amounts of data (multi-
terabyte data-sets), on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
● The MapReduce program runs on Hadoop which is an Apache open-source
framework.

Hadoop Distributed File System


● The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on
commodity hardware.
● It has many 21 similarities with existing distributed file systems. However, the
differences from other distributed file systems are significant.
● It is highly fault-tolerant and is designed to be deployed on low-cost hardware.
It provides high throughput access to application data and is suitable for
applications having large datasets.
● Apart from the above-mentioned two core components, Hadoop framework also
includes the following two modules −
Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
What is Hadoop? Identify the key features of Hadoop. Explain the advantages of
Hadoop.

Hadoop is an open-source framework by Apache, written in Java, that enables large


datasets to be processed across multiple computers in a cluster. It provides distributed
storage and computation, allowing it to scale easily from a single server to thousands of
machines, each with its own storage and processing power.

The key features of Hadoop often include:

1. Distributed Storage and Processing: Hadoop distributes large datasets across


clusters, allowing for efficient storage and parallel processing.
2. Fault Tolerance: Hadoop’s architecture is designed to handle hardware failures,
ensuring data integrity and continuous processing.
3. Scalability: Easily scales by adding more nodes to the cluster without needing to
modify the existing framework.
4. Data Locality: Hadoop moves processing to where the data is stored, minimizing
network congestion and improving speed.
5. Support for Diverse Data Types: Handles structured, semi-structured, and
unstructured data.

Advantages of Hadoop
● Hadoop framework allows the user to quickly write and test distributed systems.
It is efficient, and it automatic distributes the data and work across the machines
and in turn, utilizes the underlying parallelism of the CPU cores.
● Hadoop does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed to detect and
handle failures at the application layer.
● Servers can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.
● Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.

OR
List and explain the different types of NoSQL Databases.
Explain the following HDFS commands with examples: i) put ii) expunge iii)
chmod iv) copyToLocal v) mkdir

1. put
- Description: Uploads a file from the local file system to HDFS.
- Example: `hdfs dfs -put myfile.txt /data/`
- Explanation: This command will upload `myfile.txt` from the local system to the
HDFS directory `/data/`.

2. expunge
- Description: Removes files from the HDFS trash permanently.
- Example: `hdfs dfs -expunge`
- Explanation: Empties the trash in HDFS, permanently deleting all files stored
there.

3. chmod
- Description: Changes permissions of a file or directory in HDFS.
- Example: `hdfs dfs -chmod 777 /data/myfile.txt`
- Explanation: This sets `myfile.txt` permissions to `777`, allowing read, write, and
execute access to everyone.

4. copyToLocal
- Description: Copies a file from HDFS to the local file system.
- Example: `hdfs dfs -copyToLocal /data/myfile.txt ./`
- Explanation: Copies `myfile.txt` from the HDFS `/data/` directory to the current
local directory.
5. mkdir
- Description: Creates a new directory in HDFS.
- Example: `hdfs dfs -mkdir /data/newdir`
- Explanation: This command creates a new directory `newdir` in the HDFS path
`/data/`.

What are NoSQL databases? Distinguish between the NoSQL and relational
databases.

Sir notes
Page 15-16

You might also like