Bda Test1 Key Answers

SCHEME OF EVALUATION-TEST 1
BIG DATA ANALYTICS (18CS72)-30M
PART-A
1a) Big Data is high-volume, high velocity and high variety of information asset that requires
new forms of processing for enhanced decision making, decision making, insight discovery and
process optimization. 8M
Characteristics:
• Volume: Size defines the amount or quantity of data which is generated from an
application.The size determines the processings considerations needed for handling that
data.
• Velocity: Velocity is the a measure of how fast the data generates and processes.
• Variety: Big data comprises of a variety of data. Data is generated from multiple sources
in a system.This introdues variety in data and therefore introduces complexity. Data
consists of various forms and formats.
• Veracity: It is an important characteristic to take into account the quality of data captured
which can vary greatly affecting its accurate analysis.
Big Data Types:
• Social networks and web data such as Facebook,Twitter.
• Transactions data and Business Processes data such as credit card transactions,flight
bookings.
• Customer master data such as data for facial recognition and for the name, date of birth,
gender,location and income category.
• Machine generated data such as machine-to-machine or Internet of Things data and data
form sensors,trackers, web logs and computer system logs.
• Human generated data such as biometrics data,human-machine interaction data,email
records with a mail server and MySQL database of student grades.
1b)
Big Data analytics in health care use the following data sources: 7M
1.clinical records
2.pharmacy records
3.electronic medical records
4.diagnosis logs and notes
5.additional data such as deviations from person usual activities, medical leaves from job,
social interaction.
Health care using Big Data can facilitate the following:
1.Provisioning of value based and customer centric healthcare
2.Utilizing the Internet of Things for health care
3.Preventing fraud, waste, abuse in the healthcare industry and reduce healthcare costs.
4.Improving outcomes
5.Monitoring patients in real time.
• Value based and customer centric healthcare means cost effective patient care by
improving healthcare quality using latest knowledge, usages of electronic health and
medical records and improving coordination among the healthcare providing agencies
which reduce avoidable overuse and healthcare costs.
• Healthcare Internet of Things create unstructured data.
• The data enables the monitoring of the devices data for patient parameters such as
BP,ECG’s.
• Prevention of fraud, waste and abuse uses Big Data predictive analytics and help resolve
excessive or duplicate claims in a systematic manner.
• Patient real time monitoring uses machine learning algorithms which process real-time
events.
2a) Data processing architecture consists of five layers: (i) identification of data sources, (ii)
acquisition, ingestion, extraction, pre-processing, transformation of data, (iii) data storage at files,
servers, cluster or cloud, (iv) data-processing, and (v) data consumption in the number of programs
and tools.
L1 considers the following aspects in a design:
• Amount of data needed at ingestion layer 2 (1.2)
• Push from L1 or pull by L2 as per the mechanism for the usages
• Source data-types: Database, files, web or service
• Source formats, i.e., semi-structured, unstructured or structured.
L2 considers the following aspects:
• Ingestion and ETI. processes either in real time, which means store and use the data generated,
or in batches.
• Batch processing is using discrete datasets at scheduled or periodic intervals of time.
L.3 considers the followings aspects:
• Data storage type (historical or incremental), format, compression, incoming data

frequency, querying patterns and consumption requirements for L4 or L5
• Data storage using Hadoop distributed file system or NoSQL data stores-HBase,
Cassandra, MongoDB. .
L4 considers the followings aspects:
• Data processing software such as MapReduce, Hive, Pig. Spark, Spark Mahout,Spark
Streaming
• Processing in scheduled batches or real time or hybrid Processing as per synchronous or
asynchronous processing requirements at L.5.
L5 considers the consumption of data for the following:

• Data integration
• Datasets usages for reporting and visualization
• Analytics (real time, near real time, scheduled discovery
• Export of datasets to cloud, web or other systems
2b) Data are important for most aspect of marketing, sales and advertising.
• Customer Value (CV) depends on three factors -quality, service and price.
• Big data analytics deploy large volume of data to identify and derive intelligence using
predictive models about the individuals. The facts enable marketing companies to decide
what products to sell.
A definition of marketing is the creation, communication and delivery of value to customers.

Customer (desired) value means what a customer desires from a product. Customer (perceived)
value means what the customer believes to have received from a product after purchase of the
product.
Customer value analytics (CVA) means analyzing what a customer really needs. CVA makes it
possible for leading marketers, such as Amazon to deliver the consistent customer experiences.
Following are the five application areas in order of the popularity of Big Data use cases:
1. CVA using the inputs of evaluated purchase patterns, preferences, quality, price and post
sales servicing requirements
2. Operational analytics for optimizing company operations
3. Detection of frauds and compliances
4. New products and innovations in service
5. Enterprise data warehouse optimization.
Big data is providing marketing insights into

(1) most effective content at each stage of a sales cycle,
(2) (ii) investment in improving the customer relationship management (CRM),
(3) (iii) addition to strategies for increasing customer lifetime value (CLTV),
(4) (iv) lowering of customer acquisition cost (CAC). Cloud services use Big Data analytics for
CAC, CLTV and other metrics, the essentials in any cloud-based business
Big Data revolutionizes a number of areas of marketing and sales.
PART-B
3a)
Hadoop features are as follows: 7M
1. Fault-efficient scalable, flexible and modular design which uses simple and
modular programming model. The system provides servers at high
scalability. The system is scalable by adding new nodes to handle larger
data.
2. Robust design of HDFS: Execution of Big Data applications continue even
when an individual server or cluster fails. This is because of Hadoop
provisions for backup (due to replications at least three times for each
data block) and a data recovery mechanism. HDFS thus has high
reliability.
3. Store and process Big Data: Processes Big Data of 3V characteristics.
4. Distributed clusters computing model with data locality: Processes Big Data at
high speed as the application tasks and subtasks submit to the
DataNodes. One can achieve more computing power by increasing the
number of computing nodes.
5. Hardware fault-tolerant: A fault does not affect data and application
processing. If a node goes down, the other nodes take care of the residue.
This is due to multiple copies of all data blocks which replicate
automatically. Default is three copies of data blocks.
6. Open-source framework: Open source access and cloud services enable
large data store. Hadoop uses a cluster of multiple inexpensive servers or
the cloud.
7. Java and Linux based: Hadoop uses Java interfaces. Hadoop base is Linux
but has its own set of shell commands support.
8. Hadoop provides various components and interfaces for distributed file
system and general input/output. This includes serialization, Java RPC
(Remote Procedure Call) and file-based data structures in Java.
3b) i)Format of the HBase that stores rows line by line is: 4M
Row-Key Column-Family: (Column-Specifier: Version: Value)

(Version can any time stamp from server)
The first row stores in the HBase as follows:
ACVM_id: *2206' {'DT :1600080000024. 121217', 'HR' : 1600008007319:'16', KKHS:
1600081010821: '28', 'MHS': 1600082010582: 23', 'FNHS': 1600082018001: '38', 'NHS':
1600080158868: '8', 'OHS': 1600038028229: 50"}
ii)The records are put in rows and columns as follows: 4M
hbase (main) 001:0> put 'ACVM_id', '2206', 'DT', '121217', 'HR' , '16' , 'HourlySales:
KKHS', '28' 0 row(s) in 021120 seconds
hbase (main) 002:0> put 'ACVM_id', '2206', 'HourlySales: MHS', '23' 0 row (s) in 001120
seconds
hbase (main) 003:0> put 'ACVM_id', '2206', 'HourlySales: FNHS', '38' 0 row (s) in 021120
seconds
hbase (main) 004:0> put 'ACVM_id', '2206', 'HourlySales: NHS', '8' 0 row (s) in 001120
seconds
hbase (main) 005:0> put 'ACVM_id', '2206', 'HourlySales: OHS', '50' 0 row (s) in 001120
seconds
4a) 7M
• Hadoop ecosystem refers to a combination of technologies. Hadoop ecosystem consists of

own family of applications which tie up together with the Hadoop. The system components
support the storage, processing, access, analysis, governance, security and operations for
Big Data.
• The system enables the applications which run Big Data and deploy HDFS. The data store
system consists of clusters, racks, DataNodes and blocks.
• Hadoop deploys application programming model, such as MapReduce and HBase. YARN
manages resources and schedules sub-tasks of the application.
• HBase uses columnar databases and does OLAP. Figure 2.2 shows Hadoop core
components HDFS, MapReduce and YARN along with the ecosystem.
Figure 2.2 also shows Hadoop ecosystem. The system includes the application support layer and
application layer components- AVRO, Zookeeper, Pig, Hive, Sqoop, Ambari, Chukwa, Mahout,
Spark, Flink and Flume. The figure also shows the components and their usages.
The four layers in Figure 2.2 are as follows:
(i) Distributed storage layer
(ii) Resource-manager layer for job or application sub-tasks scheduling and execution
(iii) Processing-framework layer, consisting of Mapper and Reducer for the MapReduce process-
flow
(iv) APIs at application support layer (applications such as Hive and Pig). The codes communicate
and run using MapReduce or YARN at processing framework layer. Reducer output communicate
to APIS (Figure 2.2).
AVRO enables data serialization between the layers. Zookeeper enables coordination among layer
components.
The holistic view of Hadoop architecture provides an idea of implementation of Hadoop
components of the ecosystem. Client hosts run applications using Hadoop ecosystem projects, such
as Pig, Hive and Mahout.
4b) i)Format of the HBase that stores rows line by line is: 4M
Row-Key Column-Family{Column-specifier: Version: Value}
The first row stores in the HBase as follows:
CCSR_id ‘2206' {'DT :1600080000024. 121217', JLRWS: 1600081010821: '28', 'HWS':
1600082010582: 23', 'ZWS': 1600082018001: '38', 'NWS': 1600080158868: '8', 'SSWS':
1600038028229: 50"}
ii)The records are put in rows and columns as follows: 4M

hbase (main) 001:0> put ' CCSR_id ', '220', 'DT', '121217', 'WeeklySales: JLRWS', '28' 0
row(s) in 021120 seconds
hbase (main) 002:0> put ' CCSR_id ', '220', ' WeeklySales: HWS', '23' 0 row (s) in 001120
seconds
hbase (main) 003:0> put ' CCSR_id ', '220', ' WeeklySales: ZWS', '38' 0 row (s) in 021120
seconds
hbase (main) 004:0> put ' CCSR_id ', '220', ' WeeklySales: NWS', '8' 0 row (s) in 001120
seconds
hbase (main) 005:0> put ' CCSR_id ', '220', ' WeeklySales: SSWS', '50' 0 row (s) in 001120
seconds

Bda Test1 Key Answers

Uploaded by

Copyright:

Available Formats

Bda Test1 Key Answers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda Test1 Key Answers

Uploaded by

Copyright:

Available Formats

SCHEME OF EVALUATION-TEST 1

BIG DATA ANALYTICS (18CS72)-30M

L2 considers the following aspects:

L.3 considers the followings aspects:

• Data storage type (historical or incremental), format, compression, incoming data

L5 considers the consumption of data for the following:

A definition of marketing is the creation, communication and delivery of value to customers.

Big data is providing marketing insights into

Row-Key Column-Family: (Column-Specifier: Version: Value)

• Hadoop ecosystem refers to a combination of technologies. Hadoop ecosystem consists of

ii)The records are put in rows and columns as follows: 4M

You might also like