Information Management Unit1
Information Management Unit1
Information Management Unit1
Department of IT
UNIT-1 NOTES
2016-2017
Department of IT
UNIT-1 NOTES
2016-2017
Some authors introduce terms and concepts peculiar to the relational data model
The relational data model is the basis for most database management systems in use today.
In todays database environment, the database may be implemented with object-oriented
technology or with a mixture of object-oriented and relational technology.
Many systems developers believe that data modeling is the most important part of the systems
development process for the following reasons
1. The characteristics of data captured during data modeling are crucial in the design of
databases, programs, and other system components.
The facts and rules captured during the process of data modeling are essential in assuring data
integrity in an information system.
2. Data rather than processes are the most complex aspect of many modern information systems
and hence require a central role in structuring system requirements.
Often the goal is to provide a rich data resource that might support any type of information
inquiry (investigation), analysis, and summary.
3. Data tend to be more stable than the business processes that use that data.
Thus, an information system design that is based on a data orientation should have a longer
useful life than one based on a process orientation.
Business rules are important in data modeling because they govern how data are handled
and stored.
THE ER-MODEL: AN OVERVIEW
An entity-relationship model (E-R model) is a detailed, logical representation of the data for an
organization or for a business area.
The E-R model is expressed in terms of
entities in the business environment
the relationships (or associations) among those entities
the attributes (or properties) of both the entities and their relationships.
An E-R model is normally expressed as an entity-relationship diagram (E-R diagram, or
ERD), which is a graphical representation of an E-R model.
SAMPLE ER MODEL
o A simplified E-R diagram for a small furniture manufacturing company, Pine Valley
Furniture Company.
o A number of suppliers supply and ship different items to Pine Valley Furniture. o The
items are assembled into products that are sold to customers who order the products. o
Each customer order may include one or more lines corresponding to the products
appearing on that order. o The diagram in the above Figure shows the entities and
relationships for this company. o Attributes are omitted to simplify the diagram for now.
o Entities (the objects of the organization) are represented by the rectangle symbol o
Relationships between entities are represented by lines connecting the related entities.
Department of IT
UNIT-1 NOTES
2016-2017
ORDER The transaction associated with the sale of one or more products to a
customer and identified by a transaction number from sales or accounting.
ITEM A type of component that goes into making one or more products and can be
supplied by one or more suppliers.
SUPPLIER Another company that may provide items to Pine Valley Furniture.
SHIPMENT The transaction associated with items received in the same package by
Pine Valley Furniture from a supplier.
It is important to clearly define, as metadata, each entity.
Example
It is important to know that the CUSTOMER entity includes persons or organizations
that have not yet purchased products from Pine Valley Furniture.
It is common for different departments (Accounting, marketing) in an organization to
have different meanings for the same term
Department of IT
UNIT-1 NOTES
2016-2017
Department of IT
UNIT-1 NOTES
2016-2017
A weak entity type has no business meaning in an E-R diagram without the entity on
which it depends. The entity type on which the weak entity type depends is called the
identifying owner (or simply owner for short).
EMPLOYEE is a strong entity type with identifier Employee ID.
DEPENDENT is a weak entity type, as indicated by the double-lined rectangle.
The relationship between a weak entity type and its owner is called an identifying
relationship.
The attribute Dependent Name serves as a partial identifier.
There are a few special guidelines for naming entity types, which follow:
An entity type name is a singular noun (such as CUSTOMER, STUDENT, or
AUTOMOBILE)
An entity type name should be specific to the organization.
An entity type name should be concise, using as few words as possible.
An abbreviation, or a short name, should be specified for each entity type name
The name used for the same entity type should be the same on all E-R diagrams on
which the entity type appears.
Attributes
Each entity type has a set of attributes associated with it.
An attribute is a property or characteristic of an entity type that is of interest to the
organization.
An attribute has a noun name.
Following are some typical entity types and their associated attributes:
Department of IT
UNIT-1 NOTES
2016-2017
Department of IT
UNIT-1 NOTES
2016-2017
IDENTIFIER ATTRIBUTE
IDENTIFIER ATTRIBUTE An identifier is an attribute (or combination of
attributes) whose value distinguishes individual instances of an entity type.
No two instances of the entity type may have the same value for the identifier
attribute.
NAMING AND DEFINING ATTRIBUTES
There are a few special guidelines for naming attributes, which follow:
An attribute name is a singular noun or noun phrase (such as Customer ID, Age, or
Major).
An attribute name should be unique. No two attributes of the same entity type may
have the same name
Each attribute name should follow a standard format
1.2 MODELING RELATIONSHIP AND THEIR BASIC CONCEPTS
Relationships are the glue that holds together various components of an E-R model.
A relationship is an association representing an interaction among the instances of
one or more entity types that is of interest to the organization.
Thus, a relationship has a verb phrase name.
Relationships and their characteristics (degree and cardinality) represent business
rules.
To understand relationships more clearly, we must distinguish between relationship
types and relationship instances.
To illustrate, consider the entity types EMPLOYEE and COURSE, where COURSE
represents training courses that may be taken by employees.
To track courses that have been completed by particular employees, we define a
relationship called Completes between the two entity types,
This is a many-to-many relationship, because each employee may complete any
number of courses (zero, one, or many courses), whereas a given course may be
completed by any number of employees (nobody, one employee, many employees).
In the below Figure the employee Melton has completed three courses (C++,
COBOL, and Perl).
The SQL course has been completed by two employees (Celko and Gosling), and the
Visual Basic course has not been completed by anyone.
Department of IT
UNIT-1 NOTES
2016-2017
Department of IT
UNIT-1 NOTES
2016-2017
ASSOCIATIVE ENTITIES
An associative entity is an entity type that associates the instances of one or more
entity types and contains attributes that are peculiar to the relationship between
those entity instances.
The associative entity CERTIFICATE is represented with the rectangle with
rounded corners, as shown in the below Figure
DEGREE OF A RELATIONSHIP
The degree of a relationship is the number of entity types that participate in that
relationship.
The three most common relationship degrees in E-R models are unary (degree 1),
binary (degree 2), and ternary (degree 3).
UNARY RELATIONSHIP
A unary relationship is a relationship between the instances of a single entity
type.
(Unary relationships are also called recursive relationships.)
Example: Is Married To is shown as a one-to-one relationship between instances of
the PERSON entity type.
BINARY RELATIONSHIP
A binary relationship is a relationship between the instances of two entity types and is
the most common type of relationship encountered in data modeling.
Example: The first (one-to-one) indicates that an employee is assigned one parking
place, and that each parking place is assigned to one employee.
Department of IT
UNIT-1 NOTES
2016-2017
TERNARY RELATIONSHIP
A ternary relationship is a simultaneous relationship among the instances of three
entity types.
Example: vendors can supply various parts to warehouses.
The relationship Supplies is used to record the specific parts that are supplied by a given
vendor to a particular warehouse.
Thus there are three entity types: VENDOR, PART, and WAREHOUSE.
There are two attributes on the relationship Supplies: Shipping Mode and Unit Cost.
One instance of Supplies might record the fact that vendor X can ship part C to
warehouse Y, that the shipping mode is next-day air, and that the cost is $5 per unit.
CARDINALITY CONSTRAINTS
There is one more important data modeling notation for representing common and
important business rules.
Suppose there are two entity types, A and B that are connected by a relationship.
A cardinality constraint specifies the number of instances of entity B that can (or
must) be associated with each instance of entity A. Example:
Consider a video store that rents DVDs of movies.
The store may stock more than one DVD for each movie, this is intuitively a onetomany relationship
It is also true that the store may not have any DVDs of a given movie in stock at a
particular time (e.g., all copies may be checked out).
10
Department of IT
UNIT-1 NOTES
2016-2017
MINIMUM CARDINALITY
The minimum cardinality of a relationship is the minimum number of instances of
entity B that may be associated with each instance of entity A.
MAXIMUM CARDINALITY
The maximum cardinality of a relationship is the maximum number of instances of
entity B that may be associated with each instance of entity A.
1.3
Department of IT
UNIT-1 NOTES
2016-2017
Data manipulation Powerful operations (using the SQL language) are used
to manipulate data stored in the relations.
Data integrity The model includes mechanisms to specify business rules that
maintain the integrity of data when they are manipulated.
RELATIONAL KEYS
We must be able to store and retrieve a row of data in a relation, based on the data
values stored in that row.
To achieve this goal, every relation must have a primary key.
A primary key is an attribute or a combination of attributes that uniquely identifies
each row in a relation.
We designate a primary key by underlining the attribute name(s).
Example: The primary key for the relation EMPLOYEE1 is EmpID.
That this attribute is underlined in the above Figure.
A composite key is a primary key that consists of more than one attribute.
Often we must represent the relationship between two tables or relations.
This is accomplished through the use of foreign keys.
A foreign key is an attribute (possibly composite) in a relation that serves as the
primary key of another relation.
12
Department of IT
UNIT-1 NOTES
2016-2017
PROPERTIES OF RELATIONS
We have defined relations as two-dimensional tables of data.
However, not all tables are relations.
Relations have several properties that distinguish them from non-relational tables.
Properties:
o Each relation (or table) in a database has a unique name.
o An entry at the intersection of each row and column is atomic (or single valued).
There can be only one value associated with each attribute on a specific row
of a table; no multivalued attributes are allowed in a relation.
o Each row is unique; no two rows in a relation can be identical. o Each attribute (or
column) within a table has a unique name.
o The sequence of columns (left to right) is insignificant.
The order of the columns in a relation can be changed without changing the
meaning or use of the relation.
o The sequence of rows (top to bottom) is insignificant.
As with columns, the order of the rows of a relation may be changed or
stored in any sequence.
REMOVING MULTIVALUED ATTRIBUTES FROM TABLES
No multivalued attributes are allowed in a relation.
Thus, a table that contains one or more multivalued attributes is not a relation.
For example, Figure below shows the employee data from the EMPLOYEE1
relation extended to include courses that may have been taken by those employees.
Because a given employee may have taken more than one course, the attributes
CourseTitle and DateCompleted are multivalued attributes.
For example, the employee with EmpID 100 has taken two courses.
If an employee has not taken any courses, the CourseTitle and DateCompleted
attribute values are null.
The multivalued attributes are eliminated in the below Figure by filling the relevant
data values into the previously vacant cells of table.
Table has only single-valued attributes
The name EMPLOYEE2 is given to this relation to distinguish it from
EMPLOYEE1.
13
Department of IT
UNIT-1 NOTES
2016-2017
SAMPLE DATABASE
INTEGRITY CONSTRAINTS
The relational data model includes several types of constraints, or rules limiting
acceptable values and actions, whose purpose is to facilitate maintaining the
accuracy and integrity of data in the database.
The major types of integrity constraints are domain constraints, entity integrity,
and referential integrity.
DOMAIN CONSTRAINTS
All of the values that appear in a column of a relation must be from the same domain.
A domain is the set of values that may be assigned to an attribute.
A domain definition usually consists of the following components: domain name,
meaning, data type, size (or length), and allowable values or allowable range (if
applicable).
Table below shows domain definitions for the domains associated with the attributes
14
Department of IT
UNIT-1 NOTES
2016-2017
ENTITY INTEGRITY
The entity integrity rule is designed to ensure that every relation has a primary key
and that the data values for that primary key are all valid.
In particular, it guarantees that every primary key attribute is non-null.
In some cases, a particular attribute cannot be assigned a data value.
There are two situations in which this is likely to occur:
Either there is no applicable data value or the applicable data value is not
known when values are assigned.
Example, that you fill out an employment form that has a space reserved for a
fax number.
If you have no fax number, you leave this space empty because it does not
apply to you.
The relational data model allows us to assign a null value to an attribute in the just
described situations.
A null is a value that may be assigned to an attribute when no other value applies or
when the applicable value is unknown.
In reality, a null is not a value but rather it indicates the absence of a value.
REFERENTIAL INTEGRITY
In the relational data model, associations between tables are defined through the use
of foreign keys.
A referential integrity constraint is a rule that maintains consistency among the
rows of two relations.
The rule states that if there is a foreign key in one relation, either each foreign key
value must match a primary key value in another relation or the foreign key value
must be null.
2.
15
Department of IT
UNIT-1 NOTES
2016-2017
There are many software products that help organizations manage their business rules (for
example, JRules from ILOG, an IBM company).
In the database world, it has been more common to use the related term integrity constraint
when referring to such rules maintaining valid data values and relationships in the database
A business rules approach is based on the following premises:
Business rules are a core concept in an enterprise (project) because they are an expression of
business policy and guide individual and aggregate behavior.
Well-structured business rules can be stated in natural language for end users and in a data
model for systems developers.
Business rules can be expressed in terms that are familiar to end users.
Users can define and then maintain their own rules.
Business rules are highly maintainable.
They are stored in a central repository, and each rule is expressed only once,
and then shared throughout the organization.
Each rule is discovered and documented only once, to be applied in all
systems development projects.
Enforcement of business rules can be automated through the use of software that can
interpret the rules and enforce them using the integrity mechanisms of the database
management system.
Automatic generation and maintenance of systems will not only simplify the systems
development process but also will improve the quality of systems.
16
Department of IT
UNIT-1 NOTES
2016-2017
Consistent
Expressible
Distinct
Businessoriente
d
17
Department of IT
UNIT-1 NOTES
2016-2017
18
Department of IT
UNIT-1 NOTES
2016-2017
This is why we need to first call the next() method before retrieving data.
The ResultSet object is used to loop through and process each row of data and retrieve
the column values that we want to access.
In this case, we access the value in the name column using the rec.getString method,
which is a part of the JDBC API.
For each of the common database types, there is a corresponding get and set method that
allows for retrieval and storage of data in the database.
Table provides some common examples of SQL-to- Java mappings.
It is important to note that while the ResultSet object maintains an active connection to
the database, depending on the size of the table, the entire table (i.e., the result of the
query) may or may not actually be in memory on the client machine.
How and when data are transferred between the database and client is handled by the
Oracle driver.
By default, a ResultSet object is read-only and can only be traversed in one direction
(forward).
However, advanced versions of the ResultSet object allow scrolling in both directions
and can be updateable as well.
19
4.
Department of IT
UNIT-1 NOTES
2016-2017
5. STORED PROCEDURES
Stored procedures are modules of code that implement application logic and are included
on the database server.
Stored procedures have the following advantages:
Performance improves for compiled SQL statements.
Network traffic decreases as processing moves from the client to the server.
Security improves if the stored procedure rather than the data is accessed and
code is moved to the server, away from direct end-user access.
20
Department of IT
UNIT-1 NOTES
2016-2017
21
Department of IT
UNIT-1 NOTES
2016-2017
22
Department of IT
UNIT-1 NOTES
2016-2017
23
Department of IT
UNIT-1 NOTES
2016-2017
24
Department of IT
UNIT-1 NOTES
2016-2017
Accelerators:
Speed time-to-value with analytical and industry-specific modules
7.
NoSQL
WHAT IS NOSQL?
NoSQL is an approach to databases that represents a shift away from traditional relational
database management systems (RDBMS).
To define NoSQL, it is helpful to start by describing SQL, which is a query language used
by RDBMS.
Relational databases rely on tables, columns, rows, or schemas to organize and retrieve data.
In contrast, NoSQL databases do not rely on these structures and use more flexible data
models.
NoSQL can mean not SQL or not only SQL. As RDBMS have increasingly failed to
meet the performance, scalability, and flexibility needs that next-generation, data-intensive
applications require, NoSQL databases have been adopted by mainstream enterprises.
NoSQL is particularly useful for storing unstructured data, which is growing far more
rapidly than structured data and does not fit the relational schemas of RDBMS.
Common types of unstructured data include: user and session data; chat, messaging, and log
data; time series data such as IoT and device data; and large objects such as video and
images.
25
Department of IT
UNIT-1 NOTES
2016-2017
query are retrieved. In an RDBMS, the data would be in different rows stored in different
places on disk, requiring multiple disk operations for retrieval.
Graph stores: A graph database uses graph structures to store, map, and query relationships.
They provide index-free adjacency, so that adjacent elements are linked together without
using an index.
Multi-modal databases leverage some combination of the four types described above and
therefore can support a wider range of applications.
BENEFITS OF NOSQL
NoSQL databases offer enterprises important advantages over traditional RDBMS,
including:
Scalability: NoSQL databases use a horizontal scale-out methodology that makes it easy to
add or reduce capacity quickly and non-disruptively with commodity hardware.
Performance: By simply adding commodity resources, enterprises can increase
performance with NoSQL databases. This enables organizations to continue to deliver
reliably fast user experiences with a predictable return on investment for adding resources
again, without the overhead associated with manual sharding.
High Availability: NoSQL databases are generally designed to ensure high availability and
avoid the complexity that comes with a typical RDBMS architecture that relies on primary
and secondary nodes.
26
Department of IT
UNIT-1 NOTES
2016-2017
Global Availability: By automatically replicating data across multiple servers, data centers,
or cloud resources, distributed NoSQL databases can minimize latency and ensure a
consistent application experience wherever users are located.
Flexible Data Modeling: NoSQL offers the ability to implement flexible and fluid data
models. Application developers can leverage the data types and query options that are the
most natural fit to the specific application use case rather than those that fit the database
schema.
8. Hadoop HDFS
Hadoop File System was developed using distributed file system design.
It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault tolerant and designed using
lowcost hardware.
HDFS holds very large amount of data and provides easier access.
To store such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from possible data losses
in case of failure.
HDFS also makes applications available to parallel processing.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
27
Department of IT
UNIT-1 NOTES
2016-2017
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is software that can be run on commodity
hardware.
The system having the namenode acts as the master server and it does the following
tasks:
Manages the file system namespace.
Regulates clients access to files.
It also executes file system operations such as renaming, closing, and opening files
and directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will
be a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS
can read or write is called a Block. The default block size is 64MB, but it can be
increased as per the need to change in HDFS configuration.
Goals of HDFS
Fault detection and recovery: Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms
for quick and automatic fault detection and recovery.
Huge datasets: HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data: A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
28
Department of IT
UNIT-1 NOTES
2016-2017
Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS server),
and execute the following command.
$ hadoop namenode -format
After formatting the HDFS, start the distributed file system. The following command will start
the namenode as well as the data nodes as cluster.
$ start-dfs.sh Listing
Files in HDFS
After loading the information in the server, we can find the list of files in a directory, status of a
file, using ls. Given below is the syntax of ls that you can pass to a directory or a filename
as an argument.
$ $HADOOP_HOME/bin/hadoop fs -ls <args>
Inserting Data into HDFS
Assume we have data in the file called file.txt in the local system which is ought to be saved
in the hdfs file system. Follow the steps given below to insert the required file in the Hadoop
file system.
Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step 2
Transfer and store a data file from local systems to the Hadoop file system using the put
command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step 3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input
Retrieving Data from HDFS
Assume we have a file in HDFS called outfile. Given below is a simple demonstration for
retrieving the required file from the Hadoop file system.
Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Step 2
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
Shutting Down the HDFS
You can shut down the HDFS by using the following command.
$ stop-dfs.sh
29
Department of IT
UNIT-1 NOTES
2016-2017
9.MAPREDUCE
MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based on
java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines those data
tuples into a smaller set of tuples.
As the sequence of the name MapReduce implies, the reduce task is always performed after the
map job.
Advantage of MapReduce
It is easy to scale data processing over multiple computing nodes.
Under the MapReduce model, the data processing primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers is sometimes nontrivial.
But, once we write an application in the MapReduce form, scaling the application to run over
hundreds, thousands, or even tens of thousands of machines in a cluster is merely a
configuration change.
This simple scalability is what has attracted many programmers to use the MapReduce model.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Map stage : The map or mappers job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the
data and creates several small chunks of data.
o Reduce stage : This stage is the combination of the Shufflestage and the Reduce stage.
The Reducers job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
30
Department of IT
UNIT-1 NOTES
2016-2017
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in
the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Terminology
PayLoad - Applications implement the Map and the Reduce functions, and form the core of the
job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing takes place.
MasterNode - Node where JobTracker runs and which accepts job requests from clients.
SlaveNode - Node where Map and Reduce program runs.
31
Department of IT
UNIT-1 NOTES
2016-2017
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.
Example Scenario
Given below is the data regarding the electrical consumption of an organization. It contains the
monthly electrical consumption and the annual average for various years.
If the above data is given as input, we have to write applications to process it and produce
results such as finding the year of maximum usage, year of minimum usage, and so on. This is a
walkover for the programmers with finite number of records.
They will simply write the logic to produce the required output, and pass the data to the
application written.
But, think of the data representing the electrical consumption of all the largescale industries of a
particular state, since its formation.
When we write applications to process such bulk data, They will take a lot of time to execute.
There will be heavy network traffic when we move data from source to network server and so
on.
To solve these problems, we have the MapReduce framework.
Input Data
The above data is saved as sample.txtand given as input. The input file looks as shown below.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
32
Department of IT
UNIT-1 NOTES
2016-2017
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates Features of Hive
It stores schema in a database and processed data into HDFS.
10. HIVE
Architecture of Hive
The following component diagram depicts the architecture of Hive:
33
It
is
Execution Engine
HDFS or HBASE
Department of IT
UNIT-1 NOTES
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
34
2016-2017
Department of IT
35
UNIT-1 NOTES
2016-2017