Report of Big Data
Report of Big Data
Report of Big Data
BUISNESS AND
DATA
TECHNOLOGY
How different companies use big data to gain more to more benefit
1
Submitted to:
Sir Abdul Zahid Khan
Submitted by:
Muhammad Waseem
1252-FMS/BBAIT/S17
Date: 04/01/2018
2
Contents
What is big data?.........................................................................................................................................3
BIG DATA HISTORY AND CURRENT CONSIDRATION....................................................................................3
VOLUEM:.................................................................................................................................................3
VARIETY:..................................................................................................................................................4
VELOCITY:................................................................................................................................................4
COMPLEXITY:...........................................................................................................................................4
TYPES OF BIG DATA.....................................................................................................................................5
1. Structured data....................................................................................................................................6
2. Unstructured data...............................................................................................................................6
3. Semi-structured data...........................................................................................................................7
4. Conversational data.............................................................................................................................8
Photo and Video image data...................................................................................................................8
WHY BIG DATA............................................................................................................................................8
NEW DATA SOURCE.....................................................................................................................................9
Clickstream data:.....................................................................................................................................9
Shopping cart data:.................................................................................................................................9
Social networking data:...........................................................................................................................9
Sensor data:...........................................................................................................................................10
SOLOTION WITH BIG DATA........................................................................................................................10
HADOOP:...............................................................................................................................................10
How Hadoop does Work:...........................................................................................................................10
CASE STUDIES............................................................................................................................................11
3
can be analyzed for insights that lead to better decisions and strategic business moves .
VOLUEM:
Organizations collect data from a variety of sources, including business
transactions, social media and information from sensor or machine-to-machine data.
In the past, storing it would’ve been a problem – but new technologies (such as
Hadoop) have eased the burden.
4
VARIETY:
Data comes in all types of formats – from structured, numeric data in traditional
databases to unstructured text documents, email, video, audio, stock ticker data and
financial transactions.
VELOCITY:
Velocity refer to the amount of data that getting generated in every minutes and
every second of time let’s take the example of Facebook twitter and other social
sites when we upload picture on it then it goanna viral in a second because we are
living in the world of internet that everything connected to the each other so the data
being generated in each minuets is very high speed.
COMPLEXITY:
Today's data comes from multiple sources, which makes it difficult to link, match,
cleanse and transform data across systems. However, it’s necessary to connect and
correlate relationships, hierarchies and multiple data linkages or your data can
quickly spiral out of control.
The importance of big data doesn’t revolve around how much data you have, but
what you do with it. You can take data from any source and analyze it to find
5
Generating coupons at the point of sale based on the customer’s buying habits.
The concept of Big Data is nothing complex; as the name suggests, “Big Data” refers to
copious amounts of data which are too large to be processed and analyzed by
traditional tools, and the data is not stored or managed efficiently. Since the amount of
6
Big Data increases exponentially- more than 500 terabytes of data are uploaded to face
book alone, in a single day- it represents a real problem in terms of analysis.
However, there is also huge potential in the analysis of Big Data. The proper
management and study of this data can help companies make better decisions based
on usage statistics and user interests, thereby helping their growth. Some companies
have even come up with new products and services, based on feedback received
from Big Data analysis opportunities.
Classification is essential for the study of any subject. So Big Data is widely classified
into three main types, which are-
1. Structured data
Structured Data is used to refer to the data which is already stored in databases, in an
ordered manner. It accounts for about 20% of the total existing data, and is used the
most in programming and computer-related activities.
There are two sources of structured data- machines and humans. All the data received
from sensors, web logs and financial systems are classified under machine-generated
data. These include medical devices, GPS data, data of usage statistics captured by
servers and applications and the huge amount of data that usually move through trading
platforms, to name a few.
Human-generated structured data mainly includes all the data a human input into a
computer, such as his name and other personal details. When a person clicks a link on
the internet, or even makes a move in a game, data is created- this can be used by
companies to figure out their customer behavior and make the appropriate decisions
and modifications.
2. Unstructured data
While structured data resides in the traditional row-column databases, unstructured data
is the opposite- they have no clear format in storage. The rest of the data created, about
7
80% of the total account for unstructured big data. Most of the data a person encounters
belongs to this category- and until recently, there was not much to do to it except storing
it or analyzing it manually.
3. Semi-structured data.
The line between unstructured data and semi-structured data has always been unclear,
since most of the semi-structured data appear to be unstructured at a glance.
Information that is not in the traditional database format as structured data, but contain
some organizational properties which make it easier to process, are included in semi-
structured data. For example, NoSQL documents are considered to be semi-structured,
since they contain keywords that can be used to process the document easily.
Big Data analysis has been found to have a definite business value, as its analysis and
processing can help a company achieve cost reductions and dramatic growth. So it is
imperative that you do not wait too long to exploit the potential of this excellent business
opportunity.
8
4. Conversational data
Just take the example of social sites and twitter we sent message to each other all of
them are recorded digitally even our simple activity like listening to music and reading a
book ore now recorded electronically.
We upload video and images on YouTube and other social sites are now recorded
digitally.
Customers and prospects have become empowered in the online world. Social
networks and comparison and review websites have allowed them to quickly become
well informed before they buy. With information on their side and at their fingertips
wherever they are, mobile device users have the power to sacrifice loyalty in the blink of
an eye or the click of a mouse if quality and services are not satisfactory.
Clickstream data:
Analyzing all the clicks of every visitor on a website fosters understanding site-
navigation behavior, the paths people take to buying products and services, what else
they looked at on the way to buying, paths that led to abandonment and more. This
information helps improve the customer experience and conversion. It may also be
possible to associate clicks with customers and prospects.
This data from your website enables you to see what people are putting into and taking
out of shopping carts en route to online checkout.
Analyzing data from Facebook, LinkedIn and Twitter gives you an opportunity to obtain
information about customers you don’t yet have. It also allows you to identify previously
unknown relationships, what people like and dislike and so on. Analysis of this data also
enables you to identify who the influencers are in the network and which people link
multiple communities in the network. Targeting influencers with marketing campaigns
could significantly boost sales. Analyzing social network data also empowers you to
identify sentiment information—what people are saying about your products, your
customer service and your brand.
10
Sensor data:
Analyzing data from smart technologies such as Global Positioning System (GPS)
sensors in smartphones can give you information on product usage or location. Sensor
data may also be available for monitoring production lines, asset performance, supply
chains and distribution channels to see if customers are getting deliveries on time .
algorithm is to break down the data into smaller manageable pieces, process the data in
parallel on your distributed cluster, and subsequently combine it into the desired result
or output.
Hadoop Map Reduce includes several stages, each with an important set of operations
designed to handle big data. The first step is for the program to locate and read the
« input file » containing the raw data. Since the file format is arbitrary, the data must be
converted to something the program can process. This is the function of « Input
Format » and « Record Reader » (RR). Input Format decides how to split the file into
smaller pieces (using a function called Input Split). Then the Record Reader transforms
the raw data for processing by the map. The result is a sequence of « key » and
« value » pairs.
Once the data is in a form acceptable to map, each key-value pair of data is processed
by the mapping function. To keep track of and collect the output data, the program uses
an « Output Collector ». Another function called « Reporter » provides information that
lets you know when the individual mapping tasks are complete.
Once all the mapping is done, the Reduce function performs its task on each output
key-value pair. Finally an Output Format feature takes those key-value pairs and
organizes the output for writing to HDFS, which is the last step of the program.
Hadoop Map Reduce is the heart of the Hadoop system. It is able to process the data in
a highly resilient, fault-tolerant manner. Obviously this is just an overview of a larger and
growing ecosystem with tools and technologies adapted to manage modern big data
problems.
CASE STUDIES
GOOGLE
Big data and big business go hand in hand – this is the first in a series where I will
examine the different uses that the world’s leading corporations are making of the
endless amount of digital information the world is producing every day. Google has
12
not only significantly influenced the way we can now analyses big data (think Map
Reduce, Big Query, etc.) – but they are probably more responsible than anyone else
for making it part of our everyday lives. I believe that many of the innovative things
Google is doing today, most companies will do in years to come. Many people,
particularly those who didn’t get online until this century had started, will have had
their first direct experience of manipulating big data through Google. Although these
days Google’s big data innovation goes well beyond basic search, it’s still their core
business. They process 3.5 billion requests per day, and each request queries a
database of 20 billion web pages. Big data - case study collection 3 this is refreshed
daily, as Google’s bots crawl the web, copying down what they see and taking it
back to be stored in Google’s index database. What pushed Google in front of other
search engines has been its ability to analyses wider data sets for their search.
Initially it was PageRank which included information about sites that linked to a
particular site in the index, to help take a measure of that site’s importance in the
grand scheme of things. Previously leading search engines worked almost entirely
on the principle of matching relevant keywords in the search query to sites
containing those words. PageRank revolutionized search by incorporating other
elements alongside keyword analysis. Their aim has always been to make as much
of the world’s information available to as many people as possible (and get rich
trying, of course…) and the way Google search works has been constantly revised
and updated to keep up with this mission. Moving further away from keyword-based
search and towards semantic search is the current aim. This involves analyzing not
just the “objects” (words) in the query, but the connection between them, to
determine what it means as accurately as possible. To this end, Google throws a
whole heap of other information into the mix. Starting in 2007 it launched Universal
Search, which pulls in data from hundreds of sources including language databases,
weather forecasts and historical data, financial data, travel information, currency
exchange rates, sports statistics and a database of mathematical functions. It
continued to evolve in 2012 into the Knowledge Graph, which big data - case study
collection 4 displays information on the subject of the search from a wide range of
resources directly into the search results. It then mixes what it knows about you from
13
your previous search history (if you are signed in), which can include information
about your location, as well as data from your Google+ profile and Gmail messages,
to come up with its best guess at what you are looking for. The ultimate aim is
undoubtedly to build the kind of machine we have become used to seeing in science
fiction for decades – a computer which you can have a conversation with in your
native tongue, and which will answer you with precisely the information you want.
Search is by no means all of what Google does, though. After all, it’s free, right? And
Google is one of the most profitable businesses on the planet. That profit comes
from what it gets in return for its searches – information about you. Google builds up
vast amounts of data about the people using it. Essentially it then matches up
companies with potential customers, through its AdSense algorithm. The companies
pay handsomely for these introductions, which appear as adverts in the customers’
browsers. In 2010 it launched Big Query, its commercial service for allowing
companies to store and analyses big data sets on its cloud platforms. Companies
pay for the storage space and computer time taken in running the queries. Another
big data project Google is working on is the self-driving car. Using and generating
massive amounts of data from sensors, big data - case study collection 5 cameras,
tracking devices and coupling this with on-board and real-time data analysis from
Google Maps, Street view and other sources allows the Google car to safely drive on
the roads without any input from a human driver. Perhaps the most astounding use
Google have found for their enormous data though, is predicting the future. In 2008
the company published a paper in the science journal Nature claiming that their
technology had the capability to detect outbreaks of flu with more accuracy than
current medical techniques for detecting the spread of epidemics. The results were
controversial – debate continues over the accuracy of the predictions. But the
incident unveiled the possibility of “crowd prediction”, which in my opinion is likely to
be a reality in the future as analytics becomes more sophisticated. Google may not
quite yet be ready to predict the future – but its position as a main player and
innovator.