Big Data Report

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

1

ABSTRACT--Big data is an all-encompassing term for any


collection of data sets so large and complex that it becomes
difficult to process using on-hand data management tools or
traditional data processing applications.
The challenges include capture, curation, storage, search,
sharing, transfer, analysis and visualization. The trend to
larger data sets is due to the additional information
derivable from analysis of a single large set of related data,
as compared to separate smaller sets with the same total
amount of data, allowing correlations to be found to "spot
business trends, prevent diseases, combat crime and so on.
At multiple terabytes in size, the text and images of
Wikipedia are a classic example of big data. The challenge
for large enterprises is determining who should own big
data initiatives that straddle the entire organization. Big
data is difficult to work with using most relational database
management systems and desktop statistics and
visualization packages, requiring instead "massively
parallel software running on tens, hundreds, or even
thousands of servers". What is considered "big data" varies
depending on the capabilities of the organization managing
the set, and on the capabilities of the applications that are
traditionally used to process and analyze the data set in its
domain. "For some organizations, facing hundreds of
gigabytes of data for the first time may trigger a need to
reconsider data management options. For others, it may
take tens or hundreds of terabytes before data size becomes
a significant consideration."

I. INTRODUCTION

Big data burst upon the scene in the first decade of the
21st century, and the first organizations to embrace it
were online and startup firms. Arguably, firms like
Google, eBay, LinkedIn, and Facebook were built around
big data from the beginning. They didnt have to
reconcile or integrate big data with more traditional
sources of data and the analytics performed upon them,
because they didnt have those traditional forms. They
didnt have to merge big data technologies with their
traditional IT infrastructures because those
infrastructures didnt exist. Big data could stand alone,
big data analytics could be the only focus of analytics,
and big data technology architectures could be the only
architecture.
Consider, however, the position of large, well-
established businesses. Big data in those environments
shouldnt be separate, but must be integrated with
everything else thats going on in the company. Analytics
on big data have to coexist with analytics on other types
of data. Hadoop clusters have to do their work alongside
IBM mainframes. Data scientists must somehow get
along and work jointly with mere quantitative analysts.
In order to understand this coexistence, we interviewed
20 large organizations in the early months of 2013 about
how big data fit in to their overall data and analytics
environments. Overall, we found the expected co-
existence; in not a single one of these large organizations
was big data being managed separately from other types
of data and analytics. The integration was in fact leading
to a new management perspective on analytics, which
well call Analytics 3.0. In this paper well describe the
overall context for how organizations think about big
data, the organizational structure and skills required for
itetc. Well conclude by describing the Analytics 3.0
era. Big data may be new for startups and for online
firms, but many large firms view it as something they
have been wrestling with for a while. Some managers
appreciate the innovative nature of big data, but more
find it business as usual or part of a continuing
evolution toward more data. They have been adding new
forms of data to their systems and models for many
years, and dont see anything revolutionary about big
data. Put another way, many were pursuing big data
before big data was big.
II. METHODOLOGY

A. Definition

Big data usually includes data sets with sizes beyond
the ability of commonly used software tools
to capture, curate, manage, and process the data within a
tolerable elapsed time. Big data sizes are a constantly
moving target, as of 2012 ranging from a few dozen
terabytes to many petabytes of data in a single data set.
In a 2001 research report and related lectures, META
Group (now Gartner) analyst Doug Laney defined data
growth challenges and opportunities as being three-
dimensional, i.e. increasing volume (amount of data),
velocity (speed of data in and out), and variety (range of
data types and sources). Gartner, and now much of the
industry, continue to use this "3Vs" model for describing
big data. In 2012, Gartner updated its definition as
follows: "Big data is high volume, high velocity, and/or
high variety information assets that require new forms of
processing to enable enhanced decision making, insight
discovery and process optimization." Additionally, a new
V "Veracity" is added by some organizations to describe
it.
If Gartners definition (the 3Vs) is still widely used,
the growing maturity of the concept fosters a more sound
difference between big data and Business Intelligence,
regarding data and their use Business Intelligence



2

uses descriptive statistics with data with high information
density to measure things, detect trends etc.;
Big data uses inductive statistics and concepts
from nonlinear system identification to infer laws
(regressions, nonlinear relationships, and causal effects)
from large data sets to reveal relationships, dependencies
and perform predictions of outcomes and behaviours.
Big data can also be defined as "Big data is a large
volume unstructured data which cannot be handled by
standard database management systems like DBMS,
RDBMS or ORDBMS".

As a result, only working with less than 0.001% of the
sensor stream data, the data flow from all four LHC
experiments represents 25 petabytes annual rate before
replication (as of 2012). This becomes nearly 200
petabytes after replication. If all sensor data were to be
recorded in LHC, the data flow would be extremely hard
to work with.
The data flow would exceed 150 million petabytes
annual rate, or nearly 500exabytes per day, before
replication. To put the number in perspective, this is
equivalent to 500 quintillion (510
20
) bytes per day,
almost 200 times higher than all the other sources
combined in the world.
The Square Kilometre Array is a telescope
whichconsists of millions of antennas and is expected to
be operational by 2024. Collectively, these antennas are
expected to gather 14 exabytes and store one petabyte per
day. It is considered to be one of the most ambitious
scientific projects ever undertaken.

B. Characteristics of big data

Every day, we create 2.5 quintillion bytes of data
so much that 90% of the data in the world today has been
created in the last two years alone. This data comes from
everywhere :sensors used to gather climate information,
posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals
to name a few. This data is big data.
Big data spans three dimensions: Volume, Velocity and
Variety.

1) Volume: Enterprises are awash with ever-growing
data of all types, easily amassing terabyteseven
petabytesof information.
Turn 12 terabytes of Tweets created each day into
improved product sentiment analysis
Convert 350 billion annual meter readings to better
predict power consumption
2) Velocity: Sometimes 2 minutes is too late. For time-
sensitive processes such as catching fraud, big data must
be used as it streams into your enterprise in order to
maximize its value.
Scrutinize 5 million trade events created each day to
identify potential fraud
Analyze 500 million daily call detail records in real-
time to predict customer churn faster.

3) Variety: Big data is any type of data - structured and
unstructured data such as text, sensor data, audio, video,
click streams, log files and more. New insights are found
when analyzing these data types together.
Monitor 100s of live video feeds from surveillance
cameras to target points of interest.
Exploit the 80% data growth in images, video and
documents to improve customer satisfaction.
Big data is more than simply a matter of size; it is an
opportunity to find insights in new and emerging types of
data and content, to make your business more agile, and
to answer questions that were previously considered
beyond your reach. Until now, there was no practical
way to harvest this opportunity. Today, IBMs platform
for big data uses state of the art technologies including
patented advanced analytics to open the door to a world
of possibilities.


Fig.1. Bigdata spans three dimension



3

III. CASE STUDY
No single business trend in the last decade has as much
potential impact on incumbent IT investments as big data.
Indeed big data promisesor threatens, depending on
how you view itto upend legacy technologies at many
big companies. As IT modernization initiatives gain
traction and the accompanying cost savings hit the bottom
line, executives in both line of business and IT
organizations are getting serious about the technology
solutions that are tied to big data.
Companies are not only replacing legacy technologies
in favor of open source solutions like Apache Hadoop,
they are also replacing proprietary hardware with
commodity hardware, custom-written applications with
packaged solutions, and decades-old business
intelligence tools with data visualization. This new
combination of big data platforms, projects, and tools is
driving new business innovations, from faster product
time-to-market to an authoritativefinally!single view
of the customer to custom-packaged product bundles and
beyond.
A. Big data stack

As with all strategic technology trends, big data
introduces highly specialized features that set it apart
from legacy systems.
Each component of the stack is optimized around the
large, unstructured and semi-structured nature of big
data. Working together, these moving parts comprise a
holistic solution thats fine-tuned for specialized, high-
performance processing and storage


Fig.2. Big data stack

1) Storage:-Storing large and diverse amounts of data on
disk is becoming more cost-effective as the disk
technologies become more commoditized and efficient.
Companies like EMC sell storage solutions that allow disks
to be added quickly and cheaply, thereby scaling storage in
lock step with growing data volumes. Indeed, many big
company executives see Hadoop as a low-cost alternative
for the archival and quick retrieval of large amounts of
historical data.

1) 2) Platform Infrastructure:-The big data platform is
typically the collection of functions that comprise high-
performance processing of big data. The platform includes
capabilities to integrate, manage, and apply sophisticated
computational processing to the data. Typically, big data
platforms include a Hadoop (or similar open-source project)
foundation. Hadoop was designed and built to optimize
complex manipulation of large amounts of data while vastly
exceeding the price/performance of traditional databases.
Hadoop is a unified storage and processing environment that
is highly scalable to large and complex data volumes

2) 3) Data: -The expanse of big data is as broad and complex
as the applications for it. Big data can mean human genome
sequences, oil well sensors, cancer cell behaviors, locations
of products on pallets, social media interactions, or patient
vital signs, to name a few examples. The data layer in the
stack implies that data is a separate asset, warranting discrete
management and governance. To that end, a 2013 survey of
data management professionalsviii found that of the 339
companies responding, 71 percent admitted that they have
yet to begin planning their big data strategies. The
respondents cited concerns about data quality, reconciliation,
timeliness, and security as significant barriers to big data
adoption.
4) Application Code, Functions, and Services: -Just as big
data varies with the business application, the code used to
manipulate and process the data can vary. Hadoop uses a
processing engine called MapReduce to not only distribute
data across the disks, but to apply complex computational
instructions to that data. In keeping with the high-
performance capabilities of the platform, MapReduce
instructions are processed in parallel across various nodes on
the big data platform, and then quickly assembled to provide
a new data structure or answer set. An example of a big data
application in Hadoop might be to calculate all the
customers who like us on social media. A text mining
application might crunch through social media transactions,
searching for words such as fan, love, bought, or
awesome and consolidate a list of key influencer
customers.



4

5) Business View: -Depending on the big data
application, additional processing via MapReduce or
custom Java code might be used to construct an
intermediate data structure, such as a statistical model, a
flat file, a relational table, or a cube. The resulting
structure may be intended for additional analysis, or to be
queried by a traditional SQL-based query tool. This
business view ensures that big data is more consumable
by the tools and the knowledge workers that already exist
in an organization. One Hadoop project called Hive
enables raw data to be re-structured into relational tables
that can be accessed via SQL and incumbent SQL-based
toolsets, capitalizing on the skills that a company may
already have in-house.

6) Presentation and Consumption: -One of the more
profound developments in the world of big data is the
adoption of so-called data visualization. Unlike the
specialized business intelligence technologies and
unwieldy spreadsheets of yesterday, data visualization
tools allow the average business person to view
information in an intuitive, graphical way.

B.Organizational Structures for Big Data :-

The most likely organizational structures to initiate or
accommodate big data technologies are either existing
analytics groups (including groups with an operations
research title), or innovation or architectural groups
within IT organizations. In many cases these central
services organizations are aligned in big data initiatives
with analytically-oriented functions or business units
marketing, for example, or the online businesses for
banks or retailers (see the Big Data at Macys.com case
study). Some of these business units have IT or analytics
groups of their own. The organizations whose
approaches seemed most effective and likely to succeed
had close relationships between the business groups
addressing big data and the IT organizations supporting
them.

C.Big Data Skill Scarcity :-

In terms of skills, most of these large firms are
augmentingor trying to augmenttheir existing
analytical staffs with data scientists who possess a higher
level of IT capabilities, and the ability to manipulate big
data technologies specificallycompared to traditional
quantitative analysts. These might include natural
language processing or text mining skills, video or image
analytics, and visual analytics. Many of the data
scientists are also able to code in scripting languages like
Python, Pig, and Hive. In terms of backgrounds, some
have Ph.D.s in scientific fields; others are simply strong
programmers with some analytical skills. Many of our
interviewees questioned whether a data scientist could
possess all the needed skills, and were taking a team-
based approach to assembling them.

D. Types of tools used in Big-Data :-

Where processing is hosted? Distributed Servers /
Cloud (e.g. Amazon EC2)
Where data is stored? Distributed Storage (e.g.
Amazon S3)
What is the programming model? Distributed
Processing (e.g. MapReduce)
How data is stored & indexed? High-performance
schema-free databases (e.g. MongoDB) What
operations are performed on data? Analytic / Semantic
Processing.

IV. APPLICATION
eBay.com uses two data warehouses at
7.5 petabytes and 40PB as well as a
40PB Hadoop cluster for search, consumer
recommendations, and merchandising. Inside eBays
90PB data warehouse Amazon.com handles millions
of back-end operations every day, as well as queries
from more than half a million third-party sellers. The
core technology that keeps Amazon running is
Linux-based and as of 2005 they had the worlds
three largest Linux databases, with capacities of 7.8
TB, 18.5 TB, and 24.7 TB.
Walmart handles more than 1 million customer
transactions every hour, which are imported into
databases estimated to contain more than 2.5
petabytes (2560 terabytes) of data the equivalent of
167 times the information contained in all the books
in the US Library of Congress.
Facebook handles 50 billion photos from its user
base.
FICO Falcon Credit Card Fraud Detection System
protects 2.1 billion active accounts world-wide.
The volume of business data worldwide, across all
companies, doubles every 1.2 years, according to
estimates.



5

Windermere Real Estate uses anonymous GPS
signals from nearly 100 million drivers to help new
home buyers determine their typical drive times to
and from work throughout various times of the day.
Based on TCS 2013 Global Trend Study, huge
improvements in supply planning and boost product
quality is the greatest benefit of big data for
manufacturing.
Big data provides an infrastructure for transparency
in manufacturing industry, which is the ability to
unravel uncertainties such as inconsistent component
performance and availability.
Predictive manufacturing as an applicable approach
toward near-zero downtime and transparency
requires vast amount of data and advanced
prediction tools for a systematic process of data into
useful information.
A conceptual framework of predictive
manufacturing begins with data acquisition where
different type of sensory data is available to acquire
such as acoustics, vibration, pressure, current,
voltage and controller data. Vast amount of sensory
data in addition to historical data construct the big
data in manufacturing.
The generated big data acts as the input into
predictive tools and preventive strategies such
as Prognostics and Health Management (PHM).

V. FUTURE SCOPE
$15 billion on software firms only specializing in
data management and analytics.
This industry on its own is worth more than $100
billion and growing at almost 10% a year which is
roughly twice as fast as the software business as a
whole.
In February 2012, the open source analyst firm
Wikibon released the first market forecast for Big
Data , listing $5.1B revenue in 2012 with growth to
$53.4B in 2017
The McKinsey Global Institute estimates that data
volume is growing 40% per year, and will grow 44x
between 2009 and 2020.
Real-time big data isnt just a process for storing
petabytes or exabytes of data in a data warehouse,
Its about the ability to make better decisions and
take meaningful actions at the right time.
Fast forward to the present and technologies like
Hadoop give you the scale and flexibility to store
data before you know how you are going to process
it.
Technologies such as MapReduce,Hive and Impala
enable you to run queries without changing the data
structures underneath.
Our newest research finds that organizations are
using big data to target customer-centric outcomes,
tap into internal data and build a better information
ecosystem.
Big Data is already an important part of the $64
billion database and data analytics market It offers
commercial opportunities of a comparable scale to
enterprise software in the late 1980s
And the Internet boom of the 1990s, and the social
media explosion of today.

VI. CONCLUSION
Even though it hasnt been long since the advent of big
data, these attributes add up to a new era. It is clear from
our research that large organizations across industries are
joining the data economy. They are not keeping
traditional analytics and big data separate, but are
combining them to form a new synthesis. Some aspects
of Analytics 3.0 will no doubt continue to emerge, but
organizations need to begin transitioning now to the new
model. It means change in skills, leadership,
organizational structures, technologies, and architectures.
It is perhaps the most sweeping change in what we do to
get value from data since the 1980s.
Its important to remember that the primary value from
big data comes not from the data in its raw form, but
from the processing and analysis of it and the insights,
products, and services that emerge from analysis. The
sweeping changes in big data technologies and
management approaches need to be accompanied by
similarly dramatic shifts in how data supports decisions
and product/service innovation. There is little doubt that
analytics can transform organizations, and the firms that
lead the 3.0 charge will seize the most value.
REFERENCES
[1] www.Slideshare.com
[2] www.wikipedia.com
[3] www.computereducation.org
Books
[4] Big Data by Viktor Mayer- Schonberger
[5] The little book of big data noreen burligame
[6] Mining of massive datasets-anand rajaraman, jeffri
david ullman.
[7] NewVantage Partners, Big Data Executive Survey:
Themes and Trends, 2012.



6

[8] Peter Evans and Marco Annunziata, Industrial
Internet: Pushing the Boundaries of Minds and
Machines, GE report, Nov. 26, 2012.
www.ge.com/docs/chapters/Industrial_Internet.pdf
[9]The M Groups use of big data is described in Joel
Schectman, Ad Firm Finds Way to Cut Big Data
Costs, Wall Street Journal CIO Journal website,
February 8, 2013,
http://blogs.wsj.com/cio/2013/02/08/ad-firm-finds-way-
to-cut-big-data-costs/
[10] Kerem Tomak, in Two Expert Perspectives on
High-Performance Analytics, Intelligence Quarterly (a
SAS publication), 2nd quarter 2012, p. 6.
[11] Tom Vanderbilt, Let the Robot Drive: The
Autonomous Car of the Future Is Here, Wired, January
20, 2012,
http://www.wired.com/magazine/2012/01/ff_autonomous
cars/
[12] Andrew Leonard, How Netflix Is Turning Viewers
into Puppets, February 1, 2013,
http://www.salon.com/2013/02/01/how_netflix_is_turnin
g_viewers_into_puppets/
[13] Open source solutions are known as projects
because they are developed jointly by a community of
contributors. Thus, they represent a collection of diverse
and often far-flung activities that, when unified, comprise
a holistic solution. Because they are built by a
community of developers who are typically unpaid for
their work (many accept donations), these projects are
often free-of-charge to individuals or companies who
contribute additional functionality or guidance to the
community. This is the opposite of proprietary software
solutions, which are pre-packaged as products with finite
release schedules and more rigid pricing models. By their
very nature, open source projects are ongoing, until the
community stops using the software and/orthe members
of the developer community stop contributing to them.
[14] SAS 2013 Big Data Survey, page 1:
http://www.sas.com/resources/whitepaper/wp_58466.pdf
[15] Big Data: The Next Frontier for Innovation,
Competition, and Creativity, McKinsey Global Institute,
2011.
[16] According to a report published by Cisco Systems,
Cisco Visual Networking Index: Global Mobile Data
Traffic Forecast Update, February 6, 2013.
http://www.cisco.com/en/US/solutions/collateral/ns341/n
s525/ns537/ns705/ns827/white_paper_c11-520862.html
[17] For the complete story on this study, see
http://wikibon.org/wiki/v/Financial_Comparison_of_Big
_Data_MPP_Solution_and_Data_Warehouse_Appliance.
[18] SAS 2013 Big Data Survey, page 4:
http://www.sas.com/resources/whitepaper/wp_58466.pdf

VIII .SCREEN SHOTS
A)



B)



C)





7


D)




E)



F)



G)




H)



I)





8

J)




K)




L)


M)




N)




O)





9

P)




Q)




R)


S)




T)




U)





10

V)




W)




X)

You might also like