Big data refers to extremely large data sets that are difficult to process using traditional data processing applications. It is characterized by high volume, velocity, and variety of data. For large organizations, big data must be integrated with existing data and analytics systems rather than treated separately. The document discusses how large companies are managing big data integration, leading to a new perspective called "Analytics 3.0" that combines big data with traditional analytics.
Big data refers to extremely large data sets that are difficult to process using traditional data processing applications. It is characterized by high volume, velocity, and variety of data. For large organizations, big data must be integrated with existing data and analytics systems rather than treated separately. The document discusses how large companies are managing big data integration, leading to a new perspective called "Analytics 3.0" that combines big data with traditional analytics.
Big data refers to extremely large data sets that are difficult to process using traditional data processing applications. It is characterized by high volume, velocity, and variety of data. For large organizations, big data must be integrated with existing data and analytics systems rather than treated separately. The document discusses how large companies are managing big data integration, leading to a new perspective called "Analytics 3.0" that combines big data with traditional analytics.
Big data refers to extremely large data sets that are difficult to process using traditional data processing applications. It is characterized by high volume, velocity, and variety of data. For large organizations, big data must be integrated with existing data and analytics systems rather than treated separately. The document discusses how large companies are managing big data integration, leading to a new perspective called "Analytics 3.0" that combines big data with traditional analytics.
Download as DOCX, PDF, TXT or read online from Scribd
Download as docx, pdf, or txt
You are on page 1of 10
1
ABSTRACT--Big data is an all-encompassing term for any
collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, prevent diseases, combat crime and so on. At multiple terabytes in size, the text and images of Wikipedia are a classic example of big data. The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization. Big data is difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers". What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
I. INTRODUCTION
Big data burst upon the scene in the first decade of the 21st century, and the first organizations to embrace it were online and startup firms. Arguably, firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning. They didnt have to reconcile or integrate big data with more traditional sources of data and the analytics performed upon them, because they didnt have those traditional forms. They didnt have to merge big data technologies with their traditional IT infrastructures because those infrastructures didnt exist. Big data could stand alone, big data analytics could be the only focus of analytics, and big data technology architectures could be the only architecture. Consider, however, the position of large, well- established businesses. Big data in those environments shouldnt be separate, but must be integrated with everything else thats going on in the company. Analytics on big data have to coexist with analytics on other types of data. Hadoop clusters have to do their work alongside IBM mainframes. Data scientists must somehow get along and work jointly with mere quantitative analysts. In order to understand this coexistence, we interviewed 20 large organizations in the early months of 2013 about how big data fit in to their overall data and analytics environments. Overall, we found the expected co- existence; in not a single one of these large organizations was big data being managed separately from other types of data and analytics. The integration was in fact leading to a new management perspective on analytics, which well call Analytics 3.0. In this paper well describe the overall context for how organizations think about big data, the organizational structure and skills required for itetc. Well conclude by describing the Analytics 3.0 era. Big data may be new for startups and for online firms, but many large firms view it as something they have been wrestling with for a while. Some managers appreciate the innovative nature of big data, but more find it business as usual or part of a continuing evolution toward more data. They have been adding new forms of data to their systems and models for many years, and dont see anything revolutionary about big data. Put another way, many were pursuing big data before big data was big. II. METHODOLOGY
A. Definition
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as being three- dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data. In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Additionally, a new V "Veracity" is added by some organizations to describe it. If Gartners definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use Business Intelligence
2
uses descriptive statistics with data with high information density to measure things, detect trends etc.; Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large data sets to reveal relationships, dependencies and perform predictions of outcomes and behaviours. Big data can also be defined as "Big data is a large volume unstructured data which cannot be handled by standard database management systems like DBMS, RDBMS or ORDBMS".
As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication. If all sensor data were to be recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (510 20 ) bytes per day, almost 200 times higher than all the other sources combined in the world. The Square Kilometre Array is a telescope whichconsists of millions of antennas and is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one petabyte per day. It is considered to be one of the most ambitious scientific projects ever undertaken.
B. Characteristics of big data
Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere :sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. Big data spans three dimensions: Volume, Velocity and Variety.
1) Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabyteseven petabytesof information. Turn 12 terabytes of Tweets created each day into improved product sentiment analysis Convert 350 billion annual meter readings to better predict power consumption 2) Velocity: Sometimes 2 minutes is too late. For time- sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. Scrutinize 5 million trade events created each day to identify potential fraud Analyze 500 million daily call detail records in real- time to predict customer churn faster.
3) Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. Monitor 100s of live video feeds from surveillance cameras to target points of interest. Exploit the 80% data growth in images, video and documents to improve customer satisfaction. Big data is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile, and to answer questions that were previously considered beyond your reach. Until now, there was no practical way to harvest this opportunity. Today, IBMs platform for big data uses state of the art technologies including patented advanced analytics to open the door to a world of possibilities.
Fig.1. Bigdata spans three dimension
3
III. CASE STUDY No single business trend in the last decade has as much potential impact on incumbent IT investments as big data. Indeed big data promisesor threatens, depending on how you view itto upend legacy technologies at many big companies. As IT modernization initiatives gain traction and the accompanying cost savings hit the bottom line, executives in both line of business and IT organizations are getting serious about the technology solutions that are tied to big data. Companies are not only replacing legacy technologies in favor of open source solutions like Apache Hadoop, they are also replacing proprietary hardware with commodity hardware, custom-written applications with packaged solutions, and decades-old business intelligence tools with data visualization. This new combination of big data platforms, projects, and tools is driving new business innovations, from faster product time-to-market to an authoritativefinally!single view of the customer to custom-packaged product bundles and beyond. A. Big data stack
As with all strategic technology trends, big data introduces highly specialized features that set it apart from legacy systems. Each component of the stack is optimized around the large, unstructured and semi-structured nature of big data. Working together, these moving parts comprise a holistic solution thats fine-tuned for specialized, high- performance processing and storage
Fig.2. Big data stack
1) Storage:-Storing large and diverse amounts of data on disk is becoming more cost-effective as the disk technologies become more commoditized and efficient. Companies like EMC sell storage solutions that allow disks to be added quickly and cheaply, thereby scaling storage in lock step with growing data volumes. Indeed, many big company executives see Hadoop as a low-cost alternative for the archival and quick retrieval of large amounts of historical data.
1) 2) Platform Infrastructure:-The big data platform is typically the collection of functions that comprise high- performance processing of big data. The platform includes capabilities to integrate, manage, and apply sophisticated computational processing to the data. Typically, big data platforms include a Hadoop (or similar open-source project) foundation. Hadoop was designed and built to optimize complex manipulation of large amounts of data while vastly exceeding the price/performance of traditional databases. Hadoop is a unified storage and processing environment that is highly scalable to large and complex data volumes
2) 3) Data: -The expanse of big data is as broad and complex as the applications for it. Big data can mean human genome sequences, oil well sensors, cancer cell behaviors, locations of products on pallets, social media interactions, or patient vital signs, to name a few examples. The data layer in the stack implies that data is a separate asset, warranting discrete management and governance. To that end, a 2013 survey of data management professionalsviii found that of the 339 companies responding, 71 percent admitted that they have yet to begin planning their big data strategies. The respondents cited concerns about data quality, reconciliation, timeliness, and security as significant barriers to big data adoption. 4) Application Code, Functions, and Services: -Just as big data varies with the business application, the code used to manipulate and process the data can vary. Hadoop uses a processing engine called MapReduce to not only distribute data across the disks, but to apply complex computational instructions to that data. In keeping with the high- performance capabilities of the platform, MapReduce instructions are processed in parallel across various nodes on the big data platform, and then quickly assembled to provide a new data structure or answer set. An example of a big data application in Hadoop might be to calculate all the customers who like us on social media. A text mining application might crunch through social media transactions, searching for words such as fan, love, bought, or awesome and consolidate a list of key influencer customers.
4
5) Business View: -Depending on the big data application, additional processing via MapReduce or custom Java code might be used to construct an intermediate data structure, such as a statistical model, a flat file, a relational table, or a cube. The resulting structure may be intended for additional analysis, or to be queried by a traditional SQL-based query tool. This business view ensures that big data is more consumable by the tools and the knowledge workers that already exist in an organization. One Hadoop project called Hive enables raw data to be re-structured into relational tables that can be accessed via SQL and incumbent SQL-based toolsets, capitalizing on the skills that a company may already have in-house.
6) Presentation and Consumption: -One of the more profound developments in the world of big data is the adoption of so-called data visualization. Unlike the specialized business intelligence technologies and unwieldy spreadsheets of yesterday, data visualization tools allow the average business person to view information in an intuitive, graphical way.
B.Organizational Structures for Big Data :-
The most likely organizational structures to initiate or accommodate big data technologies are either existing analytics groups (including groups with an operations research title), or innovation or architectural groups within IT organizations. In many cases these central services organizations are aligned in big data initiatives with analytically-oriented functions or business units marketing, for example, or the online businesses for banks or retailers (see the Big Data at Macys.com case study). Some of these business units have IT or analytics groups of their own. The organizations whose approaches seemed most effective and likely to succeed had close relationships between the business groups addressing big data and the IT organizations supporting them.
C.Big Data Skill Scarcity :-
In terms of skills, most of these large firms are augmentingor trying to augmenttheir existing analytical staffs with data scientists who possess a higher level of IT capabilities, and the ability to manipulate big data technologies specificallycompared to traditional quantitative analysts. These might include natural language processing or text mining skills, video or image analytics, and visual analytics. Many of the data scientists are also able to code in scripting languages like Python, Pig, and Hive. In terms of backgrounds, some have Ph.D.s in scientific fields; others are simply strong programmers with some analytical skills. Many of our interviewees questioned whether a data scientist could possess all the needed skills, and were taking a team- based approach to assembling them.
D. Types of tools used in Big-Data :-
Where processing is hosted? Distributed Servers / Cloud (e.g. Amazon EC2) Where data is stored? Distributed Storage (e.g. Amazon S3) What is the programming model? Distributed Processing (e.g. MapReduce) How data is stored & indexed? High-performance schema-free databases (e.g. MongoDB) What operations are performed on data? Analytic / Semantic Processing.
IV. APPLICATION eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. Inside eBays 90PB data warehouse Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the worlds three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. Walmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data the equivalent of 167 times the information contained in all the books in the US Library of Congress. Facebook handles 50 billion photos from its user base. FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide. The volume of business data worldwide, across all companies, doubles every 1.2 years, according to estimates.
5
Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day. Based on TCS 2013 Global Trend Study, huge improvements in supply planning and boost product quality is the greatest benefit of big data for manufacturing. Big data provides an infrastructure for transparency in manufacturing industry, which is the ability to unravel uncertainties such as inconsistent component performance and availability. Predictive manufacturing as an applicable approach toward near-zero downtime and transparency requires vast amount of data and advanced prediction tools for a systematic process of data into useful information. A conceptual framework of predictive manufacturing begins with data acquisition where different type of sensory data is available to acquire such as acoustics, vibration, pressure, current, voltage and controller data. Vast amount of sensory data in addition to historical data construct the big data in manufacturing. The generated big data acts as the input into predictive tools and preventive strategies such as Prognostics and Health Management (PHM).
V. FUTURE SCOPE $15 billion on software firms only specializing in data management and analytics. This industry on its own is worth more than $100 billion and growing at almost 10% a year which is roughly twice as fast as the software business as a whole. In February 2012, the open source analyst firm Wikibon released the first market forecast for Big Data , listing $5.1B revenue in 2012 with growth to $53.4B in 2017 The McKinsey Global Institute estimates that data volume is growing 40% per year, and will grow 44x between 2009 and 2020. Real-time big data isnt just a process for storing petabytes or exabytes of data in a data warehouse, Its about the ability to make better decisions and take meaningful actions at the right time. Fast forward to the present and technologies like Hadoop give you the scale and flexibility to store data before you know how you are going to process it. Technologies such as MapReduce,Hive and Impala enable you to run queries without changing the data structures underneath. Our newest research finds that organizations are using big data to target customer-centric outcomes, tap into internal data and build a better information ecosystem. Big Data is already an important part of the $64 billion database and data analytics market It offers commercial opportunities of a comparable scale to enterprise software in the late 1980s And the Internet boom of the 1990s, and the social media explosion of today.
VI. CONCLUSION Even though it hasnt been long since the advent of big data, these attributes add up to a new era. It is clear from our research that large organizations across industries are joining the data economy. They are not keeping traditional analytics and big data separate, but are combining them to form a new synthesis. Some aspects of Analytics 3.0 will no doubt continue to emerge, but organizations need to begin transitioning now to the new model. It means change in skills, leadership, organizational structures, technologies, and architectures. It is perhaps the most sweeping change in what we do to get value from data since the 1980s. Its important to remember that the primary value from big data comes not from the data in its raw form, but from the processing and analysis of it and the insights, products, and services that emerge from analysis. The sweeping changes in big data technologies and management approaches need to be accompanied by similarly dramatic shifts in how data supports decisions and product/service innovation. There is little doubt that analytics can transform organizations, and the firms that lead the 3.0 charge will seize the most value. REFERENCES [1] www.Slideshare.com [2] www.wikipedia.com [3] www.computereducation.org Books [4] Big Data by Viktor Mayer- Schonberger [5] The little book of big data noreen burligame [6] Mining of massive datasets-anand rajaraman, jeffri david ullman. [7] NewVantage Partners, Big Data Executive Survey: Themes and Trends, 2012.
6
[8] Peter Evans and Marco Annunziata, Industrial Internet: Pushing the Boundaries of Minds and Machines, GE report, Nov. 26, 2012. www.ge.com/docs/chapters/Industrial_Internet.pdf [9]The M Groups use of big data is described in Joel Schectman, Ad Firm Finds Way to Cut Big Data Costs, Wall Street Journal CIO Journal website, February 8, 2013, http://blogs.wsj.com/cio/2013/02/08/ad-firm-finds-way- to-cut-big-data-costs/ [10] Kerem Tomak, in Two Expert Perspectives on High-Performance Analytics, Intelligence Quarterly (a SAS publication), 2nd quarter 2012, p. 6. [11] Tom Vanderbilt, Let the Robot Drive: The Autonomous Car of the Future Is Here, Wired, January 20, 2012, http://www.wired.com/magazine/2012/01/ff_autonomous cars/ [12] Andrew Leonard, How Netflix Is Turning Viewers into Puppets, February 1, 2013, http://www.salon.com/2013/02/01/how_netflix_is_turnin g_viewers_into_puppets/ [13] Open source solutions are known as projects because they are developed jointly by a community of contributors. Thus, they represent a collection of diverse and often far-flung activities that, when unified, comprise a holistic solution. Because they are built by a community of developers who are typically unpaid for their work (many accept donations), these projects are often free-of-charge to individuals or companies who contribute additional functionality or guidance to the community. This is the opposite of proprietary software solutions, which are pre-packaged as products with finite release schedules and more rigid pricing models. By their very nature, open source projects are ongoing, until the community stops using the software and/orthe members of the developer community stop contributing to them. [14] SAS 2013 Big Data Survey, page 1: http://www.sas.com/resources/whitepaper/wp_58466.pdf [15] Big Data: The Next Frontier for Innovation, Competition, and Creativity, McKinsey Global Institute, 2011. [16] According to a report published by Cisco Systems, Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, February 6, 2013. http://www.cisco.com/en/US/solutions/collateral/ns341/n s525/ns537/ns705/ns827/white_paper_c11-520862.html [17] For the complete story on this study, see http://wikibon.org/wiki/v/Financial_Comparison_of_Big _Data_MPP_Solution_and_Data_Warehouse_Appliance. [18] SAS 2013 Big Data Survey, page 4: http://www.sas.com/resources/whitepaper/wp_58466.pdf