The Future of Business Intelligence in T
The Future of Business Intelligence in T
The Future of Business Intelligence in T
Ooko John
IT Consultant, Sports Consultant, Author
Managing Director
StarLight Entertainment®
http://www.starlight-entertainment.biz/
Copyright© 2015
When I ventured into the IT world almost a quarter century ago, the concept of a Decision Support System
(DSS) was defined by two distinct features of data and information. Fast forward to our world today,
technological and socio-economic demands have forced pundits in this domain of information science
scurrying for better ways to redefine this very service. Near universal terms as Business Intelligence (BI)
and Enterprise Information Management (EIM) have emerged, only to seem being phased out of
technological relevance. As for the well-known facets of DSS of the good years gone by, there crept newer
terms as knowledge and actionable insight. It is in this order of technological ecstasy that a writer of one
of the books on one of the many DSS system tools, posted how, in the first decade of the 21st century, the
terms data warehouse and data mart, were suddenly and without warning, copied and pasted all over
every database professional’s résumé.
A brief look into the origins of DSS shows how the model of the 1990s required historical data, both
summarized and at a transaction level of detail. Users needed to be able to query these massive amounts
of data easily. Often, they never knew what relationships existed between data elements they were
searching for. One data warehousing anecdote talked of how a retail chain in the US learnt that new
fathers often shopped for diapers and beer in the same trip. Sales of both products soared when they
were shelved next to each other. This usage of data warehousing technology, was credited to the process
of BI, or Corporate Information Factory (CIF) as it’s referred to in some quarters.
In the words of Inmon, whom with Kimball, are two of the easily recognized pioneers of the data
warehouse model, discussions around BI began with the evolution of information and DSS.
One of the subtle definitions for Business Intelligence appears in the book BI Solutions Using SSAS Tabular
Model Succintly by Parikshit Savjani. Parikshit writes that Business intelligence is the process of converting
data into information so that business decision makers or analysts can make informed decisions better
and faster. He adds that although the term business intelligence is used more in the modern days of Big
Data and analytics, the concept is not new to the world. The same concept was previously known as
Executive Information System (EIS) and later known as Decision Support System (DSS).
To be fair, the term Business Intelligence was invoked for the first time as early as in 1958 courtesy of one
of my former employers in IBM. This was in a research paper on the treatment of text analytics by HP
Luhn.
In the BI model, the source of data can be anything ranging from flat files to a normalized Online
Transaction Processing (OLTP) database system. The end products are usually reports that allow end users
to derive meaningful information by slicing and dicing facts.
To set us on the path of yet another discussion toward what we have today as Big Data, let’s start with
the analysis of the data warehouse.
From the many definitions of a data warehouse, the two easily adopted versions are from the business
and technical angles. The business angle describes this entity of data management as a centralized
repository of highly detailed, time series data from operational systems, specifically designed for analysis
and reporting. The technical angle on the other hand, describes a data warehouse as the single
2
organizational repository of enterprise-wide data, across many or all lines of business and subject areas.
This angle also adds that this repository contains massive and integrated data, and that it represents the
complete organizational view of information needed to run and understand the business. What comes
out loud, distinguishing the two definitions, is the word integrated. It appears in the technical definition
of a data warehouse. The fact that it appears here represents the design scope of every data warehouse,
a subject on which IT and not business professionals specialize in.
When it comes to designing and building a data warehouse, there exist two distinct approaches. There is
the top-down approach and the bottom-up approach. The two approaches were respectively developed
by William H Inmon and Ralph Kimball. The top-down approach is a more mature concept and is used
when the technology and the economic problems are well known. It benefits from the synergy between
the business subjects and provides a single version of the truth. It is also a systemic method which
minimizes integration problems. This model is however very expensive and less flexible. Bound on the
philosophy of its pioneer Bill Inmon, it’s believed to respond to the requirements of all the users in the
organization, and not of any particular business department.
Further, Inmon also prophesises how the classical System Development Life Cycle (SDLC) does not work
in the world of the DSS analyst. He proffers that SDLC assumes that requirements are known at the start
of design, or can at least be discovered. As for the DSS analyst, he believes that new requirements are
usually the last thing to be discovered in the DSS development life cycle. In consequence, the DSS analyst
starts with existing requirements, but, factoring in new requirements is almost an impossibility.
The bottom-up approach to building a data warehouse is a faster scope. It’s based on experiments and
prototypes. It’s flexible and allows the organization to go further with lower costs, building independent
data marts and evaluating the advantages of the system along the development process. Still, there could
be problems when trying to integrate the data marts in a consistent enterprise data warehouse. This
approach suits the vision of Ralph Kimball, who considers that the data warehouse have (yes have, the
Kimball version is constituted by numerous data marts) to be easily understood by the users and provide
correct answers as soon as possible. For this reason, the bottom-up approach starts from business
requirements, while the top-down approach has in view, data integration and consistency at the level of
entire enterprise.
3
Figure 1
The two data warehouse approaches, can be combined to benefit from the advantages provided by each
one. From a software engineering point of view, methods as the waterfall or the spiral approach can be
used in this regard. The waterfall approach, requires a structured and systematic analysis at each step,
4
before moving to the next phase of the project. The spiral approach, otherwise known as the iterative
approach, allows faster generation of more and more developed functional systems.
As we speak, the word from the IT industry indicates that the most adequate method for developing a
data warehouse is the iterative approach. It allows for more iterations whereby a new version comes off
every other iteration involving business subjects’ consultations. The method also provides a scalable
architecture, answers informational demands of the whole organization and reduces the possible risks.
Figure 2
Once the data warehouse paradigm was fully adopted by the IT community, there soon followed in the
early 2000s, the acceptance of its related concepts as Enterprise Data Warehouse (EDW), BI and Analytics.
These new concepts helped enterprises to transform raw data collections into actionable wisdom.
Analytics applications such as Customer Analytics, Financial Analytics, Risk Analytics, Product Analytics
and Health-care Analytics have become integral parts of the business applications architecture of any
enterprise. The main drawback however has been that all of these applications dealt with only one type
of data - structured data.
Historically, much of the talk about Big Data has centred on the three Vs of volume, velocity and variety.
Volume refers to the quantity of data one is working with while velocity refers to how quickly that data
flows between its sources to its destinations. The destinations are the variety of consumers. Variety on
the other hand, refers to the diversity of data that is presented to the consumer. This may be marketing
5
data combined with financial data, patient data combined with medical research or complex
environmental data such as weather analysis.
Still, the most important V in Big Data features refers to value. The real measure of Big Data is not its size
but rather the scale of its impact. This is the value which Big Data delivers to one’s business enterprise or
personal life. Data for data’s sake serves almost no purpose at all. However, data that has a positive and
outsized impact on one’s business or personal life is truly Big Data.
Figure 3.
Applicability of OldSQL, NewSQL, and NoSQL – courtesy of Big Data Imperatif’s book by Soumendra Mohanty, Madhu
Jagadeesh and Harsha Srivatsa
Big Data is structured, semi structured, unstructured and raw data in many different formats. In some
cases, Big Data looks totally different from the clean scalar numbers and text we have stored in our data
warehouses over the last few decades. The evolution of Big Data arose from the limitations of existing
RDBMS solutions as SQL Server, Oracle, DB2 and Teradata. Despite being able to manage large volumes
of data, these systems fell short when the data was unstructured or semi-structured. This is the data of
which our world of today is fully obsessed with through social media sentiments as tweets, and their
relevant devices as mobile phones.
6
This complexity of dealing with Big Data also stems from the belief in some IT quarters that it belongs in
four different forms. There is Big Data that is at rest, that which is in motion, that which exists in many
forms and that which is in doubt. This thinking realm also sees three styles of Big Data integration in the
enterprise. There is the bulk data movement, real-time data and federated data.
More important than the data itself, enterprises have shown that insights from Big Data can be monetized.
When an e-commerce site detects an increase in favourable clicks from an experimental online
advertisement, that insight can be taken to the bottom line immediately. This direct cause-and-effect is
easily understood by management. Any analytic research group that consistently demonstrates these
insights is looked upon as a strategic resource for the enterprise by the highest levels of its management
team. This growth in business awareness of the value of data-driven insights, is rapidly spreading outward
from the e-commerce world to virtually every business segment.
Two architectures have emerged to address Big Data Analytics. There is the Extended RDBMS and
MapReduce/Hadoop architectures. These architectures are being implemented as completely separate
systems and in various interesting hybrid combinations.
Figure 4
Alternative hybrid architectures using both RDBMS and Hadoop - courtesy of reference from Ralph Kimball’s work
7
In order to implement the Extended RDBMS architecture for Big Data, vendors have added features that
address Big Data Analytics from a solid relational perspective. The two most significant architectural
developments have been the overtaking of the high end of the market with Massively Parallel Processing
(MPP) and the growing adoption of columnar storage. Additionally, RDBMS vendors added some complex
User-Defined Functions (UDFs) to their syntax. In a similar vein, RDBMs vendors have allowed complex
data structures to be stored in individual fields.
These kind of embedded complex data structures have been known as Large Binary Objects (BLOBS) for
many years by database analysts. It's important to note that relational databases have had a hard time
providing general support for interpreting BLOBS. This is because BLOBS do not fit exactly the relational
paradigm. Indeed, an RDBMS provides some value by hosting the BLOBS in a structured framework.
However, much of the complex interpretation and computation on such database objects must be done
with specially crafted UDFs or some other form of BI mechanism.
Figure 5
The MapReduce architecture was developed by Google in early 2000. The purpose was to perform web
page searches across thousands of physically separated machines. Complete systems of this framework
can be implemented in a variety of languages. Java though is the most popular language of
implementation.
MapReduce is basically a UDF execution framework of extreme complexity. Originally targeted to building
Google's webpage search index, a MapReduce job can be defined for virtually any data structure and any
application. The target processors that actually perform the requested computation can be implemented
as identical clusters or as a heterogeneous mix of processor types referred to as grid.
The most significant implementation of MapReduce is the Apache Hadoop. Known simply as Hadoop, this
is an open source Apache project with thousands of contributors and a whole industry of diverse
8
applications. Hadoop runs natively on its own distributed file system known as HDFS. It can also read and
write to other file systems as the Amazon S3. In order to do this, Hadoop employs its many tools as Sqoop,
Scribe, Flume, Pig, Hive, HBase, Oozie and ZooKeeper. Conventional database vendors have implemented
interfaces that allow Hadoop jobs to be run over massively distributed instances of their databases.
A MapReduce job is always divided into two distinct phases of map and reduce. The overall input to a
MapReduce job is divided into many equal sized splits. Each of these is assigned a map task. The map
function is then applied to each record in each split. For larger jobs, the job tracker schedules these map
tasks in parallel.
This function is heavily IT focused. It’s often done with either traditional ETL tools like Informatica. It can
also be accomplished with Hadoop open-source processing engines like Storm or with SQL engines
including Hive.
Today’s data warehouse, while still building on the founding principles of an enterprise version of truth
and a single data repository, its design faces additional challenges. It must be able to address new data
types, new volumes, new data-quality levels, new performance needs, new metadata and new user
requirements. All this is presented following the arrival of Big Data in our midst.
In building this new data warehouse, businesses need to adapt to radical thinking. Businesses will need to
develop a physical architecture that will not be constrained by the boundaries of a single platform like the
Relational Database Management Systems (RDBMS). That next-generation data warehouse, from the
view of physical architectural deployment, consists of a myriad of technologies that are data-driven from
an integration perspective. It’s also extremely flexible and scalable from a data architecture perspective.
Figure 6 shows the high-level components that illustrate the foundational building blocks of a data
warehouse in the Big Data era. It represents an integration of all the data in an enterprise.
Figure 6
9
Components of the next-generation data warehouse – courtesy of reference from Data Warehousing in the age of Big Data
by Krish Krishnan
The lowest tier represents the data. The next tier represents some of the technologies that integrate Big
Data across the diverse types and sources. The topmost layer represents the analytical layer which drives
the visualization needs of an enterprise.
At the data layer, the new generation data warehouse is redesigned to consume systems and applications
legacy data, OLTP data from systems as ERPs, CRMs and SCMs. Also considered for this consumption are
unstructured data, video data and its components of audio and metadata, static images as X-ray and CAT
scans. Not to be left in this long list of data warehouse consumables are numerical and graphical data as
those from seismic data, stock data and GPS data. Lastly, there are the new data explosion from the social
media as YouTube, Facebook, Twitter, LinkedIn and not to mention Google search.
Data consumption in the new data warehouse framework follows a six step process of acquisition,
discovery, analysis, pre-processing, integration and visualization. It’s at the data discovery phase,
especially for unstructured data, where the complexity of data processing exists. Employed here are
special mathematical, statistical, text and data mining algorithmic analyses.
Figure 7
10
Enterprise data platform consisting of BDW and EDW – Courtesy of The Big Data Imperative by Soumendra Mohanty, Madhu
Jagadeesh and Harsha Srivatsa
Big Data processing is implemented in a two-step process. The first step is a data-driven architecture that
includes analysis and design of data processing. The second step is the physical architecture
implementation.
In the data-driven integration architecture, all the data within the enterprise are categorized according to
type. Depending on its nature and its associated processing requirements, business logic rules encapsulate
and integrate the data into a series of program flows. This process incorporates enterprise metadata,
Master Data Management (MDM) and semantic technologies like taxonomies.
In the external data integration approach, the existing data processing and data warehouse platforms are
retained. A new data bus platform for processing Big Data is created in a new technology architecture.
This data bus is developed using metadata and semantic technologies. It creates a data integration
environment for data exploration and processing. In this architecture, workload processing is clearly
divided into processing Big Data in its own infrastructure while also processing the existing data
warehouse in its own infrastructure. This streamlining of workload helps maintain performance and data
quality.
The most typical option in the Big Data warehouse integration architecture employs Hadoop and RDBMS.
In this setup, a deployment connector is used to integrate Big Data and the traditional RDBMS platforms.
The Big Data processing platform is created in either Hadoop or any of the other NoSQL flavours as
MongoDB. Connectors as Sqoop for SQL Server against Hadoop, offer bridges to exchange data between
the two platforms. Workload processing in this architecture blends data processing across both platforms
to provide scalability and to reduce complexity.
Streamlining of data workload creates a scalable platform across both infrastructure layers. This allows
for seamless data discovery. The complexity of this architecture is its dependency on the performance of
the connector. The connectors largely mimic a Java Database Connectivity (JDBC) behaviour. Its
bandwidth to transport data can be a severe bottleneck considering the query complexity in the data
discovery process.
11
Figure 8
Integration-driven approach – courtesy of reference from Data Warehousing in the age of Big Data by Krish Krishnan
Data warehouse appliances emerged as a strong black-box architecture for processing workloads specific
to large-scale data in the last decade. One of the extensions of this architecture is the emergence of Big
Data appliances. These appliances are configured to handle the rigours of workloads and complexities of
Big Data and the current RDBMS architecture.
Figure 9
Conceptual Big Data appliance - courtesy of reference from Data Warehousing in the age of Big Data by Krish Krishnan
While the physical architectural implementation can differ among vendors like Teradata, Oracle, IBM and
Microsoft, the underlying concept remains the same. The technologies of Hadoop, NoSQL or a hybrid of
the two are employed to acquire, pre-process and store Big Data. The RDBMS layers are used to process
the output from the Hadoop and NoSQL layers. In-database MapReduce, R and RDBMS specific
12
translators, aided by special connectors, are used in the integrated architecture. These manage data
movement and transformation within the appliance.
Data Virtualisation
Yet another option to integrate Big Data with the old data warehouse technologies is through data
virtualization. Data virtualisation refers to an integration technique that provides complete, high-quality,
and actionable information through virtual integration of data across multiple, disparate internal and
external sources. Instead of copying and moving existing source data into physical, integrated stores as a
data warehouses or data marts, data virtualization creates a virtual or logical data store to deliver it to
business users and applications.
The biggest benefit of the data virtualisation deployment model is the reuse of existing infrastructure
from the structured portion of an existing data warehouse. The approach also provides an opportunity to
distribute workload effectively across the platforms. It essentially allows for the best optimization to be
executed in the architectures.
Data Virtualisation, coupled with a strong semantic architecture, can create a scalable solution. Other
benefits of this setup are a very flexible architecture and a lower initial cost in deployment. Because of
the lack of governance in this setup however, there is the prospect of too many silos being created. These
may degrade the overall system’s performance.
Figure 10
Data Visualisation
Data visualisation forms another option of integrating Big Data with existing data warehouse
architectures. This integrations is accomplished through newer visualisation applications as Tableau and
Spotfire. In this visualisation category we also encounter candidates as R, SAS or the traditional
technologies as Microstrategy, Business Objects and Cognos.
13
All these tools can directly leverage the semantic architecture from their integration layers and create a
scalable interface.
Figure 11
Conceptual BDW architecture – courtesy of The Big Data Imperatives by Soumendra Mohanty, Madhu Jagadeesh and Harsha
Srivatsa
Today, our world is not accelerated by a single business or product. It’s controlled by data consumers.
They are the drivers of demand for specific products or services. This has been effected by advances in
mobile technologies coupled with the advent of smart devices as Fitbit. All this coupled with the rise of
generation Z, have created a perfect storm for the transformation of companies from being product-
centric to be service-centric.
What follows on this ideology are some very critical business questions. How successful can a
transformation strategy toward this demand be for an organization? How does that organization make
decisions on what areas to transform? What will be the timing of such a transformation? Finally, how does
a customer behaviour or a market trend introduce transformation in the current business strategy?
14
In this view, one is bound to ask as to what Big Data is really disrupting? The big disruption is not just the
ability to capture and analyse more data than we have done in the past. We have to do so at price points
that are an order of magnitudes cheaper. In the Jevons paradox, named after the economist who made
this observation about the Industrial Revolution, as technological advances make storing and analysing
data more efficient, companies are doing a lot more analysis. This advent of technological reality
constitutes the biggest disruption from Big Data.
Figure 12
Many large technology companies, from Amazon to Google, from IBM to Microsoft, are getting in on the
Big Data craze. As a follow up, dozens of start-ups are cropping up to deliver open-source and cloud-based
Big Data solutions. What this means is that while the big companies are focused on horizontal Big Data
solutions, smaller companies are focused on delivering applications for specific lines of business and key
verticals. Some of these products optimize sales efficiency. Others provide recommendations for future
15
marketing campaigns. They do this by correlating marketing performance across a number of different
channels with actual product usage data. There are even Big Data products that can help companies hire
more efficiently and to retain those employees once hired.
Currently, some of the very important uses of business processes that have benefitted from the Big Data
phenomenon are genomics analysis, online game gesture tracking and data bag exploration. Then there
are loan risk analysis, insurance policy underwriting and customer churn analysis. Essentially, as better
analyses of Big Data bring its overall management prices down, its consumption goes up.
Cloud Computing
Cloud computing is neither a single technology nor a single architecture. It’s for now, the adopted platform
of innovation for computing, networking and storage technologies. It’s designed to provide rapid time to
market delivery of those services and at very dramatic cost reductions.
The National Institute of Standards and Technology (NIST) defines Cloud computing as a model for
enabling convenient, on-demand network access to a shared pool of configurable computing resources.
That pool is constituted by networks, servers, storage, applications and services that can be rapidly
provisioned and released with minimal management effort or service provider interaction.
Two pointers easily stem out of the NIST definition of Cloud computing. One, it’s a usage model and not
a technology. There are multiple different flavours of Cloud computing, each with its own distinctive traits
and advantages. In this analogy, Cloud computing is an umbrella term highlighting the similarities and
differences in each deployment model. That analogy avoids being prescriptive about the particular
technologies required to implement or support a particular platform. Two, Cloud computing is based on
a pool of network, compute, storage and application resources.
Private
Platform as a
Resource pooling
On-demand self-service
Figure 13
Away from the NIT model of computing, there is the Cloud Cube Model. This is maintained by the Jericho
forum. The group has an interesting model that attempts to categorise a cloud network based on four
dimensional factors of physical location of the data, its ownership, its security boundary and its sourcing.
16
Figure 14
The Jericho Forum’s Cloud Cube Model – courtesy of Cloud Computing Bible by …
To be fair, the underlying technologies of Cloud computing have been in use in some form or another for
some time. The Internet, the underlying technology driving Cloud computing, has been with us for almost
fifty years. It’s for this reason that many technologically savvy professionals have held the notion that they
don’t understand what the fuss about Cloud computing is. Many even believe that there is nothing new
about the technology.
Figure 15
17
Virtualisation for example, being arguably the biggest technology driver behind Cloud computing, is
almost forty years old. It has been in use since the mainframe era. Just as server and storage vendors have
been using different types of virtualisation for nearly four decades, the technology has become equally
commonplace in the corporate network. It would be almost impossible to find a LAN today that does not
use VLAN functionality.
In the same way that memory and network virtualisation have standardized over time, server
virtualisation solutions have become the fundamental building blocks of the cloud. Here, we have as
Microsoft, VMware, Parallels and Xen flavours.
As businesses move toward Cloud computing, one important factor for success is adopting multi-tenant
Software-Defined Networking (SDN) solutions in data centres. Hyper-V Network Virtualisation (HNV) is a
key enabler for a multi-tenant SDN solution. It is also essential for implementing a hybrid cloud
environment where tenants can bring not only their own IPs, but their entire network topology. This is
possible because the virtualised networks are abstracted from the underlying network fabric. Network
virtualisation in general and HNV in particular, are relatively new concepts. Unlike server virtualization
which is relatively mature and a widely understood technology, network virtualization still lacks a broader
familiarity.
The challenge for data warehouse modellers in the age of Cloud computing is to create and deploy models
that leverage these realigned technological services. Whether those data warehouse models are for small
or medium size businesses, they must be implemented for data and a broader visualization infrastructure.
That infrastructure must address the very important business operational aspects of analytics and
reporting.
In this new data warehouse model, the benefits to consumer include scalability as the businesses are able
to rake in on the other benefit of cloud reusable components. Despite all the good things about Cloud
computing technology’s many services, there also exist the inevitable disadvantages. Workload
management is not flexible as resources are shared and managed. There is also the vulnerability of high
volume of data movement to and from the cloud in non-secure environments. These may violate
compliance requirements.
C P
M O F
E R C
A M I C P
M V E U
R S P N H R
P E D S
K A E A A O
L H N I T
E L T N N D
O R D C O
T E I C N U
Y O T M
I S T I E C
E R I E
N I A L T
E V R
G V L
E
E
Figure 16
Data virtualisation based on the Cloud computing architecture – courtesy of Cloud Computing Bible by Barrie Sosinsky
18
Three models of Cloud computing are available today. There is Software as a Service (SaaS), Infrastructure
as a Service (IaaS) and Platform as a Service (PaaS).
SaaS model is one with which most individuals are familiar even if they don’t understand its underlying
technologies. Google’s Gmail for example, is one of the most widely known and commonly used SaaS
platforms. SaaS Big Data Applications (BDAs) exist at the highest level of the cloud stack. Consumers are
ready to use them out-of-the-box. There is no complex expensive infrastructure to set up or software to
install and manage.
In this same model, Salesforce has expanded its offerings through a series of acquisitions including
Radian6 and Buddy Media. Salesforce now offers cloud based social, data and analytics applications.
Newer entrants like AppDynamics, BloomReach, Content Analytics, New Relic and Rocket Fuel all deal
with large quantities of cloud-based data. Both AppDynamics and New Relic take data from cloud based
applications and provide insights to improve their performance. BloomReach and Content Analytics use
Big Data to improve search discoverability for e-commerce sites. Rocket Fuel on the other hand, uses Big
Data to optimize the advertisements it renders to potential vendors.
IaaS is a model through which the service provider commercially avails the necessary hardware resources
to run a customer’s applications in the form of network, compute and storage. At the lowest level, IaaS
makes it easy to store data in the cloud. Here we find services as the Amazon Elastic Compute Cloud (EC2),
the Simple Storage Service (S3) and the Google Cloud Storage. Perhaps more than any other company,
Amazon pioneered the public cloud space with its Amazon Web Services (AWS) offering. Other providers
as AT&T, IBM, Microsoft and Rackspace have also continued to expand their cloud infrastructure offerings.
Finally, PaaS is best described as a development environment hosted on third-party infrastructures. They
facilitate rapid application design, testing and deployment. PaaS environments are often used as
application sandboxes. Using those platforms, developers are free to create and in some sense, improvise
in an environment where the cost of consuming resources is greatly reduced. Google App Engine, Google
Compute Engine, VMware’s SpringSource and Amazon’s AWS are common examples of popular PaaS
offerings. After coming out late in this quest, Microsoft has continued to expand its Azure cloud offering.
The company’s Azure HDInsight is an Apache Hadoop offering in the cloud which enables Big Data users
to spin Hadoop clusters up and down on demand. More recently, Google introduced Google Cloud
DataFlow which it’s positioning as a successor to MapReduce. Unlike MapReduce, which takes a batch-
based approach to processing data, Cloud DataFlow can handle both batch and streaming data.
Public and Private Cloud Approaches to Big Data – Cloud Computing Deployment Models
Cloud service deployments come in four forms. There is the public, private, community or a hybrid. Private
clouds provide the same kind of on-demand scalability as public clouds. They are however designed to be
used by a single organization. Private clouds essentially cordon off an infrastructure so that data contained
therein is separated from those of other organizations.
Organizations can run their own private cloud internally on their own physical hardware and software.
They can also use private clouds that are deployed on top of major cloud services such as AWS. This
combination often offers the best of both worlds. With it comes the flexibility, scalability and reduced up-
front investment in the cloud. There is also the security that organizations require to protect their most
sensitive data.
19
Under the community cloud deployment model, more than one group with common and specific needs
share a cloud infrastructure. Examples may include a US federal agency cloud with stringent security
requirements, or a health and medical cloud with regulatory and policy requirements for privacy matters.
There is no mandate for the infrastructure to be either on-site or off-site to qualify as a community cloud.
The public cloud deployment model for its part, is what is most often thought of as the cloud. It’s a multi-
tenant capability which is shared by a number of consumers who are likely to have nothing in common.
Amazon, Apple, Microsoft and Google, to name but a few, are some of the well-known public cloud service
providers.
Finally, in the hybrid cloud deployment, there is the combination of two or more of the other deployment
models. The management framework in place for such environments make them appear as a single cloud.
A driver for this deployment model is Cloud peering or Cloud bursting, coupled of course with demands
for security or regulatory compliance alongside price and performance.
How Big Data in the Cloud Environment is Changing Mobile Computing Services
Perhaps no area is as fascinating as the intersection of Big Data and mobile computing. Mobile computing
has resulted in the accessibility of Big Data anytime, and almost anywhere. What used to require software
installed on a desktop computer can now be accessed via a tablet or a smartphone. What’s more, Big Data
applications that used to require expensive custom hardware and complex software are becoming more
and more available. One can monitor his heart rate, sleep patterns and even his Electrocardiogram (EKG)
using a smartphone, some add-on hardware and an easy-to-install mobile application. He can in turn
combine this with other cloud amenities, upload and share the data with whomsoever he wishes.
Big Data and mobile computing are coming together not only for the display and analysis of data. They
are also used in the inevitable and preliminary process of data collection. In a well-known technological
movement called the Internet of Things (IoT), smartphones and other low-cost devices are rapidly
expanding the number of sensors available for this portion of data analytics process. Even data such as
traffic information, which used to require complex, expensive sensor networks, can now be gathered and
displayed using a smartphone and mobile applications.
Fitness is another area where Big Data is driving mobile computing. Whereas one used to track his fitness
activities using desktop software applications, Big Data and Cloud computing has offered him wearable
devices like Fitbit. Today, one is able to track how many steps he walks in a given day. Combined with
mobile applications and analytics, he can now be alerted if he isn’t fit enough. These, again combined with
applications as Strava and low-cost Geographic Point Systems (GPS) devices, he can virtually compete
against other fitness enthusiasts. Strava has actually come in handy when employed to identify popular
cycling route segments. Individual cyclists use this to compare their performance on those segments and
to competitively train without having to ride together.
Doctors using mobile devices in their profession have access to applications like PracticeFusion. It allows
them to gather and view patient data for Electronic Health Records (EHR). These they can subsequently
analyse in real time on the same mobile tablets they carry with them.
20
Many of the benefits of Cloud computing in the corporate arena are purely financial. Other network
externalities relating to Cloud computing however, have brought much broader positive effects to
consumers the world over. The ubiquity of free or inexpensive computing accessed through the cloud, is
already impacting communications in the developing economies as well as in the already established
regions. Research and development, agriculture and banking in the emerging economies, have been some
of the most notable beneficiaries of this new technological phenomenon.
The increasing prevalence of cloud, mobile and social technologies is opening the floodgates of data
generation and analysis. Leading companies, otherwise referred to as the Titans in the Big Data and Cloud
computing world, are able to create actionable insight from this new technological orientation to deepen
client engagement. Because of that advantage, they are able to go after new markets. They are also able
to respond to the needs of their businesses faster and efficiently.
In order to maximize the impact of Cloud computing however, analytical tools and capabilities must be
espoused and made available to all of the people making those decisions. This is irrespective of their
location in a global sense. Cloud computing in its present state, has done its best to accommodate this.
The improvement of better internet access in terms of higher bandwidths and lower costs in the
developing communities, plus the increasing adoption of mobile devices in those same locales, have all
helped to bring this to some reality. What all this means is that Big Data and analytics’ access, via virtually
unlimited resources that scale on demand, have enabled business users to easily view, understand and
interact with data and insights. They are therefore able to determine how and when to act in clients’
support demands in real time.
In this pursuit to help businesses support their clients’ demands, Cloud computing service providers as
IBM availed SaaS and PaaS solutions designed to extend the benefits of cloud to analytics. Case in point is
the IBM Cloudant, a fully managed, distributed Database-as-a-Service (DBaaS). DBaaS leverages the
CouchDB NoSQL database. It also provides scalable NoSQL database services that allow consumers to
develop new features without redesigning the database or migrating their data. Built-in text search, data
replication and synchronization features, plus Apache MapReduce support for advanced analytics, more
than enable an exceptional user experience.
There is the ongoing perception that Cloud computing will reduce, if not completely eliminate a number
of IT and IT-related jobs. This perception is not unwarranted, given the automation of the IT service supply
chain that comes with this technology. Initial evidence have shown that there will be a permanent
alteration of some computing job roles and responsibilities as we have come to know them so far.
As businesses focus on applications and consumer usage continues to grow, the need for software
developers will grow accordingly. In fact, as the speed of application development and deployment
increases, the need for talented software developers should increase as well. In this cycle of Cloud
computing technological metamorphosis, enters the role of the data scientist.
Whereas it’s true that the realm of Big Data analytics is vastly different from transaction processing
applications and BI applications, the obsession with the role of a data scientist within the IT corporate
titans is a bit overbearing. Some IT professionals who have stayed close to the BI world have wondered
aloud how renaming a job role from a statistician to a data scientist came to triple its cost and increases
21
its voice. A rather prominent voice in this debate on LinkedIn reasoned that analytics, the accompanying
data-driven process of decision making, and not the data scientists, deserve a stronger voice in the
enterprise. All that would be true with the increased rewards that entail them. For this approach to work
efficiently in the work place however, the critic opined that businesses must fully understand the role of
the data scientist in the enterprise and how it fits within the already existing analytics groups.
Since quantitative research have driven the analytics continuum for decades, it turns out that data analysis
is not a new skill. The role has been the realm of mathematicians, statisticians and pure quantitative
scientists. The development and enhancement of sophisticated algorithms to solve real-world problems
has mostly been the purview of academia and research institutions. Here, researchers spent years in
projects that enhanced the use of established algorithms like Hidden Markov Support Vector Machines,
Linear Dynamical Systems, Spectral Clustering and Machine Learning algorithms. In between using these
established algorithms, those researchers also developed newer models. Once all those attuned models
rolled off the research laboratories, commercial organizations and product vendors adopted them to
make them usable for all other consumers.
In the world of Big Data where there are too many unknowns, there came the opportunity for the role of
the data scientist from the familiar corridors of research laboratories. Here, they have incorporated
advanced analytical approaches using sophisticated analytics and data visualization tools to discover
patterns in Big Data. In many cases, they work with well-established analytics techniques as logistic
regression methods, clustering methods and classification methods to draw insights from Big Data. The
new requirement for a successful data scientist however, has to be a deep understanding of the business
domain. From here, he would be able to effectively analyse, using advanced data visualization tools,
businesses strategy and processes before delivering the desired intelligence.
Almost no cloud-based platform has had more success than the Amazon Web Service (AWS). Until
recently, common wisdom has it that data which originated locally would be analysed locally using on-site
computer infrastructure. On this logic, data which originated in the cloud would be stored and analysed
there. This design principle was largely due to the time-consuming nature of moving massive amounts of
data from the on-site infrastructure to the cloud for analysis. Today, the availability of very high bandwidth
connections to the cloud, and the ease of scaling computing resources up and down there, has changed
all this. More and more Big Data applications are moving to the cloud or at a minimum, are making use of
the model when on-site systems are at capacity.
The Amazon site is one of the most important and heavily trafficked web sites in the world. It provides a
vast selection of products using an infrastructure based on web services. As the company and its site’s
products services has grown, the site’s infrastructure has been dramatically expanded technologically to
accommodate peak traffic times.
Starting in 2006, the company’s web service platform has been made available to developers on a usage-
basis model. Through hardware virtualization on Xen hypervisors, Amazon’s site has made it possible to
create private virtual servers that one can run worldwide. The servers can be provisioned with almost any
kind of application software one might envisage. Those applications in consequence, tap into a range of
support services that not only make distributed Cloud computing applications possible, but also robust.
22
At present, there exists other numerous very large web sites running on Amazon’s infrastructure much to
the ignorance of their clients’ knowledge.
AWS is based on Service Oriented Architecture (SOA) standards, including HTTP, REST and SOAP transfer
protocols. There is also open source and commercial operating systems, application servers and browser-
based access. Virtual private servers can provision virtual private clouds connected through virtual private
networks. This configuration provides for reasonable security and control that a system administrator can
control.
AWS has a great value proposition to its customers - You pay for what you use! While one may not save a
great deal of money over time using AWS for enterprise class web applications, he encounters very little
barrier to entry in terms of getting his site or application running quickly and robustly. With the largest
number of retail product Service Kit Units (SKUs) through a large ecosystem of partnerships which Amazon
supports at peak customer demands, AWS takes what is essentially unused infrastructure capacity off this
network and turns it into a very profitable Cloud computing business.
AWS is having an enormous impact on Cloud computing. In fact, the company’s cloud services represent
the largest pure IaaS play in Cloud computing today. It is also one of the best examples of what is possible
using a SOA. The structure of AWS is therefore highly educational in understanding just how disruptive
Cloud computing can be to traditional fixed asset IT deployments. It also represents how virtualization
enables a flexible approach to system right-sizing, and how dispersed systems can impart reliability to
mission critical systems.
Figure 17
Components of AWS
23
Amazon Elastic Compute Cloud (EC2) is the central application in the AWS portfolio. It enables the
creation, use and management of virtual private servers running Linux or Windows operating systems
over a Xen hypervisor. Amazon Machine Instances (AMI) are sized at various levels and rented on a
computing or hourly basis. Spread over data centres worldwide, EC2 applications may be created to very
highly scalable features. They also possess high redundancy and fault tolerances. In its implementations,
EC2 is supported by application tools of Amazon Simple Queue Service (SQS), Amazon Simple Notification
Service (SNS), Amazon CloudWatch and Elastic Load Balancing.
Amazon Relational Database Service (RDS) allows cloud consumers to create instances of the MySQL
database to support their web sites and the many applications that rely on data-driven services. A member
of the universal LAMP (Linux, APACHE, MySQL, and PERL) web services platform, MySQL allows cloud
developers through RDS to port applications, source code and databases directly over to AWS. They do
this all by preserving their previous investments in those same technologies. RDS provides features such
as automated software patching, database backups and automated database scaling via an API call.
Close by is the Amazon Cloudfront. This is an edge-storage or content-delivery system that caches data
in different physical locations. In consequence, cloud consumers’ access to data is enhanced through
faster data transfer speeds and lower latency. Cloudfront is set up to work with Amazon Simple Storage
System (S3), an online backup and storage system. Other services in the AWS stack of services are
Amazon Elastic Block Store (EBS) and Amazon SimpleDB. Beyond these, there is the horde of a
continually growing list of very dynamic services and utilities that support or are partners to the AWS
infrastructure. A very recent addition now competing MongoDB is the Amazon DynamoDB.
Given that Big data is a combination of transactional and interactive data, and while technologies have
mastered the art of managing volumes of transaction data, it is the interactive data that is adding variety
and velocity characteristics to the ever-growing data reservoir.
Since its founding in 1998, Google has grown by multiple orders of magnitude in several different
dimensions. That magnitude is measured in how many queries Google handles, the size of its search index,
the amount of user data it stores, the number of services it provides and the number of cloud users who
rely on those services. From a hardware perspective, the Google search engine has gone from a server
sitting under a desk in a laboratory at Stanford University to hundreds of thousands of servers located in
dozens of data-centres around the world.
Having set out to dismantle the traditional approach of upward scaling of hardware as service demands
grow, Google instead adopted the horizontal scaling approach by utilising numerous inexpensive servers.
Initially set out as a way of saving money, the approach was also to denigrate the limitations of the
traditional approach as high expense, single points of failure from hardware and software.
With this design philosophy in place, Google came up with Data Stack 1.0 system. It comprised of the
Google File System (GFS), MapReduce and Bigtable. GFS is a distributed, cluster-based file system that
assumes that any disk can fail. In this system, data is stored in multiple locations rendering data availability
even when the initial disk of storage fails. MapReduce on the other hand, is a computing paradigm that
divides problems into easily parallelizable pieces and orchestrates running them across a cluster of
machines. As for Bigtable, it’s considered the forerunner of the NoSQL database. It enables structured
24
storage to scale out to multiple servers through replication. A failure on any particular tablet server never
causes data loss.
Technological refinements of the original stack first came from its limitations, and second, from cloud
consumers’ service demands. From these stack refinements, there came components as Colossus,
Megastore, Spanner, FlumeJava and Dremel. For its part, Colossus is a distributed file system that works
around many of the limitations that GFS possessed. Megastore is a geographically replicated, consistent
form of NoSQL data store that uses the Paxos algorithm to ensure consistent reads and writes. What this
means is that if a user writes data in one datacentre, it is immediately available in all other datacentres.
Spanner is a globally replicated data store that can handles data locality constraints. Example case is
whether ‘This data is allowed to reside only in European datacentres’. Spanner managed to solve the
problem of global time ordering in a geographically distributed system. Here, it uses atomic clocks to
guarantee synchronization to within a known bound.
So What Is BigQuery?
BigQuery, like many human tools of, or not of technology, started with a problem. Google engineers were
having a hard time keeping up with the growth of their data. The number of Gmail users for example, is
in the hundreds of millions. By the year 2012, there were more than 100 billion Google searches done
every month. Not to be outplayed is the Google books data store where book contents’ search, including
my own authorship, presents added query challenges.
To make sense of all this data was a time-consuming and frustrating experience. Dremel, the distributed
SQL query engine that performs complex queries over data stored on Colossus, GFS or elsewhere, became
the base for Google’s BigQuery. Dremel enabled Google employees to run extremely fast SQL queries on
large datasets.
After its public launch in 2012, BigQuery has allowed users outside Google to take advantage of the power
and performance of Dremel. Since then, BigQuery has expanded to become not just a query engine but a
hosted, managed cloud-based structured storage provider. Its primary function is to enable interactive
analytic queries over Big Data. It tries to tackle Big Data problems by attempting to be scale-invariant.
What that means is that whether one has a thousand or a hundred billion rows in a table, the mechanism
of treating them should be the same. Although some variance in execution time is expected between
running a query over a megabyte and doing so over a terabyte, the latter shouldn’t be a million times
slower. In conclusion, if one starts using BigQuery when one is receiving a thousand records a day, he
won’t hit a brick wall when that peaks to one billion records.
BigQuery is a system that stores and operates on structured data. Its schemas describe the columns, or
fields of the structured data. Each field has a name and a data type that indicates the kind of data that can
be stored in that field. Those data types can be either primitive or record types.
Primitive types are basic. They store a single value. This can be a string, a floating point number, an integer
or a Boolean flag. A record type, however, is a collection of other fields. For the most part, a record is just
a way of grouping fields together. For example, if one stores location as latitude and longitude, he could
have a location record with two fields of lat and long. Fields can also be repeated, which means that they
can store more than one value.
25
These two features of record types and repeated fields distinguish BigQuery from most relational
databases which can only store flat rows. Records and repeated fields enable one to store the data in a
more natural way than he would be capable of in a relational database. For example, if a table contains
customer orders, a consumer may want to store an entire order as a single record, even though there may
be multiple items in the order. This makes it easier to perform analysis of the orders without having to
flatten the data or normalize it into multiple tables.
Under the BigQuery system, collections of rows of data following a single schema are organized into
tables. These tables are similar to tables in a typical relational database but have some restrictions. The
only way to modify the tables is to append to or rewrite them. There is no way to update individual rows.
BigQuery also doesn’t support RDBMS table modification queries like Alter Table, Drop Table or Update
Table. Collections of tables with similar access restrictions are organized into datasets. In the RDBMS
world, one is allowed to have multiple database catalogues.
The technological beauty coming off the BigQuery advent is that tables in the cloud can be joined against
each other as long as the user running the query has access to both. This means that if someone publishes
a table with weather data, another user can join that against his sales table to determine how the weather
affects his sales.
Currently there are a number of public datasets containing either financial information or GitHub commit
history. Researchers can mine these sources or combine the data with their own for newer insight. Any
dataset can be shared with any user by just making an API call or using the UI to edit the access control
settings. If a consumer opts for other users in a different continent to run queries against his data, he
doesn’t have to ship it to them or let them log into his servers. He would simply share the dataset. The
other consumer would then be able to run queries against this data directly. One also has the option of
requiring the new consumers to pay BigQuery bill for any queries they run. Here, the consumer has the
option of adding new users to the Access Control List or allowing them to run queries that he is already
billed to by adding them to his project.
One of the main limitations of database query performance is the sequential nature of most query
executions. Although most databases can make use of multiple processors, they often use their available
parallelism to run multiple queries at once. They do this instead of taking advantage of multiple processors
for a single query. That said, even if they did parallelize single query execution, the database would still
be limited by disk I/O speeds if the data is stored on a single disk. Reading the disk from multiple places
in parallel may actually be slower than reading it sequentially.
The SQL query language is highly parallelizable however, as long as one has a way of taking that advantage.
The Dremel query engine created a way to parallelize SQL execution across thousands of machines. When
run in the Google infrastructure, the Dremel architecture scales linearly to tens of thousands of processor
cores and hundreds of thousands of disks. The performance goal of this architecture was to process a
terabyte of data in a second. Those goals have been met and exceeded.
26
AppEngine MapReduce/ BigQuery Load/
Datastore Backup Load BigQuery Extract jobs
Hadoop
AppEngine Connector Cloud Storage
Compute
Engine
Figure 18
Google Cloud Platform and BigQuery – courtesy of Google BigQuery Analytics by Jordan Tigani and Siddartha Naidu
While we are still discussing Google and BigQuery, let’s take a glimpse into the company’s self-driving
cars’ project. In this project, Google combines a variety of data mapping with information from real-time
laser detection systems, multiple radars, GPS and other devices that allow these cars to ‘see’ traffic, traffic
lights and roads. While the cars can’t think for themselves in the literal sense, they can do a great job at
pattern matching. By combining existing data from maps with real-time data from a car’s sensors, Google
systems allow the cars to ‘make’ driving decisions.
Despite substantial hype and reported successes for early adopters of Cloud computing, more than 54%
are still to invest in this technology according to Gartner. Different challenges toward Cloud computing
adoption exist for these enterprises. Primary to the foreseeable challenges is the ability to analyse Big
Data in real time. There is also the simple reflection of what it takes for technologies as complex as Hadoop
to mature to a point where they can be easily accepted by mainstream users.
In yet another recent report by Cisco Global Cloud Network Survey, 60% say Wide Area Network (WAN)
performance was the key challenge for Cloud computing growth. Along this survey, services as VeloCloud
have sprung up to fill this niche in the form of Hybrid WANs.
Hybrid WAN refers to using a mix of public Internet with private circuits for enterprise WAN transport.
The use of broadband internet simply as a separate network for uses such as guest web surfing, or as a
standby network in cases of failure does not capture the benefits of a true hybrid WAN.
Broadband internet does not have the same predictable performance, capacity or reliability as private
circuits. Because of this, businesses that have already adopted internet usage, are often still using it for
less critical purposes. However, businesses that want to leverage the cost and other advantages of
broadband are becoming increasingly application-centric and therefore dependent on a private network-
like experience. This is the niche where next generation hybrid WAN architectures are increasingly
exploiting not only to integrate broadband but also to apply technologies that give it enterprise-grade
performance and availability.
VeloCloud is currently being marketed as one of those solutions that provide enterprise grade
performance through dynamic multi-path optimization. In doing this, broadband, as well as private
network capacity and performance, is continuously monitored. Traffic by application and business priority
27
is then dynamically steered to the best link and path at each moment in time. This dynamic use of different
services delivers the advantage of virtualization. If necessary, on-demand remediation techniques such as
error correction and jitter buffering are also automatically applied. Another key benefit is the enhanced
visibility across multiple sites and providers.
Figure 19
The analytics landscape is experiencing a significant transformation. With massive volumes of data, more
of that living outside the enterprise data warehouse, and increasing user demand for speed, autonomy
and agility, organizations are struggling with an increasing divide between end users and centralized BI
teams. The centralised teams, gatekeepers of mission-critical data, are burdened with legacy
technologies, reporting requirements and older processes, all of which prevent them from meeting the
business’s speed demands. End users, driven by a thirst for data-driven daily decisions, have kick-started
their own analytic initiatives on decentralised data using desktop discovery tools.
These shadow BI initiatives have increased end-user autonomy. They have also created analytical silos
and inconsistencies in data analysis. These further hamper the desire for data-driven decisions.
Without the flexibility and speed which the business demands, or the consistency and governance which
BI require, an organization cannot become data-driven. A recent Gartner report - Create a Centralized and
Decentralized Organizational Model for Business Intelligence - states that successful companies need to
28
navigate the complexities of these two separate worlds of analytics by implementing a 2-tier
organizational approach.
However, the lack of agility stemming from the use of legacy BI platforms results in high costs and long
wait times. Mistrust in the data provided by discovery tools also results in more arguments over numbers
and less time spent making data-driven decisions.
BIRST’s unique 2-tier approach and technology enables BI leaders to govern, support and scale multiple
integrated environments. It also provides end users with autonomy, ease-of-use and speed to work with
non-curated and curated data.
Figure 20
BIRST 2-Tier Analytics Engine – courtesy of A Platform for 2-Tier BI and Analytics, a Technical White paper
Every day 2.5 quintillion bytes of data are created. The average cost of security incidents in the era of Big
Data is estimated to be over US $40 million. The big question is how do organisations create a
comprehensive strategy that includes data protection throughout the entire information delivery flow?
References:
1. ADT WP_IBM_Building_DataDriven_Application_Cloud
2. Building a data warehouse step by step by Manole VELICANU and Gheorghe MATEI
3. SAAS_Overview Data Warehouse: The Choice of Inmon versus Kimball by Ian Abramson
4. Building The Data Warehouse 3rd Ed Wiley 2003 by W. H. Inmon
29
5. The Data Warehouse Toolkit, 3rd Edition: The Definitive Guide to Dimensional Modeling by Ralph Kimball and Margy
Ross
6. Exam 70-463_ Implementing a Data Warehouse with Microsoft SQL Server 2012 by Dejan Sarka, Matija Lah and Grega
Jerkič
7. Gartner Predicts Business Intelligence and Analytics Will Remain Top Focus for CIOs Through 2017 by Rob van der
Meulen and Janessa Rivera
8. Google BigQuery Analytics by Jordan Tigani and Siddartha Naidu
9. Magic Quadrant for Business Intelligence and Analytics Platforms by Rita L. Sallam, Bill Hostmann, Kurt Schlegel, Joao
Tapadinhas, Josh Parenteau, Thomas W. Oestreich
10. ReDefining the role of the Data Scientist by Gary Angel
11. The future of business intelligence by Dominic A Ienco
12. Big Data Imperatives by Soumendra Mohanty, Madhu Jagadeesh and Harsha Srivatsa
13. Information management: IBM and the future of data by John K Waters
14. Pro SQL Server 2012 Integration Services by Francis Rodrigues, Michael Coles and David Dye
15. Big Data Bootcamp by David Feinleib,
16. The Economics of Cloud Computing by Bill Williams
17. The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics by Ralph Kimball
18. Ten Mistakes to Avoid When Democratizing BI and Analytics by David Stodder
19. Data Warehousing in the age of Big Data by Krish Krishnan
20. The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics by Ralph Kimball
21. Microsoft System Center: Network Virtualization and Cloud Computing by Nader Benmessaoud, CJ Williams and Uma
Mahesh Mudigonda
22. Cloud computing Bible by Barrie Sosinsky
23. BI Solutions Using SSAS Tabular Model Succintly by Parikshit Savjani
24. Technology whitepaper - Taking Hybrid WANs Further by VeloCloud
25. A Platform for 2-Tier BI and Analytics, A Technical white paper
30