Data Architecture
Data Architecture
Data Architecture
Name :1. Adrian Hartanto
2. Latif Arif Putranto
Faculty : Fachran Nazarullah, S.Kom
Semester : 3
Quarter : 1
Class : 3SC8
Continuing Education Center for Computing and Information Technology
Faculty of Engineering, University of Indonesia
2019
PREFACE
The author realizes that in the preparation of this paper there are still far
from perfection. Therefore, the authors expect constructive criticism and
suggestions to improve this paper and can be a reference in preparing the papers
or subsequent tasks.
The authors also apologize if in writing this paper there are typos and
errors that confuse the reader in understanding the author's intent.
Author
ii
TABLE OF CONTENTS
Contents
PREFACE.............................................................................................................................ii
TABLE OF CONTENTS.........................................................................................................iii
TABLE OF FIGURES.............................................................................................................iii
CHAPTER I..........................................................................................................................1
INTRODUCTION..................................................................................................................1
CHAPTER II.........................................................................................................................3
BASIC THEORY....................................................................................................................3
II.3 Characteristics of Big Data...............................................................................4
PROBLEM ANALYSIS...........................................................................................................7
III.1 Definition Big Data......................................................................................................7
III. 4 Challenges of Big Data Architecture....................................................................12
III.6 Disadvantages of Big Data:...............................................................................15
1. Incompatible tools..............................................................................................15
3. Chances of Failure..............................................................................................15
4. Correlation Errors.............................................................................................16
5. Security and Privacy Concerns.........................................................................16
CONCLUSION....................................................................................................................18
IV.1. Conclusion.........................................................................................................18
IV.2. Suggestion..........................................................................................................18
BIBLIOGRAPHY.................................................................................................................19
iii
TABLE OF FIGURES
iv
1
CHAPTER I
INTRODUCTION
I.1. Background
This paper will discuss about Big data Architecture for business
which includes the understanding about function of Big data Architecture.
CHAPTER I INTRODUCTION
I.1 Background
I.2 Writing Objective
I.3 Problem Domain
I.4 Writing Methodology
IV.2 Suggestion
BIBLIOGRAPHY
2
2
3
CHAPTER II
BASIC THEORY
Server - generally hardware, this level provides the basic computing power for the
entire organization and is typically centrally located. This is the equipment in the
computer room of the newspaper business mentioned above.
Middleware - generally software, this level sits on top of the server level and provides
the infrastructure necessary to keep the hardware running and the information
flowing. These are the tools and utilities used by the information technology people in
the newspaper business.
Client - A combination of hardware and software, this level provides the capabilities
accessible by a user and allows them to access the information a business has
available. These are the things the reporters use in newspaper business (personal
computers, printers, applications, etc.).
In addition, several documents of interest are created that provide details for
how the levels are organized and administered. They are as follows:
Unstructured
Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze
unstructured data. Email is an example of unstructured data.
Semi-structured
Semi-structured data pertains to the data containing both the formats mentioned
above, that is, structured and unstructured data. To be precise, it refers to the data that
although has not been classified under a particular repository (database), yet contains
vital information or tags that segregate individual elements within the data.
Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety, Velocity, and
Volume. These characteristics, isolatedly, are enough to know what is big data. Let’s look at
them in depth:
1) Variety
4
Variety of Big Data refers to structured, unstructured, and semistructured data that is
gathered from multiple sources. While in the past, data could only be collected from
spreadsheets and databases, today data comes in an array of forms such as emails, PDFs,
photos, videos, audios, SM posts, and so much more.
2) Velocity
Velocity essentially refers to the speed at which data is being created in real-time. In a
broader prospect, it comprises the rate of change, linking of incoming data sets at varying
speeds, and activity bursts.
3) Volume
We already know that Big Data indicates huge ‘volumes’ of data that is being
generated on a daily basis from various sources like social media platforms, business
processes, machines, networks, human interactions, etc. Such a large amount of data are
stored in data warehouses.[2]
Data sources. All big data architecture starts with your sources. This can include data
from databases, data from real-time sources (such as IoT devices), and static files
generated from applications, such as Windows logs
Real-time message ingestion. If there are real-time sources, you'll need to build a
mechanism into your architecture to ingest that data.
Data store. You'll need storage for the data that will be processed via big data
architecture. Often, data will be stored in a data lake, which is a large unstructured
database that scales easily.
A combination of batch processing and real-time processing. You will need to
handle both real-time data and static data, so a combination of batch and real-time
processing should be built into your big data architecture. This is because the large
5
volume of data processed can be handled efficiently using batch processing, while
real-time data needs to be processed immediately to bring value. Batch processing
involves long-running jobs to filter, aggregate, and prepare the data for analysis.
Analytical data store. After you prepare the data for analysis, you need to bring it
together in one place so you can perform analysis on the entire data set. The
importance of the analytical data store is that all your data is in one place so your
analysis can be comprehensive, and it is optimized for analysis rather than
transactions. This might take the form of a cloud-based data warehouse or a relational
database, depending on your needs.
Analysis or reporting tools. After ingesting and processing various data sources,
you'll need to include a tool to analyze the data. Frequently, you'll use a BI (Business
Intelligence) tool to do this work, and it may require a data scientist to explore the
data.
Automation. Moving the data through these various systems requires orchestration
usually in some form of automation. Ingesting and transforming the data, moving it in
batches and stream processes, loading it to an analytical data store, and finally
deriving insights must be in a repeatable workflow so that you can continually gain
insights from your big data.[3]
6
7
CHAPTER III
PROBLEM ANALYSIS
Big data is a term that describes a large volume of structured, semi-structured and
unstructured data that has the potential to be mined for information and used in machine
learning projects and other advanced analytics applications. Big data is often characterized by
the 3Vs: the extreme volume of data, the wide variety of data types and the velocity at which
the data must be processed. several other Vs have been added to descriptions of big data,
including veracity, value and variability. Although big data doesn't equate to any specific
volume of data, the term is often used to describe terabytes, petabytes and even exabytes of
data captured over time.
Most big data architectures include some or all of the following components:
Data sources. All big data solutions start with one or more data sources. Examples
include:
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.
Data storage. Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats. This kind of store
is often called a data lake. Options for implementing this storage include Azure Data
Lake Store or blob containers in Azure Storage.
Batch processing. Because the data sets are so large, often a big data solution must
process data files using long-running batch jobs to filter, aggregate, and otherwise
prepare the data for analysis. Usually these jobs involve reading source files, processing
them, and writing the output to new files. Options include running U-SQL jobs in
Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an
HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight
Spark cluster.
Real-time message ingestion. If the solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for stream
processing. This might be a simple data store, where incoming messages are dropped
into a folder for processing. However, many solutions need a message ingestion store to
act as a buffer for messages, and to support scale-out processing, reliable delivery, and
other message queuing semantics. This portion of a streaming architecture is often
referred to as stream buffering. Options include Azure Event Hubs, Azure IoT Hub, and
Kafka.
Stream processing. After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data for analysis. The
processed stream data is then written to an output sink. Azure Stream Analytics
provides a managed stream processing service based on perpetually running SQL
queries that operate on unbounded streams. You can also use open source Apache
streaming technologies like Storm and Spark Streaming in an HDInsight cluster.
Analytical data store. Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using analytical
tools. The analytical data store used to serve these queries can be a Kimball-style
relational data warehouse, as seen in most traditional business intelligence (BI)
solutions. Alternatively, the data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a metadata
abstraction over data files in the distributed data store. Azure SQL Data Warehouse
8
provides a managed service for large-scale, cloud-based data warehousing. HDInsight
supports Interactive Hive, HBase, and Spark SQL, which can also be used to serve data
for analysis.
Analysis and reporting. The goal of most big data solutions is to provide insights
into the data through analysis and reporting. To empower users to analyze the data, the
architecture may include a data modeling layer, such as a multidimensional OLAP cube
or tabular data model in Azure Analysis Services. It might also support self-service BI,
using the modeling and visualization technologies in Microsoft Power BI or Microsoft
Excel. Analysis and reporting can also take the form of interactive data exploration by
data scientists or data analysts. For these scenarios, many Azure services support
analytical notebooks, such as Jupyter, enabling these users to leverage their existing
skills with Python or R. For large-scale data exploration, you can use Microsoft R
Server, either standalone or with Spark.
Orchestration. Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data, move data between
multiple sources and sinks, load the processed data into an analytical data store, or push
the results straight to a report or dashboard. To automate these workflows, you can use
an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.[4]
Having a bird’s eye view of big data and its application in different industries will
help you better appreciate what your role is or what it is likely to be in the future, in your
industry or across different industries.
9
Figure 3 2 Big Data overview
The Securities Exchange Commission (SEC) is using big data to monitor financial
market activity. They are currently using network analytics and natural language processors
to catch illegal trading activity in the financial markets. Retail traders, Big banks, hedge
funds and other so-called ‘big boys’ in the financial markets use big data for trade analytics
used in high frequency trading, pre-trade decision-support analytics, sentiment measurement,
Predictive Analytics etc.
This industry also heavily relies on big data for risk analytics including; anti-money
laundering, demand enterprise risk management, "Know Your Customer", and fraud
mitigation. Big Data providers specific to this industry include: 1010data, Panopticon
Software, Streambase Systems, Nice Actimize and Quartet FS
3.2 Transportation
In recent times, huge amounts of data from location-based social networks and high
speed data from telecoms have affected travel behavior. Regrettably, research to understand
travel behavior has not progressed as quickly.
10
In most places, transport demand models are still based on poorly understood new
social media structures.
Some applications of big data by governments, private organizations and individuals include:
Governments use of big data: traffic control, route planning, intelligent transport
systems, congestion management (by predicting traffic conditions)
Individual use of big data includes: route planning to save on fuel and time, for travel
arrangements in tourism etc.
Big Data Providers in this industry include: Qualcomm and Manhattan Associates
11
3.3 Education
On the technical side, there are challenges to integrate data from different sources, on
different platforms and from different vendors that were not designed to work with one
another. Politically, issues of privacy and personal data protection associated with big data
used for educational purposes is a challenge.
Big data is used quite significantly in higher education. For example, The University
of Tasmania. An Australian university with over 26000 students, has deployed a Learning
and Management System that tracks among other things, when a student logs onto the
system, how much time is spent on different pages in the system, as well as the overall
progress of a student over time.
In a different use case of the use of big data in education, it is also used to measure
teacher’s effectiveness to ensure a good experience for both students and teachers. Teacher’s
performance can be fine-tuned and measured against student numbers, subject matter, student
demographics, student aspirations, behavioral classification and several other variables.
12
III. 4 Challenges of Big Data Architecture
When done right, a big data architecture can save your company money and help
predict important trends, but it is not without its challenges. Be aware of the following issues
when working with big data.
Data Quality
Anytime you are working with diverse data sources, data quality is a challenge. This
means that you'll need to do work to ensure that the data formats match and that you don't
have duplicate data or are missing data that would make your analysis unreliable. You'll need
to analyze and prepare your data before you can bring it together with other data for analysis.
Scaling
The value of big data is in its volume. However, this can also become a significant
issue. If you have not designed your architecture to scale up, you can quickly run into
problems. First, the costs of supporting the infrastructure can mount if you don't plan for
them. This can be a burden on your budget. And second, if you don't plan for scaling, your
performance can degrade significantly. Both issues should be addressed in the planning
phases of building your big data architecture.
Security
While big data can give you great insights into your data, it's challenging to protect
that data. Fraudsters and hackers can be very interested in your data, and they may try to
either add their own fake data or skim your data for sensitive information. A cybercriminal
can fabricate data and introduce it to your data lake. For example, suppose you track website
clicks to discover anomalous patterns in traffic and find criminal activity on your site. A
cybercriminal can penetrate your system, adding noise to the data so that it is impossible to
find the criminal activity. Conversely, there is a huge volume of sensitive information to be
found in your big data, and a cybercriminal could mine your data for that information if you
don't secure the perimeters, encrypt your data, and work to anonymize the data to remove
sensitive information.[5]
13
Advantages of Big Data:
1. Cost Cutting
Big Data provides business intelligence that can improve the efficiency of operations and cut
down on costs. Big Data technologies such as Hadoop and other cloud-based analytics help
significantly reduce costs when storing massive amounts of data. They can also find far more
efficient ways of doing business.
Though initial implementation may seem expensive, it will eventually save a lot of money in
the long run. The reduction in waiting time reduces the stress on the organization’s IT
landscape, and so resources previously set aside to respond to report requests are now freed
up.
Big Data is able to analyse data from the past which can be used to make predictions
about the future. This makes businesses take better decisions in the present as well as prepare
for the future. Data insights into customer movements, promotions and competitive offerings
give useful information with regards to customer trends. With real-time analytics, quicker
decisions can be made that are better suited to present customers.
With Big Data analytics, far more companies are now able to create new products and
services to meet the needs of their customers. Companies are able to analyse data from the
past about customer feedbacks and product launches which helps them to come up with better
products. Additionally, real-time market analysis helps in customer-oriented marketing by
14
allowing businesses to understand changes in consumer behavior and shifts in supply and
demand of products. Understanding consumer needs, buying behaviors and preferences can
help with the increasing demand for personalized services.
4. Fraud Detection
Big Data helps to automatically detect fraud attempts to hack into your organisation
and you will be instantly notified of a real-time safeguard system. Once you detect a
fraudulent attempt, you can immediately take appropriate action. You can map the entire data
landscape across your organization using Big Data tools.
This will let you analyse different internal threats and use this data to keep sensitive
information secure and safe. It is stored following regulatory requirements and protected in
an appropriate way. Because of this, many industries have begun to use Big Data for data
safety and protection, especially so in organizations and companies that deal with financial
information.
1. Incompatible tools
Hadoop is the most commonly used tool for Big Data analytics. However, the
standard version of Hadoop is not currently able to handle real-time data analysis. This means
that other tools need to be used while we wait around for Hadoop to add functionality to a
real-time approach in the near or distant future.
2. New approach
15
Most organizations are used to working in a manner where insights and updated are
received approximately once in a week. With Big Data bringing in insights every second, the
organisation will require a different approach and work method to handle this influx of
information at a much faster rate than the company is used to handling.
Insights need action and with Big Data, this action is now required in real-time. This
will drastically affect work culture, a change that the company may or may not be
immediately ready for. This could definitely be a great challenge to some organisations and
may lead to a restructuring of plans and decisions.
3. Chances of Failure
Many organizations may see other companies using Big Data, and its benefits being
touted all over the internet as the best tool to grow one’s business. This may cause them to
take hasty decisions and try to implement it immediately without understanding how to use it
and whether it is suited to their business or not.
If Big Data is not implemented in the appropriate manner, it could cause more harm
than good. Companies that are not used to handling data at such a rapid rate may make
inaccurate analysis which could lead to bigger problems for the organization.
4. Correlation Errors
A common technique used to analyse Big Data is to draw correlations by linking one
variable to another to form a pattern. However, their correlations may not always stand for
anything substantial or meaningful. In fact, just because two variables are linked or correlated
does not imply that an instrumental relationship is present between them. In short, correlation
does not always imply causation.A thorough analysis with the help of a data expert will help
you understand which of these correlations mean anything to your business and which
absolutely don’t.
16
5. Security and Privacy Concerns
Though it may seem ironic since we already mentioned safety and security as an
advantage for Big Data, it is important to understand that although Big Data analytics allows
you to find fraudulent attempts, the framework itself is prone to a data breach as is the case
with many technological undertakings.
The information that you provide to a third party may get leaked to competitors and
customers. There are also privacy concerns as many customers are not comfortable with the
idea that Big Data is capable of a collection of detailed information about their identities.[6]
17
18
CHAPTER IV
CONCLUSION
IV.1. Conclusion
The availability of Big Data, low-cost commodity hardware, and new information
management and analytic software have produced a unique moment in the history of data
analysis. The convergence of these trends means that we have the capabilities required to
analyze astonishing data sets quickly and cost-effectively for the first time in history. These
capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a
clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue, and
profitability. The Age of Big Data is here, and these are truly revolutionary times if both
business and technology professionals continue to work together and deliver on the promise.
IV.2. Suggestion
Studying big data is quite important because the age of big data is here. In the future
big data will be very important because it will be needed by anyone who engaged in
information technology
19
BIBLIOGRAPHY
[3] What Does Big Data Architecture Look Like? From: https://dzone.com/articles/what-is-
big-data-architecture Retrieved 11 September 2019
19