Data Structures
Data Structures
Data Structures
Big data can come in multiple forms, including structured and non-structured data such as financial data, text files,
multimedia files, and genetic mappings.
most of the Big Data is unstructured or semi-structured in nature, which requires different techniques and tools to
process and analyze.
Distributed computing environments and massively parallel processing (MPP) architectures that enable parallelized
data ingest and analysis are the preferred approach to process such complex data.
For example, a classic Relational Database Management System (RDBMS) may store call logs for a software
support call center.
The RDBMS may store characteristics of the support calls as typical structured data, with attributes such as time
stamps, machine type, problem type, and operating system.
Big Data
Growth is
increasingly
unstructured
Data Structures
Structured data: Data containing a defined data type, format, and structure (that is, transaction data,
online analytical processing [OLAP] data cubes, traditional RDBMS, CSV files, and even simple
spreadsheets
Semi-structured data: Textual data files with a discernible pattern that enables parsing (such
as Extensible Markup Language [XML] data files that are self-describing and defined by an XML
schema).
Quasi-structured data: Textual data with erratic data formats that can be formatted with effort,
tools, and time (for instance, web clickstream data that may contain inconsistencies in data values
and formats).
Unstructured data: Data that has no inherent structure, which may include text documents, PDFs,
images, and video.
Semi Strutured Data
Quasi Structured
UnStructured Data
Analyst Perspective on Data Repositories
The introduction of spreadsheets enabled business users to create simple logic on data structured
in rows and columns and create their own analyses of business problems.
As data needs grew, so did more scalable data warehousing solutions. These technologies enabled
data to be managed centrally, providing benefits of security, failove r, and a single repository where
users
With the EDW(Enterprise Data Warehouse) model, data is managed and controlled by IT groups
and database administrators (D8As), and data analysts must depend on IT for access and changes
to the data schemas. This imposes longer lead times for analysts to get data; most of the time is
spent waiting for approvals rather than
In-Database Analytics." creates relationships to multiple data sources within an organization and
saves time spent creating these data feeds on an individual basis. In-database processing for deep
analytics enables faster turnaround time for developing and executing new analytic models, while
reducing, though not eliminating, the cost associated with data stored in local, "shadow" file
systems. In addition, ratherthan the typical structured data in the EDW, analytic sandboxes can
house
Types of Data Repositories, from an Analyst Perspective
State of the Practice in Analytics
Current business problems provide many opportunities for organizations to
become more analytical and data driven.
Comparing BI with Data Science
BI vs Data Science
• Data scientists and Business Intelligence (BI) analysts have different roles within an
organization
• A company needs both types of professionals to really optimize its use of data. In a nutshell, BI
analysts focus on interpreting past data, while data scientists extrapolate on past data to make
predictions for the future.
• Data scientists help companies mitigate the uncertainty of the future by giving them valuable
information about projected sales and making general predictions of future performance.
• BI analysts, on the other hand, interpret past trends. These big data professionals perform
more meticulous, plan-based work eg company financial health of company
Current Analytical Architecture
For data sources to be loaded into the data warehouse, data needs to be well understood,
structured, and normalized with the appropriate data type definitions
This kind of centralization enables security, backup, and fai lover of highly critical data, it also
means that data typically must go through significant preprocessing and checkpoints before it can
enter this sort of controlled environment, which does not lend itself to data exploration and iterative
analytics.
Additional local systems may emerge in the form of departmental warehouses and local data marts
that business users create to accommodate their need for flexible analysis.
Local data marts may not have the same constraints for security and structure as the main EDW
and allow users to do some level of more in-depth analysis.
Often are not synchronized or integrated with other data stores and may not be backed up.
Data is read by additional applications across the enterprise for Bl and reporting purposes.
High-priority operational processes getting critical data feeds from the data warehouses and
repositories.
Analysts create data extracts from the EDW to analyze data offline in R or other local analytical
tools.
Many times these tools are limited to in-memory analytics on desktops analyzing samples of data,
rather than the entire population of a dataset
Data evolution and the rise of Big Data sources
To better understand the market drivers related to Big Data, it is helpful to first understand some
past history of data stores and the kinds of repositories and tools to manage these data stores.
Video surveillance, such as the thousands of video cameras spread across a city
Mobile devices, which provide geospatial location data of the users, as well as metadata about text
messages, phone calls, and application usage on smart phones
Smart devices, which provide sensor-based collection of information from smart electric grids, smart
buildings, and many other public and industry infrastructures
Nontraditional IT devices, including the use of radio-frequency identification (RFID) readers, GPS
navigation systems, and seismic processing
Emerging Big Data Ecosystem and a New Approach to Analytics
Cloudera, Hortonworks, and Pivotal have provided this value-add for the open source framework
Hadoop.
Data devices-"Sensornet" gat her data from multiple locations and continuously generate new data
about the is data.
consider someone playing an online video game through a PC, game console, or smartphone. In
this case, the video game provider captures data about the skill and levels attained by the player.
smart phones provide another rich source of data. In addition to messaging and basic phone
Usage, they store and transmit data about Internet usage, SMS usage, and real-time location.
This metadata can be used for analyzing traffic patterns by scanning the density of smartphones
in locations to track the speed of cars or the relative traffic congestion on busy roads.
Retail shopping loyalty cards record not just the amount an individual spends, but the locations
of stores that person visits, the kinds of products purchased, the stores where goods
are purchased most often, and the combinations of prod ucts purchased together
Emerging Big Data Ecosystem and a New Approach to Analytics
Data collectors include sample entities that collect data from the device and users
Data results from a cable TV provider tracking the shows a person watches, which TV channels
someone will not pay for to watch on demand, and the prices someone is willing to pay for premium
TV content
Retail stores tracking the path a customer takes through their store while pushing a shopping
cart with an RFID chip so they can gauge which products get the most foot traffic using
geospatial data collected from the RFID chips
Data aggregators
Make sense of the data collected from the various entities from the "SensorNet" or the "Internet of
Things." These organizations compile data from the devices and usage patterns collected by
government agencies, retail stores, and websites. ln turn, they can choose to transform and package
the data as products to sell to list brokers, who may want to generate marketing lists of people who
may be good targets for specific ad campaigns.
Data users and buyers -These groups directly benefit from the data collected and aggregated by others
within the data value chain.
Retail l banks, acting as a data buyer, may want to know which customers have the highest likelihood
to apply for a second mortgage or a home equity line of credit.
To provide input for this analysis, retai l banks may purchase data from a data aggregator.
Using technologies such as Hadoop to perform natural language processing on unstructured, textual
data from social media websites, users can gauge the reaction to events such as presidential
campaigns.
Emerging Big Data ecosystem
Key Roles for the New Big Data Ecosystem
There are three recurring sets of activities that data scientists perform:
Design, implement, and deploy statistical models and data mining techniques on Big Data.
Develop insights that lead to actionable recommendations. It is critical to note that applying
advanced methods to data problems does not necessarily drive new business value. Instead, it is
important to learn how to draw insights out of the data and communicate them effectively.
Data scientists are generally thought of as having five main sets of skills and behaviora l
characteristics, as shown below:
Quantitative Skill, Technical Aptitude, Skeptical Mind set, Critical Thinking, Curious and Creative,
Communicative and collaborative
Big Data in Healthcare Industry
• Big data reduces costs of treatment since there is less chances of having to perform
unnecessary diagnosis.
• It helps in predicting outbreaks of epidemics and also in deciding what preventive measures
could be taken to minimize the effects of the same.
• It prevents them from getting any worse which in turn makes their treatment easy and
effective.
• Patients can be provided with evidence-based medicine which is identified and prescribed
after doing research on past medical results.
• Apple has come up with Apple HealthKit, CareKit, and ResearchKit. The main goal is to
empower the iPhone users to store and access their real-time health records on their phones.
Big Data in Healthcare Industry
Big Data in Education Industry
• Grading Systems
New advancements in grading systems have been introduced as a result of a proper analysis of student data
• Career Prediction
Appropriate analysis and study of every student’s records will help understand each student’s progress, strengths,
weaknesses, interests, and more. It would also help in determining which career would be the most suitable for the
student in future.
Welfare Schemes
In making faster and informed decisions regarding various political programs
To identify areas that are in immediate need of attention
To stay up to date in the field of agriculture by keeping track of all existing land and livestock.
To overcome national challenges such as unemployment, terrorism, energy resources
exploration, and much more.
Cyber Security
Big Data is hugely used for deceit recognition.
It is also used in catching tax evaders.
Example
Food and Drug Administration (FDA) which runs under the jurisdiction of the Federal
Government of USA leverages from the analysis of big data to discover patters and
associations in order to identify and examine the expected or unexpected occurrences of
food-based infections.
Big Data in Media and Entertainment Industry
Big Data in Media and Entertainment Industry
With people having access to various digital gadgets, generation of large amount of
data is inevitable and this is the main cause of the rise in big data in media and
entertainment industry.
Other than this, social media platforms are another way in which huge amount of
data is being generated. Although, businesses in the media and entertainment
industry have realized the importance of this data, and they have been able to
benefit from it for their growth.
Some of the benefits extracted from big data in the media and entertainment industry are given below:
Example
on-demand music providing platform, uses Big Data Analytics, collects data from all its users around the globe, and then
uses the analyzed data to give informed music recommendations and suggestions to every individual user.
Big Data in Weather Patterns
Big Data in Weather Patterns
There are weather sensors and satellites deployed all around the globe. A huge amount of
data is collected from them, and then this data is used to monitor the weather and
environmental conditions.
Data collected from these sensors and satellites contribute to big data and can be used in
different ways such as:
In weather forecasting
To study global warming
In understanding the patterns of natural disasters
To make necessary preparations in the case of crises
To predict the availability of usable water around the world
Example
IBM Deep Thunder, which is a research project by IBM, provides weather forecasting through
high-performance computing of big data. IBM is also assisting Tokyo with the improved
weather forecasting for natural disasters or predicting the probability of damaged power
lines.
Big Data in Transportation Industry
GNC-Data Science
Big Data in Transportation Industry
Since the rise of big data, it has been used in various ways to make transportation more efficient
and easier. Following are some of the areas where big data contributes to transportation.
Route planning: Big data can be used to understand and estimate users’ needs on different
routes and on multiple modes of transportation and then utilize route planning to reduce
their wait time.
Congestion management and traffic control: Using big data, real-time estimation of congestion
and traffic patterns is now possible. For examples, people are using Google Maps to locate the
least traffic-prone routes.
Safety level of traffic: Using the real-time processing of big data and predictive analysis to
identify accident-prone areas can help reduce accidents and increase the safety level of traffic.
Example
Let’s take Uber as an example here. Uber generates and uses a huge amount of data regarding
drivers, their vehicles, locations, every trip from every vehicle, etc. All this data is analyzed and
then used to predict supply, demand, location of drivers, and fares that will be set for every trip.
Big Data in Banking Sector
Big Data in Banking Sector
The amount of data in the banking sector is skyrocketing every second. According to
GDC prognosis, this data is estimated to grow 700 percent by the end of the next
year. Proper study and analysis of this data can help detect any and all illegal
activities that are being carried out such as:
Project Sponsor: Responsible for the genesis of the project. Provides the impetus and requirements for the
project and defines the core business problem. Generally provides the funding and gauges the degree of value
from the final outputs of the working team. This person sets the priorities for the project and clarifies the desired
outputs.
Project Manager: Ensures that key milestones and objectives are met on time and at the expected quality.
Business Intelligence Analyst : Provides business domain expertise based on a deep understanding of the data,
key performance indicators (KPis), key metrics, and business intelligence from a reporting perspective.
Business Intelligence Analys ts generally create dashboards and reports and have knowledge of the data feeds
and sources.
Data Analytics Lifecycle Overview
Database Administrator (DBA): Provisions and configures the database environment to support the analytics
needs of the working team. These responsibilities may include providing access to key databases or tables and
ensuring the appropriate security levels are in place related to the data repositories .
Data Engineer: Leverages deep technical skills to assist with tuning SQL queries for data management and
data extraction and provides support for data ingestion into the analytic sandbox.
Data Scientist: Provides subject matter expertise for analytical techniques, data modeling, and applying valid
analytical techniques to given business problems. Ensures overall analytics objectives are met. Designs and
executes analytical methods and approaches with the data available to the project.
Phase 2- Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team can
work with data and perform analytics for the duration of the project. The team needs to execute extract, load,
and transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT and ETL are
sometimes abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with
it and analyze it. In this phase, the team also needs to familiarize itself with the data thoroughly and take steps
to condition the data
Phase 3-Model planning: Phase 3 is model planning, where the team determines the methods, techniques,
and workflow it intends to follow for the subsequent model building phase. The team explores the data to
learn about the relationships between variables and subsequently selects key variables and the most suitable
models.
Overview of Data Analytics Lifecycle
Phase 4-Model building: In Phase 4, the team deve lops data sets for testing, training, and production
purposes. In addition, in this phase the team builds and executes models based on the work done in the model
planning phase. The team also considers whether its existing tools will suffice for running the models, or if it
will need a more robust environment for executing models and workflows (for example, fast hardware and
parallel processing, if applicable).
Phase 5-Communicate results: In Phase 5, the team, in collaboration with major stakeholders, determines if
the results of the project are a success or a failure based on the criteria developed in Phase 1. The team should
identify key findings, quantify the business value, and develop a narrative to summarize and convey findings
to stakeholders.
Phase 6-Operationalize: In Phase 6, the team delivers final reports, briefings, code, and technical documents.
In addition, the team may run a pilot project to implement the models in a production environment.
Once team members have run models and produced findings, it is critical to frame these results in a
way that is tailored to the audience that engaged the team.
Phase_1:Discovery
• In this phase, the data science team must learn and investigate the problem, develop context and
understanding, and learn about the data sources needed and available for the project.
• The team formulates initial hypotheses that can later be tested with data.
Learning the Business Domain
Understanding the domain area of the problem is essential.
• Data scientists will have deep computational and quantitative knowledge that can be broadly applied across
many disciplines that involves statistics and maths.
• Deep knowledge of the methods, techniques, and ways for applying heuristics to a variety of business and
conceptual problems.
• Person would have deep knowledge of a field of study, such as oceanography, biology, or genetics, with some
depth of quantitative knowledge.
• The team needs to determine how much business or domain knowledge the data scientist needs to develop
models
Resources
• Available tools and technology the team will be using, and the types of systems needed for later phases to
operationalize the models
• Evaluate the level of analytical sophistication within the organization and gaps that may exist related to tools,
technology, and skills
Discovery Phase
Phase_1:Discovery
• Framing the Problem
• Framing is the process of stating the analytics problem to be solved. At this point, it is a best practice to write
down the problem statement and share it with the key stakeholders.
• Each team member may hear slightly different things related to the needs and the problem and have
somewhat different ideas of possible solutions
• The team needs to clearly articulate the current situation and its main challenges.
• Main objectives of the project, identify what needs to be achieved in business terms, and identify what needs
to be done to meet the needs
Identifying Key Stakeholders
• During discussion, the team can identify the success criteria, key risks, and stakeholders, which should
include anyone who will benefit from the project or will be significantly impacted by the project.
Interviewing the Analytics Sponsor
• Team must use its knowledge and expertise to identify the true underlying problem and appropriate solution.
• The data science team typically may have a more objective understanding of the problem set than the
stakeholders, who may be suggesting solutions to a given problem.
• Team can probe deeper into the context and domain to clearly define the problem and propose possible paths
from the problem to a desired outcome.
Phase_1:Discovery
Common questions that are helpful to ask during the discovery phase when interviewing the project sponsor
What business problem is the team trying to solve?
• What is the desired outcome of the project?
• What data sources are available?
• What industry issues may impact the analysis?
• What timelines need to be considered?
• Who could provide insight into the project?
• Who has final decision-making authority on the project?
• How will the focus and scope of the problem change if the following dimensions change:
Time: Analyzing 1 year or 10 years' worth of data?
• People: Assess impact of changes in resources on project timeline.
• Risk: Conservative to aggressive
• Resources: None to unlimited (tools, technology, systems)
• Size and attributes of data: Including internal and external data sources
Phase_1:Discovery
• Developing Initial Hypotheses
• Identifying Potential Data Sources
Identify the kinds of data the team will need to solve the problem.
volume, type, and time span of the data needed to test the hypotheses
• The team should perform five main activities during this step of the discovery phase:
Identify data sources: Make a list of candidate data sources the team may need to test the
initial hypotheses outlined in this phase. Make an inventory of the datasets currently
available and those that can be purchased or otherwise acquired for the tests the team
wants to perform
Capture aggregate data sources: This is for previewing the data and providing high-level
understanding. It enables the team to gain a quick overview of the data and perform
further exploration on specific areas. It also points the team to possible areas of interest
within the data
Phase_1:Discovery
Review the raw data: Obtain preliminary data from initial data feeds. Begin understanding
the interdependencies among the data attributes, and become familiar with the content
of the data, its quality, and its limitations.
Evaluate the data structures and tools needed: The data type and structure dictate which
tools the team can use to analyze the data. This evaluation gets the team thinking about
which technologies may be good candidates for the project and how to start getting
access to these tools
Scope the sort of data infrastructure needed for this type of problem: In addition to the
tools needed, the data influences the kind of infrastructure that 's required, such as disk
storage and network capacity.
Phase 2: Data Preparation
• Data Analytics Lifecycle involves data preparation, which includes the steps to explore, preprocess, and condition
data prior to modeling and analysis.
• The team needs to create a robust environment in which it can explore the data that is separate from a
production environment.
• To get the data into the sandbox, the team needs to perform ETLT, by a combination of extracting, transforming,
and load ing data into the sandbox.
• The team needs to learn about the data in the sandbox.
• The team also must decide how to condition and transform data to get it into a format to facilitate subsequent
analysis.
• The team may perform data visualizations to help team members understand the data, including its trends,
outliers, and relationships among data variables.
Preparing the Analytic Sandbox
• The first subphase of data preparation requires the team to obtain an analytic sandbox (also commonly
referred to as a workspace), in which the team can explore the data without interfering with live production
databases. Example company's financial data.
• When developing the analytic sandbox, it is a best practice to collect all kinds of data there, as team
members need access to high volumes and varieties of data for a Big Data analytics project.
• The analytic sandbox enables organizations to undertake more ambitious data science projects and move
beyond doing traditional data analysis and Business Intelligence to perform more robust and advanced predictive
analytics
A data sandbox, in the context of big data, is a scalable and developmental platform used to explore an
organization's rich information sets through interaction and collaboration. It allows a company to realize its
actual investment value in big data.