Big Data Analysis-Modul 1
Big Data Analysis-Modul 1
Big Data Analysis-Modul 1
Modue 1
8th sem Elective - IV
Prof. Vinutha H
Asst. Professor
Dept. of CSE
Dr.AIT
What is BIG DATA?
• Big Data is the next generation of data warehousing and business analytics and is
poised to deliver top line revenues cost efficiently for enterprises
• Rapid pace of innovation and change; where we are today is not where we ’ll be in
just two years and definitely not where we ’ll be in a decade
• Big Data has been around for decades for firms that have been handling tons of
transactional data over the years—even dating back to the mainframe era
• Convergence perfect storm: Traditional data management and analytics software + hardware
technologies + open-source technology + commodity hardware
– AII are merging to create new alternatives for IT and business executives
• “Big Data” isn’t new - companies that have dealt with billions of transactions for many years
• Recent innovations –
– give the ability to leverage new technology and approaches
– enable to affordably handle more data
– take advantage of the variety of—such as unstructured data
• Ability to store data in an affordable way has changed the game in industries
• People are able to store that much data now and more than they ever before
– They don ’t have to make decisions about which half to keep or how much history to keep
– It ’s now economically feasible to keep all the history and go back later start looking for it
again
• Aside from the changes in the actual hardware and software technology, there has also been a
massive change in the actual evolution of data systems
• 3 pinnacle stages in the evolution of data systems:
– Dependent (Early Days)
• Data systems were fairly new
• Users didn’t know quite know what they wanted
• IT assumed that “Build it and they shall come.”
• 2) Big data refers to “datasets whose size is beyond the ability of typical database software
tools to capture, store, manage, and analyze”
- McKinsey study
• Big data in many sectors today will range from a few dozen terabytes to multiple petabytes
(thousands of terabytes)
• Big Data isn ’t just a description of raw volume. “The real issue is usability,”?
• The real challenge - Identifying or developing most cost-effective and reliable methods for
extracting value from all the terabytes and petabytes of data now available
Why Big Data Now?
• Past mistakes - investing in new technologies that didn’t fit into existing business frameworks
• many companies made substantial investments in customer-facing technologies that subsequently
failed to deliver expected value
• Management either forgot (or just didn’t know) that big projects require a synchronized
transformation of people, process, and technology - All three must be marching in step or the
project is doomed
• The technology of Big Data is the easy part
• Hard part is figuring out what you are going to do with the output
generated by your Big Data analytics
• Making sure that you have the people and process pieces ready
before buying the technology
A Convergence of Key Trends
• Big companies have been collecting and storing large amounts of data for a long
time
• Difference between “Old Big Data” and “New Big Data” – Accessibility
• Large amounts of information were stored on tape
• The real change has been in the ways that we access that data and use it to create
value
• Eg: technologies like Hadoop, make it functionally practical to access a tremendous
amount of data, and then extract value from it
• Convergence of several trends: -> more data -> Less expensive & faster hardware
• The availability of lower-cost hardware - easier and more feasible to retrieve and
process information, quickly and at lower costs
• Cost/benefit has really been a game changer – getting raw speed at an affordable price
• Traditional data - “structured” & it is put into a database based on the type of
data (i.e., character, numeric, floating point, etc)
• New data – “unstructred” eg: text, audio, video, image, geospatial, and Internet
data (including click streams and log files)
• “Semi-structured” data –
– a combination of different types of data
– has some pattern or structure that is not as strictly defined as
structured data Eg: call center logs may contain customer name +
date of call + complaint where the complaint information is
unstructured and not easily synthesized into a data store
– XML data
• 3) Velocity - speed at which data is created, accumulated, ingested, and
processed
• Internet data (i.e., clickstream, social media, social networking links, search engine data)
• Primary research (i.e., surveys, experiments, observations)
• Secondary research (i.e., competitive and marketplace data, industry reports, consumer data, business
data)
• Location data (i.e., mobile device data, geospatial data)
• Image data (i.e., video, satellite image, surveillance)
• Supply chain data (i.e., EDI, vendor catalogs and pricing, quality information)
• Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
• Black Box Data (i.e., voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft)
• Stock Exchange Data (i.e., information about the ‘buy’ and ‘sell’ decisions made on a share of different
companies made by the customers)
• Power grid data (i.e., information consumed by a particular node with respect to a base station)
• Transport data (i.e., Transport data includes model, capacity, distance and availability of a vehicle)
• Contd…..
• The wide variety of data leads to complexities in ingesting the data into data
storage
• Unstructured data (the kind that tends to complicate the data definition, takes up lots of storage
capacity, and is typically more difficult to analyze)
– does not have a predefined data model and/or does not fit well into a relational database
– typically text heavy, but may contain data such as dates, numbers, and facts
• Semi-structured data (the kind that describe structured data that doesn’t fit into a formal structure
of data models)
– does contain tags that separate semantic elements which includes the capability to enforce
hierarchies within the data
If unstructured data is so complicating, then why bother?
• The amount of data (all data, everywhere) is doubling every two years
• Most new data is unstructured (unstructured data represents almost 95 percent of new data, while
structured data represents only 5 percent)
• Unstructured data tends to grow exponentially, unlike structured data, which tends to grow in a
more linear fashion
• Unstructured data is vastly underutilized - there ’s a lot of money to be made for smart individuals
and companies that can mine unstructured data successfully
• The implosion of data is happening due to more open and transparent societies
– Eg: “Resumes used to be considered private information,” - “Not anymore with the advent of
LinkedIn.”
– Instagram and Flickr for pictures, Facebook for circle of friends, and Twitter for our personal
thoughts (and what the penalty can be given the recent London Olympics, where a Greek athlete
was sent home for violating strict guidelines on what athletes can say in social media)
“Even if you don’t know how you are going to apply it today,
unstructured data has value”
– As technology evolved to absorb greater volumes of data, the costs of data environments started
to come down, and companies began collecting even more transactional data
– Today, many companies have the capability to store and analyze data generated from every
search you run on their websites, every article you read, and every product you look at
– By combining that specific data with anonymous data from external sources, they can predict
your likely behavior with astonishing accuracy
– It might sound creepy, but it ’s also helping keep us safe from criminals and terrorists. “A lot of
the technology used by the CIA and other security agencies evolved through database
marketing,” says Doyle.
– Some of the tools originally developed for database marketers are now used to detect fraud and
prevent money-laundering
Big Data and the New School of Marketing
• Dan Springer, CEO of Responsys, defines the new school of marketing
– “Today ’s consumers have changed. They ’ve put down the newspaper, they fast
forward through TV commercials, and they junk unsolicited email. Why?
– They have new options that better fit their digital lifestyle.
– They can choose which marketing messages they receive, when, where, and
from whom. They prefer marketers who talk with them, not at them.
– New School marketers deliver what today ’s consumers want: relevant
interactive communication across the digital power channels: email, mobile,
social, display and the web”
• Consumers Have Changed. So Must Marketers
– lifecycle model is still the best way to approach marketing
– But linear concept of succession of lifecycle “stages,” is no longer a useful framework for planning marketing
campaigns and programs
– Because today ’s new cross-channel customer is online, offline, captivated, distracted, satisfied, annoyed,
vocal, or quiet at any given moment
– Marketing to today ’s cross-channel consumer demands a more nimble, holistic approach, one in which
customer behavior and preference data determine the content and timing—and delivery channel—of
marketing messages
– Marketing campaigns should be cohesive: content should be versioned and distributable across multiple
channels.
– Marketers can still drive conversions and revenue, based on their own needs, with targeted campaigns sent
manually, but more of their marketing should be driven by—and sent via preferred channels in response
to—individual customer behaviors and event
– How can marketers plan for that?
• Permission, integration, and automation are the keys
• Along with a more practical lifecycle model designed to make every acquisition marketing investment
result in conversion, after conversion, after conversion
The Right Approach: Cross-Channel Lifecycle
Marketing
• starts with the capture of customer permission, contact information, and
preferences for multiple channels
• It also requires marketers to have the right integrated marketing and customer
information systems, so that
(1) they can have complete understanding of customers through stated
preferences and observed behavior at any given time; and
(2) they can automate and optimize their programs and processes throughout
the customer lifecycle
• Once marketers have that, they need a practical framework for planning marketing
activities
Loops that guide marketing strategies and tactics in the Cross-Channel Lifecycle
Marketing approach: conversion, repurchase, stickiness, win-back and re-permission
Social and Affiliate Marketing
• Word-of-mouth marketing has been the most powerful form or marketing since before the Internet
– Eg: Avon Lady was organizing Tupperware parties made buying plastics acceptable back in the 1940s
• The concept of affiliate marketing, or pay for performance marketing on the Internet is credited to William J. Tobin, the
founder of PC Flowers & Gifts as he was granted patents around the concept of an online business rewarding another site (an
affiliate site) for each referred transaction or purchase
• Amazon.com launched its own affiliate program in 1996 and middleman affiliate networks like Linkshare and Commission
Junction emerged preceding the 1990s Internet boom, providing the tools and technology to allow any brand to put affiliate
marketing practices to use
• Today, industry analysts estimate affiliate marketing to be a $3 billion industry
• In 2012, using social web Facebook, Twitter, and Tumblr, now any consumer with can do affliate marketing.
• Couponmountain.com and other well know affiliate sites generate multimillion dollar yearly revenues for driving transactions
for the merchants they promote
• Most people trust a recommendation from the people they know
• Professional affiliate marketing sites provide the aggregation of many merchant offers on one centralized site, they completely
lack the concept of trusted source recommendations
• Using the backbone and publication tools created by companies like Facebook and Twitter, brands will soon find that
rewarding their own consumers for their advocacy is a required piece of their overall digital marketing mix
Empowering Marketing with Social Intelligence
• Niv Singer, Chief Technology Officer at Tracx, a social media intelligence software provider, say about
the big data challenges faced in the social media realm and how it ’s impacting the way business is
done today—and in the future
– As a result of the growing popularity and use of social media “big data” is created which is
immense, and continues to grow exponentially
– Millions of status updates, blog posts, photographs, and videos are shared every second
– Successful organizations will not only need to identify the information relevant to their company
and products—but also be able to dissect it, make sense of it, and respond to it in real time and
on a continuous basis, drawing business intelligence that help predict future customer behavior
• Real challenge is to unify social profiles for a single user who may be using different names or
handles on each of their social networks
• So an algorithm combs through key factors(content of posts, and location, etc) among others, to
provide a robust identity unification
• Client should have the flexibility to sort influencers by any of these characteristics:
– Very intelligent software is required to parse all that social data to define things like the sentiment of a post.
– A system that also learns over time what that sentiment means to a specific client or brand
– Then represent that data with increased levels of accuracy
– This provides clients a way to “train” a social platform to measure sentiment more closely to the way they would be doing
it manually themselves.
– It’s important for brands to be able to understand the demographic information of the individual driving social discussions
around their brand such as gender, age, and geography so they can better understand their customers and better target
campaigns and programs based on that knowledge
• In terms of geography
– Social check-in data from Facebook, Foursquare, and similar social sites and applications over maps are combined to show
brands at the country, state/region, state, and down to the street level where conversations are happening about their
brand, products, or competitors
– This capability enables marketers with better service or push coupons in real time, right when someone states a need,
offering value, within steps from where they already are, which has immense potential to drive sales and brand loyalty
• Singer believes that the real power comes in mining social data for business intelligence, not only for marketing, but also for
customer support and sales
• As a result, they’ve created a system to be a data management system that just happens to be focused on managing
unstructured social data, but which can easily integrate with other kinds of data sets too
• Eg1: Integration with CRM systems like Salesforce.com and Microsoft Dynamics to enable
companies to get a more holistic view of what ’s going with their clients by supplementing existing
data sets (which are more static in nature) with the social data set (which is more dynamic and real-
time)
• Eg2: Integration with popular analytics platforms like Google Analytics and Omniture, so marketers
can see a direct correlation and payoff of social campaigns through improved social sentiment or an
increase in social conversations around their brand or product
• Social media is the world ’s largest and purest focus group
• Marketers now have the opportunity to mine social conversations for purchase intent and brand lift
through Big Data
• So, marketers can communicate with consumers when they are emotionally engaged, regardless of
the channel
• Since this data is captured in real-time, Big Data is coercing marketing organizations into moving
more quickly to optimize media mix and message
• This data sheds light on all aspects of consumer behavior, companies are aligning data to insight to
prescription across channels, across media, and across the path to purchase
Fraud and Big Data
• Fraud is intentional deception made for personal gain or to damage another individual
• Most common forms of fraudulent activity is Credit card fraud
• The credit card fraud rate in all the countries is increasing
• As per Javelin ’s research, “8th Annual Card Issuers’ Safety Scorecard: Proliferation of Alerts Lead to
Quicker Detection Time and Lower Fraud Costs,” credit card fraud incidence increased 87 percent in
2011 culminating in an aggregate fraud
• Despite the significant increase in incidence, total cost of credit card fraud increased only 20 percent
• The comparatively small rise in total cost can be attributed to an increasing sophistication of fraud
detection mechanisms
One approach to solve fraud with Big Data
• According to the Capgemini Financial Services Team:
– Even though fraud detection is improving, the rate of incidents is rising
– This means banks need more proactive approaches to prevent fraud
– While issuers’ investments in detection and resolution has resulted in an influx of customer-
facing tools and falling average detection times among credit card fraud victims, the rising
incidence rate indicates that credit card issuers should prioritize preventing fraud.
• Social media and mobile phones are forming the new frontiers for fraud
• Social networks are a great resource for fraudsters, consumers are still sharing a significant amount
of personal information frequently used to authenticate a consumer ’s identity. Those with public
profiles (those visible to everyone) were more likely to expose this personal information.
• In order to prevent the fraud, credit card transactions are monitored and checked in near real time. If
the checks identify pattern inconsistencies and suspicious activity, the transaction is identified for
review and escalation.
• The Capgemini Financial Services team believes that due to the nature of data streams and
processing required, Big Data technologies provide an optimal technology solution based on the
following three Vs:
– 1.High volume. Years of customer records and transactions (150 billion1 records per year)
– 2.High velocity. Dynamic transactions and social media information
– 3.High variety. Social media plus other unstructured data such as customer emails, call center
conversations, as well as transactional structured data
• Capgemini ’s new fraud Big Data initiative focuses on flagging the suspicious credit card transactions
to prevent fraud in near real-time via multi-attribute monitoring
• Real-time inputs involving transaction data and customers records are monitored via validity checks
and detection rules
• Pattern recognition is performed against the data to score and weight individual transactions across
each of the rules and scoring dimensions
• A cumulative score is then calculated for each transaction record and compared against thresholds to
decide if the transaction is potentially suspicious or not
• The Capgemini use an open-source weapon named Elastic Search, which is a distributed, free/open-
source search server based on Apache Lucene.
– It can be used to search all kind of documents at near real-time
– They use the tool to index new transactions
– Data specific to the index historical data sets can be used in conjunction with real-time data to identify
deviations from typical payment patterns
– This Big Data component allows overall historical patterns to be compared and contrasted, and allows the
number of attributes and characteristics about consumer behavior to be very wide, with little impact on
overall performance.
– Once the transaction data has been processed, the percolator query then identify the functioning of
identifying new transactions that have raised profiles
– Percolator is a system for incrementally processing updates to large data sets.
– Percolator is the technology that Google used in building the index— that links keywords and URLs—used to
answer searches on the Google page
– Percolator query can handle both structured and unstructured data.
– This provides scalability to the event processing framework, and allows specific suspicious transactions to
be enriched with additional unstructured information—phone location/geospatial records, customer travel
schedules, and so on
– This ability to enrich the transaction further can reduce false positives and increase the experience of the
customer while redirecting fraud efforts to actual instances of suspicious activity
Another approach to solving fraud with Big Data -
social network analysis (SNA)
• SNA is the precise analysis of social networks.
• Social network analysis views social relationships and makes assumptions
• SNA reveal all individuals involved in fraudulent activity, from perpetrators to their associates, and understand their relationships and behaviors
to identify a bust out fraud case
• According to a recent article in bankersonline.com posted by Experian, “bust out” is a hybrid credit and fraud problem and the scheme is
typically defined by the following behavior:
– The account in question is delinquent or charged-off
– The balance is close to or over the limit
– One or more payments have been returned
– The customer cannot be located
– The above conditions exist with more than one account and/or financial institution.
• There are some Big Data solutions in the market like SAS ’s SNA solution, which helps institutions and goes beyond individual and account views
to analyze all related activities and relationships at a network dimension.
• The network dimension allows you to visualize social networks and see previously hidden connections and relationships, which potentially could
be a group of fraudsters.
• Obviously there are huge reams of data involved behind the scene, but the key to SNA solutions like SAS ’s is the visualization techniques for
users to easily engage and take action
Risk and Big Data
• The two most common types of risk management: credit risk management and market risk
management
• The tactics for risk professionals typically include
– avoiding risk
– reducing the negative effect or probability of risk
– or accepting some or all of the potential consequences in exchange for a potential upside gain
• Credit risk analytics - focus on past credit behaviors to predict the likelihood that a borrower will
default on any type of debt by failing to make payments which they obligated to do. For example, “Is
this person likely to default on their $300,000 mortgage?”
• Market risk analytics - focus on understanding the likelihood that the value of a portfolio will
decrease due to the change in stock prices, interest rates, foreign exchange rates, and commodity
prices. For example, “Should we sell this holding if the price drops another 10 percent?”
Credit Risk Management
• Credit risk management is a critical function that spans a diversity of businesses across a wide range
of industries
• Ori Peled is the American Product Leader for MasterCard Advisors Risk & Marketing Solutions. He
brings several years of information services experience in his current role with Master-Card and
having served in various product development capacities at Dun &Bradstreet. Peled shares his insight
with us on credit risk:
– Whether you ’re a small B2B regional plastics manufacturer or a large global consumer financial
institution, the underlying credit risk principles are essentially the same: driving the business
using the optimal balance of risk and reward
• Traditionally, credit risk management was rooted in the philosophy of minimizing losses
• credit risk professionals and business leaders came to understand that there are acceptable levels of
risk that can boost profitability beyond what would normally have been achieved by simply focusing
on avoiding write-offs
• The shift to the more profitable credit risk management approach aids large part to an ever-
expanding availability of data, tools, and advanced analytics