Lecture 2.3.7-2.3.9

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 53

APEX INSTITUTE OF TECHNOLOGY

(CSE)
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Predictive Analytics (21CSH-340)


Faculty: Dr. Jitender Kaushal
Associate Professor
(E14621)
DISCOVER . LEARN . EMPOWER
Sequence Data Mining: Techniques and Applications 1
COURSE OBJECTIVES
Course Objectives
 Understand how analytics provided a solution to industries using real case
studies.
 Explain modeling, relationships, derive and reclassify fields, integrate and collect
data.
 Review and explore data to model.

2
COURSE OUTCOMES
On completion of this course, the students shall be
able to

Unit-2 To perform data collection and initial data handling by CO3


managing data structures and measurement levels.
Analyzing and transforming data for predictive modeling CO4
through various data transformation techniques.

3
What is a sequence?
• Ordered set of elements: s = a1,a2,..an
• Each element ai could be
• Numerical
• Categorical: domain a finite set of symbols S, |S|=m
• Multiple attributes
• The length n of a sequence is not fixed
• Order determined by time or position and could be regular or
irregular
Real-life sequences
• Classical applications
• Speech: sequence of phonemes
• Language: sequence of words and delimiters
• Handwriting: sequence of strokes
• Newer applications
• Bioinformatics:
• Genes: Sequence of 4 possible nucleotides, |S|=4
• Example: AACTGACCTGGGCCCAATCC
• Proteins: Sequence of 20 possible amino-acids, |S|=20
• Example:

• Telecommunications: alarms, data packets


• Retail data mining: Customer behavior, sessions in a e-store
(Example, Amazon)
• Intrusion detection
Intrusion detection
Intrusions could be detected at

• Network-level (denial-of-service attacks, port-
scans, etc) [KDD Cup 99]
• Sequence of TCP-dumps
open
• Host-level (attacks on privileged programs like lseek
lpr, sendmail) lstat
• Sequence of system calls mmap
• |S| = set of all possible system calls ~100 execve
ioctl
ioctl
close
execve
close
unlink
Outline
• Traditional mining operations on sequences
• Classification
• Clustering
• Finding repeated patterns
• Primitives for handling sequence data
• Sequence-specific mining operations
• Partial sequence classification (Tagging)
• Segmenting a sequence
• Predicting next symbol of a sequence
• Applications in biological sequence mining
Classification of whole sequences
Given:
• a set of classes C and
• a number of example sequences in each class,
train a model so that for an unseen sequence we can say to which class
it belongs
Example:
• Given a set of protein families, find family of a new protein
• Given a sequence of packets, label session as intrusion or normal
• Given several utterances of a set of words, classify a new utterance to the
right word
Conventional classification methods
• Conventional classification methods assume
• record data: fixed number of attributes
• single record per instance to be classified (no order)
• Sequences:
• variable length, order important.
• Adapting conventional classifiers to sequences
• Generative classifiers
• Boundary-based classifiers
• Distance based classifiers
• Kernel-based classifiers
Generative methods
•For each class i, Pr(x|c1)*Pr(c1)
train a generative model Mi to x
Pr(x|c2)*Pr(c2)
maximize likelihood over all
training sequences in the class i Pr(x|c3)*Pr(c3)

• Estimate Pr(ci) as fraction of training instances


in class i
• For a new sequence x,
• find Pr(x|ci)*Pr(ci) for each i using Mi
• choose i with the largest value of Pr(x|ci)*P(ci)

Need a generative model for sequence data


Boundary-based methods
• Data: points in a fixed multi-dimensional space
• Output of training: Boundaries that define regions within
which same class predicted
• Application: Tests on boundaries to find region

Linear discriminants Decision trees Neural networks

Need to embed sequence data in a fixed coordinate space


Kernel-based classifiers
• Define function K(x , x) that intuitively defines similarity between two
i
sequences and satisfies two properties
• K is symmetric i.e., K(xi, x) = K(x, xi)
• K is positive definite
• Training: learn for each class c,
• wic for each train sequence xi
• bc
• Application: predict class of x
• For each class c, find f(x,c) = S wicK(xi, x)+bc
• Predicted class is c with highest value f(x,c)
• Well-known kernel classifiers
• Nearest neighbor classifier
• Support Vector Machines
• Radial Basis functions
Need kernel/similarity function
Sequence clustering
• Given a set of sequences, create groups such that similar sequences
in the same group
• Three kinds of clustering algorithms Need similarity function
• Distance-based:
• K-means
• Various hierarchical algorithms Need generative models
• Model-based algorithms
• Expectation Maximization algorithm
• Density-based algorithms Need dimensional embedding
• Clique
Outline
• Traditional mining on sequences
• Primitives for handling sequence data
• Embed sequence in a fixed dimensional space
• All conventional record mining techniques will apply
• Generative models for sequence
• Sequence classification: generative methods
• Clustering sequences: model-based approach
• Distance between two sequences
• Sequence classification: SVM and NN
• Clustering sequences: distance-based approach
• Sequence-specific mining operations
• Applications
Embedding sequences in fixed
dimensional
• Ignore order, eachspace
symbol maps to a dimension
• extensively used in text classification and clustering
• Extract aggregate features
• Real-valued elements: Fourier coefficients, Wavelet coefficients, Auto-
regressive coefficients
• Categorical data: number of symbol changes
• Sliding window techniques (k: window size)
• Define a coordinate for each possible k-gram a (mk coordinates)
• a-th coordinate is number of times a in sequence
• (k,b) mismatch score: a-th coordinate is number of k-grams in sequence
with b mismatches with a
• Define a coordinate for each of the k-positions
• Multiple rows per sequence
Sliding
open
lseek
window examples
One symbol per column
ioctl o c l i e m
mmap 1 2 1 1 3 2 1
execve 2 .. .. .. .. .. ..
ioctl 3 .. .. .. .. .. ..
ioctl
open
execve Sliding window: window-size 3
close One row per trace Multiple rows per trace
mmap ioe cli oli lie lim ...
sid A1 A2 A3
1 1 0 1 0 1
2 .. .. .. .. .. ..
1 o l i
3 .. .. .. .. .. .. 1 l i m
mis-match scores: b=1 1 i m e
ioe cli oli lie lim ... 1 .. .. ..
1 2 1 1 0 1 1 e c m
2 .. .. .. .. .. ..
3 .. .. .. .. .. ..
Detecting attacks on privileged
programs

• Short sequences of system calls made during normal execution of


system calls are very consistent, yet different from the sequences of
its abnormal executions
• Two approaches
• STIDE (Warrender 1999)
• Create dictionary of unique k-windows in normal traces, count what fraction occur in
new traces and threshold.
• RIPPER –based (Lee 1998)
• next...
Classification models on k-grams trace data
• When both normal and 7-grams class
abnormal data availablevtimes open seek read read read seek normal
• class label = lstat lstat lstat bind open close vtimes abnormal
normal/abnormal:
… …

• When only normal trace,


6-attributes class
• class-label=k-th system
call vtimes open seek read read read seek

lstat lstat lstat bind open close vtimes


Learn rules to predict class-label [RIPPER]


Examples of output RIPPER rules
• Both-traces:
• if the 2nd system call is vtimes and the 7th is vtrace, then the sequence is “normal”
• if the 6th system call is lseek and the 7th is sigvec, then the sequence is “normal”
• …
• if none of the above, then the sequence is “abnormal”
• Only-normal:
• if the 3rd system call is lstat and the 4th is write, then the 7th is stat
• if the 1st system call is sigblock and the 4th is bind, then the 7th is setsockopt
• …
• if none of the above, then the 7th is open
Experimental results on sendmail
traces Only-normal BOTH
• The output rule sets contain sscp-1 13.5 32.2
~250 rules, each with 2 or 3 sscp-2 13.6 30.4
attribute tests sscp-3 13.6 30.4
syslog-remote-1 11.5 21.2
• Score each trace by counting syslog-remote-2 8.4 15.6
syslog-local-1 6.1 11.1
fraction of mismatches and syslog-local-2 8.0 15.9
thresholding decode-1 3.9 2.1
decode-2 4.2 2.0
sm565a 8.1 8.0
Summary: Only normal traces
sm5x 8.2 6.5
sufficient to detect intrusions sendmail 0.6 0.1

Percent of mismatching traces


More realistic experiments
STIDE RIPPER
threshold %false-pos threshold %false-pos
Site-1 lpr 12 0.0 3 0.0016
Site-2 lpr 12 0.0013 4 0.0265
named 20 0.0019 10 0.0
xlock 20 0.00008 10 0.0
[from Warrender 99]
• Different programs need different thresholds
• Simple methods (e.g. STIDE) work as well
• Results sensitive to window size
• Is it possible to do better with sequence specific
methods?
Outline
• Traditional mining on sequences
• Primitives for handling sequence data
• Embed sequence in a fixed dimensional space
• All conventional record mining techniques will apply
• Distance between two sequences
• Sequence classification: SVM and NN
• Clustering sequences: distance-based approach
• Generative models for sequences
• Sequence classification: whole and partial
• Clustering sequences: model-based approach
• Sequence-specific mining operations
• Applications in biological sequence mining
Probabilistic models for
sequences
• Independent model

• One-level dependence (Markov chains)

• Fixed memory (Order-l Markov chains)

• Variable memory models

• More complicated models


• Hidden Markov Models
Independent model
• Model structure
• A parameter for each symbol in S
Pr(A) = 0.1
Pr(C) = 0.9
• Probability of a sequence s being generated from
the model
• example: Pr(AACA)
= Pr(A) Pr(A) Pr(C) Pr(A) = Pr(A)3 Pr(C)
= 0.13 ´ 0.9
• Training transition probabilities
• Data T : set of sequences
• Count(s ε T): total number of times substring s appears
in training data T
Pr(s) = Count(s ε T) / length(T)
First Order Markov Chains
• Model structure
• A state for each symbol in S start
• Edges between states with probabilities 0.5 0.5
• Probability of a sequence s being generated
from the model 0.4
• Example: Pr(AACA) A C
= Pr(A) Pr(A|A) Pr(C|A) Pr(A|C) 0.9
0.1 0.6
= 0.5 ´ 0.1 ´ 0.9 ´ 0.4
• Training transition probability between
states
Pr(s|b) = Count(bs ε T) / Count(b ε T)
Higher order Markov Chains start

l = memory of sequence 0.5


• Model C 0.9
• A state for each possible suffix of length
l  |S|l states AA C 0.6 AC
• Edges between states with probabilitiesA 0.1
C 0.3
• Pr(AACA) A 0.4 A 0.7
= Pr(AA)Pr(C|AA) Pr(A|AC)
= 0.5 ´ 0.9 ´ 0.7 CA CC
0.8
• Training model C 0.2
Pr(s|s) = count(ss ε T) / count(s ε T) l =2
Variable Memory models start
• Probabilistic Suffix Automata (PSA)
0.2 0.5
• Model
• States: substrings of size no greater than l where no C 0.9 C 0.6
string is suffix of another
• Calculating Pr(AACA): AC CC
= Pr(A)Pr(A|A)Pr(C|A)Pr(A|AC) C 0.7
= 0.5 ´ 0.3 ´ 0.7 ´ 0.1 A 0.6
A 0.1
• Training: not straight-forward A
• Eased by Prediction Suffix Trees
A 0.3
• PSTs can be converted to PSA after training
Prediction Suffix Trees (PSTs)
• Suffix trees with emission
C 0.6 probabilities of0.72
0.28,
observation attached with
C 0.9 e
each tree node
AC CC
0.3, 0.7 A C 0.25, 0.75
C 0.7
A 0.1
A AC
0.1, 0.9 CC 0.4, 0.6
A 0.3 Pr(AACA)=0.28 ´ 0.3 ´ 0.7 ´ 0.1

• Linear time algorithms exist for constructing such PSTs from training
data [Apostolico 2000]
Hidden Markov Models
• Doubly stochastic models
A 0.6
A 0.9
0.5 C 0.4
C 0.1
S1 0.9 S2
0.1 0.5
0.8
S4
S3
0.2
• Efficient dynamic programming A 0.5 A 0.3
algorithms exist for C 0.5 C 0.7
• Finding Pr(S)
• The highest probability path P that
maximizes Pr(S|P) (Viterbi)
• Training the model
• (Baum-Welch algorithm)
Discriminative training of HMMs
• Models trained to maximize likelihood of data might perform badly
when
• Model not representative of data
• Training data insufficient
• Alternatives to Maximum-likelihood/EM
• Objective functions:
• Minimum classification error
• Maximum posterior probability of actual label Pr(c|x)
• Maximum mutual information with class
• Harder to train above functions, number of alternatives to EM proposed
• Generalized probabilistic descent [Katagiri 98]
• Deterministic annealing [Rao 01]
HMMs for profiling system calls
• Training:
• Initial number of states = 40 (roughly equals number of distinct system calls)
• Train on normal traces
• Testing:
• Need to handle variable length and online data
• For each call, find the total probability of outputting given all calls before it.
• If probability below a threshold call it abnormal.
• Trace is abnormal if fraction of abnormal calls are high
More realistic
STIDEexperiments
RIPPER HMM
thresh %false- threshold %false- threshold %false-
old pos pos pos
Site-1 lpr 12 0.0 3 0.0016 10-7 0.0003
Site-2 lpr 12 0.0013 4 0.0265 10-7 0.0015
named 20 0.0019 10 0.0 10-7 0.0
xlock 20 0.00008 10 0.0 10-7 0.0

• HMMs [from Warrender 99]


• Take longer time to train
• Less sensitive to thresholds, no window parameter
• Best overall performance
• VMM and Sparse Markov Transducers also shown to perform
significantly better than fixed window methods [Eskin 01]
ROC curves comparing different
methods

[from Warrender 99]


Outline
• Traditional mining on sequences
• Primitives for handling sequence data
• Sequence-specific mining operations
• Partial sequence classification (Tagging)
• Segmenting a sequence
• Predicting next symbol of a sequence
• Applications in biological sequence mining
Partial sequence classification (Tagging)

• The tagging problem:


• Given:
• A set of tags L
• Training examples of sequences showing the breakup of the sequence into the set of tags
• Learn to breakup a sequence into tags
• (classification of parts of sequences)
• Examples:
• Text segmentation
• Break sequence of words forming an address string into subparts like Road, City name etc
• Continuous speech recognition
• Identify words in continuous speech
Text sequence segmentation
Example: Addresses, bib records
House
number Building Road City State Zip

4089 Whispering Pines Nobel Drive San Diego CA 92122

Author Year Title Journal


Volume
Page
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick
(1993) Protein and Solvent Engineering of Subtilising BPN' in
Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115,
12231-12237.
Approaches used for tagging
• Learn to identify start and end boundaries of each label
• Method: for each label, build two classifiers for accepting its two boundaries.
• Any conventional classifier would do:
• Rule-based most common.
• K-windows approach:
• For each label, train a classifier on k-windows
• During testing
• classify each k-window
• Adapt state-based generative models like HMM
State-based models for sequence
tagging
•Two approaches:
• Separate model per tag with special prefix and suffix states to capture
the start and end of a tag
Road name

Prefix S1 S2
Suffix
S4
S3

Building name

Prefix S1 S2
Suffix
S4
S3
Combined model over all
Example: IE
tags
Naïve Model: One state per element

… Mahatma Gandhi Road Near Parkland ...


Nested model
Each element
another HMM
… [Mahatma Gandhi Road Near : Landmark] Parkland ...
Other approaches
• Disadvantages of generative models (HMMs)
• Maximizing joint probability of sequence and labels may not maximize
accuracy
• Conditional independence of features a restrictive assumption
• Alternative: Conditional Random Fields
• Maximize conditional probability of labels given sequence
• Arbitrary overlapping features allowed
Outline
• Traditional mining on sequences
• Primitives for handling sequence data
• Sequence-specific mining operations
• Partial sequence classification (Tagging)
• Segmenting a sequence
• Predicting next symbol of a sequence
• Applications in biological sequence mining
Segmenting sequences
Basic premise: Sequence
assumed to be generated
from multiple models
Return
Goal: discover boundaries of
model change A
Pick
Applications:
Bioinformatics
Better models by breaking up
heterogeneous sequences A C T G G T T T A C C C C T G T G

A C T G G T T T A C C C C T G T G
Simpler problem: Segmenting a 0/1
sequence
• Players A and B
• A has a set of coins with
different biases Return
• A repeatedly
• Picks arbitrary coin Pick
A
• Tosses it arbitrary number of
times Toss

• B observes H/T
• Guesses transition points 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
and biases
B

0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
A MDL-based approach
•Given n head/tail observations
• Can assume n different coins with bias 0 or 1
• Data fits perfectly (with probability one)
• Many coins needed
• Or assume one coin
• May fit data poorly
• “Best segmentation” is a compromise between
model length and data fit

0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

1/4 5/7 1/3


MDL
For each segment i:
L(Mi): cost of model parameters: log(Pr(head))
+ segment boundary: log (sequence length)
L(Di|Mi): cost of describing data in segment Si given model Mi: log(Hh T(1-h)) H:
#heads, T: #tails

Goal: find segmentation that leads to smallest total cost


åsegment i L(Mi) + L(Di|Mi)
How to find optimal segments
Sequence of 17 tosses:

0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

Derived graph with 18 nodes and all possible edges

Shortest path

Edge cost =
model cost
+ data cost
Non-independent models
• In previous method each segment is assumed to be independent of each other,
0 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0
does not allow model reuse of the form:

• The (k,h) segmentation problem:


• Assume a fixed number h of models is to be used for segmenting into k parts an n-element
sequence (k > h)
• (k,k) segmentation solvable by dynamic programming
• General (k,h) problem NP hard
Approximations: for (k,h)
segmentation S2 S3 S4
S1
1. Solve (k,k) to get segments A C T G G T T T A C C C C T G T G

M1 M2
2. Solve (n,h) to get H models
• Example: 2-medians
3. Assign each segment to best ofA HC T G G T T T A C C C C T G T G

A second variant (Replace step 2 above with)


• Find summary statistics for each k segment, cluster them into H clusters
Results of segmenting genetic
sequences
From:
Gionis &
Mannila2
003
References
• S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and
PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389–3402, 1997.
• Apostolico, A., and Bejerano, G. 2000. Optimal amnesic probabilistic automata or how to learn and classify proteins
in linear time and space. In Proceedings of RECOMB2000. http://citeseer.nj.nec.com/apostolico00optimal.html
• Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic text segmentation for extracting
structured records. SIGMOD 2001.
• C. Burge and S. Karlin, Prediction of Complete Gene Structures in Human Genomic DNA. Journal of Molecular
Biology, 268:78-94, 1997.
• Mary Elaine Calif and R. J. Mooney. Relational learning of pattern-match rules for information extraction. AAAI
1999.
• S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using temporal description length,VLDB, 1998
• M. Collins, “Discriminitive training method for Hidden Markov Models: Theory and experiments with perceptron
algorithms, EMNLP 2002
• R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models of proteins
and nucleic acids, Cambridge University Press, 1998.
• Eleazar Eskin, Wenke Lee and Salvatore J. Stolfo. ``Modeling System Calls for Intrusion Detection with Dynamic
Window Sizes.'' Proceedings of DISCEX II. June 2001.
• IDS http://www.cs.columbia.edu/ids/publications/
• D Freitag and A McCallum, Information Extraction with HMM Structures Learned by Stochastic Optimization,
AAAI 2000
References


Gionis and H. Mannila: Finding recurrent sources in sequences. ACM ReCOMB 2003
Michael T. Goodrich, Efficient Piecewise-Approximation Using the Uniform Metric Symposium
on Computational Geometry , (1994)
• D. Haussler. Convolution kernels on discrete structure. Technical report, UC Santa Cruz, 1999.
• K. Karplus, C. Barrett, and R. Hughey, Hidden Markov models for detecting remote protein
homologies. Bioinformatics 14(10): 846-856, 1998.
• L. Lo Conte, S. Brenner, T. Hubbard, C. Chothia, and A. Murzin. SCOP database in 2002:
refinements accommodate structural genomics. Nucleic Acids Research, 30:264-267, 2002
• Wenke Lee and Sal Stolfo. ``Data Mining Approaches for Intrusion Detection'' In Proceedings of
the Seventh USENIX Security Symposium (SECURITY '98), San Antonio, TX, January 1998
• A. McCallum and D. Freitag and F. Pereira, Maximum entropy Markov models for information
extraction and segmentation, ICML-2000
• Rabiner, Lawrence R., 1990. A tutorial on hidden Markov models and selected applications in
speech recognition. In Alex Weibel and Kay-Fu Lee (eds.), Readings in Speech Recognition. Los
Altos, CA: Morgan Kaufmann, pages 267--296.
• D. Ron, Y. Singer and N. Tishby. The power of amnesia: learning probabilistic automata with
variable memory length. Machine Learning, 25:117-- 149, 1996
• Warrender, Christina, Stephanie Forrest, and Barak Pearlmutter. Detecting Intrusions Using
System Calls: Alternative Data Models. To appear, 1999 IEEE Symposium on Security and
Privacy. 1999
REFERENCES

Title Author Publisher Availability


Name
Predictive Analytics: The Power to E. Siegel John Wiley https://mirtech.ir/wp-content/uploads/
Predict Who Will Click, Buy, Lie, or & Sons, Inc 2018/04/Predictive-Analytics.pdf
Die
Too Big to Ignore: The Business Case for P. Simon Wiley India https://support.sas.com/content/dam/S
Big Data AS/support/en/books/too-big-to-ignore/
69508_excerpt.pdf
Data Smart: Using Data Science to J. W. Wiley (Available at Amazon)
Transform Information into Insight Foreman

52
THANK YOU

For queries
Email: [email protected]

53

You might also like