CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
http://cs246.stanford.edu
Would like to do prediction: estimate a function f(x) so that y = f(x) Where y can be:
Real number: Regression Categorical: Classification Complex object:
X Y
Data is labeled:
2/14/2011
2/14/2011
How many neighbors to look at? Weighting function (optional): How to fit with the local points?
Unused Just predict the same output as the nearest neighbor
2/14/2011
Suppose x1,, xm are two dimensional: One can draw nearest neighbor regions:
x1=(x11,x12), x2=(x21,x22),
Distance metric: How many neighbors to look at? Weighting function (optional): How to fit with the local points?
Unused Just predict the average output among k nearest neighbors k Euclidean
k=9
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 7
Euclidean
d(xi, q) = 0
2/14/2011
p q
2/14/2011
Main memory:
Linear scan Tree based:
Quadtree kd-tree
Hashing:
Secondary storage:
R-trees
Locality-Sensitive Hashing
2/14/2011
10
Simplest spatial structure on Earth! Split the space into 2d equal subsquares Repeat until done:
only one pixel left only one point left only a few points left split only one dimension at a time Kd-trees (in a moment)
Variants:
2/14/2011
11
Range search:
Put root node on the stack Repeat:
pop the next node T from the stack for each child C of T:
if C is a leaf, examine point(s) in C if C intersects with the ball of radius r around q, add C to the stack
Nearest neighbor:
Start range search with r = Whenever a point is found, update r Only investigate nodes with respect to current r
Jure Leskovec, Stanford C246: Mining Massive Datasets 12
2/14/2011
2/14/2011
13
Advantages:
E.g., Pick dimension of largest variance and split at median (balanced split) Do SVD or CUR, project and split
2/14/2011
Range search:
Put root node on the stack Repeat:
pop the next node T from the stack for each child C of T:
if C is a leaf, examine point(s) in C if C intersects with the ball of radius r around q, add C to the stack
Find top few dimensions of largest variance Randomly select one of these dimensions; split on median Construct many complete (i.e., one point per leaf) trees Drawbacks:
More memory Additional parameter to tune: number of trees
Search
Descend through each tree until leaf is reached Maintain a single priority queue for all the trees For approximate search, stop after a certain number of nodes have been examined
Jure Leskovec, Stanford C246: Mining Massive Datasets 16
2/14/2011
2/14/2011
d=128, n=100k
[Muja-Lowe, 2010]
17
Better when split passes through sparse regions Lower nodes may spill too much
hybrid of spill and non-spill nodes
For high dim. data, use randomized projections (CUR) or SVD Use Best-Bin-First (BBF)
Make a priority queue of all unexplored nodes Visit them in order of their closeness to the query
Space permitting:
Keep extra statistics on lower and upper bound for each cell and use triangle inequality to prune space Use spilling to avoid backtracking Use lookup tables for fast distance computation
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 19
Advantages:
Supports near(est) neighbor search (similar as before) Works for points and rectangles Avoids empty spaces
Jure Leskovec, Stanford C246: Mining Massive Datasets 20
2/14/2011
2/14/2011
#21
P1
A B C D E
F G
2/14/2011
#22
P1
P1 P2 P3 P4
A B C D E
F G
2/14/2011
#23
P1
P1 P2 P3 P4
A B C D E
F G
2/14/2011
#24
P1
P1 P2 P3 P4
A B C D E
F G
2/14/2011
#25
Insertion of point x:
Find MBR intersecting with x and insert If a node is full, then a split:
Linear choose far apart nodes as ends. Randomly choose nodes and assign them so that they require the smallest MBR enlargement Quadratic choose two nodes so the dead space between them is maximized. Insert nodes so area enlargement is minimized P1 P2 P3 P4
C F E
A B P1 P2
2/14/2011
G P3
H J
P4
A B C D E
D
Jure Leskovec, Stanford C246: Mining Massive Datasets
F G
#26
2/14/2011
27
2/14/2011
28
2/14/2011
30
Binary classification: f (x) = Input: Vectors xi and labels yi Goal: Find vector w = (w1, w2 ,... , wn)
Each wi is a real number
wx=
1 0
if w1 x1 + w2 x2 +. . . wn xn otherwise
wx=0
Note:
x x, 1
w w,
31
2/14/2011
(very) Loose motivation: Neuron Inputs are feature values Each feature has a weight wi w1 x1 Activation is the sum: w2 If the f(x) is:
f(x) = i wixi = wx - Positive: predict +1 Negative: predict -1
nigeria wx=0 x2 Ham=-1
x2 w3 x3 w4 x4
>0?
x1 w
Spam=1
viagra
32
2/14/2011
Start with w0 = 0 Pick training examples x one by one (from disk) Predict class of x using current weights If y is correct (i.e., y=y): If y is wrong: adjust w wt+1 = wt + y x
where:
is the learning rate parameter x is the training example y is true class label ({+1, -1})
Jure Leskovec, Stanford C246: Mining Massive Datasets 33
y = sign(wt x)
No change: wt+1 = wt
yx
wt wt+1 x
2/14/2011
If the training data is not linearly separable the perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop
2/14/2011
34
Separability: some parameters get training set perfectly Convergence: if training set is separable, perceptron will converge (binary case) Mistake bound: number of mistakes < 1/2
Jure Leskovec, Stanford C246: Mining Massive Datasets 35
2/14/2011
2/14/2011
36
Overfitting: Regularization: if the data is not separable weights dance around Mediocre generalization:
Finds a barely separating solution
2/14/2011
37
Winnow algorithm
Similar to perceptron, just different updates
Initialize : = n; w i = 1 Prediction is 1 iff w x If no mistake : do nothing If f(x) = 1 but w x < , w i 2w i (if x i = 1) (promotion ) If f(x) = 0 but w x , w i w i /2 (if x i = 1) (demotion)
2/14/2011
38
Balanced version:
Keep two weights for each variable; effective weight is the difference
Update Rule : If f ( x) = 1 but ( w + w ) x , If f ( x) = 0 but ( w + w ) x ,
2/14/2011
1 wi where xi = 1 (promotion ) 2
1 + wi wi 2 wi where xi = 1 (demotion) 2
39
Note: is a functional margin. Its effect could disappear as w grows. Nevertheless, this has been shown to be a very effective algorithmic addition.
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 40
Hypothesis : w R n w x
w w + i y j x j
Perceptron Online: can adjust to changing target, over time Advantages Simple Guaranteed to learn a linearly separable problem Limitations only linear separations only converges for linearly separable data not really efficient with many features
2/14/2011
Winnow Online: can adjust to changing target, over time Advantages Simple Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really efficient with many features
42