Module 2 Algorithm For Massive Datasets
Module 2 Algorithm For Massive Datasets
Module 2 Algorithm For Massive Datasets
In this matrix, we can read off the values of h by scanning from the top
until we come to a 1. Thus, we see that h(S2) = c, h(S3) = b, and h(S4) = a.
Jacard Similarity
To the columns for sets S1 and S2, then rows can be divided into
three classes,
1. Type X rows have 1 in both columns.
2. Type Y rows have 1 in one of the columns and 0 in the other.
3. Type Z rows have 0 in both columns.
Then SIM(S1, S2) = x/(x + y).
Minhash Signatures
•To represent sets, we pick at random some number n of permutations of
the rows of M.
•Call the minhash functions determined by these permutations h1, h2, . . .
, hn.
•From the column representing set S, construct the minhash signature for
S, the vector [h1(S), h2(S), . . . , hn(S)].
•We normally represent this list of hash-values as a column.
•we can form from matrix M a signature matrix, in which the ith column
of M is replaced by the minhash signature for (the set of) the ith column.
Example – Signature Matrix
Computing Minhash Signatures
• It is not feasible to permute a large characteristic matrix explicitly.
• Even picking a random permutation of millions or billions of rows is
time-consuming, and the necessary sorting of the rows would take
even more time.
•it is possible to simulate the effect of a random permutation by a
random hash function that maps row numbers to as many buckets as
there are rows.
Computing Minhash Signatures
Computing Minhash Signatures
Initially, this matrix consists of all ∞’s:
row numbered 2,
Computing Minhash Signatures
row numbered 3,
row numbered 4,
Computing Minhash Signatures
•We can estimate the Jaccard similarities of the underlying sets from
this signature matrix.
•Notice that columns 1 and 4 are identical, so we guess that SIM(S1,
S4) = 1.0.
•we see that the true Jaccard similarity of S1 and S4 is 2/3.
(2 rows with 1's, and 3 rows with 0 and 1 )
•Remember that the fraction of rows that agree in the
signature matrix is only an estimate of the true Jaccard similarity.
Computing Minhash Signatures
For additional examples, the signature columns for S1 and S3 agree
in half the rows (true similarity 1/4).
while the signatures of S1 and S2 estimate 0 as their Jaccard
similarity (the correct value).
Excercise
1. Compute the Jaccard similarity of each of the pairs of columns.
2. Compute, for each pair of columns of that figure, the fraction of the
120 permutations of the rows that make the two columns hash to the
same value.
Excercise
Using the data from Fig. 3.4, add to the signatures of the columns
the values of the following hash functions.
(a) h3(x) = 2x + 4 mod 5.
(b) h4(x) = 3x − 1 mod 5.
Excercise
For the above data,
a) Compute the minhash signature for each column if we use the
following three hash functions: h1(x) = 2x + 1 mod 6; h2(x) = 3x +
2 mod 6; h3(x) = 5x + 2 mod 6.
b) Which of these hash functions are true permutations?
c) How close are the estimated Jaccard similarities for the six pairs
of columns to the true Jaccard similarities?
Distance Measures
Suppose we have a set of points, called a space.
A distance measure on this space is a function d(x, y) that takes two
points in the space as arguments and produces a real number, and
satisfies the following axioms:
1. d(x, y) ≥ 0 (no negative distances).
2. d(x, y) = 0 if and only if x = y (distances are positive, except for the
distance from a point to itself).
3. d(x, y) = d(y, x) (distance is symmetric).
4. d(x, y) ≤ d(x, z) + d(z, y) (the triangle inequality).
1. Euclidean Distances
The most familiar distance measure is the one we normally think of
as “distance.”
An n-dimensional Euclidean space is one where points are vectors of
n real numbers.
The conventional distance measure in this space, which we
shall refer to as the L2-norm, is defined:
1. Euclidean Distances
For any constant r, we can define the Lr-norm to be the distance measure d
defined by,
Given two vectors x and y, the cosine of the angle between them is
the dot product x.y / the L2-norms of x and y (i.e., their Euclidean
distances from the origin).
Recall that the dot product of vectors.
Cosine Distance
Let our two vectors be x = [1, 2,−1] and = [2, 1, 1]. The dot product
x.y is 1 × 2 + 2 × 1 + (−1) × 1 = 3.
The L2-norm of both vectors is √6.
For example, x has L2-norm( 1^2 + 2^2 + (−1)^2)1/2 = √6.
Thus, the cosine of the angle between x and y is 3/(√6√6) or 1/2.
The angle whose cosine is ½ is 60 degrees, so that is the cosine
distance between x and y.
Edit Distance
This distance makes sense when points are strings.
The distance between two strings x = x1x2 · · · xn and y = y1y2 · · · ym is
the smallest number of insertions and deletions of single characters that
will convert x to y.
Eg: The edit distance between the strings x = abcde and y =
acfdeg is 3. To convert x to y:
1. Delete b.
2. Insert f after c.
3. Insert g after e.
Edit Distance
No sequence of fewer than three insertions and/or deletions will convert
x to y. Thus, d(x, y) = 3.
Another way to define and calculate the edit distance d(x, y) is to
compute a longest common subsequence (LCS) of x and y.
An LCS of x and y is a string that is constructed by deleting positions
from x and y, and that is as long as any string that can be constructed
that way.
The edit distance d(x, y) can be calculated as the length of x plus the
length of y minus twice the length of their LCS.
Edit Distance
•Example: The strings x = abcde and y = acfdeg have a unique LCS, which
is acde.
•We can be sure it is the longest possible, because it contains every
symbol appearing in both x and y.
•Fortunately, these common symbols appear in the same order in both
strings, so we are able to use them all in an LCS.
•Note that the length of x is 5, the length of y is 6, and the length of their
LCS is 4. The edit distance is thus 5 + 6 − 2 × 4 = 3, which agrees with the
direct calculation.
Edit Distance
Home work: Find edit Distance for x = aba and y = bab.
Hamming Distance
Given a space of vectors, we define the Hamming distance between
two vectors to be the number of components in which they differ.
The Hamming distance between the vectors 10101 and 11110 is 3.
That is, these vectors differ in the second, fourth, and fifth
components, while they agree in the first and third components.
Hamming Distance
Find the Hamming distances between each pair of the following
vectors: 000000, 110011, 010101, and 011100.
Locality-Sensitive Functions
•The LSH technique to distinguish strongly between pairs at a low
distance from pairs at a high distance.
•These functions can apply to different distance measure.
•There are three conditions that we need for a family of functions:
•In many cases, the function f will “hash” items, and the decision will
be based on whether or not the result is equal.
•Because it is convenient to use the notation f(x) = f(y) to mean that
f(x, y) is “yes; make x and y a candidate pair,”.
Locality-Sensitive Functions
•We also use f(x) ≠ f(y) to mean “do not make x and y a candidate pair
unless some other function concludes we should do so.”
•A collection of functions of this form will be called a family of
functions.
•Let d1 < d2 be two distances according to some distance measure d. A
family F of functions is said to be (d1, d2, p1, p2)-sensitive if for every f
in F:
o If d(x, y) ≤ d1, then the probability that f(x) = f(y) is at least p1.
oIf d(x, y) ≥ d2, then the probability that f(x) = f(y) is at most p2.
Locality-Sensitive Families for Jaccard
Distance
•we have only one way to find a family of locality-sensitive functions: use the
family of minhash functions, and assume that the distance measure is the
Jaccard distance.
•As before, we interpret a minhash function h to make x and y a candidate
pair if and only if h(x) = h(y).
•The family of minhash functions is a (d1, d2, 1−d1, 1−d2)-sensitive
family for any d1 and d2, where 0 ≤ d1 < d2 ≤ 1.
•The reason is that if d(x, y) ≤ d1, where d is the Jaccard distance, then SIM(x,
y) = 1 − d(x, y) ≥ 1 − d1.
Locality-Sensitive Families for
Jaccard Distance
(I.e when distance between x and y decreases probability of
h(x)=h(y) increases.
when distance between x and y increases probability of h(x)=h(y)
decreases.
I.e p1=1-d1.)
Example:
Let d1 = 0.3 and d2 = 0.6.
Then we can assert that the family of minhash functions is a (0.3, 0.6, 0.7, 0.4)-
sensitive family.
That is, if the Jaccard distance between x and y is at most 0.3 (i.e., SIM(x, y) ≥
0.7) then there is at least a 0.7 chance that a minhash function will send x and
y to the same value.
And if the Jaccard distance between x and y is at least 0.6 (i.e., SIM(x, y) ≤ 0.4),
then there is at most a 0.4 chance that x and y will be sent to the same value.
Note that we could make the same assertion with another choice of d1 and d2;
only d1 < d2 is required.
LSH Families for Hamming Distance
• Suppose we have a space of d-dimensional vectors, and h(x,
y) denotes the Hamming distance between vectors x and y.
• If we take any one position of the vectors, say the ith position, we can
define the function fi(x) to be the ith bit of vector x.
• Then fi(x) = fi(y) if and only if vectors x and y agree in the ith
position.
• Then the probability that fi(x) = fi(y) for a randomly chosen i is exactly
1 − h(x, y)/d;
• i.e., it is the fraction of positions in which x and y agree.
LSH Families for Hamming Distance
Thus, the family F consisting of the functions {f1, f2, . . . , fd} is a
(d1, d2, 1 − d1/d, 1 − d2/d)-sensitive
family of hash functions, for any d1 < d2.
1. While Jaccard distance runs from 0 to 1, the Hamming distance on
a vector space of dimension d runs from 0 to d.
It is therefore necessary to scale the distances by dividing by d, to turn
them into probabilities.
2. While there is essentially an unlimited supply of minhash functions, the
size of the family F for Hamming distance is only d.
Random Hyperplanes and the Cosine
Distance
Two vectors x and y that make an angle θ between them.
Note that these vectors may be in a space of many dimensions, but
they always define a plane, and the angle between them is
measured in this plane.
First, consider a vector v that is normal to the hyperplane whose
projection is represented by the dashed line.
Random Hyperplanes and the
Cosine Distance
Random Hyperplanes and the
Cosine Distance
All angles for the line that is the intersection of the random
hyperplane and the plane of x and y are equally likely.
Thus, the hyperplane will look like the dashed line with probability
θ/180 and will look like the dotted line otherwise.
Thus, each hash function f in our locality-sensitive family F is built
from a randomly chosen vector vf .
Given two vectors x and y, say f(x) = f(y) if and only if the dot
products vf .x and vf .y have the same sign.
Random Hyperplanes and
the Cosine Distance
Then F is a locality-sensitive family for the cosine distance.
That is, F is a (d1, d2, (180 − d1)/180, (180 − d2)/180)-sensitive
Sketches(Signature for cosine)
•Instead of chosing a random vector from all possible vectors, it turns out to
be sufficiently random if we restrict our choice to vectors whose
components are +1 and −1.
•The dot product of any vector x with a vector v of +1’s and −1’s is formed
by adding the components of x where v is +1 and then subtracting the
other components of x – those where v is −1.
•If we pick a collection of random vectors, say v1, v2, . . . , vn, then we
can apply them to an arbitrary vector x by computing v1.x, v2.x, . . . , vn.x
and then replacing any positive value by +1 and any negative value by −1.
The result is called the sketch of x.
Sketches
You can handle 0’s arbitrarily, e.g., by chosing a result +1 or −1 at
random.(Since there is only a tiny probability of a zero dot product).
Example:
Suppose our space consists of 4-dimensional vectors, and we pick three
random vectors: v1 = [+1,−1,+1,+1], v2 = [−1,+1,−1,+1], and v3 =
[+1,+1,−1,−1]. For the vector x = [3, 4, 5, 6], the sketch is [+1,+1,−1].
That is, v1.x = 3−4+5+6 = 10. Since the result is positive, the first
component of the sketch is +1. Similarly, v2.x = 2 and v3.x = −4, so the
second component of the sketch is +1 and the third component is −1.
Sketches
LSH Families for Euclidean Distance
Two points at distance d ≫ a have a small chance of being hashed to
the bucket.
LSH Families for Euclidean Distance
Family F just described forms a (a/2, 2a, 1/2, 1/3)- sensitive family of
hash functions.
That is, for distances up to a/2 the probability is at least 1/2 that two
points at that distance will fall in the same bucket.
While for distances at least 2a the probability points at that distance
will fall in the same bucket is at most 1/3.
Applications of Locality-Sensitive Hashing
1. Entity Resolution: This term refers to matching data records that
refer to the same real-world entity, e.g., the same person.
2. Matching Fingerprints: It is possible to represent fingerprints as
sets.
3. Matching Newspaper Articles:
Entity Resolution
•It is common to have several data sets available, and to know that they
refer to some of the same entities.
• For example, several different bibliographic sources provide information
about many of the same books or papers.
•In the general case, we have records describing entities of some type, such
as people or books.
•The records may all have the same format, or they may have different
formats, with different kinds of information.
An Entity-Resolution Example
•There are many reasons why information about an entity may vary, even
if the field in question is supposed to be the same.
•For example, names may be expressed differently in different records
because of misspellings, absence of a middle initial, use of a nickname,
and many other reasons.
•For example, “Bob S. Jomes” and “Robert Jones Jr.” may or may not be
the same person
•If records come from different sources, the fields may differ as well. One
source’s records may have an “age” field, while another does not. The
second source might have a “date of birth” field.
An Entity-Resolution Example
Real example of how LSH was used to deal with an entity resolution problem.
Company A was engaged by Company B to solicit customers for B. Company B
would pay A a yearly fee, as long as the customer maintained their
subscription. They later quarreled and disagreed over how many customers A
had provided to B. Each had about 1,000,000 records, some of which
described the same people; those were the customers A had provided to B.
The records had different data fields, but unfortunately none of those fields
was “this is a customer that A had provided to B.”
The problem was to match records from the two sets to see if a pair
represented the same person.
An Entity-Resolution Example
•The strategy for identifying records involved scoring the differences in
three fields: name, address, and phone.
•To create a score describing the likelihood that two records, one from A
and the other from B, described the same person, 100 points was
assigned to each of the three fields, so records with exact matches in all
three fields got a score of 300.
•However, it is not feasible to score all one trillion pairs of records. Thus, a
simple LSH was used to focus on likely candidates. Three “hash
functions” were used.
An Entity-Resolution Example
In practice, there was no hashing; rather the records were sorted by
name, so records with identical names would appear consecutively
and get scored for overall similarity of the name, address,
and phone.
Validating Record Matches
•In the example at hand, there was an easy way to make that decision,
and the technique can be applied in many similar situations.
•It was decided to look at the creation-dates for the records at hand, and
to assume that 90 days was an absolute maximum delay between the
time the service was bought at Company A and registered at B.
•Thus, a proposed match between two records that were chosen at
random, subject only to the constraint that the date on the B-record was
between 0 and 90 days after the date on the A-record, would have an
average delay of 45 days.
Validating Record Matches
•It was found that of the pairs with a perfect 300 score, the average
delay was 10 days.
•If you assume that 300-score pairs are surely correct matches, then
you can look at the pool of pairs with any given score s, and compute
the average delay of those pairs.
•Suppose that the average delay is x, and the fraction of true matches
among those pairs with score s is f. Then x = 10f + 45(1 − f), or x =
45−35f.
•Solving for f, we find that the fraction of the pairs with score s that are
truly matches is (45 − x)/35.