MinHash for dummies
Duplicate document detection is becoming increasingly important for businesses that have huge collections of email, web pages and documents with multiple copies that may or may not be kept up to date.
MinHash is a fairly simple algorithm that from all my Googling has been explained very poorly in blogs or in the kind of mathematical terms that I forgot long ago. So in this article I will attempt to explain how MinHash works at a practical code level.
Before I start, please take a look at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf. That document goes into a lot of theory, and was ultimately where my understanding on MinHash came from. Unfortunately it approaches the algorithm from a theoretical standpoint, but if I gloss over some aspect of the MinHash algorithm here, you will almost certainly find a fuller explanation in the PDF.
I'll also be using pseudo Java in these examples instead of traditional math. This means when I use terms like Set, I am referring to the group of classes that implement a Set in Java. Of course a set in Java shares a lot of properties with a set in maths, but keep in mind that this is written from a developers point of view.
MinHash is a fairly simple algorithm that from all my Googling has been explained very poorly in blogs or in the kind of mathematical terms that I forgot long ago. So in this article I will attempt to explain how MinHash works at a practical code level.
Before I start, please take a look at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf. That document goes into a lot of theory, and was ultimately where my understanding on MinHash came from. Unfortunately it approaches the algorithm from a theoretical standpoint, but if I gloss over some aspect of the MinHash algorithm here, you will almost certainly find a fuller explanation in the PDF.
I'll also be using pseudo Java in these examples instead of traditional math. This means when I use terms like Set, I am referring to the group of classes that implement a Set in Java. Of course a set in Java shares a lot of properties with a set in maths, but keep in mind that this is written from a developers point of view.
What does it mean when we talk about document similarity?
The whole point of MinHash is to efficiently answer the question "How similar is this document to that document".
Before we can answer that question, we need to look at a more simple example of two sets of strings and determine how similar they are.
Here are our two sets:
Set a = new Set(["chair", "desk", "rug", "keyboard", "mouse"]);
Set b = new Set(["chair", "rug", "keyboard"]);
A measure of how similar these two sets are is known as the Jaccard Coefficient. It is calculated as number of common elements / (total number of elements  number of common elements).
In our case we have 3 common elements: "chair", "rug" and "keyboard". We have a total of 8 elements. So the Jaccard Coefficient = 3 / (8  3) = 0.6, or 60%. http://www.planetcalc.com/1664/ is an online calculator that you can play around with to determine the similarity of two sets.
Documents as sets
While a document can be though of as a giant set of words, we don't just break down a document into individual words, place them in a set and calculate the similarity, because that looses the importance of the order of the words. The sets
Set a = new Set(["I", "went", "to", "work", "today"]);
Set b = new Set(["today", "I", "went", "to", "work"]);
would be considered 100% similar, even though they clearly are not.
Instead we break a document down into what are known as shingles. Each shingle contains a set number of words, and a document is broken down into total words  single length + 1 number of shingles. So if a document contained a single sentence of "The quick brown fox jumps over the lazy dog", that would be broken down into the following 5 word long shingles :
 The quick brown fox jumps
 quick brown fox jumps over
 brown fox jumps over the
 fox jumps over the lazy
 jumps over the lazy dog
So our document as a set looks like:
Set a = new Set(["The quick brown fox jumps", "quick brown fox jumps over", "brown fox jumps over the", "fox jumps over the lazy", "jumps over the lazy dog"]);
These sets of shingles can then be compared for similarity using the Jaccard Coefficient.
Optimizing the process
So we now have a way to compare two documents for similarity, but it is not an efficient process. To find similar documents to document A in a directory of 10000 documents, we need compare each pair individually. This is obviously not going to scale well.
What we can do to reduce some cycles is compare sets of randomly selected shingles from two documents. So for a document that is 10000 words long, we break it down into 9996 shingles, and randomly select say 200 of those to represent the document. If the second document is 20000 words long, we boil that down from 19996 shingles to another set of 200 randomly selected shingles.
Now instead of matching 9996 shingles to 19996 other shingles, we are are comparing 200 shingles to another 200 shingles. We have reduced our workload at the expense of accuracy, but this is still good enough for most scenarios.
NOTE
So just how much accuracy did we sacrifice by comparing only a small subset of randomly selected shingles? I actually don't know, but I assume it sits on a bell curve where the number of random selections improves the accuracy to a point where more random selections won't provide any meaningful increase in accuracy. If you know the math behind this, please leave a comment.
Computing and saving the random shingle selections
So this is where MinHash comes in.
We could save the 200 randomly selected shingles, but storing variable length strings kinda sucks. It complicates database design and comparing strings is slow.
The clever part about MinHash is that it allows us to save integers instead of strings, and also takes the hassle out of randomly selecting shingles. It works like this.
1. Break down the document a set of shingles.
2. Calculate the hash value for every shingle.
3. Store the minimum hash value found in step 2.
4. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 min hash values.
At this point instead of 200 randomly selected shingles, we have 200 integer hash values. This is effectively a random selection of shingles, because a good hash algorithm is supposed to generate a number that is as likely to be large as it is likely to be small. This kind of distribution of hash codes is what hashing is all about, so selecting the smallest hash is, for all intents and purposes, a random selection of a shingle.
Um, 199 more hash algorithms? WTF!
So you see the String.hashCode() method, but where do these other 199 hash codes come from? This is the step that most explanations on the MinHash algorithm conveniently leave out.
The short answer is that you XOR the value returned by String.hashCode() with 199 random numbers to generate the 199 other hash code values. Just make sure that you are using the same 199 random numbers across all the documents.
If you want some more detail on why this works, see http://stackoverflow.com/a/19711615/157605.
Great, but I still have to compare every document to every other document. Can this be optimized?
This is where the second neat trick of MinHash comes in.
Locality sensitive hashing (LSH) involves generating a hash code such that similar items will tend to get similar hash codes. This is the opposite of what .hashCode() does.
LSH allows you to precompute a hash code that is then quickly and easily compared to another precomputed LSH hash code to determine if two objects should be compared in more detail or quickly discarded.
We don't actually calculate a LSH hash code as such, but the idea of a boiling down a complex object (like our collection of minhash codes) into something that is quickly and easily compared with other complex objects is still applicable.
What we do is take the collection of minhash values, and categorize them into bands and row. For the sake of simplicity we'll assume each document has 20 minhash values. We will break them down into 5 bands with 4 rows each. Also for the sake of simplicity we'll assume that .hashCode() (and any number XORed with it) results in a number between 0 and 9.
So conceptually the minhash values from 5 documents in the first of the 5 bands and its rows looks like this.
Band One 

What we are looking for are rows within a band that are the same between documents. Document One and Document Three in this case both have rows with values 1, 3, 6, 0. This tells us that these two documents should be compared for their similarity.
Although it is not shown here, you should look through Bands Two  Five looking for documents that share rows. Any documents that share rows in any bands should be compared for their similarity.
This is useful when you have a document, and you want to know which other documents to compare to it for similarity. I'll assume that the minhash values are stored in SQL, so to get out candidates for comparison you would write:
SELECT DocName
FROM Documents
WHERE (
BandOneMinHashAlgorithmOne = 1 AND
BandOneMinHashAlgorithmTwo = 3 AND
BandOneMinHashAlgorithmThree = 6 AND
BandOneMinHashAlgorithmFour = 0
) OR
(
) OR (
) OR (
) OR (
)
(
BandTwoMinHashAlgorithmOne = # AND
BandTwoMinHashAlgorithmTwo = # AND
BandTwoMinHashAlgorithmThree = # AND
BandTwoMinHashAlgorithmFour = #
BandThreeMinHashAlgorithmOne = # AND
BandThreeMinHashAlgorithmTwo = # AND
BandThreeMinHashAlgorithmThree = # AND
BandThreeMinHashAlgorithmFour = #
BandFourMinHashAlgorithmOne = # AND
BandFourMinHashAlgorithmTwo = # AND
BandFourMinHashAlgorithmThree = # AND
BandFourMinHashAlgorithmFour = #
BandFiveMinHashAlgorithmOne = # AND
BandFiveMinHashAlgorithmTwo = # AND
BandFiveMinHashAlgorithmThree = # AND
BandFiveMinHashAlgorithmFour = #
This explanation completely glosses over why this works and what combination of bands and rows you should construct when looking for similar documents. If you want to get into that detail, take a look at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf.
Comments
For MinHash, you are taking the set of all unique shingles (between the two documents) and trying measure the fraction that are common to both documents, which is defined to be the document similarity. If the fraction is p, a randomly selected shingle has probability p of being a common shingle, and probability 1p of being a noncommon shingle  this is known as a Bernoulli distribution. The standard deviation of a Bernoulli distribution is sqrt(p*(1p)). So, the average of N Bernoulli samples will have a normal (bellshaped) distribution with standard deviation sqrt(p*(1p)/N). But, we don't know p  that's what we are trying to measure. The largest possible value for the standard deviation occurs when p=0.5, so we know the standard deviation is at most 0.5/sqrt(N). To have 95% confidence that the true value is within our estimated range, the range should be plus or minus 2 standard deviations. So, a conservative estimate for the uncertainty in our result is plus or minus 1/sqrt(N).
If we take N=200, our result is plus or minus 1/sqrt(200) = 0.07. So, if 100 or our 200 sampled shingles are common to both documents we know that p is in the range 0.43 to 0.57. If p is close to 0 or 1 the error bar will actually be smaller than 0.07, but we don't that that will be the case when we pick N. If we took four times as many samples the error bar would only be cut in half due to the sqrt.
Useful references:
http://en.wikipedia.org/wiki/Central_limit_theorem
http://en.wikipedia.org/wiki/Bernoulli_distribution
http://en.wikipedia.org/wiki/68–95–99.7_rule