MinHash for dummies

Duplicate document detection is becoming increasingly important for businesses that have huge collections of email, web pages and documents with multiple copies that may or may not be kept up to date.

MinHash is a fairly simple algorithm that from all my Googling has been explained very poorly in blogs or in the kind of mathematical terms that I forgot long ago. So in this article I will attempt to explain how MinHash works at a practical code level.

Before I start, please take a look at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf. That document goes into a lot of theory, and was ultimately where my understanding on MinHash came from. Unfortunately it approaches the algorithm from a theoretical standpoint, but if I gloss over some aspect of the MinHash algorithm here, you will almost certainly find a fuller explanation in the PDF.

I'll also be using pseudo Java in these examples instead of traditional math. This means when I use terms like Set, I am referring to the group of classes that implement a Set in Java. Of course a set in Java shares a lot of properties with a set in maths, but keep in mind that this is written from a developers point of view.

What does it mean when we talk about document similarity?  

The whole point of MinHash is to efficiently answer the question "How similar is this document to that document".

Before we can answer that question, we need to look at a more simple example of two sets of strings and determine how similar they are.

Here are our two sets:

Set a = new Set(["chair", "desk", "rug", "keyboard", "mouse"]);
Set b = new Set(["chair", "rug", "keyboard"]);

A measure of how similar these two sets are is known as the Jaccard Coefficient. It is calculated as number of common elements / (total number of elements - number of common elements).

In our case we have 3 common elements: "chair", "rug" and "keyboard". We have a total of 8 elements. So the Jaccard Coefficient  = 3 / (8 - 3) = 0.6, or 60%. http://www.planetcalc.com/1664/ is an online calculator that you can play around with to determine the similarity of two sets.

Documents as sets

While a document can be though of as a giant set of words, we don't just break down a document into individual words, place them in a set and calculate the similarity, because that looses the importance of the order of the words. The sets

Set a = new Set(["I", "went", "to", "work", "today"]);
Set b = new Set(["today", "I", "went", "to", "work"]);

would be considered 100% similar, even though they clearly are not.

Instead we break a document down into what are known as shingles. Each shingle contains a set number of words, and a document is broken down into total words - single length + 1 number of shingles. So if a document contained a single sentence of "The quick brown fox jumps over the lazy dog", that would be broken down into the following 5 word long shingles :
  1. The quick brown fox jumps
  2. quick brown fox jumps over
  3. brown fox jumps over the
  4. fox jumps over the lazy
  5. jumps over the lazy dog
So our document as a set looks like:

Set a = new Set(["The quick brown fox jumps", "quick brown fox jumps over", "brown fox jumps over the", "fox jumps over the lazy", "jumps over the lazy dog"]);

These sets of shingles can then be compared for similarity using the Jaccard Coefficient.

Optimizing the process

So we now have a way to compare two documents for similarity, but it is not an efficient process. To find similar documents to document A in a directory of 10000 documents, we need compare each pair individually. This is obviously not going to scale well.

What we can do to reduce some cycles is compare sets of randomly selected shingles from two documents. So for a document that is 10000 words long, we break it down into 9996 shingles, and randomly select say 200 of those to represent the document. If the second document is 20000 words long, we boil that down from 19996 shingles to another set of 200 randomly selected shingles.

Now instead of matching 9996 shingles to 19996 other shingles, we are are comparing 200 shingles to another 200 shingles. We have reduced our workload at the expense of accuracy, but this is still good enough for most scenarios.

NOTE
So just how much accuracy did we sacrifice by comparing only a small subset of randomly selected shingles? I actually don't know, but I assume it sits on a bell curve where the number of random selections improves the accuracy to a point where more random selections won't provide any meaningful increase in accuracy. If you know the math behind this, please leave a comment.

I chose 200 random selections after reading http://blog.cluster-text.com/tag/minhash/.

Computing and saving the random shingle selections

So this is where MinHash comes in.

We could save the 200 randomly selected shingles, but storing variable length strings kinda sucks. It complicates database design and comparing strings is slow.

The clever part about MinHash is that it allows us to save integers instead of strings, and also takes the hassle out of randomly selecting shingles. It works like this.

1. Break down the document a set of shingles.
2. Calculate the hash value for every shingle.
3. Store the minimum hash value found in step 2.
4. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 min hash values.

At this point instead of 200 randomly selected shingles, we have 200 integer hash values. This is effectively a random selection of shingles, because a good hash algorithm is supposed to generate a number that is as likely to be large as it is likely to be small. This kind of distribution of hash codes is what hashing is all about, so selecting the smallest hash is, for all intents and purposes, a random selection of a shingle.

Um, 199 more hash algorithms? WTF!

So you see the String.hashCode() method, but where do these other 199 hash codes come from? This is the step that most explanations on the MinHash algorithm conveniently leave out.

The short answer is that you XOR the value returned by String.hashCode() with 199 random numbers to generate the 199 other hash code values. Just make sure that you are using the same 199 random numbers across all the documents.

If you want some more detail on why this works, see http://stackoverflow.com/a/19711615/157605.

Great, but I still have to compare every document to every other document. Can this be optimized?

This is where the second neat trick of MinHash comes in. 

Locality sensitive hashing (LSH) involves generating a hash code such that similar items will tend to get similar hash codes. This is the opposite of what .hashCode() does.

LSH allows you to precompute a hash code that is then quickly and easily compared to another precomputed LSH hash code to determine if two objects should be compared in more detail or quickly discarded.

We don't actually calculate a LSH hash code as such, but the idea of a boiling down a complex object (like our collection of minhash codes) into something that is quickly and easily compared with other complex objects is still applicable.

What we do is take the collection of minhash values, and categorize them into bands and row. For the sake of simplicity we'll assume each document has 20 minhash values. We will break them down into 5 bands with 4 rows each. Also for the sake of simplicity we'll assume that .hashCode() (and any number XORed with it) results in a number between 0 and 9.

So conceptually the minhash values from 5 documents in the first of the 5 bands and its rows looks like this.

Band One
MinHash Algorithm One MinHash Algorithm Two MinHash Algorithm Three MinHash Algorithm Four
Document One 1 3 6 0
Document Two 2 3 1 0
Document Three 1 3 6 0
Document Four 2 1 3 1
What we are looking for are rows within a band that are the same between documents. Document One and Document Three in this case both have rows with values 1, 3, 6, 0. This tells us that these two documents should be compared for their similarity.

Although it is not shown here, you should look through Bands Two - Five looking for documents that share rows. Any documents that share rows in any bands should be compared for their similarity. 

This is useful when you have a document, and you want to know which other documents to compare to it for similarity. I'll assume that the minhash values are stored in SQL, so to get out candidates for comparison you would write:

SELECT DocName 
FROM Documents
WHERE (
BandOneMinHashAlgorithmOne = 1 AND
BandOneMinHashAlgorithmTwo = 3 AND
BandOneMinHashAlgorithmThree = 6 AND
BandOneMinHashAlgorithmFour = 0
) OR
(
BandTwoMinHashAlgorithmOne = # AND
BandTwoMinHashAlgorithmTwo = # AND
BandTwoMinHashAlgorithmThree = # AND
BandTwoMinHashAlgorithmFour = #
) OR (
BandThreeMinHashAlgorithmOne = # AND
BandThreeMinHashAlgorithmTwo = # AND
BandThreeMinHashAlgorithmThree = # AND
BandThreeMinHashAlgorithmFour = #
) OR (
BandFourMinHashAlgorithmOne = # AND
BandFourMinHashAlgorithmTwo = # AND
BandFourMinHashAlgorithmThree = # AND
BandFourMinHashAlgorithmFour = #
) OR (
BandFiveMinHashAlgorithmOne = # AND
BandFiveMinHashAlgorithmTwo = # AND
BandFiveMinHashAlgorithmThree = # AND
BandFiveMinHashAlgorithmFour = #
)

This explanation completely glosses over why this works and what combination of bands and rows you should construct when looking for similar documents. If you want to get into that detail, take a look at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf.

Comments

Bill Dimm said…
Regarding how the accuracy depends on the number of hash functions, that comes from the Central Limit Theorem. If you compute the mean of a bunch of independent samples from the same probability distribution, the probability distribution of the mean itself will be normal (i.e. bell-shaped) if you take enough samples even if the probability distribution of the individual samples is far from normal. The standard deviation of the bell-shaped distribution for the mean is given by the standard deviation of the individual sample distribution divided by sqrt(N), where N is the number of samples.

For MinHash, you are taking the set of all unique shingles (between the two documents) and trying measure the fraction that are common to both documents, which is defined to be the document similarity. If the fraction is p, a randomly selected shingle has probability p of being a common shingle, and probability 1-p of being a non-common shingle -- this is known as a Bernoulli distribution. The standard deviation of a Bernoulli distribution is sqrt(p*(1-p)). So, the average of N Bernoulli samples will have a normal (bell-shaped) distribution with standard deviation sqrt(p*(1-p)/N). But, we don't know p -- that's what we are trying to measure. The largest possible value for the standard deviation occurs when p=0.5, so we know the standard deviation is at most 0.5/sqrt(N). To have 95% confidence that the true value is within our estimated range, the range should be plus or minus 2 standard deviations. So, a conservative estimate for the uncertainty in our result is plus or minus 1/sqrt(N).

If we take N=200, our result is plus or minus 1/sqrt(200) = 0.07. So, if 100 or our 200 sampled shingles are common to both documents we know that p is in the range 0.43 to 0.57. If p is close to 0 or 1 the error bar will actually be smaller than 0.07, but we don't that that will be the case when we pick N. If we took four times as many samples the error bar would only be cut in half due to the sqrt.

Useful references:
http://en.wikipedia.org/wiki/Central_limit_theorem
http://en.wikipedia.org/wiki/Bernoulli_distribution
http://en.wikipedia.org/wiki/68–95–99.7_rule
Ryan Moulton said…
This comment has been removed by the author.
Ryan Moulton said…
If you'd like to incorporate weights info the minhashes, (for instance, for counting tokens more than once, or for idf weighting) I wrote a guide for how to do that here: Minhash
Sowmya said…
wonderful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries. keep it up.
blockchain training in hyderabad
blockchain course in hyderabad
blockchain coaching in hyderabad
blockchain training institute in hyderabad
blockchain institute in hyderabad

Thanks for Sharing This Article.It is very so much valuable content. I hope these Commenting lists will help to my website
blockchain online training
best blockchain online training
top blockchain online training
ed pills said…
This is really interesting, You are a very skilled blogger. I have joined your feed and look forward to seeking more of your magnificent post. Also, I have shared your site in my social networks!
ed medication said…
Its not my first time to visit this website, i am visiting this site dailly and get fastidious data from here all the time.
Hi there, just became alert to your blog through Google, and found that it is really informative. I'm going to watch out for brussels. I will appreciate if you continue this in future. Many people will be benefited from your writing. Cheers!
erection pills said…
I'm gone to say to my little brother, that he should also pay a visit this web site on regular basis to obtain updated from most up-to-date news.
sparkdeveloper said…
thank you . can you share the code ? please.
Anonymous said…
We are well established IT and outsourcing firm working in the market since 2013. We are providing training to the people ,
like- Web Design , Graphics Design , SEO, CPA Marketing & YouTube Marketing.Call us Now whatsapp: +(88) 01537587949
Digital marketing training
good post Mobile XPRESS
Free bangla sex video:Delivery Companies in UK
good post Mobile XPRESS
CrownQQ Agen DominoQQ BandarQ dan Domino99 Online Terbesar

Yuk Buruan ikutan bermain di website CrownQQ
Sekarang CROWNQQ Memiliki Game terbaru Dan Ternama loh...

10 permainan :
=> Poker
=> Bandar Poker
=> Domino99
=> BandarQ
=> AduQ
=> Sakong
=> Capsa Susun
=> Bandar 66
=> Perang Baccarat
=> Perang Dadu

Keunggulan bermain di CrownQQ :
=> Bonus Refferal 20%
=> Bonus Turn Over 0,5%
=> Minimal Depo 20.000
=> Minimal WD 20.000
=> 100% Member Asli
=> Pelayanan DP & WD 24 jam
=> Livechat Kami 24 Jam Online
=> Bisa Dimainkan Di Hp Android
=> Di Layani Dengan 5 Bank Terbaik
=> 1 User ID 10 Permainan Menarik
=> Menyediakan deposit Via Pulsa

Link Resmi CrownQQ:
- mastercqq.net
- mastercqq.org

Info Lebih lanjut Kunjungi :
Website : DominoQQ Online
Daftar CrownQQ : Poker Online
Info CrownQQ : Kontakk
Linktree : Agen Poker Online

WHATSAPP : +6287771354805
Line : CS_CROWNQQ
Facebook : CrownQQ Official
Kemenangan CrownQQ : Agen BandarQ
salar said…
The right place to learn to create websites and digital marketing strategies. Trainers explain real-time examples for clear understanding.
jeslynwang said…
Nagaqq Yang Merupakan Agen Bandarq terbaik , Domino 99, Dan Bandar Poker Online Terpercaya di asia hadir untuk anda semua dengan permainan permainan menarik dan bonus menarik untuk anda semua

Bonus yang diberikan NagaQQ :
* Bonus rollingan 0.5%,setiap senin di bagikannya
* Bonus Refferal 10% + 10%,seumur hidup
* Bonus Jackpot, yang dapat anda dapatkan dengan mudah
* Minimal Depo 15.000
* Minimal WD 20.000
* Deposit via Pulsa TELKOMSEL
* 6 JENIS BANK ( BCA , BNI, BRI , MANDIRI , CIMB , DANAMON )

Memegang Gelar atau title sebagai AGEN POKER ONLINE Terbaik di masanya

11 Games Yang di Hadirkan NagaQQ :
* Poker Online
* BandarQ
* Domino99
* Bandar Poker
* Bandar66
* Sakong
* Capsa Susun
* AduQ
* Perang Bacarrat
* Perang Dadu
* BD QQ (New Game)


Info Lebih lanjut Kunjungi :
Website : NAGAQQ
Facebook : NagaQQ official
WHATSAPP : +62 87833350051
Line : Cs_nagaQQ
TELEGRAM :+855967014811
Tamizha Karthic said…
Thank you for sharing knowledgeable blog with us Facebook ads mastery course in tamil
It's been nice to see such articles!!
I would Love to share
Arnold DK said…
Nice article, it is very valuable information. Thanks for sharing these information with all of us. whatsapp mod

kajal said…

A dark mode is coming to Google Chrome on desktop
Every app these days seems to have a new dark mode or one in development. This is due in part to the appearance a slick black background can give but also due to recent research from Google that shows darker colors use up less battery power on devices with...
Google now allows you to play podcasts within search results
The podcasting world has always been a bit of the Wild West. You'd discover a series through word of mouth or maybe by way of an article listing recommendations. Then it's on you to use a podcasting app or go to the website to download the episode and listen..
LEAK: Windows 10 removes live tiles in new Start menu
A new leak shows that Windows 10's new look will not feature live tiles. For those who do not know, live tiles are the moving tiles on your Start menu.

jobs for free
download movies for free
Download latest movies for free techisq downloader
download movies link app
Download all movies for free
Nice articles and your informatin valuable and good artices thank for the sharing information Brake Oil Plant

a fantastic article Thank you for sharing. Everyone seeking a software agency should read this post.
Brisk logic is a Mohali-based software design, app development, and web design firm.
digital transformation and business automation
Anonymous said…
The search engine determines whether or not the results of the query were accurate based on the users’ actions. If you add enough keywords in the description, title, tags, and transcript then your video will appear in more search results. The digital marketing agency London will tell you that comments, subscribers and likes on a YouTube video increase the relevancy of a post. You can get more hits through other users if your keywords are relevant.
KPR RUMAH said…
Rumah subsidi adalah bagian dari program pemerintah agar masyarakat dapat membeli rumah dengan harga dan cicilan yang terbilang cukup ringan. Melalui Kementerian Perumahan Rakyat (Kemenpera), pemerintah menggalakkan sebuah program satu juta rumah, dan salah satunya adalah KPR rumah bersubsidi. Sebagai informasi, rumah bersubsidi merupakan rumah yang mendapat kemudahan dari pemerintah yakni berupa bebasnya pajak dan cicilan dengan bunga yang sangat rendah. Selain itu, keunggulan lain yang bisa didapatkan dari rumah bersubsidi yakni memiliki suku bunga flat hingga lunas.
Rumah
How to Unsend an Email in Outlook App?

Check out the steps mentioned to know how to unsend an email in Outlook app. Select the sent items folder in the folder panel on the left of Outlook window. Here, open message that you want to recall and then double-click to open message. If you choose the message and it appears in reading panel then it will not allow to recall this message.If you have ribbon then choose actions and then recall this message from message tab. choose to delete unread copies of message or delete unread copies or replace it with the new one by clicking on OK. Lastly, if you are sending the replace message then compose new one and choose to send.


How to Setup Bellsouth Email on iPhone?

There are times when users have the query about how to setup Bellsouth email on iPhone then start by setting up your email. Now, launch the settings app on iPhone and choose mail, contacts, and calendar. Touch to add accounts under the accounts heading and choose other than account type. After that, click to add an email account. Now, enter your name, Bellsouth email address and password in the requested field. After that, you need to enter the Bellsouth email details in the requested fields. Now, choose to next to let your iPhone device check if settings are working properly or not. Lastly, you need to customize the email settings as per your personal preference.


Troubleshooting Steps for Yahoo Not Working

If you are an iPhone user then sometimes users do encounter with the issue of Yahoo not working. If you are the one facing same issue then check out the steps. For this, press and hold the power button along with volume up button. Within few seconds, you will see power-off slider. After that, drag the slider and let your device turn off properly. Now, to turn on press and hold the power button and then wait till the Apple logo appears. These are the steps that users need to follow to smoothly deal with the issue of Yahoo not working issue.

How to Turn Off Outlook Notifications on Apple Watch?

Check out the steps to know about to know how to turn off Outlook notifications on Apple Watch. If you don’t want to receive email notifications on Apple Watch, then you can smoothly turn it off by going to my watch (tab) > mail > custom > under alerts. You can access your iPhone settings by going to the settings application. After that, you can access notifications by going to the notifications section. The included section should contain a listing of Outlook. Lastly, to access Outlook and choose the tab. Lastly, these are the steps to turn off Outlook notifications on Apple Watch smoothly.
Gerry Max said…
Using Chrome's Incognito mode when using a public computer or someone else's device is a very useful feature. It prevents your browsing history from being seen by others. You can also use it on your own device for privacy reasons. However, Chrome does not provide an option to disable incognito mode. But there is another way to Disable Incognito Mode Chrome Browser. To achieve this, launch the Chrome app and tap the Square icon to switch between incognito and regular windows. You can tap the X icon to close the incognito mode on any iPhone series upon reaching this point. If you prefer not to close the incognito mode completely, you can toggle between regular and incognito mode.
Gerry Max said…
Various methods can be used to resolve the various CenturyLink Email related issues. So, to resolve the issue, check whether the CenturyLink email configuration settings are correct for both the servers, Incoming Mail Server and the Outgoing Mail Server. If you can't access CenturyLink email
, then the issue can be with your log-in account login password. So, try to change your password and then re-login to your account. By deleting the unnecessary spam and junk mails, then moving them to your recycle bin or trash, and then deleting them once again from there, you can easily troubleshoot your CenturyLink Email problem.
Gerry Max said…
Any malfunction with Yahoo's products may cause your account to be suspended. Yahoo is very strict about maintaining the quality of its products and services. The Sign-in Helper can be used immediately to regain access to your Yahoo account if it has been locked temporarily. To Unlock Yahoo Email, use the Yahoo Help Forum Page. Now, you must click on the 'Account Locked' message. Now click on 'Sign-in Helper' and enter your Yahoo Mail address or phone number.
murtazagrn said…
. On Friday the 13th, no less! To be precise, Friday, April 13th, 2029. When 99942 Apophis was found in 2004, scientists believed it had a slim chance of hitting Earth.
Facts About Subway

Fun Facts about Friday
CV Nabil menyediakan Travel Bogor Bandung dan Bandung Bogor – Kami merupakan shuttle bogor – bandung dan sebalik nya. Dengan konsep menjemput door to door. Umum nya pada setiap orang, masalah waktu bisa dibilang merupakan satu pertimbangan yang utama dikarenakan faktor penentu, terlebih ada beberapa bagi mereka yang memang sangat dikejar kejar waktu atau memiliki jadwal yang sangat singkat.
Travel Bogor Bandung Rental Mobil Bis
Neil Taylor said…
As discussed, if you have lost or forgotten the Spectrum email account password, you need to follow the Charter email password recovery process. Once you complete the steps of password recovery, you can use the new password to log into your Spectrum account and access your emails. Follow the below-mentioned steps to reset Charter password:

Full read: https://www.gethumanhelp.com/how-to-change-charter-spectrum-email-password/
komputer said…
mungkin sudah tahu beberapa trik internet dan komputer yang membuat aktivitas harianmu lebih efektif dan efisien.Memang terdapat banyak sekali trik internet dan komputer yang sangat mempengaruhi aktivitas harian kita di luar sana, namun tentu tidak semuanya sudah kita ketahui.
wisata komputer
When it comes to providing support on hp printer offline, Support Printer Help has experience in making it easy to solve customer problems. We can solve all kinds of printer related problems quickly and very easily. Whether it’s a driver related issue, software issues or any hardware changes required in their printer, we understand the customer’s point of view first and identify the exact problem and fix the problem without taking the printer out of the house.
murtaza said…
SNS Bank– Its special feature is that it offers an EU payment option. With the help of it, an individual can open a bank account anywhere in the EU and there is no requirement of a BSN.
Bank Account in the Netherlands
rohit said…
How To Open an Offshore Bank Account In 2022?- ODINT Consulting

Often foreigners have financial transactions in other countries, open an offshore bank account if you want an effective strategy to implement profits overseas and also save in trading.

Open Offshore Bank Account

Open Offshore Bank Account
rohit said…
Dubai Company Registration 2022 - 100% Remote Incorporation and Ownership

Your 100% dubai company registration guide- complete details, process, documentation. 100% ownership and remote incorporation

DUBAI Company registration

DUBAI Company registration
Anonymous said…
Difference between MOA and AOA with Comparison Chart- ODINT Consulting

The key deference between memorandum of association and articles of association are stated here with the help of a comparison chart and both MOA and AOA are defined as well.

Difference between MOA and AOA

Difference between MOA and AOA
shivam said…
An Employment Pass Singapore is provided to foreign experts who are employed by Singaporean companies and meet all of the eligibility requirements. The employer is the third party who has been appointed to apply on the candidate's behalf. As a result, employees are compensated with a wage of at least $4,500 per month. It's good for up to two years, or three years if you're renewing an employment permit. Medical insurance is not likely to be offered in this situation, as companies will have to determine whether or not to do so themselves.

Popular posts from this blog

Authenticating via Kerberos with Keycloak and Windows 2008 Active Directory

Fixing OpenVPN "Authenticate/Decrypt packet error: cipher final failed"