Duplicate document detection is becoming increasingly important for businesses that have huge collections of email, web pages and documents with multiple copies that may or may not be kept up to date. MinHash is a fairly simple algorithm that from all my Googling has been explained very poorly in blogs or in the kind of mathematical terms that I forgot long ago. So in this article I will attempt to explain how MinHash works at a practical code level. Before I start, please take a look at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf . That document goes into a lot of theory, and was ultimately where my understanding on MinHash came from. Unfortunately it approaches the algorithm from a theoretical standpoint, but if I gloss over some aspect of the MinHash algorithm here, you will almost certainly find a fuller explanation in the PDF. I'll also be using pseudo Java in these examples instead of traditional math. This means when I use terms like Set, I am referring to the gr...
Comments
Digital Marketing Course In Hyderabad