MinHash for dummies
Duplicate document detection is becoming increasingly important for businesses that have huge collections of email, web pages and documents with multiple copies that may or may not be kept up to date.
MinHash is a fairly simple algorithm that from all my Googling has been explained very poorly in blogs or in the kind of mathematical terms that I forgot long ago. So in this article I will attempt to explain how MinHash works at a practical code level.
Before I start, please take a look at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf. That document goes into a lot of theory, and was ultimately where my understanding on MinHash came from. Unfortunately it approaches the algorithm from a theoretical standpoint, but if I gloss over some aspect of the MinHash algorithm here, you will almost certainly find a fuller explanation in the PDF.
I'll also be using pseudo Java in these examples instead of traditional math. This means when I use terms like Set, I am referring to the group of classes that implement a Set in Java. Of course a set in Java shares a lot of properties with a set in maths, but keep in mind that this is written from a developers point of view.
MinHash is a fairly simple algorithm that from all my Googling has been explained very poorly in blogs or in the kind of mathematical terms that I forgot long ago. So in this article I will attempt to explain how MinHash works at a practical code level.
Before I start, please take a look at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf. That document goes into a lot of theory, and was ultimately where my understanding on MinHash came from. Unfortunately it approaches the algorithm from a theoretical standpoint, but if I gloss over some aspect of the MinHash algorithm here, you will almost certainly find a fuller explanation in the PDF.
I'll also be using pseudo Java in these examples instead of traditional math. This means when I use terms like Set, I am referring to the group of classes that implement a Set in Java. Of course a set in Java shares a lot of properties with a set in maths, but keep in mind that this is written from a developers point of view.
What does it mean when we talk about document similarity?
The whole point of MinHash is to efficiently answer the question "How similar is this document to that document".
Before we can answer that question, we need to look at a more simple example of two sets of strings and determine how similar they are.
Here are our two sets:
Set a = new Set(["chair", "desk", "rug", "keyboard", "mouse"]);
Set b = new Set(["chair", "rug", "keyboard"]);
A measure of how similar these two sets are is known as the Jaccard Coefficient. It is calculated as number of common elements / (total number of elements - number of common elements).
In our case we have 3 common elements: "chair", "rug" and "keyboard". We have a total of 8 elements. So the Jaccard Coefficient = 3 / (8 - 3) = 0.6, or 60%. http://www.planetcalc.com/1664/ is an online calculator that you can play around with to determine the similarity of two sets.
Documents as sets
While a document can be though of as a giant set of words, we don't just break down a document into individual words, place them in a set and calculate the similarity, because that looses the importance of the order of the words. The sets
Set a = new Set(["I", "went", "to", "work", "today"]);
Set b = new Set(["today", "I", "went", "to", "work"]);
would be considered 100% similar, even though they clearly are not.
Instead we break a document down into what are known as shingles. Each shingle contains a set number of words, and a document is broken down into total words - single length + 1 number of shingles. So if a document contained a single sentence of "The quick brown fox jumps over the lazy dog", that would be broken down into the following 5 word long shingles :
- The quick brown fox jumps
- quick brown fox jumps over
- brown fox jumps over the
- fox jumps over the lazy
- jumps over the lazy dog
So our document as a set looks like:
Set a = new Set(["The quick brown fox jumps", "quick brown fox jumps over", "brown fox jumps over the", "fox jumps over the lazy", "jumps over the lazy dog"]);
These sets of shingles can then be compared for similarity using the Jaccard Coefficient.
Optimizing the process
So we now have a way to compare two documents for similarity, but it is not an efficient process. To find similar documents to document A in a directory of 10000 documents, we need compare each pair individually. This is obviously not going to scale well.
What we can do to reduce some cycles is compare sets of randomly selected shingles from two documents. So for a document that is 10000 words long, we break it down into 9996 shingles, and randomly select say 200 of those to represent the document. If the second document is 20000 words long, we boil that down from 19996 shingles to another set of 200 randomly selected shingles.
Now instead of matching 9996 shingles to 19996 other shingles, we are are comparing 200 shingles to another 200 shingles. We have reduced our workload at the expense of accuracy, but this is still good enough for most scenarios.
NOTE
So just how much accuracy did we sacrifice by comparing only a small subset of randomly selected shingles? I actually don't know, but I assume it sits on a bell curve where the number of random selections improves the accuracy to a point where more random selections won't provide any meaningful increase in accuracy. If you know the math behind this, please leave a comment.
Computing and saving the random shingle selections
So this is where MinHash comes in.
We could save the 200 randomly selected shingles, but storing variable length strings kinda sucks. It complicates database design and comparing strings is slow.
The clever part about MinHash is that it allows us to save integers instead of strings, and also takes the hassle out of randomly selecting shingles. It works like this.
1. Break down the document a set of shingles.
2. Calculate the hash value for every shingle.
3. Store the minimum hash value found in step 2.
4. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 min hash values.
At this point instead of 200 randomly selected shingles, we have 200 integer hash values. This is effectively a random selection of shingles, because a good hash algorithm is supposed to generate a number that is as likely to be large as it is likely to be small. This kind of distribution of hash codes is what hashing is all about, so selecting the smallest hash is, for all intents and purposes, a random selection of a shingle.
Um, 199 more hash algorithms? WTF!
So you see the String.hashCode() method, but where do these other 199 hash codes come from? This is the step that most explanations on the MinHash algorithm conveniently leave out.
The short answer is that you XOR the value returned by String.hashCode() with 199 random numbers to generate the 199 other hash code values. Just make sure that you are using the same 199 random numbers across all the documents.
If you want some more detail on why this works, see http://stackoverflow.com/a/19711615/157605.
Great, but I still have to compare every document to every other document. Can this be optimized?
This is where the second neat trick of MinHash comes in.
Locality sensitive hashing (LSH) involves generating a hash code such that similar items will tend to get similar hash codes. This is the opposite of what .hashCode() does.
LSH allows you to precompute a hash code that is then quickly and easily compared to another precomputed LSH hash code to determine if two objects should be compared in more detail or quickly discarded.
We don't actually calculate a LSH hash code as such, but the idea of a boiling down a complex object (like our collection of minhash codes) into something that is quickly and easily compared with other complex objects is still applicable.
What we do is take the collection of minhash values, and categorize them into bands and row. For the sake of simplicity we'll assume each document has 20 minhash values. We will break them down into 5 bands with 4 rows each. Also for the sake of simplicity we'll assume that .hashCode() (and any number XORed with it) results in a number between 0 and 9.
So conceptually the minhash values from 5 documents in the first of the 5 bands and its rows looks like this.
Band One |
|
What we are looking for are rows within a band that are the same between documents. Document One and Document Three in this case both have rows with values 1, 3, 6, 0. This tells us that these two documents should be compared for their similarity.
Although it is not shown here, you should look through Bands Two - Five looking for documents that share rows. Any documents that share rows in any bands should be compared for their similarity.
This is useful when you have a document, and you want to know which other documents to compare to it for similarity. I'll assume that the minhash values are stored in SQL, so to get out candidates for comparison you would write:
SELECT DocName
FROM Documents
WHERE (
BandOneMinHashAlgorithmOne = 1 AND
BandOneMinHashAlgorithmTwo = 3 AND
BandOneMinHashAlgorithmThree = 6 AND
BandOneMinHashAlgorithmFour = 0
) OR
(
) OR (
) OR (
) OR (
)
(
BandTwoMinHashAlgorithmOne = # AND
BandTwoMinHashAlgorithmTwo = # AND
BandTwoMinHashAlgorithmThree = # AND
BandTwoMinHashAlgorithmFour = #
BandThreeMinHashAlgorithmOne = # AND
BandThreeMinHashAlgorithmTwo = # AND
BandThreeMinHashAlgorithmThree = # AND
BandThreeMinHashAlgorithmFour = #
BandFourMinHashAlgorithmOne = # AND
BandFourMinHashAlgorithmTwo = # AND
BandFourMinHashAlgorithmThree = # AND
BandFourMinHashAlgorithmFour = #
BandFiveMinHashAlgorithmOne = # AND
BandFiveMinHashAlgorithmTwo = # AND
BandFiveMinHashAlgorithmThree = # AND
BandFiveMinHashAlgorithmFour = #
This explanation completely glosses over why this works and what combination of bands and rows you should construct when looking for similar documents. If you want to get into that detail, take a look at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf.
Comments
Opening A Bank Account For Companies In Poland
The next issue is opening an account with a bank. This is a crucial step because it will facilitate your business’s operations as well as all of your transactions (including paying your employees and vendors).
Open a Bank Account in Singapore During Covid-19: Required Documentation
A B1 visa grants you to enter Poland for business purposes, for example, aggregate bartering, business bunch gatherings, counsel with business partners, and property settlement.
Read More: Poland PR Permit For Indian Entrepreneurs
Read Moore: PESEL For Foreigners
Benefits of Poland Permanent Residency
It combines commercial and residential property on a complete township that covers 60 acres of land with 800 units. Also, It has located near Dwarka Expressway, the most important role in delivering business growth.
Especially, It is a perception in a simplified form of one of the high street retail hubs. Their propose great variety in retail, Fun Zone, Food courts & service apartments. Thus, investing in these retail and office spaces can be a great option.
Read More: Type of Various Business forms in Poland
Read more: Register for Cryptocurrency Company In Poland
Read More: Office Cost in Poland
Best Web Designing Company in Hyderabad
Poland Social Security Benefits
Read More: Minimum Wage in Poland
1. Consequences: Tax avoidance leads to the deferment of tax liability. Tax evasion leads to penalty or imprisonment.
2. Objective: The objective of Tax avoidance is to reduce tax liability by applying the script of law whereas Tax evasion is done to reduce tax liability by exercising unfair means. Tax planning is done to reduce the liability of tax by applying the provision and moral of law.
3. Permissible: Tax planning and Tax avoidance are permissible whereas Tax evasion is not permissible.
Read More : Taxation for Delaware LLC to corporation conversion
Turkey company registration
Read More: Aim of Limited Partnership In Delaware
A consolidation could bring about a significantly bigger and all the more remarkable business, which is more interesting to expected investors and financial backers.
Read More: Delaware LLC Merger
Company registration in Poland
According to the Polish involvement exclusions, Polish businesses are not required to keep tax on profits if the payment receiver is based in some other EU country and receives the necessary taxes from its personnel across the globe. Investment income is subject to a 19% tax rate in Poland.
- Strategic geographical location
- Tax benefits
- Government grants
- Largest manufacturing centre
- Investments
- Various business opportunities
Read More: Delaware vs Florida LLC
Read More: Taxes In Poland
visit:https://www.affordablehousingsgurgaon.com/pareena-affordable-housing-sector-68-gurgaon/
Real Estate in Shadnagar
Washing Machine Repair in Noida
Home Cleaning services in Gurgaon
Spain Company Registration
Austria Company Registration
Sweden Company Registration
The publication fee will be charged to the time that they are paid when the Articles of Association are published in the Belgian Official Journal; around 250 euros.
Belgium Company Registration
Switzerland Company Registration
Ireland Company Registration
Holding Company in Ireland
Reasons to Invest in Ireland
Delaware Certificate of Incorporation
www.needleskart.com
techseeholic.com
virginia military divorce
Abogado Disputas Contratos Comerciales
I also worked in the field of realestate in which we provide new properties/apartment/flats. So, if you are looking for best real estate company in Gurugram then please visit our website.
It gives you the gift of healthy nature in an ultra-modern setting, and park yourself in luxury and cherish the surrounding nature."
These independent floors are designed to give you a redesigned view of the hill-facing Theme Based Park or waterbody that provides both a luxury lifestyle and peace in the lap of nature.
M3M Antalya Hills 79 offers 2.5 BHK & 3.5 BHK luxury apartments; each floor is open on two sides. It features independent floors with private space in the basement, and it offers a wide balcony to feel the fresh air blowing from the Aravalli hills!
At Antalya hills gurgaon, your elders get to relax their minds amid the greenery, away from the chaotic noise of the city. Your children can learn from the city while still enjoying themselves in the lap of nature. Here you can enjoy the constant cooling of public areas with the perfect balance of sunlight, shade, and natural ventilation.
There are exceptional amenities planned for every age group and every kind of you, “the calm, the social, the cheerful, and the sporty at M3M Antalya hills 79. Every pocket is equipped with a club with curated amenities to cater to the needs of all age groups. Now, the community lifestyle has become the new normal.
It offers amenities like a sports arena, multiple courts, Kid’s play area, a multi-tier security system, a swimming pool, private space in the basement and an AER lounge. M3M Antalya Hills floors are fully loaded with features like Modular Kitchen, Italian Marble, AC, False-ceiling, Wardrobe, TV Unit, Loaded Washrooms, etc. It also offers sustainable features “with the sustainable township and energy-efficient apartments, now save money on bills to spend on memories with your family.”
PREFER .
https://ondemandint.com/blog/company-act-in-germany/
radhe exchange demo
ABWhatsApp
Properties in Dwarka Expressway - It offers prime real estate in Gurgaon. Boasting proximity to Delhi, excellent connectivity, and future growth potential, it's a hotspot for investors seeking high-value properties.
Businesses and brands have started to understand the advantages of doing digital marketing with the aid of artificial intelligence. With the use of this revolutionary technology, many strategies of communicating with the intended audience may be developed. AI in digital marketing can be used in the following ways: Personalized Content: To provide personalized content, AI systems examine user behavior, demographics, and preferences.
Chatbots and Virtual Assistants: These artificial intelligence (AI)-driven tools automate client interactions and deliver prompt answers to their questions.
Predictive analytics: AI systems examine vast volumes of data to spot patterns and trends that may be used to forecast future market trends as well as consumer behavior and preferences.
Search Engine Optimization Services
Social Media Marketing Services
Pay Per Click Services
Content Marketing Services
Website Designing Services
Web Development Services
Magento Development Services
Wordpress Development Services
JAVA jobs in netherlands.
JAVA jobs in netherlands.
Sabyasachi Net Worth, Age, Bio, Wife, Income [2024 Updated]
For More Information Visit our Website -
Read More:- Stroke rehabilitation in Gurgaon
This is one of the extensively popular e-commerce business ways Get fast and professional support for all your problems with just one subscription. Once I was in an absolutely emergency situation. I hurriedly looked for someone to at least help me. Then I found this service and signed up. My specialist was called mobilexpressfix.com, and he solved my problem flawlessly. Thank you
The right place to learn to create websites and digital marketing strategies.Get fast and professional support for all your problems with just one subscription. Once I was in an absolutely emergency situation. I hurriedly looked for someone to at least help me. Then I found this service and signed up. My specialist was called with https://mobilexpressfix.com/ , Thanks for sharing this type of valuable content and he solved my problem flawlessly. Thank you.
Digital Marketing Course In Hyderabad
Digital Marketing Course In Hyderabad
is an advanced, exclusive, and outstanding township that bargains a generation involvement to its buyers. The apartments
in this project offer an elegant ambiance, with classy interiors and ample space. The large windows ensure plenty of
essential light and ventilation for the residents. The M3M Mansion Sector 113 Gurgaon has been developed smartly,
providing maximum utilizable area. The Project is providing 3.5BHK, & 4.5 BHK Apartments. You may Like your convenient
time playing the various indoor games, in the library, or the lavish clubhouse that the project comes with. Take a
balcony seat in the lap of nature with Pre-Launch Apartments at M3M Mansion Sector 113 Gurgaon. It helps you experience
the joy of stress-free living. This Project will amaze you is the considerate spread of greenery and water features.
study abroad consultants in Hyderabad