Set Similarity and Min Hash
Given two sets S1, S2, find similarity(S1, S2) - based not hamming distance (not Euclidean).
Jaccard Measure
View sets at a bit-array. Indexes representing each possible element, and 1/0 representing presence/absence of the element in the set.
Then Jaccard measure =
What happens when: n element in each set from a possible universe u, s.t. n << u?
Ok, as long as just |S1 U S2| is not too large.
Implementation is straightforward (In C#)
class JaccardSimilarity
{
public static double Similarity<T>(HashSet<T> set1, HashSet<T> set2)
{
int intersectionCount = set1.Intersect(set2).Count();
int unionCount = set1.Union(set2).Count();
return (1.0 * intersectionCount) / unionCount;
}
}
Intersection: O(nlogn) with sort-merge join, or O(n) with a big constant using hash join.
Union: O(n), again with some overhead.
Space is also O(n) at best.
Hash similarity
Find a hash function sig (signature) such that sim(S1, S2) is approximated by sim(sig(s1), sig(s2))
Idea 1: Sample P elements from the universe, and define sig(S1) as bits for P elements (i.e reduce the sets to a random sample of the universe).
But problems with sparsity (n << u)
Idea 2: So don't count entries that are absent in both sets. E.g:
Four combinations:
A = 1, 1 (Element present in both sets)
B = 0, 1
C = 1, 0
D = 0, 0
Count: A / A+B+C
E1 | E2 | E3 | E4 | E5 | E6 | |
S1 | 1 | 1 | 0 | 0 | 0 | 0 |
S1 | 1 | 0 | 1 | 0 | 0 | 0 |
sim(S1, S2) = 1 / 3
Min Hash
Combine ideas 1 and 2.
Randomly permute element order (columns). Hash of S is element number of first element that is present in the bit-array.
Key insight: P(h(s1) = h(s2)) = jaccardsimilarity(s1, s2)
Why? Both measures are A/A+B+C
This about this probabilistically..
Too fragile with a single permutation. Create min-hash signature (instead of a single integer) using N random permutations.
Then mhsig(S) = list of n indexes where element is present h(S)
Now, similarity(S1, S2) using min-hash is: Fraction of permutations where mhsig(S1) = mhsig(S2)
Expectation of similarity now is same as jaccard similarity measure.
We still can not implement this efficiently! Luckily there are some nifty tricks..
Instead of permuting rows n times, use n different hash functions and the let hash index order provide the random permutation.
But where is the min part in min-hash?
Foreach(set S)
Foreach(hash function h)
Find elements with that are present in S.
Compute hash of the element index if element is present
Keep the hash with minimum index value
This will give you the index of the first 1 from a random permutation
Computation cost of min hash is clearly higher, then why bother?
Answer: MinHash gives you a small signature of a (potentially large) set which can be used for similarity testing.
Summary: Signature generation time may be large, but querying time is smaller and takes less space.
A common use is in Nearest Neighbor type problem, where there are S Sets, and you wish to find the nearest one (or k nearest).
Compare all signature-pairs? Hint: use Locality Sensitive Hashing to eliminate pairs with dissimilar signatures -- Maybe my next post.
Here is the code..
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
namespace SetSimilarity
{
class MinHash
{
private const int m_numHashFunctions = 100; //Modify this parameter
private delegate int Hash(int index);
private Hash[] m_hashFunctions;
public MinHash(int universeSize)
{
Debug.Assert(universeSize > 0);
m_hashFunctions = new Hash[m_numHashFunctions];
Random r = new Random(11);
for (int i = 0; i < m_numHashFunctions; i++)
{
uint a = (uint)r.Next(universeSize);
uint b = (uint)r.Next(universeSize);
uint c = (uint)r.Next(universeSize);
m_hashFunctions[i] = x => QHash((uint)x, a, b, c, (uint)universeSize);
}
}
public double Similarity<T>(HashSet<T> set1, HashSet<T> set2)
{
Debug.Assert(set1.Count > 0 && set2.Count > 0);
int numSets = 2;
Dictionary<T, bool[]> bitMap = BuildBitMap(set1, set2);
int[,] minHashValues = GetMinHashSlots(numSets, m_numHashFunctions);
ComputeMinHashForSet(set1, 0, minHashValues, bitMap);
ComputeMinHashForSet(set2, 1, minHashValues, bitMap);
return ComputeSimilarityFromSignatures(minHashValues, m_numHashFunctions);
}
private void ComputeMinHashForSet<T>(HashSet<T> set, short setIndex, int[,] minHashValues, Dictionary<T, bool[]> bitArray)
{
int index = 0;
foreach (T element in bitArray.Keys)
{
for (int i = 0; i < m_numHashFunctions; i++)
{
if(set.Contains(element))
{
int hindex = m_hashFunctions[i](index);
if (hindex < minHashValues[setIndex, i])
{
minHashValues[setIndex, i] = hindex;
}
}
}
index++;
}
}
private static int[,] GetMinHashSlots(int numSets, int numHashFunctions)
{
int[,] minHashValues = new int[numSets, numHashFunctions];
for (int i = 0; i < numSets; i++)
{
for (int j = 0; j < numHashFunctions; j++)
{
minHashValues[i, j] = Int32.MaxValue;
}
}
return minHashValues;
}
private static int QHash(uint x, uint a, uint b, uint c, uint bound)
{
//Modify the hash family as per the size of possible elements in a Set
int hashValue = (int)((a * (x >> 4) + b * x + c) & 131071);
return Math.Abs(hashValue);
}
private static Dictionary<T, bool[]> BuildBitMap<T>(HashSet<T> set1, HashSet<T> set2)
{
Dictionary<T, bool[]> bitArray = new Dictionary<T, bool[]>();
foreach (T item in set1)
{
bitArray.Add(item, new bool[2] { true, false });
}
foreach (T item in set2)
{
bool[] value;
if (bitArray.TryGetValue(item, out value))
{
//item is present in set1
bitArray[item] = new bool[2] { true, true };
}
else
{
//item is not present in set1
bitArray.Add(item, new bool[2] { false, true });
}
}
return bitArray;
}
private static double ComputeSimilarityFromSignatures(int[,] minHashValues, int numHashFunctions)
{
int identicalMinHashes = 0;
for (int i = 0; i < numHashFunctions; i++)
{
if (minHashValues[0, i] == minHashValues[1, i])
{
identicalMinHashes++;
}
}
return (1.0 * identicalMinHashes) / numHashFunctions;
}
}
}
My code is public, feel free to copy and optimize.
(Rest based on lecture notes of Rajeev Motwani, which in turn is based on notes from Jeff Ulman)
Comments
Anonymous
January 31, 2009
Thank you so much, Sahil, for this code. It is indeed very useful. But I could not understand what 'x' means here: m_hashFunctions[i] = x => QHash((uint)x, a, b, c, (uint)universeSize); Where is 'x' getting initialized when it is passed to QHash function? Also, what is meant by '=>' operator? I shall be greatly thankful if somebody could please help. Regards, Sarika.Anonymous
June 13, 2012
m_hashFunctions[i] = x => QHash((uint)x, a, b, c, (uint)universeSize); I know this question is old, but in case anyone is still looking for an answer, the code right of the '=' symbol is a lambda expression. There is more information here: msdn.microsoft.com/.../bb397687.aspxAnonymous
September 22, 2012
Thank you for your code, I have some questions: I have found that the result is related to the universeSize and the m_numHashFunctions in the Constructor, so, I don't know how to set the value. RegardsAnonymous
February 02, 2013
It does not appear that the parameter "bound" is ever used in your QHash function? Should bound be removed as an input parameter from QHash, or is the current QHash function code incorrect and bound should be used somewhere?