More info on my String Normalization algorithm
Starting last Friday I ended up working on a StringNormalizationList data structure that encapsulates the StringNormalization class that I was working on before. Essentially this data structure allows you to keep adding strings to it, and then you must Clean the structure. The cleaning process involves making "a set of sets". This "set of sets" is a list of all the sets of words that the algorithm deam to be similar (or rather the same). This data structure now lets me go through a long list of strings and "bin" them into the appropriate buckets. Definitely pretty useful.
There are some optimizations that I'd like to do. In particular I'd like to find some sort of hash function that hashes similar string values to similar hash keys. I'll pose the question here... Does anyone know of such a hash function?
Comments
- Anonymous
May 13, 2003
How about SOUNDEX?? - Anonymous
May 13, 2003
Look up Knuth's vol. 3. - Anonymous
February 19, 2004
Soundex internally implements the same algo!?