Udostępnij za pośrednictwem


More info on my String Normalization algorithm

Starting last Friday I ended up working on a StringNormalizationList data structure that encapsulates the StringNormalization class that I was working on before. Essentially this data structure allows you to keep adding strings to it, and then you must Clean the structure. The cleaning process involves making "a set of sets". This "set of sets" is a list of all the sets of words that the algorithm deam to be similar (or rather the same). This data structure now lets me go through a long list of strings and "bin" them into the appropriate buckets. Definitely pretty useful.

There are some optimizations that I'd like to do. In particular I'd like to find some sort of hash function that hashes similar string values to similar hash keys. I'll pose the question here... Does anyone know of such a hash function?

Comments

  • Anonymous
    May 13, 2003
    How about SOUNDEX??
  • Anonymous
    May 13, 2003
    Look up Knuth's vol. 3.
  • Anonymous
    February 19, 2004
    Soundex internally implements the same algo!?