Compartilhar via


Another project...

From the start of this semester I've been also working on what I call a "Name Normalization" Algorithm (or more generally "String Normalization"). What this is supposed to do is take an input of some number of strings and determine which ones are "the same". This sounds like a simple problem, but it really isn't. Take a look at Google's Britney Search Archive and you can quickly realize that this problem is huge. The idea of the algorithm is to be able to take in a set of strings and pick out the ones that are the same. So, for example, if your name is Rich, but I search for Richie I get all results for Rich as well. Granted this is one of the easier cases to handle. There are some screwballs in there... for example, November and December only vary by 3 characters. So it's not so easy to determine that these two strings aren't in the same "class".

In any case, I just finished about 1 hour ago the "Alpha Version 1.0" release. I tried it against the input from the Google Britney Search Archive, and I was able to get 588/592 matches without any problems. Actually those 4 that I missed were: breathny spears, betny spears, brethany spears, brethine spears. If you ask me, these are horrbily POOR spelling of Britney. Anyway, there is one parameter that I can tweak to include these 4 in my positive results too, so I think it's a pretty good Alpha release. Oh, I forgot to add that on average, it takes 3 to 5 milli-seconds to do a comparison. Not too shabby.

Please note that this was in no way, shape, or form re-engineered. This was entirely derived from various readings online, and some theories/concepts that I was thinking about. I justed used the Google data as a baseline check.

To return to the point... why did I call this "String Normalization". Well, actually the fundamental idea of this is normalizing functions on a graph. Here, you want to go through a whole bunch of strings (ie: functions) and normalize them to one common string (ie: one common axis spacing and min/max, typically [0,1]).

Comments

  • Anonymous
    May 04, 2003
    Any chance you'll make that code available? (Or is it already out there and I'm just missing it.)

    -Andrej Kyselica
  • Anonymous
    May 04, 2003
    I haven't posted source just yet since i also need to submit this for class. I'll consider it once classes are done this term (which is in about a week).
  • Anonymous
    May 04, 2003
    Awesome, this has tons of potential. Are you considering using this in your photo album for name searches?
  • Anonymous
    May 05, 2003
    Happy Birthday!
  • Anonymous
    May 05, 2003
    Thanks man...