다음을 통해 공유


A Porter Stemmer in F#

 

Citing from the introduction to Martin Porter’s work [1]:

 Removing suffixes by automatic means is an operation which is especially
useful in the field of information retrieval.

Consequently, a stemmer is a (software) program to reduce inflected words to their root form (word stem). To derive a word’s root form a stemming program uses stemming algorithm(s). For example, a stemming algorithm might reduce the word catering to the stem cater. Notice, that different words may be reduced to the same stem, because they may be rooted in the same origin.

Potential stemming algorithms include dictionary-based approaches, i.e. looking up a word’s stem, or rule- and stochastic-based approaches [2]. Among the many stemming approaches, Porter’s variant is probably among of the most known ones. Reasons for this may be its availability in documentation and implementation format, and it’s natural appeal, i.e. one can immediately understand what the stemmer does by following Porter’s rules and provided examples.

For example, a rule dictates that any English word, which ends in “SSES” should be contracted to “SS”, i.e. the suffix commonly indicating plural, should be removed. Porter gives a handful of rules (30 or so) nicely arranged in a pipeline processing scheme, i.e. some rules are executed in a batch (named step) whenever the rules pre-condition is matched. Broadly there are 5 different steps, taking care of for example, plural, participle and noun suffixes.

Rule conditions govern whether or not a rule should be executed on some word(-fragment). Rules include:

  • Does a word contain vowels
  • Does a word end with a certain double consonant
  • Does a word end with a particular suffix
  • Does a word contain a sequence of vowel and consonant of length l

Whenever a rule is executed it either leads to a word contraction (removing a suffix), composition with an additional “e” or substitution of word parts (letters).

Equipped with this knowledge, I followed Porter’s paper and implemented the basic Porter stemming algorithm in F#. Since Porter did not only give the basic algorithm (rules), but also tests for it, I used a method resembling test-driven development to implement the stemmer. That is, using FsUnit, I implemented rule tests like:

 "caresses" |> porterStem |> should equal "caress"

to drive implementation. Clearly, in the beginning, such a test expecting the word “caresses” to be stemmed to the stem “caress”, would fail. However, after having implemented step 1 (the previously mentioned batch of processing rules) given by Porter the stemmer would pass the test. Following this methodology, I implemented each step after I took and coded Porter’s tests for that step. Since tests are localised, i.e. a test tests a particular rule or batch of rules, tests might interfere with one another, i.e. produce some result deemed incorrect by a test but perfectly correct with regard to the implementation.

The F# implementation makes use of pattern matching, active patterns and pipelining. For example, avoiding the actual implementation of steps at the moment, the stemming process a word goes through is defined by:

 word
|> step1A 
|> step1B
|> step1C
|> step2
|> step3
|> step4
|> step5A
|> step5B

Here, a word is simply a sequence of letters, which can be either consonant or vowel:

 type Symbol = char
///<Summary>
///defines a letter, which can be the empty letter
///or a vowel (Vow) or a consonant (Cons)
///</Summary>
type Letter = 
    | EmptyLetter
    | Vow of Symbol
    | Cons of Symbol

let isVowel (v:Symbol) = 
    v = 'a' || v = 'A' || 
    v = 'e' || v = 'E' || 
    v = 'i' || v = 'I' || 
    v = 'o' || v = 'O' || 
    v = 'u' || v = 'U'

let (|Consonant|Vowel|) (s:Symbol) = 
    match isVowel(s) with
    | true -> Vowel(s)
    | false -> Consonant(s)

Since the actual implementation is rather large, you may find the stemmers implementation on SkyDrive. I created a folder specifically for my F# explorations. Notice, that this is not a full-blown project but rather a dump for (F#) source files. As such, I do not take responsibility for correctness, while I may answer question, the code is free – in lieu of Porters FAQ statement.

In order to compile the stemmer, simply create a standard F# project. In order to run the tests, you need NUnit as advised for FsUnit. The F# project was created using Visual Studio 2010 and F# 2.0. I zipped up all necessary files, including:

  • The stemmer’s implementation
  • The tests
  • FsUnit
  • The research paper in txt form (see References)
  • A todo list

With that, happy stemming and read you soon.

F# Porter Stemmer SkyDrive link

Update:

  • 2011-06-25 (v0.2) : bug fixing and additional rules specified by Porter

Todo:

  • Optimisations already given by Porter
  • Documentation
  • F#isation

References

  1. M.F.Porter (1980): An Algorithm for suffix stripping
  2. Wikipedia: Stemming