Condividi tramite


F#-Querying WordNet Online

 

Todays post focuses on using F# to query Princeton’s WordNet Online service for information about some word. According to WordNet’s home:

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

Traditionally, WordNet could be accessed from Prolog or via API abstractions of Prolog’s database database means. However, (more or less) recently the ability to query WordNet online was provided. Since I am aiming at gaining confidence with F#, I tried to make the wonderful world of SynSet approachable via F#.

In order to use WordNet Online there will be the following steps:

  1. Initialize and query user about what word she wants to ask WordNet
  2. Ask WordNet about the user word
  3. Process WordNet’s answer

In order to realise points 1 – 3, we’ll develop a F# console application, which reads the console to obtain a word, make a web request to WordNet online to ask about the word’s SynSets. The result will be a HTML page, which will be processed using the HTMLAgilityPack, which is a CodePlex hosted project, with the following purpose:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Within the realm of html parsing, we’ll use F# pattern and regular expression matching. Notice that, this is merely a prototypical solution and as such, many improvements can potentially be made to it, including:

  • Stability and Robustness
    I do not claim, that the solution handles all WordNet answers correctly. I will do some more testing and tweaking but for now, it seams to work.
  • Performance
    I implemented a synchronous pipeline model. So you might take more advantage of asynchronous workflows.
  • Completeness
    There are part of speech (adverb) which are currently not handled, by I intent to add those.
  • Understandability
    I tried to stay within the F# idioms but, I assume some processing still bears imperative looks.

Now, to give you a feel, for what I intent to do, the following is a picture of WordNet Online being asked about the word “dog”.

WordNet

What I intent to do is to represent this information using an object SynSet having a SynType (its part of speech) like noun, etc. associated example sentences, and so on. Also, as you can see, WordNet Online allows us to specify certain query options, such as whether or not we want to be shown example sentences (e.g., “the dog barked all night”). Clearly, representing option information will be helpful too. For that, I defined the following types and some pattern match expression within their own module FWordNetTypes.

 module FWordNetTypes

open System.Text
open System.Text.RegularExpressions

open HtmlAgilityPack

type WordNetOptions = 
    { 
        ShowExampleSentences:bool;
        ShowGlosses:bool;
        ShowFrequencyCounts:bool;
        ShowDatabaseLocations:bool;
        ShowLexicalFileInfo:bool;
        ShowLexicalFileNumbers:bool;
        ShowSenseKeys:bool; 
        ShowSenseNumbers:bool
    }
    member this.BuildWordNetOptionsString() = 
        let optionString o = ( if o = true then "1" else System.String.Empty )
        let x = ref ( StringBuilder() )
        x := (!x).Append(
            System.String.Format("o0={0}&o1={1}&o2={2}&o3={3}&o4={4}&o5={5}&o6={6}&o7={7}&o8=1"
                , (optionString this.ShowExampleSentences)
                , (optionString this.ShowGlosses)
                , (optionString this.ShowFrequencyCounts)
                , (optionString this.ShowDatabaseLocations)
                , (optionString this.ShowLexicalFileInfo)
                , (optionString this.ShowLexicalFileNumbers)
                , (optionString this.ShowSenseKeys)
                , (optionString this.ShowSenseNumbers)
            )
        )
        (!x).ToString()

type SynType =
    | SynNoun
    | SynVerb
    | SynAdjective 
    | SynNone

type SynSetWord = 
    { 
        Word:string; 
        SenseKey:string; 
        SenseNumber:string
    }
    override this.ToString() = 
        this.Word + "#" + this.SenseNumber + "("  + this.SenseKey + ")"

let SynTypeToString s =
    match s with
        | SynNoun -> "Noun"
        | SynVerb -> "Verb"
        | SynAdjective -> "Adjective"
        | _ -> "None" 

type SynSet = 
    { 
        SType:SynType; 
        LexicalFileInfo:string;
        LexicalFileNumber:string;
        SynWords:seq<SynSetWord>; 
        SynGlos:seq<string>; 
        SynExampleSentences:seq<string>; 
        FrequencyCount:int; 
        DatabaseLocation:string; 
    }
    override this.ToString() = 
        let sb = ref (StringBuilder())
        sb := (!sb).Append("SynType          : " + (SynTypeToString this.SType) + "\r\n")
        sb := (!sb).Append("SynWord(s)       :\r\n")
        this.SynWords |> Seq.iter(fun s -> sb := (!sb).Append("[+]" + s.ToString() + ";\r\n"))

        sb := (!sb).Append("SynGlos:\r\n")
        this.SynGlos |> Seq.iter(fun s -> sb := (!sb).Append("[+]" + s + ";\r\n"))
        
        sb := (!sb).Append("SynExample(s)    :\r\n")
        this.SynExampleSentences |> Seq.iter(fun s -> sb := (!sb).Append("[+]" + s + ";\r\n"))

        sb := (!sb).Append("Frequency        : " + this.FrequencyCount.ToString() + "\r\n")
        sb := (!sb).Append("DB Location      : " + this.DatabaseLocation + "\r\n")
        sb := (!sb).Append("Lexical File Info: " + this.LexicalFileInfo + "\r\n")
        sb := (!sb).Append("Lexical File No. : " + this.LexicalFileNumber)

        (!sb).ToString()

let EmptySynSet = 
    { 
        SType = SynNone; 
        SynWords = Seq.empty ; 
        SynGlos = Seq.empty ; 
        SynExampleSentences = Seq.empty ; 
        FrequencyCount = -1; 
        DatabaseLocation = System.String.Empty; 
        LexicalFileInfo = System.String.Empty; 
        LexicalFileNumber = System.String.Empty; 
    }

let (|Noun|Verb|None|Adjective|) (v:string) =
    match v.ToLowerInvariant().Trim() with
        | "(n)" -> Noun
        | "(v)" -> Verb
        | "(adj)" -> Adjective
        | _ -> None

let (|POS|SynWord|SynWordDesc|SynIndicator|SynExample|SynGlos|) (v:HtmlNode, word:string) =
    if (v.Name.ToLowerInvariant() = "a" && 
        v.Attributes.Contains("class") && 
        v.Attributes.Item("class").Value.ToLowerInvariant() = "pos") then 
        POS(v.InnerHtml.Trim())
    elif (v.Name.ToLowerInvariant() = "b" || 
         (v.InnerHtml.Contains(word) && v.Name.ToLowerInvariant() <> "i" ) ) || 
         (
            v.Name.ToLowerInvariant() = "a" && 
            v.Attributes.Contains("href") && 
            v.InnerHtml.Contains("S:") <> true
         ) then 
        let processedInner = v.InnerHtml.Trim()
        let senseKeyNumberRegEx = Regex("(?<SenseWord>.+)\#(?<SenseNumber>\d+).*\((?<SenseKey>.*)\)")
        let matches = senseKeyNumberRegEx.Match(processedInner)
        if matches.Success then
            let senseWord  = matches.Groups.Item("SenseWord").Value;
            let senseKey = matches.Groups.Item("SenseKey").Value;
            let senseNumber = matches.Groups.Item("SenseNumber").Value;
            
            SynWord( 
                { 
                    Word = senseWord; 
                    SenseKey = senseKey; 
                    SenseNumber = senseNumber; 
                }
            )
        else
            SynWord( 
                { 
                    Word = processedInner ; 
                    SenseKey = System.String.Empty; 
                    SenseNumber = System.String.Empty; 
                } 
            )
    elif ( v.Name.ToLowerInvariant() = "a" && 
           v.Attributes.Contains("href") && 
           v.InnerHtml.Contains("S:")) then 
        SynIndicator(v.InnerHtml.Trim())
    elif (v.Name.ToLowerInvariant() = "i" &&
          v.InnerHtml.StartsWith("\"") && 
          v.InnerHtml.EndsWith("\"")) then 
        SynExample(v.InnerHtml.Trim())
    elif (v.InnerHtml.Trim().StartsWith("(") && 
          v.InnerHtml.Trim().EndsWith(")") ) then 
        SynGlos(v.InnerHtml.Replace("(", "").Replace(")", "").Trim())
    else 
        SynWordDesc(v.InnerHtml.Trim())

As you can see, WordNetOptions, SynSetWord and SynType are the main types to handle WordNet Online options, SynSets and words as part of a SynSet. The rest is F# patterns to make the parsing of WordNet HTML answers more intuitive and readable. For example, instead of dealing with strings of the form “(v)”, I want to be able to ask about part of speech. Or instead of decomposing the answer in terms of HTML nodes, I want to derive decision logic on the basis of words, glosses and example sentences, i.e. problem domain entities. As mentioned, I do not claim this to be the ideal solution, however, I want to stress the importance of decomposing a problem within the problem domain, which is more often than not more suitable then expressing it within the realm of technical terms.

Clearly, patterns abstracting HTML node patterns were even more powerful, i.e. instead of evaluating node names and attributes all over the place, why not have a pattern such as BoldNode or AnchorNode. I plan to incorporate this in the near term.

Equipped with the types, we want to fill via instantiation, we can decompose the problem flow into:

  1. Get the user’s word from command-line
  2. Check whether the word is a word (not done here) and it is not a stop word, i.e. a word which has not entry in the WordNet database, because it does not express a concept, has no SynSet, etc.
  3. Set query options, such a retrieve example sentences or not
  4. Ask WordNet
  5. Process WordNet’s answer
  6. Display potential answer

These 6 steps are shown in the following:

 WriteAsHeader "Ask WordNet ..."

printfn "enter a word: "

let stopwords = [|"a"; "the"; "these"; "this"; "those"; "them"; "their" |]
let isStopWord = 
    fun word -> 
        stopwords |> Seq.exists(fun s -> s.ToLowerInvariant() = word)

let word = System.Console.ReadLine().ToLowerInvariant()

if (isStopWord word) then printfn "word is a stop word ... exit"
elif (System.String.IsNullOrEmpty(word)) then printfn "Cannot look for empty word ... exit"
else 
    let options = { 
        ShowExampleSentences    = true; 
        ShowGlosses             = true; 
        ShowFrequencyCounts     = true; 
        ShowDatabaseLocations   = true;
        ShowLexicalFileInfo     = true;
        ShowLexicalFileNumbers  = true;
        ShowSenseKeys           = true;
        ShowSenseNumbers        = true; 
    }
    let showErrors = false
    let answer = askWordNet word options 
    if (System.String.IsNullOrEmpty(answer) <> true) then 
        printfn "obtained answer from WordNet ..."
        answer |> saveAnswerHTML word
        let processedAnswer = processWordNetAnswer word answer options showErrors
        match processedAnswer with 
            | Some(answer)  ->  
                printfn "WordNet said ... :"
                answer  
                    |> Seq.iter( fun k -> WriteAsHeader (k.ToString())) 
            | _ -> ()
    else
        printfn "failed to obtain answer from WordNet ..."

printfn "press <any> key to exit" 
System.Console.ReadLine() |> ignore

As you can see, the main processing is rather small. The stopwords array would clearly not be constant but expanded and stored elsewhere. WriteAsHeader is a simple function, which encloses some string within two horizontal bars (see bottom FWordNetHelper module). saveAnswerHTML will simply dump and HTML response in a file in the local file system, which can be helpful in offline testing mode.

The askWordNet function is a simple wrapper around:

  1. Constructing an HTTP request URL from the user’s word and WordNet’s processing options
  2. Invoking a web request via .NET means (i.e., System.Net.HttpWebRequest)
  3. Getting the request’s response
  4. Outputting the response as a string
 let askWordNet word (option:WordNetOptions) =
    let url = "https://wordnetweb.princeton.edu/perl/webwn"
    if (System.String.IsNullOrEmpty(word) = true) then failwith "word cannot be null or empty"
    let wordneturl = url + (buildRequestString word option)

    printfn "requesting: %s" wordneturl

    let rq = System.Net.HttpWebRequest.Create(wordneturl)
    let resp = ref ""
    
    try
        using (new System.IO.StreamReader(rq.GetResponse().GetResponseStream())) ( fun s -> 
            resp := s.ReadToEnd()
        )
    with
        | :? System.Net.WebException -> ( printfn "failed to connect to \"%s\"\r\n" url )

    !resp

The programs heavy-lifting is done inside the processWordNetAnswer , which takes the user’s word, the WordNet options and the request’s response (as a string), to do the following:

  1. Create an HTML document using the HtmlAgilityPack
  2. Scrape the HTML document, so that a list of SynSet entries will be constructed for the user’s word
  3. Return the constructed list
 let processWordNetAnswer (word:string) answer options showErrors = 
    let doc = new HtmlDocument()
    
    doc.LoadHtml(answer)
    printfn "Html Parse Errors ... "
    if showErrors then 
        doc.ParseErrors 
            |> Seq.iter(fun err -> 
                (printfn "%d %d %s %s" err.Line err.LinePosition err.Reason err.SourceText) 
            )
    
    let synList = List<SynSet>()

    (FindListItemNode doc.DocumentNode "li")
        |> Seq.iter(fun liNode -> 
            let freqRegex = Regex("^(\((?<Frequency>\d+)\))?" + 
                                  "(\{(?<DatabaseLocation>\d+)\})?" + 
                                  "(\x20&lt;(?<FileInfo>.+\..+)&gt;)?" +
                                  "(\[(?<FileNumber>\d+)\])?(\x20)?<a")

            let rMatch = freqRegex.Match(liNode.InnerHtml)
            let mutable freq = -1;
            let mutable dbloc = System.String.Empty
            let mutable fileInfo = System.String.Empty
            let mutable fileNumber = System.String.Empty

            if rMatch.Success then 

                let freqStr = rMatch.Groups.Item("Frequency").Value
                let locStr = rMatch.Groups.Item("DatabaseLocation").Value
                let fileInfoStr = rMatch.Groups.Item("FileInfo").Value
                let fileNoStr = rMatch.Groups.Item("FileNumber").Value

                if (System.String.IsNullOrEmpty(freqStr) = false) then  freq <- System.Int32.Parse(freqStr) ;
                if (System.String.IsNullOrEmpty(locStr) = false) then dbloc <- locStr
                if (System.String.IsNullOrEmpty(fileInfoStr) = false) then fileInfo <- fileInfoStr
                if (System.String.IsNullOrEmpty(fileNoStr) = false) then fileNumber <- fileNoStr

            let xSyn = ref { 
                EmptySynSet with 
                    FrequencyCount = freq; 
                    DatabaseLocation = dbloc; 
                    LexicalFileInfo = fileInfo ; 
                    LexicalFileNumber = fileNumber; 
            }
            
            liNode.ChildNodes 
                |> Seq.iter(fun node -> xSyn := ( FillSynSet node !xSyn word) )
            
            synList.Add( sanitiseSynSet !xSyn )
        )
    
    if synList.Count <> 0 then Some(synList)
    else None

Using HtmlAgilityPack proves extremely beneficial to decompose the HTML response, into WordNet relevant atoms. FindListItemNode is a simple recursive HTML document traversing function which seeks to find the list item nodes within the html document. FillSynSet uses the power of F# pattern matching to fill a SynSet from information found in the reponses list items.

 let FillSynSet (node:HtmlNode) (syn:SynSet) (word:string) = 
    match (node, word) with
        | POS(n) -> 
            match n with
                | Noun -> { syn with SType = SynNoun; }
                | Verb -> { syn with SType = SynVerb; }
                | Adjective -> { syn with SType = SynAdjective; }
                | _ -> ( failwith ("Error: Unrecognized POS \"" + n.ToString() + "\"!") )
        | SynExample(e) -> 
            (
                let exampleSentences = 
                    ( e.Split([|";"|], System.StringSplitOptions.RemoveEmptyEntries) 
                        |> Seq.map( fun entry -> entry.Trim() ) 
                    )
                { syn with SynExampleSentences = ( Seq.append syn.SynExampleSentences exampleSentences ) ; } 
            )
        | SynWord(n) -> { syn with SynWords = (Seq.append syn.SynWords [n]) ; }
        | SynGlos(g) -> { syn with SynGlos = ( Seq.append syn.SynGlos [g] ); }
        | SynIndicator(n) | SynWordDesc (n) -> syn

let FindListItemNode (startNode:HtmlNode) soughtNodeName = 
    let col = new List<HtmlNode>()

    let rec FindListItemNode_t (startNode:HtmlNode) soughtNodeName currentCollection = 
        if startNode = null then ()
        else
            if startNode.Name = soughtNodeName then ( col.Add(startNode) ) 
            else 
                startNode.ChildNodes 
                    |> Seq.iter(fun n -> (FindListItemNode_t n soughtNodeName col))

    FindListItemNode_t startNode soughtNodeName col
    col

In order to have some data cleansing, the sanitiseSynSet function takes a SynSet and cleans it from potentially unwanted artefacts (not done extensively). As you can see, having a simple WordNet scraper is indeed not that hard, all left are mentioned helper functions, which depending on your visualisation needs can be dropped:

 module FWordNetHelper

open FWordNetTypes

let saveAnswerHTML (word:string) (answer:string) = 
    let time = System.DateTime.Now.Ticks.ToString()
    let fileName = System.String.Format(@".\{0}_{1}.html"
        , word
        , time
    )

    using (System.IO.File.CreateText(fileName)) (fun f ->
        f.Write(answer)
        f.Flush()
    )

let buildRequestString (word:string) (option:WordNetOptions) =
    let str = System.String.Format("?s={0}&sub=Search+WordNet&{1}"
        , word
        , ( option.BuildWordNetOptionsString() ) 
    )
    str

let WriteAsHeader s = 
    printfn "______________________________________________________\r\n"
    printfn "%s" s
    printfn "______________________________________________________\r\n"

 

From that, we might actually finish our little program by adding necessary namespace and module declarations:

 module FWordNetOnline

open System.Text
open System.Text.RegularExpressions
open System.Net
open System.IO
open System.Collections.Generic

open FWordNetTypes
open FWordNetHelper

open HtmlAgilityPack

and ask WordNet about the known word “Dog”:

AskWordNet

which, if all goes well, will answer with the following (partial) processed response:

WordNetResponse

Equipped with that dear reader, a happy exploration of the wonderful world of words using WordNet Online. In case of comments, improvements, please do not hesitate to comment – as only the ignorant is immune to learning.

Comments

  • Anonymous
    June 11, 2011
    As an addition to this post, here are two interesting services:Image-Net: http://www.image-net.org/indexImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. Currently we have an average of over five hundred images per node. We hope ImageNet will become a useful resource for researchers, educators, students and all of you who share our passion for pictures.WordNik: http://www.wordnik.com/Our goal is to show you as much information as possible, as fast as we can find it, for every word in English, and to give you a place where you can make your own opinions about words known.Traditional dictionaries make you wait until they've found what they consider to be "enough" information about a word before they will show it to you. Wordnik knows you don't want to wait—if you're interested in a word, we're interested too!