F#-Querying WordNet Online
Todays post focuses on using F# to query Princeton’s WordNet Online service for information about some word. According to WordNet’s home:
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
Traditionally, WordNet could be accessed from Prolog or via API abstractions of Prolog’s database database means. However, (more or less) recently the ability to query WordNet online was provided. Since I am aiming at gaining confidence with F#, I tried to make the wonderful world of SynSet approachable via F#.
In order to use WordNet Online there will be the following steps:
- Initialize and query user about what word she wants to ask WordNet
- Ask WordNet about the user word
- Process WordNet’s answer
In order to realise points 1 – 3, we’ll develop a F# console application, which reads the console to obtain a word, make a web request to WordNet online to ask about the word’s SynSets. The result will be a HTML page, which will be processed using the HTMLAgilityPack, which is a CodePlex hosted project, with the following purpose:
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Within the realm of html parsing, we’ll use F# pattern and regular expression matching. Notice that, this is merely a prototypical solution and as such, many improvements can potentially be made to it, including:
- Stability and Robustness
I do not claim, that the solution handles all WordNet answers correctly. I will do some more testing and tweaking but for now, it seams to work. - Performance
I implemented a synchronous pipeline model. So you might take more advantage of asynchronous workflows. - Completeness
There are part of speech (adverb) which are currently not handled, by I intent to add those. - Understandability
I tried to stay within the F# idioms but, I assume some processing still bears imperative looks.
Now, to give you a feel, for what I intent to do, the following is a picture of WordNet Online being asked about the word “dog”.
What I intent to do is to represent this information using an object SynSet having a SynType (its part of speech) like noun, etc. associated example sentences, and so on. Also, as you can see, WordNet Online allows us to specify certain query options, such as whether or not we want to be shown example sentences (e.g., “the dog barked all night”). Clearly, representing option information will be helpful too. For that, I defined the following types and some pattern match expression within their own module FWordNetTypes.
module FWordNetTypes
open System.Text
open System.Text.RegularExpressions
open HtmlAgilityPack
type WordNetOptions =
{
ShowExampleSentences:bool;
ShowGlosses:bool;
ShowFrequencyCounts:bool;
ShowDatabaseLocations:bool;
ShowLexicalFileInfo:bool;
ShowLexicalFileNumbers:bool;
ShowSenseKeys:bool;
ShowSenseNumbers:bool
}
member this.BuildWordNetOptionsString() =
let optionString o = ( if o = true then "1" else System.String.Empty )
let x = ref ( StringBuilder() )
x := (!x).Append(
System.String.Format("o0={0}&o1={1}&o2={2}&o3={3}&o4={4}&o5={5}&o6={6}&o7={7}&o8=1"
, (optionString this.ShowExampleSentences)
, (optionString this.ShowGlosses)
, (optionString this.ShowFrequencyCounts)
, (optionString this.ShowDatabaseLocations)
, (optionString this.ShowLexicalFileInfo)
, (optionString this.ShowLexicalFileNumbers)
, (optionString this.ShowSenseKeys)
, (optionString this.ShowSenseNumbers)
)
)
(!x).ToString()
type SynType =
| SynNoun
| SynVerb
| SynAdjective
| SynNone
type SynSetWord =
{
Word:string;
SenseKey:string;
SenseNumber:string
}
override this.ToString() =
this.Word + "#" + this.SenseNumber + "(" + this.SenseKey + ")"
let SynTypeToString s =
match s with
| SynNoun -> "Noun"
| SynVerb -> "Verb"
| SynAdjective -> "Adjective"
| _ -> "None"
type SynSet =
{
SType:SynType;
LexicalFileInfo:string;
LexicalFileNumber:string;
SynWords:seq<SynSetWord>;
SynGlos:seq<string>;
SynExampleSentences:seq<string>;
FrequencyCount:int;
DatabaseLocation:string;
}
override this.ToString() =
let sb = ref (StringBuilder())
sb := (!sb).Append("SynType : " + (SynTypeToString this.SType) + "\r\n")
sb := (!sb).Append("SynWord(s) :\r\n")
this.SynWords |> Seq.iter(fun s -> sb := (!sb).Append("[+]" + s.ToString() + ";\r\n"))
sb := (!sb).Append("SynGlos:\r\n")
this.SynGlos |> Seq.iter(fun s -> sb := (!sb).Append("[+]" + s + ";\r\n"))
sb := (!sb).Append("SynExample(s) :\r\n")
this.SynExampleSentences |> Seq.iter(fun s -> sb := (!sb).Append("[+]" + s + ";\r\n"))
sb := (!sb).Append("Frequency : " + this.FrequencyCount.ToString() + "\r\n")
sb := (!sb).Append("DB Location : " + this.DatabaseLocation + "\r\n")
sb := (!sb).Append("Lexical File Info: " + this.LexicalFileInfo + "\r\n")
sb := (!sb).Append("Lexical File No. : " + this.LexicalFileNumber)
(!sb).ToString()
let EmptySynSet =
{
SType = SynNone;
SynWords = Seq.empty ;
SynGlos = Seq.empty ;
SynExampleSentences = Seq.empty ;
FrequencyCount = -1;
DatabaseLocation = System.String.Empty;
LexicalFileInfo = System.String.Empty;
LexicalFileNumber = System.String.Empty;
}
let (|Noun|Verb|None|Adjective|) (v:string) =
match v.ToLowerInvariant().Trim() with
| "(n)" -> Noun
| "(v)" -> Verb
| "(adj)" -> Adjective
| _ -> None
let (|POS|SynWord|SynWordDesc|SynIndicator|SynExample|SynGlos|) (v:HtmlNode, word:string) =
if (v.Name.ToLowerInvariant() = "a" &&
v.Attributes.Contains("class") &&
v.Attributes.Item("class").Value.ToLowerInvariant() = "pos") then
POS(v.InnerHtml.Trim())
elif (v.Name.ToLowerInvariant() = "b" ||
(v.InnerHtml.Contains(word) && v.Name.ToLowerInvariant() <> "i" ) ) ||
(
v.Name.ToLowerInvariant() = "a" &&
v.Attributes.Contains("href") &&
v.InnerHtml.Contains("S:") <> true
) then
let processedInner = v.InnerHtml.Trim()
let senseKeyNumberRegEx = Regex("(?<SenseWord>.+)\#(?<SenseNumber>\d+).*\((?<SenseKey>.*)\)")
let matches = senseKeyNumberRegEx.Match(processedInner)
if matches.Success then
let senseWord = matches.Groups.Item("SenseWord").Value;
let senseKey = matches.Groups.Item("SenseKey").Value;
let senseNumber = matches.Groups.Item("SenseNumber").Value;
SynWord(
{
Word = senseWord;
SenseKey = senseKey;
SenseNumber = senseNumber;
}
)
else
SynWord(
{
Word = processedInner ;
SenseKey = System.String.Empty;
SenseNumber = System.String.Empty;
}
)
elif ( v.Name.ToLowerInvariant() = "a" &&
v.Attributes.Contains("href") &&
v.InnerHtml.Contains("S:")) then
SynIndicator(v.InnerHtml.Trim())
elif (v.Name.ToLowerInvariant() = "i" &&
v.InnerHtml.StartsWith("\"") &&
v.InnerHtml.EndsWith("\"")) then
SynExample(v.InnerHtml.Trim())
elif (v.InnerHtml.Trim().StartsWith("(") &&
v.InnerHtml.Trim().EndsWith(")") ) then
SynGlos(v.InnerHtml.Replace("(", "").Replace(")", "").Trim())
else
SynWordDesc(v.InnerHtml.Trim())
As you can see, WordNetOptions, SynSetWord and SynType are the main types to handle WordNet Online options, SynSets and words as part of a SynSet. The rest is F# patterns to make the parsing of WordNet HTML answers more intuitive and readable. For example, instead of dealing with strings of the form “(v)”, I want to be able to ask about part of speech. Or instead of decomposing the answer in terms of HTML nodes, I want to derive decision logic on the basis of words, glosses and example sentences, i.e. problem domain entities. As mentioned, I do not claim this to be the ideal solution, however, I want to stress the importance of decomposing a problem within the problem domain, which is more often than not more suitable then expressing it within the realm of technical terms.
Clearly, patterns abstracting HTML node patterns were even more powerful, i.e. instead of evaluating node names and attributes all over the place, why not have a pattern such as BoldNode or AnchorNode. I plan to incorporate this in the near term.
Equipped with the types, we want to fill via instantiation, we can decompose the problem flow into:
- Get the user’s word from command-line
- Check whether the word is a word (not done here) and it is not a stop word, i.e. a word which has not entry in the WordNet database, because it does not express a concept, has no SynSet, etc.
- Set query options, such a retrieve example sentences or not
- Ask WordNet
- Process WordNet’s answer
- Display potential answer
These 6 steps are shown in the following:
WriteAsHeader "Ask WordNet ..."
printfn "enter a word: "
let stopwords = [|"a"; "the"; "these"; "this"; "those"; "them"; "their" |]
let isStopWord =
fun word ->
stopwords |> Seq.exists(fun s -> s.ToLowerInvariant() = word)
let word = System.Console.ReadLine().ToLowerInvariant()
if (isStopWord word) then printfn "word is a stop word ... exit"
elif (System.String.IsNullOrEmpty(word)) then printfn "Cannot look for empty word ... exit"
else
let options = {
ShowExampleSentences = true;
ShowGlosses = true;
ShowFrequencyCounts = true;
ShowDatabaseLocations = true;
ShowLexicalFileInfo = true;
ShowLexicalFileNumbers = true;
ShowSenseKeys = true;
ShowSenseNumbers = true;
}
let showErrors = false
let answer = askWordNet word options
if (System.String.IsNullOrEmpty(answer) <> true) then
printfn "obtained answer from WordNet ..."
answer |> saveAnswerHTML word
let processedAnswer = processWordNetAnswer word answer options showErrors
match processedAnswer with
| Some(answer) ->
printfn "WordNet said ... :"
answer
|> Seq.iter( fun k -> WriteAsHeader (k.ToString()))
| _ -> ()
else
printfn "failed to obtain answer from WordNet ..."
printfn "press <any> key to exit"
System.Console.ReadLine() |> ignore
As you can see, the main processing is rather small. The stopwords array would clearly not be constant but expanded and stored elsewhere. WriteAsHeader is a simple function, which encloses some string within two horizontal bars (see bottom FWordNetHelper module). saveAnswerHTML will simply dump and HTML response in a file in the local file system, which can be helpful in offline testing mode.
The askWordNet function is a simple wrapper around:
- Constructing an HTTP request URL from the user’s word and WordNet’s processing options
- Invoking a web request via .NET means (i.e., System.Net.HttpWebRequest)
- Getting the request’s response
- Outputting the response as a string
let askWordNet word (option:WordNetOptions) =
let url = "https://wordnetweb.princeton.edu/perl/webwn"
if (System.String.IsNullOrEmpty(word) = true) then failwith "word cannot be null or empty"
let wordneturl = url + (buildRequestString word option)
printfn "requesting: %s" wordneturl
let rq = System.Net.HttpWebRequest.Create(wordneturl)
let resp = ref ""
try
using (new System.IO.StreamReader(rq.GetResponse().GetResponseStream())) ( fun s ->
resp := s.ReadToEnd()
)
with
| :? System.Net.WebException -> ( printfn "failed to connect to \"%s\"\r\n" url )
!resp
The programs heavy-lifting is done inside the processWordNetAnswer , which takes the user’s word, the WordNet options and the request’s response (as a string), to do the following:
- Create an HTML document using the HtmlAgilityPack
- Scrape the HTML document, so that a list of SynSet entries will be constructed for the user’s word
- Return the constructed list
let processWordNetAnswer (word:string) answer options showErrors =
let doc = new HtmlDocument()
doc.LoadHtml(answer)
printfn "Html Parse Errors ... "
if showErrors then
doc.ParseErrors
|> Seq.iter(fun err ->
(printfn "%d %d %s %s" err.Line err.LinePosition err.Reason err.SourceText)
)
let synList = List<SynSet>()
(FindListItemNode doc.DocumentNode "li")
|> Seq.iter(fun liNode ->
let freqRegex = Regex("^(\((?<Frequency>\d+)\))?" +
"(\{(?<DatabaseLocation>\d+)\})?" +
"(\x20<(?<FileInfo>.+\..+)>)?" +
"(\[(?<FileNumber>\d+)\])?(\x20)?<a")
let rMatch = freqRegex.Match(liNode.InnerHtml)
let mutable freq = -1;
let mutable dbloc = System.String.Empty
let mutable fileInfo = System.String.Empty
let mutable fileNumber = System.String.Empty
if rMatch.Success then
let freqStr = rMatch.Groups.Item("Frequency").Value
let locStr = rMatch.Groups.Item("DatabaseLocation").Value
let fileInfoStr = rMatch.Groups.Item("FileInfo").Value
let fileNoStr = rMatch.Groups.Item("FileNumber").Value
if (System.String.IsNullOrEmpty(freqStr) = false) then freq <- System.Int32.Parse(freqStr) ;
if (System.String.IsNullOrEmpty(locStr) = false) then dbloc <- locStr
if (System.String.IsNullOrEmpty(fileInfoStr) = false) then fileInfo <- fileInfoStr
if (System.String.IsNullOrEmpty(fileNoStr) = false) then fileNumber <- fileNoStr
let xSyn = ref {
EmptySynSet with
FrequencyCount = freq;
DatabaseLocation = dbloc;
LexicalFileInfo = fileInfo ;
LexicalFileNumber = fileNumber;
}
liNode.ChildNodes
|> Seq.iter(fun node -> xSyn := ( FillSynSet node !xSyn word) )
synList.Add( sanitiseSynSet !xSyn )
)
if synList.Count <> 0 then Some(synList)
else None
Using HtmlAgilityPack proves extremely beneficial to decompose the HTML response, into WordNet relevant atoms. FindListItemNode is a simple recursive HTML document traversing function which seeks to find the list item nodes within the html document. FillSynSet uses the power of F# pattern matching to fill a SynSet from information found in the reponses list items.
let FillSynSet (node:HtmlNode) (syn:SynSet) (word:string) =
match (node, word) with
| POS(n) ->
match n with
| Noun -> { syn with SType = SynNoun; }
| Verb -> { syn with SType = SynVerb; }
| Adjective -> { syn with SType = SynAdjective; }
| _ -> ( failwith ("Error: Unrecognized POS \"" + n.ToString() + "\"!") )
| SynExample(e) ->
(
let exampleSentences =
( e.Split([|";"|], System.StringSplitOptions.RemoveEmptyEntries)
|> Seq.map( fun entry -> entry.Trim() )
)
{ syn with SynExampleSentences = ( Seq.append syn.SynExampleSentences exampleSentences ) ; }
)
| SynWord(n) -> { syn with SynWords = (Seq.append syn.SynWords [n]) ; }
| SynGlos(g) -> { syn with SynGlos = ( Seq.append syn.SynGlos [g] ); }
| SynIndicator(n) | SynWordDesc (n) -> syn
let FindListItemNode (startNode:HtmlNode) soughtNodeName =
let col = new List<HtmlNode>()
let rec FindListItemNode_t (startNode:HtmlNode) soughtNodeName currentCollection =
if startNode = null then ()
else
if startNode.Name = soughtNodeName then ( col.Add(startNode) )
else
startNode.ChildNodes
|> Seq.iter(fun n -> (FindListItemNode_t n soughtNodeName col))
FindListItemNode_t startNode soughtNodeName col
col
In order to have some data cleansing, the sanitiseSynSet function takes a SynSet and cleans it from potentially unwanted artefacts (not done extensively). As you can see, having a simple WordNet scraper is indeed not that hard, all left are mentioned helper functions, which depending on your visualisation needs can be dropped:
module FWordNetHelper
open FWordNetTypes
let saveAnswerHTML (word:string) (answer:string) =
let time = System.DateTime.Now.Ticks.ToString()
let fileName = System.String.Format(@".\{0}_{1}.html"
, word
, time
)
using (System.IO.File.CreateText(fileName)) (fun f ->
f.Write(answer)
f.Flush()
)
let buildRequestString (word:string) (option:WordNetOptions) =
let str = System.String.Format("?s={0}&sub=Search+WordNet&{1}"
, word
, ( option.BuildWordNetOptionsString() )
)
str
let WriteAsHeader s =
printfn "______________________________________________________\r\n"
printfn "%s" s
printfn "______________________________________________________\r\n"
From that, we might actually finish our little program by adding necessary namespace and module declarations:
module FWordNetOnline
open System.Text
open System.Text.RegularExpressions
open System.Net
open System.IO
open System.Collections.Generic
open FWordNetTypes
open FWordNetHelper
open HtmlAgilityPack
and ask WordNet about the known word “Dog”:
which, if all goes well, will answer with the following (partial) processed response:
Equipped with that dear reader, a happy exploration of the wonderful world of words using WordNet Online. In case of comments, improvements, please do not hesitate to comment – as only the ignorant is immune to learning.
Comments
- Anonymous
June 11, 2011
As an addition to this post, here are two interesting services:Image-Net: http://www.image-net.org/indexImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. Currently we have an average of over five hundred images per node. We hope ImageNet will become a useful resource for researchers, educators, students and all of you who share our passion for pictures.WordNik: http://www.wordnik.com/Our goal is to show you as much information as possible, as fast as we can find it, for every word in English, and to give you a place where you can make your own opinions about words known.Traditional dictionaries make you wait until they've found what they consider to be "enough" information about a word before they will show it to you. Wordnik knows you don't want to wait—if you're interested in a word, we're interested too!