Condividi tramite


Using the MicrosoftNgram Python Module

Over the past few posts I've shown some samples of the
MicrosoftNgram Python module.  Writing documentation is not something engineers
I know enjoy doing; in fact the only available documentation right now is
through help(MicrosoftNgram).  Here's an attempt to rectify the situation.

To get started, you'll of course need to get the module,
which you can download here.

The main class is named LookupService.  An instance of this
object encapsulates two crucial pieces of information (a) the user token, and
(b) the language model of interest.  The user token is a GUID issued by
Microsoft Research.  This is something we use to track the amount of usage;
neither the phrases nor models used are tracked in the interest of protecting
users' privacy.  The language model is dataset against which you can query probabilities.  Details on language models were covered in last week's post, but in a nutshell there are three properties to a model: source, version, and order.  The following instantiations of the constructor are all functionally equivalent, provided that (i) you use your actual GUID, not xx.., and (ii) for the first case, you've specified an environment variable called NGRAM_TOKEN and its value set to your GUID.

 >>> s = MicrosoftNgram.LookupService()
>>> s.GetModel()
'bing-body/jun09/3'
>>> s = MicrosoftNgram.LookupService(token='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx')
>>> s.GetModel()
'bing-body/jun09/3'
>>> s = MicrosoftNgram.LookupService('xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx')
>>> s.GetModel()
'bing-body/jun09/3'
>>> s = MicrosoftNgram.LookupService(token='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',model='bing-body/jun09/3')
>>> s.GetModel()
'bing-body/jun09/3'
>>> s = MicrosoftNgram.LookupService('xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx','bing-body/jun09/3')
>>> s.GetModel()
'bing-body/jun09/3'

 I prefer to set the environment variable, since I can't be bothered to memorize a GUID.  Speaking of environment variables, note that MicrosoftNgram uses urllib under the covers, so if you need to specify a proxy, set HTTP_PROXY appropriately.

Once you have a LookupService object, you can call the various methods. 

 >>> s = MicrosoftNgram.LookupService(model='bing-body/apr10/5')
>>> s.GetConditionalProbability('happy cat is happy')
-0.93900499999999998
>>> s.GetConditionalProbability('happy cat is sad')
-4.2167089999999998
>>> s.GetJointProbability('kthxbai')
-7.6080370000000004

 Well it's good to know that happy cat is 1000x more likely to be happy than sad.  What else can happy cat be?

 >>> for t in s.Generate('happy cat is', maxgen=5): print t
...
('always', -0.36325089999999999)
('a', -0.89422170000000001)
('happy', -0.93900499999999998)

So we know that happy cat is never sad (well, most likely anyway — bing-body/apr10 has a unigram cutoff of 10.)  We can further infer that when computing the conditional probability above, we must have backed off to a lower-order n-gram.

Comments

  • Anonymous
    November 27, 2010
    The comment has been removed