Share via


MOSS Search Word Stemming - Part 2

So how Does MOSS Expand Search Query Terms to Related Words?

 

Here is how this works in MOSS:

 

In MOSS, stemming is used in combination with the word breaker component which determines where word boundaries are. The word breaker is used at both index and query time while the stemmer is used only at query time for most languages (the exceptions currently are Arabic and Hebrew) to perform both morphological analysis and morphological generation. In the case of Arabic and Hebrew, stemming is restricted to morphological analysis at both query and index time. A stemmer links word forms to their base form. For example, ”running,” ”ran,” and ”runs“ are all variants of the verb ”to run.” Stemming is currently turned off by default for some languages including English. Stemmers are only available for languages which have significant morphological variation among their word forms. This means that for languages where stemmers are not available (such as Vietnamese) turning on this feature in the Search Result Page (CoreResult Web Part) will not have any effect, since in such languages exact match is all that is needed.

 

Word Stemming is NOT the same thing as Wild Card Searching, which our engine supports as well. Wild Card searching has to do with doing searches with * in the query. This means you are asking the search engine to find you all words that start with the text string and end with anything, since * means match any character any number of times until you reach the end of the word which in most languages (excluding most East Asian languages) is indicated by a white space. So a search query using * such as "Share*" will return results including "SharePoint", while a search query using morphological processing would bring back "sharing", which is an inflectional variant of Share. Wild Card searching and Word Stemming are often used to refer to the same thing but they are in fact separate and different mechanisms which can return different results.

 

Word Stemming would bring back words closely related to the query terms (usually inflectional variants for most languages, but for some languages derivational variants as well).

 

 For example, for the following queries, here are some sample results

  • If you type in "run" --> in addition to exact matches on “run”, it will bring back matches on "runs", "ran" and "running"
  • If you type in "page" --> in addition to exact matches on “page” it will bring back matches on "pages", "paged" and "paging"
  • If you type in "basket" --> in addition to exact matches on “basket” it will find "baskets", but it will not find "basketball". A wild card search for “basket*” would find basketball, which our engine supports and I will discuss this in another article. Word Stemming does not handle this currently because we have focused on matching inflectional variants of words only rather than derivational variants.

However this option is turned off by default out of the box for English and some other languages. You can turn this on by going to the Search Results Web Part, and then Options and turn on this feature which is called “Enable Search Term Stemming”.

 

Thanks for Ian Johnson from the Natural Language Group at Microsoft for providing his feedback on this.

 

Hope that helps

Mike

Comments

  • Anonymous
    December 26, 2006
    Great investigation!  Thanks!

  • Anonymous
    December 26, 2006
    Two posts that explain Search Word Stemming in MOSS by Mike TaghizadehMOSS Search Word Stemming - Part...

  • Anonymous
    December 28, 2006
    This article implies support for stemming but it does not seem to be enabled by default. Stemming =...

  • Anonymous
    January 01, 2007
    Still wondering where that option is. I've looked all over and can't find it. I googled and still can't find it. Maybe the option to turn on stemming doesn't exist. At least that's what me thinks.

  • Anonymous
    January 01, 2007
    The comment has been removed

  • Anonymous
    January 02, 2007
    Thank you Sharon. Works like a charm.

  • Anonymous
    January 03, 2007
    Recommended Reading for January (click here for previous recommendations): · MOSS Search Word Stemming:

  • Anonymous
    January 04, 2007
    A quick follow up on Sharon's comment. Is turning on stemming really increasing the index size? My reading of the post is that stemming is only configurable for the search web part. Which means that when you type the keyword run, the search engine will also search for ran, running, runs, etc... in the index. But it doesn't change the composition of the index. Also are there any numbers out there on the impact of turning on that feature in terms of precision/recall and performances? Thanks, Tony.

  • Anonymous
    January 04, 2007
    Is there a way to turn on stemming just for People search?

  • Anonymous
    January 04, 2007
    Stemming is for the results web part which brings content and people back. You can look into building your own web part which seperates the two and you can pick and choose.

  • Anonymous
    January 05, 2007
    The comment has been removed

  • Anonymous
    January 05, 2007
    Stemming is NOT the same thing as wild card searching. I have talked about in the article. We do support wild card as well, look into the SDK.

  • Anonymous
    January 05, 2007
    The only thing that I found in the SDK regarding wild card search involves building a custom web part. Is there no way to simply flip a switch to allow for wild cards in search?

  • Anonymous
    January 05, 2007
    No, we support this through building a custom web part.

  • Anonymous
    January 06, 2007
    Va recomand sa citi cu caldura aceste articole.... · MOSS Search Word Stemming: Part 1 and Part 2 – written

  • Anonymous
    January 06, 2007
    The comment has been removed

  • Anonymous
    January 09, 2007
    The comment has been removed

  • Anonymous
    January 15, 2007
    Does any one know, where i can find an example on how to build a custom web part that has the wild card search feature? Thanks

  • Anonymous
    February 02, 2007
    The comment has been removed

  • Anonymous
    February 11, 2007
    The comment has been removed

  • Anonymous
    February 22, 2007
    Mike Taghizadeh covering MOSS 2007 Search Capabilities

  • Anonymous
    July 24, 2007
    If the configuration to allow stemming is set in the core results web part, will that setting be published to from an authoring environment to a production environment in a WCM publishing scenario? Or, do I have to go to the search page on the production host and make the same config?

  • Anonymous
    August 29, 2007
    The comment has been removed

  • Anonymous
    November 16, 2007
    The comment has been removed

  • Anonymous
    November 19, 2007
    The comment has been removed

  • Anonymous
    November 27, 2007
    Ray (et al) I'm in total agreement. why bother saying you have search capabilities when it does not include wildcard.  isn't that the whole  idea behind any search??  If I'm "searching" for something vague like "micro"  the search results should bring back, microscope, microgram, micrometer, microsoft, microphone, etc.  get it???  now I would have to actually type in microsoft to get info about that searched phrase!  so how MOSS has search working would basically NOT all someone to find documents, names in SP that have microsoft in them if the search keyword = micro. Wildcard shoudl be the default and NON wildcard should be an option if needed to be turned off.   I'm so close to wrapping up my custom search for a client and now have to figure out wildcard searches to be 100% done.

  • Anonymous
    December 06, 2007
    I cannot get this to work as well. I am using an English MOSS environment where the "Enable Search Term Stemming" checkbox is unchecked in the search results page. I have done the following: Uploaded a text file with the term "new york" to a MOSS site; replaced the tsneu.xml in C:Program FilesMicrosoft Office Servers12.0DataOffice ServerApplications<GUID>Config with the following contents <XML ID="Microsoft Search Thesaurus"> <thesaurus xmlns="x-schema:tsSchema.xml">  <diacritics_sensitive>0</diacritics_sensitive>  <expansion>   <sub>detroit</sub>   <sub>new york</sub>  </expansion> </thesaurus> </XML> Restarted Office SharePoint Server Search and performed a full crawl afterwards. Searched for "detroit" expecting to retrieve the "new york" file. But I did not get the expected results. Dit I miss something. I hope you can help me. Best regards, Andries

  • Anonymous
    December 06, 2007
    The comment has been removed

  • Anonymous
    December 20, 2007
    The comment has been removed

  • Anonymous
    January 21, 2008
    We configured the Word Stemming for finnish language support and after that the search fails to function correctly. There seems to be something wrong with crawl, because the number of indexed items has dropped radically. Our environment has been migrated from SPS2003. Can "legacy leftovers from SPS2003" cause problems? Another question: How much do browser settings affect the search?

  • Anonymous
    February 01, 2008
    In response to the question by Keutmann. Yes stemming is supported on the danish language. I have it running in a solution.

  • Anonymous
    April 04, 2008
    The comment has been removed

  • Anonymous
    May 27, 2008
    The comment has been removed

  • Anonymous
    July 17, 2008
    Is it possible to highlight the word when using stemming. If I type in tutorial, the word tutorial will be highlighted but not in results where it's tutorials. An answer would be appreciated. Thanks

  • Anonymous
    November 05, 2008
    quit whining about wildcard search

  • Anonymous
    November 12, 2008
    Now I must say to our big customer, which is a huge financial intitution, that ms doesn't support wildcards. This is really really annoying.

  • Anonymous
    December 12, 2008
    I am running into an error when Stemming is enabled for German Language. Is Stemming supported for German?

  • Anonymous
    July 27, 2009
    The comment has been removed

  • Anonymous
    October 23, 2013
    The comment has been removed