MOSS Search Word Stemming - Part 2
So how Does MOSS Expand Search Query Terms to Related Words?
Here is how this works in MOSS:
In MOSS, stemming is used in combination with the word breaker component which determines where word boundaries are. The word breaker is used at both index and query time while the stemmer is used only at query time for most languages (the exceptions currently are Arabic and Hebrew) to perform both morphological analysis and morphological generation. In the case of Arabic and Hebrew, stemming is restricted to morphological analysis at both query and index time. A stemmer links word forms to their base form. For example, ”running,” ”ran,” and ”runs“ are all variants of the verb ”to run.” Stemming is currently turned off by default for some languages including English. Stemmers are only available for languages which have significant morphological variation among their word forms. This means that for languages where stemmers are not available (such as Vietnamese) turning on this feature in the Search Result Page (CoreResult Web Part) will not have any effect, since in such languages exact match is all that is needed.
Word Stemming is NOT the same thing as Wild Card Searching, which our engine supports as well. Wild Card searching has to do with doing searches with * in the query. This means you are asking the search engine to find you all words that start with the text string and end with anything, since * means match any character any number of times until you reach the end of the word which in most languages (excluding most East Asian languages) is indicated by a white space. So a search query using * such as "Share*" will return results including "SharePoint", while a search query using morphological processing would bring back "sharing", which is an inflectional variant of Share. Wild Card searching and Word Stemming are often used to refer to the same thing but they are in fact separate and different mechanisms which can return different results.
Word Stemming would bring back words closely related to the query terms (usually inflectional variants for most languages, but for some languages derivational variants as well).
For example, for the following queries, here are some sample results
- If you type in "run" --> in addition to exact matches on “run”, it will bring back matches on "runs", "ran" and "running"
- If you type in "page" --> in addition to exact matches on “page” it will bring back matches on "pages", "paged" and "paging"
- If you type in "basket" --> in addition to exact matches on “basket” it will find "baskets", but it will not find "basketball". A wild card search for “basket*” would find basketball, which our engine supports and I will discuss this in another article. Word Stemming does not handle this currently because we have focused on matching inflectional variants of words only rather than derivational variants.
However this option is turned off by default out of the box for English and some other languages. You can turn this on by going to the Search Results Web Part, and then Options and turn on this feature which is called “Enable Search Term Stemming”.
Thanks for Ian Johnson from the Natural Language Group at Microsoft for providing his feedback on this.
Hope that helps
Mike
Comments
Anonymous
December 26, 2006
Great investigation! Thanks!Anonymous
December 26, 2006
Two posts that explain Search Word Stemming in MOSS by Mike TaghizadehMOSS Search Word Stemming - Part...Anonymous
December 28, 2006
This article implies support for stemming but it does not seem to be enabled by default. Stemming =...Anonymous
January 01, 2007
Still wondering where that option is. I've looked all over and can't find it. I googled and still can't find it. Maybe the option to turn on stemming doesn't exist. At least that's what me thinks.Anonymous
January 01, 2007
The comment has been removedAnonymous
January 02, 2007
Thank you Sharon. Works like a charm.Anonymous
January 03, 2007
Recommended Reading for January (click here for previous recommendations): · MOSS Search Word Stemming:Anonymous
January 04, 2007
A quick follow up on Sharon's comment. Is turning on stemming really increasing the index size? My reading of the post is that stemming is only configurable for the search web part. Which means that when you type the keyword run, the search engine will also search for ran, running, runs, etc... in the index. But it doesn't change the composition of the index. Also are there any numbers out there on the impact of turning on that feature in terms of precision/recall and performances? Thanks, Tony.Anonymous
January 04, 2007
Is there a way to turn on stemming just for People search?Anonymous
January 04, 2007
Stemming is for the results web part which brings content and people back. You can look into building your own web part which seperates the two and you can pick and choose.Anonymous
January 05, 2007
The comment has been removedAnonymous
January 05, 2007
Stemming is NOT the same thing as wild card searching. I have talked about in the article. We do support wild card as well, look into the SDK.Anonymous
January 05, 2007
The only thing that I found in the SDK regarding wild card search involves building a custom web part. Is there no way to simply flip a switch to allow for wild cards in search?Anonymous
January 05, 2007
No, we support this through building a custom web part.Anonymous
January 06, 2007
Va recomand sa citi cu caldura aceste articole.... · MOSS Search Word Stemming: Part 1 and Part 2 – writtenAnonymous
January 06, 2007
The comment has been removedAnonymous
January 09, 2007
The comment has been removedAnonymous
January 15, 2007
Does any one know, where i can find an example on how to build a custom web part that has the wild card search feature? ThanksAnonymous
February 02, 2007
The comment has been removedAnonymous
February 11, 2007
The comment has been removedAnonymous
February 22, 2007
Mike Taghizadeh covering MOSS 2007 Search CapabilitiesAnonymous
July 24, 2007
If the configuration to allow stemming is set in the core results web part, will that setting be published to from an authoring environment to a production environment in a WCM publishing scenario? Or, do I have to go to the search page on the production host and make the same config?Anonymous
August 29, 2007
The comment has been removedAnonymous
November 16, 2007
The comment has been removedAnonymous
November 19, 2007
The comment has been removedAnonymous
November 27, 2007
Ray (et al) I'm in total agreement. why bother saying you have search capabilities when it does not include wildcard. isn't that the whole idea behind any search?? If I'm "searching" for something vague like "micro" the search results should bring back, microscope, microgram, micrometer, microsoft, microphone, etc. get it??? now I would have to actually type in microsoft to get info about that searched phrase! so how MOSS has search working would basically NOT all someone to find documents, names in SP that have microsoft in them if the search keyword = micro. Wildcard shoudl be the default and NON wildcard should be an option if needed to be turned off. I'm so close to wrapping up my custom search for a client and now have to figure out wildcard searches to be 100% done.Anonymous
December 06, 2007
I cannot get this to work as well. I am using an English MOSS environment where the "Enable Search Term Stemming" checkbox is unchecked in the search results page. I have done the following: Uploaded a text file with the term "new york" to a MOSS site; replaced the tsneu.xml in C:Program FilesMicrosoft Office Servers12.0DataOffice ServerApplications<GUID>Config with the following contents <XML ID="Microsoft Search Thesaurus"> <thesaurus xmlns="x-schema:tsSchema.xml"> <diacritics_sensitive>0</diacritics_sensitive> <expansion> <sub>detroit</sub> <sub>new york</sub> </expansion> </thesaurus> </XML> Restarted Office SharePoint Server Search and performed a full crawl afterwards. Searched for "detroit" expecting to retrieve the "new york" file. But I did not get the expected results. Dit I miss something. I hope you can help me. Best regards, AndriesAnonymous
December 06, 2007
The comment has been removedAnonymous
December 20, 2007
The comment has been removedAnonymous
January 21, 2008
We configured the Word Stemming for finnish language support and after that the search fails to function correctly. There seems to be something wrong with crawl, because the number of indexed items has dropped radically. Our environment has been migrated from SPS2003. Can "legacy leftovers from SPS2003" cause problems? Another question: How much do browser settings affect the search?Anonymous
February 01, 2008
In response to the question by Keutmann. Yes stemming is supported on the danish language. I have it running in a solution.Anonymous
April 04, 2008
The comment has been removedAnonymous
May 27, 2008
The comment has been removedAnonymous
July 17, 2008
Is it possible to highlight the word when using stemming. If I type in tutorial, the word tutorial will be highlighted but not in results where it's tutorials. An answer would be appreciated. ThanksAnonymous
November 05, 2008
quit whining about wildcard searchAnonymous
November 12, 2008
Now I must say to our big customer, which is a huge financial intitution, that ms doesn't support wildcards. This is really really annoying.Anonymous
December 12, 2008
I am running into an error when Stemming is enabled for German Language. Is Stemming supported for German?Anonymous
July 27, 2009
The comment has been removedAnonymous
October 23, 2013
The comment has been removed