Full text indexing
Hi Gang,
An interesting thread got going in my email the other day, and I thought it might be of help to some of you. It started as a query to “Working Programmer” author Ted Neward. A reader had a question related to his series of articles on MongoDB.
Imran Saeed asks Ted:
“I liked your article on MongoDB and very good coverage of the topic. Is it possible to cover some “free text” search capabilities in the database in some future article on this topic? My question is probably more basic than free text. A relational database is good for relating data but it would be nice to get an idea of doing something similar in NoSQL world but with practical application. As an example, consider a blog site which has blogs/articles by user name so how would I go about searching for all the articles written by (let’s say) “Neward” in this system?”
Ted answers:
“Answering your question I think comes in two forms, depending on exactly what you’re looking for. If we use your blog site example as the context of the question, then the first question becomes how the author is described in the document—if it’s a field, for example, then it’s a simple query to find all documents where “author” equals “Neward”, and we’re off and running. If, however, the goal is to find all articles where “Neward” appears in the article body (in other words, it’s not a field), such as what you might do to build a tag cloud of documents, then I think your best bet is to write a JavaScript function that lives on the MongoDB server that scans through the article looking for the keyword in question, and invoke that function on the server as part of the query to find that term.
In fact, if this is something you would want to support on a regular basis, then you might even have *two* article body fields, one to hold the “readable” text, and the other to hold the text minus all the filler words that aren’t really searchable; for example, this last paragraph could probably be reduced to just a half-dozen words by cutting out all the conjunctions and verbs, which would speed up your keyword queries by quite a lot.”
Ted then brings in an expert in the technology, who weighs in:
“For full text indexing most people use an integration with sphinx or solr. Those give you much better text search capabilities than anything you'll get with the query language. Having said that, it is possible to get a lightweight text search by adding a field to your document that contains an array of key words/searchable terms. Then you would just query on that array. I believe that would be done with the $in operator. You don't get some of the benefits that a proper full text indexing engine will give you such as stemming (i.e. knowing that “know” is the stem of “knowing”), but it is lightweight and easy to do. The MongoDB guys have an eventual plan to add full text indexing to the database but it isn't a pressing feature just yet.”
-- Keith Ward, editor in chief, MSDN Magazine.