Share via


Character based versus word based search, or Searching with wildcards on Windows Vista

There is a major difference between file search on Windows XP and desktop search on Windows Vista (or on Windows XP with Windows Desktop Search [WDS] installed) that has perhaps not been made sufficiently clear.

On Windows XP search is character based. That is, if you search for a string 'test', it will find files named 'my test data.doc', 'additional testing.xls' as well as 'latest junk.txt' or (if you tell it to search also contents of files) files containing words such as 'test', 'tester' and 'fattest'.

On Windows Vista, and on Windows XP with WDS installed, search is normally word based. Searching for the string 'test' will only find documents with the word 'test' in them, or words beginning with 'test'. So it will find the files named 'my test data.doc' and 'additional testing.xls' but it will not find 'latest junk.txt'. Moreover, it will find documents containing 'test' or 'tester' but it will not find documents containing 'fattest'.

One cannot really say one is better than the other; if one is really looking for a word 'test', finding documents containing 'fattest' is just unwanted noise. On the other hand, sometimes one is really looking for a string, wherever it occurs, and then character based search is the only game in town.

The main reason for the change is that by making search word based one can use an index to make searches much faster. This is why searches on Windows Vista are generally so much faster than on Windows XP (without WDS): on Windows XP each search basically plows through every single file, looking for the string, while on Windows Vista an index lookup produces the right documents instantly. By the way, this is how most Internet search engines work and that is why they too are word based.

But what if you really want to look for a string anywhere? The good news is that you can do that also on Windows Vista (or Windows XP with WDS). You do it by searching for a string that contains '?' and/or '*'. As on Windows XP (and harking all the way back to DOS), '?' matches exactly one arbitrary character while '*' matches zero, one or more arbitrary characters. So to search for 'test' occurring anywhere in words, search for '*test*'. It finds all of the examples above, just as 'test' would on Windows XP. Note that the pattern will be matched against the whole value, not just against each word, so searching for 'lo?t' will not find documents with the word 'lost' in them unless that was the only word. You would have to search for '*lo?t*' even though that will also find documents containing words such as 'plotting'.

You can also restrict your search to a single property. For example, 'name:*bum*' will return any document where the string 'bum' occurs somewhere in the name property, 'subject:???' will search for documents where the subject has exactly three characters, and 'filename:l?t*t' will find documents with file names such as 'latest', 'littlest' or 'lqtfgdfhgt', and so on. See New Mansions in Search - Advanced Query Syntax for more on querying over specific properties.

There is one exception: if what you search for has a '*' wildcard at the end and no other wildcards, as in 'subject:test*', the search will still be word based and the results will contain any document that has a word beginning with 'test' somewhere in the subject. However, you can force the character based search by using the operation '~'. 'subject:~test*' will return only those documents where the subject begins with 'test' and 'subject:~*test*' will return those where 'test' occurs anywhere (also inside words).

So what is the price for using '*' and '?', that is, using character based search? Time! The search engine is forced to go through every document in the scope and look for the specified pattern. If you are searching over a large collection of documents, this can take a long time, but the choice is up to you, the user!

Comments

  • Anonymous
    September 10, 2007
    Say you have a document containing the number 1234567890 (it might be a part number or an invoice number)