Share via


SharePoint Search: – “Why are my results so bad?” Part 2 - PDFs vs. Office Documents

In a previous post, I discussed why search results can seem so bad.  Honestly, labeling results as "bad" is misleading because the Search Engine is ranking results by only the metadata and rank model that the Search Admin told the Search Engine to use.  In reality, there are no bad search results, only bad Search Admins. ;)

I recently ran across a case of "perceived bad results" that I thought was really interesting because the ranking behavior from the default SP2013 rank model had completely flipped when compared to the ranking behavior in SPO.  In SP2013, customers often questioned why, when given an Office Document (e.g. PowerPoint or Word) and a PDF which were equivalent in content, the Office Document always ranked higher.

The reason is that the SP2013 default rank model heavily weighs PowerPoint and Word file types while PDFs are not weighted.  In fact, in the SP2013 default rank model, Office Documents are the most heavily weighted of all document types.  Since a PDF is not weighted in the default rank model, it receives a default weighting that is lower than Office Documents' weightings.

In SPO, this behavior has "flipped," not because PDFs are weighted higher, but because Office Documents are weighted lower than the default File Type, and since PDFs have no defined weighting, they will get the default or higher weighting.

Adding to the foregoing confusion is this question: "What defines the default weight for a File Type?"  During indexing, all files are assigned an "Internal File Type."  For example, a Word document is InternalFileType=1, a PowerPoint is InternalFileType=2, and a PDF is InternalFileType=15.  While the default rank model in SPO and SP2013 define weights for Types 1 and 2, they do not define a weight for 15.  During rank score calculation, a PDF's File Type is transformed from 15 to 0.  This transformation is clearly seen in the rank log output (see the Search Query Tool for more on how to get a rank log).  The default rank log does define a weighting for Type 0, but the rank models label this type as HTML.  So, in reality, any undefined File Type will receive the same weighting as an HTML file and in SPO it will be higher than Office Documents, while in SP2013 it will be lower. 8-O

Below is a excerpt from a rank log using the default SPO rank model of a PowerPoint document showing a the InternalFileType=2 (PPT) getting a boost of -0.078.

<bucketed_static_feature name='InternalFileType' property_name='InternalFileType' used_default='0'
raw_value_in='2' raw_value='2' raw_value_transformed='2'
hidden_nodes_adds=' -0.0780378'/>

Compare to the following excerpt from a rank log using the default SPO rank model of a PDF document showing an initial InternalFileType=15 (PDF), then transforming the value to 0 (HTML/default), and applying a boost of 0.391.

<bucketed_static_feature name='InternalFileType' property_name='InternalFileType' used_default='1'
raw_value_in='15' raw_value='15' raw_value_transformed='0'
hidden_nodes_adds='0.39136'/>

To address this issue in SP2013 or SPO, you should to define an appropriate weight for PDFs in a custom rank model or use the XRANK operator to apply an appropriate boost for PDFs.

Comments

  • Anonymous
    April 22, 2017
    Why did you implement the default values to give "bad results"? Why does it matter what container format something has. Is should only be the content that matters. There should be some machine learning automatically applied to the search algoritm so it gives the best result. When I use Bing I dont have to define the ranking model to display great results? Why do I need to do that in SP?//ranting complete