SharePoint 2013: Crawled Properties for HTML Meta tags
Introduction
One of the SharePoint data sources are HTML Web pages. SharePoint crawler retrieves metadata from Web content as crawled properties. Default crawled properties available before you perform a first crawl are listed here
List of crawled properties as well as list of crawled property categories will change after the first crawl to include additional crawled properties and categories found during that crawl.
One example of custom crawled properties Sharepoint is retrieving from Web pages are HTML meta tags content.
Overview
The crawl properties that the SharePoint search crawler creates for the custom HTML meta tags are rendered to following crawled properties categories:
- Web
- Document Parser
Web is standard crawled properties category. Crawled properties are added to this category with the same name as the name of the custom meta tags but made uppercase.
Document Parser is custom crawled properties category. Crawled properties are added to this category with the same name as the name of the custom meta tags and in the same case.
Example:
Web page http://authors.library.caltech.edu/37214/ contains meta tag -
<meta name="eprints.citation" content=" DiMarco, E. Joseph and Khabiboulline, Emil and Orris, Darryl F. and Tartaglia, Michael A. and Terechkine, Iouri (2013) Superconducting Solenoid Lens for a High Energy Part of a Proton Linac Front End. IEEE Transactions on Applied Superconductivity, 23 (3). Art. No. 4100905. ISSN 1051-8223 http://resolver.caltech.edu/CaltechAUTHORS:20130228-145009767 <http://resolver.caltech.edu/CaltechAUTHORS:20130228-145009767> " />
After crawling this page SharePoint crawler will create following crawled properties:
Web category -
Property name: EPRINTS.CITATION
Category: Web
Property Set ID: d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1
Document Parser category -
Property name: eprints.citation
Category: Document Parser
Property Set ID: 64ae120f-487d-445a-8d5a-5258f99cb970
HTML meta tags and corresponding crawled properties
There are many meta tags namespaces used by different publishers.
The most common are:
Highwire Press tags (e.g., citation_title),
Eprints tags (e.g.,eprints.title),
Dublin Core tags (e.g., DC.title)
2.1 Crawled properties for Citation Meta tags
WEB |
Document Parser |
Content |
CITATION_ABSTRACT |
citation_abstract |
abstract |
CITATION_AUTHOR |
citation_author |
author |
CITATION_AUTHORS |
citation_authors |
authors |
CITATION_DATE |
citation_date |
date |
CITATION_PUBLICATION_DATE |
citation_publication_date |
date |
CITATION_ONLINE_DATE |
citation_online_date |
date |
CITATION_YEAR |
citation_year |
date |
CITATION_DOI |
citation_doi |
doi |
CITATION_FIRST_PAGE |
citation_first_page |
first page |
CITATION_ID |
citation_id |
|
CITATION_ISSN |
citation_issn |
issn |
CITATION_ISBN |
citation_isbn |
Isbn |
CITATION_ISSUE |
citation_issue |
issue |
CITATION_JOURNAL_TITLE |
citation_journal_title |
journal title |
CITATION_LAST_PAGE |
citation_last_page |
last page |
CITATION_PUBLISHER |
citation_publisher |
publisher |
CITATION_TITLE |
citation_title |
title |
CITATION_VOLUME |
citation_volume |
volume |
2.2 Crawled properties for Eprints Meta tags
WEB |
Document Parser |
Content |
EPRINTS.ABSTRACT |
eprints.abstract |
abstract |
EPRINTS.CREATORS_NAME |
eprints.creators_name |
author |
EPRINTS.DATE |
eprints.date |
date |
EPRINTS.ID_NUMBER |
eprints.id_number |
|
EPRINTS.CITATION |
eprints.citation |
|
EPRINTS.PAGERANGE |
eprints.pagerange |
pages |
EPRINTS.ISSN |
eprints.issn |
issn |
EPRINTS.NUMBER |
eprints.number |
Issue |
EPRINTS.PUBLICATION |
eprints.publication |
journal title |
EPRINTS.PUBLISHER |
eprints.publisher |
publisher |
EPRINTS.VOLUME |
eprints.volume |
volume |
EPRINTS.OFFICIAL_URL |
eprints.official_url |
|
EPRINTS.TITLE |
eprints.title |
title |
EPRINTS.TYPE |
eprints.type |
document type |
2.3 Crawled properties for Dublin Core Meta tags
WEB |
Document Parser |
Content |
DC.DESCRIPTION.ABSTRACT |
DC.description.abstract |
abstract |
DC.DESCRIPTION |
DC.description |
abstract |
DC.CREATOR.PERSONALNAME |
DC.creator.personalname |
author |
DC.CREATOR |
DC.creator |
author |
DC.CONTRIBUTOR |
DC.contributor |
author |
DC.DATE.CREATED |
DC.date.created |
date |
DC.DATE |
DC.date |
date |
DC.IDENTIFIER.DOI |
DC.identifier.doi |
doi |
DC.CITATION.PAGE |
DC.citation.page |
pages |
DC.IDENTIFIER.ISSN |
DC.identifier.issn |
issn |
DC.SOURCE.ISSN |
DC.source.issn |
Issn |
DC.PUBLISHER |
DC.publisher |
publisher |
DC.CITATION.VOLUME |
DC.citation.volume |
volume |
DC.IDENTIFIER |
DC.identifier |
|
DC.RELATION |
DC.relation |
|
DC.SOURCE |
DC.source |
|
DC.TITLE |
DC.title |
title |
DC.TYPE |
DC.type |
document type |