Share via


SharePoint 2013: Crawled Properties for HTML Meta tags

Introduction

 

One of the SharePoint data sources are HTML Web pages. SharePoint crawler retrieves metadata from Web content as crawled properties. Default crawled properties available before you perform a first crawl are listed here

List of crawled properties as well as list of crawled property categories will change after the first crawl to include additional crawled properties and categories found during that crawl.

One example of custom crawled properties Sharepoint is retrieving from Web pages are  HTML meta tags content.

Overview

The crawl properties that the SharePoint search crawler creates for the custom HTML meta tags are rendered to following crawled properties categories: 

  • Web
  • Document Parser

Web is standard crawled properties category. Crawled properties are added to this category with the same name as the name of the custom meta tags but made uppercase.

Document Parser is custom crawled properties category. Crawled properties are added to this category with the same name as the name of the custom meta tags and in the same case.

 Example:

Web page http://authors.library.caltech.edu/37214/ contains meta tag -

 <meta name="eprints.citation" content="  DiMarco, E. Joseph and Khabiboulline, Emil and Orris, Darryl F. and Tartaglia, Michael A. and Terechkine, Iouri  (2013) Superconducting Solenoid Lens for a High Energy Part of a Proton Linac Front End.  IEEE Transactions on Applied Superconductivity, 23  (3).   Art. No. 4100905.  ISSN 1051-8223     http://resolver.caltech.edu/CaltechAUTHORS:20130228-145009767 <http://resolver.caltech.edu/CaltechAUTHORS:20130228-145009767>  " />

 After crawling this page SharePoint crawler will create following crawled properties:

Web category -

 Property name: EPRINTS.CITATION
 Category: Web
 Property Set ID: d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 

Document Parser category -

 Property name: eprints.citation
 Category: Document Parser
 Property Set ID: 64ae120f-487d-445a-8d5a-5258f99cb970

 

HTML meta tags and corresponding crawled properties

 

There are many meta tags namespaces used by different publishers.

The most common are:

              Highwire Press tags (e.g., citation_title),

              Eprints tags (e.g.,eprints.title),

              Dublin Core tags (e.g., DC.title)

  

2.1              Crawled properties for Citation Meta tags

 

WEB

Document Parser

Content

CITATION_ABSTRACT

citation_abstract

abstract

CITATION_AUTHOR

citation_author

author

CITATION_AUTHORS

citation_authors

authors

CITATION_DATE

citation_date

date

CITATION_PUBLICATION_DATE

citation_publication_date

date

CITATION_ONLINE_DATE

citation_online_date

date

CITATION_YEAR

citation_year

date

CITATION_DOI

citation_doi

doi

CITATION_FIRST_PAGE

citation_first_page

first page

CITATION_ID

citation_id

 

CITATION_ISSN

citation_issn

issn

CITATION_ISBN

citation_isbn

Isbn

CITATION_ISSUE

citation_issue

issue

CITATION_JOURNAL_TITLE

citation_journal_title

journal title

CITATION_LAST_PAGE

citation_last_page

last page

CITATION_PUBLISHER

citation_publisher

publisher

CITATION_TITLE

citation_title

title

CITATION_VOLUME

citation_volume

volume

   

2.2              Crawled properties for Eprints Meta tags

 

WEB

Document Parser

Content

EPRINTS.ABSTRACT

eprints.abstract

abstract

EPRINTS.CREATORS_NAME

eprints.creators_name

author

EPRINTS.DATE

eprints.date

date

EPRINTS.ID_NUMBER

eprints.id_number

 

EPRINTS.CITATION

eprints.citation

 

EPRINTS.PAGERANGE

eprints.pagerange

pages

EPRINTS.ISSN

eprints.issn

issn

EPRINTS.NUMBER

eprints.number

Issue

EPRINTS.PUBLICATION

eprints.publication

journal title

EPRINTS.PUBLISHER

eprints.publisher

publisher

EPRINTS.VOLUME

eprints.volume

volume

EPRINTS.OFFICIAL_URL

eprints.official_url

 

EPRINTS.TITLE

eprints.title

title

EPRINTS.TYPE

eprints.type

document type

  

2.3              Crawled properties for Dublin Core Meta tags

 

WEB

Document Parser

Content

DC.DESCRIPTION.ABSTRACT

DC.description.abstract

abstract

DC.DESCRIPTION

DC.description

abstract

DC.CREATOR.PERSONALNAME

DC.creator.personalname

author

DC.CREATOR

DC.creator

author

DC.CONTRIBUTOR

DC.contributor

author

DC.DATE.CREATED

DC.date.created

date

DC.DATE

DC.date

date

DC.IDENTIFIER.DOI

DC.identifier.doi

doi

DC.CITATION.PAGE

DC.citation.page

pages

DC.IDENTIFIER.ISSN

DC.identifier.issn

issn

DC.SOURCE.ISSN

DC.source.issn

Issn

DC.PUBLISHER

DC.publisher

publisher

DC.CITATION.VOLUME

DC.citation.volume

volume

DC.IDENTIFIER

DC.identifier

 

DC.RELATION

DC.relation

 

DC.SOURCE

DC.source

 

DC.TITLE

DC.title

title

DC.TYPE

DC.type

document type