Scrape SharePoint sites for their markdown content

Saketh Nalla 6 Reputation points
2024-12-11T17:15:36.6333333+00:00

Hello,

I'm trying to scrape a site which further has links to multiple SharePoint sites and dynamic as well static HTML pages for their markdown/HTML content. The sites are on SharePoint Server. I'm able to scrape dynamic and static HTML pages for their HTML content using 'HtmlAgilityPack' in a .NET web service.

I need assistance with finding the right APIs to get the HTML content for the SharePoint sites. I know I've to get Authentication token using the credentials (clientID, clientSecret and TenantID) of the app I created in Azure AD and I found that I've to give "Sites.Read.All" and "Sites.Manage.All" for that app. I was trying to test these API in Graph Explorer but I wasn't able to find the API I need to fulfil my objective.

Thank you.

Microsoft Graph
Microsoft Graph
A Microsoft programmability model that exposes REST APIs and client libraries to access data on Microsoft 365 services.
12,653 questions
SharePoint Server Development
SharePoint Server Development
SharePoint Server: A family of Microsoft on-premises document management and storage systems.Development: The process of researching, productizing, and refining new or existing technologies.
1,626 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Emily Du-MSFT 48,646 Reputation points Microsoft Vendor
    2024-12-12T08:16:41.9+00:00

    Here are steps to get HTML contents for the SharePoint page by using REST API.

    1.Use ClientID, ClientSecret and TenantID to get access token.

    Reference: https://www.billyperalta.com/Accessing%20SharePoint%20REST%20API%20using%20Postman/

    2.Through item ID:

    https://contoso.sharepoint.com/sites/SiteName/_api/web/lists/getbytitle('Site Pages')/items(1)?$select=Title,CanvasContent1
    

    Through page name:

    https://contoso.sharepoint.com/sites/SiteName/_api/web/lists/getbytitle('Site Pages')/items?$select=Title,FileLeafRef,CanvasContent1&$filter=FileLeafRef eq 'Employee.aspx'
    

    Note: Microsoft is providing this information as a convenience to you. The sites are not controlled by Microsoft. Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. Please make sure that you completely understand the risk before retrieving any suggestions from the above link.


    If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.