Share via


Create your own web scraper using Node.js and get data in JSON format

Want to make you own scraper to scrape any data form any website and return it in JSON format so you can used it anywhere you like? If yes, then you are at right place.

In this article I will guide you how to scrape any website to get desired data using Node.js and to obtain the data in JSON format which can be used e.g. make any app which will run on live data from the internet.

I will be using Windows 10 x64 and VS 2015 for this article and will scrape from a news website i.e.

var url = "http://www.thenews.com.pk/CitySubIndex.aspx?ID=14";
  • Modify the router.get() function as following:
router.get ('/', function  (req, res) {
    request (url, function  (error, response, body) {
        if (!error && response.statusCode === 200) {
            var data = scrapeDataFromHtml(body);
            res.send(data);
        }
        return console.log(error);
    });
});

https://2.bp.blogspot.com/-OYxZo1mabII/VmQKlyzykDI/AAAAAAAAATI/PeHXsfstPZ4/s640/node6.gif

var scrapeDataFromHtml = function (html) {
    var data = {};
    var $ = cheerio.load(html);
    var j = 1;
    $('div.DetailPageIndexBelowContainer').each(function () {
        var a = $(this);
        var fullNewsLink = a.children().children().attr("href");
        var headline = a.children().first().text().trim();  
        var description = a.children().children().children().last().text();
        var metadata = {
            headline: headline,
     description: description,
            fullNewsLink : fullNewsLink 
        };
        
        data[j] = metadata;
        j++;
    });
    return data;
};

This function will reach the ‘div’ using the class ‘.DetailPageIndexBelowContainer’ and will iterate its DOM to fetch the ‘fullNewsLink’, ‘headline’ and ‘description’. Then it will add these values in the array called ‘metadata’. I have another array called ‘data’ and will come the values from metadata on each iteration so in the end I can return my ‘data’ array as JSON. If you only want one thing form a website you don’t need to have loop for it or to create you other array. You can directly access them by traversing it and return the single array.

Now run it and check the output

https://4.bp.blogspot.com/-XPhdStWXmxE/VmQKlA-xrfI/AAAAAAAAAS0/apTW4v-8UZ8/s640/node12.JPG

And yes! It’s running perfectly and returning you the required data in JSON format.

PS: If the site that i am using as an example, removes the page, changes the layout, changes the css files or their names etc then we would not get the desired result. For that you have to write the new logic. but i have explained the logic and how to traverse the DOM tree of any website.

Source Code: https://github.com/umerqureshi93/webscrapper