Fork me on GitHub

Outils pour utilisateurs

Outils du site


How to write a scraper (javascript)


The job of a scraper is to retrieve the list of titles available on a content provider's website and generate a valid PKB file with this information.

Get started with a pre-existing model

The scrapers are stored in the specific folder ezpaarse/platforms-scraper and ordered by platform. Not every platform needs a scraper. The current list is available gitHub.

A scraper for a platform will be named using the syntax: scrape_platform_ressource.js where platform is the name of the platform, and ressource is the nature of the data (journals, books, encyclopedia, AllTitles, etc.)

There is no standard format for writing a scraper: the architectures of HTML pages from publisher sites are too different. However, there is a library (see below) that will allow to initialize an element, to enrich it and to integrate it, to produce a KBART list.

A set of functionalities can be mutualized by calling the pkbrows library. It helps generating KBART files with normalized names when creating parsers.

Calling pkbrows

var PkbRows = require('../.lib/pkbrows.js');
var pkb = new PkbRows('acs');

The call to the library and the pkb declaration (with a parameter indicating for which platform the scraper is working): the KBART file will automatically be written in the corresponding folder. In this example, acs is the short name for the American Chemical Society platform.

Initializing the elements

var info = {};
  // initialize a kbart record
info = pkb.initRow(info);

The element is initialized with all the mandatory KBART fields.

The enrichment of every element

info.publication_title  = $('#journalLogo img').attr('alt');

Every KBART element can be enriched.

Adding elements to the KBART list


Generating the kbart file


The file is generated in the corresponding folder with a name and a format respecting KBART.

See the example for American Chemical Society

platforms/contribute/scraper_en.txt · Dernière modification: 2014/05/09 09:11 par porquet