Fork me on GitHub

Outils pour utilisateurs

Outils du site


platforms:contribute:parser_en

How to write a new parser (javascript)

Use the new platform generation tool

The first thing you need to do is inform the other ezPAARSE contributors that you will implement a new parser. For that, you need to subscribe to the epaarse-contribute public list. This way, the other contributors and the team will be able to help and give advice.

The javascript language is used to write a parser. And writing a parser is mostly adapting regular expressions proposed in the generated parser.js file (see below).

A new platform generation tool called ''platform-init'' is now available and makes it easy to generate all the files needed to create a functioning parser.

You will find the answers to the questions asked by platform-init in the analysis page of the corresponding platform. The main questions are about the domains on which the parser will react, wether it uses a knowledge base and what it's capable of analysing.

Writing a new parser is thus mostly:

  • use platfom-init that will create all the necessary files for you
  • adapt and enrich the test file, to correspond to the analysis page for the plateform
  • adapt and enrich the parser to output a result corresponding to the test file content (see below)
  • launch the validation tests (see below)

Once the tests have been validated, the parser can be integrated to the github repository.

Each folder is named after the platform processed by the parser, e.g.: "cairn" or "springer".

The main parser types

Depending on the structure discovered during the analysis of a new platform and on the similarities with an already analyzed platform you can rely on an existing parser to create or adapt yours:

  • parser for a platform without a knowledge base (e.g. ScienceDirect)
  • parser for a platform with a knowledge base : a proprietary identifier is used (e.g. Springer)
  • parser for a platform with only one journal, i.e. to one URL (or platform) corresponds exactly one journal (which title appears in the URL) but all the other journals use the same kind of platform (e.g. EDP)
  • parser for a platform with only one journal, i.e. to one URL corresponds one journal (which is signaled through a proprietary identifier (e.g. BMC)

What does a parser contain?

You'll notice that every parser is a folder containing:

  • a file named parser.js, that we describe in the section below
  • a file named manifest.json that contains descriptive information about the parser
  • a test/ folder
    • a file *platform_name*.version.csv that contains validation data for testing the parser

The parser.js file

It is in this file that the variable part lives (i.e. different for every platform). The generic part of the parser is located in the beginning of the file, as a module call.

See the parseur-generique_en page for more details on the generic part.

The modules declarations

var URL    = require('url');
var Parser = require('../.lib/parser.js'); //generic part of the parser

analyseEC : the function that needs to be adapted

The analyseEC function is passed as an argument to the parser's constructor:

module.exports = new Parser(function analyseEC(parsedUrl) {
//it is here that you'll write the code specific to the platform
}

and its variables will be reused as they are :

function analyseEC(parsedUrl) {
  var result    = {};
  var param     = parsedUrl.query;
  var path      = parsedUrl.path;
  var match;

From the URL given as an argument, we get elements such as:

  • the path (/resume.php?ID_ARTICLE=ARSS_195_0012 for a URL like http://www.cairn.info/resume.php?ID_ARTICLE=ARSS_195_0012)
  • parameters like ID_REVUE=ARSS&ID_NUMPUBLIE=ARSS_195&AJOUTBIBLIO=ARSS_195_0012 sous forme d'un objet javascript (structure JSON)
    {ID_REVUE: 'ARSS', ID_NUMPUBLIE: 'ARSS_195', AJOUTBIBLIO: 'ARSS_195_0012'}

All the details are available in the URL and Query String node.js modules documentation

Depending on how the platform analysis is organized, you can adopt one strategy or another to characterize the consultation events that you need the parser to recognize:

  1. pathname: see line 13 of the parser skeleton.
  2. attribute(s) of the request present in the URL (param, line 12)
  3. a regular expression, see lines 17 or 23 of the the parser skeleton.

Notes and tips

To test the parser, follow the example:

. ../../env
cat ./test/parser-skeleton.2014-03-10.csv | ../../bin/csvextractor --fields="in-url" -c --noheader | ./parser.js

that gives the following result:

{"rtype":"ARTICLE","mime":"PDF","title_id":"document-123456-test.pdf","unitid":"123456"}
{"rtype":"ARTICLE","mime":"HTML","title_id":"document-78910-test.html","unitid":"78910"}

Validate the integration

You have written a parser, and it executes on the command line (see above). It's time to validate its integration in the code base of ezPAARSE!

  1. once the parser is working, remember to launch the
    make jshint

    command in the base directory of your ezPAARSE install. It will make sure that the coding rules of the ezPAARSE team are enforced. Jshint issues messages that show where the rules are broken in an explicit manner so that it can be easily corrected.

  2. once jshint doesn't complain anymore, the tests specific to the parser can be launched. ezPAARSE has to be started (with the make start command). You can then execute:
    make test-platforms-verbose

    (or make test-platforms to get less messages). ezPAARSE checks that every parser contains all the previously described elements (test file, manifest.json, etc.) and yields the awaited results.

  3. when those tests pass, the parser is ready to be shared with the community via github. For this :
    • add the new files to source control with:
      git add ...
    • commit your work locally with:
      git commit -a -m "short comment"

      (and don't forget to comment your commit)

    • publish your work on github with:
      git push
platforms/contribute/parser_en.txt · Dernière modification: 2014/12/17 13:39 par lechaudel