Fork me on GitHub

Outils pour utilisateurs

Outils du site


This page contains obsolete information and will soon be updated (9/12/2015)

Contribute to the analysis of a content provider's platform

Four distinct steps are needed for ezPAARSE to be able to analyse a new content provider's platform. These steps can be performed by different people:

Information specific to the platform (test data, knowledge base) can be stored in an excel spreadsheet file type depending on this model. This Excel file will be used to create csv test and knowledge base files (platform.version.csv and platform.pkb.csv )

Analysis of the platform (Librarian)

The analysis of a platform can be achieved by someone who is not a coder but has a good command of what a URL is. This analysis needs to follow the model and be based on this sample.

  1. login on this platform (from your institution or with a distant access)
  2. create a wiki page on analogIST that describes this platform following the model
  3. enumerate the different types and formats of resources found on the platform (on a first pass, it's more important to focus on the types and formats that ezPAARSE already recognizes: Resource Types and Resource Formats)
  4. for every combination of type and format, fill in the wiki page (for that, you can use the URL analysis utility)
    1. note the type and format of the resource
    2. copy/paste the URL as is (the URL reflects the proxy step) so that you can come back to it directly by clicking on it
    3. split the URL path (in the Host part, you can now delete the proxy part)
    4. document the elements to be found when this URL has to be parsed
  5. go to the next type/format combination

For the step 5, use the URL analysis utility

Prepare a test file (Librarian+)

This test file is intended to validate the proper functioning of the parser. It is a CSV file formatted as follows: Columns prefixed with in- contain the data to be sent to the parser, those prefixed with out- contain the data that the parser is supposed to identify.

Note that parsers are independent from the knowledge bases. The test file must only contain data that is present in the URL or any other materials provided to the parser.

  1. Create a file in a spreadsheet (plateform.version.csv.) that contains one line per analyzed resource and save it in the csv format
  2. For each combination of type and format of resource (eg ARTICLE/HTML), add a line to the file by filling the columns of the elements given as input (in-) and those of the recognized elements (out-)
  3. Move to the next resource

Write the parser (Coder)

The implementation of a parser requires programming skills, especially in the writing of regular expressions. To write a parser, you need to chose a programming language (javascript, php, perl, python, etc.).

We explain this work in a detailed fashion on parser_en.

Use the plateform.version.csv file to validate the execution of a parser

Every parser must be able to be automatically tested. The plateform.version.csv file is used to that purpose. If you wish to manually launch the test, you can use the make test command. Here is a schema of how the test works with the csv file:

Example of a test file :


Create a PKB file (Librarian+/Coder)

ezPAARSE uses files called knowledge bases, named after this pattern: platform_AllTitles.txt. Those are text file, formatted with the KBART standard. There is often one (or more) for each platform. But they are not needed when the parser is able to extract a normalized identifier (like an ISSN) directly from the URL.

You can find those KBART files in a specific folder structure ezpaarse/platforms-kb/platform, following the same semantics as parsers.

The Publisher Knowledge Bases are useful for :

  • make proprietary identifiers on content providers' platforms correspond with normalized identifiers (like ISSNs, DOIs, etc.)
  • include the titles of accessed resources in the results of an ezPAARSE analysis

All the details on PKBs can be found on pkb_en.

platforms/contribute/start_en.txt · Dernière modification: 2016/03/22 08:45 par porquet