Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
gargantext
gargantext
  • Project
    • Project
    • Details
    • Activity
    • Releases
    • Cycle Analytics
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Charts
  • Issues 0
    • Issues 0
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Charts
  • Wiki
    • Wiki
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Charts
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • humanities
  • gargantextgargantext
  • Wiki
    • Overview
  • parser

parser

Last edited by Alexandre Delanoë Oct 04, 2016
Page history

HOW TO: Reference a new webscrapper/API + parser

Global scope

Three main mooves to do:

  • develop and index parser in gargantext.util.parsers

  • developp and index a scrapper in gargantext.moissonneurs

  • adapt forms for a new source in templates and views

Reference parser into gargantext website

gargantext website is stored in gargantext/gargantext

reference your new parser into contants.py

  • import your parser l.125
from gargantext.util.parsers import \
    EuropressParser, RISParser, PubmedParser, ISIParser, CSVParser, ISTexParser, CernParser

The parser corresponds to the name of the parser referenced in gargantext/util/parser here name is CernParser

  • index your RESOURCETYPE int RESOURCETYPES (l.145) at the end of the list
# type 10
   {    "name": 'SCOAP (XML MARC21 Format)',
        "parser": CernParser,
        "default_language": "en",
        'accepted_formats':["zip","xml"],
   },
A noter le nom ici est composé de l'API_name(SCOAP) + (GENERICFILETYPE FORMAT_XML Format)
La complexité du nommage correspond à trois choses:
    * le nom de l'API (different de l'organisme de production)
    * le type de format: XML
    * la norme XML de ce format : MARC21 (cf. CernParser in gargantext/util/parser/Cern.py )

The default_langage corresponds to the default accepted lang that should load the default corresponding tagger

from gargantext.util.taggers import NltkTagger
TO DO: charger à la demander les types de taggers en fonction des langues et de l'install
TO DO: proposer un module pour télécharger des parsers supplémentaires
TO DO: provide install tagger module scripts inside lib

Les formats correspondent aux types de fichiers acceptées lors de l'envoi du fichier dans le formulaire de parsing disponible dans gargantext/view/pages/projects.py et exposé dans /templates/pages/projects/project.html

reference your parser script

add your parser script into folder gargantext/util/parser/

here my filename was Cern.py

##declare it into gargantext/util/parser/init.py from .Cern import CernParser

At this step, you will be able to see your parser and add a file with the form but nothing will occur

the good way to write the scrapper script

Three main and only requirements:

  • your parser class should inherit from the base class _Parser() gargantext/gargantext/util/parser/_Parser
  • your parser class must have a parse method that take a file buffer as input
  • you parser must structure and store data into hyperdata_list variable name to be properly indexed by toolchain ! Be careful of date format: provide a publication_date in a string format YYYY-mm-dd HH:MM:SS

Adding a scrapper API to offer search option:

En cours

  • Add pop up question Do you have a corpus option search in /templates/pages/projects/project.html line 181

Reference a scrapper (moissonneur) into gargantext

  • adding accepted_formats in constants
  • adding check_file routine in Form check ==> but should inherit from utils/files.py that also have implmented the size upload limit check

Suggestion 4 next steps:

  • XML parser MARC21 UNIMARC ...
  • A project type is qualified by the first element add i.e: the first element determine the type of corpus of all the corpora within the project
Clone repository
  • about
  • about
    • credits
    • index
    • license
    • release notes
  • api_urls
  • automatic_install
  • contribution guide
  • contribution-guide
    • archi
    • contribution
    • db
    • dev
    • ngrams_lifecycle
    • website
  • contribution
More Pages

New Wiki Page

Tip: You can specify the full path for the new file. We will automatically create any missing directories.