HOW TO: Reference a new webscrapper/API + parser
Global scope
Three main mooves to do:
-
develop and index parser in gargantext.util.parsers
-
developp and index a scrapper in gargantext.moissonneurs
-
adapt forms for a new source in templates and views
Reference parser into gargantext website
gargantext website is stored in gargantext/gargantext
reference your new parser into contants.py
- import your parser l.125
from gargantext.util.parsers import \
EuropressParser, RISParser, PubmedParser, ISIParser, CSVParser, ISTexParser, CernParser
The parser corresponds to the name of the parser referenced in gargantext/util/parser here name is CernParser
- index your RESOURCETYPE int RESOURCETYPES (l.145) at the end of the list
# type 10
{ "name": 'SCOAP (XML MARC21 Format)',
"parser": CernParser,
"default_language": "en",
'accepted_formats':["zip","xml"],
},
A noter le nom ici est composé de l'API_name(SCOAP) + (GENERICFILETYPE FORMAT_XML Format)
La complexité du nommage correspond à trois choses:
* le nom de l'API (different de l'organisme de production)
* le type de format: XML
* la norme XML de ce format : MARC21 (cf. CernParser in gargantext/util/parser/Cern.py )
The default_langage corresponds to the default accepted lang that should load the default corresponding tagger
from gargantext.util.taggers import NltkTagger
TO DO: charger à la demander les types de taggers en fonction des langues et de l'install
TO DO: proposer un module pour télécharger des parsers supplémentaires
TO DO: provide install tagger module scripts inside lib
Les formats correspondent aux types de fichiers acceptées lors de l'envoi du fichier dans le formulaire de
parsing disponible dans gargantext/view/pages/projects.py
et
exposé dans /templates/pages/projects/project.html
reference your parser script
add your parser script into folder gargantext/util/parser/
here my filename was Cern.py
##declare it into gargantext/util/parser/init.py from .Cern import CernParser
At this step, you will be able to see your parser and add a file with the form but nothing will occur
the good way to write the scrapper script
Three main and only requirements:
- your parser class should inherit from the base class _Parser()
gargantext/gargantext/util/parser/_Parser
- your parser class must have a parse method that take a file buffer as input
- you parser must structure and store data into hyperdata_list variable name to be properly indexed by toolchain ! Be careful of date format: provide a publication_date in a string format YYYY-mm-dd HH:MM:SS
Adding a scrapper API to offer search option:
En cours
- Add pop up question Do you have a corpus option search in /templates/pages/projects/project.html line 181
Reference a scrapper (moissonneur) into gargantext
- adding accepted_formats in constants
- adding check_file routine in Form check ==> but should inherit from utils/files.py that also have implmented the size upload limit check
Suggestion 4 next steps:
- XML parser MARC21 UNIMARC ...
- A project type is qualified by the first element add i.e: the first element determine the type of corpus of all the corpora within the project