# HOW TO: Reference a new webscrapper/API + parser ## Global scope Three main mooves to do: - develop and index parser in gargantext.util.parsers - developp and index a scrapper in gargantext.moissonneurs - adapt forms for a new source in templates and views ## Reference parser into gargantext website gargantext website is stored in gargantext/gargantext ### reference your new parser into contants.py * import your parser l.125 ``` from gargantext.util.parsers import \ EuropressParser, RISParser, PubmedParser, ISIParser, CSVParser, ISTexParser, CernParser ``` The parser corresponds to the name of the parser referenced in gargantext/util/parser here name is CernParser * index your RESOURCETYPE int RESOURCETYPES (l.145) **at the end of the list** ``` # type 10 { "name": 'SCOAP (XML MARC21 Format)', "parser": CernParser, "default_language": "en", 'accepted_formats':["zip","xml"], }, ``` A noter le nom ici est composé de l'API_name(SCOAP) + (GENERICFILETYPE FORMAT_XML Format) La complexité du nommage correspond à trois choses: * le nom de l'API (different de l'organisme de production) * le type de format: XML * la norme XML de ce format : MARC21 (cf. CernParser in gargantext/util/parser/Cern.py ) The default_langage corresponds to the default accepted lang that **should load** the default corresponding tagger ``` from gargantext.util.taggers import NltkTagger ``` TO DO: charger à la demander les types de taggers en fonction des langues et de l'install TO DO: proposer un module pour télécharger des parsers supplémentaires TO DO: provide install tagger module scripts inside lib Les formats correspondent aux types de fichiers acceptées lors de l'envoi du fichier dans le formulaire de parsing disponible dans `gargantext/view/pages/projects.py` et exposé dans `/templates/pages/projects/project.html` ## reference your parser script ## add your parser script into folder gargantext/util/parser/ here my filename was Cern.py ##declare it into gargantext/util/parser/__init__.py from .Cern import CernParser At this step, you will be able to see your parser and add a file with the form but nothing will occur ## the good way to write the scrapper script Three main and only requirements: * your parser class should inherit from the base class _Parser() `gargantext/gargantext/util/parser/_Parser` * your parser class must have a parse method that take a **file buffer** as input * you parser must structure and store data into **hyperdata_list** variable name to be properly indexed by toolchain ! Be careful of date format: provide a publication_date in a string format YYYY-mm-dd HH:MM:SS # Adding a scrapper API to offer search option: En cours * Add pop up question Do you have a corpus option search in /templates/pages/projects/project.html line 181 ## Reference a scrapper (moissonneur) into gargantext * adding accepted_formats in constants * adding check_file routine in Form check ==> but should inherit from utils/files.py that also have implmented the size upload limit check # Suggestion 4 next steps: * XML parser MARC21 UNIMARC ... * A project type is qualified by the first element add i.e: the first element determine the type of corpus of all the corpora within the project