[DOC] HOw to add a parser step by step in docs/overview

c0e4f25e · c24b · 730c6319 · c0e4f25e
Commit c0e4f25e authored May 12, 2016 by c24b
Hide whitespace changes
Inline Side-by-side

Showing with 21 additions and 15 deletions

parser.md docs/overview/parser.md +21 -15

No files found.
--- a/docs/overview/parser.md
+++ b/docs/overview/parser.md
@@ -11,18 +11,20 @@ in gargantext.moissonneurs
 in templates and views

 ## Reference parser into gargantext website
+gargantext website is stored in gargantext/gargantext
+
 ### reference your new parser into contants.py
 * import your parser l.125
 ```
 from gargantext.util.parsers import \
    EuropressParser, RISParser, PubmedParser, ISIParser, CSVParser, ISTexParser, CernParser
 ```
-Le parser correspond au nom du parser référencé dans gargantext/util/parser
-ici il est appelé CernParser
+The parser corresponds to the name of the parser referenced in gargantext/util/parser
+here  name is CernParser


-* index your RESOURCETYPES
-RESOURCETYPES (l.145) **at the end of the list**
+* index your RESOURCETYPE
+int RESOURCETYPES (l.145) **at the end of the list**
 ```
 # type 10
   {    "name": 'SCOAP (XML MARC21 Format)',
@@ -31,18 +33,21 @@ RESOURCETYPES (l.145) **at the end of the list**
        'accepted_formats':["zip","xml"],
   },
 ```
-A noter le nom ici est composé de l'API_name(SCOAP) + (GENERICFILETYPE FORMAT_XML Format)
-La complexité du nommage correspond à trois choses:
-    * le nom de l'API (different de l'organisme de production)
-    * le type de format: XML
-    * la norme XML de ce format : MARC21 (cf. CernParser in gargantext/util/parser/Cern.py )
+    A noter le nom ici est composé de l'API_name(SCOAP) + (GENERICFILETYPE FORMAT_XML Format)
+    La complexité du nommage correspond à trois choses:
+        * le nom de l'API (different de l'organisme de production)
+        * le type de format: XML
+        * la norme XML de ce format : MARC21 (cf. CernParser in gargantext/util/parser/Cern.py )
+
+The default_langage corresponds to the default accepted lang that **should load** the default corresponding tagger
+

-La langue correspond à la langue par défaut acceptée et qui charge le tagger correspondant
 ```
 from gargantext.util.taggers import NltkTagger
 ```
    TO DO: charger à la demander les types de taggers en fonction des langues et de l'install
    TO DO: proposer un module pour télécharger des parsers supplémentaires
+    TO DO: provide install tagger module scripts inside lib

 Les formats correspondent aux types de fichiers acceptées lors de l'envoi du fichier dans le formulaire de
 parsing disponible dans `gargantext/view/pages/projects.py` et
@@ -63,24 +68,25 @@ but nothing will occur

 Three main and only requirements:
 * your parser class should inherit from the base class _Parser()
-* your parser class must have a parse method that take a **filename** as input
+`gargantext/gargantext/util/parser/_Parser`
+* your parser class must have a parse method that take a **file buffer** as input
 * you parser must structure and store data into **hyperdata_list** variable name
 to be properly indexed by toolchain
+! Be careful of date format: provide a publication_date in  a string format YYYY-mm-dd HH:MM:SS

 # Adding a scrapper API to offer search option:
+En cours
 * Add pop up question Do you have a corpus
 option search in /templates/pages/projects/project.html line 181


+## Reference a scrapper (moissonneur) into gargantext

-
-# Some changes
 * adding accepted_formats in constants
 * adding check_file routine in Form check ==> but should inherit from utils/files.py
 that also have implmented the size upload limit check

-# Suggestion next step:
-
+# Suggestion 4 next steps:
 * XML parser MARC21 UNIMARC ...
 * A project type is qualified by the first element add i.e:
 the first element determine the type of corpus of all the corpora within the project