Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
gargantext
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
humanities
gargantext
Commits
3374d428
Commit
3374d428
authored
May 11, 2016
by
c24b
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
[DO] Adding Parser Doc Integration with CERN example into overview
parent
9aae706c
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
84 additions
and
0 deletions
+84
-0
parser.md
docs/overview/parser.md
+84
-0
No files found.
docs/overview/parser.md
0 → 100644
View file @
3374d428
# HOW TO: Reference a new webscrapper/API + parser
## Global scope
Three main mooves to do:
-
develop and index parser
in gargantext.util.parsers
-
developp and index a scrapper
in gargantext.moissonneurs
-
adapt forms for a new source
in templates and views
## Reference parser into gargantext website
### reference your new parser into contants.py
*
import your parser l.125
```
from gargantext.util.parsers import \
EuropressParser, RISParser, PubmedParser, ISIParser, CSVParser, ISTexParser, CernParser
```
Le parser correspond au nom du parser référencé dans gargantext/util/parser
ici il est appelé CernParser
*
index your RESOURCETYPES
RESOURCETYPES (l.145)
**at the end of the list**
```
# type 10
{ "name": 'SCOAP (XML MARC21 Format)',
"parser": CernParser,
"default_language": "en",
'accepted_formats':["zip","xml"],
},
```
A noter le nom ici est composé de l'API_name(SCOAP) + (GENERICFILETYPE FORMAT_XML Format)
La complexité du nommage correspond à trois choses:
*
le nom de l'API (different de l'organisme de production)
*
le type de format: XML
*
la norme XML de ce format : MARC21 (cf. CernParser in gargantext/util/parser/Cern.py )
La langue correspond à la langue par défaut acceptée et qui charge le tagger correspondant
```
from gargantext.util.taggers import NltkTagger
```
TO DO: charger à la demander les types de taggers en fonction des langues et de l'install
TO DO: proposer un module pour télécharger des parsers supplémentaires
Les formats correspondent aux types de fichiers acceptées lors de l'envoi du fichier dans le formulaire de
parsing disponible dans
`gargantext/view/pages/projects.py`
et
exposé dans
`/templates/pages/projects/project.html`
## reference your parser script
## add your script into gargantext/util/
here filename is Cern.py
##declare it into gargantext/util/parser/__init__.py
from .Cern import CernParser
## add your parser script into gargantext/util/parser/
At this step, you will be able to see your parser and add a file with the form
it will send the job to toolchain
##
parse_extract_indexhyperdata(corpus)
*
Add pop up question Do you have a corpus
option search in /templates/pages/projects/project.html line 181
adding
# Some changes
*
adding accepted_formats in constants
*
adding check_file routine in Form check
# Suggestion next step:
*
XML parser MARC21 UNIMARC ...
*
A project type is qualified by the first element add i.e:
the first element determine the type of corpus of all the corpora within the project
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment