Commit dcfd1f71 authored by delanoe's avatar delanoe

Documentation removed after subtree tests. Adding it again.

parent 2cd4a570
#Gargantext
Welcome to Garagentext documentation!
List of garg's own JSON API(s) urls
===================================
2016-05-27
### /api/nodes/2
```
{
"id": 2,
"parent_id": 1,
"name": "abstract:\"evaporation+loss\"",
"typename": "CORPUS"
}
```
------------------------------
### /api/nodes?pagination_limit=-1
```
{
"records": [
{
"id": 9,
"parent_id": 2,
"name": "A recording evaporimeter",
"typename": "DOCUMENT"
},
(...)
{
"id": 119,
"parent_id": 81,
"name": "GRAPH EXPLORER COOC (in:81)",
"typename": "COOCCURRENCES"
}
],
"count": 119,
"parameters": {
"formated": "json","pagination_limit": -1,
"fields": ["id","parent_id","name","typename"],
"pagination_offset": 0
}
}
```
------------------------------
### /api/nodes?types[]=CORPUS
```
{
"records": [
{
"id": 2,
"parent_id": 1,
"name": "abstract:\"evaporation+loss\"",
"typename": "CORPUS"
},
(...)
{
"id": 8181,
"parent_id": 1,
"name": "abstract:(astrogeology+OR ((space OR spatial) AND planetary) AND geology)",
"typename": "CORPUS"
}
],
"count": 2,
"parameters": {
"pagination_limit": 10,
"types": ["CORPUS"],
"formated": "json",
"pagination_offset": 0,
"fields": ["id","parent_id","name","typename"]
}
}
```
------------------------------
### /api/nodes/5?fields[]=ngrams
<5> représente un doc_id ou list_id
```
{
"ngrams": [
[1.0,{"id":2299,"n":1,"terms":designs}],
[1.0,{"id":1917,"n":1,"terms":height}],
[1.0,{"id":1755,"n":2,"terms":higher speeds}],
[1.0,{"id":1940,"n":1,"terms":cylinders}],
[1.0,{"id":2221,"n":3,"terms":other synthesized materials}],
(...)
[2.0,{"id":1970,"n":1,"terms":storms}],
[9.0,{"id":1754,"n":2,"terms":spherical gauges}],
[1.0,{"id":1895,"n":1,"terms":direction}],
[1.0,{"id":2032,"n":1,"terms":testing}],
[1.0,{"id":1981,"n":2,"terms":"wind effects"}]
]
}
```
------------------------------
### api/nodes/3?fields[]=id&fields[]=hyperdata&fields[]=typename
```
{
"id": 3,
"typename": "DOCUMENT",
"hyperdata": {
"language_name": "English",
"language_iso3": "eng",
"language_iso2": "en",
"title": "A blabla analysis of laser treated aluminium blablabla",
"name": "A blabla analysis of laser treated aluminium blablabla",
"authors": "A K. Jain, V.N. Kulkarni, D.K. Sood"
"authorsRAW": [
{"name": "....", "affiliations": ["... Research Centre,.. 085, Country"]},
{"name": "....", "affiliations": ["... Research Centre,.. 086, Country"]}
(...)
],
"abstract": "Laser processing of materials, being a rapid melt quenching process, quite often produces a surface which is far from being ideally smooth for ion beam analysis. (...)",
"genre": ["research-article"],
"doi": "10.1016/0029-554X(81)90998-8",
"journal": "Nuclear Instruments and Methods In Physics Research",
"publication_year": "1981",
"publication_date": "1981-01-01 00:00:00",
"publication_month": "01",
"publication_day": "01",
"publication_hour": "00",
"publication_minute": "00",
"publication_second": "00",
"id": "61076EB1178A97939B1C893904C77FB7DA2276D0",
"source": "elsevier",
"distributor": "istex"
}
}
```
## TODO continuer la liste
// dot ngram_parsing_flow.dot -Tpng -o ngram_parsing_flow.png
digraph ngramflow {
edge [fontsize=10] ;
label=<<B><U>gargantext.util.toolchain</U></B><BR/>(ngram extraction flow)>;
labelloc="t" ;
"extracted_ngrams" -> "grouplist" ;
"extracted_ngrams" -> "occs+ti_rank" ;
"project stoplist (todo)" -> "stoplist" ;
"stoplist" -> "mainlist" ;
"occs+ti_rank" -> "mainlist" [label=" TI_RANK_LIMIT"];
"mainlist" -> "coocs" [label=" COOCS_THRESHOLD"] ;
"coocs" -> "specificity" ;
"specificity" -> "maplist" [label="MAPLIST_LIMIT\nMONOGRAM_PART"];
"mainlist" -> "tfidf" ;
"tfidf" -> "explore" [label="doc relations with all map and candidates"];
"maplist" -> "explore" ;
"grouplist" -> "occs+ti_rank" ;
"grouplist" -> "coocs" ;
"grouplist" -> "tfidf" ;
}
#Contribution guide
## Community
* [http://gargantext.org/about](http://gargantext.org/about)
* IRC Chat: (OFTC/FreeNode) #gargantex
##Tools
* gogs
* server access
* forge
* gargantext box
##Gargantex
* Gargantex box install
(S.I.R.= Setup Install & Run procedures)
* Architecture Overview
* Database Schema Overview
* Interface design Overview
##To do:
* Docs
* Interface deisgn
* Parsers/scrapers
* Computing
## How to contribute:
1. Clone the repo
2. Create a new branch <username>-refactoring
3. Run the gargantext-box
4. Code
5.Test
6. Commit
### Exemple1: Adding a parser
* create your new file cern.py into gargantex/scrapers/
* reference into gargantex/scrapers/urls.py
add this line:
import scrapers.cern as cern
* reference into gargantext/constants
```
# type 9
{ 'name': 'Cern',
'parser': CernParser,
'default_language': 'en',
},
```
* add an APIKEY in gargantex/settings
### Exemple2: User Interface Design
#Contribution guide
* A question or a problem? Ask the community
* Sources
* Tools
* Contribution workflow: for contributions, bugs and features
* Some examples of contributions
## Community
Need help? Ask the community
* [http://gargantext.org/about](http://gargantext.org/about)
* IRC Chat: (OFTC/FreeNode) #gargantex
## Source
Source are available throught XXX LICENSE
You can install Gargantext throught the [installation procedure](./install.md)
##Tools
* gogs
* forge.iscpif.fr
* server access
* gargantext box
## Contributing: workflow procedure
Once you have installed and tested Gargantext
You
1. Clone the stable release into your project
Note: The current stable release <release_branch> is: refactoring
Inside the repo, clone the reference branch and get the last changes:
git checkout <ref_branch>
git pull
It is highly recommended to create a generic branch on a stable release such as
git checkout -b <username>-<release_branch>
git pull
2. Create your project on stable release
git checkout -b <username>-<release_branch>-<project_name>
Do your modifications and commits as you want it:
git commit -m "foo/bar/1"
git commit -m "foo/bar/2"
git push
If you want to save your local change you can merge it into your generic branch <username>-<release_branch>
git checkout <username>-<release_branch>
git pull
git merge <username>-<release_branch>-<project_name>
git commit -m "[Merge OK] comment"
##Technical Overview
* Interface Overview
* Database Schema Overview
* Architecture Overview
### Exemple1: Adding a parser
### Exemple2: User Interface Design
Cycle de vie des décomptes ngrammes
-----------------------------------
### (schéma actuel et pistes) ###
Dans ce qui crée les décomptes, on peut distinguer deux niveaux ou étapes:
1. l'extraction initiale et le stockage du poids de la relation ngramme
document (appelons ces nodes "1doc")
2. tout le reste: la préparation des décomptes agrégés pour la table
termes ("stats"), et pour les tables de travail des graphes et de la
recherche de publications.
On pourrait peut-être parler d'indexation par docs pour le niveau 1 et de "modélisations" pour le niveau 2.
On peut remarquer que le niveau 1 concerne des **formes** ou ngrammes seuls (la forme observée <=> chaine de caractères u-nique après normalisation) tandis que dans le niveau 2 on a des objets plus riches... Au fur et à mesure des traitements on a finalement toujours des ngrammes mais:
- filtrés (on ne calcule pas tout sur tout)
- typés avec les listes map, stop, main (et peut-être bientôt des
"ownlistes" utilisateur)...
- groupés (ce qu'on voit avec le `+` de la table terme, et qu'on
pourrait peut-être faire apparaître aussi côté graphe?)
On peut dire qu'on manipule plutôt des **termes** au niveau 2 et non plus des **formes**... ils sont toujours des ngrammes mais enrichis par l'inclusion dans une série de mini modèles (agrégations et typologie de ngrammes guidée par les usages).
### Tables en BDD
Si on adopte cette distinction entre formes et termes, ça permet de clarifier à quel moment on doit mettre à jour ce qu'on a dans les tables. Côté structure de données, les décomptes sont toujours stockés via des n-uplets qu'on peut du coup résumer comme cela:
- **1doc**: (doc:node - forme:ngr - poids:float) dans des tables
NodeNgram
- **occs/gen/spec/tirank**: (type_mesure:node - terme:ngr -
poids:float) dans des tables NodeNgram
- **cooc**: (type_graphe:node - terme1:ngr - terme2:ngr -
poids:float) dans des tables NodeNgramNgram
- **tfidf**: (type_lienspublis:node - doc:node - terme:ngr -
correlation:float) dans des tables NodeNodeNgram.
Où "type" est le node portant la nature de la stat obtenue, ou bien la
ref du graphe pour cooc et de l'index lié à la recherche de publis pour
le tfidf.
Il y a aussi les relations qui ne contiennent pas de décomptes mais sont
essentielles pour former les décomptes des autres:
- map/main/stopliste: (type_liste:node - forme ou terme:ngr) dans des
tables NodeNgram
- "groupes": (mainform:ngr - subform:ngr) dans des tables
NodeNgramNgram.
### Scénarios d'actualisation
Alors, dans le déroulé des "scénarios utilisateurs", il y plusieurs
évenements qui viennent **modifier ces décomptes**:
1. les créations de termes opérés par l'utilisateur (ex: par
sélection/ajout dans la vue annotation)
2. les imports de termes correspondant à des formes jamais indexées sur
ce corpus
3. les dégroupements de termes opérés par l'utilisateur
4. le passage d'un terme de la stopliste aux autres listes
5. tout autre changement de listes et/ou création de nouveaux
groupes...
A et B sont les deux seules étapes hormis l'extraction initiale où des
formes sont rajoutées. Actuellement A et B sont gérés tout de suite pour
le niveau 1 (tables par doc) : il me semble qu'il est bon d'opérer la
ré-indexation des 1doc le plus tôt possible après A ou B. Pour la vue
annotations, l'utilisateur s'attend à voir apparaître le surlignage
immédiatement sur le doc visualisé. Pour l'import B, c'est pratique car
on a la liste des nouveaux termes sous la main, ça évite de la stocker
quelque part en attendant un recalcul ultérieur.
L'autre info mise à jour tout de suite pour A et B est l'appartenance
aux listes et aux groupes (pour B), qui ne demandent aucun calcul.
C, D et E n'affectent pas le niveau 1 (tables par docs) car ils ne
rajoutent pas de formes nouvelles, mais constituent des modifications
sur les listes et les groupes, et devront donc provoquer une
modification du tfidf (pour cela on doit passer par un re-calcul) et des
coocs sur map (effet appliqué à la demande d'un nouveau graphe).
C et D demandent aussi une mise à jour des stats par termes
(occurrences, gen/spec etc) puisque les éléments subforms et les
éléments de la stopliste ne figurent pas dans les stats.
Donc pour résumer on a dans tous les cas:
=> l'ajout à une liste, à un groupe et tout éventuel décompte de
nouvelle forme dans les docs sont gérés dès l'action utilisateur
=> mais les modélisations plus "avancées" représentées par les les
stats occs, gen, spec et les tables de travail "coocs sur map" et
"tfidf" doivent attendre un recalcul.
Idéalement à l'avenir il seraient tous mis à jour incrémentalement au
lieu de forcer ce recalcul... mais pour l'instant on en est là.
### Fonctions associées
| | GUI | API action → url | VIEW | SUBROUTINES |
|-------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------|-------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| A | "annotations/highlight.js, annotations/ngramlists.js" | "PUT → api/ngrams, PUT/DEL → api/ngramlists/change" | "ApiNgrams, ListChange" | util.toolchain.ngrams_addition.index_new_ngrams |
| B | NGrams_dyna_chart_and_table | POST/PATCH → api/ngramlists/import | CSVLists | "util.ngramlists_tools.import_ngramlists, util.ngramlists_tools.merge_ngramlists, util.toolchain.ngrams_addition.index_new_ngrams" |
| C,D,E | NGrams_dyna_chart_and_table | "PUT/DEL → api/ngramlists/change, PUT/DEL → api/ngramlists/groups" "ListChange, GroupChange" | util.toolchain.ngrams_addition.index_new_ngrams | |
L'import B a été remis en route il y a quelques semaines, et je viens de
reconnecter A dans la vue annotations.
#Contribution guide
## Community
* [http://gargantext.org/about](http://gargantext.org/about)
* IRC Chat: (OFTC/FreeNode) #gargantex
##Tools
* gogs
* server access
* gargantext box
##Gargantex
* Gargantex box install
see [install procedure](install.md)
* Architecture Overview
* Database Schema Overview
* Interface design Overview
##To do:
* Docs
* Interface design
* [Parsers](./overview/parser.md) / scrappers(./overview/scraper.md)
* Computing
## How to contribute:
1. Clone the repo
2. Create a new branch <username>-refactoring
3. Run the gargantext-box
4. Code
5. Test
6. Commit
94eb7bdf57557b72dcd1b93a42af044b pubmed.zip
# API
Be more careful about authorizations.
cf. "ng-resource".
# Projects
## Overview of all projects
- re-implement deletion
## Single project view
- re-implement deletion
# Taggers
Path for data used by taggers should be defined in `gargantext.constants`.
# Database
# Sharing
Here follows a brief description of how sharing could be implemented.
## Database representation
The database representation of sharing can be distributed among 4 tables:
- `persons`, of which items represent either a user or a group
- `relationships` describes the relationships between persons (affiliation
of a user to a group, contact between two users, etc.)
- `nodes` contains the projects, corpora, documents, etc. to share (they shall
inherit the sharing properties from their parents)
- `permissions` stores the relations existing between the three previously
described above: it only consists of 2 foreign keys, plus an integer
between 1 and 3 representing the level of sharing and the start date
(when the sharing has been set) and the end date (when necessary, the time
at which sharing has been removed, `NULL` otherwise)
## Python code
The permission levels should be set in `gargantext.constants`, and defined as:
```python
PERMISSION_NONE = 0 # 0b0000
PERMISSION_READ = 1 # 0b0001
PERMISSION_WRITE = 3 # 0b0011
PERMISSION_OWNER = 7 # 0b0111
```
The requests to check for permissions (or add new ones) should not be rewritten
every time. They should be "hidden" within the models:
- `Person.owns(node)` returns a boolean
- `Person.can_read(node)` returns a boolean
- `Person.can_write(node)` returns a boolean
- `Person.give_right(node, permission)` gives a right to a given user
- `Person.remove_right(node, permission)` removes a right from a given user
- `Person.get_nodes(permission[, type])` returns an iterator on the list of
nodes on which the person has at least the given permission (optional
argument: type of requested node)
- `Node.get_persons(permission[, type])` returns an iterator on the list of
users who have at least the given permission on the node (optional argument:
type of requested persons, such as `USER` or `GROUP`)
## Example
Let's imagine the `persons` table contains the following data:
| id | type | username |
|----|-------|-----------|
| 1 | USER | David |
| 2 | GROUP | C.N.R.S. |
| 3 | USER | Alexandre |
| 4 | USER | Untel |
| 5 | GROUP | I.S.C. |
| 6 | USER | Bidule |
Assume "David" owns the groups "C.N.R.S." and "I.S.C.", "Alexandre" belongs to
the group "I.S.C.", with "Untel" and "Bidule" belonging to the group "C.N.R.S.".
"Alexandre" and "David" are in contact.
The `relationships` table then contains:
| person1_id | person2_id | type |
|------------|------------|---------|
| 1 | 2 | OWNER |
| 1 | 5 | OWNER |
| 3 | 2 | MEMBER |
| 4 | 5 | MEMBER |
| 6 | 5 | MEMBER |
| 1 | 3 | CONTACT |
The `nodes` table is populated as such:
| id | type | name |
|----|----------|----------------------|
| 12 | PROJECT | My super project |
| 13 | CORPUS | A given corpus |
| 13 | CORPUS | The corpus |
| 14 | DOCUMENT | Some document |
| 15 | DOCUMENT | Another document |
| 16 | DOCUMENT | Yet another document |
| 17 | DOCUMENT | Last document |
| 18 | PROJECT | Another project |
| 19 | PROJECT | That project |
If we want to express that "David" created "My super project" (and its children)
and wants everyone in "C.N.R.S." to be able to view it, but not access it,
`permissions` should contain:
| person_id | node_id | permission |
|-----------|---------|------------|
| 1 | 12 | OWNER |
| 2 | 12 | READ |
If "David" also wanted "Alexandre" (and no one else) to view and modify "The
corpus" (and its children), we would have:
| person_id | node_id | permission |
|-----------|---------|------------|
| 1 | 12 | OWNER |
| 2 | 12 | READ |
| 3 | 13 | WRITE |
If "Alexandre" created "That project" and wants "Bidule" (and no one else) to be
able to view and modify it (and its children), the table should then have:
| person_id | node_id | permission |
|-----------|---------|------------|
| 3 | 19 | OWNER |
| 6 | 19 | WRITE |
#User guide
1. Login
run the gargantex box following the install procedure
open a webrowser at http://127.0.0.1:8000/
click on Test Gargantext
login with:
```
Login : gargantua
Password : autnagrag
```
2. Create a project
3. Import an existing corpus
4. Create corpus from search
5. Explore stats
6. Explore graphs
7. Query
8. Refine
* Time periods
* Nodes
9. Export
#Architecture Overview
#Database Schema
#Website
Gargantext is a web plateform to explore your corpora using text-mining[...](about.md)
## Getting started
* [Install](install.md) the Gargantext box
* [Take a tour](demo.md) of the different features offered by Gargantext
##Need some help?
Ask the community at:
* [http://gargantext.org/about](http://gargantext.org/about)
* IRC Chat: (OFTC/FreeNode) #gargantex
##Want to contribute?
* take a look at the [architecture overview](overview.md)
* read the [contribution guide](contribution-guide.md)
## News
## Credits and acknowledgments
#Install Instructions for Gargamelle:
Gargamelle is the gargantext plateforme toolbox it is a full plateform system
with minimal modules
First you need to get the source code to install it
The folder will be /srv/gargantext:
* docs containes all informations on gargantext
/srv/gargantext/docs/
* install contains all the installation files
/srv/gargantext/install/
Help needed ?
See [http://gargantext.org/about](http://gargantext.org/about) and [tools](./contribution_guide.md) for the community
## Get the source code
by cloning gargantext into /srv/gargantext
``` bash
git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
&& cd /srv/gargantext \
&& git fetch origin stable \
&& git checkout stable \
```
## Install
```bash
# go into the directory
user@computer: cd /srv/gargantext/
#git inside installation folder
user@computer: cd /install
#execute the installation
user@computer: ./install
```
The installation requires to create a user for gargantext, it will be asked:
```bash
Username (leave blank to use 'gargantua'):
#email is not mandatory
Email address:
Password:
Password (again):
```
If successfully done this step you should see:
```bash
Superuser created successfully.
[ ok ] Stopping PostgreSQL 9.5 database server: main.
```
## Run
Once you proceed to installation Gargantext plateforme will be available at localhost:8000
to start gargantext plateform:
``` bash
# go into the directory
user@computer: cd /srv/gargantext/
#git inside installation folder
user@computer: ./start
#type ctrl+d to exit or simply type exit in terminal;
```
Then open up a chromium browser and go to localhost:8000
Click on "Enter Gargantext"
Login in with you created username and pasword
Enjoy! ;)
* Create user gargantua
Main user of Gargantext is Gargantua (role of Pantagruel soon)!
``` bash
sudo adduser --disabled-password --gecos "" gargantua
```
* Create the directories you need
here for the example gargantext package will be installed in /srv/
``` bash
for dir in "/srv/gargantext"
"/srv/gargantext_lib"
"/srv/gargantext_static"
"/srv/gargantext_media"
"/srv/env_3-5"; do
sudo mkdir -p $dir ;
sudo chown gargantua:gargantua $dir ;
done
```
You should see:
```bash
$tree /srv
/srv
├── gargantext
├── gargantext_lib
├── gargantext_media
│   └── srv
│   └── env_3-5
└── gargantext_static
```
* Get the main libraries
Download uncompress and make main user access to it.
PLease, Be patient due to the size of the packages libraries (27GO)
this step can be long....
``` bash
wget http://dl.gargantext.org/gargantext_lib.tar.bz2 \
&& tar xvjf gargantext_lib.tar.bz2 -o /srv/gargantext_lib \
&& sudo chown -R gargantua:gargantua /srv/gargantext_lib \
&& echo "Libs installed"
```
* Get the source code of Gargantext
by cloning the repository of gargantext
``` bash
git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
&& cd /srv/gargantext \
&& git fetch origin refactoring \
&& git checkout refactoring \
```
TODO(soon): git clone https://gogs.iscpif.fr/gargantext.git
See the [next steps of installation procedure](install.md#Install)
#Architecture Overview
#Database Schema
#Website
# HOW TO: Reference a new webscrapper/API + parser
## Global scope
Three main mooves to do:
- develop and index parser
in gargantext.util.parsers
- developp and index a scrapper
in gargantext.moissonneurs
- adapt forms for a new source
in templates and views
## Reference parser into gargantext website
gargantext website is stored in gargantext/gargantext
### reference your new parser into contants.py
* import your parser l.125
```
from gargantext.util.parsers import \
EuropressParser, RISParser, PubmedParser, ISIParser, CSVParser, ISTexParser, CernParser
```
The parser corresponds to the name of the parser referenced in gargantext/util/parser
here name is CernParser
* index your RESOURCETYPE
int RESOURCETYPES (l.145) **at the end of the list**
```
# type 10
{ "name": 'SCOAP (XML MARC21 Format)',
"parser": CernParser,
"default_language": "en",
'accepted_formats':["zip","xml"],
},
```
A noter le nom ici est composé de l'API_name(SCOAP) + (GENERICFILETYPE FORMAT_XML Format)
La complexité du nommage correspond à trois choses:
* le nom de l'API (different de l'organisme de production)
* le type de format: XML
* la norme XML de ce format : MARC21 (cf. CernParser in gargantext/util/parser/Cern.py )
The default_langage corresponds to the default accepted lang that **should load** the default corresponding tagger
```
from gargantext.util.taggers import NltkTagger
```
TO DO: charger à la demander les types de taggers en fonction des langues et de l'install
TO DO: proposer un module pour télécharger des parsers supplémentaires
TO DO: provide install tagger module scripts inside lib
Les formats correspondent aux types de fichiers acceptées lors de l'envoi du fichier dans le formulaire de
parsing disponible dans `gargantext/view/pages/projects.py` et
exposé dans `/templates/pages/projects/project.html`
## reference your parser script
## add your parser script into folder gargantext/util/parser/
here my filename was Cern.py
##declare it into gargantext/util/parser/__init__.py
from .Cern import CernParser
At this step, you will be able to see your parser and add a file with the form
but nothing will occur
## the good way to write the scrapper script
Three main and only requirements:
* your parser class should inherit from the base class _Parser()
`gargantext/gargantext/util/parser/_Parser`
* your parser class must have a parse method that take a **file buffer** as input
* you parser must structure and store data into **hyperdata_list** variable name
to be properly indexed by toolchain
! Be careful of date format: provide a publication_date in a string format YYYY-mm-dd HH:MM:SS
# Adding a scrapper API to offer search option:
En cours
* Add pop up question Do you have a corpus
option search in /templates/pages/projects/project.html line 181
## Reference a scrapper (moissonneur) into gargantext
* adding accepted_formats in constants
* adding check_file routine in Form check ==> but should inherit from utils/files.py
that also have implmented the size upload limit check
# Suggestion 4 next steps:
* XML parser MARC21 UNIMARC ...
* A project type is qualified by the first element add i.e:
the first element determine the type of corpus of all the corpora within the project
#resources
Adding a new source into Gargantext requires a previous declaration
of the source inside constants.py
```python
RESOURCETYPES= [
{ "type":9, #give a unique type int
"name": 'SCOAP [XML]', #resource name as proposed into the add corpus FORM [generic format]
"parser": "CernParser", #name of the new parser class inside a CERN.py file (set to None if not implemented)
"format": 'MARC21', #specific format
'file_formats':["zip","xml"],# accepted file format
"crawler": "CernCrawler", #name of the new crawler class inside a CERN.py file (set to None if no Crawler implemented)
'default_languages': ['en', 'fr'], #supported defaut languages of the source
},
...
]
```
## adding a new parser
Once you declared your new parser inside constants.py
add your new crawler file into /srv/gargantext/utils/parsers/
following this naming convention:
* Filename must be in uppercase without the Crawler mention.
eg. MailParser => MAIL.py
* Inside this file the Parser must be called following the exact typo declared as parser in constants.py
* Your new crawler shall inherit from baseclasse Parser and provide a parse(filebuffer) method
```python
#!/usr/bin/python3 env
#filename:/srv/gargantext/util/parser/MAIL.py:
from ._Parser import Parser
class MailParser(Parser):
def parse(self, file):
...
```
## adding a new crawler
Once you declared your new parser inside constants.py
add your new crawler file into /srv/gargantext/utils/parsers/
following this naming convention:
* Filename must be in uppercase without the Crawler mention.
eg. MailCrawler => MAIL.py
* Inside this file the Crawler must be called following the exact typo declared as crawler in constants.py
* Your new crawler shall inherit from baseclasse Crawler and provide three method:
* scan_results => ids
* sample = > yes/no
* fetch
```python
#!/usr/bin/python3 env
#filename:/srv/gargantext/util/crawler/MAIL.py:
from ._Crawler import Crawler
class MailCrawler(Crawler):
def scan_results(self, query):
...
self.ids = set()
def sample(self, results_nb):
...
def fetch(self, ids):
```
// dot ngram_parsing_flow.dot -Tpng -o ngram_parsing_flow.png
digraph ngramflow {
edge [fontsize=10] ;
label=<<B><U>gargantext.util.toolchain</U></B><BR/>(ngram extraction flow)>;
labelloc="t" ;
"extracted_ngrams" -> "grouplist" ;
"extracted_ngrams" -> "occs+tfidfs" ;
"main_user_stoplist" -> "stoplist" ;
"stoplist" -> "mainlist" ;
"occs+tfidfs" -> "mainlist" [label=" TFIDF_LIMIT"];
"mainlist" -> "coocs" [label=" COOCS_THRESHOLD"] ;
"coocs" -> "specificity" ;
"specificity" -> "maplist" [label="MAPLIST_LIMIT\nMONOGRAM_PART"];
"maplist" -> "explore" ;
"grouplist" -> "maplist" ;
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment