Commit 175e4fab authored by sim's avatar sim

Remove old documentation

parent e5d705ad
mkdocs build --clean
mkdocs serve --dev-addr=0.0.0.0:8888
#Gargantext
Welcome to Garagentext documentation!
List of garg's own JSON API(s) urls
===================================
2016-05-27
### /api/nodes/2
```
{
"id": 2,
"parent_id": 1,
"name": "abstract:\"evaporation+loss\"",
"typename": "CORPUS"
}
```
------------------------------
### /api/nodes?pagination_limit=-1
```
{
"records": [
{
"id": 9,
"parent_id": 2,
"name": "A recording evaporimeter",
"typename": "DOCUMENT"
},
(...)
{
"id": 119,
"parent_id": 81,
"name": "GRAPH EXPLORER COOC (in:81)",
"typename": "COOCCURRENCES"
}
],
"count": 119,
"parameters": {
"formated": "json","pagination_limit": -1,
"fields": ["id","parent_id","name","typename"],
"pagination_offset": 0
}
}
```
------------------------------
### /api/nodes?types[]=CORPUS
```
{
"records": [
{
"id": 2,
"parent_id": 1,
"name": "abstract:\"evaporation+loss\"",
"typename": "CORPUS"
},
(...)
{
"id": 8181,
"parent_id": 1,
"name": "abstract:(astrogeology+OR ((space OR spatial) AND planetary) AND geology)",
"typename": "CORPUS"
}
],
"count": 2,
"parameters": {
"pagination_limit": 10,
"types": ["CORPUS"],
"formated": "json",
"pagination_offset": 0,
"fields": ["id","parent_id","name","typename"]
}
}
```
------------------------------
### /api/nodes/5?fields[]=ngrams
<5> représente un doc_id ou list_id
```
{
"ngrams": [
[1.0,{"id":2299,"n":1,"terms":designs}],
[1.0,{"id":1917,"n":1,"terms":height}],
[1.0,{"id":1755,"n":2,"terms":higher speeds}],
[1.0,{"id":1940,"n":1,"terms":cylinders}],
[1.0,{"id":2221,"n":3,"terms":other synthesized materials}],
(...)
[2.0,{"id":1970,"n":1,"terms":storms}],
[9.0,{"id":1754,"n":2,"terms":spherical gauges}],
[1.0,{"id":1895,"n":1,"terms":direction}],
[1.0,{"id":2032,"n":1,"terms":testing}],
[1.0,{"id":1981,"n":2,"terms":"wind effects"}]
]
}
```
------------------------------
### api/nodes/3?fields[]=id&fields[]=hyperdata&fields[]=typename
```
{
"id": 3,
"typename": "DOCUMENT",
"hyperdata": {
"language_name": "English",
"language_iso3": "eng",
"language_iso2": "en",
"title": "A blabla analysis of laser treated aluminium blablabla",
"name": "A blabla analysis of laser treated aluminium blablabla",
"authors": "A K. Jain, V.N. Kulkarni, D.K. Sood"
"authorsRAW": [
{"name": "....", "affiliations": ["... Research Centre,.. 085, Country"]},
{"name": "....", "affiliations": ["... Research Centre,.. 086, Country"]}
(...)
],
"abstract": "Laser processing of materials, being a rapid melt quenching process, quite often produces a surface which is far from being ideally smooth for ion beam analysis. (...)",
"genre": ["research-article"],
"doi": "10.1016/0029-554X(81)90998-8",
"journal": "Nuclear Instruments and Methods In Physics Research",
"publication_year": "1981",
"publication_date": "1981-01-01 00:00:00",
"publication_month": "01",
"publication_day": "01",
"publication_hour": "00",
"publication_minute": "00",
"publication_second": "00",
"id": "61076EB1178A97939B1C893904C77FB7DA2276D0",
"source": "elsevier",
"distributor": "istex"
}
}
```
## TODO continuer la liste
# Definitions and notation for the documentation (!= python notation)
## Node
The table (nodes) is a list of nodes: `[Node]`
Each Node has:
- a typename
- a parent_id
- a name
### Each Node has a parent_id
Node A
├── Node B
└── Node C
If Node A is Parent of Node B and Node C
then NodeA.id == NodeB.parent_id == NodeC.parent_id.
### Each Node has a typename
Notation: `Node["FOO"]("bar")` is a Node of typename "FOO" and with name "bar".
Then:
- Then Node[PROJECT] is a project.
- Then Node[CORPUS] is a corpus.
- Then Node[DOCUMENT] is a document.
The syntax of the Node here do not follow exactly Python documentation
(for clarity and to begin with): in Python code, typenames are strings
represented as UPPERCASE strings (eg. "PROJECT").
### Each Node as a typename and a parent
Node[USER](name)
├── Node[PROJECT](myProject1)
│   ├── Node[CORPUS](myCorpus1)
│   ├── Node[CORPUS](myCorpus2)
│   └── Node[CORPUS](myCorpus3)
└── Node[PROJECT](myProject2)
/!\\ 3 ways to manage rights of the Node:
1. Then Node[User] is a folder containing all User projects and corpus and
documents (i.e. Node[user] is the parent_id of the children).
2. Each node as a user_id (mainly used today)
3. Right management for the groups (implemented already but not
used since not connected to the frontend).
## Global Parameters
Global User is Gargantua (Node with typename user).
This node is the parent of the other nodes for parameters.
Node[USER](gargantua) (gargantua.id == Node[USER].user_id)
├── Node[TFIDF-Global](global) : without group
│   ├── Node[TFIDF](database1)
│   ├── Node[TFIDF](database2)
│   └── Node[TFIDF](database3)
└── Node[ANOTHERMETRIC](global)
[//]: # (Are there any plans to add user wide or project wide parameters or metrics? For example TFIDF nodes related to a normal user -- ie. not Gargantua?)
Yes we can in the future (but we have others priorities before.
[//]: # (What is the purpose of the 3 child nodes of Node[TFIDF-Global]? Are they TFIDF metrics related to databases 1, 2 and 3? If so, shouldn't they be children of related CORPUS nodes?)
Node placement in the tree indicates the context of the metric: the
Metrics Node has parent the corpus Node to indicate the context of the
metrics.
Answer:
Node[USER](foo)
Node[USER](bar)
├── Node[PROJECT](project1)
│   ├── Node[CORPUS](corpus1)
│   │   ├── Node[DOCUMENT](doc1)
│   │   ├── Node[DOCUMENT](doc2)
│   │ └── Node[TFIDF-global](name of the metrics)
│   ├── Node[CORPUS](corpus2)
│   └── Node[CORPUS](corpus3)
└── Node[PROJECT](project2)
## NodeNgram
NodeNgram is a relation of a Node with a ngram:
- documents and ngrams
- metrics and ngrams (position of the node metrics indicates the
context)
# Community Parameters
# User Parameters
// dot ngram_parsing_flow.dot -Tpng -o ngram_parsing_flow.png
digraph ngramflow {
edge [fontsize=10] ;
label=<<B><U>gargantext.util.toolchain</U></B><BR/>(ngram extraction flow)>;
labelloc="t" ;
"extracted_ngrams" -> "grouplist" ;
"extracted_ngrams" -> "occs+ti_rank" ;
"project stoplist (todo)" -> "stoplist" ;
"stoplist" -> "mainlist" ;
"occs+ti_rank" -> "mainlist" [label=" TI_RANK_LIMIT"];
"mainlist" -> "coocs" [label=" COOCS_THRESHOLD"] ;
"coocs" -> "specificity" ;
"specificity" -> "maplist" [label="MAPLIST_LIMIT\nMONOGRAM_PART"];
"mainlist" -> "tfidf" ;
"tfidf" -> "explore" [label="doc relations with all map and candidates"];
"maplist" -> "explore" ;
"grouplist" -> "occs+ti_rank" ;
"grouplist" -> "coocs" ;
"grouplist" -> "tfidf" ;
}
#Contribution guide
## Community
* [http://gargantext.org/about](http://gargantext.org/about)
* IRC Chat: (OFTC/FreeNode) #gargantex
##Tools
* gogs
* server access
* forge
* gargantext box
##Gargantex
* Gargantex box install
(S.I.R.= Setup Install & Run procedures)
* Architecture Overview
* Database Schema Overview
* Interface design Overview
##To do:
* Docs
* Interface deisgn
* Parsers/scrapers
* Computing
## How to contribute:
1. Clone the repo
2. Create a new branch <username>-refactoring
3. Run the gargantext-box
4. Code
5.Test
6. Commit
### Exemple1: Adding a parser
* create your new file cern.py into gargantex/scrapers/
* reference into gargantex/scrapers/urls.py
add this line:
import scrapers.cern as cern
* reference into gargantext/constants
```
# type 9
{ 'name': 'Cern',
'parser': CernParser,
'default_language': 'en',
},
```
* add an APIKEY in gargantex/settings
### Exemple2: User Interface Design
#Contribution guide
* A question or a problem? Ask the community
* Sources
* Tools
* Contribution workflow: for contributions, bugs and features
* Some examples of contributions
## Community
Need help? Ask the community
* [http://gargantext.org/about](http://gargantext.org/about)
* IRC Chat: (OFTC/FreeNode) #gargantex
## Source
Source are available throught XXX LICENSE
You can install Gargantext throught the [installation procedure](./install.md)
##Tools
* gogs
* forge.iscpif.fr
* server access
* gargantext box
## Contributing: workflow procedure
Once you have installed and tested Gargantext
You
1. Clone the stable release into your project
Note: The current stable release <release_branch> is: refactoring
Inside the repo, clone the reference branch and get the last changes:
git checkout <ref_branch>
git pull
It is highly recommended to create a generic branch on a stable release such as
git checkout -b <username>-<release_branch>
git pull
2. Create your project on stable release
git checkout -b <username>-<release_branch>-<project_name>
Do your modifications and commits as you want it:
git commit -m "foo/bar/1"
git commit -m "foo/bar/2"
git push
If you want to save your local change you can merge it into your generic branch <username>-<release_branch>
git checkout <username>-<release_branch>
git pull
git merge <username>-<release_branch>-<project_name>
git commit -m "[Merge OK] comment"
##Technical Overview
* Interface Overview
* Database Schema Overview
* Architecture Overview
### Exemple1: Adding a parser
### Exemple2: User Interface Design
Cycle de vie des décomptes ngrammes
-----------------------------------
### (schéma actuel et pistes) ###
Dans ce qui crée les décomptes, on peut distinguer deux niveaux ou étapes:
1. l'extraction initiale et le stockage du poids de la relation ngramme
document (appelons ces nodes "1doc")
2. tout le reste: la préparation des décomptes agrégés pour la table
termes ("stats"), et pour les tables de travail des graphes et de la
recherche de publications.
On pourrait peut-être parler d'indexation par docs pour le niveau 1 et de "modélisations" pour le niveau 2.
On peut remarquer que le niveau 1 concerne des **formes** ou ngrammes seuls (la forme observée <=> chaine de caractères u-nique après normalisation) tandis que dans le niveau 2 on a des objets plus riches... Au fur et à mesure des traitements on a finalement toujours des ngrammes mais:
- filtrés (on ne calcule pas tout sur tout)
- typés avec les listes map, stop, main (et peut-être bientôt des
"ownlistes" utilisateur)...
- groupés (ce qu'on voit avec le `+` de la table terme, et qu'on
pourrait peut-être faire apparaître aussi côté graphe?)
On peut dire qu'on manipule plutôt des **termes** au niveau 2 et non plus des **formes**... ils sont toujours des ngrammes mais enrichis par l'inclusion dans une série de mini modèles (agrégations et typologie de ngrammes guidée par les usages).
### Tables en BDD
Si on adopte cette distinction entre formes et termes, ça permet de clarifier à quel moment on doit mettre à jour ce qu'on a dans les tables. Côté structure de données, les décomptes sont toujours stockés via des n-uplets qu'on peut du coup résumer comme cela:
- **1doc**: (doc:node - forme:ngr - poids:float) dans des tables
NodeNgram
- **occs/gen/spec/tirank**: (type_mesure:node - terme:ngr -
poids:float) dans des tables NodeNgram
- **cooc**: (type_graphe:node - terme1:ngr - terme2:ngr -
poids:float) dans des tables NodeNgramNgram
- **tfidf**: (type_lienspublis:node - doc:node - terme:ngr -
correlation:float) dans des tables NodeNodeNgram.
Où "type" est le node portant la nature de la stat obtenue, ou bien la
ref du graphe pour cooc et de l'index lié à la recherche de publis pour
le tfidf.
Il y a aussi les relations qui ne contiennent pas de décomptes mais sont
essentielles pour former les décomptes des autres:
- map/main/stopliste: (type_liste:node - forme ou terme:ngr) dans des
tables NodeNgram
- "groupes": (mainform:ngr - subform:ngr) dans des tables
NodeNgramNgram.
### Scénarios d'actualisation
Alors, dans le déroulé des "scénarios utilisateurs", il y plusieurs
évenements qui viennent **modifier ces décomptes**:
1. les créations de termes opérés par l'utilisateur (ex: par
sélection/ajout dans la vue annotation)
2. les imports de termes correspondant à des formes jamais indexées sur
ce corpus
3. les dégroupements de termes opérés par l'utilisateur
4. le passage d'un terme de la stopliste aux autres listes
5. tout autre changement de listes et/ou création de nouveaux
groupes...
A et B sont les deux seules étapes hormis l'extraction initiale où des
formes sont rajoutées. Actuellement A et B sont gérés tout de suite pour
le niveau 1 (tables par doc) : il me semble qu'il est bon d'opérer la
ré-indexation des 1doc le plus tôt possible après A ou B. Pour la vue
annotations, l'utilisateur s'attend à voir apparaître le surlignage
immédiatement sur le doc visualisé. Pour l'import B, c'est pratique car
on a la liste des nouveaux termes sous la main, ça évite de la stocker
quelque part en attendant un recalcul ultérieur.
L'autre info mise à jour tout de suite pour A et B est l'appartenance
aux listes et aux groupes (pour B), qui ne demandent aucun calcul.
C, D et E n'affectent pas le niveau 1 (tables par docs) car ils ne
rajoutent pas de formes nouvelles, mais constituent des modifications
sur les listes et les groupes, et devront donc provoquer une
modification du tfidf (pour cela on doit passer par un re-calcul) et des
coocs sur map (effet appliqué à la demande d'un nouveau graphe).
C et D demandent aussi une mise à jour des stats par termes
(occurrences, gen/spec etc) puisque les éléments subforms et les
éléments de la stopliste ne figurent pas dans les stats.
Donc pour résumer on a dans tous les cas:
=> l'ajout à une liste, à un groupe et tout éventuel décompte de
nouvelle forme dans les docs sont gérés dès l'action utilisateur
=> mais les modélisations plus "avancées" représentées par les les
stats occs, gen, spec et les tables de travail "coocs sur map" et
"tfidf" doivent attendre un recalcul.
Idéalement à l'avenir il seraient tous mis à jour incrémentalement au
lieu de forcer ce recalcul... mais pour l'instant on en est là.
### Fonctions associées
| | GUI | API action → url | VIEW | SUBROUTINES |
|-------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------|-------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| A | "annotations/highlight.js, annotations/ngramlists.js" | "PUT → api/ngrams, PUT/DEL → api/ngramlists/change" | "ApiNgrams, ListChange" | util.toolchain.ngrams_addition.index_new_ngrams |
| B | NGrams_dyna_chart_and_table | POST/PATCH → api/ngramlists/import | CSVLists | "util.ngramlists_tools.import_ngramlists, util.ngramlists_tools.merge_ngramlists, util.toolchain.ngrams_addition.index_new_ngrams" |
| C,D,E | NGrams_dyna_chart_and_table | "PUT/DEL → api/ngramlists/change, PUT/DEL → api/ngramlists/groups" "ListChange, GroupChange" | util.toolchain.ngrams_addition.index_new_ngrams | |
L'import B a été remis en route il y a quelques semaines, et je viens de
reconnecter A dans la vue annotations.
#Contribution guide
## Community
* [http://gargantext.org/about](http://gargantext.org/about)
* IRC Chat: (OFTC/FreeNode) #gargantext
##Tools
* gogs
* server access
* gargantext box
##Gargantex
* Gargantex box install
see [install procedure](install.md)
* Architecture Overview
* Database Schema Overview
* Interface design Overview
##To do:
* Docs
* Interface design
* [Parsers](./overview/parser.md) / scrappers(./overview/scraper.md)
* Computing
## How to contribute:
1. Clone the repo
2. Create a new branch <username>-refactoring
3. Run the gargantext-box
4. Code
5. Test
6. Commit
94eb7bdf57557b72dcd1b93a42af044b pubmed.zip
# API
Be more careful about authorizations.
cf. "ng-resource".
# Projects
## Overview of all projects
- re-implement deletion
## Single project view
- re-implement deletion
# Taggers
Path for data used by taggers should be defined in `gargantext.constants`.
# Database
# Sharing
Here follows a brief description of how sharing could be implemented.
## Database representation
The database representation of sharing can be distributed among 4 tables:
- `persons`, of which items represent either a user or a group
- `relationships` describes the relationships between persons (affiliation
of a user to a group, contact between two users, etc.)
- `nodes` contains the projects, corpora, documents, etc. to share (they shall
inherit the sharing properties from their parents)
- `permissions` stores the relations existing between the three previously
described above: it only consists of 2 foreign keys, plus an integer
between 1 and 3 representing the level of sharing and the start date
(when the sharing has been set) and the end date (when necessary, the time
at which sharing has been removed, `NULL` otherwise)
## Python code
The permission levels should be set in `gargantext.constants`, and defined as:
```python
PERMISSION_NONE = 0 # 0b0000
PERMISSION_READ = 1 # 0b0001
PERMISSION_WRITE = 3 # 0b0011
PERMISSION_OWNER = 7 # 0b0111
```
The requests to check for permissions (or add new ones) should not be rewritten
every time. They should be "hidden" within the models:
- `Person.owns(node)` returns a boolean
- `Person.can_read(node)` returns a boolean
- `Person.can_write(node)` returns a boolean
- `Person.give_right(node, permission)` gives a right to a given user
- `Person.remove_right(node, permission)` removes a right from a given user
- `Person.get_nodes(permission[, type])` returns an iterator on the list of
nodes on which the person has at least the given permission (optional
argument: type of requested node)
- `Node.get_persons(permission[, type])` returns an iterator on the list of
users who have at least the given permission on the node (optional argument:
type of requested persons, such as `USER` or `GROUP`)
## Example
Let's imagine the `persons` table contains the following data:
| id | type | username |
|----|-------|-----------|
| 1 | USER | David |
| 2 | GROUP | C.N.R.S. |
| 3 | USER | Alexandre |
| 4 | USER | Untel |
| 5 | GROUP | I.S.C. |
| 6 | USER | Bidule |
Assume "David" owns the groups "C.N.R.S." and "I.S.C.", "Alexandre" belongs to
the group "I.S.C.", with "Untel" and "Bidule" belonging to the group "C.N.R.S.".
"Alexandre" and "David" are in contact.
The `relationships` table then contains:
| person1_id | person2_id | type |
|------------|------------|---------|
| 1 | 2 | OWNER |
| 1 | 5 | OWNER |
| 3 | 2 | MEMBER |
| 4 | 5 | MEMBER |
| 6 | 5 | MEMBER |
| 1 | 3 | CONTACT |
The `nodes` table is populated as such:
| id | type | name |
|----|----------|----------------------|
| 12 | PROJECT | My super project |
| 13 | CORPUS | A given corpus |
| 13 | CORPUS | The corpus |
| 14 | DOCUMENT | Some document |
| 15 | DOCUMENT | Another document |
| 16 | DOCUMENT | Yet another document |
| 17 | DOCUMENT | Last document |
| 18 | PROJECT | Another project |
| 19 | PROJECT | That project |
If we want to express that "David" created "My super project" (and its children)
and wants everyone in "C.N.R.S." to be able to view it, but not access it,
`permissions` should contain:
| person_id | node_id | permission |
|-----------|---------|------------|
| 1 | 12 | OWNER |
| 2 | 12 | READ |
If "David" also wanted "Alexandre" (and no one else) to view and modify "The
corpus" (and its children), we would have:
| person_id | node_id | permission |
|-----------|---------|------------|
| 1 | 12 | OWNER |
| 2 | 12 | READ |
| 3 | 13 | WRITE |
If "Alexandre" created "That project" and wants "Bidule" (and no one else) to be
able to view and modify it (and its children), the table should then have:
| person_id | node_id | permission |
|-----------|---------|------------|
| 3 | 19 | OWNER |
| 6 | 19 | WRITE |
#User guide
1. Login
run the gargantex box following the install procedure
open a webrowser at http://127.0.0.1:8000/
click on Test Gargantext
login with:
```
Login : gargantua
Password : autnagrag
```
2. Create a project
3. Import an existing corpus
4. Create corpus from search
5. Explore stats
6. Explore graphs
7. Query
8. Refine
* Time periods
* Nodes
9. Export
#Architecture Overview
#Database Schema
#Website
Gargantext is a web plateform to explore your corpora using text-mining[...](about.md)
## Getting started
* [Install](install.md) the Gargantext box
* [Take a tour](demo.md) of the different features offered by Gargantext
## Architecture
* [Architecture](architecture.md) Architecture of Gargantext
##Need some help?
Ask the community at:
* [http://gargantext.org/about](http://gargantext.org/about)
* IRC Chat: (OFTC/FreeNode) #gargantex
##Want to contribute?
* take a look at the [architecture overview](overview.md)
* read the [contribution guide](contribution-guide.md)
## News
## Credits and acknowledgments
#Install Instructions for Gargamelle:
Gargamelle is the gargantext plateforme toolbox it is a full plateform system
with minimal modules
First you need to get the source code to install it
The folder will be /srv/gargantext:
* docs containes all informations on gargantext
/srv/gargantext/docs/
* install contains all the installation files
/srv/gargantext/install/
Help needed ?
See [http://gargantext.org/about](http://gargantext.org/about) and [tools](./contribution_guide.md) for the community
## Get the source code
by cloning gargantext into /srv/gargantext
``` bash
git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
&& cd /srv/gargantext \
&& git fetch origin stable \
&& git checkout stable \
```
## Install
```bash
# go into the directory
user@computer: cd /srv/gargantext/
#git inside installation folder
user@computer: cd /install
#execute the installation
user@computer: ./install
```
The installation requires to create a user for gargantext, it will be asked:
```bash
Username (leave blank to use 'gargantua'):
#email is not mandatory
Email address:
Password:
Password (again):
```
If successfully done this step you should see:
```bash
Superuser created successfully.
[ ok ] Stopping PostgreSQL 9.5 database server: main.
```
## Run
Once you proceed to installation Gargantext plateforme will be available at localhost:8000
to start gargantext plateform:
``` bash
# go into the directory
user@computer: cd /srv/gargantext/
#git inside installation folder
user@computer: ./start
#type ctrl+d to exit or simply type exit in terminal;
```
Then open up a chromium browser and go to localhost:8000
Click on "Enter Gargantext"
Login in with you created username and pasword
Enjoy! ;)
* Create user gargantua
Main user of Gargantext is Gargantua (role of Pantagruel soon)!
``` bash
sudo adduser --disabled-password --gecos "" gargantua
```
* Create the directories you need
here for the example gargantext package will be installed in /srv/
``` bash
for dir in "/srv/gargantext"
"/srv/gargantext_lib"
"/srv/gargantext_static"
"/srv/gargantext_media"
"/srv/env_3-5"; do
sudo mkdir -p $dir ;
sudo chown gargantua:gargantua $dir ;
done
```
You should see:
```bash
$tree /srv
/srv
├── gargantext
├── gargantext_lib
├── gargantext_media
│   └── srv
│   └── env_3-5
└── gargantext_static
```
* Get the main libraries
Download uncompress and make main user access to it.
PLease, Be patient due to the size of the packages libraries (27GO)
this step can be long....
``` bash
wget http://dl.gargantext.org/gargantext_lib.tar.bz2 \
&& tar xvjf gargantext_lib.tar.bz2 -o /srv/gargantext_lib \
&& sudo chown -R gargantua:gargantua /srv/gargantext_lib \
&& echo "Libs installed"
```
* Get the source code of Gargantext
by cloning the repository of gargantext
``` bash
git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
&& cd /srv/gargantext \
&& git fetch origin stable \
&& git checkout stable \
```
TODO(soon): git clone https://gogs.iscpif.fr/gargantext.git
* Install and configure the virtual environment
``` bash
cd /srv/
pip3 install virtualenv
virtualenv /srv/env_3-5 -p /usr/bin/python3.5
pip install -r /srv/gargantext/install
echo '/srv/gargantext' > /srv/env_3-5/lib/python3.5/site-packages/gargantext.pth
echo 'alias venv="source /srv/env_3-5/bin/activate"' >> ~/.bashrc
```
See the [next steps of installation procedure](install.md#Install)
See the [next manual steps of installation procedure](Debian.sh)
# Gargantext foundations
Collaborative platform for multi-scale text experiments
Embrace the past, update the present, forecast the future.
# Main Types of Entity definitions
Documentation valid for 3.0.\* versions of Gargantext.
## Nature of the entities
In Object programming language, it is objects.
In purely functional language, it is types.
## Project
A project is a list of corpora (a project may have duplicate corpora).
## Corpus
A corpus is a set of documents: duplicate documents are authorized but
not recommended for the methodology since it shows artificial repeated content in the corpus.
Then, in the document view, users may delete duplicates with a specific
function.
## Document
A document is the main Entity of Textual Context (ETC) that is composed with:
- a title (truncated field name in the database)
- the date of publication
- a journal (or source)
- an abstract
- the authors
Users may add many fields to the document.
The main fields mentioned above are used for the main statistics in Gargantext.
### Source Type
Source Type is the source (database) from where documents have been
extracted.
In 3.0.\* versions of Gargantext, each corpus has only one source type
(i.e database). But user can build his own corpus with CVS format.
## Ngrams
### Definitions
### Gram
A gram is a contiguous sequence of letters separated by spaces.
### N-gram
N-gram is a contiguous sequence of n grams separated by spaces (where n
is a non negative natural number).
## N-gram Lists
## Main ngrams lists: Stop/Map/Main
Receipe of Gargantext consist of offering the rights ngrams for the map.
A the better level of complexity in order to unveil its richness
according to this 2 main rules:
If ngrams are too specifics, then the graph becomes too sparse.
If ngrams are too generics, then the graph becomes too connected.
As a consequence, finding the right balance of specific and generic
ngrams is the main target.
In first versions of Gargantext, this balance is solved with linear
methods. After 3.1.\*, non linear methods trained on dataset of the
users enable the system to find a better balance at any scale.
### Definition
3 main kinds of lists :
1. Stop List contains black listed ngrams i.e. the noise or in others words ngrams users do not want to deal with.
2. Map List contains ngrams that will be shown in the map.
3. Main list or Candidate list contains all other ngrams that are neither in the stop list or in the map list. Then it _could_ be in the map according to the choice of the user or, by default, the default parameters of Gargantext.
### Storage
Relation between the list and the ngram is stored as Node-Ngram
relation where
- Node has type name (STOP|MAIN|MAP) and parent_id the context
(CORPUS in version 3.0.*; but could be PROJECT)
- Ngrams depend on the context of the Node List where NodeNgrams is
not null and Node has typename Document.
Node[USER](name1)
├── Node[PROJECT](project1)
│   ├── Node[CORPUS](corpus1)
│   │   ├── Node[MAPLIST](list name)
│   │   ├── Node[STOPLIST](list name)
│   │   ├── Node[MAINLIST](list name)
│ │  │  
│   │   ├── Node[DOCUMENT](doc1)
│   │   ├── Node[DOCUMENT](doc2)
│   │ └── Node[DOCUMENT](doc2)
### Policy
#### Algo
Let be a set of ngrams where NodeNgram != 0 then
find 2 subsets of these ngrams that show a split
- stop ngrams
- not stop ngrams
then for the subset "not stop ngrams"
find 2 subset of ngrams that show a split:
- map ngrams
- others ngrams
#### Techno algo
A classifier (Support Machine Vector) is used on the following scaled-measures
for each step:
- n (of the "n" gram)
- Occurrences : Zip Law (in fact already used in TFICF, this
features are correletad, put here for pedagogical purpose)
- TFICF-CORPUS-SOURCETYPE
- TFICF-SOURCETYPE-ALL
- Genericity score
- Specificty score
## Metrics
### Term Frequency - Inverse Context Frequency (TF-ICF)
TFICF, short for term frequency-inverse context frequency, is a numerical
statistic that is intended to reflect how important an ngram is to a
context of text.
TFICF(ngram,contextLocal,contextGlobal) = TF(ngram,contextLocal) \* ICF(ngram, contextGlobal)
where
* TF(ngram, contextLocal) is the ngram frequency (occurrences) in contextLocal.
* ICF(ngram, contextGlobal) is the inverse (log) document frequency (occurrences) in contextGlobal.
Others types of TFICF:
- TFICF(ngram, DOCUMENT, CORPUS)
- TFICF(ngram, CORPUS, PROJECT)
- TFICF(ngram, PROJECT, DATABASETYPE)
- TFICF(ngram, DATABASETYPE, ALL)
If the context is a document in a set of documents (corpus), then it is a TFIDF as usual.
Then TFICF-DOCUMENT-CORPUS == TFICF(ngram,DOCUMENT,CORPUS) = TFIDF.
TFICF is the generalization of [TFIDF, Term Frequency - Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
#### Implementation
TFICF = TF * log (ICF)
To prepare the groups, we need to store TF and ICF seperately (in
NodesNogram via 2 nodes).
Let be TF and ICF typename of Nodes.
Node[USER](gargantua)
├── Node[OCCURRENCES](source)
├── Node[TF](all sourcetype)
├── Node[ICF](all sourcetype)
├── Node[SOURCETYPE](Pubmed)
│   ├── Node[OCCURRENCES](all corpora)
│   ├── Node[TF](all corpora)
│   └── Node[ICF](all corpora)
├── Node[SOURCETYPE](WOS)
## others ngrams lists
### Group List
#### Definition
Group list gives a quantifiable link between two ngrams.
#### Policy to build group lists
To group the ngrams:
- stemming or lemming
- c-value
- clustering (see graphs)
- manually by the user (supervised learning)
The scale is the character.
#### Storage
In the table NodeNgramNgram where Node has type name Group for ngram1
and ngram2.
### Favorite List
#### Definition
Fovorite Nodes
The scale is the node.
#### Building policy
- manually by the user (supervised learning)
#### Storage
NodeNode relation where first Node has type Favorite.
#Architecture Overview
#Database Schema
#Website
# HOW TO: Reference a new webscrapper/API + parser
## Global scope
Three main mooves to do:
- develop and index parser
in gargantext.util.parsers
- developp and index a scrapper
in gargantext.moissonneurs
- adapt forms for a new source
in templates and views
## Reference parser into gargantext website
gargantext website is stored in gargantext/gargantext
### reference your new parser into contants.py
* import your parser l.125
```
from gargantext.util.parsers import \
EuropressParser, RISParser, PubmedParser, ISIParser, CSVParser, ISTexParser, CernParser
```
The parser corresponds to the name of the parser referenced in gargantext/util/parser
here name is CernParser
* index your RESOURCETYPE
int RESOURCETYPES (l.145) **at the end of the list**
```
# type 10
{ "name": 'SCOAP (XML MARC21 Format)',
"parser": CernParser,
"default_language": "en",
'accepted_formats':["zip","xml"],
},
```
A noter le nom ici est composé de l'API_name(SCOAP) + (GENERICFILETYPE FORMAT_XML Format)
La complexité du nommage correspond à trois choses:
* le nom de l'API (different de l'organisme de production)
* le type de format: XML
* la norme XML de ce format : MARC21 (cf. CernParser in gargantext/util/parser/Cern.py )
The default_langage corresponds to the default accepted lang that **should load** the default corresponding tagger
```
from gargantext.util.taggers import NltkTagger
```
TO DO: charger à la demander les types de taggers en fonction des langues et de l'install
TO DO: proposer un module pour télécharger des parsers supplémentaires
TO DO: provide install tagger module scripts inside lib
Les formats correspondent aux types de fichiers acceptées lors de l'envoi du fichier dans le formulaire de
parsing disponible dans `gargantext/view/pages/projects.py` et
exposé dans `/templates/pages/projects/project.html`
## reference your parser script
## add your parser script into folder gargantext/util/parser/
here my filename was Cern.py
##declare it into gargantext/util/parser/__init__.py
from .Cern import CernParser
At this step, you will be able to see your parser and add a file with the form
but nothing will occur
## the good way to write the scrapper script
Three main and only requirements:
* your parser class should inherit from the base class _Parser()
`gargantext/gargantext/util/parser/_Parser`
* your parser class must have a parse method that take a **file buffer** as input
* you parser must structure and store data into **hyperdata_list** variable name
to be properly indexed by toolchain
! Be careful of date format: provide a publication_date in a string format YYYY-mm-dd HH:MM:SS
# Adding a scrapper API to offer search option:
En cours
* Add pop up question Do you have a corpus
option search in /templates/pages/projects/project.html line 181
## Reference a scrapper (moissonneur) into gargantext
* adding accepted_formats in constants
* adding check_file routine in Form check ==> but should inherit from utils/files.py
that also have implmented the size upload limit check
# Suggestion 4 next steps:
* XML parser MARC21 UNIMARC ...
* A project type is qualified by the first element add i.e:
the first element determine the type of corpus of all the corpora within the project
#resources
Adding a new source into Gargantext requires a previous declaration
of the source inside constants.py
```python
RESOURCETYPES= [
{ "type":9, #give a unique type int
"name": 'SCOAP [XML]', #resource name as proposed into the add corpus FORM [generic format]
"parser": "CernParser", #name of the new parser class inside a CERN.py file (set to None if not implemented)
"format": 'MARC21', #specific format
'file_formats':["zip","xml"],# accepted file format
"crawler": "CernCrawler", #name of the new crawler class inside a CERN.py file (set to None if no Crawler implemented)
'default_languages': ['en', 'fr'], #supported defaut languages of the source
},
...
]
```
## adding a new parser
Once you declared your new parser inside constants.py
add your new crawler file into /srv/gargantext/utils/parsers/
following this naming convention:
* Filename must be in uppercase without the Crawler mention.
eg. MailParser => MAIL.py
* Inside this file the Parser must be called following the exact typo declared as parser in constants.py
* Your new crawler shall inherit from baseclasse Parser and provide a parse(filebuffer) method
```python
#!/usr/bin/python3 env
#filename:/srv/gargantext/util/parser/MAIL.py:
from ._Parser import Parser
class MailParser(Parser):
def parse(self, file):
...
```
## adding a new crawler
Once you declared your new parser inside constants.py
add your new crawler file into /srv/gargantext/utils/parsers/
following this naming convention:
* Filename must be in uppercase without the Crawler mention.
eg. MailCrawler => MAIL.py
* Inside this file the Crawler must be called following the exact typo declared as crawler in constants.py
* Your new crawler shall inherit from baseclasse Crawler and provide three method:
* scan_results => ids
* sample = > yes/no
* fetch
```python
#!/usr/bin/python3 env
#filename:/srv/gargantext/util/crawler/MAIL.py:
from ._Crawler import Crawler
class MailCrawler(Crawler):
def scan_results(self, query):
...
self.ids = set()
def sample(self, results_nb):
...
def fetch(self, ids):
```
// dot ngram_parsing_flow.dot -Tpng -o ngram_parsing_flow.png
digraph ngramflow {
edge [fontsize=10] ;
label=<<B><U>gargantext.util.toolchain</U></B><BR/>(ngram extraction flow)>;
labelloc="t" ;
"extracted_ngrams" -> "grouplist" ;
"extracted_ngrams" -> "occs+tfidfs" ;
"main_user_stoplist" -> "stoplist" ;
"stoplist" -> "mainlist" ;
"occs+tfidfs" -> "mainlist" [label=" TFIDF_LIMIT"];
"mainlist" -> "coocs" [label=" COOCS_THRESHOLD"] ;
"coocs" -> "specificity" ;
"specificity" -> "maplist" [label="MAPLIST_LIMIT\nMONOGRAM_PART"];
"maplist" -> "explore" ;
"grouplist" -> "maplist" ;
}
* Referencer ses pages dans mkdocs.yml
* Ecrire chaque fichier en markdown (github flavor)
* Générer la doc create-dic.sh > génére un dossier site
* [RTFM](http://www.mkdocs.org/)!
site_name: Gargantext
theme: readthedocs
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment