Commit a26bb0f6 authored by delanoe's avatar delanoe

Merge branch 'romain-reintegration-graphExplorer' into anoe-graph

parents b2067a0c f5314bfa
Install Instructions for Gargantext (CNRS): #Install Instructions for Gargantext (CNRS):
## Get the source code
by cloning gargantext into /srv/gargantext
``` bash
git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
&& cd /srv/gargantext \
&& git fetch origin stable \
&& git checkout stable \
```
Help needed ?
See [http://gargantext.org/about](http://gargantext.org/about) and [tools]() for the community
The folder will be /srv/gargantext:
* docs containes all informations on gargantext
/srv/gargantext/docs/
* install contains all the installation files
/srv/gargantext/install/
Prepare your environnement and make the initial installation. Help needed ?
Once you setup and install the Gargantext box. You can use ./install/run.sh utility See [http://gargantext.org/about](http://gargantext.org/about) and [tools](./contribution_guide.md) for the community
to load gargantext web plateform and access it throught your web browser
______________________________ Two installation procedure are provided:
1. [Prerequisites](##Prerequisites) 1. Semi-automatic installation [EASY]
2. Step by step installation [ADVANCED]
2. [SETUP](##Setup) Here only semi-automatic installation is covered checkout [manual_install](manual_install.md)
to follow step by step procedure
3. [INSTALL](##Install)
4. [RUN](##RUN)
______________________________
##Prerequisites ##Prerequisites
## Init Setup
## Install
## Run
--------------------
# Semi automatic installation
All the procedure files are located into /srv/garantext/install/
``` bash
user@computer:$ cd /srv/garantext/install/
```
## Prerequisites
* A Debian based OS >= [FIXME] * A Debian based OS >= [FIXME]
* At least 35GO in the desired location of Gargantua [FIXME] * At least 35GO in /srv/ [FIXME]
todo: reduce the size of gargantext lib todo: reduce the size of gargantext lib
todo: remove lib once docker is configure todo: remove lib once docker is configured
tip: if you have enought space for the full package you can: ! tip: if you have enought space for the full package you can:
* resize your partition * resize your partition
* make a simlink on gargantext_lib * make a simlink on gargantext_lib
* A [docker engine installation](https://docs.docker.com/engine/installation/linux/)
##Setup
Prepare your environnement and make the initial setup.
Setup can be done in 2 ways:
* [automatic setup](setup.sh) can be done by using the setup script provided [here](setup.sh)
* [manual setup](manual_setup.md) if you want to change some parameters [here](manual_setup.md)
##Install
Two installation procedure are actually proposed:
* the docker way [easy]
* the debian way [advanced]
####DOCKER WAY [EASY] ##Init Setup
Prepare your environnement and make the initial setup.
* Install docker This initial step creates a user for gargantext plateform along with dowloading additionnal libs and files.
See [installation instruction for your distribution](https://docs.docker.com/engine/installation/)
* Build your docker image It also install docker and build the docker image and build the gargantext box
``` bash ``` bash
cd /srv/gargantext/install/docker/dev user@computer:/srv/garantext/install/$ .init.sh
./build
ID=$(docker build .) && docker run -i -t $ID
``` ```
You should see
``` ### Install
Successfully built <container_id> Once the init step is done
```
* Enter into the docker environnement * Enter into the docker environnement
Inside folder /srv/garantext/install/
enter the gargantext image
``` bash ``` bash
./srv/gargantext/install/docker/enterGargantextImage user@computer:/srv/garantext/install/$ .docker/enterGargantextImage
``` ```
go to the installation folder
``` bash
root@dockerimage8989809:$ cd /srv/gargantext/install/
```
[ICI] Tester si les config de postgresql et python sont faits en amont à la création du docker file
* Install Python environment * Install Python environment
Inside the docker image, execute as root:
``` bash ``` bash
/srv/gargantext/install/python/configure root@dockerimage8989809:/srv/garantext/install/$ python/configure
``` ```
* Configure PostgreSql * Configure PostgreSql
Inside the docker image, execute as root: Inside the docker image, execute as root:
``` bash ``` bash
/srv/gargantext/install/postgres/configure root@computer:/srv/garantext/install/$ postgres/configure
```
* Exit the docker
```
exit (or Ctrl+D)
``` ```
[Si OK ] enlever ses lignes
Install Gargantext server Install Gargantext server
* Enter docker container
``` bash
/srv/gargantext/install/docker/enterGargantextImage
```
* Configure the database * Configure the database
Inside the docker container: Inside the docker container:
``` bash ``` bash
service postgresql start service postgresql start
#su gargantua #su gargantua
#activate the virtualenv
source /srv/env_3-5/bin/activate source /srv/env_3-5/bin/activate
python /srv/gargantext/dbmigrate.py ```
/srv/gargantext/manage.py makemigrations You have entered the virtualenv as shown with (env_3-5)
/srv/gargantext/manage.py migrate ``` bash
python /srv/gargantext/dbmigrate.py (env_3-5) $ python /srv/gargantext/dbmigrate.py
(env_3-5) $ /srv/gargantext/manage.py makemigrations
(env_3-5) $ /srv/gargantext/manage.py migrate
(env_3-5) $ python /srv/gargantext/dbmigrate.py
#will create tables and not hyperdata_nodes #will create tables and not hyperdata_nodes
python /srv/gargantext/dbmigrate.py (env_3-5) $ python /srv/gargantext/dbmigrate.py
#will create table hyperdata_nodes #will create table hyperdata_nodes
#launch first time the server to create first user #launch first time the server to create first user
/srv/gargantext/manage.py runserver 0.0.0.0:8000 (env_3-5) $ /srv/gargantext/manage.py runserver 0.0.0.0:8000
/srv/gargantext/init_accounts.py /srv/gargantext/install/init/account.csv (env_3-5) $ /srv/gargantext/init_accounts.py /srv/gargantext/install/init/account.csv
``` ```
FIXME: dbmigrate need to launched several times since tables are FIXME: dbmigrate need to launched several times since tables are
ordered with alphabetical order (and not dependencies order) ordered with alphabetical order (and not dependencies order)
####Debian way [advanced]
##Run Gargantext * Exit the docker
* Launch Gargantext ```
exit (or Ctrl+D)
```
## Run Gargantext
Enter the docker container: Enter the docker container:
``` bash ``` bash
...@@ -126,31 +141,30 @@ Enter the docker container: ...@@ -126,31 +141,30 @@ Enter the docker container:
``` ```
Inside the docker container: Inside the docker container:
``` bash ``` bash
#start postgresql #start Database (postgresql)
service postgresql start service postgresql start
#change to user #change to user
su gargantua su gargantua
#activate the virtualenv #activate the virtualenv
source /srv/env_3-5/bin/activate source /srv/env_3-5/bin/activate
#go to gargantext srv #go to gargantext srv
cd /srv/gargantext/ (env_3-5) $ cd /srv/gargantext/
#run the server #run the server
/manage.py runserver 0.0.0.0:8000 (env_3-5) $ /manage.py runserver 0.0.0.0:8000
``` ```
Keep it open and outside the docker launch browser
* Launch browser
outside the docker
``` bash ``` bash
chromium http://127.0.0.1:8000/ chromium http://127.0.0.1:8000/
``` ```
* Click on Test Gargantext * Click on Test Gargantext
``` ```
Login : gargantua Login : gargantua
Password : autnagrag Password : autnagrag
``` ```
Enjoy :) Enjoy :)
See [User Guide](/demo/tuto.md) for quick usage example See [User Guide](/demo/tuto.md) for quick usage example
...@@ -12,14 +12,16 @@ LISTTYPES = { ...@@ -12,14 +12,16 @@ LISTTYPES = {
'STOPLIST' : UnweightedList, 'STOPLIST' : UnweightedList,
'MAINLIST' : UnweightedList, 'MAINLIST' : UnweightedList,
'MAPLIST' : UnweightedList, 'MAPLIST' : UnweightedList,
'SPECIFICITY' : WeightedList, 'SPECCLUSION' : WeightedList,
'GENCLUSION' : WeightedList,
'OCCURRENCES' : WeightedIndex, # could be WeightedList 'OCCURRENCES' : WeightedIndex, # could be WeightedList
'COOCCURRENCES': WeightedMatrix, 'COOCCURRENCES': WeightedMatrix,
'TFIDF-CORPUS' : WeightedIndex, 'TFIDF-CORPUS' : WeightedIndex,
'TFIDF-GLOBAL' : WeightedIndex, 'TFIDF-GLOBAL' : WeightedIndex,
'TIRANK-LOCAL' : WeightedIndex, # could be WeightedList 'TIRANK-LOCAL' : WeightedIndex, # could be WeightedList
'TIRANK-GLOBAL' : WeightedIndex # could be WeightedList 'TIRANK-GLOBAL' : WeightedIndex, # could be WeightedList
} }
# 'OWNLIST' : UnweightedList, # £TODO use this for any term-level tags
NODETYPES = [ NODETYPES = [
# TODO separate id not array index, read by models.node # TODO separate id not array index, read by models.node
...@@ -37,7 +39,7 @@ NODETYPES = [ ...@@ -37,7 +39,7 @@ NODETYPES = [
'COOCCURRENCES', # 9 'COOCCURRENCES', # 9
# scores # scores
'OCCURRENCES', # 10 'OCCURRENCES', # 10
'SPECIFICITY', # 11 'SPECCLUSION', # 11
'CVALUE', # 12 'CVALUE', # 12
'TFIDF-CORPUS', # 13 'TFIDF-CORPUS', # 13
'TFIDF-GLOBAL', # 14 'TFIDF-GLOBAL', # 14
...@@ -47,6 +49,7 @@ NODETYPES = [ ...@@ -47,6 +49,7 @@ NODETYPES = [
# more scores (sorry!) # more scores (sorry!)
'TIRANK-LOCAL', # 16 'TIRANK-LOCAL', # 16
'TIRANK-GLOBAL', # 17 'TIRANK-GLOBAL', # 17
'GENCLUSION', # 18
] ]
INDEXED_HYPERDATA = { INDEXED_HYPERDATA = {
...@@ -222,12 +225,16 @@ DEFAULT_RANK_CUTOFF_RATIO = .75 # MAINLIST maximum terms in % ...@@ -222,12 +225,16 @@ DEFAULT_RANK_CUTOFF_RATIO = .75 # MAINLIST maximum terms in %
DEFAULT_RANK_HARD_LIMIT = 5000 # MAINLIST maximum terms abs DEFAULT_RANK_HARD_LIMIT = 5000 # MAINLIST maximum terms abs
# (makes COOCS larger ~ O(N²) /!\) # (makes COOCS larger ~ O(N²) /!\)
DEFAULT_COOC_THRESHOLD = 2 # inclusive minimum for COOCS coefs DEFAULT_COOC_THRESHOLD = 3 # inclusive minimum for COOCS coefs
# (makes COOCS more sparse) # (makes COOCS more sparse)
DEFAULT_MAPLIST_MAX = 350 # MAPLIST maximum terms DEFAULT_MAPLIST_MAX = 350 # MAPLIST maximum terms
DEFAULT_MAPLIST_MONOGRAMS_RATIO = .15 # part of monograms in MAPLIST DEFAULT_MAPLIST_MONOGRAMS_RATIO = .2 # quota of monograms in MAPLIST
# (vs multigrams = 1-mono)
DEFAULT_MAPLIST_GENCLUSION_RATIO = .6 # quota of top genclusion in MAPLIST
# (vs top specclusion = 1-gen)
DEFAULT_MAX_NGRAM_LEN = 7 # limit used after POStagging rule DEFAULT_MAX_NGRAM_LEN = 7 # limit used after POStagging rule
# (initial ngrams number is a power law of this /!\) # (initial ngrams number is a power law of this /!\)
...@@ -272,7 +279,7 @@ DOWNLOAD_DIRECTORY = UPLOAD_DIRECTORY ...@@ -272,7 +279,7 @@ DOWNLOAD_DIRECTORY = UPLOAD_DIRECTORY
# about batch processing... # about batch processing...
BATCH_PARSING_SIZE = 256 BATCH_PARSING_SIZE = 256
BATCH_NGRAMSEXTRACTION_SIZE = 1024 BATCH_NGRAMSEXTRACTION_SIZE = 3000 # how many distinct ngrams before INTEGRATE
# Scrapers config # Scrapers config
...@@ -282,7 +289,7 @@ QUERY_SIZE_N_DEFAULT = 1000 ...@@ -282,7 +289,7 @@ QUERY_SIZE_N_DEFAULT = 1000
# Grammar rules for chunking # Grammar rules for chunking
RULE_JJNN = "{<JJ.*>*<NN.*|>+<JJ.*>*}" RULE_JJNN = "{<JJ.*>*<NN.*|>+<JJ.*>*}"
RULE_JJDTNN = "{<JJ.*>*<NN.*>+((<P|IN> <DT>? <JJ.*>* <NN.*>+ <JJ.*>*)|(<JJ.*>))*}" RULE_NPN = "{<JJ.*>*<NN.*>+((<P|IN> <DT>? <JJ.*>* <NN.*>+ <JJ.*>*)|(<JJ.*>))*}"
RULE_TINA = "^((VBD,|VBG,|VBN,|CD.?,|JJ.?,|\?,){0,2}?(N.?.?,|\?,)+?(CD.,)??)\ RULE_TINA = "^((VBD,|VBG,|VBN,|CD.?,|JJ.?,|\?,){0,2}?(N.?.?,|\?,)+?(CD.,)??)\
+?((PREP.?|DET.?,|IN.?,|CC.?,|\?,)((VBD,|VBG,|VBN,|CD.?,|JJ.?,|\?\ +?((PREP.?|DET.?,|IN.?,|CC.?,|\?,)((VBD,|VBG,|VBN,|CD.?,|JJ.?,|\?\
,){0,2}?(N.?.?,|\?,)+?)+?)*?$" ,){0,2}?(N.?.?,|\?,)+?)+?)*?$"
...@@ -42,6 +42,9 @@ CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml'] ...@@ -42,6 +42,9 @@ CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']
CELERY_IMPORTS = ("gargantext.util.toolchain", "graph.cooccurrences") CELERY_IMPORTS = ("gargantext.util.toolchain", "graph.cooccurrences")
# garg's custom unittests runner (adapted to our db models)
TEST_RUNNER = 'unittests.framework.GargTestRunner'
# Application definition # Application definition
INSTALLED_APPS = [ INSTALLED_APPS = [
...@@ -123,6 +126,9 @@ DATABASES = { ...@@ -123,6 +126,9 @@ DATABASES = {
'PASSWORD': 'C8kdcUrAQy66U', 'PASSWORD': 'C8kdcUrAQy66U',
'HOST': '127.0.0.1', 'HOST': '127.0.0.1',
'PORT': '5432', 'PORT': '5432',
'TEST': {
'NAME': 'test_gargandb',
},
} }
} }
......
...@@ -19,7 +19,7 @@ from gargantext.constants import DEFAULT_CSV_DELIM, DEFAULT_CSV_DELIM_GRO ...@@ -19,7 +19,7 @@ from gargantext.constants import DEFAULT_CSV_DELIM, DEFAULT_CSV_DELIM_GRO
# import will implement the same text cleaning procedures as toolchain # import will implement the same text cleaning procedures as toolchain
from gargantext.util.toolchain.parsing import normalize_chars from gargantext.util.toolchain.parsing import normalize_chars
from gargantext.util.toolchain.ngrams_extraction import normalize_terms from gargantext.util.toolchain.ngrams_extraction import normalize_forms
from sqlalchemy.sql import exists from sqlalchemy.sql import exists
from os import path from os import path
......
from gargantext.util.languages import languages from gargantext.util.languages import languages
from gargantext.constants import LANGUAGES, DEFAULT_MAX_NGRAM_LEN, RULE_JJNN, RULE_JJDTNN from gargantext.constants import LANGUAGES, DEFAULT_MAX_NGRAM_LEN, RULE_JJNN, RULE_NPN
import nltk import nltk
import re import re
......
...@@ -39,11 +39,11 @@ def do_mainlist(corpus, ...@@ -39,11 +39,11 @@ def do_mainlist(corpus,
# retrieve helper nodes if not provided # retrieve helper nodes if not provided
if not ranking_scores_id: if not ranking_scores_id:
ranking_scores_id = session.query(Node.id).filter( ranking_scores_id = session.query(Node.id).filter(
Node.typename == "TFIDF-GLOBAL", Node.typename == "TIRANK-GLOBAL",
Node.parent_id == corpus.id Node.parent_id == corpus.id
).first() ).first()
if not ranking_scores_id: if not ranking_scores_id:
raise ValueError("MAINLIST: TFIDF node needed for mainlist creation") raise ValueError("MAINLIST: TIRANK node needed for mainlist creation")
if not stoplist_id: if not stoplist_id:
stoplist_id = session.query(Node.id).filter( stoplist_id = session.query(Node.id).filter(
......
...@@ -9,37 +9,49 @@ from gargantext.util.db_cache import cache ...@@ -9,37 +9,49 @@ from gargantext.util.db_cache import cache
from gargantext.util.lists import UnweightedList from gargantext.util.lists import UnweightedList
from sqlalchemy import desc, asc from sqlalchemy import desc, asc
from gargantext.constants import DEFAULT_MAPLIST_MAX,\ from gargantext.constants import DEFAULT_MAPLIST_MAX,\
DEFAULT_MAPLIST_GENCLUSION_RATIO,\
DEFAULT_MAPLIST_MONOGRAMS_RATIO DEFAULT_MAPLIST_MONOGRAMS_RATIO
def do_maplist(corpus, def do_maplist(corpus,
overwrite_id = None, overwrite_id = None,
mainlist_id = None, mainlist_id = None,
specificity_id = None, specclusion_id = None,
genclusion_id = None,
grouplist_id = None, grouplist_id = None,
limit=DEFAULT_MAPLIST_MAX, limit=DEFAULT_MAPLIST_MAX,
genclusion_part=DEFAULT_MAPLIST_GENCLUSION_RATIO,
monograms_part=DEFAULT_MAPLIST_MONOGRAMS_RATIO monograms_part=DEFAULT_MAPLIST_MONOGRAMS_RATIO
): ):
''' '''
According to Specificities and mainlist According to Genericity/Specificity and mainlist
Parameters: Parameters:
- mainlist_id (starting point, already cleaned of stoplist terms) - mainlist_id (starting point, already cleaned of stoplist terms)
- specificity_id (ranking factor) - specclusion_id (ngram inclusion by cooc specificity -- ranking factor)
- genclusion_id (ngram inclusion by cooc genericity -- ranking factor)
- grouplist_id (filtering grouped ones) - grouplist_id (filtering grouped ones)
- overwrite_id: optional if preexisting MAPLIST node to overwrite - overwrite_id: optional if preexisting MAPLIST node to overwrite
+ 2 constants to modulate the terms choice + 3 params to modulate the terms choice
- limit for the amount of picked terms - limit for the amount of picked terms
- monograms_part: a ratio of terms with only one lexical unit to keep - monograms_part: a ratio of terms with only one lexical unit to keep
(multigrams quota = limit * (1-monograms_part))
- genclusion_part: a ratio of terms with only one lexical unit to keep
(speclusion quota = limit * (1-genclusion_part))
''' '''
if not (mainlist_id and specificity_id and grouplist_id): if not (mainlist_id and specclusion_id and genclusion_id and grouplist_id):
raise ValueError("Please provide mainlist_id, specificity_id and grouplist_id") raise ValueError("Please provide mainlist_id, specclusion_id, genclusion_id and grouplist_id")
monograms_limit = round(limit * monograms_part) quotas = {'topgen':{}, 'topspec':{}}
multigrams_limit = limit - monograms_limit genclusion_limit = round(limit * genclusion_part)
print("MAPLIST: monograms_limit =", monograms_limit) speclusion_limit = limit - genclusion_limit
print("MAPLIST: multigrams_limit = ", multigrams_limit) quotas['topgen']['monograms'] = round(genclusion_limit * monograms_part)
quotas['topgen']['multigrams'] = genclusion_limit - quotas['topgen']['monograms']
quotas['topspec']['monograms'] = round(speclusion_limit * monograms_part)
quotas['topspec']['multigrams'] = speclusion_limit - quotas['topspec']['monograms']
print("MAPLIST quotas:", quotas)
#dbg = DebugTime('Corpus #%d - computing Miam' % corpus.id) #dbg = DebugTime('Corpus #%d - computing Miam' % corpus.id)
...@@ -54,11 +66,19 @@ def do_maplist(corpus, ...@@ -54,11 +66,19 @@ def do_maplist(corpus,
) )
ScoreSpec=aliased(NodeNgram) ScoreSpec=aliased(NodeNgram)
ScoreGen=aliased(NodeNgram)
# specificity-ranked
query = (session.query(ScoreSpec.ngram_id) # ngram with both ranking factors spec and gen
query = (session.query(
ScoreSpec.ngram_id,
ScoreSpec.weight,
ScoreGen.weight,
Ngram.n
)
.join(Ngram, Ngram.id == ScoreSpec.ngram_id) .join(Ngram, Ngram.id == ScoreSpec.ngram_id)
.filter(ScoreSpec.node_id == specificity_id) .join(ScoreGen, ScoreGen.ngram_id == ScoreSpec.ngram_id)
.filter(ScoreSpec.node_id == specclusion_id)
.filter(ScoreGen.node_id == genclusion_id)
# we want only terms within mainlist # we want only terms within mainlist
.join(MainlistTable, Ngram.id == MainlistTable.ngram_id) .join(MainlistTable, Ngram.id == MainlistTable.ngram_id)
...@@ -68,36 +88,99 @@ def do_maplist(corpus, ...@@ -68,36 +88,99 @@ def do_maplist(corpus,
.outerjoin(IsSubform, .outerjoin(IsSubform,
IsSubform.c.ngram2_id == ScoreSpec.ngram_id) IsSubform.c.ngram2_id == ScoreSpec.ngram_id)
.filter(IsSubform.c.ngram2_id == None) .filter(IsSubform.c.ngram2_id == None)
)
# TODO: move these 2 pools up to mainlist selection
top_monograms = (query
.filter(Ngram.n == 1)
.order_by(asc(ScoreSpec.weight))
.limit(monograms_limit)
.all()
)
top_multigrams = (query # specificity-ranked
.filter(Ngram.n >= 2)
.order_by(desc(ScoreSpec.weight)) .order_by(desc(ScoreSpec.weight))
.limit(multigrams_limit) )
.all()
) # format in scored_ngrams array:
obtained_mono = len(top_monograms) # -------------------------------
obtained_multi = len(top_multigrams) # [(37723, 8.428, 14.239, 3 ), etc]
obtained_total = obtained_mono + obtained_multi # ngramid wspec wgen nwords
# print("MAPLIST: top_monograms =", obtained_mono) scored_ngrams = query.all()
# print("MAPLIST: top_multigrams = ", obtained_multi) n_ngrams = len(scored_ngrams)
if n_ngrams == 0:
raise ValueError("No ngrams in cooc table ?")
# results, with same structure as quotas
chosen_ngrams = {
'topgen':{'monograms':[], 'multigrams':[]},
'topspec':{'monograms':[], 'multigrams':[]}
}
# specificity and genericity are rather reverse-correlated
# but occasionally they can have common ngrams (same ngram well ranked in both)
# => we'll use a lookup table to check if we didn't already get it
already_gotten_ngramids = {}
# 2 loops to fill spec-clusion then gen-clusion quotas
# (1st loop uses order from DB, 2nd loop uses our own sort at end of 1st)
for rkr in ['topspec', 'topgen']:
got_enough_mono = False
got_enough_multi = False
all_done = False
i = -1
while((not all_done) and (not (got_enough_mono and got_enough_multi))):
# retrieve sorted ngram n° i
i += 1
(ng_id, wspec, wgen, nwords) = scored_ngrams[i]
# before any continue case, we check the next i for max reached
all_done = (i+1 >= n_ngrams)
if ng_id in already_gotten_ngramids:
continue
# NB: nwords could be replaced by a simple search on r' '
if nwords == 1:
if got_enough_mono:
continue
else:
# add ngram to results and lookup
chosen_ngrams[rkr]['monograms'].append(ng_id)
already_gotten_ngramids[ng_id] = True
# multi
else:
if got_enough_multi:
continue
else:
# add ngram to results and lookup
chosen_ngrams[rkr]['multigrams'].append(ng_id)
already_gotten_ngramids[ng_id] = True
got_enough_mono = (len(chosen_ngrams[rkr]['monograms']) >= quotas[rkr]['monograms'])
got_enough_multi = (len(chosen_ngrams[rkr]['multigrams']) >= quotas[rkr]['multigrams'])
# at the end of the first loop we just need to sort all by the second ranker (gen)
scored_ngrams = sorted(scored_ngrams, key=lambda ng_infos: ng_infos[2], reverse=True)
obtained_spec_mono = len(chosen_ngrams['topspec']['monograms'])
obtained_spec_multi = len(chosen_ngrams['topspec']['multigrams'])
obtained_gen_mono = len(chosen_ngrams['topgen']['monograms'])
obtained_gen_multi = len(chosen_ngrams['topgen']['multigrams'])
obtained_total = obtained_spec_mono \
+ obtained_spec_multi \
+ obtained_gen_mono \
+ obtained_gen_multi
print("MAPLIST: top_spec_monograms =", obtained_spec_mono)
print("MAPLIST: top_spec_multigrams =", obtained_spec_multi)
print("MAPLIST: top_gen_monograms =", obtained_gen_mono)
print("MAPLIST: top_gen_multigrams =", obtained_gen_multi)
print("MAPLIST: kept %i ngrams in total " % obtained_total) print("MAPLIST: kept %i ngrams in total " % obtained_total)
obtained_data = chosen_ngrams['topspec']['monograms'] \
+ chosen_ngrams['topspec']['multigrams'] \
+ chosen_ngrams['topgen']['monograms'] \
+ chosen_ngrams['topgen']['multigrams']
# NEW MAPLIST NODE # NEW MAPLIST NODE
# ----------------- # -----------------
# saving the parameters of the analysis in the Node JSON # saving the parameters of the analysis in the Node JSON
new_hyperdata = { 'corpus': corpus.id, new_hyperdata = { 'corpus': corpus.id,
'limit' : limit, 'limit' : limit,
'monograms_part' : monograms_part, 'monograms_part' : monograms_part,
'monograms_result' : obtained_mono/obtained_total if obtained_total != 0 else 0 'genclusion_part' : genclusion_part,
} }
if overwrite_id: if overwrite_id:
# overwrite pre-existing node # overwrite pre-existing node
...@@ -118,9 +201,7 @@ def do_maplist(corpus, ...@@ -118,9 +201,7 @@ def do_maplist(corpus,
the_id = the_maplist.id the_id = the_maplist.id
# create UnweightedList object and save (=> new NodeNgram rows) # create UnweightedList object and save (=> new NodeNgram rows)
datalist = UnweightedList( datalist = UnweightedList(obtained_data)
[res.ngram_id for res in top_monograms + top_multigrams]
)
# save # save
datalist.save(the_id) datalist.save(the_id)
......
...@@ -10,8 +10,8 @@ from .ngram_groups import compute_groups ...@@ -10,8 +10,8 @@ from .ngram_groups import compute_groups
from .metric_tfidf import compute_occs, compute_tfidf_local, compute_ti_ranking from .metric_tfidf import compute_occs, compute_tfidf_local, compute_ti_ranking
from .list_main import do_mainlist from .list_main import do_mainlist
from .ngram_coocs import compute_coocs from .ngram_coocs import compute_coocs
from .metric_specificity import compute_specificity from .metric_specgen import compute_specgen
from .list_map import do_maplist # TEST from .list_map import do_maplist
from .mail_notification import notify_owner from .mail_notification import notify_owner
from gargantext.util.db import session from gargantext.util.db import session
from gargantext.models import Node from gargantext.models import Node
...@@ -136,22 +136,26 @@ def parse_extract_indexhyperdata(corpus): ...@@ -136,22 +136,26 @@ def parse_extract_indexhyperdata(corpus):
# => used for doc <=> ngram association # => used for doc <=> ngram association
# ------------ # ------------
# -> cooccurrences on mainlist: compute + write (=> Node and NodeNgramNgram) # -> cooccurrences on mainlist: compute + write (=> Node and NodeNgramNgram)*
coocs = compute_coocs(corpus, coocs = compute_coocs(corpus,
on_list_id = mainlist_id, on_list_id = mainlist_id,
groupings_id = group_id, groupings_id = group_id,
just_pass_result = True) just_pass_result = True,
diagonal_filter = False) # preserving the diagonal
# (useful for spec/gen)
print('CORPUS #%d: [%s] computed mainlist coocs for specif rank' % (corpus.id, t())) print('CORPUS #%d: [%s] computed mainlist coocs for specif rank' % (corpus.id, t()))
# -> specificity: compute + write (=> NodeNodeNgram) # -> specclusion/genclusion: compute + write (2 Nodes + 2 lists in NodeNgram)
spec_id = compute_specificity(corpus,cooc_matrix = coocs) (spec_id, gen_id) = compute_specgen(corpus,cooc_matrix = coocs)
# no need here for subforms because cooc already counted them in mainform # no need here for subforms because cooc already counted them in mainform
print('CORPUS #%d: [%s] new specificity node #%i' % (corpus.id, t(), spec_id)) print('CORPUS #%d: [%s] new spec-clusion node #%i' % (corpus.id, t(), spec_id))
print('CORPUS #%d: [%s] new gen-clusion node #%i' % (corpus.id, t(), gen_id))
# maplist: compute + write (to Node and NodeNgram) # maplist: compute + write (to Node and NodeNgram)
map_id = do_maplist(corpus, map_id = do_maplist(corpus,
mainlist_id = mainlist_id, mainlist_id = mainlist_id,
specificity_id=spec_id, specclusion_id=spec_id,
genclusion_id=gen_id,
grouplist_id=group_id grouplist_id=group_id
) )
print('CORPUS #%d: [%s] new maplist node #%i' % (corpus.id, t(), map_id)) print('CORPUS #%d: [%s] new maplist node #%i' % (corpus.id, t(), map_id))
...@@ -187,7 +191,7 @@ def recount(corpus): ...@@ -187,7 +191,7 @@ def recount(corpus):
- ndocs - ndocs
- ti_rank - ti_rank
- coocs - coocs
- specificity - specclusion/genclusion
- tfidf - tfidf
NB: no new extraction, no list change, just the metrics NB: no new extraction, no list change, just the metrics
...@@ -208,10 +212,15 @@ def recount(corpus): ...@@ -208,10 +212,15 @@ def recount(corpus):
old_tirank_id = None old_tirank_id = None
try: try:
old_spec_id = corpus.children("SPECIFICITY").first().id old_spec_id = corpus.children("SPECCLUSION").first().id
except: except:
old_spec_id = None old_spec_id = None
try:
old_gen_id = corpus.children("GENCLUSION").first().id
except:
old_gen_id = None
try: try:
old_ltfidf_id = corpus.children("TFIDF-CORPUS").first().id old_ltfidf_id = corpus.children("TFIDF-CORPUS").first().id
except: except:
...@@ -254,11 +263,13 @@ def recount(corpus): ...@@ -254,11 +263,13 @@ def recount(corpus):
just_pass_result = True) just_pass_result = True)
print('RECOUNT #%d: [%s] updated mainlist coocs for specif rank' % (corpus.id, t())) print('RECOUNT #%d: [%s] updated mainlist coocs for specif rank' % (corpus.id, t()))
# -> specificity: compute + write (=> NodeNgram)
spec_id = compute_specificity(corpus,cooc_matrix = coocs, overwrite_id = old_spec_id)
# -> specclusion/genclusion: compute + write (=> NodeNodeNgram)
(spec_id, gen_id) = compute_specgen(corpus, cooc_matrix = coocs,
spec_overwrite_id = spec_id, gen_overwrite_id = gen_id)
print('RECOUNT #%d: [%s] updated specificity node #%i' % (corpus.id, t(), spec_id)) print('RECOUNT #%d: [%s] updated spec-clusion node #%i' % (corpus.id, t(), spec_id))
print('RECOUNT #%d: [%s] updated gen-clusion node #%i' % (corpus.id, t(), gen_id))
print('RECOUNT #%d: [%s] FINISHED metric recounts' % (corpus.id, t())) print('RECOUNT #%d: [%s] FINISHED metric recounts' % (corpus.id, t()))
......
"""
Computes a specificity metric from the ngram cooccurrence matrix.
+ SAVE => WeightedList => NodeNgram
"""
from gargantext.models import Node, Ngram, NodeNgram, NodeNgramNgram
from gargantext.util.db import session, aliased, func, bulk_insert
from gargantext.util.lists import WeightedList
from collections import defaultdict
from pandas import DataFrame
from numpy import diag
def round3(floating_number):
"""
Rounds a floating number to 3 decimals
Good when we don't need so much details in the DB writen data
"""
return float("%.3f" % floating_number)
def compute_specgen(corpus, cooc_id=None, cooc_matrix=None,
spec_overwrite_id = None, gen_overwrite_id = None):
'''
Compute genericity/specificity:
P(j|i) = N(ij) / N(ii)
P(i|j) = N(ij) / N(jj)
Gen(i) = Sum{j} P(j_k|i)
Spec(i) = Sum{j} P(i|j_k)
Gen-clusion(i) = (Spec(i) + Gen(i)) / 2
Spec-clusion(i) = (Spec(i) - Gen(i)) / 2
Parameters:
- cooc_id: mandatory id of a cooccurrences node to use as base
- spec_overwrite_id: optional preexisting specificity node to overwrite
- gen_overwrite_id: optional preexisting genericity node to overwrite
'''
matrix = defaultdict(lambda : defaultdict(float))
if cooc_id == None and cooc_matrix == None:
raise TypeError("compute_specificity: needs a cooc_id or cooc_matrix param")
elif cooc_id:
cooccurrences = (session.query(NodeNgramNgram)
.filter(NodeNgramNgram.node_id==cooc_id)
)
# no filtering: cooc already filtered on mainlist_id at creation
for cooccurrence in cooccurrences:
matrix[cooccurrence.ngram1_id][cooccurrence.ngram2_id] = cooccurrence.weight
# matrix[cooccurrence.ngram2_id][cooccurrence.ngram1_id] = cooccurrence.weight
elif cooc_matrix:
# copy WeightedMatrix into local matrix structure
for (ngram1_id, ngram2_id) in cooc_matrix.items:
w = cooc_matrix.items[(ngram1_id, ngram2_id)]
# ------- 8< --------------------------------------------
# tempo hack to ignore lines/columns where diagonal == 0
# £TODO find why they exist and then remove this snippet
if (((ngram1_id,ngram1_id) not in cooc_matrix.items) or
((ngram2_id,ngram2_id) not in cooc_matrix.items)):
continue
# ------- 8< --------------------------------------------
matrix[ngram1_id][ngram2_id] = w
nb_ngrams = len(matrix)
print("SPECIFICITY: computing on %i ngrams" % nb_ngrams)
# example corpus (7 docs, 8 nouns)
# --------------------------------
# "The report says that humans are animals."
# "The report says that rivers are full of water."
# "The report says that humans like to make war."
# "The report says that animals must eat food."
# "The report says that animals drink water."
# "The report says that humans like food and water."
# "The report says that grass is food for some animals."
#===========================================================================
cooc_counts = DataFrame(matrix).fillna(0)
# cooc_counts matrix
# ------------------
# animals food grass humans report rivers war water
# animals 4 2 1 1 4 0 0 1
# food 2 3 1 1 3 0 0 1
# grass 1 1 1 0 1 0 0 0
# humans 1 1 0 3 3 0 1 1
# report 4 3 1 3 7 1 1 3
# rivers 0 0 0 0 1 1 0 1
# war 0 0 0 1 1 0 1 0
# water 1 1 0 1 3 1 0 3
#===========================================================================
# conditional p(col|line)
diagonal = list(diag(cooc_counts))
# debug
# print("WARN diag: ", diagonal)
# print("WARN diag: =================== 0 in diagonal ?\n",
# 0 in diagonal ? "what ??? zeros in the diagonal :/" : "ok no zeros",
# "\n===================")
p_col_given_line = cooc_counts / list(diag(cooc_counts))
# p_col_given_line
# ----------------
# animals food grass humans report rivers war water
# animals 1.0 0.7 1.0 0.3 0.6 0.0 0.0 0.3
# food 0.5 1.0 1.0 0.3 0.4 0.0 0.0 0.3
# grass 0.2 0.3 1.0 0.0 0.1 0.0 0.0 0.0
# humans 0.2 0.3 0.0 1.0 0.4 0.0 1.0 0.3
# report 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
# rivers 0.0 0.0 0.0 0.0 0.1 1.0 0.0 0.3
# war 0.0 0.0 0.0 0.3 0.1 0.0 1.0 0.0
# water 0.2 0.3 0.0 0.3 0.4 1.0 0.0 1.0
#===========================================================================
# total per lines (<=> genericity)
Gen = p_col_given_line.sum(axis=1)
# Gen.sort_values(ascending=False)
# ---
# report 8.0
# animals 3.9
# food 3.6
# water 3.3
# humans 3.3
# grass 1.7
# war 1.5
# rivers 1.5
#===========================================================================
# total columnwise (<=> specificity)
Spec = p_col_given_line.sum(axis=0)
# Spec.sort_values(ascending=False)
# ----
# grass 4.0
# food 3.7
# water 3.3
# humans 3.3
# report 3.3
# animals 3.2
# war 3.0
# rivers 3.0
#===========================================================================
# our "inclusion by specificity" metric
Specclusion = Spec-Gen
# Specclusion.sort_values(ascending=False)
# -----------
# grass 1.1
# war 0.8
# rivers 0.8
# food 0.0
# humans -0.0
# water -0.0
# animals -0.3
# report -2.4
#===========================================================================
# our "inclusion by genericity" metric
Genclusion = Spec+Gen
# Genclusion.sort_values(ascending=False)
# -----------
# report 11.3
# food 7.3
# animals 7.2
# water 6.7
# humans 6.7
# grass 5.7
# war 4.5
# rivers 4.5
#===========================================================================
# specificity node
if spec_overwrite_id:
# overwrite pre-existing id
the_spec_id = spec_overwrite_id
session.query(NodeNgram).filter(NodeNgram.node_id==the_spec_id).delete()
session.commit()
else:
specnode = corpus.add_child(
typename = "SPECCLUSION",
name = "Specclusion (in:%s)" % corpus.id
)
session.add(specnode)
session.commit()
the_spec_id = specnode.id
if not Specclusion.empty:
data = WeightedList(
zip( Specclusion.index.tolist()
, [v for v in map(round3, Specclusion.values.tolist())]
)
)
data.save(the_spec_id)
else:
print("WARNING: had no terms in COOCS => empty SPECCLUSION node")
#===========================================================================
# genclusion node
if gen_overwrite_id:
the_gen_id = gen_overwrite_id
session.query(NodeNgram).filter(NodeNgram.node_id==the_gen_id).delete()
session.commit()
else:
gennode = corpus.add_child(
typename = "GENCLUSION",
name = "Genclusion (in:%s)" % corpus.id
)
session.add(gennode)
session.commit()
the_gen_id = gennode.id
if not Genclusion.empty:
data = WeightedList(
zip( Genclusion.index.tolist()
, [v for v in map(round3, Genclusion.values.tolist())]
)
)
data.save(the_gen_id)
else:
print("WARNING: had no terms in COOCS => empty GENCLUSION node")
#===========================================================================
return(the_spec_id, the_gen_id)
"""
Computes a specificity metric from the ngram cooccurrence matrix.
+ SAVE => WeightedList => NodeNgram
"""
from gargantext.models import Node, Ngram, NodeNgram, NodeNgramNgram
from gargantext.util.db import session, aliased, func, bulk_insert
from gargantext.util.lists import WeightedList
from collections import defaultdict
from pandas import DataFrame
import pandas as pd
def compute_specificity(corpus, cooc_id=None, cooc_matrix=None, overwrite_id = None):
'''
Compute the specificity, simple calculus.
Parameters:
- cooc_id: mandatory id of a cooccurrences node to use as base
- overwrite_id: optional preexisting specificity node to overwrite
'''
matrix = defaultdict(lambda : defaultdict(float))
if cooc_id == None and cooc_matrix == None:
raise TypeError("compute_specificity: needs a cooc_id or cooc_matrix param")
elif cooc_id:
cooccurrences = (session.query(NodeNgramNgram)
.filter(NodeNgramNgram.node_id==cooc_id)
)
# no filtering: cooc already filtered on mainlist_id at creation
for cooccurrence in cooccurrences:
matrix[cooccurrence.ngram1_id][cooccurrence.ngram2_id] = cooccurrence.weight
matrix[cooccurrence.ngram2_id][cooccurrence.ngram1_id] = cooccurrence.weight
elif cooc_matrix:
# copy WeightedMatrix into local matrix structure
for (ngram1_id, ngram2_id) in cooc_matrix.items:
w = cooc_matrix.items[(ngram1_id, ngram2_id)]
matrix[ngram1_id][ngram2_id] = w
nb_ngrams = len(matrix)
print("SPECIFICITY: computing on %i ngrams" % nb_ngrams)
x = DataFrame(matrix).fillna(0)
# proba (x/y) ( <= on divise chaque ligne par son total)
x = x / x.sum(axis=1)
# vectorisation
# d:Matrix => v: Vector (len = nb_ngrams)
# v = d.sum(axis=1) (- lui-même)
xs = x.sum(axis=1) - x
ys = x.sum(axis=0) - x
# top inclus ou exclus
#n = ( xs + ys) / (2 * (x.shape[0] - 1))
# top generic or specific (asc is spec, desc is generic)
v = ( xs - ys) / ( 2 * (x.shape[0] - 1))
## d ##
#######
# Grenelle biodiversité kilomètres site élus île
# Grenelle 0 0 4 0 0 0
# biodiversité 0 0 0 0 4 0
# kilomètres 4 0 0 0 4 0
# site 0 0 0 0 4 6
# élus 0 4 4 4 0 0
# île 0 0 0 6 0 0
## d.sum(axis=1) ##
###################
# Grenelle 4
# biodiversité 4
# kilomètres 8
# site 10
# élus 12
# île 6
# résultat temporaire
# -------------------
# pour l'instant on va utiliser les sommes en ligne comme ranking de spécificité
# (**même** ordre qu'avec la formule d'avant le refactoring mais calcul + simple)
# TODO analyser la cohérence math ET sem de cet indicateur
#v.sort_values(inplace=True)
# [ ('biodiversité' , 0.333 ),
# ('Grenelle' , 0.5 ),
# ('île' , 0.599 ),
# ('kilomètres' , 1.333 ),
# ('site' , 1.333 ),
# ('élus' , 1.899 ) ]
# ----------------
# specificity node
if overwrite_id:
# overwrite pre-existing id
the_id = overwrite_id
session.query(NodeNgram).filter(NodeNgram.node_id==the_id).delete()
session.commit()
else:
specnode = corpus.add_child(
typename = "SPECIFICITY",
name = "Specif (in:%s)" % corpus.id
)
session.add(specnode)
session.commit()
the_id = specnode.id
# print(v)
pd.options.display.float_format = '${:,.2f}'.format
if not v.empty:
data = WeightedList(
zip( v.index.tolist()
, v.values.tolist()[0]
)
)
data.save(the_id)
else:
print("WARNING: had no terms in COOCS => empty SPECIFICITY node")
return(the_id)
...@@ -18,7 +18,8 @@ def compute_coocs( corpus, ...@@ -18,7 +18,8 @@ def compute_coocs( corpus,
stoplist_id = None, stoplist_id = None,
start = None, start = None,
end = None, end = None,
symmetry_filter = False): symmetry_filter = False,
diagonal_filter = True):
""" """
Count how often some extracted terms appear Count how often some extracted terms appear
together in a small context (document) together in a small context (document)
...@@ -55,6 +56,9 @@ def compute_coocs( corpus, ...@@ -55,6 +56,9 @@ def compute_coocs( corpus,
NB the expected type of parameter value is datetime.datetime NB the expected type of parameter value is datetime.datetime
(string is also possible but format must follow (string is also possible but format must follow
this convention: "2001-01-01" aka "%Y-%m-%d") this convention: "2001-01-01" aka "%Y-%m-%d")
- symmetry_filter: prevent calculating where ngram1_id > ngram2_id
- diagonal_filter: prevent calculating where ngram1_id == ngram2_id
(deprecated parameters) (deprecated parameters)
- field1,2: allowed to count other things than ngrams (eg tags) but no use case at present - field1,2: allowed to count other things than ngrams (eg tags) but no use case at present
...@@ -69,7 +73,7 @@ def compute_coocs( corpus, ...@@ -69,7 +73,7 @@ def compute_coocs( corpus,
JOIN nodes_ngrams AS idxb JOIN nodes_ngrams AS idxb
ON idxa.node_id = idxb.node_id <== that's cooc ON idxa.node_id = idxb.node_id <== that's cooc
--------------------------------- ---------------------------------
AND idxa.ngram_id <> idxb.ngram_id AND idxa.ngram_id <> idxb.ngram_id (diagonal_filter)
AND idxa.node_id = MY_DOC ; AND idxa.node_id = MY_DOC ;
on entire corpus on entire corpus
...@@ -152,16 +156,14 @@ def compute_coocs( corpus, ...@@ -152,16 +156,14 @@ def compute_coocs( corpus,
ucooc ucooc
# for debug (2/4) # for debug (2/4)
#, Xngram.terms.label("w_x") # , Xngram.terms.label("w_x")
#, Yngram.terms.label("w_y") # , Yngram.terms.label("w_y")
) )
.join(Yindex, Xindex.node_id == Yindex.node_id ) # <- by definition of cooc .join(Yindex, Xindex.node_id == Yindex.node_id ) # <- by definition of cooc
.join(Node, Node.id == Xindex.node_id) # <- b/c within corpus .join(Node, Node.id == Xindex.node_id) # <- b/c within corpus
.filter(Node.parent_id == corpus.id) # <- b/c within corpus .filter(Node.parent_id == corpus.id) # <- b/c within corpus
.filter(Node.typename == "DOCUMENT") # <- b/c within corpus .filter(Node.typename == "DOCUMENT") # <- b/c within corpus
.filter(Xindex_ngform_id != Yindex_ngform_id) # <- b/c not with itself
) )
# outerjoin the synonyms if needed # outerjoin the synonyms if needed
...@@ -179,12 +181,12 @@ def compute_coocs( corpus, ...@@ -179,12 +181,12 @@ def compute_coocs( corpus,
.group_by( .group_by(
Xindex_ngform_id, Yindex_ngform_id # <- what we're counting Xindex_ngform_id, Yindex_ngform_id # <- what we're counting
# for debug (3/4) # for debug (3/4)
#,"w_x", "w_y" # ,"w_x", "w_y"
) )
# for debug (4/4) # for debug (4/4)
#.join(Xngram, Xngram.id == Xindex_ngform_id) # .join(Xngram, Xngram.id == Xindex_ngform_id)
#.join(Yngram, Yngram.id == Yindex_ngform_id) # .join(Yngram, Yngram.id == Yindex_ngform_id)
.order_by(ucooc) .order_by(ucooc)
) )
...@@ -192,6 +194,9 @@ def compute_coocs( corpus, ...@@ -192,6 +194,9 @@ def compute_coocs( corpus,
# 4) INPUT FILTERS (reduce N before O(N²)) # 4) INPUT FILTERS (reduce N before O(N²))
if on_list_id: if on_list_id:
# £TODO listes différentes ou bien une liste pour x et tous les ngrammes pour y
# car permettrait expansion de liste aux plus proches voisins (MacLachlan)
# (avec une matr rectangulaire)
m1 = aliased(NodeNgram) m1 = aliased(NodeNgram)
m2 = aliased(NodeNgram) m2 = aliased(NodeNgram)
...@@ -226,6 +231,10 @@ def compute_coocs( corpus, ...@@ -226,6 +231,10 @@ def compute_coocs( corpus,
) )
if diagonal_filter:
# don't compute ngram with itself
coocs_query = coocs_query.filter(Xindex_ngform_id != Yindex_ngform_id)
if start or end: if start or end:
Time = aliased(NodeHyperdata) Time = aliased(NodeHyperdata)
...@@ -268,6 +277,7 @@ def compute_coocs( corpus, ...@@ -268,6 +277,7 @@ def compute_coocs( corpus,
# threshold # threshold
# £TODO adjust COOC_THRESHOLD a posteriori: # £TODO adjust COOC_THRESHOLD a posteriori:
# ex: sometimes 2 sometimes 4 depending on sparsity # ex: sometimes 2 sometimes 4 depending on sparsity
print("COOCS: filtering pairs under threshold:", threshold)
coocs_query = coocs_query.having(ucooc >= threshold) coocs_query = coocs_query.having(ucooc >= threshold)
......
...@@ -77,7 +77,7 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_ ...@@ -77,7 +77,7 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_
continue continue
# get ngrams # get ngrams
for ngram in ngramsextractor.extract(value): for ngram in ngramsextractor.extract(value):
tokens = tuple(token[0] for token in ngram) tokens = tuple(normalize_forms(token[0]) for token in ngram)
if do_subngrams: if do_subngrams:
# ex tokens = ["very", "cool", "exemple"] # ex tokens = ["very", "cool", "exemple"]
...@@ -90,7 +90,7 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_ ...@@ -90,7 +90,7 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_
subterms = [tokens] subterms = [tokens]
for seqterm in subterms: for seqterm in subterms:
ngram = normalize_terms(' '.join(seqterm)) ngram = ' '.join(seqterm)
if len(ngram) > 1: if len(ngram) > 1:
# doc <=> ngram index # doc <=> ngram index
nodes_ngrams_count[(document.id, ngram)] += 1 nodes_ngrams_count[(document.id, ngram)] += 1
...@@ -118,7 +118,7 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_ ...@@ -118,7 +118,7 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_
raise error raise error
def normalize_terms(term_str, do_lowercase=DEFAULT_ALL_LOWERCASE_FLAG): def normalize_forms(term_str, do_lowercase=DEFAULT_ALL_LOWERCASE_FLAG):
""" """
Removes unwanted trailing punctuation Removes unwanted trailing punctuation
AND optionally puts everything to lowercase AND optionally puts everything to lowercase
...@@ -127,14 +127,14 @@ def normalize_terms(term_str, do_lowercase=DEFAULT_ALL_LOWERCASE_FLAG): ...@@ -127,14 +127,14 @@ def normalize_terms(term_str, do_lowercase=DEFAULT_ALL_LOWERCASE_FLAG):
(benefits from normalize_chars upstream so there's less cases to consider) (benefits from normalize_chars upstream so there's less cases to consider)
""" """
# print('normalize_terms IN: "%s"' % term_str) # print('normalize_forms IN: "%s"' % term_str)
term_str = sub(r'^[-",;/%(){}\\\[\]\.\' ]+', '', term_str) term_str = sub(r'^[-\'",;/%(){}\\\[\]\. ©]+', '', term_str)
term_str = sub(r'[-",;/%(){}\\\[\]\.\' ]+$', '', term_str) term_str = sub(r'[-\'",;/%(){}\\\[\]\. ©]+$', '', term_str)
if do_lowercase: if do_lowercase:
term_str = term_str.lower() term_str = term_str.lower()
# print('normalize_terms OUT: "%s"' % term_str) # print('normalize_forms OUT: "%s"' % term_str)
return term_str return term_str
......
...@@ -57,7 +57,7 @@ class CSVLists(APIView): ...@@ -57,7 +57,7 @@ class CSVLists(APIView):
params in request.GET: params in request.GET:
onto_corpus: the corpus whose lists are getting patched onto_corpus: the corpus whose lists are getting patched
params in request.FILES: params in request.data:
csvfile: the csv file csvfile: the csv file
/!\ We assume we checked the file size client-side before upload /!\ We assume we checked the file size client-side before upload
......
This diff is collapsed.
# Gargantext Installation
You will find here a Dockerfile and docker-compose script
that builds a development container for Gargantex
along with a PostgreSQL 9.5.X server.
* Install Docker
On your host machine, you need Docker.
[Installation guide details](https://docs.docker.com/engine/installation/#installation)
* clone the gargantex repository and get the refactoring branch
```
git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext
cd /srv/gargantext
git fetch origin refactoring
git checkout refactoring
Install additionnal dependencies into gargantex_lib
```
wget http://dl.gargantext.org/gargantext_lib.tar.bz2 \
&& sudo tar xvjf gargantext_lib.tar.bz2 -o /srv/gargantext_lib \
&& sudo chown -R gargantua:gargantua /srv/gargantext_lib \
```
* Developers: create your own branch based on refactoring
see [CHANGELOG](CHANGELOG.md) for migrations and branches name
```
git checkout-b username-refactoring refactoring
```
Build the docker images:
- a database container
- a gargantext container
```
cd /srv/gargantext/install/
docker-compose build -t gargantex /srv/gargantext/install/docker/config/
docker-compose run web bundle install
```
Finally, setup the PostgreSQL database with the following commands.
```
docker-compose run web bundle exec rake db:create
docker-compose run web bundle exec rake db:migrate
docker-compose run web bundle exec rake db:seed
```
## OS
## Debian Stretch
See install/debian
If you do not have a Debian environment, then install docker and
execute /srv/gargantext/install/docker/dev/install.sh
You need a docker image.
All the steps are explained in [docker/dev/install.sh](docker/dev/install.sh) (not automatic yet).
Bug reports are welcome.
...@@ -26,6 +26,7 @@ ENV PYTHON_ENV /srv/env_3-5 ...@@ -26,6 +26,7 @@ ENV PYTHON_ENV /srv/env_3-5
RUN apt-get update && \ RUN apt-get update && \
apt-get install -y \ apt-get install -y \
apt-utils ca-certificates locales \ apt-utils ca-certificates locales \
python3-dev \
sudo aptitude gcc g++ wget git postgresql-9.5 vim \ sudo aptitude gcc g++ wget git postgresql-9.5 vim \
build-essential make build-essential make
...@@ -44,7 +45,7 @@ RUN apt-get update && apt-get install -y \ ...@@ -44,7 +45,7 @@ RUN apt-get update && apt-get install -y \
postgresql-server-dev-9.5 libpq-dev libxml2 \ postgresql-server-dev-9.5 libpq-dev libxml2 \
libxml2-dev xml-core libgfortran-5-dev \ libxml2-dev xml-core libgfortran-5-dev \
virtualenv python3-virtualenv \ virtualenv python3-virtualenv \
python3.5 python3-dev \ python3.5 \
python3-six python3-numpy python3-setuptools \ python3-six python3-numpy python3-setuptools \
# ^for numpy, pandas # ^for numpy, pandas
python3-numexpr \ python3-numexpr \
......
#!/bin/bash
#######################################################################
# ____ _
# | _ \ ___ ___| | _____ _ __
# | | | |/ _ \ / __| |/ / _ \ '__|
# | |_| | (_) | (__| < __/ |
# |____/ \___/ \___|_|\_\___|_|
#
######################################################################
sudo docker build -t gargantext .
# OR Get the ID of your container
#ID=$(docker build .) && docker run -i -t $ID
# OR
# cd /tmp
# wget http://dl.gargantext.org/gargantext_docker_image.tar \
# && sudo docker import - gargantext:latest < gargantext_docker_image.tar
#!/bin/bash
echo "Adding user gargantua";
sudo adduser --disabled-password --gecos "" gargantua;
echo "Creating the environnement into /srv/";
for dir in "/srv/gargantext" "/srv/gargantext_lib" "/srv/gargantext_static" "/srv/gargantext_media""/srv/env_3-5"; do
sudo mkdir -p $dir ;
sudo chown gargantua:gargantua $dir ;
done;
echo "Downloading the libs. Please be patient!";
wget http://dl.gargantext.org/gargantext_lib.tar.bz2 \
&& tar xvjf gargantext_lib.tar.bz2 -o /srv/gargantext_lib \
&& sudo chown -R gargantua:gargantua /srv/gargantext_lib \
&& echo "Libs installed";
echo 'Install docker'
sudo apt-get install -y docker-engine
echo 'Build gargantext image'
cd /srv/gargantext/install/
./docker/config/build
#Next steps
#install and configure git
#sudo apt-get install -y git
#clone your SSH key
#cp ~/.ssh/id_rsa.pub id_rsa.pub
#clone the repo
#~ git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
#~ && cd /srv/gargantext \
# get on branch
#~ && git fetch origin unstable \
#~ && git checkout unstable \
#~ echo "Currently on /srv/gargantext unstable branch";
#create your own branch
# git checkout -b my-unstable
...@@ -256,7 +256,7 @@ ...@@ -256,7 +256,7 @@
</div> </div>
<!-- Sidebar --> <!-- Sidebar -->
<div id="leftcolumn"> <div id="sidecolumn">
<div style="text-align: center;"> <div style="text-align: center;">
<a href="http://www.cnrs.fr" target="_blank"><img width="40%" src="https://www.ipmc.cnrs.fr/~duprat/comm/images/logo_cnrs_transparent.gif"></a> <a href="http://www.cnrs.fr" target="_blank"><img width="40%" src="https://www.ipmc.cnrs.fr/~duprat/comm/images/logo_cnrs_transparent.gif"></a>
</div> </div>
......
...@@ -149,11 +149,11 @@ function CRUD( list_id , ngram_ids , http_method , callback) { ...@@ -149,11 +149,11 @@ function CRUD( list_id , ngram_ids , http_method , callback) {
var div_info = ""; var div_info = "";
if( $( ".colorgraph_div" ).length>0 ) if( $( ".colorgraph_div" ).length>0 )
div_info += '<ul id="colorGraph" class="nav navbar-nav navbar-right">' div_info += '<ul id="colorGraph" class="nav navbar-nav">'
div_info += ' <li class="dropdown">' div_info += ' <li class="dropdown">'
div_info += '<a href="#" class="dropdown-toggle" data-toggle="dropdown">' div_info += '<a href="#" class="dropdown-toggle" data-toggle="dropdown">'
div_info += ' <img title="Set Colors" src="/static/img/colors.png" width="20px"><b class="caret"></b></img>' div_info += ' <img title="Set Colors" src="/static/img/colors.png" width="22px"><b class="caret"></b></img>'
div_info += '</a>' div_info += '</a>'
div_info += ' <ul class="dropdown-menu">' div_info += ' <ul class="dropdown-menu">'
...@@ -186,11 +186,11 @@ function CRUD( list_id , ngram_ids , http_method , callback) { ...@@ -186,11 +186,11 @@ function CRUD( list_id , ngram_ids , http_method , callback) {
div_info = ""; div_info = "";
if( $( ".sizegraph_div" ).length>0 ) if( $( ".sizegraph_div" ).length>0 )
div_info += '<ul id="sizeGraph" class="nav navbar-nav navbar-right">' div_info += '<ul id="sizeGraph" class="nav navbar-nav">'
div_info += ' <li class="dropdown">' div_info += ' <li class="dropdown">'
div_info += '<a href="#" class="dropdown-toggle" data-toggle="dropdown">' div_info += '<a href="#" class="dropdown-toggle" data-toggle="dropdown">'
div_info += ' <img title="Set Sizes" src="/static/img/NodeSize.png" width="20px"><b class="caret"></b></img>' div_info += ' <img title="Set Sizes" src="/static/img/NodeSize.png" width="18px"><b class="caret"></b></img>'
div_info += '</a>' div_info += '</a>'
div_info += ' <ul class="dropdown-menu">' div_info += ' <ul class="dropdown-menu">'
......
...@@ -18,14 +18,53 @@ ...@@ -18,14 +18,53 @@
} }
.navbar { .navbar {
margin-bottom:1px; margin-bottom:1px;
} }
#defaultop{ #defaultop{
min-height: 5%; min-height: 5%;
max-height: 10%; /*max-height: 10%;*/
text-align: center;
}
#defaultop li.basicitem{
/*font-family: "Helvetica Neue", Helvetica, Arial, sans-serif ;*/
padding-left: .4em;
padding-right: .4em;
padding-bottom: 0;
font-size: 90% ;
}
#defaultop > div {
float: none;
display: inline-block;
text-align: left;
}
#defaultop > div {
float: none;
display: inline-block;
text-align: left;
}
#defaultop .nav > li > a {
text-align: center;
padding-top: .4em;
padding-bottom: .2em;
margin-left: auto ;
margin-right: auto ;
}
/*searchnav should get same padding as our .navbar-nav > li > a or bootstrap's*/
#defaultop div#searchnav {
padding-top: 13px;
padding-bottom: 9px;
}
#defaultop .settingslider {
max-width: 80px;
display: inline-block ;
} }
#sigma-example { #sigma-example {
...@@ -165,9 +204,7 @@ ...@@ -165,9 +204,7 @@
display:inline-block; display:inline-block;
border:solid 1px; border:solid 1px;
/*box-shadow: 0px 0px 0px 1px rgba(0,0,0,0.3); */ /*box-shadow: 0px 0px 0px 1px rgba(0,0,0,0.3); */
-moz-border-radius: 6px; border-radius: 6px;
-webkit-border-radius: 6px;
-khtml-border-radius: 6px;'+
border-color:#BDBDBD; border-color:#BDBDBD;
padding:0px 2px 0px 2px; padding:0px 2px 0px 2px;
margin:1px 0px 1px 0px; margin:1px 0px 1px 0px;
...@@ -367,6 +404,12 @@ ...@@ -367,6 +404,12 @@
padding-left:5%; padding-left:5%;
} }
/* small messages */
p.micromessage{
font-size: 85%;
color: #707070 ;
}
.btn-sm:hover { .btn-sm:hover {
font-weight: bold; font-weight: bold;
} }
...@@ -376,7 +419,7 @@ ...@@ -376,7 +419,7 @@
.tab { display: inline-block; zoom:1; *display:inline; background: #eee; border: solid 1px #999; border-bottom: none; -moz-border-radius: 4px 4px 0 0; -webkit-border-radius: 4px 4px 0 0; } .tab { display: inline-block; zoom:1; *display:inline; background: #eee; border: solid 1px #999; border-bottom: none; -moz-border-radius: 4px 4px 0 0; -webkit-border-radius: 4px 4px 0 0; }
.tab a { font-size: 12px; line-height: 2em; display: block; padding: 0 10px; outline: none; } .tab a { font-size: 12px; line-height: 2em; display: block; padding: 0 10px; outline: none; }
.tab a:hover { text-decoration: underline; } .tab a:hover { text-decoration: underline; }
.tab.active { background: #fff; padding-top: 6px; position: relative; top: 1px; border-color: #666; } .tab.active { background: #fff; padding-top: 6px; position: relative; top: 3px; border-color: #666; }
.tab a.active { font-weight: bold; } .tab a.active { font-weight: bold; }
.tab-container .panel-container { background: #fff; border: solid #666 1px; padding: 10px; -moz-border-radius: 0 4px 4px 4px; -webkit-border-radius: 0 4px 4px 4px; } .tab-container .panel-container { background: #fff; border: solid #666 1px; padding: 10px; -moz-border-radius: 0 4px 4px 4px; -webkit-border-radius: 0 4px 4px 4px; }
.panel-container { margin-bottom: 10px; } .panel-container { margin-bottom: 10px; }
.fsslider { .fsslider {
position: relative; position: relative;
min-width: 100px; min-width: 80px;
height: 8px; height: 8px;
display: inline-block; display: inline-block;
width: 100%; width: 100%;
......
...@@ -21,14 +21,15 @@ box-shadow: 0px 0px 3px 0px #888888; ...@@ -21,14 +21,15 @@ box-shadow: 0px 0px 3px 0px #888888;
}*/ }*/
#leftcolumn { #sidecolumn {
overflow-y: scroll; overflow-y: scroll;
margin-right: -300px;
margin-left: 0px;
padding-bottom: 10px; padding-bottom: 10px;
padding-left: 5px; padding-left: 5px;
right: 300px; right: 0px;
width: 300px; /* this width one is just a first guess...
/* (it will be changed in main.js to sidecolumnSize param)
*/
width: 25em;
position: fixed; position: fixed;
height: 100%; height: 100%;
border: 1px #888888 solid; border: 1px #888888 solid;
......
...@@ -30,6 +30,11 @@ var mainfile = ["db.json"]; ...@@ -30,6 +30,11 @@ var mainfile = ["db.json"];
// getUrlParam.file = window.location.origin+"/"+$("#graphid").html(); // garg exclusive // getUrlParam.file = window.location.origin+"/"+$("#graphid").html(); // garg exclusive
// var corpusesList = {} // garg exclusive -> corpus comparison // var corpusesList = {} // garg exclusive -> corpus comparison
var tagcloud_limit = 50;
// for the css of sidecolumn and canvasLimits size
var sidecolumnSize = "20%"
var current_url = window.location.origin+window.location.pathname+window.location.search var current_url = window.location.origin+window.location.pathname+window.location.search
getUrlParam.file = current_url.replace(/projects/g, "api/projects") getUrlParam.file = current_url.replace(/projects/g, "api/projects")
......
...@@ -22,45 +22,27 @@ ...@@ -22,45 +22,27 @@
/[$\w]+/g /[$\w]+/g
); );
$.fn.visibleHeight = function() { // on window resize
console.log('FUN t.TinawebJS:visibleHeight') // @param canvasdiv: id of the div (without '#')
var elBottom, elTop, scrollBot, scrollTop, visibleBottom, visibleTop; function sigmaLimits( canvasdiv ) {
scrollTop = $(window).scrollTop(); console.log('FUN t.TinawebJS:sigmaLimits') ;
scrollBot = scrollTop + $(window).height(); var canvas = document.getElementById(canvasdiv) ;
elTop = this.offset().top; var sidecolumn = document.getElementById('sidecolumn') ;
elBottom = elTop + this.outerHeight(); var ancho_total = window.innerWidth - sidecolumn.offsetWidth ;
visibleTop = elTop < scrollTop ? scrollTop : elTop; var alto_total = window.innerHeight - sidecolumn.offsetTop ;
visibleBottom = elBottom > scrollBot ? scrollBot : elBottom;
return visibleBottom - visibleTop // setting new size
} canvas.style.width = ancho_total - 5 ;
canvas.style.height = alto_total - 5 ;
// for new SigmaUtils
function sigmaLimits( sigmacanvas ) { // fyi result
console.log('FUN t.TinawebJS:sigmaLimits') var pw=canvas.offsetWidth;
pw=$( sigmacanvas ).width(); var ph=canvas.offsetHeight;
ph=$( sigmacanvas ).height();
// $("body").css("padding-top",0) console.log("new canvas! w:"+pw+" , h:"+ph) ;
// var footer = ( $("footer").length>0) ? ($('#leftcolumn').position().top -$("footer").height()) : $('#leftcolumn').position().top*2;
var ancho_total = $( window ).width() - $('#leftcolumn').width() ;
var alto_total = $('#leftcolumn').visibleHeight() ;
// console.log("")
// console.log(footer)
// console.log(ancho_total)
// console.log(alto_total)
// console.log("")
sidebar=$('#leftcolumn').width();
anchototal=$('#dafixedtop').width();
$( sigmacanvas ).width(ancho_total);
$( sigmacanvas ).height( alto_total );
pw=$( sigmacanvas ).width();
ph=$( sigmacanvas ).height();
return "new canvas! w:"+pw+" , h:"+ph;
} }
SelectionEngine = function() { SelectionEngine = function() {
console.log('FUN t.TinawebJS:SelectionEngine:new') console.log('FUN t.TinawebJS:SelectionEngine:new')
// Selection Engine!! finally... // Selection Engine!! finally...
...@@ -381,6 +363,8 @@ SelectionEngine = function() { ...@@ -381,6 +363,8 @@ SelectionEngine = function() {
TinaWebJS = function ( sigmacanvas ) { TinaWebJS = function ( sigmacanvas ) {
console.log('FUN t.TinawebJS:TinaWebJS:new') console.log('FUN t.TinawebJS:TinaWebJS:new')
// '#canvasid'
this.sigmacanvas = sigmacanvas; this.sigmacanvas = sigmacanvas;
this.init = function () { this.init = function () {
...@@ -392,11 +376,11 @@ TinaWebJS = function ( sigmacanvas ) { ...@@ -392,11 +376,11 @@ TinaWebJS = function ( sigmacanvas ) {
return this.sigmacanvas; return this.sigmacanvas;
} }
this.AdjustSigmaCanvas = function ( sigmacanvas ) { this.AdjustSigmaCanvas = function ( canvasdiv ) {
console.log('FUN t.TinawebJS:AdjustSigmaCanvas') console.log('FUN t.TinawebJS:AdjustSigmaCanvas')
var canvasdiv = ""; if (! canvasdiv)
if( sigmacanvas ) canvasdiv = sigmacanvas; // '#canvasid' => 'canvasid'
else canvasdiv = this.sigmacanvas; canvasdiv = sigmacanvas.substring(1);
return sigmaLimits( canvasdiv ); return sigmaLimits( canvasdiv );
} }
...@@ -565,8 +549,8 @@ TinaWebJS = function ( sigmacanvas ) { ...@@ -565,8 +549,8 @@ TinaWebJS = function ( sigmacanvas ) {
// === un/hide leftpanel === // // === un/hide leftpanel === //
$("#aUnfold").click(function(e) { $("#aUnfold").click(function(e) {
//SHOW leftcolumn //SHOW sidecolumn
sidebar = $("#leftcolumn"); sidebar = $("#sidecolumn");
fullwidth=$('#fixedtop').width(); fullwidth=$('#fixedtop').width();
e.preventDefault(); e.preventDefault();
// $("#wrapper").toggleClass("active"); // $("#wrapper").toggleClass("active");
...@@ -590,7 +574,7 @@ TinaWebJS = function ( sigmacanvas ) { ...@@ -590,7 +574,7 @@ TinaWebJS = function ( sigmacanvas ) {
}, 400); }, 400);
} }
else { else {
//HIDE leftcolumn //HIDE sidecolumn
$("#aUnfold").attr("class","leftarrow"); $("#aUnfold").attr("class","leftarrow");
sidebar.animate({ sidebar.animate({
"right" : "-" + sidebar.width() + "px" "right" : "-" + sidebar.width() + "px"
......
...@@ -178,7 +178,7 @@ function MainFunction( RES ) { ...@@ -178,7 +178,7 @@ function MainFunction( RES ) {
// [ Initiating Sigma-Canvas ] // [ Initiating Sigma-Canvas ]
var twjs_ = new TinaWebJS('#sigma-example'); var twjs_ = new TinaWebJS('#sigma-example');
print( twjs_.AdjustSigmaCanvas() ); print( twjs_.AdjustSigmaCanvas() );
$( window ).resize(function() { print(twjs_.AdjustSigmaCanvas()) }); window.onresize = function(){twjs_.AdjustSigmaCanvas()} // TODO: debounce?
// [ / Initiating Sigma-Canvas ] // [ / Initiating Sigma-Canvas ]
print("categories: "+categories) print("categories: "+categories)
...@@ -357,6 +357,9 @@ function MainFunction( RES ) { ...@@ -357,6 +357,9 @@ function MainFunction( RES ) {
partialGraph.stopForceAtlas2(); partialGraph.stopForceAtlas2();
}, fa2seconds*1000); }, fa2seconds*1000);
// apply width from settings on left column
document.getElementById('sidecolumn').style.width = sidecolumnSize ;
} }
......
This diff is collapsed.
...@@ -119,13 +119,14 @@ ...@@ -119,13 +119,14 @@
<a tabindex="-1" <a tabindex="-1"
data-url="/projects/{{project.id}}/corpora/{{ corpus.id }}/explorer?field1=ngrams&amp;field2=ngrams&amp;distance=distributional&amp;bridgeness=5" onclick='gotoexplorer(this)' >With distributional distance</a> data-url="/projects/{{project.id}}/corpora/{{ corpus.id }}/explorer?field1=ngrams&amp;field2=ngrams&amp;distance=distributional&amp;bridgeness=5" onclick='gotoexplorer(this)' >With distributional distance</a>
</li> </li>
<!--
<li> <li>
<a tabindex="-1" <a tabindex="-1"
onclick="javascript:location.href='/projects/{{project.id}}/corpora/{{ corpus.id }}/myGraphs'" onclick="javascript:location.href='/projects/{{project.id}}/corpora/{{ corpus.id }}/myGraphs'"
data-target='#' href='#'>My Graphs data-target='#' href='#'>My Graphs
</a> </a>
</li> </li>
--!>
</ul> </ul>
...@@ -213,16 +214,11 @@ ...@@ -213,16 +214,11 @@
</div> </div>
</div> </div>
</div> </div>
{% else %}
<div class="container theme-showcase">
<div class="jumbotron" style="margin-bottom:0">
</div>
</div>
{% endif %} {% endif %}
{% endif %} {% endif %}
{% endblock %} {% endblock %}
{% block content %} {% block content %}
{% endblock %} {% endblock %}
...@@ -235,7 +231,7 @@ ...@@ -235,7 +231,7 @@
<p> <p>
Gargantext Gargantext
<span class="glyphicon glyphicon-registration-mark" aria-hidden="true"></span> <span class="glyphicon glyphicon-registration-mark" aria-hidden="true"></span>
, version 3.0.3.1, , version 3.0.3.3,
<a href="http://www.cnrs.fr" target="blank" title="Institution that enables this project."> <a href="http://www.cnrs.fr" target="blank" title="Institution that enables this project.">
Copyrights Copyrights
<span class="glyphicon glyphicon-copyright-mark" aria-hidden="true"></span> <span class="glyphicon glyphicon-copyright-mark" aria-hidden="true"></span>
......
UNIT TESTS
==========
Prerequisite
------------
Running unit tests will involve creating a **temporary test DB** !
+ it implies **CREATEDB permssions** for settings.DATABASES.user
(this has security consequences)
+ for instance in gargantext you would need to run this in psql as postgres:
`# ALTER USER gargantua CREATEDB;`
A "principe de précaution" could be to allow gargantua the CREATEDB rights on the **dev** machines (to be able to run tests) and not give it on the **prod** machines (no testing but more protection just in case).
Usage
------
```
./manage.py test unittests/ -v 2 # in django root container directory
# or for a single module
./manage.py test unittests.tests_010_basic -v 2
```
( `-v 2` is the verbosity level )
Tests
------
1. **tests_010_basic**
2. ** tests ??? **
3. ** tests ??? **
4. ** tests ??? **
5. ** tests ??? **
6. ** tests ??? **
7. **tests_070_routes**
Checks the response types from the app url routes:
- "/"
- "/api/nodes"
- "/api/nodes/<ID>"
GargTestRunner
---------------
Most of the tests will interact with a DB but we don't want to touch the real one so we provide a customized test_runner class in `unittests/framework.py` that creates a test database.
It must be referenced in django's `settings.py` like this:
```
TEST_RUNNER = 'unittests.framework.GargTestRunner'
```
(This way the `./manage.py test` command will be using GargTestRunner.)
Using a DB session
------------------
To emulate a session the way we usually do it in gargantext, our `unittests.framework` also
provides a session object to the test database via `GargTestRunner.testdb_session`
To work correctly, it needs to be read *inside the test setup.*
**Example**
```
from unittests.framework import GargTestRunner
class MyTestRecipes(TestCase):
def setUp(self):
# -------------------------------------
session = GargTestRunner.testdb_session
# -------------------------------------
new_project = Node(
typename = 'PROJECT',
name = "hello i'm a project",
)
session.add(new_project)
session.commit()
```
Accessing the URLS
------------------
Django tests provide a client to browse the urls
**Example**
```
from django.test import Client
class MyTestRecipes(TestCase):
def setUp(self):
self.client = Client()
def test_001_get_front_page(self):
''' get the about page localhost/about '''
# --------------------------------------
the_response = self.client.get('/about')
# --------------------------------------
self.assertEqual(the_response.status_code, 200)
```
Logging in
-----------
Most of our functionalities are only available on login so we provide a fake user at the initialization of the test DB.
His login in 'pcorser' and password is 'peter'
**Example**
```
from django.test import Client
class MyTestRecipes(TestCase):
def setUp(self):
self.client = Client()
# login ---------------------------------------------------
response = self.client.post(
'/auth/login/',
{'username': 'pcorser', 'password': 'peter'}
)
# ---------------------------------------------------------
def test_002_get_to_a_restricted_page(self):
''' get the projects page /projects '''
the_response = self.client.get('/projects')
self.assertEqual(the_response.status_code, 200)
```
*Si vous aimez les aventures de Peter Corser, lisez l'album précédent ["Doors"](https://gogs.iscpif.fr/leclaire/doors)* (Scénario M. Leclaire, Dessins R. Loth) (disponible dans toutes les bonnes librairies)
FIXME
-----
url client get will still give read access to original DB ?
cf. http://stackoverflow.com/questions/19714521
cf. http://stackoverflow.com/questions/11046039
cf. test_073_get_api_one_node
"""
A test runner derived from default (DiscoverRunner) but adapted to our custom DB
cf. docs.djangoproject.com/en/1.9/topics/testing/advanced/#using-different-testing-frameworks
cf. gargantext/settings.py => TEST_RUNNER
cf. dbmigrate.py
FIXME url get will still give read access to original DB ?
cf. http://stackoverflow.com/questions/19714521
cf. http://stackoverflow.com/questions/11046039
cf. test_073_get_api_one_node
"""
# basic elements
from django.test.runner import DiscoverRunner, get_unique_databases_and_mirrors
from sqlalchemy import create_engine
from gargantext.settings import DATABASES
# things needed to create a user
from django.contrib.auth.models import User
# here we setup a minimal django so as to load SQLAlchemy models ---------------
# and then be able to import models and Base.metadata.tables
from os import environ
from django import setup
environ.setdefault("DJANGO_SETTINGS_MODULE", "gargantext.settings")
setup() # models can now be imported
from gargantext import models # Base is now filled
from gargantext.util.db import Base # contains metadata.tables
# ------------------------------------------------------------------------------
# things needed to provide a session
from sqlalchemy.orm import sessionmaker, scoped_session
class GargTestRunner(DiscoverRunner):
"""
We use the default test runner but we just add
our own dbmigrate elements at db creation
=> we let django.test.runner do the test db creation + auto migrations
=> we retrieve the test db name from django.test.runner
=> we create a test engine like in gargantext.db.create_engine but with the test db name
=> we create tables for our models like in dbmigrate with the test engine
TODO: list of tables to be created are hard coded in self.models
"""
# we'll also expose a session as GargTestRunner.testdb_session
testdb_session = None
def __init__(self, *args, **kwargs):
# our custom tables to be created (in correct order)
self.models = ['ngrams', 'nodes', 'contacts', 'nodes_nodes', 'nodes_ngrams', 'nodes_nodes_ngrams', 'nodes_ngrams_ngrams', 'nodes_hyperdata']
self.testdb_engine = None
# and execute default django init
old_config = super(GargTestRunner, self).__init__(*args, **kwargs)
def setup_databases(self, *args, **kwargs):
"""
Complement the database creation
by our own "models to tables" migration
"""
# default django setup performs base creation + auto migrations
old_config = super(GargTestRunner, self).setup_databases(*args, **kwargs)
# retrieve the testdb_name set by DiscoverRunner
testdb_names = []
for db_infos in get_unique_databases_and_mirrors():
# a key has the form: (IP, port, backend, dbname)
for key in db_infos:
# db_infos[key] has the form (dbname, {'default'})
testdb_names.append(db_infos[key][0])
# /!\ hypothèse d'une database unique /!\
testdb_name = testdb_names[0]
# now we use a copy of our normal db config...
db_params = DATABASES['default']
# ...just changing the name
db_params['NAME'] = testdb_name
# connect to this test db
testdb_url = 'postgresql+psycopg2://{USER}:{PASSWORD}@{HOST}:{PORT}/{NAME}'.format_map(db_params)
self.testdb_engine = create_engine( testdb_url )
print("TESTDB INIT: opened connection to database **%s**" % db_params['NAME'])
# we retrieve real tables declarations from our loaded Base
sqla_models = (Base.metadata.tables[model_name] for model_name in self.models)
# example: Base.metadata.tables['ngrams']
# ---------------------------------------
# Table('ngrams', Column('id', Integer(), table=<ngrams>, primary_key=True),
# Column('terms', String(length=255), table=<ngrams>),
# Column('n', Integer(), table=<ngrams>),
# schema=None)
# and now creation of each table in our test db (like dbmigrate)
for model in sqla_models:
try:
model.create(self.testdb_engine)
print('TESTDB INIT: created model: `%s`' % model)
except Exception as e:
print('TESTDB INIT ERROR: could not create model: `%s`, %s' % (model, e))
# we also create a session to provide it the way we usually do in garg
# (it's a class based static var to be able to share it with our tests)
GargTestRunner.testdb_session = scoped_session(sessionmaker(bind=self.testdb_engine))
# and let's create a user too otherwise we'll never be able to login
user = User.objects.create_user(username='pcorser', password='peter')
# old_config will be used by DiscoverRunner
# (to remove everything at the end)
return old_config
def teardown_databases(self, old_config, *args, **kwargs):
"""
After all tests
"""
# close the session
GargTestRunner.testdb_session.close()
# free the connection
self.testdb_engine.dispose()
# default django teardown performs destruction of the test base
super(GargTestRunner, self).teardown_databases(old_config, *args, **kwargs)
# snippets if we choose direct model building instead of setup() and Base.metadata.tables[model_name]
# from sqlalchemy.types import Integer, String, DateTime, Text, Boolean, Float
# from gargantext.models.nodes import NodeType
# from gargantext.models.hyperdata import HyperdataKey
# from sqlalchemy.schema import Table, Column, ForeignKey, UniqueConstraint, MetaData
# from sqlalchemy.dialects.postgresql import JSONB, DOUBLE_PRECISION
# from sqlalchemy.ext.mutable import MutableDict, MutableList
# Double = DOUBLE_PRECISION
# sqla_models = [i for i in sqla_models]
# print (sqla_models)
# sqla_models = [Table('ngrams', MetaData(bind=None), Column('id', Integer(), primary_key=True, nullable=False), Column('terms', String(length=255)), Column('n', Integer()), schema=None), Table('nodes', MetaData(bind=None), Column('id', Integer(), primary_key=True, nullable=False), Column('typename', NodeType()), Column('user_id', Integer(), ForeignKey('auth_user.id')), Column('parent_id', Integer(), ForeignKey('nodes.id')), Column('name', String(length=255)), Column('date', DateTime()), Column('hyperdata', JSONB(astext_type=Text())), schema=None), Table('contacts', MetaData(bind=None), Column('id', Integer(), primary_key=True, nullable=False), Column('user1_id', Integer(), primary_key=True, nullable=False), Column('user2_id', Integer(), primary_key=True, nullable=False), Column('is_blocked', Boolean()), Column('date_creation', DateTime()), schema=None), Table('nodes_nodes', MetaData(bind=None), Column('node1_id', Integer(), ForeignKey('nodes.id'), primary_key=True, nullable=False), Column('node2_id', Integer(), ForeignKey('nodes.id'), primary_key=True, nullable=False), Column('score', Float(precision=24)), schema=None), Table('nodes_ngrams', MetaData(bind=None), Column('node_id', Integer(), ForeignKey('nodes.id'), primary_key=True, nullable=False), Column('ngram_id', Integer(), ForeignKey('ngrams.id'), primary_key=True, nullable=False), Column('weight', Float()), schema=None), Table('nodes_nodes_ngrams', MetaData(bind=None), Column('node1_id', Integer(), ForeignKey('nodes.id'), primary_key=True, nullable=False), Column('node2_id', Integer(), ForeignKey('nodes.id'), primary_key=True, nullable=False), Column('ngram_id', Integer(), ForeignKey('ngrams.id'), primary_key=True, nullable=False), Column('score', Float(precision=24)), schema=None), Table('nodes_ngrams_ngrams', MetaData(bind=None), Column('node_id', Integer(), ForeignKey('nodes.id'), primary_key=True, nullable=False), Column('ngram1_id', Integer(), ForeignKey('ngrams.id'), primary_key=True, nullable=False), Column('ngram2_id', Integer(), ForeignKey('ngrams.id'), primary_key=True, nullable=False), Column('weight', Float(precision=24)), schema=None), Table('nodes_hyperdata', MetaData(bind=None), Column('id', Integer(), primary_key=True, nullable=False), Column('node_id', Integer(), ForeignKey('nodes.id')), Column('key', HyperdataKey()), Column('value_int', Integer()), Column('value_flt', DOUBLE_PRECISION()), Column('value_utc', DateTime(timezone=True)), Column('value_str', String(length=255)), Column('value_txt', Text()), schema=None)]
"""
BASIC UNIT TESTS FOR GARGANTEXT IN DJANGO
=========================================
"""
from django.test import TestCase
class NodeTestCase(TestCase):
def setUp(self):
from gargantext.models import nodes
self.node_1000 = nodes.Node(id=1000)
self.new_node = nodes.Node()
def test_010_node_has_id(self):
'''new_node.id'''
self.assertEqual(self.node_1000.id, 1000)
def test_011_node_write(self):
'''write new_node to DB and commit'''
from gargantext.util.db import session
self.assertFalse(self.new_node._sa_instance_state._attached)
session.add(self.new_node)
session.commit()
self.assertTrue(self.new_node._sa_instance_state._attached)
"""
ROUTE UNIT TESTS
================
"""
from django.test import TestCase
from django.test import Client
# to be able to create Nodes
from gargantext.models import Node
# to be able to compare in test_073_get_api_one_node()
from gargantext.constants import NODETYPES
# provides GargTestRunner.testdb_session
from unittests.framework import GargTestRunner
class RoutesChecker(TestCase):
def setUp(self):
"""
Will be run before each test
"""
self.client = Client()
# login with our fake user
response = self.client.post(
'/auth/login/',
{'username': 'pcorser', 'password': 'peter'}
)
print(response.status_code)
session = GargTestRunner.testdb_session
new_project = Node(
typename = 'PROJECT',
name = "hello i'm a project",
)
session.add(new_project)
session.commit()
self.a_node_id = new_project.id
print("created a project with id: %i" % new_project.id)
def test_071_get_front_page(self):
''' get the front page / '''
front_response = self.client.get('/')
self.assertEqual(front_response.status_code, 200)
self.assertIn('text/html', front_response.get('Content-Type'))
# on suppose que la page contiendra toujours ce titre
self.assertIn(b'<h1>Gargantext</h1>', front_response.content)
def test_072_get_api_nodes(self):
''' get "/api/nodes" '''
api_response = self.client.get('/api/nodes')
self.assertEqual(api_response.status_code, 200)
# 1) check the type is json
self.assertTrue(api_response.has_header('Content-Type'))
self.assertIn('application/json', api_response.get('Content-Type'))
# 2) let's try to get things in the json
json_content = api_response.json()
json_count = json_content['count']
json_nodes = json_content['records']
self.assertEqual(type(json_count), int)
self.assertEqual(type(json_nodes), list)
print("\ntesting nodecount: %i " % json_count)
def test_073_get_api_one_node(self):
''' get "api/nodes/<node_id>" '''
# we first get one node id by re-running this bit from test_072
a_node_id = self.client.get('/api/nodes').json()['records'][0]['id']
one_node_route = '/api/nodes/%i' % a_node_id
# print("\ntesting node route: %s" % one_node_route)
api_response = self.client.get(one_node_route)
self.assertTrue(api_response.has_header('Content-Type'))
self.assertIn('application/json', api_response.get('Content-Type'))
json_content = api_response.json()
nodetype = json_content['typename']
nodename = json_content['name']
print("\ntesting nodename:", nodename)
print("\ntesting nodetype:", nodetype)
self.assertIn(nodetype, NODETYPES)
# TODO http://localhost:8000/api/nodes?types[]=CORPUS
# £TODO test request.*
# print ("request")
# print ("user.id", request.user.id)
# print ("user.name", request.user.username)
# print ("path", request.path)
# print ("path_info", request.path_info)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment