[DOC] Foundations main Defintions (ngrams lists)

f745d104 · delanoe · 1231e529 · f745d104 · f745d104 · f745d104
Commit f745d104 authored Jun 26, 2017 by delanoe
Showing with 151 additions and 15 deletions

architecture.md docs/architecture.md +1 -1

ngram_lists.md docs/ngram_lists.md +128 -0

nodes.py gargantext/models/nodes.py +4 -1

list_map.py gargantext/util/toolchain/list_map.py +18 -13

No files found.
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -68,7 +68,7 @@ This node is the parent of the other nodes for parameters.

 [//]: # (Are there any plans to add user wide or project wide parameters or metrics?  For example TFIDF nodes related to a normal user -- ie. not Gargantua?)

-Yes we can in the futur (but we have others priorities before.
+Yes we can in the future (but we have others priorities before.

 [//]: # (What is the purpose of the 3 child nodes of Node[TFIDF-Global]?  Are they TFIDF metrics related to databases 1, 2 and 3? If so, shouldn't they be children of related CORPUS nodes?)


--- a/docs/ngram_lists.md
+++ b/docs/ngram_lists.md
+# Gargantext foundations : main definitions
+
+Documentation valid for 3.0\* versions of Gargantext.
+
+## Project
+A project is a list of corpora (a project may have duplicate corpora).
+
+## Corpus
+A corpus is a set of documents: duplicate documents are authorized but
+not recommended for the methodology since it shows artificial repeated content in the corpus. 
+
+Then, in the document view, users may delete duplicates with a specific
+function.
+
+## Document
+A document is the main Entity of Textual Context (ETC) that is composed with:
+    - a title (truncated field name in the database)
+    - the date of publication
+    - a journal (or source)
+    - an abstract
+    - the authors
+Users may add many fields to the document.
+
+The main fields mentioned above are used for the main statistics in Gargantext.
+
+
+### Source Type
+Source Type is the source (database) from where documents have been
+extracted. 
+
+In 3.0.\* versions of Gargantext, each corpus has only one source type
+(i.e database). But user can build his own corpus with CVS format.
+
+
+## Ngrams
+
+### Definitions
+
+### Gram 
+A gram is a contiguous sequence of letters separated by spaces.
+
+### N-gram
+N-gram is a contiguous sequence of n grams separated by spaces (where n
+is a non negative natural number).
+
+
+## N-gram Lists
+
+
+## Main ngrams lists: Stop/Map/Main
+
+### Definition
+
+3 main kinds of lists :
+    1. Stop List contains black listed ngrams i.e. the noise or in others words ngrams users do not want to deal with.
+    2. Map List contains ngrams that will be shown in the map.
+    3. Main list or Candidate list contains all other ngrams that are not in the stop list and not in the map list. Then it could be in the map according to the choice of the user or, by default, the default parameters of Gargantext.
+
+### Storage
+
+Relation between the list and the ngram is stored as Node-Ngram
+relation where
+    - Node has type name (STOP|MAIN|MAP) and parent_id the context
+      (CORPUS in version 3.0.*; but could be PROJECT)
+    - Ngrams depend on the context of the Node List where NodeNgrams is
+      not null and Node has typename Document.
+
+
+    Node[USER](name1)
+    ├── Node[PROJECT](project1)
+    │   ├── Node[CORPUS](corpus1)
+    │   │   ├── Node[MAPLIST](list name)
+    │   │   ├── Node[STOPLIST](list name)
+    │   │   ├── Node[MAINLIST](list name)
+    │   │   │  
+    │   │   ├── Node[DOCUMENT](doc1)
+    │   │   ├── Node[DOCUMENT](doc2)
+    │   │   └── Node[DOCUMENT](doc2)
+
+
+### Policy
+
+
+#### Stops
+
+
+
+## Metrics
+
+### Term Frequency - Inverse Context Frequency (TF-ICF)
+
+TFICF, short for term frequency-inverse context frequency, is a numerical
+statistic that is intended to reflect how important an ngram is to a
+context of text.
+
+TFICF(ngram,contextLocal,contextGlobal) = TF(ngram,contextLocal) \* ICF(ngram, contextGlobal)
+where
+ * TF(ngram, contextLocal) is the ngram frequency (occurrences) in contextLocal.
+ * ICF(ngram, contextGlobal) is the inverse (log) document frequency (occurrences) in contextGlobal.
+
+
+Others types of TFICF:
+    - TFICF(ngram, DOCUMENT, CORPUS)
+    - TFICF(ngram, CORPUS, PROJECT)
+    - TFICF(ngram, PROJECT, DATABASETYPE)
+    - TFICF(ngram, DATABASETYPE, ALL)
+
+
+If the context is a document in a set of documents (corpus), then it is a TFIDF as usual. 
+Then TFICF-DOCUMENT-CORPUS == TFICF(ngram,DOCUMENT,CORPUS) = TFIDF.
+TFICF is the generalization of [TFIDF, Term Frequency - Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
+
+
+
+## others ngrams lists
+
+### Group List
+
+
+
+#### Definition
+#### Policy to build group lists
+#### Storage
+
+
+
+
+
--- a/gargantext/models/nodes.py
+++ b/gargantext/models/nodes.py
@@ -21,7 +21,10 @@ class NodeType(TypeDecorator):


 class Node(Base):
-    """This model can fit many purposes.
+    """This model can fit many purposes:
+    
+    myFirstCorpus = session.query(CorpusNode).first()
+
    It intends to provide a generic model, allowing hierarchical structure
    and NoSQL-like data structuring.
    The possible types are defined in `gargantext.constants.NODETYPES`.

--- a/gargantext/util/toolchain/list_map.py
+++ b/gargantext/util/toolchain/list_map.py
@@ -12,15 +12,19 @@ from gargantext.constants     import DEFAULT_MAPLIST_MAX,\
                                     DEFAULT_MAPLIST_GENCLUSION_RATIO,\
                                     DEFAULT_MAPLIST_MONOGRAMS_RATIO

+def do_maplist_query():
+    return None
+
+
 def do_maplist(corpus,
-               overwrite_id = None,
-               mainlist_id  = None,
-               specclusion_id = None,
-               genclusion_id = None,
-               grouplist_id = None,
-               limit=DEFAULT_MAPLIST_MAX,
-               genclusion_part=DEFAULT_MAPLIST_GENCLUSION_RATIO,
-               monograms_part=DEFAULT_MAPLIST_MONOGRAMS_RATIO
+               overwrite_id    = None,
+               mainlist_id     = None,
+               specclusion_id  = None,
+               genclusion_id   = None,
+               grouplist_id    = None,
+               limit           = DEFAULT_MAPLIST_MAX,
+               genclusion_part = DEFAULT_MAPLIST_GENCLUSION_RATIO,
+               monograms_part  = DEFAULT_MAPLIST_MONOGRAMS_RATIO
               ):
    '''
    According to Genericity/Specificity and mainlist
@@ -28,9 +32,9 @@ def do_maplist(corpus,
    Parameters:
      - mainlist_id (starting point, already cleaned of stoplist terms)
      - specclusion_id (ngram inclusion by cooc specificity -- ranking factor)
-      - genclusion_id (ngram inclusion by cooc genericity -- ranking factor)
+      - genclusion_id  (ngram inclusion by cooc genericity  -- ranking factor)
      - grouplist_id (filtering grouped ones)
-      - overwrite_id: optional if preexisting MAPLIST node to overwrite
+      - overwrite_id: optional. Overwrite if preexisting MAPLIST node

      + 3 params to modulate the terms choice
        - limit for the amount of picked terms
@@ -77,6 +81,7 @@ def do_maplist(corpus,
                        )
                .join(Ngram, Ngram.id == ScoreSpec.ngram_id)
                .join(ScoreGen, ScoreGen.ngram_id == ScoreSpec.ngram_id)
+                
                .filter(ScoreSpec.node_id == specclusion_id)
                .filter(ScoreGen.node_id == genclusion_id)

@@ -155,10 +160,10 @@ def do_maplist(corpus,
        # at the end of the first loop we just need to sort all by the second ranker (gen)
        scored_ngrams = sorted(scored_ngrams, key=lambda ng_infos: ng_infos[2], reverse=True)

-    obtained_spec_mono = len(chosen_ngrams['topspec']['monograms'])
+    obtained_spec_mono  = len(chosen_ngrams['topspec']['monograms'])
    obtained_spec_multi = len(chosen_ngrams['topspec']['multigrams'])
-    obtained_gen_mono = len(chosen_ngrams['topgen']['monograms'])
-    obtained_gen_multi = len(chosen_ngrams['topgen']['multigrams'])
+    obtained_gen_mono   = len(chosen_ngrams['topgen']['monograms'])
+    obtained_gen_multi  = len(chosen_ngrams['topgen']['multigrams'])
    obtained_total = obtained_spec_mono   \
                    + obtained_spec_multi \
                    + obtained_gen_mono   \