Commit 009bd667 authored by Alexandre Delanoë's avatar Alexandre Delanoë

[FIX] README

parent 514025ca
...@@ -8,8 +8,9 @@ processing, text-mining, complex networks analysis and interactive data ...@@ -8,8 +8,9 @@ processing, text-mining, complex networks analysis and interactive data
visualization to pave the way toward new kinds of interactions with your visualization to pave the way toward new kinds of interactions with your
digital corpora. digital corpora.
You will not find this software very useful without also running or being This repo deals with the frontend or client which needs
granted access to a [backend](https://gitlab.iscpif.fr/gargantext/haskell-gargantext). a backend server running or being granted access to a
[backend](https://gitlab.iscpif.fr/gargantext/haskell-gargantext).
This software is free software, developed by the CNRS Complex Systems This software is free software, developed by the CNRS Complex Systems
Institute of Paris Île-de-France (ISC-PIF) and its partners. Institute of Paris Île-de-France (ISC-PIF) and its partners.
...@@ -191,7 +192,7 @@ yarn css # for manual setup ...@@ -191,7 +192,7 @@ yarn css # for manual setup
<!-- A `purs ide` connection will be available on port 9002 while the --> <!-- A `purs ide` connection will be available on port 9002 while the -->
<!-- development server is running. --> <!-- development server is running. -->
A guide to getting set up with the IDE integration is coming soon, I hope. A guide to getting set up with the IDE integration is coming soon.
### Testing ### Testing
...@@ -236,8 +237,6 @@ yarn rebuild-set # or darn rebuild-set ...@@ -236,8 +237,6 @@ yarn rebuild-set # or darn rebuild-set
yarn rebase-set && yarn rebuild-set # or darn rebase-set && darn rebuild-set yarn rebase-set && yarn rebuild-set # or darn rebase-set && darn rebuild-set
``` ```
This will occasionally result in swearing when you go on to build.
## Theory Introduction ## Theory Introduction
Making sense of out text isn't actually that hard, but it does require Making sense of out text isn't actually that hard, but it does require
...@@ -245,21 +244,23 @@ a little background knowledge to understand. ...@@ -245,21 +244,23 @@ a little background knowledge to understand.
### N-grams ### N-grams
N-grams are at the heart of how Gargantext makes sense out of text. N-grams in contexts (of texts) are at the heart of how Gargantext makes
sense out of text.
There are two common meanings in the literature for n-gram: There are two common meanings in the literature for n-gram:
- a sequence of `n` characters - a sequence of `n` characters
- a sequence of `n` words - a sequence of `n` words
Gargantext is focused on words. Here are some example word n-grams; Gargantext is focused on words. Here are some example word n-grams
usually extracted by our Natural Language Process toolkit;
- `coffee` (unigram or 1-gram) - `coffee` (unigram or 1-gram)
- `need coffee` (bigram or 2-gram) - `black coffee` (bigram or 2-gram)
- `one coffee please` (trigram or 3-gram) - `hot black coffee` (trigram or 3-gram)
- `here is your coffee` (4-gram) - `arabica hot black coffee` (4-gram)
- `i need some more coffee` (5-gram)
N-grams are matched case insensitively and across whole words. Examples: N-grams are matched case insensitively and across whole words removing
the linked syntax if exists. Examples:
| Text | N-gram | Matches | | Text | N-gram | Matches |
|--------------|--------------|----------------------| |--------------|--------------|----------------------|
...@@ -271,20 +272,23 @@ You may read more about n-grams [on wikipedia](https://en.wikipedia.org/wiki/N-g ...@@ -271,20 +272,23 @@ You may read more about n-grams [on wikipedia](https://en.wikipedia.org/wiki/N-g
<!-- TODO: Discuss punctuation --> <!-- TODO: Discuss punctuation -->
Gargantext allows you to define n-grams interactively in your browser Gargantext allows you to define and refine n-grams interactively in your
and explore the relationships they uncover across a corpus of text. browser and explore the relationships they uncover across a corpus of
text.
Various metrics can be applied to n-grams, the most common of which is Various metrics can be applied to n-grams, the most common of which
the number of times an n-gram appears in a document. is the number of times an n-gram appears in a document (occurrences).
GarganText uses extensively the cooccurrences: times 2 n-grams appear in
same context of text.
## Glossary ## Glossary
document document
: One or more texts comprising a single logical document : One or more texts comprising a single logical document
field field
: A portion of a document, e.g. `title`, `abstract`, `body` : A portion of a document or metadata, e.g. `title`, `abstract`, `body`
corpus corpus
: A collection of documents : A collection of documents as set (with no repetition)
n-gram/ngram n-gram/ngram
: A word or words to be indexed, consisting of `n` words. : A word or words to be indexed, consisting of `n` words.
This technically includes skip-grams, but in the general case This technically includes skip-grams, but in the general case
...@@ -296,8 +300,10 @@ bigram/2-gram ...@@ -296,8 +300,10 @@ bigram/2-gram
trigram/3-gram trigram/3-gram
: A three-word n-gram, e.g. `coffee cup holder` : A three-word n-gram, e.g. `coffee cup holder`
skip-gram skip-gram
: An n-gram where the words are not all adjacent. Not yet supported. : An n-gram where the words are not all adjacent. Group 2 different
n-grams to enable such feature.
k-skip-n-gram k-skip-n-gram
: An n-gram where the words are at most distance k from each other. : An n-gram where the words are at most distance k from each other. This
feature is used for advanced research in text (not yet supported in
GarganText)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment