README.md 4.98 KB
Newer Older
1 2 3 4 5 6 7 8 9 10
# Gargantext Purescript

## About this project

Gargantext is a collaborative web platform for the exploration of sets
of unstructured documents. It combines tools from natural language
processing, text-mining, complex networks analysis and interactive data
visualization to pave the way toward new kinds of interactions with your
digital corpora.

11 12 13 14
You will not find this software very useful without also running or being
granted access to a [backend](https://gitlab.iscpif.fr/gargantext/haskell-gargantext).

This software is free software, developed by the CNRS Complex Systems
15 16
Institute of Paris Île-de-France (ISC-PIF) and its partners.

17 18 19
## Development

### Installing dependencies
20

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#### Debian

##### Testing Distribution and above
```shell
sudo apt update && sudo apt install nodejs yarn
```

##### Stable Distribution
```shell
curl -sL https://deb.nodesource.com/setup_11.x | sudo bash -
sudo apt update && sudo apt install nodejs
```

```shell
curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -
echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list
sudo apt update && sudo apt install yarn
```

### OSX
```shell
brew install node yarn
```

### Installing dependencies

47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
#### Debian

##### Testing Distribution and above
```shell
sudo apt update && sudo apt install nodejs yarn
```

##### Stable Distribution
```shell
curl -sL https://deb.nodesource.com/setup_11.x | sudo bash -
sudo apt update && sudo apt install nodejs
```

```shell
curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -
echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list
sudo apt update && sudo apt install yarn
```

### OSX
```shell
brew install node yarn
```

### Installing dependencies

73
Before building gargantext, you must install the dependencies. We use
74 75
[yarn](https://yarnpkg.com/en/) for this. They have excellent
[installation instructions](https://yarnpkg.com/en/docs/install).
76

77
Once you have yarn installed, you may install everything else simply:
78

79 80 81
```shell
yarn install && yarn install-ps
```
82

83
You may now build:
84

85 86 87
```shell
yarn build
```
88

89
And run a repl:
90 91 92 93 94

```shell
yarn repl
```

95 96 97
## Note to the contributors

Please follow CONTRIBUTING.md
98

99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
### How do I?

#### Add a javascript dependency?

Add it to `package.json`, under `dependencies` if it is needed at
runtime or `devDependencies` if it is not.

#### Add a purescript dependency?

Add it to `psc-package.json` without the `purescript-` prefix.

If is not in the package set, you will need to read the next section.

#### Add a custom or override package to the local package set?

You need to add an entry to the relevant map in
`packages.dhall`. There are comments in the file explaining how it
works. It's written in dhall, so you can use comments and such.

118 119 120 121 122 123
You will then need to rebuild the package set:

```shell
yarn rebuild-set
```

124 125 126
#### Upgrade the base package set local is based on to latest?

```shell
127
yarn rebase-set && yarn rebuild-set
128 129 130
```

## Theory Introduction
131 132

Making sense of out text isn't actually that hard, but it does require
133
a little background knowledge to understand.
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160

### N-grams

N-grams are at the heart of how Gargantext makes sense out of text.

There are two common meanings in the literature for n-gram:
- a sequence of `n` characters
- a sequence of `n` words

Gargantext is focused on words. Here are some example word n-grams;

- `coffee` (unigram or 1-gram)
- `need coffee` (bigram or 2-gram)
- `one coffee please` (trigram or 3-gram)
- `here is your coffee` (4-gram)
- `i need some more coffee` (5-gram)

N-grams are matched case insensitively and across whole words. Examples:

| Text         | N-gram       | Matches              |
|--------------|--------------|----------------------|
| `Coffee cup` | `coffee`     | YES                  |
| `Coffee cup` | `off`        | NO, not a whole word |
| `Coffee cup` | `coffee cup` | YES                  |

You may read more about n-grams [on wikipedia](https://en.wikipedia.org/wiki/N-gram).

161 162
<!-- TODO: Discuss punctuation -->

163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
Gargantext allows you to define n-grams interactively in your browser
and explore the relationships they uncover across a corpus of text.

Various metrics can be applied to n-grams, the most common of which is
the number of times an n-gram appears in a document.

## Glossary

document
: One or more texts comprising a single logical document
field
: A portion of a document, e.g. `title`, `abstract`, `body`
corpus
: A collection of documents
n-gram/ngram
: A word or words to be indexed, consisting of `n` words.
  This technically includes skip-grams, but in the general case
  the words will be contiguous.
unigram/1-gram
: A one-word n-gram, e.g. `cow`, `coffee`
bigram/2-gram
: A two-word n-gram, e.g. `coffee cup`
185
trigram/3-gram
186 187 188 189 190 191 192 193
: A three-word n-gram, e.g. `coffee cup holder`
<!-- skip-grams are not yet supported -->
<!-- skip-gram -->
<!-- : An n-gram where the words are not all adjacent -->
<!-- k-skip-n-gram -->
<!-- : An n-gram where the words are at most distance k from each other -->