Coherent Stemming interface (#324) · Issues · gargantext / haskell-gargantext

Coherent Stemming interface

Stepping stone towards fixing purescript-gargantext#633 (closed)

To summarise the context, we would like to have better control over our queries; at the moment we get 0 results for searches like "postpartum" on corpus documents either because Postgres' built-in full-text-search stemming isn't enough or because we stem with the "wrong" (for the query at hand) algorithm (i.e. porter vs lancaster). In the case of "postpartum", while the "Porter" algorithm cannot stem it further, using Lancaster would stem it into postpart, which means that if in postgres we use the :* syntax and search for to_tsquery("postpart:*") we would get results.

This ticket outlines a direction of travel. At the moment our stemming interface is a bit all over the place:

We have a Gargantext.Core.Text.Terms.Mono.Stem module which exposes a stem function, but this function uses the stem function from the stemmer package, which is deprecated in favour of snowball, which uses a C library (and implements the porter algorithm), but when I tried to use it, it segfaulted;
We have a porter implementation sitting at Gargantext.Core.Text.Terms.Mono.Stem.En, but this is used randomly, for example as part of Gargantext.Database.Action.Search.searchInCorpus instead of the "main" interface;
We have poor support for languages, also because our Lang type includes an All data constructor which makes annoying to have a total mapping between a Lang and a stemming algorithm;
We might want to pick a different algorithm for different contexts, for example we might want to have an "expert view" in our corpus search and run searches with different stemming strategies, and compare the results.

Proposal

My proposal is as follows:

Let's refactor the Gargantext.Core.Text.Terms.Mono.Stem so that it expose a single, nicely-encapsulate abstract function:

stem :: Lang -> StemmingAlgorithm -> T.Text -> T.Text

...

data StemmingAlgorithm
  = -- | Use the 'porter' implementation for gargantext.
    Porter
    -- | Use the 'stemmer' implementation from the 'stemmer' package.
  | Stemmer
    -- | User Lancaster stemming.
  | Lancaster

This means that all the requests for stemming a word needs to spell out, concretely:

a. The language; b. The algorithm.

If we want, we could have an helper function which would default to our built-in porter algorithm if the language is English, or switch to one of the stemmer algos for other languages.

I would suggest to refactor the Lang type to get rid of the All constructor -- being this already an Enum and Bounded instance we can recover the All semantic by doing:

allLangs :: [Lang]
allLang = [minBound .. maxBound]

If All is needed as input for a query that we perform from the backend, then I would suggest that we create newtype wrappers so that what we use in the frontend is uncorrelated to the concrete backend type we end up working with. Getting rid of All means that we can precisely map each language to a particular stemming algorithm, which would also solve the problem that at the moment we are assuming that most of the document corpus is in english and therefore not using the correct stemming algorithm where we should.

@anoe What do you think?

Stepping stone towards fixing https://gitlab.iscpif.fr/gargantext/purescript-gargantext/issues/633

To summarise the context, we would like to have better control over our queries; at the moment we get 0 results for searches like "postpartum" on corpus documents either because Postgres' built-in full-text-search stemming isn't enough or because we stem with the "wrong" (for the query at hand) algorithm (i.e. porter vs lancaster). In the case of "postpartum", while the "Porter" algorithm cannot stem it further, using Lancaster would stem it into `postpart`, which means that if in postgres we use the `:*` syntax and search for `to_tsquery("postpart:*")` we would get results.

This ticket outlines a direction of travel. At the moment our stemming interface is a bit all over the place:

1. We have a `Gargantext.Core.Text.Terms.Mono.Stem` module which exposes a `stem` function, but this function uses the `stem` function from the [stemmer](https://hackage.haskell.org/package/stemmer) package, which is deprecated in favour of [snowball](https://hackage.haskell.org/package/snowball), which uses a C library (and implements the porter algorithm), but when I tried to use it, it segfaulted;

2. We have a porter implementation sitting at `Gargantext.Core.Text.Terms.Mono.Stem.En`, but this is used randomly, for example as part of `Gargantext.Database.Action.Search.searchInCorpus` instead of the "main" interface;

3. We have poor support for languages, also because our `Lang` type includes an `All` data constructor which makes annoying to have a total mapping between a `Lang` and a stemming algorithm;

4. We might want to pick a different algorithm for different contexts, for example we might want to have an "expert view" in our corpus search and run searches with different stemming strategies, and compare the results.

## Proposal

My proposal is as follows:

* Let's refactor the `Gargantext.Core.Text.Terms.Mono.Stem` so that it expose a single, nicely-encapsulate abstract function:

```hs
stem :: Lang -> StemmingAlgorithm -> T.Text -> T.Text

...

This means that all the requests for stemming a word needs to spell out, concretely:

a. The language;
b. The algorithm.

If we want, we could have an helper function which would default to our built-in `porter` algorithm if the language is English, or switch to one of the `stemmer` algos for other languages.

I would suggest to refactor the `Lang` type to get rid of the `All` constructor -- being this already an `Enum` and `Bounded` instance we can recover the `All` semantic by doing:

```hs
allLangs :: [Lang]
allLang = [minBound .. maxBound]
```

If `All` is needed as input for a query that we perform from the backend, then I would suggest that we create newtype wrappers so that what we use in the frontend is uncorrelated to the concrete backend type we end up working with. Getting rid of `All` means that we can precisely map each language to a particular stemming algorithm, which would also solve the problem that at the moment we are assuming that most of the document corpus is in english and therefore not using the correct stemming algorithm where we should.

@anoe What do you think?