Coherent Stemming interface
Stepping stone towards fixing purescript-gargantext#633 (closed)
To summarise the context, we would like to have better control over our queries; at the moment we get 0 results for searches like "postpartum" on corpus documents either because Postgres' built-in full-text-search stemming isn't enough or because we stem with the "wrong" (for the query at hand) algorithm (i.e. porter vs lancaster). In the case of "postpartum", while the "Porter" algorithm cannot stem it further, using Lancaster would stem it into postpart, which means that if in postgres we use the :* syntax and search for to_tsquery("postpart:*") we would get results.
This ticket outlines a direction of travel. At the moment our stemming interface is a bit all over the place:
-
We have a
Gargantext.Core.Text.Terms.Mono.Stemmodule which exposes astemfunction, but this function uses thestemfunction from the stemmer package, which is deprecated in favour of snowball, which uses a C library (and implements the porter algorithm), but when I tried to use it, it segfaulted; -
We have a porter implementation sitting at
Gargantext.Core.Text.Terms.Mono.Stem.En, but this is used randomly, for example as part ofGargantext.Database.Action.Search.searchInCorpusinstead of the "main" interface; -
We have poor support for languages, also because our
Langtype includes anAlldata constructor which makes annoying to have a total mapping between aLangand a stemming algorithm; -
We might want to pick a different algorithm for different contexts, for example we might want to have an "expert view" in our corpus search and run searches with different stemming strategies, and compare the results.
Proposal
My proposal is as follows:
- Let's refactor the
Gargantext.Core.Text.Terms.Mono.Stemso that it expose a single, nicely-encapsulate abstract function:
stem :: Lang -> StemmingAlgorithm -> T.Text -> T.Text
...
data StemmingAlgorithm
= -- | Use the 'porter' implementation for gargantext.
Porter
-- | Use the 'stemmer' implementation from the 'stemmer' package.
| Stemmer
-- | User Lancaster stemming.
| Lancaster
This means that all the requests for stemming a word needs to spell out, concretely:
a. The language; b. The algorithm.
If we want, we could have an helper function which would default to our built-in porter algorithm if the language is English, or switch to one of the stemmer algos for other languages.
I would suggest to refactor the Lang type to get rid of the All constructor -- being this already an Enum and Bounded instance we can recover the All semantic by doing:
allLangs :: [Lang]
allLang = [minBound .. maxBound]
If All is needed as input for a query that we perform from the backend, then I would suggest that we create newtype wrappers so that what we use in the frontend is uncorrelated to the concrete backend type we end up working with. Getting rid of All means that we can precisely map each language to a particular stemming algorithm, which would also solve the problem that at the moment we are assuming that most of the document corpus is in english and therefore not using the correct stemming algorithm where we should.
@anoe What do you think?