Coherent Stemming interface
Stepping stone towards fixing purescript-gargantext#633 (closed)
To summarise the context, we would like to have better control over our queries; at the moment we get 0 results for searches like "postpartum" on corpus documents either because Postgres' built-in full-text-search stemming isn't enough or because we stem with the "wrong" (for the query at hand) algorithm (i.e. porter vs lancaster). In the case of "postpartum", while the "Porter" algorithm cannot stem it further, using Lancaster would stem it into postpart
, which means that if in postgres we use the :*
syntax and search for to_tsquery("postpart:*")
we would get results.
This ticket outlines a direction of travel. At the moment our stemming interface is a bit all over the place:
-
We have a
Gargantext.Core.Text.Terms.Mono.Stem
module which exposes astem
function, but this function uses thestem
function from the stemmer package, which is deprecated in favour of snowball, which uses a C library (and implements the porter algorithm), but when I tried to use it, it segfaulted; -
We have a porter implementation sitting at
Gargantext.Core.Text.Terms.Mono.Stem.En
, but this is used randomly, for example as part ofGargantext.Database.Action.Search.searchInCorpus
instead of the "main" interface; -
We have poor support for languages, also because our
Lang
type includes anAll
data constructor which makes annoying to have a total mapping between aLang
and a stemming algorithm; -
We might want to pick a different algorithm for different contexts, for example we might want to have an "expert view" in our corpus search and run searches with different stemming strategies, and compare the results.
Proposal
My proposal is as follows:
- Let's refactor the
Gargantext.Core.Text.Terms.Mono.Stem
so that it expose a single, nicely-encapsulate abstract function:
stem :: Lang -> StemmingAlgorithm -> T.Text -> T.Text
...
data StemmingAlgorithm
= -- | Use the 'porter' implementation for gargantext.
Porter
-- | Use the 'stemmer' implementation from the 'stemmer' package.
| Stemmer
-- | User Lancaster stemming.
| Lancaster
This means that all the requests for stemming a word needs to spell out, concretely:
a. The language; b. The algorithm.
If we want, we could have an helper function which would default to our built-in porter
algorithm if the language is English, or switch to one of the stemmer
algos for other languages.
I would suggest to refactor the Lang
type to get rid of the All
constructor -- being this already an Enum
and Bounded
instance we can recover the All
semantic by doing:
allLangs :: [Lang]
allLang = [minBound .. maxBound]
If All
is needed as input for a query that we perform from the backend, then I would suggest that we create newtype wrappers so that what we use in the frontend is uncorrelated to the concrete backend type we end up working with. Getting rid of All
means that we can precisely map each language to a particular stemming algorithm, which would also solve the problem that at the moment we are assuming that most of the document corpus is in english and therefore not using the correct stemming algorithm where we should.
@anoe What do you think?