Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
haskell-gargantext
haskell-gargantext
  • Project
    • Project
    • Details
    • Activity
    • Releases
    • Cycle Analytics
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Charts
  • Issues 161
    • Issues 161
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 8
    • Merge Requests 8
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Charts
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Charts
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • gargantext
  • haskell-gargantexthaskell-gargantext
  • Issues
  • #324

Closed
Open
Opened Mar 04, 2024 by Alfredo Di Napoli@AlfredoDiNapoli
  • Report abuse
  • New issue
Report abuse New issue

Coherent Stemming interface

Stepping stone towards fixing purescript-gargantext#633 (closed)

To summarise the context, we would like to have better control over our queries; at the moment we get 0 results for searches like "postpartum" on corpus documents either because Postgres' built-in full-text-search stemming isn't enough or because we stem with the "wrong" (for the query at hand) algorithm (i.e. porter vs lancaster). In the case of "postpartum", while the "Porter" algorithm cannot stem it further, using Lancaster would stem it into postpart, which means that if in postgres we use the :* syntax and search for to_tsquery("postpart:*") we would get results.

This ticket outlines a direction of travel. At the moment our stemming interface is a bit all over the place:

  1. We have a Gargantext.Core.Text.Terms.Mono.Stem module which exposes a stem function, but this function uses the stem function from the stemmer package, which is deprecated in favour of snowball, which uses a C library (and implements the porter algorithm), but when I tried to use it, it segfaulted;

  2. We have a porter implementation sitting at Gargantext.Core.Text.Terms.Mono.Stem.En, but this is used randomly, for example as part of Gargantext.Database.Action.Search.searchInCorpus instead of the "main" interface;

  3. We have poor support for languages, also because our Lang type includes an All data constructor which makes annoying to have a total mapping between a Lang and a stemming algorithm;

  4. We might want to pick a different algorithm for different contexts, for example we might want to have an "expert view" in our corpus search and run searches with different stemming strategies, and compare the results.

Proposal

My proposal is as follows:

  • Let's refactor the Gargantext.Core.Text.Terms.Mono.Stem so that it expose a single, nicely-encapsulate abstract function:
stem :: Lang -> StemmingAlgorithm -> T.Text -> T.Text

...

data StemmingAlgorithm
  = -- | Use the 'porter' implementation for gargantext.
    Porter
    -- | Use the 'stemmer' implementation from the 'stemmer' package.
  | Stemmer
    -- | User Lancaster stemming.
  | Lancaster

This means that all the requests for stemming a word needs to spell out, concretely:

a. The language; b. The algorithm.

If we want, we could have an helper function which would default to our built-in porter algorithm if the language is English, or switch to one of the stemmer algos for other languages.

I would suggest to refactor the Lang type to get rid of the All constructor -- being this already an Enum and Bounded instance we can recover the All semantic by doing:

allLangs :: [Lang]
allLang = [minBound .. maxBound]

If All is needed as input for a query that we perform from the backend, then I would suggest that we create newtype wrappers so that what we use in the frontend is uncorrelated to the concrete backend type we end up working with. Getting rid of All means that we can precisely map each language to a particular stemming algorithm, which would also solve the problem that at the moment we are assuming that most of the document corpus is in english and therefore not using the correct stemming algorithm where we should.

@anoe What do you think?

Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
0
Labels
None
Assign labels
  • View project labels
Reference: gargantext/haskell-gargantext#324