Uniform ngrams creation (#224) · Issues · gargantext / haskell-gargantext

Uniform ngrams creation

Currently we tokenize in multiple ways:

We tokenize with CoreNLP/spacy etc
We use PostgresFTS for some searches and to hint the highlighter (for performance reasons)
The highlighter tokenizes it on the frontend
We have a custom stemmer on docs search and it's the English one, irrespective of the document language. (G.C.T.T.M.S.En -> stemIt). This one is used in doc search (G.D.A.Search -> searchInCorpus).

The algorithms 1-3 will never be the same. 3 is the most brittle, it doesn't include any language information whatsoever and should be replaced with some kind of highlighter. In this issue I added 2. but this is part of a GraphQL query which, if needed, can be replaced with some other algorithm, without affecting the frontend.

The highlighting should be moved to the backend. We reduce step 3 then and we can somehow merge 1 and 2 on the backend. How the highlighting is done, doesn't matter for the frontend, this would be another graphql endpoint that should return tokenization for a given text, that's all.

It is worth nothing that PostgreSQL, apart from full text search, also has trigrams with an index on DB so one can have performant queries of form LIKE '%abc%': https://www.postgresql.org/docs/current/pgtrgm.html

Currently we tokenize in multiple ways:

1. We tokenize with CoreNLP/spacy etc
1. We use PostgresFTS for some searches and to hint the highlighter (for performance reasons)
1. The highlighter tokenizes it on the frontend
1. We have a custom stemmer on docs search and it's the English one, irrespective of the document language. (`G.C.T.T.M.S.En -> stemIt`). This one is used in doc search (`G.D.A.Search -> searchInCorpus`).

It is worth nothing that PostgreSQL, apart from full text search, also has trigrams with an index on DB so one can have performant queries of form `LIKE '%abc%'`:
https://www.postgresql.org/docs/current/pgtrgm.html

Edited Feb 19, 2024 by Przemyslaw Kaminski