Uniform ngrams creation
Currently we tokenize in multiple ways:
- We tokenize with CoreNLP/spacy etc
- We use PostgresFTS for some searches and to hint the highlighter (for performance reasons)
- The highlighter tokenizes it on the frontend
- We have a custom stemmer on docs search and it's the English one, irrespective of the document language. (
G.C.T.T.M.S.En -> stemIt
). This one is used in doc search (G.D.A.Search -> searchInCorpus
).
The algorithms 1-3 will never be the same. 3 is the most brittle, it doesn't include any language information whatsoever and should be replaced with some kind of highlighter. In this issue I added 2. but this is part of a GraphQL query which, if needed, can be replaced with some other algorithm, without affecting the frontend.
The highlighting should be moved to the backend. We reduce step 3 then and we can somehow merge 1 and 2 on the backend. How the highlighting is done, doesn't matter for the frontend, this would be another graphql endpoint that should return tokenization for a given text, that's all.
It is worth nothing that PostgreSQL, apart from full text search, also has trigrams with an index on DB so one can have performant queries of form LIKE '%abc%'
:
https://www.postgresql.org/docs/current/pgtrgm.html