Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
haskell-gargantext
haskell-gargantext
  • Project
    • Project
    • Details
    • Activity
    • Releases
    • Cycle Analytics
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Charts
  • Issues 176
    • Issues 176
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 9
    • Merge Requests 9
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Charts
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Charts
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • gargantext
  • haskell-gargantexthaskell-gargantext
  • Issues
  • #224

Closed
Open
Opened Jun 07, 2023 by Przemyslaw Kaminski@cgenie
  • Report abuse
  • New issue
Report abuse New issue

Uniform ngrams creation

Currently we tokenize in multiple ways:

  1. We tokenize with CoreNLP/spacy etc
  2. We use PostgresFTS for some searches and to hint the highlighter (for performance reasons)
  3. The highlighter tokenizes it on the frontend
  4. We have a custom stemmer on docs search and it's the English one, irrespective of the document language. (G.C.T.T.M.S.En -> stemIt). This one is used in doc search (G.D.A.Search -> searchInCorpus).

The algorithms 1-3 will never be the same. 3 is the most brittle, it doesn't include any language information whatsoever and should be replaced with some kind of highlighter. In this issue I added 2. but this is part of a GraphQL query which, if needed, can be replaced with some other algorithm, without affecting the frontend.

The highlighting should be moved to the backend. We reduce step 3 then and we can somehow merge 1 and 2 on the backend. How the highlighting is done, doesn't matter for the frontend, this would be another graphql endpoint that should return tokenization for a given text, that's all.

It is worth nothing that PostgreSQL, apart from full text search, also has trigrams with an index on DB so one can have performant queries of form LIKE '%abc%': https://www.postgresql.org/docs/current/pgtrgm.html

Edited Feb 19, 2024 by Przemyslaw Kaminski
Assignee
Assign to
Epic 0.0.7
Milestone
Epic 0.0.7
Assign milestone
Time tracking
None
Due date
None
0
Labels
None
Assign labels
  • View project labels
Reference: gargantext/haskell-gargantext#224