Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
haskell-gargantext
haskell-gargantext
  • Project
    • Project
    • Details
    • Activity
    • Releases
    • Cycle Analytics
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Charts
  • Issues 159
    • Issues 159
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 8
    • Merge Requests 8
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Charts
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Charts
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • gargantext
  • haskell-gargantexthaskell-gargantext
  • Issues
  • #473

Closed
Open
Opened May 22, 2025 by Alfredo Di Napoli@AlfredoDiNapoli
  • Report abuse
  • New issue
Report abuse New issue

Ngrams extraction is a bit iffy

While working on #466 , I have stumbled upon something a bit iffy regarding the extraction of the ngrams.

In particular, #466 forced me to try and reason about the transactional properties of our DB operations, and we have quite a lot of code inside Database.Action.Flow that does things which are a bit problematic. If we look at insertMasterDocs I have added the following commentary:

-- FIME(adn): the use of 'extractNgramsT' is iffy and problematic -- we shouldn't
-- be contacting the NLP server in the middle of some DB ops! we should extract
-- the tokens /before/ inserting things into the DB.

The purpose of this function would be to insert documents into the database, which entails also extracting and persisting the tokens synthesised out of them. However, the extraction of the documents happens in the middle of what is morally a DB transaction, and it happens by doing an HTTP request to the NLP server(!).

This is problematic because due to the fact we are strict in the effects we can perform inside a DB transaction (for very good reasons) , we have no choice but to break down what otherwise would be a perfectly fine atomic DB transaction into a few parts, one that does the creation of the parent nodes, another that does the ngrams extraction and another one that saves the document with the extracted ngrams. This is a problem, because if we have an exception striking in the middle (the glaring example would be because the NLP server is not responding) we won't get DB rollbacks! Now we have created some dangling empty corpus node without the ngrams, rather than rolling back to a clean state.

If we look at the whole insertMasterDocs , is not clear to me while we rely on the documents with their Id attached in order to be able to do ngrams extraction -- in my simpleton mind, all we need to extract the terms is to pass to the NLP server some notion of text, something we can extract from the get-go from the documents. If we do that, then we can extract the terms before calling this function, which can now work as an atomic DB transaction.

This ticket is about refactoring insertMasterDocs and friends to achieve the above.

Assignee
Assign to
Stabilisation
Milestone
Stabilisation
Assign milestone
Time tracking
None
Due date
None
1
Labels
Doing
Assign labels
  • View project labels
Reference: gargantext/haskell-gargantext#473