Separate ngram extraction from document insertion (!415) · Merge Requests · gargantext / haskell-gargantext

Separate ngram extraction from document insertion

Fixes #473.

This MR refactors a bit the code around insertMasterDocs to split the Ngrams generation from the document creation. Before this MR the ngrams extraction happened in the middle of insertMasterDocs, meaning that we had to contact the NLP server in the middle of what could have been a perfectly atomic DB transaction, with the risk of leading to an inconsistent state, i.e. document inserted without any ngrams (effectively leaving the flow in an incomplete state).

This MR fixes that by splitting the process in two parts: first we generate the Ngrams, storing those in a Map indexed by a DocumentHashId, and later we match every Node created with the previously generated ngrams -- this last step can happen in a pure fashion (it's just a map lookup) so it can be embedded safely inside insertMasterDocs, which is now a single DBUpdate.

@cgenie I don't think my work is necessarily conflicting with yours, but I did some refactoring around some typeclasses like UniqParameters & friends as they were a bit iffy, so perhaps have a look if something stands out that might create problems on your side (this is work-in-progress work, btw).

Fixes #473.

This MR refactors a bit the code around `insertMasterDocs` to split the Ngrams generation from the document creation. Before this MR the ngrams extraction happened in the middle of `insertMasterDocs`, meaning that we had to contact the NLP server in the middle of what could have been a perfectly atomic DB transaction, with the risk of leading to an inconsistent state, i.e. document inserted without any ngrams (effectively leaving the `flow` in an incomplete state).

This MR fixes that by splitting the process in two parts: first we generate the Ngrams, storing those in a `Map` indexed by a `DocumentHashId`, and later we match every `Node` created with the previously generated ngrams -- this last step can happen in a pure fashion (it's just a map lookup) so it can be embedded safely inside `insertMasterDocs`, which is now a single `DBUpdate`.

@cgenie I don't think my work is necessarily conflicting with yours, but I did some refactoring around some typeclasses like `UniqParameters` & friends as they were a bit iffy, so perhaps have a look if something stands out that might create problems on your side (this is work-in-progress work, btw).

Edited Jun 05, 2025 by Alfredo Di Napoli

Separate ngram extraction from document insertion

Check out, review, and merge locally

Revert this merge request

Cherry-pick this merge request