Ngrams extraction is a bit iffy (#473) · Issues · gargantext / haskell-gargantext

Ngrams extraction is a bit iffy

While working on #466 , I have stumbled upon something a bit iffy regarding the extraction of the ngrams.

In particular, #466 forced me to try and reason about the transactional properties of our DB operations, and we have quite a lot of code inside Database.Action.Flow that does things which are a bit problematic. If we look at insertMasterDocs I have added the following commentary:

-- FIME(adn): the use of 'extractNgramsT' is iffy and problematic -- we shouldn't
-- be contacting the NLP server in the middle of some DB ops! we should extract
-- the tokens /before/ inserting things into the DB.

The purpose of this function would be to insert documents into the database, which entails also extracting and persisting the tokens synthesised out of them. However, the extraction of the documents happens in the middle of what is morally a DB transaction, and it happens by doing an HTTP request to the NLP server(!).

This is problematic because due to the fact we are strict in the effects we can perform inside a DB transaction (for very good reasons) , we have no choice but to break down what otherwise would be a perfectly fine atomic DB transaction into a few parts, one that does the creation of the parent nodes, another that does the ngrams extraction and another one that saves the document with the extracted ngrams. This is a problem, because if we have an exception striking in the middle (the glaring example would be because the NLP server is not responding) we won't get DB rollbacks! Now we have created some dangling empty corpus node without the ngrams, rather than rolling back to a clean state.

If we look at the whole insertMasterDocs , is not clear to me while we rely on the documents with their Id attached in order to be able to do ngrams extraction -- in my simpleton mind, all we need to extract the terms is to pass to the NLP server some notion of text, something we can extract from the get-go from the documents. If we do that, then we can extract the terms before calling this function, which can now work as an atomic DB transaction.

This ticket is about refactoring insertMasterDocs and friends to achieve the above.

While working on https://gitlab.iscpif.fr/gargantext/haskell-gargantext/issues/466 , I have stumbled upon something a bit iffy regarding the extraction of the ngrams.

In particular, #466 forced me to try and reason about the transactional properties of our DB operations, and we have quite a lot of code inside `Database.Action.Flow` that does things which are a bit problematic. If we look at [insertMasterDocs](https://gitlab.iscpif.fr/gargantext/haskell-gargantext/blob/dev/src/Gargantext/Database/Action/Flow.hs#L417) I have added the following commentary:

```hs
-- FIME(adn): the use of 'extractNgramsT' is iffy and problematic -- we shouldn't
-- be contacting the NLP server in the middle of some DB ops! we should extract
-- the tokens /before/ inserting things into the DB.
```

The purpose of this function would be to insert documents into the database, which entails also extracting and persisting the tokens synthesised out of them. However, the extraction of the documents happens _in the middle_ of what is morally a DB transaction, and it happens by doing an HTTP request to the NLP server(!).

This is **problematic** because due to the fact we are [strict in the effects](https://gitlab.iscpif.fr/gargantext/haskell-gargantext/blob/dev/src/Gargantext/Database/Transactional.hs#L62) we can perform inside a DB transaction (for very good reasons) , we have no choice but to break down what otherwise would be a perfectly fine atomic DB transaction into a few parts, one that does the creation of the parent nodes, another that does the ngrams extraction and another one that saves the document with the extracted ngrams. This is a problem, because if we have an exception striking in the middle (the glaring example would be because the NLP server is not responding) we won't get DB rollbacks! Now we have created some dangling empty corpus node without the ngrams, rather than rolling back to a clean state.

If we look at the whole `insertMasterDocs` , is not clear to me while we rely on the documents with their Id attached in order to be able to do ngrams extraction -- in my simpleton mind, all we need to extract the terms is to pass to the NLP server some notion of text, something we can extract from the get-go from the documents. If we do that, then we can extract the terms _before_ calling this function, which can now work as an atomic DB transaction.

This ticket is about refactoring `insertMasterDocs` and friends to achieve the above.