Separate ngram extraction from document insertion
Fixes #473.
This MR refactors a bit the code around insertMasterDocs
to split the Ngrams generation from the document creation. Before this MR the ngrams extraction happened in the middle of insertMasterDocs
, meaning that we had to contact the NLP server in the middle of what could have been a perfectly atomic DB transaction, with the risk of leading to an inconsistent state, i.e. document inserted without any ngrams (effectively leaving the flow
in an incomplete state).
This MR fixes that by splitting the process in two parts: first we generate the Ngrams, storing those in a Map
indexed by a DocumentHashId
, and later we match every Node
created with the previously generated ngrams -- this last step can happen in a pure fashion (it's just a map lookup) so it can be embedded safely inside insertMasterDocs
, which is now a single DBUpdate
.
@cgenie I don't think my work is necessarily conflicting with yours, but I did some refactoring around some typeclasses like UniqParameters
& friends as they were a bit iffy, so perhaps have a look if something stands out that might create problems on your side (this is work-in-progress work, btw).