A tree datastructure for rich-text shared between the frontend and backend
While working on the blocknotejs integration, I came to believe that most of the complexity of gargantext lays in the table schema designed as a surrogate for an actual tree data structure representing a rich-text with annotations.
When a document is ingested in garg, it's as a lossy flat text. This flat text is then enriched by the NLP machinery, context window definition, the text is no longer flat. This additional depth is persisted in database as a static tree that we reconstruct through joins.
Then we want to display something to users they can interact with. We add highlights, contexts as boxes, tables displaying ngrams. Additional informations (such as graph parameters) are stored locally or in memory in the frontend. When there is interaction we want to update the tree that we store in the database. Collaborative features are hindered because of this architecture : I update something on the frontend then I need to click somewhere to update the database then others need to refresh the page. Concurrent edits are not handled (last to update the db wins). Even with websockets and automatic refetch conflict resolution will be messy.
The proposal : instead of thinking that texts are flat, let's admit that they are already rich and structured. They have paragraph, styles, titles, metadata, history of updates. And we want to enrich them even more. Some group of words are important, some are not. We want to add POS and NER tagging, labels etc. All that collaboratively with automatic conflict resolution and syncing between clients.
Within this paradigm the garg server is just a user like the others that might batch update the rich-text data structure. Conflicts resolution are handled via CRDT merges.
And the frontend are just views on the rich-text. Ngrams table with occurrence count would be a matter of folding the tree once 0(N). Updates would be O(1). Highlights would be done in a rich-text editor. Contexts are basically the paragraph that we can edit.
Where do we store additional data such as ngrams not present in the document ? We can create a table within the parent document that store those additional data.
What about data that are needed to build a graph ? Garg can create a table in the document that stores all informations needed. Then we display the graph based on those data. It would be a great improvement for auditability as this table would be the blueprint for the viz.
What about community-led curation of ngrams ? Curated ngrams should be defined in a .csv and willingly shared. I don't think it's a sufficient reason to have separate instances of gargantext and a complex db schema with no control over the data as a user (once ngrams are in the system, it's not easily deletable).
We would get for free many collaborative features, a better auditability through version history, the db schema and the frontend and backend codebase would be massively simplified, perfs would be improved. And it would be easier to make a local-first software that is less tied to internet connexion and the remote database. So better privacy and responsivness. One consequence it that we could focus our research and work on the NLP and knowledge visualization part.
On the other hand it would mean that we relay more on the JS ecosystem and the frontend gets more responsability (but it's inevitable if we want reactivity and collaborative features).