Improve ngrams grouping and merging (#498) · Issues · gargantext / haskell-gargantext

Improve ngrams grouping and merging

Follow up to #342 (closed).

I'm turning what @davidchavalarias said in that issue in actionable points, so that I can tackle this separately:

ngrams grouping

As far as ngrams grouping we need to check:

Understand if we have to cap the maximum levels we can group, as it has been reported that this used to create bugs with infinite loops, especially in the frontend. I have scanned quickly the backend but I haven't found any evidence that we have an hard limit, so that has to be investigated;
If we have to impose a hard limit on level, then adding the n + 1 level should have it merged with level n -- I think the best we can do in terms of merging is simply to have leaves of level n + 1 to all be flattened out as children of n - 1;

ngrams searching

Make sure that if we have a deep ngram tree, children deep in the hierarchy still shows up in the search (i.e. the backend can search in a table ngrams taking into account nested children);

ngrams import and export

There must be some rules that governs imports and exports of a Terms node:

Exporting via JSON (and importing via JSON) should keep all the levels intact -- i.e. if I have a tree with 4 levels, those 4 levels should all be present when importing back. This can be expressed as a general roundtrip property for import and exports;
TSV export should merge levels from 2 onwards, for backward compatibility. Then importing (which IIRC is implemented using JSON) would preserve the levels;
During import, there might be existing ngrams, and a merge strategy must be put in place, which David spelled out like this:

First resolve types conflicts (map/candidates/stop).

This could become quite complex because we have 3 types (map, candidates, stop) AND levels. We should specify at import if the types of the imported list overwrite the ggt list or not Example :

ggt map terms : A > B ; C > D ; imported stop terms : B > C

No overwrite : map type propagate to B>C though any of the A>B and C>D the merged list has maps group A > B > C > D overwrite : stop type propagate to A>C and C>D because of B>C the merged list has a stop group A > B > C > D

Then resolve the hierarchy conflicts

Merge the map terms, candidate terms and stop terms separatly : detect the connected components of the merge between the ggt map terms and the imported map terms. if you have for example in maps groups :

ggt : A > B > C ; B > D ; A > F
imported D > A ; D > E > F

Then A, B, C, D, E should be in the same group after import.

As for the resolution of hierarchy conflicts, start from the top levels by taking and resolve by recurrence by taking N and N+1. There are at least two types of conflicts illustrated below :

a) Here we have a first conflict D > A and A >> D. The term with highest number of occurrences wins. So let's say it's D, we have D>A>B>C; D>E ; A > F ; E > F (B>D is removed because D changed its level)

And we move to resolve levels 2-3. b) here we have a second type of conflict : at level 2, there is a competition between to parents for F. The one with the highest number of occurrences wins (let's say E), so we end with : D>A>B>C ; D>E>F

Implementation note(s)

That might be the hardest part of this ticket (and perhaps better handled in a "part 3"?) because as far as I know we use that patch-class machinery to deal with conflicts, which should ideally fit quite naturally in this model, but the way that patching works is a bit foreign to me, so I will have to learn a bit better the current algorithm.

Follow up to #342.

I'm turning what @davidchavalarias said in that issue in actionable points, so that I can tackle this separately:

## ngrams grouping

As far as ngrams grouping we need to check:

- [ ] Understand if we have to cap the maximum levels we can group, as it has been reported that this used to create bugs with infinite loops, especially in the frontend. I have scanned quickly the backend but I haven't found any evidence that we have an hard limit, so that has to be investigated;
- [ ] If we have to impose a hard limit on level, then adding the `n + 1` level should have it merged with level `n` -- I think the best we can do in terms of merging is simply to have leaves of level `n + 1` to all be flattened out as children of `n - 1`;

## ngrams searching

- [x] Make sure that if we have a deep ngram tree, children deep in the hierarchy still shows up in the search (i.e. the backend can search in a table ngrams taking into account nested children);

## ngrams import and export

There must be some rules that governs imports and exports of a `Terms` node:

- [x] Exporting via JSON (and importing via JSON) should keep all the levels intact -- i.e. if I have a tree with 4 levels, those 4 levels should all be present when importing back. This can be expressed as a general roundtrip property for import and exports;

- [ ] TSV export should merge levels from 2 onwards, for backward compatibility. Then importing (which IIRC is implemented using JSON) would preserve the levels;

- [ ] During import, there might be existing ngrams, and a merge strategy must be put in place, which David spelled out like this:

### First resolve types conflicts (map/candidates/stop).
This could become quite complex because we have 3 types (map, candidates, stop) AND levels. We should specify at import if the types of the imported list overwrite the ggt list or not
Example :

ggt map terms : A > B ; C > D ;
imported stop terms : B > C

No overwrite : map type propagate to B>C though any of the A>B and C>D the merged list has maps group A > B > C > D 
overwrite : stop type propagate to A>C and C>D because of B>C the merged list has a stop group A > B > C > D

### Then resolve the hierarchy conflicts
Merge the map terms, candidate terms and stop terms separatly : detect the connected components of the merge between the ggt map terms and the imported map terms.   if you have for example in maps groups :
* ggt : A > B > C ; B > D ; A > F
* imported D > A ; D > E > F

Then A, B, C, D, E should be in the same group after import.

As for the resolution of hierarchy conflicts, start from the top levels by taking and resolve by recurrence by taking N and N+1. There are at least two types of conflicts illustrated below :

a)
Here we have a first conflict D > A and A >> D. The term with highest number of occurrences wins. So let's say it's D, we have 
D>A>B>C; D>E ; A > F ; E > F  (B>D is removed because D changed its level)

And we move to resolve levels 2-3.
b) here we have a second type of conflict : at level 2, there is a competition between to parents for F. The one with the highest number of occurrences wins (let's say E), so we end with :
D>A>B>C ; D>E>F

## Implementation note(s)

Edited Aug 04, 2025 by Alfredo Di Napoli