Improve ngrams grouping and merging
Follow up to #342.
I'm turning what @davidchavalarias said in that issue in actionable points, so that I can tackle this separately:
ngrams grouping
As far as ngrams grouping we need to check:
- Understand if we have to cap the maximum levels we can group, as it has been reported that this used to create bugs with infinite loops, especially in the frontend. I have scanned quickly the backend but I haven't found any evidence that we have an hard limit, so that has to be investigated;
-
If we have to impose a hard limit on level, then adding the
n + 1
level should have it merged with leveln
-- I think the best we can do in terms of merging is simply to have leaves of leveln + 1
to all be flattened out as children ofn - 1
;
ngrams searching
- Make sure that if we have a deep ngram tree, children deep in the hierarchy still shows up in the search (i.e. the backend can search in a table ngrams taking into account nested children);
ngrams import and export
There must be some rules that governs imports and exports of a Terms
node:
-
Exporting via JSON (and importing via JSON) should keep all the levels intact -- i.e. if I have a tree with 4 levels, those 4 levels should all be present when importing back. This can be expressed as a general roundtrip property for import and exports;
-
TSV export should merge levels from 2 onwards, for backward compatibility. Then importing (which IIRC is implemented using JSON) would preserve the levels;
-
During import, there might be existing ngrams, and a merge strategy must be put in place, which David spelled out like this:
First resolve types conflicts (map/candidates/stop).
This could become quite complex because we have 3 types (map, candidates, stop) AND levels. We should specify at import if the types of the imported list overwrite the ggt list or not Example :
ggt map terms : A > B ; C > D ; imported stop terms : B > C
No overwrite : map type propagate to B>C though any of the A>B and C>D the merged list has maps group A > B > C > D overwrite : stop type propagate to A>C and C>D because of B>C the merged list has a stop group A > B > C > D
Then resolve the hierarchy conflicts
Merge the map terms, candidate terms and stop terms separatly : detect the connected components of the merge between the ggt map terms and the imported map terms. if you have for example in maps groups :
- ggt : A > B > C ; B > D ; A > F
- imported D > A ; D > E > F
Then A, B, C, D, E should be in the same group after import.
As for the resolution of hierarchy conflicts, start from the top levels by taking and resolve by recurrence by taking N and N+1. There are at least two types of conflicts illustrated below :
a) Here we have a first conflict D > A and A >> D. The term with highest number of occurrences wins. So let's say it's D, we have D>A>B>C; D>E ; A > F ; E > F (B>D is removed because D changed its level)
And we move to resolve levels 2-3. b) here we have a second type of conflict : at level 2, there is a competition between to parents for F. The one with the highest number of occurrences wins (let's say E), so we end with : D>A>B>C ; D>E>F
Implementation note(s)
That might be the hardest part of this ticket (and perhaps better handled in a "part 3"?) because as far as I know we use that patch-class machinery to deal with conflicts, which should ideally fit quite naturally in this model, but the way that patching works is a bit foreign to me, so I will have to learn a bit better the current algorithm.