Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
haskell-gargantext
haskell-gargantext
  • Project
    • Project
    • Details
    • Activity
    • Releases
    • Cycle Analytics
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Charts
  • Issues 180
    • Issues 180
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 12
    • Merge Requests 12
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Charts
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Charts
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • gargantext
  • haskell-gargantexthaskell-gargantext
  • Issues
  • #498

Closed
Open
Opened Jul 28, 2025 by Alfredo Di Napoli@AlfredoDiNapoli2 of 6 tasks completed2/6 tasks
  • Report abuse
  • New issue
Report abuse New issue

Improve ngrams grouping and merging

Follow up to #342.

I'm turning what @davidchavalarias said in that issue in actionable points, so that I can tackle this separately:

ngrams grouping

As far as ngrams grouping we need to check:

  • Understand if we have to cap the maximum levels we can group, as it has been reported that this used to create bugs with infinite loops, especially in the frontend. I have scanned quickly the backend but I haven't found any evidence that we have an hard limit, so that has to be investigated;
  • If we have to impose a hard limit on level, then adding the n + 1 level should have it merged with level n -- I think the best we can do in terms of merging is simply to have leaves of level n + 1 to all be flattened out as children of n - 1;

ngrams searching

  • Make sure that if we have a deep ngram tree, children deep in the hierarchy still shows up in the search (i.e. the backend can search in a table ngrams taking into account nested children);

ngrams import and export

There must be some rules that governs imports and exports of a Terms node:

  • Exporting via JSON (and importing via JSON) should keep all the levels intact -- i.e. if I have a tree with 4 levels, those 4 levels should all be present when importing back. This can be expressed as a general roundtrip property for import and exports;

  • TSV export should merge levels from 2 onwards, for backward compatibility. Then importing (which IIRC is implemented using JSON) would preserve the levels;

  • During import, there might be existing ngrams, and a merge strategy must be put in place, which David spelled out like this:

First resolve types conflicts (map/candidates/stop).

This could become quite complex because we have 3 types (map, candidates, stop) AND levels. We should specify at import if the types of the imported list overwrite the ggt list or not Example :

ggt map terms : A > B ; C > D ; imported stop terms : B > C

No overwrite : map type propagate to B>C though any of the A>B and C>D the merged list has maps group A > B > C > D overwrite : stop type propagate to A>C and C>D because of B>C the merged list has a stop group A > B > C > D

Then resolve the hierarchy conflicts

Merge the map terms, candidate terms and stop terms separatly : detect the connected components of the merge between the ggt map terms and the imported map terms. if you have for example in maps groups :

  • ggt : A > B > C ; B > D ; A > F
  • imported D > A ; D > E > F

Then A, B, C, D, E should be in the same group after import.

As for the resolution of hierarchy conflicts, start from the top levels by taking and resolve by recurrence by taking N and N+1. There are at least two types of conflicts illustrated below :

a) Here we have a first conflict D > A and A >> D. The term with highest number of occurrences wins. So let's say it's D, we have D>A>B>C; D>E ; A > F ; E > F (B>D is removed because D changed its level)

And we move to resolve levels 2-3. b) here we have a second type of conflict : at level 2, there is a competition between to parents for F. The one with the highest number of occurrences wins (let's say E), so we end with : D>A>B>C ; D>E>F

Implementation note(s)

That might be the hardest part of this ticket (and perhaps better handled in a "part 3"?) because as far as I know we use that patch-class machinery to deal with conflicts, which should ideally fit quite naturally in this model, but the way that patching works is a bit foreign to me, so I will have to learn a bit better the current algorithm.

Edited Aug 04, 2025 by Alfredo Di Napoli
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
1
Labels
Doing
Assign labels
  • View project labels
Reference: gargantext/haskell-gargantext#498