Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
haskell-gargantext
haskell-gargantext
  • Project
    • Project
    • Details
    • Activity
    • Releases
    • Cycle Analytics
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Charts
  • Issues 175
    • Issues 175
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 9
    • Merge Requests 9
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Charts
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Charts
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • gargantext
  • haskell-gargantexthaskell-gargantext
  • Issues
  • #449

Closed
Open
Opened Mar 03, 2025 by Przemyslaw Kaminski@cgenie
  • Report abuse
  • New issue
Report abuse New issue

Investigate possible improvements to GGTX `Language` type

Please look at the file Gargantext.Database.Action.Flow, function insertMasterDocs. What we do there is: fetch an NLP server for a predefined language and then insert documents, using that NLP server.

This isn't optimal: each document could have a different _hd_language_iso2 setting in it's hyperdata (c.f. HyperdataDocument).

What would be best is to apply precisely the NLP server that the document hyperdata defines.

Problem is, not all APIs allow for easy language filtering. E.g. Arxiv [search_query documentation)(https://info.arxiv.org/help/api/user-manual.html#query_details) doesn't mention anything about language.

On the other hand, OpenAlex is better as it allows language in works filters: https://docs.openalex.org/api-entities/works/filter-works

My suggestion would be to remove language in insertMasterDocs, then use the language defined in document (or default to English).

Then, review the APIs and make sure we filter by given language. If it's not possible (like in Arxiv), maybe try to run some heuristics to try to determine the language?

Some libraries to help us out:

  • lingua-py: https://github.com/pemistahl/lingua-py Here's a simple paste (try with uv run python detect.py): https://paste.sr.ht/~cgenie/2dc1f37b1fe75cb665eb1c5baa1ce23d7960f967
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
0
Labels
None
Assign labels
  • View project labels
Reference: gargantext/haskell-gargantext#449