Investigate possible improvements to GGTX `Language` type (#449) · Issues · gargantext / haskell-gargantext

Investigate possible improvements to GGTX `Language` type

Please look at the file Gargantext.Database.Action.Flow, function insertMasterDocs. What we do there is: fetch an NLP server for a predefined language and then insert documents, using that NLP server.

This isn't optimal: each document could have a different _hd_language_iso2 setting in it's hyperdata (c.f. HyperdataDocument).

What would be best is to apply precisely the NLP server that the document hyperdata defines.

Problem is, not all APIs allow for easy language filtering. E.g. Arxiv [search_query documentation)(https://info.arxiv.org/help/api/user-manual.html#query_details) doesn't mention anything about language.

On the other hand, OpenAlex is better as it allows language in works filters: https://docs.openalex.org/api-entities/works/filter-works

My suggestion would be to remove language in insertMasterDocs, then use the language defined in document (or default to English).

Then, review the APIs and make sure we filter by given language. If it's not possible (like in Arxiv), maybe try to run some heuristics to try to determine the language?

Some libraries to help us out:

lingua-py: https://github.com/pemistahl/lingua-py Here's a simple paste (try with uv run python detect.py): https://paste.sr.ht/~cgenie/2dc1f37b1fe75cb665eb1c5baa1ce23d7960f967

Please look at the file `Gargantext.Database.Action.Flow`, function
`insertMasterDocs`. What we do there is: fetch an NLP server for a
predefined language and then insert documents, using that NLP server.

This isn't optimal: each document could have a different
`_hd_language_iso2` setting in it's hyperdata
(c.f. `HyperdataDocument`).

What would be best is to apply precisely the NLP server that the
document hyperdata defines.

Problem is, not all APIs allow for easy language filtering. E.g. Arxiv
[`search_query`
documentation)(https://info.arxiv.org/help/api/user-manual.html#query_details)
doesn't mention anything about language.

On the other hand, OpenAlex is better as it allows `language` in works
filters: https://docs.openalex.org/api-entities/works/filter-works

My suggestion would be to remove language in `insertMasterDocs`, then
use the language defined in document (or default to English).

Then, review the APIs and make sure we filter by given language. If
it's not possible (like in Arxiv), maybe try to run some heuristics to
try to determine the language?

Some libraries to help us out:
- lingua-py: https://github.com/pemistahl/lingua-py
  Here's a simple paste (try with `uv run python detect.py`):
  https://paste.sr.ht/~cgenie/2dc1f37b1fe75cb665eb1c5baa1ce23d7960f967