Investigate possible improvements to GGTX `Language` type
Please look at the file Gargantext.Database.Action.Flow
, function
insertMasterDocs
. What we do there is: fetch an NLP server for a
predefined language and then insert documents, using that NLP server.
This isn't optimal: each document could have a different
_hd_language_iso2
setting in it's hyperdata
(c.f. HyperdataDocument
).
What would be best is to apply precisely the NLP server that the document hyperdata defines.
Problem is, not all APIs allow for easy language filtering. E.g. Arxiv
[search_query
documentation)(https://info.arxiv.org/help/api/user-manual.html#query_details)
doesn't mention anything about language.
On the other hand, OpenAlex is better as it allows language
in works
filters: https://docs.openalex.org/api-entities/works/filter-works
My suggestion would be to remove language in insertMasterDocs
, then
use the language defined in document (or default to English).
Then, review the APIs and make sure we filter by given language. If it's not possible (like in Arxiv), maybe try to run some heuristics to try to determine the language?
Some libraries to help us out:
- lingua-py: https://github.com/pemistahl/lingua-py
Here's a simple paste (try with
uv run python detect.py
): https://paste.sr.ht/~cgenie/2dc1f37b1fe75cb665eb1c5baa1ce23d7960f967