[Meta] Mistmatch between compound words representation in frontend and in database.
Summary
Meta issue about the mistmatch between ngrams with hyphen or single quote representation in frontend and in database. In the document ui, spaces are added after special characters. When highlighted and chose as map terms, special characters in terms become spaces. Furthermore, documents are based on the raw context, without some of the processing happening in the backend such as hyphen removal. Those different processing create mistmatch between frontend and backend, breaking comparison-based features in terms table (search bar, highlights, score count in doc view, map terms selection from document and search bar).
Steps to reproduce
- Create a new empty corpus
- Click on the flower -> WriteNodesDocument icon
- Docs -> Add Document button in the panel
- fill the form with the abstract
words are nice but word's shaped as a porte-monnaie are better
- From the doc modal, highlight the word
porte- monnaie
and add it to map terms. - From the search terms bar, search
portemonnaie
add it to the map terms. It shows an occurrence count of 1. - From the search terms bar, search
porte- monnaie
. Nothing found.
Example Project
Minimal example on the dev instance
What is the current bug behavior?
-
porte- monnaie
highlighted from the doc modal has a count of 0 occurrences. -
portemonnaie
selected from the search bar has a count of 1. -
Searching among already selected terms with hyphen will not give expected results.
-
porte- monnaie
from the search bar yields no result. -
porte-monnaie
yields no result. -
porte monnaie
(with two spaces) returnsporte monnaie
with a count of 0.
-
-
There is no term highlighted in the document view.
What is the expected correct behavior?
- We should be able to select the terme
porte-monnaie
from the document with the highlight. - The term should have an occurrences count of 1.
- It should be highlighted in green in the document view.
- It should be searchable. When already selected searching
porte-monnaie
should yield one result in one document.
Relevant logs and/or screenshots
Tokenization seperate words on the hyphen
ghci> ffmap _terms_label myTerms
[[["words"],["are"],["nice"],[","],["word"],["'s"],["shape"],["as","a"],["porte"],["-"],["monnaie"],["are"],["better"]]]
The function cleanTextForNLP removes hyphens
ghci> cleanTextForNLP "Words are nice but words shaped as a porte-monnaie are better"
"Words are nice but words shaped as a portemonnaie are better"
The function multiterms find the terms. One of those terms is portemonnaie
without hyphen. cleanTextForNlp is called before multiterms is applied.
ghci> multiterms defaultNLPServer EN "words are nice, word's shaped as a porte-monnaie are better"
[(Terms {_terms_label = ["portemonnaie"], _terms_stem = fromList ["portemonnaie"]},1),(Terms {_terms_label = ["word"], _terms_stem = fromList ["word"]},2)]
Possible fixes
Temporary workaround for users: try to add the compound term from the search bar without the hyphen, quote or space (e.g, portefeuille
-> add to map terms)
/purescript-gargantext/src/Gargantext/Core/NgramsTable/Functions.purs
):
Dirty frontend fixes (mostly in - normalization at query time before. If a term with special characters is highlighted, remove spaces and those special characters before it's sent. If the highlighted ngram is already in base, we should find a new entry with a count != 0 in the table, without space or special characters.
- In order to fix the highlights in document ui : remove spaces after special characters and special characters when contexts are called and displayed in the document ui. This way it will match the ngram in base. Or better we make the highlight rules more flexible.
- Same thing for the search terms, either we multiply queries with simple transformations when we see special characters and merge the result set (less restrictive result set) or we modify the query to remove special characters and spaces (more restrictive result set).
Saner fixes:
- One should remove all text transformations in the frontend. We only normalize queries and we define more flexible matching rules.
- cleanTextForNLP use should be reevaluated.
porte-monnaie
should appears as such in the ngrams table and in the document UI, not asportemonnaie
andporte- monnaie
. - Highlighted terms should be more flexible (we don't transform the context to match the ngrams but we highlight when the normalized terms match the normalized part of the context displayed in the document ui)
Compound words seem very tricky, we should plan a spike to look at the strategies used in postgres full text search, or algolia.
Fix at the UX level:
From my understanding, one cannot add ngrams which has not been already defined in base by the NLP layer. So there is many ways to create terms with a count of 0 with the highlight mecanism (when one highlights too many words, not enough, in between words, pseudo-words excluded during the NLP phase). So before defining a new term, we should evaluate whether we can create a map term from the user selection. If not we display an error.
It's still suboptimal though. Since the list of potential map terms is already defined by the NLP layer, it would be better to let the user select the term or group of terms with a click (since the ngrams boundaries are already known).
Do we want the NLP layer to be the only source of truth though ? A user should be able to enter arbitrary terms or group of terms and use this for the clustering. If we want this degree of flexibility, can we create an isomorphic and efficient data structure allowing us to navigate between the human-readable form of the document and the degraded form used by the machine ? Is normalization at every layers the only way ?