[Meta] Mistmatch between compound words representation in frontend and in database. (#386) · Issues · gargantext / haskell-gargantext

[Meta] Mistmatch between compound words representation in frontend and in database.

Summary

Meta issue about the mistmatch between ngrams with hyphen or single quote representation in frontend and in database. In the document ui, spaces are added after special characters. When highlighted and chose as map terms, special characters in terms become spaces. Furthermore, documents are based on the raw context, without some of the processing happening in the backend such as hyphen removal. Those different processing create mistmatch between frontend and backend, breaking comparison-based features in terms table (search bar, highlights, score count in doc view, map terms selection from document and search bar).

Steps to reproduce

Create a new empty corpus
Click on the flower -> WriteNodesDocument icon
Docs -> Add Document button in the panel
fill the form with the abstract words are nice but word's shaped as a porte-monnaie are better
From the doc modal, highlight the word porte- monnaie and add it to map terms.
From the search terms bar, search portemonnaie add it to the map terms. It shows an occurrence count of 1.
From the search terms bar, search porte- monnaie. Nothing found.

Example Project

Minimal example on the dev instance

What is the current bug behavior?

porte- monnaie highlighted from the doc modal has a count of 0 occurrences.
portemonnaie selected from the search bar has a count of 1.
Searching among already selected terms with hyphen will not give expected results.
- porte- monnaie from the search bar yields no result.
- porte-monnaie yields no result.
- porte monnaie (with two spaces) returns porte monnaie with a count of 0.
There is no term highlighted in the document view.

What is the expected correct behavior?

We should be able to select the terme porte-monnaie from the document with the highlight.
The term should have an occurrences count of 1.
It should be highlighted in green in the document view.
It should be searchable. When already selected searching porte-monnaie should yield one result in one document.

Relevant logs and/or screenshots

Tokenization seperate words on the hyphen

ghci> ffmap _terms_label myTerms
[[["words"],["are"],["nice"],[","],["word"],["'s"],["shape"],["as","a"],["porte"],["-"],["monnaie"],["are"],["better"]]]

The function cleanTextForNLP removes hyphens

ghci> cleanTextForNLP "Words are nice but words shaped as a porte-monnaie are better"
"Words are nice but words shaped as a portemonnaie are better"

The function multiterms find the terms. One of those terms is portemonnaie without hyphen. cleanTextForNlp is called before multiterms is applied.

ghci> multiterms defaultNLPServer EN "words are nice, word's shaped as a porte-monnaie are better"
[(Terms {_terms_label = ["portemonnaie"], _terms_stem = fromList ["portemonnaie"]},1),(Terms {_terms_label = ["word"], _terms_stem = fromList ["word"]},2)]

Possible fixes

Temporary workaround for users: try to add the compound term from the search bar without the hyphen, quote or space (e.g, portefeuille -> add to map terms)

Dirty frontend fixes (mostly in `/purescript-gargantext/src/Gargantext/Core/NgramsTable/Functions.purs` ):

normalization at query time before. If a term with special characters is highlighted, remove spaces and those special characters before it's sent. If the highlighted ngram is already in base, we should find a new entry with a count != 0 in the table, without space or special characters.
In order to fix the highlights in document ui : remove spaces after special characters and special characters when contexts are called and displayed in the document ui. This way it will match the ngram in base. Or better we make the highlight rules more flexible.
Same thing for the search terms, either we multiply queries with simple transformations when we see special characters and merge the result set (less restrictive result set) or we modify the query to remove special characters and spaces (more restrictive result set).

Saner fixes:

One should remove all text transformations in the frontend. We only normalize queries and we define more flexible matching rules.
cleanTextForNLP use should be reevaluated. porte-monnaie should appears as such in the ngrams table and in the document UI, not as portemonnaie and porte- monnaie.
Highlighted terms should be more flexible (we don't transform the context to match the ngrams but we highlight when the normalized terms match the normalized part of the context displayed in the document ui)

Compound words seem very tricky, we should plan a spike to look at the strategies used in postgres full text search, or algolia.

Fix at the UX level:

From my understanding, one cannot add ngrams which has not been already defined in base by the NLP layer. So there is many ways to create terms with a count of 0 with the highlight mecanism (when one highlights too many words, not enough, in between words, pseudo-words excluded during the NLP phase). So before defining a new term, we should evaluate whether we can create a map term from the user selection. If not we display an error.

It's still suboptimal though. Since the list of potential map terms is already defined by the NLP layer, it would be better to let the user select the term or group of terms with a click (since the ngrams boundaries are already known).

Do we want the NLP layer to be the only source of truth though ? A user should be able to enter arbitrary terms or group of terms and use this for the clustering. If we want this degree of flexibility, can we create an isomorphic and efficient data structure allowing us to navigate between the human-readable form of the document and the degraded form used by the machine ? Is normalization at every layers the only way ?

## Summary

Meta issue about the mistmatch between ngrams with hyphen or single quote representation in frontend and in database.
In the document ui, spaces are added after special characters. When highlighted and chose as map terms, special characters in terms become spaces. Furthermore, documents are based on the raw context, without some of the processing happening in the backend such as hyphen removal. Those different processing create mistmatch between frontend and backend, breaking comparison-based features in terms table (search bar, highlights, score count in doc view, map terms selection from document and search bar).

## Steps to reproduce

- Create a new empty corpus
- Click on the flower -> WriteNodesDocument icon
- Docs -> Add Document button in the panel
- fill the form with the abstract `words are nice but word's shaped as a porte-monnaie are better`
- From the doc modal, highlight the word `porte- monnaie` and add it to map terms.
- From the search terms bar, search `portemonnaie` add it to the map terms. It shows an occurrence count of 1.
- From the search terms bar, search `porte- monnaie`. Nothing found.

## Example Project

[Minimal example on the dev instance](https://dev.sub.gargantext.org/#/lists/acourt.yoelis@dev.sub.gargantext.org/188157)

## What is the current bug behavior?

- `porte- monnaie` highlighted from the doc modal has a count of 0 occurrences.
- `portemonnaie` selected from the search bar has a count of 1.

- Searching among already selected terms with hyphen will not give expected results.
    - `porte- monnaie` from the search bar yields no result.
    - `porte-monnaie` yields no result.
    - `porte  monnaie` (with two spaces) returns `porte monnaie` with a count of 0.

- There is no term highlighted in the document view.

## What is the expected correct behavior?

- We should be able to select the terme `porte-monnaie` from the document with the highlight.
- The term should have an occurrences count of 1.
- It should be highlighted in green in the document view.
- It should be searchable. When already selected searching `porte-monnaie` should yield one result in one document.

## Relevant logs and/or screenshots

Tokenization seperate words on the hyphen

```
ghci> ffmap _terms_label myTerms
[[["words"],["are"],["nice"],[","],["word"],["'s"],["shape"],["as","a"],["porte"],["-"],["monnaie"],["are"],["better"]]]

```

The function cleanTextForNLP removes hyphens
```
ghci> cleanTextForNLP "Words are nice but words shaped as a porte-monnaie are better"
"Words are nice but words shaped as a portemonnaie are better"
```

The function multiterms find the terms. One of those terms is `portemonnaie` without hyphen. cleanTextForNlp is called before multiterms is applied.

```haskell
ghci> multiterms defaultNLPServer EN "words are nice, word's shaped as a porte-monnaie are better"
[(Terms {_terms_label = ["portemonnaie"], _terms_stem = fromList ["portemonnaie"]},1),(Terms {_terms_label = ["word"], _terms_stem = fromList ["word"]},2)]
```

## Possible fixes

Temporary workaround for users: try to add the compound term from the search bar *without the hyphen, quote or space* (e.g, `portefeuille` -> add to map terms)

### Dirty frontend fixes (mostly in `/purescript-gargantext/src/Gargantext/Core/NgramsTable/Functions.purs` ):
   - normalization at query time before. If a term with special characters is highlighted, remove spaces and those special characters before it's sent. If the highlighted ngram is already in base, we should find a new entry with a count != 0 in the table, without space or special characters.
   -  In order to fix the highlights in document ui : remove spaces after special characters and special characters when contexts are called and displayed in the document ui. This way it will match the ngram in base. Or better we make the highlight rules more flexible.
   -  Same thing for the search terms, either we multiply queries with simple transformations when we see special characters and merge the result set (less restrictive result set) or we modify the query to remove special characters and spaces (more restrictive result set).

### Saner fixes: 
   - One should remove all text transformations in the frontend. We only normalize queries and we define more flexible matching rules.
   - cleanTextForNLP use should be reevaluated. `porte-monnaie` should appears as such in the ngrams table and in the document UI, not as `portemonnaie` and `porte- monnaie`. 
   - Highlighted terms should be more flexible (we don't transform the context to match the ngrams but we highlight when the normalized terms match the normalized part of the context displayed in the document ui)