Improve Phylo robustness and performance (#292) · Issues · gargantext / haskell-gargantext

Improve Phylo robustness and performance

In the context of #290 (closed), I have taken a deeper look at the Phylo code, and in the context of !232 (merged) I have paved the way to a more systematic testing and benchmarking of Phylo.

While looking at the code, I have realised a few things:

The Phylo code is fairly slow (and this is advertised in the UI as well), but I think there are a couple of places where we could try to parallelise the code (but we need benchmark and performance tests first, to make sure we know what we are improving);
While taking a look at the generated .dot file, I have noticed that we seems to emit the wrong data in same places. Take a look at this excerpt, for example:

        group1984198620 [fontname=Arial
                        ,shape=square
                        ,penwidth=4
                        ,nodeType=group
                        ,gid=group1984198620
                        ,from=1984
                        ,to=1986
                        ,strFrom="\"1986-01-01\""
                        ,strTo="\"1986-01-01\""
                        ,branchId="0 2 1 1 1 1 1 1 1 1 1"
                        ,bId=0
                        ,support=2
                        ,weight="Just 2.0"
                        ,source="[]"
                        ,sourceFull="[]"
                        ,density=0.0
                        ,cooc="fromList [((2,2),3.0)]"
                        ,lbl="\"competitive intelligence\""
                        ,foundation="\"2\""
                        ,role="\"3.0\""
                        ,frequence="\"6.359649122807036e-2\""
                        ,seaLvl="[0.0,0.1,0.2,0.30000000000000004,0.4,0.5,0.6,0.7,0.7999999999999999,0.8999999999999999,0.9999999999999999]"];

This looks a bit iffy to me (but maybe that's intended):

The weight field is being represented as the Just 2.0 string, which sounds like it's a mistake -- shouldn't this be just 2.0, treated as a double?
The cooc includes the fromList, which is the direct show of the underlying Map, which seems suspect, I would have expected just a list of tuples here;
Things like strFrom and strTo includes a quoted date, whereas I would have expected to not include the internal quote (i.e. render this is just 1986-01-01 for example;
Numbers like foundation, role, frequence etc are all strings, but possibly they could be numbers?

As mentioned in #290 (closed), we have an issue where the cooc field becomes too long; for now I have fixed this by manually patching graphviz, but it sounds like we should come up with a more succinct representation, if possible? It looks like unbounded strings (or linear in the number of documents) are going to be a problem.

It would be nice to spend a bit of time investigating the performance of Phylo as well as increase his coverage testing, because if the above rendering wasn't intentional, a test would have caught this.

@anoe I'm more than happy to take a look at this in the new year.

In the context of #290, I have taken a deeper look at the Phylo code, and in the context of https://gitlab.iscpif.fr/gargantext/haskell-gargantext/merge_requests/232 I have paved the way to a more systematic testing and benchmarking of Phylo.

While looking at the code, I have realised a few things:

1. The Phylo code is fairly slow (and this is advertised in the UI as well), but I think there are a couple of places where we could try to parallelise the code (but we need benchmark and performance tests _first_, to make sure we know what we are improving);

2. While taking a look at the generated `.dot` file, I have noticed that we seems to emit the wrong data in same places. Take a look at this excerpt, for example:

```hs
        group1984198620 [fontname=Arial
                        ,shape=square
                        ,penwidth=4
                        ,nodeType=group
                        ,gid=group1984198620
                        ,from=1984
                        ,to=1986
                        ,strFrom="\"1986-01-01\""
                        ,strTo="\"1986-01-01\""
                        ,branchId="0 2 1 1 1 1 1 1 1 1 1"
                        ,bId=0
                        ,support=2
                        ,weight="Just 2.0"
                        ,source="[]"
                        ,sourceFull="[]"
                        ,density=0.0
                        ,cooc="fromList [((2,2),3.0)]"
                        ,lbl="\"competitive intelligence\""
                        ,foundation="\"2\""
                        ,role="\"3.0\""
                        ,frequence="\"6.359649122807036e-2\""
                        ,seaLvl="[0.0,0.1,0.2,0.30000000000000004,0.4,0.5,0.6,0.7,0.7999999999999999,0.8999999999999999,0.9999999999999999]"];
```

This looks a bit iffy to me (but maybe that's intended):

* The `weight` field is being represented as the `Just 2.0` string, which sounds like it's a mistake -- shouldn't this be just `2.0`, treated as a double?

* The `cooc` includes the `fromList`, which is the direct `show` of the underlying `Map`, which seems suspect, I would have expected just a list of tuples here;

* Things like `strFrom` and `strTo` includes a quoted date, whereas I would have expected to _not_ include the internal quote (i.e. render this is just `1986-01-01` for example;

* Numbers like `foundation`, `role`, `frequence` etc are all strings, but possibly they could be numbers?

3. As mentioned in #290, we have an issue where the `cooc` field becomes _too long_; for now I have fixed this by manually patching `graphviz`, but it sounds like we should come up with a more succinct representation, if possible? It looks like _unbounded_ strings (or linear in the number of documents) are going to be a problem.

It would be nice to spend a bit of time investigating the performance of Phylo as well as increase his coverage testing, because if the above rendering _wasn't_ intentional, a test would have caught this.

@anoe I'm more than happy to take a look at this in the new year.