Improve PhyloMaker performance up to 50% (!249) · Merge Requests · gargantext / haskell-gargantext

Improve PhyloMaker performance up to 50%

@anoe This is the first of many patches that I will land trying to improve the performance and the readability of the Phylo code.

I have spent a few weeks on this trying to get myself familiar with the code, which is quite complex and intricate, and after a few false starts, I have gone back to the drawing board by starting to tackle the low hanging fruit, small wins where we could see immediately the performance gains. At glance, this is what I have done:

I have replace the nub and sort functions in PhyloMaker with nubOrd or, when possible, with nub and sort from the discrimination package, which runs in O(n), whereas Data.List.nub sort is notoriously slow and exponential (O(n^2)) and Data.List.sort uses IIRC a flavour of the QuickSort and runs in O(n*logn).

Furthermore, as the majority of the time of the PhyloMaker is spent calculating relatedComponents, but we do this on all the similarity Set, we can parallelise that, and so I have added a strategic parMap.

This patch also adds a new (test) executable called garg-phylo-profile which shares code from the garg-phylo but is meant to be used for profiling as it load some canned data.

Results

Before my patch, running garg-phylo-profile would take:

real	0m17,210s
user	0m25,466s
sys	0m0,747s

After my patch:

real	0m9,230s
user	0m34,577s
sys	0m0,465s

In terms of eventlog, we are doing a bit better, but I need to do more investigation as I think there are bits which are highly parallel and they can be improved, but that will happen in a separate round of investigations:

As you can see this is now much more parallel (green means "doing work", here) and the productivity went up a bit and the GC down a bit, but there is a huge gap where we are (likely) waiting for work to be computed (my hunch is that we are waiting for the similarities to be computed). This can be improved, but will happen in due course.

@anoe This is the first of many patches that I will land trying to improve the performance and the readability of the Phylo code.

1. I have replace the `nub` and `sort` functions in `PhyloMaker` with `nubOrd` or, when possible, with `nub` and `sort` from the [discrimination](https://hackage.haskell.org/package/discrimination) package, which runs in `O(n)`, whereas `Data.List.nub` sort is notoriously slow and exponential (`O(n^2)`) and `Data.List.sort` uses IIRC a flavour of the QuickSort and runs in `O(n*logn)`.

Furthermore, as the majority of the time of the `PhyloMaker` is spent calculating `relatedComponents`, but we do this on all the similarity `Set`, we can parallelise that, and so I have added a strategic `parMap`.

This patch also adds a new (test) executable called `garg-phylo-profile` which shares code from the `garg-phylo` but is meant to be used for profiling as it load some canned data.

## Results

Before my patch, running `garg-phylo-profile` would take:

```
real	0m17,210s
user	0m25,466s
sys	0m0,747s
```

**After** my patch:

```
real	0m9,230s
user	0m34,577s
sys	0m0,465s
```

![Screenshot_2024-02-22_at_09.32.23](/uploads/42236399900217e4d6abc50a32d3b23f/Screenshot_2024-02-22_at_09.32.23.png)

Edited Feb 22, 2024 by Alfredo Di Napoli

Improve PhyloMaker performance up to 50%

Results

Check out, review, and merge locally

Revert this commit

Cherry-pick this commit