Improve Phylo robustness and performance
In the context of #290 (closed), I have taken a deeper look at the Phylo code, and in the context of !232 (merged) I have paved the way to a more systematic testing and benchmarking of Phylo.
While looking at the code, I have realised a few things:
-
The Phylo code is fairly slow (and this is advertised in the UI as well), but I think there are a couple of places where we could try to parallelise the code (but we need benchmark and performance tests first, to make sure we know what we are improving);
-
While taking a look at the generated
.dot
file, I have noticed that we seems to emit the wrong data in same places. Take a look at this excerpt, for example:
group1984198620 [fontname=Arial
,shape=square
,penwidth=4
,nodeType=group
,gid=group1984198620
,from=1984
,to=1986
,strFrom="\"1986-01-01\""
,strTo="\"1986-01-01\""
,branchId="0 2 1 1 1 1 1 1 1 1 1"
,bId=0
,support=2
,weight="Just 2.0"
,source="[]"
,sourceFull="[]"
,density=0.0
,cooc="fromList [((2,2),3.0)]"
,lbl="\"competitive intelligence\""
,foundation="\"2\""
,role="\"3.0\""
,frequence="\"6.359649122807036e-2\""
,seaLvl="[0.0,0.1,0.2,0.30000000000000004,0.4,0.5,0.6,0.7,0.7999999999999999,0.8999999999999999,0.9999999999999999]"];
This looks a bit iffy to me (but maybe that's intended):
-
The
weight
field is being represented as theJust 2.0
string, which sounds like it's a mistake -- shouldn't this be just2.0
, treated as a double? -
The
cooc
includes thefromList
, which is the directshow
of the underlyingMap
, which seems suspect, I would have expected just a list of tuples here; -
Things like
strFrom
andstrTo
includes a quoted date, whereas I would have expected to not include the internal quote (i.e. render this is just1986-01-01
for example; -
Numbers like
foundation
,role
,frequence
etc are all strings, but possibly they could be numbers?
- As mentioned in #290 (closed), we have an issue where the
cooc
field becomes too long; for now I have fixed this by manually patchinggraphviz
, but it sounds like we should come up with a more succinct representation, if possible? It looks like unbounded strings (or linear in the number of documents) are going to be a problem.
It would be nice to spend a bit of time investigating the performance of Phylo as well as increase his coverage testing, because if the above rendering wasn't intentional, a test would have caught this.
@anoe I'm more than happy to take a look at this in the new year.