For PostgreSQL queries, don't use `offset` (#434) · Issues · gargantext / haskell-gargantext

For PostgreSQL queries, don't use `offset`

This is inspired by the articles:

Basically, using offset in PostgreSQL queries results in these issues:

When there are 'ties' at the page break, you can get unpredictable results on the current and next page (because the second time you make a query, sorting can be done differently). Although our 'terms' have unique names and, e.g. for docs, it rarely happens that two documents have the same title, however, we allow also sorting by score, document source, document date, document tag, and these offer real threats of ties.
offset requires the DB to sort all results and then throw away offset number of these

Proposition is to replace offset by something else.

One solution is a PostgreSQL cursor: https://www.postgresql.org/docs/current/plpgsql-cursors.html. However, this requires keeping an open connection to the client.

Another proposition is to use something like last-id. If we sort by id, then for the next page, we just require results with id > last_id. If we sort by name, we could additionally sort by id and return results with (name == last_name AND id > last_id) OR (name > last_name). The additional id part in ordering and next query allows splitting the 'tie' when two name's are the same. Also, by requesting name >= last_name we already allow PostgreSQL to optimize the query by first filtering by name and then sorting (with offset it has to sort the whole result-set first).

From a technical point of view in our code, we currently have: Maybe Offset, Maybe Limit, Maybe OrderBy. One would change that to Maybe Id, Maybe Limit, Maybe OrderBy. The (Offset, OrderBy) pair should be replaced by (Id, OrderBy).

This issue is just a proposition, I'm open to any comments, including ones that just tell us to stick with Offset :)

This is inspired by the articles:
- https://cedardb.com/blog/pagination/
- https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/

Basically, using `offset` in PostgreSQL queries results in these issues:

1. When there are 'ties' at the page break, you can get unpredictable results on the current and next page (because the second time you make a query, sorting can be done differently). Although our 'terms' have unique names and, e.g. for docs, it rarely happens that two documents have the same title, however, we allow also sorting by score, document source, document date, document tag, and these offer real threats of ties.
2. `offset` requires the DB to sort all results and then throw away `offset` number of these

Proposition is to replace `offset` by something else.

One solution is a PostgreSQL `cursor`: https://www.postgresql.org/docs/current/plpgsql-cursors.html. However, this requires keeping an open connection to the client.

Another proposition is to use something like `last-id`. If we sort by `id`, then for the next page, we just require results with `id > last_id`. If we sort by `name`, we could additionally sort by `id` and return results with `(name == last_name AND id > last_id) OR (name > last_name)`. The additional `id` part in ordering and next query allows splitting the 'tie' when two `name`'s are the same. Also, by requesting `name >= last_name` we already allow PostgreSQL to optimize the query by first filtering by name and then sorting (with `offset` it has to sort the whole result-set first).

From a technical point of view in our code, we currently have: `Maybe Offset, Maybe Limit, Maybe OrderBy`. One would change that to `Maybe Id, Maybe Limit, Maybe OrderBy`. The `(Offset, OrderBy)` pair should be replaced by `(Id, OrderBy)`.

This issue is just a proposition, I'm open to any comments, including ones that just tell us to stick with `Offset` :)