For PostgreSQL queries, don't use `offset`
This is inspired by the articles:
- https://cedardb.com/blog/pagination/
- https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/
Basically, using offset
in PostgreSQL queries results in these issues:
- When there are 'ties' at the page break, you can get unpredictable results on the current and next page (because the second time you make a query, sorting can be done differently). Although our 'terms' have unique names and, e.g. for docs, it rarely happens that two documents have the same title, however, we allow also sorting by score, document source, document date, document tag, and these offer real threats of ties.
-
offset
requires the DB to sort all results and then throw awayoffset
number of these
Proposition is to replace offset
by something else.
One solution is a PostgreSQL cursor
: https://www.postgresql.org/docs/current/plpgsql-cursors.html. However, this requires keeping an open connection to the client.
Another proposition is to use something like last-id
. If we sort by id
, then for the next page, we just require results with id > last_id
. If we sort by name
, we could additionally sort by id
and return results with (name == last_name AND id > last_id) OR (name > last_name)
. The additional id
part in ordering and next query allows splitting the 'tie' when two name
's are the same. Also, by requesting name >= last_name
we already allow PostgreSQL to optimize the query by first filtering by name and then sorting (with offset
it has to sort the whole result-set first).
From a technical point of view in our code, we currently have: Maybe Offset, Maybe Limit, Maybe OrderBy
. One would change that to Maybe Id, Maybe Limit, Maybe OrderBy
. The (Offset, OrderBy)
pair should be replaced by (Id, OrderBy)
.
This issue is just a proposition, I'm open to any comments, including ones that just tell us to stick with Offset
:)