Streamlined deployments of GGTX
@fmaniere @cgenie I'm opening this ticket to get the ball rolling on the deployment story, as well as giving us visibility and high-level cooperation on the topic.
I know that probably both of you have done individual thinking on the topic and Fabien has been doing some excellent work in trying to get things moving, but I think we need an holistic plan to bring us from A to B.
NOTE: This is going to be a long ticket, so grab your favourite beverage before diving in.
Problem description
Gargantext is an open source project with an active development, but its deployments have been for a long time centralised, meaning that in that area the bus factor is exactly 1. Recently, as Murphy's law has it, the bus factor manifested and as such deployments stopped/slowed down, leading to a lot of valuable contributions stagnating on the dev
branch without the ability of being pushed outside for users to test it.
Current (guessed) status quo
The current deployment scenario is, as far as I understand, the following:
- A main GGTX engineer hops into one instance he would like to upgrade (let's pick
dev.sub.gargantext.org
as an example); - A "feature branch" is cut typically from a precise
dev
commit, and the feature branch is used as a baseline for the changes; - The code is compiled;
- The version is bumped, with a versioning scheme which at the moment is not publicly disclosed or, in general, it's not completely obvious to guess (i.e. it doesn't adhere to the classic PVP policy;
- A changelog is (manually) forged that highlights all the changes happened between two versions. Such changelog as far as I understand is manually curated by a human;
- The code is run (probably guarded by some
systemd
service) and the DB migration executed; - The release branch is merged back into
dev
.
I'm not exactly sure if the code gets run before the version bump and changelog have happened, but I decided to include the "run" step last because it's not obvious to make what should happen in case step 6
fails in terms of rollbacks (especially database migrations), so I have decided to put it last.
Requirements
Going forward, we have a bunch of requirements for deployments:
- Inspectable -- the deployment process should be "auditable", i.e. it should be document or, even better, self contained in a bunch of scripts (like Ansible for example), so that there are no "obscure steps" left to the imagination;
- Automatic -- it should ideally be possible to run a single command for tagging a release (example:
gargantext release -o <version_number>
), to make a changelog (example:gargantext generate-changelog -i <version_number>
) and to deploy (example:gargantext deploy -i dev.sub.gargantext.org
); - Decentralised -- it should be possible for a pool of trusted GGTX engineers to all deploy if their SSH keys are present on the server(s). This way, the bus factor increases;
- Nix-based -- we are already using
nix
to provide system dependencies, it's conceivablenix
should play a role. We should research if there are tools that integrates with Nix (I know that Ansible should, at least) or if Nix has a bespoke way of deploying things (I know aboutNixOps
but that is NixOS specific); - CI-powered -- see also the ticket #485 that @cgenie opened here; given that CI has already cached dependencies and have to build GGTX anyway (even though not in a shape that is currently viable for production) it feels like a waste to not use it to do some sort of continuous integration. We could in theory also explore hydra if that helps with deployments, but I don't think so, and I think that setting up a hydra instance will be a bit of a hassle.
Plan of action
I have a lot of ideas, but there are a few preliminary steps that we can conceivably do, for example start researching the solution space (I have used to fair share of tools for other clients but I'm always interested in hearing about new / more fitting tools). I will open other umbrella tickets to get us going before coming back here with a more thorough implementation plan, as this ticket is already long enough.
Straw-man proposal
NOTE: I'm discussing only the backend for now, I need to think how the frontend fits into this. For me, that's a separate project and as such the backend release and deploy code shouldn't be intertwined with it, so I'm omitting it for now. I understand that the release and deploy of the two happens in lockstep, so does for many projects out there.
Given the discussion and the feedback from the initial writeup, here is my initial proposal that we can revise and discuss as we see fit:
Let's introduce a set of CLI commands,
gargantext release
,gargantext deploy
, which semantic will be explained later, but in a nutshellgargantext release
can be used to tag a new release, generate a new changelog, push things todev
and that's about it, whereasgargantext deploy
can be used to actually deploy the server to a target machine;
gargantext release
The release
command should be responsible for:
- Discovering the current release of
gargantext
via the programmatic API that GHC exposes via thePaths_gargantext
file, in theversion
CAF - Bumping the release by tweaking the
Version
type following loosely the PVP -- I propose we do something very simple: a. By default, if therelease
command is run without any strong preferences, a minor release is performed, meaning that in the example above we would go from0.0.7.4.7
to0.0.7.4.8
; b. In the initial description I stressed the term "loosely", and the reason being that if we were to stick to the the classic PVP, in case of breaking changes we would have to bump the second digit by one, so instead of0.0.7.4.7
we should release0.1.0.0
, which is a bit counterintuitive given that our ggtx versioning scheme is not PVP-compliant. I propose we simplify our lives and we use a very simple schema: I. For minor changes we bump the last digit, so we have0.0.7.4.7
->0.0.7.4.8
; II. For breaking changes we bump the penultimate digit, so we have0.0.7.4.7
->0.0.7.5.0
; III. For new versions of the platform we bump the 3rd digit, so we have0.0.7.4.7
->0.0.8.0.0
; - Generating the changelog (See later section);
- Modify the current version in the
gargantext.cabal
viased
in-place. We could use something more sophisticated like cabal-install-parsers but that is overkill, a simple sed substitution is enough; - Ensure that the project still builds;
- Push to
dev
the released new version: that will trigger a CI rule to build the executables which will be used for the deploy of that particular version.
gargantext deploy
The deploy
command should take as input at least a version and a target host and:
- Acquire the binaries produced by CI after the push to
dev
; - Push the binaries to the target host;
- Apply any DB migration on the host, if any;
- Restart the server.
Deployment UML Sequence diagrams
In the following section I'm including some visual cues to showcase how things would work, in the form of UML Sequence diagrams:
gargantext release sequence diagram
gargantext deploy sequence diagram
Generating the changelog
In order to generate the changelog, we will need to acquire the set of commits pushed to dev
between the last "release commit" (i.e. the one pushed during a gargantext release
) and HEAD
, filter only merge requests and have a way for merge requests to be "annotated" so that we can point back to the relevant issue. There are multiple ways to do this, one simple and naive, the other more complicated:
-
Simple: Every last commit that we push before opening a MR should be something like
Fixes #xxx (did feature x y z)
-- then we can scan thegit log
history, look for such text fragments and simply put in the changelog the issue number and the brief description of the fix. We don't need a very long description, because the changelog will include already a hyperlink to read the full issue, no need to be verbose; -
Complicated: We need to still mention the related ticket a MR is closing, but then we could use the gitlab API to get the issue title from the MR number. The workflow would be that from the
git log
we identify the MRs commits, we grab the text description of the MR via the gitlab API, we find the referenced issue, and we include that and its title.
I think that both requires discipline from developers and both have the risk of missing changelog entries; even the complicated approach still relies on developers mentioning the initial issue in the MR description.
My slight inclination would be to go with the "complicated" approach is having access to a Gitlab API key is easy enough, but have a "preflight" check during release
that before actually committing to gitlab
anything the command should spit out the rendered changelog for auditing. Then a human can inspect it contains a sensible changelog and, if it doesn't, it might amend the MR description to include the backlink.
Technical decisions
In the past I have used Ansible and I find setting it up a bit clunky and fragile to write, but the glaring advantage is that code gets "shipped" via SSH to the remote machine and executed there, meaning that after we finish #487 any of the trusted developers can perform deploys and releases from their local machines, preventing malicious users to just checking out the code and do the same.
I don't really mind if we end up using Ansible or not, but the key feature is that it should be a system where we have a permission/auth mechanism to prevent unauthorised deploys. I'm a dinosaur in this regard so I'd argue we should deploy to bare metal and not into a docker container. I'm in favour of not reinventing the wheel here and use something like Ansible or similar as it's really battle tested and used my gazillion companies out there.