Skip to main content

ADR-001 Github Failover

Status

✅ Accepted

Context

Github is used for multiple purposes within the organisation:

  1. Source control
  2. Pipelines (Github actions)
  3. SSO (authentication with other service providers such as Auth0, Sentry)
  4. Project management

Likely: Github goes down for a short period of time:

  1. Source control - inability to commit and pull code is a low to medium risk. There is the possibility that a team might not be able to action a production bug as a result of this feature not being available. Alternatives: GitLab, AWS CodeCommit

  2. Pipelines - inability to run jobs or deploy code is a low to medium risk. There is the possibility that a team might not be able to action a production bug as a result of this feature not being available. Alternatives: CircleCI, AWS CodePipeline

  3. SSO - inability to auth with github is a medium risk. We use Github as SSO for Auth0, which provides access to the Cloud Platform cluster, including application logs. It is also used for authentication with CircleCI. The impact of accessing the cluster will be limited to new members of teams who have not yet generated their kubeconfig or those whose sessions have expired. Accessing the logs and other services that use GH as SSO will affect people who have not logged in recently, others will still have valid sessions. Alternatives: No clear alternative for all usecases, separate accounts possible for some

  4. Project management - inability to perform project management activities is low risk. Alternatives: Trello, Jira, Notion

Unlikely: Github goes for a prolonged period of time, for example over a day:

  1. Source control - inability to commit and pull code is a high risk. The need to make changes, update dependencies and maintain secure systems requires regular ability to update code

  2. Pipelines - inability to run jobs or deploy code is a high risk. The need to make changes, update dependencies and maintain secure systems requires regular ability to update systems

  3. SSO - inability to auth is a high risk. Losing access to the services that require GH SSO for a long period of time could lead to a degradation of service

  4. Project management - inability to perform project management activities is risk. Alternatives: Trello, Jira, Notion

Github incident history

Decision

In the likely scenario of parts of Github being unavailable for a short period of time, the risk to MoJ services is medium. Transitioning any of the functionality to a different provider is considerable effort, especially relating to SSO. Code repositories can be migrated to Gitlab and GH actions can be translated to CircleCi jobs though both will take some effort. We should make sure both of these MoJ organisation accounts are correctly set up and ready just in case. Alternatives for SSO should be considered. Project management is low risk.

In the unlikely scenario of Github being unavailable for a long period of time, the risk to MoJ services is high. This is due to the potential inability of teams to address production issues due to not being able to commit or deploy code as well as access production logs and deployments. Alternatives for source code management, deployment pipelines and SSO should be considered. Alternatives for project management can be reviewed, though that is still considered low risk.

Consequences

  1. Evaluate organisational accounts for an alternative repository store and deployment pipelines and provide recommendations and instructions.
  2. Investigate alternatives to Github SSO for critical services.
This page was last reviewed on 5 March 2024. It needs to be reviewed again on 5 September 2024 by the page owner #operations-engineering-alerts .
This page was set to be reviewed before 5 September 2024 by the page owner #operations-engineering-alerts. This might mean the content is out of date.