Skip to main content

Responding to an Incident

This rubook outlines how to declare an incident and manage an incident, so that the team and customers are promptly notified and the incident can be mitigated effectively.

Declaring an Incident to the Operations Engineering Team

To declare an incident, you can trigger the “Declare an Incident” workflow in the #operations-engineering-team slack channel by using the /declare-an-incident command, or by manually triggering the workflow from the boomarked items as so:

Manually Trigger Workflow

Input the title of the incident, as so:

Input Incident Title

A message will be sent to the #operations-engineering-team slack channel declaring an incident.

This will notify members of the Operations Engineering team of the incident allowing for a coordinated mitigation effort.

A huddle can be started to gather the team to investigate the issue.

It may be a good idea to keep all incident related discussion in the thread of the incident delaration message.

Declaring an incident to our Customers

Once the team has been notified that an incident has occured and have begun investigating the issue, it is important that our customers are promptly notified.

You can use this template for incident related communications:

Incident Title

Incident Description.

Am I Affected?

How the incident impacts users.

What Are We Doing?

Actions that the Operations Engineering team are currently taking to mitigate the issue.

For example:

GitHub Actions For Internal/Private Repositories Not Running

We have received reports that GitHub Actions are not running for Internal/Private GitHub Repositories due to the GitHub Actions Quota Being Depleted.

Am I Affected?

If you have a Internal/Private GitHub Repository within the ministryofjustice or moj-analytical-services which depends on GitHub Actions you are most likely affected by this issue

What Are We Doing?

The billing should allow for overages, so we are currently investigating why this process is not working as expected

It is important that appropriate channels are targeted for comms.

For all Operations Engineering incidents the follwing channels should be targeted: #operations-engineering-update

For all GitHub related incidents the following additional channels should be targeted: #github-community

If the incident affects other teams it is important to contact them via their team Slack channel.

Don’t forget to notify customers once the incident has resolved, for this you could use the following template:

UPDATE (RESOLVED)

Actions we have taken which have resolved the issue.

Actions that we have taken to prevent the issue from recurring in the future.

Please feel free to comment on this thread, or raise a question in #ask-operations-engineering if you face any additional problems.

For example:

UPDATE (RESOLVED)

We have updated our overage spending limit for Internal/Private GitHub Actions to $10,000, and we have received reports that this has now resolved the issue for users.

Our monthly quota resets on the 1st of every month, so we expect the $10,000 to be more than enough to cover the next few days

Please feel free to comment on this thread, or raise a question in #ask-operations-engineering if you face any additional problems.
This page was last reviewed on 8 November 2024. It needs to be reviewed again on 8 May 2025 by the page owner #operations-engineering-alerts .