Incident Log
This is a log of incidents managed by Operations Engineering.
Q4 2024 (October to December)
Mean Time to Repair: 24h 10m
Mean Time to Resolve: 24h 10m
Incident on 2024-10-14 - Onboarding Mechanism Release Failure (AP GitHub Organisation)
Key events
- First detected: 2024-10-14 10:30
- Repaired: 2024-10-15 11:29 (Fix forward approach taken)
- Resolved: 2024-10-15 11:40 (Communications sent out to user base)
Time to repair: 24h 10m
Time to resolve: 24h 10m
Identified: The Operations Engineering team identified the issue after users started encountering SSO prompts and re-authentication errors with GitHub CLI and git operations (via ssh).
Impact: The disruption affected multiple users across the moj-analytical-services GitHub organisation, leading to re-authentication prompts and users seeing a banner for enabling SSO.
Context:
- 2024-10-14 10:30 Planning began for the SSO rollout.
- 2024-10-14 14:00 SSO was enabled for the AP GitHub organisation without anticipating the impact on current users.
- 2024-10-14 14:40 Users began experiencing issues with GitHub CLI re-authentication and the SSO banner.
- 2024-10-15 09:30 Communications began with Jacob from Analytical Platform who replicated the issue.
- 2024-10-15 11:29 The team decided to “fix forward” rather than rolling back.
- 2024-10-15 11:40 AP team sent communications to users and the community began offering solutions.
Resolution: The team opted to “fix forward” by addressing the issues without rolling back the changes. Emergency communications were sent to inform users and support the transition.
Review Actions:
- Improve risk assessment for changes across all GitHub organisations, including the
moj-analytical-services
. - Implement a 24-hour delay between communication and changes to allow for better stakeholder engagement.
- Formalise the testing process and release process to include better review procedures.
- Improve stakeholder engagement by creating an incident process runbook.
- Improve risk assessment for changes across all GitHub organisations, including the
Q2 2024 (April to June)
Mean Time to Repair: 0h 0m
Mean Time to Resolve: 0h 0m
Q1 2024 (January to March)
Mean Time to Repair: 0h 0m
Mean Time to Resolve: 0h 0m
Q4 2023 (October to December)
Mean Time to Repair: 83h 50m
Mean Time to Resolve: 2185h 6m
Incident on 2023-12-15 - Sentry Session Replay Quota Exhausted
Key events
- First detected: 2023-12-15 15:34
- Repaired: 2023-12-15 16:16
- Resolved: 2024-04-11 11:06
Time to repair: 0h 42m
Time to resolve: 2808h 28m
Identified: Automated alert email from Sentry Support at 80% and 100% Replay quota consumption.
Impact: This resulted in no further Replays being available to any projects until the monthly subscription resets on 14th January 2024.
Context:
- 2023-12-15 15:34 Email alert from Sentry notifying us that 80% of replay quota exhausted.
- 2023-12-15 16:16 Notification to users via #operations-engineering-update that quota was due to run out. No further resolution until quota increased.
- 2023-12-15 22:37 Email alert from Sentry notifying us that 100% of replay quota exhausted.
Resolution: Users notified that quota exhaused and actiomn that is being taken to increase quota.
Review Actions:
N/A - Only action that will resolve furter incidents is an increase in quota, which is already underway.
Incident on 2023-12-08 - obr.uk email migration - Email failure due to cached MTA-STS policy
Key events
- First detected: 2023-12-08 14:12
- Repaired: 2023-12-12 15:18
- Resolved: 2023-12-12 15:18
Time to repair: 97h 6m
Time to resolve: 97h 6m
Identified: Reported by Email Migration Team.
Impact: Emails not delivered.
Context:
- 2023-12-08 14:12 HM Treasury (HMT) Email Team report that emails being blocked by unknown MTA-SYS policy.
- 2023-12-08 15:25 MTA-STS policy removed by Operation Engineering.
- 2023-12-09 14:15 HMT Email Team report that some email is being delivered. Email via Google still being blocked.
- 2023-12-10 10:30 HMT Email Team report that email appears to be blocked by a cached MTA-STS policy. Cache based on a policy
max_age
of 14 days. Partial delivery of emails. - 2023-12-10 11:08 Original MTA-STS policy reinstated with a short
max_age
to try and clear cached policy. - 2023-12-11 10:30 Technical call held to discuss options as email still only being partially delivered. Reinstating policy did not resolve issue. Decision taklen to implement a new
none
type policy. - 2023-12-11 10:43 New policy deployed
- 2023-12-11 14:00 Technical call held. Email still being partially delivered due to cached policy. Decision made to add old MX record server back to see inpact on the policy.
- 2023-12-11 14:40 HMT Email Team start seeing Non-Delivery Report (NDR) errors. Decision made to revert MX record changes and remove old mail servers. NDR errors stop.
- 2023-12-12 12:00 Technical call held. Identifed that
max_age
value should be at least 24 hers for some email providers to recognise policy changes. - 2023-12-12 13:05 New MTA-STS policy deployed
- 2023-12-12 15:18 HMT Email Team report that emails no longer blocked by orginal MTA-STS policy. Email service working as expected.
Resolution: Implementation of new
none
type MTA-STS policy with 24hrmax_age
and no associated MX records resulted in cached policy being replaced. This enabled email to be delivered as expected.Review Actions:
- Update Readme to include correct decommissioning actions - DONE
- Create runbook to help assess DNS changes required for email migrations - DONE
- Review of current MTA-STS Infrastucture as Code solution and process to feed into longer terms strategy for this product i.e. tactical uplift vs tackling longer term technical dept - TODO
Incident on 2023-11-17 - Sentry Session Replay Quota Breach
Key events
- First detected: 2023-11-17 09:21
- Repaired: 2023-11-17 15:45
- Resolved: 2024-04-11 11:06
Time to repair: 6h 24m
Time to resolve: 3649h 45m
Identified: Automated alert email from Sentry Support at 80% and 100% Replay quota consumption.
Impact: This resulted in no further Replays being available to any projects until the monthly subscription resets on 14th December 2023. There are only 3 projects currently using this feature.
Context:
- 2023-11-16 17:17 Email alert from Sentry notifying us that 80% of replay quota exhausted.
- 2023-11-16 18:01 Email alert from Sentry notifying us that 100% of replay quota exhausted. No more quota until monthly reset on 14th December.
- 2023-11-17 09:21 Operations Engineering Team aware of email alert from Sentry.
- 2023-11-17 10:38 Investigation started. Data shows that spike as a result of replays from one project.
- 2023-11-17 10:43 First contact with the Service Team to alert them to issues and to establish what may have changed. A change was made to the config that enabled this feature for their app. The config resulted in the spike. Agreed remediation work (i.e. to remove the trigger on non-error transactions) to reduce risk of recurrence.
- 2023-11-17 12:21 Search of Github org to identify any other impacted teams. Two other teams identified, but not contributed to replay sessions. Teams contacted to notify them that the quota is exhausted. No immediate impact to teams.
- 2023-11-17 14:00 Contacted Sentry Customer Success Manager to discuss estimations for a potential quota increase, plus some user questions about feature behaviour to help set appropriate limits.
- 2023-11-17 14:40 PR raised to remove triggers for % transactions.
- 2023-11-17 15:40 PR approved
- 2023-11-17 15:45 PR merged
Resolution: Worked with Service Team to update configuration to reduce risk of further quota spikes. It is likely that until the quota is increased that further breaches may occur.
Review Actions:
- Improve detection of spikes by adding Replays to existing anolmoly detection and alerting - Done
- Forward automated email alerts from Sentry to alerts channel in slack to improve visibility an align to operational practices - Done
- Increase Replay subscription to allow users to be able to use the feature - Done
- Create user documentation to help Service Teams set appropraite sample rates - Done