Skip to main content

ADR-016: Archive DNS-IAC

Status

✅ Accepted

Context

The Operations Engineering team is responsible for managing the Ministry of Justice (MoJ) Domain Name Service (DNS) estate. The MoJ DNS estate is large, and currently consists of 419 hosted zones in Amazon Web Services (AWS) Route53.

The purpose of DNS-IAC was to bring our DNS estate under the modern practice of managing “Infrastructure As Code” (IAC). Using IAC is typically considered good practice as it allows you to incorporate modern software development practices into how you manage infrastructure such as version control, auditing of system changes, peer reviews of system changes etc.

Problems

The current DNS-IAC solution falls short of its intended purpose as it only covers ~25% of our DNS estate. This section will detail some of the root problems identified with DNS-IAC as it currently stands.

Monolithic Terraform State File

Due to a monolithic Terraform state file, deployment of changes through Continuous Integration / Continuous Deployment ( CI/CD) pipelines were considered too slow. Adding more of the DNS estate to the current solution would further slow down this process, making it slower to implement changes to our DNS estate. Which contributes to why the team is reluctant to attempt to manage the entire estate with the current solution.

Shared AWS Account

The AWS where we currently host our DNS estate is a legacy account that is shared across multiple teams and business units. This makes it difficult to be confident that manual changes are not being made outside our team and to completely restrict management of the DNS estate to a single solution.

Understanding of The Domain

The team does not have documented research into the core problems, constraints and needs within this area. Making it difficult to make effective decisions and create solutions that resolve whole problems.

Symptoms

Drift

Drift occurs when resources that Terraform managed are changed outside of Terraform (manually). This change in state then requires additional work to inform Terraform of the new changes.

Two Processes For The Same Problem

Both users and Operations and Engineering are experiencing the pain having two solutions for the same problem. One being DNS-IAC, the other being manually changing DNS in Route53.

As a user, it’s fairly jarring to raise a Pull Request (PR) with minor changes to your DNS records and be shown a plan with a few weeks worth of drift.

As a member of Operations Engineering - its typically down to the individual to decide which process to follow. Whether to quickly add the proposed changes manually in Route53 or to correct all current drift and then apply the new changes locally via terraform.

This complexity makes it difficult to define a consistent underlying process that can be assessed and improved.

Options

1. Keep DNS-IAC As Is

Pros

  • No operational changes required

Cons

  • Hard to articulate the process to new engineers
  • Two processes for the same problem
  • DNS-IAC will still only cover ~25% of our DNS estate

2. Refactor DNS-IAC So it Can Meet The Needs of The Business and Users

Pros

  • Potential to scale DNS-IAC to encompass our entire domain*
  • Can try to start and enforce a single process for DNS management*
  • Potential to enable additional self-service capabilities for teams wanting more autonomy managing their DNS estate ( less work for Operations Engineering)

*having a shared AWS account may make some these pros challenging to implement

Cons

  • Work required to refactor DNS-IAC, CI/CD pipelines and break up the monolithic Terraform state
  • May build a solution that does not resolve core problems or scale in the correct way for the long term, due to these areas not yet being explored and understood yet
  • Work is not prioritised, meaning choosing this option (without prioritising the work) is actually is choosing Option 1

3. Archive DNS-IAC

Pros

  • Single process for the management of our entre DNS estate
  • Reduces maintenance time and cost for Operations Engineering, which can be invested into other higher priority areas
  • Easy to explain to new engineers and to enable criticism of the single process, which we can use to define needs and build a solution in the future

Cons

  • All DNS changes become manual
  • Lose the ability to optionally get reviews of DNS changes before they are implemented
  • Harder to rollback changes (where they could have been made through code)

Decision

Operations Engineering have decided on Option 3, to Archive DNS-IAC.

We have yet to fully understand the needs, business constraints and technical constraints in this area to provide a well-rounded solution that can meet the needs of the business and users.

Understanding all of this will take time, and require the work to be prioritised by the team. Archiving the current solution will reduce the maintenance burden on the team, enabling efforts to be invested into higher priority work and allow the appropriate time needed to fully understand the requirements before building a new solution.

This page was last reviewed on 28 May 2024. It needs to be reviewed again on 28 November 2024 by the page owner #operations-engineering-alerts .