THE SCENARIO

You are a senior cloud infrastructure engineer at a large financial services institution. Your team owns the GCP infrastructure layer that powers internal data pipelines, including a critical overnight batch process that generates regulatory compliance reports for SOX controls and AML transaction monitoring.

On Tuesday morning, the compliance operations team reports that last night's batch run failed silently — no alerts fired, no errors appeared in any dashboard, but the regulatory reports are incomplete. This is the third intermittent failure in two weeks. The Chief Technology Officer has escalated: regulators audit these reports quarterly, and the next audit window opens in six weeks.

While investigating, you discover four overlapping problems:

(1) The batch pipeline runs on Compute Engine instances provisioned through Terraform, but someone has been making changes through the GCP console — the Terraform state file no longer reflects what is actually deployed. Resources exist in GCP that Terraform doesn't know about, and Terraform-managed resources have been manually modified.

(2) Cloud SQL — the batch pipeline's data source — is running on a single-zone configuration with automated backups disabled. One zone failure would mean data loss for a compliance-critical database.

(3) The GCP infrastructure has no meaningful monitoring or alerting. The compliance team discovered the batch failure by manually checking report outputs the next morning. No Cloud Monitoring alerts, no log-based metrics, no dashboards.

(4) A review of the Terraform codebase reveals several security and cost issues beyond the batch failure itself.

Your manager asks you to own the investigation, fix the immediate infrastructure issues, and establish monitoring so this never fails silently again. The first meaningful changes must ship within two weeks, with a team of two engineers.

STARTER CODE

The following Terraform configuration manages your batch processing infrastructure. Read it carefully — it may contain issues beyond the primary batch failure.

# compute.tf — Batch Processing Infrastructure

resource "google_compute_instance" "batch_processor" {
  count        = var.instance_count
  name         = "batch-node-${count.index}"
  machine_type = "n1-standard-16"
  zone         = "us-east4-a"

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-11"
      size  = 500
    }
  }

  network_interface {
    network    = google_compute_network.main.id
    subnetwork = google_compute_subnetwork.main.id
    access_config {}   # assigns ephemeral public IP to every instance
  }

  service_account {
    email  = "default"
    scopes = ["https://www.googleapis.com/auth/cloud-platform"]
  }

  metadata = {
    enable-oslogin = "FALSE"
  }

  labels = {}
}

resource "google_compute_firewall" "batch_allow" {
  name    = "batch-allow-all"
  network = google_compute_network.main.id

  allow {
    protocol = "all"
  }

  source_ranges = ["0.0.0.0/0"]
}

resource "google_sql_database_instance" "compliance_db" {
  name             = "compliance-data"
  database_version = "POSTGRES_14"
  region           = "us-east4"

  settings {
    tier              = "db-custom-16-61440"
    availability_type = "ZONAL"

    backup_configuration {
      enabled = false
    }

    ip_configuration {
      ipv4_enabled = true
      authorized_networks {
        name  = "allow-all"
        value = "0.0.0.0/0"
      }
    }
  }
}

CONSTRAINTS

Honor all constraints below. Strong submissions address each one explicitly. Generic solutions that ignore these constraints will score lower regardless of technical quality.

Infrastructure: Your stack is GCP: Compute Engine, Cloud SQL, Cloud Monitoring, Cloud Logging, VPC, IAM. You may not introduce services outside this set. Work within what exists.

IaC Discipline: All infrastructure changes must go through Terraform. Console changes caused the drift problem — do not perpetuate that pattern. Your remediation must restore and enforce IaC discipline.

Scope: Your first deliverable must be shippable in two weeks by a team of two engineers. Identify what is in scope for that window and what comes later. Do not propose a quarter-long overhaul.

Ownership: Your team will own this infrastructure in production, including overnight on-call. Whatever you build or change, you are on the hook for it. Design and document accordingly.

Compliance Environment: This is a regulated financial services environment. Audit trails, access controls, and data protection are not optional — they are compliance requirements. Your infrastructure decisions must reflect this context.

THE SCENARIO

While investigating, you discover four overlapping problems:

(4) A review of the Terraform codebase reveals several security and cost issues beyond the batch failure itself.

STARTER CODE

The following Terraform configuration manages your batch processing infrastructure. Read it carefully — it may contain issues beyond the primary batch failure.

# compute.tf — Batch Processing Infrastructure

resource "google_compute_instance" "batch_processor" {
  count        = var.instance_count
  name         = "batch-node-${count.index}"
  machine_type = "n1-standard-16"
  zone         = "us-east4-a"

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-11"
      size  = 500
    }
  }

  network_interface {
    network    = google_compute_network.main.id
    subnetwork = google_compute_subnetwork.main.id
    access_config {}   # assigns ephemeral public IP to every instance
  }

  service_account {
    email  = "default"
    scopes = ["https://www.googleapis.com/auth/cloud-platform"]
  }

  metadata = {
    enable-oslogin = "FALSE"
  }

  labels = {}
}

resource "google_compute_firewall" "batch_allow" {
  name    = "batch-allow-all"
  network = google_compute_network.main.id

  allow {
    protocol = "all"
  }

  source_ranges = ["0.0.0.0/0"]
}

resource "google_sql_database_instance" "compliance_db" {
  name             = "compliance-data"
  database_version = "POSTGRES_14"
  region           = "us-east4"

  settings {
    tier              = "db-custom-16-61440"
    availability_type = "ZONAL"

    backup_configuration {
      enabled = false
    }

    ip_configuration {
      ipv4_enabled = true
      authorized_networks {
        name  = "allow-all"
        value = "0.0.0.0/0"
      }
    }
  }
}

CONSTRAINTS

Honor all constraints below. Strong submissions address each one explicitly. Generic solutions that ignore these constraints will score lower regardless of technical quality.

Infrastructure: Your stack is GCP: Compute Engine, Cloud SQL, Cloud Monitoring, Cloud Logging, VPC, IAM. You may not introduce services outside this set. Work within what exists.

Scope: Your first deliverable must be shippable in two weeks by a team of two engineers. Identify what is in scope for that window and what comes later. Do not propose a quarter-long overhaul.

Ownership: Your team will own this infrastructure in production, including overnight on-call. Whatever you build or change, you are on the hook for it. Design and document accordingly.

GCP Cloud Infrastructure Engineer Skills Challenge

What You'll Be Doing

THE SCENARIO

STARTER CODE

CONSTRAINTS

What You'll Accomplish

How Your Work Will Be Scored

What to Submit

GCP Cloud Infrastructure Engineer Skills Challenge

What You'll Be Doing

THE SCENARIO

STARTER CODE

CONSTRAINTS

What You'll Accomplish

How Your Work Will Be Scored

What to Submit