You are a senior cloud infrastructure engineer at a large financial services institution. Your team owns the GCP infrastructure layer that powers internal data pipelines, including a critical overnight batch process that generates regulatory compliance reports for SOX controls and AML transaction monitoring.
On Tuesday morning, the compliance operations team reports that last night's batch run failed silently — no alerts fired, no errors appeared in any dashboard, but the regulatory reports are incomplete. This is the third intermittent failure in two weeks. The Chief Technology Officer has escalated: regulators audit these reports quarterly, and the next audit window opens in six weeks.
While investigating, you discover four overlapping problems:
(1) The batch pipeline runs on Compute Engine instances provisioned through Terraform, but someone has been making changes through the GCP console — the Terraform state file no longer reflects what is actually deployed. Resources exist in GCP that Terraform doesn't know about, and Terraform-managed resources have been manually modified.
(2) Cloud SQL — the batch pipeline's data source — is running on a single-zone configuration with automated backups disabled. One zone failure would mean data loss for a compliance-critical database.
(3) The GCP infrastructure has no meaningful monitoring or alerting. The compliance team discovered the batch failure by manually checking report outputs the next morning. No Cloud Monitoring alerts, no log-based metrics, no dashboards.
(4) A review of the Terraform codebase reveals several security and cost issues beyond the batch failure itself.
Your manager asks you to own the investigation, fix the immediate infrastructure issues, and establish monitoring so this never fails silently again. The first meaningful changes must ship within two weeks, with a team of two engineers.
The following Terraform configuration manages your batch processing infrastructure. Read it carefully — it may contain issues beyond the primary batch failure.
# compute.tf — Batch Processing Infrastructure
resource "google_compute_instance" "batch_processor" {
count = var.instance_count
name = "batch-node-${count.index}"
machine_type = "n1-standard-16"
zone = "us-east4-a"
boot_disk {
initialize_params {
image = "debian-cloud/debian-11"
size = 500
}
}
network_interface {
network = google_compute_network.main.id
subnetwork = google_compute_subnetwork.main.id
access_config {} # assigns ephemeral public IP to every instance
}
service_account {
email = "default"
scopes = ["https://www.googleapis.com/auth/cloud-platform"]
}
metadata = {
enable-oslogin = "FALSE"
}
labels = {}
}
resource "google_compute_firewall" "batch_allow" {
name = "batch-allow-all"
network = google_compute_network.main.id
allow {
protocol = "all"
}
source_ranges = ["0.0.0.0/0"]
}
resource "google_sql_database_instance" "compliance_db" {
name = "compliance-data"
database_version = "POSTGRES_14"
region = "us-east4"
settings {
tier = "db-custom-16-61440"
availability_type = "ZONAL"
backup_configuration {
enabled = false
}
ip_configuration {
ipv4_enabled = true
authorized_networks {
name = "allow-all"
value = "0.0.0.0/0"
}
}
}
}
Honor all constraints below. Strong submissions address each one explicitly. Generic solutions that ignore these constraints will score lower regardless of technical quality.
Infrastructure: Your stack is GCP: Compute Engine, Cloud SQL, Cloud Monitoring, Cloud Logging, VPC, IAM. You may not introduce services outside this set. Work within what exists.
IaC Discipline: All infrastructure changes must go through Terraform. Console changes caused the drift problem — do not perpetuate that pattern. Your remediation must restore and enforce IaC discipline.
Scope: Your first deliverable must be shippable in two weeks by a team of two engineers. Identify what is in scope for that window and what comes later. Do not propose a quarter-long overhaul.
Ownership: Your team will own this infrastructure in production, including overnight on-call. Whatever you build or change, you are on the hook for it. Design and document accordingly.
Compliance Environment: This is a regulated financial services environment. Audit trails, access controls, and data protection are not optional — they are compliance requirements. Your infrastructure decisions must reflect this context.
Diagnose and remediate Terraform state drift in a GCP production environment
Fix critical Cloud SQL reliability and security gaps for a compliance-critical database
Design a monitoring and alerting strategy to prevent silent batch failures
Scope a two-week infrastructure remediation plan in a regulated financial services environment
Demonstrate critical evaluation and iterative use of AI tools for infrastructure code
On this page