You are joining the Financial Crime Data Engineering team at a large financial institution. The team builds and maintains the data pipelines that feed the institution's anti-money laundering (AML) and fraud detection systems. These pipelines ingest transaction data from multiple source systems, enrich it with entity-level behavioral profiles, and deliver structured outputs to downstream case management and regulatory reporting platforms.
The institution processes approximately 3.5 million financial transactions per hour across retail banking, wire transfers, ACH, and card networks. All data processing runs on Google Cloud Platform — Dataproc clusters for Spark workloads, BigQuery for analytical queries and historical lookups, and Cloud Storage for raw data landing and intermediate staging.
A recent regulatory examination identified a gap in the institution's transaction monitoring coverage: the current pipeline does not adequately detect structuring behavior — patterns where individuals deliberately split transactions to stay below the $10,000 Currency Transaction Report (CTR) threshold. The existing pipeline processes individual transactions in isolation. It does not build rolling behavioral windows at the entity level that would reveal whether a customer has made multiple deposits of $9,500 across different branches over a three-day period.
The compliance team has mandated a 90-day remediation timeline. Your team has been assigned to design and build a new pipeline module — the Structuring Detection Enrichment Layer — that sits between raw transaction ingestion and the downstream case management platform. This module must:
transaction_date and source_system)entity_idtransaction_date, clustered by entity_idHonor the following constraints in your solution. These reflect the real operating environment for this role.
GCP infrastructure only: Your solution must use the existing GCP stack — Dataproc (Spark), BigQuery, and Cloud Storage. Do not propose migrating to a different cloud provider or introducing tools not already in the environment (e.g., no Kafka, no Snowflake, no Databricks). Work within what exists.
Fixed downstream API contract: The case management platform consumes enriched records via a fixed API (JSON-over-HTTPS, max 5,000 records per batch call, 200ms timeout). You cannot modify this contract. Your output stage must conform to it.
4-week first deliverable: The regulatory timeline is 90 days, but your first sprint deliverable must be scoped to 4 weeks. Identify what you would deliver first and what comes later — do not propose a 90-day monolithic build.
Audit trail non-negotiable: Every enriched output record must be traceable back to its source transactions. No transformations that lose lineage. Compliance requires 7-year retention of all intermediate artifacts. This is a regulatory requirement, not a nice-to-have.
Existing cluster resources: Assume the existing 12-node Dataproc cluster (n1-highmem-16). You may propose configuration changes or scaling recommendations, but your proof-of-concept must demonstrate it can run within this resource envelope. No unlimited-compute assumptions.
Design a multi-window aggregation pipeline that computes rolling entity-level behavioral profiles (24-hour, 72-hour, and 30-day windows) over high-volume transaction data using PySpark on Dataproc, demonstrating your ability to move beyond row-level processing to stateful, entity-centric analytics.
Integrate batch and historical data sources by orchestrating reads from both Cloud Storage (raw Parquet ingestion) and BigQuery (historical lookback), managing the handoff between landing-zone data and warehoused history within a single coherent pipeline module.
Produce schema-compliant enriched output that conforms to a fixed downstream API contract you cannot change — requiring you to map your enrichment logic to an external system's expectations while handling batching constraints, payload limits, and delivery reliability.
Implement end-to-end data lineage and auditability so that every enriched output record can be traced back to its constituent source transactions, satisfying a 7-year regulatory retention requirement and demonstrating your understanding of compliance-grade data engineering practices
Make and justify infrastructure and performance trade-offs — including partitioning strategies, cluster sizing considerations, and pipeline scheduling decisions — that show you can reason about operational cost, processing latency, and system reliability in a production GCP environment handling millions of transactions per hour.
On this page