Provn
    How it worksBrowse jobsFor companiesBlogLog in

    © 2026 Provn Inc. All rights reserved.

    About•Blog•Terms of Service•Privacy Policy

    Made with love in Seattle

    Challenges/Collabera/Backend Engineer, Data Engineer/Senior Big Data Developer/Big Data Developer Challenge

    Senior Big Data Developer/Big Data Developer Challenge

    Big Data
    Python
    PySpark
    Scala
    Estimated Time:
    45 minutes
    Status:Not started

    What You'll Be Doing

    The Scenario

    You are joining the Financial Crime Data Engineering team at a large financial institution. The team builds and maintains the data pipelines that feed the institution's anti-money laundering (AML) and fraud detection systems. These pipelines ingest transaction data from multiple source systems, enrich it with entity-level behavioral profiles, and deliver structured outputs to downstream case management and regulatory reporting platforms.

    The institution processes approximately 3.5 million financial transactions per hour across retail banking, wire transfers, ACH, and card networks. All data processing runs on Google Cloud Platform — Dataproc clusters for Spark workloads, BigQuery for analytical queries and historical lookups, and Cloud Storage for raw data landing and intermediate staging.


    The Problem

    A recent regulatory examination identified a gap in the institution's transaction monitoring coverage: the current pipeline does not adequately detect structuring behavior — patterns where individuals deliberately split transactions to stay below the $10,000 Currency Transaction Report (CTR) threshold. The existing pipeline processes individual transactions in isolation. It does not build rolling behavioral windows at the entity level that would reveal whether a customer has made multiple deposits of $9,500 across different branches over a three-day period.

    The compliance team has mandated a 90-day remediation timeline. Your team has been assigned to design and build a new pipeline module — the Structuring Detection Enrichment Layer — that sits between raw transaction ingestion and the downstream case management platform. This module must:

    • Aggregate transactions into rolling entity-level windows (24-hour, 72-hour, and 30-day periods) to surface structuring patterns
    • Enrich each transaction record with the entity's aggregated behavior profile (total volume, frequency, amount distribution, branch diversity)
    • Produce output records conforming to the downstream case management platform's fixed input schema (you cannot modify this schema)
    • Maintain full data lineage from source transaction to enriched output for regulatory audit purposes

    What You Know About the Current System

    • Source data lands in Cloud Storage as partitioned Parquet files (partitioned by transaction_date and source_system)
    • Entity resolution has already been performed upstream — each transaction record includes a resolved entity_id
    • The existing daily batch pipeline runs on a 12-node Dataproc cluster (n1-highmem-16 instances) and currently completes in approximately 3.5 hours
    • Historical transaction data for lookback windows lives in BigQuery, partitioned by transaction_date, clustered by entity_id
    • The downstream case management platform consumes data via a fixed API contract (JSON-over-HTTPS, max 5,000 records per batch call, 200ms timeout per call)
    • Compliance requires that all intermediate data artifacts be retained for 7 years and be reproducible from source

    Constraints

    Honor the following constraints in your solution. These reflect the real operating environment for this role.

    • GCP infrastructure only: Your solution must use the existing GCP stack — Dataproc (Spark), BigQuery, and Cloud Storage. Do not propose migrating to a different cloud provider or introducing tools not already in the environment (e.g., no Kafka, no Snowflake, no Databricks). Work within what exists.

    • Fixed downstream API contract: The case management platform consumes enriched records via a fixed API (JSON-over-HTTPS, max 5,000 records per batch call, 200ms timeout). You cannot modify this contract. Your output stage must conform to it.

    • 4-week first deliverable: The regulatory timeline is 90 days, but your first sprint deliverable must be scoped to 4 weeks. Identify what you would deliver first and what comes later — do not propose a 90-day monolithic build.

    • Audit trail non-negotiable: Every enriched output record must be traceable back to its source transactions. No transformations that lose lineage. Compliance requires 7-year retention of all intermediate artifacts. This is a regulatory requirement, not a nice-to-have.

    • Existing cluster resources: Assume the existing 12-node Dataproc cluster (n1-highmem-16). You may propose configuration changes or scaling recommendations, but your proof-of-concept must demonstrate it can run within this resource envelope. No unlimited-compute assumptions.

    What You'll Accomplish

    Design a multi-window aggregation pipeline that computes rolling entity-level behavioral profiles (24-hour, 72-hour, and 30-day windows) over high-volume transaction data using PySpark on Dataproc, demonstrating your ability to move beyond row-level processing to stateful, entity-centric analytics.

    Integrate batch and historical data sources by orchestrating reads from both Cloud Storage (raw Parquet ingestion) and BigQuery (historical lookback), managing the handoff between landing-zone data and warehoused history within a single coherent pipeline module.

    Produce schema-compliant enriched output that conforms to a fixed downstream API contract you cannot change — requiring you to map your enrichment logic to an external system's expectations while handling batching constraints, payload limits, and delivery reliability.

    Implement end-to-end data lineage and auditability so that every enriched output record can be traced back to its constituent source transactions, satisfying a 7-year regulatory retention requirement and demonstrating your understanding of compliance-grade data engineering practices

    Make and justify infrastructure and performance trade-offs — including partitioning strategies, cluster sizing considerations, and pipeline scheduling decisions — that show you can reason about operational cost, processing latency, and system reliability in a production GCP environment handling millions of transactions per hour.

    How Your Work Will Be Scored

    Core Technical Execution - 20%Enterprise Data Architecture & Scale - 15%AI Fluency - 15%Resume & Domain Fit - 50%

    What to Submit

    No submission guidelines provided.

    On this page

    Top of Page
    What You'll Be Doing
    How It's Scored
    What to Submit