Provn
    How it worksBrowse jobsFor companiesBlogLog in

    © 2026 Provn Inc. All rights reserved.

    About•Blog•Terms of Service•Privacy Policy

    Made with love in Seattle

    Challenges/DAT Freight & Analytics/Software Engineer/Senior Software Engineer — DAT Freight & Analytics, Broker Tech

    Senior Software Engineer — DAT Freight & Analytics, Broker Tech

    Software Engineering
    TypeScript
    Distributed Systems
    Incident Response
    AI Fluency
    Estimated Time:
    1 hour
    Status:Not started

    What You'll Be Doing

    The Scenario

    You are a Senior Software Engineer on DAT's Broker Tech team, which powers the Convoy Platform — the integration layer between DAT's freight matching network and the Transportation Management Systems (TMS) that brokers use every day.

    The Convoy Platform lets brokers post loads, receive carrier matches, and execute shipments without leaving their TMS. Brokers connect via API and webhook — when a load changes status (matched, accepted, in-transit, delivered), their TMS receives a webhook event and updates accordingly.

    This morning you inherit a production incident that was partially resolved overnight by the on-call engineer but has left the system in a degraded state. Here is the incident summary in your queue:

    Incident Report | INC-4471 | SEV-2 — Partially Mitigated

    Reported: 02:14 AM | Owner: Broker Tech | Status: Degraded — monitoring

    What happened:

    A surge in load activity (~3x normal volume) caused our shipment-events Kafka consumer group to fall behind. Consumer lag hit 42,000 messages at peak. Three downstream effects occurred:

    • Webhook delivery to broker TMS endpoints began timing out and retrying — 847 webhook events were delivered more than once to 34 different brokers.
    • Because our webhook delivery service retried without idempotency checking, some brokers' TMS systems processed the same status-change event multiple times — resulting in duplicate shipment records in at least 12 confirmed broker accounts.
    • Two brokers called in to report that carrier match notifications showed conflicting statuses — a load marked 'matched' in Convoy and 'available' in their TMS simultaneously.What on-call did:

    Scaled up consumer instances from 3 to 12 to drain the backlog. Consumer lag is now at 800 and falling. Webhook delivery has stabilized but the duplicate data is still in broker TMS systems. Root cause not yet identified.

    What is NOT resolved:

    • Duplicate shipment records in 12+ broker accounts. We do not know the full scope.
    • Root cause of the consumer lag spike is unconfirmed.
    • No idempotency controls exist on the webhook delivery path — this will happen again.

    Existing Code

    You also have access to the relevant section of the webhook delivery service. Read the code carefully — it may contain issues beyond the primary incident.

    // webhook-delivery.service.ts
    // Shipment status change consumer — processes Kafka events and delivers webhooks
    export class WebhookDeliveryService {
      constructor(
        private readonly http: HttpClient,
        private readonly db: DatabaseService,
        private readonly logger: Logger,
      ) {}
    
      async processShipmentEvent(event: KafkaMessage): Promise<void> {
        const payload = JSON.parse(event.value.toString());
        const brokers = await this.db.query(
          `SELECT * FROM broker_subscriptions WHERE load_id = ${payload.loadId}`
        );
        for (const broker of brokers) {
          try {
            await this.http.post(broker.webhookUrl, payload, { timeout: 5000 });
            this.logger.log(`Webhook delivered to broker ${broker.id}`);
          } catch (err) {
            this.logger.log(`Webhook failed, retrying: ${err}`);
            await this.processShipmentEvent(event); // retry
          }
        }
      }
    }
    

    Your Task — Three Deliverables

    Deliverable 1 — Revised Implementation

    Produce a revised version of the webhook delivery service that addresses the core production issues identified in the incident. Your implementation should be a working TypeScript/Node.js file — not pseudocode, not a diagram.

    Your implementation must address:

    • Idempotency: webhook events should not produce duplicate side effects when delivered more than once
    • Error handling: distinguish between retryable and non-retryable failures; do not retry infinitely
    • Observability: structured logging with enough context that an on-call engineer could diagnose a repeat incident from logs alone
    • At least one additional issue you identify in the existing code beyond the primary incident

    Scope note: This is a proof-of-concept implementation, not a full production rewrite. A focused, working solution that demonstrates the right patterns is more valuable than a comprehensive but skeletal one.

    Deliverable 2 — README.md (Sections A, B, and C)

    Section A — Written Analysis (300–500 words)

    Address all four of the following in your written analysis:

    • Root cause: what caused the consumer lag spike, and what conditions allowed the duplicate-delivery problem to propagate as far as it did?
    • Design gap: why does the existing webhook delivery architecture not protect against this class of failure? What specific change closes the gap?
    • Consistency tradeoff: the duplicate shipment records are now in broker TMS systems. Describe the trade-off between (a) attempting an automated cleanup and (b) leaving cleanup to brokers — and which you would recommend given DAT's position in the broker/carrier relationship.
    • Scope decision: name one thing you explicitly chose NOT to include in your implementation and explain why — what would you tackle in a follow-up PR?

    Section B — Production Runbook + Reasoning Question

    Section B has two parts. Complete both.

    Part B1 — Incident Runbook

    Write a runbook for the next on-call engineer who encounters consumer lag on the shipment-events consumer group. The runbook should cover:

    • How to confirm the issue (which metrics or logs to check first)
    • Immediate mitigation steps (in the order they should be executed)
    • How to confirm the incident is resolved — not just mitigated
    • One question the on-call engineer should answer before closing the incident to prevent recurrence

    Part B2 — Required Reasoning Question (answer without AI assistance)

    Describe a scenario where an AI coding assistant would give you a plausible but incorrect answer for this type of problem — specifically, idempotency in a message-driven webhook delivery system. What would the incorrect output look like, and what would you check to identify the error before acting on it?

    Answer this question in your own words without using an AI tool. We want to understand how you reason about AI failure modes — not how AI describes them.

    Section C — AI Usage Log (Mandatory)

    This is not a trick. We want to see how you work with AI — not whether you used it.

    In a short section of your README, document your AI collaboration process. For each significant interaction with an AI tool, briefly note:

    • What you asked the AI to help with
    • What it gave you
    • What you kept, changed, or rejected — and why

    Three interactions documented is sufficient. The log does not need to be exhaustive.

    Deliverable 3 — Video Walkthrough (8–10 minutes)

    Record your walkthrough as an MP4 or MOV file and upload it directly on the Provn platform as a separate file.

    Structure your video to cover:

    • Summary (60 seconds): the incident, your diagnosis, and your recommended fix
    • Code walkthrough (3–4 minutes): walk through your revised implementation — explain the key decisions, not just what the code does
    • Runbook walkthrough (1–2 minutes): walk through your Part B1 runbook — how would you actually use this at 2am?
    • Mandatory AI question (1–2 minutes): see the AI Usage Guidance section below
    • Reflection (30–60 seconds): what would you tackle next, and what trade-off are you least confident in?

    Speak naturally. Communication is assessed on clarity of technical reasoning and logical structure — not verbal polish, accent, or filler words.

    Constraints

    Honor all four. AI tools will typically ignore them. Evaluators will check each one.

    • Stack constraint: your implementation must use TypeScript and Node.js — the existing DAT Broker Tech stack. Do not introduce a different runtime, language, or framework.
    • Infrastructure constraint: idempotency must be implemented using a persistent store (e.g. a database or Redis-style cache) — not in-memory state that would not survive a pod restart in Kubernetes.
    • Organizational constraint: you cannot change the Kafka topic structure or partition schema. The consumer group configuration is managed by a separate platform team and any changes require a two-week change request. Your fix must work within the current consumer group design.
    • Ownership constraint: your on-call team will inherit whatever you build. Write your runbook and code comments for the engineer who gets paged at 2am — not the one who built it.

    Evaluation Criteria

    Your submission is evaluated across five dimensions. Weights reflect what DAT's Broker Tech team cares most about.

    • Systems Design & Technical Judgment (30%): Identifies the root cause correctly, explains the design gap that allowed the incident to propagate, and proposes an architecture fix with explicit trade-off reasoning — not just a patch.
    • Production Code Quality & Engineering Craft (25%): TypeScript implementation is production-intentioned: meaningful types, structured logging with context, explicit error handling that distinguishes retryable from non-retryable failures, and at least one meaningful test.
    • Message-Driven & Integration Architecture (20%): Demonstrates working knowledge of Kafka delivery semantics and idempotency design — not just that they've heard of the concepts. Webhook contract design reflects the reality of broker TMS integration failures.
    • Communication & Technical Leadership (10%): Written analysis and runbook are structured for handoff — a new team member could act on them without asking clarifying questions. Trade-off reasoning is legible to a non-engineer.
    • AI Fluency (15%): Evidence of directing AI with domain-specific constraints, critical evaluation of AI output, and iteration. The AI Usage Log and video answer to the mandatory question are the primary evidence sources.

    AI Usage Guidance

    We expect you to use AI tools. We evaluate how you use them — not whether you use them. Evidence of iteration, redirection, and critical evaluation scores higher than a polished output with no process documentation.

    The single highest-signal indicator: your video answer to the mandatory AI question. If you cannot name a specific moment where you redirected AI output, evaluators will assume you did not.

    Mandatory AI question (include in your video):

    "Walk me through one moment where you disagreed with, pushed back on, or redirected what the AI gave you — and what you did instead. Name the specific moment. Explain what the AI produced that didn't meet the bar, what you did differently, and why."

    Note: Part B2 of your README must be completed without AI assistance. This is not about AI detection — it is about understanding how you reason through AI failure modes independently.

    Submission Checklist

    Before you submit, confirm:

    • Implementation file(s) — TypeScript/Node.js, uploaded as a separate file
    • README.md — includes Section A (written analysis), Section B (Part B1 runbook + Part B2 reasoning question), and Section C (AI Usage Log)
    • Video walkthrough — 8–10 minutes, MP4 or MOV, uploaded as a separate file
    • All four constraints honored — check each one before submitting
    • Part B2 answered in your own words, without AI assistance

    Upload each deliverable as a separate file directly on the Provn platform: your implementation file(s), your README document, and your video walkthrough. Do not bundle files into a ZIP. Do not link to external repositories or video platforms.

    What You'll Accomplish

    Diagnose a production incident involving Kafka consumer lag and webhook delivery failures in a distributed system

    Implement idempotency controls and structured error handling in a message-driven integration architecture

    Write a production runbook that enables effective incident response under real on-call conditions

    Reason independently about AI coding assistant failure modes in distributed systems contexts

    Communicate technical trade-offs and architecture decisions clearly to both engineering and non-technical stakeholders

    How Your Work Will Be Scored

    Systems Design & Technical Judgment (30%)Production Code Quality & Engineering Craft (25%)Message-Driven & Integration Architecture (20%)Communication & Technical Leadership (10%)AI Fluency (15%)

    What to Submit

    No submission guidelines provided.

    On this page

    Top of Page
    What You'll Be Doing
    How It's Scored
    What to Submit