Senior Software Engineer — DAT Freight & Analytics, Broker Tech

The Scenario

You are a Senior Software Engineer on DAT's Broker Tech team, which powers the Convoy Platform — the integration layer between DAT's freight matching network and the Transportation Management Systems (TMS) that brokers use every day.

The Convoy Platform lets brokers post loads, receive carrier matches, and execute shipments without leaving their TMS. Brokers connect via API and webhook — when a load changes status (matched, accepted, in-transit, delivered), their TMS receives a webhook event and updates accordingly.

This morning you inherit a production incident that was partially resolved overnight by the on-call engineer but has left the system in a degraded state. Here is the incident summary in your queue:

Incident Report | INC-4471 | SEV-2 — Partially Mitigated

Reported: 02:14 AM | Owner: Broker Tech | Status: Degraded — monitoring

What happened:

A surge in load activity (~3x normal volume) caused our shipment-events Kafka consumer group to fall behind. Consumer lag hit 42,000 messages at peak. Three downstream effects occurred:

Webhook delivery to broker TMS endpoints began timing out and retrying — 847 webhook events were delivered more than once to 34 different brokers.
Because our webhook delivery service retried without idempotency checking, some brokers' TMS systems processed the same status-change event multiple times — resulting in duplicate shipment records in at least 12 confirmed broker accounts.
Two brokers called in to report that carrier match notifications showed conflicting statuses — a load marked 'matched' in Convoy and 'available' in their TMS simultaneously.What on-call did:

Scaled up consumer instances from 3 to 12 to drain the backlog. Consumer lag is now at 800 and falling. Webhook delivery has stabilized but the duplicate data is still in broker TMS systems. Root cause not yet identified.

What is NOT resolved:

Duplicate shipment records in 12+ broker accounts. We do not know the full scope.
Root cause of the consumer lag spike is unconfirmed.
No idempotency controls exist on the webhook delivery path — this will happen again.

Existing Code

You also have access to the relevant section of the webhook delivery service. Read the code carefully — it may contain issues beyond the primary incident.

// webhook-delivery.service.ts
// Shipment status change consumer — processes Kafka events and delivers webhooks
export class WebhookDeliveryService {
  constructor(
    private readonly http: HttpClient,
    private readonly db: DatabaseService,
    private readonly logger: Logger,
  ) {}

  async processShipmentEvent(event: KafkaMessage): Promise<void> {
    const payload = JSON.parse(event.value.toString());
    const brokers = await this.db.query(
      `SELECT * FROM broker_subscriptions WHERE load_id = ${payload.loadId}`
    );
    for (const broker of brokers) {
      try {
        await this.http.post(broker.webhookUrl, payload, { timeout: 5000 });
        this.logger.log(`Webhook delivered to broker ${broker.id}`);
      } catch (err) {
        this.logger.log(`Webhook failed, retrying: ${err}`);
        await this.processShipmentEvent(event); // retry
      }
    }
  }
}

Your Task — Three Deliverables

Deliverable 1 — Revised Implementation

Produce a revised version of the webhook delivery service that addresses the core production issues identified in the incident. Your implementation should be a working TypeScript/Node.js file — not pseudocode, not a diagram.

Your implementation must address:

Idempotency: webhook events should not produce duplicate side effects when delivered more than once
Error handling: distinguish between retryable and non-retryable failures; do not retry infinitely
Observability: structured logging with enough context that an on-call engineer could diagnose a repeat incident from logs alone
At least one additional issue you identify in the existing code beyond the primary incident

Scope note: This is a proof-of-concept implementation, not a full production rewrite. A focused, working solution that demonstrates the right patterns is more valuable than a comprehensive but skeletal one.

Deliverable 2 — README.md (Sections A, B, and C)

Section A — Written Analysis (300–500 words)

Address all four of the following in your written analysis:

Root cause: what caused the consumer lag spike, and what conditions allowed the duplicate-delivery problem to propagate as far as it did?
Design gap: why does the existing webhook delivery architecture not protect against this class of failure? What specific change closes the gap?
Consistency tradeoff: the duplicate shipment records are now in broker TMS systems. Describe the trade-off between (a) attempting an automated cleanup and (b) leaving cleanup to brokers — and which you would recommend given DAT's position in the broker/carrier relationship.
Scope decision: name one thing you explicitly chose NOT to include in your implementation and explain why — what would you tackle in a follow-up PR?

Section B — Production Runbook + Reasoning Question

Section B has two parts. Complete both.

Part B1 — Incident Runbook

Write a runbook for the next on-call engineer who encounters consumer lag on the shipment-events consumer group. The runbook should cover:

How to confirm the issue (which metrics or logs to check first)
Immediate mitigation steps (in the order they should be executed)
How to confirm the incident is resolved — not just mitigated
One question the on-call engineer should answer before closing the incident to prevent recurrence

Part B2 — Required Reasoning Question (answer without AI assistance)

Describe a scenario where an AI coding assistant would give you a plausible but incorrect answer for this type of problem — specifically, idempotency in a message-driven webhook delivery system. What would the incorrect output look like, and what would you check to identify the error before acting on it?

Answer this question in your own words without using an AI tool. We want to understand how you reason about AI failure modes — not how AI describes them.

Section C — AI Usage Log (Mandatory)

This is not a trick. We want to see how you work with AI — not whether you used it.

In a short section of your README, document your AI collaboration process. For each significant interaction with an AI tool, briefly note:

What you asked the AI to help with
What it gave you
What you kept, changed, or rejected — and why

Three interactions documented is sufficient. The log does not need to be exhaustive.

Deliverable 3 — Video Walkthrough (8–10 minutes)

Record your walkthrough as an MP4 or MOV file and upload it directly on the Provn platform as a separate file.

Structure your video to cover:

Summary (60 seconds): the incident, your diagnosis, and your recommended fix
Code walkthrough (3–4 minutes): walk through your revised implementation — explain the key decisions, not just what the code does
Runbook walkthrough (1–2 minutes): walk through your Part B1 runbook — how would you actually use this at 2am?
Mandatory AI question (1–2 minutes): see the AI Usage Guidance section below
Reflection (30–60 seconds): what would you tackle next, and what trade-off are you least confident in?

Speak naturally. Communication is assessed on clarity of technical reasoning and logical structure — not verbal polish, accent, or filler words.

Constraints

Honor all four. AI tools will typically ignore them. Evaluators will check each one.

Stack constraint: your implementation must use TypeScript and Node.js — the existing DAT Broker Tech stack. Do not introduce a different runtime, language, or framework.
Infrastructure constraint: idempotency must be implemented using a persistent store (e.g. a database or Redis-style cache) — not in-memory state that would not survive a pod restart in Kubernetes.
Organizational constraint: you cannot change the Kafka topic structure or partition schema. The consumer group configuration is managed by a separate platform team and any changes require a two-week change request. Your fix must work within the current consumer group design.
Ownership constraint: your on-call team will inherit whatever you build. Write your runbook and code comments for the engineer who gets paged at 2am — not the one who built it.

Evaluation Criteria

Your submission is evaluated across five dimensions. Weights reflect what DAT's Broker Tech team cares most about.

Systems Design & Technical Judgment (30%): Identifies the root cause correctly, explains the design gap that allowed the incident to propagate, and proposes an architecture fix with explicit trade-off reasoning — not just a patch.
Production Code Quality & Engineering Craft (25%): TypeScript implementation is production-intentioned: meaningful types, structured logging with context, explicit error handling that distinguishes retryable from non-retryable failures, and at least one meaningful test.
Message-Driven & Integration Architecture (20%): Demonstrates working knowledge of Kafka delivery semantics and idempotency design — not just that they've heard of the concepts. Webhook contract design reflects the reality of broker TMS integration failures.
Communication & Technical Leadership (10%): Written analysis and runbook are structured for handoff — a new team member could act on them without asking clarifying questions. Trade-off reasoning is legible to a non-engineer.
AI Fluency (15%): Evidence of directing AI with domain-specific constraints, critical evaluation of AI output, and iteration. The AI Usage Log and video answer to the mandatory question are the primary evidence sources.

AI Usage Guidance

We expect you to use AI tools. We evaluate how you use them — not whether you use them. Evidence of iteration, redirection, and critical evaluation scores higher than a polished output with no process documentation.

The single highest-signal indicator: your video answer to the mandatory AI question. If you cannot name a specific moment where you redirected AI output, evaluators will assume you did not.

Mandatory AI question (include in your video):

"Walk me through one moment where you disagreed with, pushed back on, or redirected what the AI gave you — and what you did instead. Name the specific moment. Explain what the AI produced that didn't meet the bar, what you did differently, and why."

Note: Part B2 of your README must be completed without AI assistance. This is not about AI detection — it is about understanding how you reason through AI failure modes independently.

Submission Checklist

Before you submit, confirm:

Implementation file(s) — TypeScript/Node.js, uploaded as a separate file
README.md — includes Section A (written analysis), Section B (Part B1 runbook + Part B2 reasoning question), and Section C (AI Usage Log)
Video walkthrough — 8–10 minutes, MP4 or MOV, uploaded as a separate file
All four constraints honored — check each one before submitting
Part B2 answered in your own words, without AI assistance

Upload each deliverable as a separate file directly on the Provn platform: your implementation file(s), your README document, and your video walkthrough. Do not bundle files into a ZIP. Do not link to external repositories or video platforms.

The Scenario

This morning you inherit a production incident that was partially resolved overnight by the on-call engineer but has left the system in a degraded state. Here is the incident summary in your queue:

Incident Report | INC-4471 | SEV-2 — Partially Mitigated

Reported: 02:14 AM | Owner: Broker Tech | Status: Degraded — monitoring

What happened:

A surge in load activity (~3x normal volume) caused our shipment-events Kafka consumer group to fall behind. Consumer lag hit 42,000 messages at peak. Three downstream effects occurred:

Webhook delivery to broker TMS endpoints began timing out and retrying — 847 webhook events were delivered more than once to 34 different brokers.
Because our webhook delivery service retried without idempotency checking, some brokers' TMS systems processed the same status-change event multiple times — resulting in duplicate shipment records in at least 12 confirmed broker accounts.
Two brokers called in to report that carrier match notifications showed conflicting statuses — a load marked 'matched' in Convoy and 'available' in their TMS simultaneously.What on-call did:

What is NOT resolved:

Duplicate shipment records in 12+ broker accounts. We do not know the full scope.
Root cause of the consumer lag spike is unconfirmed.
No idempotency controls exist on the webhook delivery path — this will happen again.

Existing Code

You also have access to the relevant section of the webhook delivery service. Read the code carefully — it may contain issues beyond the primary incident.

// webhook-delivery.service.ts
// Shipment status change consumer — processes Kafka events and delivers webhooks
export class WebhookDeliveryService {
  constructor(
    private readonly http: HttpClient,
    private readonly db: DatabaseService,
    private readonly logger: Logger,
  ) {}

  async processShipmentEvent(event: KafkaMessage): Promise<void> {
    const payload = JSON.parse(event.value.toString());
    const brokers = await this.db.query(
      `SELECT * FROM broker_subscriptions WHERE load_id = ${payload.loadId}`
    );
    for (const broker of brokers) {
      try {
        await this.http.post(broker.webhookUrl, payload, { timeout: 5000 });
        this.logger.log(`Webhook delivered to broker ${broker.id}`);
      } catch (err) {
        this.logger.log(`Webhook failed, retrying: ${err}`);
        await this.processShipmentEvent(event); // retry
      }
    }
  }
}

Your Task — Three Deliverables

Deliverable 1 — Revised Implementation

Your implementation must address:

Idempotency: webhook events should not produce duplicate side effects when delivered more than once
Error handling: distinguish between retryable and non-retryable failures; do not retry infinitely
Observability: structured logging with enough context that an on-call engineer could diagnose a repeat incident from logs alone
At least one additional issue you identify in the existing code beyond the primary incident

Deliverable 2 — README.md (Sections A, B, and C)

Section A — Written Analysis (300–500 words)

Address all four of the following in your written analysis:

Root cause: what caused the consumer lag spike, and what conditions allowed the duplicate-delivery problem to propagate as far as it did?
Design gap: why does the existing webhook delivery architecture not protect against this class of failure? What specific change closes the gap?
Consistency tradeoff: the duplicate shipment records are now in broker TMS systems. Describe the trade-off between (a) attempting an automated cleanup and (b) leaving cleanup to brokers — and which you would recommend given DAT's position in the broker/carrier relationship.
Scope decision: name one thing you explicitly chose NOT to include in your implementation and explain why — what would you tackle in a follow-up PR?

Section B — Production Runbook + Reasoning Question

Section B has two parts. Complete both.

Part B1 — Incident Runbook

Write a runbook for the next on-call engineer who encounters consumer lag on the shipment-events consumer group. The runbook should cover:

How to confirm the issue (which metrics or logs to check first)
Immediate mitigation steps (in the order they should be executed)
How to confirm the incident is resolved — not just mitigated
One question the on-call engineer should answer before closing the incident to prevent recurrence

Part B2 — Required Reasoning Question (answer without AI assistance)

Answer this question in your own words without using an AI tool. We want to understand how you reason about AI failure modes — not how AI describes them.

Section C — AI Usage Log (Mandatory)

This is not a trick. We want to see how you work with AI — not whether you used it.

In a short section of your README, document your AI collaboration process. For each significant interaction with an AI tool, briefly note:

What you asked the AI to help with
What it gave you
What you kept, changed, or rejected — and why

Three interactions documented is sufficient. The log does not need to be exhaustive.

Deliverable 3 — Video Walkthrough (8–10 minutes)

Record your walkthrough as an MP4 or MOV file and upload it directly on the Provn platform as a separate file.

Structure your video to cover:

Summary (60 seconds): the incident, your diagnosis, and your recommended fix
Code walkthrough (3–4 minutes): walk through your revised implementation — explain the key decisions, not just what the code does
Runbook walkthrough (1–2 minutes): walk through your Part B1 runbook — how would you actually use this at 2am?
Mandatory AI question (1–2 minutes): see the AI Usage Guidance section below
Reflection (30–60 seconds): what would you tackle next, and what trade-off are you least confident in?

Speak naturally. Communication is assessed on clarity of technical reasoning and logical structure — not verbal polish, accent, or filler words.

Constraints

Honor all four. AI tools will typically ignore them. Evaluators will check each one.

Stack constraint: your implementation must use TypeScript and Node.js — the existing DAT Broker Tech stack. Do not introduce a different runtime, language, or framework.
Infrastructure constraint: idempotency must be implemented using a persistent store (e.g. a database or Redis-style cache) — not in-memory state that would not survive a pod restart in Kubernetes.
Organizational constraint: you cannot change the Kafka topic structure or partition schema. The consumer group configuration is managed by a separate platform team and any changes require a two-week change request. Your fix must work within the current consumer group design.
Ownership constraint: your on-call team will inherit whatever you build. Write your runbook and code comments for the engineer who gets paged at 2am — not the one who built it.

Evaluation Criteria

Your submission is evaluated across five dimensions. Weights reflect what DAT's Broker Tech team cares most about.

Systems Design & Technical Judgment (30%): Identifies the root cause correctly, explains the design gap that allowed the incident to propagate, and proposes an architecture fix with explicit trade-off reasoning — not just a patch.
Production Code Quality & Engineering Craft (25%): TypeScript implementation is production-intentioned: meaningful types, structured logging with context, explicit error handling that distinguishes retryable from non-retryable failures, and at least one meaningful test.
Message-Driven & Integration Architecture (20%): Demonstrates working knowledge of Kafka delivery semantics and idempotency design — not just that they've heard of the concepts. Webhook contract design reflects the reality of broker TMS integration failures.
Communication & Technical Leadership (10%): Written analysis and runbook are structured for handoff — a new team member could act on them without asking clarifying questions. Trade-off reasoning is legible to a non-engineer.
AI Fluency (15%): Evidence of directing AI with domain-specific constraints, critical evaluation of AI output, and iteration. The AI Usage Log and video answer to the mandatory question are the primary evidence sources.

AI Usage Guidance

The single highest-signal indicator: your video answer to the mandatory AI question. If you cannot name a specific moment where you redirected AI output, evaluators will assume you did not.

Mandatory AI question (include in your video):

Note: Part B2 of your README must be completed without AI assistance. This is not about AI detection — it is about understanding how you reason through AI failure modes independently.

Submission Checklist

Before you submit, confirm:

Implementation file(s) — TypeScript/Node.js, uploaded as a separate file
README.md — includes Section A (written analysis), Section B (Part B1 runbook + Part B2 reasoning question), and Section C (AI Usage Log)
Video walkthrough — 8–10 minutes, MP4 or MOV, uploaded as a separate file
All four constraints honored — check each one before submitting
Part B2 answered in your own words, without AI assistance

Senior Software Engineer — DAT Freight & Analytics, Broker Tech

What You'll Be Doing

The Scenario

Existing Code

Deliverable 1 — Revised Implementation

Deliverable 2 — README.md (Sections A, B, and C)

Deliverable 3 — Video Walkthrough (8–10 minutes)

Constraints

Evaluation Criteria

AI Usage Guidance

Submission Checklist

What You'll Accomplish

How Your Work Will Be Scored

What to Submit

Senior Software Engineer — DAT Freight & Analytics, Broker Tech

What You'll Be Doing

The Scenario

Existing Code

Deliverable 1 — Revised Implementation

Deliverable 2 — README.md (Sections A, B, and C)

Deliverable 3 — Video Walkthrough (8–10 minutes)

Constraints

Evaluation Criteria

AI Usage Guidance

Submission Checklist

What You'll Accomplish

How Your Work Will Be Scored

What to Submit