Performance Degradation Impacting CA & US2 (Canada and United States 2) Users

Incident Report for Spare Platform

Postmortem

Spare Incident Report

April 16, 2026

Executive Summary

On April 16, 2026 at 12:50 pm PT, the Spare platform experienced a service disruption affecting customers in our CA and US2 regions. During a 52 minute window, users experienced varying degrees of system instability:

  • For CA Region Users: The impact was primarily limited to specific dashboard views. Most notably, users reported that the Duty Schedule page would load vehicle numbers, but driver names remained stuck in a "loading" state or appeared missing.
  • For US2 Region Users: In addition to the missing driver names on the Duty Schedule, users experienced slowness on numerous other pages. Key operations, including searching for riders, retrieving trip estimates, and loading payment options, were up to 10 times slower than normal.

Service was fully restored to all affected organizations by 1:42 pm PT.

Incident Details

The disruption was first identified at 12:50 pm PT when internal monitoring detected extreme resource saturation on a primary database cluster.

The instability manifested differently across regions due to variations in infrastructure capacity. In the US2 region, the heavy load caused several high-traffic API endpoints to experience severe latency, with response times for tasks like searching riders or updating the live map rising to between 10 and 60 seconds.

Cause

The primary root cause was a software regression introduced in a recent update to the "list duties" endpoint.

  • Logic Error: A refactor intended to implement a new redundancy pattern contained a bug related to the "list duties by IDs" filter.
  • Database Overload: When the frontend attempted to look up details for a small group of duties, the system instead attempted to count and process every duty record within the entire organization.
  • Resource Exhaustion: This massive, unintended data processing task created excessive database load, leading to the performance degradation observed in US2 and the loading failures in both regions.
  • Testing Gap: This issue was not detected during manual QA because local test environments utilized smaller datasets where the performance impact was not visible. Additionally, current automated tests did not specifically target the "list by ID" query parameter.

Mitigation and Resolution

Our engineering team implemented the following steps to stabilize the platform:

  • Workload Management: We temporarily disabled non-essential background tasks in the CA and US2 regions to alleviate immediate pressure on the database.
  • Database Optimization: The team manually executed optimization commands (ANALYZE) and temporarily adjusted database indexes to improve query efficiency.
  • Service Rollback: Once the specific cause was identified, engineering rolled back the recent API deployment in CA at 1:40 pm PT, followed immediately by US2.
  • Service Restoration: Full system stability was confirmed, and all background services were re-enabled by 1:42 pm PT.

Timeline

All times are in Pacific Time (PT).

  • 12:50 pm: Incident begins; monitoring detects critical resource saturation in the US2 region.
  • 1:13 pm: Incident team responds by disabling recurring background jobs to reduce load.
  • 1:16 pm: Reports received from customers regarding driver names failing to load on the Duty Schedule.
  • 1:31 pm: Engineering initiates manual database optimization and index adjustments.
  • 1:40 pm: Successful API rollback in the CA region; performance begins to normalize.
  • 1:42 pm: API rollback completed in US2; full service restored across all affected regions.

Next Steps

To prevent a recurrence and improve platform resilience, we are taking the following actions:

  • Expand Test Coverage: We are updating our automated test suites to ensure that all query parameters for core API endpoints are rigorously tested for both accuracy and performance.
  • Infrastructure Review: We are reviewing our database tier configurations to ensure they are appropriately sized to handle unexpected workload spikes.
  • Advanced Monitoring: We are implementing high-urgency CPU analysis within our "canary" deployment process to detect performance regressions before they reach the general user population.

We understand the critical role our platform plays in your daily operations and sincerely apologize for the frustration this disruption caused your team and your riders. We are committed to strengthening our infrastructure to deliver the reliability you expect.

Posted Apr 17, 2026 - 12:16 PDT

Resolved

The issue impacting CA and US2 (Canada and United States 2) environments has been resolved.

Users should now be able to access and view duty schedules in the Spare Operations platform as expected.

Thank you for your patience while we worked to restore service.
Posted Apr 16, 2026 - 13:56 PDT

Monitoring

A rollback has been implemented for the CA and US2 (Canada and United States 2) environments, and system performance has improved.

Users should now be able to load and view duty schedules as expected.

Our team is continuing to monitor the situation closely to ensure stability.
Posted Apr 16, 2026 - 13:46 PDT

Update

We are continuing to investigate the performance degradation impacting CA and US2 (Canada and United States 2) environments.

At this time, some users may still experience issues loading the Duty Schedule page or viewing duty details, which may impact real-time operations.

Our engineering team is actively working to identify the root cause and restore full functionality as quickly as possible.

We’ll share further updates as soon as more information is available.
Posted Apr 16, 2026 - 13:32 PDT

Investigating

We are currently experiencing a performance degradation impacting CA and US2 (Canada and United States 2) environments.

Users may be unable to view or load duty schedules within the Spare Operations Demand Response Admin Platform.

Our team is actively investigating the issue and will provide updates as more information becomes available.
Posted Apr 16, 2026 - 13:17 PDT
This incident affected: Administrator Portal ([Canada], [United States 2]).