Spare Incident Report
April 16, 2026
Executive Summary
On April 16, 2026 at 12:50 pm PT, the Spare platform experienced a service disruption affecting customers in our CA and US2 regions. During a 52 minute window, users experienced varying degrees of system instability:
- For CA Region Users: The impact was primarily limited to specific dashboard views. Most notably, users reported that the Duty Schedule page would load vehicle numbers, but driver names remained stuck in a "loading" state or appeared missing.
- For US2 Region Users: In addition to the missing driver names on the Duty Schedule, users experienced slowness on numerous other pages. Key operations, including searching for riders, retrieving trip estimates, and loading payment options, were up to 10 times slower than normal.
Service was fully restored to all affected organizations by 1:42 pm PT.
Incident Details
The disruption was first identified at 12:50 pm PT when internal monitoring detected extreme resource saturation on a primary database cluster.
The instability manifested differently across regions due to variations in infrastructure capacity. In the US2 region, the heavy load caused several high-traffic API endpoints to experience severe latency, with response times for tasks like searching riders or updating the live map rising to between 10 and 60 seconds.
Cause
The primary root cause was a software regression introduced in a recent update to the "list duties" endpoint.
- Logic Error: A refactor intended to implement a new redundancy pattern contained a bug related to the "list duties by IDs" filter.
- Database Overload: When the frontend attempted to look up details for a small group of duties, the system instead attempted to count and process every duty record within the entire organization.
- Resource Exhaustion: This massive, unintended data processing task created excessive database load, leading to the performance degradation observed in US2 and the loading failures in both regions.
- Testing Gap: This issue was not detected during manual QA because local test environments utilized smaller datasets where the performance impact was not visible. Additionally, current automated tests did not specifically target the "list by ID" query parameter.
Mitigation and Resolution
Our engineering team implemented the following steps to stabilize the platform:
- Workload Management: We temporarily disabled non-essential background tasks in the CA and US2 regions to alleviate immediate pressure on the database.
- Database Optimization: The team manually executed optimization commands (ANALYZE) and temporarily adjusted database indexes to improve query efficiency.
- Service Rollback: Once the specific cause was identified, engineering rolled back the recent API deployment in CA at 1:40 pm PT, followed immediately by US2.
- Service Restoration: Full system stability was confirmed, and all background services were re-enabled by 1:42 pm PT.
Timeline
All times are in Pacific Time (PT).
- 12:50 pm: Incident begins; monitoring detects critical resource saturation in the US2 region.
- 1:13 pm: Incident team responds by disabling recurring background jobs to reduce load.
- 1:16 pm: Reports received from customers regarding driver names failing to load on the Duty Schedule.
- 1:31 pm: Engineering initiates manual database optimization and index adjustments.
- 1:40 pm: Successful API rollback in the CA region; performance begins to normalize.
- 1:42 pm: API rollback completed in US2; full service restored across all affected regions.
Next Steps
To prevent a recurrence and improve platform resilience, we are taking the following actions:
- Expand Test Coverage: We are updating our automated test suites to ensure that all query parameters for core API endpoints are rigorously tested for both accuracy and performance.
- Infrastructure Review: We are reviewing our database tier configurations to ensure they are appropriately sized to handle unexpected workload spikes.
- Advanced Monitoring: We are implementing high-urgency CPU analysis within our "canary" deployment process to detect performance regressions before they reach the general user population.
We understand the critical role our platform plays in your daily operations and sincerely apologize for the frustration this disruption caused your team and your riders. We are committed to strengthening our infrastructure to deliver the reliability you expect.