Spare Incident Report
February 4, 2026
Executive Summary
On February 4, 2026, the Spare platform experienced service disruptions primarily impacting customers served by our CA region infrastructure. Users experienced two distinct windows of instability characterized by system slowness, operation timeouts, and intermittent error messages.
The disruption was caused by database resource contention following a scheduled infrastructure adjustment, compounded by a "cold cache" state and stale query optimization statistics that led to connection pool exhaustion. Our engineering team addressed the issues through immediate capacity increases, manual optimization of database statistics, and temporary suspension of non-essential background tasks to prioritize core service stability. Service was fully restored to all affected organizations by 11:14 am PT.
Incident Details
The incident occurred in two phases. The first began at 7:20 am PT when monitoring systems alerted the team to severe connection pool exhaustion in one of our database clusters. A second period of instability occurred at 10:16 am PT as the system struggled to optimize queries under peak morning load.
Cause
The primary root cause was the exhaustion of the database connection pool, which prevented backend services from processing queries.
Our detailed investigation identified two contributing factors:
- Resource Contention: A database scale-down operation performed the previous evening resulted in insufficient capacity to handle the surge in morning traffic.
- Stale Query Statistics: Following the necessary scale-up to restore capacity, the database query planner lacked updated statistics (specifically requiring a manual ANALYZE operation). This caused the system to use inefficient query plans, leading to high disk I/O and further instances of connection pool exhaustion.
Mitigation and Resolution
Our team’s response focused on immediate capacity restoration and system stabilization:
- Capacity Increase: Engineering immediately scaled the CA database back to its previous capacity tier and increased the number of API pods to distribute the load more effectively.
- Query Optimization: To resolve the second wave of instability, the team manually executed ANALYZE commands across the database clusters to refresh statistics and restore efficient query processing.
- Workload Management: We temporarily disabled non-critical background jobs during the incident to reduce database pressure.
- Service Restoration: Full service was confirmed for all organizations served from our CA region by 11:14 am PT.
Timeline
All times are in Pacific Time (PT).
- 07:20 am: Monitoring detects connection pool exhaustion; Incident Team responds.
- 07:26 am: Team initiates CA database scale-up to restore capacity.
- 08:00 am: API pod health checks temporarily removed to reduce database load during recovery.
- 09:21 am: Database scale-up completes; initial service restoration achieved.
- 10:16 am: A second wave of high error rates is detected as morning load peaks.
- 11:03 am: Non-essential background jobs are disabled to prioritize core platform stability.
- 11:14 am: Manual database optimization (ANALYZE) completed; full service restored to all organizations.
- 12:52 pm: Background jobs re-enabled after confirming sustained system stability.
Next Steps
To ensure continued reliability and prevent a recurrence of this resource contention, we are undertaking the following:
- Infrastructure Configuration: We are updating our infrastructure-as-code (Terraform) configurations to prevent automatic down-scaling of production-critical database tiers, and to more appropriately schedule database statistic refreshes (via ANALYZE operation).
- Operational Procedures: We are implementing a mandatory requirement to refresh database statistics following any manual or automated scaling event.
- Performance Monitoring: We are enhancing our alerts for "cold cache" symptoms and disk I/O wait times to detect potential bottlenecks before they result in connection exhaustion.
We understand that this disruption was more than just a technical failure; it was a breakdown in the service you and your riders depend on every day. We are deeply sorry for the immense frustration and operational strain this caused your teams. Reliability is the foundation of our partnership, and we fell short of that standard today. We are working with absolute urgency to implement the safeguards required to deliver the level of uptime performance you expect.