Incident Management
How we declare, respond to, communicate, and learn from incidents.
Purpose
Restore normal operations quickly, protect customer trust, and learn from urgent reactive work without turning every issue into a heavyweight process.
Nest follows the incident.io principle that incidents are not only engineering outages. An incident is anything unexpected that pulls us away from planned work with urgency. Incidents often start in engineering, but the process also covers support, operations, billing, security/privacy, vendor, partner, and customer success issues.
Operating principles
- Declare early. A false alarm is cheaper than uncoordinated urgent work.
- Anyone at Nest can declare an incident.
- Manage incidents in the open in incident.io and Slack. Do not create private incident channels; redact sensitive details instead.
- Mitigate customer or business impact first. Root-cause analysis and cleanup happen after impact is stable.
- Keep a written trail of decisions, evidence, actions, and handoffs.
- Use debriefs to understand contributors, mitigators, risks, and learnings. Avoid reducing incidents to a single "root cause."
Tooling
- incident.io is the system of record for the incident.
- Slack is the primary working surface:
#incidents,#engineering,#production-support, and the per-incident channel. - Linear is the system of record for follow-up actions after the incident.
- Zendesk remains the support intake and customer ticketing system.
- Google Meet or Zoom is used when voice/video coordination is faster than Slack. Summarize key decisions and action owners back into the incident channel.
- The public status page is status.nest.vet.
When to declare
Declare an incident when one or more of these is true:
- Customer, clinic, partner, or internal operations impact is happening or likely to happen soon.
- The work is urgent enough that someone must drop planned work to respond.
- Multiple people or functions need to coordinate.
- Support or ops needs a reliable source of truth for customer updates.
- The issue may require rollback, feature disablement, status page updates, or executive decision-making.
- We want a durable timeline because the issue may teach us something important.
If you are unsure, declare a Minor incident and downgrade or close it later.
Severity
Use one company-wide severity model. If an older workflow says Low, Medium, or High, map those to Minor, Major, and Critical.
| Severity | Use when | Examples | Response expectations |
|---|---|---|---|
| Minor | Limited impact, low spread, or urgent investigation that can usually stay within working hours. | Scoped integration degradation, a contained data exposure with no payment/card details, credentials, keys, or broad customer impact, a single-customer issue that needs coordination. | Declare in incident.io, coordinate in the incident channel, update support if customers are involved, debrief optional. |
| Major | Meaningful customer, operational, or business impact with a workaround or contained blast radius. | Billing issue affecting a cohort, data integration failure affecting multiple clinics, customer-visible degradation, repeated support tickets for the same production problem. | Incident Lead required, notify John and Jacob, assign Support/Ops comms if customers are impacted, status page if impact is visible, short debrief expected. |
| Critical | Severe or widespread customer/business impact, loss of a critical workflow, or sensitive exposure involving payment/card details, credentials, or keys. | Raven down, Bubble checkout broken, widespread billing failure, broad data integration outage, suspected credential/key exposure. | Use incident.io app escalation, notify John/Jacob/Ishani as appropriate, assign Support/Ops comms, update status page, full debrief required. |
Pick the more severe level if people are debating severity during active response. Update the severity later when scope is clearer.
Status
Use simple incident statuses:
- Investigating: we believe something is wrong, but scope or cause is not clear.
- Fixing: we understand the likely issue and are applying mitigation or repair.
- Monitoring: impact appears mitigated and we are validating recovery.
- Resolved: immediate impact is over and remaining work can move to Linear.
- Closed: debrief and required follow-ups are captured.
Minor incidents can move from Resolved to Closed immediately when no debrief is needed.
Roles
Every incident needs one clear Incident Lead. Other responsibilities should be assigned as actions or lightweight hats only when they are needed.
| Responsibility | When needed | What to do |
|---|---|---|
| Incident Lead | Every incident | Coordinate response, assign actions, keep the incident moving, decide when to escalate, and make sure updates happen. Default leads are Akansh or John until a formal rotation exists. |
| Investigation owner | When a specific person needs to drive technical or operational diagnosis | Investigate, propose mitigations, run commands or deploy/rollback changes, and post evidence into the channel. |
| Comms owner | Customer-visible incidents or Major/Critical incidents | Keep internal/customer updates moving. For Nest, this is usually Support/Ops. |
Do not create standing roles for executives, legal, security, finance, or other functions. Escalate to those people only when the incident needs their decision, context, or approval.
The Incident Lead should avoid becoming the only person debugging. Their job is coordination, context, escalation, and decision flow.
Response procedure
- Declare the incident in incident.io and choose an initial severity.
- Work in the generated incident channel. Post a short summary: what happened, known impact, current status, who is leading, and what happens next.
- Assign the Incident Lead and any needed investigation or comms owners.
- Establish whether impact is customer-visible. If yes, route Support/Ops into the response and decide whether to update status.nest.vet.
- Mitigate first: rollback, pause risky jobs, disable a non-critical feature, scale down load, or apply another low-risk stabilization step.
- Capture evidence in the channel: dashboards, logs, error messages, commands, hypotheses, and what was verified. Summarize any meeting decisions back into Slack.
- Send internal updates in the incident channel and
#incidents. For Major/Critical incidents, also notify#engineeringand the relevant leaders. - Send customer/status page updates every 30 minutes or as meaningful updates are available. Include the next expected update time.
- Move to Monitoring when impact appears mitigated. Validate billing, data integration, and affected customer workflows before resolving.
- Resolve the incident when immediate impact is over and remaining work can be tracked normally in Linear.
- Create Linear follow-up actions with owners and links back to the incident.
- Close the incident after the required debrief and follow-ups are captured.
Communication
Good updates answer four questions:
- What is happening?
- Who or what is affected?
- What are we doing now?
- When will we update again?
Use plain language. Avoid unexplained internal acronyms. If the answer is "unknown," say that directly and explain what is being checked.
Customer communication
Create or update the status page when customers can observe impact or when Support/Ops needs a single external source of truth. Customer updates should:
- Provide enough context to reduce speculation.
- Explain whether we are investigating, fixing, or monitoring.
- Tell customers what they should do, if there is an action to take.
- Avoid over-specific technical claims until verified.
- Repeat "no material change" updates rather than going silent.
Customer communications are owned by Support/Ops unless the Incident Lead assigns someone else.
Sensitive information
Do not post payment/card details, credentials, keys, raw secrets, or unredacted customer-sensitive data in incident channels, incident.io, Linear, Zendesk, or status page updates. Use sanitized summaries and approved secure systems for any required sensitive evidence.
After-hours response
There is no formal on-call rotation or compensation program today. Critical incidents may use the incident.io app for escalation because Slack notifications are not expected to be reliable after hours. Response is best-effort based on who is available and willing until a formal rotation is introduced.
Debriefs
Use the term Incident Debrief rather than post-mortem.
| Severity | Debrief requirement |
|---|---|
| Minor | Optional. Add a short note if there is a useful learning or follow-up. |
| Major | Short debrief expected. Capture timeline, impact, contributors, mitigators, and Linear follow-ups. |
| Critical | Full debrief required. Facilitate with the Incident Lead, Akansh, John, or Jacob. |
Debriefs should be blameless and specific. Names are fine when needed to explain what happened, but the discussion should start from the assumption that people were acting reasonably with the information they had.
Full debrief template
Use this structure for Critical incidents and high-learning Major incidents:
- Summary: what happened, current state, and customer/business impact.
- Key information: severity, affected systems, affected customers or clinics, incident link, status page link, and participants.
- Timeline: key events, decisions, mitigations, and turning points.
- Contributors: technical, human, process, vendor, or business conditions that made the incident possible or worse.
- Mitigators: what reduced impact or helped the response.
- Risks and learnings: what this showed about our systems or organization.
- Follow-ups: Linear issues, owners, priority, and due expectations.
Let follow-up actions settle after the debrief unless they are urgent. Avoid creating low-value work only because the incident is recent.
References
Last updated on