Incident Response Checklist for a Small Website Outage

Incident response team reviewing a computer screen during an outage investigation

A website outage feels urgent because every minute looks expensive. The fastest teams do not improvise every step. They follow a short incident response checklist: confirm the problem, define impact, assign roles, inspect recent changes, restore service, and write down what happened.

This guide is for small teams, solo operators, and content site owners who need a practical outage process without enterprise incident tooling.

Key Takeaways

Confirm the outage from more than one network before changing production.
Separate triage, communication, and fixing work when more than one person is available.
Check recent changes before digging through every possible cause.
After recovery, capture the timeline and one or two prevention actions.

The First Five Minutes

Start by confirming whether the site is down for everyone or only one location. Check the homepage, a direct post URL, wp-admin or app login, DNS resolution, and server response from a second network. A local browser cache, DNS resolver issue, or regional network problem can look like a full outage.

Write down the start time, who noticed it, and the first symptom. For example: homepage returns 502, SSL certificate error, database connection error, DNS not resolving, or page loads but checkout fails. Precise symptoms save time.

Assign Simple Roles

If two or more people are available, split roles. One person investigates and fixes. One person communicates status to stakeholders or users. One person keeps a timeline. On a tiny team, one person can do all three, but the roles still help keep the work organized.

Use one communication channel for the incident. Avoid spreading decisions across email, chat, phone, and comments. A single incident thread makes the timeline easier to reconstruct later.

Triage Checklist

Can the server be reached by SSH?
Is the web server running?
Is the database running and accepting connections?
Did CPU, memory, disk, or bandwidth spike?
Did a deploy, plugin update, DNS change, or certificate renewal happen recently?
Are error logs showing one repeated failure?

Recent changes deserve special attention. Many outages come from a deploy, dependency update, expired certificate, full disk, broken configuration, or database service failure. Check the simple path before assuming a complex attack or rare infrastructure issue.

Restore Service Before Perfect Diagnosis

The goal during an outage is service restoration. If a recent deploy caused the issue, rollback. If the server is out of disk space, free enough space to recover and then investigate. If a plugin update broke WordPress, disable the plugin and bring the site back before writing the full explanation.

Avoid making many untracked changes at once. Each fix should be noted in the incident timeline with the time and result. This keeps the team from losing the path that actually restored service.

Status Communication

Small sites do not always need a public status page, but they still need clear communication. If users are affected, publish a short message that says what is affected, when it started, and when the next update will come. Do not guess a root cause too early.

A useful first message is simple: “We are investigating elevated errors on the main site. Admin access and some public pages may fail. Next update in 30 minutes.” Keep it factual.

Post-Incident Review

After recovery, write a short review while the details are fresh. Include timeline, impact, root cause if known, what restored service, what detection worked, and what prevention action will be taken. The review does not need blame or long theory.

Pick one or two follow-up actions, not ten. Examples include adding disk alerts, testing backups monthly, staging plugin updates, documenting rollback, or adding uptime checks for both homepage and login.

FAQ

What should I check first during a website outage?

Confirm the outage, check recent changes, then inspect web server, database, disk, CPU, and error logs. Recent changes often reveal the fastest rollback path.

Do small sites need incident response?

Yes, but it can be lightweight. A one-page checklist is enough for many sites and is much better than starting from memory during stress.

Should I restart the server immediately?

Not always. A restart can hide useful evidence. Check basic service status and logs first unless the site needs emergency recovery and restart is the known safe action.

Final Check

A good incident response checklist makes outages shorter and less chaotic. Confirm impact, track changes, restore service, communicate clearly, and convert the incident into one practical improvement.