Cross-industry | Security

Alert fatigue is an operating model problem, not a tooling problem.

Buying another detection product will not fix a queue nobody can finish. Triage discipline, ownership, and tuning will.

Walk into most security operations rooms and you will find the same scene: thousands of alerts per day, a team that can investigate a fraction of them, and a quiet, corrosive understanding that most of the queue will never be looked at. The usual response is to buy a better tool. The queue then gets bigger, because the new tool sees more.

Having run managed detection and response across more than fifteen organizations and about 1,500 endpoints, I can report a hard truth: the teams that get on top of alerting are rarely the ones with the most products. They are the ones with the most discipline.

Where the volume really comes from

In a typical environment, a large share of alert volume traces back to a small set of causes: detections left on vendor defaults, known-benign behavior that nobody suppressed with documentation, the same misconfiguration firing on a schedule, and duplicate coverage where two tools both report the same event. None of these are detection problems. They are housekeeping debts, and they compound.

The operating model that works

  • Severity that means something. If everything is high, nothing is. Severity tiers need definitions tied to business impact, and the discipline to use them.
  • A tuning cadence. Every week, the noisiest detections get reviewed: fix the cause, suppress with documentation, or accept with a named owner. Tuning is not a project. It is a rhythm.
  • Triage time as a managed metric. Median time from alert to disposition is the heartbeat of a SOC. If you do not measure it, you are managing by anecdote.
  • Clear ownership of the response. An alert that ends with "told IT" is not closed. Containment, remediation, and verification need named owners and follow-through.

What good looks like

In environments that adopt this discipline, the numbers move fast. Median triage drops from hours to minutes because analysts see a short, honest queue instead of an infinite one. Critical alerts get genuine attention because they are rare enough to deserve it. And leadership finally gets a security report it can read: what fired, what mattered, what was done, and what is being fixed at the root.

The capacity math nobody runs

A competent analyst can properly investigate somewhere between 20 and 40 alerts in a working day, depending on complexity. A team of four therefore has a real capacity of perhaps 120 investigations a day. If the environment generates 2,000 alerts a day, the team is structurally able to look at six percent of them. No amount of effort changes that arithmetic; only volume reduction or automation does. Run this calculation for your own SOC. If the gap between volume and capacity is more than tenfold, your true detection coverage is whatever happens to be at the top of the queue, and the honest risk statement belongs in front of leadership.

The encouraging part: the first tuning pass is usually dramatic. In one environment we reviewed, a single misconfigured detection produced close to a third of total volume, and the top ten produced over 70 percent. Two weeks of disciplined housekeeping bought back more capacity than a year of hiring would have.

Questions to put to your SOC lead this week

  • What is our median time from alert to disposition, and is it trending?
  • Which ten detections produced the most alerts last month, and what did we change as a result?
  • What share of last month's alerts were closed without investigation, and who accepted that risk?
  • For our last five true positives, how long did containment take, and who verified remediation?
  • If volume doubled tomorrow, what is the plan other than working harder?

A SOC lead with crisp answers runs an operating model. A SOC lead who answers with tool names is describing the problem.

The security report leadership should demand

Alert volume is the number most SOCs report and the least useful one for governance. The monthly security report worth reading fits on one page and carries six numbers with trends: median triage time; true positives and what they were; mean time from detection to containment for those cases; the share of alerts auto-closed or suppressed, with the risk owner named; critical vulnerabilities open beyond SLA; and the top recurring alert cause and what is being done at its root. Two of these, containment time and beyond-SLA vulnerabilities, are the ones boards and regulators increasingly ask for directly. A leadership team that reviews these six monthly will catch a degrading security operation two quarters before an incident report would have told them, and a SOC that knows these numbers are read behaves differently in week one.

The practical first step

Pull thirty days of alert data and rank detections by volume. The top ten will usually explain most of the queue. Fix or document each one, then set the weekly tuning rhythm before adding any new tooling. The cheapest capacity you will ever buy is the noise you stop generating.

Facing this problem? This is the work TechEccentric does: analytics, AI and machine learning, and cybersecurity for organizations where the operating systems behind decisions have to hold up.

Book a Diagnostic Call