Before we hired our first #SOC analyst or triaged our first alert, we defined where we wanted to get to; what great looked like.
Here’s [some] of what we wrote:
We believe that a highly effective SOC:
1. leads with tech; doesn’t solve issues w/ sticky notes 2. automates repetitive tasks 3. responds and contains incidents before damage 4. has a firm handle on capacity v. loading 5. is able to answer, “are we getting better, or worse?”
6. doesn’t trade quality for efficiency; measures quality, has a self-correcting process 7. has a culture where “I don’t know” is always an acceptable answer; pathways exist to get help 8. is always learning (from success & failure) 9. has high retention, career pathways
With these goals in mind (things we want to achieve), we thought about the measurements that would inform us where we are on that journey.
Let’s start with “automates repetitive tasks”. What does that mean?
When an alert is sent to a SOC analyst for a decision, do they have to jump into a #SIEM or pivot to EDR to get more information?
For example, analysts may find themselves continuously asking questions like: Where has this user previously logged in from? At what times?
How much time is spent wrestling with security tech to get more context to make a decision?
I think about “automates repetitive tasks” as decision support. Using tech/sw automation to enable our #SOC to answer the right questions about a security event in an easy way.
How do you get there? Think about the classes of work you send to your SOC. Try to group them.
Here’s an example :
- Unusual API calls/sequence
- Process/file execution event
- SaaS app event
- Unusual API auth event
- Outbound network conn
- Inbound network conn
- Etc
The point is to think about the various classes of work that show up and the steps required to get to an informed decision. Not a fast decision, but an informed decision using contextual data.
Now think about the work time of these various classes of work.
Work time is the diff (typically measured in minutes) between when the work “starts” and when it's “done”.
Some are likely better than others. E.g., work times for EDR events are 50% faster than triaging SaaS app logins, because we spend 20 min per alert reviewing auth logs.
You can find repetitive tasks by studying the steps taken by your SOC and then automate them.
What metrics inform if it’s working? Work times for the various classes should go down. Analysts get to spend more time on making the right decisions vs. wrestling with security tech.
In our SOC, it’s not just about automating “decision support”, several classes of work are automated end-to-end. Meaning, at this point we understand the work well enough where automation aka "bots" handle entire classes of work.
Bots do more than fetch information for analysts, they close alerts, perform investigations, create incidents, etc.
Next, there’s the more obvious “responds & contains incidents before damage”. How quickly does your SOC need to respond? Is it 1 min, is it 10? Set a target.
I think about measuring response as the time between when the first alert (lead) was created and the incident contained
E.g, the host was contained, the account was disabled, the EC2 instance was shut down, the long-term access key was reset, etc.
There's nuance here where the first lead may have fired minutes, hours, days before detection. Take that into consideration as well.
There are a couple of things to consider here as well:
In your SOC you may have different alert severities (critical, high, medium, low, etc). And you likely use them as buffers to reduce variance and instruct analysts which alerts to handle first.
When you’re looking at your response times, inspect which alert severities most often detect security incidents. If most of your incidents start from “low” severity alerts that likely get looked at *after* the higher severities, what does that mean? Do you need to improve that?
Whether your building a #SOC, scaling a customer success team, or a D&R Engineering team define where you want to get to (what are your goals / what does success look like?) and think about the measurements that will tell you if you’re there or not.
If you’re interested in reading more about #SOC metrics, here are a few blogs on the topic:
How to think about presenting good security metrics:
- Anchor your audience (why are these metrics important?)
- Make multiple passes with increasing detail
- Focus on structures and functions
- Ensure your audience leaves w/ meaning
Don’t read a graph, tell a story
Ex ⬇️
*Anchor your audience 1/4*
Effective leaders have a firm handle on SOC analyst capacity vs. how much work shows up. To stay ahead, one measurement we analyze is a time series of alerts sent to our SOC.
*Anchor your audience 2/4*
This is a graph of the raw trend of unique alerts sent to our SOC for review between Nov 1, 2021 and Jan 2, 2022. This time period includes two major holidays so we’ll expect some seasonality to show up around these dates.
Once a month we get in front of our exec/senior leadership team and talk about #SOC performance relative to our business goals (grow ARR, retain customers, improve gross margin).
A 🧵on how we translate business objectives to SOC metrics.
As a business we want to grow Annual Recurring Revenue (ARR), retain and grow our customers (Net Revenue Retention - NRR) and improve gross margin (net sales minus the cost of services sold). There are others but for this thread we'll focus on ARR, NRR, and gross margin.
/1
I think about growing ARR as the ability to process more work. It's more inputs. Do we have #SOC capacity available backed by the right combo of tech/people/process to service more work?
Things that feed more work: new customers, cross selling, new product launches.
/2
Purpose: Be clear with your team about what success looks like - and create a team and culture that guides you there. Go through the exercise of articulating your teams purpose.
The "purpose" we've aligned on at Expel in our SOC: protect our customers and help them improve.
People: To get to where you want to go, what are the traits, skills, and experiences you need to be successful?
Traits (who you are)
Skills (what you know)
Experiences (what you've encountered/accomplished)
A good alert includes:
- Detection context
- Investigation/response context
- Orchestration actions
- Prevalence info
- Environmental context (e.g, src IP is scanner)
- Pivots/visual to understand what else happened
- Able to answer, "Is host already under investigation?"
Detection context. Tell me what the alert is meant to detect, when is was pushed to prod/last modified and by whom. Tell me about "gotchas" and point me to examples when this detection found evil. Also, where in the attack lifecycle did we alert? This informs the right pivots.
Investigation/response context. Given a type of activity detected, guide an analyst through response.
If #BEC, what questions do we need to answer, which data sources? If coinminer in AWS, guide analyst through CloudTrail, steps to remediate.
Gathering my thoughts for a panel discussion tomorrow on scaling #SOC operations in a world with increasing data as part of the Sans #BlueTeamSummit.
No idea where the chat will take us, but luck favors the prepared. A 🧵 of random thoughts likely helpful for a few.
Before you scale anything, start with strategy. What does great look like? Are you already there and now you want to scale? Or do you have some work to do?
Before we scaled anything @expel_io we defined what great #MDR service looked like, and delivered it.
We started with the customer and worked our way back. What does a 10 ⭐ MDR experience look like?
We asked a lot of questions. When an incident happens, when do we notify? How do we notify? What can we tell a customer now vs. what details can we provide later?
1. Collect data, you won't know what it means 2. Collect data, *kind* of understand it 3. Collect data, understand it. Able to say: "This is what's happening, let's try changing *that*" 4. Operational control. "If we do *this*, *that* will happen"
What you measure is mostly irrelevant. It’s that you measure and understand what it means and what you can do to move your process dials up or down.
If you ask questions about your #SOC constantly (ex: how much analyst time do we spend on suspicious logins and how can we reduce that?) - progress is inevitable.
W/o constantly asking questions and answering them using data, scaling/progress is coincidental.