Correlation Rules and Engine Debugging1
Correlation Rules and Engine Debugging1
Correlation Rules and Engine Debugging1
Correlat
Co rrelation
ion Rules and Engine
Engi ne Debugging
Introduction
This document is intended to outline the basic rule creation, tuning
tuning and debugging fo r the McAfee C orrelation
Engine. The correlation engine
engine can reside within
within an Event Receiver or as the
the Adva nce Corre lation Engine (AC E)
appliance. The AC E removes the corr elation overhead from the Event Receiv er, allowing
allowing it operate at its
its maximum
ratings.
An Eve nt Receiver is an appliance ( hardware or virtual ) which houses just one of the 3 po ssible correlation
correlation
engines. This is:
An A C E is an appliance (hardware
(hardware or virtual ) which house s 3 correlation engines. These are:
What does that mean? This means that without expe rt guidance, any adjustments you make as a result of this
document could cause unintended
unintended conseq uences. Use
Use any tip o r trick highlight
highlighted
ed in this
this document carefully.
What doesn’t this document provide?
This document is intended to prov ide the
the rea der with
with an insight into
into debugging correlation rules and the
associated correlation engine.
engine. As such, it provides a great deal of technical information.
What does that mean? This means that without expe rt guidance, any adjustments you make as a result of this
document could cause unintended
unintended conseq uences. Use
Use any tip o r trick highlight
highlighted
ed in this
this document carefully.
Table of Contents
Rules
Event Flow 4
Writing Rules 5
Rule
Rule Caveats
Cave ats 9
Debugging a Bad
Ba d Rule
Rule 13
When a rule doesn’t do what y ou want it to.
to.
Locations 14
Where the
the Correlation Engine
Engine stores its assets.
In-Bound Events 15
Looking at the inbound events.
CPU Utilization 16
How is the engine performing?
Correlation
Correlat ion Engine Status 17
Complete dump o f the
the engine's vital signs.
Additional Options 23
Useful options when things are proving difficult.
Conclusion 24
Some final items of interest.
Rules
The Eve nt Flow
Before we mov e into the detail of rule writing, here's a quick primer on how ev ent traffic moves around the McAfee
SIEM enviro nment. While a C orrelation Engine can reside on an Eve nt Receiver, this diagram assumes an A dvance
C orrelation Engine (ACE) is in the environment.
Events are processed (Collected, Parsed, Normalized, Enriched and Aggregated ) and then based
on the poll time, are sent to the ESM.
ESM then forwards these to the AC E for possible use in one or more C orrelated Rules. These are
forwarded at roughly the same cycle as the ESM to ERC poll.
When AC E triggers an event, its queued and sent back to the ESM via the ESM to AC E poll.
Writing Rules
Now that yo u have the flow of the events understood, let’s talk abo ut writing rules. While not a difficult process,
C orrelation rule writing has a few guidelines w hich, if not followed, could introduce unintended results or lack thereof.
The following section is designed to prov ide you with a brief summary of some do’s a nd don’ts on rule writing.
First, we'll cover the different components used in crea ting a C orrelation Rule. Each of the components below ca n be
used individually or together within a rule.
Match Component
This is the most frequently used compone nt and it performs a criteria match bas ed on the elements of an ev ent that
are contained within it. One or mo re filters ca n be within a Match C omponent. Each Match C omponent within a rule
may match separate events in order to satisfy the rule.
Deviation Component
As the labe l describes, this component uses the traditional model for Standard Deviation and applies this deviation to
the filters co ntained within the component. In addition to traditional Deviation, we’ve added Percent from A verage and
Gates
AND OR SET
As illustrated above, there are three gate pos sibilities. These are:
AND – All the co mponents within this gate will have to match.
OR – Any of the components within this gate will have to ma tch.
SET – One o r more of the compo nents within this gate will have to match. You select how many of the
available Match C omponents have to match true in order for this gate to match. Examples would be 2 of 4 and
3 of 6.
NOTE: There is an additional option called Sequence that is used o n the AND or SET logical elem ent to require the
conditions of the rule to occur in the sequence you place them in the C orrelation Logic field for the rule to be
triggered.
Gates can be use d alone or in ne sted groups. Multiple event reco rds may satisfy the co mponents within a gate. B ut as
we will describe shortly, the more nesting that is used, the more costly from a performance standpoint they are.
Next is the “How” to use these com ponents most effectively. This section is not designed to pro vide every possible
option for crea ting C orrelation Rules. Rather its purpose is to p rovide guidance on efficient use of the comp onents and
pitfalls y ou might encounter.
Below are e xamples of what appea rs to be a similar rule. Notice that the filters in e ach of the match components are
looking for the same values. However, in Figure 1, there are three match components versus Figure 2, where there is
just one. While ea ch of these rule s will trigger on essentially the sam e criteria, the Figure 1 example will behav e
substantially different than Figure 2.
Figure 1
Figure 2
Here’s why:
In Figure 1, the rule as a whole will potentially fire much more o ften because:
An Example:
Whereas in Figure 2, it use less memory because it won’t ma intain state ( memory allocations) until all three criteria
are within a single event.
Below are two exam ples of a nested rule. Figure 3 is a ba d example. At a minimum, this could have been combined
into a single AND gate, and preferably, it would have been a single Match C omponent. Figure 4 is a better ex ample of
how to use nesting effectively. In this exam ple a single Match component is combined with 3 ex isting Attack rules,
only one of which will hav e to match.
NOTE: Without a SET gate, an OR gate implies a single match am ong the components in the gate.
Nesting rules can be extremely effective when look ing for different types of events as part of a single rule. However ,
nesting uses more resource s (memory and CPU ) because of the over head associated with the additional logic require
to performance the matches. Please ke ep that in mind when crafted nested rules and reduce the number of
components where possible to maintain an efficient rule as possible.
Bad Example
Figure 3
Rule Caveats
Below are items that can hav e the potential to induce extra or ex cessive resource usage (memory and C PU) and which
should be c onsidered when crafting custom C orrelation Rules. This doesn’t mean "don’t do the items listed below ";
Rather, these are highlighted so that you can be aware of the impact these items hav e on the over all performance of
the C orrelation Engine so that you can cra ft an efficient rule as possible given your criteria.
OR gates mixed with AND gates result in a lot of extra proc essing
This is due to the additional processing required to perform state evaluation on the ov erall rule. The engine will have
to check the OR and AND conditions more frequently.
Watch Lists are stored in the database and ar e read into short term memo ry at engine execution time to pre vent
frequent reads to the McAfeeEDB (McAfeeEDB is the database in which the parsed events are stored on the ESM).
This adds to memo ry usage, so ve ry large lists, those with entries in the thousands or more, could use excessive
memory and be the cause of slower engine perform ance. Also, should a list change, it requires the current state of the
correlation rule using the changed list to be ended, the new list read in a nd then state to be ma intained aga in.
NOTE: A large watch list would be one that has more than 100K entries.
Using another rule (Reference d), within a rule uses additional memory and C PU than simply including the logic from
the referenced rule. This does not mean they shouldn’t be used a s the upside of levera ging existing rules is
functionally ve ry beneficial. But should the gate o f the Refere nced Rule be an A ND and y our new rule also contains an
AND(s), it may be more be neficial to just use the logic from the Referenced Rule. This is especially true if the
Referenced Rule is only refere nced once.
Figure 5 is an exa mple of what a Rule with a reference component (green) looks like.
High match rate/low fire rate/high timeout
This is where a rule with multiple Match C omponents has one Match C omponent that ha s a substantially larger match
rate than other compo nents and the overa ll rule triggers v ery infrequently. These ma tch rates, among ma ny other
settings and v alues, can be check ed using a script that “dumps” details of what is going on within the e ngine. This is
specifically outlined in the script section starting on page 18.
One really slow rule can’t be fixed with load bala ncing
The C orrelation Engine has the ability to a uto balance its self across its processors. However, if you have a rule which
uses a less e fficient methodology, this auto balance capability ma y not help. You can check this using the c orrelator.sh
script and that is specifically outlined in the script section towards the end of this document.
The C orrelation Engine is constrained by the av ailable resources. Keeping the items previously mention ed is one way
of mana ging this C PU and Memory use. Another is to determine if the customer e nvironment actually needs all of the
standard rules provided (176 a s the writing of this document). It’s possible that a few (or mor e) can be disabled or at
least tune to reduce their m emory and/or CP U usage.
If one were trying to squeeze ever y last ounce of performance out of the C orrelation Engine, consider whether a
Match C omponent should use a Watch List or Variable. The Pros and Co ns of each are below.
Watch Lists
Variables
Rule Attributes
Some final bits of a dvice to ensure that your custom C orrelation Rules will performance as efficiently as possible. Each
rule has a couple attributes that have a fe w cave ats. Below, we outline som e of them and items to co nsider when
creating a new correlation rule.
Group By
NOTE: Grouping by multiple high cardinality fields (SrcIP, DestIP, etc) may ca use high memo ry usage.
Time Window
Gate Logic
Even after the Time Window on a giv en rule has e xpired, some Meta data for that rule is kept in memory for a
period of time. That time period is called Time Order Tolerance and by default this is set to 60min. Time Orde r
Tolerance is designed to account for ev ents that come in out of sequence ( or late, after a rules Time Window has
expired ). This uses additional memory, and depending on how eve nts are matched ( or not ) and your time thresholds
(a number of factors come into play ), this could use a lot of memo ry.
To pre vent potential ex cess memory usage, you can re duce the Time Order Tolerance to something less than 60 min.
The upside is that in env ironments that are struggling with resources, they would benefit from added memory that is
freed up.
However, and this is a big however , you need to e nsure or be comfortable that ev ents for A LL data sources will not
arrive late. Ever . If they do and they arr ive after the e xpiration of the Time Order Tolerance, then they won’t be
included in a prev iously expired Corre lation Rule.
Debugging a Bad Rule
Once a rule has bee n written and is running, yo u might find that it doesn’t trigger, or all co rrelated events seem to be
getting to the ESM a bit mo re slowly after the new rule has been added, or no ev ents appear at all. When these types
of behav iors appear, customers will typically call suppor t and work through the issue.
However, the following pages will provide the rea der with a workflow to determine what m ight be causing the issue
and allow them to resolve the issue on their own.
NOTE: The steps outlined on the following pages are designed to help the reader debug mo derate correlation engine
issues. They are not intended to be a complete de bugging guide. If you attempt to go beyond the scope of this
document, you may do more harm than intended. If y ou are unsure, do not proce ed and seek additional support.
Important Locations
Before we de bug, we need to know where eve rything is and to note where the C orrelation Engine kee ps its important
directories and files. Every thing that is important ( or at least covere d herein ) is located within a single directory
regardless if this is an Eve nt Receiver or AC E Appliance. That directory is:
/usr/local/ace/
From here, most things cor relation related can be found, check ed or investigated. As with any component within
McAfee SIEM there a re some files or commands that can be useful. To keep this section simple, we’ll stick to the most
important items.
The Directories
Contains the correlator.sh shell script. This script can be used for a variety of
bin
tasks, most notably to check the effici ency of the engine and individual rules.
enrichment Contains the enrichment rules
Contains the event files used for historical correlation if historical correlation
historical
is enabled.
incoming Stores the incomi ng events sent to it from the ESM.
lib No need to look in here
Contains the logs of the running correlation engines. The could be useful
log
during debugging. We have outlined some uses within this document.
As pr eviously noted, the ES M forwards events from the Event Receiv ers (ERC) to the C orrelation Engine. The engine
stores these e vent files in the incoming directory to wait for further proc essing. Sometimes, due to bad rules, too
many events, engine being stalled/stopped or another issue, these event files can get stacked up and the C orrelation
Engine gets behind.
To see how many eve nt files are waiting, if any are, perform the following command from the /usr/local/ace
directory:
Depending on the EPS of y our environment, your results, an ex ample of which is be low, should be a single or small
double digit number. The ultimate goal is to have as few files waiting as possible and that this number is stable or
reducing ov er time.
If y ou have a large number of files waiting, or after running the above comma nd a couple times, the number is
growing, this could be an indicator that there is an issue that will need to be investigated further.
CPU Utilization
The C orrelation Engine reserv es a minimum amount of resource s (memory and C PU) to opera te. Check ing C PU
utilization is one way to see if the engine is performing as expec ted or if it is in distress. The engine uses a dynamic
calculation at runtime de termined the number of cores to use. The calculation is the number of cores detected, minus
2.
To monitor the C PU you can use e ither top or htop. Both are useful tools but display the da ta in slightly different
ways. htop has a slight advantage as it displays activity in a bit mor e colorful manner. Below ar e a couple of screen
shots.
What to look for? I n a moderately loaded ACE, you should see the C PU percentage (First of the two highlighted lines in
the screen shot below) as a number over 100%. May be 300%, po ssibly 700%, but definitely ov er 99-100 %. See
Figure 9. If you e ver notice that the C PU utilization at 99-100% for an extended period of time, it could mean that a
single rule is hogging cycles. The other thing to look for is that you have ut ilization spread across 4 CPU’s. Usage does
not have to be uniform, just that activity is acr oss all four. See Figure 8.
Figure 8
Correlation Engine Status
If, after inspecting the incoming directory, the C PU usage or events are simply not showing up, you believe there is a
problem, you can look at the internals of the Co rrelation Engine. The engine ke eps a wide variety of statistics and
values for the engine itself as well as for the individual rules and components within the rules.
To dump the current status of the engine, e xecute the following scr ipt.
The correlator.sh script performs a status dump o f the engine at the mom ent in time it is e xecuted.
NOTE: If this script takes more than 5 minutes to execute, this could be an indication that the engine is under
performing. The values found in the output file are as of the last time the engine was r estarted.
Once you hav e the result file, you can sear ch it for v aluable information on how the engine is performing. The
following pages provide ex amples of sections of the output to review when debugging your correlation eng ine.
#1 – Memory Critical
C orrelation is a memory intensive operation. The act of maintaining state on rules matching So urce IP, Destination IP
or User Name, each having high cardinality can be expensive in terms of memory used. So if your ex periencing slow
event genera tion or y ou believe the engine is under stress, one of the first items to check would be the
mem ory Cri tical property.
NOTE: The –A 3 will grab the next three lines after the gre p match.
What you a re looking for is one o r more rules which have the highest activeInstances values. If you see one or
more rules that are much larger than the rest of the rules, use some of the steps found further in this document to
debug the rea son for this large number of active instances.
NOTE: It is possible that due to high cardinality of a specific field (Source IP, DestinationIP or Username as ex amples)
that any rule, eve n the default rules, may hav e to hav e additional match com ponents added to limit their
activeInstances.
#2 – Processor Balance
- or -
#3 – Which Rule is on Which Proce ssor
The “mate” to the prev ious grep is one looking for timeSpent. The example is:
<value>rulesProcessor1.timeSpent=48851.7ms</value>
<value>Correlation Engine-47-4000004.timeSpent=35425.8ms</value>
..
..
<value>Correlation Engine-47-4000013.timeSpent=13425.9ms</value>
<value>rulesProcessor2.timeSpent=9310.3ms</value>
<value>Correlation Engine-47-4000014.timeSpent=8765.1ms</value>
..
..
<value>Correlation Engine-47-4000023.timeSpent=545.2ms</value>
The C orrelation Engine prioritizes rules by their expense (process ing time), so this particular e lement shows which
rule is taking the mo st processing time a nd which core it is using. This element i s ordered in processor sequence and
the C orrelation Engine will always put the most ex pense rule first within each pro cessor group.
Thus the first rule on pro cessor 1 will be the most e xpensive rule overall. Because of this, you should easily be able to
determine which rule is using the mo st processing time.
In the ex ample above, note the time spent for ea ch rule (red), the time spent for eac h processor (pink) and the
Signature ID (green).
NOTE: timeSpent is reset each time the rule set is re balanced (auto o r manually). This is done to be more sensitive to
performance changes in the engine. As an ex ample, a rule that was slow last week , may not be slow this week . In
addition, when new rules a re added timeSpent is reset because the rules need to be bala nced across processors based
on current processing time aga inst the whole rule base .
NOTE: If y ou see an extremely large number and it is in E -Notation (scientific or e xponential notation. This o ccurs
with numbers greater than 15 digits), then you c an be pretty certain that this rule is very expensive and should be
reviewed.
Determining Rule Performanc e
Once you have identified the offending rule based on its C PU usage ( prev ious page ), how ca n you determine how it’s
performing? Is the rule logic p erfor ming as intended? To find out, we need to go further into the output of the
correlator.sh script.
As m entioned in the Writing Rules se ction (page 5), the state for e ach Rule and each Match C omponent within each
Rule is ma intained in me mory and the C orre lation Engine keeps statistics on these. We can view these statistics in the
output of the correlator.sh script.
Let’s use the Bad Rule example from page 7. In Figure 10, you se e three Match C omponents.
Figure 10
From the prev ious example of the grep, you are able to determine Time Spent and the Signature ID of the rule in
question. While grep is an exce llent tool, sometimes just looking through the output is helpful as well. So for this
section, we’ll edit the out put of the correlator.sh script. An example using vi is below.
NOTE: To sea rch for the Signature ID of the correlated event you are interest in use a slash (/) followed by the value
you a re searching for. Once you’ve locate d the string you entered, use a lower case n to co ntinue the searc h. During
the search, you may match the Signature ID a number of times.
For this e xample, you are looking for the <st atu s name>
element which will have the Rule Name in it. You may pass
Using our ex ample, three a re (3) Match C omponents. So in the output for this rule, we will have three <statu s
name=rule_ elements, each corresponding to a Match C omponent. This me ans that:
<status name="rule_1"> matches
<status name="rule_2"> matches
<status name="rule_3"> matches
Knowing this, we can see each components statistics and how it is performing. The XML Example:
1. Since these Match C omponents are individually defined versus all in a single Match C omponent, the engine will
be using a lot of pro cessing (CPU ) attempting to match ea ch component individually.
2. When a match does occur in R ule 3, it’s maintaining state ( memory ) for almost half of the e vents it’s seeing.
This is aggressive and will be expensive in term of memory use.
3. It may never trigger. This is because while Rule 1 and R ule 2 have seen 400M+ events, nothing has ma tched
even though Rule 3 has seen a lmost a 50% match rate. In other words, if Rule 1 and Rule 2 hav en’t matched
by now, they ma y never. Thus Rule 3 is keeping memory state on over half of what it is seeing without any
change of a match (rule trigger )
The solution in this e xample is for this rule is to combine the Match C omponents into a single component. This will
reduce memory usage and improve C PU utilization thus ensuring the engine can run as efficiently as possible.
NOTE: This is just one example o f rule tuning. Ea ch rule will behave differently and may require different tuning. But
using the steps outline here, it’s straight forwar d to see where a rule has gone 'wrong'.
Additional Options
Sometimes even after y ou indentify the rule, disable, it and then ro ll policy out, the Engine is busy determining how to
catch up. Alternatively, your ex perience tells you that the engine has a corrupt Rule XML file, o r you want to delete
the queued up ev ents and start from sc ratch.
If these reasons apply in y our environment, you can force the situation a bit. You do this by killing the java process
which is running the e ngine. Generally there is no harm in do ing this a s there is another process which will start the
engine imme diately without user intervention.
To do this, first you need to determine what the Pro cess ID is. You can use one of two commands. These are:
– or –
Figure 11 is the output of the first example with the Java process (red) and the Proce ss ID (yellow) highlighted:
Conclusion
Finally, here are a co uple sections contained within the correlator.sh output.
As we’v e previously mentioned, the C orrelation Engine dynamically determines the number of c ores to use. That
calculation is the number o f cores detected minus 2. This section shows you what was detected and was is used.
<property name="coresUsed
">
<value>
2
</value>
<description>
The number of CPU cores being used
</description>
</property>
<property name="totalCores">
<value>
4
</value>
<description>
Total detected CPU cores
</description>
</property>
Like dsstatus on a Receive r, the Cor relation Engine also keeps track of its processing perfor mance. This section
provides yo u with the live EPS and Total Events Processed as of the exe cution of the script. The EPS number here will
be what the EPS was a t the mom ent the status was run. It could be low or high and should not be v iewed as the EPS
of the engine. See page 25 for the processing record counts in the logs for more accurate EPS of the engine.
The results will look something like Figure 12. With a number of e ntries listed. This would me an that events are
getting to the engine, but for some rea son the engine is ignoring them. One easy step to take is to Roll Po licy out to
the engine to make sure it has the more recent rules.
Grep for files processed in the logs
Once you k now that files are making it to the engine, this next command will check to see if the files are being
processed. The syntax is:
The results are in Figure 13. If y ou hav e one or more entries in the log and they are recent ev ents, you ca n be
assured that the engine is pro cessing events. The ev ent counts here are compressed. Multiplying these b y your
aggregation rate can prov ide an estimate of the event rate on the engine. McAfee uses a 10:1 default aggregation
rate, however you rate will be different.
Appendix A - Full Rule Element in XML
<status name="rule_2">
<property name="matchAttempts" >
<value>
423454245
</value>
<description>
Number of match attempts
</description>
</property>
<property name="matches">
<value>
</value>
0
<description>
Number of matches
</description>
</property>
</status>
<status name="rule_3">
<property name="matchAttempts" >
<value>
423454245
</value>
<description>
Number of match attempts
</description>
</property>
<property name="matches">
<value>
243489245
</value>
<description>
Number of matches
</description>
</property>
</status>
</status>
Appendix B - Arguments for the correlator .sh script
NOTE: Columns which are grayed out are outlined here for informational purposes and are NOT intended for general use without support assistance.
A dd a ne w event file to be processed. One could strip out a file from the incoming
-add <file | file uncompressed | file <type> | file <type> uncompressed>
directory and add it in. Typically used in debugging.
-eventTypes <eventTypePath>
Tells the engine to use a different incoming location for the events from the ESM.
-incoming <path>
Used for debugging only.
-port <port> Tells Correlation Engine what port to listen for commands. Internal use only
Forces engine to re-balance its usage across the threads. This is supposed to be
-rebalance performed as a normal process, however, should an imbalance occur, using this can
force the engine to perform the balance immediately.
Paired with the shutdown command and its queries the correlation engine to see if
-shutdownStatus
it has shutdown yet. It could take time to save its state file.