Seattle Area System Administrators Guild: Scaling Nagios To Monitor Large Heterogeneous Environments

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 18

Seattle Area System

Administrators Guild
Scaling Nagios to monitor large heterogeneous environments
Dave Blunt
February 21, 2008

SASAG Scaling Nagios

What is Nagios?
an Open Source host, service and network monitoring program. Started as Netsaint in 1999 and became Nagios in 2002. www.nagios.org Availability and performance monitoring is it up, is it down? How much load/memory/disk is in use?

2008 GroundWork Open Source, Inc.

SASAG Scaling Nagios

What is Nagios?

Nagios Configuration files

Nagios parent PID

Nagios Status Log

CGIs

Nagios child PID Nagios child PID Nagios child PID Nagios child PID Nagios child PID

Plugin Plugin Plugin

Notification

Event Handler

2008 GroundWork Open Source, Inc.

SASAG Scaling Nagios

What can Nagios suffer from?


Configuration file maintenance issues CPU and disk I/O bottlenecks

Blocking host checks


File based performance bottlenecks

2008 GroundWork Open Source, Inc.

SASAG Scaling Nagios

What can Nagios suffer from?


Configuration file maintenance issues
Use a web based configuration tool
Monarch (sourceforge.net/projects/monarch) Fruity (sourceforge.net/projects/fruity)

Facilitates monitoring across multiple Windows domains, SNMP communities, and other security zones.
CGI

Nagios Configuration files Nagios instance 1

Configuration

Nagios Configuration files Nagios instance 2

Nagios Configuration files Nagios instance 3

Nagios Configuration files Nagios instance n

2008 GroundWork Open Source, Inc.

SASAG Scaling Nagios

What can Nagios suffer from?


CPU and disk I/O bottlenecks
Optimize Nagios
nagios.sourceforge.net/docs/2_0/tuning.html

Use database to store config and status information


NDOUtils (www.nagios.org/downloads) Foundation (sourceforge.net/projects/gwfoundation)

Placing the database on a separate server will greatly improve performance and both examples support it.
Nagios parent PID CGIs

Configuration

Status And Events

Nagios child PID Nagios child PID Nagios child PID Nagios child PID Nagios child PID

Plugin Plugin Plugin

Notification

Event Handler

2008 GroundWork Open Source, Inc.

SASAG Scaling Nagios

What can Nagios suffer from?


Blocking host checks
Passive host updates
Fping (fping.sourceforge.net)

Huge increase in host check capacity (8,000+ checks a minute) if pings are parallelized. Downside of passive host updates is the possibility of some extra service alarms.
Nagios parent PID CGIs Configuration Status And Events

Fping Feeder

Nagios child PID Nagios child PID Nagios child PID Nagios child PID Nagios child PID

Plugin Plugin Plugin

Notification

Event Handler

Host A

Host B

Host n

2008 GroundWork Open Source, Inc.

SASAG Scaling Nagios

What can Nagios suffer from?


File based performance bottlenecks
Remove Nagios pipe file bottleneck with Event Brokers
Bronx (archive.groundworkopensource.com/groundworkopensource/trunk/bronx/)
Feed data into Bronx as replacement for NSCA and also have Bronx send data to Foundation

DNX (dnx.sourceforge.net)
Specifically tied to distributed monitoring.

2008 GroundWork Open Source, Inc.

SASAG Scaling Nagios

Typical scaling limits in Nagios*


Typical mix of Hosts/Services 700/7,000 Active service checks/min** 770 Passive service checks/min** -

*With dual 3GHz Xeon, 4GB RAM, 10k RPM disk, RHEL4 ES 32-bit OS. **Based on a Service being checked once every 10 minutes, and 1% of Services and Hosts being in transition between OK and non-OK states. Retry interval for non-OK states is 1 minute.

2008 GroundWork Open Source, Inc.

SASAG Scaling Nagios

So, how could you scale up?

2008 GroundWork Open Source, Inc.

10

SASAG Scaling Nagios

How can I drive my primary Nagios with Passive service checks?


Additional Nagios instances and forward the data to Bronx or NSCA, or set up DNX
At some point you end up having too much monitoring infrastructure!

Passive agents, e.g. GroundWork Distributed Monitoring Agent, NT_Scheduler, Cron


Monarch supports creation of configuration files for passive agents

Different tier one monitoring tools, e.g. syslog, SNMP traps, Ganglia, Cacti
Feed data from these up to your primary Nagios server by installing the right agent on that server, process results, and then submit to Nagios. Syslog-ng (www.balabit.com/network-security/syslog-ng/) Snmptt (www.snmptt.org) Ganglia (ganglia.sourceforge.net) Cacti (www.cacti.net)

2008 GroundWork Open Source, Inc.

11

SASAG Scaling Nagios

One Nagios Instance

2008 GroundWork Open Source, Inc.

12

SASAG Scaling Nagios

Many Nagios Instances

2008 GroundWork Open Source, Inc.

13

SASAG Scaling Nagios

How do I implement many Nagios instances?


Use a web based configuration tool
Monarch (sourceforge.net/projects/monarch)

Enable configuration data transfer between instances


SSH

Enable check result data transfer between instances


NSCA (Nagios Service Check Acceptor www.nagios.org/downloads), or Bronx

Optimize each Nagios instance for its purpose


Turn off active checking on parent Set command_check_interval=-1 on parent Turn off performance data, eventhandler, and notification processing on children

Alternative approach
DNX? beta, but significant maintenance advantages

2008 GroundWork Open Source, Inc.

14

SASAG Scaling Nagios

Typical scaling limits in Nagios*


Typical mix of Hosts/Services 700/7,000 2,700/27,000 8,000/80,000*** Active service checks/min** 770 Passive service checks/min** 2,970 10,000***

*With dual 3GHz Xeon, 4GB RAM, 10k RPM disk, RHEL4 ES 32-bit OS. **Based on a Service being checked once every 10 minutes, and 1% of Services and Hosts being in transition between OK and non-OK states. Retry interval for non-OK states is 1 minute. ***Using Bronx Event Broker and assumptions listed for note (**)
2008 GroundWork Open Source, Inc.

15

SASAG Scaling Nagios

Heterogeneous environments
Mix of Operating Systems, Network security zones, Applications, and Administrators! Approaches to the problem:
Same agent type on every system
Consistent Limited coverage

Mix of methods
Flexible More difficult to maintain Must normalize data

2008 GroundWork Open Source, Inc.

16

SASAG Scaling Nagios

Methods
UNIX
SNMP / SNMP traps SSH with plugins (www.nagios.org/downloads) NRPE with plugins (www.nagios.org/downloads) Cron with plugins Port-based checks Syslog (aka traps) SNMP / SNMP traps NRPE_NT with plugins (www.nagiosexchange.org/Windows_NRPE.66.0.html?&tx_netnagext_pi1[p_view]=235) WMI (with proxy) NT_Scheduler with plugins Port-based checks Event logs (aka traps) (www.intersectalliance.com/projects/SnareWindows/ or www.steveshipway.org/software/f_nagios.html) SNMP / SNMP traps Syslog Port-based checks SNMP / SNMP traps Syslog Port-based checks

Windows (http://www.crn.com/software/206801053)

Network

Special devices

2008 GroundWork Open Source, Inc.

17

SASAG Scaling Nagios

GroundWork Open Source, Inc. 139 Townsend Street, Suite 100 San Francisco, CA 94107 phone: (415) 992-4500
www.groundworkopensource.com

[email protected]

2008 GroundWork Open Source, Inc.

18

You might also like