2

I'm having trouble diagnosing some random lag on a 6 server LAMP cluster serving a MediaWiki site. While we're serving some 100 pages/sec the servers themselves are running fine with less than 0.5 load, no locked processes, no paging, no errors being logged, etc....

  • Lag is present on all servers and is random: one minute its fine the next it's there.
  • DNS lookups on the servers are randomly slow. For example time nslookup google.com varies randomly from a few milliseconds to several seconds and sometimes times out entirely. While we use IP addresses internally on the cluster this may be a symptom of the root issue. We are not running our own DNS server.
  • The Apache server-status pages randomly lag or time out. Benchmarking using ab between servers shows a few loads sometimes take 3000 ms (almost exactly). Benchmarking server-status on the local server itself usually shows no issue (it showed a lag only once among a few hundred tests).

The servers are sitting behind a switch and a firewall which I don't have any access to so I don't know their setup or status. While we are under heavier than normal load a 2 Mbps incoming and 20 Mbps outgoing traffic shouldn't be stressing the switch or firewall should it? My feeling is that it is the switch/firewall or something above them in the ISP like their DNS but can't confirm it.

I need some other tests or methods of diagnosing this lag to try and narrow down the ultimate cause.

2 Answers 2

1

The problem turned out to be the firewall had a hard set limit of 10,000 connections. The difficulty in tracking this down was mostly due to not having access to the firewall and convincing the service provider that there was indeed an issue.

1

Diagnosing problems almost always requires you to have some form of monitoring in place.

Roll out something like OpenNMS, InterMapper, Cacti, or if you're desparate Nagios, and look at the traffic, system load, etc. when you see a problem. The information your monitoring system provides will probably help you figure out what's wrong.

1
  • We are monitoring with Zabbix. None of the major variables shows any correlation with the issue but there are a few hundred other things I can check to see if anything pops out. If the issue is indeed outside of the six servers this probably wouldn't show anything.
    – uesp
    Commented Nov 14, 2011 at 23:12

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .