TCP IP Troubleshooting Part 1
TCP IP Troubleshooting Part 1
TCP IP Troubleshooting Part 1
This article describes a structured approach for troubleshooting problems with TCP/IP
networks.
This is the first of a series of articles on TCP/IP troubleshooting, and future articles will
focus on key issues highlighted in this article.
What do you think of when you hear the phrase "TCP/IP troubleshooting"? People who
are visually imaginative may see a flowchart. More linear-minded types may see a series
of numbered steps. Others (far too common) may feel a sense of inadequacy and
frustration.
TCP/IP troubleshooting should be simple, right? After all, it's just a protocola series of
steps to transfer bits over the network. But what a protocol - four layers, and multiple
protocols at each layer.
Type ipconfig to check if your IP address, subnet mask, and default gateway are
correct.
Now ping 127.0.0.1 to see if your network adapter is working.
Now ping your own computer's IP address.
Now try pinging the IP address of another computer on the same subnet.
Now try pinging your default gateway (the near-side interface of the router that
connects your subnet to the rest of the network).
Now try pinging the IP address of a computer on a different subnet.
And so on.
I call this the "brain-dead approach" because it's so methodical you can basically turn off
your brain and just follow the steps. It's also somewhat inefficient, for it automatically
assumes that your problem most likely starts with your own computer and that the
problem is more likely to be closer to you (your network card, your computer's IP address
configuration, your local subnet) than further away (other subnets). And it's a method that
was probably developed before the Internet really took off, that is, before DNS became
ubiquitous for name resolution and before firewalls and VPNs became a fact of life for
most corporate networks.
What I mean is this: one of your user's says "I can't connect to the server right now."
What could be the problem? It helps to dissect this simple sentence to understand the
issues that may be involved. For example:
"I can't"
Is this the only user who has called in reporting network problems? If there are others, do
they have similar issues? If so, then right away it's clear you don't need to take a brain-
dead approach and begin your troubleshooting at the user's computer. Instead, the issue is
most likely "out there" somewhere, and that could mean maybe your DNS server is
offline or your DNS provider services may be experiencing difficulty. Or maybe a router
on your internal network may be going crazy and dropping packets. Or maybe the server
your users are trying to connect to may have crashed.
You should also stop and think about any commonalities these users who are having
problems may have. For example, are their machines all on the same subnet? If so, then
maybe the default gateway for that subnet is misconfigured or the router crashed. Or
maybe a contractor working in the plenum crawlspace has accidentally cut a network
cable connecting the subjet's workgroup switch to the department's main Ethernet
backbone switch. Or maybe someone malicious has installed a rogue DHCP server on
that subnet and it's stealing machines as their leases come up for renewal and assigning
them unroutable addresses to create a denial of service condition.
If it's only that one user though who has the problem, then it's probably time to play
brain-dead and start asking questions like "OK, is your computer turned on? Is the
network cable securely attached at the back of your machine?" and so on.
"connect to"
A good question to ask this user is "What do you mean by connect?" That's because
"connect" is a technical-sounding word that users often use to impress Help Desk to show
they know what they're talking about. Well, they usually don't. Why? Because there are
different kinds of connectivity including MAC-level communications, TCP sessions,
password-authentication, access rights and privileges, NAT-traversal connectivity,
firewall pass-through, application-level sessions, and so on. What kind of connectivity
problem are they actually having? What are they actually trying to do when they say they
want to "connect to" the server? Are they trying to access a share on that server? Do they
get an "Access denied" message when they do this? Are they getting a login box
prompting them for credentials? Is it rejecting their credentials? Are they having trouble
finding the share in Active Directory? Is it a mapped drive they are having problems
with? Are they trying to browse to find the server in My Network Places? And so on.
And is it just that server they're having trouble connecting to, or are they having problems
connecting to anything on the network? Determining the scope of the problem here is
important: is connectivity failing in just one way or many ways?
"the server"
You've got this user over here, and this server over there, and the network between. They
can't connect. Why? Well, where exactly is that server anyway? Is it on the user's subnet?
On an adjacent subnet? In a different department? On a different floor? In a different
building? On a different continent? What kind of network connects the user with that
particular server? A wired Ethernet LAN? A wireless LAN (WLAN)? A fractional T1
line? Frame Relay? A VPN tunnel over the Internet? A dial-up modem connection?
Cable modem or DSL? First determine the type of connection (possibly several types)
between the user and the server, and then ponder where things might break down. Maybe
the CSU/DSU has gone wonky, try recycling its power or contact your service provider
who should be monitoring it. Maybe the janitor is cleaning the server room and he
bumped a power bar and an Ethernet switch has gone offline. Check for an alert message
from your network management software, assuming you're using managed switches.
Maybe there's been a power blackout at the remote branch office where that server is
located. Call them on the phone and see what's happening.
And is it server or servers? Is the user having trouble connecting to only that server or to
other servers as well? Are others having problems connecting to other servers also? What
are the commonalities (if any) between all the servers being affected? (Or apparently
being affectedremember, the problem may be with the users' computers or more likely
with the network infrastructure itself.)
"right now."
The time element is crucial in troubleshooting. Did the problem just start happening?
When was the last time you successfully connected to the server? How long has it being
going on for? Is it continuous or intermittent? Intermittent network problems involving
unreliable WAN links and other issues can be difficult to troubleshoot, especially if
they're transient i.e. brief and occasional.
Time can also help you relate the problem to other circumstances that might be impacting
your network. Did the problem start this morning at 10 am? What else happened on your
network around then? Were patches applied by a WSUS server? Did scheduled
maintenance on a domain controller occur? Was a construction crew in the building
compound using a backhoe to repair a water main break?
A Structured Approach
My own approach to TCP/IP troubleshooting is structured around three critical areas:
2. Determine which troubleshooting steps might apply given the above problem
elements. This includes:
o Verifying physical media connectivity for the client(s), server(s) and
network infrastructure hardware involved. This means checking cables,
making sure network adapters are properly seated, and looking for other
causes of network connections displaying a media disconnected state.
o Verifying TCP/IP configuration of the client(s), server(s) and network
infrastructure hardware involved. On the clients and servers this means IP
address, subnet mask, default gateway, DNS settings, and so on. For
network infrastructure hardware typically means routing tables on routers
and Internet gateways.
o Verifying routing connectivity between the client(s) and server(s)
involved. This means using ping, pathping, tracert, and other similar tools
to verify end-to-end TCP/IP connectivity at the network level; packet
sniffing to monitor transport layer sessions; using nslookup, telnet and
other tools to troubleshoot application layer issues involving name
resolution problems, authentication problems, and so on.
Conclusion
Troubleshooting TCP/IP networks can be frustrating, but it can also be fun. In future
articles we'll zoom in on the troubleshooting steps and tools you need to be able to do in
order to successfully solve the issues that might arise on your network. Until then, stay
connected!