High-Performance Networking Unleashed

- 28 -

Monitoring Your Network

by Frank C. Pappas and Emil Rensing

Congratulations! By now you've probably become quite familiar with all sorts of networking topics, including hardware and software selection, network topologies, cabling options, and much, much more. In fact, you're probably itching to drop this book and begin planning the creation or rebirth of your own corporate super-network. But wait--you're not ready to solo just yet! Almost. But not quite. There are just a few more important issues that we need to cover before you'll be ready to assemble your team, call or fax hardware vendors and systems consultants, and provide a wide array of important and amazing network services for your managers, colleagues, secretaries, and interns.

The chapters that you've covered so far throughout this book have given you a number of vital tools to help you build or, just as importantly, rebuild--a corporate local- or wide-area network from the ground up. No expense has been spared; we've covered everything from the physical connections between the computers to the various protocols that allow the many different network and client operating systems to be joined together in vast heterogeneous internetworks. So, truth be told, you probably are just about ready to go out and architect a Grade-A system that would make us (and you!) proud. After all, you've put in a lot of effort reading this book from cover to cover, have invested lots of time researching vendors and their offerings, and have used all of your education, experience, and intuition to spec out a system that meets and exceeds every desire, expectation, or daydream that your techno-geek boss may once have verbalized.

However--and here's the kicker--have you stopped to consider what will happen to that magnificent creation of yours a week, a month, or a year down the road? What will you do when this testament to high-tech corporate communications begins to develop quirks or, worse yet, becomes the very nightmare it was designed to replace? Taking a good deal of poetic license with the words of an actor whom we feel can't quite act his way out of a wet paper bag: When your network begins to fall apart at the seams, "What do you do? What do you do?"

The answer itself is simple: You make sure that your network never degrades to the point that people will start asking questions (or start building a gallows). To avoid the lynch-mob, you'll need to be proactive when it comes to systems maintenance. What's more, you need to be especially vigilant when it comes to problem detection and resolution, knowing where, when, and (most importantly) how to look for those bottlenecks, malfunctions, and other assorted quirks that will undoubtedly develop over the life of your network, as well as how best to resolve them when they do crop up. Unfortunately, keeping abreast of every aspect of your network can be a daunting task. How do you know when there's a problem? Well, just as realtors the world over will tell you that the most important feature of any house is "location, location, location," network engineers and systems administrators have their own axiom for making sure that your networks don't get out of hand: "monitor, monitor, monitor!" This way, you can be sure that your networks will always be preventing headaches instead of causing them.

The Mother of All Network Problems

Without a doubt, a corporate network can be one of the most complex technology environments you'll ever be required to support and maintain. Of course, we're not saying that you'll never take a job over at MAE-East troubleshooting backbone problems, or that you'll never get that dream job with the Air Force connecting, via T-1 lines, all of the SAC missile silos in North Dakota. The chances are, however, that the ragged mix of state-of-the-art and hand-me-down hardware, software, and cable that you call a network is in dire need of a good dose of tender loving care. Your job, as a skilled network engineer or systems administrator, is to make sure that your network is optimally configured so that it can, well, be all that it can be.

Now, as any seasoned network administrator will tell you, when you receive a report of network trouble, there's one thing that should be done immediately: Blame the User (BTU). Ninety percent of all supposed network trouble is actually the result of users doing something they shouldn't be doing, such as installing inappropriate or uncertified software and hardware, tinkering with Registry settings, folding or yanking on the CAT5 that's running into their NIC, and other such frustrating mischief. This is especially the case when only a few workstations are afflicted by the mysterious network malady. While we'd love to give these users a good spanking or send them to bed without dinner, the BTU policy is an excellent way of simultaneously keeping your network on its toes and responding to user problems in a timely and efficient manner.

Of course, BTU won't be of much help to you unless you follow a few highly important tips. First, never, ever let the user know about the BTU system--they'll probably get mad and throw something hard at you (such as a monitor). Second, remember that almost all BTU-related problems are either fixed, or at least identified, within the first ten minutes or so after you've responded to the call for help. If, after the first ten or fifteen minutes, neither you nor the user can come up with even the foggiest of ideas as to why the network is acting up--or if any more than a few machines are affected--it may be time to regroup and start looking for a more pervasive cause somewhere on your server or in your spider web of hubs, routers, and cable.

TIP: If you're stumped as to the cause of a single-workstation network outage, be sure to check the cable running between the network drop on the wall and the network interface card (NIC) at the back of the client workstation. While this most likely will seem absurdly obvious, you'll be surprised how often this three-second procedure will be the answer you need. Chances are that the cable was inadvertently yanked or twisted and no longer has a good connection in one of the two sockets. If that doesn't work, try swapping in a new section of cable; every so often network wiring will just mysteriously die.

Things To Look For

It is of the utmost importance to remember that your network is more that just a simple collection of highly expensive hardware linked together via some fancy protocols and a bunch of wiring. The utility--and frustration--involved with modern corporate networking comes from the fact that they are such complex systems, where multiple protocols run across different hardware types, with each particular brand of hardware often hosting a unique combination of software and services. While providing a robust, reliable, and efficient networking solution for your company may seem like magic to your user community, we all know that you'll most likely lose a few nights of sleep (and some hair) in the process. Because there's a slight chance that we may end up working together in the same network operations center (NOC) one of these days, we're here to provide you with some monitoring tricks, tips, and pointers so you can effectively manage a top-notch network, keep your good looks, and (hopefully) never go crazy!

In the most general sense, there are five areas that can provide their own unique set of serious (that is, non-BTU) network problems: cable, routers and hubs, hardware, software, and operating systems. We've already covered some of these topics throughout the course of this book. Chapters 6, "Hubs," and 8, "Switches," provide a bit of information on the proper role of hubs and switches in a networked environment so you can be extra confident that you're using the right tools for the right job.

How You Can Analyze Your Network

There are a number of different methods that you will use to analyze your network. While some utilize specialized (also known as expensive) hardware and software reminiscent of an episode from your favorite science-fiction television show, there are other avenues you can explore that are anything but high-tech or prohibitively expensive. Even if you (or your company) can't quite justify spending $10,000 on the latest button-and-light-covered LAN analyzer that once graced the cover of Network World, take heart. While these gadgets certainly perform some wonderful tasks that help to speed up network troubleshooting, systems administrators and engineers were resolving complex network problems long before these products came to market. In other words, with a little bit of sweat and a pinch of luck, you'll muddle through!

If, however, you're still a bit unsure as to how you'll be keeping track of your network following are some rules of thumb that should be of tremendous value.

Listen to Your Users

As your first line of defense against the evils of network trouble, be sure to listen to the troops. The guys and gals on the front line (the users) are probably more sensitive to network trouble than, say, a systems administrator, programmer, or network engineer, and they're almost sure to come a hollerin' at the first sign of trouble. This is the case simply because the network administrator and other related tech-types probably spend the majority of their time working directly at the server console, not at a remote workstation at the other end of the building where problems tend to be exceedingly more pronounced. Remember that when users gripe about seemingly insignificant problems, it's still worth a few minutes of your time to investigate. BTU notwithstanding, every once in a while a user will spot something that's indicative of a budding catastrophe, and you'll be thankful that you followed up on that trouble ticket!

Listen to Your Network Operating System

As important as it is to listen and respond to the requests, gripes, and queries that filter up from the unwashed masses, it's equally important to listen to what your network operating system(s) (NOS) has to say about its impressions of network performance. Posh! Silly author, you think. How can my operating systems talk to me? Two words: server logs.

As overwhelming as they can often be, server logs are, in fact, your friends. Depending on your operating system, you'll have anything from a handful of services intermittently reporting on their status to full-blown administrative tracking systems that record every error, warning, or informative flag that is raised by the many applications, services, and hardware that comprise your network. The key here is to not only know where to look, but also what to look for and how to respond to the specific warning(s). This is a sufficiently complex enough issue to easily demand its own book. What will (hopefully) save the day is your understanding of the various components of your network, how they interrelate, and the dependencies between each service and between particular services and specific hardware. Once again, it's experience over book-learning that will separate the men from the boys when it comes to monitoring networks via system logs.

The locations and contents of these logs vary from NOS to NOS, based partially on the operating system itself as well as administrator-defined configuration preferences (most often specified at installation). On servers running any of the many flavors of the UNIX operating system, you'll find logs scattered throughout the directory structure, reporting on everything from mail services and Web traffic to disk/volume status and security information. Windows NT networks, on the other hand, offer one of the most comprehensive and easy-to-use logging tools that is provided, at no additional charge, as part of the operating system itself. The Windows NT Event Viewer monitors and records status messages from all NT services, security data, and other applications, providing a consolidated viewer to review all logs.

Beware Obsolescence!

The third major defense that you wield against the scourge of network trouble is the patch. Thanks to the poking and prodding of users and administrators around the world, it isn't long before deficiencies or omissions in operating system releases are identified and publicized. This includes the entire gamut of operating system parts, from security features to file system eccentricities to resource-hungry graphics subsystems. So that their products continue to function as advertised (and don't lose market share), many vendors, including Digital Equipment Corporation, Microsoft, Novell, Sun Microsystems, and many others, offer frequent patches (upgrades) to their operating systems to remedy these noticeable and sometimes unknown bugs. Monitoring the version numbers on your firmware, network operating system, and applications will allow you to be ahead of the curve when it comes to keeping the network running all day, every day. These patches are generally made available on the particular NOS manufacturer's Web site and are generally provided at little or no charge. You can also keep up-to-date on the newest revisions for these products by monitoring the various Usenet news groups that are relevant to your particular hardware and software.

Use Your Operating System's Native Diagnostic Tools

A common mistake that network administrators often make is that they reject, out-of-hand, the suites of diagnostic tools that are frequently integrated into, or are occasionally bundled with, network operating systems. This is more often the case with administrators running Windows NT, who are generally less than impressed with the appearance of PerfMon, the freebie Windows NT diagnostic utility. However, don't judge a book by its cover.

The reality of the Windows NT performance monitor is altogether different. While some third-party packages will include a more attractive front-end or some flashy monitoring wizards, the NT performance monitor allows you to track just about every service, application, and byte of data that passes though or around your NT network. If you spend some time learning how PerfMon does its stuff, you'll find that it is an excellent tool that can handle perhaps 80 percent (sometimes more) of your network monitoring needs, depending entirely on the size and complexity of your internetworking environment. If you are using Windows NT as your network operating system, Bill Gates and his minions in Redmond, WA have been thoughtful enough to provide you with a second set of tools--command line options--that can be used to supplement the information gathered in a Performance Monitor session. These options (such as Net statistics, Net file, Net session, and so on) can be executed at the command line to give you an instant snapshot, and control, of a variety of internetworking-related systems running on your NT server.

UNIX also provides an exhaustive set of tools and other utilities designed to help you help yourself when it comes to monitoring your network. While UNIX diagnostics may not always be as attractive or as user-friendly as you'd like, you'll be hard-pressed to find a set of standard, affordable tools that can compete with the amount of information that the UNIX monitoring utilities will provide. If you'd like, you can skip right ahead to the UNIX section and read up on some more specific information relating to UNIX monitoring.

These four steps should provide you with just about every bit of information that you'll need in order to keep a watchful eye over your network. Should things start to get a little out of control however, you can start bringing in the heavy artillery--LAN analyzers.

Using a LAN Analyzer

If, for one reason or another, the steps that we've previously covered haven't helped you solve your particular set of network problems, your next move will be to employ a dedicated LAN analyzer and cross your fingers! While LAN analyzers are most often quite expensive, they're usually more than worth their cost if you plan on maintaining anything more than a simple, one-protocol network. Because networks and internetworks are becoming increasingly complicated--running multiple protocols, supporting wider arrays of software and hardware, and providing for ultra-high bandwidth connections--sometimes the standard sets of monitoring and diagnostic tools that are included with network operating systems simply aren't up to the task. After all, these NOS-specific suites were designed to monitor a specific type of homogeneous (often single-protocol) network, not the complex, heterogeneous monstrosity that you have installed in your office.

LAN analyzers are basically network therapists. Once you've connected the device (anything from a full-blown PC to a hand-held unit) to your network, it listens to all of the information that passes between the various network nodes, taking note of a number of transmission statistics. In earlier days, LAN analyzers simply gathered this information and presented it in a predefined format via any of a number of display devices, just like the good therapist, without offering any concrete advice. At that point, it was up to the network administrator to figure out how best to interpret the data and what actions, if any, were needed in order to address the perceived problems.

Unfortunately, this passivity would only be acceptable for so long. As networks have become increasingly more complex, so too have the feature-sets of LAN analysis devices grown beyond their early and limited incarnations. Today, most dedicated LAN analysis systems require high-end RISC or x86 processors, significant amounts of memory, and quite a good deal of money. A large portion of the analysis tool market is comprised of proprietary hardware and software combinations that are often designed as portable, hand-held devices that support a variety of protocols, architectures, and network types. What's more, the growing complexity of internetworks has forced these advanced monitoring devices to incorporate the latest developments in artificial intelligence and expert systems in order to streamline the process of monitoring and troubleshooting networking issues. Not only will the most recent LAN analysis tools monitor network traffic and build reports containing nearly every type of information available, they'll also use their on-board knowledge of IEEE standards, protocol specifics, and other networking issues to offer possible avenues of action based on what they see as the most serious network issues.

Operating System Analysis and Tuning Tools

Although conceptually the same from network operating system to network operating system, the actual implementation of monitoring and tuning tools is quite different. This is not necessarily the result of different developers working for different software vendors who are building competing products, but from the low-level differences in the actual implementation of networking, and to some extent the other abilities of an operating system. Additionally, these tools will leverage other abilities of an operating system in the overall tool design. For example, most monitoring tools for Windows NT are full-screen applications that make use of the Windows GUI, while UNIX tools are primarily command-line tools that write to standard output or standard error. In this section, we will discuss the primary monitoring and tuning tools for both Windows NT and UNIX operating systems.

Windows NT

Windows NT networks can be fun and (relatively) easy to work with, especially when they're working according to spec. Unfortunately, when things go wrong they tend to go wrong in a big way.

THE BLUE SCREEN OF DEATH: If you ever show up to work and find one of your critical NT servers displaying some hexadecimal identifiers and a bunch of (seemingly) nonsensical letters and numbers on a blue background, be prepared for a long day. Whenever the NT kernel is critically unable to deal with hardware or software (kernel-specific) errors, Windows NT will come to a halt and display this blue screen and the associated characters, which actually are providing some clues as to the cause of the crash. Your first step is to reboot the machine via the power switch. You can ignore the specific characters on the screen if the problem goes away after your reboot. If the machine blue-screens again, however, take note of the characters and get in touch with your support representative.

A lot of times such problems will be related to NT as an operating system, exclusive of its role as a part of your greater network. This will include file system problems, interrupt conflicts, print failures, and other hardware hiccups. However, based on your system and network configuration, you may find that NT isn't performing as well as you'd like as the brain of your network, or is perhaps failing altogether. In cases where NT's networking functions are involved, there are a number of monitoring tools at your disposal that you can use to resolve network trouble.

Event Viewer

The Windows NT Event Viewer hasn't changed much over the years. This three-tiered system records System, Security, and Application event information for later analysis. System events occur when a driver fails during operation or cannot load at boot time. The Security log tracks login attempts and other security information, allowing you to detect and repel possible attacks. Finally, the Application event log will capture any information from NT-based applications, from third-party Web servers to database or communications software.

Network Monitor

As one of the most exciting additions to the tools suite of Windows NT 4.0, the Network Monitor provides what at one time was only available as a third-party application or as part of an expensive LAN analysis hardware package. The Network Monitor provides a host of network statistics, from frames, broadcasts, and multicasts detected to adapter-specific traffic data. You are able to run in either normal or dedicated capture modes and can save the data for later analysis.

Performance Monitor

The Performance Monitor is a graphical tool that allows you to measure the performance of nearly every aspect of your Windows NT network operating system, including the server and remote workstations. The Performance Monitor will also provide charting, reporting, and alerting services to inform you if and when predefined performance levels are reached, such as %Processor Time, Packets Transmitted/sec, and so on.

Remote Access Administrator

If you are providing remote access (RAS) services for your network and trouble arises, your first stop should be at the Remote Access Administrator. The Remote Access Administrator will provide a quick view of your overall RAS status, including: available and busy ports, inbound and outbound byte and frame rates, CRC, time-outs, framing, hardware, buffer errors, and identification data for the remote workstation.

Server Manager

The Windows NT Server Manager is a resource that will allow you to administer Windows NT domains, as well as settings on specific computers. You can view a list of users connected to certain machines, view machine-specific resources (shared and open), control directory replication, configure services, and broadcast alert messages to connected users. Additionally, you can manage primary domain controllers (PDCs) and backup domain controllers (BDCs), synchronize servers with the PDC, and add or remove machines from the NT domain.

Built-In Diagnostics

The built-in suite of NT diagnostics named, aptly enough, Windows NT Diagnostics, provides a comprehensive look at the configuration and environment variables that your NT box is currently using, in any of nine categories, including: Version, System, Display, Drives, Memory, Services, Resources, Environment, and Network. The category tabs will provide you with the following information: Version--OS version and build numbers, 20-digit registration number, and personalized registration data

System--HAL type, BIOS type and version, and types of processor(s)

Display--BIOS date and type, current settings, memory, chip and DAC types, and driver-specific data

Drives--Access to and information on all drives connected to the machine

Memory--Pagefile space, physical and kernel memory, handles, threads, processes, commit change totals

Services--All installed services and their current states (running, stopped, paused)

Resources--IRQs, Buys number and type for all devices attached to the server

Environment--Processor type, architecture, level, and revision; Windows Path; OS Network--Four categories of data: General (access rights, workgroup/domain name, LAN root), Transports (type, address), Settings (time-out values, read-ahead, buffers, and so on), and Statistics (bytes sent/received, cache, failed reads, and so on).

Internet Service Manager

With the recent addition of the Microsoft Internet Information Server as an integral part of the Windows NT network operating system, Microsoft has added the Internet Service Manager as a front-line tool for managing NT-based Internet services running on boxes located anywhere on your local area network.

The Internet Service Manager allows you to easily access any local Windows NT servers providing HTTP, FTP, or Gopher services, view their current status (running, stopped, paused), and reconfigure specific service properties in response to current or anticipated network trouble. While ISM monitoring functions are fairly limited, they can, when coupled with the ability to remotely start, stop, and configure net services, be a godsend for systems maintenance!

NT Networking Hotspots

There is an almost endless number of subsystems that work in concert to provide the robust networking environment that we have come to know as Windows NT. However, when it comes to NT networks, there is a number of very important features of which you need to be made aware. Once you know where they are and what to look for, you should have a much easier time trying to resolve latency issues and other performance bottlenecks.

Monitoring the NT hotspots will always require you to use the NT Performance Monitor (PerfMon). While you'll usually end up complementing your PerfMon analysis with other monitoring tools, PerfMon is generally the best and most enlightening place to begin your investigations. Depending on the services and other additional features that you have installed with your NT server or workstation, you'll have your choice of monitoring a number of PerfMon objects, including: browser, cache, FTP Server, HTTP Service, ICMP, IP, Memory, NetBEUI/NetBEUI Resource, Network Interface/Segment, Paging File, Physical Disk, Process, Processor, RAS Port/Total, Redirector, Server/Work Queues, System, TCP, Thread, and UDP. While each of these factors can contribute to the success or failure of your network at a particular moment, it is important to pay special attention to a minimum of four objects. While these four objects can vary, depending on your specific configuration, they can be broadly categorized into: default network protocol, interface hardware, memory, and server.

During the course of preventative or responsive troubleshooting, you should remember to monitor the default network protocol that connects your various nodes. In some cases this can be achieved through a combination of the TCP and IP objects, or perhaps as a function of the NetBEUI object. Collisions, byte and packet rates, time-out values, and other important information can be examined and evaluated based on ideal and/or anticipated performance levels.

Interface hardware deals with the state of your current host adapter and its overall performance in terms of sent and received packets, errors, queue length, and so on. You can (and should) rely on this object to help you determine when network outages are the result of an improperly configured NIC, adapter failure, or even bad CAT5.

Memory, quite simply, will provide performance data for the currently installed system memory, including read and write errors, faults, copies, available memory, and so on. You can use this object to determine whether network services are being degraded by faulty memory (indicated by high numbers of errors/faults), insufficient memory (large paging files and frequent disk access), or an improperly configured Paging File.

The server object is the last central object to which you'll immediately turn for signs of network trouble. You can measure everything from permission and system errors to additional data on the status of available physical memory. Also, the Bytes Total/Sec. value can provide invaluable data as to how busy a particular machine is under different circumstances (especially useful for file servers) and can be of great help when trying to perform load-balancing calculations.

UNIX

For the most part, network troubleshooting in the UNIX world will focus on TCP/IP, which is a communications protocol, and NFS, which is a network file system. In addition, you must be constantly aware of the implementation of the rest of your network infrastructure, which is an additional source of potential problems. Every decision you make in building your infrastructure will affect the capacities and capabilities of your network. Your inability to send heavy traffic across your network can be detected using many different monitoring tools (not just those found on UNIX systems) but the source of your problem may lie elsewhere--not in the hardware of the systems that are dropping the data packets.

The majority of networks in the world today with UNIX systems attached are based on 10base-T, twisted-pair Ethernet technologies. Although 10base-T is usually thought of as a 10Mbps medium, the actual speed of the network that can be used by applications is usually significantly less. Even with the tightly integrated networking features within the UNIX operating system, it is not uncommon for UNIX hosts on 10base-T networks to see realized speeds of only 4Mbps. At first look, you might think that 4Mbps is still enough speed to function. However, that number can continue to shrink as your network grows. As actual demand begins to exceed capacity, all the users on your network will know it, even if they are running UNIX. In addition, it only takes one system on an Ethernet network to bring an entire network to a grinding halt. Continually transferring large files across the network is just one example of an activity most any user can perform that will have an impact on your network performance. Remember, as with any of the other resources on your systems, network capacity is a resource you may run out of.

UNIX users, however, have an additional advantage in the world of network monitoring. They have very powerful and informative tools built into their operating system. With a little instruction from you, users can quite easily detect capacity problems. A quick comparison of the execution times of simple commands executed locally versus the execution times of simple commands executed using rsh, for example, can indicate that your network has problems.

Before we begin talking about detailed network monitoring, let's do a quick test to make sure you are connected to a network. To do this, you need to know the name or IP address of a second machine on the network. You will use the ping command to send a single packet to a remote system and time how long it takes the remote system to send the packet back to you. In its simplest form, ping (shown in Listing 28.1) accepts a machine name or IP address as its command line parameter. In the example in Listing 28.1, the host dole cannot seem to find whitehouse, another host on the network.

Listing 28.1. An unsuccessful ping.

dole % ping whitehouse

ping: whitehouse: Unknown host In the event that resolution of the second system fails using a name, try using an IP address (see Listing 28.2).

In this example, dole has found whitehouse using its IP address and is sending and receiving data 100% error free. ping will keep sending data until you break the loop by pressing Ctrl+C.

Listing 28.2. A successful ping.

dole % ping 198.137.240.92
PING 198.137.240.92 (198.137.240.92): 56 data bytes
64 bytes from 152.163.41.3: icmp_seq=0 ttl=254 time=8 ms
64 bytes from 152.163.41.3: icmp_seq=1 ttl=254 time=2 ms
64 bytes from 152.163.41.3: icmp_seq=2 ttl=254 time=1 ms
64 bytes from 152.163.41.3: icmp_seq=3 ttl=254 time=2 ms
----198.137.240.92 PING Statistics----
4 packets transmitted, 4 packets received, 0% packet loss

round-trip min/avg/max = 1/3/8 ms

Monitoring Using spray

spray is a UNIX command used to deliver a burst of data packets to another machine and report how many of the packets made the trip successfully and how long it took. Similar in scope to its little brother ping, spray can be used more effectively to monitor performance than ping because it can send more data. The results of the command, shown in Listing 28.3, will let you know whether the other machine was able to successfully receive all of the packets you sent.

In the example shown in Listing 28.3, a burst of data packets is being sent from the source machine (dole) to the destination machine (clinton).

Listing 28.3. Using spray to monitor your network.

dole % spray clinton
sending 1162 packets of lnth 86 to clinton ...
        no packets dropped by clinton

5917 packets/sec, 508943 bytes/sec In the example above, the destination machine (dole) successfully returned all of the data sent to it by the source machine (clinton). If clinton were under heavy load, caused either by network traffic or other intense activity, some of the data packets would not have been returned by clinton. spray defaults to sending 1162, 86-byte packets.

spray supports several command line parameters, shown in Listing 28.4, that you can use to modify the count of packets sent, the length of each packet and the number of buffers to use on the source machine. These parameters can be helpful in running tests that are more realistic.

Listing 28.4 shows the spray command used with the -c option, which delivers 1,000 packets, and the -l option, which sets each packet to 4096 bytes.

Listing 28.4. Using variable packet counts, spray can be used to more closely simulate realistic network traffic.

dole % spray -c 1000 -d 20 -l 4096 clinton
sending 1000 packets of lnth 4096 to clinton ...
        no packets dropped by clinton

95 packets/sec, 392342 bytes/sec Simulating network data transmissions with spray can be made more realistic by increasing the number of packets and the length of each packet. The -c option will let you increase the total count of packets that is sent and the -l option lets you set the length of each packet. This can be helpful in mimicking certain transmission protocols. The -d option is used to set the delay between the transmission of each packet. This can be useful so you do not overrun the network buffers on the source machine. Listing 28.5 shows what a problem might look like. In this case, the source machine (dole) overwhelms the destination machine (carter) with data. This does not immediately indicate that a networking problem exists, but that the destination machine (carter) might just not have enough available processing power to handle the network requests.

Listing 28.5. spray can be used to isolate failing network hardware.

dole % spray -l 4096 carter
sending 1162 packets of lnth 4096 to carter ...
        415 packets (35.714%) dropped by carter

73 packets/sec, 6312 bytes/sec In the event that your tests with spray result in packet loss, your next step would be to take a closer look at the destination machine you have been testing. First, look for a heavy process load, memory shortage, or other CPU problems. Anything on that system that might be bogging it down can cause degraded network performance. In the event that you cannot find anything wrong with your test system that might be causing a delayed network response, sending a similar test back to your initial test machine might indicate a larger network problem. At that point, it is time to start checking your routing hardware and your infrastructure with the analysis hardware.

Monitoring with netstat

The simplest way to check the network load on a particular system is to use the netstat command. When executed without any command-line parameters, the command displays a list of active sockets for each protocol. Listing 28.6 shows the output of netstat running on a Silicon Graphics Indy workstation. In this example, the host dole has several open connections. The only potentially problematic entry in the list is the connection from the system named dukakis. However, because it is a modem connection, the depth of the send queue is expected to be slightly higher.

Listing 28.6. Using netstat to monitor active network connections.

dole % netstat
Active Internet connections
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
tcp        0      0  dole.telnet            reagan.431025          ESTABLISHED
tcp        0      0  dole.telnet            nixon.9031             ESTABLISHED
tcp        0      4  dole.telnet            dukakis.ppp.1036       ESTABLISHED
tcp       56      0  dole.1060              ftp.watergate.ftp      CLOSE_WAIT

tcp 52 0 dole.1799 news.washngtnpost.nntp CLOSE_WAIT In the standard netstat report, the only field that is significant in your monitoring operation is the Send-Q, which is reporting on the depth of your network send queue, that is the amount of data in bytes that is waiting to be sent (or received if you are looking at the receive queue). If the numbers in your send queue for connections across a particular network segment are already large and getting larger, that particular network is probably inundated with excess traffic. If single entries are appearing with high send queue entries, it is possible that there is a problem with the particular host.

Perhaps the quickest way to determine the integrity of your network--are packets reaching their destination as quickly as they can--is by using netstat with the -i command line parameter. All of the systems attached to a particular segment of your network share it. When more than one client or server system attempts to utilize the network at the same time, a collision occurs when the packets from one machine meet the packets from another. This is actually not an uncommon condition on most networks; however, when the number of collisions becomes a significant percentage of all of the network traffic, you will start to see performance degradation. In addition to collisions caused by simultaneous broadcast on the network, various other extenuating conditions might cause errors in transmission or reception of data. Faulty hubs, malfunctioning interfaces, or even other electromagnetic fields from devices not physically connected to a network (such as motors for the elevators in a building) might be to blame for high rates of packet collisions. As the number of collisions and other errors on your network increase, the performance of the network degrades.

When used with the -i command-line parameter, netstat will report on how many packets each of the network interfaces in your system has sent and received and whether the number of errors and collisions is anything to be concerned about. Listing 28.7 shows a netstat report for all of the adapters in dole. As you can see, dole is on a very high-traffic network, and has seen many collisions since it was last rebooted. However, because the number of collisions is less than 2.25% of all packets sent, this network is not really saturated with traffic. If the percentage of all packets sent were constantly in the double-digits, you probably have a problem with network traffic.

Listing 28.7. Using netstat to monitor individual packet and error rates for each network interface.

dole % netstat -i
Name Mtu   Network     Address      Ipkts     Ierrs   Opkts      Oerrs  Coll
ec3  1500  207.221.40  dole         45223252  0       953351793  0      21402113

lo0 8304 loopback localhost 2065169 0 2065169 0 0 Table 28.1 shows the values of the output from the netstat command.

Table 28.1. The information returned by netstat.

Value Data

Name The name of the network interface (naming conventions vary among UNIX)

MTU The maximum packet size of the interface

Net/Dest The network that the interface connects to

Address The resolved Internet name of the interface

Ipkts Number of incoming packets since the last time the system was rebooted

Ierrs Number of incoming packet errors since the last time the system was rebooted

Opkts Number of outgoing packets since the last time the system was rebooted

Oerrs Number of outgoing packet errors since the last time the system was rebooted

Collis Number of detected collisions

TIP: Cheat sheet for interpreting the results of a netstat -i command:

The number of input packet errors (Ierrs) should never be more than 0.25% of all input packets (Ipkts) for a particular adapter. If that is the case, you probably have a network that is extremely saturated with traffic.
If the number of output packet errors (Oerrs) is anything but 0, you may have a hardware problem with that particular adapter. You should monitor that interface a bit more carefully.
If the number of collisions (Collis) seen by a particular interface continually is greater than 5% of all the packets that it sends (Opkts), that is an indicator that your network may be on its way to becoming highly saturated with traffic. The obvious sluggishness of network performance should also be another indicator.

In the previous example, you see that the percentage of collisions detected on the network is less than 2.25% of all the packets sent. As discussed earlier, if collisions were continually in double digits, your network is probably suffering from a capacity shortage. Also of note in the previous example is the input to error and output to error ratios. Both are insignificantly low, which is another indicator that the network that dole is attached to, as well as the system itself, is functioning normally.

If you suspect that network problems are to blame for your degraded system performance, you should repeat this command often. Active, healthy systems will have input and output packet counts that are continually on the rise. If Ipkts increases and Opkts does not, your system is probably not responding to all of the requests it receives, indicating that your system may be overloaded or is having transmission problems. If the number of input packets never increases, your system is not receiving any network data.

When the -s command line parameter is used with the netstat command, statistical log information associated with each of the supported components of TCP/IP is displayed. In Listing 28.8, IP, ICMP, IGMP, TCP, and UDP are displayed in the output of netstat -s as it is executed on dole.

Listing 28.8. The output from the netstat -s command.

dole % netstat -s
ip:
        41887 total packets received
        0 bad header checksums
        0 with size smaller than minimum
        0 with data size < data length
        0 with header length < data size
        0 with data length < header length
        0 with bad options
        0 fragments received
        0 fragments dropped (dup or out of space)
        0 fragments dropped after timeout
        39820 packets for this host
        4240 packets recvd for unknown/unsupported protocol
        0 packets forwarded  (forwarding enabled)
        2067 packets not forwardable
        0 redirects sent
        84436 packets sent from this host
        0 output packets dropped due to no bufs, etc.
        0 output packets discarded due to no route
        36140 datagrams fragmented
        175925 fragments created
        0 datagrams that can't be fragmented
icmp:
        24 calls to icmp_error
        0 errors not generated `cuz old message was icmp
        Output histogram:
                echo reply      : 4
                destination unreachable : 24
        6 messages with bad code fields
        0 messages < minimum length
        0 bad checksums
        0 messages with bad length
        Input histogram:
                echo reply      : 8
                destination unreachable : 4230
                echo    : 4
                time exceeded   : 2
        4 message responses generated
igmp:
        0 messages received
        0 messages received with too few bytes
        0 messages received with bad checksum
        0 membership queries received
        0 membership queries received with invalid field(s)
        0 membership reports received
        0 membership reports received with invalid field(s)
        0 membership reports received for groups to which we belong
        6 membership reports sent
tcp:
        28744 packets sent
                14789 data packets (8274385 bytes)
                34 data packets (10276 bytes) retransmitted
                8351 ack-only packets (6314 delayed)
                0 URG only packets
                5 window probe packets
                3975 window update packets
                1591 control packets
        25848 packets received
                3973 pcb cache misses
                10891 acks (for 8261734 bytes)
                1048 ack predictions ok
                307 duplicate acks
                0 acks for unsent data
                20905 packets (8099738 bytes) received in-sequence
                12465 in-sequence predictions ok
                65 completely duplicate packets (37613 bytes)
                4 packets with some dup. data (174 bytes duped)
                628 out-of-order packets (423912 bytes)
                2 packets (2 bytes) of data after window
                2 window probes
                46 window update packets
                49 packets received after close
                0 discarded for bad checksums
                0 discarded for bad header offset fields
                0 discarded because packet too short
                0 discarded because of old timestamp
        826 connection requests
        50 connection accepts
        790 connections established (including accepts)
        871 connections closed (including 41 drops)
        82 embryonic connections dropped
        11120 segments updated rtt (of 11337 attempts)
        67 retransmit timeouts
                0 connections dropped by rexmit timeout
        0 persist timeouts
        387 keepalive timeouts
                0 keepalive probes sent
                1 connection dropped by keepalive
udp:
        9728 total datagrams received
        0 with incomplete header
        0 with bad data length field
        0 with bad checksum
        24 datagrams dropped due to no socket
        38 broadcast/multicast datagrams dropped due to no socket
        0 datagrams dropped due to full socket buffers
        9666 datagrams delivered

55475 datagrams output When looking at the overall statistics, you should pay close attention to the checksum fields. They should always show extremely small values. If they are large, that indicates that your network is handling extremely large amounts of traffic.

Running netstat -s on your original destination system (that is now targeted at your source system), in combination with spray on your source system, can help you determine whether data corruption caused by faulty hardware or network traffic caused by high demand is to blame for your degraded network performance. If you see similar numbers in the count of packets lost between systems, chances are you are faced with a network integrity problem. If you see packets being lost only on one system, you should begin to examine your hardware more thoroughly.

TIP: If your tests with netstat -s and spray lead you to believe that you have a network integrity problem, you can verify this by running netstat -i on your destination machine.

Monitoring with nfsstat -c

NFS is an extremely powerful tool. It gives users an easy way to share files among UNIX systems by mounting directories on remote volumes as local drives. With this power also must come increased responsibility and education. Users should be taught how to appropriately use NFS volumes. It is important to point out that things will be generally slower when accessing files across the network using NFS, especially if the file is large. Accessing the same file directly on the remote machine by using a remote login will result in an operation that will execute in a more timely fashion. However, certain functions such as editing and copying files that are of a more reasonable size is perfectly fine--that is what NFS was designed for. Your users should be very careful when using NFS and they should know how to use it appropriately. That is how you can best monitor NFS performance on your network.

For those who need an actual tool, nfsstat is just what the doctor ordered. nfsstat is used to display communication statistics between NFS clients and servers. The -c command line parameter is used when you wish to display client statistics and the -s command line parameter is used to display server statistics.

TIP: If you feel the need to reset the counters used by nfsstat, try executing the command with the -z command line parameter. This may be useful when attempting to determine when a particular error condition occurred. Remember that the NFS counters can only be reset by root.

NFS is based on synchronous procedures commonly referred to as RPCs. RPCs, or Remote Procedure Calls, allow the client making the call to wait for successful completion of the operation on the server that is handling the request before continuing. If the server fails to respond to the RPC, the client will simply transmit the request again. Remember that, when dealing with network communication in general, as the communication between systems degrades due to packet collisions, traffic on the network will increase. In addition, as we all know, the more traffic that exists on the network, the slower the network will perform and the greater the possibility of having even more collisions increases. So if the NFS re-transmission count is high, you should look for servers that are functioning under heavy loads, high collision rates that are delaying the packets sent between clients and servers, or Ethernet devices such as routers, hubs or interfaces that are simply dropping packets. Listing 28.9 shows sample output from the nfsstat command when used with the -c option. The client data displayed shows a healthy use of NFS. Pay close attention to the ratio of time-outs to calls.

Listing 28.9. Sample output from the nfsstat command when executed with the -c parameter.

bush % nfsstat -c
Client rpc:
Calls     badcalls     retrans     badxid     timeout     wait     newcred     Âtimers
231893    0            101         0          101         0        0           122
Client nfs:
Calls     badcalls     nclget      nclcreate
229114    0            229114      0
null      getattr      setattr     root      lookup       readlink     read
0  0%     16038  7%    1  0%       0  0%     2930  1%     0  0%        1456  0%
wrcache   write        create      remove    rename       link         symlink
0  0%     210784 92%   176  0%     1  0%     0  0%        0  0%        0  0%
mkdir     rmdir        readdir     statfs

0% 0 0% 322 0% 12 0% The output from the nfsstat command shows the following values:

Table 28.2. What nfsstat shows you.

Data returned by nfsstat Description

calls Number of calls sent by a client

badcalls Number of calls rejected by the NFS service

retrans Number of re-transmissions made

badxid Number of duplicated acknowledgments received

timeout Number of service time-outs

wait Number of times no client handler was available on the server

newcred Number of authentications that were automatically refreshed

timers Number of times a time-out was reached

readlink Number of times a symbolic link was read/resolved on an NFS server

If the ratio of time-outs to calls is high, you may have found a problem with either an over-worked NFS server or a larger problem with your network. Either the packets are delayed in reaching their destination or the NFS server is unavailable to handle the request being made by the client. In the example above of the system named bush, there are very few time-outs in relation to the number of calls being made by the client to the NFS server. As the number of time-outs grows toward 5% of all calls being made, it will become time for you to act. If the number of duplicated acknowledgments received by a client is approximately equal to the number or re-transmissions being made, the problem probably lies with an over-burdened NFS server. If the number of duplicated acknowledgments received is much smaller than the number of re-transmissions being made and/or the number of time-out conditions being reached, it is logical to assume that you have a problem with your network that is causing the requests to be sent again.

Summary

Monitoring your network from other operating systems can be a little bit more of a challenge. Networking is not as "integrated" to most other operating systems the way it is under Windows NT and UNIX. The standard distributions of most other operating systems do not contain protocol stacks, nor do they contain analysis tools. Therefore, it may be more difficult for you to find software tools to deliver the analysis features you are looking for.

There are many commercial and public domain monitoring tools available for MS-DOS, Microsoft Windows, IBM OS/2, or the Apple MacOS that will track the number of packets sent and received. Additionally, there are tools that report on the percentage of output packets that result in a collision as well as tools to historically track the results of network performance.

America Online users can find may such network utilities in the Computing & Software channel, or you can search for them directly at Keyword: FileSearch. If you do not have access to America Online, or would rather find things on the Internet, c|net maintains a large database of links to many large, popular FTP sites around the world at http://www.shareware.com and http://www.download.com.

Value	Data
`Name`	The name of the network interface (naming conventions vary among UNIX)
`MTU`	The maximum packet size of the interface
`Net/Dest`	The network that the interface connects to
`Address`	The resolved Internet name of the interface
`Ipkts`	Number of incoming packets since the last time the system was rebooted
`Ierrs`	Number of incoming packet errors since the last time the system was rebooted
`Opkts`	Number of outgoing packets since the last time the system was rebooted
`Oerrs`	Number of outgoing packet errors since the last time the system was rebooted
`Collis`	Number of detected collisions

Data returned by `nfsstat`	Description
`calls`	Number of calls sent by a client
`badcalls`	Number of calls rejected by the NFS service
`retrans`	Number of re-transmissions made
`badxid`	Number of duplicated acknowledgments received
`timeout`	Number of service time-outs
`wait`	Number of times no client handler was available on the server
`newcred`	Number of authentications that were automatically refreshed
`timers`	Number of times a time-out was reached
`readlink`	Number of times a symbolic link was read/resolved on an NFS server