High-Performance Networking Unleashed

- 9 -

Routers

by Martin Bligh

The simplest of networks can be imagined as a one-wire bus (see Figure 9.1), where each computer can talk to any other by sending a packet out onto that bus.

FIGURE 9.1. The simplest of networks.

But as you increase the number of computers on the network, this becomes impractical. There are several main problems:

1. You run out of bandwidth on the network.

2. Each computer wastes more time processing unnecessary broadcast traffic.

3. The network becomes unmanageable. Any fault can bring the whole network down.

4. Each computer can listen to any other computer's conversation.

Segmenting the network helps solve all of these problems. But if you break the network into separate segments, you must provide a mechanism for the different segment hosts to communicate. This normally involves selectively passing data between segments at some layer of the ISO network stack. Let's look again at the network stack (see Figure 9.2) to see where routers fit in.

Routers operate at the network layer. This chapter assumes that the network layer is IP (version 4), as this is by far the most popular protocol. The concepts involved are similar to those behind other network layer protocols.

Routing Versus Bridging

Routing is a higher-level concept than layer 2 switching/bridging--you are further removed from the physical details of the network. Any machine on a routed network has the same network layer address format (for example, an IP address) whether it is communicating over an Ethernet, Token Ring, FDDI, or WAN link.

Data link layer addresses (for example, MAC addresses) are just unique identifying tags for a particular network interface within a particular layer 3 network (they may also be globally unique--for example, Ethernet addresses). Network layer addresses usually hold more information than this--they consist of two parts: a network address and a host address. (For example, the IP address of my network interface card is 158.84.81.39, the network address is 158.84.81, and the host address is 39).

FIGURE 9.2. Where connecting devices fit into the OSI stack.

NOTE: Here the network address is taken to mean the whole of the number specifying the network (that is, including any subnet address).

A bridge can only connect networks with the same (or very similar) data link layer protocols. A router transcends this problem. It can connect any two networks, provided that the hosts use the same network layer protocol.

Connecting the Network Layer to the Data Link Layer

Underlying the network layer is the data link layer. For the layers to interoperate, they need some "glue" protocols. ARP (Address Resolution Protocol) is used to map network layer (layer 3) addresses to data link layer (layer 2) addresses (see the description in the following section). RARP (Reverse Address Resolution Protocol) is used to map layer 2 addresses to layer 3 addresses.

The most common use of ARP is to resolve IP addresses, though the protocol is defined in such a way that it is independent of the network layer protocol. The most common data link layer is Ethernet. Accordingly, the examples in the ARP and RARP sections are based on IP and Ethernet, though the concepts are identical for use with other protocols.

Address Resolution Protocol

Network layer addresses are an abstract mapping defined by the network administrator--the network layer doesn't have to worry which data link layer it is running over. However, network interfaces can only communicate with each other according to the layer 2 address, which is dependent on the network type. These layer 2 (hardware) addresses are derived from the layer 3 address by the Address Resolution Protocol (ARP).

An ARP request is not necessary for every datagram sent--the responses are cached in the local ARP table, which keeps a list of <IP address, hardware address> pairs. This keeps the number of ARP packets on the network very low. ARP is generally a low maintenance protocol that raises few problems; it is normally seen only when there are conflicting layer 3 addresses on the network. ARP is a simple protocol, presenting few complications.

Overview

If interface A wants to send a datagram to interface B, and it only has the IP address for B (B-IP), it must first find the hardware address for B (B-hard). Interface A sends an ARP broadcast specifying B-IP and requesting B-hard. Interface B receives the broadcast, and replies with a unicast to A, giving the correct B-hard for B-IP (see Figure 9.3).

FIGURE 9.3. An ARP exchange.

Note that only interface B responds to the request, even though other interfaces on the network may have the relevant information. This ensures that responses are correct and do not provide out-of-date information.

It is important to understand that ARP requests are sent out only for the next-hop gateway, not always for the destination IP address. Thus, if interface A wants to send a datagram to interface B, but its routing table tells it that traffic must pass through router C, it sends out an ARP request for router C's address, not for interface B's address (see Figure 9.4).

What Happens When an ARP Packet Is Received?

The flowchart in Figure 9.5 details the process followed when an ARP packet is received. Note that the <IP address, hardware address> pair of the sender is inserted into the local ARP table in addition to a reply being sent; if A wishes to talk to B, then it is very likely that B also needs to talk to A.

FIGURE 9.4. An ARP exchange through a router.

FIGURE 9.5. Receipt of an ARP packet (constructed from information in RFC826).

IP Address Conflicts

The most commonly seen error produced by ARP is caused by a conflicting IP address. This is where two different stations claim to own the same IP address--IP addresses must be unique on any connected set of networks.

IP address conflicts are apparent when two replies come in answer to an ARP request--each reply specifying a different hardware address. This is a serious error, with no easy solution--which hardware address do you send the datagrams to?

To avoid IP address conflicts, when interface A is first initialized it sends out an ARP request for its own IP address. If no response is sent back, interface A can assume that the IP address is not in use. However, suppose interface B is already using the IP address in question: B sends an ARP reply with the hardware address B-hard. Interface A now knows that the IP address is already in use--it must not use the address and must flag an error.

There is still a problem though. Suppose that host C had an entry for the disputed IP address, mapping it to B-hard. Looking at Figure 9.3, you see that on receipt of the ARP broadcast from interface A, host C updates its ARP table to map the address to A-hard. To correct such errors, interface B (the "defending" system) sends out an ARP request broadcast for the IP address again. Host C now updates its ARP entry for the disputed IP address to B-hard again. The network state is now back as before, but host C may have sent IP datagrams intended for host B to host A by mistake while the ARP table was (briefly) incorrect. This is unfortunate, but as IP does not guarantee delivery, this situation does not cause major problems.

Managing the ARP Cache Table

The ARP cache table is a list of <IP address, hardware address> pairs, indexed by IP address. The table can often be managed with the arp command. Common syntax for this command includes

Add a static entry to the cache table--arp -s <IP address> <hardware address>
Delete an entry from the cache table--arp -d <IP address>
Displaying all entries in cache table--arp -a

Dynamic entries in the ARP cache table (that is, those that have not been manually added with arp -s) are normally deleted after a period of time. This period is determined by the specific TCP/IP implementation, but an entry would commonly be destroyed if unused for a fixed time period (for example, five minutes).

Use of a Static ARP Address

One typical use of a static ARP entry is to set up a standalone printer server. These small units can normally be configured by way of Telnet, but first they need an IP address. There is no obvious way to feed them this initial information, except by using the built-in serial port. However, it is often inconvenient to find an appropriate terminal and serial cable, set up baud rates, parity settings, and so on.

Suppose we wish to set up a print server, P, with an IP address of P-IP, and that we know the print server's hardware address to be P-hard. A static ARP entry is created on workstation A to map P-IP to P-hard. Any IP traffic from workstation A to P-IP is now sent to P-hard, though the print server does not yet know its IP address. We can now telnet to P-IP, which connects to the print server, and configure its IP address. Then we can tidy up by deleting the static ARP entry. (See Figure 9.6.)

FIGURE 9.6. Using a static ARP ad- dress to set up a print server.

It is often useful to configure the print server on one subnet, but use it on another. This is easy to achieve by a very similar process to that illustrated in the preceding figure. The IP address of the print server on the subnet it uses is P-IP. Allocate a temporary IP address, T-IP, on the subnet that you wish to configure the print server on, and attach the print server to that subnet. On a workstation (A) connected to the configuring subnet, create the static ARP entry mapping T-IP onto P-hard, and telnet to T-IP. Configure the print server to use IP address P-IP. Move the print server to the subnet it will be used on, and tidy up by deleting the static ARP entry. (See Figure 9.7.)

FIGURE 9.7. Setting up a print server using a tempo- rary IP address.

Proxy ARP

It is possible to avoid configuring the routing tables on every host by using proxy ARP. This is particularly useful where subnetting is being used, but not all hosts are capable of understanding subnetting (see the section on subnetting later on in this chapter).

The basic idea is that a workstation sends out ARP requests even for machines that are not on its own subnet. The ARP proxy server (often the gateway) responds with the hardware address of the gateway. See Figure 9.8, where proxy ARP is used, and compare it to Figure 9.4, where routing tables are used.

Proxy ARP makes the management of host configurations much simpler. However, it increases network traffic (though not significantly) and potentially requires a much larger ARP cache. An entry for each IP address off the local subnet is created, all mapping to the gateway's hardware address.

FIGURE 9.8. A workstation using proxy ARP.

In the eyes of a workstation using proxy ARP, the world is just one large physical network, with no routers in sight!

IP Addressing

In routable network layer protocol, the protocol address must hold two pieces of information: the network address and the host address. The most obvious way to store this information is in two separate fields. We must cope with the largest possible case in both fields, perhaps allocating 16 bits for each field. Some protocols (such as IPX) behave like this, and it works well for small- to medium-sized networks.

Another solution would be to keep the host address field small, perhaps allocating 24 bits for the network address and just 8 for the host address. This would allow plenty of networks, but not many hosts on each network. However, for networks with more than 256 (2⁸) hosts, you could allocate multiple addresses. The problem with this scheme is that the large number of networks created tends to place an intolerable load on the network's routers.

IP packs the network address and host address together into one 32-bit field. Sometimes the host address portion is short, sometimes it is long. This allows very efficient use of the address space, keeping IP addresses short, and the total number of networks fairly low. There are two different ways of splitting the address back into its two parts--class-based addressing and classless addressing. These are discussed in the following section.

Hosts Versus Gateways

The distinction between hosts and gateways often causes some confusion. This is because of a shift in the meaning of the term "host." As defined by the original RFCs (1122/3 and 1009):

A host is a device connected to one or more networks. It can send and receive traffic on any of these networks, but it never passes traffic from one network to another.
A gateway is a device connected to more than one network. It selectively forwards traffic from one network to another.

In other words, the terms host and gateway used to be mutually exclusive--computers were not generally powerful enough to act as both a host and a gateway. The host was a computer that a user did some work on, or that perhaps acted as a file server. Modern computers are powerful enough to both act as a gateway and do useful work for a user; therefore, a more modern definition of a host might be the following:

A host is a device connected to one or more networks. It can send and receive traffic on any of these networks. It may function as a gateway, but this is not its sole purpose.

A router is a dedicated gateway. The hardware is specially designed to allow the router to pass high volumes of traffic, and with little delay for each packet (latency). However, a gateway can also be a standard computer with multiple network interfaces, where the operating system's network layer allows it to forward packets. Now that dedicated routing hardware is becoming less expensive, the use of computers as gateways is becoming much less common. At a very small site with a only a cheap dial-up connection, a user's computer might be used as a nondedicated gateway.

Class-Based Addressing

When IP first designed, the address was split into its composite parts according to the first byte of the address:

0: Reserved (for the network address)

1-126: Class A (network: 1 byte, host: 3 bytes)

127: Reserved (for the loopback address)

128-191: Class B (network: 2 bytes, host: 2 bytes)

192-223: Class C (network: 3 bytes, host: 1 byte)

224-255: Reserved (see the note that follows)

NOTE: Part of this range is for multicast addresses, sometimes referred to as class D addresses. For the sake of simplicity, these are not discussed here.

If you needed a large network you were given a class A address, but if you only had a few hosts, you were given a class C address. A few examples:

IP address Network address Host address

56.81.38.28 56 81.38.28

137.89.15.88 137.89 15.88

200.77.32.61 200.77.32 61

Subnetting

Although the class-based addressing system worked well for the Internet service provider, it was impossible to do any routing inside a network. The intention was that a network would use layer 2 (bridging/switching) to direct packets within a network. The lack of routing was a particular problem if you had a large class A network, as bridging/switching on a large network becomes very difficult to manage.

The logical solution is to break down some larger networks into smaller segments, but this was not possible within the original confines of the class-based addressing system. In the previous example, the network address 137.89 is treated as a class B address, so it is not possible to route different parts of this network to different sites.

To solve this problem, a new field called a subnet mask was introduced and associated with every address. The subnet mask indicated which portion of the address was the network address, and which was the host address (instead of deciding by the first byte).

In the subnet mask, binary 1 indicates a network address bit, and binary 0 indicates a host address bit. Thus for the 137.89.15.88 example given earlier, the format would be:

Address:          10001001 . 01011001 . 00001111 . 01011000 (137. 89. 15. 88)
Subnet mask:      11111111 . 11111111 . 00000000 . 00000000 (255.255.  0.  0)

The subnet mask given indicates that the first two bytes are the network address, the second two bytes are the host address. Thus your traditional class addresses have subnet masks:

Class A (network: 8 bits, host: 24 bits): 255.0.0.0
Class B (network: 16 bits, host: 16 bits): 255.255.0.0
Class C (network: 24 bits, host: 8 bits): 255.255.255.0

If you wish to use the 137.89.0.0 network address as a set of distinct class C-sized networks, the following address would be used:

Address:          10001001 . 01011001 . 00001111 . 01011000 (137. 89. 15. 88)
Subnet mask:      11111111 . 11111111 . 11111111 . 00000000 (255.255.255.  0)

Breaking a network into subnetworks using a longer subnet mask (for example, 255.255.255.0 instead of 255.255.0.0) is called subnetting. Be aware that some very old software won't support subnetting, as it doesn't understand subnet masks. For instance UNIX's routed routing daemon normally uses a routing protocol called RIP version 1, which was designed before subnet masks.

Non-Byte-Aligned Subnetting

So far, I have only discussed subnet masks of 255.0.0.0, 255.255.0.0, and 255.255.255.0. These are referred to as byte-aligned subnet masks, because they split the network and host portions on a byte boundary. However, it is also possible (though slightly more difficult to work with) to split the address inside a byte (using non-byte-aligned subnet masks), for instance:

Address:          10001001 . 01011001 . 00001111 . 01011000 (137. 89. 15. 88)
Subnet mask:      11111111 . 11111111 . 11111111 . 11110000 (255.255.255.240)

Now you have only 4 bytes for the host address, giving 16 possible addresses within the network. One address is reserved for the network itself, and one for the broadcast address, leaving a possible 14 hosts.

Address:          10001001 . 01011001 . 00001111 . 01011000 (137. 89. 15. 88)
Subnet mask:      11111111 . 11111111 . 11111111 . 11110000 (255.255.255.240)
Network addr:     10001001 . 01011001 . 00001111 . 01010000 (137. 89. 15. 80)
Broadcast addr:   10001001 . 01011001 . 00001111 . 01011111 (137. 89. 15. 95)

Note that the network address (137.89.15.80) no longer ends in the familiar 0 and the broadcast address (137.89.15.95) no longer ends in the familiar 255. Although they don't look like the familiar network and broadcast addresses, they are formed in exactly the same way (setting the host portion of the address to all 1s or all 0s, respectively). The subnet mask also looks unfamiliar, but it, too, is formed in exactly the same way.

This extension of the subnetting technique makes new sizes of network possible, including tiny networks for point-to-point links (mask 255.255.255.252, giving 30 bits network, and 2 bits host), or medium-sized networks (for example, mask 255.255.240.0, giving 20 bits network, and 12 bits hosts--4,096 possible hosts).

Humans would find the subnet mask system a lot easier to understand (and to read on a day-to-day basis) if the mask was just represented as the number of bits that were allocated to the host portion of the address (for instance, in the preceding example the mask 255.255.240.0 would be 12). Unfortunately, history has handed us a system where the representation makes it easy for computers to do an AND operation, not for us to read. You can soon learn to think in binary though!

Writing subnet masks in hexadecimal (base 16) rather than decimal (base 10) can also help greatly--a subnet mask of FF.FF.F0.00 is easier to work with than 255.255.240.0, as it is easier to convert between hexadecimal and binary than between decimal and binary.

If you are using DNS (domain name system) to map between host names and IP addresses, be slightly wary of non-byte-aligned subnet masks. They may prevent you from delegating control of the reverse lookup records, used to map from IP addresses to host names. DNS is designed to allow only a delegation split on an IP address byte boundary (inside the in-addr.arpa. domain).

Supernetting

Supernetting is a very similar concept to subnetting--the IP address is split into separate network address and host address portions according to the subnet mask. However, instead of breaking down larger networks into several smaller subnets, you group smaller networks together to make one larger supernet.

Imagine that I am given a bank of 16 class C networks, ranging from 201.66.32.0 to 201.66.47.0--my whole network can be addressed as 201.66.32.0 with a subnet mask of 255.255.240.0 (any address on the network has the same initial 20 bits as 201.66.32.0--to address the network, set the host address to 0).

Unfortunately it's not possible to allocate totally arbitrary groups of addresses--a range of 16 class C networks from 201.66.71.0 to 201.66.86.0 doesn't have a single network address (try to find one!). Why is this? Given the required subnet mask of 255.255.240.0, the host portion of the beginning of the address range is not 0:

Address Subnet mask Network address Host address

201.66.32.0 255.255.240.0 201.66.32 0.0

201.66.84.0 255.255.240.0 201.66.64 3.0

Fortunately this isn't a real problem--given a sensible address allocation strategy, you can find suitable blocks of addresses.

Variable Length Subnet Masks (VLSM)

If you want to split your network into multiple subnetworks of unequal size, you can use a variable length subnet mask (VLSM). This slightly intimidating acronym just means that each of your subnetworks can have a different length subnet mask. If you were splitting a company's network by department, some networks might have 255.255.255.0 (for most departments), while others might have 255.255.252.0 (for a particularly large department).

Classless Addressing (CIDR)

As the Internet has taken off, the number of hosts attached to the network has grown beyond all expectations. Although there are still far fewer than 2³² hosts connected directly to the Internet, there is a shortage of addresses. RFC 1519, Classless Inter-Domain Routing (CIDR), was published in 1993 in an attempt to address inefficiencies in the allocation of the address space.

CIDR is an attempt to extend the life span of IP v4; it does not address the eventual exhaustion of the whole address space. IP v6 addresses the eventual exhaustion of the address space by using a 128-bit address rather than a 32-bit address. However, implementing IP v6 is a mammoth task, which the Internet is not yet ready for. CIDR gives us time to implement IP v6.

The class-based address system worked well. It provided a reasonable compromise between efficient address usage and a low number of networks for routers to cope with. However, two major problems were caused by the unexpected growth of the Internet:

The increased number of allocated networks meant that the number of entries in routing tables became unmanageably large, and slowed down routers considerably.
Much of the address space was being wasted by the allocation policy--allocating inflexible blocks of 256 = 2⁸ (class C), 65536 = 2¹⁶ (class B), or 16777216 = 2²⁴ (class A) resulted in many wasted addresses. This has resulted particularly in a shortage of class B addresses.

To solve the second problem, it is possible to allocate multiple smaller networks instead of one larger network: for instance, multiple class C networks instead of one class B. Although this results in much more efficient address allocation, it exacerbates the growth of routing tables (the first problem).

Under CIDR, addresses are assigned according to the topology of the network. This means that a consecutive group of network addresses would be allocated to a particular service provider, allowing the whole group to be covered by one (probably supernetted) network address.

For example, a service provider is given a bank of 256 class C networks, ranging from 213.79.0.0 to 213.79.255.0. The service provider allocates one class C address to each customer, but routing tables external to the service provider know all of these routes by just one entry in the routing table--213.79.0.0 with a network mask of 255.255.0.0.

This method obviously greatly reduces the growth of routing tables for each new address that is allocated. Estimates given by the authors of the CIDR RFC (1519) indicate that if 90% of service providers used CIDR, routing tables might grow by 54% over a 3-year period, as opposed to 776% growth without CIDR (these figures assume CIDR is not in place at the start of the period).

If renumbering the existing address space were possible, the number of advertised routes that the Internet backbone routers had to deal with could be massively reduced. Unfortunately, this is unlikely to be practical, due to the huge amount of administrative effort involved.

Routing Tables

If a host has several network interfaces, how does it decide which interface to use for packets to a particular IP address? The answer lies in the routing table. Consider the following routing table:

Destination Subnet mask Gateway Flags Interface

201.66.37.0 255.255.255.0 201.66.37.74 U eth0

201.66.39.0 255.255.255.0 201.66.39.21 U eth1

The host sends all traffic for hosts on network 201.66.37.0 (for example, host addresses 201.66.37.1-201.66.37.254) out through interface eth0 (which has IP address 201.37.37.74), and all traffic for hosts on network 201.66.39.0 out through interface eth1 (which has IP address 201.37.39.21). The flag U just means that the route is "up" (that is, active).

Note that, for directly attached networks, some software doesn't give the IP address of the interface in the gateway field as shown. The name of the interface alone is listed.

This example only covers hosts that are connected directly to you--what if the host in question is on a remote network? If you are connected to network 73.0.0.0 by way of a router with an IP address of 201.66.37.254, you can add an entry to the routing table:

Destination Subnet mask Gateway Flags Interface

73.0.0.0 255.0.0.0 201.66.37.254 UG eth0

This tells the machine to route packets for any hosts on the 73.0.0.0 network through 201.66.37.254--note that there must be another entry in the table, telling the host how to send packets to 201.66.37.254! The G (gateway) flag just means that this routing entry directs traffic through an external gateway. Similarly, a route to a specific host through a gateway can be added, and it receives the H (host) flag:

Destination Subnet mask Gateway Flags Interface

91.32.74.21 255.255.255.255 201.66.37.254 UGH eth0

This example covers all the basics of the routing table, apart from a few special entries:

Destination Subnet mask Gateway Flags Interface

127.0.0.1 255.255.255.255 127.0.0.1 UH lo0

default 0.0.0.0 201.66.39.254 UG eth1

The first of these is the loopback interface, for traffic from the host to itself. This is used for testing, and for communications for applications that are designed to operate over IP but that happen to be communicating locally. It is a host route to the special address 127.0.0.1 (the interface lo0 refers to a "fake" network card internal to the IP stack).

The second entry is more interesting. To save having a route defined on the host to every possible network on the Internet, a default route can be defined. If no other entry in the routing table matches the destination address, the packet is sent to the default gateway (given in the default route).

Most hosts in a simple setup are connected by way of one interface card to a LAN, which has only one router to other networks. This results in just three entries in the routing table: the loopback entry, the entry for the local subnet, and the default entry (pointing to the router).

Overlapping Routes

Suppose you have entries in the routing table that overlap:

Destination Subnet mask Gateway Flags Interface

1.2.3.4 255.255.255.255 201.66.37.253 UGH eth0

1.2.3.0 255.255.255.0 201.66.37.254 UG eth0

1.2.0.0 255.255.0.0 201.66.39.253 UG eth1

default 0.0.0.0 201.66.39.254 UG eth1

The routes are said to be overlapping because all four include the address 1.2.3.4. So if I send a packet to 1.2.3.4, which route is chosen? In this case, it is the first route, through gateway 201.66.37.253; the route with the longest (most specific) subnet mask is always chosen. Similarly, packets to 1.2.3.5 are sent by the second route in preference to the third.

IMPORTANT: This rule applies only to indirect routes (those routing packets through gateways). Having two interfaces defined on the same subnet is not legal in many implementations of software. The following setup is normally illegal (though some software will attempt to load-balance over the two interfaces):

Interface IP address Subnet mask

eth0 201.66.37.1 255.255.255.0

eth1 201.66.37.2 255.255.255.0

The policy on overlapping routes is extremely useful; it allows the default route to work as just a route with a destination of 0.0.0.0, and a subnet mask of 0.0.0.0, rather than having to implement it as a special case for the routing software.

Looking back at CIDR, let's take the preceding example where a service provider is given a bank of 256 class C networks, ranging from 213.79.0.0 to 213.79.255.0. Routing tables that are external to the service provider know all of these routes by just one entry in the routing table--213.79.0.0, with a network mask of 255.255.0.0.

But suppose that one customer moves to a different service provider. The customer had the network address 213.79.61.0, but must he now get a new network address from the new service provider's range of addresses? That would mean renumbering every machine in the organization, changing every DNS entry, and so on, and so on.

Fortunately, there is an easy solution. The old service provider keeps the route 213.79.0.0 (with subnet mask 255.255.0.0), while the new service provider advertises the route 213.79.61.0 (with subnet mask 255.255.255.0). As the new route has a longer subnet mask than the old service provider's route, it overrides the old route.

Static Routing

Looking back at the routing table that you have been building up, there are now six entries in it. These are listed next, and a diagram of the network is given in Figure 9.9:

Destination Subnet mask Gateway Flags Interface

127.0.0.1 255.255.255.255 127.0.0.1 UH lo0

201.66.37.0 255.255.255.0 201.66.37.74 U eth0

201.66.39.0 255.255.255.0 201.66.39.21 U eth1

default 0.0.0.0 201.66.39.254 UG eth1

73.0.0.0 255.0.0.0 201.66.37.254 UG eth0

91.32.74.21 255.255.255.255 201.66.37.254 UGH eth0

NOTE: A routing table can usually be listed like this by using the command netstat -Rn. See your vendor's documentation for netstat.

How did these entries get there? The first one is added by the network software when the routing table is initialized. The second and third are created automatically when the network interface cards are bound to their IP addresses. However, the last three must be added specifically. On a UNIX system, this is done by issuing the route command, either manually by a user, or by the rc scripts upon bootup.

All these methods involve static routing. Routes are generally added on bootup, and the routing table remains unchanged, unless manual intervention occurs.

Routing Protocols

Both hosts and gateways can use a technique called dynamic routing. This allows the routing table to be automatically altered if, for instance, a router fails. Another router could be used instead, without user intervention, providing a much more resilient system.

Dynamic routing requires a routing protocol, which adds and deletes entries from the routing table. The routing table still works the same way as in static routing, but entries are added and removed automatically rather than manually.

There are two types of routing protocols: interior and exterior. Interior protocols route inside an autonomous system (AS), while exterior protocols route between autonomous systems. An autonomous system is a network normally under one administrative control, perhaps by a large company or a university. Small sites tend to be part of their Internet service provider's AS.

Only interior protocols are discussed here; few people ever have to deal with (or have even heard of!) exterior protocols. The most common exterior protocols are EGP (Exterior Gateway Protocol) and BGP (Border Gateway Protocol). BGP is the newer protocol, and it is slowly replacing EGP.

FIGURE 9.9. A more complex network.

ICMP Redirects

ICMP is not normally considered to be a routing protocol, but ICMP redirects act in much the same way as a routing protocol, so I'll discuss them here. Suppose that you have a routing table with the six entries given earlier. A packet is sent to the host 201.66.43.33. Looking through the table, this does not match any route except the default route, which sends traffic by way of the router 201.66.39.254 (see trip 1 in Figure 9.10). However, this router has full knowledge of the network, and knows that all packets for the 201.66.43.0 subnet should go through 201.66.39.254. Accordingly, it forwards the packet to the appropriate router (trip 2 in Figure 9.10). But it would have been much more efficient if the host had sent the packet straight to 201.66.39.254 (trip 3 in Figure 9.10).

FIGURE 9.10. An ICMP redirect.

The router can instruct the host to use a different route by sending an ICMP redirect. The router knows that there is a better route, because it is sending the packet back out on the same interface it came in on. Though the router knows that the whole of the 201.66.43.0 subnet should go by way of 201.66.39.254, it normally only sends an ICMP redirect for a particular host (in this case 201.66.43.33). The host creates a new entry in the routing table:

Destination Subnet mask Gateway Flags Interface

201.66.43.33 255.255.255.255 201.66.39.253 UGHD eth1

Notice the D (redirect) flag--this is set on all routes created by an ICMP redirect. In the future, all packets will be sent by the new route (trip 3 in Figure 9.10).

Routing Information Protocol (RIP)

RIP is a simple interior routing protocol, which has been around for many years, and is widely implemented (UNIX routed uses RIP). It is a distance-vector algorithm, which means that its routing decisions are based purely upon the number of "hops" between two points. Traversing a router is considered to be one hop.

Both hosts and gateways can run RIP, although hosts only receive information; they do not send it. Information can be specifically requested from a gateway, but is also broadcasted every 30 seconds in order to keep routing tables current. RIP uses UDP to intercommunicate between hosts and gateways through port 520.

The information passed between gateways is used to build up a routing table. The route chosen by RIP is always the one with the shortest number of hops to the destination.

RIP version 1 works reasonably well on simple, fairly small networks. However, it shows several problems working on larger networks, some of which are rectified in RIP v2, but some of the limitations are inherent in its design. In the following discussion, points applicable to v1 and v2 are referred to simply as RIP, while RIP v1 or RIP v2 refer to the specific versions.

RIP doesn't have any concept of quality for links; all links are considered to be the same. Thus a low-speed serial line is considered to be as good as a high-speed fiber-optic link. RIP gives preference to the route with the least number of hops; thus, when given a choice between going across:

A 100Mbps fiber-optic link, then a router to a 10Mbit Ethernet, or
A 9600bps serial link

RIP will choose the latter. RIP also has no concept of the traffic levels on a link; given a choice between two Ethernet links, one of which is very busy, and one of which has no traffic at all, RIP will quite happily use the busier link.

The maximum number of hops interpreted by RIP is 15, any more than this is considered to be unreachable. Thus on very large autonomous systems, where the number of hops on any useful route may exceed 15, use of RIP is impractical.

RIP v1 does not support subnetting; the subnet mask is not transmitted with each route. The method for determining the subnet mask for each given route varies from implementation to implementation. RIP v2 corrects this shortcoming.

RIP updates are only sent every 30 seconds, so information about the failure of a link can take some time to propagate through a large network. The time for routing information to settle down to a stable state can be even longer, and routing loops can occur during this period of change.

We can conclude that RIP is a simple routing protocol, with some restrictive limitations, especially in version 1. However, it is often the only choice for particular operating systems.

Summary

ARP (Address Resolution Protocol) is used to map IP addresses to hardware addresses (MAC addresses). It is a transparent protocol that is normally only seen by the user when an IP address conflict occurs. In special situations, the ARP cache table can be controlled manually via the arp command.

An IP address splits into two parts: the network part and the host part. How the address is split used to depend on the network class, given by the first byte of the address. Modern implementations of IP hold an extra field called the subnet mask, which is used to determine how the address is split. This greatly enhances the functionality of IP routing, but also adds a lot of complexity.

Routing tables can be static or dynamic. Static routes are controlled manually, or by a sequence of commands in a bootup script. Dynamic routes are controlled by a daemon running a routing protocol, such as RIP or OSPF. Though, strictly speaking, ICMP isn't a routing protocol, it can still alter the routing tables in response to a redirect message.

`0`:	Reserved (for the network address)
`1-126:`	Class A (network: 1 byte, host: 3 bytes)
`127`:	Reserved (for the loopback address)
`128-191`:	Class B (network: 2 bytes, host: 2 bytes)
`192-223`:	Class C (network: 3 bytes, host: 1 byte)
`224-255`:	Reserved (see the note that follows)

IP address	Network address	Host address
`56.81.38.28`	`56`	`81.38.28`
`137.89.15.88`	`137.89`	`15.88`
`200.77.32.61`	`200.77.32`	`61`

Address	Subnet mask	Network address	Host address
`201.66.32.0`	`255.255.240.0`	`201.66.32`	`0.0`
`201.66.84.0`	`255.255.240.0`	`201.66.64`	`3.0`

Destination	Subnet mask	Gateway	Flags	Interface
`201.66.37.0`	`255.255.255.0`	`201.66.37.74`	`U`	`eth0`
`201.66.39.0`	`255.255.255.0`	`201.66.39.21`	`U`	`eth1`