Windows Server 2003 TCP/IP provides level 2 (full) support for IP multicasting and IGMP versions 1 through 3, as described in RFCs 1112, 2236, and 3376. The introduction to RFC 1112 provides a good overall summary of IP multicasting. The text reads:
“IP multicasting is the transmission of an IP datagram to a host group—a set of zero or more hosts identified by a single IP destination address. A multicast datagram is delivered to all members of its destination host group with the same ‘best-effort’ reliability as regular unicast IP datagrams; that is, the datagram is not guaranteed to arrive intact to all members of the destination group or in the same order relative to other datagrams.
“The membership of a host group is dynamic; that is, hosts may join and leave groups at any time. There is no restriction on the location or number of members in a host group. A host may be a member of more than one group at a time. A host need not be a member of a group to send datagrams to it.
“A host group may be permanent or transient. A permanent group has a well-known, administratively assigned IP address. It is the address—not the membership of the group—that is permanent; at any time a permanent group may have any number of members, even zero. Those IP multicast addresses that are not reserved for permanent groups are available for dynamic assignment to transient groups that exist only as long as they have members.
“Internetwork forwarding of IP multicast datagrams is handled by multicast routers that may be co-resident with, or separate from, Internet gateways. A host transmits an IP multicast datagram as a local network multicast that reaches all immediately-neighboring members of the destination host group. If the datagram has an IP time-to-live greater than 1, the multicast router(s) attached to the local network take responsibility for forwarding it towards all other networks that have members of the destination group. On those other member networks that are reachable within the IP time-to-live, an attached multicast router completes delivery by transmitting the datagram as a local multicast.”
IP/ARP Extensions for IP Multicasting
To support IP multicasting, an additional route is defined on the host. The route (added by default) specifies that if a datagram is being sent to a multicast host group, it should be sent to the IP address of the host group through the local NIC, and not forwarded to the default gateway. The following route (which you can discover using the route print command) illustrates this:
IP multicast addresses are easily identified, as they are from the class D range, 18.104.22.168 to 22.214.171.124 (126.96.36.199/4). These IP addresses all have 1110 as their high-order bits.
To send a packet to a host group, using the local interface, the IP address must be resolved to a media access control address. As stated in RFC 1112:
“An IP host group address is mapped to an Ethernet multicast address by placing the low-order 23 bits of the IP address into the low-order 23 bits of the Ethernet multicast address 01-00-5E-00-00-00 (hex). Because there are 28 significant bits in an IP host group address, more than one host group address may map to the same Ethernet multicast address.”
For example, a datagram addressed to the multicast address 188.8.131.52 would be sent to the (Ethernet) MAC address 01-00-5E-00-00-05. This MAC address is formed by the junction of 01-00-5E and the 23 low-order bits of 184.108.40.206 (00-00-05).
Because more than one host group address can map to the same Ethernet multicast address, the interface may indicate hand-up multicasts for a host group for which no local applications have a registered interest. These extra multicasts are discarded by the TCP/IP protocol.
Multicast Extensions to Windows Sockets
Internet Protocol multicasting is currently supported only on AF_INET sockets of type SOCK_DGRAM and SOCK_RAW. By default, IP multicast datagrams are sent with a Time to Live (TTL) of 1. Applications can use the setsockopt() function to specify a TTL. By convention, multicast routers use TTL thresholds to determine how far to forward datagrams. These TTL thresholds are defined as follows:
Multicast datagrams with initial TTL 0 are restricted to the same host.
Multicast datagrams with initial TTL 1 are restricted to the same subnet.
Multicast datagrams with initial TTL 32 are restricted to the same site.
Multicast datagrams with initial TTL 64 are restricted to the same region.
Multicast datagrams with initial TTL 128 are restricted to the same continent.
Multicast datagrams with initial TTL 255 are unrestricted in scope.
Use of Multicast and IGMP by Windows Components
Some Windows Server 2003 components use IP multicast traffic. For example, router discovery uses multicasts, by default. WINS servers use multicasting when attempting to locate replication partners. The Windows Server 2003 Routing and Remote Access service supports multicast forwarding and the configuration of interfaces to operate in IGMP router or IGMP proxy mode.
Transmission Control Protocol (TCP)
TCP provides a connection-based, reliable byte-stream service to applications. Microsoft networking relies upon TCP for logon, file and print sharing, replication of information between domain controllers, transfer of browse lists, and other common functions. It can only be used for one-to-one communications.
TCP uses a checksum on both the TCP header and payload of each segment to reduce the chance that network corruption will go undetected. NDIS 5.1 provides support for task offloading, and Windows Server 2003 TCP/IP takes advantage of this by allowing the NIC to perform the TCP checksum calculations if the NIC driver offers support for this function. Offloading the checksum calculations to hardware can result in performance improvements in very high-throughput environments.
Windows Server 2003 TCP/IP has also been strengthened against a variety of attacks that were published over the past couple of years and has been subject to an internal security review intended to reduce susceptibility to future attacks. For instance, the initial sequence number (ISN) algorithm has been modified so that ISNs increase in random increments, using an RC4-based random number generator initialized with a 2048-bit random key upon system startup.
TCP Receive Window Size Calculation and Window Scaling (RFC 1323)
The TCP receive window size is the amount of receive data (in bytes) that can be buffered at one time on a connection. The sending host can send only that amount of data before waiting for an acknowledgment and window update from the receiving host. The Windows Server 2003 TCP/IP stack was designed to tune itself in most environments and uses larger default window sizes than earlier versions. Instead of using a hard-coded default receive window size, TCP adjusts to even increments of the maximum segment size (MSS) negotiated during connection setup. Matching the receive window to even increments of the MSS increases the percentage of full-sized TCP segments used during bulk data transmission.
The default advertised TCP receive-window size in Windows Server 2003 depends on the following, in order of precedence:
The SO_RCVBUF WinSock option for the connection.
The per-interface TcpWindowSize registry parameter.
The global TcpWindowSize registry parameter.
Autodetermination based on the NDIS-reported bit rate of the media The stack also tunes itself based on the media speed:
Below 1 Mbps: 8 KB
1 Mbps – 100 Mbps: 17 KB
Greater than 100 Mbps: 64 KB
For 10 and 100 Mbps Ethernet, the window is normally set to 17,520 bytes (17 KB rounded up to twelve 1460-byte segments.) There are two methods for setting the receive window size to specific values:
The TcpWindowSize registry parameter (see Appendix A)
The setsockopt() Windows Sockets function (on a per-socket basis)
To improve performance on high-bandwidth, high-delay networks, Windows Server 2003 TCP/IP supports scalable windows support as described in RFC 1323. This RFC details a method for supporting scalable windows by allowing TCP to negotiate a scaling factor for the window size at connection establishment. This allows for an actual receive window of up to 1 gigabyte (GB). Section 2.2 of RFC 1323 provides a good description:
“The three-byte Window Scale option may be sent in a SYN segment by a TCP. It has two purposes: 1. indicate that the TCP is prepared to do both send and receive window scaling, and 2. communicate a scale factor to be applied to its receive window. Thus, a TCP that is prepared to scale windows should send the option, even if its own scale factor is 1. The scale factor is limited to a power of two and encoded logarithmically, so it may be implemented by binary shift operations.
“This option is an offer, not a promise; both sides must send Window Scale options in their SYN segments to enable window scaling in either direction. If window scaling is enabled, then the TCP that sent this option will right-shift its true receive-window values by 'shift.cnt' bits for transmission in SEG.WND. The value shift.cnt may be zero (offering to scale, while applying a scale factor of 1 to the receive window).
“This option may be sent in an initial segment (in other words, a segment with the SYN bit on and the ACK bit off). It may also be sent in a segment, but only if a Window Scale option was received in the initial segment. A Window Scale option in a segment without a SYN bit should be ignored.
“The Window field in a SYN (in other words, a or ) segment itself is never scaled.”
When you read network traces of a connection that was established by two computers that support scalable windows, keep in mind that the window sizes advertised in the TCP headers must be scaled by the negotiated scale factor. The scale factor can be observed in the connection establishment (three-way handshake) packets, as illustrated in the following Network Monitor capture:
The computer sending the packet above is offering the Window Scale option, with a scaling factor of 5. If the target computer responds, accepting the Window Scale option in the SYN-ACK, then it is understood that any TCP window advertised by this computer needs to be left-shifted 5 bits from this point onward (the SYN itself is not scaled). For example, if the computer advertised a 32 KB window in its first send of data, this value would need to be left-shifted (shifting in 0's from the right) 5 bits as shown below:
As a check, left-shifting a number 5 bits is equivalent to multiplying it by 25, or 32. Performing the calculation in decimal, we get 32767 * 32 = 1,048,544.
The scale factor is not necessarily symmetrical, so it may be different for each direction of data flow.
TCP window scaling is negotiated on-demand in Windows Server 2003, based on the value set for the SO_RCVBUF WinSock option when a connection is initiated. Additionally, the Window Scale option is used by default on a connection if the received SYN segment for a connection initiated by a TCP peer contains the Window Scale option. You can modify this default behavior with the Tcp1323Opts registry parameter (see Appendix A).
As specified in RFC 1122, TCP uses delayed acknowledgments (ACKs) to reduce the number of packets sent on the media. The Windows Server 2003 TCP/IP stack takes a common approach to implementing delayed ACKs. As data is received by TCP on a connection, it only sends an acknowledgment back if one of the following conditions is met:
No ACK was sent for the previous segment received.
A segment is received, but no other segment arrives within 200 milliseconds for that connection.
In summary, normally an ACK is sent for every other TCP segment received on a connection, unless the delayed ACK timer (200 milliseconds) expires. The delayed ACK timer can be adjusted through the TcpDelAckTicks registry parameter.
TCP Selective Acknowledgment (RFC 2018)
Windows Server 2003 TCP/IP includes support for an important performance feature known as Selective Acknowledgement (SACK). SACK is especially important for connections using large TCP window sizes. Prior to SACK, a receiver could only acknowledge the latest sequence number of contiguous data that had been received, or the left edge of the receive window. When SACK is enabled, the receiver continues to use the ACK number to acknowledge the left edge of the receive window, but it can also acknowledge other non-contiguous blocks of received data individually. SACK uses two different TCP header options:
The Sack-Permitted option is used to indicate that the sender can receive and interpret the SACK option. This option is sent in TCP segments in which the SYN flag is set.
The SACK option is used to convey extended acknowledgment information from the receiver to the sender over an established TCP connection. Within the SACK option are up to four pairs of TCP sequence numbers acknowledging up to four non-continuous blocks. Each TCP sequence pair contains the left edge (the sequence number of the first byte in the block) and the right edge (the sequence number of the byte immediately following the last byte in the block) of the block being acknowledged.
When SACK is enabled (the default), a packet or series of packets can be dropped, and the receiver can inform the sender of exactly which data has been received, and where the holes in the data are. The sender can then selectively retransmit the missing data without needing to retransmit blocks of data that have already been received successfully. SACK is controlled by the SackOpts registry parameter and is enabled by default. The Network Monitor capture below illustrates a host acknowledging all data up to sequence number 54857340, plus the data from sequence number 54858789-54861684.
FRAME: Base frame properties
ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet
TCP: Acknowledgement Number = 54857341 (0x3450E7D)
TCP: Data Offset = 44 (0x2C)
TCP: Reserved = 0 (0x0000)
TCP: Flags = 0x10 : .A....
TCP: Window = 32722 (0x7FD2)
TCP: Checksum = 0x4A72
TCP: Urgent Pointer = 0 (0x0)
TCP: Option Nop = 1 (0x1)
TCP: Option Nop = 1 (0x1)
TCP: Timestamps Option
TCP: Option Nop = 1 (0x1)
TCP: Option Nop = 1 (0x1)
TCP: SACK Option
TCP: Option Type = 0x05
TCP: Option Length = 10 (0xA)
TCP: Left Edge of Block = 54858789 (0x3451425)
TCP: Right Edge of Block = 54861685 (0x3451F75)
TCP Timestamps (RFC 1323)
Windows Server 2003 TCP/IP supports TCP time stamps, as described in RFC 1323. Like SACK, time stamps are important for connections using large window sizes. Time stamps were conceived to assist TCP in accurately measuring round-trip time (RTT) to adjust retransmission time-outs.
The Timestamps option carries two four-byte time stamp fields. The time-stamp value field (TSval) contains the current value of the time-stamp clock of the TCP sending the option. The Timestamp Echo Reply field (TSecr) is only valid if the ACK bit is set in the TCP header; if it is valid, it echoes a timestamp value that was sent by the remote TCP peer in the TSval field of a Timestamps option. When TSecr is not valid, its value must be zero. The TSecr value will generally be from the most recent Timestamp option that was received; however, there are exceptions that are explained below.
A TCP peer may send the Timestamps option (TSopt) in an initial SYN segment (i.e., segment containing a SYN bit and no ACK bit), and may send a TSopt in other segments only if it received a TSopt in the initial SYN segment for the connection.
The Timestamps option field can be viewed in a Network Monitor capture by expanding the TCP options field, as shown below:
TCP: Timestamps Option
TCP: Option Type = Timestamps
TCP: Option Length = 10 (0xA)
TCP: Timestamp = 2525186 (0x268802)
TCP: Reply Timestamp = 1823192 (0x1BD1D8)
By default, the Timestamp option is only used on a connection if the received SYN segment for a connection initiated by a TCP peer contains the Timestamp option. This default behavior can be modified with the Tcp1323Opts registry parameter (see Appendix A).
Path Maximum Transmission Unit (PMTU) Discovery
PMTU discovery is described in RFC 1191. When a connection is established, the two hosts involved exchange their TCP maximum segment size (MSS) values. The smaller of the two MSS values is used for the connection. Historically, the MSS for a host has been the MTU at the link layer minus 40 bytes for the IP and TCP headers. However, support for additional TCP options, such as time stamps, has increased the typical TCP IP header to 52 or more bytes. Figure 2 shows the difference between MTU and MSS.
F igure 2. MTU versus MSS
When TCP segments are destined to a non-local network, the Don’t Fragment (DF) flag is set in the IP header. Any link along the path can have an MTU that differs from that of the two hosts. If a link has an MTU that is too small for the IP datagram being routed, the router attempts to fragment the datagram. However, the Don’t Fragment bit is set in the IP header. At this point, the router should inform the sending host that the datagram couldn't be forwarded further without fragmentation. This is done with anICMP Destination Unreachable-Fragmentation Needed and DF Set message. Most routers also specify the MTU for the next hop by putting the value for it in the low-order 16 bits of the ICMP header field that is unused in RFC 792. See RFC 1191, section 4, for the format of this message. Upon receiving this ICMP error message, TCP adjusts its MSS for the connection to the specified MTU minus the TCP and IP header size so that any further packets sent on the connection are no larger than the maximum size that can traverse the path without fragmentation.
Note The minimum MTU permitted is 88 bytes, and Windows Server 2003 TCP/IP enforces this limit.
Some noncompliant routers may silently drop IP datagrams that cannot be fragmented or may not correctly report their next-hop MTU. If this occurs, it may be necessary to make a configuration change to the PMTU detection algorithm. There are two registry changes that can be made to Windows Server 2003 TCP/IP stack to work around these problematic devices:
EnablePMTUBHDetect—Adjusts the PMTU discovery algorithm to attempt to detect PMTU black hole routers. PMTU black hole detection is disabled by default.
EnablePMTUDiscovery—Completely enables or disables the PMTU discovery mechanism. When PMTU discovery is disabled, an MSS of 536 bytes is used for all non-local destination addresses. PMTU discovery is enabled by default (this is the recommended setting).
These registry entries are described in more detail in Appendix A.
The PMTU between two hosts can be discovered manually using the ping tool with the -f (don’t fragment) switch, as follows:
ping -f -n-l
As shown in the example below, the size parameter can be varied until the MTU is found. The size parameter used by ping is the size of the data buffer to send, not including headers. The ICMP header consumes 8 bytes, and the IP header is normally 20 bytes. In the case below (Ethernet), the link layer MTU is the maximum-sized ping buffer plus 28, or 1500 bytes:
C:\>ping -f -n 1 -l 1472 10.99.99.10
Pinging 10.99.99.10 with 1472 bytes of data:
Reply from 10.99.99.10: bytes=1472 time<10ms TTL=128
Ping statistics for 10.99.99.10:
Packets: Sent = 1, Received = 1, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
C:\>ping -f -n 1 -l 1473 10.99.99.10
Pinging 10.99.99.10 with 1473 bytes of data:
Packet needs to be fragmented but DF set.
Ping statistics for 10.99.99.10:
Packets: Sent = 1, Received = 0, Lost = 1 (100% loss),
Approximate round trip times in milliseconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
In the example shown above, the IP layer returned an ICMP error message that ping interpreted. If the router had been a PMTU black hole router, ping would simply not be answered once its size exceeded the MTU that the router could handle. Ping can be used in this manner to detect such a router.
A sample ICMP Destination Unreachable-Fragmentation Needed and DF Set message is shown here:
10.99.99.10 10.99.99.9 ICMP Destination Unreachable:
10.99.99.10 See frame 3
FRAME: Base frame properties
ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol
IP: ID = 0x4401; Proto = ICMP; Len: 56
ICMP: Destination Unreachable: 10.99.99.10 See frame 3
ICMP: Packet Type = Destination Unreachable
ICMP: Unreachable Code = Fragmentation Needed, DF Flag Set
ICMP: Checksum = 0xA05B
ICMP: Next Hop MTU = 576 (0x240)
ICMP: Data: Number of data bytes remaining = 28 (0x001C)
ICMP: Description of original IP frame
This error was generated by using ping-f –n 1 -l 1000 on an Ethernet-based host to send a large datagram across a router interface that only supports an MTU of 576 bytes. When the router tried to place the large frame onto the network with the smaller MTU, it found that fragmentation was not allowed. Therefore, it returned the error message indicating that the largest datagram that could be forwarded is 576 bytes.
Dead Gateway Detection
Dead gateway detection is used to allow TCP to detect failure of the default gateway and to adjust the IP route table to use another default gateway. Windows Server 2003 TCP/IP uses the triggered reselection method described in RFC 816, with slight modifications based upon customer experience and feedback.
When a TCP connection routed through the default gateway attempts to send a TCP packet to the destination a number of times (equal to one-half of the registry value TcpMaxDataRetransmissions) without receiving a response, the algorithm changes the Route Cache Entry (RCE) for that remote IP address to use the next default gateway in the list as its next-hop address. When 25 percent of the TCP connections have moved to the next default gateway, the algorithm advises IP to change the computer’s default gateway to the one that the connections are now using.
For example, assume that there are currently TCP connections to 11 different IP addresses that are being routed through the default gateway. Now assume that the default gateway fails, that there is a second default gateway configured, and that the value for TcpMaxDataRetransmissions is at the default of 5.
When the first TCP connection tries to send data, it does not receive any acknowledgments. After the third retransmission, the RCE for that remote IP address is switched to the next default gateway in the list. At this point, any TCP connections to that one remote IP address have switched over, but the remaining connections still try to use the original default gateway.
When the second TCP connection tries to send data, the same thing happens. Now, two of the 11 RCEs point to the new gateway.
When the third TCP connection tries to send data, after the third retransmission, three of 11 RCEs have been switched to the second default gateway. Because, at this point, over 25 percent of the RCEs have been moved, the default gateway for the whole computer is moved to the new one.
That default gateway remains the primary one for the computer until it experiences problems (causing the dead gateway algorithm to try the next one in the list again) or until the computer is restarted.
When the search reaches the last default gateway, it returns to the beginning of the list.
TCP Retransmission Behavior
TCP starts a retransmission timer when each outbound segment is handed down to IP. If no acknowledgment has been received for the data in a given segment before the timer expires, the segment is retransmitted. For new connection requests, the retransmission timer is initialized to 3 seconds (controllable using the TcpInitialRtt per-adapter registry parameter), and the request (SYN) is resent up to the value specified in TcpMaxConnectRetransmissions (the default for Windows Server 2003 is 2 times). On existing connections, the number of retransmissions is controlled by the TcpMaxDataRetransmissions registry parameter (5 by default). The retransmission time-out is adjusted on the fly to match the characteristics of the connection, using Smoothed Round Trip Time (SRTT) calculations as described in Van Jacobson’s paper named "Congestion Avoidance and Control." The timer for a given segment is doubled after each retransmission of that segment. Using this algorithm, TCP tunes itself to the normal delay of a connection. TCP connections over high-delay links take much longer to time-out than those over low-delay links.3
The following Network Monitor capture shows the retransmission algorithm for two hosts that are connected over Ethernet on the same subnet. An FTP file transfer was in progress when the receiving host was disconnected from the network. Because the SRTT for this connection was very small, the first retransmission was sent after about one-half second. The timer was then doubled for each of the retransmissions that followed. After the fifth retransmission, the timer was once again doubled. If no acknowledgment was received before it expired, the connection was aborted.
There are some circumstances under which TCP retransmits data prior to the time that the retransmission timer expires. The most common of these occurs due to a feature known as fast retransmit. When a receiver that supports fast retransmit receives data with a sequence number beyond the current expected one, it assumes that some data was dropped. To help make the sender aware of this event, the receiver immediately sends an ACK, with the ACK number set to the sequence number that it was expecting. It continues to do this for each additional TCP segment that arrives containing data subsequent to the missing data in the incoming stream. When the sender starts to receive a stream of ACKs that are acknowledging the same sequence number and that sequence number is earlier than the current sequence number being sent, it can infer that a segment (or more) must have been dropped. Senders that support the fast retransmit algorithm immediately resend the segment that the receiver is expecting to fill in the gap in the data, without waiting for the retransmission timer to expire for that segment. This optimization greatly improves performance in a busy network environment.
With fast retransmit, the sender retransmits the TCP segments to fill in the gap in the data before their retransmission timers expire. Because the retransmission timers did not expire for the missing TCP segments, the sender can more quickly send additional segments to the receiver. This is known as fast recovery. Fast retransmit and fast recovery are described in RFC 2581.
By default, Windows Server 2003 resends a segment if it receives three ACKs for the same sequence number and that sequence number lags the current one. This is controllable with the TcpMaxDupAcks registry parameter. See also the “TCP Selective Acknowledgment (RFC 2018)” section in this paper.
TCP Keep-Alive Messages
A TCP keep-alive packet is simply an ACK with the sequence number set to one less than the current sequence number for the connection. A host receiving one of these ACKs responds with an ACK for the current sequence number. Keep-alives can be used to verify that the computer at the remote end of a connection is still available. TCP keep-alives can be sent once every KeepAliveTime (defaults to 7,200,000 milliseconds or two hours) if no other data or higher-level keep-alives have been carried over the TCP connection. If there is no response to a keep-alive, it is repeated once every KeepAliveInterval seconds. KeepAliveInterval defaults to 1 second. NetBT connections, such as those used by other Microsoft networking components, send NetBIOS keep-alives more frequently, so normally no TCP keep-alives are sent on a NetBIOS connection. TCP keep-alives are disabled by default, but Windows Sockets applications can use the setsockopt() function to enable them.
Slow Start Algorithm and Congestion Avoidance
When a connection is established, TCP starts slowly at first to assess the bandwidth of the connection, and to avoid overflowing the receiving host or any other devices or links in the path. The send window is set to two TCP segments, and if that is acknowledged, it is incremented to three segments.4 If those are acknowledged, it is incremented again, and so on until the amount of data being sent per burst reaches the size of the receive window on the remote host. At that point, the slow start algorithm is no longer in use, and flow control is governed by the receive window. However, congestion could still occur on a connection at any time during transmission. If this happens (evidenced by the need to retransmit), a congestion-avoidance algorithm is used to reduce the send window size temporarily and to grow it back towards the receive window size. Slow start and congestion avoidance are described in RFCs 2581.
Silly Window Syndrome (SWS)
Silly Window Syndrome is described in RFC 1122 as follows:
“In brief, SWS is caused by the receiver advancing the right window edge whenever it has any new buffer space available to receive data and by the sender using any incremental window, no matter how small, to send more data [TCP:5]. The result can be a stable pattern of sending tiny data segments, even though both sender and receiver have a large total buffer space for the connection.”
Windows Server 2003 TCP/IP implements SWS avoidance, as specified in RFC 1122, by not sending more data until there is a sufficient window size advertised by the receiving end to send a full TCP segment. It also implements SWS avoidance on the receive end of a connection by not opening the receive window in increments of less than a TCP segment.
Windows Server 2003 TCP/IP implements the Nagle algorithm described in RFC 896. The purpose of this algorithm is to reduce the number of very small segments sent, especially on high-delay (remote) links. The Nagle algorithm allows only one small segment to be outstanding at a time without acknowledgment. If more small segments are generated while awaiting the ACK for the first one, these segments are coalesced into one larger segment. Any full-sized segment is always transmitted immediately, on the assumption that there is a sufficient receive window available. The Nagle algorithm is effective in reducing the number of packets sent by interactive applications, such as Telnet, especially over slow links.
The Nagle algorithm can be observed in the following Network Monitor capture. The trace was captured by using PPP to dial up an Internet service provider. A Telnet (character-mode) session was established, and then the Y key was held down on the computer. At all times, one segment was sent, and further Y characters were held by the stack until an acknowledgment was received for the previous segment. In this example, three to four Y characters were buffered each time and sent together in one segment. The Nagle algorithm resulted in a huge savings in the number of packets sent—the number of packets was reduced by a factor of about three.
Windows Sockets applications can disable the Nagle algorithm for their connections by setting the TCP_NODELAY socket option. However, this practice should be avoided unless it is absolutely necessary because it increases network utilization. Some network applications may not perform well if their design does not take into account the effects of transmitting large numbers of small packets and the Nagle algorithm. The Nagle algorithm is not applied to loopback TCP connections for performance reasons. Windows Server 2003 NetBT disables Nagling for NetBIOS over TCP connections as well as direct-hosted redirector/server connections, which can improve performance for applications issuing numerous small file manipulation commands. An example is an application that uses file locking/unlocking frequently.
TCP TIME-WAIT Delay
When a TCP connection is closed, the socket-pair is placed into a state known as TIME-WAIT. This is done so that a new connection does not use the same protocol, source IP address, destination IP address, source port, and destination port until enough time has passed to ensure that any segments that may have been misrouted or delayed are not delivered unexpectedly. RFC 793 specifies the length of time that the socket-pair should not be reused as two maximum segment lifetimes (2 MSL), or four minutes. This is the default setting for Windows Server 2003 TCP/IP. However, with this default setting, some network applications that perform many outbound connections in a short time may use up all available ports before the ports can be recycled.
Windows Server 2003 TCP/IP offers two methods of controlling this behavior. First, the TcpTimedWaitDelay registry parameter can be used to alter this value. Windows Server 2003 TCP/IP allows it to be set as low as 30 seconds, which should not cause problems in most environments. Second, the number of user-accessible ephemeral ports that can be used to source outbound connections is configurable using the MaxUserPort registry parameter. By default, when an application requests any socket from the system to use for an outbound call, a port between the values of 1024 and 5000 is supplied. The MaxUserPort parameter can be used to set the value of the uppermost port that the administrator chooses to allow for outbound connections. For instance, setting this value to 10,000 (decimal) would make approximately 9000 user ports available for outbound connections. For more details on this concept, see RFC 793. See also the MaxFreeTcbs and MaxHashTableSize registry parameters in Appendix A.
TCP/IP for Windows Server 2003 Service Pack 1 has implemented a smart TCP port allocation algorithm. When an application requests any available TCP port, TCP/IP first attempts to find an available port that does not correspond to a connection in the TIME-WAIT state. If a port cannot be found, then it picks any available port. For more information, see "Smart TCP Port Allocation" in this paper.
TCP Connections to and from Multihomed Computers
When TCP connections are made to a multihomed host, both the WINS client and the Domain Name Resolver (DNR) attempt to determine whether any of the destination IP addresses provided by the name server are on the same subnet as any of the interfaces in the local computer. If so, these addresses are sorted to the top of the list so that the application can try them prior to trying addresses that are not on the same subnet. If none of the addresses is on a common subnet with the local computer, behavior is different depending upon the name space. The PrioritizeRecordData TCP/IP registry parameter can be used to prevent the DNR component from sorting local subnet addresses to the top of the list.
In the WINS name space, the client is responsible for randomizing or load balancing between the provided addresses. The WINS server always returns the list of addresses in the same order, and the WINS client randomly picks one of them for each connection.
In the DNS name space, the DNS server is usually configured to provide the addresses in a round robin fashion. The DNR does not attempt to further randomize the addresses. In some situations, it is desirable to connect to a specific interface on a multihomed computer. The best way to accomplish this is to provide the interface with its own DNS entry. For example, a computer named raincity could have one DNS entry listing both IP addresses (actually two separate records in the DNS with the same name), and also records in the DNS for raincity1 and raincity2, each associated with just one of the IP addresses assigned to the computer.
When TCP connections are made from a multihomed host, things get a bit more complicated. If the connection is a Winsock connection using the DNS name space, once the target IP address for the connection is known, TCP attempts to connect from the best source IP address available. Again, the route table is used to make this determination. If there is an interface in the local computer that is on the same subnet as the target IP address, its IP address is used as the source in the connection request. If there is no best source IP address to use, the system chooses one randomly.
If the connection is a NetBIOS-based connection using the redirector, little routing information is available at the application level. The NetBIOS interface supports connections over various protocols and has no knowledge of IP. Instead, the redirector places calls on all of the transports that are bound to it. If there are two interfaces in the computer and one protocol installed, there are two transports available to the redirector. Calls are placed on both, and NetBT submits connection requests to the stack, using an IP address from each interface. It is possible that both calls succeed. If so, the redirector cancels one of them. The choice of which one to cancel depends upon the redirector ObeyBindingOrderregistry value5. If this is set to 0 (the default value), the primary transport (determined by binding order) is the preferred one, and the redirector waits for the primary transport to time out before accepting the connection on the secondary transport. If this value is set to 1, the binding order is ignored, and the redirector accepts the first connection that succeeds and cancels the other(s).
TCP was designed to provide optimum performance over varying link conditions, and Windows Server 2003 TCP/IP contains improvements such as those supporting RFC 1323. Actual throughput for a link depends on a number of variables, but the most important factors are:
Link speed (bits-per-second that can be transmitted)
Window size (amount of unacknowledged data that may be outstanding on a TCP connection)
Network and intermediate device congestion
TCP throughput calculation is discussed in detail in Chapters 20–24 of TCP/IP Illustrated, by W. Richard Stevens6. Some key considerations are listed below:
The capacity of a pipe is bandwidth multiplied by round-trip time. This is known as the bandwidth-delay product. If the link is reliable, for best performance the window size should be greater than or equal to the capacity of the pipe so that the sending stack can fill it. The largest window size that can be specified, due to its 16-bit field in the TCP header, is 65535, but larger windows can be negotiated by using window scaling as described earlier in this paper. See TcpWindowSize in Appendix A.
Throughput can never exceed window size divided by round-trip time.
If the link is unreliable or badly congested and packets are being dropped, using a larger window size may not improve throughput. Along with scaling windows support, Windows Server 2003 TCP/IP supports Selective Acknowledgments (SACK; described in RFC 2018) to improve performance in environments that are experiencing packet loss. It also includes support for timestamps (described in RFC 1323) for improved RTT estimation.
Propagation delay is dependent upon the speed of light, latencies in transmission equipment, and so on.
Transmission delay depends on the speed of the media.
For a specified path, propagation delay is fixed, but transmission delay depends upon the packet size.
At low speeds, transmission delay is the limiting factor. At high speeds, propagation delay may become the limiting factor.
To summarize, Windows Server 2003 TCP/IP can adapt to most network conditions and can dynamically provide the best throughput and reliability possible on a per-connection basis. Attempts at manual tuning are often counter-productive unless a qualified network engineer first performs a careful study of data flow.