NTttcp is a Winsock-based port of ttcp to Windows. It helps measure network driver performance and throughput on different network topologies and hardware setups. It provides the customer with a multithreaded, asynchronous performance workload for measuring an achievable data transfer rate on an existing network setup.
For more information, see How to Use NTttcp to Test Network Performance in the Windows Dev Center.
When setting up NTttcp, consider the following:
A single thread should be sufficient for optimal throughput.
Multiple threads are required only for single-to-many clients.
Posting enough user receive buffers (by increasing the value passed to the -a option) reduces TCP copying.
You should not excessively post user receive buffers because the first buffers that are posted would return before you need to use other buffers.
It is best to bind each set of threads to a logical processor (the second delimited parameter in the -m option).
Each thread creates a logical processor that connects to (listens) a different port.
Table 8. Example Syntax for NTttcp Sender and Receiver
Example Syntax for a Sender
NTttcps –m 1,0,10.1.2.3 –a 2
Bound to CPU 0.
Connects to a computer that uses IP 10.1.2.3.
Posts two send-overlapped buffers.
Default buffer size: 64 K.
Default number of buffers to send: 20 K.
Example Syntax for a Receiver
NTttcpr –m 1,0,10.1.2.3 –a 6 –fr
Bound to CPU 0.
Binds on local computer to IP 10.1.2.3.
Posts six receive-overlapped buffers.
Default buffer size: 64 KB.
Default number of buffers to receive: 20 K.
Posts full-length (64 K) receive buffers.
Note Make sure that you enable all offloading features on the network adapter.
TCP/IP Window Size
For 1 GB adapters, the settings shown in Table 8 should provide good throughput because NTttcp sets the default TCP window size to 64 K through a specific logical processor option (SO_RCVBUF) for the connection. This provides good performance on a low-latency network. In contrast, for high-latency networks or for 10 GB adapters, the default TCP window size value for NTttcp yields less than optimal performance. In both cases, you must adjust the TCP window size to allow for the larger bandwidth delay product. You can statically set the TCP window size to a large value by using the -rb option. This option disables TCP Window Auto-Tuning, and we recommend using it only if the user fully understands the resultant change in TCP/IP behavior. By default, the TCP window size is set at a sufficient value and adjusts only under heavy load or over high-latency links.
Server Performance Advisor 3.0
Microsoft Server Performance Advisor (SPA) 3.0 helps IT administrators collect metrics to identify, compare, and diagnose potential performance issues in a Windows Server 2012, Windows Server 2008 R2, or Windows Server 2008 deployment. SPA generates comprehensive diagnostic reports and charts, and it provides recommendations to help you quickly analyze issues and develop corrective actions.
For more information, see Server Performance Advisor 3.0 in the Windows Dev Center.
Performance Tuning for the Storage Subsystem
Decisions about how to design or configure storage software and hardware usually consider performance. Performance is improved or degraded as a result of trade-offs between multiple factors such as cost, reliability, availability, power, or ease-of-use. There are many components involved in handling storage requests as they work their way through the storage stack to the hardware, and trade-offs are made between such factors at each level. File cache management, file system architecture, and volume management translate application calls into individual storage access requests. These requests traverse the storage driver stack and generate streams of commands that are presented to the disk storage subsystem. The sequence and quantity of calls and the subsequent translation can improve or degrade performance.
Figure 4 shows the storage architecture, which includes many components in the driver stack.
The layered driver model in Windows sacrifices some performance for maintainability and ease-of-use (in terms of incorporating drivers of varying types into the stack). The following sections discuss tuning guidelines for storage workloads.
The most important considerations in choosing storage systems include:
Understanding the characteristics of current and future storage workloads.
Understanding that application behavior is essential for storage subsystem planning and performance analysis.
Providing necessary storage space, bandwidth, and latency characteristics for current and future needs.
Selecting a data layout scheme (such as striping), redundancy architecture (such as mirroring), and backup strategy.
Using a procedure that provides the required performance and data recovery capabilities.
Using power guidelines; that is, calculating the expected average power required in total and per-unit volume (such as watts per rack).
For example, when compared to 3.5-inch disks, 2.5-inch disks have greatly reduced power requirements, but they can also be packed more compactly into racks or servers, which can increase cooling requirements per rack or per server chassis.
The better you understand the workloads on a specific server or set of servers, the more accurately you can plan. The following are some important workload characteristics:
When you estimate how much data will be stored on a new server, consider these issues:
How much data you will move to the new server from existing servers
How much data you will store on the server in the future
A general guideline is to assume that growth will be faster in the future than it was in the past. Investigate whether your organization plans to hire many employees, whether any groups in your organization are planning large projects that will require additional storage, and so on.
You must also consider how much space is used by operating system files, applications, redundancy, log files, and other factors. Table 9 describes some factors that affect server storage capacity.
Table 9. Factors that Affect Server Storage Capacity
Required storage capacity
Operating system files
At least 15 GB.
To provide space for optional components, future service packs, and other items, plan for an additional 3 to 5 GB for the operating system volume. A Windows Server installation can require even more space for temporary files.
For smaller servers, 1.5 times the amount of RAM, by default.
For servers that have hundreds of gigabytes of memory, you might be able to eliminate the page file; otherwise, the page file might be limited because of space constraints (available disk capacity). The benefit of a page file of larger than 50 GB is unclear.
Depending on the memory dump file option that you have chosen, use an amount as large as the physical memory plus 1 MB.
On servers that have very large amounts of memory, full memory dumps become intractable because of the time that is required to create, transfer, and analyze the dump file.
Varies according to the application.
Example applications include backup and disk quota software, database applications, and optional components.
Varies according to the applications that create the log file.
Some applications let you configure a maximum log file size. You must make sure that you have enough free space to store the log files.
Data layout and redundancy
Varies depending on cost, performance, reliability, availability, and power goals.
For more information, see Choosing the Raid Level later in this guide.
10 percent of the volume, by default, but we recommend increasing this size based on frequency of snapshots and rate of disk data updates.
Choosing a Storage Solution
There are many considerations in choosing a storage solution that matches the expected workload. The range of storage solutions that are available to enterprises is immense.
Some administrators will choose to deploy a traditional storage array, backed by SAS or SATA hard drives and directly attached or accessed through a separately managed Fibre Channel or iSCSI fabric. The storage array typically manages the redundancy and performance characteristics internally. Figure 5 illustrates some storage deployment models that are available in Windows Server 2012.
Figure 5. Storage deployment models
Alternatively, Windows Server 2012 introduces a new technology called Storage Spaces, which provides platform storage virtualization. This enables customers to deploy storage solutions that are cost-efficient, highly-available, resilient, and performant by using commodity SAS/SATA hard drives and JBOD enclosures. For more information, see Storage Spaces later in this guide.
Table 10 describes some of the options and considerations for a traditional storage array solution.
Table 10. Options for Storage Array Selection
SAS or SATA
These serial protocols improve performance, reduce cable length limitations, and reduce cost. SAS and SATA drives are replacing much of the SCSI market. In general, SATA drives are built with higher capacity and lower cost targets than SAS drives. The premium benefit associated with SAS is typically attributed to performance.
Hardware RAID capabilities
For maximum performance and reliability, the enterprise storage controllers should offer resiliency capabilities. RAID levels 0, 1, 0 1, 5, and 6 are described in Table 11.
Maximum storage capacity
Total usable storage space.
The maximum peak and sustained bandwidths at which storage can be accessed are determined by the number of physical disks in the array, the speed of the controllers, the type of bus protocol (such as SAS or SATA), the hardware-managed or software-managed RAID, and the adapters that are used to connect the storage array to the system. The more important values are the achievable bandwidths for the specific workloads to be run on servers that access the storage.
Hardware Array Capabilities
Most storage solutions provide some resiliency and performance-enhancing capabilities. In particular, storage arrays may contain varying types and capacities of caches that can serve to boost performance by servicing reads and writes at memory speeds rather than storage speeds. In some cases, the addition of noninterruptible power supplies or batteries are required to keep the additional performance from coming at a reliability cost.
A hardware-managed array is presented to the operating system as a single drive, which can be termed a logical unit number (LUN), virtual disk, or any number of other names for a single contiguously addressed block storage device.
Table 11 lists some common options for the storage arrays.
Table 11. Storage Array Performance and Resiliency Options (RAID levels)
Just a bunch of disks (JBOD)
This is not a RAID level. It provides a baseline for measuring the performance, reliability, availability, cost, capacity, and energy consumption of various resiliency and performance configurations. Individual disks are referenced separately, not as a combined entity.
In some scenarios, a JBOD configuration actually provides better performance than striped data layout schemes. For example, when serving multiple lengthy sequential streams, performance is best when a single disk services each stream. Also, workloads that are composed of small, random requests do not experience performance improvements when they are moved from a JBOD configuration to a striped data layout.
A JBOD configuration is susceptible to static and dynamic “hot spots” (frequently accessed ranges of disk blocks) that reduce available storage bandwidth due to the resulting load imbalance between the physical drives.
Any physical disk failure results in data loss in a JBOD configuration. However, the loss is limited to the failed drives. In some scenarios, a JBOD configuration provides a level of data isolation that can be interpreted as offering greater reliability than striped configurations.
This is not a RAID level. It is the concatenation of multiple physical disks into a single logical disk. Each disk contains one continuous set of sequential logical blocks. Spanning has the same performance and reliability characteristics as a JBOD configuration.
Striping (RAID 0)
Striping is a data layout scheme in which sequential logical blocks of a specified size (the stripe unit) are distributed in a circular fashion across multiple disks. It presents a combined logical disk that stripes disk accesses over a set of physical disks. The overall storage load is balanced across all physical drives.
For most workloads, a striped data layout provides better performance than a JBOD configuration if the stripe unit is appropriately selected based on server workload and storage hardware characteristics. The overall storage load is balanced across all physical drives.
This is the least expensive RAID configuration because all of the disk capacity is available for storing the single copy of data.
Because no capacity is allocated for redundant data, striping does not provide data recovery mechanisms such as those provided in the other resiliency schemes. Also, the loss of any disk results in data loss on a larger scale than a JBOD configuration because the entire file system or raw volume spread across n physical disks is disrupted; every nth block of data in the file system is missing.
Mirroring (RAID 1)
Mirroring is a data layout scheme in which each logical block exists on multiple physical disks (typically two, but sometimes three in mission-critical environments). It presents a virtual disk that consists of a set of two or more mirrored disks.
Mirroring often has worse bandwidth and latency for write operations when compared to striping or JBOD. This is because data from each write request must be written to a pair of physical disks. Request latency is based on the slowest of the two (or more) write operations that are necessary to update all copies of the updated data blocks. In more complex implementations, write latencies may be reduced by write logging or battery-backed write caching, or by relaxing the requirement for dual write completions before returning the I/O completion notification.
Mirroring has the potential to provide faster read operations than striping because it can (with a sufficiently intelligent controller) read from the least busy physical disk of the mirrored pair, or the disk that will experience the shortest mechanical positioning delays.
Mirroring is the most expensive resiliency scheme in terms of physical disks because half (or more) of the disk capacity stores redundant data copies. A mirrored array can survive the loss of any single physical disk. In larger configurations, it can survive multiple disk failures if the failures do not involve all the disks of a specific mirrored disk set.
Mirroring has greater power requirements than a non-mirrored storage configuration. It doubles the number of disks; therefore, it doubles the required amount of idle power. Also, mirroring performs duplicate write operations that require twice the power of non-mirrored write operations.
In the simplest implementations, mirroring is the fastest of the resiliency schemes in terms of recovery time after a physical disk failure. Only a single disk (the other part of the broken mirror pair) must participate in bringing up the replacement drive. The second disk is typically still available to service data requests throughout the rebuilding process. In more complex implementations, multiple drives may participate in the recovery phase to help spread out the load for the duration of the rebuild.
Striped mirroring (RAID 0 1 or 10)
The combination of striping and mirroring is intended to provide the performance benefits of striping and the redundancy benefits of mirroring.
The cost and power characteristics are similar to those of mirroring.
Rotated parity or parity disks (RAID 5)
An array with rotated parity (denoted as RAID 5 for expediency) presents a logical disk that is composed of multiple physical disks that have data striped across the disks in sequential blocks (stripe units) in a manner similar to simple striping (RAID 0). However, the underlying physical disks have parity information spread throughout the disk array, as in the example shown in Figure 6.
For read requests, RAID 5 has characteristics that resemble those of striping. However, small RAID 5 writes are much slower than those of other resiliency schemes because each parity block that corresponds to the modified data block(s) must also be updated. This process requires three additional disk requests in the simplest implementation, regardless of the size of the array. Each small write requires two reads (old data and old parity) and two writes (new data and new parity). Because multiple physical disk requests are generated for every logical write, bandwidth is reduced by up to 75 percent.
RAID 5 arrays provide data recovery capabilities because data can be reconstructed from the parity. Such arrays can survive the loss of any one physical disk, as opposed to mirroring, which can survive the loss of multiple disks if the mirrored pair (or triplet) is not lost.
RAID 5 requires additional time to recover from a lost physical disk compared to mirroring because the data and parity from the failed disk can be re-created only by reading all the other disks in their entirety. In a basic implementation, performance during the rebuilding period is severely reduced due to the rebuilding traffic and because the reads and writes that target the data that was stored on the failed disk must read all the disks (an entire “stripe”) to re-create the missing data. More complex implementations incorporating multiple arrays may take advantage of more parallelism from other disks to help speed up recovery time.
RAID 5 is more cost efficient than mirroring because it requires only an additional single disk per array, instead of double (or more) the total number of disks in an array.
Power guidelines: RAID 5 might consume more or less energy than a mirrored configuration, depending on the number of drives in the array, the characteristics of the drives, and the characteristics of the workload. RAID 5 might use less energy if it uses significantly fewer drives. The additional disk adds to the required amount of idle power as compared to a JBOD array, but it requires less additional idle power versus a full mirrored set of drives. However, RAID 5 requires four accesses for every random write request (in the basic implementation) to read the old data, read the old parity, compute the new parity, write the new data, and write the new parity.
This means that the power needed beyond idle to perform the write operations is up to four times that of a JBOD configuration or two times that of a mirrored configuration.
(RAID 5 continued)
(Depending on the workload, there may be only two seeks, not four, that require moving the disk actuator.) Thus, although unlikely in most configurations, RAID 5 might have greater energy consumption. This might happen if a heavy workload is being serviced by a small array or an array of disks with idle power that is significantly lower than their active power.
Double rotated parity, or double parity disks (RAID 6)
Traditional RAID 6 is basically RAID 5 with additional redundancy built in. Instead of a single block of parity per stripe of data, two blocks of redundancy are included. The second block uses a different redundancy code (instead of parity), which enables data to be reconstructed after the loss of any two disks. More complex implementations may take advantage of algorithmic or hardware optimizations to reduce the overhead that is associated with maintaining the extra redundant data.
As far as power and performance, the same general statements can be made for RAID 6 that were made for RAID 5, but to a larger magnitude.
Rotated redundancy schemes (such as RAID 5 and RAID 6) are the most difficult to understand and plan for. Figure 6 shows a RAID 5 example, where the sequence of logical blocks presented to the host is A0, B0, C0, D0, A1, B1, C1, E1, and so on.
Each RAID level involves a trade-off between the following factors:
To determine the best array configuration for your servers, evaluate the Read and Write loads of all data types and then decide how much you can spend to achieve the performance, availability, and reliability that your organization requires. Table 12 describes common configurations and their relative performance, reliability, availability, cost, capacity, and energy consumption.
Striped mirroring (RAID 0 1): A general purpose combination of performance and reliability for critical data, workloads with hot spots, and high-concurrency workloads
Rotated parity or parity disks (RAID 5): Web pages, semicritical data, workloads without small writes, scenarios in which capital and operating costs are an overriding factor, and read-dominated workloads
Multiple rotated parity or double parity disks (RAID 6): Data mining, critical data (assuming quick replacement or hot spares), workloads without small writes, scenarios in which cost or power is a major factor, and read-dominated workloads. RAID 6 might also be appropriate for massive datasets, where the cost of mirroring is high and double-disk failure is a real concern (due to the time required to complete an array parity rebuild for disk drives greater than 1 TB).
If you use more than two disks, striped mirroring is usually a better solution than only mirroring.
To determine the number of physical disks that you should include in an array, consider the following information:
Bandwidth (and often response time) improves as you add disks.
Reliability (in terms of mean time to failure for the array) decreases as you add disks.
Usable storage capacity increases as you add disks, but so does cost.
For striped arrays, the trade-off is between data isolation (small arrays) and better load balancing (large arrays). For mirrored arrays, the trade-off is between better cost per capacity (for basic mirrors, which is a depth of two physical disks) and the ability to withstand multiple disk failures (for depths of three or four physical disks). Read and Write performance issues can also affect mirrored array size. For arrays with rotated parity (RAID 5), the trade-off is between better data isolation and mean time between failures (MTBF) for small arrays, versus better cost, capacity, and power for large arrays.
Because hard disk failures are not independent, array sizes must be limited when the array is made up of actual physical disks (that is, a bottom-tier array). The exact amount of this limit is very difficult to determine.
The following is the array size guideline with no available hardware reliability data:
Bottom-tier RAID 5 arrays should not extend beyond a single desk-side storage tower or a single row in a rack-mount configuration. This means approximately 8 to 14 physical disks for 3.5-inch storage enclosures. Smaller 2.5-inch disks can be racked more densely; therefore, they might require being divided into multiple arrays per enclosure.
Bottom-tier mirrored arrays should not extend beyond two towers or rack-mount rows, with data being mirrored between towers or rows when possible. These guidelines help avoid or reduce the decrease in time between catastrophic failures that is caused by using multiple buses, power supplies, and so on from separate storage enclosures.
Selecting a Stripe Unit Size
Hardware-managed arrays allow stripe unit sizes ranging from 4 KB to more than 1 MB. The ideal stripe unit size maximizes the disk activity without unnecessarily breaking up requests by requiring multiple disks to service a single request. For example, consider the following:
One long stream of sequential requests on a JBOD configuration uses only one disk at a time. To keep all striped disks in use for such a workload, the stripe unit should be at least 1/n where n is the request size.
For n streams of small serialized random requests, if n is significantly greater than the number of disks and if there are no hot spots, striping does not increase performance over a JBOD configuration. However, if hot spots exist, the stripe unit size must maximize the possibility that a request will not be split while it minimizes the possibility of a hot spot falling entirely within one or two stripe units. You might choose a low multiple of the typical request size, such as five times or ten times, especially if the requests are aligned on some boundary (for example, 4 KB or 8 KB).
If requests are large, and the average or peak number of outstanding requests is smaller than the number of disks, you might need to split some requests across disks so that all disks are being used. You can interpolate an appropriate stripe unit size from the previous two examples. For example, if you have 10 disks and 5 streams of requests, split each request in half (that is, use a stripe unit size equal to half the request size). Note that this assumes some consistency in alignment between the request boundaries and the stripe unit boundaries.
Optimal stripe unit size increases with concurrency and typical request sizes.
Optimal stripe unit size decreases with sequentiality and with good alignment between data boundaries and stripe unit boundaries.
Determining the Volume Layout
Placing individual workloads into separate volumes has advantages. For example, you can use one volume for the operating system or paging space and one or more volumes for shared user data, applications, and log files. The benefits include fault isolation, easier capacity planning, and easier performance analysis.
You can place different types of workloads into separate volumes on different physical disks. Using separate disks is especially important for any workload that creates heavy sequential loads (such as log files), where a single set of physical disks can be dedicated to handling the updates to the log files. Placing the page file on a separate virtual disk might provide some improvements in performance during periods of high paging.
There is also an advantage to combining workloads on the same physical disks, if the disks do not experience high activity over the same time period. This is basically the partnering of hot data with cold data on the same physical drives.
The “first” partition on a volume that is utilizing hard disks usually uses the outermost tracks of the underlying disks, and therefore it provides better performance. Obviously, this guidance does not apply to solid-state storage.
With the cost of solid state devices dropping, it is important to consider including multiple tiers of devices into a storage deployment to achieve better balance between performance, cost, and energy consumption. Traditional storage arrays offer the ability to aggregate and tier heterogenous storage, but Storage Spaces provides a more robust implementation.