Performance Tuning Guidelines for Windows Server 2008 R2 April 12, 2013 Abstract

Download 0,59 Mb.
bet	7/174
Sana	21.03.2017
Hajmi	0,59 Mb.
	#1060

1 2 3 4 5 6 7 8 9 10 ... 174

Storage-Related Parameters

You can adjust registry parameters on Windows Server 2008 R2 for high-throughput scenarios.

I/O Priorities

Windows Server 2008 and Windows Server 2008 R2 can specify an internal priority level on individual I/Os. Windows primarily uses this ability to de-prioritize background I/O activity and to give precedence to response-sensitive I/Os (for example, multimedia). However, extensions to file system APIs let applications specify I/O priorities per handle. The storage stack logic to sort out and manage I/O priorities has overhead, so if some disks will be targeted by only a single priority of I/Os (such as a SQL database disks), you can improve performance by disabling the I/O priority management for those disks by setting the following registry entry to zero:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\DeviceClasses

\{Device GUID}\DeviceParameters\Classpnp\IdlePrioritySupported

Storage-Related Performance Counters

The following sections describe performance counters that you can use for workload characterization, capacity planning, and identifying potential bottlenecks.

Logical Disk and Physical Disk

On servers that have heavy I/O workloads, you should enable the disk counters on a sampling basis or in specific scenarios to diagnose storage-related performance issues. Continuously monitoring disk counters can incur up to a 1 percent CPU overhead penalty.

The same counters are valuable in both the logical and physical disk counter objects. Logical disk statistics are tracked by the volume manager (or managers), and physical disk statistics are tracked by the partition manager.

The following counters are exposed through volume and partition managers:

% Disk Read Time, % Disk Time, % Disk Write Time, % Idle Time

These counters are of little value when multiple physical drives are behind logical disks. Imagine a subsystem of 100 physical drives presented to the operating system as five disks, each backed by a 20-disk RAID 0 1 array. Now imagine that the administrator spans the five disks to create one logical disk, volume x. One can assume that any serious system that needs that many physical disks has at least one outstanding request to volume x at any given time. This makes the volume appear to be 100% busy and 0% idle, when in fact the 100-disk array could be up to 99% idle with only a single request outstanding.

Average Disk Bytes / { Read | Write | Transfer }

This counter collects average, minimum, and maximum request sizes. If possible, you should observe individual or sub-workloads separately. You cannot differentiate multimodal distributions by using average values if the request types are consistently interspersed.

Average Disk Queue Length, Average Disk { Read | Write } Queue Length

These counters collect concurrency data, including burstiness and peak loads. Guidelines for queue lengths are given later in this guide. These counters represent the number of requests in flight below the driver that takes the statistics. This means that the requests are not necessarily queued but could actually be in service or completed and on the way back up the path. Possible in-flight locations include the following:

Waiting in an ATAport queue or a Storport queue.

Waiting in a queue in a miniport driver.

Waiting in a disk controller queue.

Waiting in an array controller queue.

Waiting in a hard disk queue (that is, on board a physical disk).

Actively receiving service from a physical disk.

Completed, but not yet back up the stack to where the statistics are collected.
Average Disk second / {Read | Write | Transfer}

These counters collect disk request response time data and possibly extrapolate service time data. They are probably the most straightforward indicators of storage subsystem bottlenecks. Guidelines for response times are given later in this guide. If possible, you should observe individual or sub-workloads separately. You cannot differentiate multimodal distributions by using Perfmon if the requests are consistently interspersed.

Current Disk Queue Length

This counter instantly measures the number of requests in flight and therefore is subject to extreme variance. Therefore, this counter is not useful except to check for the existence of many short bursts of activity.

Disk Bytes / second, Disk {Read | Write } Bytes / second

This counter collects throughput data. If the sample time is long enough, a histogram of the array’s response to specific loads (queues, request sizes, and so on) can be analyzed. If possible, you should observe individual or sub-workloads separately.

Disk {Reads | Writes | Transfers } / second

Split I/O / second

This counter is useful only if the value is not in the noise. If it becomes significant, in terms of split I/Os per second per physical disk, further investigation could be needed to determine the size of the original requests that are being split and the workload that is generating them.

Note: If the Windows standard stacked driver’s scheme is circumvented for some controller, so-called “monolithic” drivers can assume the role of partition manager or volume manager. If so, the monolithic driver writer must supply the counters listed above through the Windows Management Instrumentation (WMI) interface, or the counters will not be available.

Processor Information

% DPC Time, % Interrupt Time, % Privileged Time

If interrupt time and DPC time are a large part of privileged time, the kernel is spending a long time processing I/Os. Sometimes, it is best to keep interrupts and DPCs affinitized to only a few CPUs on a multiprocessor system to improve cache locality. At other times, it is best to distribute the interrupts and DPCs among many CPUs to prevent the interrupt and DPC activity from becoming a bottleneck on individual CPUs.

DPCs Queued / second

This counter is another measurement of how DPCs are using CPU time and kernel resources.

Interrupts / second

This counter is another measurement of how interrupts are using CPU time and kernel resources. Modern disk controllers often combine or coalesce interrupts so that a single interrupt causes the processing of multiple I/O completions. Of course, there is a trade-off between delaying interrupts (and therefore completions) and amortizing CPU processing time.

Power Protection and Advanced Performance Option

There are two performance-related options for every disk under Disk > Properties > Policies:

Enable write caching.

Enable an “advanced performance” mode that assumes that the storage is protected against power failures.
Enabling write caching means that the storage hardware can indicate to the operating system that a write request is complete even though the data has not been flushed from the volatile intermediate hardware cache(s) to its final nonvolatile storage location. Note that with this action a period of time passes during which a power failure or other catastrophic event could result in data loss. However, this period is typically fairly short because write caches in the storage hardware are usually flushed during any period of idle activity. Cache flushes are also requested frequently by the operating system (or some applications) to explicitly force writes to be written to the final storage medium in a specific order. Alternately, hardware time-outs at the cache level might force dirty data out of the caches.

Other than cache flush requests, the only means of synchronizing writes is to tag them as “write-through.” Storage hardware is supposed to guarantee that a write-through request’s data has reached nonvolatile storage (such as magnetic media on a disk platter) before it indicates a successful request completion to the operating system. Some commodity disks or disk controllers may not honor write-through semantics. In particular, ATA/SATA/USB storage components may not support the ForceUnitAccess flag that is used to tag write-through requests in the hardware. Enterprise storage subsystems typically use battery-backed caches or use SAS/SCSI/FC hardware to correctly maintain write-through semantics.

The “advanced performance” disk policy option is available only when write caching is enabled. This option strips all write-through flags from disk requests and removes all flush-cache commands from the request stream. If you have power protection for all hardware write caches along the I/O path, you do not need to worry about those two pieces of functionality. By definition, any dirty data that resides in a power-protected write cache is safe and appears to have occurred “in-order” from the software’s viewpoint. If power is lost to the final storage location while the data is being flushed from a write cache, the cache manager can retry the write operation after power has been restored to the relevant storage components.

Block Alignment (DISKPART)

NTFS aligns its metadata and data clusters to partition boundaries in increments of the cluster size (which is selected during file system creation or set by default to 4 KB). In releases of Windows prior to Windows Server 2008, the partition boundary offset for a specific disk partition could be misaligned when it was compared to array disk stripe unit boundaries. This caused small requests to be unintentionally split across multiple disks. To force alignment, you were required to use Diskpar.exe or Diskpart.exe at the time that the partition was created.

In Windows Server 2008 and Windows Server 2008 R2, partitions are created by default with a 1-MB offset, which provides good alignment for the power-of-two stripe unit sizes that are typically found in hardware. If the stripe unit size is set to a size that is greater than 1 MB, the alignment issue is much less of a problem because small requests rarely cross large stripe unit boundaries. Note that Windows Server 2008 and Windows Server 2008 R2 default to a 64-KB offset if the disk is smaller than 4 GB.

If alignment is still a problem even with the default offset, you can use Diskpart.exe to force alternative alignments when you create a partition.

Solid-State and Hybrid Drives

Previously, the cost of large quantities of nonvolatile memory was prohibitive for most server configurations. Exceptions included aerospace or military applications in which the shock and vibration tolerance of flash memory is highly desirable.

As the cost of flash memory continues to decrease, it becomes more possible to use nonvolatile memory (NVM) to improve storage subsystem response time on servers. The typical vehicle for incorporating NVM in a server is the solid-state disk (SSD). One cost-effective strategy is to place only the hottest data of a workload onto nonvolatile memory. In Windows Server 2008 R2, as in previous versions of Windows Server, partitioning can be performed only by applications that store data on an SSD; Windows does not try to dynamically determine what data should optimally be stored on SSDs versus rotating media.

Response Times

You can use tools such as Perfmon to obtain data on disk request response times. Write requests that enter a write-back hardware cache often have very low response times (less than 1 ms) because completion depends on dynamic RAM (DRAM) speeds instead of rotating disk speeds. The data is lazily written to disk media in the background. As the workload begins to saturate the cache, response times increase until the write cache’s only potential benefit is a better ordering of requests to reduce positioning delays.

For JBOD arrays, reads and writes have approximately the same performance characteristics. Writes can be slightly longer due to additional mechanical “settling” delays. With modern hard disks, positioning delays for random requests are 5 to 15 ms. Smaller 2.5-inch drives have shorter positioning distances and lighter actuators, so they generally provide faster seek times than comparable larger 3.5inch drives. Positioning delays for sequential requests should be insignificant except for streams of write-through requests, where each positioning delay should approximate the required time for a complete disk rotation. (Write-through requests are typically identified by the ForceUnitAccess (FUA) flag on the disk request.)

Transfer times are usually less significant when they are compared to positioning delays, except for sequential requests and large requests (larger than 256 KB) that are instead dominated by disk media access speeds as the requests become larger or more sequential. Modern enterprise disks access their media at 50 to 150 MB/s depending on rotation speed and sectors per track, which varies across a range of blocks on a specific disk model. The outermost tracks can have up to twice the sequential throughput of innermost tracks.

If the stripe unit size of a striped array is well chosen, each request is serviced by a single disk—except for low-concurrency workloads. So, the same general positioning and transfer times still apply.

For mirrored arrays, a write completion must typically wait for both disks to complete the request. Depending on how the requests are scheduled, the two completions of the requests could take a long time. However, although writes for mirrored arrays generally should not take twice the time to complete, they are typically slower than for JBOD. Reads can experience a performance increase if the array controller is dynamically load balancing or factoring in spatial locality.

For RAID 5 arrays (rotated parity), small writes become four separate requests in the typical read-modify-write scenario. In the best case, this is approximately the equivalent of two mirrored reads plus a full rotation of the disks, if you assume that the read/write pairs continue in parallel. Traditional RAID 6 slows write performance even more because each RAID 6 write request becomes three reads plus three writes.

You must consider the performance effect of redundant arrays on read and write requests when you plan subsystems or analyze performance data. For example, Perfmon might show that 50 writes per second are being processed by volume x, but in reality this could mean 100 requests per second for a mirrored array, 200 requests per second for a RAID 5 array, or even more than 200 requests per second if the requests are split across stripe units.

Use the following are response-time guidelines if no workload details are available:

For a lightly loaded system, average write response times should be less than 25 ms on RAID 5 or RAID 6 and less than 15 ms on non-RAID 5 or non-RAID 6 disks. Average read response times should be less than 15 ms regardless.

For a heavily loaded system that is not saturated, average write response times should be less than 75 ms on RAID 5 or RAID 6 and less than 50 ms on non-RAID 5 or non-RAID 6 disks. Average read response times should be less than 50 ms.

Queue Lengths

Several opinions exist about what constitutes excessive disk request queuing. This guide assumes that the boundary between a busy disk subsystem and a saturated one is a persistent average of two requests per physical disk. A disk subsystem is near saturation when every physical disk is servicing a request and has at least one queued-up request to maintain maximum concurrency—that is, to keep the data pipeline flowing. Note that in this guideline, disk requests split into multiple requests (because of striping or redundancy maintenance) are considered multiple requests.

This rule has caveats, because most administrators do not want all physical disks constantly busy. Because disk workloads are often bursty, this rule is more likely applied over shorter periods of (peak) time. Requests are typically not uniformly spread among all hard disks at the same time, so the administrator must consider deviations between queues—especially for bursty workloads. Conversely, a longer queue provides more opportunity for disk request schedulers to reduce positioning delays or optimize for full-stripe RAID 5 writes or mirrored read selection.

Because hardware has an increased capability to queue up requests—either through multiple queuing agents along the path or merely agents with more queuing capability—increasing the multiplier threshold might allow more concurrency within the hardware. This creates a potential increase in response time variance, however. Ideally, the additional queuing time is balanced by increased concurrency and reduced mechanical positioning times.

Use the following queue length targets when few workload details are available:

For a lightly loaded system, the average queue length should be less than one per physical disk, with occasional spikes of 10 or less. If the workload is write heavy, the average queue length above a mirrored controller should be less than 0.6 per physical disk and the average queue length above a RAID 5 or RAID 6 controller should be less than 0.3 per physical disk.

For a heavily loaded system that is not saturated, the average queue length should be less than 2.5 per physical disk, with infrequent spikes up to 20. If the workload is write heavy, the average queue length above a mirrored controller should be less than 1.5 per physical disk and the average queue length above a RAID 5 or RAID 6 controller should be less than 1.0 per physical disk.

For workloads of sequential requests, larger queue lengths can be tolerated because services times and therefore response times are much shorter than those for a random workload.

For more details on Windows storage performance, see “Resources” later in this guide.

Download 0,59 Mb.

1 2 3 4 5 6 7 8 9 10 ... 174

Download 0,59 Mb.