Application I/O requests are processed in buffer memory. A finite amount of memory is presented to the application, and although this memory is presented to the application as if it were contiguous memory, its physical location is generally fragmented across multiple locations in memory. This buffer memory, available to programs running in user-mode of the operating system, is identified by its virtual address. When the application needs to transfer data from a physical disk device into or out of the buffer, the hardware controller executing the transfer must know the actual physical addresses where the data is stored (the scatter/gather list previously described). This process is effective as long as it is only the hardware that needs to know the physical location of the data and only the application that needs to know the virtual address.
In rare cases, the miniport driver itself (not just the hardware) needs to access data in the buffer before or after a transfer. This might happen when the miniport needs to translate discovery information (such as the list of logical unit numbers (LUNs) returned by a target device or the INQUIRY data from a logical unit). The driver cannot use the user virtual address of the I/O buffer, since this is not available in the kernel mode in which all drivers run. Instead, the port driver must obtain that information by first allocating and then mapping all system (kernel) memory for all I/O buffers. This process is enormously costly to undertake because it involves copying data to scarce buffers which has a heavy impact on performance.
Storport allows the vendor the flexibility of selecting the setting necessary to maximize the performance of the storage miniport driver. A Storport miniport can map any of the following: all data buffers, no data buffers, or only those buffers that are not actual data intended for the application (such as discovery information). SCSIport does not allow this selective mapping; it is all or none.
Unlike SCSIport, which can only queue a maximum of 254 outstanding I/O requests to an adapter supporting multiple storage devices, Storport does not limit the number of outstanding I/O requests that can be sent to an adapter. Instead, each logical storage unit (such as a virtual disk on a RAID array or a physical disk drive) can accept up to 254 outstanding requests. The number of requests an adapter can handle is the number of logical units x 254. Since large storage arrays with hundreds of disks can have thousands of logical units, it is obvious that removing the queuing from the adapter results in an enormous improvement in I/O throughput. This is especially important for organizations with high transaction processing needs and large numbers of physical and virtual disks.
Storport also enables the miniport driver to implement basic queue management functions. These include pause/resume device, pause/resume adapter, busy/ready device, and busy/ready adapter, and the ability to control queue depth (the number of outstanding commands) on a per LUN basis, all of which can help ensure balanced throughput rather than overloaded I/O. Key scenarios that can take advantage of these capabilities include limited adapter resources, limited per LUN resources, or a busy LUN that prevents non-busy LUNs from receiving commands. Certain intermittent storage conditions, such as link disruptions or storage device upgrades, can be handled much more effectively when these controls are properly used in Storport miniports. (By contrast, a SCSI miniport cannot effectively control this queuing at all. A device may indicate a busy status; consequently commands will automatically be retried for a fixed amount of time; however, no controls are available on an adapter basis whatsoever.)
Improved Error and Reset Handling
Errors during data transmission can either be permanent (so called hard errors, such as those cased by broken interconnects) or transient (soft errors, including recoverable data errors, device Unit Attention conditions, or fabric “events” such as a state change notification caused by, for example, storage entering or leaving the fabric). Hard errors must be detected and the physically damaged equipment replaced. Soft errors are handled by error checking and correction, or by simply retrying the command.
When SCSIport detects certain interconnect or device errors or conditions, it will respond by using a SCSI bus reset. On parallel SCSI, there is an actual reset line; however, on serial interconnects and RAID adapters, there is no bus reset, so it must be emulated in the best way possible. Whichever way the bus reset is done, the code path always disrupts I/O to all devices and LUNs connected to the adapter, even if the problem is related to only a single device. Such disruption requires reissuing in-progress commands for all LUNs.
In contrast, Storport has the ability to instruct the HBA to only reset the afflicted LUN; no other device on that bus is impacted. If the LUN reset does not accomplish the recovery action, Storport attempts to reset the target device; and, if that doesn’t work, it emulates a bus reset. (In practice, the bus reset should not be seen except when Storport is used with parallel devices). This advanced reset capacity enables configurations that were not possible (or were unreliable) in the past with SCSIport.
Improved Clustering by Using Hierarchical Resets
In environments with large storage arrays, or in clustering environments designed to keep applications highly available, a bus reset compromises the goal of high data availability. In the case of clustering, without a multipathing solution, none of the servers is able to access the shared storage while the bus is unavailable or the devices are being reset. Since clustered servers must use a reservation system to gain sole access to the shared storage, a bus reset will result in the loss of all reservations. This is a costly loss, since recovery of clustered systems takes considerable system resources. In other cases, the reset is actually issued by the cluster disk driver to break the reservation on devices that must be moved to a standby server. The same error recovery mechanism described earlier is used to clear the reservations on individual LUNs, as needed, during the failover process.
(An interesting side effect of the bus reset mechanism used by SCSIport is that any servers that are using shared connections to storage, such as you would expect on a SAN, can have their I/O operations cancelled. This, in turn, leads to further bus resets as the non-clustered servers have to clear and retry their missing I/Os.)
A further advantage to the hierarchical reset model is that cluster configurations that were not supported in the past can now work reliably. These include boot from SAN with the same adapter used for the cluster interconnect and supporting tape devices on the shared interconnects as well. With SCSIport, these configurations require additional HBAs.
Fibre Channel Link Handling
Critical to managing Fibre Channel SANs in both single and multipath configurations is ensuring that interconnects and links are monitored for problems. Errors that cannot be resolved by the hardware must be passed up the stack for resolution by the port driver. While it is possible to design the miniport driver to resolve such errors, many miniport solutions do not function predictably in a multipathing environment, and there are some cases (such as when attempting to retrieve page information from a SAN) the miniport driver may not resolve interconnect errors correctly in a single-path environment.
The Storport driver contains two new link status notifications, LinkDown and LinkUp. If a link is down, the miniport driver notifies the port driver and optionally identifies the outstanding I/O requests. The port driver pauses the adapter for a period of time; if the link comes back up, the port driver retries the outstanding I/O requests before sending any new ones. If the link does not come up during the specified period of time, the port driver must fail the I/O requests. In a multipath environment, the multipath driver then attempts to issue the failed commands on another available path.
Ability to Run Deferred Procedure Calls (DPCs)
It is often desirable to perform extended processing after external events, particularly in the Fibre Channel environment. An example would be device rediscovery after receipt of a registered state change notification (RSCN). The SCSIport model prescribes “minimal” processing once an interrupt has been received from the hardware. Unfortunately, this processing is unavoidable, so a more efficient way to do this at a lower priority level is necessary. Storport allows miniport drivers to use DPCs to accomplish this goal.
An important part of the Windows operating system design is the use of the “registry,” a configuration database if you will. SCSIport does not allow free access to the registry from a miniport driver. A single string can be passed to the miniport driver, which must then parse that string to extract adapter-specific parameters. Furthermore, SCSIport cannot guarantee that multiple adapters using the same miniport will be able to use different sets of parameters. The total length of the parameter string passed is limited to 255 characters.
The Storport model allows registry access from the miniport in a much less restricted fashion. One routine can be used to query for specific parameters in any location in the system hive of the registry; writing back to the registry is also supported. This allows solving the problem of adapter specific parameters, such as persistent binding information or queue depth limits.
Fibre Channel Management
The SNIA HBA Application Programming Interface is an important industry-led effort to allow management of FC adapters and switched fabrics. Although fully supported for Fibre Channel adapters using SCSIport, the Microsoft implementation of the HBA API is a required component of any Storport implementation written for FC adapters. This ensures the greatest level of compatibility with Microsoft management initiatives and support tools. The Microsoft implementation eliminates vendor supplied libraries and the complex registration process. Based on a WMI infrastructure in the miniports, this interface can also be used directly by tools and command line utilities. Important enhancements to the more common implementations also include full support for true asynchronous eventing and remoteability (ability to run utilities from a different host system or management console). Another important part of the WMI infrastructure is the ability to directly set security on individual operations: a user can be given monitoring rights but not the ability to set adapter or switch parameters (also known as role-based administration).
Easy Migration to Storport
Storport has been designed with a similar miniport interface to SCSIport, making the transition from SCSIport to Storport straightforward. The details of how to port from SCSIport to Storport are presented in the Microsoft Driver Developer Kit.
Note that Storport supports Microsoft® Windows® Plug and Play compliant drivers. Legacy SCSIport miniports need more adaptation to work under the Storport model.
The performance of the port driver varies not only with the capabilities of the miniport driver, but also with the system and RAID configuration, the storage adapter cache settings, the I/O queue depth, and the type of I/O operation. The rest of this section provides a brief review of how these various factors affect performance.
System configuration. Adding more physical RAM helps ensure that that server accesses data in RAM cache, rather than from disk. Host based data caching—I/O requests to the file are intercepted by the caching system. If the request is an “unbuffered” WRITE, data is sent directly to the disk device (without any increase in speed); if a request is READ and the data is in memory cache, the response if very fast (no disk I/O is necessary).
RAID configuration. I/O processing performance depends both on the type of redundancy that is used, and the number of physical disks across which the I/O load is spread. (The greater the number of disks, the better the performance, since multiple disks can be accessed simultaneously.) Note that RAID-10 gives the fastest I/O performance while still supporting redundancy.
Controller cache settings. I/O performance is strongly impacted by whether or not the storage device can cache data, since caching gives better performance. Adding faster or more I/O controllers also improves performance. In the case of HBA RAID adapters, caching on the adapter also improves performance.
I/O queue depth. Past a certain threshold, the more I/O requests there are in queue for each device, the slower the performance. Below that threshold, performance may actually increase, as the storage device can effectively reorder operations for greatest efficiency. A subjective measure of device “stress” is I/O load. According to StorageReview, a light load is 4-16 I/Os, moderate is 16-64, and high is 64-256. Consult product documentation for optimal queue depth for specific storage devices.
File Type and Use. Files vary in their size and the extent to which they are used (as much as 95% of all I/O activity occurs with fewer than 5% of all files), both of which impact performance.
Type of I/O operation. There are four types of I/O operation: random writes (RW), sequential writes (SW), random reads (RR) and sequential reads (SR). I/O read requests can be processed very rapidly with host-based data caching. Read and write performance can be improved by caching on the storage controller. (Write requests can be written to fast cache memory before being permanently written to disk.) In many cases, caches can be tuned to perform better for the workload (read vs write, random vs sequential).
Disk Fragmentation. Just as with a single disk, files stored on RAID arrays can become fragmented, resulting in longer seek times during I/O operations.
Measuring Storport Performance
Benchmarking is one objective way to measure I/O performance. While standardized tests allow direct comparison of one configuration with another, it is also important to test workloads that represent the true data transfer patterns of the applications being considered. For example, if the intended application is Microsoft® Exchange, which has 4KB I/Os, measuring 64KB sequential I/O performance is not meaningful. The performance results presented in the following section are indicative of the changes that are possible when a miniport driver has been properly written to take advantage of Storport capabilities. In all cases, the adapters used also had SCSI miniport drivers available, so legitimate comparisons can be made.
Host-Based RAID Adapter
Using a RAID adapter on the host, random writes are improved by 10-15% and sequential writes by 20-30%, although, in both cases, this effect becomes less dramatic as the size of the transfer increases. Random reads see a 10-15% improvement and sequential reads a 20-40% improvement with Storport, although this effect lessens as the total size of the data transferred grows larger. (Typical Windows I/O transfer sizes include 512 byes (for example, file system metadata), 4k (Exchange), 8k (SQL), and 64K (file system data).
(Intel’s Iometer program, a common benchmark tool, was used for this case study to assess Storport performance. The tool measures the average number of I/Os per second, MB/s (megabytes per second, or equivalently, total I/O per second x unit size), and CPU effectiveness (I/O per % CPU used) for different I/O request types and for differing amounts of data.)
Figure 9 summarizes overall system efficiency as measured by I/O per second over percent of CPU.
Figure 9. Storport I/O Throughput Efficiency
Storport (triangles) is about 30-50% more efficient than SCSIport (diamonds), passing through more I/O per second than SCSIport and using less CPU to do so.
Storport is the new Microsoft port driver recommended for use with hardware RAID storage arrays and high performance Fibre Channel interconnects. Storport overcomes the limitations of the legacy SCSIport design, while preserving enough of the SCSIport framework that porting to the Storport device is straightforward for most developers. Storport enables bidirectional (full duplex) transport of I/O requests, more effective interactions with vendor miniport drivers, and improved management capabilities. Storport should be the port driver of choice when deploying SAN or hardware RAID storage arrays in a Windows Server 2003 environment.
For more information on Windows Drivers see the Microsoft Developer Network (MSDN) website at http://msdn.microsoft.com/.
For more information regarding the Driver Development Kit, see “Microsoft Windows Driver Development Kits” on the Microsoft Hardware and Driver Central website (http://go.microsoft.com/fwlink/?LinkId=19866).
To locate appropriate support contacts, see “WHQL Support Contacts” on the Microsoft Hardware and Driver Central website (http://go.microsoft.com/fwlink/?LinkId=22256).
Windows Server System is the comprehensive, integrated
server software that simplifies the development, deployment, and operation of agile business solutions.
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.
This white paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in, or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.