This section describes the features added to Windows 2000 to decrease the number of planned reboots (see Figure 2. NT 4 cause of system reboots). Planned reboots are those outages required for system management activities and as such can be scheduled for time periods with minimum impact on the end user. Whilst planned outage allows the system manager control over the timing of the outage they still impact the end user and also contribute to the overall cost of ownership of the system. More importantly planned outages provide opportunity for errors as any activity is performed outside of the control of the operating system.
The type of planned outage that provides the greatest opportunity for errors (resulting in subsequent system reboots to correct any problems) is the re-configuration of the system through the installation or removal of hardware, operating system or application.
3.1 Software installation and configuration
The installation of software (both operating system and application) presents a number of problems that can significantly affect system dependability, specifically.
It can be complex and error prone.
Installation processes often assume the user has in depth knowledge of both the operating system and the application. The installer through answering a number of technical questions can accidentally re-configure the system/application into an error prone state.
It can corrupt other applications or the operating system itself.
Early releases of Windows operating system has allowed shared system files to be overwritten by applications, this has been termed DLL hell.
The Windows 2000 development team focused on improving the installation process of both the operating systems and the applications. Previous releases of Windows operating system provided the ability for shared system files to be overwritten by non-OS installation programs. The possibility of a system file being overwritten required the system to reboot, following an application installation, to allow the system to recognize, potentially new, system files. If the system files were overwritten, users would often experience unpredictable system performance, ranging from application errors to operating system hangs or crashes. This problem affects several types of files, most commonly dynamic link libraries (DLLs) and executable files (EXEs).
A significant focus was placed on both simplification and protection of the installation process. Additional effort was placed on working with partners to improve the quality of device drivers and also applications to prevent corruptions occurring (see Section 5. Testing).
The Windows Installer (WI) has been improved through the development of the Windows File Protection (WFP), an independent but related mechanism.
Windows Installer (WI) is a service that manages the installation and removal of applications, providing a basic disaster recovery mechanism through rollbacks. WI keeps a catalogue of all application install files. If, on execution of the application, a file no longer exists WI automatically restores the missing files allowing application execution. This self-healing function reduces the need re-installation of applications if they become corrupt.
The Windows File Protection (WFP) significantly enhances the WI. WFP verifies the source and version of a system file before it is initially installed, preventing the replacement of system files. The WFP additionally runs in the background and protects all files installed by the Windows 2000 set-up program. By default the WFP is always enabled and only allows protected files to be replaced when installing the following
Windows 20000 Service packs using Update.exe.
Hotfix distribution using hotfix.exe.
Windows update.
Windows 2000 Device Manager/Class installer.
During the installation process WFP verifies the digital signature to validate the new files (see section for details of device signing). If a file is corrupt, WFP retrieves the correct file either from a disk cache of frequently replaced DLLs or from the original installation media.
3.2 Operating system re-configuration
In previous versions of Windows NT many operating system configuration changes required a system reboot. The system would not automatically reboot but request the system manager to perform the task. The system manager may attempt to bundle a number of actions prior to rebooting the system, which may have unforeseen impact on the system dependability. Therefore reducing the numbers of system management activities that require a system reboot is assumed to both reduce the number of planned and unplanned outages.
Windows 2000 redesigned many of the subsystems removing the reboot requirement from 60 configuration change scenarios, some examples are:
Changing network settings, including:
Changing IP settings,
Changing IP addresses (if more than one network interface controller)
Changing IPX frame type.
Network addressing, including:
Resolving IP address conflict.
Switching between static and DHCP IP address selections.
Adding or removing network protocols, such as TCP/IP, IPX/SPX, NETBEUI, DLC and AppleTalk.
Adding or removing network services, such as SNMP, WINS, DHCP, and RAS
Enabling or disabling network adapter.
Plug and play features for installation or removal of devices e.g.
Network interface controllers.
Modems.
Disk and tape storage.
Universal serial bus devices (e.g. mice, joysticks, keyboards, video capture and speakers).
Installation of applications e.g.
Internet Information Server.
Microsoft Transaction Services.
Microsoft SQL Server 7.0
Microsoft Exchange 5.5
Device driver kit (DDK).
Software developer’s kit (SDK).
Microsoft Connect Manager.
System management activities such as
Page file management (increasing initial and maximum size).
Extending and mirroring NTFS volumes.
Loading and using TAPI providers.
Docking and undocking laptop computers.
Changing performance optimisation between applications and background services.
3.3 Service Pack Slipstreaming
The Windows 2000 installation process allows users to create a single install image, on a network share, containing Windows 2000 and applicable services packs. Users installing from this image do not require a system reboot to install the service pack, as these will already have been "slipstreamed" into the initial install.
3.4 Hardware Install and Configuration
Windows 2000 has a Plug and Play hardware interface allowing users to add hardware, and associated drivers, to a system without requiring a system reboot. When a compatible device is installed on Windows 2000, the Plug and Play manager automatically recognizes the hardware and loads the appropriate devices drivers. While loading, Plug and Play allocates the resources the driver needs to operate correctly.
3.5 Removing Scheduled Reboots
Analysis of NT 4.0 data highlighted numerous customers who were regularly scheduling reboots on their computers (this process is not unique to NT). Frequently, the IT staff cannot articulate why they are performing the reboots other than they ‘believe’ the machine performs better. For a number of early NT 4.0 releases scheduled reboots did improve behaviour e.g. resource leaks or corruption caused by applications. These are addressed in NT 4.0 through improvements in driver and application quality, Windows 2000 focused on these types of issues through the following features.
Job object API allows application developers to manage a set of processes as a single unit. Developers can place constraints on the unit, such as maximum CPU and memory utilization. These constrains apply to processes explicitly included in the job object and any child processes spawned by the included processes. If an application attempts to perform an operation that would exceed one of the constraints, the call is silently ignored. The job object API also details the resource usage of all of the processes and child processes in a job.
IIS restart allows system administrators to restart the IIS service programmatically without rebooting the operating system. This was previously difficult because of the dependencies of the multiple servers that make up IIS, and the out of process applications that IIS spawns. IIS restart understands these dependencies and handles them transparently to the user.
4. Unplanned Reboot Reduction
Figure 4. Windows 2000 architecture.
Unplanned reboots (see
Figure 3. NT4 cause of system failures) are created when the system becomes unstable and in the worst-case scenario this can result in a blue_screen or system hang. Two mayor causes of this is defects in the operating system itself or defects in applications or drivers which result in overwriting operating system code running in the kernel.
The focus during Windows 2000 development is to place as many new features as possible into the user mode and also to improve the verification of software that resides in the kernel mode.
4.1 Application Failures
Unplanned reboots can occur due to applications either corrupting the in-memory operating system or stop responding to requests for service in a timely manner. Windows 2000 incorporates a number of features to address these problems without requiring an operating system reboot.
This is addressed through Kernel memory isolation; the kernel memory space is marked as read-only to user mode applications providing the kernel greater isolation from errant applications.
An addition to the task manager called "end process tree" allows users to kill a process and all of its child processes. In Windows NT 4 if an application stopped responding the system manager could either determine the set of processes associated with the application, killing each manually, or reboot the system. Frequently the simpler option of system reboot was taken, having the side effect of reducing the availability of the system and other applications running on the same machine.
The Internet Information Server (IIS) has become more robust with respect to application failures. By default, all web applications in Windows 2000 run in a separate process from the IIS. Users can elect to run each web application in its own process for further isolation. If any in-process application causes IIS to fail, there is a utility that will cleanly restart all of the IIS services.
4.2 Operating system failures
Microsoft has addressed the issue of operating system failures with a combination of testing, verification and certification. These topics are covered in detail in Section 5.
5. Testing, Verification and Certification
Microsoft invested over $160M in testing Windows 2000, to ensure its dependability, through process and tool development described in this section. The focus has not only been to provide tools and processes for Microsoft developed software but also to provide similar processes to our partners to help improve the quality of their software. A certification process has been put in place to enable customers to identify which applications have been verified using these processes.
5.1 Design Process Improvements
In an effort to improve the design process and code quality prior to testing, Microsoft uses an in-house tool the Prefix source code-scanning tool. Prefix scans source code and identifies potential problems, such as using un-initialised variables or neglecting to free pointers. Prefix identified a large number of issues that were addressed prior to releasing Windows 2000.
A number of tools have been developed for both Microsoft developers and also independent software vendors. These tools focus on software development in kernel and also in user mode.
The following tools and processes have been developed to assist developers to write code in Kernel-Mode.
The Windows NT 4.0 kernel contains a fully shared pool of memory that is allocated and released back to the pool. Common errors are for drivers to write to memory outside of their memory allocation resulting in corruption, or to not release the memory on completion of task called memory leak. Both errors are difficult to identify. To assist the developer the pool tagging sets a Guard page at the edge of their allocated memory. If the driver attempts to write into the guard page during testing the developer is alerted.
The driver verifier is a tool that can monitor one or more kernel-mode drivers to verify that they are not making illegal function calls or causing system corruption. Driver verifier performs extensive tests and checks on the target drivers. Driver Verifier ships as part of the Windows operating system and can be invoked through typing “
Verifier” at the command prompt. Its settings can be configured using a Graphical User interface (GUI) tool provided as part of the DDK [8].
The device path exerciser (Devctl) [9] tests how devices handle errors in code that uses the device. The driver can be invoked either synchronously or asynchronously through various I/O interfaces to validate that the driver manages mismatched requests.
The following tools and processes have been developed to assist developers to write code in User Mode.
This tool helps developers find memory access errors in non-kernel mode software code. This works in a similar manner to Pool Tagging.
5.2 Testing
Eight million machine hours of stress tests were performed in an effort to eliminate bugs from Windows 2000. These tests can be divided up into two categories: long haul stress and overnight stress.
Fifty machines were dedicated to running long haul stress tests for at least two weeks before upgrading to a newer system build. These tests were designed to find slow memory leaks and pool corruptions. The overnight stress tests had 60 usage scenarios for a variety of component Windows 2000 components. In the weeks before Windows 2000 was released to manufacturing, the Windows team ran stress on over 1500 machines nightly.
Each pre-released version of Windows 2000 was installed onto Microsoft’s production data centres, the Microsoft.com website was a very early adopter of Window 2000. Having beta versions of Windows 2000 on internal systems, running in production environments, help in the identification and correction of bugs unique to a real world application with high load.
Along with the automated testing described above, Microsoft employed a full time team to review code in an effort to find potential software problems.
As can be seen in Figure 3: NT 4 cause of system failures, hardware drivers, software drivers and anti-virus software drivers cause 44% of NT4 OS crashes. Microsoft has created the following processes to help third parties improve the quality of their drivers.
Driver Development Kit [8] has been improved to include more production quality samples and extra interface documentation on usage.
Vendors of anti-virus and other software that includes file system filter drivers have been participating in special development labs. In these labs, vendors and Microsoft developers use some of the verification tools described below to identify problems and jointly work on solutions. In addition, gathering the vendors together allowed them to discover and address interoperability problems.
5.3 Verification and certification
Microsoft has developed a series of test (Windows Hardware Quality Labs WHQL) to verify that hardware meets the design specifications published in the Microsoft hardware design guidelines. Part of this process includes applying the Driver Verifier tool to the drivers associated with the hardware. Driver Verifier places drivers in a constrained memory space injecting random faults and low memory conditions to stress drivers. Drivers that exhibit problems when run with Driver Verifier are not certified. Drivers that pass the WHQL tests are signed through attaching an encrypted digital signature to a code file. Windows 2000 recognizes this signature.
Windows 2000 can be set to one of three modes to recognize the status of the driver during their installation, these are
Advises the users that the driver being installed hasn’t been signed but allows the user the option to install it anyway.
Prevents all unsigned drivers from being installed.
Allows any driver to be installed irrespective of whether it is signed or not.
6. Availability improvements
Improving reliability will not necessarily result in an improvement in availability, as availability is impacted by the system recovery time. The following tools/ processes have been developed to minimize the product down time.
The recovery console is a command line console utility available to system administrators. This is useful to allow files to be copied from floppy or CD-ROM to the hard drive or to reconfigure a service that is preventing the system from booting.
To assist the system administrators diagnose system problems Windows 2000 can be started using safe mode boot. In safe mode only the default hardware settings (mouse, monitor, keyboard, mass storage, base video, default system services, and no network connections) are used. In safe mode the user can request the system to boot under the last known good configuration.
If an application stops responding the end process tree kills the application and all child processes.
Automatic System and Service Restart.
The system can now be set to automatically reboot on system failure. The system can be configured to write the memory contents to a file to assist the system administrator to determine the cause of the failure.
Administrators can define a specific restart policy for each NT service. The policy allows for the specification of scripts for system recovery, notification of failures via e-mail, etc.
IIS restart provides a single step to restart the whole of the IIS process.
6.1 System Dumps
Windows 2000 has made system dumping mandatory providing three Systems dump options: full memory dump, kernel dump and mini-dump. The full dump is the same as the one found in Windows NT 4. The kernel dump does not record any of the memory in user space, significantly reducing the size of the dump on large memory systems. The mini-dump is a 64KB file containing information including the stop code and the loaded modules list.
6.2 Faster Disk Integrity checking
For system problems causing the
disk volume to become corrupt, the chkdsk program runs automatically. On large disk volumes, checking the integrity of the volume can greatly increase the downtime of a system. In Windows 2000, the performance of the chkdsk program has increased by between 4x and 8x. For larger volumes, the performance gain is even greater.
7. System Dependability Research
During the course of the development and testing of Windows 2000, the development team spoke to a number of research institutes to investigate the relevance of their methodologies to the development and release of the operating system. The views on the major research areas were
7.1 Software Reliability Engineering (SRE [10])
SRE techniques are used within Microsoft (e.g. within the Microsoft Office development) so it relevance to an operating system was investigated. A number of the SRE processes were viewed as capable of making positive impacts on the software development process, the limitations of the technique from an operating system perspective were
Operational Profiles.
Attempting to develop an operational profile for an operating system is difficult if not impossible. Whilst applications usage is relatively bounded unfortunately operating system are not. Testing the operating system with Microsoft’s applications does provide a limited operational profile. But not only are there countless other applications using certain Windows APIs in different ways there are also infinite varieties of hardware and combination of hardware that the operating system and applications are expected to perform.
Release process/criteria.
The system test development of Windows 2000 was based upon regression tests from NT 4, with new tests developed for additional Windows 2000 features. The tests were continually developed up to the release of the product, making the SRE release criteria process impossible to apply. Updating of testing was necessary because
Operating systems features are not necessarily used as planned (this is linked to the problems with operational profiles). Therefore the tests had to adjust to reflect the different usage and failure profiles.
The testing was focused not only on operating systems failures based on correct usage but also validating stability during incorrect usage. Again the beta testing kept highlighting new ways that customers configured their systems in unforeseen and incorrect ways.
New hardware and versions of drivers are continually being sent to Microsoft for testing. These occasionally required changes to the system test process.
7.2 Fault injection
A number of fault injection techniques were considered, none of which were found appropriate to the verification of Windows 2000 (although some are used for some product testing within Microsoft). The issue is the large number of APIs in Windows 2000. Each API has multiple possible fault categories. Even if there existed an infinite time for verification, the rate of change of new drivers and peripheral devices means the items under test are changing daily. Without the existence of an operational profile it is not possible to minimize the number of faults applied.
7.3 Research Opportunities
During the development of Windows 2000 Microsoft looked for information in the following areas and found limited if no information available. Additionally often were information was available it was focused on the fault tolerant space not the high availability area which provides unique challenges. The specific areas, which appear to have limited research focus, are:
Characterizing System dependability for
Individual systems.
Multiple systems tightly coupled (e.g. clusters).
Multiple systems loosely coupled.
Applications (simple and complex).
Characterizing system (for all system types defined above) dependability drivers especially in the areas of
Operating system installations.
Application installations.
Configuration changes.
Setting release criteria for complex product development.
Testing methodologies for high available products.
8. Conclusions
The initial feedback on the reliability of Windows 2000 in the field verifies the success of the testing and release process used for this product. The number of bugs submitted is the lowest proportion for any operating system release from Microsoft. By setting aggressive dependability goals, based on the behaviour on the customer site, required Microsoft to analyse and address total system dependability drivers.
Microsoft will be continuing to develop its development, test and release processes, through analysing the effectiveness of its release process based on the behaviour of the product in the field (i.e. how well was the $162 million spent on testing). A more worrying factor was the limited relevance that system dependability research has in the development and release of this complex high available product.
9. References
[1] IFIP WG 10.4 on Dependable Computing and Fault Tolerance, http://www.dependability.org/
[2] Microsoft’s Enterprise Framework, http://www.microsoft.com/msf.
[3] IT Infrastructure Library (ITIL) documented in Central Computer and Telecommunications Agency (CCTA) see http:///www.itil.co.uk/.
[4] Windows 2000 Event Log Analyser, http://win2kready/eventloganalyzer.htm.
[5] B. Murphy, T. Gent, “Measuring system and software reliability using an automated data collection process”. Q & R Engineering International Vol 11 341-353 (1995).
[6] Jim Gray. “A Census of Tandem Systems Availability between 1985 and 1990”. IEEE Transactions on Reliability, 39(4), October 1990.
[7] Microsoft, Guidelines for using Driver Verifier, http://www.microsoft.com/hwdev/driver/driververify.htm.
[8] Microsoft, Drivers development kits, http://www.microsoft.com/ddk/.
[9] Microsoft, Device Path Exerciser, http://www.microsoft.com/hwtest/testkits/.
[10] Musa J. D.,” Software Reliability Engineering: More Reliable Software, Faster Development and Testing”, McGraw-Hill, 1998.