Abstract Windows 2000 development had two apparently contradictory goals, to add functionality whilst improving dependability to the end customer. This paper describes the difficulty in developing the processes to characterizing the dependability goals for Windows 2000. The paper further describes how Windows 2000 met those goals through addition of dependability features and comprehensive testing in laboratories and production environments. These goals could not be achieved solely through improvements to the operating systems but also required addressing additional issues, such as improving the installation process and quality of driver software. The development of Windows 2000 highlighted a number of research opportunities, discussed later in the paper, to impact the development and release of highly dependable software.
One of the major focuses of Windows 2000 development was in the area of system dependability. To ensure developers focused on this area the company wanted to set exacting dependability goals. Unfortunately there are no industry standards for characterizing system dependability (an IFIP special interest group, of whom Microsoft is a member, is investigating the development of a dependability benchmark ); therefore any goal setting for a particular operating system has to be relative to other versions of operating systems. Microsoft set the Windows 2000 dependability goals to be more dependable at its release to systems running NT 4.0. This added an initial complication as the goal was set relative to a moving target as the dependability of NT 4.0 continued to improve through the release of service packs. A further complication in meeting any goals was the significant amount of functionality added to Windows 2000. Adding functionality, to an operating system, tends to decrease the customer’s perception of dependability through newly induced bugs and problems understanding and managing this new functionality.
Setting a user focused dependability goals required Microsoft to radically rethink what was required to achieve these goals. User focused goals require an understanding of how customers perceive system dependability and its drivers rather than solely focusing of technical issues such as bug counts.
Microsoft introduced a program to characterize the current status of Windows NT 4.0 dependability, focusing on the end user perception. A large number of customers and partners were interviewed to capture their perception of the problem areas and provide them an opportunity to request new features. These discussions occurred in parallel to a program that characterized Windows NT 4.0 dependability through monitoring and analysing the system behaviour on the end customer site.
These feedback processes highlighted the need for Microsoft to take a much more holistic view of dependability, tackling it from a system/solution perspective rather than being purely operating system centric.
Improving the dependability of the operating system was achieved through the introduction of new features to address both system failures and to reduce the number of planned system reboots. Where possible these ‘features’ were reverse engineered into previous versions of NT and windows, improving customer satisfaction but complicating the issue of goal setting. Defect removal through testing was a massive effort (costing $162 million) including stress testing in lab environments, running beta versions on production servers within Microsoft and on customers/ partners systems.
Analysis of Windows NT 4.0 data highlighted the biggest impact on availability is recovery time rather than reliability. New processes were developed to address system availability through tool improvements and providing access to knowledge-based systems.
On-site system management processes also have a big impact on system recovery and these are being addressed through publication of best system management practices. While Microsoft is continually attempting to simplify this system management it is also developing guidelines to assist customers to achieve mission critical production performance through Microsoft Operations Framework (MOF) , based upon ITIL .
This paper will discuss:
Benchmarking of Windows NT 4.0 to provide dependability goals.
Dependability feature added to Windows 2000.
Testing verification and certification.
Research opportunities identified during the development and testing of Windows 2000.
2. Benchmarking Dependability
Customer discussions yielded a wish list of features for improving the overall system dependability, not all solely focused on quality improvements in the operating system. Two common trends occurred within these discussions
Perception of Windows NT’s dependability varied significantly between customers (and for large customers between sites).
Lab measurements of Windows NT dependability differed significantly from the customer’s perception.
Microsoft set up a monitoring program to capture the dependability of customer systems through analysis of their system logs (event, application and security). Initial analysis proved difficult due to significant deficiencies in NT’s system logging process. More importantly these deficiencies complicated on-site problem diagnosis. Greater emphasis was subsequently placed into the fault logging architecture with some additional features being added to NT 4.0 Service Pack 4 specifically:
Information on the cause of system outages.
On a reboot the system logs the reason and time of the system shutdown, identifying if a clean shutdown (event type 6006) or a blue screen occurred (event type 6008). The time the system becomes usable by an application (event type 6005) is also logged.
Operating System version and upgrade information.
Following a system reboot the version of the operating system is recorded (event type 6009) including installations of service packs (event type 4353), if relevant.
Cause of system downtime.
Event type 1001 gives a brief description of the system dump recorded on operating system crash (called blue_screen). Any related Dr Watson diagnosis of the cause of application crash is additionally captured (event type 4097).
The system writes a regular timestamp/heartbeat to the system event log indicating its availability. This is required as occasionally the cause and time of the system shutdown is not captured in the error log (in common with other operating systems) making availability calculation error prone.
The monitoring tool Event Log Analyst (ELA)  was developed to capture the system error logs from hundreds of servers simultaneously. ELA was deployed extensively throughout the internal systems within Microsoft capturing the current and historic behaviour (the historical time was dependent on the data in the error logs). The tool was further developed and deployed on a small number of customer sites. A number of Microsoft partners had developed similar internal tools to ELA, providing Microsoft with NT reliability data captured using these tools.
The analysis of the behaviour of systems on customer sites highlighted great variability in both measurement and the causal factors.
Figure 1. System reliability. The reboot rate between monitored customer sites varies by as much as a factor of 5 (see Figure 1. System reliability). Additionally the reboot rate per system on a single customer site shows a much greater variability. Discussing the results with the customer highlighted two factors
The relative reboot rate did not reflect the customer satisfaction or dissatisfaction with the operating system.
A large number of site specific and non operating system specific factors affected the monitored reboot rate.
Analysis of the collected data highlighted the difficulty of having a single value for reliability; this is also true for system availability. While the factors affecting system behaviour were similar across the monitored customer sites the impact of the factors on sites and individual systems varied significantly.
A common factor affecting the rate of system reboots across the customer sites was the high percentage of outages that were planned. A breakdown of the cause of system outages (see Figure 2. NT 4 cause of system reboots) on a single customer site running 1,180 servers, highlighting that 65% of all outages are “planned” with only 35% occurring due to system failures.
Whilst planned outages are preferable to unplanned, as the system manager can control their impact on the end users, they still represent a loss of service to the end user and a cost to the customer. Of the remaining “unplanned” outages only 14% are induced system failures. Other reasons for addressing planned outages are that they can often induce additional defects to the system.
Figure 2. NT 4 cause of system reboots. Analysis of NT 4 failures reported to Microsoft (see Figure 3. NT4 cause of system failures), identifies that 43% of failures were due to core NT defects. This breakdown also highlights the significant impact drivers have on the rate of system failures (a fact well known to most customers and in common with most other operating systems).
A major difference between this breakdown and that observed on OpenVMS systems  (and to a lesser extent previously on Tandem systems ) is the lack of system management induced failure observed, whereas 50% of OpenVMS failures were due to system management induced defects. This can be explained by the differences in the data collection methods. The analysis of OpenVMS failures was based on detailed analysis of every failure that occurred on the systems. However the NT failure breakdown is based on bug reports. Defects induced by system management activity would be unlikely to raise a bug report to Microsoft as either the system manager would have solved the issue himself or herself or the defect would have been solved by other service providers.
While none of the data collection methods adopted to measure system behaviour highlight the problem of system management induced defects, Microsoft recognizes that NT like all other operating systems suffers a large proportion of these type of failures. A number of collection methods have been attempted to better characterize the cause of failures on the customer site none have been totally successful.
Taking a simplistic approach and combining the two sets of data, the maximum opportunities for improvements in the rate of system shutdowns, through operating system code quality improvements would be 6% of all outages (%System Failure * % Core NT Failures).
Figure 3. NT4 cause of system failures. Assuming system events are independent a 6% opportunity for improvement in the reboot rate appears low but is similar to that found analysis of OpenVMS performed 10 years earlier . Whilst the reality is more complex than the calculation suggests as a percentage of the planned shutdowns can be as a result of addressing bugs in NT. Nevertheless solely identifying and correcting system bugs, will not necessarily have a major impact on overall system behaviour. This analysis highlighted.
Improvements to system dependability required Microsoft to address all systems outages, irrespective of their cause.
Producing single metrics to characterize the system reliability/ availability/ dependability is very difficult if not impossible.