Many Web publishing shops make a basic distinction between the “Webmaster” and the “System Administrator.” The Webmaster has primary responsibility for organizing and placing content on the server, for coordinating content provider tasks, for connecting the server to external databases, and for analyzing server logs. The System Administrator is responsible for the administration of the hardware, the server operating system, the maintenance of authorized users, system backup, network management, and security.
Keep in mind that these are vague conventions, not hard rules. In some sites, the Webmaster and the System Administrator are the same person. In others, a team of a dozen people may be responsible for different aspects of site management. In fact, a site may not use these labels to describe staff members. In any event, the project manager will have to decide who on the team has what responsibilities, and who is granted what level of system administrator privileges on the various servers deployed. (See Chapter 2 for a further discussion of the many roles in a Web publishing project.)
Choosing Your Server Hardware
We’ve seen the wide variety of server hardware, operating system, and server software options. Now suppose you’ve decided to buy an Intel-based server. How powerful a system should you buy?
Many sites get along just fine by buying a commodity PC intended for desktop use (with an Intel processor or with a competitive processor such as an AMD K6) and running Windows NT or Linux on the box. Vendors will try to sell you boxes marketed specifically to run as servers. What are the differences between server boxes and desktop ones?
Server PCs tend to come with greater expansion capabilities: more PCI “slots” and more room for memory.
Server PCs tend to come with SCSI disk drives built-in. Today’s desktop PCs come with Ultra ATA IDE drives, whose performance is impressive, but which lags behind modern SCSI devices.
Server PCs may offer RAID (“Redundant Array of Inexpensive Disks”) with important advantages in reliability and performance (see below).
Server PCs tend to provide special disk drive bays that make it easy to swap defective drives with the system turned off (“cold swap”) or even with the system still running (“hot swap”).
Server PCs tend to have processors (such as the Intel Xeon) and internal memory caching tuned for transaction processing.
Server PCs tend to offer error correcting memory, which most of today’s commodity PCs do not. This can improve reliability.
Server PCs offer the option of multiple CPUs, which can be exploited by Windows NT or Unix for very high performance requirements.
Server PCs tend to offer high-performance networking interfaces.
Server PCs may offer with built-in high capacity tape drives and backup software.
Let’s consider one server option, RAID, in a little more detail. RAID allows you to make efficient use of multiple relatively inexpensive disk drives to achieve better performance, reliability, or both. RAID spreads data across multiple physical disk volumes to achieve these goals. The industry has defined these “levels” of RAID:
RAID 0 is also known as “data striping.” This level improves performance by spreading data across two or more drives. Because data can be retrieved in parallel from multiple physical disks, performance can improve dramatically.
RAID 1 is also known as “mirroring.” Here you run pairs of identical drives. All data written to disk is written as a mirror image on both drives. Here you are gaining total redundancy for your data at the expense of doubling the amount of disk you must buy for a given amount of content. If a mirrored drive dies, your server can continue running until you are able to install a replacement drive; then, the new drive is automatically brought back “in sync” with the remaining one. Mirroring offers some performance improvements when both drives are operational.
RAID 5 provides “data striping with parity.” This provides the performance benefit of striping along with redundant information spread across the disk volumes, so that your system can continue running if a single drive fails. RAID 5 isn’t as simple as mirroring but you buy reliability at a far lower cost.
The alternative to RAID is to use a single disk volume or a set of volumes in the conventional way – each disk holds its own data entirely, and no data within a single file is spread across drives. This is perfectly adequate for many Web sites, and probably adequate for the majority of CI projects.
Other considerations to look at when buying a server include the speed of the processor and the amount of memory. In recent years memory prices have declined so dramatically that it makes little sense to buy a server with less than 128 megabytes (or even 256 megabytes or more) of system memory. Buying more memory can mean a dramatic improvement in performance, especially if you have many concurrent users or are doing a great deal of live content work.
The case for buying the most powerful processor – or multiple processors – is not so clear. Vendors charge a premium for the very fastest CPUs. Most CI sites have no need to buy the very latest CPU at the very highest clock speed. Systems that support multiple CPUs cost more, and each additional CPU adds to the price. In addition, depending on your software environment, you may not be able to take full advantage of multiple CPUs. Think twice before you spend hundreds or thousands of dollars on additional CPU horsepower for your server.
Increasingly, vendor Web sites make it easy to comparison shop for systems – both within the server and desktop categories. For instance, the Dell site makes it particularly easy to choose options from a Web form, seeing how much various configurations would cost.
A very powerful commodity PC that would make a perfectly adequate server for many if not most CI projects can be had for under $2500 including tape backup. A server-class machine can easily cost $5000 to $10,000 or more. Similar ranges of prices can be found among proprietary Unix server systems or Macintosh systems intend to be used as servers.
There are two schools of thought when it comes to buying server hardware:
Buy a system that will handle your anticipated load for three to five years.
Which school you follow will depend on your current budget and your budget cycles. Some sites with grant funds or other “one time” money like to buy extra capacity while capital funds are available. However, since computing power for the same amount of money doubles every 18 months, it’s very expensive to buy capacity very far into the future.
Installing Your Server Operating System
If you buy a system that comes with the operating system installed, and if you choose not to do a re-install for learning purposes, you are ready to configure and run your server when you take it out of the box.
If you buy a system that doesn’t have your desired operating system installed, you will need to install it from scratch. Typically you will work from a series of installation CD-ROMs. You may have to begin the process with a bootable floppy if the system arrives totally “bare.”
Because the Toolkit includes demonstration applications that run under Windows NT, we include complete instructions on installing Windows NT Server 4.0 software. This chapter is extremely comprehensive, including annotated screen shots of every step along the process of installation. However, because not all sites will install this operating system, we omit that material from the printed book. You will find this material on the Toolkit CD and Web site under Software.
System and Content Backup; Archiving
You will need to establish a regular program of backup for your system. Backups protect you from disk failures as well as human catastrophes such as accidental deletion of data. Here are some general guidelines:
Most shops back up disk drives to tape. Tapes with sufficient capacity to match today’s disk drives can be expensive; see discussion below.
A common approach calls for daily backups of all changed data files, and weekly backup of all data on your system. You create “pools” of tapes for your daily and weekly dumps. You might have one tape series for every day of the week (or one each for Monday through Friday) and a separate series for your weekly dumps. The more “depth” to your tape pools, the more confidence you have that you’ve got all your important files are backed up.
Some sites handle content backup separately from operating system and software backup; some even put software and data on separate disk drives. Because software tends to change at a different pace than content, this can provide important efficiencies. For instance if you make software changes infrequently, you might do a daily “change” dump of your content, and only back up your software weekly. Note that it’s important to capture important configuration files, which tend to reside in the same folders as software; the only sure-fire way to do this is to back up everything.
It’s very important to periodically store your most critical files off-site. You can do this by taking one of your full dump tape set to an off-site storage location. A safety deposit box at a bank is a good choice. Increasingly network backup is becoming an option; vendors provide ways to archive your most critical files at their site across the Internet.
If you employ RAID 1 or RAID 5, you may feel confident that your system is adequately backed up. Unfortunately, there have been cases in which RAID systems have failed in such a way that you are not protected. For instance, in some cases, a mirrored drive may fail, and you may not notice the failure. Eventually its mirrored partner fails, and now you have no data and no backup. Bottom line: it’s a good idea to back up all data periodically even when you have RAID protection.
The industry offers data backup drives based on formats created for other purposes. For instance, DAT, or Digital Audio Tape, is a popular format for data backup; it is probably the most commonly-used format for server backup applications. The major formats for server backup are:
DAT. These drives can hold from 8 gigabytes (8GB) of data up to 24G at a cost from under $1000 for the drive to over $3000, depending on transfer speed and data capacity. Data is backed up onto a 4mm cassette that looks like a DAT tape. Media costs are from $5 per tape to $25 per tape depending on capacity.
Exabyte, or 8mm. These tapes resemble those used in camcorders. Drives cost $1000 or more. Tapes hold from 2.5 to 7 GB and cost from $5 to $12 each.
DLT. These half-inch cartridges hold from 10GB to 75GB of data. Drives cost from $2000 to $5000 or more. Tapes cost about $50 each. These are very high-quality, high-performance backup devices.
Note that vendors may quote capacities of “12/24” or similar numbers. The first number is uncompressed; the second number assumes two-to-one compression. If your content is already in a compressed format such as JPEG, or if your backup software does data compression, you won’t achieve an additional two-to-one compression on tape.
In any event, if you buy a drive with sufficient capacity, you may be able to back up your entire Web site onto a single tape. This saves the manual effort of loading “contiunation volumes” during the backup process and can be a great convenience.
In order to perform backups, you will need backup software. Your NT, Unix, or Mac server will come with built-in backup software, but you may find such software to be limited. Commercial backup software tools allow you to schedule backups, manage multiple tape pools, and back up user desktop computers along with your server. In the Windows NT environment, ARCserve from Computer Associates is a popular tool. Seagate Software markets a competing tool, Backup Exec. These tools cost less than $500.
Archiving of content is a concept related to backup. When we speak of archiving, we generally think in terms of taking a snapshot of part or all of our content, with that snapshot kept indefinitely. Archiving can be done to the same tape media you use for backup. Alternatively, you may want to consider CD-R, CD-RW, or the new DVD-RAM as archive formats.
CD-R allows you to store about 650 megabytes of data on a single CD, which can be read by any PC with a CD-ROM drive. A CD-R probably could not hold an entire CI site including software, but in many cases it could hold most or all of a site’s content. Individual CD-R discs can be found for under $2 each.
CD-RW offers the advantages of CD-R at a higher media cost – about $12 per disc as of this writing. Unlike CD-R, CD-RW allows a single disc to be written on multiple times. The extra media expense of CD-RW would not be justified for archiving content; by definition, you want to write on an archive disc only once.
DVD-RAM offers several times the capacity of CD-R – up to 5.2 gigabytes. Thus DVD-RAM could in many cases back up on a single disc an entire CI site, or all of the content of a multimedia-rich site. As of this writing each DVD-RAM blank disc is expensive – $50 or so – but these prices will fall.
These formats can be useful for exchange of data as well. For instance, an off-site content provider may wish to deliver large amounts of content in CD-R form. Your webmaster can retain the CD-R disc or return it to the content provider; either way, another backup copy can be retained.
Other popular archive and exchange formats include Iomega’s ZIP and JAZ drives, and Imation’s Superdisk drives.
Most people who run Web sites want to know their “hit” counts – how often the site in general, and certain pages in particular, are visited. Your Web server generates log files that hold information about every HTTP transaction: every time a file is fetched from the server, a log entry documents the date and time as well as the host name of the user’s computer. Log analysis tools convert your logs from almost incomprehensible raw data into useful summaries of activity.
Log analysis tools allow you to summarize and analyze your traffic patterns across time, presenting tabular and graphical reports that greatly assist you in tuning your content and your site. Examples of the kinds of reports you can get include:
A list of the most popular pages on your site.
A list of the least popular pages on your site. If a page you think should be popular isn’t, you can tune your site’s layout and link structure to improve its visibility so more users will find it. Some tools offer “path analysis,” showing what hyperlinks users follow through your site.
A report showing how users arrive at your site. Your logs contain “referrer” information that says what Web site a user visited before they followed a link to your site. You can tell which directories, search engines, personal or other kinds of sites are the most popular starting points from which your users find your content.
A graph showing activity across days, weeks, or months.
A list of the most popular domains from which your users visit your site.
These tools vary from free, public domain tools to commercial packages costing from $100 or so to $10,000 tools. Extremely expensive log analysis tools feature real-time data analysis and sophisticated multi-server analysis functions that are required by only the busiest commercial sites on the planet. Most community information sites could choose to spend no more than $250 for an adequate solution.
Here is a sample graph from a popular tool, Webtrends.
Here is a list of some of the most popular log analysis tools:
Vendor Web Site
A very popular log analysis tool that presents reports in tabular and graphical form.
Site Server is a family of server support and administration tools, including log analysis, site mapping, and a search engine.
A high-end tool for Windows, Mac, and Unix servers.
Another log analysis tool.
You will want to establish a policy as to what log information is kept for how long. Because logs can be used to trace activity by IP address, it is possible to discern in some cases which users access which pages on your CI site. This can be a violation of privacy, especially if the log files are published on the site or otherwise become public. Libraries will be especially sensitive to this issue, as there are ethical considerations and in some jurisdictions, legal ramifications of releasing individual patron access information.