"If it works don't touch it" only applies if you have a good system administrator. This is because computers are never static: log files fill up, limits are exceeded, new stuff gets added without old stuff getting removed. Cruft happens.
Good computer administration requires a deep understanding of how the system behaves under normal conditions, so that troubleshooting under exceptional conditions can be done against a strong baseline. There is no magic to getting that baseline, it requires regular attention to all the small details. For example, what are the typical profiles for internals like CPU usage, memory consumption, network traffic, disk I/O, and disk usage; and what are the externals like computer room temperature, air filtration, equipment age, cabling fidelity, and power integrity.
Four essential tasks that are delegated to the admin layer are: real time monitoring, log inspection, backups, and baseline documentation.
Real time monitoring
Real time monitoring should be done at a frequency interval that matches the dynamic flux of the system. High traffic systems, and systems serving several disparate needs, require more frequent monitoring; and low traffic systems, or systems dedicated to single purpose tasks (like a dedicated mail server, database server, or DNS server) needing less frequent monitoring. Of course the nature of the monitoring is entirely dependent on the jobs running on the system, as each will have their own monitoring tools.
Log inspection can be done less frequently than real time monitoring. But just as you do with backups, choose a time period that will enable you to catch errors, trap security violations, adjust database indexes, and throttle network hogs before they impact your users.
Backups should be done with explicit goals in mind. Meeting these goals will determine both the frequency of backups and the backup methodology.
Choose disk, tape or other media to match the size requirements of your backups, the cost of the media (consumable cost or electric power costs), and whether or not an attendant needs to be present to mount the media. Full and incremental backups are often the dividing line between tape and disk.
Choose the backup disk location and frequency based on which protection goal is sought. Use these definitions for the four backup disk locations described here:
- QuickSave. Backup from active directory structure to reserved backup location on the same disk.
- Near-line. Backup to another computer's disk.
- Off-line. Backup to removable media.
- Archive. Backup to permanent read-only media.
Here are some backup goals to consider:
- Recover from accidental deletion.Protect new documents, configuration files, and source code files that have been recently created or modified. Run daily and use the QuickSave location. Protect large binary files like photo collections, large documents like artwork files, and large directories like email message boxes on a weekly basis, but save to a segregated QuickSave location which will be subject to a different rotation policy.
- Archive.Save and protect finished artwork projects, finished source-code projects, and collections of photos for posterity. Backup to QuickSave as a simple first stage operation, then use regular safety check backup schedule to move to safer off-line storage.
- Recover from potential corruption.Protect databases that are actively being updated. Run daily and use the QuickSave location.
- Rollback to working state.Protect configuration and source code files that are being changed from a well-known state to a new untested state, especially when they are running in production mode. Run daily and use the QuickSave location. This strategy is in lieu of a full revision control system like Subversion or Bazaar.
- Revision snapshot.Save working application software or operating systems prior to upgrades, to help troubleshoot faulty upgrades and to potentially allow full rollback. Run as needed prior to version upgrades and use QuickSave location.
- Regular safety check.Copy the most recent QuickSave to near-line or off-line for protection from theft or fire. Run monthly or quarterly. (Greater frequency is not needed when RAID systems are functioning properly.)
- Emergency safety check.Save contents of entire working partition when one of the physical disks of a RAID or mirror fails or is about to fail. Run immediately upon fault discovery. Use near-line storage on a separate computer large enough to hold the entire partition.
- Restore damaged disk.Save contents of all working partitions to near-line or off-line storage to allow restoration of failed hardware. Run full backup soon after OS installation, and soon after a major work effort that results in new software being put into active production use. Use near-line or off-line location.
- Rebuild computer.Save content of all working partitions to near-line or off-line storage to allow an entire setup to be rebuilt to a new set of identical hardware. Same frequency and location as the "restore damaged disk" goal.
Frequency and rotation
Here is one example of a frequency and rotation policy that meets various goals.
- Daily.Backup configuration files for core software like Apache, DNS, and others that change often. Backup software source code files that are in active development. Store all of these in QuickSave location in a directory named for the day of week, from Sunday through Saturday. Copy the first backup of the month to a directory named for the month. With this rotation frequency, there will be seven daily backups and twelve monthly backups, resulting in a 1:19 ratio of active to backup storage capacity. This is practical only for text and source code files that are not excessively large.
- Daily.Backup databases that are actively being updated. Keep current backup only, no rotation. This is for accidental deletion and potential corruption goals only.
- Weekly.Backup growing collections on a separate frequency. Backup large binary files like photo collections, large artwork files, and large directories like email message boxes on a weekly basis. Keep current backup only, no rotation. This is for accidental deletion and potential corruption goals only.
- Monthly or Quarterly.Copy most recent QuickSave to off-line storage for protection against theft or fire.
- On demand.Backup archive snapshots at the conclusion of a project.
- Before upgrade.Copy revision snapshot to QuickSave.
- After reconfiguration.Backup using "rebuild computer" strategy after OS upgrade or major application software configuration.
In the course of good administration, exceptional events will occur. These should be recorded in the system administrators notebook. This is especially useful when an extended period of time has been spent tracking down a problem and resolving it: document your solution using a wiki or similar tool so that you (or another administrator) can quickly recognize and solve the problem again in the future.
When problems occur, system administrators need to look to the four tools just outlined for help with troubleshooting. Problems are not solved by other people, they are solved by the system administrator, and the task is always easier when the baseline system is well understood, the logging systems are operational, the backups are up-to-date, and the documentation is in place.