User:Millosh/Internet and Programming for Linguists/The Big Picture/Disk storage

When you look^[1] at a hard disk under Unix, you see a tree of named directories and files. Normally you won't need to look any deeper than that, but it does become useful to know what's going on underneath if you have a disk crash and need to try to salvage files. Unfortunately, there's no good way to describe disk organization from the file level downwards, so I'll have to describe it from the hardware up.

Low-level disk and file system structure

The surface area of your disk, where it stores data, is divided up something like a dartboard — into circular tracks which are then pie-sliced into sectors. Because tracks near the outer edge have more area than those close to the spindle at the center of the disk, the outer tracks have more sector slices in them than the inner ones. Each sector (or disk block) has the same size, which under modern Unixes is generally 1 binary K (1024 8-bit words). Each disk block has a unique address or disk block number.

Unix divides the disk into disk partitions. Each partition is a continuous span of blocks that's used separately from any other partition, either as a file system or as swap space. The original reasons for partitions had to do with crash recovery in a world of much slower and more error-prone disks; the boundaries between them reduce the fraction of your disk likely to become inaccessible or corrupted by a random bad spot on the disk. Nowadays, it's more important that partitions can be declared read-only (preventing an intruder from modifying critical system files) or shared over a network through various means we won't discuss here. The lowest-numbered partition on a disk is often treated specially, as a boot partition where you can put a kernel to be booted.

Each partition is either swap space (used to implement virtual memory) or a file system used to hold files. Swap-space partitions are just treated as a linear sequence of blocks. File systems, on the other hand, need a way to map file names to sequences of disk blocks. Because files grow, shrink, and change over time, a file's data blocks will not be a linear sequence but may be scattered all over its partition (from wherever the operating system can find a free block when it needs one). This scattering effect is called fragmentation.

File names and directories

Within each file system, the mapping from names to blocks is handled through a structure called an i-node. There's a pool of these things near the "bottom" (lowest-numbered blocks) of each file system (the very lowest ones are used for housekeeping and labeling purposes we won't describe here). Each i-node describes one file. File data blocks (including directories) live above the i-nodes (in higher-numbered blocks).

Every i-node contains a list of the disk block numbers in the file it describes. (Actually this is a half-truth, only correct for small files, but the rest of the details aren't important here.) Note that the i-node does not contain the name of the file.

Names of files live in directory structures. A directory structure just maps names to i-node numbers. This is why, in Unix, a file can have multiple true names (or hard links); they're just multiple directory entries that happen to point to the same i-node.

Mount points

In the simplest case, your entire Unix file system lives in just one disk partition. While you'll see this arrangement on some small personal Unix systems, it's unusual. More typical is for it to be spread across several disk partitions, possibly on different physical disks. So, for example, your system may have one small partition where the kernel lives, a slightly larger one where OS utilities live, and a much bigger one where user home directories live.

The only partition you'll have access to immediately after system boot is your root partition, which is (almost always) the one you booted from. It holds the root directory of the file system, the top node from which everything else hangs.

The other partitions in the system have to be attached to this root in order for your entire, multiple-partition file system to be accessible. About midway through the boot process, your Unix will make these non-root partitions accessible. It will mount each one onto a directory on the root partition.

For example, if you have a Unix directory called /usr, it is probably a mount point to a partition that contains many programs installed with your Unix but not required during initial boot.

How a file gets looked up

Now we can look at the file system from the top down. When you open a file (such as, say, /home/esr/WWW/ldp/fundamentals.xml) here is what happens:

Your kernel starts at the root of your Unix file system (in the root partition). It looks for a directory there called ‘home’. Usually ‘home’ is a mount point to a large user partition elsewhere, so it will go there. In the top-level directory structure of that user partition, it will look for a entry called ‘esr’ and extract an i-node number. It will go to that i-node, notice that its associated file data blocks are a directory structure, and look up ‘WWW’. Extracting that i-node, it will go to the corresponding subdirectory and look up ‘ldp’. That will take it to yet another directory i-node. Opening that one, it will find an i-node number for ‘fundamentals.xml’. That i-node is not a directory, but instead holds the list of disk blocks associated with the file.

File ownership, permissions and security

To keep programs from accidentally or maliciously stepping on data they shouldn't, Unix has permission features. These were originally designed to support timesharing by protecting multiple users on the same machine from each other, back in the days when Unix ran mainly on expensive shared minicomputers.

In order to understand file permissions, you need to recall the description of users and groups in the section What happens when you log in?. Each file has an owning user and an owning group. These are initially those of the file's creator; they can be changed with the programs chown(1) and chgrp(1).

The basic permissions that can be associated with a file are ‘read’ (permission to read data from it), ‘write’ (permission to modify it) and ‘execute’ (permission to run it as a program). Each file has three sets of permissions; one for its owning user, one for any user in its owning group, and one for everyone else. The ‘privileges’ you get when you log in are just the ability to do read, write, and execute on those files for which the permission bits match your user ID or one of the groups you are in, or files that have been made accessible to the world.

To see how these may interact and how Unix displays them, let's look at some file listings on a hypothetical Unix system. Here's one:

snark:~$ ls -l notes
-rw-r--r-- 1 esr users 2993 Jun 17 11:00 notes

This is an ordinary data file. The listing tells us that it's owned by the user ‘esr’ and was created with the owning group ‘users’. Probably the machine we're on puts every ordinary user in this group by default; other groups you commonly see on timesharing machines are ‘staff’, ‘admin’, or ‘wheel’ (for obvious reasons, groups are not very important on single-user workstations or PCs). Your Unix may use a different default group, perhaps one named after your user ID.

The string ‘-rw-r--r--’ represents the permission bits for the file. The very first dash is the position for the directory bit; it would show ‘d’ if the file were a directory, or would show ‘l’ if the file were a symbolic link. After that, the first three places are user permissions, the second three group permissions, and the third are permissions for others (often called ‘world’ permissions). On this file, the owning user ‘esr’ may read or write the file, other people in the ‘users’ group may read it, and everybody else in the world may read it. This is a pretty typical set of permissions for an ordinary data file.

Now let's look at a file with very different permissions. This file is GCC, the GNU C compiler.

snark:~$ ls -l /usr/bin/gcc
-rwxr-xr-x 3 root bin 64796 Mar 21 16:41 /usr/bin/gcc

This file belongs to a user called ‘root’ and a group called ‘bin’; it can be written (modified) only by root, but read or executed by anyone. This is a typical ownership and set of permissions for a pre-installed system command. The ‘bin’ group exists on some Unixes to group together system commands (the name is a historical relic, short for ‘binary’). Your Unix might use a ‘root’ group instead (not quite the same as the ‘root' user!).

The ‘root’ user is the conventional name for numeric user ID 0, a special, privileged account that can override all privileges. Root access is useful but dangerous; a typing mistake while you're logged in as root can clobber critical system files that the same command executed from an ordinary user account could not touch.

Because the root account is so powerful, access to it should be guarded very carefully. Your root password is the single most critical piece of security information on your system, and it is what any crackers and intruders who ever come after you will be trying to get.

About passwords: Don't write them down — and don't pick a passwords that can easily be guessed, like the first name of your girlfriend/boyfriend/spouse. This is an astonishingly common bad practice that helps crackers no end. In general, don't pick any word in the dictionary; there are programs called dictionary crackers that look for likely passwords by running through word lists of common choices. A good technique is to pick a combination consisting of a word, a digit, and another word, such as ‘shark6cider’ or ‘jump3joy’; that will make the search space too large for a dictionary cracker. Don't use these examples, though — crackers might expect that after reading this document and put them in their dictionaries.

Now let's look at a third case:

snark:~$ ls -ld ~
drwxr-xr-x 89 esr users 9216 Jun 27 11:29 /home2/esr
snark:~$

This file is a directory (note the ‘d’ in the first permissions slot). We see that it can be written only by esr, but read and executed by anybody else.

Read permission gives you the ability to list the directory — that is, to see the names of files and directories it contains. Write permission gives you the ability to create and delete files in the directory. If you remember that the directory includes a list of the names of the files and subdirectories it contains, these rules will make sense.

Execute permission on a directory means you can get through the directory to open the files and directories below it. In effect, it gives you permission to access the i-nodes in the directory. A directory with execute completely turned off would be useless.

Occasionally you'll see a directory that is world-executable but not world-readable; this means a random user can get to files and directories beneath it, but only by knowing their exact names (the directory cannot be listed).

It's important to remember that read, write, or execute permission on a directory is independent of the permissions on the files and directories beneath. In particular, write access on a directory means you can create new files or delete existing files there, but does not automatically give you write access to existing files.

Finally, let's look at the permissions of the login program itself.

snark:~$ ls -l /bin/login
-rwsr-xr-x 1 root bin 20164 Apr 17 12:57 /bin/login

This has the permissions we'd expect for a system command — except for that ‘s’ where the owner-execute bit ought to be. This is the visible manifestation of a special permission called the ‘set-user-id’ or setuid bit.

The setuid bit is normally attached to programs that need to give ordinary users the privileges of root, but in a controlled way. When it is set on an executable program, you get the privileges of the owner of that program file while the program is running on your behalf, whether or not they match your own.

Like the root account itself, setuid programs are useful but dangerous. Anyone who can subvert or modify a setuid program owned by root can use it to spawn a shell with root privileges. For this reason, opening a file to write it automatically turns off its setuid bit on most Unixes. Many attacks on Unix security try to exploit bugs in setuid programs in order to subvert them. Security-conscious system administrators are therefore extra-careful about these programs and reluctant to install new ones.

There are a couple of important details we glossed over when discussing permissions above; namely, how the owning group and permissions are assigned when a file or directory is first created. The group is an issue because users can be members of multiple groups, but one of them (specified in the user's /etc/passwd entry) is the user's default group and will normally own files created by the user.

The story with initial permission bits is a little more complicated. A program that creates a file will normally specify the permissions it is to start with. But these will be modified by a variable in the user's environment called the umask. The umask specifies which permission bits to turn off when creating a file; the most common value, and the default on most systems, is -------w- or 002, which turns off the world-write bit. See the documentation of the umask command on your shell's manual page for details.

Initial directory group is also a bit complicated. On some Unixes a new directory gets the default group of the creating user (this in the System V convention); on others, it gets the owning group of the parent directory in which it's created (this is the BSD convention). On some modern Unixes, including Linux, the latter behavior can be selected by setting the set-group-ID on the directory (chmod g+s).

How things can go wrong

Earlier it was hinted that file systems can be fragile things. Now we know that to get to a file you have to hopscotch through what may be an arbitrarily long chain of directory and i-node references. Now suppose your hard disk develops a bad spot?

If you're lucky, it will only trash some file data. If you're unlucky, it could corrupt a directory structure or i-node number and leave an entire subtree of your system hanging in limbo — or, worse, result in a corrupted structure that points multiple ways at the same disk block or i-node. Such corruption can be spread by normal file operations, trashing data that was not in the original bad spot.

Fortunately, this kind of contingency has become quite uncommon as disk hardware has become more reliable. Still, it means that your Unix will want to integrity-check the file system periodically to make sure nothing is amiss. Modern Unixes do a fast integrity check on each partition at boot time, just before mounting it. Every few reboots they'll do a much more thorough check that takes a few minutes longer.

If all of this sounds like Unix is terribly complex and failure-prone, it may be reassuring to know that these boot-time checks typically catch and correct normal problems before they become really disastrous. Other operating systems don't have these facilities, which speeds up booting a bit but can leave you much more seriously screwed when attempting to recover by hand (and that's assuming you have a copy of Norton Utilities or whatever in the first place...).

One of the trends in current Unix designs is journalling file systems. These arrange traffic to the disk so that it's guaranteed to be in a consistent state that can be recovered when the system comes back up. This will speed up the boot-time integrity check a lot.

Notes and references

↑ This document contains material from The Unix and Internet Fundamentals HOWTO made by Eric S. Raymond, version 2.9 from 2004-03-03. According to the LDP manifesto, text is adopted under the terms of GNU Free Documentation License 1.2 or any later.

[1] This document contains material from The Unix and Internet Fundamentals HOWTO made by Eric S. Raymond, version 2.9 from 2004-03-03. According to the LDP manifesto, text is adopted under the terms of GNU Free Documentation License 1.2 or any later.

[1]