CS Ch13 File Structures
CS Ch13 File Structures
CS Ch13 File Structures
File
Structures
13.1
Objectives
After studying this chapter, the student should be able to:
Define two categories of access methods: sequential access and random
access.
Understand the structure of sequential files and how they are updated.
Understand the structure of indexed files and the relation between the index
and the data file.
Understand the idea behind hashed files and describe some hashing methods.
Describe address collisions and how they can be resolved.
Define directories and how they can be used to organize files.
Distinguish between text and binary files.
13.2
13.3
Sequential access
If we need to access a file sequentiallythat is, one record
after another, from beginning to end we use a sequential
file structure.
Random access
If we need to access a specific record without having to
retrieve all records before it, we use a file structure that
allows random access. Two file structures allow this:
indexed files and hashed files. This taxonomy of file
structures is shown in Figure 13.1.
13.4
13.5
13.8
13.9
13.12
Inverted files
One of the advantages of indexed files is that we can have
more than one index, each with a different key. For
example, an employee file can be retrieved based on either
social security number or last name. This type of indexed file
is usually called an inverted file.
13.14
13.15
Hashing methods
For key-address mapping, we can select one of several
hashing methods. We discuss a few of them here.
Direct hashing
In direct hashing, the key is the data file address without any
algorithmic manipulation. The file must therefore contain a
record for every possible key. Although situations suitable
for direct hashing are limited, it can be very powerful,
because it guarantees that there are no synonyms or
collisions (discussed later in this chapter), as with other
methods.
13.16
13.18
10
Collision
Generally, the population of keys for a hashed list is greater
than the number of records in the data file.
For example, if we have a file of 50 students for a class in
which the students are identified by the last four digits of
their social security number, then there are 200 possible keys
for each element in the file (10,000/50). Because there are
many keys for each address in the file, there is a possibility
that more than one key will hash to the same address in the
file. We call the set of keys that hash to the same address in
our list synonyms. The collision concept is illustrated in
Figure 13.10.
13.21
11
Collision resolution
With the exception of the direct method, none of the methods
we have discussed for hashing creates one-to-one mappings.
This means that when we hash a new key to an address, we
may create a collision. There are several methods for
handling collisions, each of them independent of the hashing
algorithm. That is, any hashing method can be used with any
collision resolution method. In this section, we discuss some
of these methods.
13.23
12
13
13-5 DIRECTORIES
Directories are provided by most operating systems for
organizing files. A directory performs the same
function as a folder in a filing cabinet. However, a
directory in most operating systems is represented as a
special type of file that holds information about other
files. A directory not only serves as a kind of index that
tells the operating system where files are located on an
auxiliary storage device, but can also contain other
information about the files it contains, such as who has
the access to each file, or the date when each file was
created, accessed or modified.
13.27
14
Special directories
There are four special types of directory that play an
important role in the directory structure in UNIX: the root
directory, home directories, working directories and
parent directories.
Paths and pathnames
The files path is specified by its absolute pathname, a list
of all directories separated by a slash character (/). UNIX
also provides a shorter pathname, known as a relative
pathname, which is the path relative to the working
directory.
13.29
15
Text files
A text file is a file of characters. It cannot contain integers,
floating-point numbers, or any other data structures in their
internal memory format. To store these data types, they must
be converted to their character equivalent formats. Some
files can only use character data types. Most notable are file
streams (input/output objects in some object-oriented
language like C++) for keyboards, monitors and printers.
This is why we need special functions to format data that is
input from or output to these devices.
13.31
Binary files
A binary file is a collection of data stored in the internal
format of the computer. In this definition, data can be an
integer (including other data types represented as unsigned
integers, such as image, audio, or video), a floating-point
number or any other structured data (except a file).
Unlike text files, binary files contain data that is meaningful
only if it is properly interpreted by a program. If the data is
textual, one byte is used to represent one character (in ASCII
encoding). But if the data is numeric, two or more bytes are
considered a data item.
13.32
16