Pascal Programming/Files
Ever wondered how to process bulks of data? Files are the solution in Pascal. You were already acquainted with some basics in the input and output chapter. Here we will elaborate more details as far as the ISO standard 7185 “Pascal” defines them. The “Extended Pascal” ISO standard 10206 defines even more features, but these will be covered in the second part of this WikiBook.
File data types
[edit | edit source]So far we have been only handling text files, i. e. files possessing the data type text
, but there are more file types.
Concept
[edit | edit source]Mathematically speaking, a file is a bounded finite sequence. That means,
- components are oriented along an axis (sequence),
- component values are chosen from one domain (bounded), and
- there is a certain number of components present (finite).
To put this in fancy math symbols:
Declaration
[edit | edit source]In Pascal we can declare file data types by specifying file of recordType
, where recordType
needs to be a valid record data type.
A permissible record data type can be any data type, except another file data type (including text
) or a data type containing such.
That means an array
of file data types, or a record
having a file
as a component is not permitted.
Let’s see an example:
program fileDemo(output);
type
integerFile = file of integer;
With a variable of the data type integerFile
we can access a file containing only one kind of data, integer
values (the domain restriction).
var
temperatures: integerFile;
i: integer;
Note, the variable temperatures
is not a file by itself.
This Pascal variable merely provides us with an abstract “handle”, something that permits us, the program
, to get a hold of the actual file (as described in § Concept).
Modes
[edit | edit source]All files have a current mode. Upon declaration of a file variable, this mode is, like usual, undefined. In Standard Pascal as defined by the ISO standard 7185 you can choose from either generation or inspection mode.
Generation mode
[edit | edit source]In order to write to a file you will need to call the standard built-in procedure
named rewrite
.
Rewrite
will attempt opening a file for writing from the start.
begin
rewrite(temperatures);
The file
immediately becomes empty, hence its name rewrite.
Extended Pascal also has the non-destructive procedure extend
.
Only after successfully opening a file for writing, all write routines become legal. Attempting to write to a file that has not been opened for writing will constitute a fatal error.
write(temperatures, 70);
write(temperatures, 74);
All parameters to write
after the destination
(here temperatures
) have to be of the destination
file’s recordType
.
There must be at least one.
Only if the destination
is a text
file, various built-in data types are permitted.
Note that the procedure(s) writeLn
(and readLn
) can only be applied to text
files.
Other files do not “know” the notion of lines, therefore the …Ln
procedures cannot be applied on them.
Inspection mode
[edit | edit source]In order to read a file you will need to call the standard built-in procedure
named reset
.
Reset
will attempt opening a file for reading from the start.
reset(temperatures);
while not EOF(temperatures) do
begin
read(temperatures, i);
writeLn(i);
end;
end.
Note that after reset(temperatures)
you cannot write anything to that file anymore.
Modes are exclusive:
Either you are writing or reading.[fn 1]
Application
[edit | edit source]The main and most apparent “advantage” of a file
might be:
Unlike an array
we do not need to specify a size in advance, in our source code.
The file
can be as large as needed.
Yet an array
can be copied with a :=
assignment.
Entire files cannot be copied this way.
The main “disadvantage” of a file
might be:
Access is only sequentially.
We have to start reading and writing a file
from the start.
If we want to have, say, the 94th record, we need to advance 93 times and also take account of the possibility that there might be less than 94 records available.[fn 2]
The words advantage and disadvantage were put between quotation marks, because a programming language cannot judge/rate what is “better” or “worse”. It is the programmer’s task to make the assessment. Files are especially suitable for I/O of unpredictable length, for instance user input.
Primitive routines
[edit | edit source]So far we have been using only read
/readLn
and write
/writeLn
.
These procedures are convenient and perfect for everday use.
However, Pascal also gives you the opportunity to have a comparatively “low-level” access to files, get
and put
.
Buffer
[edit | edit source]Every file variable is associated with a buffer.
A buffer is a temporary storage space.
Everything you read from and write to a file
passes through this storage space before the actual read or write action is communicated to the OS.[fn 3]
Buffered I/O is chosen for performance reasons.
In Pascal we can access one, the “current” component of the buffer by appending ↑
to the variable name, just as if it was a pointer.
The data type of this dereferenced value is the recordType
as in our declaration.
So if we have
var
foobar: file of Boolean;
the expression foobar↑
has the data type Boolean
.
To put everything into relation to each other let’s take a look at a diagram. This diagram is about understanding and shows a very specific situation. Focus on the relationships:
The upper part is in the purview of the OS.
The lower part is in the purview of the (our) program
.
The data of the file, here a sequence of 16 integer
values in total, are exclusively managed by the OS.
Any access of the data is done via the OS.
Directly reading or writing is not possible.
We ask the OS to copy the first 4 integer
data values for us into our buffer.
We do so, because copying 4 integers individually is slower than copying them all together in one go.[fn 4]
Sliding window
[edit | edit source]The three different storage locations – the actual data file, the internal buffer, and the buffer variable – work together in providing us a “view” of the file. If we overlay everything that contains the same information, we get the following image:
Here, the second quartet of integers was loaded into the internal buffer (green background). The file buffer points to the second component of the internal buffer. This is represented by a bluish hue over the sixth component of the entire file. Everything else is shaded, meaning we can view and manipulate only the sixth component.
Advancing the window
[edit | edit source]This sliding window can be advanced (in the rightwards direction, i. e. in the direction of EOF) with the routines get
and put
.
Both advance the file buffer to point to the next item in the internal buffer.
Once the internal buffer has been completely processed, the next batch of components is loaded or stored.
Calling get
is only legal while a file is inspection mode; respectively put
is only legal while a file is generation mode.
Using the window
[edit | edit source]Get
and put
take one non-optional parameter, a file
(or text
) variable.
Put
takes the current contents of the buffer variable and ensures they are written to the actual file.
Let’s see this in action.
Consider the following program
:
program getPutDemo(output);
type
realFile = file of real;
var
score: realFile;
begin
The following table shows in the right-hand column the state of score
, the contents and where the sliding window is at (blue background).
source code | state after successful operation | ||||||
---|---|---|---|---|---|---|---|
rewrite(score);
|
| ||||||
score^ := 97.75;
|
| ||||||
put(score);
|
| ||||||
score^ := 98.38;
|
| ||||||
put(score);
|
| ||||||
score^ := 100.00
|
| ||||||
{ For demonstration purposes: no `put(score)` here. }
|
|
Now let’s print the file score
we just filled with some real
values.
For a change we use get
.
Like read
/readLn
, get
is only allowed if not EOF
:
reset(score);
while not EOF(score) do
begin
writeLn(score^);
get(score);
end;
end.
Note that this prints just two real
values:
9.775000000000000E+01
9.838000000000000E+01
The third real
value, although defined, was not written by a corresponding put(score)
Requirements
[edit | edit source]As mentioned above, get
may only be called when the specified file is inspection mode, whereas put
may only be called when the file is generation mode.
More specifically, calling get(F)
is only allowed when EOF(F)
is false
, and calling put(F)
is only allowed when EOF(F)
is true
.
In other words, reading past the EOF is forbidden, while writing has to occur at the EOF.
After successfully calling rewrite(F)
(or the EP procedure extend(F)
) the value of EOF(F)
becomes true
.
Any subsequent put(F)
does not alter this value.
After calling reset(F)
the value of EOF(F)
depends on whether the given file is empty.
Any subsequent get(F)
may change this value from false
to true
(never in the reverse direction).
As you know, it is forbidden to read a variable that was not previously defined (i. e. you have to assign a value beforehand). Because it involves reading the buffer value, writing a buffer is only allowed if it was previously defined. Consider the following faulty code snippet:
temperatures^ := 88;
put(temperatures); { ✔ Good. Will successfully write 88. }
put(temperatures); { ↯ Bad. temperatures^ is not defined. }
put(temperatures); { ↯ temperatures^ still not defined. }
get and put advance the sliding window. Only the first put(temperatures) reads the defined value temperatures^ . The next and following put(temperatures) would however read an undefined temperatures↑ . |
Text
buffer
[edit | edit source]The buffer value of a text
has some special behavior.
A text
file is essentiallly a file of char
.
Everything presented in this chapter can be applied to a text
file just as if it was file of char
.
However, as repeatedly emphasized, a text
file is structured into lines, each line consisting of a (possibly empty) sequence of char
values.
When EOLn(input)
becomes true
, the buffer variable input↑
returns a space character (' '
).
Thus when using buffer variables the only way to distinguish between a space character as part of a line, and a space character terminating a line is to call the function EOLn
.
Rationale: Various operating systems employ different methods of marking the end of a line. It has to be marked somehow, because this information cannot be magically deduced out of nowhere. However, there are multiple strategies out there. This is really inconvenient for the programmer who cannot take account of everything. Pascal has therefore chosen that, regardless of the specific EOL marker used, the buffer variable contains a simple space character at the end of a line. This is predictable, and predictable behavior is good.
Purpose
[edit | edit source]It is worth noting that all functionality of read
/readLn
and write
/writeLn
can at their heart be based on get
and put
respectively.
Here are some basic relationships:
If f
refers to a file of recordType
variable and x
is a recordType
variable, read(f, x)
is equivalent to
x := f^;
get(f);
Similarly, write(f, x)
is equivalent to
f^ := x;
put(f);
For text
variables the relationships are not as straightforward.
The behavior depends on the various destination/source variables’ data types.
Nonetheless, one simple relationship is, if f
refers to a text
variable, readLn(f)
is equivalent to
while not EOLn(f) do
begin
get(f);
end;
get(f);
The latter get(f)
actually “consumes” the newline marker.
Support
[edit | edit source]Unfortunately, from the compilers presented in the opening chapter, Delphi and the FPC do not support all ISO 7185 functionality.
- Delphi and the FPC require files to be explicitly associated with file names before performing any operations. It is required to back any kind of
file
by a file in background memory (e. g. on disk). How this works will be explained in the second part of this book, since ISO standard 10206 “Extended Pascal” defines some means for that, too. - The FPC provides the procedures
get
andput
, and file variable buffers only in{$mode ISO}
or{$mode extendedPascal}
. Delphi does not support this at all.
Rest assured, everything works fine if you are using the GPC. The authors cannot make a statement regarding the Pascal‑P compiler since they have not tested it.
Tasks
[edit | edit source]file
variable is initialized. That means a mode has to be selected by invoking reset
or rewrite
first.
Think of reset
/rewrite
as a special kind of new
and the file variable as a pointer. You may only dereference the pointer (= append ↑
) if it was previously defined.
program
that merges repeating space characters ' '
into a single space character. (A filter program means, process input
and write to output
with the specified rule applied on the given input.) Extra credit: Write a solution that does not declare any additional variables (i. e. there is no var
-section).program mergeRepeatingSpace(input, output);
const
{ Choose any character, but ' ' (a single space). }
nonSpaceCharacter = 'X';
begin
output^ := nonSpaceCharacter;
while not EOF do
begin
Since input↑
contains a space character when we are the EOL, the only correct way of emitting a new line is using writeLn
.
WriteLn
does not use the buffer variable.
In other words, output↑
may contain any value now.
if EOLn then
begin
writeLn;
In this branch of the if
statement, input↑
holds a space character.
However this instance of space character should not trigger the repeating space character detection.
Therefore we assign a non-space character to output↑
(now acting as a “previous character variable”).
output^ := nonSpaceCharacter;
end
else
begin
if [output^, input^] <> [' '] then
In Extended Pascal using the string
/char
concatenation operator +
you could write:
if output^ + input^ <> '' then
Remember that the plain =
‑comparison pads both operands to the same length using space characters.
begin
write(input^);
end;
output^ := input^;
{ The buffer variable (`output↑`) now contains the previous character. }
end;
get(input);
end;
end.
Boolean
variable as a flag whether the preceding character was non-newline space character.
program
that reads from input
and only writes the last input char
value to output
. On a standard Linux or FreeBSD system you can test your program
with the command line echo -n '123H' | ./printLastCharacter
. The ‑n
option flag is important. Otherwise your program
might just display a single space (' '
) character. Alternatively, you may use printf '123H' | ./printLastCharacter
. With either variant your program
should write a line consisting of the single character H
.program printLastCharacter(input, output);
begin
{ We cannot output anything, unless there is at least one character. }
if not EOF(input) then
begin
while not EOF(input) do
begin
{ After `get(input)`, `input↑` becomes undefined once
we reach `EOF(input)`. Therefore copy it beforehand. }
output^ := input^;
get(input);
end;
put(output);
writeLn(output);
end;
end.
By specifying input
in the program
parameter list, the post-assertions of reset
become true. That means, there has been an implicit (= invisible) get(input)
before our begin
in the second line and only after that the value the of EOF(input)
becomes defined.
If you happen to have a compiler supporting Extended Pascal’s halt
procedure
, you would eliminate one indentation level:
{ We cannot output anything, unless there is at least one character. }
if EOF(input) then
begin
halt;
end;
while not EOF(input) do
Notes:
- ↑ Extended Pascal, as defined by ISO standard 10206, also permits an update mode, i. e. reading and writing at the same time, yet this is only possible for “direct-access files” (files that are indexed).
- ↑ Extended Pascal, ISO knows “direct-access files”. Such a file type allows accessing the 94th record in an easy and fast manner, yet it cannot “grow” as needed.
- ↑ This is an implementation detail and not a requirement imposed by programming language. Already the mere presence of an OS is beyond Pascal’s horizon. Nonetheless, this description is a common scheme.
- ↑ This is of course under the presumption, that we do intend to need them. Unnecessarily copying data that will not be used later on is a waste of computing time.