A-level Computing/WJEC (Eduqas)/Component 2/Organisation and structure of data

A file is a collection of organised data about one particular person, company or topic. It is made up of various fields which hold the data for the subject of that data, that subject being referred to as a record.

Types of Record

Fixed Length

In a fixed length record, every single record is exactly the same size. When coding this, you need to be certain on how many characters you will assign to a field as this is not changeable later on. This means that any extra space will be padded with whitespace, which is wasted storage space. But the advantage is that you can determine where to read a record, as you know exactly how many space each field takes up.

Variable Length

In variable length records, only the actual data input is stored, they're used in both serial files and sequential files. There is no whitespace, but you can't go directly to read a record.

Types of File

Serial

A serial files utilises variable length records and is where any new records are appended to the end of the file, they are not sorted in any manner, but are rather chronological in sequence. To code this, you would need to use the File facility in your programming language, passing in the mode of Append to ensure that the data you want to add is appended to the end. This is the fastest type of file organising and is used when the order should be chronological, for example a record of all outgoing phone calls at a mobile carrier (for a single working day).

Sequential

A sequential file uses fixed length records. Sequential files are much easier to search for a specific value as everything stored in them is ordered in a logical manner, for example you could search a file of customers for one with the ID of 7. But there is an added complexity when you want to insert a new record. First, the data in the current file must be copied into a temporary file. This is done record by record, until the place where the record needs to be inserted is reached. The record is inserted in the correct place and then the rest of the records copied over. The old file is deleted and the temporary file becomes the new file, with the record in the correct place. If you want to delete a record from a sequential file, it is simply not copied over when making the temporary file.

Indexed Sequential

An indexed sequential file is where data is split up into logical chunks which relate to ordered data. Data is split up into groups, for example an alphabetical order. The chunks could be A-C, D-F, F-I, and so on. They would contain a range of indexes that relate to this data, so for the first chunk all names beginning with A, B and C would be found in that index range. The reason this is used to improve search times, as the chunks eliminate the need to look through more records than necessary. However, it will be very complex to insert or remove records as that modifies the indexes for every group. If there is a very large amount of data, the same concept can be applied and multiple levels of indexes can be used to speed up searching through the records.

Random/Direct Access

Random access files are where the exact location can be jumped to, without searching through the records to find the correct one. To find some data, you just need to know which data you are looking for in the file, usually a primary key. An algorithm, called a hashing algorithm, will be utilised to find out the exact location of the data given the key.

Alternatively, blocks can be used. Blocks are limited in size, for example 10 records could be included in a block. The algorithm would take you to the start of the block and then that block could be searched sequentially.

Hashing Algorithms

Each hashing algorithm is different in terms of how efficient it is, based on the data you have. They all contain 4 main characteristics: they must be deterministic (when given a key, always produce the same result), uniform (keys spread evenly over available block range), ensure data is normalised, offer continuity and are not invertible.

Master & Transaction Files

Master Files

Master files are very large and are accessed infrequently. They are stored in a logical way, based around a primary key. But they need to contain a store of everything that has ever happened, so they are updated with a batch of new information from time to time, ensuring they're up-to-date and can be referenced when ever needed.

Transaction Files

Transaction files store the day-to-day data for a company. They store data sequentially, i.e. chronologically, for a set time period. This may produce a lot of data, especially if the company is large, but it keeps everything in small, manageable sections. After the day is over, the data can be copied from the transaction file into the master file.

Updating the Master File

To update the master file, first the transaction must be sorted in order of primary key. This allows the master file to be updated easily. For every record in the sorted transaction file, the master file is updated by comparing the new transactions with what is already in the master file. After this has repeated, a new master file is produced with any updated data, error reports in case one update failed and printed reports to utilise within the company, for example they may be invoices.

Future-Proofing

Backups

The process of backing a file up, is copying it from one location to another. For example, copying a file from your HDD to an external HDD, and storing that in a fire-proof safe would be a good method of backup. Companies create a backup policy which dictates how they'll go about the process of backing up. A good backup policy will contain the following details: where the backup will be located, what storage medium will it located on, how frequent/infrequent will the backup be kept and for how long will the backup be kept (a retention policy).

Archiving

The process of moving files, usually located in an archive (.7z/.zip/.rar) , from one location to another. This is keeping important things in case you need to refer to them, but are accessed infrequently. Removing them from your main storage medium will improve the speed of your system.