THE UNIVERSITY OF TEXAS AT AUSTIN
SCHOOL OF INFORMATION


LIS 384K.11 (known as INF 385M, beginning with the Fall Semester 2003)
DATABASE-MANAGEMENT PRINCIPLES AND APPLICATIONS
R. E. Wyllys

Variable-Length Record Structures


Introduction

This handout provides a summary view and examples of two ways of implementing a variable-length record structure. This kind of structure makes it possible to have variable-length fields and repeating, or multi-valued, fields. Part of the importance of variable-length records for the field of library and information science is that they are an integral part of the MARC system of the Library of Congress.

The MARC System

MARC stands for MAchine-Readable Cataloging. The MARC system originated at the Library of Congress (LC) in 1965, as a part--probably the single most important part--of the beginnings of the automation of libraries in the U.S. and elsewhere. These beginnings coincided with the arrival of business-oriented computers that could perform various tasks at costs low enough to attract a much wider audience of consumers than in years prior to the mid-1960s.

Building on a pilot project during 1965-1967, the Library of Congress settled in 1968 on a form of computerized recording of cataloging information, the MARC II record. This was the foundation for the current record, called MARC 21, which is essentially the MARC II record with some added features.

The MARC record is a computer-readable and -manipulable record of cataloging information for information-bearing entities (InBEs), such as books, serials, etc. The MARC system moves cataloging information among the institutions that participate in the system, which consists of:

MARC systems exist in several countries, with minor differences to accommodate local needs. What is used in the U.S. is often called USMARC to identify it with this country. Examples of other MARC systems are CAN/MARC and UKMARC, the Canadian and U.K. systems, respectively.

The Nature of the MARC Record

Fixed-Length Computer Records

A typical computer file consists of a set of records of equal length: i.e., records such that

Note that these restrictions allow different types of fields to have different numbers of bytes. For example, suppose that each record contains a Social Security Number (SSN) field and a telephone-number field (plus other fields that we ignore here). Each SSN field will consist of 9 bytes; each telephone-number field, of 10 bytes. The kind of structure we have just described is called a "fixed-length" record.

Fixed-length records work well for many applications. For example, consider a file used in a company's accounting department to contain the information needed to prepare employees' paychecks. The records in such a file would store information for the individual employees of the company. Each record would need fields that contain such information as SSN, hourly wage rate, number of income-tax withholding deductions claimed, number of hours worked in the current week, number of hours of overtime worked in the current week, total of wages paid to date in the calendar year, and total withheld to date in the calendar year. Each such field will have a fixed length appropriate to the nature of the information in the field; each record will contain the same number of fields; and, hence, each record will be of the same fixed length as every other record in the file.

However, it should be clear that there are types of information that do not fit neatly in fixed-length fields. For example, the title of a book can vary in length anywhere from one byte to hundreds (or even thousands) of bytes; the surname and the first name(s) of an author can vary ("Ann Lee" is much shorter than "Gustaf-Adolphus von Sachsen und Coburg"), and a book (in LC cataloging practice) can have 1, 2, or 3 authors (i.e., there can be a need for multiple fields for authors' names). As a moment's reflection will show, these examples indicate that there can be serious problems in using fixed-length records to handle certain types of data.

How could one design a title field for a fixed-length record for book data? Suppose we know that some book titles can be as long as, say, 1492 characters. If we decide to provide 1492 bytes in a fixed-length field for titles, then the vast majority of titles, being much shorter than 1492 bytes, will occupy only a small portion of the title field, the rest of which will have to be filled with space characters. For most records, this would be a great waste of computer-storage space and communications time. On the other hand, if we decide to provide fewer than 1492 bytes for the title field, say 100 bytes, then we encounter another problem: viz., although most titles will fit into a 100-byte field, there will still be some wasted space with many titles, and, worse, some titles will have to be truncated to their first 100 characters (including space characters). (Furthermore, even the space-wasting 1492-byte field might turn out to be too short for an extraordinary title.)

The same problem, and a related one, arise with the author field. First, it is clear that the varying lengths of authors' names present the same problem as that of varying lengths of titles. But there is a second problem, which stems from the fact that there can be 1, 2, or 3 authors of a book. If we include 3 author fields in every fixed-length record for a book, then much of the time, there will be nothing in the 2nd author field and the 3rd author field but space characters.

Variable-Length Computer Records

When the staff of the MARC pilot project began, in 1965, to consider how to handle catalog data in computers, they immediately encountered the problems we have just outlined. Furthermore, at that time, almost all computer files that had ever been designed or used were of the fixed-length-record type. The MARC designers came up with a then-novel solution: the variable-length record.

There are two basic ways of designing a variable-length record for computer use. The first way is to mark, or delimit, the beginnings and endings (or, at a minimum, either the beginning or the ending) of fields and records by special characters that are reserved for that purpose. (Note: Almost all computer files, whether of fixed-length or variable-length type, employ a special character to mark the end of the file. And many fixed-length-record computer files use special end-of-record characters for convenience and as a safety measure against error.) In order for a computer program to use a file of variable-length records with variable-length fields (and, possibly, of varying numbers of occurrences of a given field), the program must, as it opens the file, examine each successive character in the file to determine whether the character is one of the special end-of-field or end-of -record delimiters. Whenever a character is found to be a delimiter, the program knows it has finished inputting a field or a record, and the program must take steps to handle the field or record appropriately.

The second way of designing a variable-length record is to include, at the beginning of each record, a special field, of fixed length, in which the lengths of all the variable-length fields in the record are specified, but to use no special end-of-field or end-of-record characters. This special field, usually called the "header", must itself be of fixed length so that the program can quickly establish the nature of the structure of the whole record, including its variable-length parts, by examining the contents of the header. Often the header, since it is of fixed length, will also include certain fields that are known always to be of a fixed length (e.g., 4 digits for a year).

The MARC record uses both these ways of dealing with records of catalog information. Before we consider at the MARC record format, however, we shall look at a example of each way of handling variable-length records.

Example of Variable-Length Record Structure Using Delimiters

Suppose we have some information about three companies, including their addresses, and our contacts in the companies. Here are the data as we might write them on pages in an address book.

          IBM Corporation
          11400 Burnet Road
          Building A1
          Austin, Texas 78758
          Contacts: Sam Robertson

          Big-Bang Startup Company
          10 W. Martin Luther King Jr. Boulevard
          Austin, Texas
          Contacts: Stephen Hawking

          ABC Company
          123 Main Street
          Pocahontas, Iowa 50747
          Contacts: Joe Smith, Jane Roe, Mary Fulano, John A. Doe

Next, suppose we decide to store these data in a computer file using a variable-length structure. First, we display the overall structure of each record, then the delimiters we shall use, and, finally, the foregoing data after being placed in the file.

Record Structure

COMPANY_NAMEa variable-length field
 ADDRESSa variable-length field that may be repeated as many times as necessary
 CITYa variable-length field
 STATEa variable-length field (state names are used, not their abbreviations)
 ZIPa variable-length field (since it can be either 5 or 9 digits in length)
 CONTACT_NAMEa variable-length field that may be repeated as many times as needed
Delimiters
 «beginning of file
 »end of file
 ƒbeginning of field
 ^end of field
 beginning of subfield, i.e., beginning of one occurrence of a repeatable field
 end of subfield, i.e., end of one occurrence of a repeatable field
 ~beginning of record
 §end of record
Sample File of Data Stored as Variable-Length Records Using Both Beginning and Ending Delimiters

«~ƒIBM Corporation^ƒ‡11400 Burnet Road†‡Building A1^ƒAustin^ƒTexas^ƒ78758^ƒSam Robertson^§~ƒBig-Bang Startup Company^ƒ10 W. Martin Luther King Jr. Boulevard^ƒAustin^ƒTexas^ƒ^ƒStephen Hawking^§~ƒABC Company^ƒ123 Main Street^ƒPocahontas^ƒIowa^ƒ50574^ƒ‡Joe Smith‡†Jane Roe‡†Mary Fulano†‡John A. Doe^§»

Note: In the second record, that for Big-Bang Startup Company, there is no ZIPcode. Its absence is shown by the use of adjacent beginning-of-field and end-of-field delimiters, "ƒ^".

Next, we observe that there are actually some unnecessary delimiters in the above example. For instance, the physical beginning of a file will be identified by whatever computer operating system is being used, so that our use of an explicit beginning-of-file delimiter is superfluous, and we may omit it. But, of course, once a program starts looking at the contents of a file, it is important for the program to be able to identify the end of the file, so we will not omit the end-of-file delimiter.

In similar fashion, we can observe that it is really not necessary to mark both the beginning and the ending of each record. The beginning of the very first record in the file must coincide with the beginning of the file itself; and the beginnings of second and later records in the file must occur immediately after an end-of-file mark. Thus, we may omit the beginning-of-record delimiters provided that we retain the end-of-record delimiters.

Again in similar fashion, we can note that it is unnecessary to mark both the beginning and ending of each field. The beginning of the first field in a record must coincide with the beginning of the record itself, and the beginnings of second and later fields in the record must occur immediately after an end-of record mark. Thus, we may omit the beginning-of-field delimiters provided that we retain the end-of-field marks.

Finally, in somewhat similar fashion, we can note that it is unnecessary to mark both the beginning and ending of each subfield. We could reason, in the fashion we have been using, that the beginning of the first subfield in a field must coincide with the beginning of the field itself, and that the beginnings of second and later subfields in the field must occur immediately after an end-of subfield mark. However, we could also reason that the ending of the first subfield in a field must occur immediately before the beginning of the second subfield; that the ending of the second subfield in a field must occur immediately before the beginning of the third subfield; and so on for further subfields. This indicates that it would be sufficient to use just beginning-of-subfield delimiters and to omit end-of-subfield delimiters. (In fact, this is what the MARC record format does.)

Here is the example we used above, except that this time, in keeping with the foregoing reasoning, we have omitted the beginning-of-file delimiters, beginning-of-record delimiters, beginning-of-field delimiters, and end-of-subfield delimiters, with the result shown below.

Minimal Set of Delimiters
 »end of file
 ^end of field
 beginning of subfield, i.e., beginning of one occurrence of a repeatable field
 §end of record
Sample File of Data Stored as Variable-Length Records Using a Minimal Set of Delimiters

IBM Corporation^‡11400 Burnet Road‡Building A1^Austin^Texas^78758^Sam Robertson^§Big-BangStartup Company^‡10 W. Martin Luther King Jr. Boulevard^Austin^Texas^^Stephen Hawking^§ABC Company^‡123 Main Street^Pocahontas^Iowa^50574^‡Joe Smith‡Jane Roe‡Mary Fulano‡John A. Doe^§»

The above example uses delimiters in a fashion quite similar to that of the MARC record format.

Example of Variable-Length Record Structure Using Header Blocks

Suppose that we have (partial) cataloging data for two books.

          Rob, Peter; Coronel, Carlos. Database Systems: Design, Implementation, and Management. Course           Technology; 1997. ISBN:0-7600-4904-1.

          Cassel, Paul. Teach Yourself Access 97 in 14 Days. Sams; 1996. ISBN:0-672-30969-6.

Next, suppose we decide to store these data in a computer file using a variable-length structure that employs the header-block approach.. First, we display the overall structure of each record and then the foregoing data after being placed in the file.

Database Structure
Header Block  By design, known to be 29 characters long
 RECORD_ID The ISBN is used in this example.
 COPYRIGHT_DATE 
 TITLE_LENGTH 
 LENGTH_OF_FIRST_AUTHOR_FIELD By LC design, no more than 3 authors
 LENGTH_OF_SECOND_AUTHOR_FIELD 
 LENGTH_OF_THIRD_AUTHOR_FIELD 
 LENGTH_OF_PUBLISHER_FIELD 
Data Block  
 TITLE 
 FIRST_AUTHOR 
 SECOND_AUTHOR 
 THIRD_AUTHOR 
 PUBLISHER 
Sample File of Data Stored Using a Header Block

07600490411997056009014000017Database Systems: Design, Implementation, and ManagementPeter RobCarlos CoronelCourse Technology§06723096961996035011000000004Teach Yourself Access 97 in 14 DaysPaul CasselSams§»

Translation of Sample for Humans

For an example, we use the header block of the first record, in order to show that the header-data string is parsed as though it read:

0760049041 1997 056 009 014 000 017

where the first ten characters are the ISBN (0760049041); the next four characters, the copyright date (1997); the next three, the number of characters in the title (56); the next three, the number of characters in the first author's name (9); the next three, the number of characters in the second author's name (14); the next three, the number of characters in the third author's name (0); and the last three, the number of characters in the publisher's name (17). The second header-data string is parsed in an analogous way.

Note: This example is a simplified analog of the MARC record structure. It shows how, in principle, header blocks of a fixed length can furnish all the information needed for records of varying lengths. The actual MARC record structure combines the header-block structure with field delimiters. The resulting redundancy helps to reduce data errors.


Last revised 2004 Feb 23