Record-based File I/O

(a.k.a record-oriented, structure-oriented)

Overview

There are two basic types of files: text and binary. Some operating systems don't distinguish between the two, leaving it up to people (or applications). Essentially, text-based files are meant to be read/written by humans. Binary files are meant for the computer.

Text files are generally unstructured and are used for things like:

Source code (C, C++, etc.)
Configuration files (e.g. key/value pairs)
Web pages (i.e HTML files)
XML (structured text files with many applications)
Unformatted text meant for humans (e.g. readme files)

Binary files are generally (rigorously) structured and used for:

Compiled source code (e.g. object files, executable files)
Images (e.g. JPG, PNG, etc.)
Videos (e.g. MP4, WMV, etc.)
Audio (e.g. MP3, FLAC, etc.)
Office documents (e.g. word processing, spreadsheets, etc.)
Databases (e.g. student record systems)

There are reasons why you would choose one format over the other:

Convenience - If humans need to interact with the data, text is much easier.
Efficiency - Some data stored as binary is smaller than if it was stored as text and can be processed more efficiently.
Flexibility - Binary format can store virtually any data type. Text can become burdensome when trying to store certain data (e.g. videos).

As an example, we'll create a system that contains information about students. To keep it simple, we're just going to track 5 pieces of information:

An unique identifier (C-string)
A student's first name (C-string)
A student's last name (C-string)
A student's age (integer)
A student's GPA (double)

To further restrict the data, the ID will be at most 8 characters, the first name will be at most 20 characters, and the last name will be at most 20 characters.

Our C structure to hold each student record looks like this:

#define MAX_ID_LEN    8
#define MAX_NAME_LEN 20

struct STUDENT
{
  char ID[MAX_ID_LEN];           /* e.g. 101001 */
  char last_name[MAX_NAME_LEN];  /* e.g. Smith  */
  char first_name[MAX_NAME_LEN]; /* e.g. John   */
  int age;                       /* e.g. 22     */
  double GPA;                    /* e.g. 3.14   */
};

Storing the Data as Text

Suppose we store the data in a text file. We'd still have to give it some kind of structure so that we could tell one record from the next. Here's a sample student:

        ID: 101001
 Last name: Faith
First name: Ian
       Age: 18
       GPA: 3.140000

Suppose we store each student as one line of text, with the fields separated by commas and ending with a newline (OS-dependent):

101001,Faith,Ian,18,3.140000<NL>

Here's what a set of records would look like:

101001,Faith,Ian,18,3.140000<NL>
102001,Tufnel,Nigel,19,3.250000<NL>
103001,Savage,Viv,22,3.870000<NL>
104001,Shrimpton,Mick,25,2.610000<NL>
105001,Besser,Joe,19,2.180000<NL>
106001,Smalls,Derek,19,2.640000<NL>
107001,St.Hubbins,David,20,2.900000<NL>
108001,Fleckman,Bobbi,20,3.190000<NL>
109001,Eton-Hogg,Denis,21,3.830000<NL>
110001,Upham,Denny,18,3.310000<NL>
111001,McLochness,Ross,19,1.980000<NL>
112001,Pudding,Ronnie,20,2.890000<NL>
113001,Schindler,Danny,20,3.410000<NL>
114001,Pettibone,Jeanine,28,3.330000<NL>
115001,Fame,Duke,18,2.990000<NL>
116001,Fufkin,Artie,19,2.900000<NL>
117001,DiBergi,Marty,19,3.750000<NL>
118001,Floyd,Pink,20,3.840000<NL>
119001,Zeppelin,Led,19,3.810000<NL>
120001,Mason,Nick,18,2.710000<NL>
121001,Wright,Richard,19,2.940000<NL>
122001,Waters,Roger,19,3.090000<NL>
123001,Gilmore,David,20,3.500000<NL>

Let's look at the pros and cons of using text format:

Pros:

The data is easy to read and verify. This is very important when developing a system (for debugging).
We don't need any special "tools" to create/modify the data. Any text editor will work.
No special documentation is required to use the data. It's just a bunch of strings separated by commas.
Even though the names can be 20 characters long, we're only storing exactly what we need.
Since everything in the file is text, we don't have to worry about system dependencies like endianess for multibyte integers and floating point numbers. Everything is a bunch of bytes, which is likely the same on all systems.

Cons:

If we have a comma in a name (or other fields), it will complicate things because we're using the comma as a delimiter.
Storing large integers requires more space in text than binary. Although in this trivial example, the age is only 2 digits (requiring 2 bytes to store), if we have other integers (likely), they may require much more space, e.g. this string "1234567890" requires 10 characters to store, but an integer stored as binary only requires 4 bytes (on most systems).
The length of a line in a text file is limited. If we were to store thousands of bytes of data (likely for any real system), this would become a problem.
Moving a file between systems could be complicated because of the potential for different end-of-line characters.
Finding a field within the string requires us to walk the string byte-by-byte to find the comma separators. This will be very inefficient with large amounts of data.
Since everything is text, all numbers (floats, doubles, integers, etc.) would need to be converted between text and numerical values each time we read or wrote the data.
Probably other issues...

It's already starting to look like text is going to be too limiting for a real world system, which it is. So, binary it is!

Keep in mind that, if the data is very limited (i.e. few records with few fields), storing the data as text is perfectly acceptable (and I prefer it, personally). However, we'd like to develop a system that can handle very large numbers of records with many fields in each record and we'd like to do this very efficiently. At some point, the textual representation will become a real pain to use. See uuencoding to understand the complexity.

Storing the Record in Binary

As a reminder, here's what our data looks like:

#define MAX_ID_LEN    8
#define MAX_NAME_LEN 20

struct STUDENT
{
  char ID[MAX_ID_LEN];           /* e.g. 101001 */
  char last_name[MAX_NAME_LEN];  /* e.g. Smith  */
  char first_name[MAX_NAME_LEN]; /* e.g. John   */
  int age;                       /* e.g. 22     */
  double GPA;                    /* e.g. 3.14   */
};

Here's some code to show how we might (inefficiently and incorrectly) write out the data in binary:

  /* Initialize a sample record */
struct STUDENT s = {"101001", "Faith", "Ian", 18, 3.14};

  /* Open file for binary/write */
FILE *outfile = fopen("student-record", "wb");

  /* Write all 5 fields of the record to the file */
fwrite(&s.ID, sizeof(char), MAX_ID_LEN, outfile);
fwrite(&s.last_name, sizeof(char), MAX_NAME_LEN, outfile);
fwrite(&s.first_name, sizeof(char), MAX_NAME_LEN, outfile);
fwrite(&s.age, sizeof(int), 1, outfile);
fwrite(&s.GPA, sizeof(double), 1, outfile);

  /* Close the file, flushing all buffers */
fclose(outfile);

In order to keep the sample code simple, very little error handling has been coded. In a Real World^™ application, you would check that all of the I/O functions (fopen, fwrite, etc.) were successful. It is quite possible that they could fail (e.g. disk full, invalid filenames, etc.)

Now that our data is in a binary file, we can no longer simply view it with a text editor. We'll be looking at the files using a hex dump tool called dumpit (Windows, Mac, Linux). Example usage:

dumpit student-record

Output:

student-record:
       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 31 30 31 30 30 31 00 00  46 61 69 74 68 00 00 00   101001..Faith...
000010 00 00 00 00 00 00 00 00  00 00 00 00 49 61 6E 00   ............Ian.
000020 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000030 12 00 00 00 1F 85 EB 51  B8 1E 09 40               .......Q...@

I've highlighted the ID and age fields in blue and the GPA field in red so you can more easily see the fields.

The first thing we see is that the size of the file is 60 bytes. That's because of the sizes of the fields:

        ID:  8 bytes
 last_name: 20 bytes
first_name: 20 bytes
       age:  4 bytes
+      GPA:  8 bytes
--------------------
            60 bytes

Add them all up and you get 60 bytes. Compare that with 28 or 29 bytes required to hold the data as text:

101001,Faith,Ian,18,3.140000<NL>

It seems that we are using more space than necessary and we are. However, this is only one of the cons of using binary data. And, as stated in the pros and cons above, this isn't always the case. It just happens to be the case for this small example. In the long run, the benefits of using binary data will outweigh the extra space required.

OK, that was... interesting. But remember I said this technique was inefficient AND incorrect? Let's make it more efficient (which will also make it correct at the same time.) This is where C structures really shine.

Instead of writing one field-at-a-time to the file, we can write the entire structure (record) at once. For a small structure like this, the benefits are not quite as significant. However, you can imagine a real world situation where you have hundreds of fields, with many of the fields being structures themselves. Reading/writing individual fields is not only tedious and inefficient, but very error prone.

Take a look at this structure and realize it would be nigh impossible to write out each field individually. You would have to know the exact layout of every field in all of the many nested structures. That's why we don't want to write individual fields!

Not only would it be very difficult and inefficient, but suppose later on you decide to change the type of one of the fields in the structure from, say, a 2-byte short integer to a 4-byte integer. You would have to find every single line in your program where you were reading or writing that field and change it. Good luck with that.

So, we can replace the 5 calls to fwrite above with a single call:

fwrite(&s, sizeof(struct STUDENT), 1, outfile);

And this is the dump of the file:

student-record2:
       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 31 30 31 30 30 31 00 00  46 61 69 74 68 00 00 00   101001..Faith...
000010 00 00 00 00 00 00 00 00  00 00 00 00 49 61 6E 00   ............Ian.
000020 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000030 12 00 00 00 00 00 00 00  1F 85 EB 51 B8 1E 09 40   ...........Q...@

The first thing you will notice is that the file is larger. It's 4 bytes larger. It was 60 bytes, now it's 64 bytes. You will also notice that the 4 bytes in bold are the reason. What gives?

Long story short: The extra space (padding) is for alignment. This is kind of an involved topic that you can read more about here: Structure Alignment. Briefly, for reasons of efficiency, members (fields) of a structure should be aligned on address boundaries that are multiples of the size of the data. This means that short integers should be on addresses that are evenly divisible by 2, integers and floats should be on addresses that are evenly divisible by 4, long integers (LP64 model), doubles and pointers (64-bit) should be on addresses that are evenly divisible by 8, etc. In order for the double in the structure above to be on the correct address, 4 extra bytes of "padding" are added after the integer so that the double "moves over" to the correct address.

That's why reading/writing individual fields of a structure is more difficult. The proper way is to always read the entire structure, which preserves this extra padding between fields. I showed you the "incorrect" way so that you would understand and appreciate why it was wrong and do it the correct way.

Storing Multiple Records

OK, so we now know how to store a structure in the file, but we want to store many such structures (records). Here's more sample data (23 records) that we are going to store in the file:

#define MAX_ID_LEN 8
#define MAX_NAME_LEN 20

struct STUDENT
{
  char ID[MAX_ID_LEN];           /* e.g. 101001 */
  char last_name[MAX_NAME_LEN];  /* e.g. Smith  */
  char first_name[MAX_NAME_LEN]; /* e.g. John   */
  int age;                       /* e.g. 22     */
  double GPA;                    /* e.g. 3.14   */
};

struct STUDENT Students[] = {
  {"101001", "Faith",      "Ian",     18, 3.14},
  {"102001", "Tufnel",     "Nigel",   19, 3.25},
  {"103001", "Savage",     "Viv",     22, 3.87},
  {"104001", "Shrimpton",  "Mick",    25, 2.61},
  {"105001", "Besser",     "Joe",     19, 2.18},
  {"106001", "Smalls",     "Derek",   19, 2.64},
  {"107001", "St.Hubbins", "David",   20, 2.90},
  {"108001", "Fleckman",   "Bobbi",   20, 3.19},
  {"109001", "Eton-Hogg",  "Denis",   21, 3.83},
  {"110001", "Upham",      "Denny",   18, 3.31},
  {"111001", "McLochness", "Ross",    19, 1.98},
  {"112001", "Pudding",    "Ronnie",  20, 2.89},
  {"113001", "Schindler",  "Danny",   20, 3.41},
  {"114001", "Pettibone",  "Jeanine", 28, 3.33},
  {"115001", "Fame",       "Duke",    18, 2.99},
  {"116001", "Fufkin",     "Artie",   19, 2.90},
  {"117001", "DiBergi",    "Marty",   19, 3.75},
  {"118001", "Floyd",      "Pink",    20, 3.84},
  {"119001", "Zeppelin",   "Led",     19, 3.81},
  {"120001", "Mason",      "Nick",    18, 2.71},
  {"121001", "Wright",     "Richard", 19, 2.94},
  {"122001", "Waters",     "Roger",   19, 3.09},
  {"123001", "Gilmore",    "David",   20, 3.50}
};

int Count = sizeof(Students) / sizeof(*Students);

Here is the hex dump of the binary file. The size of the file is 1,472. There are 23 records and each record is 64 bytes. Multiply 23 * 64 and you get 1,472. We're going to make a slight addition to our file to help when reading back the information. As it stands now, the only way to know how many records are in the file is to read them all one-at-a-time. To make it more efficient, we're going to store that count as the first integer in the file.

To create the file, just use a loop to write each structure to the file. This is what the first few records in the file look like with the count stored:

student-records:
       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 17 00 00 00 31 30 31 30  30 31 00 00 46 61 69 74   ....101001..Fait
000010 68 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   h...............
000020 49 61 6E 00 00 00 00 00  00 00 00 00 00 00 00 00   Ian.............
000030 00 00 00 00 12 00 00 00  00 00 00 00 1F 85 EB 51   ...............Q
000040 B8 1E 09 40 31 30 32 30  30 31 00 00 54 75 66 6E   ...@102001..Tufn
000050 65 6C 00 00 00 00 00 00  00 00 00 00 00 00 00 00   el..............
000060 4E 69 67 65 6C 00 00 00  00 00 00 00 00 00 00 00   Nigel...........
000070 00 00 00 00 13 00 00 00  00 00 00 00 00 00 00 00   ................
000080 00 00 0A 40 31 30 33 30  30 31 00 00 53 61 76 61   ...@103001..Sava
000090 67 65 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ge..............
0000A0 56 69 76 00 00 00 00 00  00 00 00 00 00 00 00 00   Viv.............
0000B0 00 00 00 00 16 00 00 00  00 00 00 00 F6 28 5C 8F   .............(\.
0000C0 C2 F5 0E 40 31 30 34 30  30 31 00 00 53 68 72 69   ...@104001..Shri
0000D0 6D 70 74 6F 6E 00 00 00  00 00 00 00 00 00 00 00   mpton...........

The first integer (4 bytes) is highlighted. The value 17 is in hexadecimal (and little-endian), which is the value 23 in decimal, the exact number of records in the file. Knowing the count will make it easy to allocate an array large enough to hold all of the records when we read all of them in later.

This is the hex dump of the binary file with the count field. Code showing how to create the file.

void write_students(void)
{
  int i;

    /* Open file to write all records */
  FILE *outfile = fopen("student-records", "wb");

    /* Write the count first */
  fwrite(&Count, sizeof(int), 1, outfile);

    /* Write each record to the file */
  for (i = 0; i < Count; i++)
    fwrite(&Students[i], sizeof(struct STUDENT), 1, outfile);

  fclose(outfile);
}

However, even at this early stage in our development, we can do better. Instead of writing each structure one-at-a-time, we can write the entire array of structures at once.

void write_students(void)
{
    /* Open file to write all records */
  FILE *outfile = fopen("student-records", "wb");

    /* Write the count first */
  fwrite(&Count, sizeof(int), 1, outfile);

    /* Write entire array of structures at once */
  fwrite(Students, sizeof(struct STUDENT), Count, outfile);

  fclose(outfile);
}

I'm sure you're beginning to see the power and elegance of dealing with entire structures (or arrays of structures) using binary files. Literally one line of code to write thousands of structures (which could be thousands of bytes in size, with hundreds of fields) to the file.

Here's the actual binary file for you to experiment with: student-records. You won't be able to view it properly in a browser because it is just binary data. You'll have to download it and use the code above to view it, or view it with some kind of hex editor/viewer like dumpit.

Reading Records from the File

Now that we have all of the data stored in the file, it won't be long until we need to read it and/or modify the data. Reading is just as simple as writing. We can either read one record-at-a-time or read in all records into an array. Sample code for both:

Reading individual records:

void read_records(void)
{
  int count;
  int i;

    /* Open the binary file for reading */
  FILE *infile = fopen("student-records", "rb");

    /* Get count of records in the file */
  fread(&count, sizeof(int), 1, infile);

    /* Read each record and do something with it */
  for (i = 0; i < count; i++)
  {
    struct STUDENT s;
    fread(&s, sizeof(struct STUDENT), 1, infile);
    /* Do something with the record... */
  }

  fclose(infile);
}

Reading entire file into an array:

void read_records(void)
{
  int count;
  struct STUDENT *students;

    /* Open the binary file for reading */
  FILE *infile = fopen("student-records", "rb");

    /* Get count of records in the file */
  fread(&count, sizeof(int), 1, infile);

    /* Allocate room for all of the records */
  students = (struct STUDENT *) malloc(count * sizeof(struct STUDENT));

    /* Read all records at once */
  fread(students, sizeof(struct STUDENT), count, infile);

  /* Do something with the records... */

    /* Print out each student record */
  for (i = 0; i < count; i++)
    print_student(&students[i]);

  free(students);
  fclose(infile);
}

Reminder: There is no error checking being done in this code. In a real application you would need to check that all of the library functions succeeded. (e.g. fopen, malloc, etc.)

Let's do something that will need to be done on a regular basis: Update a student's GPA. These are the steps involved:

Open the file for read/binary.
Locate the student's record by ID. (We call this value the key.)
Read in the entire record.
Modify the GPA.
Write the entire record back out to the file.
Close the file.

This is pretty straight-forward and how we would modify any field within a student's record. However, there's a subtle point that needs to be made. According to the algorithm above, we are reading and writing the same open file at the same time. There are at least a couple of ways we can deal with this. Briefly:

Open the file for read/binary
Read in the record
Close the file
Modify the record (in memory)
Open the file for write/binary
Write the modified record
Close the file

There is nothing wrong with this method and it will work. You already have all of the information to do that. But, C has a better way to deal with this: Open the file for update (i.e. reading and writing). This is what the first algorithm described above does.

For this example, let's change Artie Fufkin's GPA from 2.90 to 3.25. Artie Fufkin's ID is 116001. This function takes an ID and GPA and updates the record in the file. A call to the function would look like this:

update_GPA("116001", 3.25);

This is the function:

/* Find student record with ID and modify GPA */
void update_GPA(const char *ID, double newGPA)
{
  int count;

    /* Open the file for update (read/write) binary */
  FILE *inoutfile = fopen("student-records", "rb+");

    /* Get count of records in the file */
  fread(&count, sizeof(int), 1, inoutfile);

    /* Search for the specified record by ID */
  while (count--)
  {
    long position;    /* The current position in the file */
    struct STUDENT s; /* The record read/modified         */

      /* Get current position in the file so we can return to it */
    position = ftell(inoutfile);

      /* Get the next record */
    fread(&s, sizeof(struct STUDENT), 1, inoutfile);

      /* If the student's record was found, update it */
    if (!strcmp(ID, s.ID))
    {
        /* Update GPA */
      s.GPA = newGPA;
      
        /* Move back to correct position in the file */
      fseek(inoutfile, position, SEEK_SET);

        /* Write out the updated record */
      fwrite(&s, sizeof(struct STUDENT), 1, inoutfile);

        /* Done */
      fclose(inoutfile);

      return;
    }
  }

    /* Record wasn't found */
  printf("Student ID: %s not found.\n", ID);
}

This is Artie Fufkin's original record with the current GPA (2.90) highlighted:

0003C0 85 EB 07 40 31 31 36 30  30 31 00 00 46 75 66 6B   ...@116001..Fufk
0003D0 69 6E 00 00 00 00 00 00  00 00 00 00 00 00 00 00   in..............
0003E0 41 72 74 69 65 00 00 00  00 00 00 00 00 00 00 00   Artie...........
0003F0 00 00 00 00 13 00 00 00  00 00 00 00 33 33 33 33   ............3333
000400 33 33 07 40 31 31 37 30  30 31 00 00 44 69 42 65   33.@117001..DiBe

This is Artie Fufkin's updated record with the new GPA (3.25) highlighted:

0003C0 85 EB 07 40 31 31 36 30  30 31 00 00 46 75 66 6B   ...@116001..Fufk
0003D0 69 6E 00 00 00 00 00 00  00 00 00 00 00 00 00 00   in..............
0003E0 41 72 74 69 65 00 00 00  00 00 00 00 00 00 00 00   Artie...........
0003F0 00 00 00 00 13 00 00 00  00 00 00 00 00 00 00 00   ................
000400 00 00 0A 40 31 31 37 30  30 31 00 00 44 69 42 65   ...@117001..DiBe

With hexadecimal numbers, the values are not obvious. Not only are the 64-bit doubles displayed in hexadecimal, but they are in little-endian order. Here are some conversions with the binary using IEEE-754 notation:

Original GPA:

Decimal: 2.90
 Binary: 0100000000000111001100110011001100110011001100110011001100110011
    Hex: 40 07 33 33 33 33 33 33

New GPA:

Decimal: 3.25
 Binary: 0100000000001010000000000000000000000000000000000000000000000000
    Hex: 40 OA 00 00 00 00 00 00

In a nutshell, that's how record-based I/O works. I leave it as an exercise for the reader to add more functionality like modifying other fields, adding records, deleting records, etc. This brief tutorial has given you all you need to get started.

Notes:

The code reads one record-at-a-time. We know that we could have read the entire file in at once. However, in practice, this may not be possible because with very large files (not unlikely) like databases, having enough memory could prove difficult.
Related to the previous point, if we're only interested in one record, why do we need to read every record? This seems sub-optimal and it is.
It would be better if we had some kind of "random access" so that we could "jump" to the correct record without having to search through each and every one.
These "issues" will be dealt with next.

More Efficient File I/O

The previous examples worked just fine, but as the file grows with more data, it soon becomes inefficient to have to read every record each time we update a single record. Like many things, there are multiple ways to solve this "problem". The way we're going to address it is by using an index to the data.

By placing an index at the front of the file, we can more quickly locate where the record is further in the file. This is what the new format of the file looks like:

student-records-indexed:
       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 17 00 00 00 31 30 31 30  30 31 00 00 31 30 32 30   ....101001..1020
000010 30 31 00 00 31 30 33 30  30 31 00 00 31 30 34 30   01..103001..1040
000020 30 31 00 00 31 30 35 30  30 31 00 00 31 30 36 30   01..105001..1060
000030 30 31 00 00 31 30 37 30  30 31 00 00 31 30 38 30   01..107001..1080
000040 30 31 00 00 31 30 39 30  30 31 00 00 31 31 30 30   01..109001..1100
000050 30 31 00 00 31 31 31 30  30 31 00 00 31 31 32 30   01..111001..1120
000060 30 31 00 00 31 31 33 30  30 31 00 00 31 31 34 30   01..113001..1140
000070 30 31 00 00 31 31 35 30  30 31 00 00 31 31 36 30   01..115001..1160
000080 30 31 00 00 31 31 37 30  30 31 00 00 31 31 38 30   01..117001..1180
000090 30 31 00 00 31 31 39 30  30 31 00 00 31 32 30 30   01..119001..1200
0000A0 30 31 00 00 31 32 31 30  30 31 00 00 31 32 32 30   01..121001..1220
0000B0 30 31 00 00 31 32 33 30  30 31 00 00 31 30 31 30  01..123001..1010
0000C0 30 31 00 00 46 61 69 74  68 00 00 00 00 00 00 00   01..Faith.......
0000D0 00 00 00 00 00 00 00 00  49 61 6E 00 00 00 00 00   ........Ian.....
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 12 00 00 00   ................
0000F0 00 00 00 00 1F 85 EB 51  B8 1E 09 40 31 30 32 30   .......Q...@1020
000100 30 31 00 00 54 75 66 6E  65 6C 00 00 00 00 00 00   01..Tufnel......
000110 00 00 00 00 00 00 00 00  4E 69 67 65 6C 00 00 00   ........Nigel...

[rest of the file...]

Here is sample code that created the indexed file:

void write_students_indexed(void)
{
  int i;

    /* Open file to write all records */
  FILE *outfile = fopen("student-records-indexed", "wb");

    /* Write the count first */
  fwrite(&Count, sizeof(int), 1, outfile);

    /* Then write each ID to the file */
  for (i = 0; i < Count; i++)
    fwrite(&Students[i].ID, MAX_ID_LEN, 1, outfile);

    /* Finally, write all of the records to the file */
  fwrite(Students, sizeof(struct STUDENT), Count, outfile);

  fclose(outfile);
}

Now, when we look up a record, we just have to scan the index and then use that to locate the actual record. We still have to read in the entire index, but that is likely to be significantly less (by orders of magnitude) data than reading in the entire file.

Let's write a function that, given an ID, displays the student record. We'll write an entire program that will accept an ID on the command line and display that student record. Here it is in its entirety: (lookup-student.c)

#include <stdio.h>   /* FILE *, printf, fread, fopen, fclose */
#include <stdlib.h>  /* malloc                               */
#include <string.h>  /* strcmp                               */
#include "student.h" /* Student struct                       */

void print_student(const struct STUDENT *student)
{
  printf("%8s: %s, %s (Age: %i, GPA: %3.2f)\n", 
         student->ID,
         student->last_name,
         student->first_name,
         student->age,
         student->GPA);
}

void display_record(const char *ID)
{
  int i;       /* Loop counter                      */
  int count;   /* The number of records in the file */
  char *index; /* All of the student IDs            */

    /* Open the binary file for reading (hard-coded filename!) */
  FILE *infile = fopen("student-records-indexed", "rb");

    /* Get count of records in the file */
  fread(&count, sizeof(int), 1, infile);

    /* Allocate room for the index */
  index = (char *) malloc(count * sizeof(char) * MAX_ID_LEN);

    /* Read in the entire index */
  fread(index, MAX_ID_LEN, count, infile);

  for (i = 0; i < count; i++)
  {
      /* Does this ID match? */
    if (!strcmp(ID, index + i * MAX_ID_LEN))
    {
      struct STUDENT s;

        /* Calculate offset and move file pointer to that point */
      long position = sizeof(int) + (count * MAX_ID_LEN) + (i * sizeof(struct STUDENT));
      fseek(infile, position, SEEK_SET);

        /* Read record at current position */
      fread(&s, sizeof(struct STUDENT), 1, infile);

        /* Display record */
      print_student(&s);

        /* Clean up */
      fclose(infile);
      free(index);

        /* Done */
      return;
    }
  }

    /* Record wasn't found */
  printf("Student ID: %s not found.\n", ID);

    /* Clean up */
  free(index);
  fclose(infile);
}

int main(int argc, char **argv)
{
  const char *ID = "108001";
  if (argc > 1)
    ID = argv[1];

  display_record(ID);
  return 0;
}

Some points to make:

Running the program as such:

lookup-student 105001

produces this output:

105001: Besser, Joe (Age: 19, GPA: 2.18)

The index is simply one large array of characters.
We are reading the entire index, which may be quite large. If it's too large to fit into memory, we could read a portion of it at a time.
We need to find (search for) the ID within the large character array. This is easy to do and we need to use pointer arithmetic to find out where each ID starts in the array. That's what is going on in the call to strcmp.
Once we find the ID, we know which record it is by the value of the loop counter, i.
These lines of code may need a little explanation:
```
  /* Calculate offset and move file pointer to that point */
long position = sizeof(int) + (count * MAX_ID_LEN) + (i * sizeof(struct STUDENT));
fseek(infile, position, SEEK_SET);
```
We need to skip over all of the bytes that come before the record we are looking for. This means we need to skip over the first 4-bytes, which is the count of records.
Then, we need to skip over the entire index. Since we know how large each ID is (MAX_ID_LEN, 8 bytes) and we know how many records there are (count), we multiply them together to get the length of the index which is 184 bytes in the example (8 * 23).
Then, we have to skip over all of the records that come before the one we're looking for. For example, if we are looking for the record with ID 108001, we find it as the 8th ID in the index (101001, 102001..108001). This means we skip over 7 records (64-bytes each, 448 bytes total) to get to the one we want.
This gives us the exact byte in the file where the record with ID 108001 begins. Realize that this calculation can be done in constant time, which means that it doesn't matter how many records are in the file (tens, hundreds, thousands, millions, billions!). It will take the same amount of time (very small) to find the record.
It may seem like this is a lot more work than simply reading each record in, but it isn't. Sure, there is more code but the amount of work being done is a lot less. This is simply because the disk is about 10,000 times SLOWER than the CPU. So, the fewer bytes we need to actually read from the disk, the better the performance will be. Sometimes this performance is several orders of magnitude better. Ever wonder why databases with GBs or TBs (or more) of data are so fast? This is one reason why.
Here's the actual binary file (with indexes) for you to experiment with: student-records-indexed. Again, you won't be able to view it properly in a browser because it is just binary data.

Additional issues and possible changes:

Currently, the IDs are sorted from smallest to largest. (This is not always the case and you can't depend on it.) However, for these examples, that's just the way the file was created. This has some interesting properties:
1. We could use a binary search through the index instead of a linear search. This can be a significantly faster way to find a record.
2. The file now contains two copies of the ID. One is in the "index" and the other is in the "data". We could remove the second one (in the data), since it's already in the index. In practice, the data is going to be several orders of magnitude larger than the ID, so this isn't a huge deal. (However, it does violate the concept of data normalization, i.e. having only one copy of any data.)
3. The order of the IDs (keys) matches the order of the data records. This is OK for small data, but as we add and delete records, this synchronization will not last (unless we re-write the entire file at certains points).
4. Sorting the IDs and rewriting the index isn't that much work. However, if we have to keep all of the data records in the same order, then there is a significant amount of file I/O required.
5. One approach is to store the offset of the data record along with the ID in the index. Now, if we insert an ID in the middle of the index (because we want to keep the IDs sorted), we can simply add the data record to the end of the file and update the record's index to include the position of the data. This is how many record-structured files are implemented.
6. Currently, every time we add a new record (at the end), we still have to rewrite the entire file because the index has to be re-written before we can start writing data records.
  - One solution is to have a lot of "extra" unused (reserved) slots in the index so we only re-write the entire file after adding many records. This isn't so bad since disk space is relatively inexpensive and the index is relatively small compared to the size of the data records.
  - Another "solution" is to have a separate file for the indexes. We could have a file of "indexes" and a file of "data". This means we would never have to rewrite the data file. Adding new records always appends to the end of the data file. This saves time because it is likely that the actual data is tens or hundreds or thousands of times larger than the index. This solution is used by many database systems.
7. Technically, every time we delete a record (not shown in these examples), the entire file needs to be re-written (index and data). However, we could just "mark" the index for the record as "unused". This would only require an update to the index. (No complete file rewrite.) We could then reuse the record later when adding new ones.
There are many other "features" that could easily be added to this scheme to make the file I/O even more efficient.
Finally:
There is another really major benefit to record-based files versus text files that needs to be mentioned, but it's generally considered advanced programming.

Suppose you want two (or more) processes (programs) or threads to modify records in the file simultaneously. This is basically not possible with text-based (unstructured) files. However, with binary records, as described above, this is trivial. Each process/thread is only modifying their particular record in the file. As long as each process/thread is working on a different record, there is no problem. (If necessary, you can lock individual records instead of the entire file, which also significantly improves performance.) In fact, you could have dozens (or hundreds) of processes/threads reading/writing the same file at the same time!. This dramatically improves performance and is how the Real World ^TM works.

Multiprocessing and multithreading is beyond the scope of this basic introduction.

Summary and Other Notes

As long as we use the same compiled program (executable) to read and write the files, everything will work out fine. However, if we compile the code using a different compiler, we could have problems. Specifically, the alignment of the fields in the structures are dependent on the compiler.

This is not an insurmountable problem and is well-known to software developers that deal with binary files. The technique used to make sure that all compilers employ the same alignment is called structure packing. You can read my introduction to it here: Structure Alignment.

Also, check out this excellent article: The Lost Art of Structure Packing.

One final note: Different CPUs order multibyte data types (e.g. integers, doubles) differently (little-endian vs. big-endian). This would also need to be taken into account if you were to use the program/code with different computer architectures.

Here are more links to the file I/O functions used:

fopen - Opens a file. (man page)
fclose - Close a file. (man page)
fread - Read block of data from stream. (man page)
fwrite - Write block of data to stream. (man page)
ftell - Get current position in stream. (man page)
fseek - Reposition stream position indicator. (man page) (lseek)