Strings

Literal Strings and Pointers

Strictly-speaking, it is not a good idea to use NULL for terminating strings. NULL is a macro that is defined as a pointer type. NUL may not be defined, and if it is, it is likely to be an escaped zero: '\0'. However, you will see NULL, NUL, and null being used interchangeably when talking about null-terminated strings. See the comp.lang.c FAQ, specifically section 5.9.

There is a subtle difference between a string and an array of characters. This is how the first literal string above would be laid out in memory:
Literal strings are much like character arrays in that they can be used with pointers. In this example, p is a char pointer or pointer to char and it points to the first element in the string:
char *p = "Hello, world\n";
Visually:
We can print the string just as if it were a literal string:
printf(p);
Using the %s format specifier to print strings:
char *ph = "Hello";
char *pw = "world";
printf("%s, %s\n", ph, pw);
These three strings would look something like this (not necessarily adjacent in memory):
The terminating NUL (zero) character is very important when treating the array as a string. It is what tells printf when to stop:
char *ph = "Hello";
char w[] = {'H', 'e', 'l', 'l', 'o'};

printf("%s\n", ph); /* OK, a string      */
printf("%s\n", w);  /* Bad, not a string */
Output:
Hello
Hello¦¦¦¦¦¦¦¦¦¦¦<@B
Another attempt:
  /* Manually add the terminator to the array */
char w[] = {'H', 'e', 'l', 'l', 'o', 0};

  /* Ok, now it's a string */
printf("%s\n", w);
We could print strings "the hard way", by printing one character at a time:
char *p = "Hello, world\n";
while (*p != 0)
  printf("%c", *p++); /* Compact pointer notation */
After initilization:
After the while loop:

Make sure that you fully understand the difference between the pointer and the value that the pointer is pointing to:

char *p = "Hello, world\n";

  /* This is the correct condition */
while (*p != 0)
  printf("%c", *p++);
char *p = "Hello, world\n";

  /* INCORRECT */
while (p != 0) 
  printf("%c", *p++);
Output from the incorrect code: (using gcc)
Hello, world
 %c Hello world %s, %s
 %s
 The value of i is %i
                                                                                                       ?
 a                                                                                                     @
@          hA  x@  l@          xA  ¤@                          ¬@  ,@  E@  O@  è@  ?A  ?A  ?A  $A      0
A           I a`y a¤%   a?~?aàS a&+     a,/     aZ1     aN5     a        I
Y|    5 __main    F?_impure_ptr   ·?calloc    ï?cygwin_internal   ??dll_crt0__FP11per_process e?free  K?
malloc    >?printf     ?realloc   O?GetModuleHandleA   @   @   @   @   @   @   @   @   @  cygwin1.dll ¶@
  KERNEL32.dll
118871 [main] a 1808 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
 118871 [main] a 1808 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
 119867 [main] a 1808 open_stackdumpfile: Dumping stack trace to a.exe.stackdump
 119867 [main] a 1808 open_stackdumpfile: Dumping stack trace to a.exe.stackdump
 810735 [main] a 1808 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
 841004 [main] a 1808 _cygtls::handle_exceptions: Error while dumping state (probably corrupted stack)
The output before it crashed and burned when using Microsoft's compiler.

Note: When using printf to print strings, only the first string is being interpreted. For example, this code:

char *p1 = "%s%d";
printf("A string with %%: %s\n", p1);
will print this:
A string with %: %s%d
as none of the other arguments (p1 in this case) will have their % symbols evaluated. They will just be printed verbatim.

String Variables and Initialization

Initialization with character arrays:
char s1[] = {'H', 'e', 'l', 'l', 'o'};    /* array of 5 chars */
char s2[] = {'H', 'e', 'l', 'l', 'o', 0}; /* array of 6 chars */
Initializing with strings:
char s3[] = "Hello"; /* array of 6 chars; 5 + terminator                           */
char *s4 = "Hello";  /* pointer to a char; 6 chars in the "string"; 5 + terminator */
What is sizeof s1, s2, s3, s4? (Hint: What are the types?)


           

Initializing with fewer characters:
char s5[10] = {'H', 'e', 'l', 'l', 'o'};    /* array of 10 chars, 5 characters are 0 */
char s6[8] = "Hello";                       /* array of 8 chars; 3 characters are 0  */
           
Given these declarations:
char s[5]; /* array of 5 chars, undefined values */
char *p;   /* pointer to a char, undefined value */
Use a loop to set each character and then print them out (assume i is an integer):
  /* Set each character to A - E */
for (i = 0; i < 5; i++)
  s[i] = i + 'A';
  /* Print out the characters: ABCDE */
  /* Uses array notation             */
for (i = 0; i < 5; i++)
  printf("%c", s[i]);
printf("\n");
A different loop doing the same thing (assume c is an integer): ASCII chart
  /* Set each character to A - E */
for (c = 'A'; c < 'A' + 5; c++)
  s[c - 'A'] = c;
  /* Print out the characters: ABCDE */
  /* Uses pointer notation           */
for (i = 0; i < 5; i++)
  printf("%c", *(s + i));
Do something similar with p:
  /* Print out the character that p points to */
printf("%c", p[0]);
printf("%c", *p);
You may get garbage, or it may crash:
     65 [main] a 2020 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
  22906 [main] a 2020 open_stackdumpfile: Dumping stack trace to a.exe.stackdump
     65 [main] a 2020 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
  22906 [main] a 2020 open_stackdumpfile: Dumping stack trace to a.exe.stackdump
 686199 [main] a 2020 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
 707734 [main] a 2020 _cygtls::handle_exceptions: Error while dumping state (probably corrupted stack)
Set p to point at something first:
  /* Point p at s[0] */
p = s;
Now print out the value:
  /* Print out the character that p points to */
printf("%c", p[0]);
printf("%c", *p);
In a loop, print out all the characters that p points to: ABCDE. These are both the same (due to the Basic Rule):
for (i = 0; i < 5; i++)
  printf("%c", p[i]);
for (i = 0; i < 5; i++)
  printf("%c", *(p + i));

String Input/Output

There's a convenient function for printing strings:
int puts(const char *string);
The puts function will print a newline automatically. Examples:
Sample codeOutput
char *p1 = "Hello";
char p2[] = "Hello";

puts("Hello");  /* literal string  */
puts(p1);       /* string variable */
puts(p2);       /* string variable */
puts("%s%i%d"); /* literal string  */
Hello
Hello
Hello
%s%i%d
There's also a convenient function for printing a single character:
int putchar(int c);
Example:
Sample codeOutput
char c = 'H';
char *p = "ello";

putchar(c);      /* outputs one char, no newline */
while (*p)
  putchar(*p++); /* outputs one char, no newline */

putchar('\n');   /* print new line               */
Hello
For input, we can use this:
int gets(char *string);

Example:

char string[100]; /* 99 chars + NUL terminator */

puts("Type something: "); /* prompt the user */
gets(string);             /* read the string */
puts(string);             /* print it out    */
Output (charcters in red are typed by the user):
Type something:
I am not a great fool, so I can clearly not choose the wine in front of you.
I am not a great fool, so I can clearly not choose the wine in front of you.
We can also read a single character:
int getchar(void);
Example:
Sample codeOutput
int c = 0;

while (c != 'a')
{
  c = getchar(); /* read in a character   */
  putchar(c);    /* print out a character */
}
This is a string <NL>
This is a (no newline)
Notice how the loop only printed part of the phrase that was typed in. The getchar function did not return until the user pressed the enter/return key. (All of the characters are buffered.) Then, the loop continued.

In C, literal strings are defined as char *. In C++, they are defined as const char *. This will help prevent errors that may occur due to writing to the read-only string pool. More on this later.

String Functions

Although strings are not truly built into the language, there are many functions specifically for dealing with NUL-terminated strings. You will need to include this:
#include <string.h>
Here are four of the more popular ones. Familiarize yourself (i.e. practice) with them as you will be using them a lot in the near future.

Function Prototype Description
size_t strlen(const char *string); Returns the length of the string, which is the number of characters int the string. It does not include the terminating 0.
char *strcpy(char *destination, const char *source); Copies the string pointed to by source into the string pointed to by destination. Destination must have enough space to hold the string from source. The return is destination.
char *strcat(char *destination, const char *source); Concatenates (joins) two strings by appending the string in source to the end of the string in destination. Destination must have enough space to accomodate both strings. The return is destination.
int strcmp(const char *s1, const char *s2); Compares two strings lexicographically (i.e. alphabetically). If string1 is less than string2, the return value is negative. If string1 is greater than string2, then the return value is positive. Otherwise the return is 0 (they are the same.) UPPERCASE is considered different than lowercase.

Sample implementations of strlen:

size_t mystrlen1(const char *string)
{
  size_t len;

  for (len = 0; *string != 0; string++)
    len++;

  return len;
}
size_t mystrlen2(const char *string)
{
  size_t len = 0;

  while (*string++)
    len++;

  return len;
}
size_t mystrlen3(const char *string)
{
  const char *start = string;

  /* Leaves string pointing at NUL byte */
  while (*string)
    string++;

  return string - start;
}
size_t mystrlen4(const char *string)
{
  const char *start = string;

  /* Leaves string pointing at one past the NUL byte */
  while (*string++)
    ;

  return string - start - 1;
}
Most compilers/libraries will have a highly-optimized version of strlen, (and other string-related functions) possibly even written in assembly code, so you should never need to write your own. Here is a version from glibc (The GNU C Library). From my simple tests, it's about 2.5 to 3 times faster than any of the ones shown above. Some of the optimizations may depend on the architecture of the CPU, e.g. SSE (Streaming SIMD Extensions) and vectorization, which is certainly well beyond the scope of this course.

Self check: Using the above implementations of mystrlen as a guide, write your own version of mystrcpy and mystrcat.

The String Pool

Given the code below, the three variables p1, p2, and p3, live on the stack. The three (NUL-terminated) strings live in the string pool.
int main(void)
{
    /* p1, p2, p3 are on the stack */
  char *p1 = "Hello";
  char *p2 = "Hello";
  char *p3 = "Hello";

    /* Display the address of each string */
  printf("%p, %p, %p\n", p1, p2, p3);

  return 0;
}

The string pool is an area of memory that contains all of the constant literal strings in the program. It is generally a read-only area of memory that is protected from being overwritten.

Here's a possible layout in memory (with arbitrary addresses):
And here's the output of the program:
0x400652, 0x400652, 0x400652
What?!? All of the strings have the same address! That means that there is only one copy of "Hello" in the program. This is a more accurate diagram:
This is an optimization that most, if not all, compilers implement. Since they are literal constants, they can never change, so it is totally acceptable to do this. If you have a large program with many strings that are the same, this can save a lot of memory.

There is only one string pool that is shared by all functions and files in a program. So, if the word "Hello" exists in other functions, or even in other files (in the same program), they will all be merged into one string. Some compilers will provide a command line option to enable/disable this optimization.

For strings within a single file, GNU gcc will automatically merge similar strings and this cannot be disabled. For programs with multiple files, this is disabled by default. To enable it, you need the option:

-fmerge-constants
This tells the compiler and linker to remove any duplicate strings. Here's a larger example with multiple functions and multiple files:
merge1.cmerge2.cmerge3.c
#include <stdio.h>

/* prototypes */
void f21();
void f22();
void f23();
void f24();
void f31();
void f32();
void f33();
void f34();

void f11(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}

void f12(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}

void f13(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}

void f14(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}

int main(void)
{
  char *p = "Hello";
  printf("%p\n", p);
  f11();
  f12();
  f13();
  f14();

  f21();
  f22();
  f23();
  f24();

  f31();
  f32();
  f33();
  f34();

  return 0;
}
#include <stdio.h>

void f21(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}

void f22(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}

void f23(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}

void f24(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}
#include <stdio.h>

void f31(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}

void f32(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}

void f33(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}

void f34(void)
{
  char *p = "Hello";
  printf("%p\n", p);
}
This program has 13 occurrences of the string "Hello".

Build the program:

gcc -Wall -Wextra -ansi -pedantic merge1.c merge2.c merge3.c -o merge
There is a tool called strings (part of Cygwin on Windows, built into Linux and Mac) that displays the strings used in a program:
strings merge
This produces about 78 lines of output on my Linux computer. The actual output is here. Since I'm only interested in the strings that are Hello, I can filter the output:
strings merge | grep Hello
and this is the output:
Hello
Hello
Hello
This tells me that there are three Hello strings in the program. The reason is that there is one for each of the three files. Now, if I execute the program, this is the output:
0x400814
0x400814
0x400814
0x400814
0x400814
0x40081e
0x40081e
0x40081e
0x40081e
0x400828
0x400828
0x400828
0x400828
You can see there are three different addresses. The first five are from merge1.c, the second four are from merge2.c, and the last four are from merge3.c.
If I build it like this (with the appropriate option):
gcc -Wall -Wextra -ansi -pedantic merge1.c merge2.c merge3.c -o merge -fmerge-constants
Then when I run the strings program I just get this:
Hello
and executing the program gives this output:
0x400814
0x400814
0x400814
0x400814
0x400814
0x400814
0x400814
0x400814
0x400814
0x400814
0x400814
0x400814
0x400814
Clearly, there is now only one copy of the string Hello in the entire program.

It should be obvious that it is the linker that is doing the merging since the compiler can only see one file at a time.

Here's another example that shows how clever the compiler and linker can be at times:

int main(void)
{
  char *p1 = "123456";
  char *p2 = "23456";
  char *p3 = "3456";

  printf("%p, %p, %p\n", p1, p2, p3);

  return 0;
}
Building the program:
gcc -Wall -Wextra -ansi -pedantic pool2.c -o pool2
Then run strings on it:
strings pool2 | grep 3456
And the output:
123456
23456
3456
This is probably as expected, since there are three different strings. If we execute the program, we will see three distinct addresses:
0x4005e4, 0x4005eb, 0x4005f1
However, if I include the -fmerge-constants option and then run strings we get this:
123456
What happened to the other two strings (23456 and 3456)? Executing the program gives this output:
0x4005e4, 0x4005e5, 0x4005e6
There are still three distinct addresses, but what do you notice about them? This is another way that the compiler/linker can optimize for memory.

This is what is happening (with arbitrary addresses):

Again, because these strings are literal constants, there is no way they can change, so doing this is fine. Also, realize that the compiler/linker can't help with this:
int main(void)
{
  char *p1 = "123456";
  char *p2 = "2345";
  char *p3 = "34";

  printf("%p, %p, %p\n", p1, p2, p3);

  return 0;
}
This is because the second and third strings don't include every character up to the NUL character.

You don't necessarily have to use the -fmerge-constants command line option. Any optimization option (e.g. -O, -O1, -O2, -O3, or -Os) will enable this feature. If you need to force the compiler/linker to NOT merge strings:

-fno-merge-constants
From my research on -fmerge-constants:

From GNU gcc documentation: Options That Control Optimization

So what happens if you do attempt to modify a string in the pool?

int main(void)
{
  char *p1 = "Hello"; /* The "Hello" string is in the string pool.   */
  *p1 = 'C';          /* Change first char to 'C', now it's "Cello". */

  return 0;
}
Output:
Segmentation fault
This means that something bad happened. Essentially, you are trying to write to a read-only section of memory and the operating system is terminating the program immediately. Running it under a memory debugger (Valgrind) gives a little more information:
==26788== 
==26788== Process terminating with default action of signal 11 (SIGSEGV)
==26788==  Bad permissions for mapped region at address 0x4005C4
==26788==    at 0x400526: main (pool3.c:4)
Segmentation fault
The "Bad permissions" basically means that the area of memory is marked as read-only, but we are trying to write to it. Just as you can have read-only files on the disk, you can have read-only memory.

As a reminder:

      char *p1 = "Hello"; /* OK in C, Warning in C++.  */
const char *p2 = "Hello"; /* OK in both C and C++.     */
           *p1 = 'C';     /* Unsafe in both C and C++. */
           *p2 = 'C';     /* Error in both C and C++.  */
To be on the safe side, and to share code with C++, you should use the const keyword so the compiler can warn you if you do something potentially dangerous. (The const keyword wasn't present in the original C compilers, so that's why C accepts the dangerous code.)