Strings
Strings
10 Strings
Processing strings of characters is one of the oldest application of mechanical computers, arguably
predating numerical computation by at least fifty years. Assuming you’ve already solved the problem
of how to represent characters in memory (e.g. as the C char type encoded in ASCII), there are two
standard ways to represent strings:
• As a delimited string, where the end of a string is marked by a special character. The
advantages of this method are that only one extra byte is needed to indicate the length of an
arbitrarily long string, that strings can be manipulated by simple pointer operations, and in some
cases that common string operations that involve processing the entire string can be performed
very quickly. The disadvantage is that the delimiter can’t appear inside any string, which limits
what kind of data you can store in a string.
• As a counted string, where the string data is prefixed or supplemented with an explicit count of
the number of characters in the string. The advantage of this representation is that a string can
hold arbitrary data (including delimiter characters) and that one can quickly jump to the end of
the string without having to scan its entire length. The disadvantage is that maintaining a
separate count typically requires more space than adding a one-byte delimiter (unless you limit
your string length to 255 characters) and that more care needs to be taken to make sure that the
count is correct.
4.10.1 C strings
Because delimited strings are simpler and take less space, C went for delimited strings. A string is a
sequence of characters terminated by a null character '\0'. Looking back from almost half a century
later, this choice may have been a mistake in the long run, but we are pretty much stuck with it.
Note that the null character is not the same as a null pointer, although both appear to have the value 0
when used in integer contexts. A string is represented by a variable of type char *, which points to
the zeroth character of the string. The programmer is responsible for allocating and managing space to
store strings, except for explicit string constants, which are stored in a special non-writable string
space by the compiler.
If you want to use counted strings instead, you can build your own using a struct. Most scripting
languages written in C (e.g. Perl, Python_programming_language, PHP, etc.) use this approach
internally. (Tcl is an exception, which is one of many good reasons not to use Tcl).
int
main(int argc, char **argv)
{
char hi[3];
hi[0] = 'h';
hi[1] = 'i';
hi[2] = '\0';
puts(hi);
return 0;
}
examples/strings/hi.c
Note that the buffer needs to have size at least 3 in order to hold all three characters. A common error in
programming with C strings is to forget to leave space for the null at the end (or to forget to add the
null, which can have comical results depending on what you are using your surprisingly long string
for).
the strcpy function will happily keep copying characters across memory long after it has passed the
end of smallBuffer. While you can avoid this to a certain extent when you control where
bigString is coming from, the situation becomes particularly fraught if the string you are trying to
store comes from the input, where it might be supplied by anybody, including somebody who is trying
to execute a buffer overrun attack to seize control of your program.
If you do need to read a string from the input, you should allocate the receiving buffer using malloc
and expand it using realloc as needed. Below is a program that shows how to do this, with some
bad alternatives commented out:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
size = INITIAL_LINE_LENGTH;
line = malloc(size);
assert(line);
length = 0;
line[length++] = c;
}
line[length] = '\0';
return line;
}
int
main(int argc, char **argv)
{
int x = 12;
/* char name[NAME_LENGTH]; */
char *line;
int y = 17;
printf("Hi %s! Did you know that x == %d and y == %d?\n", line, x, y);
return 0;
}
examples/strings/getLine.c
The externally visible difference between strcpy2 and the original strcpy is that strcpy returns
a char * equal to its first argument. It is also likely that any implementation of strcpy found in a
recent C library takes advantage of the width of the memory data path to copy more than one character
at a time.
Most C programmers will recognize the while(*dest++ = *src++); from having seen it
before, although experienced C programmers will generally be able to figure out what such highly
abbreviated constructions mean. Exposure to such constructions is arguably a form of hazing.
Because C pointers act exactly like array names, you can also write strcpy2 using explicit array
indices. The result is longer but may be more readable if you aren’t a C fanatic.
char *
strcpy2a(char *dest, const char *src)
{
int ;
i = 0;
for(i = 0; src[i] != '\0'; i++) {
dest[i] = src[i];
}
/* note that the final null in src is not copied by the loop */
dest[i] = '\0';
return dest;
}
An advantage of using a separate index in strcpy2a is that we don’t trash dest, so we can return it
just like strcpy does. (In fairness, strcpy2 could have saved a copy of the original location of
dest and done the same thing.)
Note that nothing in strcpy2, strcpy2a, or the original strcpy will save you if dest points to a
region of memory that isn’t big enough to hold the string at src, or if somebody forget to tack a null
on the end of src (in which case strcpy will just keep going until it finds a null character
somewhere). As elsewhere, it’s your job as a programmer to make sure there is enough room. Since the
compiler has no idea what dest points to, this means that you have to remember how much room is
available there yourself.
If you are worried about overrunning dest, you could use strncpy instead. The strncpy function
takes a third argument that gives the maximum number of characters to copy; however, if src doesn’t
contain a null character in this range, the resulting string in dest won’t either. Usually the only
practical application to strncpy is to extract the first k characters of a string, as in
/* copy the substring of src consisting of characters at positions
start..end-1 (inclusive) into dest */
/* If end-1 is past the end of src, copies only as many characters as
available. */
/* If start is past the end of src, the results are unpredictable. */
/* Returns a pointer to dest */
char *
copySubstring(char *dest, const char *src, int start, int end)
{
/* copy the substring */
strncpy(dest, src + start, end - start);
return dest;
}
Another quick and dirty way to extract a substring of a string you don’t care about (and can write to) is
to just drop a null character in the middle of the sacrificial string. This is generally a bad idea unless
you are certain you aren’t going to need the original string again, but it’s a surprisingly common
practice among C programmers of a certain age.
A similar operation to strcpy is strcat. The difference is that strcat concatenates src on to the
end of dest; so that if dest previous pointed to "abc" and src to "def", dest will now point to
"abcdef". Like strcpy, strcat returns its first argument. A no-return-value version of strcat
is given below.
void
strcat2(char *dest, const char *src)
{
while(*dest) dest++;
while(*dest++ = *src++);
}
Decoding this abomination is left as an exercise for the reader. There is also a function strncat
which has the same relationship to strcat that strncpy has to strcpy.
As with strcpy, the actual implementation of strcat may be much more subtle, and is likely to be
faster than rolling your own.
return i;
}
Note the use of the comma operator in the increment step. The comma operator applied to two
expressions evaluates both of them and discards the value of the first; it is usually used only in for
loops where you want to initialize or advance more than one variable at once.
Like the other string routines, using strlen requires including string.h.
dest[j] = '\0';
return dest;
}
The problem is that strlen has to scan all of src every time the test is done, which adds time
proportional to the length of src to each iteration of the loop. So
copyEvenCharactersBadVersion takes time proportional to the square of the length of src.
len = strlen(src);
return dest;
}
Because it doesn’t call strlen all the time, this version of copyEvenCharacters will run much
faster than the original even on small strings, and several million times faster if src is megabytes long.
My own feeling is that the first version is more clear, since !strcmp always suggested to me that you
were testing for the negation of some property (e.g. not equal). But if you think of strcmp as telling
you when two strings are different rather than when they are equal, this may not be so confusing.
4.10.7 Formatted output to strings
You can write formatted output to a string buffer with sprintf just like you can write it to stdout
with printf or to a file with fprintf. Make sure when you do so that there is enough room in the
buffer you are writing to, or the usual bad things will happen.
Because allocating space for a copy of a string is such a common operation, many C libraries provide a
strdup function that does exactly this. If you don’t have one (it’s not required by the C standard), you
can write your own like this:
/* return a freshly-malloc'd copy of s */
/* or 0 if malloc fails */
/* It is the caller's responsibility to free the returned string when done. */
char *
strdup(const char *s)
{
char *s2;
s2 = malloc(strlen(s)+1);
if(s2 != 0) {
strcpy(s2, s);
}
return s2;
}
Exercise: Write a function strcatAlloc that returns a freshly-malloc’d string that concatenates its
two arguments. Exactly how many bytes do you need to allocate?
Recall that argv in main is declared as char **; this means that it is a pointer to a pointer to a
char, or in this case the base address of an array of pointers to char, where each such pointer
references a string. These strings correspond to the command-line arguments to your program, with the
program name itself appearing in argv[0]12
The count argc counts all arguments including argv[0]; it is 1 if your program is called with no
arguments and larger otherwise.
Here is a program that prints its arguments. If you get confused about what argc and argv do, feel
free to compile this and play with it:
#include <stdio.h>
int
main(int argc, char **argv)
{
int i;
return 0;
}
examples/strings/printArgs.c
Like strings, C terminates argv with a null: the value of argv[argc] is always 0 (a null pointer to
char). In principle this allows you to recover argc if you lose it.