0% found this document useful (0 votes)
2 views10 pages

Strings

The document discusses string processing in C, highlighting two main representations: delimited strings, which use a null character to indicate the end, and counted strings, which include an explicit count of characters. It details the implications of using C strings, string constants, and operations such as copying and concatenating strings, while also addressing common pitfalls like buffer overflows. Additionally, it covers string length determination and comparison methods, emphasizing the importance of proper memory management and efficient coding practices.

Uploaded by

hashtag00gamer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

Strings

The document discusses string processing in C, highlighting two main representations: delimited strings, which use a null character to indicate the end, and counted strings, which include an explicit count of characters. It details the implications of using C strings, string constants, and operations such as copying and concatenating strings, while also addressing common pitfalls like buffer overflows. Additionally, it covers string length determination and comparison methods, emphasizing the importance of proper memory management and efficient coding practices.

Uploaded by

hashtag00gamer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

4.

10 Strings
Processing strings of characters is one of the oldest application of mechanical computers, arguably
predating numerical computation by at least fifty years. Assuming you’ve already solved the problem
of how to represent characters in memory (e.g. as the C char type encoded in ASCII), there are two
standard ways to represent strings:
• As a delimited string, where the end of a string is marked by a special character. The
advantages of this method are that only one extra byte is needed to indicate the length of an
arbitrarily long string, that strings can be manipulated by simple pointer operations, and in some
cases that common string operations that involve processing the entire string can be performed
very quickly. The disadvantage is that the delimiter can’t appear inside any string, which limits
what kind of data you can store in a string.
• As a counted string, where the string data is prefixed or supplemented with an explicit count of
the number of characters in the string. The advantage of this representation is that a string can
hold arbitrary data (including delimiter characters) and that one can quickly jump to the end of
the string without having to scan its entire length. The disadvantage is that maintaining a
separate count typically requires more space than adding a one-byte delimiter (unless you limit
your string length to 255 characters) and that more care needs to be taken to make sure that the
count is correct.

4.10.1 C strings
Because delimited strings are simpler and take less space, C went for delimited strings. A string is a
sequence of characters terminated by a null character '\0'. Looking back from almost half a century
later, this choice may have been a mistake in the long run, but we are pretty much stuck with it.
Note that the null character is not the same as a null pointer, although both appear to have the value 0
when used in integer contexts. A string is represented by a variable of type char *, which points to
the zeroth character of the string. The programmer is responsible for allocating and managing space to
store strings, except for explicit string constants, which are stored in a special non-writable string
space by the compiler.
If you want to use counted strings instead, you can build your own using a struct. Most scripting
languages written in C (e.g. Perl, Python_programming_language, PHP, etc.) use this approach
internally. (Tcl is an exception, which is one of many good reasons not to use Tcl).

4.10.2 String constants


A string constant in C is represented by a sequence of characters within double quotes. Standard C
character escape sequences like \n (newline), \r (carriage return), \a (bell), \0x17 (character with
hexadecimal code 0x17), \\ (backslash), and \" (double quote) can all be used inside string
constants. The value of a string constant has type const char *, and can be assigned to variables
and passed as function arguments or return values of this type.
Two string constants separated only by whitespace will be concatenated by the compiler as a single
constant: "foo" "bar" is the same as "foobar". This feature is not much used in normal code,
but shows up sometimes in macros.

4.10.2.1 String encodings


Standard C strings are assumed to be in ASCII, a 7-bit code developed in the 1960s to represent
English-language text. If you want to write text that includes any letters not in the usual 26-letter Latin
alphabet, you will need to use a different encoding. C does not provide very good support for this, but
for fixed strings, you can often get away with using Unicode as long as both your text editor and your
terminal are set to use the UTF-8 encoding.
The reason this works is that UTF-8 encodes each Unicode character as one or more 8-bit characters,
and does this in a way that guarantees that you never accidentally create a null or any other standard
ASCII character. So a C string containing UTF-8 characters looks like an ordinary C string to all the C
library routines. This also works if you include a Unicode string with a UTF-8 encoding in a comment
or printf format string, as illustrated in the file unicode.c. But this use of Unicode in C is very
limited.
Some issues you will quickly run into if you are trying to do something more sophisticated:
1. You cannot use non-ASCII letters anywhere outside a string constant or comment without
confusing the C compiler. So variable names that use non-ASCII characters are forbidden.
2. If you include a UTF-8 encoded string somewhere, even though both your text editor and
terminal are likely to display the multi-byte characters correctly, they are still spread across
multiple bytes in the encoded string. This can be trouble if you try to change some letter in a
UTF-8-encoded string to something else whose encoding has a different width.
3. You can’t generally put a multibyte character into a char variable, or write it as a char
constant.
4. You may find out that some other tools have their own ideas about what encodings to expect,
causing Unicode characters to turn into gibberish or cause other errors.
There exists libraries for working with Unicode strings in C, but they are clunky. If you need to handle
a lot of non-ASCII text, you may be better of working with a different language. However, even
moving away from C is not always a panacea, and Unicode support in other tools may be hit-or-miss.

4.10.3 String buffers


The problem with string constants is that you can’t modify them. If you want to build strings on the fly,
you will need to allocate space for them. The traditional approach is to use a buffer, an array of chars.
Here is a particularly painful hello-world program that builds a string by hand:
#include <stdio.h>

int
main(int argc, char **argv)
{
char hi[3];
hi[0] = 'h';
hi[1] = 'i';
hi[2] = '\0';

puts(hi);

return 0;
}

examples/strings/hi.c
Note that the buffer needs to have size at least 3 in order to hold all three characters. A common error in
programming with C strings is to forget to leave space for the null at the end (or to forget to add the
null, which can have comical results depending on what you are using your surprisingly long string
for).

4.10.3.1 String buffers and the perils of gets


Fixed-size buffers are a common source of errors in older C programs, particularly ones written with
the library routine gets. The problem is that if you do something like
strcpy(smallBuffer, bigString);

the strcpy function will happily keep copying characters across memory long after it has passed the
end of smallBuffer. While you can avoid this to a certain extent when you control where
bigString is coming from, the situation becomes particularly fraught if the string you are trying to
store comes from the input, where it might be supplied by anybody, including somebody who is trying
to execute a buffer overrun attack to seize control of your program.
If you do need to read a string from the input, you should allocate the receiving buffer using malloc
and expand it using realloc as needed. Below is a program that shows how to do this, with some
bad alternatives commented out:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define NAME_LENGTH (2)

#define INITIAL_LINE_LENGTH (2)

/* return a freshly-malloc'd line with next line of input from stdin */


char *
getLine(void)
{
char *line;
int size; /* how much space do I have in line? */
int length; /* how many characters have I used */
int c;

size = INITIAL_LINE_LENGTH;
line = malloc(size);
assert(line);
length = 0;

while((c = getchar()) != EOF && c != '\n') {


if(length >= size-1) {
/* need more space! */
size *= 2;

/* make length equal to new size */


/* copy contents if necessary */
line = realloc(line, size);
}

line[length++] = c;
}

line[length] = '\0';

return line;
}

int
main(int argc, char **argv)
{
int x = 12;
/* char name[NAME_LENGTH]; */
char *line;
int y = 17;

puts("What is your name?");

/* gets(name); */ /* may overrun buffer */


/* scanf("%s\n", name); */ /* may overrun buffer */
/* fgets(name, NAME_LENGTH, stdin); */ /* may truncate input */
line = getLine(); /* has none of these problems */

printf("Hi %s! Did you know that x == %d and y == %d?\n", line, x, y);

free(line); /* but we do have to free line when we are done with it */

return 0;
}

examples/strings/getLine.c

4.10.4 Operations on strings


Unlike many programming languages, C provides only a rudimentary string-processing library. The
reason is that many common string-processing tasks in C can be done very quickly by hand.
For example, suppose we want to copy a string from one buffer to another. The library function
strcpy declared in string.h will do this for us (and is usually the right thing to use), but if it
didn’t exist we could write something very close to it using a famous C idiom.
void
strcpy2(char *dest, const char *src)
{
/* This line copies characters one at a time from *src to *dest. */
/* The postincrements increment the pointers (++ binds tighter than *) */
/* to get to the next locations on the next iteration through the loop. */
/* The loop terminates when *src == '\0' == 0. */
/* There is no loop body because there is nothing to do there. */
while(*dest++ = *src++);
}

The externally visible difference between strcpy2 and the original strcpy is that strcpy returns
a char * equal to its first argument. It is also likely that any implementation of strcpy found in a
recent C library takes advantage of the width of the memory data path to copy more than one character
at a time.
Most C programmers will recognize the while(*dest++ = *src++); from having seen it
before, although experienced C programmers will generally be able to figure out what such highly
abbreviated constructions mean. Exposure to such constructions is arguably a form of hazing.
Because C pointers act exactly like array names, you can also write strcpy2 using explicit array
indices. The result is longer but may be more readable if you aren’t a C fanatic.
char *
strcpy2a(char *dest, const char *src)
{
int ;

i = 0;
for(i = 0; src[i] != '\0'; i++) {
dest[i] = src[i];
}

/* note that the final null in src is not copied by the loop */
dest[i] = '\0';

return dest;
}

An advantage of using a separate index in strcpy2a is that we don’t trash dest, so we can return it
just like strcpy does. (In fairness, strcpy2 could have saved a copy of the original location of
dest and done the same thing.)

Note that nothing in strcpy2, strcpy2a, or the original strcpy will save you if dest points to a
region of memory that isn’t big enough to hold the string at src, or if somebody forget to tack a null
on the end of src (in which case strcpy will just keep going until it finds a null character
somewhere). As elsewhere, it’s your job as a programmer to make sure there is enough room. Since the
compiler has no idea what dest points to, this means that you have to remember how much room is
available there yourself.
If you are worried about overrunning dest, you could use strncpy instead. The strncpy function
takes a third argument that gives the maximum number of characters to copy; however, if src doesn’t
contain a null character in this range, the resulting string in dest won’t either. Usually the only
practical application to strncpy is to extract the first k characters of a string, as in
/* copy the substring of src consisting of characters at positions
start..end-1 (inclusive) into dest */
/* If end-1 is past the end of src, copies only as many characters as
available. */
/* If start is past the end of src, the results are unpredictable. */
/* Returns a pointer to dest */
char *
copySubstring(char *dest, const char *src, int start, int end)
{
/* copy the substring */
strncpy(dest, src + start, end - start);

/* add null since strncpy probably didn't */


dest[end - start] = '\0';

return dest;
}

Another quick and dirty way to extract a substring of a string you don’t care about (and can write to) is
to just drop a null character in the middle of the sacrificial string. This is generally a bad idea unless
you are certain you aren’t going to need the original string again, but it’s a surprisingly common
practice among C programmers of a certain age.
A similar operation to strcpy is strcat. The difference is that strcat concatenates src on to the
end of dest; so that if dest previous pointed to "abc" and src to "def", dest will now point to
"abcdef". Like strcpy, strcat returns its first argument. A no-return-value version of strcat
is given below.
void
strcat2(char *dest, const char *src)
{
while(*dest) dest++;
while(*dest++ = *src++);
}

Decoding this abomination is left as an exercise for the reader. There is also a function strncat
which has the same relationship to strcat that strncpy has to strcpy.

As with strcpy, the actual implementation of strcat may be much more subtle, and is likely to be
faster than rolling your own.

4.10.5 Finding the length of a string


Because the length of a string is of fundamental importance in C (e.g., when deciding if you can safely
copy it somewhere else), the standard C library provides a function strlen that counts the number of
non-null characters in a string. Note that if you are allocating space for a copy of a string, you will need
to add one to the value returned by strlen to account for the null.

Here’s a possible implementation:


int
strlen(const char *s)
{
int i;

for(i = 0; *s; i++, s++);

return i;
}

Note the use of the comma operator in the increment step. The comma operator applied to two
expressions evaluates both of them and discards the value of the first; it is usually used only in for
loops where you want to initialize or advance more than one variable at once.
Like the other string routines, using strlen requires including string.h.

4.10.5.1 The strlen tarpit


A common mistake is to put a call to strlen in the header of a loop; for example:
/* like strcpy, but only copies characters at indices 0, 2, 4, ...
from src to dest */
char *
copyEvenCharactersBadVersion(char *dest, const char *src)
{
int i;
int j;

/* BAD: Calls strlen on every pass through the loop */


for(i = 0, j = 0; i < strlen(src); i += 2, j++) {
dest[j] = src[i];
}

dest[j] = '\0';

return dest;
}

The problem is that strlen has to scan all of src every time the test is done, which adds time
proportional to the length of src to each iteration of the loop. So
copyEvenCharactersBadVersion takes time proportional to the square of the length of src.

Here’s a faster version:


/* like strcpy, but only copies characters at indices 0, 2, 4, ...
from src to dest */
char *
copyEvenCharacters(char *dest, const char *src)
{
int i;
int j;
int len; /* length of src */

len = strlen(src);

/* GOOD: uses cached value of strlen(src) */


for(i = 0, j = 0; i < len; i += 2, j++) {
dest[j] = src[i];
}
dest[j] = '\0';

return dest;
}

Because it doesn’t call strlen all the time, this version of copyEvenCharacters will run much
faster than the original even on small strings, and several million times faster if src is megabytes long.

4.10.6 Comparing strings


If you want to test if strings s1 and s2 contain the same characters, writing s1 == s2 won’t work,
since this tests instead whether s1 and s2 point to the same address. Instead, you should use strcmp,
declared in string.h. The strcmp function walks along both of its arguments until it either hits a
null on both and returns 0, or hits two different characters, and returns a positive integer if the first
string’s character is bigger and a negative integer if the second string’s character is bigger (a typical
implementation will just subtract the two characters). A straightforward implementation might look like
this:
int
strcmp(const char *s1, const char *s2)
{
while(*s1 && *s2 && *s1 == *s2) {
s1++;
s2++;
}

return *s1 - *s2;


}

To use strcmp to test equality, test if the return value is 0:


if(strcmp(s1, s2) == 0) {
/* strings are equal */
...
}

You may sometimes see this idiom instead:


if(!strcmp(s1, s2)) {
/* strings are equal */
...
}

My own feeling is that the first version is more clear, since !strcmp always suggested to me that you
were testing for the negation of some property (e.g. not equal). But if you think of strcmp as telling
you when two strings are different rather than when they are equal, this may not be so confusing.
4.10.7 Formatted output to strings
You can write formatted output to a string buffer with sprintf just like you can write it to stdout
with printf or to a file with fprintf. Make sure when you do so that there is enough room in the
buffer you are writing to, or the usual bad things will happen.

4.10.8 Dynamic allocation of strings


When allocating space for a copy of a string s using malloc, the required space is strlen(s)+1.
Don’t forget the +1, or bad things may happen.11

Because allocating space for a copy of a string is such a common operation, many C libraries provide a
strdup function that does exactly this. If you don’t have one (it’s not required by the C standard), you
can write your own like this:
/* return a freshly-malloc'd copy of s */
/* or 0 if malloc fails */
/* It is the caller's responsibility to free the returned string when done. */
char *
strdup(const char *s)
{
char *s2;

s2 = malloc(strlen(s)+1);

if(s2 != 0) {
strcpy(s2, s);
}

return s2;
}

Exercise: Write a function strcatAlloc that returns a freshly-malloc’d string that concatenates its
two arguments. Exactly how many bytes do you need to allocate?

4.10.9 Command-line arguments


Now that we know about strings, we can finally do something with argc and argv.

Recall that argv in main is declared as char **; this means that it is a pointer to a pointer to a
char, or in this case the base address of an array of pointers to char, where each such pointer
references a string. These strings correspond to the command-line arguments to your program, with the
program name itself appearing in argv[0]12

The count argc counts all arguments including argv[0]; it is 1 if your program is called with no
arguments and larger otherwise.
Here is a program that prints its arguments. If you get confused about what argc and argv do, feel
free to compile this and play with it:
#include <stdio.h>
int
main(int argc, char **argv)
{
int i;

printf("argc = %d\n\n", argc);

for(i = 0; i < argc; i++) {


printf("argv[%d] = %s\n", i, argv[i]);
}

return 0;
}

examples/strings/printArgs.c
Like strings, C terminates argv with a null: the value of argv[argc] is always 0 (a null pointer to
char). In principle this allows you to recover argc if you lose it.

You might also like