Intro To C - Module 5
Intro To C - Module 5
16-22, 2024)
We can use structs and helper methods to make more advanced data structures and, during
this course, we will.
Let's start with generic byte arrays—or, as they're sometimes called—bytestrings. Although
null-terminated "C strings" are the standard for text, we can't use them for generic data "blobs"
because 0x00 is a valid (and common!) value of a byte in generic data. Instead, to know where
these objects end, we need to track lengths—a common way to do this is to include a dedicated
field in a struct.
#include "stdlib.h"
void bytestring_delete(bytestring* b) {
free(b->data); // @E
free(b);
}
For brevity's sake, we are assuming allocations will be successful; in a real implementation,
we'd have to check those pointers and behave accordingly.
The arrow operator is shorthand: obj->field is another way of writing (*obj).field, used
commonly because heap-allocated objects are almost always addressed using pointers, just as
Java does so using references (which are, in truth, managed pointers.)
We use an "object-like" syntax by putting the type of object at the front of our method names:
bytestring_new is comparable to a constructor, and since the constructed (and returned)
object must outlive the call, we must (@A) allocate it on the heap. We observe that it (@B) returns
a pointer to what it has allocated. Our method does not assume ownership of the unsigned
char* that will be its source string; it could be a string literal, which we're not allowed to modify
at all, so we must must copy it (@C) to a new array—a second allocation (@D)—that the bytestring
will own, and therefore can modify without consequences. We use memcpy rather than strcpy
because our string version is not null-terminated. Finally, when we are done with a string object,
we are responsible for deleting it as well as the data buffer (@E). Note that the ordering of
free-ing cannot be changed. If we free'd p first, then free(p->name) would have us
dereferencing a freed pointer, which is not allowed.
We might find ourselves, in the course of our work with bytestrings, wanting to copy them. When
doing so, we need to copy the data field as well—otherwise, we'll have two bytestring objects
that own the same underlying unsigned char*, which means that changes to one would affect
the other.
bytestring* bytestring_copy(bytestring* b) {
bytestring* out = malloc(sizeof(bytestring));
out->length = b->length;
out->data = malloc(out->length);
memcpy(out->data, b->data, out->length);
return out;
}
Doing the full, or "deep," copy ensures that modification as well as deletion of the original one
do not interfere with the new one.
To complete our encapsulation, let's add some more methods to this string type.
size_t bytestring_length(bytestring* b) {
return b->length;
}
void bytestring_print_ascii(bytestring* b) {
for (int i = 0; i < b->length; i++) {
putchar(b->data[i]);
}
}
void bytestring_print_hex(bytestring* b) {
for (int i = 0; i < b->length; i++) {
printf("%02x ", b->data[i]);
}
printf("\n");
}
These heap-allocated objects require the same discipline as (pointers to) raw blocks of memory
managed with malloc and free. Every bytestring_new, unless you intend for its object to live
forever, must be paired with a bytestring_delete when you are done with it. If you let a
bytestring become unreachable but don't delete it, this is objectively an error (memory leak.)
In higher-level languages, garbage collection frees unreachable data; in C, that's your job.
This, of course, means that parent objects that create bytestrings for their own use must
handle deletion of what they create. For example, imagine rearchitecting the person struct from
Module 4, like so:
void delete_person(person* p) {
bytestring_delete(p->name);
free(p->birthday);
free(p);
}
The heap-allocated person object creates and owns two objects, which are also heap-allocated,
one in the name field and one in the birthday field. Since birthday is directly malloc'd, it can
be free'd; however, the bytestring_new requires a bytestring_delete to ensure proper
cleanup.
This manual allocation and deallocation is unfamiliar to users of high-level languages, in which
garbage collection handles it all. Is it difficult? In most cases, not at all, but it is tedious. The
downsides are obvious; the upside is that you, as a C programmer, have explicit control over
when objects are created and released—you will not face "stop the world" pauses due to
garbage collection.
We'll show, to give an example of a more flexible data structure, how to make a linked list. Since
we're just aiming for a proof of concept, we'll use unsigned char as its type.
uchar_list* uchar_list_new () {
uchar_list* out = malloc(sizeof(uchar_list));
out->data = 0;
out->next = NULL
out->is_end = true;
return out;
}
return 0;
}
... builds up and prints out the contents of the list; note the recursion in uchar_list_print.
Linked lists will be one implementation option when you build your Lisp interpreter for the
second half of this course.
Files
One of the most important objects we'll ever work with are files, because on Linux, everything is
a file. Network socket? File. Random number generation? /dev/urandom, a file. The place
where Ayn Rand's work and ideology truly belong? /dev/null, a file. You can't do much if you
don't know how to use files, so let's get to it.
You'll use fopen and fclose to—surprising no one—open and close files on your system.
There are other ways to do it, but these are the simplest. The type signature of fopen is:
The const specifier is a promise that fopen won't modify the strings you give it; the return type
tells us we should expect a pointer to a FILE object. In the event of failure—for example, you try
to read a file that doesn't exist—it will return a null pointer—for this reason, you should always
check it and see what kind of error occurred—we'll return to this in a later module.
The mode argument lets you specify what you intend to do with the file—the most common are
"r", for reading, "w", for writing, and "a", for appending. You can use "r+", "w+", and "a+" if
you intend to read and write to the file—these modes all differ on how they handle the case of
existing files. If the file doesn't exist and a read mode (e.g., "r", "r+") is used, then fopen will
fail. A write or append mode, on the other hand, creates the file if it doesn't exist—a write mode,
in addition, deletes the contents of the original file, whereas an append mode, if the file exists,
more cooperatively appends to the existing contents to what is already there (think of a running
log.) There's also a "wx" mode that fails if the file already exists. Last of all, you should append
a b to your mode string (e.g., "wb" instead of "w") if you're working with binary data to ensure that
the file objects are opened in binary mode instead of text mode. On Linux and Mac OS, the two
modes are identical—on Windows, which uses a two-character line terminator (\r\n), they are
different.
When you're done with a FILE*, remember to fclose it. Otherwise, you will leak file descriptors
which, like leaking memory, may consign your program to an eventual doom.
Here is a function that, like Linux touch, makes the file exist if it does not, but does not alter it.
There are dozens of file methods and purposes to which files are put, and we should discuss
three very important "files"—not unique to C—that exist in all programs: stdin, stdout, and
stderr, corresponding to descriptors 0, 1, and 2. The first of these, unless something else is
redirected into (piped into) it, will be console input. The second and third are console output,
though they're separate streams and can be directed to different places.
#include <inttypes.h>
#include <limits.h>
#include <stdio.h>
int main() {
int64_t n;
fprintf(stdout, "Enter an integer: ");
fflush(stdout);
fscanf(stdin, "%lld", &n);
if (n < INT_MIN || n > INT_MAX) {
fprintf(stderr, "WARNING: value too large -- overflow likely\n");
}
fprintf(stdout, "%lld squared is %lld\n", n, n * n);
return 0;
}
Instead of printf, we use fprintf, which takes a file as its first argument—in this case, we
use the built-in files stdout and stderr. We use the %lld format specifier because we work
with an int64_t, and we give a warning if we suspect its square may overflow.
$ ./console_square
Enter an integer: 137
137 squared is 18769
$ ./console_square
Enter an integer: 4000000000
WARNING: value too large -- overflow likely
4000000000 squared is -2446744073709551616
Below is an example where we redirect all of our std streams to regular files. As one sees,
we've separated the streams.
There are lots of ways to read and write files: fgetc/fputc do it one character at a time,
fgets/fputs handle strings in cases of reasonable line length (as you will need a char[] buffer
large enough to hold what is to be read or written), fscanf and fprintf handle formatted text
I/O, and fread and fwrite are suited to general binary data. You'll always want to read the
documentation when using these functions, because exception and error reporting is probably
not what you're used to—C will not throw an exception, but report errors sometimes through
function returns but also through errno; it's up to you to check.
Below is a program that takes a "dictionary" file—a list of words, one per line—in lower case and
converts them all to upper case, while also counting them.
#include <stdio.h>
#include <string.h>
Note that it assumes both the destination filename (<source filename> + .upcase) and each
word will fit in 160 characters, including a null terminator. These are reasonable assumptions for
this particular purpose—possibly not in general.
The random number generator in C's standard library is... not great. You won't need one for this
course, but you can get burned by rand's short period. A (system-dependent) way to get quality
random numbers, on a Linux-like system, is to use /dev/urandom, which is a source of random
bytes with a file-like interface. Below is a program that does exactly this:
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
This program still has a slight bias when n_sides does not exactly divide 232, but it's no less
biased than physical dice.
About void*
C is a statically but weakly typed language—the compiler expects you to declare types, will track
them for you, and will warn you or throw an error if it thinks you're making a mistake, but it also
allows you to throw away type information in some circumstances. Pointers, after all, are just
addresses—code using them will behave identically regardless of what type of data they point
to. As C is a "I know what I'm doing" language, you're allowed to change pointer types into and
out of void*, which is a generic pointer.
This program allows you to investigate the bit representation of floating-point pi.
#include "inttypes.h"
#include "math.h"
#include "stdio.h"
int main() {
double d = M_PI;
double* p = &d;
int64_t* q = (int64_t *)(void *) p;
int64_t z = *q;
printf("%lld\n", z);
}
Why is it illegal to cast directly from a double* to an int*, or vice versa? The reason, yet again,
is nasal demons (undefined behavior). Compilers like to know when pointers alias each
other—that is, when they point to the same thing. For example, this piece of code:
*b = 17
*a = 34
*a = 34
*b = 17
Can it be? Is this safe? Almost always. If a and b hold the same address, though, the answer is
no—the two versions become semantically different. So, how does the compiler know when two
pointers don't alias? This is nontrivial. It is generally safe to assume that a T* and a U* will
never alias—but typecasting throws this safety out the window. Therefore, if you're converting a
pointer from one type to another, always do it through void*. Otherwise, the compiler may
falsely assume two pointers cannot alias and make optimizations that turn out to be unsafe.
How often do you convert from one type into a different one? Rarely. However, generic pointers
are all over the place in C, including in the standard libraries, so you'll need to get used to them.
For example, malloc returns a generic pointer, and free takes one.
Dynamic Typing
Let's get more specific and assume we're implementing a dynamically typed language with three
types: number (we'll use double), bytestring, and boolean.
The dirty secret of dynamically typed languages is that they unify all their in-language types into
one. Or, to be pedantic, they tend to use one static (compile-time) type and handle
type-checking at runtime. When you ask a Python interpreter to evaluate 1 + 2, it first evaluates
the type of 1—int, which is a signed integer of unspecified size—to determine the specific
desired behavior for +—that is, it looks up int.__add__ rather than float.__add__ or
str.__add__. After that, it checks the type of 2 and throws an error at runtime if the type is
inappropriate; in this case, there is no error, so it returns an int of value 3. This is much more
flexible and safe than what C does, but it costs you at runtime.
To implement our own dynamic language, we'll use the following pattern:
dyn_value* dyn_of_string(bytestring* s) {
dyn_value *out = malloc(sizeof(dyn_value));
out->type = DString;
out->data = s;
return out;
}
We're not done yet! Since these dyn_values are heap allocated, we need to clean up after
ourselves by manually deleting them. We create this function for the purpose.
Notice that we free the void* in the first two cases; in the third, we're building the dyn_value
around a bytestring* that was already created for it, of which it takes ownership, so we must
call the proper destructor; freeing the pointer is insufficient, and will create unreachable (thus,
leaked) memory.
As with print_dyn, this function would behave in undesirable ways if the relationship between
dv->type's value and dv->data's real type were violated. Also notice that the dyn_value takes
ownership of the string it points to and therefore must delete it, allowing it in turn to free its
owned objects, before it can be safely deleted.
int main() {
dyn_value* dv1 = dyn_of_bool(true);
dyn_value* dv2 = dyn_of_double(7.25);
bytestring* s3 = bytestring_new(3, "CAT");
dyn_value* dv3 = dyn_of_string(s3);
print_dyn(dv1);
print_dyn(dv2);
print_dyn(dv3);
delete_dyn(dv1);
delete_dyn(dv2);
delete_dyn(dv3);
return 0;
}
prints:
#t
7.250000
"CAT"
to the console, showing that our code accommodates all three included types.
Module 5 Assignment
with
printf("%s", b->data)? More specifically, what input might cause the latter to fail?
How would one use this data structure, instead of the void* based one? Is it legal to do this?
What advantages does it have? What disadvantages does it have?
5.3. Run your StackMan implementation on a file with the following contents:
0++0++00++<(?_0+++>0++<~0+++<(?_0++>+0++>)_0+++<)_0++<_0+++0++(?_0++<?_0++<)_
(?_0++>+0++>)_0+++++0++++++(?_0++>+0++>)_00++<(?_0+++>0++<~0+++<(?_0++>+0++>)
_0+++<)_0++<_0+++0++0+0++<(?_0+++>0++<~0+++<00++<(?_0+++>0++<~0+++<(?_0++>+0+
+>)_0+++<)_0++<_0+++<)_0++<_(?_0++<?_0++<)_0+++00++<(?_0+++>0++<~0+++<(?_0++>
+0++>)_0+++<)_0++<_00+++++++(?_0++<?_0++<)_(?_0++>+0++>)_0+++++0++++(?_0++<?_
0++<)_0+0+0++<(?_0+++>0++<~0+++<00++<(?_0+++>0++<~0+++<(?_0++>+0++>)_0+++<)_0
++<_0+++<)_0++<_(?_0++>+0++>)_
What is your result? Hint: there should only be one stack entry.
5.4 Include your StackMan implementation on separate pages. Your answers to 5.1-5.3 should
fit on one page; take as many pages as you want for 5.4.