2021-12-18

Fixing C

The C Programming Language remains one of the most popular ones in the history of computer science. All major operating systems use it, and if you design your own systems language you will have to make it talk to C if you want to accomplish anything. However, during my own journey through the weeds of systems programming over the last several years, i have experienced several annoyances with its “features” and lack thereof. Let’s fix that!

As a little foreword, if you are not into the whole reading stuff, i’ve made a fancy sample code file summarizing everything in this article and some more stuff i thought is neat here.

Integer Promotion Is Dumb

One of the (in my opinion) most stupid and pointless bug inducing flaws in C’s design is integer promotion and how it is implemented. Let me demonstrate this by an example, which assumes the target architecture uses two’s complement for negative numbers (which is true for pretty much any platform):

int main(void)
{
	char c = -1;
	while (c >>= 1);
}

You would assume that the while loop will execute exactly CHAR_BIT times and then exit, but this is not the case due to the rules of integer promotion. One of those rules states that any operation on integrals narrower than int require converting it to an int first (including sign extension; you probably see where this is going) and only then executing the operation. Therefore, this is what’s actually happening:

int main(void)
{
	char c = -1;
	while (c)
		c = (char)((int)c >> 1);
}

Assuming we are on x86, this would convert c, whose value is 0xff, to an int representing the same number, which is 0xffffffff after sign extension. Then, after the bitshift, that number becomes 0x7fffffff, and is cast back to char, resulting in c having the same value as before! And sure enough, if you feed gcc with the above code, it spits out an endless loop. That is, unless you compile with -O0, which makes the whole thing even nastier to debug.

An immediate fix that would break basically nothing that wasn’t already broken before is altering the rules for integer promotion: Bitwise operations always promote to unsigned types. It’s really as simple as that, at least as far as i can tell. And while we’re at it, we could also make 2’s complement the mandatory representation for all signed integers, because why wouldn’t we. Even the C23 people seem to agree.

More Metaprogramming

Since C, except for the, uh, interesting _Generic operator, completely lacks generics and puts all of that work into the preprocessor instead, it should at least have some useful tools for writing macros. However, ISO C is still very much lacking in that regard. The typeof operator is currently being discussed as a candidate for C23, which would be a very welcome addition to the language in my opinion.

When combined with a certain GNU extension, typeof would allow for very powerful metaprogramming. I’m talking of course about Statement Expressions! This duo lets you write macros that would otherwise introduce bugs:

#define max(a, b) ({		\
	typeof(a) _a = (a);	\
	typeof(b) _b = (b);	\
	_a > _b ? _a : _b;	\
})

There isn’t much else to say about this. I already rely on these extensions in pretty much all of my projects with over 1000 source lines of code, and a lot of others do too (including gigantic projects like Linux or FreeBSD).

Integer Ranges

Now that’s where my take is starting to get interesting and significantly higher in temperature, but hear me out.

Suppose you have a function that takes a parameter which will be used as the count for a bitshift, or generally anything that has to be constrained to a certain range of values. Then, suppose you don’t trust yourself because you regularly write code at 3 AM and might pass some value that is greater than the width of the integer you’re shifting. So, because you’re a good boy/girl/enby who avoids undefined behavior, you write a debug assertion that checks whether the parameter is less than LONG_BIT or whatever:

vm_page_t alloc_pages(unsigned int order)
{
	assert(order < NR_ORDERS);
	struct pool *pool = &pools[order];
	/* blah */
	size_t size = (1 << order) * PAGE_SIZE;
	/* blah */
}

Now, that can obviously become tedious if you have a lot of functions that require a constraint like this. But wait a minute, doesn’t the compiler already do these sorts of sanity checks for pointer types and the such, so you don’t accidentally assign an int * to a char *? It sure does, so why wouldn’t we extend this capability to the actual values themselves, rather than just the types?

Optimizing compilers already do static analysis to track what values a variable may have at any given point in the program so it can do its fun little tricks that have absolutely never resulted in any bugs in the output binary whatsoever. To be perfectly clear about what i mean, here is another example:

void do_stuff(int x)
{
	/* x could have any value here */

	if (x >= 8 && x < 16) {
		/* x must be >= 8 and < 16 if
		 * we reach this scope (duh) */
	}

	/* x might be all sorts of things */
}

Now, my proposal is to write the range in angle brackets directly after the type, as in int<0,10> for any integer from 0 (inclusive) to 10 (exclusive). I don’t care how the exact syntax would look like, though, and considering that C is pretty much the queen of cursed syntax anyway it doesn’t really matter in my opinion. By the way, i’m consciously not writing that in a code block because it makes my syntax parser freak out a little, but i’m sure you can use your imagination. If not, just see the fancy sample code.

Dynamic Size Annotated Arrays

This is already a proposal for C23 if i remember correctly, but it’s so useful that i wanted to include it anyway. It’s kind of related to the type range feature. Have a look at the signature of read(2):

ssize_t read(int fd, void *buf, size_t nbytes);

This has worked well for several decades. But it still has a fatal flaw: Literally nothing is stopping you from passing a buf that is smaller than nbytes. What if instead the signature looked like this:

ssize_t read(int fd, char buf[nbytes], size_t nbytes);

The compiler could easily figure out whether the buffer is sufficiently sized, and emit a warning if not (for example, because it knows how big a memory area returned from malloc() is). A drawback of this is of course that it would require casting any buffer to a char * before passing it to the function, which could be compensated by making void arrays behave like char ones in this specific situation:

ssize_t read(int fd, void buf[nbytes], size_t nbytes);

The implementation of read would still need to perform some form of explicit or implicit type cast in order to write to the destination buffer, of course.

Better Type Obfuscation

This one is inspired by physicists insisting on always using the correct unit along with numbers. Let’s say you are writing a security critical function that checks whether a process is a member of a certain group. It’s probably not a good idea to accidentally mix up uid and gid, but since both are usually typedefed to int or something simlar, it is pretty easy to do so:

typedef int uid_t;
typedef int gid_t;

bool has_gid(const struct task *task, gid_t gid);

Nothing is stopping you from passing a uid to this function. This is a bad thing. So, why not make uid_t and gid_t completely obfuscated types that are mutually incompatible? I propose the following syntax that makes values of type uid_t and gid_t incompatible with values of any other integral type, unless it has an explicit cast.

typedef int ~uid_t;
typedef int ~gid_t;

Of course, you also can’t assign a uid_t to a gid_t and vice versa. In effect, this would be the same as encapsulating the actual value into a struct, but without having to access the member every time you need the raw integral value.

Putting It All Together

There is some minor stuff that i left out in this article but included in the fancy sample code file, but it’s mostly just a logical conclusion of the concepts declared herein. I hope you find these ideas as interesting as i do, and maybe someone (hopefully not me, since i am drowning in side projects already) will write a transpiler for this dialect of C.

The only thing that’s left now is to give it a name. How about … Type-C? Because it’s primarily an extension of the type system, and i like the idea of making the whole ambiguity around a certain serial bus standard even more confusing than it already is. I’m envisioning this to be kind of what TypeScript is to JavaScript, which is also just a transpiled language and a superset of the latter. Let me know what you think in the comments, and be sure to like and subscribe as well as hit the bell icon so you won’t miss any future videos.

tags: tech – c – programming