Wednesday, October 6, 2010

Did you know? Funny facts about C99

I am currently writing an implementation of the C99 programming language as a basic language for the MPS language workbench. I call this implementation c-base, because it is intended to be used as a target language for code generation, much like MPS' Java-like base-language.

I created the model of this MPS language according to the C99 standard grammar. Thus, I had to dive into its documentation and I have found a number of surprising features and properties of the C99 programming language. Not all of them are important, but if you want to be nitpicker in the next C discussion, here is your ammo ;-)

So, did you know, that

the size of a byte is implementation-defined? (3.6)

In contrast to an octet, a byte does not always consist of 8 bits. Instead, it is an "addressable unit of data storage large enough to hold any member of the basic character set of the execution environment".

argv[argc] shall be a null pointer? (5.1.2.2.1.2)

Very convenient indeed. But I haven't seen code yet which relies on this.

0 is not a decimal constant? (6.4.4.1)

There are three different types of integer constants, i.e. octal, decimal and hexadecimal integer constants. Octal integers constants are prefixed by 0, hexadecimal constants are prefixed by 0x and for decimal constants there is no prefix. Thus a single 0 is an octal integer constant.

there are hexadecimal floating point constants? (6.4.4.2)

There are two different types of floating constants, i.e. decimal and hexadecimal floating constants. Octal floating constants are not supported.

As e is a valid hexadecimal digit, the delimiter for the exponent part of a hexadecimal floating constant is p. The base for the exponent is two and the exponent itself is a decimal number. So 0xap1 equals 20, and 0xap10 equals 10240.

there are two different types of for-loops? (6.8.5)

We know that the header of a for-loop consists of three optional parts separated by semicolons. But in fact that's not true. Instead, only the second and the third header part are always optional expressions. For the first header part, though, things depend on the type of the for-loop.

There are two such types. In case of the first type, the first header part is also an optional expression. As this expression is evaluated once when before starting the first iteration of the loop, it could likewise be moved in front of the loop. What the first part of a for-loop is most used, though, is the declaration of a variable which is scoped by the loop body. As declarations are not expressions, this is not possible with this type of for-loop.

That's why there is a second type of for-loop, where the first part is a declaration. The funny thing is, that in this case the first semicolon is not part of the for loop. Instead, it is the delimiter of the declaration. In order to disallow for-loops with only one semicolon, the first header part is not optional in this case. So for(; i<2; i++){} is a for-loop of the first type with an omitted expression as its first header part.

a typedef declaration does not introduce a new type? (6.7.7)

Instead, only a synonym for the type is specified. I have seen people ranting about C being not type safe, because the compiler allows to use a foo when a bar is expected, if both identifiers are typedefs of the same type. d'oh

the typedef specifier is a storage-class specifier? (6.7.1.3)

The other storage-class specifiers are extern, static, auto, and register. They all deal with how the declared object is stored. typedef is a storage-class specifier, "for syntactic convenience only".

Storage-class specifiers precede declarations which specify "the interpretation and attributes for a set of identifiers". If the storage-class specifier is typedef, these identifiers become synonyms for the respective type. So, typedef int a, (*b)() introduces at one swoop a as a synonym for int and b as a synonym for pointer to function with no parameters and returning int.

From my experience, I have never seen anybody using this feature. So I wonder, how convenient this grammatical hack really is.

the coupling of switch statements and case labels is rather loose? (6.8.4.2)

The body of a switch statement is a statement. Most often this is a block statement with case labels and/or a default label. But this is not required. Instead, switch(42) i++; is valid C99 code, where i++ is not executed, because there is no matching case constant expression and no default label in the body of the switch statement.

In general, possible jump targets of switch statement are all case labels which are "in or on the the switch body", excluding those which are part of an enclosed switch statement. Apparently, being "in the switch body" recursively includes any block statements, enabling things like duff's device. As far as I can tell, though, this is not explicitly stated in the standard.

the grammar defines a rather large superset of the language?

A grammar describes the building blocks of a language and the rules of combining them. The C99 grammar is very permissive concerning the combination rules. Many things are later disallowed by textually written constraints. But some things are not, although they make no sense.

An example for additional constraints is that the grammar allows an arbitrary list of type qualifiers and type, function and storage-class specifiers preceding a declaration (6.7). Later, 6.7.1.2 states that "at most one storage-class specifier may be given" and 6.7.2 requires that "at least on type specifier shall be given". Furthermore, the set of allowed unordered sublists of all type specifiers is enlisted. For example, int static long volatile unsigned i; is a valid declaration, because static is a storage-class specifier, volatile is a type qualifier and unsigned long int is a valid sublist of type specifiers.

An example for - in my opinion - forgotten constraints is that a struct definition is a struct specifier which is a type specifier which is a declaration specifier which is part of a parameter declaration (6.7.5). This means, that void f(struct foo { int i; } j); is valid C99 code, although the syntax highlighting of my Vim goes crazy about this. GCC issues a warning, that "‘struct foo’ declared inside parameter list" and "its scope is only this definition or declaration, which is probably not what you want". Indeed, 6.2.1.4 says, that "if [...] the type specifier [...] appears within the list of parameter declarations [...] the identifier has function prototype scope, which terminates at the end of the function declarator." I can hardly imagine a case in which a type definition with function prototype scope is needed. So I wonder, why this isn't forbidden.

there is a clear definition of side effect? (5.1.2.3.2)

"Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects". It can be so easy.

_Bool is a standard type? (6.2.5.2)

Before C99 there was no Boolean type. Instead, everybody had its own type synonyms and pre-processor macros. In order to no collide with existing code, C99 introduces the keyword Bool with a underscore prefix. But actually you are not supposed to use it, as there is a macro bool which expands to _Bool in stdbool.h (7.16). Furthermore, there are macros for true and false which expand to the integer constants 1 and 0, unless you redefine them. By the way, "an object declared as type _Bool is large enough to store the values 0 and 1".

_Complex is a keyword? (6.2.5.11)

Common floating represent real numbers. If you prefix a floating type with _Complex it becomes a complex number and is stored as an array of two elements of the corresponding floating type (6.2.5.13).

_Imaginary is a keyword? (6.4.1)

But it is only reserved for specifying imaginary types. It is not normative because "there is little existing practice to validate the design decisions" (G.1).

there are different character sets? (5.2.1)

The basic character set contains the common ASCII upper and lowercase letters, special characters and white spaces. The extended character set contains the basic character set and zero or more implementation defined characters. This is meant to support characteristics of the the various locales. Furthermore, C99 distinguishes between the source character set and the execution character set. The former is the set of characters which form source files. The latter is the set of characters available at execution time. Elements from the execution character set are represented by corresponding elements of the source character set or by escape sequences.

some special characters can be represented by trigraphs? (5.2.1.1)

The output of the puts("wtf??!"); is "wtf|", because "??!" is a trigraph sequence which is replaced by "|" "before any other processing takes place". In total, there are nine different trigraph sequences, all starting with "??". So you should escape question marks in string literals. The output of puts("wtf\?\?!"); is "wtf??!".

Fortunately, GCC disables trigraphs by default and issues a warning if the source files contains trigraphs.

some punctuators can be represented by digraphs? (6.4.6.3)

In contrast to trigraphs, digraphs are replaced after tokenization. In particular, this means that digraphs within string literals are not replaced. In total, there are six different digraphs, all starting with an opening angle bracket or with a percent sign. The represented punctuators are opening and closing square brackets, opening and closing curly brackets and single or double hash.

No comments:

Post a Comment