C*

From WiredWiki
Jump to navigationJump to search
C*
Cstar mono outline.png
Flavour image for the C* logo.
Paradigm imperative, procedural
First appeared December 2020
Typing discipline static, strong, manifest
Influenced by
Ada, C, Thinking Machines C*, D, Go

C* (pronounced C star) is an imperative, procedural system programming language created by Alexander Nicholi. It facilitates comprehensive compile time guarantees of fully arbitrary mutability of state.

Background

C*'s legacy begins in the earliest time I can remember, before I attended school. Our house got its first computer, and I fell into a deep love with it that I never shook away. Much fun there was with gaming, especially with old god games like SimCity and Caesar III. As I became an adolescent I began to want to command the computer to do things existing programs wouldn't do, and began learning how to program with Visual BASIC .NET. Over the next several years I learned many programming languages and paradigms, and learned baremetal programming through the Game Boy Advance when I was seventeen – but I began to notice that something was wrong with how things were going. Computers weren't always doing what they were supposed to, and I spent a few years figuring out why.

In my early twenties I figured out an explanation for the essence of this blog post by Peter Welch. The truth is, most computer scientists don't know what they're doing, and I'm not just referring to hustlers writing JavaScript and overselling themselves. Even our industry leaders are beginning to pale in comparison to their priors. The most crucial kind of programming, systems programming, is stagnating in a major way, and almost nobody is even noticing this, let alone understanding why, and let alone finding anything to do to stop it. It's partly because systems programming is hard. Also because it's relatively new. Also because universities are too busy making money to care. At the end of the day, the outcome is simple: systems are made generically, and the quality of those systems suffers. Peter Welch's bridge analogy has less hyperbole as the years wear on.

My own conviction led me to reject this, fuelled no doubt by the emergent work culture of cynicism and manager-blaming that has resulted from this pitiful state of affairs. Also motivating me to keep going is my conviction against the major false solutions to these problems of system misbehaviour, chief among them the Rust programming language and its false claim to "memory safety". The evangelism for it is the worst I have ever seen. I know there is an important problem trying to be solved by Rust, and it is in a sense that same problem that C* solves: system correctness made practical. Unfortunately for Rust folks, their error was made at the drawing board.

Addressing complexity in systems

Practicality of complexity before now has always been achieved through genericism. C* rejects this prescription, and instead capitalises on the explicit semantics of C. Genericism is anathema to systems programming, because it inherently obfuscates a program as a means to compartmentalise complexity. This does not address the complexity in a way that programmers can positively appreciate, rather trying to "do away" with it and let them pretend it is something abstract when it is not. The complexity itself is already more than enough for a human brain to handle – this abstract metaprogramming is surely a denial of the system in any real mind and makes for bad systems.

Instead, C* capitalises on C's communicativity to make obvious and clear the details of a system. It then provides a slew of new semantic mechanisms for constraining valid state, and a specification oriented around bits alone instead of abstract objects or bytes of any length.

My essay entitled Law & Order in Software Engineering with C* proselytises for C* through one of its key features: law & order. This is the part of C* that provides fully arbitrary mutability of state.

Laws

C* provides this through a few new keywords and a key concept. First among these is the law keyword, which defines and optionally names constraints to be used on data types. This is semantically accomplished through a series of boolean expressions, like so:

/* anonymous law applied to type */
law : s32
{
   _ >= 0;
   _ < 1000;
};

/* named law */
law leet
{
   _ == 1337;
};

/* applying previously declared law */
law leet : u16;

These laws are enforced upon the data types they apply to at compile time through an exhaustive program analysis. The compiler works backwards to create a control flow tree representing a transient variable lifetime, and exhaustively validates the initialisation and modifications of that transient variable against the laws enacted upon it. This is made practical by formalising the boundaries of the compilation unit as a border between "native" and "foreign" code, which in the essay is called the total system. Data which is confined to this total system gains the performance benefit of fully arbitrary validity checking at compile time.

Marshalling

To deal with foreign code, C* provides a mechanism called marshalling. This is a definition of marshalling expanded from its current meaning in computer science as a synonym for serialisation, to also include the act of validating data being serialised according to arbitrary schemas, or in the case of C*, arbitrary laws. All subroutines that are callable from outside the total system must provide marshalling blocks for validating their variables, like so:

typedef u8 mybyte;

law : mybyte
{
   _ < 255;
   _ != 0;
};

void foo( mybyte a, int c )
{
   /* marshalling happens one parameter at a time */
   marshal a
   {
      if( a == 0 )
      {
         /* a MUST be set to a valid value through marshalling
          * but, we can check around that, smartly */
         a = 1;
         break;
      }

      /* exit the function otherwise */
      return;
   }

   /* this is the minimum required
    * if any laws enacted upon s32, this will fail to compile */
   marshal c
   { }

   /* alternatively, this minimal marshalling will do law checks
    * and return upon any failures, since marshal blocks are only
    * entered when the runtime checks for the laws fail */
   marshal c
   {
      return;
   }
}

Marshal blocks can only reference the parameter they are marshalling. They may declare and modify local variables with automatic storage duration, and may only call pure functions with such parameters.

Fundamental model of data

C* models state in terms of bits. This is reflected in the type system: C* has only one fundamental primitive type, the bit. All other primitives are modelled with explicit and exact structs of bits coupled with various attributes that the compiler understands the meaning of.

Memory model

C* has a simpler and more natural memory model improved upon from the memory model of C. There are three types of memory and two jurisdictions of memory in the language.

Types of memory Jurisdictions of memory
Private memory
Same as in OpenCL parlance; in CUDA terms it may be called "registers" or "local memory"; in general-purpose CPU terms it is thread-local storage. Writable, and only accessible from a single execution context.
Shared memory
Same as in CUDA parlance; in OpenCL terms it is called "local memory"; in CPU terms it is typical, often heap-allocated memory. Writable and accessible from potentially multiple concurrent execution contexts; this is the only memory category that demands manual synchronisation.
Constant memory
Shared memory that is read-only for all execution contexts. This memory can be shared and used without need for synchronisation across multiple execution contexts, but is only modified at the moment it was declared.
Native memory
Memory for which its entire lifetime and all of its access is confined to the total system. Laws can be enacted upon types of objects residing in native memory. Variable declarations default to this jurisdiction.
Foreign memory
Memory that may be accessed, created and/or destroyed outside of the total system. All laws enacted on a type of object in foreign memory do not apply; instead, the user must apply pacts and marshalling code to validate such data. Use of either the extern or volatile storage modifiers declares a variable as residing in foreign memory.

New struct semantics

C* provides a new framing of structs to more precisely declare the anatomy of a data type. Whereas C's structs are a mechanism for declaring new complex data types, C*'s structs are a mechanism for declaring new complex and primitive data types. This is how it is possible for C* to have only one fundamental primitive built-in to the language (the bit): the other primitives programmers are used to having are declared in header files through structs.

structs of primitives are often comprised of statically-sized arrays of bits or other types. For example, a u8 is defined like so:

typedef struct
{
   bit _[8];
}
u8;

This uses a feature of C* called inline structs, which allows users to declare a struct type with a single member named _, and access its instances directly by name without any dot notation, like a primitive. It combines this with typedefs to make it fully look and act like a primitive data type.

Explicit padding

If a struct has more than one member, fields with the name _ are treated as padding bits for the struct. This is useful since C* makes all structs fully bit-packed, requiring programmers explicate padding. This accomplishes the same thing that C ABIs do while keeping the programmer empowered about the specifics, instead of leaving it up to the compiler implementation.

Attributes

An implementation will have a known list of attributes that convey certain information about a new primitive. These are construed through a braced list of string literals at the end of the typedef's body, before the name, like so:

typedef struct
{
   bit _[8] { "signed2" };
}
s8;

The attribute signed2 conveys that the number is signed using two's complement. This means the most significant bit of the type will be treated as a sign bit by the implementation using two's complement.

Some attributes and their provisions include:

Attribute Description
signed1 Signed integers using one's complement
signed2 Signed integers using two's complement
ieee-bin16 IEEE 754 floating point binary16
ieee-bin32 IEEE 754 floating point binary32
ieee-bin64 IEEE 754 floating point binary64
ieee-bin128 IEEE 754 floating point binary128
ieee-bin256 IEEE 754 floating point binary256
bigint Unlimited precision integer

Flex structs

One of the major limitations of C is its inability to parameterise the size of members of struct declarations as part of the variable's type signature. C has always been able to do this with constant expressions in its array syntax, and since C99 it can do this dynamically with VLAs. It just cannot do this with structs, but C* can. We call these flex structs.

Below is a really good example of how flex structs prove useful. It is an abbreviated implementation of a singly linked list node, first using the C-compatible pointer approach, and then using the C*-specific inline approach.

struct uni_1ll_node;

struct uni_1ll_node
{
   u8 * data;
   struct uni_1ll_node * next;
};

void foo( void *, struct uni_1ll_node );

struct uni_1ll_inode[_];
/* this is also valid and equivalent */
struct uni_1ll_inode[];

struct uni_1ll_inode[x]
{
   u8 data[x];
   struct uni_1ll_inode[x] * next;
};

void fooi( void *, struct uni_1ll_inode[64] );

The value of these semantics is obvious, as it makes it possible to avoid the indirection of having pointers, without requiring abuse of the preprocessor or template/macro metaprogramming. The number, being a compile-time constant, is simply worked into the final definition of the type as if it were present in the struct body itself, dictating its ultimate size and the resulting ABI requirements.

A comprehensive fluid ABI

C* strives to provide the maximum possible power to exactly specify the function of a system. If a programmer needs a certain number, they need only to declare as much specificity as they need, and the implementation has the responsibility of discerning the best way to map that specificity onto the targeted machine. When writing C, general-purpose "catch-all" primitives like int and long often prove to be brittle, as they are, in fact, boiled down into concrete types with more specificity than the programmer may have wanted. C* takes the spirit of these built-in C primitives, as in their original intended meaning from their names, and formalises that fluidity so it can be properly taken advantage of by the programmer. If a programmer only needs so many bits of information, they can cap it. If they need truly unbounded numbers, they can use an attribute to ask the implementation for a number that can be extended as large as needed.

The implementation has the responsibility of applying these definitions to data through hardware-provided mechanisms for the sake of performance. However, it is prudent for the system engineer to know the provisions and limits of the hardware they are targeting, as C* will transparently fall back to software emulation of unavailable features, in much the same way GCC has fallback implementations of IEEE 754 floating-point arithmetic. Ultimately, system engineers must fine tune their code if they desire the maximum possible performance on a given machine.

The benefit of doing this in C* is that they simply need to understand the hardware their code runs on; they do not always need to fall back to inline assembly, whether it is to work around ABI formalities, or to have their algorithms take advantage of important instructions like SSE or NEON when a compiler might not emit them. Since the C* compiler models the program–machine relationship so much better, it can be much more certain when things like these are appropriate.

Encoding data

C* revolves quite heavily around the spirit of data-oriented design. So, much thought and many tough decisions have been made regarding data in the design of the language. This includes data encoding mechanisms, rules and limits upon literals in source, and more.

Literals in source code

C* imposes several strong measures to help contain complexity in systems programming. Many of these show up in the specifics of construing data literally within source code using "literals".

The language requires all source code to be ASCII compliant in its raw form. No other encodings of source text are supported, although there is the doc comment exception.

C* does not provide binary literal notation, nor does it deviate from C in its syntax for octal literals. C* certainly dispels any argument against this on grounds of better code, and so the more historically prevalent opinion is chosen out of pragmatism.

There are literal notations for two kinds of "text": 7-bit "narrow" ASCII, and 21-bit "wide" Unicode. As in C, single quotes are used to construe character literals, and double quotes are used to construe string literals. C* uses the @ symbol prefixed to opening quotes to denote the literal as being Unicode instead of ASCII. Observe:

'a'; /* literal ASCII lowercase A (number 97) */
@'a'; /* literal Unicode lowercase A (U+0061) */
'\377'; /* ASCII DEL (number 127) */
@'\u2018'; /* Unicode opening single quote (U+2018) */
"Good morning, Vietnam!\n"; /* literal ASCII string */
@"Good morning, Vietnam!\n"; /* literal Unicode string */

Transmogrification

C*'s provisions for literals are usually not going to translate into their ideal storage medium as-is. Everything defaults to being bit-packed, including ASCII as 7-bit and Unicode as 21-bit, which is hostile to most CPU architectures. In order to help programmers work through such problems without the follies of metaprogramming, C* provides syntax for a kind of transmogrifier function that is worked through to transform literals into their final form within the program.

Transmogrifier functions are the sole context for C*'s third fundamental primitive type, the fifo, as well as two operators, <- and ->. Definitionally, a function is a transmogrifier function if its return "type" is fifo and its sole parameter is an anonymous fifo "type". The <- is the output operator, streaming bits as output as the function progresses; the -> is the input operator, streaming bits as input as the function progresses.

With these, it is possible, for example, to write a transmogrifier that takes a Unicode string literal and outputs it as UTF-8:

 
fifo ustr2utf8( fifo )
{
   bit c[21];
   u8 n;

   /* take in 21 bits from input FIFO
    * n reports how many bits were available */
   c, n <- 21;

   if(n < 21)
   {
      /* do something different, potentially */
   }

   /* ... implementation ... */

   /* send out 8 bits */
   c & 0xFF -> 8;
}

fifo str2utf8( fifo )
{
   bit c[7];
   u8 n;

   c, n <- 7;

   if( n == 0 )
   {
      return;
   }
   else if( n < 7 )
   {
      /* Houston... we have a problem */
   }

   /* 7-bit ASCII into 8-bit stream */
   0 -> 1;
   c -> 7;
}

const u8 * my_unicode = ustr2utf8@"\u201CBlah blah\u201D";
const u8 * my_ascii = str2utf8"Good morning, Vietnam!";

Optimisation

Conceptual work is still being done on how C* provides for optimisation, as it is so much later in the development process, requiring it to be modelled in that order to a great extent, too. Much of the explicit semantics of the language carry manual optimisation for the programmer's use, but there are still other model changes that will be made to help programmers make better programs while maintaining portability and avoiding genericism and preprocessing.

Inlining

C* models function inlining in reverse of C and most other languages. Instead of dictating this intent at the callee's declaration site, it is instead dictated at the caller's. As long as the function is within the total system, it can be inlined using this technique. Furthermore, it is desirable to have a feeble compiler that always inlines upon request, and never does so otherwise, the opposite of what most C compilers do with the inline keyword (ignore it). Experienced systems programmers know this all too well, and in the real world, profile-guided manual optimisation is the name of the game anyway. So, this is a tool for that kind of task.

To achieve this, C* uses the back tick symbol (`) to prefix the function identifier at the call site, like so:

void foo( void )
{
   /* ... */
}

void bar( void )
{
   /* this calls a proper separate function implemented elsewhere */
   foo( );

   /* this inlines foo( ) right here, always */
   `foo( );
}

Other features

C* provides some other more typical features that are often found in other languages. The language introduces keywords for function signatures, among them noreturn and pure. The noreturn keyword can be used as a return type in place of void, to signify a function not only does not return a value (as void suggests), but does not even return execution flow at all. The pure keyword can prefix the type in a function signature to invoke enforcement upon it as a pure function, that is, a function that does not read or write to any external state, and thus always gives the same output for a given set of inputs. Currently there are no plans to provide decorators that influence optimisation, such as those handling inlining, cold/hot spots, or leaf functions.

New operators

C* introduces several new operators:

  • the minimum operator <? and its assignment variant <?=
    • the assignment variant has short-circuit logic: if the destination variable is smaller, it is left unchanged; otherwise, it is set to the smaller incoming value
  • the maximum operator >? and its assignment variant >?=
    • the assignment variant has short-circuit logic: if the destination variable is larger, it is left unchanged; otherwise, it is set to the larger incoming value
  • the count leading zeroes unary operator ^?
  • the count trailing zeroes unary operator ?^
  • the population count unary operator ^^
  • the arithmetic (signed shift right) operator >>> and its assignment variant >>>=
  • the rotate left operator <<< and its assignment variant <<<=
  • the short-circuit logical AND assignment operator &&=
    • this kind of assignment statement only sets the left-hand side variable if its contents are nonzero (truthy)
  • the short-circuit logical OR assignment operator ||=
    • this kind of assignment statement only sets the left-hand side variable if its contents are zero (falsey)

Division and modulus

C* also overloads the meaning of both the division operator / and the modulus operator % in a way that maintains semantic compatibility with C. It uses the multiple return values feature of C* borrowed from Go to make the following semantic equivalences:

/* these all have the same effect */
a = x / y;
a, _ = x / y;

b = x % y;
b, _ = x % y;

a, b = x / y;
b, a = x % y;

This was done because it is virtually universal that division is performed as a single operation with two output values (the quotient and the remainder), and it is prudent to have the language reflect that mechanical reality.

Additionally, C* also irons out the semantics of division and modulus, so that integer division will always round to zero, and modulus will behave consistently so that the result always carries the sign of the second operand.

Doc comment exception

Normally, all C* source code must be in the form of an array or stream of octets, each carrying one ASCII-only character excluding NUL (1-127). For the sake of comments in non-Latin scripts, the language makes one exception to this requirement in the space of documentation comments, which are of the form /** ... */. Within these symbols, the compiler must not error out upon encountering any octets in the range 128-255 until the comment terminates. The language remains totally agnostic to further encodings, although it should be noted that encodings that are not supersets of ASCII (such as UTF-8) may create compatibility issues or cause unexpected compilation errors.