Writing My Own Programming Language
How Phase Started
Last summer, I was learning C; I liked the control it provided but I kept being frustrated by its lack of clarity, and I frequently found myself having to write many lines of code just to accomplish something that was much simpler in Python.
But then I'd switch back to Python for some scripting, and while the clarity of the syntax was refreshing, I became frustrated with the lack of control and I missed the certainty of static typing and performance awareness in C.
I was basically stuck between Python and C, which inspired me to imagine what I really wanted in a language. And so, I decided to create my own: Phase -- a statically-typed bytecode-interpreted programming language that combines the expressiveness of high-level languages (like Python) with the explicitness of lower-level languages (like C).
I chose the name 'Phase' because it represents the shift between language levels, and pretty much every other unique name was already taken.
Also, despite the existence and popularity of tools like LLVM that could've made this a lot easier, every single component of Phase was fully handwritten by me.
This is because I wanted to understand each and every part of the pipeline, which allowed me to to shape the language exactly how I envisioned it, and it challenged me to actually learn compiler theory.
Diagnostics
An additional yet important thing is that I put much thought into Phase's error system. Since this was, in some way, my ideal language, I realized that it had to also solve my frustrations with vague errors that I experienced in other languages. So I spent some time creating an informative and visually appealing diagnostics manager, partially inspired by Rust's errors.
Here's an example:
┏ Fatal Error [102]: Expected ')'.
┃ --> ../tests/missing_paren.phase:2:19-19
┃
┃ 2 | out("a, b, c:"
┃ | ^
┃
┣ Help: Add ')' here.
┃ Suggestion:
┃ - out("a, b, c:"
┃ + out("a, b, c:")
This message tells you what went wrong, where it went wrong, why it went wrong, and how it could be fixed -- essentially everything you need to solve the issue.
The Interpreter Itself
Phase's interpreter has several stages:
Lexer --> Parser --> Type Checker --> Bytecode Generator --> Virtual Machine
To demonstrate the process, we'll follow this basic line of Phase code:
out("Hello world!")
From beginning to end.
1. Lexer
The lexer (short for lexical analyzer) is the very first stage, taking in raw source code -- an arbitrary character string -- and breaking it down into tokens that represent keywords, operators, types, and literals.
Our previous line of code will get tokenized into this form:
OUT
LPAREN
STRING_LIT 'Hello world!'
RPAREN
NEWLINE
These tokens are now organized and separated -- much easier to process than the original line of code.
The lexer also handles details like skipping whitespace, distinguishing between keywords and identifiers, and handling string literals with escape sequences.
In concept, it's simple since it's just processing strings from a while, which is what I thought until I found it was probably the most tedious and boring aspect of the whole project to design and implement correctly.
2. Parser
Next, the parser takes the stream of tokens the lexer produced and constructs an Abstract Syntax Tree (AST) -- a hierarchy of the structure of the program.
I specifically implemented a recursive-descent parser, wherein each grammar rule is designated to a specific function. The parser starts at the top (program) level, and goes through each sublevel until there's none left.
So parsing our tokens gives us these nodes:
STATEMENT (OUT)
╰ EXPRESSION (STRING) ["Hello world!"]
And our list of tokens from before is now an organized linear structure that can be easily followed.
The parser is also where syntax errors are caught and raised. For example, if you forget a closing parenthesis in your code, the parser catches this based on the previous tokens it encounters -- in this case, an opening parenthesis, so it expects a closing one on the same line.
3. Type Checker
This is where Phase's static typing is enforced through semantic verification. The type checker walks along the AST and verifies that all operations are correct, so you can't mismatch variable types in assignment or arithmetic.
The type checker is separate from the lexer because semantics are different from syntax. Think of it like arranging words in a sentence (syntax) versus what the sentence actually means (semantics), so while the sentence "colourless green ideas sleep furiously" is correct with its word structure, it doesn't actually mean anything.
To demonstrate, this line of code is syntactically correct:
let x: int = "Hi"
But we still get an error:
┏ Fatal Error [108]: Type mismatch.
┃ --> ../tests/type_mismatch.phase:3:5-14
┃
┃ 3 | let x: int = "Hi"
┃ | ^^^^^^^^^^
┃
┣ Help: Variable 'x' expects int but got str.
┃ Suggestion:
┃ - let x: int = "Hi"
┃ + let x: int = 0
Because the code is semantically wrong due to a type mismatch. However, our 'Hello world!' code is a statement and accepts any expression as an argument because it will always be converted to a string, so the type checker emits it as is.
4. Bytecode Generator
The type-checked AST is now taken by the bytecode generator and compiled into bytecode: a custom, Assembly-like instruction set that's much simpler to execute than the AST itself.
For Phase, I specifically implemented a stack-based architecture, meaning that operations push and pop values from a storage 'stack' -- in the order of Last-In, First-Out (LIFO).
Inputing our AST now gives us this hexadecimal bytecode:
00 00 00
01
18
Which represents these instructions of opcodes and constant values:
OP_PUSH_CONST 0 ; Push 'Hello world!' onto the stack
OP_PRINT ; Print it
OP_HALT ; Stop the program
I designed Phase's instruction set to contain abou 25 different opcodes. Bytecode generation was surprisingly interesting, and, in fact, it was one of my favourite aspects of creating Phase because of the total low-level design control it provided.
5. Virtual Machine
Finally, the pipeline ends with the virtual machine, which directly executes the bytecode. It maintains an instruction pointer tracking the next instruction to execute, a stack for temporary values and computations, and a global environment for holding variables.
The VM functions like a very simple CPU running a fetch-decode-execute cycle: it reads an instruction, decides what operation to perform, runs it, and moves onto the next instruction.
So, we finally produce an output from our code:
Hello world!
Creating the VM was a pretty good learning experience for interpreter design, and it was also quite satisfying to get to program the actual outputs for my source code after working purely with debug info up to that point.
Design Decisions
I probably spent more time dealing with Phase's design and tradeoffs than actually programming it. Some of the more important decisions I went through were:
Interpreter versus Transpiler
Originally, Phase was meant to be a transpiler that converted source code to C code, before being built and executed -- which was actually the first functioning implementation of Phase that I wrote. However, as I tested my first Phase programs, I realized that the build-run pipeline was very tedious as I had to compile code twice in a row and only then execute it. Even though I felt that a transpiler would demonstrate more low-level knowledge, I decided to convert Phase to an interpreter by switching out the backend for a bytecode generator and a VM, while keeping the same lexer and parser.
Static Typing versus Dynamic Typing
Static typing makes things harder because you have to implement type checking and more sophisticated error handling, but I felt that the benefits outweighed dynamic typing with bug-catching and explicitness in code. I originally implemented C-style variable type declarations because they were simple and I was used to them, but after some feedback from a person online, I replaced them with Rust-style declarations to better support the updates that came soon after.
Stack-Based VM versus Register-Based VM
I chose a stack-based architecture over register-based because it's much simpler to implement, but it's also slower because of more stack operations. But either way, there wouldn't be any noticeable difference between the two architectures for a language this small.
Python versus C
I created the first prototype of Phase (called Luma at the time) in Python since I knew it much better than C, and I got it working in a few days with extremely basic features. However, writing it in Python made me feel as if I were cheating myself in a way, so I decided to challenge myself by writing it fully in C, which sped up my learning by forcing me to confront concepts I was unfamiliar with at the time.
Wrapping Things Up
Phase is my first fully complete long-term project. It's got with enough features as a programming lagnauge to write a variety of small practical programs, for example fibonacci:
func fibonacci(n: int): void {
let (a, b): int = (0, 1)
let (next, count): int = (b, 1)
while count <= n {
out(next)
count += 1
a = b
b = next
next = a + b
}
}
entry {
fibonacci(10)
}
But if I had to revisit it in the future, I would likely work on standard input, arena memory allocation, and JIT bytecode complication.
Overall though, Phase taught me many things like deep compiler/interpreter theory, how to properly design and implement a large project in a low-level language like C, and how to think in a systems-oriented way so I consider all my architectural design decisions. It also proves the sheer effectiveness of project-based learning -- something I really value -- since I everything I learnt had a direct purpose to achieve my goal; I couldn't have acquired all these skills if I hadn't tried to create something out of my reach at the time.
You can check out Phase's repository on GitHub.