Writing My Own Programming Language
How Phase Started
Last summer, I was deep into learning C. I loved the control it provided but I kept being frustrated by its lack of clarity; I frequently found myself having to write many lines of code just to accomplish something that was much simpler in Python.
But then I'd switch back to Python for some experiments. The clarity of the syntax was refreshing, but I became frustrated again — this time because of the lack of control; I missed the certainty of static typing and performance awareness in C.
I was stuck between Python's clarity and C's control.
So I decided to build my own middleground: Phase — a statically-typed bytecode-interpreted programming language that combines the expressiveness of high-level languages (like Python) with the explicitness of lower-level languages (like C).
I chose the name 'Phase' because it represents the shift between language levels, and pretty much every other distinct name was already taken.
Why Build From Scratch?
Despite the existence and popularity of tools like LLVM that could've made this a lot easier, every single part of Phase — from lexer to virtual machine — is fully handwritten.
This is because I wanted to understand each and every part of the pipeline; I wanted to control exactly how a line of source code gets converted into tokens, into AST nodes, into bytecode, and then executed by the VM. This allowed me to to shape the language exactly how I envisioned it.
The Importance of Diagnostics
Something I put much thought into during Phase's planning was error messages. Since this was, in some way, my ideal language, I realized that it had to solve the issue of vague or antagonistic errors that I experienced in other languages.
I believe that errors must work alongside you, not against you. So I created a system that's both informative and visually appealing.
Here's an example:
┏ Fatal Error [102]: Expected ')'.
┃ --> ../tests/missing_paren.phase:2:19-19
┃
┃ 2 | out("a, b, c:"
┃ | ^
┃
┣ Help: Add ')' here.
┃ Suggestion:
┃ - out("a, b, c:"
┃ + out("a, b, c:")
This message tells us:
- What went wrong, in just a few words.
- Where the problem occurred, with visual markers too.
- Why it's a problem, including context for what was expected.
- How we could fix it, with a direct fix suggested.
That's everything we need to solve the issue.
The Interpreter Itself
Phase's interpreter possesses a multistage pipeline:
Let's go through each section now, following this basic line of Phase code:
out("Hello world!")
From beginning to end.
1. Lexer (Tokenization)
The lexer (short for lexical analyzer) is the very first stage, taking in raw source code — an arbitrary character string — and breaking it down into tokens that represent keywords, operators, types, and literals.
Our previous line of code will get tokenized into this form:
OUT
LPAREN
STRING_LIT 'Hello world!'
RPAREN
NEWLINE
These tokens are now organised and therefore much easier to process than the original line of code.
The lexer also handles details like skipping whitespace, distinguishing between keywords and identifiers, and handling string literals with escape sequences.
Conceptually, it's simple, which is what I thought until I realized it was actually one of the most tedious and boring aspects of the whole project to implement correctly.
2. Parser (Syntax Analysis)
Next, the parser takes the stream of tokens the lexer produced and constructs an Abstract Syntax Tree (AST) — a hierarchy of the grammatical structure of the program.
I specifically implemented a recursive-descent parser, wherein each grammar rule is designated to a specific function. The parser starts at the top (program) level, and goes through each sublevel until there's none left.
So parsing our tokens gives us these nodes:
STATEMENT (OUT)
╰ EXPRESSION (STRING) ["Hello world!"]
Our list of tokens from before is now a rigid linear structure that can be easily followed.
The parser is also where syntax errors are caught. If you, for instance, reference an undeclared variable or forget a parentheses, the parser knows because the current token doesn't match the expected grammar based on previous tokens.
3. Type Checker (Semantic Analysis)
This is where Phase's static typing is enforced. The type checker walks along the AST and verifies that all operations are correct, so you can't mismatch variable types in assignment or arithmetic.
Semantics are different from syntax: think of it like arranging words in a sentence (syntax) versus what the sentence actually means (semantics).
To demonstrate, this line of code is syntactically correct:
let x: int = "Hi"
But we still get an error:
┏ Fatal Error [108]: Type mismatch.
┃ --> ../tests/type_mismatch.phase:3:5-14
┃
┃ 3 | let x: int = "Hi"
┃ | ^^^^^^^^^^
┃
┣ Help: Variable 'x' expects int but got str.
┃ Suggestion:
┃ - let x: int = "Hi"
┃ + let x: int = 0
Because the code is semantically wrong, due to a type mismatch. However, our 'hello world' code is a statement and accepts any expression as an argument, so the type checker emits it as is.
4. Bytecode Generator (Compilation)
The type-checked AST is now taken by the bytecode generator and compiled into bytecode: a custom, Assembly-like instruction set that's much simpler to execute than the AST itself.
For Phase, I specifically implemented a stack-based architecture, meaning that operations push and pop values from a storage 'stack' — in the order of Last-In, First-Out (LIFO).
Our code's AST now compiles into this hexadecimal bytecode:
00 00 00
01
18
Which represents these opcodes:
OP_PUSH_CONST 0 ; Push 'Hello world!' onto the stack
OP_PRINT ; Print 'Hello world!'
OP_HALT ; Stop the program
I designed Phase's instruction set to be minimal rather than superfluous, with about 25 opcodes currently implemented. Bytecode generation was surprisingly interesting, and, in fact, one of my favourite aspects of creating Phase due to the total design control it provided.
5. Virtual Machine (Execution)
The pipeline ends with the virtual machine, which directly executes the bytecode. It maintains:
- An instruction pointer tracking which instruction to execute next.
- A stack for temporary values and computations.
- A global environment for holding variables.
The VM functions like a very simple CPU running a fetch-decode-execute cycle: it reads an instruction, decides what operation to perform, runs it, and moves onto the next instruction.
So, we finally produce an output from our code:
Hello world!
Creating the VM was a great learning experience for interpreter design, and I don't regret writing it myself instead of using a ready-made tool. It was also quite satisfying to program tangible outputs for my source code after getting only debug info so far.
Design Decisions and Tradeoffs
Every programming language has design tradeoffs. Here are some of my key decisions I made and why:
Interpreter vs Transpiler
Originally, Phase was meant to be a transpiler that converted source code to C code, which would then be built and executed. That was actually the first functioning implementation of Phase that I wrote.
However, as I tested my first Phase programs, I realised that the build-run pipeline was very tedious; I had to compile code twice in a row and then execute. Even though I felt that a transpiler would demonstrate more low-level knowledge, I decided to convert Phase to an interpreter by switching out the backend for a bytecode generator and a VM, while keeping the same lexer and parser.
Static Typing vs Dynamic Typing
Static typing adds complexity with type checking and more sophisticated error handling, but I felt that the benefits outweighed dynamic typing witb comprehensive bug catching and clear intent in code.
I originally settled on simple C-style variable type declarations because I wanted the confidence of knowing exactly what the compiler knows, while retaining simplicity. However, after some feedback from a person online, I replaced this with Rust-style declarations to better support the updates that came soon after.
Stack-Based VM vs Register-Based VM
I chose a stack-based architecture over register-based because it's much simpler to implement, but it's also slower because of more stack operations.
For Phase, I prioritized modularity and extensibility over raw performance. Besides, Phase's small scope means there wouldn't be any noticeable different between the two architectures.
Python vs C
I created the first prototype of Phase (called Luma at the time) in Python since I knew it much better than C, and got it working in a few days with extremely basic features.
However, I realized it felt almost too easy to implement it in Python, so I decided to challenge myself by writing it fully in C, which greatly accelerated my learning by forcing me to confront concepts I was unfamiliar with at the time.
What's Next
Phase is finished for now, with enough features to write a variety of small, useful programs. For example, here is a fibonacci sequence program:
func fibonacci(n: int): void {
let (a, b): int = (0, 1)
let (next, count): int = (b, 1)
while count <= n {
out(next)
count += 1
a = b
b = next
next = a + b
}
}
entry {
fibonacci(10)
}
But if I were to revisit it in the future, these would be the next features on my roadmap:
- An arena memory allocator
- JIT bytecode compilation
- Standard input
- Aggregate data types
What I Learned
Building Phase taught me more about programming languages than any course or book could. I now understand:
- How compilers translate high-level code into executable instructions.
- Why certain language features exist and what tradeoffs they involve.
- The complexity of seemingly simple things like lexing or error messages.
- How to design systems that are both functional and easily extendable.
More importantly, it demonstrated the power of project-based learning; I couldn't have learned all these skills if I didn't actually create my language.
You can check out Phase on GitHub.