Writing My Own Programming Language
The Origin Story
Last summer, I was deep into learning C. I loved the control it provided — pointers, memory management, hardware access — but I kept being frustrated by one thing: the boilerplate. Every time I wanted to do something that seemed simple, I actually had to write dozens of lines of code.
Then I'd switch back to Python for some prototyping, and while the clarity of the syntax was refreshing, I immediately missed the control of C. There was no static typing, no pass-by-reference, no way of knowing if my code would work until I ran it. I felt stuck between Python's clarity and C's control.
So I decided to build my own middleground: Phase — a statically-typed, bytecode-interpreted language that combines the expressiveness of high-level languages (like Python) with the explicitness of lower-level languages (like C).
Why Build From Scratch?
Despite the existence of many tools like ANTLR or LLVM that could've made this a lot easier, every part of Phase — from lexer to VM — is fully handwritten.
Why? Because I wanted to understand each and every part of the pipeline; I didn't want to treat any section like a black box. I wanted to control exactly how a line of source code gets converted into tokens, into AST nodes, into bytecode, and then executed by the VM. And the only way to truly understand and control something is by building it yourself.
The Importance of Diagnostics
Another thing I thought of during Phase's planning was error messages — specifically, informative, helpful, and visually appealing ones. Too many languages treat errors as afterthoughts, providing you with vague messages that don't actually give you what you want.
I believe that good diagnostics are just as important as good semantics. Phase's error system reflects that.
Take a look at this example error message:
┏ Fatal Error [102]: Expected ')'.
┃ --> ../tests/missing_paren.phase:2:19-19
┃
┃ 2 | out("a, b, c:"
┃ | ^
┃
┣ Help: Add ')' here.
┃ Suggestion:
┃ - out("a, b, c:"
┃ + out("a, b, c:")
This tells us:
- What went wrong, in just a few words.
- Where the error occurred, with visual markers too.
- Why it's a problem, including context for what was expected.
- How we could fix it, with a direct code suggestion.
That's everything we need to solve the issue.
The Interpreter Pipeline
Phase's interpreter follows a classic multistage pipeline:
Let's go through each component.
1. Lexer (Tokenization)
The lexer is the very first stage, taking in raw source code — an arbitrary string of characters — and breaking it down into tokens: meaningful units of the language, such as keywords, operators, and literals.
For example, this code:
out("Hello world")
Gets tokenized into:
OUT
LPAREN
STRING_LIT 'Hello world'!
RPAREN
NEWLINE
The lexer also handles details like skipping whitespace, recognizing keywords and identifiers, and handling string literals with escape sequences. Conceptually, it's simple, which is what I thought until I realized it was one of the most tedious aspects of the whole interpreter to get right.
2. Parser (Syntax Analysis)
The parser takes the stream of tokens the lexer produced and constructs a hierarchical Abstract Syntax Tree (AST), which represents the grammatical structure of the program.
I implemented a recursive-descent parser — meaning that each grammar rule maps to a specific function.
For example, parsing the tokens we created earlier gives us this tree:
STATEMENT (OUT)
╰ EXPRESSION (STRING) ["Hello world!"]
The parser is where syntax errors are caught. If you reference an undeclared variable or forget a parentheses, the parser knows because the token stream doesn't match the expected grammar.
3. Type Checker (Semantic Analysis)
This is where Phase's static typing is enforced. The type checker walks the AST and verifies that operations are correct: you can't assign a string to an integer variable, you can't use a variable before you declare it, etc.
So while this is syntactically correct:
int x = "hi"
It's semantically wrong.
4. Bytecode Generator (Compilation)
The bytecode generator takes the AST (after it passes type checking) and compiles it into bytecode: a custom, Assembly-like instruction set that's much simpler to execute than the AST itself.
Phase uses a stack-based architecture, meaning operations push and pop values from a storage stack — in the order of Last-In, First-Out (LIFO).
For example, our previous AST will compile into this hexadecimal bytecode:
00 00 00
01
24
Which represents these opcodes:
OP_PUSH_CONST 0 ; Push 'Hello world!' onto the stack
OP_PRINT ; Print 'Hello world!'
OP_HALT ; Stop the program
I designed Phase's instruction set to be intentionally minimal, with about 25 opcodes currently implemented. Bytecode generation was surprisingly interesting, and was one of my favourite aspects of creating Phase due to the total design control it provided.
5. Virtual Machine (Execution)
The final component of the pipeline is the VM, which directly executes the bytecode. It maintains:
- An instruction pointer tracking which instruction to execute next.
- A stack for temporary values and computations.
- A global environment for holding variables.
The VM functions like a very simple CPU, running a fetch-decode-execute loop: it reads an instruction, decides what operation to perform, runs it, and moves onto the next instruction.
So, for example, our bytecode is executed and this output is produced:
Hello world!
Creating the VM was a great learning experience for interpreter design, and I'm happy that I chose to write it myself instead of using a ready-made tool.
Design Decisions and Tradeoffs
Every language design involves tradeoffs. Here are some of my key decisions I made and why:
Interpreter vs Transpiler
Originally, Phase was meant to be a transpiler that converted source code to C code, which would then be built and executed. In fact, that was the first functioning implementation of Phase.
However, as I tested my first Phase programs, I realised that the build-run pipeline was very tedious. Even though I felt that a transpiler would demonstrate more low-level knowledge, I decided to convert Phase to an interpreter by switching out the backend for a bytecode generator and a VM, while keeping the same lexer and parser.
Static Typing vs Dynamic Typing
Static typing adds complexity — you need type checker, explicit declarations, and more sophisticated error handling. But I felt that the benefits outweight dynamic typing: comprehensive bug catching, better tooling, and clear intent in code.
I settled on simple, C-style type declarations because I wanted the confidence of knowing exactly what the compiler knows, while keeping simplicity.
Stack-Based VM vs Register-Based VM
I chose a stack-based architecture over register-based, because it's much simpler to implement. However, it's also slower because of more stack operations.
For Phase, I prioritized modularity and extensibility over raw performance. Besides, Phase's small scope means there wouldn't be any noticeable different between the two architectures.
What's Next
Phase isn't finished, but it's fully functional and can run various small programs.
These are the next features on my roadmap:
- Declaration keywords + annotations
- Functions
- Conditionals
- Basic loops
- An arena memory allocator
What I Learned
Building Phase taught me more about programming languages than any course or book could. I now understand:
- How compilers translate high-level code into executable instructions.
- Why certain language features exist and what tradeoffs they involve.
- How type systems work behind the scenes.
- The complexity of seemingly simple things like lexing or error messages.
- How to design systems that are both functional and easily extendable.
More importantly, it demonstrated two important things:
The power of project-based learning: I couldn't have learned all these skills if I didn't actually create my language.
And my ethos of computer science: providing the agency to turn hopeful imagination into working reality — to build what wasn't there before.
If you're thinking about creating your own language, this is my advice: you can plan a bit, get some ideas about syntax and pipelines, but just start. I made the mistake of spending a month just planning and reading; you'll learn quickly by just building it, piece-by-piece.
You can check out Phase on GitHub. Feel free contribute via an issue or PR if you wish to.