Programming your own simple lexer in C#
Hey, dear reader, are you interested in programming? Have you ever wanted to create your own programming language and you simply didn’t know where to start?
In this story I am going to explain how I created my own simple lexer in C# and how it works. You will be able to take a look at the source code inside GitHub. Before we start we have to understand what is lexer.
What is lexer?
Lexer is the first process of the programming language which takes a high level source code and it converts into a sequence of tokens. That’s why we can also call lexer a tokenizer, scanner and more. Let’s take an example, so we can easily imagine what I mean:
DEFAULT ARCTICC’S SIMPLE SOURCE CODE:
BECOMES:
As we can see, we get a ordered array of tokens which are later going to be useful for the parser to organize a parse tree.
How does it work?
Lexer of ArcticC is repeating a while loop through all the source code written and compares each string of added character with already defined keywords, integers, strings, booleans and characters in the program.
ArcticC compares converted byte arrays of a string with pre-defined byte arrays of keywords and more. It also has a pre-defined 2D array in which we can store our lexered strings and define their type with already defined rules for keywords, integers, etc.
At the end we just print the 2D array and we have our lexered source code stored. We can later then already send lexered source code array to parser.
The image above represents of how to define each identifier to a specific type. In the first step we can see that the value of Value is equals to one which means it’s integer but when we add another character (source code line string) to the Value it changes to a decimal because of the rule of decimal values and that’s a dot in the integer value.
The same way you should define other types. Let’s say that lexer would get string with characters of “break” this shouldn’t be defined as identifier but as a program keyword.
On Wikipedia is written more clearly about how and why is lexer used and I highly recommend checking it out. I can guarantee you that you will understand how to create a simple lexer if you don’t already after reading it.
I hope that this story is kinda educational and that it might help you create your own simple or more advanced lexer in C# and most importantly happy programming!