Thursday, December 30, 2010

F* Yeah! 2 + 1 * 3 = 5

I spent some time lately just reading about compilers a little bit. Nothing too in depth but I wanted to explore the idea of a basic compiler. So I worked my way through some Wikipedia link jumping. Reading about lexical analysis, parsers, grammars. abstract syntax trees, stack machines, register machines, some of everything. I'm surprised I didn't end up on Kevin Bacon's page.

The Idea

Anyway, after all of that I realized something about an exercise we did in school. Take an arithmetic expression, something like 2 + (3 - 1), parse it into a tree data structure, then evaluate the tree. This was basically a simple compiler. We didn't have to describe the lexical and grammar rules, nor generate an intermediate or executable representation, but it was still the same idea. Given this, turn it into something else.

So that sounds like a perfect project, so lets do it in Erlang.

The Lexer


I know the tokens I want: +, -, *, /, number, (, )

After a bit of research I find that Erlang has a lex implementation call...wait for it... leex. Given some lexical rules (defined with a bit of regular expressions and Erlang code) it'll generate me the Erlang code for the lexical analyzer.

Well that's easy. Here's are rules to turn a string representation of an expression into a series of tokens:

[0-9]+ :
  {token, {number, TokenLine, list_to_integer(TokenChars)}}.

[0-9]+\.[0-9]+ :
  {token, {number, TokenLine, list_to_float(TokenChars)}}.

\+ : {token, {'+', TokenLine}}.
\- : {token, {'-', TokenLine}}.
/ : {token, {'/', TokenLine}}.
\* : {token, {'*', TokenLine}}.
\( : {token, {'(', TokenLine}}.
\) : {token, {')', TokenLine}}.

[\000-\s] : skip_token.
[\n] : {end_token, {'$end', TokenLine}}.



The Parser


Ok, I have my lexer. I can turn "2 + 1 * 3" into [{number, 1, 2}, {'+', 1}, {number, 1, 1}, ...
That first 1 on all the tokens is the line number. In my current situation I'm only worrying about a single line expression.

Working directly with that list of tokens would be about as fun as fighting a cool monkey. I want to be friends and he wants to look like a badass while throwing poo. This is where a grammar definition and a parser comes in.

And guess what... in the butt? No, Erlang has a yacc implementation called... wait for it.... yecc. So here's my grammar:

Nonterminals expression.
Terminals number '+' '-' '/' '*' '(' ')'.

Rootsymbol expression.
Endsymbol '$end'.

Left 300 '+'.
Left 300 '-'.
Left 400 '*'.
Left 400 '/'.


expression -> expression '+' expression : [ '$2', '$1', '$3' ].
expression -> expression '-' expression : [ '$2', '$1', '$3' ].
expression -> expression '/' expression : [ '$2', '$1', '$3' ].
expression -> expression '*' expression : [ '$2', '$1', '$3' ].
expression -> '(' expression ')' : '$2'.
expression -> number : '$1'.



Basically I'm saying that every list of tokens should result in an expression (a nonterminal). Which could be a number, an expression inside parentheses, or two expressions joined with one of my supported operators. Those "Left " is the associativity, precedence, and token to aid the parser in getting things correctly.


The Generator


I have a lexer and a parser, sweet. What does that mean? I can turn "2 + 1 * 3" into

     [{'+',1},
          {number,1,2},
          [{'*',1},
              {number,1,1},
              {number,1,3}]]}



I know that is a terrible representation, but's an abstract syntax tree represented as a list (brackets) and tuples (curly braces). The '+' token is the head with only a number to its "left" and a subtree to its "right". Since the tokens come out as a list of tuples, we need another method to represent the tree. Lists will do just as well. This probably looks strange to an OOP guy, but structured data is structured data when you are processing and translating.

Well I'm not done. I have an abstract syntax tree, but how do I use it. Lets turn it into a made up stack based machine language:

-module(generator).
-export([generate/1]).

generate([Root, Left, Right]) ->
  LeftExp = generate(Left),
  RightExp = generate(Right),
  RootExp = generate(Root),
  LeftExp ++ RightExp ++ RootExp;
     
generate({number, _Line, Value}) ->
  [{push, Value}];
    
generate({'+', _Line}) ->
  [{add}];

generate({'-', _Line}) ->
  [{subtract}];

generate({'*', _Line}) ->
  [{multiply}];

generate({'/', _Line}) ->
  [{divide}].

That's the whole generator. The top generator() function just visits each node of the tree while the others rely on Erlang's pattern matching to turn a token into an instruction ({number..} into {push}). You are probably thinking why didn't the lexer generate these instructions as the tokens? Well you don't want to combine these truly separate concerns do you? Also the straight list of tokens as they come out from the lexer wouldn't work with a stack machine where you need to push the data on a stack and pop it off to operator on it. Oh, my operator instructions do the popping.... anyway.


The Virtual Machine


Now I'm turning "2 + 1 * 3" into [{push, 2},{push,1},{push,3},{multiply},{add}]. I can go from a line of "code" to a list of machine instructions. Hot damn, but I still don't have the evaluated value.

At this point you might need to know how to read Erlang code, so I'll skip the gist embedding and provide a link to my git repository for the curious:

https://github.com/copenhas/mathexp

Ok, holy crap. What else do we have to do just to evaluate some math? You start the virtual machine (which runs in it's own process), you send the VM your program (list of instructions), then asynchronously it runs the program and finally will send you the return value in a message.

I'm afraid I'll also leave understanding the basics of a stack machine to the reader as well, but the example of a program above should give you a good idea. I don't think I'm strictly following it (I believe they are usually a 0 operand instruction set not a 0-1), but it works pretty well and the code is pretty straight forward.

Nothing impressive, but the title expresses my feeling when I saw it work.

The Absurdity

Here is the absurdity as a series of Erlang code to get that magical (high) 5.... he he he.

{ok, Tokens, _EndLine} = lexical:string("2 + 1 * 3").
{ok, Ast} = grammar:parse(Tokens).
Program = generator:generate(Ast).
vm:start().
vm:load(Program).
vm:stop().
receive {return, Value} => Value end.

Erlang's assignment does data unpackaging (pulls the data out and sets the variable) and pattern matching (the right side has to match the format of the left). That last line would actually print out the value in the Erlang interpreter. This last bit might have been confusing but I'm going to leave you hanging.

1 comment:

  1. Oh, I hope to add multi-line expressions, variables, more operators, better vm, but we'll see what happens.

    ReplyDelete