VOOZH about

URL: http://www.paulbatum.com/search/label/Compilers

⇱ Paul Batum: Compilers


skip to main | skip to sidebar
Showing posts with label Compilers. Show all posts
Showing posts with label Compilers. Show all posts

Sunday, 24 February 2008

Status update: compilers and ruby on rails

So you probably noticed I haven't been posting much about the parsing stuff lately and thats because I stopped working on it. I got to the point where wikipedia + lecture notes isn't quite cutting it and I've been unable to find a resource that tackles the subject matter in a way I am comfortable with. A good textbook could make a big difference and there are a few out there that look promising but my heart is not in it at the moment. I'm somewhat torn in between wanting to finish the projects I start and making sure that I am enjoying the projects. However the simple fact is that this plan of less video games and more programming/reading/learning is only sustainable if I am really enjoying the work. My intention is to come back to the compilers stuff in a few months and spend time looking at some of the programs out there that make compiler/interpreter development easier. Previously I avoided this because I wanted to learn the fundamentals and I found this to be worthwhile. It was never my intention to implement an actual compiler or interpreter completely from scratch, although I had hoped to get somewhat further than I did. In any case, the topic that holds my interest at the moment is that of a web development framework called Ruby on Rails. Rails interests me for a number of reasons but I think the primary driver is frustration with ASP.NET at work. I have read about the model-view-controller pattern but never actually worked with it first hand so I am curious about how it compares to developing with web forms. Rather than downloading and configuring Rails locally I am making use of Heroku - they provide free Rails hosting with an integrated development environment. I'm sure you will hear how I'm finding it soon enough.

Posted by Paul at 10:55 0 comments 👁 Image
Labels: Compilers, Rails, Ruby

Thursday, 14 February 2008

Fixing the grammar

The test that was failing was this one:

 def test_left_compound # "((500*20)+16)"
 tokens = [LeftParenToken.new(), LeftParenToken.new(), IntegerToken.new(500), MultiplicationToken.new(), IntegerToken.new(20), RightParenToken.new(), AdditionToken.new(), IntegerToken.new(16), RightParenToken.new()]
 assert(tokens_are_valid(tokens))
end
Why? Well, do what I did - try to build the expression ((500*20)+16) by hand using the rules. Here they are again:
E => (E) | nT T => +E | -E | * E | /E | ε
Give up yet? It can't be done! Also, for the same reason, this grammar cannot generate (1)+2. The simple fact is that I botched the job of making the grammar LL(1) compatible. So how to fix it? Well its simple really, especially when you take the (1)+2 example. Seems to me that a slight tweak to the production rules for E is necessary:
E => (E)T | nT
Great! Lets modify the lookup table accordingly:
lookup[:E][LeftParenToken] = [:lpar, :E, :rpar, :T]
The tests all pass! Now I can turn my attention to using the same algorithm for building the tree. However I have some more reading to do before I can proceed. Come on CS164 lectures notes, don't let me down!
Posted by Paul at 08:39 0 comments 👁 Image
Labels: Compilers

Monday, 11 February 2008

Syntax checking with a LL(1) parser

After reading about some different types of parsers, I decided to have a go at LL(1). This means the parser works Left-to-right, generates the Leftmost derivation and looks one token ahead. The basic idea of LL(1) is to build a stack of symbols that represent the derived expression by using a lookup table that tells you what rule to apply based on the current symbol and current token. Not all context free grammars can be parsed using LL(1) but I think the calculator's grammar can, with some modification. Lets break the work into some smaller steps:

  1. Modify the grammar I described last week to support LL(1) parsing.
  2. Modify the tokens_are_valid method to use LL(1) to verify whether the token stream is well formed. Given that I already have tests for tokens_are_valid, I will know that I am already on the right track once the existing tests are passing with the new implementation.
  3. Add syntax tests for parentheses. Watch them fail.
  4. Modify the tokens_are_valid implementation to make the new tests pass. Once this is happening we are done with using LL(1) for syntax verification and can move on to using LL(1) to actually do the parsing.
  5. Modify the build_tree method to use LL(1) and watch the operator precedence tests fail because my grammar doesn't deal with the precedence issue.
  6. Fix the grammar to support operator precedence, make the corresponding changes to the lookup table and watch the tests pass.
  7. Add evaluation tests for parentheses . I think these babies should automatically work if I've done the other steps right.
In this post I will tackle steps 1,2 and 3. Here is the grammar I described last week:
E => n | E+E | E-E | E*E | E/E | (E)
As it stands, this grammar will not work with LL(1) because to figure out which rule to apply, I have to look at least 2 tokens ahead. For example, starting with E and given two different inputs "1" and "1+1", its obvious that I need to use a different rule for each input, but both of them start with the token "1". This problem can be fixed by modifying the grammar. Here goes:
E => (E) | nT T => +E | -E | * E | /E | ε
The symbol 'ε' seems to represent null or empty. Basically it means that the symbol T can just disappear without producing anything. Resuming the above example, now there is only one rule to apply: E => nT. In the "1" case, the T disappears. In the "1+1" case the T is expanded to +E. Here is the lookup table for this grammar:
n + - * / ( ) $
E nT (E)
T +E -E *E /E ε ε
The '$' represents the end of the input string. The gaps in the table represent syntax errors. Writing the code There are a few issues that need addressing when it comes to implementation. The most obvious one is that I need some way to represent the symbols in the above table. Well it turns out Ruby has a symbol class, though it was not designed to represent grammar symbols - Ruby symbols are just named constants that have no intrinsic value(as far as I can tell). They use the :symbol_name syntax, so for example E and T are now represented as :E and :T. I amended the existing token classes to expose their corresponding symbol:
class AdditionToken < OperatorToken
def symbol
 return :plus
end
end
I need a way to determine if a symbol is terminal or not, and there is a rather nice way to do this - simply defining an extra method on the Symbol class:
class Symbol
def non_terminal?
 return (self == :T || self == :E)
end
end
Just to be clear, Symbol is not my class, I'm just adding an extra method to an existing class. Executing :T.non_terminal? will return true. Next on the agenda is the end token. We don't strictly need an end token but it does make the algorithm simpler.
class EndToken
def symbol
 return :end
end
end
The lookup table can be done as 2 dimensional hash:
def create_lookup_table()
lookup = {}
lookup[:E] = {}
lookup[:E][IntegerToken] = [:int,:T]
lookup[:E][LeftParenToken] = [:lpar, :E, :rpar]
lookup[:T] = {}
lookup[:T][AdditionToken] = [:plus, :E]
lookup[:T][SubtractionToken] = [:minus, :E]
lookup[:T][MultiplicationToken] = [:multiply,:E]
lookup[:T][DivisionToken] = [:divide, :E]
lookup[:T][RightParenToken] = []
lookup[:T][EndToken] = []
return lookup
end
As you can see, the table is first indexed with the non-terminal (:E or :T) and then with a token type. The result is an array of symbols, which may be empty. If an attempt to lookup an undefined entry is made then nil is returned and this is interpreted as a syntax error. Right, with that out of the way, I can begin work on tokens_are_valid. The goal is to replace the old implementation with an LL(1) algorithm and hopefully all the existing tests will pass:
def tokens_are_valid(tokens)
tokens = tokens.dup
symbols = [:E]
lookup = create_lookup_table()

# The symbol and token arrays can be used as stacks if the contents is first reversed.
# An end token/symbol is appended to each before they are reversed.
tokens.push(EndToken.new())
tokens.reverse!
symbols.push(:end)
symbols.reverse!

while(tokens.size > 0)
 current_symbol = symbols.pop() #The current symbol will always be consumed. Safe to pop.
 current_token = tokens.last #The current token won't necessarily be consumed, cannot pop it yet.
 if(current_symbol.non_terminal?)
 result = lookup[current_symbol][current_token.class]
 return false if result.nil? 
 symbols.push(result.reverse)
 symbols.flatten! 
 else
 return false if current_symbol != current_token.symbol
 tokens.pop()
 end 
end

return true
end
Hopefully this is relatively self-explanatory. The bit that may need some explanation is in bold. See, I am using arrays as stacks but to do so I need to reverse the contents - I do this once at the beginning for the tokens and the symbols. However its also necessary to reverse the result of the table lookup for the same reason. Yes, I could have coded the table with the symbols backwards but that is more confusing. Easiest just to push a reversed copy of the result onto the symbol stack. The only problem with doing this is that you end up with an array inside an array, which is why I call the flatten! method. In case you were wondering, Ruby has a convention to use the exclamation mark for methods that modify the instance you called the method on, rather than just returning a modified object. The following should demonstrate:
irb(main):001:0> x = [1,2,3]
=> [1, 2, 3]
irb(main):002:0> y = x.reverse
=> [3, 2, 1]
irb(main):003:0> x
=> [1, 2, 3]
irb(main):004:0> x.reverse!
=> [3, 2, 1]
irb(main):005:0> x
=> [3, 2, 1]
Well, my proposed change to tokens_are_valid works fine - all my tests are still passing! The next step is to write some tests that include parentheses. Here are a few (no need to bore you with the full set):
class ParenthesesSyntaxTests < Test::Unit::TestCase
def test_surround_one_number # "(1)"
 tokens = [LeftParenToken.new(),IntegerToken.new(1), RightParenToken.new()]
 assert(tokens_are_valid(tokens))
end 

def test_surround_operation # "(1+2)"
 tokens = [LeftParenToken.new(),IntegerToken.new(1), AdditionToken.new(), IntegerToken.new(2), RightParenToken.new()]
 assert(tokens_are_valid(tokens))
end

def test_left_compound # "((500*20)+16)"
 tokens = [LeftParenToken.new(), LeftParenToken.new(), IntegerToken.new(500), MultiplicationToken.new(), IntegerToken.new(20), RightParenToken.new(), AdditionToken.new(), IntegerToken.new(16), RightParenToken.new()]
 assert(tokens_are_valid(tokens))
end

def test_right_compound # "(16+(500*20))"
 tokens = [LeftParenToken.new(), IntegerToken.new(16), AdditionToken.new(), LeftParenToken.new(), IntegerToken.new(500), MultiplicationToken.new(), IntegerToken.new(20), RightParenToken.new() , RightParenToken.new()]
 assert(tokens_are_valid(tokens))
end

def test_empty_parens_error # "()"
 tokens = [LeftParenToken.new(), RightParenToken.new()]
 assert(!tokens_are_valid(tokens))
end

def test_wrong_order_error # ")2("
 tokens = [RightParenToken.new(), IntegerToken.new(1), LeftParenToken.new()]
 assert(!tokens_are_valid(tokens))
end

def test_no_operator_adjacent_to_paren_error # "1+2(3+4)"
 tokens = [IntegerToken.new(1), AdditionToken.new(), IntegerToken.new(2), LeftParenToken.new(), IntegerToken.new(3), AdditionToken.new(), IntegerToken.new(4), RightParenToken.new()]
 assert(!tokens_are_valid(tokens))
end

end
All but one pass! Can you guess which one? It turns out this is a problem with my grammar, not the implementation. Next post I will discuss the failing test and what I have to do to fix it.
Posted by Paul at 06:32 0 comments 👁 Image

Saturday, 9 February 2008

More on context free grammars

I started writing a post on the work I've done this week but soon realised that it would be useful to explain grammars a bit more before continuing. After all, it is possible that some of you went to the Wikipedia page for context-free grammars and were (like me) unable to curb the involuntary reflex to close the tab as soon as that stack of ugly math symbols came into view. Anyway, lets get to it. To make a grammar you need the following:

  1. One or more terminal symbols.
  2. One or more non-terminal symbols.
  3. A start symbol (must be non-terminal).
  4. Production rules that relate a non-terminal symbol to a set of symbols.
Terminal symbols are the output of the grammar and are usually written in lower case. Non-terminal symbols are temporary - they cannot appear in a generated string and are designed solely for consumption via production rules. They are usually written using upper case. Ok, so lets define a grammar: Terminal(s): a, b Non-terminal(s): S Start symbol: S The only thing left is the production rule(s):
S => aS | b
This rule says that the non-terminal S can be replaced with either aS or b. Given this fact, it should be apparent that this grammar produces strings that consist of zero or more a's and end with a b. Just to be clear, let me show you the 4 smallest strings that it can produce:
b ab aab aaab
And now some strings that this grammar cannot produce:
a ba baab abaab
So now that we have that clear, the question is how would you write a program that would return true for the top 4 and false for the bottom 4. Ok thats not the question - its trivial. But given an arbitrary set of production rules... now thats much more tricky. In my previous post I proposed a grammar for my calculator and discussed some valid and invalid strings. From what I've read so far, it looks like that getting an implementation that will return true for the valid strings and false for the invalid ones will actually take me 90% of the way to having an implementation for building a parse tree based on the grammar.
Posted by Paul at 09:19 0 comments 👁 Image
Labels: Compilers

Monday, 4 February 2008

Parentheses in the calculator: a first look at context-free grammars

This series on writing a program that can evaluate simple arithmetic expressions is all about baby steps. I have no idea what I'm doing in terms of the problem space and I'm similarly ignorant of my tool, Ruby. But I'm making progress and have a good idea of what is next:

  • expand number support to the rationals
  • report the position of a syntax error when found
  • implement support for parentheses
The first two shouldn't be a big deal, I could probably get them working pretty quickly with the existing code base. Parentheses however, are significantly more work. I know of only one way to deal with them: context-free grammars. Even if I was aware of another way to deal with the parens I would ignore it - this exercise is about learning compilers and I know that implementing a context-free grammar is necessary. I have found the lecture notes for CS 164 at Berkeley University to be quite useful. They read well without a lecturer and handle the topics with a degree of patience that allows me to comprehend it. I really wish I had been given the chance to do a course like this while I was at university. I think the context-free grammar I need can be defined as follows:
E => n | E+E | E-E | E*E | E/E | (E)
E stands for Expression and is the only non-terminal in this grammar. A non-terminal is a token that can be "expanded" further using the grammar rules. The grammar contains the following terminal tokens: 'n', '+', '-', '*', '/', '(' and ')'. Basically this grammar says that an expression can consist of either a number (n), or a combination of two expressions, or an expression in parentheses. Using this grammar we could derive the following strings:
120+16*2 6+((120-2)*4) (((1)))
But note that we couldn't derive:
1+*2 )2+3 4+(-)
Obviously this is the sort of behaviour we want. So now that I have a grammar, I want to implement a function that will take a string and tell me whether it is derivable from the grammar. There are a few ways to do this but I have to read more before I can figure out which one to try to implement first. Its also possible that I will have to modify my grammar to make it parse able.
Posted by Paul at 08:10 0 comments 👁 Image
Labels: Compilers

Saturday, 2 February 2008

The calculator, version one

There are a few loose ends to tie up and then the first pass of the calculator is complete. If you've forgotten where I was up to, here is a quick reminder:

  • The calculator falls over when the numbers get big because I implemented the tree evaluation in a educational but less than practical manner.
  • Whitespace should be ignored.
  • The codebase needs a general cleanup.
None of these are particularly involved - this should be a relatively short post! During the last post I discussed a few ways to implement the tree evaluation. I pointed out the most obvious solution (a switch statement), mentioned polymorphism and then ventured forth into using unbound methods. Upon discovering the issues with the latter approach it was a no brainer to go back to polymorphism:
class OperatorToken
 attr_reader :multitive, :value

 def evaluate(container_node)
 return do_operation(container_node.left_child.evaluate(), container_node.right_child.evaluate())
 end 
end

class AdditionToken < OperatorToken
 def initialize()
 @value = "+"
 @multitive = false
 end

 def do_operation(left_child_value, right_child_value)
 return left_child_value + right_child_value
 end 
end
I think you can guess what the other Tokens look like. The @value field isn't necessary to the implementation but it allowed me to keep a bunch of existing tests. The @multitive field is a replacement of the crappy named "major" field I used to determine whether the operator was higher precedence. I had to make a few minor changes to other sections of the program but it was all very easy. Whitespace is trivial to deal with. I could have easily handled it when implementing the tokenization but I guess it slipped my mind. Lets add a test for it:
 def test_spaces
 assert_equal([14,"+",62,"*",127], tokenize_for_value(" 14 +62* 127 ") ); end
The fix is just stripping whitespace from the string each time a token is generated. I added a few small tests and did my best to clean up the code. You can download it here. Some sample output:
C:\Users\Paul>irb
irb(main):001:0> require 'calculator_v1'
=> true
irb(main):002:0> calculate "10 + 2*6"
=> 22
irb(main):003:0> calculate "1000000000000000000000000000000000-1"
=> 999999999999999999999999999999999
irb(main):004:0> calculate "x-1"
=> "Syntax Error"
irb(main):005:0>
Next post I discuss the what I would like to achieve in the next version.
Posted by Paul at 10:32 0 comments 👁 Image
Labels: Compilers, Ruby

Monday, 28 January 2008

Evaluating the expression tree

Lets start off with some tests:

class EvaluateTests < Test::Unit::TestCase
 def test_1
 assert_equal(1, calculate("1"))
 end
 
 def test_add
 assert_equal(3, calculate("1+2"))
 end
 
 def test_subtract
 assert_equal(3, calculate("7-4"))
 end
 
 def test_multiply
 assert_equal(6, calculate("3*2"))
 end
 
 def test_divide
 assert_equal(3, calculate("6/2"))
 end
 
 def test_negative_result
 assert_equal(-1, calculate("1-2"))
 end
 
 def test_multitive_before_additive
 assert_equal(14, calculate("4*3+2"))
 assert_equal(14, calculate("2+3*4"))
 assert_equal(10, calculate("4+3*2"))
 end
 
 def test_left_to_right
 assert_equal(8, calculate("12-6+2"))
 assert_equal(4, calculate("12/6*2"))
 end
end
Not exactly comprehensive, but its a start. I need to write a calculate method to tie together the work I've done over the last few weeks:
def calculate(expression)
 tokens = tokenize(expression)
 root_node = build_tree(tokens)
 return root_node.evaluate()
end
Evaluating the tree is obviously a recursive function. Here is the first bit:
class BinaryTreeNode
 def evaluate()
 return @token.evaluate(self)
 end
end
I decided the easiest way to do this is have the evaluate method take an argument: the tree node that contains the token. When evaluating an IntegerToken the value is simply returned, no need to use the argument:
class IntegerToken < Token
 def evaluate(containerNode)
 return @value
 end
end
For the OperatorToken, we want: evaluate(left child) [operator] evaluate (right child). There are lots of different ways to do this - a case statement is the simplest:
class OperatorToken
 def evaluate(containerNode)
 case @value
 when "+"
 return containerNode.lhs.evaluate() + containerNode.rhs.evaluate()
 when "-"
 return containerNode.lhs.evaluate() - containerNode.rhs.evaluate()
 when "*"
 return containerNode.lhs.evaluate() * containerNode.rhs.evaluate()
 when "/"
 return containerNode.lhs.evaluate() / containerNode.rhs.evaluate()
 end
 end
 end
end
My my isn't that boring. Another option would be to implement subclasses for OperatorToken - I wouldn't learn much from doing it though. Lets try something different instead:
class OperatorToken < Token
 attr_reader :major
 
 def initialize(value)
 @value = value
 @eval_method = Fixnum.instance_method(value)
 case @value
 when "+","-" 
 @major = false
 when "*","/"
 @major = true
 end
 end
 
 def evaluate(containerNode)
 bound_method = @eval_method.bind(containerNode.lhs.evaluate()) 
 return bound_method.call(containerNode.rhs.evaluate())
 end
end
The three relevant lines of code are highlighted. But first, some background. Operators in Ruby are actually just methods. When I execute the line x = 2 + 3 I am calling the '+' method on the object '2' which is an instance of the class Fixnum. So what I am actually doing with that line is this: x = 2.+(3) . Classes in Ruby are just like normal objects, so you can call methods on them as I am doing in the first highlighted line. If the OperatorToken is constructed with "+", then I acquire an instance of the '+' method defined on the class Fixnum and store it as @eval_method. What I actually store is an instance of UnboundMethod - if you are a .NET developer, it might help to think of it as similar to MethodInfo as you can use it to invoke the method, but you will need to specify the instance you wish to invoke the method on. The second highlighted line does this - the left node is evaluated and the result is our bind target. The third line invokes the bound method and passes the result of evaluating the right node. The result is what we are after and is returned. Yes, this is much more complicated than a switch statement, and surely much slower. But I learned something, and hopefully so did you. Lets run my unit tests:
>ruby calculator.rb
Loaded suite calculator
Started
........................
Finished in 0.022 seconds.

24 tests, 37 assertions, 0 failures, 0 errors
>Exit code: 0
I think its worth mentioning that I have sort of covered up some mistakes I made while implementing the tree building. I didn't initially write the left to right unit tests and so for a while my calculator could do "2+9/3*4" but not "9/3*4" (it returned zero rather than 12). I didn't notice this problem until I ran the stress tester:
def stress_test()
 expressions = generate_expressions(100)
 rejects = []
 expressions.each do |exp|
 rejects << exp if (eval(exp) != calculate(exp)) 
 end
 return rejects
end
 
def generate_expressions(count)
 expressions = []
 operators = ["+","-","*","/"] 
 count.times do
 exp = (rand(100) + 1).to_s
 rand(10).times do
 exp << operators[rand(4)]
 exp << (rand(100) + 1).to_s
 end 
 expressions << exp
 end
 return expressions
end
This baby will generate 100 expressions, each containing up to 10 terms between 1 and 100. It does not generate zeros because I would then have to deal with division by zero problems (I'll let that wait for version two). The eval method is not mine - it is a built in method that uses Ruby's own expression evaluation engine. This is really useful because I can pass an expression string into my program and compare the result with what Ruby came up with. If there is a discrepancy then I add it to the array of rejects and return them when done. The stress test revealed the aforementioned problem with left to right evaluation which I fixed, but it also raised something else. Consider this test:
def test_really_big_number
 assert_equal(1000000000000000000001, calculate("1000000000000000000000+1"))
end
Unfortunately, it doesn't pass:
>ruby calculator.rb
Loaded suite calculator
Started
........E................
Finished in 0.02 seconds.

1) Error:
test_really_big_number(EvaluateTests):
TypeError: bind argument must be an instance of Fixnum
calculator.rb:55:in `bind'
calculator.rb:55:in `evaluate'
calculator.rb:12:in `evaluate'
calculator.rb:118:in `calculate'
calculator.rb:209:in `test_really_big_number'

25 tests, 37 assertions, 0 failures, 1 errors
>Exit code: 1
When a Fixnum gets larger than a certain size it changes into a Bignum. Unfortunately, this wreaks havoc with my evaluation code because my unbound method will only bind to a Fixnum, not a Bignum. There are some workarounds, a relatively simple one already springs to mind. However I don't think they are worth entertaining - going back to the case statement or something similar seems to be a better idea. Next post I clean up this issue and the code in general.
Posted by Paul at 05:45 0 comments 👁 Image

Sunday, 27 January 2008

More syntactic analysis: building an expression tree

Its been a few days since the last post in this series, mostly because this step was trickier than the previous few and partly because once I had the tree building working I couldn't resist adding the evaluation step. This was actually a good thing, because it helped me discover a mistake I made with the tree building and saved me from making a fool of myself by posting some broken crap (writing unit tests should have theoretically prevented this but I suck at them). Anyway, lets get on with it. I need to define the data structure the tree will be built from. I tried enhancing the tokens to act as the tree nodes themselves, but it made the tree manipulations more complicated. Instead I decide to keep it simple:

class BinaryTreeNode
attr_accessor :lhs, :rhs, :token

def initialize(token)
@token = token
end

def leaf?
return @lhs == nil && @rhs == nil
end
end
Pretty self explanatory I think. The tree node holds a reference to a token and its two child nodes. The 'leaf?' method isn't really necessary but it helps simplify the building algorithm. I considered trying to download some sort of diagramming tool to explain the process, but I knew it was going to suck. So instead, you get this: 👁 Image
Hopefully you can follow this without too much trouble. I build the tree by performing the same single operation over and over. I don't know what to call it ('create_parent' isn't particular explanatory), but this is the code:
def create_parent(target_node, operator, integer)
new_node = BinaryTreeNode.new(operator)
new_node.lhs = target_node
new_node.rhs = BinaryTreeNode.new(integer)
return new_node
end
This method takes an existing node and makes it the left child of a new node. It then sets up the operator and the value and returns the new node. In the diagram above, this method is called on the (6) node in step 1 and the result is the tree in step 2. The tree building algorithm only needs to use this operation on two nodes:
  • the root node if the operator is a + or - or if the root node is the only node or if the root operator is a * or /.
  • the root node's right hand child in other cases (the operator is a * or / and the other conditions don't apply).
Once you understand the create_parent operation, you will find there is not much else to it:
def build_tree(tokens)
tokens = tokens.dup
root_node = BinaryTreeNode.new(tokens.shift) 

while(tokens.size > 1)
 operator = tokens.shift
 integer = tokens.shift 
 if(root_node.leaf? || !operator.major || root_node.token.major)
 # If the root_node is the only node or if we have a '+' or '-'
 # OR if the root operator is '*' or '/' then append to root
 root_node = create_parent(root_node, operator, integer) 
 else
 # We have a * or /, append to root.rhs 
 root_node.rhs = create_parent(root_node.rhs, operator, integer) 
 end
end

 return root_node
end
The Operator.major method is a badly named way of determining whether the operator is addition or subtraction (minor) or multiplication or division (major). I'm sure there is some appropriately mathy term that I could use instead but it did not come to me. So while there are still tokens left to process, I take two off, and then do a create_parent with them. The logic I described above is used to figure out where the new node is created and thats it. I must admit I was pleased with the simplicity of this approach. If the next operator is addition or subtraction then you want the existing expression to be evaluated first, so simply creating a new root node and attaching the existing tree to the left does the job. If the next operator is multiplication or division, then you want it to take precedence before the root operator, UNLESS the root operator is also multiplication or division (to preserve left to right evaluation). You might be wondering what that first line of code is, 'tokens = tokens.dup'. This was a quick fix for a small issue I overlooked. I am using Array.shift to work my way through the tokens and as a result when the calling code returns, the array it passed in will be empty. The 'dup' method simply returns a clone of the array which I can modify without risk of breaking anything. I tried writing unit tests for the tree building but they end up being quite lame. Its fiddly work to verify via code that the nodes are arranged into the correct structure. I know it goes against the spirit of unit testing but I found it easier to test the tree structure simply by skipping forward to implementing evaluation and using the evaluation tests. Tomorrow I will post the wacky implementation I chose for evaluating the tree. Its probably not the best way to do it but it demonstrates some cool Ruby functionality.
Posted by Paul at 13:05 5 comments 👁 Image
Labels: Compilers, Ruby

Tuesday, 22 January 2008

Really simple syntactic analysis

(Continues from the previous post) So now I have an array of these token objects that I want to build into a binary tree. I said yesterday that I would not handle syntax errors in this first version but I have since realised that I was only talking about a particular type of syntax error: one that occurs at the lexical level. For example, the string "32+65-chocolate" is not an expression I expect my lexer to handle. Now if I was writing a lexer for a language that allows variable declarations, the aforementioned string might be valid because the previous line was "chocolate=6". That however, is outside the scope of this exercise so for now "chocolate" does not lex. So with that said, its time to discuss the other type of syntax error that springs to mind. Here are some examples:

  • 1+
  • 2*+3
  • -4
Uhoh. How is "-4" a syntax error? Its time to spill the beans: I lied when I said my lexer would support integers. This first version will only support non-negative integers. The important thing is to stay focused on getting something working, no matter how basic. I'll deal with those pesky negative numbers later. Ok, so I wrote units tests for the three examples above:
class SyntaxTests < Test::Unit::TestCase
def test_disallow_end_with_operator
tokens = [IntegerToken.new("1"), OperatorToken.new("+")]
assert(!tokens_are_valid(tokens))
end

def test_disallow_two_consecutive_operators
tokens = [IntegerToken.new("2"), OperatorToken.new("*"), OperatorToken.new("-"), IntegerToken.new("3")]
assert(!tokens_are_valid(tokens))
end

def test_disallow_start_with_operator
tokens = [OperatorToken.new("-"), IntegerToken.new("4")]
assert(!tokens_are_valid(tokens))
end
end
Ok now I need to make them pass. No problem!!
def tokens_are_valid(tokens)
return false
end
Ok that was some savage cheats. I obviously need to define some tests for valid syntax aswell as invalid syntax:
def test_allow_two_numbers
tokens = [IntegerToken.new("2"), OperatorToken.new("+"), IntegerToken.new("6")]
assert(tokens_are_valid(tokens))
end

def test_allow_three_numbers
tokens = [IntegerToken.new("73"), OperatorToken.new("*"), IntegerToken.new("56"), OperatorToken.new("-"), IntegerToken.new("94")]
assert(tokens_are_valid(tokens))
end
Ugh, those token constructors are taking up a lot of space aren't they? I might do something about that later. Also while I'm on the topic of token constructors, I also need to write tests to ensure that they raise exceptions if I do something silly like OperatorToken.new("6"). That will also have to wait. Getting back to the topic at hand, I now have some tests that fail:
>ruby calculator.rb
Loaded suite calculator
Started
FF..........
Finished in 0.009 seconds.

1) Failure:
test_allow_three_numbers(SyntaxTests) [calculator.rb:81]:
<false> is not true.

2) Failure:
test_allow_two_numbers(SyntaxTests) [calculator.rb:76]:
<false> is not true.

12 tests, 12 assertions, 2 failures, 0 errors
>Exit code: 1
Ok, now its time to flake out a bit. I figure I'm supposed to define some sort of context-free grammar but it looks kinda hard. Lets do something way easier:
def tokens_are_valid(tokens)
return false if(tokens.empty?)
return false if(tokens.first.class != IntegerToken)
return false if(tokens.last.class != IntegerToken)

expect_next_is_integer = true
tokens.each do |token|
 token_is_integer = (token.class == IntegerToken)
 return false if(expect_next_is_integer != token_is_integer) 
 expect_next_is_integer = !expect_next_is_integer
end

return true
end
If there are no tokens, or we do not start or end with an integer token, then return false. Then just loop through them and make sure the types alternate. I'm assuming that there are only two types of tokens - we will run into problems when it comes time to implement parentheses but that is to be expected. Ok lets give it a try:
>ruby calculator.rb
Loaded suite calculator
Started
............
Finished in 0.001 seconds.

12 tests, 12 assertions, 0 failures, 0 errors
>Exit code: 0
Good job. The process of building the binary tree will be much easier now that we can ensure that our array of tokens is valid.
Posted by Paul at 08:04 6 comments 👁 Image

Monday, 21 January 2008

Really simple lexical analysis

(Continues from the previous post)
If I ever develop my own programming language, you can bet I won't be writing the lexer by hand. Programs such as Flex will generate a lexer based on the regular expressions you feed it. However the purpose of this exercise is to learn the fundamentals of how compilers work, so today I will write the lexer. It should be easy because the strings I am parsing are so simple. I'm going to write a bunch of unit tests to define the behavior I am aiming for. Here is an example:
def test_integer_operator_integer
assert_equal([193,"*",671], tokenize_for_value("193*671") ); end
Basically this is saying that when passed the string "193*671", I expect the tokenize_for_value function to return an array that contains the values 193,"*" and 671. This is pretty simple and could be done in lots of different ways. I chose a mechanism that hopefully somewhat resembles what a proper lexer would do. First I define two types of token:
class IntegerToken
attr_reader :value
def initialize(value)
@value = value.to_i()
end
end

class OperatorToken
attr_reader :value
def initialize(value)
@value = value
end
end
The only difference between them at this point is that the IntegerToken converts its value to an integer (duh). I expect to extend these classes once I get to the next step but for now they will suffice. Next, I define an array of regular expressions that recognise integers and operators:
definitions = []
definitions << [/^\d+/, IntegerToken]
definitions << [/^\+/, OperatorToken]
definitions << [/^\-/, OperatorToken]
definitions << [/^\*/, OperatorToken]
definitions << [/^\//, OperatorToken]
Now we just loop our way through the string, matching the regular expressions against it and slicing out the characters that were matched successfully and build a token for them.
 tokens = []
lastSize = 0
# Track the number of remaining characters in the expression.
# If we manage to go through all the regular expressions without
# getting a match then we can parse no more.
while(expression.size != lastSize)
lastSize = expression.size
definitions.each do |regex, token_type|
# Do a match with the regex against the string, remove the
# characters that match and put them into lexmeme
lexmeme = expression.slice!(regex)
if (lexmeme)
 # We got a match, create a new token and break out of
 # the regex loop to preserve the regex ordering.
 tokens << token_type.new(lexmeme)
 break
end
end
end

return tokens
It would be fairly easy to add some code that will report if we had to finish parsing with characters still remaining - this would only happen in the case of a syntax error. I'll add that later once I have a basic version of the calculator working. Finally, I need a method that will take the string, call tokenize and give me an array of their values to make the unit tests easy to write:
def tokenize_for_value(expression)
tokens = tokenize(expression)
return tokens.map { |t| t.value }
end
When I run the ruby file (source) the unit tests execute. I wrote seven, so here is the output:
>ruby calculator_source.rb
Loaded suite calculator_source
Started
.......
Finished in 0.001 seconds.

7 tests, 7 assertions, 0 failures, 0 errors
>Exit code: 0
I'm pleased with how easy it is to write and run unit tests in ruby. There was very little friction. Next up: building a binary tree out of the tokens.
Posted by Paul at 08:51 0 comments 👁 Image
Labels: Compilers, Ruby

Sunday, 20 January 2008

Discovering compilers

My first exercise for learning how compilers work is to write a program that takes a string containing a mathematical expression and evaluates the result e.g. "2+3*7" => 23. I will be implementing it using Ruby and will write unit tests as I go. Since my goal is to learn the underlying concepts of compiler construction, I am avoiding any libraries/functionality provided by Ruby that would make it too easy. The first implementation will ignore whitespace and accept strings containing integers and the 4 standard operators (plus, minus, multiply, divide). The second implementation will report syntax errors and accept decimal numbers and parentheses. According to my research on Wikipedia, this exercise can be broken into the following steps:

  1. lexical analysis (identifying the tokens that appear in the string)
  2. syntactic analysis (assembling the tokens into a tree)
  3. semantic analysis (evaluating the tree for a result)
I will implement the lexical analysis step using regular expressions. They are well suited for the task and I need the practice with them. The syntactic analysis is the tricky bit because I have to follow order of operations. The semantic analysis will be trivial as it simply requires a recursive algorithm executed on the root node of the tree. Next up: Implementing lexical analysis.
Posted by Paul at 01:47 0 comments 👁 Image
Labels: Compilers
Subscribe to: Posts (Atom)