Not Happy, Happy Documentation (Compiler Series Part V)

After reading about monads in chapter 14 of Real World Haskell I thought I understood them to some extant. Then I read the section of the Happy user guide on threaded lexers and I was still confused on how to create a threaded lexer.

After looking at the threaded lexers section and not understanding it, I looked at the Haskell code that Happy generates when it runs. It turns out that the regular parser generator without a threaded lexer passes the token that caused the parse error as the first token in the token list to the “parseError” function. They DO NOT mention this in the documentation.

To grab the token, line and, column# that caused the parse error use the following function

parseError :: [Token] -> a
parseError tokenList = let pos = tokenPosn(head(tokenList)) 
  in 
  error ("parse error at " ++ show (head(tokenList))++ show(getLineNum(pos)) ++ ":" ++ show(getColumnNum(pos)))

The key point is the call to “head(tokenList)” this grabs the first token in the tokenList which is the token that caused the parse error.

You can download the source code and check it out here. To test the error reporting run “make” and then

bbell@bbell-desktop:~/NewL-Compiler/parser_with_error_reporting$ cat ../test_files/test1.txt  | ./newl 
newl: parse error at line 3 and column 9
Advertisements

Parsing and Monads

My next compiler series post will cover error reporting in the parser. For good error reporting in the parser we need to at least report the line and column number of the error. This implies more communication between the parser and the lexer. In particular we have to tell Happy to use a threaded lexer. This implies the usage of monads. I’ve gotten up to this point in Haskell with no real understanding of monads.

I have to split off my compiler series and read about and more importantly understand what monads are and what they do.

My first step is to read the chapter 14 of Real World Haskell on monads. I promise I’ll be back after my excursion. It might be a while since I’m not too bright and it takes me a while to grok abstractions. I’ll do my best and report back.

Having Fun with Happy (Compiler Series Part III)

The next step to writing a compiler is creating the parser. For the parser generator I’m using Happy which is the Haskell version of yacc.

To make following along with me easier I’ve uploaded the parser and other files for NewL to github.com at http://github.com/bjwbell/NewL-Compiler. The parser is in the parser directory. Please note that the parser contains no error checking. I’ll explain how to add error checking in a later blog post.

To play with the parser and test it go to http://github.com/bjwbell/NewL-Compiler/archives/master. It’ll ask you to download the source. Once you’ve downloaded the source and unzipped it, open a terminal and go to the parser directory.

To compile the parser you’ll need to have the Glasgow Haskell Compiler collection installed. You also need to have Happy and Alex installed. If you’re running Ubuntu all of them are available via the Ubuntu repositories. To compile the parser run

make

and to test the parser run

cat test1.newl | ./newl

It will output the parse tree representation of the test1.newl program.

Before going into how I wrote the parser I want to mention that to get the scanner and parser to play together you have to delete

main = do
  s <- getContents
  print (alexScanTokens s)

from scanner.x. Otherwise Alex would generate a standalone scanner.

For an explanation of how the parts to the parser fit together I strongly encourage you to read http://www.haskell.org/happy/doc/html/sec-using.html. It has a very well written explanation of writing a parser using Happy.

The first part of the parser is to declare the name and what modules we import.

{
module Main where
import Scanner
}

The important part is “import Scanner”. This imports the scanner that was declared in scanner.x. In particular notice that the name “Scanner” matches the module name declared in scanner.x. The second part is to declare the name of our parser, you can choose any name you like. We also declare the token type which must match the token type declared in scanner.x, finally we give the name of the error function.

%name newl
%tokentype { Token }
%error { parseError }

Now onto to the meat of the parser. We declare a list of tokens as I’ve done below.

%token
  "class"                      { TClass }
  "new"                        { TNew }
  "String"                     { TString }
  "static"                     { TStatic }
  "void"                       { TVoid }
  "main"                       { TMain }
  "public"                     { TPublic }
  "return"                     { TReturn }
  "extends"                    { TExtend }
  "int"	                       { TInt }
  "boolean"                    { TBool }
  "if"                         { TIf }
  "else"                       { TElse }
  "true"                       { TTrue }
  "false"                      { TFalse }
  "this"                       { TThis }
  "length"                     { TLength }
  "while"                      { TWhile }
  integer_literal              { TIntLiteral $$ }
  ident                        { TIdent $$ }
  "{"                          { TLeftBrace }
  "}"                          { TRightBrace }
  ","                          { TComma }
  "["                          { TLeftBrack }
  "]"				           { TRightBrack }
  op                           { TOp $$}
  comop                        { TComOp $$ }
  "("                          { TLeftParen }
  ")"                          { TRightParen }
  ";"                          { TSemiColon }
  "."                          { TPeriod }
  "!"                          { TNot }
  "="                          { TEquals }
  "System.out.println"         { TPrint }
%%

“The symbols on the left are the tokens as they will be referred to in the rest of the grammar, and to the right of each token enclosed in braces is a Haskell pattern that matches the token. The parser will expect to receive a stream of tokens, each of which will match one of the given patterns .” [1] (the definition of the Token datatype is given in scanner.x)

Now we declare the production rules to the grammar (to view the grammar in BNF notation see https://bjbell.wordpress.com/2010/01/20/writing-a-compiler-in-haskell-compiler-series-part-i/).

Program : 
        MainClass ClassDeclList { Program $1 $2 }
MainClass : 
          "class" ident "{" "public" "static" "void" "main" "(" "String" "[" "]" ident ")" "{" Statement "}" "}" { MClass $2 $12 $15 }


ClassDeclList :
          ClassDecl     { ClassDeclList $1 CEmpty }
          | ClassDecl ClassDeclList { ClassDeclList $1 $2 }
          |             { CEmpty }

ClassDecl : 
          "class" ident "{" VarDeclList MethodDeclList "}"                     { ClassDecl $2 "void" $4 $5 }
          | "class" ident "extends" ident "{" VarDeclList MethodDeclList "}"   { ClassDecl $2 $4 $6 $7 }


MethodDeclList :
     MethodDecl                   { MethodDeclList $1 MEmpty }
     | MethodDecl MethodDeclList  { MethodDeclList $1 $2 }
     |                            { MEmpty }

MethodDecl : 
     "public" Type ident "(" FormalList ")" "{" VarDeclList StatementList "return" Exp ";" "}" { MethodDecl $2 $3 $5 $8 $9 $11 }

VarDeclList :
     Type ident ";" { VarDeclList $1 $2 VEmpty }
     | Type ident ";" VarDeclList { VarDeclList $1 $2 $4 }
     |              { VEmpty }

FormalList :
     Type ident       { FormalList $1 $2 FEmpty }
     | Type ident FormalList { FormalList $1 $2 $3 }

Type :
     "int" "[" "]"    { TypeIntArray }
     | "boolean"      { TypeBoolean }
     | "int"          { TypeInt }
     | ident          { TypeIdent $1 }

Statement :
    "{" StatementList "}"                            { SList $2 }
    | "if" "(" Exp ")" Statement "else" Statement  { SIfElse $3 $5 $7 }
    | "while" "(" Exp ")" Statement                { SWhile $3 $5 }
    | "System.out.println" "(" Exp ")" ";"         { SPrint $3 }
    | ident "=" Exp ";"                              { SEqual $1 $3 }
    | ident "[" Exp "]" "=" Exp ";"                  { SArrayEqual $1 $3 $6 }

StatementList :
    Statement               { StatementList Empty $1 }
    | StatementList Statement   { StatementList $1 $2 }

Exp : 
    Exp op Exp                        { ExpOp $1 $2 $3}
    | Exp comop Exp                   { ExpComOp $1 $2 $3}
    | Exp "[" Exp "]"                 { ExpArray $1 $3}
    | Exp "." "length"                { ExpLength $1}
    | Exp "." ident "(" ExpList ")"   { ExpFCall $1 $3 $5}
    | integer_literal                 { ExpInt $1}
    | "true"                          { ExpBool True}
    | "false"                         { ExpBool False}
    | ident                           { ExpIdent $1}
    | "this"                          { ExpThis }
    | "new" "int" "[" Exp "]"         { ExpNewInt $4 }  
    | "new" ident "(" ")"             { ExpNewIdent $2}
    | "!" Exp                         { ExpNot $2}
    | "(" Exp ")"                     { ExpExp $2}

ExpList :
        Exp            { ExpListExp $1 }
        | Exp ExpRest  { ExpList $1 $2 }
        |              { ExpListEmpty }

ExpRest : 
     "," Exp      { ExpRest $2 }

“Each production consists of a non-terminal symbol on the left, followed by a colon, followed by one or more expansions on the right, separated by |. Each expansion has some Haskell code associated with it, enclosed in braces as usual.” [1]

“The way to think about a parser is with each symbol having a `value’: we defined the values of the tokens above, and the grammar defines the values of non-terminal symbols in terms of sequences of other symbols (either tokens or non-terminals).” [1]

With production rules declared we now declare the data types for them. Which I’ve listed below.

{
parseError :: [Token] -> a
parseError _ = error "Parse error"


data Program 
    = Program MainClass ClassDeclList
      deriving (Show, Eq)



data MainClass
    = MClass String String Statement
      deriving (Show, Eq)

data ClassDeclList
    = ClassDeclList ClassDecl ClassDeclList
    | CEmpty
  deriving (Show, Eq)

data ClassDecl = ClassDecl Ident Ident VarDeclList MethodDeclList
  deriving (Show, Eq)


data MethodDeclList
    = MethodDeclList MethodDecl MethodDeclList
    | MEmpty
    deriving (Show, Eq)
data MethodDecl
    = MethodDecl Type Ident FormalList VarDeclList StatementList Exp
    deriving (Show, Eq)

data VarDeclList =
    VarDeclList Type Ident VarDeclList
    | VEmpty
    deriving (Show, Eq)

data FormalList = 
    FormalList Type Ident FormalList
    | FEmpty
  deriving (Show, Eq)

data Type =
    TypeIntArray
    | TypeBoolean
    | TypeInt
    | TypeIdent Ident
    deriving (Show, Eq)

data Statement
    = Statement String
    | SList StatementList
    | SIfElse Exp Statement Statement
    | SWhile Exp Statement
    | SPrint Exp
    | SEqual Ident Exp
    | SArrayEqual Ident Exp Exp
    | StatementError
    deriving (Show, Eq)

data StatementList
    = StatementList StatementList Statement 
    | Empty
    deriving (Show, Eq)


data Exp
    = Exp String
    | ExpOp Exp Char Exp
    | ExpComOp Exp Char Exp
    | ExpArray Exp Exp -- "Exp [ Exp ]"
    | ExpFCall Exp Ident ExpList -- Exp . Ident ( ExpList )
    | ExpInt Int
    | ExpNewInt Exp
    | ExpBool Bool -- True or False
    | ExpIdent Ident
    | ExpNewIdent Ident -- new Ident ()
    | ExpExp Exp -- Exp ( Exp )
    | ExpThis
    | ExpNot Exp
    | ExpLength Exp
    | ExpError
    deriving (Show, Eq)

data Op
     = And
     | LessThan
     | Plus
     | Minus
     | Times
     deriving (Show, Eq)

type Ident = String
type Integer_Literal = Int
data ExpList
    = ExpList Exp ExpRest
    | ExpListEmpty
    | ExpListExp Exp
    deriving (Show, Eq)
data ExpRest
    = ExpRest Exp
    deriving (Show, Eq)



main = do 
  inStr <- getContents
  let parseTree = newl (alexScanTokens inStr)  
  putStrLn ("parseTree:" ++ show(parseTree))
  print "done"
}

I’ve also defined the “parseError” function and the main function.
The main function uses “getContents” to retrieve the input from stdin and then creates the parse tree using “newl (alexScanTokens inStr)”. Note the “newl” name is the same one we declared using “%name newl” at the beginning of the program.

With the above code we now have a working parser for the NewL language. At this point if there’s an error parsing it barfs out “parse error” and doesn’t tell us anything about the error. Also no type or semantic analysis is done. I’ll expand on both of these points in later posts.

References
[1] Chapter 2. Using Happy, http://www.haskell.org/happy/doc/html/sec-using.html