Playing With Alex Again (Compiler Series Part IV)

Now that we have a scanner and parser I want to go back and add error reporting to both. For this post I’m only covering error reporting for the scanner. Our goal is simple, we want to abort on the first error the scanner encounters and print out the line and column number of the error.

In my previous post on the scanner I mentioned that we used the “basic” wrapper for Alex which provided a simple interface to the scanner that took a String and outputted a list of tokens.

For error reporting we need something more sophisticated. In particular the “posn” wrapper is almost perfect for our purposes. To use the “posn” wrapper we need to modify our Token type by annotating it with one property of type AlexPosn. The “posn” wrapper defines AlexPosn to contain the following three numbers, the absolute offset of the token, the line number of the token, and the column number of the token. Specifically

data AlexPosn = AlexPn !Int  -- absolute character offset
                       !Int  -- line number
                       !Int  -- column number

For now you can ignore the “!” in the “!Int”, for explanation of what it means go here.

This is exactly what we need for error reporting. To use the “posn” wrapper we have to modify our Token type to include an AlexPosn for representing the token position. For example

data Token =
     	TLeftBrace       |

becomes

data Token =
     	TLeftBrace AlexPosn     |

We also have to modify the scanner actions to include the passed AlexPosn value on each action. I’ve included the modified actions below.

tokens :-

  $white+				;
  "class"				{ \p s -> TClass p }
  "new"					{ \p s -> TNew p }
  "String"				{ \p s -> TString p }
  "static"				{ \p s -> TStatic p }
  "void"				{ \p s -> TVoid p }
  "main"				{ \p s -> TMain p }
  "return"              { \p s -> TReturn p }
  "public"				{ \p s -> TPublic p }
  "extends"				{ \p s -> TExtend p }
  "int"					{ \p s -> TInt p }
  "boolean"				{ \p s -> TBool p }
  "if"					{ \p s -> TIf p }
  "else"				{ \p s -> TElse p }
  "true"				{ \p s -> TTrue p }
  "false"				{ \p s -> TFalse p }
  "this"				{ \p s -> TThis p }
  "length"				{ \p s -> TLength p }
  "while"				{ \p s -> TWhile p }
  $digit+				{ \p s -> TIntLiteral p (read s) }
  "."                   { \p s -> TPeriod p }
  "&&"			        { \p s -> TOp p (head s) }
  "!"					{ \p s -> TNot p }
  [\+\-\*\/]            { \p s -> TOp p (head s) }
  "<"                   { \p s -> TComOp p (head s) }
  "="					{ \p s -> TEquals p }
  ";" 					{ \p s -> TSemiColon p }
  "("					{ \p s -> TLeftParen p }
  ")"					{ \p s -> TRightParen p }
  $alpha[$alpha $digit \_ \']*	{ \p s -> TIdent p s }
  @string 	       	    { \p s -> TStringLiteral p (init (tail s)) -- remove the leading and trailing double quotes }
  "{"	 	 	   		{ \p s -> TLeftBrace p }
  "}"					{ \p s -> TRightBrace p }
  ","					{ \p s -> TComma p }
  "["					{ \p s -> TLeftBrack p }
  "]"					{ \p s -> TRightBrack p }
  "System.out.println"  { \p s -> TPrint p }
-- Each action has type ::AlexPosn -> String -> Token

Picking one specific example.

$digit+				{ \p s -> TIntLiteral p (read s) }

The “\p s -> TIntLiteral p (read s)” is an anonymous function that is passed two arguments, a String, “s”, and an AlexPosn, “p”. The function returns the token “TIntLiteral” with p as the token position and (read s) as the value of the actual integer.

The only issue with the “posn” wrapper is that the “alexScanTokens” function does not include any information when an error is encountered; it simply aborts with the message “lexical error” with no information on the error line or column number. To fix this we define a new function “alexScanTokens2” that includes better error reporting. The below code defines two helper functions and the modified “alexScanTokens2”.

getLineNum :: AlexPosn -> Int
getLineNum (AlexPn offset lineNum colNum) = lineNum 

getColumnNum :: AlexPosn -> Int
getColumnNum (AlexPn offset lineNum colNum) = colNum

--alexScanTokens :: String -> [token]
alexScanTokens2 str = go (alexStartPos,'\n',str)
  where go inp@(pos,_,str) =
          case alexScan inp 0 of
                AlexEOF -> []
                AlexError _ -> error ("lexical error @ line " ++ show (getLineNum(pos)) ++ " and column " ++ show (getColumnNum(pos)))
                AlexSkip  inp' len     -> go inp'
                AlexToken inp' len act -> act pos (take len str) : go inp'

main = do
  s <- getContents
  print (alexScanTokens2 s)

The definition of “alexScanTokens2” may look scary at first if you’re a new Haskell programmer like me but if we break it down, it’s not so complicated.
First the two lines

alexScanTokens2 str = go (alexStartPos,'\n',str)
  where go inp@(pos,_,str) =

define “alexScanTokens2” to take a String, “str”, as input and defines it to return the value of “go (alexStartPos, ‘\n’, str)”.
The where clause defines what “go (alexStartPost, ‘\n’, str)” is. Which in this case it is

go inp@(pos,_,str)

Please note that the “inp@” gives “inp” the value of “(pos, _, str)”. That is “inp” is defined as a three-tuple. Saying it another way, the “@” is syntactic sugar for defining the variable “inp” as the value of “(pos, _, str)”. To figure out the case expression I recommend looking up the definition of “alexScan”, here, it’s under the “Basic Interface” section.

The only portion of the case expression that I modified from the default “alexScanTokens” function provided by the “posn” wrapper was to change the AlexError case to

AlexError _ -> error ("lexical error @ line " ++ show (getLineNum(pos)) ++ " and column " ++ show (getColumnNum(pos)))

As the code self evidently shows, a simple message is printed giving the line and column number of the error and the program then aborts.

To try out our new scanner you can download it at http://github.com/bjwbell/NewL-Compiler under the “scanner_with_error_reporting” directory. There’s a small readme file with instructions on how to test the scanner.

Advertisements

Having Fun with Happy (Compiler Series Part III)

The next step to writing a compiler is creating the parser. For the parser generator I’m using Happy which is the Haskell version of yacc.

To make following along with me easier I’ve uploaded the parser and other files for NewL to github.com at http://github.com/bjwbell/NewL-Compiler. The parser is in the parser directory. Please note that the parser contains no error checking. I’ll explain how to add error checking in a later blog post.

To play with the parser and test it go to http://github.com/bjwbell/NewL-Compiler/archives/master. It’ll ask you to download the source. Once you’ve downloaded the source and unzipped it, open a terminal and go to the parser directory.

To compile the parser you’ll need to have the Glasgow Haskell Compiler collection installed. You also need to have Happy and Alex installed. If you’re running Ubuntu all of them are available via the Ubuntu repositories. To compile the parser run

make

and to test the parser run

cat test1.newl | ./newl

It will output the parse tree representation of the test1.newl program.

Before going into how I wrote the parser I want to mention that to get the scanner and parser to play together you have to delete

main = do
  s <- getContents
  print (alexScanTokens s)

from scanner.x. Otherwise Alex would generate a standalone scanner.

For an explanation of how the parts to the parser fit together I strongly encourage you to read http://www.haskell.org/happy/doc/html/sec-using.html. It has a very well written explanation of writing a parser using Happy.

The first part of the parser is to declare the name and what modules we import.

{
module Main where
import Scanner
}

The important part is “import Scanner”. This imports the scanner that was declared in scanner.x. In particular notice that the name “Scanner” matches the module name declared in scanner.x. The second part is to declare the name of our parser, you can choose any name you like. We also declare the token type which must match the token type declared in scanner.x, finally we give the name of the error function.

%name newl
%tokentype { Token }
%error { parseError }

Now onto to the meat of the parser. We declare a list of tokens as I’ve done below.

%token
  "class"                      { TClass }
  "new"                        { TNew }
  "String"                     { TString }
  "static"                     { TStatic }
  "void"                       { TVoid }
  "main"                       { TMain }
  "public"                     { TPublic }
  "return"                     { TReturn }
  "extends"                    { TExtend }
  "int"	                       { TInt }
  "boolean"                    { TBool }
  "if"                         { TIf }
  "else"                       { TElse }
  "true"                       { TTrue }
  "false"                      { TFalse }
  "this"                       { TThis }
  "length"                     { TLength }
  "while"                      { TWhile }
  integer_literal              { TIntLiteral $$ }
  ident                        { TIdent $$ }
  "{"                          { TLeftBrace }
  "}"                          { TRightBrace }
  ","                          { TComma }
  "["                          { TLeftBrack }
  "]"				           { TRightBrack }
  op                           { TOp $$}
  comop                        { TComOp $$ }
  "("                          { TLeftParen }
  ")"                          { TRightParen }
  ";"                          { TSemiColon }
  "."                          { TPeriod }
  "!"                          { TNot }
  "="                          { TEquals }
  "System.out.println"         { TPrint }
%%

“The symbols on the left are the tokens as they will be referred to in the rest of the grammar, and to the right of each token enclosed in braces is a Haskell pattern that matches the token. The parser will expect to receive a stream of tokens, each of which will match one of the given patterns .” [1] (the definition of the Token datatype is given in scanner.x)

Now we declare the production rules to the grammar (to view the grammar in BNF notation see https://bjbell.wordpress.com/2010/01/20/writing-a-compiler-in-haskell-compiler-series-part-i/).

Program : 
        MainClass ClassDeclList { Program $1 $2 }
MainClass : 
          "class" ident "{" "public" "static" "void" "main" "(" "String" "[" "]" ident ")" "{" Statement "}" "}" { MClass $2 $12 $15 }


ClassDeclList :
          ClassDecl     { ClassDeclList $1 CEmpty }
          | ClassDecl ClassDeclList { ClassDeclList $1 $2 }
          |             { CEmpty }

ClassDecl : 
          "class" ident "{" VarDeclList MethodDeclList "}"                     { ClassDecl $2 "void" $4 $5 }
          | "class" ident "extends" ident "{" VarDeclList MethodDeclList "}"   { ClassDecl $2 $4 $6 $7 }


MethodDeclList :
     MethodDecl                   { MethodDeclList $1 MEmpty }
     | MethodDecl MethodDeclList  { MethodDeclList $1 $2 }
     |                            { MEmpty }

MethodDecl : 
     "public" Type ident "(" FormalList ")" "{" VarDeclList StatementList "return" Exp ";" "}" { MethodDecl $2 $3 $5 $8 $9 $11 }

VarDeclList :
     Type ident ";" { VarDeclList $1 $2 VEmpty }
     | Type ident ";" VarDeclList { VarDeclList $1 $2 $4 }
     |              { VEmpty }

FormalList :
     Type ident       { FormalList $1 $2 FEmpty }
     | Type ident FormalList { FormalList $1 $2 $3 }

Type :
     "int" "[" "]"    { TypeIntArray }
     | "boolean"      { TypeBoolean }
     | "int"          { TypeInt }
     | ident          { TypeIdent $1 }

Statement :
    "{" StatementList "}"                            { SList $2 }
    | "if" "(" Exp ")" Statement "else" Statement  { SIfElse $3 $5 $7 }
    | "while" "(" Exp ")" Statement                { SWhile $3 $5 }
    | "System.out.println" "(" Exp ")" ";"         { SPrint $3 }
    | ident "=" Exp ";"                              { SEqual $1 $3 }
    | ident "[" Exp "]" "=" Exp ";"                  { SArrayEqual $1 $3 $6 }

StatementList :
    Statement               { StatementList Empty $1 }
    | StatementList Statement   { StatementList $1 $2 }

Exp : 
    Exp op Exp                        { ExpOp $1 $2 $3}
    | Exp comop Exp                   { ExpComOp $1 $2 $3}
    | Exp "[" Exp "]"                 { ExpArray $1 $3}
    | Exp "." "length"                { ExpLength $1}
    | Exp "." ident "(" ExpList ")"   { ExpFCall $1 $3 $5}
    | integer_literal                 { ExpInt $1}
    | "true"                          { ExpBool True}
    | "false"                         { ExpBool False}
    | ident                           { ExpIdent $1}
    | "this"                          { ExpThis }
    | "new" "int" "[" Exp "]"         { ExpNewInt $4 }  
    | "new" ident "(" ")"             { ExpNewIdent $2}
    | "!" Exp                         { ExpNot $2}
    | "(" Exp ")"                     { ExpExp $2}

ExpList :
        Exp            { ExpListExp $1 }
        | Exp ExpRest  { ExpList $1 $2 }
        |              { ExpListEmpty }

ExpRest : 
     "," Exp      { ExpRest $2 }

“Each production consists of a non-terminal symbol on the left, followed by a colon, followed by one or more expansions on the right, separated by |. Each expansion has some Haskell code associated with it, enclosed in braces as usual.” [1]

“The way to think about a parser is with each symbol having a `value’: we defined the values of the tokens above, and the grammar defines the values of non-terminal symbols in terms of sequences of other symbols (either tokens or non-terminals).” [1]

With production rules declared we now declare the data types for them. Which I’ve listed below.

{
parseError :: [Token] -> a
parseError _ = error "Parse error"


data Program 
    = Program MainClass ClassDeclList
      deriving (Show, Eq)



data MainClass
    = MClass String String Statement
      deriving (Show, Eq)

data ClassDeclList
    = ClassDeclList ClassDecl ClassDeclList
    | CEmpty
  deriving (Show, Eq)

data ClassDecl = ClassDecl Ident Ident VarDeclList MethodDeclList
  deriving (Show, Eq)


data MethodDeclList
    = MethodDeclList MethodDecl MethodDeclList
    | MEmpty
    deriving (Show, Eq)
data MethodDecl
    = MethodDecl Type Ident FormalList VarDeclList StatementList Exp
    deriving (Show, Eq)

data VarDeclList =
    VarDeclList Type Ident VarDeclList
    | VEmpty
    deriving (Show, Eq)

data FormalList = 
    FormalList Type Ident FormalList
    | FEmpty
  deriving (Show, Eq)

data Type =
    TypeIntArray
    | TypeBoolean
    | TypeInt
    | TypeIdent Ident
    deriving (Show, Eq)

data Statement
    = Statement String
    | SList StatementList
    | SIfElse Exp Statement Statement
    | SWhile Exp Statement
    | SPrint Exp
    | SEqual Ident Exp
    | SArrayEqual Ident Exp Exp
    | StatementError
    deriving (Show, Eq)

data StatementList
    = StatementList StatementList Statement 
    | Empty
    deriving (Show, Eq)


data Exp
    = Exp String
    | ExpOp Exp Char Exp
    | ExpComOp Exp Char Exp
    | ExpArray Exp Exp -- "Exp [ Exp ]"
    | ExpFCall Exp Ident ExpList -- Exp . Ident ( ExpList )
    | ExpInt Int
    | ExpNewInt Exp
    | ExpBool Bool -- True or False
    | ExpIdent Ident
    | ExpNewIdent Ident -- new Ident ()
    | ExpExp Exp -- Exp ( Exp )
    | ExpThis
    | ExpNot Exp
    | ExpLength Exp
    | ExpError
    deriving (Show, Eq)

data Op
     = And
     | LessThan
     | Plus
     | Minus
     | Times
     deriving (Show, Eq)

type Ident = String
type Integer_Literal = Int
data ExpList
    = ExpList Exp ExpRest
    | ExpListEmpty
    | ExpListExp Exp
    deriving (Show, Eq)
data ExpRest
    = ExpRest Exp
    deriving (Show, Eq)



main = do 
  inStr <- getContents
  let parseTree = newl (alexScanTokens inStr)  
  putStrLn ("parseTree:" ++ show(parseTree))
  print "done"
}

I’ve also defined the “parseError” function and the main function.
The main function uses “getContents” to retrieve the input from stdin and then creates the parse tree using “newl (alexScanTokens inStr)”. Note the “newl” name is the same one we declared using “%name newl” at the beginning of the program.

With the above code we now have a working parser for the NewL language. At this point if there’s an error parsing it barfs out “parse error” and doesn’t tell us anything about the error. Also no type or semantic analysis is done. I’ll expand on both of these points in later posts.

References
[1] Chapter 2. Using Happy, http://www.haskell.org/happy/doc/html/sec-using.html