LARA

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cc19:scallion [2019/07/11 14:28]
romain
cc19:scallion [2019/07/25 10:05]
romain
Line 1: Line 1:
-====== Introduction to Parser Combinators ​using Scallion ======+ 
 +===== Introduction to Parser Combinators =====
  
 The next part of the compiler you will be working on is the parser. The goal of the parser is to convert the sequence of tokens generated by the lexer into an Amy //abstract syntax tree// (AST). The next part of the compiler you will be working on is the parser. The goal of the parser is to convert the sequence of tokens generated by the lexer into an Amy //abstract syntax tree// (AST).
Line 21: Line 22:
 ==== Documentation ==== ==== Documentation ====
  
-In this document, we will introduce parser combinators in Scallion and showcase how to use them. This document is not intended to be a complete reference to Scallion. Fortunately,​ the library comes with a [[https://​epfl-lara.github.io/​scallion/​scallion/​parsing/​index.html|comprehensive API]] which fulfills that role. Feel free to refer to it while working on your project.+In this document, we will introduce parser combinators in Scallion and showcase how to use them. This document is not intended to be a complete reference to Scallion. Fortunately,​ the library comes with a [[https://​epfl-lara.github.io/​scallion/​scallion/​syntactic/​index.html|comprehensive API]] which fulfills that role. Feel free to refer to it while working on your project!
  
 ==== Setup ==== ==== Setup ====
  
-Before we can dive into parser combinators,​ we +In Scallion, parsers are called ​defined within a trait called ''​%%Syntaxes%%''​. This trait takes as parameters two types:
- +
-In Scallion, parsers are defined within a trait called ''​%%Parsers%%''​. This trait takes as parameters two types:+
  
   * The type of tokens,   * The type of tokens,
-  * The type of //token kinds//.+  * The type of //token kinds//. Token kinds represent groups of tokens. They abstract away all the details found in the actual tokens, such as for instance positions or identifiers name. Each token has a unique kind.
  
-In our case, the tokens will be of type ''​%%Token%%''​ and the token kinds of type ''​%%TokenKind%%'' ​that we introduced previously. We implement the ''​%%getKind%%''​ method to inform Scallion of the kind associated to each of our tokens.+In our case, the tokens will be of type ''​%%Token%%'' ​that we introduced ​and used in the previous project. The token kinds will be ''​%%TokenKind%%''​, which we have already defined for you.
  
 <​code>​ <​code>​
 object Parser extends Pipeline[Iterator[Token],​ Program] object Parser extends Pipeline[Iterator[Token],​ Program]
-                 ​with ​Parsers[Token, TokenKind] {+                 ​with ​Syntaxes[Token, TokenKind] {
  
 +  // Indicates the kind of the various tokens.
   override def getKind(token:​ Token): TokenKind = TokenKind.of(token)   override def getKind(token:​ Token): TokenKind = TokenKind.of(token)
   ​   ​
-  // Rest of the parser here...+  // You parser ​implementation goes here.
 } }
 </​code>​ </​code>​
-The ''​%%Parsers%%''​ trait mixed-in the ''​%%Parser%%''​ object provides all functions and types you will use to write your parser.+The ''​%%Syntaxes%%''​ trait mixed-in the ''​%%Parser%%''​ object provides all functions and types you will use to write your parser.
  
 ==== Writing Parsers ==== ==== Writing Parsers ====
Line 49: Line 49:
 When writing a parser using parser combinators,​ one defines many smaller parsers and combine them together into more and more complex parsers. The top-level, most complex, of those parser then defines the entire syntax for the language. In our case, that top-level parser will be called ''​%%program%%''​. When writing a parser using parser combinators,​ one defines many smaller parsers and combine them together into more and more complex parsers. The top-level, most complex, of those parser then defines the entire syntax for the language. In our case, that top-level parser will be called ''​%%program%%''​.
  
-All those parsers are objects of the type ''​%%Parser[A]%%''​. The type parameter ''​%%A%%''​ indicates the type of values produced by that parser.((The type parameter ''​%%A%%''​ in ''​%%Parser[A]%%''​ is covariant, meaning that if ''​%%B%%''​ is a subtype of ''​%%A%%'',​ then ''​%%Parser[B]%%''​ is subtype of ''​%%Parser[A]%%''​. +All those parsers are objects of the type ''​%%Syntax[A]%%''​. The type parameter ''​%%A%%''​ indicates the type of values produced by the parser. For instance, a parser of type ''​%%Syntax[Int]%%''​ produces ''​%%Int%%''​s and a parser of type ''​%%Syntax[Expr]%%''​ produces ''​%%Expr%%''​s. Our top-level parser has the following signature:
-)) For instance, a parser of type ''​%%Parser[Int]%%''​ produces ''​%%Int%%''​s and a parser of type ''​%%Parser[Expr]%%''​ produces ''​%%Expr%%''​s. Our top-level parser has the following signature:+
  
 <​code>​ <​code>​
 lazy val program: Parser[Program] = ... lazy val program: Parser[Program] = ...
 </​code>​ </​code>​
-Contrarily to the types of tokens and token kinds – which are fixed – the type of value produced is a type parameter of the various ''​%%Parser%%''​s. This allows your different parsers to produce different type of values.+Contrarily to the types of tokens and token kindswhich are fixedthe type of value produced is a type parameter of the various ''​%%Syntax%%''​s. This allows your different parsers to produce different type of values.
  
 The various parsers are stored as ''​%%val%%''​ members of the ''​%%Parser%%''​ object. In the case of mutually dependent parsers, we use ''​%%lazy val%%''​ instead. The various parsers are stored as ''​%%val%%''​ members of the ''​%%Parser%%''​ object. In the case of mutually dependent parsers, we use ''​%%lazy val%%''​ instead.
  
 <​code>​ <​code>​
-lazy val definition: ​Parser[ClassOrFunDef] =+lazy val definition: ​Syntax[ClassOrFunDef] =
   functionDefinition | abstractClassDefinition | caseClassDefinition   functionDefinition | abstractClassDefinition | caseClassDefinition
    
-lazy val functionDefinition: ​Parser[FunDef] = ...+lazy val functionDefinition: ​Syntax[ClassOrFunDef] = ...
  
-lazy val abstractClassDefinition: ​Parser[AbstractClassDef] = ...+lazy val abstractClassDefinition: ​Syntax[ClassOrFunDef] = ...
  
-lazy val caseClassDefinition: ​Parser[CaseClassDef] = ...+lazy val caseClassDefinition: ​Syntax[ClassOrFunDef] = ...
 </​code>​ </​code>​
 ==== Running Parsers ==== ==== Running Parsers ====
  
-Parsers of type ''​%%Parser[A]%%''​ have an ''​%%apply%%''​ method which takes as parameter an iterator of tokens and returns a value of type ''​%%ParseResult[A]%%'',​ which can be one of three things:+Parsers of type ''​%%Syntax[A]%%''​ have an ''​%%apply%%''​ method which takes as parameter an iterator of tokens and returns a value of type ''​%%ParseResult[A]%%'',​ which can be one of three things:
  
   * A ''​%%Parsed(value,​ rest)%%'',​ which indicates that the parser was successful and produced the value ''​%%value%%''​. The entirety of the input iterator was consumed by the parser.   * A ''​%%Parsed(value,​ rest)%%'',​ which indicates that the parser was successful and produced the value ''​%%value%%''​. The entirety of the input iterator was consumed by the parser.
Line 77: Line 76:
   * An ''​%%UnexpectedEnd(rest)%%'',​ which indicates that the end of the iterator was reached and the parser could not finish at this point. The input iterator was completely consumed.   * An ''​%%UnexpectedEnd(rest)%%'',​ which indicates that the end of the iterator was reached and the parser could not finish at this point. The input iterator was completely consumed.
  
-In each case, the additional value ''​%%rest%%''​ is itself a ''​%%Parser[A]%%''​. That parser represents the parser after the successful parse or at the point of error. This parser could be used to provide useful error messages or even to resume parsing.+In each case, the additional value ''​%%rest%%''​ is itself a ''​%%Syntax[A]%%''​. That parser represents the parser after the successful parse or at the point of error. This parser could be used to provide useful error messages or even to resume parsing.
  
 <​code>​ <​code>​
Line 92: Line 91:
 ==== Parsers and Grammars ==== ==== Parsers and Grammars ====
  
-As you will see, parsers built using parser combinators will tend to look a bit like grammars. However, contrarily to grammars, parsers not only describe the syntax or your language, but also directly specify how to turn this syntax into a value such as an AST. Also, as we will see, parser combinators have a richer vocabulary than your usual //BNF// grammars.+As you will see, parsers built using parser combinators will look a lot like grammars. However, contrarily to grammars, parsers not only describe the syntax or your language, but also directly specify how to turn this syntax into a value. Also, as we will see, parser combinators have a richer vocabulary than your usual //BNF// grammars.
  
 Interestingly,​ a lot of concepts that you have seen on grammars, such as ''​%%FIRST%%''​ sets and nullability can be straightforwardly transposed to parsers. Interestingly,​ a lot of concepts that you have seen on grammars, such as ''​%%FIRST%%''​ sets and nullability can be straightforwardly transposed to parsers.
Line 117: Line 116:
 <​code>​ <​code>​
 val eof: Parser[Token] = elem(EOFKind) val eof: Parser[Token] = elem(EOFKind)
-</​code>​ 
-== Syntax for keywords and delimiters. == 
- 
-Since ''​%%elem%%''​ is used very frequentely,​ especially for keywords and punctuation,​ we added a small implicit conversion to your ''​%%Parser%%''​ object so you can write ''​%%","​%%''​ for ''​%%elem(PunctuationKind(","​))%%''​ and ''​%%"​object"​%%''​ for ''​%%elem(KeywordKind("​object"​))%%''​. This conversion is defined as: 
- 
-<​code>​ 
-implicit def symb(string:​ String): Parser[Token] = 
-  if (string.length == 1 && !string.head.isLetter) { 
-    elem(PunctuationKind(string.head)) 
-  } else { 
-    elem(KeywordKind(string)) 
-  } 
 </​code>​ </​code>​
 === Accept === === Accept ===
  
-The function ''​%%accept%%''​ is a variant of ''​%%elem%%''​ which applies a transformation to the matched token before ​it is produced.+The function ''​%%accept%%''​ is a variant of ''​%%elem%%''​ which directly ​applies a transformation to the matched token when it is produced.
  
 <​code>​ <​code>​
-val identifier: ​Parser[String] = accept(IdentifierKind) {+val identifier: ​Syntax[String] = accept(IdentifierKind) {
   case IdentifierToken(name) => name   case IdentifierToken(name) => name
 } }
Line 149: Line 136:
 === Disjunction === === Disjunction ===
  
-The first combinator we have is disjunction,​ that we write, for parsers ''​%%p1%%''​ and ''​%%p2%%'',​ simply ''​%%p1 | p2%%''​. When both ''​%%p1%%''​ and ''​%%p2%%''​ are of type ''​%%Parser[A]%%'',​ the disjunction ''​%%p1 | p2%%''​ is also of type ''​%%Parser[A]%%''​. The disjunction operator is associative and commutative.+The first combinator we have is disjunction,​ that we write, for parsers ''​%%p1%%''​ and ''​%%p2%%'',​ simply ''​%%p1 | p2%%''​. When both ''​%%p1%%''​ and ''​%%p2%%''​ are of type ''​%%Syntax[A]%%'',​ the disjunction ''​%%p1 | p2%%''​ is also of type ''​%%Syntax[A]%%''​. The disjunction operator is associative and commutative.
  
 Disjunction works just as you think it does. If either of the parsers ''​%%p1%%''​ or ''​%%p2%%''​ would accept the sequence of tokens, then the disjunction also accepts the tokens. The value produced is the one produced by either ''​%%p1%%''​ or ''​%%p2%%''​. Disjunction works just as you think it does. If either of the parsers ''​%%p1%%''​ or ''​%%p2%%''​ would accept the sequence of tokens, then the disjunction also accepts the tokens. The value produced is the one produced by either ''​%%p1%%''​ or ''​%%p2%%''​.
  
-Note that ''​%%p1%%''​ and ''​%%p2%%''​ must have disjoint ''​%%first%%''​ sets. This restriction ensures that no ambiguities can arise and that parsing can be done in linear time.((Scallion is not the only parser combinator library to exist, far from it! Many of those libraries do not have this restriction. Those libraries generally need to backtrack to try the different alternatives when a branch fails.+Note that ''​%%p1%%''​ and ''​%%p2%%''​ must have disjoint ''​%%first%%''​ sets. This restriction ensures that no ambiguities can arise and that parsing can be done efficiently.((Scallion is not the only parser combinator library to exist, far from it! Many of those libraries do not have this restriction. Those libraries generally need to backtrack to try the different alternatives when a branch fails.
 )) We will see later how to automatically detect when this is not the case and how fix the issue. )) We will see later how to automatically detect when this is not the case and how fix the issue.
  
Line 169: Line 156:
  
 <​code>​ <​code>​
-lazy val abstractClassDefinition: ​Parser[AbstractClassDef] = +lazy val abstractClassDefinition: ​Syntax[ClassOrFunDef] = 
-  ("​abstract"​ ~ "​class"​ ~ identifier).map {+  ​(kw("​abstract"​kw("​class"​~ identifier).map {
     case kw ~ _ ~ id => AbstractClassDef(id).setPos(kw)     case kw ~ _ ~ id => AbstractClassDef(id).setPos(kw)
   }   }
Line 181: Line 168:
  
 <​code>​ <​code>​
-lazy val expr: Parser[Expr] = recursive {+lazy val expr: Syntax[Expr] = recursive {
   ...   ...
 } }
Line 187: Line 174:
 If you were to omit it, a ''​%%StackOverflow%%''​ exception would be triggered during the initialisation of your ''​%%Parser%%''​ object. If you were to omit it, a ''​%%StackOverflow%%''​ exception would be triggered during the initialisation of your ''​%%Parser%%''​ object.
  
-The ''​%%recursive%%''​ combinator in itself does not change the behaviour of the underlying parser ​in any ways. It is just there to //tie the knot//((See [[https://​stackoverflow.com/​questions/​357956/​explanation-of-tying-the-knot|a good explanation of what tying the knot means in the context of lazy languages.]]+The ''​%%recursive%%''​ combinator in itself does not change the behaviour of the underlying parser. It is there to //tie the knot//((See [[https://​stackoverflow.com/​questions/​357956/​explanation-of-tying-the-knot|a good explanation of what //tying the knot// means in the context of lazy languages.]]
 )). )).
  
Line 210: Line 197:
  
 The combinator ''​%%repsep%%''​ returns a parser that accepts any number of repetitions of its argument parser, separated by an other parser, including 0. The variant ''​%%rep1sep%%''​ forces the parser to match at least once. The combinator ''​%%repsep%%''​ returns a parser that accepts any number of repetitions of its argument parser, separated by an other parser, including 0. The variant ''​%%rep1sep%%''​ forces the parser to match at least once.
 +
 +The separator parser is restricted to the type ''​%%Syntax[Unit]%%''​ to ensure that important values do not get ignored. You may use ''​%%unit()%%''​ to on a parser to turn its value to ''​%%Unit%%''​ if you explicitly want to ignore the values a parser produces.
  
 == Binary operators with operators == == Binary operators with operators ==
Line 216: Line 205:
  
 <​code>​ <​code>​
-val times: ​Parser[(Int, Int) => Int] =+val times: ​Syntax[String] =
   accept(OperatorKind("​*"​)) {   accept(OperatorKind("​*"​)) {
-    case _ => (x: Int, y: Int) => x y+    case _ => "*"
   }   }
  
 ... ...
  
-lazy val operation: ​Parser[Int] =+lazy val operation: ​Syntax[Int] =
   operators(number)(   operators(number)(
 +    // Defines the different operators, by decreasing priority.
     times | div   is LeftAssociative,​     times | div   is LeftAssociative,​
-    plus  | minus is LeftAssociative +    plus  | minus is LeftAssociative
-  )+    ... 
 +  ) 
 +    // Defines how to apply the various operators. 
 +    case (lhs, "​*",​ rhs) => Times(lhs, rhs).setPos(lhs) 
 +    ... 
 +  }
 </​code>​ </​code>​
-Documentation for ''​%%operators%%''​ is [[https://​epfl-lara.github.io/​scallion/​scallion/​parsing/​Operators.html|available on this page]].+Documentation for ''​%%operators%%''​ is [[https://​epfl-lara.github.io/​scallion/​scallion/​syntactic/​Operators.html|available on this page]].
  
 ==== LL(1) Checking ==== ==== LL(1) Checking ====
  
-In Scallion, non-LL(1) parsers can be written, but the result of applying such a parser is not specified. In practice, we therefore ​must restrict ourselves only to LL(1) parsers. The reason behind this is that LL(1) parsers are unambiguous and can be run in time linear in the input size.+In Scallion, non-LL(1) parsers can be written, but the result of applying such a parser is not specified. In practice, we therefore restrict ourselves only to LL(1) parsers. The reason behind this is that LL(1) parsers are unambiguous and can be run in time linear in the input size.
  
 Writing LL(1) parsers is non-trivial. However, some of the higher-level combinators of Scallion already alleviate part of this pain. In addition, LL(1) violations can be detected before the parser is run. Parsers have an ''​%%isLL1%%''​ method which returns ''​%%true%%''​ if the parser is LL(1) and ''​%%false%%''​ otherwise, and so without needing to see any tokens of input. Writing LL(1) parsers is non-trivial. However, some of the higher-level combinators of Scallion already alleviate part of this pain. In addition, LL(1) violations can be detected before the parser is run. Parsers have an ''​%%isLL1%%''​ method which returns ''​%%true%%''​ if the parser is LL(1) and ''​%%false%%''​ otherwise, and so without needing to see any tokens of input.
Line 243: Line 238:
   * ''​%%NullableConflict%%'',​ which indicates that two branches of a disjunction are nullable.   * ''​%%NullableConflict%%'',​ which indicates that two branches of a disjunction are nullable.
   * ''​%%FirstConflict%%'',​ which indicates that the ''​%%first%%''​ set of two branches of a disjunction are not disjoint.   * ''​%%FirstConflict%%'',​ which indicates that the ''​%%first%%''​ set of two branches of a disjunction are not disjoint.
-  * ''​%%FollowConflict%%'',​ which indicates that the ''​%%first%%''​ set of a nullable parser is not disjoint from the ''​%%first%%''​ set of a parser that directly follows ​in a sequence.+  * ''​%%FollowConflict%%'',​ which indicates that the ''​%%first%%''​ set of a nullable parser is not disjoint from the ''​%%first%%''​ set of a parser that directly follows ​it.
   * ''​%%LeftRecursiveConflict%%'',​ which indicates that a parser can recursively call itself without consuming any input token.   * ''​%%LeftRecursiveConflict%%'',​ which indicates that a parser can recursively call itself without consuming any input token.
  
-The ''​%%LL1Conflict%%''​s objects contain fields which can help you pinpoint the exact location of conflicts in your parser and hopefully help you fix those.+The ''​%%LL1Conflict%%''​s objects contain fields which can help you pinpoint the exact location of conflicts in your parser and hopefully help you fix those. For instance, check the ''​%%witnesses%%''​ method, which returns sequences of token kinds which lead to the problem.