LARA

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cc19:scallion [2019/07/25 10:05]
romain
cc19:scallion [2019/10/09 16:36]
romain [Documentation]
Line 9: Line 9:
   * Writing the parser in a //domain specific language// (DSL) and using a parser generator (such as Bison) to produce the parser.   * Writing the parser in a //domain specific language// (DSL) and using a parser generator (such as Bison) to produce the parser.
  
-An other approach, which we will be using, is //parser combinators//​. The idea behind the approach is very simple:+Another ​approach, which we will be using, is //parser combinators//​. The idea behind the approach is very simple:
  
   * Have a set of simple primitive parsers, and   * Have a set of simple primitive parsers, and
   * Have ways to combine them together into more and more complex parsers. Hence the name //parser combinators//​.   * Have ways to combine them together into more and more complex parsers. Hence the name //parser combinators//​.
  
-Usually, those primitive parsers and combinators are provided as a library directly in the language used by the compiler. In our case, we will be working with **Scallion**,​ a Scala parser combinators library ​developped ​by //LARA//.+Usually, those primitive parsers and combinators are provided as a library directly in the language used by the compiler. In our case, we will be working with **Scallion**,​ a Scala parser combinators library ​developed ​by //LARA//.
  
 Parser combinators have many advantages – the main one being easy to write, read and maintain. Parser combinators have many advantages – the main one being easy to write, read and maintain.
Line 23: Line 23:
  
 In this document, we will introduce parser combinators in Scallion and showcase how to use them. This document is not intended to be a complete reference to Scallion. Fortunately,​ the library comes with a [[https://​epfl-lara.github.io/​scallion/​scallion/​syntactic/​index.html|comprehensive API]] which fulfills that role. Feel free to refer to it while working on your project! In this document, we will introduce parser combinators in Scallion and showcase how to use them. This document is not intended to be a complete reference to Scallion. Fortunately,​ the library comes with a [[https://​epfl-lara.github.io/​scallion/​scallion/​syntactic/​index.html|comprehensive API]] which fulfills that role. Feel free to refer to it while working on your project!
 +
 +==== Playground Project ====
 +
 +We have set up {{scallion-playground.zip|an example project}} that implements a lexer and parser for a simple expression language using Scallion. Feel free to experiment and play with it. The project showcases the API of Scallion and some of the more advanced combinators.
  
 ==== Setup ==== ==== Setup ====
  
-In Scallion, parsers are called ​defined within a trait called ''​%%Syntaxes%%''​. This trait takes as parameters two types:+In Scallion, parsers are defined within a trait called ''​%%Syntaxes%%''​. This trait takes as parameters two types:
  
   * The type of tokens,   * The type of tokens,
Line 47: Line 51:
 ==== Writing Parsers ==== ==== Writing Parsers ====
  
-When writing a parser using parser combinators,​ one defines many smaller parsers and combine ​them together into more and more complex parsers. The top-level, most complex, of those parser then defines the entire syntax for the language. In our case, that top-level parser will be called ''​%%program%%''​.+When writing a parser using parser combinators,​ one defines many smaller parsers and combines ​them together into more and more complex parsers. The top-level, most complex, of those parser then defines the entire syntax for the language. In our case, that top-level parser will be called ''​%%program%%''​.
  
 All those parsers are objects of the type ''​%%Syntax[A]%%''​. The type parameter ''​%%A%%''​ indicates the type of values produced by the parser. For instance, a parser of type ''​%%Syntax[Int]%%''​ produces ''​%%Int%%''​s and a parser of type ''​%%Syntax[Expr]%%''​ produces ''​%%Expr%%''​s. Our top-level parser has the following signature: All those parsers are objects of the type ''​%%Syntax[A]%%''​. The type parameter ''​%%A%%''​ indicates the type of values produced by the parser. For instance, a parser of type ''​%%Syntax[Int]%%''​ produces ''​%%Int%%''​s and a parser of type ''​%%Syntax[Expr]%%''​ produces ''​%%Expr%%''​s. Our top-level parser has the following signature:
Line 54: Line 58:
 lazy val program: Parser[Program] = ... lazy val program: Parser[Program] = ...
 </​code>​ </​code>​
-Contrarily ​to the types of tokens and token kinds, which are fixed, the type of value produced is a type parameter of the various ''​%%Syntax%%''​s. This allows your different parsers to produce different ​type of values.+Contrary ​to the types of tokens and token kinds, which are fixed, the type of values ​produced is a type parameter of the various ''​%%Syntax%%''​s. This allows your different parsers to produce different ​types of values.
  
 The various parsers are stored as ''​%%val%%''​ members of the ''​%%Parser%%''​ object. In the case of mutually dependent parsers, we use ''​%%lazy val%%''​ instead. The various parsers are stored as ''​%%val%%''​ members of the ''​%%Parser%%''​ object. In the case of mutually dependent parsers, we use ''​%%lazy val%%''​ instead.
Line 76: Line 80:
   * An ''​%%UnexpectedEnd(rest)%%'',​ which indicates that the end of the iterator was reached and the parser could not finish at this point. The input iterator was completely consumed.   * An ''​%%UnexpectedEnd(rest)%%'',​ which indicates that the end of the iterator was reached and the parser could not finish at this point. The input iterator was completely consumed.
  
-In each case, the additional value ''​%%rest%%''​ is itself a ''​%%Syntax[A]%%''​. That parser represents the parser after the successful parse or at the point of error. This parser could be used to provide useful error messages or even to resume parsing.+In each case, the additional value ''​%%rest%%''​ is itself ​some sort of a ''​%%Syntax[A]%%''​. That parser represents the parser after the successful parse or at the point of error. This parser could be used to provide useful error messages or even to resume parsing.
  
 <​code>​ <​code>​
Line 91: Line 95:
 ==== Parsers and Grammars ==== ==== Parsers and Grammars ====
  
-As you will see, parsers built using parser combinators will look a lot like grammars. However, ​contrarily to grammars, parsers not only describe the syntax ​or your language, but also directly specify how to turn this syntax into a value. Also, as we will see, parser combinators have a richer vocabulary than your usual //BNF// grammars.+As you will see, parsers built using parser combinators will look a lot like grammars. However, ​unlike ​grammars, parsers not only describe the syntax ​of your language, but also directly specify how to turn this syntax into a value. Also, as we will see, parser combinators have a richer vocabulary than your usual //BNF// grammars.
  
 Interestingly,​ a lot of concepts that you have seen on grammars, such as ''​%%FIRST%%''​ sets and nullability can be straightforwardly transposed to parsers. Interestingly,​ a lot of concepts that you have seen on grammars, such as ''​%%FIRST%%''​ sets and nullability can be straightforwardly transposed to parsers.
Line 102: Line 106:
 definition.first === Set(def, abstract, case) definition.first === Set(def, abstract, case)
 </​code>​ </​code>​
 +
 === Nullability === === Nullability ===
  
Line 128: Line 133:
 === Epsilon === === Epsilon ===
  
-The parser ''​%%epsilon(value)%%''​ is a parser that produces the ''​%%value%%''​ without consuming any input. It corresponds to the empty word //𝛆// in grammars.+The parser ''​%%epsilon(value)%%''​ is a parser that produces the ''​%%value%%''​ without consuming any input. It corresponds to the //​𝛆// ​found in grammars.
  
 ==== Parser Combinators ==== ==== Parser Combinators ====
Line 149: Line 154:
 If the parser ''​%%p1%%''​ accepts the prefix of a sequence of tokens and ''​%%p2%%''​ accepts the postfix, the parser ''​%%p1 ~ p2%%''​ accepts the entire sequence and produces the pair of values produced by ''​%%p1%%''​ and ''​%%p2%%''​. If the parser ''​%%p1%%''​ accepts the prefix of a sequence of tokens and ''​%%p2%%''​ accepts the postfix, the parser ''​%%p1 ~ p2%%''​ accepts the entire sequence and produces the pair of values produced by ''​%%p1%%''​ and ''​%%p2%%''​.
  
-Note that the ''​%%first%%''​ set of ''​%%p2%%''​ should be disjoint from the ''​%%first%%''​ set of all sub-parsers in ''​%%p1%%''​ that are //​nullable//​ and in trailing position (available via the ''​%%shouldNotFollow%%''​ method). This restriction ensures that the combinator does not introduce ambiguities.+Note that the ''​%%first%%''​ set of ''​%%p2%%''​ should be disjoint from the ''​%%first%%''​ set of all sub-parsers in ''​%%p1%%''​ that are //​nullable//​ and in trailing position (available via the ''​%%followLast%%''​ method). This restriction ensures that the combinator does not introduce ambiguities.
  
 === Transforming Values === === Transforming Values ===
  
-The method ''​%%map%%'' ​make it possible to apply a transformation to the values produced by a parser. Using ''​%%map%%''​ does not influence the sequence of tokens accepted or rejected by the parser, it merely modifies the value produced. Generally, you will use ''​%%map%%''​ on a sequence of parsers, as in:+The method ''​%%map%%'' ​makes it possible to apply a transformation to the values produced by a parser. Using ''​%%map%%''​ does not influence the sequence of tokens accepted or rejected by the parser, it merely modifies the value produced. Generally, you will use ''​%%map%%''​ on a sequence of parsers, as in:
  
 <​code>​ <​code>​
Line 161: Line 166:
   }   }
 </​code>​ </​code>​
-The above parser accepts ​abtract ​class definitions in Amy syntax. It does so by accepting the sequence of keywords ''​%%abstract%%''​ and ''​%%class%%'',​ followed by any identifier. The method ''​%%map%%''​ is used to convert the produced values into an ''​%%AbstractClassDef%%''​. The position of the keyword ''​%%abstract%%''​ is used as the position of the definition.+The above parser accepts ​abstract ​class definitions in Amy syntax. It does so by accepting the sequence of keywords ''​%%abstract%%''​ and ''​%%class%%'',​ followed by any identifier. The method ''​%%map%%''​ is used to convert the produced values into an ''​%%AbstractClassDef%%''​. The position of the keyword ''​%%abstract%%''​ is used as the position of the definition.
  
 === Recursive Parsers === === Recursive Parsers ===
Line 174: Line 179:
 If you were to omit it, a ''​%%StackOverflow%%''​ exception would be triggered during the initialisation of your ''​%%Parser%%''​ object. If you were to omit it, a ''​%%StackOverflow%%''​ exception would be triggered during the initialisation of your ''​%%Parser%%''​ object.
  
-The ''​%%recursive%%''​ combinator in itself does not change the behaviour of the underlying parser. It is there to //tie the knot//((See [[https://​stackoverflow.com/​questions/​357956/​explanation-of-tying-the-knot|a good explanation of what //tying the knot// means in the context of lazy languages.]]+The ''​%%recursive%%''​ combinator in itself does not change the behaviour of the underlying parser. It is there to //tie the knot//((See [[https://​stackoverflow.com/​questions/​357956/​explanation-of-tying-the-knot|a good explanation of what tying the knot means in the context of lazy languages.]]
 )). )).
  
-In practice, it is only required in very few places. In order to avoid ''​%%StackOverflow%%''​ exceptions during initialisation,​ you should make sure that all recursives ​parsers (stored in ''​%%lazy val%%''​s) must not be able to reenter themselves without going through a ''​%%recursive%%''​ combinator somewhere along the way.+In practice, it is only required in very few places. In order to avoid ''​%%StackOverflow%%''​ exceptions during initialisation,​ you should make sure that all recursive ​parsers (stored in ''​%%lazy val%%''​s) must not be able to reenter themselves without going through a ''​%%recursive%%''​ combinator somewhere along the way.
  
 === Other Combinators === === Other Combinators ===
Line 202: Line 207:
 == Binary operators with operators == == Binary operators with operators ==
  
-Scallion also contains combinators to easily build parsers for infix binary operators, with different associativities and priority levels. This combinator is defined in an additional trait called ''​%%Operators%%'',​ which you should mix-in to ''​%%Parsers%%''​ if you want to use the combinator.+Scallion also contains combinators to easily build parsers for infix binary operators, with different associativities and priority levels. This combinator is defined in an additional trait called ''​%%Operators%%'',​ which you should mix into ''​%%Parsers%%''​ if you want to use the combinator. By default, it should already be mixed-in.
  
 <​code>​ <​code>​
Line 212: Line 217:
 ... ...
  
-lazy val operation: Syntax[Int] =+lazy val operation: Syntax[Expr] =
   operators(number)(   operators(number)(
     // Defines the different operators, by decreasing priority.     // Defines the different operators, by decreasing priority.
Line 226: Line 231:
 Documentation for ''​%%operators%%''​ is [[https://​epfl-lara.github.io/​scallion/​scallion/​syntactic/​Operators.html|available on this page]]. Documentation for ''​%%operators%%''​ is [[https://​epfl-lara.github.io/​scallion/​scallion/​syntactic/​Operators.html|available on this page]].
  
 +== Upcasting ==
 +
 +In Scallion, the type ''​%%Syntax[A]%%''​ is invariant with ''​%%A%%'',​ meaning that, even when ''​%%A%%''​ is a (strict) subtype of some type ''​%%B%%'',​ we //​won'​t//​ have that ''​%%Syntax[A]%%''​ is a subtype of ''​%%Syntax[B]%%''​. To upcast a ''​%%Syntax[A]%%''​ to a syntax ''​%%Syntax[B]%%''​ (when ''​%%A%%''​ is a subtype of ''​%%B%%''​),​ you should use the ''​%%.up[B]%%''​ method.
 +
 +For instance, you may need to upcast a syntax of type ''​%%Syntax[Literal[_]]%%''​ to a ''​%%Syntax[Expr]%%''​ in your assignment. To do so, simply use ''​%%.up[Expr]%%''​. ​
 ==== LL(1) Checking ==== ==== LL(1) Checking ====
  
 In Scallion, non-LL(1) parsers can be written, but the result of applying such a parser is not specified. In practice, we therefore restrict ourselves only to LL(1) parsers. The reason behind this is that LL(1) parsers are unambiguous and can be run in time linear in the input size. In Scallion, non-LL(1) parsers can be written, but the result of applying such a parser is not specified. In practice, we therefore restrict ourselves only to LL(1) parsers. The reason behind this is that LL(1) parsers are unambiguous and can be run in time linear in the input size.
  
-Writing LL(1) parsers is non-trivial. However, some of the higher-level combinators of Scallion already alleviate part of this pain. In addition, LL(1) violations can be detected before the parser is run. Parsers ​have an ''​%%isLL1%%''​ method which returns ''​%%true%%''​ if the parser is LL(1) and ''​%%false%%''​ otherwise, and so without needing to see any tokens of input.+Writing LL(1) parsers is non-trivial. However, some of the higher-level combinators of Scallion already alleviate part of this pain. In addition, LL(1) violations can be detected before the parser is run. Syntaxes ​have an ''​%%isLL1%%''​ method which returns ''​%%true%%''​ if the parser is LL(1) and ''​%%false%%''​ otherwise, and so without needing to see any tokens of input.
  
 === Conflict Witnesses === === Conflict Witnesses ===
Line 239: Line 249:
   * ''​%%FirstConflict%%'',​ which indicates that the ''​%%first%%''​ set of two branches of a disjunction are not disjoint.   * ''​%%FirstConflict%%'',​ which indicates that the ''​%%first%%''​ set of two branches of a disjunction are not disjoint.
   * ''​%%FollowConflict%%'',​ which indicates that the ''​%%first%%''​ set of a nullable parser is not disjoint from the ''​%%first%%''​ set of a parser that directly follows it.   * ''​%%FollowConflict%%'',​ which indicates that the ''​%%first%%''​ set of a nullable parser is not disjoint from the ''​%%first%%''​ set of a parser that directly follows it.
-  * ''​%%LeftRecursiveConflict%%'',​ which indicates that a parser can recursively call itself without consuming any input token. 
  
-The ''​%%LL1Conflict%%''​s objects contain fields which can help you pinpoint the exact location of conflicts in your parser and hopefully help you fix those. ​For instance, check the ''​%%witnesses%%'' ​method, which returns sequences ​of token kinds which lead to the problem.+The ''​%%LL1Conflict%%''​s objects contain fields which can help you pinpoint the exact location of conflicts in your parser and hopefully help you fix those. ​ 
 + 
 +The helper method  ​''​%%debug%%'' ​prints a summary ​of the LL(1) conflicts of a parser. We added code in the handout skeleton so that, by default, a report is outputted in case of conflicts when you initialise your parser.