When Does Lexing End and Parsing Begin?
I had an interesting bug in my compiler: The parser would fail on blank lines. To a certain degree, this makes sense; the formal grammar of the language does not include blank lines. This is invalid input! On the other hand, every programming language ever invented, as far as I know, simply ignores them. That sounds simple, but, as the author of the compiler, it is necessary to ask: Where is the correct place to ignore a blank line?
Compiler textbooks and classes tend to divide compiler architecture into phases, and these generally begin with lexing (also called scanning) and parsing. However, this is misleading in several ways. Some compilers do not treat scanning as a separate phase at all. Of those that do, most run scanning and parsing in parallel, feeding tokens to the parser as they are recognized by the lexer, rather than completing lexing before beginning parsing. It is also common (if, admittedly, a hack) to feed information from the parser back into the lexer as scanning progresses. In all, there is not a strict distinction between lexing and parsing "phases."
If a compiler exists to transform source code into an executable program, what is the rationale for having lexing as a separate "phase?" Partially, it's an implementation detail: Treating lexing separately can improve performance and makes certain features easier to implement. There is a theoretical justification as well; if you specify the PL as a context-free grammar using terminals which are lexemes specified with regular expressions, then it makes sense to implement these concerns separately.
Compiler textbooks do not always do a great job of explaining the border between lexing and parsing in a general sense. For example, "Compilers: Principles, Techniques, & Tools," ("the dragon book"), Second Edition, has this to say on p. 6: "Blanks separating the lexemes would be discarded by the lexical analyzer." That is true enough for the toy example in question on that page, but clearly wrong for a language with significant whitespace, like Python or Haskell. Significant whitespace is a reasonably common programming language feature, but the dragon book mostly ignores it.
Whitespace is not significant in C# or VB.NET, but Roslyn does not discard it. Instead, Roslyn syntax nodes store whitespace, comments, etc., in SyntaxTrivia fields. The intention is that the full, original source text can be rebuilt from the syntax tree.
Lexers tend to treat keywords (reserved words) distinctly from identifiers such as variable or type names. The dragon book does suggest why this is the case, but, as Frans Bouma pointed out, does so in terms "requiring knowledge found hundreds of pages further on [in the book]." Other books have a similar problem: "Modern Compiler Implementation in ML" states, "A lexical token is a sequence of characters that can be treated as a unit in the grammar of a programming language." But it does this well before the notion of a "unit in the grammar of a programming language," is really explained.
The Coursera compilers course finally made this click for me: The point of a lexer is to provide input to a parser. Put differently, a lexer should supply a valid sequence of terminals to the parser, given valid (source code) input. So you need to understand parsers before you can really understand a lexer. Indeed, it may make more sense to teach compilers "backwards."
In the same way that lexers operate on a sequence of characters or code points, parsers operate on a sequence of terminals. Hence, the lexer should aim to produce a valid enumeration of terminals in the parser's context-free grammar, given valid input.
Therefore: If the parser's terminals are characters, then you do not need a lexer and you are doing scannerless parsing. If the parser's terminals are "more than individual characters," keywords, formatted numbers, identifiers, and the like, then the job of the lexer is to produce terminals, and only terminals, on correctly-formatted source code.
This tells me that the correct place for me to remove blank lines in my compiler is in the lexer.
Compiler textbooks and classes tend to divide compiler architecture into phases, and these generally begin with lexing (also called scanning) and parsing. However, this is misleading in several ways. Some compilers do not treat scanning as a separate phase at all. Of those that do, most run scanning and parsing in parallel, feeding tokens to the parser as they are recognized by the lexer, rather than completing lexing before beginning parsing. It is also common (if, admittedly, a hack) to feed information from the parser back into the lexer as scanning progresses. In all, there is not a strict distinction between lexing and parsing "phases."
If a compiler exists to transform source code into an executable program, what is the rationale for having lexing as a separate "phase?" Partially, it's an implementation detail: Treating lexing separately can improve performance and makes certain features easier to implement. There is a theoretical justification as well; if you specify the PL as a context-free grammar using terminals which are lexemes specified with regular expressions, then it makes sense to implement these concerns separately.
Compiler textbooks do not always do a great job of explaining the border between lexing and parsing in a general sense. For example, "Compilers: Principles, Techniques, & Tools," ("the dragon book"), Second Edition, has this to say on p. 6: "Blanks separating the lexemes would be discarded by the lexical analyzer." That is true enough for the toy example in question on that page, but clearly wrong for a language with significant whitespace, like Python or Haskell. Significant whitespace is a reasonably common programming language feature, but the dragon book mostly ignores it.
Whitespace is not significant in C# or VB.NET, but Roslyn does not discard it. Instead, Roslyn syntax nodes store whitespace, comments, etc., in SyntaxTrivia fields. The intention is that the full, original source text can be rebuilt from the syntax tree.
Lexers tend to treat keywords (reserved words) distinctly from identifiers such as variable or type names. The dragon book does suggest why this is the case, but, as Frans Bouma pointed out, does so in terms "requiring knowledge found hundreds of pages further on [in the book]." Other books have a similar problem: "Modern Compiler Implementation in ML" states, "A lexical token is a sequence of characters that can be treated as a unit in the grammar of a programming language." But it does this well before the notion of a "unit in the grammar of a programming language," is really explained.
The Coursera compilers course finally made this click for me: The point of a lexer is to provide input to a parser. Put differently, a lexer should supply a valid sequence of terminals to the parser, given valid (source code) input. So you need to understand parsers before you can really understand a lexer. Indeed, it may make more sense to teach compilers "backwards."
In the same way that lexers operate on a sequence of characters or code points, parsers operate on a sequence of terminals. Hence, the lexer should aim to produce a valid enumeration of terminals in the parser's context-free grammar, given valid input.
Therefore: If the parser's terminals are characters, then you do not need a lexer and you are doing scannerless parsing. If the parser's terminals are "more than individual characters," keywords, formatted numbers, identifiers, and the like, then the job of the lexer is to produce terminals, and only terminals, on correctly-formatted source code.
This tells me that the correct place for me to remove blank lines in my compiler is in the lexer.

Comments
-
Please login first in order for you to submit comments