Since sometime around June last year, I’ve been working with ANTLR at work. Even though I haven’t really had any formal education on Compiler Theory, this certainly pushed me to the deep end and there was a lot to learn.

Everything went fine till we started to write a parser for PL\SQL somewhere in August, and that brings me to my rant:
in PL\SQL, apparently records can have columns named “type”. Variables can be named “ref”. Cursors can be named “count”. And the list goes on. I personally think that
(a). Any language that allows keywords to be used as variables is fundamentally flawed and leads to confusing code
(b). Any programmer who re-uses keywords to create variable names is retarded violates basic software engineering principles.

Getting the actual parser running was not too difficult. ANTLR’s Grammar page had a few PL\SQL grammars but none worked with my files. BNF Web was very interesting and I spent a couple of days visiting all of their pages and copying their BNF, but that didn’t work either. Then I actually started going through my workplace files and created a grammar from scratch. That turned out to be highly ambiguous, and I re-did it, left-factoring symbols. This final one worked.

The real headache came when I started to parse the files for testing the grammar and saw all of the above “unconventional” uses. In the end, I gave up trying to fix my parser, and decided to try to make the parser context-sensitive.

Formal definitions aside, my idea of a context-sensitive language is one that, as I pointed out above, recognizes that “ref” is a keyword only when used along with “REF CURSOR” and so on. I still think that non-context-sensitive languages (and I like my languages strongly typed too) are easier to develop on. As can be imagined, I do not like JavaScript and I loathe PL\SQL.

I came across some article that suggested a few approaches, and I either failed to understand them properly, or I simply could not get the results they promised.

I ended up trying at least 5 different approaches on getting the parser to recognize context. Syntactic predicates to override testLiteralsTable() required tight integration between lexer and parser. Overriding testLiteralsTable itself didn’t work, as it required lookahead to work, and this advanced the lexer, overwriting the text to be resolved. Parsing optimistically as a keyword, and rewinding on a parser exceptionand trying again as an identifier felt promising, but there was no way to re-invoke the entire parser chain, and such a re-invocation could be at any point of the call stack.

Finally, something worked in a limited situation. I wrote an override for match(), and set a flag, clearing it immediately before returning. Now, when the grammar expected “BULK COLLECT” the generated parser simply calls match(LITERAL_collect) immediately after matching BULK. A flag is set, the lexer checks the flag to know it’s a keyword and flag is cleared after matching. At any other location of code, when a COLLECT is met, the flag would be off, and therefore it must be an identifier.

That sounds nice, but didn’t work very well either. The reason is that many keywords are optional and the majority of matches are done after a lookahead. So the problem was resolving between a keyword and an identifier during lookahead and buffer fills.

Overriding the filter (TokenStreamHiddenTokenFilter, to preserve whitespace) to lookahead once again resulted in the lexer advancing. Buffering tokens was also very tricky.

Finally, I took the easy way out. I wrote a second very simple lexer/parser combination that would simply recognize a stream of keywords, identifiers and literals, and not try to recognize a complex structure in it. The token types are then routed to a list and subsequently to an array, and then the real parsing begins. The real parser now can look ahead as well as back to the boundaries of the file, and whenever we determine that a keyword token is actually an identifier, the array is updated at the corresponding location.

I have just scheduled a complete bulk parse of 5023 PL\SQL files, totalling 180MB in size. The parser has just chewed through 4892 of those (169MB) and that’s in only one day of testing and fixes. It has taken a little over an hour, compared to about 2 1/2 hours it took with my initial “keywords everywhere” parser. The overheads include the repeated invocation of the parser executable, and the double-pass parsing mechanism, but the results are certainly very promising.