Lexer basics

What is a lexer?

When you compile a program, it goes through a lot of compilation steps before ending up as the final “machine code” or “binary code”. The very first step in this process is the lexical analysis. Your source file is nothing more than a stream of characters. The lexer breaks this stream into separate “words”, which in computer science we call “tokens”. Each token belongs to some category, like a keyword, constant, variable, …

A lexer breaks the stream of characters up in separate tokens. A token is a conversion unit. Each token belongs to some category, like a keyword, constant, variable, …

Once the lexer has split the stream in tokens, it can go a step further and actually color them. Well, of course that’s not needed in the compilation procedure. But we’re not interested in compilation right now. We just want to use a lexer for syntax highlighting.
A parser goes one level deeper than the lexer. It takes the tokens produced by the lexer and tries to determine if proper sentences have been formed. Herein lies the difference: parsers operate at the grammatical level, whereas lexers work at the word level. A lexer is generally sufficient to provide syntax highlighting in your IDE. And that’s exactly what we’re going to do in this chapter.


A lexer in QScintilla is an instance of the QsciLexer class – or one of its subclasses. QScintilla provides an extensive set of complete lexers for various languages like Python, C/C++, C#, … which can be used out of the box.
This is how you install a lexer on your editor:

# 1. Create a C++ lexer object
self.__lexer = QsciLexerCPP(self.__editor)
# 2. Install the lexer onto your editor

Do you remember the class hierarchy from the introduction chapter? Here is a snippet from that figure, applied on this particular QsciLexerCPP:



Besides the QsciLexerCPP for C++, you’ve got tons of other lexers:

QScintilla predefined lexers
QsciLexerAVS AVS lexer   QsciLexerBash Bash lexer
QsciLexerBatch Batch lexer   QsciLexerCMake CMake lexer
QsciLexerCoffeeScript CoffeeScript lexer   QsciLexerCPP C++ lexer
QsciLexerCSharp C# lexer   QsciLexerIDL IDL lexer
QsciLexerJava Java lexer   QsciLexerJavaScript JavaScript lexer
QsciLexerCSS CSS lexer   QsciLexerD D lexer
QsciLexerDiff Diff lexer   QsciLexerFortran Fortran lexer
QsciLexerHTML HTML lexer   QsciLexerXML XML lexer
QsciLexerJSON JSON lexer   QsciLexerLua Lua lexer
QsciLexerMakefile Makefile lexer   QsciLexerMarkdown Markdown lexer
QsciLexerMatlab Matlab file lexer   QsciLexerOctave Octave file lexer
QsciLexerPascal Pascal lexer   CQsciLexerPerl Perl lexer
QsciLexerPO PO lexer   QsciLexerPostScript PostScript lexer
QsciLexerPOV POV lexer   QsciLexerProperties Properties lexer
QsciLexerPython Python lexer   QsciLexerRuby Ruby lexer
QsciLexerSpice Spice lexer   QsciLexerSQL SQL lexer
QsciLexerTCL TCL lexer   QsciLexerTeX TeX lexer
QsciLexerVerilog Verilog lexer   QsciLexerVHDL VHDL lexer
QsciLexerYAML YAML lexer      



If you want to create your own custom lexer, you need to subclass the QsciLexerCustom class. It requires more work than simply using a pre-cooked lexer, but you’ll get maximal flexibility ánd you’ll learn how to make some tokens clickable! Every modern IDE has clickable functions and variables making you jump to the original definitions. That’s our goal.



This is how you install a custom lexer on your editor:

# 1a. Subclass QsciLexerCustom
[explained on next page...]

# 1b. Create a lexer object from your subclass
self.__lexer = MyLexer(self.__editor)
# 2. Install the lexer onto your editor