Tokenize and highlight

Do you remember the Lexer basics chapter? The text in your editor is just “a stream of characters”. During lexical analysis this stream is split up in separate “words”, called “tokens” in computer science. Each token belongs to some category, like a keyword, constant, variable, … After tokenizing the source code, the actual syntax highlighting can begin.

Create a token_list

Let us use a very simple regex to split the text in tokens:
p = re.compile(r"\s+|\w+|\W")

There is a very useful website to see immediately the effect of your regex on a string. Let’s take a look at https://regex101.com/:

As you can see, this regex makes useful splits in the text. There is lot of room for improvement if you want to highlight your C-code professionally, but that would lead us too far right now.

This is how you make the token_list in Python:
token_list = p.findall(text)

The result is:
token_list = ['#', 'include', ' ', '<', 'stdio', '.', 'h', '>', '\r\n\r\n', 'int', ' ', 'main', '(', ')', '\r\n', … ]

But wait a minute. Remember how the setStyling(..) method for syntax highlighting worked? You feed it with a number – representing how many chars it has to highlight starting from the current engine index – and you also specify a style (which is again a number). To use this method, we need to know the length of each token:
token_list = [('#', 1), ('include', 7), (' ', 1), ('<', 1), ('stdio', 5), ('.', 1), ('h', 1), ('>', 1), … ]

 

Highlight in a loop

Now we have the token_list – let’s do some simple syntax highlighting:

  • Keywords will be colored red – this is style 1
  • Parentheses and braces are colored blue – this is style 2
  • Everything else is just black – this is style 0

Remember, style 0, 1 and 2 were defined on the All about styles page.

def styleText(self, start, end):
        # 1. Initialize the styling procedure
        # ------------------------------------
        self.startStyling(start)

        # 2. Slice out a part from the text
        # ----------------------------------
        text = self.parent().text()[start:end]

        # 3. Tokenize the text
        # ---------------------
        p = re.compile(r"\s+|\w+|\W")
        token_list = [ (token, len(bytearray(token, "utf-8"))) for token in p.findall(text)]
        # -> 'token_list' is a list of tuples: (token_name, token_len)

        # 4. Style the text in a loop
        # ----------------------------
        # self.setStyling(number_of_chars, style_nr)
        #
        for i, token in enumerate(token_list):
            if token[0] in ["for", "while", "return", "int", "include"]:
                # Red style
                self.setStyling(token[1], 1)

            elif token[0] in ["(", ")", "{", "}", "[", "]", "#"]:
                # Blue style
                self.setStyling(token[1], 2)

            else:
                # Default style
                self.setStyling(token[1], 0)
            ###
        ###

Great! Let’s look at the result:

 
To keep the internal counter in sync, your token_list must always have the same nr of characters as the text you’re handling! To keep an eye on that, you might introduce the following assertion:

print(len(bytearray(text, "utf-8")))

sum = 0
for token in token_list:
    sum += token[1]
print(sum)

If both print-statements output the same number, you’re safe. If not, your internal counter got out of sync.