Share via


In which I try to write a tokenizer, and fail...

I reread something by Steve Yegge, which I think was his NBL thing. Anyway, he said something to the effect of 'writing a programming language will make you a better programmer'. And I thought 'Really? Well, why not. Practicing any sort of programming probably helps, but if you're also doing that kind of introspection maybe it helps more.'

Anyway. Thought process continues and eventually I spent my evening writing a tokenizer. A buggy one. By the time I went to bed my brain was too fried to see the bugs in my code.

Tonight I was still having trouble seeing the bugs in my code, until I compared the input with the output more closely.

And this lead me to find that the bugs were the kind of bugs that probably happen to people writing tokenizers all the time (at least people who don't get much practice) .

They are
1) Consuming too many chars of input when building the token. This goes unnoticed when the next char is meant to be ignored anyway, like spaces. But it suddenly becomes very noticeable when you are missing an expected '(' token.
2) Not outputting the final token, because that's the end of the input stream!

Now I've identified my bugs, it's time to think about why they happened, and what to do better instead.

Why bug 1?

I decided that I would write my tokenizer as an implicit state machine, where the states are the execution flow through the code, i.e. Program Counter. Which is just a fancy way of saying lots of if/switch statements and while loops, with nesting as deep as my tokens are complicated, which luckily isn't very. Now that, in itself is not the cause of the bug. The cause of the bug is that the logic in the while loops has to be exactly right, i.e. peek at characters as you go along, then consume them if matched, instead of pull characters, then try to match them and (bug) forget them otherwise.

So you have to do
while (peekc().isAlphaNumeric) { token.append(nextC()); }

not
while (c.isAlphaNumeric() && (c = nextC())) { token.append(C); }

Of course once you've discovered the right pattern, you might as well codify it somehow as a helper function, so you don't keep forgetting and screwing it up.

Why bug 2?

I decided nextC() would throw at end of input, and I would catch it higher up. This would have maybe worked, if I had finally clause that would return the token being constructed....

It may just be the case that exceptions are just a really silly way of handling end of input. Still thinking about this one.