次の方法で共有


Coding for Kids #3. Bugs!

In the first post, we introduced the problem.  We’re writing a program that solves the problem of finding 100 point words, where each letter in the word corresponds to its position in the alphabet (A=1, B=2, Z=26).  In the second post, we coded the basic solution that allows the user to enter a word, and we calculate the result.

We have a bug, though, because we made the assumption that the letters passed in are lower case.  The word “automated” is a 100 point word, but “Automated” is displaying only 99 points, because the first “A” in the word isn’t getting recognized correctly.  Remember, in computer programming, we need to be precise.

There are a couple of ways to solve this.  The first way will be to simply tack on extra if statements to deal with capital letters, like so:

 if (c == 'a') return 1;
if (c == 'A') return 1;

But, we already decided this function isn’t all that efficient, and this doubles the number of if statements.  Fortunately, C#, and more specifically, the .NET Runtime that hosts the application, has a rich set of functionality in the base class libraries (that is, what’s “in the box”) that can do some typical work for us.   In this case, the string object contains a number of useful methods to convert the entire string to either upper or lower case characters.

For example, we can modify the getWordValue function to something like this:

    1:  private int getWordValue(string theWord)
    2:  {
    3:      int wordValue = 0;
    4:   
    5:      foreach (char c in theWord.ToLower())
    6:      {
    7:          wordValue += getCharacterValue(c);
    8:      }
    9:   
   10:      return wordValue;
   11:  }

Notice on line 5, we’re calling ToLower() on theWord, which converts the entire word to lower case characters.  If we rerun the application (F5) and try the word “automated” with a mix of characters, we’ll see we get the correct value:

image

Even though we fixed the bug, our code is fragile.  What that means is that there are assumptions about the code that, if not corrected, will cause errors or unexpected behavior down the road.  For example, the getCharacterValue function still assumes the character is lower case, even though getWordValue anticipates that.  Sometimes, and especially in small projects, you just accept that the code is not ideal and move on. 

Before we move on to reading files and finding 100 point words, let’s solve a couple of problems here in the code.   The first thing to understand is that computers store all data as 0’s and 1’s … called binary.  Each binary digit is called a bit, and 8 of those are called a byte.  The way computers translate those zeros and ones into a letter and words is through encoding.   Encoding is a standardized way to convert a binary number into a character.   Going through the ins and outs of encoding is worthy of a number of blog posts, but think of it like a map.  The computer sees a binary number like 01100001 – which happens to equal 97 decimal.   The computer has a character map that says the number 97 is equal to the letter ‘a’ (specifically the lower case ‘a’).   Without a character map, the computer has no way of knowing this is supposed to be an ‘a’.   As you might guess, the character map says the number 98 is equal to ‘b’, and so on, where ‘z’ is 122.   And while we’re at it, that character map also says ‘A’ (upper case) is 65, and ‘Z’ is 90.   These numbers have roots in a character encoding set known as ASCII, and it was convenient because all western characters could be represented in a single byte (more specifically, it could really be done with only 7 bits).  Today, it’s common for applications, and our runtime, to use Unicode – a more modern way to map the binary data to characters using extensible code points.  By extensible, it’s possible for Unicode to have code points that encompass virtually any number of languages.

The point of the above, though, is that for English use, Unicode maintains compatibility with ASCII.  So, we can rely on those character numbers (65-90 for upper, and 97-122 for lower case) to tell us if we have an upper or lower case number.   Computers can process numbers _real_ fast, so instead of doing an evaluation of the character as we have been, let’s evaluate the number instead:

    1:  /// <summary>
    2:  /// Returns the value of a given character, where the
    3:  /// value is determined by its location in the alphabet.
    4:  /// (A or a = 1, B or b = 2, etc.).  Case insensitive.
    5:  /// </summary>
    6:  /// <param name="c">The character to be evaluated.</param>
    7:  /// <returns>The numerical value of the character, 1-26.</returns>
    8:  private int getCharacterValue(char c)
    9:  {
   10:      //get the numerical/ASCII value of the letter
   11:      int charValue = (int)c;
   12:   
   13:      //if the character is an lower case letter a-z
   14:      if (charValue >= 97 && charValue <= 122)
   15:      {
   16:          return charValue - 96;
   17:      }
   18:   
   19:      //if the character is an UPPER case letter A-Z
   20:      if (charValue >= 65 && charValue <= 90)
   21:      {
   22:          return charValue - 64;
   23:      }
   24:   
   25:      //not an A-Z or a-z character, return 0
   26:      return 0;
   27:  }

The first thing we’re doing on line 10 is creating a local variable, charValue, to hold the numerical value of the character, c, passed in.  (int)c is called a cast.   We know we have a character, but we want it represented as a number.  The cast allows us to do that.   We cast it to a variable (charValue) because it reads clearer and is more efficient than casting a char potentially many times over.  A word of warning:  you need to know what you’re doing when you cast.   It’s safe to cast a character as an int to get the numerical value, but when getting into more complicated scenarios, failed casts raise an exception which need to be handled (and exception handling outside the scope of this series).  

In general, I wouldn’t recommend working with character values directly unless you really were sure of what is going on encoding-wise.  This is a good example of it being okay, because we have very specific rules and expectations, and the app is overall quite simple.

Our if statements on line 14 and 20 look to see if the character’s numerical value falls in the given range.  The double ampersand (&&) is a logical AND operation so both conditions must be true for the if to evaluate to true.   If it’s a lower case letter, we simply return that value minus 96, which will give us a number of 1 to 26.   Same for the upper case, although we subtract 64 to get its value.  The code above is more concise, runs FAR faster (though, is _still_ not the best if can be), but it’s not quite as readable or obvious as to what it’s doing.  That’s where effective commenting comes in.  Notice, too, we solved the upper/lower case issue in a far better way – it’s not as fragile.   We do lose some flexibility, however – if the “game” changes and makes vowels worth double, for example, we’d have to go back to something else.  Also, notice we check for lower case characters first.  This is deliberate:  while it will handle either lower or upper case, if the characters passed in are more often than not lower case, we return the value and it’s one less if block that gets evaluated.   For the purposes of our program, it’s not significant except as an academic exercise in code optimization. 

For now, we’ll call this “good enough” and move on to the next challenge:  finding words!