Win32 Console Input: How to differentiate between 0xE0 as a Start-Of-Sequence and 0xE0 as the 1st byte of a UTF-8 sequence/as a codepage character?

RobinLe 0 Reputation points
2024-10-28T21:25:39.0133333+00:00

I use _getch() for reading input character by character, but there seems to be no way to tell whether the next character still belongs to the last read character or not.

It seems that there is no way to tell if a 0xE0 is the 1st of two bytes in a Windows-specific "extended key" or if it's a regular character; like the à in codepage 1252.

Using _kbhit() only works in part, as some keyboards combine simultaneously-pressed characters into one input if you assume _kbhit() is 0 in-between actual keypresses (_kbhit() is not reliable for differentiating between actually sent input sequences).

Windows API - Win32
Windows API - Win32
A core set of Windows application programming interfaces (APIs) for desktop and server applications. Previously known as Win32 API.
2,651 questions
C++
C++
A high-level, general-purpose programming language, created as an extension of the C programming language, that has object-oriented, generic, and functional features in addition to facilities for low-level memory manipulation.
3,758 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Darran Rowe 1,031 Reputation points
    2024-10-31T15:23:56.9033333+00:00

    I spent a little more time looking at this, and it isn't possible to remove the ambiguity without analysing the input. As already mentioned, it is possible to look at the second byte read to figure out if it is the start of a UTF-8 sequence, since 0xE0 will never be sent for à when the codepage is UTF-8. However, there is no way to determine if a 0xE0 represents a à or an extended key in codepages like 1252. It is possible to look at it as à followed by something like K, L, M, or N is unlikely.

    If the desire is to remove all ambiguity, then ReadConsoleInput is the only option. This function deals with INPUT_RECORD structures, meaning that it will give key states. It is also much easier to determine if the cursor keys have been pressed since the entire INPUT_RECORD is dedicated to the key press and INPUT_RECORD also indicates if there is a character to display.

    Just be aware that ReadConsoleInput will give all key presses. As an example, if à is being input using a dead key, then there will be the key down and key up records for the dead key first. Also, there may be multiple key down records for the same key press. If the codepage is UTF-8, ReadConsoleInputA will read two key down events for the a press. The first will contain 0xC3, the second will contain 0xA0. It is easier to distinguish between printable keys and non-printable keys because the uChar member of the KEY_EVENT_RECORD contained in INPUT_RECORD will contain values when there is a printable character, but will be 0 when there is a non-printable character. Finally, if ReadConsoleInputW is used, the function will return UTF-16 encoded values regardless of the console codepage.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.