Jaa


Converting Win32 API results to std::wstring (or std::string)

Hmm. Just realized that this is a bit out of order and should have been published before the previous post Smile.

It turns out that there are a significant number of Win32 APIs that have a similar calling pattern – you call the API once to find the size of the buffer needed for the result, allocate a buffer then fill in the resulting buffer. Examples of APIs like this are GetCurrentDirectory, GetEnvironmentVariable, ExpandEnvironmentString, MultiByteToWideChar, etc. The pattern shows up over and over again in the Win32 API surface. It would be really useful if it was possible to define a fairly standard pattern that results in the creation of a std::wstring from one of these APIs (it’s probably not possible to create a template to handle all the possible APIs but that’s ok, defining a working pattern is probably OK). One of the key aspects of such a pattern is that I only ever want to perform one allocation during the call. Otherwise I could simply use malloc to allocate a buffer, fill in the buffer and then construct a new std::wstring for that pattern.

I did what every developer does in this situation and searched the internet and I found a solution that seemed to work correctly, but I wasn’t really happy with what I came up with (NOTE: DO NOT USE THIS CODE):

 std::wstring UnicodeStringFromAnsiString(_In_ const std::string &ansiString)
{
    std::wstring unicodeString;

    auto wideCharSize = MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, ansiString.c_str(), -1, nullptr, 0);
    if (wideCharSize == 0)
    {
        return L"";
    }
    unicodeString.reserve(wideCharSize);
    unicodeString.resize(wideCharSize-1);

    wideCharSize = MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, ansiString.c_str(), -1, &unicodeString[0], wideCharSize);

    return unicodeString;
}

There are a couple of issues with this code (ignore the fact that it doesn’t handle errors particularly well). The first is that it is inconsistent in its handling of the failure case, I’m also not happy with the “reserve”/”resize” aspect of the result.

So I asked the owners of the C++ runtime library on the VC team what they would suggest and they pointed out a huge issue with my solution.

The code I found above worked but it ignored at least one really important aspect of the implementation of the std::wstring type (and the std::string type). It turns out that the null terminating character in the std::wstring is owned by the STL – are code outside the STL is not allowed to mess with that null terminating character, and that’s exactly what happens when I wrote .reserve (to allocate storage) and .resize (to resize the buffer so it doesn’t contain the trailing null). There’s a more significant problem with this code however. By returning both a named value AND a temporary, it has the side effect of disabling the NRV optimization, which can result in significant performance degradation.

Instead of my example, they suggested instead that I change the code as follows (follow this pattern instead):

 std::wstring UnicodeStringFromAnsiString(_In_ const std::string &ansiString)
{
    std::wstring returnValue;
    auto wideCharSize = MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, ansiString.c_str(), -1, nullptr, 0);
    if (wideCharSize == 0)
    {
        return returnValue;
    }
    returnValue.resize(wideCharSize);
    wideCharSize = MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, ansiString.c_str(), -1, &returnValue[0], wideCharSize);
    if (wideCharSize == 0)
    {
        returnValue.resize(0);
        return returnValue;
    }
    returnValue.resize(wideCharSize-1);
    return returnValue;
}

The key difference between this version and the previous is that we only allocate a new string once we’ve calculated the size and we initialize the string to the needed buffer size. We then fill the buffer (using the array operator [] to gain access to a non const buffer for the string) and finally resize the string to the actual size of the string. By performing the operations in this order, we ensure that we never overwrite the STL’s null character – it always lies one character beyond the end of the buffer that we’re filling in. And because a resize that decreases the size of a buffer never allocates a new buffer, we preserve the desired behavior that we only ever allocate one buffer. And finally, this version only returns the same local variable (thus enabling NRVO).

One final note: This technique only works when the output of the Win32 call is relatively static. If you’re retrieving data whose size can change on the fly, it’s probably best to add a loop around the API call to ensure that the buffer is large enough.

Comments

  • Anonymous
    November 25, 2015
    If you provide ansiString.size() instead of -1 as the argument for cbMultiByte, then MultiByteToWideChar won't write a null terminator, elegantly bypassing the problem you're trying to solve. This saves MultiByteToWideChar from computing the length that's already known.  It eliminates the over allocation for an extra null terminator (which means we're more likely to be able to take advantage of the small string optimization).  It also eliminates the special cases and their associated early returns. The only trick is that ansiString.size() returns an unsigned type that may be of higher rank than the (signed) int that MultiByteToWideChar expects.  Since you're not dealing with this problem in general, you can just cast the away the compiler warning. [CODE]    std::wstring UnicodeStringFromAnsiString(In const std::string &ansiString)    {        const std::string::size_type limit = std::numeric_limits<int>::max();        assert(ansiString.size() < limit);        const int ansiByteSize = static_cast<int>(ansiString.size());        const auto wideCharSize =            ::MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, ansiString.c_str(), ansiByteSize,                                  nullptr, 0);        std::wstring returnValue(wideCharSize, L'�');        const auto finalCharSize =            ::MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, ansiString.c_str(), ansiByteSize,                                  &returnValue[0], wideCharSize);        assert(finalCharSize == wideCharSize);        return returnValue;    } [/CODE] Note that we now simply construct the buffer with the right size--there are no more resizes.  We construct the buffer with initial L'�' characters.  There aren't really null terminators, just zeroes.  We could have used L'*' or any other wide character, but I figured the optimizer has the best chance of doing the right thing with zeros.  Note that std::wstring::resize is going to fill the buffer, as well, so using the constructor to do it isn't any less efficient, though it is unfortunate that we have to fill it at all since it's about to filled anyway. If you think something can go wrong that causes the first call to succeed and the second call to fail, you just replace the last assert statement with returnValue.resize(finalCharSize).

  • Anonymous
    November 25, 2015
    Interesting. I was unaware that MBTWC handles the case where it wouldn't drop in a trailing nul. But I suspect that is not the case for all the APIs that match this pattern (GetCurrentDirectory, etc). But a nice optimization for charset conversion.