Udostępnij za pośrednictwem


Downloading content from the web using different encodings

The other day, somebody asked me: How do I download a webpage, or other content from a webserver, where the content is stored using a specific encoding ? They want to do this using for eg: System.Net.HttpWebRequest

Why is this necessary ?

Well, for starters, webservers around the world store their content in various encodings. For eg, webadmins in Japan server their pages using the Shift-JIS encoding to account for the japanese characters in their pages.

If you just attach a StreamReader to the stream given by HttpWebResponse.GetResponseStream(), then you will most likely get bad characters in your data. Or, your stream might be truncated in the middle. This is because StreamReader uses a default encoding (UTF8) which might not match the encoding of the bytes you are reading into the StreamReader.

So, lets get down to coding.

There are two places where a server can indicate the encoding of the entity in the response. The first is the response header. The second is the entity body itself, if the entity is an HTML page (this is indicated by “content-type: text/html“ response header).

 The response headers you need to look at are:

“Content-Type: foo/bar; charset=<charset encoding>“

If the Content-Type header exists, and the value for this header contains a charset=<value>, then the <value> portion gives the encoding of the response entity.

If this header is not present, or if a “charset=” token is not present in the header value, then you need to look at the header of the HTML page (if the entity contains HTML). There will be some meta tags in the begining of the entity which indicate the charset of the entity:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1" />

What you need to do is to read the entity as ASCII into a string. Then, you extract the encoding information from the header of the entity. Once you know the encoding info, you can reprocess the raw entity using the correct encoding. of course, you should make sure to store the raw entity in a MemoryStream or other buffer, so that you can use it when you want to read the entity using its actual encoding.

Here is the code which demonstrates this:


private static String DecodeData(WebResponse w) {

//
// first see if content length header has charset = calue
//
String charset = null;
String ctype = w.Headers["content-type"];
if(ctype != null) {
int ind = ctype.IndexOf("charset=");
if(ind != -1) {
charset = ctype.Substring(ind + 8);
Console.WriteLine("CT: charset=" + charset);
}
}

                // save data to a memorystream
MemoryStream rawdata = new MemoryStream();
byte [] buffer = new byte[1024];
Stream rs = w.GetResponseStream();
int read = rs.Read(buffer,0,buffer.Length);
while(read > 0) {
rawdata.Write(buffer,0,read);
read = rs.Read(buffer,0,buffer.Length);
}

                rs.Close();

      //
// if ContentType is null, or did not contain charset, we search in body
//
if(charset == null) {
MemoryStream ms = rawdata;
ms.Seek(0,SeekOrigin.Begin);

          StreamReader srr = new StreamReader(ms,Encoding.ASCII);
String meta = srr.ReadToEnd();

if(meta != null) {
int start_ind = meta.IndexOf("charset=");
int end_ind = -1;
if(start_ind != -1) {
end_ind = meta.IndexOf("\"", start_ind);
if(end_ind != -1) {
int start = start_ind + 8;
charset = meta.Substring(start, end_ind - start + 1);
charset = charset.TrimEnd(new Char[] { '>','"' });
Console.WriteLine("META: charset=" + charset);
}
}
}
}

      Encoding e = null;
if(charset == null) {
e = Encoding.ASCII; //default encoding
} else {
try {
e = Encoding.GetEncoding(charset);
} catch(Exception ee) {
Console.WriteLine("Exception: GetEncoding: " + charset);
Console.WriteLine(ee.ToString());
e = Encoding.ASCII;
}
}

      rawdata.Seek(0,SeekOrigin.Begin);

      StreamReader sr = new StreamReader(rawdata, e);

      String s = sr.ReadToEnd();

      return s.ToLower();
}


Comments

  • Anonymous
    March 30, 2004
    Nice one! I was trying to work out how to do this a couple of weeks ago :)
  • Anonymous
    May 04, 2004
    Hey, what is the w.rawdata;
    I can't find that in the WebResponse
  • Anonymous
    May 04, 2004
    The comment has been removed
  • Anonymous
    May 04, 2004
    That is correct. It should be just "rawdata"

    eferoze
  • Anonymous
    July 08, 2004
    The comment has been removed
  • Anonymous
    July 08, 2004
    Joachim,

    Thanks for your feedback. Can you tell me which version of the framework (and SP) that you got this behavior on ?
  • Anonymous
    July 08, 2004
    > Thanks for your feedback. Can you tell me which version of the framework (and SP) that you got this behavior on ?

    I'm using version 2.0.40607.16 (and got the same result in 1.1.4322.573).

    --Joachim
  • Anonymous
    July 09, 2004
    Joachim,

    I sent your question to a developer. This is his response:

    -----

    To get the behavior this user expects, he should use the CodePage property on Encoding, not the WindowsCodePage property. The WindowsCodePage property gives “the Windows operating system code page that most closely corresponds to this encoding”. In this case, ISO-8859-1 (Code Page 28591) is not a Windows code page, but ANSI – Latin 1 (Code Page 1252) is the closest Windows code page. The CodePage property will return the actual code page of the encoding, in this case 28591.

    I am not exactly sure when it would be beneficial to use the WindowsCodePage property, but I will talk with the developer and let you know.

    By the way, the MSDN documentation makes this distinction.
  • Anonymous
    July 12, 2004
    Thank you Feroze.

    I read the MSDN documentation for WindowsCodePage but didn't find any definition of "Windows code page". My (stupid) guess was that any code page listed under HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlNlsCodePage was a "Windows code page"...

    Now, I realize that the encoding that I thought was broken most likely is correct.
    I was naive enough to use System.Console.WriteLine(string) in my code and redirected the output to a file. This probably triggered a conversion to the code page used by the console's TextWriter (Console.Out) --- which was 437 (corresponding to the registry name OEMCP) in my case.

    In other words, your code works just fine.
    (Well, the exception text will of course be unreadable if you run into the same encoding problem as I did.)

    Thanks again for your help.

    --Joachim

  • Anonymous
    January 21, 2009
    PingBack from http://www.keyongtech.com/424873-socket-class-slower-then-http