Downloads and International Filenames
A few times a year, I get a question about Internet Explorer's behavior when it comes to downloading files that have non-ASCII characters in the filename, because different browsers have different behavior when handling such files.
The server can suggest the name for a file download in one of two ways:
- Explicitly, by including a filename token in the Content-Disposition response header
- Implicitly, by not including the filename token and instead simply making the path component of the download's URL contain the desired filename.
The challenge with approach #1 is that the HTTP specification doesn't permit non-ASCII characters to appear within HTTP headers. Early versions of Internet Explorer worked around this limitation by assuming that any non-ASCII characters within HTTP headers were encoded using the local system's Windows codepage—a not unreasonable assumption at that time. However, as time has passed and the Internet has grown increasingly multi-lingual, it becomes more and more likely that the user will encounter file downloads with names that are not represented using characters from their own local codepage.
For instance, consider the case of a file delivered with a filename specified using characters in the server's codepage (Windows-1251 Cyrillic):
Content-Disposition: attachment; filename="Текстовый документ док.doc"
When a user on a client configured to use that codepage attempts to download the file, the filename is displayed correctly:
When a user on a client configured for a different codepage (Windows-1252 Western European) attempts to download the file, the filename is corrupted:
If the server is reconfigured to send the filename using raw UTF-8 bytes, the filename remains corrupted, because the client interprets those bytes using the system codepage:
Internet Explorer permits use of UTF-8 in the filename token only if it is represented in %-escaped-hexadecimal:
Content-Disposition: attachment; filename="%d0%a2%d0%b5%d0%ba%d1%81%d1%82%d0%be%d0%b2%d1%8b%d0%b9 %d0%b4%d0%be%d0%ba%d1%83%d0%bc%d0%b5%d0%bd%d1%82 %d0%b4%d0%be%d0%ba.doc"
When sent this way, systems running in any codepage will display the filename correctly.
Unfortunately, however, while this syntax works in IE and Chrome, it doesn't work in Firefox, Opera, or Safari, which do not unescape the %-escaped characters:
RFC2231 proposed a mechanism whereby a server could specify the character set before the token value:
Content-Disposition: attachment; attachment; filename="LegacyFileйame.doc"; filename*=utf-8''%d0%a2%d0%b5%d0%ba%d1%81%d1%82%d0%be%d0%b2%d1%8b%d0%b9%20%d0%b4%d0%be%d0%ba%d1%83%d0%bc%d0%b5%d0%bd%d1%82%20%d0%b4%d0%be%d0%ba.doc
Unfortunately, Internet Explorer, Safari and Chrome do not support this syntax, and Firefox and Opera will only use the RFC2231 filename* token's value if it appears before the legacy filename token.
Update: IE9 now supports RFC5987/RFC2231 formatted tokens using the UTF-8 character encoding. IE9 prefers the filename* token over the filename token, although, for legacy compatibility, you should send the filename token before the filename* token.
Notably, if the Content-Disposition specifies that the file is an attachment without specifying a filename:
Content-Disposition: attachment;
… all browsers will attempt to derive the filename from the path component of the URL. So, if the file is downloaded from:
…without a filename token in the Content-Disposition header, the file will be named properly by all browsers.
I've posted a Meddler Script which demonstrates the various mechanisms for naming the file; download it here.
Ũńťīŀ Ņĕxţ Ŧĩmе,
Eric
Comments
Anonymous
July 28, 2010
Thanks!Anonymous
September 22, 2010
The comment has been removedAnonymous
September 22, 2010
Thanks for the update, Julian! In a private thread, I've suggested that renaming "filename*" to something else, e.g. "name*" will interoperate better with legacy parsers, such that the use of the new tokens will not break the hundreds of millions of shipped clients.Anonymous
September 22, 2010
Hi Eric, do you have any evidence of "filename*" breaking existing clients, and in partivular, "name*" being any better? The blog above doesn't seem to provide it. If there is, I'll add that to my test suite at greenbytes.de/.../tc2231.Anonymous
September 22, 2010
@Julian: Yes, as you noticed in your test case greenbytes.de/.../attfnboth2.asis, if the filename* precedes filename, legacy versions of IE will not find the "unstarred" filename parameter and use that.Anonymous
September 22, 2010
Hi Eric, so yes, you can't send "filename*" and "filename" in this order to IE. But you can send first "filename", then "filename*". Is the situation any better with a new name like "name*"? And even if it was, does this justify duplicating the functionality, and having to have three other browsers change?Anonymous
September 22, 2010
I'm not aware of other features that impose ordering requirements on HTTP header tokens. The problem goes away if the "new" token does not contain a string match for the "old" token. I have no idea what "duplicating the functionality" means.Anonymous
September 22, 2010
Eric, with "duplicating the functionality" I was referring to the fact that the IETF has recommended a way to do this for many years (to be precise, since August 1997), and three browser implementations implement that. Adding a new parameter "name*" that works exactly as "filename*" would duplicate functionality that's already implemented and deployed. It would be an alias (with the usual impact that we would need to think about what happens when both "filename*" and "name*" are present). It would certainly be easier to simply have a single approach. That ordering would become significant certainly is a drawback, but it would be only a temporary workaround. From that point of view, I don't think it's something to be concerned about. Best regards, JulianAnonymous
September 22, 2010
While the IETF spec may be 13 years old, support for "filename*" has been extremely sparse, at best. My suggestion is merely that using a different token (e.g. "name*") is more likely to spur adoption than sticking with a known-problematic token. If by "temporary" you mean "however many years it takes for users to upgrade to newer browsers" then yes, I think we're agreed.Anonymous
September 22, 2010
Eric, Firefox has been supporting "filename*" for over 6 years. Back then, it was the one usable alternative to IE, so I wouldn't call that "spotty" :-) With respect to calling "filename*" problematic: my test cases show that it can be sent to IE today, as long as it's preceded by "filename". You know, "same markup" (even though in headers, not HTML). Best regards, JulianAnonymous
September 22, 2010
I'm not really interested in debating whether or not a browser with tiny (in 2004, at least) marketshare that partially supports a feature if-and-only-if you use it in a particular way (which AFAIK was never documented) constitutes "spotty" support. We can simply agree to disagree. As to the latter point, I've stated my opinion that imposing arbitrary ordering requirements on HTTP value parameters seems like a sub-optimal design. You've suggested that you disagree, and since you're writing the RFC, not me, I suspect I know how that's going to turn out.Anonymous
September 23, 2010
Eric. peace :-). In 2004, FF already had a largish market share, and back then was the only serious contender to IE. This is not "spotty". And no, I don't want to impose ordering requirements. There are none of these in the proposed spec. What it does say is that if you choose a particular ordering, your chances of UAs doing the right thing are better. This is simply a fact, and only relevant until UAs have their parsing related bugs fixed. Best regards, JulianAnonymous
September 23, 2010
"largish" is somewhere between 4 and 8%? Hrm... en.wikipedia.org/.../Usage_share_of_web_browsersAnonymous
December 22, 2010
Recently, we found a system where download of files with non-ASCII characters in the filename consistently failed. It turns out that the system in question had "short filename generation" disabled on the volume where Temporary Internet Files are stored. This causes an internal failure of the long-filename handling code, and IE will not be able to download files with non-ASCII characters in the name. The quick way (on Win7) to see if this is the problem is to run the following command from a command prompt: fsutil 8dot3name query C: (Where C: is the drive containing your TIF). If the result is "Based on the above two settings, 8dot3 name creation is enabled on C:" then you should not encounter this problem.Anonymous
September 12, 2011
In the meantime, the IETF has published RFC 5987 (for encoding parameters in HTTP header fields) and RFC 6266 (about Content-Disposition in HTTP). Also, Chrome (with version 9) and Internet Explorer (also with version 9, for UTF-8 only) support the encoding, and Firefox 5 fixed the problem of defaulting to the wrong variant when both are present. Opera and Konqueror have been supporting the notation for long. Thus, "filename*" (with UTF-8 encoding) can be used interoperably for all current browsers (except the one I didn't mention above). A revision to RFC 5987 is likely to remove the requirement to support ISO-8859-1, in which case IE9 will be "fully" compliant. Best regards, JulianAnonymous
October 22, 2012
Update u are talking about IE9 Update: IE9 now supports RFC5987/RFC2231 formatted tokens using the UTF-8 character encoding. IE9 prefers the filename* token over the filename token, although, for legacy compatibility, you should send the filename token before the filename* token. Still issue persist with IE9. Is it fixed with IE9 or next version u talk about?