What's the difference between an Encoding, Code Page, Character Set and Unicode?
Encoding, Code Page and Character Set are often used interchangeably, even when that isn't strictly correct. There are some distinctions though:
Characters are usually thought of as the smallest element of writing that has a meaning. It could be a punctuation mark, spacing character, letter, word, letter modifier or symbol.
Character sets are a collection of characters that are useful, usually for a particular script or scripts. Sometimes people use character set as a synonym for code page. Character sets however can be collections without a method of coding them. Similarly code pages could contain multiple sets of characters.
Code Pages, also Coded Character Sets, are character sets where each character has been assigned a numerical representation. This allows characters to be mapped to binary values and back to the same character. Often code pages are referenced by particular implementations, like windows code page 1252.
Encodings are a way to express a character set as actual coded data. Often used interchangeably with the code page term, even within the MSDN documentation. We often use the term Encodings in the managed .Net classes and code pages in windows APIs. Managed Encodings even can accept a code page number (although I'd recommend using the names when possible). Generally I think of Code Pages being similar to a table of characters to number and Encodings being how the characters get from character form to encoded byte form.
Unicode is best described at www.unicode.org. The Unicode Standard says "The Unicode Standard is the universal character encoding scheme for written characters and text. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally..." Basically its kind of a enormous character set that encompasses all of the other characters. It’s encoded in UTF-8, UTF-16 or UTF-32. All 3 UTF encodings are representations of the same set of characters. Windows and .Net (and many other systems) use Unicode natively, and it’s the natural preferred encoding for .Net or Windows applications.
Comments
Anonymous
October 14, 2008
I feel that the definition of Unicode at unicode.org is misleading b'cz unicode is called an encoding instead of being called a coded-charset.- Anonymous
January 09, 2019
Agree. It should be clearly defined the following: |||||||| Character set = some set of graphical characters |||||||| Encoded character set = an isomorphic mapping from the Character set to a set of nonnegative integer numbers ||||||| Encoding = a representation of nonnegative integer number set by a particular form of using single bytes |||||||| (Maybe?) Default encoding = a special case for the Encoding for the small set of integer numbers between 0 and 255 - those numbers are represented by their single byte native form |||||||| Code page = Coded character set with an Encoding (e.g. it could be the Default encoding for many cases)
- Anonymous