Mad for Bom

From wikipedia:

"The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in."

The problem

Often I meet situations where i have to deal with conversions between bytes and string. Usually this involve conversion from bytes arrays from the most different sources, often concatenating streams.

This causes often the creation of strings that have the BOM at their beginning, and when unit testing, suddenly all the strings seems different (even if while watching them in debugger they are perfectly identical!!). But when it comes to check the strings as char array a couple of bytes usually appears

Typical BOMs

This is a list of the tipical BOMs you can find.

EncodingRepresentation (hexadecimal)Representation (decimal)
UTF-8EF BB BF239 187 191
UTF-16 (BE)FE FF254 255
UTF-16 (LE)FF FE255 254
UTF-32 (BE)00 00 FE FF0 0 254 255
UTF-32 (LE)FF FE 00 00255 254 0 0
UTF-72B 2F 76 3843 47 118 56
2B 2F 76 3943 47 118 57
2B 2F 76 2B43 47 118 43
2B 2F 76 2F43 47 118 47
2B 2F 76 38 2D43 47 118 56 45
UTF-1F7 64 4C247 100 76
UTF-EBCDICDD 73 66 73221 115 102 115
SCSU0E FE FF14 254 255
BOCU-1FB EE 28251 238 40
GB-1803084 31 95 33132 49 149 51

Solution

A quick solution i founded was to create a new "Encoding Wrapper".

The important part is the "GetPreamble" function. It is the one that will reset the BOM.

Usage

It's pretty simple, instead of using the real encoding you invoke the wrapper.

When converting from bytes to string the BOM will be trimmed.

Download

See kendar.org for the latest changes.


Last modified on: June 07, 2013