From wikipedia:
"The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in."
Often I meet situations where i have to deal with conversions between bytes and string. Usually this involve conversion from bytes arrays from the most different sources, often concatenating streams.
This causes often the creation of strings that have the BOM at their beginning, and when unit testing, suddenly all the strings seems different (even if while watching them in debugger they are perfectly identical!!). But when it comes to check the strings as char array a couple of bytes usually appears
This is a list of the tipical BOMs you can find.
Encoding | Representation (hexadecimal) | Representation (decimal) |
---|---|---|
UTF-8 | EF BB BF | 239 187 191 |
UTF-16 (BE) | FE FF | 254 255 |
UTF-16 (LE) | FF FE | 255 254 |
UTF-32 (BE) | 00 00 FE FF | 0 0 254 255 |
UTF-32 (LE) | FF FE 00 00 | 255 254 0 0 |
UTF-7 | 2B 2F 76 38 | 43 47 118 56 |
2B 2F 76 39 | 43 47 118 57 | |
2B 2F 76 2B | 43 47 118 43 | |
2B 2F 76 2F | 43 47 118 47 | |
2B 2F 76 38 2D | 43 47 118 56 45 | |
UTF-1 | F7 64 4C | 247 100 76 |
UTF-EBCDIC | DD 73 66 73 | 221 115 102 115 |
SCSU | 0E FE FF | 14 254 255 |
BOCU-1 | FB EE 28 | 251 238 40 |
GB-18030 | 84 31 95 33 | 132 49 149 51 |
A quick solution i founded was to create a new "Encoding Wrapper".
The important part is the "GetPreamble" function. It is the one that will reset the BOM.
It's pretty simple, instead of using the real encoding you invoke the wrapper.
When converting from bytes to string the BOM will be trimmed.
See kendar.org for the latest changes.