The usual cause of software bugs is when the program interprets bits pattern in the wrong way. Here are data encodings/decodings basics with examples. The post is aligned with the Black Box Software Testing Foundations course (BBST) designed by Rebecca Fiedler, Cem Kaner, and James Bach.
If you could magnify hard disk or RAM surface, you would notice a pattern of two states, zeros and ones. This is what the computer can “see.”
What a bit pattern in a memory means depends on how the program reading it interprets it [BBST].
The same bit pattern could be an integer or sequence of characters. We could also store floating-point numbers, commands, or addresses.
As humans mostly use a sequence of characters for communication, software testers usually found bugs related to a character string representation. UTF-8 is a popular encoding/decoding standard. Let’s encode Euro Sign €[Wikipedia] in the pattern of zeros and ones:
- The Unicode point in hexadecimal is U+20AC.
- UTF-8 uses Shema, where the Unicode point could be encoded using from one to four bytes where byte has eight bits. This is due to historical reasons. As the character set grow, there was a need for more bytes. Euro Sign Unicode point is in a range of three bytes. Hexadecimal 20AC in binary is 0010 0000 1010 1100. Remember that the base for hexadecimal numbers is 16, and we need four bits to encode that range.
- Because we will use three bytes (24 bits), UTF-8 schema defines the first half of byte as 1110, where three zeros define that three bytes are used.
- the four most significant bits of a code point is next in line. So far, we have 1110 0010.
- 12 more bits to go. Following six-bits 0000 10 are encoded next. To have full byte with 8 bits, we add for two leftmost bits 10, and then six bits of the codepoint follow 1000 0010. So far we have 1110 0010 1000 0010
- Last six bits 10 1100 are encoded the same as in the previous step: 1010 1100
Final result is
1110 0010 1000 0010 1010 1100
These are three bytes for Euro Sign.
If we decode (interpret) this pattern as ASCII, we will one character and three commands:
, enquiry new_line bell