Saturday, April 16, 2016

Interesting problem

Today, I was writing some code that had some bitmap files embedded in them.  Not wanting  to waste space, I decided to use gzip to compress the bitmap.  After converting the compressed file to something that looks like a byte array in C, I get something like this:

unsigned char bitmap[] =
{
    0x1f, 0x8b, 0x08, 0x08, 0x2d, 0x87, 0xf9, 0x56,
    0x02, 0x0b, 0x69, 0x63, 0x6e, 0x5f, 0x73, 0x79,
    0x6e, 0x74, 0x68, 0x2e, 0x62, 0x6d, 0x70, 0x00,
    0xed, 0x99, 0x31, 0x0e, 0x82, 0x30, 0x14, 0x86,
    0x45, 0xa4, 0x5f, 0xf6, 0x0f, 0x21, 0x93, 0x3c,
    0x23, 0x70, 0xbf, 0xab, 0x06, 0xff, 0x9e, 0x00,

    .
    .
    .

    0x00, 0x00, 0x00, 0x00, 0x00, 0x42, 0xe1, 0x05,
    0x5e, 0xd7, 0xb8, 0x1c, 0x36, 0x40, 0x00, 0x00
};


So, nothing terribly interesting in that, but after compiling this into an executable, I get trapped by my virus scanner saying that my shiny new executable has a virus.  If I comment out the bitmap part of the code, no virus.

Apparently whatever byte pattern is created by compressing this particular bitmap gets flagged by my virus scanner, so in order to debug this code, I need to create an exception for the executable I am producing in the virus scanner.  Never had this happen before, but is certainly an artifact of the world we now live in.

I ran into another interesting trap while unit testing this code.  The basic problem at hand was to embed a series of bitmap files in executable code that could be reconstituted at will.  So, to unit test this code, I created the list of bitmap files and used gzip to compress them into a series of filename.bmp.gz compressed files.  Each of the bitmaps is 16k bytes and they compressed down to about 1k on average, so the savings is significant.

I created a utility that would take a list of file names and generate a C source code file similar to what is shown above which was then included in a unit test program.  The test program recreated the decompressed file using my decompression algorithm and wrote the resultant file back to the disk in a different folder.  What I expected was to see a directory full of identical files to the original set.  What I got instead was about 1/3 of the files had 2 or 4 bytes extra.  The rest were identical to the original.

This as you might guess had me more than a little concerned that my decompression algorithm, while not detecting any errors, including a CRC check of the decompressed memory image of the file, was still different on disk than the original by length and therefore content.

Loading the bitmaps up into an image viewer showed very minor corruption of the image.  Rats...

To check this out, I did a binary difference of all the affected files and I saw an interesting pattern emerge.  Every file that contained a 0x0A byte value in the original file, had a two byte sequence (0x0D, 0x0A) replacing it in the corrupt file.

For those that spend some time programming computers, this may be recognized as an automatic Unix/Linux end-of-line convention being replaced by a Windows end-of-line convention.  Unix/Linux uses (typically) line-feed characters only at the end of a line of text in a file whilst Windows (typically) uses a carriage-return/line-feed pair of characters.  (Thanks Microsoft...)

The cause of this result was failing to remember to open the new file in binary mode rather than non-binary mode which caused the file write operation to replace every line-feed character in the file with a carriage-return/line-feed pair.  A simple change to the file open to use binary mode and the resulting files were now identical to the originals.  Don't forget to unit test your code.

No comments:

Post a Comment