Mixing fscanf for ascii data and fread binary data bug

Alec Jacobson

May 12, 2013

weblog/

I've been using a simple format for dense matrices. The original idea was just to have a dead simple ASCII file format for reading and writing column-major matrices. The advantage over MATLABs simple dlmread/dlmwrite files is that my file format contained the size of matrix in a header at the beginning. This made reading the files much, much faster. Later I realized this would be even faster if I also supported binary data. I still used the ascii header to reveal the size of the binary data. So before the binary data I would have a line that read:
[cols] [rows]
Then the binary data immediately follows in 8 byte chunks (each dense matrix entry as a double). This turned out to be a bit of a mistake, since a line starts to become badly defined for binary data. On my system I end lines with a '\n' line feed character. But if I try to parse out that line using:
fscanf(fp,"%d %d\n",&cols,&rows);
Then fscanf might grab more than just the '\n' character, depending on what's sitting in the file next: i.e. what the first byte of the first double precision float is. I found out this the hard way when I was suddenly reading junk for any matrix whose first entry was "-6.343889". The first byte (in my little endian system) is in hex 0B which is apparently a vertical tab, which for some reason gets eaten up as part of the \n in the fscanf above. My current, not-so-satisfying solution is to change the header reading to:
fscanf(fp,"%d %d",&cols,&rows);
without reading the '\n' character and then explicitly read exactly one-character (byte). After verifying that this character is indeed the '\n' then I know I'm safe to read in the binary data. Unfortunately this means I can no longer really say the header should be just a line reading as above, since some systems use multiple characters to denote then end of a line. For example, With my current description I have no way of knowing whether I should read '\n\r' as the line ending or if I should read just '\n' because the '\r' is really just the value of the first byte of the first double interpreted as a char. It seems like one right way to solve this would be to use a standard format. Maybe using xml's Base64 element