Software Engineering
data versioning configuration storage binary
Updated Fri, 17 Jun 2022 17:59:21 GMT

Binary data formats, how to make ensure you can read different format versions?


On our project we have this data format that we use to process and record data on. As of late our application changed so that many of the data formats parameters have become obsolete. Currently we receive these data "packets" over internet (UDP or TCP) and from saved binary files.

We want to create a new more space efficient format, removing things we don't need. Each format is divided into a header and the payload, where things like time-stamp information and some description of the payload is in the header.

To ensure that we can support multiple versions of a format, we decided that it made sense to put some sort of format version ID at the top of the format for every format we make. Unfortunately the previous format (created by people who are no longer on our team) does not follow the convention, and at some point the decision was made to put the format version ID in the middle of the format, in between where all the now useless junk data was.

reading this older format is an issue because currently we actually have gigabytes of that formats data that we use as test data for our application, stuff that was collected in the field.

How do we both ensure formats that don't follow the format format version ID, everything else are still able to be read by our application and future format versions that we create?

We've considered the following:

  • Just moving on to the next format, ignoring old data. Not responsible, prohibitively expensive.

  • Having the user some how specify which format is which (formats which can be found out from header immediately vs old format types). Annoying, and hard on people who are not devs on this project but also contribute (of which there are many).

  • Having new format versions follow old version up to version ID portion. mitigates many of the benefits of moving to the new version, requires careful planning of where to place header bytes to ensure version ID is still in the same location (harder on developers).

  • Converting old formats to version ID first header versions, requires new tooling and maintaining of version converter, requires everyone else's files to be updated as well, these recorded files are with people who are not devs and aren't using version control either, so it will be difficult to make sure already recorded data can be correctly used for everyone.

Here is an example of what the current header looks like:

* = marked for removal

size: 8 bytes
payload metadata: 8 bytes
payload metadata: 8 bytes
* non-standard timeformat: 8 bytes 
* non-standard timeformat: 8 bytes
* legacy undocumented data: 8 bytes
version number: 8 bytes
* source metadata: 8 bytes // may not want this all the time
sequence number: 8 bytes
short range time: 8 bytes
payload metadata: 8 bytes
* size data?: 8 bytes
* spare data: 8 bytes
payload: N bytes



Solution

It seems to me that the simplest solution is to make your version header unambiguous and make sure that the old format can never look like it has a format header, you simply look for it. If it's not there you assume it's the old style and try to find it from the middle. There might also be things in the beginning of the old format that can clue you in.

The key here is that you need to find some sort of scheme for your version preamble that the old format cannot produce. For example, let's assume the old version never starts with a 0 byte. You could start your preamble with 0x00 0x00 0x00 0x00. Then when you start reading the data, you read in the first 4 bytes and if there's any non-zero value, you are looking at an old version (or a bad request.) An example of this being done is in UTF-8 and its backwards compatibility with ascii.





Comments (5)

  • +0 – Use of some kinda of GUID might help here but it doesn't have to be fancy. 0xCAFEBABE was good enough for Java. — Jan 11, 2018 at 15:27  
  • +0 – I do not believe this is possible, like at all, given the old format (which I've updated to include). I don't see anything that can "clue me in" unambiguously, especially seeing as so much stuff changes. — Jan 11, 2018 at 15:29  
  • +0 – @snb Really? Why not? — Jan 11, 2018 at 15:30  
  • +1 – For example, Let's say the old version never starts with a 0 byte. In the newer versions, you start the version header with 4 '0' bytes. Then your old version will never look like it has a header. You just need to find a prefix that the old version can never produce. Ralf gives a potential scheme. — Jan 11, 2018 at 16:04  
  • +2 – Ehm, starting with a BOM is making UTF-8 intentionally backwards-incompatible to ASCII. — Jan 11, 2018 at 21:20  


External Links

External links referenced by this document: