Optimize and compress non-primary historic archive files
Currently, the same file format is used for primary and non-primary historic archive files. This does not make much sense since the two types are used very differently.
For the primary archive:
• Write operations should be fast since it is collecting data in real-time
• Read operations should be fast since the most recent data is usually the most relevant
• File size is not a concern since there is only 1 primary archive
For non-primary archives:
• Write speed is not a concern since write operations are infrequent
• Read operations should be fast because users may want to look for patterns over a long time range
• File size is very important because there are many non-primary archives and they can quickly fill up a drive
Please consider optimizing each type of archive file for its use. This likely means having 2 different file formats, and it may mean retiring the current archive file format. When an archive shift occurs, the primary archive should be reprocessed into a non-primary archive. Just in case if anything goes wrong, an archive in the primary format would ideally be usable as a non-primary archive, but not necessarily the other way around.
I am aware that there are tradeoffs, but I also feel that OSIsoft has not found the sweet spot within those tradeoffs.
On one hand, a compressed archive means that additional work must be done to decompress data being read and compress data being written. On the other hand, compressed data would be faster to read from the disk because there is less data to read. I'm not sure which one has a greater effect on read speed. It probably depends on the chosen compression algorithm.
I understand the concern with out-of-order data, but I'd like to think that this wouldn't happen often enough for it to be a concern. I did forget to consider the effect on clients trying to read data while a non-primary archive is being written to.
Backwards compatibility is something that you may need to ditch soon for a different reason. In my previous comment, I linked a suggestion for the use of 64-bit Unix time, which will need to be implemented eventually to make PI usable in 2038 and beyond. If this change requires breaking backwards compatibility, it would be the perfect opportunity to implement other changes to the archive files that would also break backwards compatibility. OSIsoft can always release a utility that converts old archives to the new format.
For the non-primary archives being compressed, I don't mean zipping or gzipping the whole file. The compression could be local within the file for chunks of time or tags. Something to make non-primary archives much smaller without too much of a penalty on read speed. Disk compression is a good example of this (at least, on small files).
Also, it doesn't necessarily have to be that all archives except for the primary are compressed. There could be a tuning parameter to control when this compression occurs (e.g. based on the number of archive shifts since the archive was the primary or based on the age of the data that the archive holds). Then you can precisely choose when to take the performance hit while conserving disk space. The older an archive is, the less likely it is to be read from or written to, and the more disk space matters.
As with all things, there are trade offs. Swinging door compression was used initially in the Data Archive as lossless compression was too slow for the hardware at that time. While hardware has caught up, on-the-fly compression still incurs some penalty. For most, performance is key and they are willing to trade that off with disk space usage. In addition, compressed files may be an issue when there are out-of-order data that requires inserts into the historical archives, or if you have multiple clients that are requesting searches or calculations that leads to decompression of the historical archives followed by re-compression. A good example would be analytics that include FindXX functions followed by recalculation of historical data. And then there's backwards compatibility that we need to be aware of - we have to ensure old server/client continue to work with new server/client and vice versa. While this is certainly technologically possible, it's currently low in priority.
A change in the format of the archive files would be the perfect chance to implement these as well:
When I use disk compression on an archive file, the size of the archive decreases significantly, which shows that the archive files are not well compressed. I find it strange that OSIsoft pushes for the use of exception and compression, which are forms of lossy compression, without worrying about lossless compression first.