Files
pytorch/third_party/miniz-2.0.8
Zachary DeVito 170ff7764f Use a zip archive as our container format (#14521)
Summary:
After consulting with Owen, who pointed out the existence of the miniz library, I decided to take one last shot at using zip as our container format.
miniz makes this surprisingly feasible and I think the benefits of using zip are large enough that we should do it.

This replaces our custom container format with a zip archive, preserving all of the
desirable features of our custom format, such as append-oriented writing, and
mmap'able tensor data while adding a bunch of debugging advantages:

1. You can unzip and explore the container to debug what is going on with a model.
2. You can edit the model using a text editor (e.g. change the definition of a method,
   or editing the json-serialized meta-data), re-zip the file use OSX's native 'Compress'
   option, and re-load the result into pytorch. Note: this enables you to, e.g., print-debug
   serialized models.
3. We can easily enable features like compression in the future.
4. Stock python , without pytorch installed, and other programming languages
   can reasonably consume this format,using json  and zipfile packages, which enables
   people to build tools like visualizers without those visualizers depending on pytorch.
   This will be especially useful if you want to, for instance, write a visualizer in javascript.

Notes:

*  This add miniz (https://github.com/richgel999/miniz) as a dependency. miniz is a self-contained
   library for reading/writing zipfiles that unlike other zip libraries also includes libz
   compatible compress/decompress support. It is a single header and a single C file without
   any other dependencies. Note that the instructions for miniz explicitly state:

   > Please use the files from the releases page in your projects. Do not use the git checkout directly!

   So we have checked in the 'release' source. Miniz supports zip64, and its API is amenable
   to doing zip-align style things to align data.

*  Removes 'size' from RecordRef. This allows you to edit files in the zip archive without
   editing the meta-data file. Very important if you want to print-debug serialized models.

*  PyTorchStreamReader/PyTorchStreamWriter keep mostly the same API (though keys become strings)
   However, their implementation is completely swapped out to use miniz.

*  Code exists to check for the old magic number to give a decent warning to our preview users
   after we change the format.

*  Container version information is now put in a stand-alone 'version' file in the archive
   and serves a similar purpose to the other container version info.

*  All files in the zip archive start at 64-byte boundaries, using an approach similar to
   zip-align. Tests check that this property remains true. While the writer does this,
   the reader doesn't depend on it, allowing user-created archives that can use compression,
   and do not have to align data.

*  Added test to check for > 4GB files and archives. Disabled by default because it takes
   almost 2 minutes to run.

*  torchscript files are now optional: if a submodule does not have methods, it will
   not be written.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14521

Reviewed By: jamesr66a

Differential Revision: D13252945

Pulled By: zdevito

fbshipit-source-id: 01209294c0f6543d0fd716f85a38532249c52f8c
2018-11-30 19:19:29 -08:00
..

Miniz

Miniz is a lossless, high performance data compression library in a single source file that implements the zlib (RFC 1950) and Deflate (RFC 1951) compressed data format specification standards. It supports the most commonly used functions exported by the zlib library, but is a completely independent implementation so zlib's licensing requirements do not apply. Miniz also contains simple to use functions for writing .PNG format image files and reading/writing/appending .ZIP format archives. Miniz's compression speed has been tuned to be comparable to zlib's, and it also has a specialized real-time compressor function designed to compare well against fastlz/minilzo.

Usage

Please use the files from the releases page in your projects. Do not use the git checkout directly! The different source and header files are amalgamated into one miniz.c/miniz.h pair in a build step (amalgamate.sh). Include miniz.c and miniz.h in your project to use Miniz.

Features

  • MIT licensed
  • A portable, single source and header file library written in plain C. Tested with GCC, clang and Visual Studio.
  • Easily tuned and trimmed down by defines
  • A drop-in replacement for zlib's most used API's (tested in several open source projects that use zlib, such as libpng and libzip).
  • Fills a single threaded performance vs. compression ratio gap between several popular real-time compressors and zlib. For example, at level 1, miniz.c compresses around 5-9% better than minilzo, but is approx. 35% slower. At levels 2-9, miniz.c is designed to compare favorably against zlib's ratio and speed. See the miniz performance comparison page for example timings.
  • Not a block based compressor: miniz.c fully supports stream based processing using a coroutine-style implementation. The zlib-style API functions can be called a single byte at a time if that's all you've got.
  • Easy to use. The low-level compressor (tdefl) and decompressor (tinfl) have simple state structs which can be saved/restored as needed with simple memcpy's. The low-level codec API's don't use the heap in any way.
  • Entire inflater (including optional zlib header parsing and Adler-32 checking) is implemented in a single function as a coroutine, which is separately available in a small (~550 line) source file: miniz_tinfl.c
  • A fairly complete (but totally optional) set of .ZIP archive manipulation and extraction API's. The archive functionality is intended to solve common problems encountered in embedded, mobile, or game development situations. (The archive API's are purposely just powerful enough to write an entire archiver given a bit of additional higher-level logic.)

Known Problems

  • No support for encrypted archives. Not sure how useful this stuff is in practice.
  • Minimal documentation. The assumption is that the user is already familiar with the basic zlib API. I need to write an API wiki - for now I've tried to place key comments before each enum/API, and I've included 6 examples that demonstrate how to use the module's major features.

Special Thanks

Thanks to Alex Evans for the PNG writer function. Also, thanks to Paul Holden and Thorsten Scheuermann for feedback and testing, Matt Pritchard for all his encouragement, and Sean Barrett's various public domain libraries for inspiration (and encouraging me to write miniz.c in C, which was much more enjoyable and less painful than I thought it would be considering I've been programming in C++ for so long).

Thanks to Bruce Dawson for reporting a problem with the level_and_flags archive API parameter (which is fixed in v1.12) and general feedback, and Janez Zemva for indirectly encouraging me into writing more examples.

Patents

I was recently asked if miniz avoids patent issues. miniz purposely uses the same core algorithms as the ones used by zlib. The compressor uses vanilla hash chaining as described here. Also see the gzip FAQ. In my opinion, if miniz falls prey to a patent attack then zlib/gzip are likely to be at serious risk too.

Build Status