Child pages
  • WARC reader process

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

GZip compression is only supported on WARC files where each record is compressed individually and concatenated into one file and not the case where the whole WARC file and all it's records are GZip'ed as a whole. The later mostly because this makes random access to individual record highly ineffective.

Pitfalls:

The payload is inaccessible when the "Content-Length" is absent or invalid.

Warc-Payload-Digest header is computed only on defined record payloads where the leading header has been read. This makes it a requirement for the WARC parser to identify and always parse the http response and not make it optional.

For GZip'ed records this is not a big problem since we know the record ends when the GZip entry ends.

For uncompressed records the payload input stream would have to look ahead for a valid WARC version line at which point the payload stream should be closed and the bytes read beyond that pushed back onto the internal streams.

Children Display
depth3
styleh3
excerpttrue
excerptTypesimple

...