Child pages
  • WARC reader process

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Excerpt

Describes the steps taken to read and validate a WARC record.

Overview: 

The goal of this WARC library was to make a small package to read and validate WARC files.

The WARC parser was implemented on the premise that WARC data would be supplied in the form of streams and not files.

So the basic operation of parsing and validating a WARC file is a sequential operation where each record and its payload is only read once.

This is also the case when parsing/validating compressed WARC files where each record is GZip'ed. In which case the compressed data can also only be processed sequentially.

It is however possible to random access individual WARC records when working with the logical files and using a file offset.

Steps to parsing a WARC record:

...