Child pages
  • WARC reader process

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

One line is read at a time and compared to a valid WARC version line. The parser is faily strict and will accept "WARC/" and an invalid version string.

Wiki MarkupAccepted version strings are in the format of "x.x\[x.x\]" even though this will most likely never happen.

Warnings/Errors range from leading garbage before a valid WARC identifier line to invalid version information and missing CR-LF pairs.

...

Warnings/Errors reported are restricted to the presence of more or less that two linefeeds.

Usage: 

The WARC reader can be used to read either all the records in a file sequentially or select records in random order.

Both scenarios are supported by the various factory and reader methods.

Compression:

Besides uncompressed WARC files, GZip compressed files are also supported.

GZip compression is only supported on WARC files where each record is compressed individually and concatenated into one file and not the case where the whole WARC file and all it's records are GZip'ed as a whole. The later mostly because this makes random access to individual record highly ineffective.

Children Display
depth3
styleh3
excerpttrue
excerptTypesimple