Describes the steps taken to read and validate a WARC file/record.
The goal of this WARC library was to make a small package to read and validate WARC files.
The WARC parser was implemented on the premise that WARC data would be supplied in the form of streams and not files.
So the basic operation of parsing and validating a WARC file is a sequential operation where each record and its payload is only read once.
This is also the case when parsing/validating compressed WARC files where each record is GZip'ed. In which case the compressed data can also only be processed sequentially.
It is however possible to random access individual WARC records when working with the logical files and using a file offset.
Steps to parsing a WARC record:
The following steps are taken when parsing a WARC record:
- Parse and possibly skip lines until a valid "WARC/x.x" version line is identified. (Generates warnings if empty or unknown lines precede a version line)
- Parse and identify valid header lines until an empty line is encountered. (Generates warnings if a WARC header value is empty or invalid)
- Validation of required and misplaced WARC headers based on the WARC-Type, if present. (Generates warnings if values are not present/absent as per the WARC specs)
- The payload is wrapped up in a InputStream and made accessible through WARC record methods for payload processing. In case of http(s) payload content the http headers are parsed and made accessible along with the actually http response data. The record and the payload are digested if requested.
- When payload processing is done and the WARC record is closed the final step is to look for the mandatory trailing two linefeeds. (Generates warnings in case more or less than two lines are parsed)
Before a record can be parsed it must first be identified. So the first step is to look for the WARC version line in the stream.
One line is read at a time and compared to a valid WARC version line. The parser is faily strict and will accept "WARC/" and an invalid version string.
Accepted version strings are in the format of "x.x[x.x]" even though this will most likely never happen.
Warnings/Errors range from leading garbage before a valid WARC identifier line to invalid version information and missing CR-LF pairs.
This part of the record parser looks for valid header lines. This process is only terminated when an empty line is encountered.
Each possible header line is analyzed for correctness. Basic correctness is the presence of a ":" delimiter between the header-name and header-value.
Furthermore header-names can only contain US-ASCII character excluding control characters, white spaces, etc.
Header-values are valid in the presence of either US-ASCII characters, UTF-8, quoted strings or encoded words. All the encodings can be used in sequence but no simultaniously.
Header-values can span multiple lines using FWS (Feeding White Space).
Warnings/Errors range from invalid headers, missing or empty values, incorrect encoding to invalid uri/date/numeric/ip/content-type/digest formats.
This step is central in validating the WARC record header. Depending on the "WarcType-Id" the headers present are examined according to the profile for that type.
Warnings/Errors range from missing required fields to the presence of unwanted fields.
Parsing of the record header is now done and the payload can now be processed. Payload processing is only possible when a valid "Content-Length" has been parsed.
The payload is made available through a fixed length input stream which is problematic without a valid length.
If requested the record digest is computed. The payload digest is computed if requested but only in case the record has a defined payload with a leading header. (Currently only http response header)
Warnings/Errors on the payload stream are non existing and at the discretion of post processing parsing.
If WARC digest headers are present in the record and digests have been computed while reading the payload they will be compared.
In accordance with the WARC specifications two trailing linefeeds are required after a record.
Warnings/Errors reported are restricted to the presence of more or less that two linefeeds.
The WARC reader can be used to read either all the records in a file sequentially or select records in random order.
Both scenarios are supported by the various factory and reader methods.
Besides uncompressed WARC files, GZip compressed files are also supported.
GZip compression is only supported on WARC files where each record is compressed individually and concatenated into one file and not the case where the whole WARC file and all it's records are GZip'ed as a whole. The later mostly because this makes random access to individual record highly ineffective.
The payload is inaccessible when the "Content-Length" is absent or invalid.
Warc-Payload-Digest header is computed only on defined record payloads where the leading header has been read. This makes it a requirement for the WARC parser to identify and always parse the http response and not make it optional.
For GZip'ed records this is not a big problem since we know the record ends when the GZip entry ends.
For uncompressed records the payload input stream would have to look ahead for a valid WARC version line at which point the payload stream should be closed and the bytes read beyond that pushed back onto the internal streams.