Appendix 2J – metadata per page ALTO
Revised version for second bid.
This appendix has been reworked in a way where changes has not been marked.
Requirements concerning OCR data
The requirements are inspired by the NDNP-ALTO standard.
The State and University Library OCR Profile
- OCR text shall be encoded using the ALTO (Analyzed Layout and Text Object) schema, Version 2.0, with the additional clarifications stated below.
- The value for MeasurementUnit will be "inch1200," which is 1/1200 of an inch.
- The use of the SourceImageInformation\fileName element is required. This should include the path if the path contains useful information (e.g., identifying the newspaper title and/or issue). FileName shall be the full path in the package delivered.
- The use of the OCRProcessing element is required. If the software does not have a commercial name, the name of the executable may be used. In addition to the name of the software, the software version, configuration of software, including use of dictionaries, alphabets, etc. shall be specified. The settings can be specified as the command-line arguments given to the processing software.
- For all applicable elements, the use of STYLEREFS and language are encouraged.
- For the Page element, the use of PRINTED_IMG_NR, QUALITY, POSITION, and PROCESSING are encouraged.
- For the Page element, the use of HEIGHT and WIDTH are required.
- For the Page element, the entire page may be included in the PrintSpace. (Thus, use of TopMargin, LeftMargin, RightMargin, and BottomMargin are not required.)
- The use of Illustration, GraphicalElement, and ComposedBlock are not required.
- The use of non-rectangular blocks is not encouraged.
- The use of SP and HYP are encouraged.
- For a TextLine, the use of BASELINE is discouraged.
- For a String, the use of ALTERNATIVE, WC, and CC is encouraged if available. If alternatives are provided, ALTERNATIVE should be utilized, not multiple Strings
- For a String, the use of HEIGHT, WIDTH, HPOS, and VPOS is required.
- For a String, the CONTENT should be a word. A word should not be split at special characters such as "æ", "ø", or "å".
- Non-English text shall be encoded at the TEXTBLOCK, using ISO 639-2 alpha-3 language codes. Note: a single ALTO document may have multiple languages encoded within individual TEXTBLOCKs (e.g. bilingual newspaper pages), but a single TEXTBLOCK may only have a single language. Language code "dan" from ISO 639-2 shall be used for danish.
- If a hyphen splits a word at the end of a line, the OCR file should represent both fragments of the word, the hyphen, and the complete word. See the following example, where the word "experts" was split at the end of a line. <String ID="P5_ST00015" HPOS="5508" VPOS="24344" WIDTH="170" HEIGHT="61" CONTENT="ex" SUBS_TYPE="HypPart1" SUBS_CONTENT="experts" WC="0.96" CC="111"/> <HYP CONTENT="-"/> </TextLine> <TextLine ID="P5_TL00003" HPOS="3146" VPOS="24425" WIDTH="2532" HEIGHT="108"> <String ID="P5_ST00016" HPOS="3146" VPOS="24439" WIDTH="288" HEIGHT="94" CONTENT="perts" SUBS_TYPE="HypPart2" SUBS_CONTENT="experts" WC="0.99" CC="00001"/>
- If the hyphenated word occurs in the middle of a line, the hyphen should be left in place. See the following example where the word is "re-examination" occurred in the middle of a line. <String ID="P2_ST03691" HPOS="11428" VPOS="15727" WIDTH="897" HEIGHT="89" CONTENT="re-examination" WC="0.97" CC="01001011010110" STYLEREFS="TXT_5"/>
- In case of optionB3 OCR text shall be in natural reading order. Thus, OCR text should reflect columns of the original newspaper and be ordered column-by-column. In addition, the ordering of all elements should reflect the original newspaper. (That is, reading order should be indicated by the ordering of elements, for example, Strings should be in reading order.)