The Wayback installation under NetarchiveSuite is only tested on a pc installed with linux and in ProxyReplay mode. Other modes should work, but no guarantees are given.
Wayback Indexer and Aggregator
Basic Concepts for the Indexer/Aggregator
There are two applications responsible for indexing an arcrepository. The
WaybackIndexerApplication checks a repository for any new files and issues batch jobs to index each new file individually. These unsorted index files are deposited in a local folder. The
AggregatorApplication sorts and merges these index files and then merges the result into the existing index files being used by your wayback instance. These applications may be configured and deployed using the NetarchiveSuite Deploy Tool.
This application uses a database to maintain a list of all files in a repository and information as to whether or not they have been indexed. It uses a set of worker threads to issue batch jobs to index any new files found. The application behaviour is that any arcfile which contains the string "metadata" in its name is assumed to be a metadata file and is indexed with a tool that searches for deduplication records. Any other file is simple indexed as an arcfile, using code from the wayback open-source project.
The default settings for this application are
<settings> <wayback> <hibernate> <c3p0> <acquire_increment>1</acquire_increment> <idle_test_period>100</idle_test_period> <max_size>100</max_size> <max_statements>100</max_statements> <min_size>10</min_size> <timeout>100</timeout> </c3p0> <connection_url>jdbc:derby:derbyDB/wayback_indexer_db;create=true</connection_url> <db_driver_class>org.apache.derby.jdbc.ClientDriver</db_driver_class> <use_reflection_optimizer>false</use_reflection_optimizer> <transaction_factory>org.hibernate.transaction.JDBCTransactionFactory</transaction_factory> <dialect>org.hibernate.dialect.DerbyDialect</dialect> <show_sql>true</show_sql> <format_sql>true</format_sql> <hbm2ddl_auto>update</hbm2ddl_auto> <user></user> <password></password> </hibernate> <indexer> <replicaId>ONE</replicaId> <final_batch_output_dir>batchOutputDir</final_batch_output_dir> <temp_batch_output_dir>tempdir</temp_batch_output_dir> <maxFailedAttempts>3</maxFailedAttempts> <producerDelay>0</producerDelay> <recentProducerSince>86400000</recentProducerSince> <recentProducerInterval>1800000</recentProducerInterval> <producerInterval>86400000</producerInterval> <consumerThreads>5</consumerThreads> <initialFiles></initialFiles> </indexer> </wayback> </settings>
As can be seen, the application uses a hibernate object-relational mapping layer to communicate with a relational database. Thus it should be possible to plug in any RDBMS simply by changing the hibernate settings. The code has only been tested with DerbyDB and postgresql. The hibernate settings are not described in any more detail here as they are fully documented in the hibernate documentation at http://www.hibernate.org.
The NetarchiveSuite-specific settings are as follows:
dk.netarkivet.wayback.settings.indexer.replicaId: The Id of the replica to be used for indexing. Since indexing is a relatively intensive operation, it is useful to be able to specify which replica is used by the indexer.
dk.netarkivet.wayback.settings.indexer.final_batch_output_dir: The directory where the unsorted index files are stored.
dk.netarkivet.wayback.settings.indexer.temp_batch_output_dir: A directory in which the output from partially finished batch jobs can be written.
dk.netarkivet.wayback.settings.indexer.maxFailedAttempts: The maximum number of failures allowed per file before the indexer permanently gives up attempting to index a given file. If a failed file needs to be retried then the
ResetFailedFiles utility can be used.
dk.netarkivet.wayback.settings.indexer.producerDelay: The delay in milliseconds after the system start before the indexing process begins.
dk.netarkivet.wayback.settings.indexer.producerInterval: The interval (in milliseconds) between successive reads of the latest complete filelist from the repository. The value of this parameter is a compromise between updating the index as quickly as possible and overburdening the repository with heavy-duty
dk.netarkivet.wayback.settings.indexer.recentProducerSince: The time (in milliseconds) measured backwards from now for which the indexer fetches all new files in the archive at an interval specified by:
dk.netarkivet.wayback.settings.indexer.recentProducerInterval: The interval (in milliseconds) between successive reads of newly-uploaded (or updated) files.
dk.netarkivet.wayback.settings.indexer.consumerThreads: The number of simultaneous indexing threads to be started and hence the maximum number of indexing batch jobs to be run simultaneously.
dk.netarkivet.wayback.settings.indexer.initialFiles: the path to a file containing a list of files in the archive which the indexer should ignore. This can be used when deploying the indexer to a legacy system to ensure that archive files already indexed are not reindexed at unnecessary computational expense.
To summarise, the indexer behaviour is that it reads all newly archived files (based on their filesystem timestamp) at some specified short interval, and then at some much longer interval reads a list of all the files in the archive to check for any unexplained holes in the index coverage. Tuning of these parameters is a matter for the individual repository, but one possibility would be to fetch a list of files updated in the last 24 hours every half-hour and a list of every file in the archive once a week.
The indexer indexes both new data warc files and new metadata files. Metadata files are identified by the setting
settings.common.metadata.fileregexsuffix which has the default value
-metadata-[0-9]+.(w)?arc(.gz)?. This matches all metadata files generated by NetarchiveSuite with the normal setup.
The aggregator takes all files found in the indexer's output directory, sorts them, and merges them into an existing index file. The unix sort command is used so this application runs only in unix-like systems. At any given time, the active index files will consist of a list such as
wayback_intermediate.index wayback.index wayback.<yyyyMMdd-HHmm>.cdx .
Whenever the aggregator runs (the interval between aggregator runs is determined by the parameter
dk.netarkivet.wayback.settings.aggregator.aggregationInterval in milliseconds) the new index files are sorted and merged into wayback_intermediate.index. If this file is now larger than
dk.netarkivet.wayback.settings.aggregator.maxIntermediateIndexFileSize (in KB) then this file is merged into wayback.index. (Merging is computationally less expensive than sorting - O(n) compared with O(n log(n)).) If this would cause
wayback.index to grow to larger than
dk.netarkivet.wayback.settings.aggregator.maxMainIndexFileSize then the
wayback.index file is renamed to incorporate the current timestamp and a new wayback.index file is started.
In addition to the settings described above, the aggregator also uses
dk.netarkivet.wayback.settings.aggregator.indexFileOutputDir: the directory containing all the final active index files
dk.netarkivet.wayback.settings.aggregator.tempAggregatorDir: a temporary workspace directory. This directory should have storage space at least equal to
maxMainIndexFileSize and should ideally be on the same file system as
If a file has failed indexing more than maxFailedAttempts times then one can force the indexer to retry indexing it using a command line utility. For example, running it from inside the deploy directory on the same machine as the
[test@prod-way-001 TEST12]$ java -cp lib/dk.netarkivet.wayback.jar -Ddk.netarkivet.settings.file=conf/settings_WaybackIndexerApplication.xml -Dsettings.common.applicationInstanceId=RESET_APP dk.netarkivet.wayback.indexer.ResetFailedFiles file1 file2 files3 ...
The indexer will then attempt to index the named files again the next time its indexing thread runs.