Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


  • Our 2017 broad crawl was launched on the 16th October. The settings are 1500 URLs per domain, with a limit of 3 days per job. Our prediction of the overall volume based on our tests seems to have been underestimated: we had calculated around 77 TB with these settings and after three weeks of crawling we are now expecting a final volume of around 97 TB. This is still within our overall storage budget but we are keeping a close watch on the volume of data collected. So far we have encountered no major problems, both H3 and the new infrastructure are functioning correctly.
  • We are also continuing to work on updating our full-text indexing process with the aim of indexing our news crawls since 2016. We have been updating the indexing schema to follow recent developments on warc-indexer and we will be working on the organisation of the index to improve query performance. The research project that will use this index to study neologisms is starting this week, so we will be working closely with a research engineer over the next few weeks.
  • We are working on BCweb to integrate KB developments in the 5.3 release and fixing some minor layout and redirection bugs.


  • Our Domain Crawl for this year just finshed a few days ago (With Nas 5.3 and all the expected problems - a lot of times we had to terminate jobs by calling the kill script, a couple of times we had to stop NAS and had to clear the message queue, due to too many messages, which caused inactivity of NAS)
  • Now we are doing some postprocessing work (indexing, reporting)
  • Next step is a redeployment of 5.3.1 to reproduce our problems we had in summer and to save the log files for further discussion. If that fails again we go back to 5.3. which works for selective crawls only without a problem



Next meetings

  • December 5th
  • January 9th, 2018