Child pages
  • 2017-04-04 Statusmeeting

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
  • On March 8 we started our first broad crawl for 2017, first step with a budget limit of 10 MB per domain. We had lots of problems with this first broad crawl with Heritrix 3 and NAS 5.2.2. Most likely one of the problems was the job scheduling: jobs changed their state and there was lot of manual “put out fires” work. The crawl finished one on March 26.
  • With our new strategy for the selective crawls we had stopped with crawling front pages only 6 times a day for news sites. We were afraid of overloading the web site owner’s servers. For a couple of weeks ago we restarted with 6 daily front page crawls for the national news sites – so far without complaints from the site owners.
  • We selected 22 representative Facebook-profiles and started harvesting them with Archive-IT. Our first Fecebook crawl since last autumn.
  • We have NSF performance problems with the wayback calender display and we still can’t display pages using the https protocol.

  • The free text search index can be 3-4 month late due to the way it works. At the moment it is about 2 weeks late.

...