- On March 8 we started our first broad crawl for 2017, first step with a budget limit of 10 MB per domain. We had lots of problems with this first broad crawl with Heritrix 3 and NAS 5.2.2. Most likely one of the problems was the job scheduling: jobs changed their state and there was lot of manual “put out fires” work. The crawl finished one on March 26.
- With our new strategy for the selective crawls we had stopped with crawling front pages only 6 times a day for news sites. We were afraid of overloading the web site owner’s servers. For a couple of weeks ago we restarted with 6 daily front page crawls for the national news sites – so far without complaints from the site owners.
- We selected 22 representative Facebook-profiles and started harvesting them with Archive-IT. Our first Fecebook crawl since last autumn.
We have NSF performance problems with the wayback calender display and we still can’t display pages using the https protocol.
The free text search index can be 3-4 month late due to the way it works. At the moment it is about 2 weeks late.