Agenda for the joint KB, BNF, ONB and BNE NetarchiveSuite tele-conference 2017-03-07, 13:00-14:00.
- BNF: Sara
- ONB: Michaela, Andreas
- KB/DK - Copenhagen: Stephen, Tue, Nicholas
- KB/DK - Aarhus: Sabine, Colin
- BNE: Mar
NAS 5.3 Release
5.3 is out: NetarchiveSuite 5.3.x Release Notes
5.3.1 is opened: https://sbforge.org/jira/secure/RapidBoard.jspa?rapidView=8&view=detail
NAS workshop in Vienna
Any changes to the draft agenda ? 2017 NAS workshop
Status of the production sites
- On March 8 we started our first broad crawl for 2017, first step with a budget limit of 10 MB per domain. We had lots of problems with this first broad crawl with Heritrix 3 and NAS 5.2.2. Most likely one of the problems was the job scheduling: jobs changed their state and there was lot of manual “put out fires” work. The crawl finished one on March 26.
- With our new strategy for the selective crawls we had stopped with crawling front pages only 6 times a day for news sites. We were afraid of overloading the web site owner’s servers. For a couple of weeks ago we restarted with 6 daily front page crawls for the national news sites – so far without complaints from the site owners.
- We selected 22 representative Facebook-profiles and started harvesting them with Archive-IT. Our first Fecebook crawl since last autumn.
We have NSF performance problems with the wayback calender display and we still can’t display pages using the https protocol.
The free text search index can be 3-4 month late due to the way it works. At the moment it is about 2 weeks late.
- After performing our last tests on Netarchivesuite 5.3 and Heritrix 3, we went into production and started our first crawls on March 20th!
- The beginning of the year is also the time for writing our annual report. In 2016, we crawled 125.47 TB of data including the largest broad crawl in our collection (90.5 TB). This year we chose to study the top level domains (TLDs) in the broad crawl to measure the impact of including new regional TLDs in the seed list. The use of the TLD varies from one region to another (commercial purposes, public purposes, personal websites...) and the number of active websites is not proportional to the geographical area. We also analysed Epub files, as we did last year, to see if there is any evolution: their number is quite similar but the number of domains where they are hosted is growing. Overall, we exceeded our predictions due to the increase of the average weight of the harvested files.
Since one week we are using NAS 5.3 in production. No problems during selective crawls. Our Domain crawl for 2017 will start soon.
- May 9th
- June 6th
- July 4th
- August 8th
Any other business?