Agenda for the joint BNF, ONB, SB, KB and BNE NetarchiveSuite tele-conference 2016-09-20, 13:00-14:00.
- BNF: Lam, Annick, Sara
- ONB: Michaela, Andreas
- KB/DK: Søren, Stephen, Nicholas
- SB: Sabine, Colin
- BNE: -
- KB/Sweden: Bengt
IIPC crawler hackathon in London
September 22-23. Søren, Colin, Bert will attend.
Topics, attendees: https://drive.google.com/drive/folders/0BwTi-qdD0KvdNEE4Qmpaa2dJeHM
Common questions/interests to bring?
NAS 5.2 Developement Update
On BnF side: some bugfixes:
NAS-2544Getting issue details...
NAS-2545Getting issue details...
NAS-2546Getting issue details...
NAS-2553Getting issue details...
Translation of new keys in French and German.
Considering the adoption of WARC revisit records for duplicates.
NAS workshop in Vienna
January 30th 2017 - February 1st 2017 - Vienna
NetarchiveSuite Curator Issues
Should we "reanimate" our curator roadmap/backlog, revise it and discuss it in Vienna?
Status of the production sites
- Last week we launched the third broad crawl 2016. The crawl limit per domaine will be max. 100 MB. There will be special crawls for ministeries and government bodies, and for ultra big sites (e.g. dr.dk)
- We will try to get in touch with the webpage owneers/web hotels who are blocking our crawler (about 11% are blocking us)
- The event collection for the Olympics in Rio 2016 will go on until the end of the Paralympics 2016
- We are working on the configuration of the regional/local news media crawls.
- We have test-crawled about 60 Danish Facebook profiles with Archive-IT. We are analyzing how much we get from the profiles. We have to renew our account with Archive-IT after the end of November and we are trying to negotiate a good prize.
- We made a special crawl of Prime Minister Lars Løkkes Facebook profile on 2016.08.30, the day he published his 2025 plan.
Compression of the archive
- We are preparing for the compression, but this awaits NAS release 5.2
Last not least
Last week we learned, that the ministry of culture wants KB and SB to merge: From January 2017 we will be “Nationalbiblioteket” with two locations, in Copenhagen and Aarhus
We are continuing to work on this year's broad crawl. We are preparing nas-preload, the tool used to combine the different sources into a single list to be loaded into NAS. This step also includes a DNS check to avoid slowing down the crawl with domains that do not have a DNS response. This year, in addition to excluding domains with no DNS we are also excluding those that give an "unknown" response, as from previous years we know there is generally no content on these domains. Overall the seed list will contain around 4.4 million active domains, and will have improved coverages of the different regional TLDs : .alsace, .paris; .bzh (for Brittany) and the French West Indies.
Turning to project crawls, the 2016 Olympiad is now over but our Olympics crawls are still running. The project, in line with the precedent collaborative collections documenting the 2014 Sotchi Winter Games and 2012 London Summer Games, involves seven curators from the Literature and Art department who work on the selection based on eight themes. Two crawls were planned, before and after the games, covering a list of 558 seeds. Concerning social media, we focused only on Twitter, with 447 French accounts or hashtags collected twice a day from the 4th to the 24th of August. These crawls will be complemented by one for the Paralympic games, to be launched on the 18th of September. We have also communicated our list of seeds for the worldwide collaborative collection led by the British Library for IIPC.
- We switched to NAS 5.2 already because we had severe problems with https websites with the former version. These problems are fixed now by using H3 which runs under java 1.8.0_77 and following disabled jdk.tls Algorithms in /opt/jdk1.8.0_77/jre/lib/security/java.security
jdk.tls.disabledAlgorithms=SSLv3, DHE, ECDHE, RC4, MD5withRSA, DH keySize < 768
It went smooth so far. We are still using the arc format, because we have to refactor all our tools before we switch to warc.
- The crawl about our presidential elections still running, we have a new election date beginning of December and hope to be able to finish the crawl soon.
- Apart from one small, additional thematic crawl we will only have ongoing crawls until the end of the year. Next domain crawl is scheduled for 2017.
- October 25
- November 29
- January 3, 2017
Any other business?