Agenda for the joint NetarchiveSuite tele-conference 2019-09-10, 13:00-14:00.
- BNF: Géraldine, Clara, Sara
Andreas (unable to attend)
- KB/DK - Copenhagen: Tue, Stephen, Anders, Kristian
- KB/DK - Aarhus: Colin, Sabine, Knud Åge
- BNE: Alicia
- KB/Sweden: Par, Thomas, Peter
Update on NAS latest tests and developments
NetarchiveSuite 5.6 release: see NetarchiveSuite 5.6 Release Notes
Using IIPC Heritrix 3 into NetarchiveSuite: feedback on BnF tests and analysis: see presentation
Status of the production sites
Summer activities and plans
• We ran a mini event harvest on Trumps plan to by Greenland from the Danish Queen, especially Twitter activities and reactions from foreign Medias.
• After our 2nd broad crawl for 2019, which finished in May we reworked the results and changed some configurations. We started our 3rd broad crawl on 1 September, 1st step with a limit of 50 MB. Step 2 will have a limit of 16 GB, simultaneously with step 1 we started a run of ”ultra big sites”, "OAI-extraction (research databases)", "municipalities and regions", "ministries and administrative bodies" and, YouTube videos
• One of our most important problems observed with the selective crawls is js-lazy-load: images are not displayed or worse, not even captured.
From our ongoing projects:
• We are looking forward to implement the new features for BCWeb, so we can go on with building up an external community to help us with the collection work using BCWeb
• We are going to rethink our collection strategy within the frame of the general collection strategy for the digital cultural heritage.
• We are investigating the solution with only one online copy of Netarchive
• Hopefully we soon will get allocated more IT resources, so we can go on with the implementation of browser based harvesting in our production system. Umbra still is not totally in place.
• There are still some issues to be solved before we can implement SolR wayback in our frontend – especially legal issues in connection with GDPR
During the summer, we continued the preparation of our broad crawl. We ran an HTTP test on circa 5 million seed URL and identified out of this test 90 unwanted websites (hosting, ISP, parking, domain name registration websites) which will enable us to exclude 187 400 domains from our seed list.
Our bandwith was increased to 1.5 GB along with a general increase of BnF bandwith. We are running tests to find the best compromise with our infrastrucure (CPU, memory).
We are still working on the new version of BCweb and are now on the administrator pages.
We upgraded openwayback to the latest 2.4.2 that was released in May 2019.
We are currently running the first stage of our yearly domain crawl. We are in the last third of that stage. After exchange our hardware (old PCs with weaker CPU but more RAM), we are still experiencing the https://sbforge.org/jira/browse/NAS-2682 Problem, but not very often.
This year we also plan to do a second stage, which was not possible last year due to our limits of storage. To make this possible for every year we need to negotiate to get more storage. In preparation for our yearly budget discussion we were collecting information about the last domain crawl in Denmark (Number .dk is similiar to .at). Thanks to Tue, who was providing us these information, which can hopefully help us the get more storage in future.
We closed two of our event crawls: European Parliament elections and local elections. We still working in Spanish Government elections collection because the political situation is complicated and the government has not yet been formed.
It seems that we already have a date for our annual broad crawl. It will be carried out this mounth.
We are working on a pilot project with another national institution specialized in big data to index and allow full-text search in our archive. We will tell you any update concerning to this.
We are organizing a conference about the preservation of the Web archive that it will take place on November 29, the World Digital Preservation Day. Thereby, we celebrate the tenth anniversary of the Spanish Web Archive.
Last but not least, this week two new librarians will join the team.
- October 8
- November 5
- December 3
- January 7, 2020
Any other business?