Agenda for the joint KB, BNF, ONB and BNE NetarchiveSuite tele-conference 2018-09-11, 13:00-14:00.
- BNF: Sara, Géraldine
- ONB: Andreas, Michaela
- KB/DK - Copenhagen: Tue, Stephen, Anders
- KB/DK - Aarhus: Colin, Sabine
- BNE: Mar
- KB/Sweden: Bengt
Update on NAS latest tests and developments
Nicholas Clarke has left his employment at the Danish Royal Library. The remaining NetarchiveSuite (and SOLRWayback) developers are now all situated in Aarhus, and so we expect that future development activities will be concentrated mostly in Aarhus.
Development is currently focused on integrating Umbra for browser-based harvesting into NetarchiveSuite. We have heavily leveraged the BnF-developed functionality for harvest-channels and mappings, so that the main effort is on creating a specialised "Umbra-enabled" HarvestController component. We are close to producing a release-candidate for beta-testing.
Status of the production sites
Broad crawl: first step of our third broad crawl for 2018 started on August 25 and is still ongoing.
Selctive crawl: September 5 is the official commemoration day for Danish soldiers, who had been deployed in war or conflict zones. Together with partners from the Danish National Archives we are running an event crawl on this commemoration. We used BCWeb for the nomination of the url’s. Everything went fine – Steven has hardcoded the needed schedules, as we have no schedules with integrated hops with Heritrix 5. But after the fourth crawl all crawls failed without any changes. We made test crawls with the BCWeb schedules – they work fine. We still have not solved the problem. So we created a “replacement” event harvest definition without using BCWeb.
Open wayback: We are now able to display pages using https, but far from all https-pages. For instance we are not able to display social media pages.
Blacklight (fulltext search): the facets to refine a search do not work.
SOLRWayback: we made some tests in our production environment. The results are promising: we are able to display pages form Twitter and Facebook crawls after they started using https. Now the most important is to resolve problems with the proxy browser setup.
After the two workshops on crawling YouTube (covered in our June update), we were able in July to launch a production crawl using the process previously outlined. This first crawl lasted 20 days. The curators selected 42 channels and we crawled all the videos from these channels: 28 063 videos, with the exception of 10 videos that had been removed and one video excluded because of our filters. The crawl represents 1.8 TB and more than 3 000 hours of video. A second crawl is planned in November.
We have also finished work on giving access to these videos, as well as those crawled during the elections last year. To replay the videos within YouTube pages, we built on the system already used for Dailymotion. A specific rule is applied to pages for which videos have been collected, allowing us to replace the YouTube player with another called FLV Player, which is present in our archives. We use the metadata collected during the crawl to establish the link between the web page and the correct video file. As the page listing all the videos on a channel is not fully collected by Heritrix, we created pages within our access application with the full list of videos collected for each channel, and inserted a button within the YouTube page to link to this list. Finally, we created a "guided tour", similar to that which already exists for news sites, with a list of all the YouTube channels collected. This is also based on the metadata, with additional description added by curators.
In other news, we have just started our broad crawl for 2018. It will be the biggest broad crawl we have yet performed, with a budget of 110 TB and 4.7 million domains in the seed list. The budget per domain is 2 500 URLs (compared to 1 500 URLs last year). During this crawl, the total size of the BnF web archives is expected to exceed 1 Petabyte.
Our selective crawls are running as usual. We are considering launching a couple of new selective crawls on “Gastronomy” (which is an important topic in Spain) and “Folklore and popular traditions”.
We have a big gap on collections about important fields like Language and Literature, History, Social Science, Biology and Medicine or Science and Technology, as we don’t have special departments in charge of this kind of collections. We have signed an agreement with the University Libraries Network in Spain (REBIUN) for them to cooperate with us on selecting and managing web collections on these fields. In the meantime, we are creating a basis with a small bunch of seeds per subject.
We are still working on a new search interface that can provide access by collection and subject.
We are also considering the possibility of opening the access on internet to our web archive excluding the previous year, following the example of the Portuguese Web Archive. We have to consult our legal staff yet to be sure that we prevent claims or complaints from the web content providers.
Our IT team has installed the version 5.4.2 in the pre-production environment. The problem of jobs duplication that we experienced in the previous version is solved.
- October 9th
- November 6th
- December 4th
- January 8th 2019
Any other business?