Page tree
Skip to end of metadata
Go to start of metadata

Agenda for the joint KB, BNF, ONB and BNE NetarchiveSuite tele-conference 2017-12-05, 13:00-14:00.

Participants

  • BNF: Sara
  • ONB: Michaela, Andreas
  • KB/DK - Copenhagen: Tue, Nicholas
  • KB/DK - Aarhus: Colin, Sabine
  • BNE: Mar
  • KB/Sweden: Bengt

Upcoming NAS developments

  • Release of NAS has been delayed because of issues with our compression project
  • ongoing work on 5.3.2: https://sbforge.org/jira/secure/RapidBoard.jspa?projectKey=NAS&rapidView=8
  • Priorities for Netarkivet development team in 2018 (in addition to NAS bug-fixing, performance issues etc.)
    • Browse-access with OpenWayback and https support
    • Imporved harvesting with umbra
    • API harvesting of social media

Status of the production sites

Netarkivet

  • Our fourth broad crawl for 2017 with a budget of 10 MB per domain started on November 14 and finished on November 23. We captured a little less than four TB.
  • Our event harvest on the local and regional elections on November 21 are almost finished. We will give the different definitions one or two more crawls.
    Our electional Facebook crawl will be run with Archive-IT, we calculated that we could crawl about 1000 Facebook profiles within our account budget. Setting up the crawl takes quite some time. Intentionally we will run the Facebook crawl after the elections, as we will be able to capture content retrospectively.
    As mentioned before we also used BCWeb for the electional harvest – as BCWeb only was accessible internally at KB, it is kind of a pilot project for the use of BCWeb with a colleague outside Netarchive. In the next couple of weeks, we will evaluate on this different elements of the event harvest.

BnF

Our first broad crawl with NAS5 and H3 is finished! We crawled 101.55 TB in 6 weeks. We encountered 4 problems during this crawl:
  • a storage saturation problem with our new infrastructure (we lost 16 jobs of the broad crawl and a few jobs from selective crawls)
  • an out of memory problem on the GUI and the broker (with no data loss)
  • the use of public_suffixes.dat introduced in NAS5 made H3 create a lot of queues by host for the domain blogspot.com instead of a single queue by domain
  • some second level TLDs were also created as domains and broaden the crawl scopes


We received only 5 complaints from web publishers compared to around 15 in 2016. During the coming weeks, we are going to analyse the crawl reports and the quality of the archives to produce a report on the crawl.

In parallel, we had scheduling issues: our daily news crawls stopped three times. Two jobs were submitted with the same ID and this changed the status of the selective harvest from active to inactive. 

ONB

  • We have finished our 2017 broad crawl. Due to the unusual storage consumption of NAS we could only do one stage with a limit of 10 MB. In total we have crawled approx. 5 TB. We will change the interval for broad crawls from two years to annual crawls.
  • We have a couple of local elections in 2018, which will be the focus of selective crawls.

BNE

           Last month, we successfully migrated all our web collections to the production environment of NAS 5. We are reasonably happy with the new environment.

            Anyway, and despite the tests we run on the preproduction environment, we experienced some problems mainly related to the configuration of templates in NAS 5.

            Frontpage+1 and frontpage+2 didn’t work as expected. Nevertheless we realized that some of the crawls ran very fast, but they stopped when encountered any slight problem and didn’t manage to finish.

            Juan Carlos compared the NAS 5 templates with the ones in NAS 4 and adjusted some parameters. Apparently everything is working properly, crawls finish faster than before and harvest more objects. But the default template is not working yet and my IT colleagues are studying its configuration.

            We wait for the system to be more stable before running the .gal domain crawl. We hope we can launch it before the end of the year.

The Library is mirroring its storage in another location of the Ministry of Education and Culture, so we'll have there a copy of our web archive in the next few months.

The access we enabled for users by last mid-summer is only available from the BNE and the regional libraries that asked for it: http://www.dl-e.es/openwayback/wayback/. Although we disseminated this new service, so far, we don't have many consultations as the access is only available on-site and the interface (the OpenWayback by default) is not very friendly. We give open access (in internet) to the captures we have of a precise website (the calendar), but once you try to access the content a message pops up noting that the access is limited to on-site facilities due to copyright reasons. The list of collections (in Excel files, so far) is here http://www.bne.es/es/Colecciones/ArchivoWeb/Subcolecciones/selectivas.html

KB-Sweden


Next meetings

  • January 9th, 2018

Any other business?

  • No labels