Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
  • Our fourth broad crawl for 2017 with a budget of 10 MB per domain started on November 14 and finished on November 23. We captured a little less than four TB.
  • Our event harvest on the local and regional elections on November 21 are almost finished. We will give the different definitions one or two more crawls.
    Our electional Facebook crawl will be run with Archive-IT, we calculated that we could crawl about 1000 Facebook profiles within our account budget. Setting up the crawl takes quite some time. Intentionally we will run the Facebook crawl after the elections, as we will be able to capture content retrospectively.
    As mentioned before we also used BCWeb for the electional harvest – as BCWeb only was accessible internally at KB, it is kind of a pilot project for the use of BCWeb with a colleague outside Netarchive. In the next couple of weeks, we will evaluate on this different elements of the event harvest.

...

Panel
Our first broad crawl with NAS5 and H3 is finished! We crawled 101.55 TB in 6 weeks. We encountered 4 problems during this crawl:
  • a storage saturation problem with our new infrastructure (we lost 16 jobs of the broad crawl and a few jobs from selective crawls)
  • an out of memory problem on the GUI and the broker (with no data loss)
  • the use of public_suffixes.dat introduced in NAS5 made H3 create a lot of queues by host for the domain blogspot.com instead of a single queue by domain
  • some second level TLDs were also created as domains and broaden the crawl scopes


We received only 5 complaints from web publishers compared to around 15 in 2016. During the coming weeks, we are going to analyse the crawl reports and the quality of the archives to produce a report on the crawl.

In parallel, we had scheduling issues: our daily news crawls stopped three times. Two jobs were submitted with the same ID and this changed the status of the selective harvest from active to inactive.

The access we enabled for users by last mid-summer is only available from the BNE and the regional libraries that asked for it. Although we disseminated this new service, so far, we don't have many consultations as the access is only available on-site and the interface (the OpenWayback by default) is not very friendly. We give open access (in internet) to the captures we have of a precise website

ONB

Panel

BNE

Panel

Dear colleagues,

           Last month, we successfully migrated all our web collections to the production environment of NAS 5. We are reasonably happy with the new environment.

            Anyway, and despite the tests we run on the preproduction environment, we experienced some problems mainly related to the configuration of templates in NAS 5.

            Frontpage+1 and frontpage+2 didn’t work as expected. Nevertheless we realized that some of the crawls ran very fast, but they stopped when encountered any slight problem and didn’t manage to finish.

            Juan Carlos compared the NAS 5 templates with the ones in NAS 4 and adjusted some parameters. Apparently everything is working properly, crawls finish faster than before and harvest more objects. But the default template is not working yet and my IT colleagues are studying its configuration.

            We wait for the system to be more stable before running the .gal domain crawl. We hope we can launch it before the end of the year.

The Library is mirroring its storage in another location of the Ministry of Education and Culture, so we'll have there a copy of our web archive in the next few months.

...