Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • BNF: Sara, Géraldine
  • ONB: Andreas, Michaela
  • KB/DK - Copenhagen: Tue, Nicholas, Søren, Anders
  • KB/DK - Aarhus: Colin, Sabine
  • BNE: Mar
  • KB/Sweden: Bengt

Update on NAS 5.4.

...

NAS 5.4 is available for download here but we are awaiting completion of the acceptance test before making a formal announcement.

We have actually found a bug (memory leak) in NAS 5.4 

JIRA
serverSBForge
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId327e372c-baf0-3de4-afa1-7694d9fcf12b
keyNAS-2751
 which affects the new functionality to manage the number of Jobs on-queue. The feature is, in fact, disabled by default, but we are working on a quick patch-release so there will be a 5.4.1 within days.

Harvesting Youtube with NAS and H3: feedback from BnF

In March and May, the team has worked on defining an integrated workflow to harvest Youtube channels and videos. We will present the results of our work.

Status of the production sites

Netarkivet

Panel

Broad crawl:

  • Our second broad crawl 2018 is ongoing. We started with running step 1 with a limit of 10 MB/domain from May 20 to May 28 (552 jobs). We relaunched a run with a limit of 10 MB/domain. We have a problem with too many -50 return codes.

Event crawls:

  • The collection on the collective negotiations on pay is almost finished, as all unions have accepted the results.
  • We just launched a new event crawl on a political and cultural meeting on the island of Bornholm – called “Folkemødet”, which means “peoples meeting”. This a pilot project for the use of BCWeb.

All technical and practical obstacles for the use of BCWeb are “surmounted” – we have now two registered external users.

We have installed OpenWayback in our test environment, among others are we focusing on the replay of https-pages. It works for quite some sites, but not for all sites.

BnF

...

ONB

Panel
  •  Our Crawl of twoday.net is almost finished, but still running. The austrian blogging platform wanted to shutdown by end of May, but they postponed the shutdown to end of June. This gives us extra time to finish our jobs.
  • As soon these jobs are finished we are upgrading to Nas 5.4 or 5.4.1 and preparing our domaincrawl
  • We have a request for crawling a website regularly. It's the website of Vienna wien.at. They want to support us with resources. Our Management has also interest in offering such a service.

BNE

Panel

Our 2018 broad crawl finished a couple of weeks ago. Comparing with the one launched in 2016, that lasted 3 months, this one has been considerable shorter: only 42 days.

The number of .es domains is more than 1.900.000. The limit per domain was 150 MB. And around 50 TB were archived.

The event crawl on the Catalan elections has been closed. It lasted around 7 months and contains 1.800 seeds.

Recently we’ve been very busy with the National Politics collection, due to the many changes have been taking place in relation to the Government change.

We have plans to upgrade to NAS version 5.4 soon.

We have also been designing a web archive interface for the users, that includes search for subject, collection and titles along with the default url search. The design is more or less ready and now we are in the development phase.

A couple of months ago we heard about the closing of Wikispaces by the end of July. Wikispaces is a free hosting service, that hosts mainly academic and learning content. As there is no way to discriminate by language or country, it was necessary to count with some help from outside our team. We launched a social media campaign (a press release on the Library website and a call on Twitter) calling for nominations from the academic and research community along from individuals who know some Spanish wikispaces. We received many nominations. We consider this collection “at-risk” and we already have crawled more than 300 Spanish wikispaces.

2 tests and latests developments

Colin

Status of the production sites

Netarkivet

Panel

·         We upgraded from old Wayback to OpenWayback. Still many images “are lost” and https is only partly supported (maybe the problem is different use of dns-secure/dirty setup in Copenhagen /Aarhus). The https based Social Media are still invisible.

The loss of images surely is a browser problem. We use IE - that is an old technology. When using the Edge browser all images get visible. Integrating Edge in our Wayback setup needs an update of our Citrix platform.

·         We started testing SOLRWayback in our production environment – the results look good. Our https based Twitter- and Facebook crawls are visible.

The great challenge is the proxy browser setup. A Firefox based setup will not be supported on the Citrix platform by a National IT service. The IT service will take charge of the support of almost all our IT platforms, devices, software, … (a political decision of centralizing all IT support for national institutions)

·         Our second broad crawl for 2018 is half way done with step 2 (with a limit of 14 GB/domain). We have problems with jobs hanging with long breaks – and they need “manual help”

·         We set a new version of H3 (supporting "scrset" repsonsive design tags) in our production environment. Images using these responsive design tags have not been harvested from 2014 to june 2018. We still miss support for data-srcset tags.

·         We upgraded our Blacklight search front end to the newest version with support for new SOLR index, but there are still problems with the graphic design.

BnF

Panel

Since 2011, the BnF has published each year a summary and statistics on national publishing output based on the collections made via legal deposit, including the web archives. In addition to the general analysis, each year has a focus, and for this year it was politics. We compared the web collections relating to three elections: 2007, 2012 and 2017. The three selections were based on the same categories and covered a lot of regions. The analysis confirms our hypothesis: since 2007, videos have become ubiquitous and the use of social networks from blogs to Twitter and Facebook has exploded.

We also measured the percentage of web sites still online. After 10 years, 26% of sites for elections are still on line, 19% redirect to other websites and 55% have disappeared. After 5 years, 44% are sitll online, 22 % redirect and 33% have disappeared. After 1 year, 81% are still on line, 10% redirect and 6% have disappeared. The lifespan of a web site changes a lot but the results show that the other collections (outside electoral collections) are complementary.

During May, a placement student worked on the scope of the broad and focused crawls in view of the legal definition of the French domain. With 4.5 million sites, the BnF covers less than 60% of the French web. To be more representative, BnF must extend its contacts to other registers, especially those with generic TLDs. We hope to contact the company Gandi to obtain their list of sites and improve the coverage. For the focused crawl, it is suggested that selection must be related to how the web is used and not only the traditional collections, and also that new ways of selection can help, such as more cooperation with librarians, researchers... In addition, more legal clarification is needed relating to the harvest of social networks: for the moment the BnF only crawls accounts and some hashtags of public figures or organisations. Finally, the study proposes to document the dynamics of sites through filmed tutorials, especially when there are technical difficulties for the crawler.

ONB

Panel

BNE

Panel

KB-Sweden

Panel

Next meetings

  • July 17th
  • September 11th
  • October 9th
  • November 6th
  • December 4th
  • January 8th 2019

...