Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • BNF: Clara, Sara, Géraldine
  • ONB: Andreas
  • KB/DK - Copenhagen: Tue, Stephen, Anders 
  • KB/DK - Aarhus: Colin, Sabine, Kristian
  • BNE: Alicia
  • KB/Sweden: ParPär, Thomas, Peter

Join from PC, Mac, Linux, iOS or Android:

...

Panel

We celebrated the 10th anniversary of the Spanish Web Archive. We organized a conference where we had the opportunity to share experiences with other colleagues.

New collections:

  • We are working now in two event crawls about the elections in two regions: the Basque Country and Galicia
  • We want to launch a new collection about feminism this month

Serials broad crawl: We were preparing a list of serials urls in free access. They are almost 8,000 and we launch a kind of broad crawl to harvest them.

KB-Sweden

Panel

(At last we write something here. We apologize for not doing it before.)

Recap: We have run selective harvests successfully since middle of 2018 but had during 2019 lots of problems with running NAS snapshots above a certain size level (number of domains).
There was a bottlneck in the system but it was hard do figure out what. At last we found it: the Postresql database server was overwhelmed and database requests were queued up, which made the system slow and hard to monitor (as the GUI updates was out of phase).

Eventually we realized what was the bottlneck, added some indexes to the databse and suddenly everything went like clockwork! At least technically.

So in December we could complete the first part of our browad crawl (just 500 kByte limit). And in January we started part 2, with around 500.000 domains remaining and limits 2 Gbyte and 50000 objects. It has now run over 90 % of the jobs, so will probably be done within this week. Very good!

Things we discovered when monitoring is the large amount of sites which are same kind of shop, displaying many thousands of products, with a couple of images of each. And sites related to sport activities, having tons of match results and player statistics. This, combined with errors in links creating looping URL:s can lead to millions of URL:s in queue. It takes some time before suchs jobs reach the 50.000 objects limit, so we have been monitoring and deleting URL:s in the queue now and then.

Next meetings

  • April 7, 2020
  • May 5, 2020
  • June 9, 2020
  • July 7, 2020
  • September 8, 2020
  • October 6, 2020
  • November 3, 2020
  • December 8, 2020
  • January 5, 2021

...