- BNF: Clara, Sara, Géraldine
- ONB: Andreas
- KB/DK - Copenhagen: Tue, Stephen, Anders
- KB/DK - Aarhus: Colin, Sabine, Kristian
- BNE: Alicia
- KB/Sweden: ParPär, Thomas, Peter
Join from PC, Mac, Linux, iOS or Android:
We celebrated the 10th anniversary of the Spanish Web Archive. We organized a conference where we had the opportunity to share experiences with other colleagues.
Serials broad crawl: We were preparing a list of serials urls in free access. They are almost 8,000 and we launch a kind of broad crawl to harvest them.
(At last we write something here. We apologize for not doing it before.)
Recap: We have run selective harvests successfully since middle of 2018 but had during 2019 lots of problems with running NAS snapshots above a certain size level (number of domains).
Eventually we realized what was the bottlneck, added some indexes to the databse and suddenly everything went like clockwork! At least technically.
So in December we could complete the first part of our browad crawl (just 500 kByte limit). And in January we started part 2, with around 500.000 domains remaining and limits 2 Gbyte and 50000 objects. It has now run over 90 % of the jobs, so will probably be done within this week. Very good!
Things we discovered when monitoring is the large amount of sites which are same kind of shop, displaying many thousands of products, with a couple of images of each. And sites related to sport activities, having tons of match results and player statistics. This, combined with errors in links creating looping URL:s can lead to millions of URL:s in queue. It takes some time before suchs jobs reach the 50.000 objects limit, so we have been monitoring and deleting URL:s in the queue now and then.
- April 7, 2020
- May 5, 2020
- June 9, 2020
- July 7, 2020
- September 8, 2020
- October 6, 2020
- November 3, 2020
- December 8, 2020
- January 5, 2021