Agenda for the joint KB, BNF, ONB and BNE NetarchiveSuite tele-conference 2018-01-09, 13:00-14:00.
- BNF: Sara
- ONB: Michaela, Andreas
- KB/DK - Copenhagen: Tue, Nicholas, Soren
- KB/DK - Aarhus: Colin, Sabine
- BNE: Mar
- KB/Sweden: Bengt
Upcoming NAS developments
Status of the production sites
The steering committee resigned due to the ongoing reorganization of the Royal Danish Library. As the whole team by now is employed by one institution, the Library will rethink the organization/steering of Netarchive.
Webdanica is gone into production. According to our legal deposit we have to collect “Danica”, that is to say content produced by Danes, in Danish or for a Danish audience. Webdanica is an automation of identifying Danica outside .dk. Outlinks from .dk domains are collected and filtered according to speciffic criteria to identify Danica. The occurrence of Danish geografical and personal names for example, arecriteria for being Danica. The Danica seeds areinserted in NAS seedlists and harvested by Heritrix.
The event crawl on the local and regional elections ended on December 15.
First of all we wish you a very happy new year and all best wishes for 2018 ! We have a change in the team, Ange Aniesa has left to take up a position in another department at the BnF, we wish him all the best.
In December, we organized a week-long workshop within the team on collecting Twitter, to build on last year's election crawls, where we used Heritrix 3 to collect more than 3 500 Twitter accounts or hashtags twice a day, with a depth of page + 1 click. This allowed us to crawl the time line for each seed (i.e. 40 tweets per day per seed) and a part of the context (the time line of other accounts or hashtags mentioned in the seed). The goal of this workshop was to continue this specific crawl during the year by creating a new specific harvest definition, and to improve its quality. The quality of the crawl depends of the number of seeds. First we tested dividing the seed list between several jobs. Then we tested putting all the seeds in one job and dividing the queue twitter.com into 10 separate queues. The quality is better when the seed list is shared between several jobs than in several queues within one job, apparently because the division between queues isn't equal : some queues crawled more than 15 000 URLs while some crawled less than 1500 URLs. We need to continue the tests.
During this workshop we also studied the API services. The free service allows us to collect less information by the crawl by Heritrix: less tweets, less images, less context and no link. It will also be more difficult to then give access to these data and preserve them. We therefore decided to abandon this approach. The new crawl will start at the beginning of the year and crawl twice a day, with only a small number of accounts at the beginning, but the seed list will grow step by step thanks to the curators. This is the best way to cover current events, in addition to our existing crawls of news websites.
Finally, we are pleased to announce that we have published the seed lists for our focused crawls on the new BnF site dedicated to APIs and datasets. These lists are based on exports from BCWeb and include the crawl settings and descriptive elements added by the curators. We hope this will help researchers to make better use of our collections. There are two pages on the site, one for election crawls (http://api.bnf.fr/liste-des-adresses-URL-des-collectes-du-web-electoral-par-la-BnF) and one for other focused crawls (http://api.bnf.fr/liste-des-adresses-url-des-collectes-ciblees-du-web-francais-par-la-bnf).
We are currently selecting seeds for local elections taking place end of January.
After several tests, we finally installed NAS 5 in a production environment. We encountered some problems with the templates that our IT team managed to solve adapting and sort-of merging the default NAS templates and the BnF templates.
We launched the .gal domain crawl on behalf of the Library of Galicia. It is the first time we launch a domain crawl on demand of a regional library. The list of .gal domains is about 5.200. Approximately 25% of them are empty, with no content, as they are presumably just reserved domains. The .gal top level domain is quite recent, what might explain this percentage of empty pages.
We have around 10% of errors, many of them due to forbidden access, what should be solved by the Galician web curators getting in touch with the websites holders.
We are about to finish the Elections in Cataluña event crawl. The president is expected to be voted by the Catalan parliament by the end of this month.
The rest is business as usual (thematic, regional and event crawls).
- February 6th
- March 6th
- April 10th
- May 15th
- June 12th
- July 17th
- September 4th
Any other business?