Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.



Each year, the different sections of the  BnF legal deposit department give a view of the documents they have received. L’Observatoire du dépôt légal : reflet de l’édition contemporaine is now available online (in French only):
It gives analysis and raw data from 2015 on seed domains (more than 900,000 have appeared since the previous year and more than 500,000 have disappeared), on format, on http response codes, on the biggest harvested domains…

This month we also have several project crawls on different themes.

Among these project crawls, the annual one dedicated to French Official Publications is still going on with few new aims. Launched in the middle of June, it contains a sample of the web social presence of the central administration, with the decision to add the social media accounts of ministers and public bodies. While this is unfortunately without crawls of Facebook pages because of the now well-known problem of captchas, the goal is to reflect this type of official communication that was previously not so well covered in our selections. The frequence of the crawls of these specific ways to promote official publications, administrative and political communication could be extended in the future. The traditional aim of collecting the "classic" online publications is still relevant, with more than 800 URL seeds of traditional websites, crawled with a 100,000 URL budget for each.

Our annual crawl of auction houses has just finished. The scope of the collection is the same as in previous years, but last year the platform, which represents about a third of the crawl, blocked access by our robots. The librarian in charge of the selection contacted the site owner who was happy to let us crawl the site, and the quality seems much better this year. We also have to be careful as the majority of the sites are hosted on two platforms ( and Drouot), and their catalogues and images are stored on a small number of hosts - we have to increase the budget for these hosts to collect as much as possible.

We are also maintaining our crawl "Solidarities" with the same scope as last year, though we have also included sites that were  selected for an emergency crawl on the refugee crisis .


  • Michaela has changed position within the library and since July 1st is head of the Digital Library. Her post will not be replaced at the webarchive. At the moment ONB is not sure what the NAS contribution will look like in the future. We will work on a new concept and allocation of tasks. Michaela (and of course Andreas) will still be the contacts for webarchiving.
  • Please complete Doodle poll for Vienna meeting until end of July
  • Crawl about presidential elections is still ongoing, the repetition of the election will take place in October.


  • The first .es domain crawl, run with NAS at the Library finished on July 6th. It started on April 4th, so it took 3 months. From a list of 1.800.000 registered domains, only around 800.000 are active. The result is around 20 Tb and 460.000.000 objects. We fixed a limit of 100 Mb per seed and around 87% of the domains have been crawled entirely.
  • The General Elections took place for a second time in June as the Parliament coming from the December 2015 elections didn’t manage to designate a Prime Minister. So our General Elections event  crawl launched by the beginning of December 2015 hasn’t finished yet. So far, we have collected around 10,5 Tb. The regional web curators that collaborate in the project have been nominating seeds for this event crawl.
  • The regional web curators are testing BCWeb on a preproduction environment and they are starting to manage their own web collections using this application. So the production environment of BCWeb is only managed by the National Library team so far. We hope they get the training and knowledge enough to start using the production environment by next autumn. In the meantime National Library web archiving team is launching some regional web collections of limited scope.
  • A couple of weeks ago we welcomed two fellows at the team. They will be working with us for one year. Miriam is an information and documentation specialist and Elena is engineer.