Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.



After the dreadful attacks which occurred on the 7th and 9th of January in Paris and the events that followed, we decided to launch an emergency crawl in order to harvest web resources (news articles, blog posts, social media reactions, institutional websites…) related or reacting to them. We made an appeal to IIPC members and to our BnF network of librarians, asking them if they could help us in quickly gathering references to make the most complete and relevant seedlist possible. Due to the exceptional nature of the event, the scope and criteria of the selection were extended to an international scale and aimed to cover the different forms and diversity of the reactions. We received 2,480 URLs from eighteen different IIPC members and 1,740 URLs nominated by more than 70 BnF librarians. In addition to these selections, the already identified seed lists of French governmental, news, political, and activist websites have been specially harvested. And finally, our regular daily and weekly harvests of the principal French news sites, particularly relevant during those days, worked as usual.

Technically, the crawls were performed from 8th to 16th January 2015 and each website has been crawled at least once with a depth of  page +1 click. During the same period, selected Twitter accounts and popular hashtags (as the now famous #JesuisCharlie) have been crawled four times a day. A total of 15.9 million URLs have been collected, for a total of 0.5 TB of data.   




Broad Crawl:

  • We are currently preparing for our bi-annual broad crawl with 1.25 mio. .at domains and the new TLD .wien
  • Before the broad crawl we will change from NAS 4.01 to 4.4.
  • We finished the database migration from mysql to postgresql
  • We made a JDK change from 1.6_22 to 1.7_65 (did not work, need to switch back to lower 1.7 version)
  • The JDK change caused problems with deduplication.
  • Our IT department is developing a new storage concept, currently our storage is outsourced to the Federal Computing Centre, is would be less expensive to have the storage inhouse. We try to negotiate a higher storage budget for the webarchive.
  • We still experience technical difficulties with our cluster.

Selective Crawls:

  • In 2015 four regional elections take place. We will add the seeds to our politics collection.
  • The Eurovision Song Contest in May will be a rather small project for us. We could not get additional resources and unfortunately it will not be comparable to the great effort last year in Denmark.

Next meeting

7th april

Any other business?