Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


Status of the production sites



Our main focus are the elections for the European Parliament and the Danish parliamentary elections, which take place today.  As the national parliamentary elections were announced under the campaign for the European parliamentary elections, we decided to merge the two events in one event collection:

  • We do not collect Danish news media, the regular selective crawls cover this part.
  • We crawl Twitter, and Instagram accounts from political parties and candidates with Heritrix, and Facebook accounts with our Archive-IT account.
  • We crawl YouTube videos and podcasts on the elections

We will finish the crawls, when the new government is in place.

We finished our 2. Broad crawl for 2019 on last Monday.



At the beginning of May we had a meeting to prepare the program of the 2019 broad crawl. We have contacted the registrars which answered positvely in 2018. We will try this year to clean up even more the seed list because we noticed that a lot of registered domains are in fact parking websites or just domain name bookings. We are also going to develop all the subjects raised during the NAS workshop. The launch of the broad crawl is scheduled for October.

On April 10th, the daily newspaper "Le Monde" announced that it will close its blog platform "les blogs abonnés du" at the very beginning of June. We contacted their product owner and harvested 6 250 blogs. To avoid any performance issue on the platform due to Heritrix crawling, we chose to use a maximum of 3 threads in a single job.