We have almost finished ( only 3 slow jobs back) the first full broad crawl for 2019 without any software issues and we have never harvested so much as this time ( more than 60 TB in 2 ½ month) withdrawn the -5003 returncodes.
It seems that our harvesters have seen a lot more than the quota is set to before they stop, so we still have a lot of -5003 in the crawllogs.
We still focus on IP-validated access to content behind paywalls. We ran into problems with I-frames but they seem to be solved with Umbra. Another issue is to keep the site owners contact informations according to GDPR.
We are preparing for the elections for the EU Parliament and for Danish parliamentary elections. The latter has to take place at the latest in June 2019
Together with a colleague – a researcher we had a mini event crawl on April Fools using BCWeb for the nominations of URL’s. As you never know, where April Fools pop up, the researcher wanted us to crawl with 4 hops. Thus, the crawls are still ongoing. Part of the evaluation would be to crawl with less hops next time.
After the war in the 1860th Denmark lost a part of Southern Jutland to Prussia/Germany. After WW1 Southern Jutland became Danish again. Next year we will celebrate the centennial of this reunion – preparations are popping up on the internet. Thus, we are preparing an event crawl: we already have collected about 40 URL’s
Most urgent technical issue
Our citrix wayback access platform is performing very badly – among others it may take over 5 minutes to load a page and many images are not displayed