We finished our third broad crawl for 2019 (with a limit of 50 MB/step 1 and 16 GB/step 2) on 10 September. In 602 jobs we harvested a total of about 93 TB or 187 million objects. There are lots of sites blocking us, we will solve that by giving our new broad crawl harvesters new IP addresses and updating our throttling firewall rules. Simultaniously we ran the selective crawls connected to the broad crawls: Research databases, Municipalities and regions, Ministries and Government agencies, YouTube.
Now we are doing the “cleaning up” and improvements to prepare the next broad crawl
Getting IP-validated access to content behind paywalls is still a big issue (to get in touch with the right person from the website owners). Vi are also trying to solve another issue: we are not able to capture comments on news articles.
We have started our annual broad crawl on September 23.
There are almost 2 million websites that we divide in sets of 500 domains per job with a limit of 150 MB/domain.
We use two specific networks (FTTH-Fiber to The Home) to make the broad crawl in order to leave the regular network for our selective collections
We have already collected 38% of the websites (26 TB of information) without important inconveniences