Last week, we had a meeting to prepare the program of the 2022 broad crawl.
The Official Publications harvest has been launched last week and will last at the end of June. This harvest includes websites of ministries, public establishments, independent administrative authorities and local authorities. Nearly 900 websites have been selected.
Finally, our next Videos harvest is in preparation. We are encountering some difficulties because we have changed the metadata extraction tool. The number of metadata extracted and therefore videos to download is indeed much greater than with the previous tool, which raises budget issues.
The broad crawl 2022 of the .es domain ended on May 19th. It has taken 21 days (compared to 25 days last year) with a limit of 150 MB per domain and 71 crawlers. This year the harvest was carried out through the BNE internet line. This has meant a reduction in the number of days we have used. In terms of results, we crawled 69 TB. In terms of documents harvested, we saved 3.54% less. This may be due to the fact that we have eliminated earlier the jobs that were stuck due to poor site configuration. If we combine both factors (fewer but larger items) we assume that we have a higher quality collection.
The broad crawl of journals was completed in April. The number of websites collected with electronic serials was more than 12,000, that is around 3.4 terabytes.