Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
  • Step 2 of the first broad crawl will start  
    • Focus on cleaning up regular expressions
    • Limits of domains 
    • New harvester-server up and running soon
  • Outsourcing of harvesting - status: will be kept in house.
  • SolrWayback "live"-QA up and running 
  • RSS-Heritrix module tested and still needs some focus
  • IIPC Browserbased-crawling project is proceeding. We have an update meeting tonight and have had input during the IIPC GA-sessions.
  • A lot of data delivery for researchers at the moment.
  • Working on updated JWAT for validation of Warc-files (communication with Nicholas and estimating budget/finding a way to this now internally at KB)

BnF


Panel

Last week, we had a meeting to prepare the program of the 2022 broad crawl. 
In this context, an overhaul of nas-preload and developments concerning NAS are planned.
The registrars have been contacted and we've already got most of the lists. Two new TLDs from overseas departments and territories have been obtained. The launch of the broad crawl is scheduled for October.

The Official Publications harvest has been launched last week and will last at the end of June. This harvest includes websites of ministries, public establishments, independent administrative authorities and local authorities. Nearly 900 websites have been selected.

Finally, our next Videos harvest is in preparation. We are encountering some difficulties because we have changed the metadata extraction tool. The number of metadata extracted and therefore videos to download is indeed much greater than with the previous tool, which raises budget issues.

ONB

Panel

BNE

Panel

The broad crawl 2022 of the .es domain ended on May 19th. It has taken 21 days (compared to 25 days last year) with a limit of 150 MB per domain and 71 crawlers. This year the harvest was carried out through the BNE internet line. This has meant a reduction in the number of days we have used. In terms of results, we crawled 69 TB. In terms of documents harvested, we saved 3.54% less. This may be due to the fact that we have eliminated earlier the jobs that were stuck due to poor site configuration. If we combine both factors (fewer but larger items) we assume that we have a higher quality collection.

The broad crawl of journals was completed in April. The number of websites collected with electronic serials was more than 12,000, that is around 3.4 terabytes.

...