Page tree
Skip to end of metadata
Go to start of metadata

Regional Daily Newspapers Example: Ouest-France

Ouest-France ( is the biggest regional daily newspaper title with 47 editions. In the past, we tested the deposit of PDF files without success.
There is no free online access to the PDF versions of the newspapers ( A password is required to access the publications after subscription. We added the password into the Heritrix profile but:

  • The login/password is available for 3 months only
  • Often, the crawler gets disconnected
  • A big part of the site is programmed in JavaScript
  • Heritrix extracts a lot of false URLs from JavaScript
  • Any false URL causes a disconnect and leads to the login page
  • But Heritrix enters the password only once a job (the page is then marked as “already seen” and is not collected again)

So to date, we crawl the articles but not the integral PDF versions.

  • No labels