In April, we organized an internal workshop on responsive websites. As a start, we selected a sample of websites. We first tried to visualize the archives of these sites with a more recent version of Firefox and Chromium : half of the problems disappeared which lead us to conclude that many problems are in fact access issues and not crawling issues. As we used Firefox as User agent, the visual quality was better with Firefox than Chromium.
In a second step, we analysed the source code of the websites which had crawling issues. The conclusion of these analysis was that each site has peculiarities that are specific to it. To solve the crawling problems, we tried :
- to use various user-agents (e.g. specific version of firefox user-agent, Chrome) but this change did not significantly change the quality of the crawl and the choice of the user-agent must be consistent with the choice of the browser used for the access.
- to crawl the websites with the latest release of Umbra included in NAS. During the tests, Umbra fell as during our first tests in December. It's very efficient for social networks as Instagram or pinterest, especially to crawl images. But due to the instability of the application, it's impossible to put it in production. We'll probably test it again during the preparation of our broad crawl tests.