Status of the production sites
Since 2011, the BnF has published each year a summary and statistics on national publishing output based on the collections made via legal deposit, including the web archives. In addition to the general analysis, each year has a focus, and for this year it was politics. We compared the web collections relating to three elections: 2007, 2012 and 2017. The three selections were based on the same categories and covered a lot of regions. The analysis confirms our hypothesis: since 2007, videos have become ubiquitous and the use of social networks from blogs to Twitter and Facebook has exploded.
We also measured the percentage of web sites still online. After 10 years, 26% of sites for elections are still on line, 19% redirect to other websites and 55% have disappeared. After 5 years, 44% are sitll online, 22 % redirect and 33% have disappeared. After 1 year, 81% are still on line, 10% redirect and 6% have disappeared. The lifespan of a web site changes a lot but the results show that the other collections (outside electoral collections) are complementary.
During May, a placement student worked on the scope of the broad and focused crawls in view of the legal definition of the French domain. With 4.5 million sites, the BnF covers less than 60% of the French web. To be more representative, BnF must extend its contacts to other registers, especially those with generic TLDs. We hope to contact the company Gandi to obtain their list of sites and improve the coverage. For the focused crawl, it is suggested that selection must be related to how the web is used and not only the traditional collections, and also that new ways of selection can help, such as more cooperation with librarians, researchers... In addition, more legal clarification is needed relating to the harvest of social networks: for the moment the BnF only crawls accounts and some hashtags of public figures or organisations. Finally, the study proposes to document the dynamics of sites through filmed tutorials, especially when there are technical difficulties for the crawler.