Feedback and important information from GA
NAS workshop (Sara)
1) Share experience with NAS 5 and Heritrix 3
2) Discuss challenges with specific types of sites (news, social media)
3) Discuss collection strategies
4) Discuss features/a GUI to handle the harvester
5) Look into the possibility to integrate another crawler into NAS (Colin proposed to come with a prototype with a headless browser)
End of January 2017 - 2,5 days - in Vienna
Poll from Michaela http://doodle.com/poll/nk6dfc3kav4a4hs8
Status of the production sites
- We have moved our production site to NAS 5.1 H3
- We will start the second broad crawl 2016 as soon as NAS 5.1 and Heritrix 3 are running “smoothly”
- The event crawl on the refugee crisis is stil ongoing: As it is a supplement to our selective news media and social media crawls, it is a very little event crawl.
- We will participate in the IIPC collections Olympics 2016 and Online News around the World: A snapshot in Time
- We are preparing for a new event crawl on the European Capital of Culture project “Aarhus 2017”: we are looking at different scenarios for this event crawl
- We are still unable to harvest anything from Facebook.
- We are revising our collection strategy: There will be less broad crawls and more selective crawls. At the moment we are looking at the selective news media crawls. According to our resources we need a more streamlined approach for an extended number of domains to be crawled
- The social platform arto.com will be closed down at june 1st . We were offered a private crawl of the entire site (no WARC files, but likely WARC compatible). We decided to say no thanks and to do a last crawl of the entire site on our own.
- We are working on a business model (juridical and financial issues) for giving corpora from Netarchive to research institutions. Our first customer will be the University of Southern Denmark.
- We are still running our ongoing selective crawls (the biggest one is annual focused on big hosts and domains, social movements).
- We installed Java 1.7.79 on some harvesters within a specific channel to solve HTTPS problems for specific crawls (news and official publications).
- We are still working on our Corpus project.
- The The first .es domain crawl is running since April 4th. Our engineers estimate it will last until the end of July or mid August.
- We are trying to connect BCWeb to NAS development environment to give access to the web curators from our regional libraries.
- As our General Elections are going to be repeated in June 26th, we didn't close yet the General Elections event crawl that started on December 2015. Web curators from the regional libraries are nominating seeds for this collection.
- At the moment, the web archiving team is even smaller than it was. If the librarians were two (Sole and me), I'm now on my own, because Sole moved to another position at the Library. We are trying to recruit more people for the team, but so far the situation is even worse than it used to be.