Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference May the 14th 2013, 13:00-14:00.
- TDC tele-conference:
- Dial in number (+45) 70 26 50 45
- Dial in code 9064479#
- BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.
- BNF: Sara
- ONB: Andreas
- KB: Tue, Søren and Nicholas
- SB: Colin, Mikis and Sabine
- Any other issues to be discussed on today's tele-conference?
Due to low availability of development resources I propose that we skip the 4.3 release and aim for a minor 4.4 release at the end of the year. We would like to minimize the amount of testing required here by only including localized bug fixes, thus avoiding the need for major regression testing.
Decision to skip the 4.3 release and stick to a minor 4.4 release at the end of the year. The only new feature will be Bnf's
NAS-2212Getting issue details...
Wayback meetings at BnF
Recap from Sara, Nicolas and Colin.
Next NetarchiveSuite workshop
We normally gather for a meeting in the autumn to share our views on the where NetarchiveSuite should go. Perhaps we should consider doing this in the spring next year?
Status of the production sites
- We started our third broad crawl for 2013 in the beginning of September
- We upgraded our test environment to NAS 4.2. It works fine. When the broad crawl is finished, we plan to upgrade the production system from NAS 4.0 to 4.2?
- We are working on improving our documentation, not only for to facilitate the curators work, but also on demand of the researchers. We are testing how much of our documentation could be incorporated in NAS, among other by creating extended fields on both the domain level and the harvest definition level.
- Our greatest barrier for to give access to our archive is the Danish personal data protection law. In a pilot project we extracted a corpus from our archive and screened it for personal data (especially for civil registration numbers). We both used automatic and manual screening.
- We intensified our work with capturing content behind pay walls from news sites
Last summer, BnF tried a new type of harvest for blog platforms. We were satisfied with the result except that we had only a small sample of blogs: the volume of images for free.fr was really big and we had to stop the harvest after 15 days. So in 2013, we decided not to collect free.fr and to reduce the budget to 800 URLs per host. We had a list of 225,000 seeds which we harvested during a period of 50 days. The problem this year is that, with a depth of "host", Heritrix generated an exponential list of inactive queues: it seemed we would never finish the crawl! And so we have to think of yet another choice of parameters…
- 2nd stage of domain crawl 2013 is almost finished (just a few jobs to finish)
- We have parliamentary elections on Sept 29th. We started an ongoing politics collection beginning of 2013, which also includes this event.
Oktober 21th 13-14??
Any other business?