Loading...

XML

Word

Printable

Details

Type: New Feature
Resolution: Fixed
Priority: Critical
Fix Version/s: I46, 3.15.0
Affects Version/s: None
Component/s: Harvester Controller Server
Labels:
None

Description

We have moved forward with the Netarchive harvesting of e-books from Museum Tusculanum Press (MTP at Copenhagen University). Latest test harvesting has caught 389Mbytes in 104 objects:
http://kb-test-adm-001.kb.dk:8080/History/Harveststatus-jobdetails.jsp?jobID=734
Bjarne's comment:
"Next challenge for OAI harvesting is getting Heritrix to flick through the pages of the OAI result. The way MTP have made their OAI we only get 100 books at a time - and at the bottom of the XML file there is a ResumptionToken - a kind of code you need to generate a link. It requires a special setup of Heritrix. I believe we need to make a little script (BeanShellScript) for link extraction / link generation. A developer must write the actual code. I do not think anyone has tried to write a link-extraction script in BeanShellScript before (at Netarkivet). A mail on the Heritrix-mailing list could tell if anyone in the world have already created an OAI-target script to Heritrix - it would be nice if it already existed."

Attachments

Issue Links

Trackbacks

2011-08-09 Netarkiv møde DK møde Tidspunkt: 9. aug 11:00 12:00 Kort information (Mikis) Workshop i December hos BnF https://sbforge.org/display/NAS/2011DecemberworkshopatBnF. Ansøgningsrunde til fuldtids WARC udvikler i gang....

Activity

People

Assignee:: Colin Rosenthal

Reporter:: Claus Lomborg (Inactive)

Watchers:: 0 Start watching this issue

Dates

Created:: 06/Jan/11 3:30 PM

Updated:: 16/Mar/11 6:18 PM

Resolved:: 16/Mar/11 6:18 PM