Analysis of LiWA Rich Media Capture in the NetarchiveSuite project
The Liwa Rich Media Capture component consists of a plugin for Heritrix, which is able to detect streaming media entries on web pages. When a stream is detected a message is generated and put on a queue for processing. The idea is that you can then download the stream with recorders able to handle the different stream types. A downloadable movie file is then created, which can be archived (in a warc file for example). When the archived webpage containing the stream is request through wayback, it is rewritten on-the-fly replacing the original stream url, with a reference the the harvested downloadable file. More details can be found on the LiWA Rich Media Capture home page.
This sounds like a promising concept, but currently has a number of short-comes:
- The stream URL detection on the original web pages is difficult to make on a consistent basis because of the heavy redirect, caching and proxying functionality by the stream providers. This means that the stream url harvesting must in most cases be customized for the individual web sites. This was a major point of concerns for the participants at IWAW2010.
- The LiWA Rich Media Capture component only contains the Heritrix plugin for extracting the stream urls. The recording of the streams, the archiving and rewrite of webpages when accessing the archive has to be implemented afterwards.
- Additional hardware needs to be added to handle the actual recording of the streams.
- The archived webpages will not exposed the original stream, but will instead use more classical download technology. This means some of the features of streaming media is lost, most notable the ability to start viewing a video file before the entire file is retrieved.
- The current process for stream media harvesting and viewing are quite fragile and often breaks, as we saw in the demos at IWAW2010.
Current state of the NetarchiveSuite project
The NetarchiveSuite project currently faces a number of fundamental problems:
- The available technical resources are very small.
- The development process is not very efficient and therefore the progress is slow.
- The currently deployed systems are rather complex
- The wish-list for NetarchiveSuite functionality is already very large.
The introduction of the streaming media harvesting based on the LiWA component would significant contribute to increasing the NetarchiveSuite problems in return for a partial and fragile streaming media harvesting functionality.
The functionality for capturing stream media should therefore seriously be considered moved to a later time, where the concept is more mature, and the NetarchiveSuite project where better equipped to handle the extra weight of including this feature.