[NAS-2496] Redirect for jp.dk fails in test wayback Created: 05/Feb/16  Updated: 21/Feb/17  Resolved: 21/Feb/17

Status: Resolved
Project: NetarchiveSuite
Component/s: Wayback
Affects Version/s: 5.0, 5.2
Fix Version/s: 5.3

Type: Bug Priority: Minor
Reporter: Colin Rosenthal Assignee: Unassigned
Resolution: Fixed  
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Implement
is implemented by NAS-2585 NetarchiveResourceStore doesn't handl... Resolved
Verification:

Believed solved as part of NAS-2585. See the description for the verification.


 Description   

In TEST12, using the standard set of warcfiles, I can find two "hits" for jp.dk, but clicking on them produces an error like this in the the wayback tomcat log:

11:06:47.090 [Thread 41 proxy handling: ] INFO  d.n.wayback.NetarchiveResourceStore - Received request for resource from file '5-1-20130117172315-00000-kb-test-har-002.kb.dk.warc' at offset '2804'
11:06:47.090 [Thread 41 proxy handling: ] DEBUG d.n.a.a.d.JMSArcRepositoryClient - Requesting get of record '5-1-20130117172315-00000-kb-test-har-002.kb.dk.warc:2804'
11:06:47.113 [Thread 41 proxy handling: ] DEBUG d.n.common.distribute.Synchronizer - Received reply for message: ID:139-130.226.228.10(bf:2a:1f:8a:82:85)-43110-1454666807098: To TEST12_COMMON_THE_REPOS ReplyTo TEST12_COMMON_THIS_REPOS_CLIENT_130_226_228_10_WIA_WAYBACKWEBAPPTEST12 OK Arcfile: 5-1-20130117172315-00000-kb-test-har-002.kb.dk.warc Offset: 2804
11:06:47.113 [Thread 41 proxy handling: ] DEBUG d.n.a.a.d.JMSArcRepositoryClient - Reply received after 0 seconds
11:06:47.114 [Thread 41 proxy handling: ] INFO  d.n.wayback.NetarchiveResourceStore - Retrieved resource from file '5-1-20130117172315-00000-kb-test-har-002.kb.dk.warc' at offset '2804'
11:06:47.114 [Thread 41 proxy handling: ] DEBUG d.n.c.d.a.BitarchiveRecord - Reading 303 bytes from objectBuffer
11:06:47.114 [Thread 41 proxy handling: ] DEBUG d.n.wayback.NetarchiveResourceStore - Setting response code '301'
11:06:47.114 [Thread 41 proxy handling: ] INFO  d.n.wayback.NetarchiveResourceStore - Setting redirect Location header to 'http://jyllands-posten.dk/'
11:06:47.114 [Thread 41 proxy handling: ] DEBUG d.n.wayback.NetarchiveResourceStore - ARCRecord created with code '-1'
11:06:47.114 [Thread 41 proxy handling: ] INFO  d.n.wayback.NetarchiveResourceStore - Returning resource 'dk.netarkivet.wayback.NetarchiveResourceStore$1@24e1f49f'
WARNING Premature EOF before end-of-record: {statuscode=301, subject-uri=jp.dk/, ip-address=www.jp.dk, absolute-offset=2804, length=303, creation-date=Thu Jan 17 18:23:16 CET 2013, content-type=application/http, version=301, Location=http://jyllands-posten.dk/}

Searching directly for jyllands-posten.dk works fine.

The exact same behaviour is seen in both ia wayback and OpenWayback.



 Comments   
Comment by Colin Rosenthal [ 09/Feb/16 ]

I'm pretty sure this is not a new bug but a consequence of the fact that NetarchiveResourceStore has never been properly rewritten to support warcfiles. It only seems to affect the wayback webapp. Since we have no immediate plans to upgrade the webapp in production, this is not blocking for the 5.1 release.

Comment by Colin Rosenthal [ 09/Feb/16 ]

So far it seems that all the problem records are redirect records. For example the first one above appears to be this record:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.jp.dk/
WARC-Date: 2013-01-17T17:23:16Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 91.214.20.52
WARC-Record-ID: <urn:uuid:bd02f38c-c95d-45b6-9a1c-7199357762d9>
Content-Type: application/http; msgtype=response
Content-Length: 303

HTTP/1.1 301 http://jyllands-posten.dk/
Server: nginx/0.8.55
Date: Thu, 17 Jan 2013 17:23:16 GMT
Connection: close
Location: http://jyllands-posten.dk/
Accept-Ranges: bytes
X-Varnish: 2788715357
Age: 0
Via: 1.1 varnish
X-Cache: MISS - bromine.jp-prod.lp.jppol.dk
X-Src-Nginx: bromine-nginx



WARC/1.0
...
Comment by Søren Vejrup Carlsen (Inactive) [ 08/Feb/16 ]

How does these EOF records look like?

Comment by Colin Rosenthal [ 05/Feb/16 ]

But I finally got it to work with the combination of NetarchiveCacheResourceStore and OpenWayback.

Comment by Colin Rosenthal [ 05/Feb/16 ]

I tried with the NetachiveCacheResourceStore but then got the error

11:28:33.569 [Thread 19 proxy handling: ] INFO  d.n.w.NetarchiveCacheResourceStore - File '5-1-20130117172315-00000-kb-test-har-002.kb.dk.warc' downloaded from archive and put into the cache '/tmp'.
11:28:33.642 [Thread 19 proxy handling: ] ERROR d.n.w.NetarchiveCacheResourceStore - Error looking for non existing resource
java.util.zip.ZipException: Not in GZIP format
Comment by Colin Rosenthal [ 05/Feb/16 ]

It's not the only EOF either. For example, the first hit for news.dk gives

11:17:57.248 [Thread 41 proxy handling: ] INFO  d.n.wayback.NetarchiveResourceStore - Received request for resource from file '5-1-20130117172315-00001-kb-test-har-002.kb.dk.warc' at offset '2938'
11:17:57.251 [Thread 41 proxy handling: ] DEBUG d.n.a.a.d.JMSArcRepositoryClient - Requesting get of record '5-1-20130117172315-00001-kb-test-har-002.kb.dk.warc:2938'
11:17:57.265 [Thread 41 proxy handling: ] DEBUG d.n.common.distribute.Synchronizer - Received reply for message: ID:184-130.226.228.10(bf:2a:1f:8a:82:85)-43110-1454667477254: To TEST12_COMMON_THE_REPOS ReplyTo TEST12_COMMON_THIS_REPOS_CLIENT_130_226_228_10_WIA_WAYBACKWEBAPPTEST12 OK Arcfile: 5-1-20130117172315-00001-kb-test-har-002.kb.dk.warc Offset: 2938
11:17:57.266 [Thread 41 proxy handling: ] DEBUG d.n.a.a.d.JMSArcRepositoryClient - Reply received after 0 seconds
11:17:57.266 [Thread 41 proxy handling: ] INFO  d.n.wayback.NetarchiveResourceStore - Retrieved resource from file '5-1-20130117172315-00001-kb-test-har-002.kb.dk.warc' at offset '2938'
11:17:57.266 [Thread 41 proxy handling: ] DEBUG d.n.c.d.a.BitarchiveRecord - Reading 486 bytes from objectBuffer
11:17:57.266 [Thread 41 proxy handling: ] DEBUG d.n.wayback.NetarchiveResourceStore - Setting response code '302'
11:17:57.266 [Thread 41 proxy handling: ] INFO  d.n.wayback.NetarchiveResourceStore - Setting Content-Type header to 'text/html; charset=utf-8'
11:17:57.266 [Thread 41 proxy handling: ] INFO  d.n.wayback.NetarchiveResourceStore - Setting redirect Location header to 'http://news.dk/'
11:17:57.266 [Thread 41 proxy handling: ] INFO  d.n.wayback.NetarchiveResourceStore - Setting length header to '146'
11:17:57.266 [Thread 41 proxy handling: ] DEBUG d.n.wayback.NetarchiveResourceStore - ARCRecord created with code '-1'
11:17:57.266 [Thread 41 proxy handling: ] INFO  d.n.wayback.NetarchiveResourceStore - Returning resource 'dk.netarkivet.wayback.NetarchiveResourceStore$1@470a4a2b'
WARNING Premature EOF before end-of-record: {statuscode=302, subject-uri=news.dk/, ip-address=www.news.dk, length=486, absolute-offset=2938, creation-date=Thu Jan 17 18:23:16 CET 2013, content-type=text/html, version=302, Location=http://news.dk/}
WARNING Premature EOF before end-of-record: {statuscode=302, subject-uri=news.dk/, ip-address=www.news.dk, length=486, absolute-offset=2938, creation-date=Thu Jan 17 18:23:16 CET 2013, content-type=text/html, version=302, Location=http://news.dk/}
WARNING Premature EOF before end-of-record: {statuscode=302, subject-uri=news.dk/, ip-address=www.news.dk, length=486, absolute-offset=2938, creation-date=Thu Jan 17 18:23:16 CET 2013, content-type=text/html, version=302, Location=http://news.dk/}
WARNING Premature EOF before end-of-record: {statuscode=302, subject-uri=news.dk/, ip-address=www.news.dk, length=486, absolute-offset=2938, creation-date=Thu Jan 17 18:23:16 CET 2013, content-type=text/html, version=302, Location=http://news.dk/}
Generated at Sat Apr 27 04:14:52 CEST 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.