[NAS-2496] Redirect for jp.dk fails in test wayback Created: 05/Feb/16 Updated: 21/Feb/17 Resolved: 21/Feb/17 |
|
Status: | Resolved |
Project: | NetarchiveSuite |
Component/s: | Wayback |
Affects Version/s: | 5.0, 5.2 |
Fix Version/s: | 5.3 |
Type: | Bug | Priority: | Minor |
Reporter: | Colin Rosenthal | Assignee: | Unassigned |
Resolution: | Fixed | ||
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Issue Links: |
|
||||||||
Verification: | Believed solved as part of |
Description |
In TEST12, using the standard set of warcfiles, I can find two "hits" for jp.dk, but clicking on them produces an error like this in the the wayback tomcat log: 11:06:47.090 [Thread 41 proxy handling: ] INFO d.n.wayback.NetarchiveResourceStore - Received request for resource from file '5-1-20130117172315-00000-kb-test-har-002.kb.dk.warc' at offset '2804' 11:06:47.090 [Thread 41 proxy handling: ] DEBUG d.n.a.a.d.JMSArcRepositoryClient - Requesting get of record '5-1-20130117172315-00000-kb-test-har-002.kb.dk.warc:2804' 11:06:47.113 [Thread 41 proxy handling: ] DEBUG d.n.common.distribute.Synchronizer - Received reply for message: ID:139-130.226.228.10(bf:2a:1f:8a:82:85)-43110-1454666807098: To TEST12_COMMON_THE_REPOS ReplyTo TEST12_COMMON_THIS_REPOS_CLIENT_130_226_228_10_WIA_WAYBACKWEBAPPTEST12 OK Arcfile: 5-1-20130117172315-00000-kb-test-har-002.kb.dk.warc Offset: 2804 11:06:47.113 [Thread 41 proxy handling: ] DEBUG d.n.a.a.d.JMSArcRepositoryClient - Reply received after 0 seconds 11:06:47.114 [Thread 41 proxy handling: ] INFO d.n.wayback.NetarchiveResourceStore - Retrieved resource from file '5-1-20130117172315-00000-kb-test-har-002.kb.dk.warc' at offset '2804' 11:06:47.114 [Thread 41 proxy handling: ] DEBUG d.n.c.d.a.BitarchiveRecord - Reading 303 bytes from objectBuffer 11:06:47.114 [Thread 41 proxy handling: ] DEBUG d.n.wayback.NetarchiveResourceStore - Setting response code '301' 11:06:47.114 [Thread 41 proxy handling: ] INFO d.n.wayback.NetarchiveResourceStore - Setting redirect Location header to 'http://jyllands-posten.dk/' 11:06:47.114 [Thread 41 proxy handling: ] DEBUG d.n.wayback.NetarchiveResourceStore - ARCRecord created with code '-1' 11:06:47.114 [Thread 41 proxy handling: ] INFO d.n.wayback.NetarchiveResourceStore - Returning resource 'dk.netarkivet.wayback.NetarchiveResourceStore$1@24e1f49f' WARNING Premature EOF before end-of-record: {statuscode=301, subject-uri=jp.dk/, ip-address=www.jp.dk, absolute-offset=2804, length=303, creation-date=Thu Jan 17 18:23:16 CET 2013, content-type=application/http, version=301, Location=http://jyllands-posten.dk/} Searching directly for jyllands-posten.dk works fine. The exact same behaviour is seen in both ia wayback and OpenWayback. |
Comments |
Comment by Colin Rosenthal [ 09/Feb/16 ] |
I'm pretty sure this is not a new bug but a consequence of the fact that NetarchiveResourceStore has never been properly rewritten to support warcfiles. It only seems to affect the wayback webapp. Since we have no immediate plans to upgrade the webapp in production, this is not blocking for the 5.1 release. |
Comment by Colin Rosenthal [ 09/Feb/16 ] |
So far it seems that all the problem records are redirect records. For example the first one above appears to be this record: WARC/1.0 WARC-Type: response WARC-Target-URI: http://www.jp.dk/ WARC-Date: 2013-01-17T17:23:16Z WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ WARC-IP-Address: 91.214.20.52 WARC-Record-ID: <urn:uuid:bd02f38c-c95d-45b6-9a1c-7199357762d9> Content-Type: application/http; msgtype=response Content-Length: 303 HTTP/1.1 301 http://jyllands-posten.dk/ Server: nginx/0.8.55 Date: Thu, 17 Jan 2013 17:23:16 GMT Connection: close Location: http://jyllands-posten.dk/ Accept-Ranges: bytes X-Varnish: 2788715357 Age: 0 Via: 1.1 varnish X-Cache: MISS - bromine.jp-prod.lp.jppol.dk X-Src-Nginx: bromine-nginx WARC/1.0 ... |
Comment by Søren Vejrup Carlsen (Inactive) [ 08/Feb/16 ] |
How does these EOF records look like? |
Comment by Colin Rosenthal [ 05/Feb/16 ] |
But I finally got it to work with the combination of NetarchiveCacheResourceStore and OpenWayback. |
Comment by Colin Rosenthal [ 05/Feb/16 ] |
I tried with the NetachiveCacheResourceStore but then got the error 11:28:33.569 [Thread 19 proxy handling: ] INFO d.n.w.NetarchiveCacheResourceStore - File '5-1-20130117172315-00000-kb-test-har-002.kb.dk.warc' downloaded from archive and put into the cache '/tmp'. 11:28:33.642 [Thread 19 proxy handling: ] ERROR d.n.w.NetarchiveCacheResourceStore - Error looking for non existing resource java.util.zip.ZipException: Not in GZIP format |
Comment by Colin Rosenthal [ 05/Feb/16 ] |
It's not the only EOF either. For example, the first hit for news.dk gives 11:17:57.248 [Thread 41 proxy handling: ] INFO d.n.wayback.NetarchiveResourceStore - Received request for resource from file '5-1-20130117172315-00001-kb-test-har-002.kb.dk.warc' at offset '2938' 11:17:57.251 [Thread 41 proxy handling: ] DEBUG d.n.a.a.d.JMSArcRepositoryClient - Requesting get of record '5-1-20130117172315-00001-kb-test-har-002.kb.dk.warc:2938' 11:17:57.265 [Thread 41 proxy handling: ] DEBUG d.n.common.distribute.Synchronizer - Received reply for message: ID:184-130.226.228.10(bf:2a:1f:8a:82:85)-43110-1454667477254: To TEST12_COMMON_THE_REPOS ReplyTo TEST12_COMMON_THIS_REPOS_CLIENT_130_226_228_10_WIA_WAYBACKWEBAPPTEST12 OK Arcfile: 5-1-20130117172315-00001-kb-test-har-002.kb.dk.warc Offset: 2938 11:17:57.266 [Thread 41 proxy handling: ] DEBUG d.n.a.a.d.JMSArcRepositoryClient - Reply received after 0 seconds 11:17:57.266 [Thread 41 proxy handling: ] INFO d.n.wayback.NetarchiveResourceStore - Retrieved resource from file '5-1-20130117172315-00001-kb-test-har-002.kb.dk.warc' at offset '2938' 11:17:57.266 [Thread 41 proxy handling: ] DEBUG d.n.c.d.a.BitarchiveRecord - Reading 486 bytes from objectBuffer 11:17:57.266 [Thread 41 proxy handling: ] DEBUG d.n.wayback.NetarchiveResourceStore - Setting response code '302' 11:17:57.266 [Thread 41 proxy handling: ] INFO d.n.wayback.NetarchiveResourceStore - Setting Content-Type header to 'text/html; charset=utf-8' 11:17:57.266 [Thread 41 proxy handling: ] INFO d.n.wayback.NetarchiveResourceStore - Setting redirect Location header to 'http://news.dk/' 11:17:57.266 [Thread 41 proxy handling: ] INFO d.n.wayback.NetarchiveResourceStore - Setting length header to '146' 11:17:57.266 [Thread 41 proxy handling: ] DEBUG d.n.wayback.NetarchiveResourceStore - ARCRecord created with code '-1' 11:17:57.266 [Thread 41 proxy handling: ] INFO d.n.wayback.NetarchiveResourceStore - Returning resource 'dk.netarkivet.wayback.NetarchiveResourceStore$1@470a4a2b' WARNING Premature EOF before end-of-record: {statuscode=302, subject-uri=news.dk/, ip-address=www.news.dk, length=486, absolute-offset=2938, creation-date=Thu Jan 17 18:23:16 CET 2013, content-type=text/html, version=302, Location=http://news.dk/} WARNING Premature EOF before end-of-record: {statuscode=302, subject-uri=news.dk/, ip-address=www.news.dk, length=486, absolute-offset=2938, creation-date=Thu Jan 17 18:23:16 CET 2013, content-type=text/html, version=302, Location=http://news.dk/} WARNING Premature EOF before end-of-record: {statuscode=302, subject-uri=news.dk/, ip-address=www.news.dk, length=486, absolute-offset=2938, creation-date=Thu Jan 17 18:23:16 CET 2013, content-type=text/html, version=302, Location=http://news.dk/} WARNING Premature EOF before end-of-record: {statuscode=302, subject-uri=news.dk/, ip-address=www.news.dk, length=486, absolute-offset=2938, creation-date=Thu Jan 17 18:23:16 CET 2013, content-type=text/html, version=302, Location=http://news.dk/} |