Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-1625

We need to normalize URLs when browsing data

    XMLWordPrintable

Details

    • New Feature
    • Resolution: Unresolved
    • Major
    • 5.5.1
    • 2.0
    • Vienna 2017, Viewerproxy
    • None

    Description

      in the PLIGT-system we have havested a lot of data from www.bs.dk
      None of this (except frontpage) can be viewed in viewerproxy
      heritrix encodes URLs like:
      http://www.bs.dk/showfile.aspx?IdGuid=

      {BB0455A5-4BA9-4054-8EC3-4251813B96F4}

      to
      http://www.bs.dk/showfile.aspx?IdGuid=%7BBB0455A5-4BA9-4054-8EC3-4251813B96F4%7D
      but when browsing (with IE - haven't checked other browsers) those braces are
      not encoded by the browser - so nothing is found from viewerproxy.
      so fix could be:
      have CDXReader.getKey(String uri) URLencode uri before calling BinSearch.
      It should be checked how this would affect uri's that are already URLencoded ? -
      maybe just som chars should be encoded
      BJA remembers doing similar thing in the special RoyalWedding-brach of viewerproxy.
      NOTE: This bug is originally from Bugzilla bug_id=623.
      This bug was previously assigned to Unassigned.

      Attachments

        Activity

          People

            csr Colin Rosenthal
            bja Bjarne Andersen
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: