[NAS-2519] Remove tlds from settings Created: 28/Apr/16 Updated: 03/Nov/16 Resolved: 19/Oct/16 |
|
Status: | Resolved |
Project: | NetarchiveSuite |
Component/s: | Harvest Definition |
Affects Version/s: | None |
Fix Version/s: | 5.2 |
Type: | Bug | Priority: | Major |
Reporter: | Colin Rosenthal | Assignee: | Søren Vejrup Carlsen (Inactive) |
Resolution: | Fixed | ||
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | 13m | ||
Original Estimate: | Not Specified |
Sprint: | NAS 5.2 |
Verification: | Checked that we can now harvest bbc.co.uk, for example, with standard installation. |
Description |
Currently the list of allowable top-level domains is set in the settings file. This is no longer good enough. Consult with Tue to find out what a better solution would look like. |
Comments |
Comment by Søren Vejrup Carlsen (Inactive) [ 16/Sep/16 ] |
When harvesting with the new TLD system, I get tons of log-lines like this: 2016-09-16 16:24:51.866 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk 2016-09-16 16:24:51.866 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk 2016-09-16 16:24:51.866 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk 2016-09-16 16:24:51.867 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk 2016-09-16 16:24:51.867 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk 2016-09-16 16:24:51.867 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk 2016-09-16 16:24:51.867 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk |
Comment by Søren Vejrup Carlsen (Inactive) [ 13/Sep/16 ] |
See description of the use of wildcards and '!' in the suffix file at https://publicsuffix.org/list/ |
Comment by Søren Vejrup Carlsen (Inactive) [ 13/Sep/16 ] |
We get these warnings whenever we read the file: 17:50:29.664 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.bd', ignoring 17:50:29.665 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.bn', ignoring 17:50:29.665 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.nom.br', ignoring 17:50:29.666 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.ck', ignoring 17:50:29.666 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '!www.ck', ignoring 17:50:29.668 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.er', ignoring 17:50:29.668 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.fj', ignoring 17:50:29.668 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.fk', ignoring 17:50:29.673 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.gu', ignoring 17:50:29.677 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.jm', ignoring 17:50:29.678 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.kawasaki.jp', ignoring 17:50:29.679 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.kitakyushu.jp', ignoring 17:50:29.679 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.kobe.jp', ignoring 17:50:29.679 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.nagoya.jp', ignoring 17:50:29.679 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.sapporo.jp', ignoring 17:50:29.679 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.sendai.jp', ignoring 17:50:29.679 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.yokohama.jp', ignoring 17:50:29.680 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '!city.kawasaki.jp', ignoring 17:50:29.680 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '!city.kitakyushu.jp', ignoring 17:50:29.680 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '!city.kobe.jp', ignoring 17:50:29.680 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '!city.nagoya.jp', ignoring 17:50:29.680 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '!city.sapporo.jp', ignoring 17:50:29.681 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '!city.sendai.jp', ignoring 17:50:29.681 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '!city.yokohama.jp', ignoring 17:50:29.692 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.ke', ignoring 17:50:29.693 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.kh', ignoring 17:50:29.693 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.kw', ignoring 17:50:29.694 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.mm', ignoring 17:50:29.699 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.mz', ignoring 17:50:29.699 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '!teledata.mz', ignoring 17:50:29.704 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.np', ignoring 17:50:29.704 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.pg', ignoring 17:50:29.709 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.sch.uk', ignoring 17:50:29.712 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.ye', ignoring 17:50:29.712 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.zw', ignoring 17:50:29.721 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.compute.estate', ignoring 17:50:29.721 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.alces.network', ignoring 17:50:29.722 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.platform.sh', ignoring 17:50:29.722 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.cryptonomic.net', ignoring 17:50:29.725 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.api.githubcloud.com', ignoring 17:50:29.726 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.ext.githubcloud.com', ignoring 17:50:29.726 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.githubcloudusercontent.com', ignoring 17:50:29.726 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.0emm.com', ignoring 17:50:29.727 [main] WARN dk.netarkivet.common.utils.TLD - Invalid tld '*.magentosite.cloud', ignoring |
Comment by Søren Vejrup Carlsen (Inactive) [ 09/Sep/16 ] |
Have now committed this addition to the code. |
Comment by Søren Vejrup Carlsen (Inactive) [ 09/Sep/16 ] |
Our PROD supervisor at netarkivet.dk believes the solution to add extra TLDs to the settings makes it just as difficult as it is today to update the acceptable TLDs. |
Comment by Søren Vejrup Carlsen (Inactive) [ 15/Aug/16 ] |
Have now merged the code in special branch |
Comment by Søren Vejrup Carlsen (Inactive) [ 15/Aug/16 ] |
The list of tlds in the settings-files has now been fully replaced by the file common/common-core/src/main/resources/dk/netarkivet/common/utils/public_suffix_list.dat |
Comment by Søren Vejrup Carlsen (Inactive) [ 15/Jul/16 ] |
Created |
Comment by Søren Vejrup Carlsen (Inactive) [ 24/May/16 ] |
We should embed the list https://www.publicsuffix.org/list/public_suffix_list.dat in NetarchiveSuite |