[NAS-2519] Remove tlds from settings Created: 28/Apr/16  Updated: 03/Nov/16  Resolved: 19/Oct/16

Status: Resolved
Project: NetarchiveSuite
Component/s: Harvest Definition
Affects Version/s: None
Fix Version/s: 5.2

Type: Bug Priority: Major
Reporter: Colin Rosenthal Assignee: Søren Vejrup Carlsen (Inactive)
Resolution: Fixed  
Labels: None
Remaining Estimate: Not Specified
Time Spent: 13m
Original Estimate: Not Specified

Sprint: NAS 5.2
Verification:

Checked that we can now harvest bbc.co.uk, for example, with standard installation.


 Description   

Currently the list of allowable top-level domains is set in the settings file. This is no longer good enough. Consult with Tue to find out what a better solution would look like.



 Comments   
Comment by Søren Vejrup Carlsen (Inactive) [ 16/Sep/16 ]

When harvesting with the new TLD system, I get tons of log-lines like this:

2016-09-16 16:24:51.866 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk
2016-09-16 16:24:51.866 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk
2016-09-16 16:24:51.866 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk
2016-09-16 16:24:51.867 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk
2016-09-16 16:24:51.867 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk
2016-09-16 16:24:51.867 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk
2016-09-16 16:24:51.867 DEBUG d.n.h.h.r.HarvestReportGenerator.getDomainNameFromURIString - Not possible to extract domainname from URL: www.netarkivet.dk
Comment by Søren Vejrup Carlsen (Inactive) [ 13/Sep/16 ]

See description of the use of wildcards and '!' in the suffix file at https://publicsuffix.org/list/
Reading it, I just think we should ignore them for the moment

Comment by Søren Vejrup Carlsen (Inactive) [ 13/Sep/16 ]

We get these warnings whenever we read the file:

17:50:29.664 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.bd', ignoring
17:50:29.665 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.bn', ignoring
17:50:29.665 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.nom.br', ignoring
17:50:29.666 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.ck', ignoring
17:50:29.666 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '!www.ck', ignoring
17:50:29.668 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.er', ignoring
17:50:29.668 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.fj', ignoring
17:50:29.668 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.fk', ignoring
17:50:29.673 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.gu', ignoring
17:50:29.677 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.jm', ignoring
17:50:29.678 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.kawasaki.jp', ignoring
17:50:29.679 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.kitakyushu.jp', ignoring
17:50:29.679 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.kobe.jp', ignoring
17:50:29.679 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.nagoya.jp', ignoring
17:50:29.679 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.sapporo.jp', ignoring
17:50:29.679 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.sendai.jp', ignoring
17:50:29.679 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.yokohama.jp', ignoring
17:50:29.680 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '!city.kawasaki.jp', ignoring
17:50:29.680 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '!city.kitakyushu.jp', ignoring
17:50:29.680 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '!city.kobe.jp', ignoring
17:50:29.680 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '!city.nagoya.jp', ignoring
17:50:29.680 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '!city.sapporo.jp', ignoring
17:50:29.681 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '!city.sendai.jp', ignoring
17:50:29.681 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '!city.yokohama.jp', ignoring
17:50:29.692 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.ke', ignoring
17:50:29.693 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.kh', ignoring
17:50:29.693 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.kw', ignoring
17:50:29.694 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.mm', ignoring
17:50:29.699 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.mz', ignoring
17:50:29.699 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '!teledata.mz', ignoring
17:50:29.704 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.np', ignoring
17:50:29.704 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.pg', ignoring
17:50:29.709 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.sch.uk', ignoring
17:50:29.712 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.ye', ignoring
17:50:29.712 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.zw', ignoring
17:50:29.721 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.compute.estate', ignoring
17:50:29.721 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.alces.network', ignoring
17:50:29.722 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.platform.sh', ignoring
17:50:29.722 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.cryptonomic.net', ignoring
17:50:29.725 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.api.githubcloud.com', ignoring
17:50:29.726 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.ext.githubcloud.com', ignoring
17:50:29.726 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.githubcloudusercontent.com', ignoring
17:50:29.726 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.0emm.com', ignoring
17:50:29.727 [main] WARN  dk.netarkivet.common.utils.TLD - Invalid tld '*.magentosite.cloud', ignoring
Comment by Søren Vejrup Carlsen (Inactive) [ 09/Sep/16 ]

Have now committed this addition to the code.

Comment by Søren Vejrup Carlsen (Inactive) [ 09/Sep/16 ]

Our PROD supervisor at netarkivet.dk believes the solution to add extra TLDs to the settings makes it just as difficult as it is today to update the acceptable TLDs.
I will then look for a public_suffix_list.dat in the conf directory just like the Settings class do today for the settings.xml file where it looks for the setting.xml in conf/settings.xml in the Settings.getSettingsFiles() method

Comment by Søren Vejrup Carlsen (Inactive) [ 15/Aug/16 ]

Have now merged the code in special branch NAS-2519 with master and deleted the branch

Comment by Søren Vejrup Carlsen (Inactive) [ 15/Aug/16 ]

The list of tlds in the settings-files has now been fully replaced by the file common/common-core/src/main/resources/dk/netarkivet/common/utils/public_suffix_list.dat
Whenever this file needs updating, download it from https://www.publicsuffix.org/list/public_suffix_list.dat

Comment by Søren Vejrup Carlsen (Inactive) [ 15/Jul/16 ]

Created NAS-2519 branch for this

Comment by Søren Vejrup Carlsen (Inactive) [ 24/May/16 ]

We should embed the list https://www.publicsuffix.org/list/public_suffix_list.dat in NetarchiveSuite
and read our tlds from this list

Generated at Sat Apr 20 09:44:09 CEST 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.