The Create domain page is used for creating new domains in the system. It is possible to create a single domain as well as list of domains. It is also possible to import domains from a file.
To create single domains enter domain names in the text box and press Create.
To create domains in bulk from a file select the file from your local computer with Browse and press Ingest. The file must be a simple list of domain names – one at each line. The file must be UTF-8 encoded if it contains special characters.
New domains get a default configuration when created (with the defaultorderxml template and a default maximum number of bytes). New domains also get a defaultseedlist when created.
Existing domains in the system will be skipped by the ingest.
Find Domain(s) is used to find domains existing in the system.
Write a domain name in the box (e.g. kb.dk). Searching is done on the complete text string. Press Search.
Left and/or right wildcards with *. The domains can also be searched by crawler traps and comments.
If there are several hits, a list of the found domains are displayed.
A link to the domain harvest history for each domain is available in the second column.
Edit domain is an overview of a single domain where it is possible to edit the domain’s definition in the harvest system.
- Free commentary text box.
- ’Alias of’: Here it can be stated if the domain is an alias of another domain – they are identical in content and only one of them should be harvested. Domains marked as an alias will not be harvested within the snapshot harvests. Alias is defined one year at a time and then has to be renewed.
- ‘Configurations’: New configuration and Edit open a new page: Enter/edit configuration (see below): Unused configuration can be hidden which can be useful if there are many configurations. A unused configuration is a configuration which is either the default configuration or a configuration used in a active harvest.
- ‘Seed lists’: New seed list and Edit opens a new page: Enter/edit seed list (see below). Unused seed lists can be hidden which can be useful if there are many seed lists. A seed list is considered used if it is used in 'Used configuration'.
- ‘Crawler traps’: Show crawler traps opens a new text box: Crawler traps (see below)
- Show historical harvest information for … opens a new page Harvest history for domain…. (see Harvest History).
The Enter/edit configuration page is used to define a new configuration or edit an existing one. A configuration contains information about which Harvest template and Seed lists are used (more than one Seed list can be used - hold down CTRL).
At the creation of a new configuration a name is given that thereafter can not be changed.
Furthermore it is possible to choose between different Harvester templates and maximum number of bytes to be harvested in each harvest of the configuration. At creation the default number of bytes is chosen for each domain. And a default maximum number of objects is set, but can be overwritten.
Editing seed lists
Enter/edit seed list is used to define a new Seed list or to edit an existing one.
At the creation of a new Seed list a name is given that thereafter can not be changed.
In the ’Seeds’ text box a list of seeds to be harvested is given. Seeds can be omitted by writing a # prefix, e.g. http://www.kb.dk. This can also be used for comments inside the seedlist – e.g. 'this seed is important'
A crawlertrap is a path followed blindly by the harvester which in principle can continue forever. A typical crawlertrap is a calendar.
To avoid crawlertraps on a domain, the administrator can state parts of URLs that should never be harvested (in any configuration). Matching URLs are omitted in all harvests of the domain and in other domains harvested in the same job. So be very careful not to give too general statements that could potentially omit things on other domains (perhaps always include the domainname itself in the statement).
The string of text must be stated as a 'regular expression'.
Harvest history of a domain
If you want to see all the jobs of the the finished harvests for a domain may be listed by click on Show historical harvest information for domainxxx at the bottom of the domain page. The harvest history page includes information of why the harvest stopped. The 'Stopped due to' column will show if a harvest was stopped unexpectedly or if the harvest hit the max-bytes limit for the chosen domain or if the harvest was stopped because of an error on the harvester machine.
The domain statistics page will give you information about number of subdomains for each unique Top level domain known in the system. IP-numbers will be counted separately.
The number in the “Number of subdomains” column is clickable and will do a search for all domains matching that Top level domain. This is only applicable to Top level domains with a limited number of subdomains since the matching domains will be listed on one page – and that page will get very long if the system contains hundreds or thousands of domains.
The alias summary page gives an overview of the domains marked as aliases of other domains in the system. Both domain names are clickable and will open the domain page for the clicked domain.
The “Expires” column shows when the alias expires (12 month after they are marked). The mark does not disappear after 12 month in the database but the “Overview of Aliases” page will show the “expired” ones in the top.
To renew an alias for another 12 month one is currently forced to open the domain page of the marked domain (the “Domain” column) – select “renew alias” and press Save.