Class DefaultJobGenerator

  • All Implemented Interfaces:
    JobGenerator

    public class DefaultJobGenerator
    extends Object
    The legacy job generator implementation. Aims at generating jobs that execute in a predictable time by taking advantage of previous crawls statistics.
    • Constructor Detail

      • DefaultJobGenerator

        public DefaultJobGenerator()
    • Method Detail

      • getInstance

        public static DefaultJobGenerator getInstance()
        Returns:
        the singleton instance, builds it if necessary.
      • processDomainConfigurationSubset

        protected int processDomainConfigurationSubset​(HarvestDefinition harvest,
                                                       Iterator<DomainConfiguration> domainConfSubset)
        Create new jobs from a collection of configurations. All configurations must use the same order.xml file.Jobs
        Parameters:
        harvest - the HarvestDefinition being processed.
        domainConfSubset - the configurations to use to create the jobs
        Returns:
        The number of jobs created
        Throws:
        ArgumentNotValid - if any of the parameters is null or if the cfglist does not contain any configurations
      • reset

        public static void reset()
        Only to be used by unittests.
      • generateJobs

        public int generateJobs​(HarvestDefinition harvest)
        Description copied from interface: JobGenerator
        Generates a series of jobs for the given harvest definition. Note that a job generator is expected to follow the singleton pattern, so implementations of this method should be thread-safe.
        Specified by:
        generateJobs in interface JobGenerator
        Parameters:
        harvest - the harvest definition to process.
        Returns:
        the number of jobs that were generated.
      • canAccept

        public boolean canAccept​(Job job,
                                 DomainConfiguration cfg,
                                 DomainConfiguration previousCfg)
        Description copied from interface: JobGenerator
        Tests if a configuration fits into this Job. First tests if it's the right type of order-template and bytelimit, and whether the bytelimit is right for the job. The Job limits are compared against the configuration estimates and if no limits are exceeded true is returned otherwise false is returned.
        Specified by:
        canAccept in interface JobGenerator
        Parameters:
        job - the job being built.
        cfg - the configuration to check
        previousCfg - if not null, the configuration added to this job immediately prior
        Returns:
        true if adding the configuration to this Job does not exceed any of the Job limits.
      • editJobOrderXml

        protected void editJobOrderXml​(Job job)
        Once the job has been filled with DomainConfigurations, performs the following operations:
        1. Edit the harvest template to add/remove deduplicator configuration.
        Parameters:
        job - the job
      • ignoreConfiguration

        public boolean ignoreConfiguration​(DomainConfiguration cfg)
        Description copied from interface: JobGenerator
        Test if this configuration should be ignored
        Specified by:
        ignoreConfiguration in interface JobGenerator
        Parameters:
        cfg - a domain configuration
        Returns:
        true if we should ignore this configuration (It could be that it is disabled in some way, or all seeds are prefixed with a '#' and so there are no active seeds