Class CrawlDataItem


  • public class CrawlDataItem
    extends Object
    A base class for individual items of crawl data that should be added to the index.
    Author:
    Kristinn Sigurðsson
    • Constructor Detail

      • CrawlDataItem

        public CrawlDataItem()
        Constructor. Creates a new CrawlDataItem with all its data initialized to null.
      • CrawlDataItem

        public CrawlDataItem​(String URL,
                             String contentDigest,
                             String timestamp,
                             String etag,
                             String mimetype,
                             String origin,
                             boolean duplicate)
        Constructor. Creates a new CrawlDataItem with all its data initialized via the constructor.
        Parameters:
        URL - The URL for this CrawlDataItem
        contentDigest - A content digest of the document found at the URL
        timestamp - Date of when the content digest was valid for that URL. Format: yyyyMMddHHmmssSSS
        etag - Etag for the URL
        mimetype - MIME type of the document found at the URL
        origin - The origin of the CrawlDataItem (the exact meaning of the origin is outside the scope of this class and it may be any String value)
        duplicate - True if this CrawlDataItem was marked as duplicate
    • Method Detail

      • getURL

        public String getURL()
        Returns the URL
        Returns:
        the URL
      • setURL

        public void setURL​(String URL)
        Set the URL
        Parameters:
        URL - the new URL
      • getContentDigest

        public String getContentDigest()
        Returns the documents content digest
        Returns:
        the documents content digest
      • setContentDigest

        public void setContentDigest​(String contentDigest)
        Set the content digest
        Parameters:
        contentDigest - The new value of the content digest
      • getTimestamp

        public String getTimestamp()
        Returns a timestamp for when the URL was fetched in the format: yyyyMMddHHmmssSSS
        Returns:
        the time of the URLs fetching
      • setTimestamp

        public void setTimestamp​(String timestamp)
        Set a new timestamp.
        Parameters:
        timestamp - The new timestamp. It should be in the format: yyyyMMddHHmmssSSS
      • getEtag

        public String getEtag()
        Returns the etag that was associated with the document.

        If etag is unavailable null will be returned.

        Returns:
        the etag.
      • setEtag

        public void setEtag​(String etag)
        Set a new Etag
        Parameters:
        etag - The new etag
      • getMimeType

        public String getMimeType()
        Returns the mimetype that was associated with the document.
        Returns:
        the mimetype.
      • setMimeType

        public void setMimeType​(String mimetype)
        Set new MIME type.
        Parameters:
        mimetype - The new MIME type
      • getOrigin

        public String getOrigin()
        Returns the "origin" that was associated with the document.
        Returns:
        the origin (may be null if none was provided for the document)
      • setOrigin

        public void setOrigin​(String origin)
        Set new origin
        Parameters:
        origin - A new origin.
      • isDuplicate

        public boolean isDuplicate()
        Returns whether the CrawlDataItem was marked as duplicate.
        Returns:
        true if duplicate, false otherwise
      • setDuplicate

        public void setDuplicate​(boolean duplicate)
        Set whether duplicate or not.
        Parameters:
        duplicate - true if duplicate, false otherwise