Class CrawlDataItem


  • public class CrawlDataItem
    extends java.lang.Object
    A base class for individual items of crawl data that should be added to the index.
    Author:
    Kristinn Sigurðsson
    • Constructor Summary

      Constructors 
      Constructor Description
      CrawlDataItem()
      Constructor.
      CrawlDataItem​(java.lang.String URL, java.lang.String contentDigest, java.lang.String timestamp, java.lang.String etag, java.lang.String mimetype, java.lang.String origin, boolean duplicate)
      Constructor.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.lang.String getContentDigest()
      Returns the documents content digest
      java.lang.String getEtag()
      Returns the etag that was associated with the document.
      java.lang.String getMimeType()
      Returns the mimetype that was associated with the document.
      java.lang.String getOrigin()
      Returns the "origin" that was associated with the document.
      java.lang.String getTimestamp()
      Returns a timestamp for when the URL was fetched in the format: yyyyMMddHHmmssSSS
      java.lang.String getURL()
      Returns the URL
      boolean isDuplicate()
      Returns whether the CrawlDataItem was marked as duplicate.
      void setContentDigest​(java.lang.String contentDigest)
      Set the content digest
      void setDuplicate​(boolean duplicate)
      Set whether duplicate or not.
      void setEtag​(java.lang.String etag)
      Set a new Etag
      void setMimeType​(java.lang.String mimetype)
      Set new MIME type.
      void setOrigin​(java.lang.String origin)
      Set new origin
      void setTimestamp​(java.lang.String timestamp)
      Set a new timestamp.
      void setURL​(java.lang.String URL)
      Set the URL
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • CrawlDataItem

        public CrawlDataItem()
        Constructor. Creates a new CrawlDataItem with all its data initialized to null.
      • CrawlDataItem

        public CrawlDataItem​(java.lang.String URL,
                             java.lang.String contentDigest,
                             java.lang.String timestamp,
                             java.lang.String etag,
                             java.lang.String mimetype,
                             java.lang.String origin,
                             boolean duplicate)
        Constructor. Creates a new CrawlDataItem with all its data initialized via the constructor.
        Parameters:
        URL - The URL for this CrawlDataItem
        contentDigest - A content digest of the document found at the URL
        timestamp - Date of when the content digest was valid for that URL. Format: yyyyMMddHHmmssSSS
        etag - Etag for the URL
        mimetype - MIME type of the document found at the URL
        origin - The origin of the CrawlDataItem (the exact meaning of the origin is outside the scope of this class and it may be any String value)
        duplicate - True if this CrawlDataItem was marked as duplicate
    • Method Detail

      • getURL

        public java.lang.String getURL()
        Returns the URL
        Returns:
        the URL
      • setURL

        public void setURL​(java.lang.String URL)
        Set the URL
        Parameters:
        URL - the new URL
      • getContentDigest

        public java.lang.String getContentDigest()
        Returns the documents content digest
        Returns:
        the documents content digest
      • setContentDigest

        public void setContentDigest​(java.lang.String contentDigest)
        Set the content digest
        Parameters:
        contentDigest - The new value of the content digest
      • getTimestamp

        public java.lang.String getTimestamp()
        Returns a timestamp for when the URL was fetched in the format: yyyyMMddHHmmssSSS
        Returns:
        the time of the URLs fetching
      • setTimestamp

        public void setTimestamp​(java.lang.String timestamp)
        Set a new timestamp.
        Parameters:
        timestamp - The new timestamp. It should be in the format: yyyyMMddHHmmssSSS
      • getEtag

        public java.lang.String getEtag()
        Returns the etag that was associated with the document.

        If etag is unavailable null will be returned.

        Returns:
        the etag.
      • setEtag

        public void setEtag​(java.lang.String etag)
        Set a new Etag
        Parameters:
        etag - The new etag
      • getMimeType

        public java.lang.String getMimeType()
        Returns the mimetype that was associated with the document.
        Returns:
        the mimetype.
      • setMimeType

        public void setMimeType​(java.lang.String mimetype)
        Set new MIME type.
        Parameters:
        mimetype - The new MIME type
      • getOrigin

        public java.lang.String getOrigin()
        Returns the "origin" that was associated with the document.
        Returns:
        the origin (may be null if none was provided for the document)
      • setOrigin

        public void setOrigin​(java.lang.String origin)
        Set new origin
        Parameters:
        origin - A new origin.
      • isDuplicate

        public boolean isDuplicate()
        Returns whether the CrawlDataItem was marked as duplicate.
        Returns:
        true if duplicate, false otherwise
      • setDuplicate

        public void setDuplicate​(boolean duplicate)
        Set whether duplicate or not.
        Parameters:
        duplicate - true if duplicate, false otherwise