Package is.hi.bok.deduplicator
Class CrawlDataItem
- java.lang.Object
-
- is.hi.bok.deduplicator.CrawlDataItem
-
public class CrawlDataItem extends Object
A base class for individual items of crawl data that should be added to the index.- Author:
- Kristinn Sigurðsson
-
-
Field Summary
Fields Modifier and Type Field Description protected String
contentDigest
static String
dateFormat
The proper formating ofsetURL(String)
andgetURL()
protected boolean
duplicate
protected String
etag
protected String
mimetype
protected String
origin
protected String
timestamp
protected String
URL
-
Constructor Summary
Constructors Constructor Description CrawlDataItem()
Constructor.CrawlDataItem(String URL, String contentDigest, String timestamp, String etag, String mimetype, String origin, boolean duplicate)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description String
getContentDigest()
Returns the documents content digestString
getEtag()
Returns the etag that was associated with the document.String
getMimeType()
Returns the mimetype that was associated with the document.String
getOrigin()
Returns the "origin" that was associated with the document.String
getTimestamp()
Returns a timestamp for when the URL was fetched in the format: yyyyMMddHHmmssSSSString
getURL()
Returns the URLboolean
isDuplicate()
Returns whether the CrawlDataItem was marked as duplicate.void
setContentDigest(String contentDigest)
Set the content digestvoid
setDuplicate(boolean duplicate)
Set whether duplicate or not.void
setEtag(String etag)
Set a new Etagvoid
setMimeType(String mimetype)
Set new MIME type.void
setOrigin(String origin)
Set new originvoid
setTimestamp(String timestamp)
Set a new timestamp.void
setURL(String URL)
Set the URL
-
-
-
Field Detail
-
dateFormat
public static final String dateFormat
The proper formating ofsetURL(String)
andgetURL()
- See Also:
- Constant Field Values
-
URL
protected String URL
-
contentDigest
protected String contentDigest
-
timestamp
protected String timestamp
-
etag
protected String etag
-
mimetype
protected String mimetype
-
origin
protected String origin
-
duplicate
protected boolean duplicate
-
-
Constructor Detail
-
CrawlDataItem
public CrawlDataItem()
Constructor. Creates a new CrawlDataItem with all its data initialized to null.
-
CrawlDataItem
public CrawlDataItem(String URL, String contentDigest, String timestamp, String etag, String mimetype, String origin, boolean duplicate)
Constructor. Creates a new CrawlDataItem with all its data initialized via the constructor.- Parameters:
URL
- The URL for this CrawlDataItemcontentDigest
- A content digest of the document found at the URLtimestamp
- Date of when the content digest was valid for that URL. Format: yyyyMMddHHmmssSSSetag
- Etag for the URLmimetype
- MIME type of the document found at the URLorigin
- The origin of the CrawlDataItem (the exact meaning of the origin is outside the scope of this class and it may be any String value)duplicate
- True if this CrawlDataItem was marked as duplicate
-
-
Method Detail
-
getURL
public String getURL()
Returns the URL- Returns:
- the URL
-
setURL
public void setURL(String URL)
Set the URL- Parameters:
URL
- the new URL
-
getContentDigest
public String getContentDigest()
Returns the documents content digest- Returns:
- the documents content digest
-
setContentDigest
public void setContentDigest(String contentDigest)
Set the content digest- Parameters:
contentDigest
- The new value of the content digest
-
getTimestamp
public String getTimestamp()
Returns a timestamp for when the URL was fetched in the format: yyyyMMddHHmmssSSS- Returns:
- the time of the URLs fetching
-
setTimestamp
public void setTimestamp(String timestamp)
Set a new timestamp.- Parameters:
timestamp
- The new timestamp. It should be in the format: yyyyMMddHHmmssSSS
-
getEtag
public String getEtag()
Returns the etag that was associated with the document.If etag is unavailable null will be returned.
- Returns:
- the etag.
-
setEtag
public void setEtag(String etag)
Set a new Etag- Parameters:
etag
- The new etag
-
getMimeType
public String getMimeType()
Returns the mimetype that was associated with the document.- Returns:
- the mimetype.
-
setMimeType
public void setMimeType(String mimetype)
Set new MIME type.- Parameters:
mimetype
- The new MIME type
-
getOrigin
public String getOrigin()
Returns the "origin" that was associated with the document.- Returns:
- the origin (may be null if none was provided for the document)
-
setOrigin
public void setOrigin(String origin)
Set new origin- Parameters:
origin
- A new origin.
-
isDuplicate
public boolean isDuplicate()
Returns whether the CrawlDataItem was marked as duplicate.- Returns:
- true if duplicate, false otherwise
-
setDuplicate
public void setDuplicate(boolean duplicate)
Set whether duplicate or not.- Parameters:
duplicate
- true if duplicate, false otherwise
-
-