Details
-
New Feature
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
None
Description
This is an idea for a simple loadSeeds testing tool that doesn't check for duplicates in hbase and doesn't insert anything in hbase either. However it does read the blacklists from hbase.
The steps of the tool is as follows:
Argument: outlinks file with or without annotations
Should use PROD webdanica_settings.xml
pseudo-code:
for each line in the outlinks file do { if line is not acceptable seed, skip seed If seed matches any of the enabled blacklists, skip seed print out seed }
Duplicates are removed by unix command
sort | uniq