NetarchiveSuite-Github

Clone Tools
  • last updated a few seconds ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
With the new changes I pushed the HDFS URI as a setting, but I actually didn't use it for anything. Setting 'fs.defaultFS' to e.g. 'hdfs://node1' is enough for paths to be resolved correctly, so I ...

With the new changes I pushed the HDFS URI as a setting, but I actually didn't use it for anything.
Setting 'fs.defaultFS' to e.g. 'hdfs://node1' is enough for paths to be resolved correctly, so I haven't used the URI-prefix. Should probably have just removed it.
But maybe that is a matter of how explicit we want to be.

You're right, there shouldn't be any need. I got the habit of writing more explicit paths from earlier when learning Hadoop.. It would get tedious when switching between starting jobs from inside ...

You're right, there shouldn't be any need.

I got the habit of writing more explicit paths from earlier when learning Hadoop..
It would get tedious when switching between starting jobs from inside the VM and starting them from the outside, and then having the same path be interpreted as either '/user/vagrant/XXX' or '/user/rbkr/XXX'.

I am actually a bit unsure about how best to handle this one right now. Right now jobs are spawned for each individual ArchiveFile when indexing, so for each inputFile there will never be more tha...

I am actually a bit unsure about how best to handle this one right now.

Right now jobs are spawned for each individual ArchiveFile when indexing, so for each inputFile there will never be more than the one mapping output 'part-m-00000'.
Of course I could trivially write the code to iterate through all files matching the 'part-m-XXXXX'-pattern and do the copying to local file etc., but since we're only looking at a single file when interacting with the ArchiveFile-class, it doesn't really make sense to do so.

Read.me file should be more informative

Read.me file should be more informative

Preconditions for running this code?

Preconditions for running this code?

I agree

I agree

No, this part of the code is just copied from jolf/abr's repository 'Hadoopifications'. I didn't get around to look at what else to do at this part, so I just let the comment be.

No, this part of the code is just copied from jolf/abr's repository 'Hadoopifications'. I didn't get around to look at what else to do at this part, so I just let the comment be.

Is jolf/abr correcting this code or?

Is jolf/abr correcting this code or?

This works, but i do not recommend using throughput references this way. It's "bad smells". It makes it very difficult to see if anything goes wrong. Don't use this in production code!

This works, but i do not recommend using throughput references this way. It's "bad smells".
It makes it very difficult to see if anything goes wrong.
Don't use this in production code!

I don't recommend deleting classes for testing, unless the they are one-timers for getting the code running.

I don't recommend deleting classes for testing, unless the they are one-timers for getting the code running.

For testing MD5 checksum is reasonable, but for production use it's not not that safe.

For testing MD5 checksum is reasonable, but for production use it's not not that safe.

The local inputFile is actually deleted when copying it to HDFS on line 220 (first argument is a boolean called 'delSrc'). Should I also just clean all of the input in HDFS after completing the job...

The local inputFile is actually deleted when copying it to HDFS on line 220 (first argument is a boolean called 'delSrc').
Should I also just clean all of the input in HDFS after completing the job?

But yeah, since the hadoopInputNameFile in HDFS doesn't have any use after running the job, I suppose I should at least mark that for deletion.

When I deploy this code I get the following error: 2020-05-05 09:34:31.982 [ProducerThread] ERROR dk.netarkivet.wayback.indexer.HibernateUtil.initialis eFactory *Could not connect to hibernate ob...

When I deploy this code I get the following error:
2020-05-05 09:34:31.982 [ProducerThread] ERROR dk.netarkivet.wayback.indexer.HibernateUtil.initialis
eFactory

  • Could not connect to hibernate object store - exiting
    org.hibernate.PropertyNotFoundException: Could not find a setter for property
    newFileInWaybackTempDir in class dk.netarkivet.wayback.indexer.ArchiveFile
    at org.hibernate.property.BasicPropertyAccessor.createSetter(BasicPropertyAccessor.java:240)


The problem is that Hibernate interprets this method as a getter for a bean variable, and therefore complains that there is no corresponding field or setter. The solution is to change the name of the method.

Consider future-proofing by making this configurable instead of hard-coded.

Consider future-proofing by making this configurable instead of hard-coded.

I don't think I agree. We don't want to index a whole folder. A glob is possible - for indexing single files the glob is just the filename. But "file of files" is the most flexible solution.

I don't think I agree. We don't want to index a whole folder. A glob is possible - for indexing single files the glob is just the filename. But "file of files" is the most flexible solution.

Add a finally to delete the hdfs file.

Add a finally to delete the hdfs file.

Do the TODO.

Do the TODO.

Are there files to be deleted here? The inputFile?

Are there files to be deleted here? The inputFile?

Better with a unique input file name for each job.

Better with a unique input file name for each job.

The more I think about it, the better I like this implementation. I still don't think it will scale to production use, but because it uses only bitmag and hadoop APIs it should work for any archite...

The more I think about it, the better I like this implementation. I still don't think it will scale to production use, but because it uses only bitmag and hadoop APIs it should work for any architecture - test, devel, stage - regardless of exactly how we share files between bitmag and hadoop. So when we reimplement with local file mounts we should keep this bitmag-api based access as a configurable option.

Delete? or not?

Delete? or not?

Better left blank as it doesn't really make sense to have a default value for the hdfs path.

Better left blank as it doesn't really make sense to have a default value for the hdfs path.

Since these Reference & Repository settings are only for a specific test installation setup they shouldn't be in production code. Settings for our local installations are maintained in the docker p...

Since these Reference & Repository settings are only for a specific test installation setup they shouldn't be in production code. Settings for our local installations are maintained in the docker project. Settings for the devel platform are in stash at https://sbprojects.statsbiblioteket.dk/stash/projects/NARK/repos/devel-config/browse/resources and https://sbprojects.statsbiblioteket.dk/stash/projects/NARK/repos/bitmag_test_config/browse

As a general rule we should only include default values for settings where a default value makes sense. So I would leave this setting empty, and add a value for <settingsDir> to the actual settings...

As a general rule we should only include default values for settings where a default value makes sense. So I would leave this setting empty, and add a value for <settingsDir> to the actual settings file you pass to the NetarchiveSuite application when you start it. For example in the dockerised NAS this is specified in the start script https://github.com/netarchivesuite/netarchivesuite-docker-compose/blob/bitmag/nasapp/start.sh.j2 which points to the settings file https://github.com/netarchivesuite/netarchivesuite-docker-compose/blob/bitmag/nasapp/settings.xml.j2 .

Thought I had changed that back, my bad.. Made it non-final back when I started, because I didn't understand how the settings were loaded..

Thought I had changed that back, my bad.. Made it non-final back when I started, because I didn't understand how the settings were loaded..

Agreed. Move it to a BitmagUtils class.

Agreed. Move it to a BitmagUtils class.

Is there any need for "/user/vagrant" in these paths anyway? Since these are directories under hdfs:// why not just call them "/nas_input" and "nas_output" or something like that?

Is there any need for "/user/vagrant" in these paths anyway? Since these are directories under hdfs:// why not just call them "/nas_input" and "nas_output" or something like that?

Yes, these three all need to be in settings files since these values will need to be set for devel/stage/production. Also document HADOOP_INPUT_FOLDER_PATH and HADOOP_OUTPUT_FOLDER_PATH

Yes, these three all need to be in settings files since these values will need to be set for devel/stage/production.

Also document HADOOP_INPUT_FOLDER_PATH and HADOOP_OUTPUT_FOLDER_PATH

Why is this no longer final?

Why is this no longer final?