Colin Rosenthal

When I deploy this code I get the following error: 2020-05-05 09:34:31.982 [ProducerThread] ERROR dk.netarkivet.wayback.indexer.HibernateUtil.initialis eFactory *Could not connect to hibernate ob...

When I deploy this code I get the following error:
2020-05-05 09:34:31.982 [ProducerThread] ERROR dk.netarkivet.wayback.indexer.HibernateUtil.initialis
eFactory

  • Could not connect to hibernate object store - exiting
    org.hibernate.PropertyNotFoundException: Could not find a setter for property
    newFileInWaybackTempDir in class dk.netarkivet.wayback.indexer.ArchiveFile
    at org.hibernate.property.BasicPropertyAccessor.createSetter(BasicPropertyAccessor.java:240)


The problem is that Hibernate interprets this method as a getter for a bean variable, and therefore complains that there is no corresponding field or setter. The solution is to change the name of the method.

Consider future-proofing by making this configurable instead of hard-coded.

Consider future-proofing by making this configurable instead of hard-coded.

I don't think I agree. We don't want to index a whole folder. A glob is possible - for indexing single files the glob is just the filename. But "file of files" is the most flexible solution.

I don't think I agree. We don't want to index a whole folder. A glob is possible - for indexing single files the glob is just the filename. But "file of files" is the most flexible solution.

Add a finally to delete the hdfs file.

Add a finally to delete the hdfs file.

Do the TODO.

Do the TODO.

Are there files to be deleted here? The inputFile?

Are there files to be deleted here? The inputFile?

Better with a unique input file name for each job.

Better with a unique input file name for each job.

The more I think about it, the better I like this implementation. I still don't think it will scale to production use, but because it uses only bitmag and hadoop APIs it should work for any archite...

The more I think about it, the better I like this implementation. I still don't think it will scale to production use, but because it uses only bitmag and hadoop APIs it should work for any architecture - test, devel, stage - regardless of exactly how we share files between bitmag and hadoop. So when we reimplement with local file mounts we should keep this bitmag-api based access as a configurable option.

Delete? or not?

Delete? or not?

Better left blank as it doesn't really make sense to have a default value for the hdfs path.

Better left blank as it doesn't really make sense to have a default value for the hdfs path.

Since these Reference & Repository settings are only for a specific test installation setup they shouldn't be in production code. Settings for our local installations are maintained in the docker p...

Since these Reference & Repository settings are only for a specific test installation setup they shouldn't be in production code. Settings for our local installations are maintained in the docker project. Settings for the devel platform are in stash at https://sbprojects.statsbiblioteket.dk/stash/projects/NARK/repos/devel-config/browse/resources and https://sbprojects.statsbiblioteket.dk/stash/projects/NARK/repos/bitmag_test_config/browse

As a general rule we should only include default values for settings where a default value makes sense. So I would leave this setting empty, and add a value for <settingsDir> to the actual settings...

As a general rule we should only include default values for settings where a default value makes sense. So I would leave this setting empty, and add a value for <settingsDir> to the actual settings file you pass to the NetarchiveSuite application when you start it. For example in the dockerised NAS this is specified in the start script https://github.com/netarchivesuite/netarchivesuite-docker-compose/blob/bitmag/nasapp/start.sh.j2 which points to the settings file https://github.com/netarchivesuite/netarchivesuite-docker-compose/blob/bitmag/nasapp/settings.xml.j2 .

Agreed. Move it to a BitmagUtils class.

Agreed. Move it to a BitmagUtils class.

Is there any need for "/user/vagrant" in these paths anyway? Since these are directories under hdfs:// why not just call them "/nas_input" and "nas_output" or something like that?

Is there any need for "/user/vagrant" in these paths anyway? Since these are directories under hdfs:// why not just call them "/nas_input" and "nas_output" or something like that?

Yes, these three all need to be in settings files since these values will need to be set for devel/stage/production. Also document HADOOP_INPUT_FOLDER_PATH and HADOOP_OUTPUT_FOLDER_PATH

Yes, these three all need to be in settings files since these values will need to be set for devel/stage/production.

Also document HADOOP_INPUT_FOLDER_PATH and HADOOP_OUTPUT_FOLDER_PATH

Why is this no longer final?

Why is this no longer final?

This code gets the file from bitmag and then inserts it into hdfs. This will work, but it is too inefficient to copy large harvested files around like this. We need to replace this with a solution ...

This code gets the file from bitmag and then inserts it into hdfs. This will work, but it is too inefficient to copy large harvested files around like this. We need to replace this with a solution where the bitmag files are all already mounted as local files inside hadoop. Then we add these local file paths to the hadoopInputNameFiles as file:// paths.

I think this is a single commit aa5ff1dbdaefd04652a9c66506d20f1a6ae01dc3 which we could offer as a pull request.

I think this is a single commit aa5ff1dbdaefd04652a9c66506d20f1a6ae01dc3 which we could offer as a pull request.

This is already in ia/master

This is already in ia/master

This is something we added because heritrix was treating inline image data as links. I think we should make a pull request for it.

This is something we added because heritrix was treating inline image data as links. I think we should make a pull request for it.

This is already what is in ia/master.

This is already what is in ia/master.

This is already in ia/master

This is already in ia/master

Merge issues NAS-heritrix/IIPC-heritrix
Merge issues NAS-heritrix/IIPC-heritrix
Obviously something weird here as "contrib" is there twice,

Obviously something weird here as "contrib" is there twice,

?? Where does this come from? What does it do?

?? Where does this come from? What does it do?

?

?

The fallback is "false", meaning no match, meaning "accept this url". Is this the best choice? Does it matter? Should the behaviour be configurable?

The fallback is "false", meaning no match, meaning "accept this url". Is this the best choice? Does it matter? Should the behaviour be configurable?

This commit is mission-critical for us because we have had serious problems with pathological regexes. To get it accepted we probably should make the default behaviour backwards compatible ie infin...

This commit is mission-critical for us because we have had serious problems with pathological regexes. To get it accepted we probably should make the default behaviour backwards compatible ie infinite timeout, even though that's probably a terrible idea. I'd like to persuade Andy to allow a sensible default like 20s.

(There's also a possibly better solution which is to use a 3rd party regex engine with guaranteed runtime complexity e.g. https://www.brics.dk/automaton/faq.html)

This shouldn't be hardcoded. Why is this not just a bean-value that can be set in crawler beans?

This shouldn't be hardcoded. Why is this not just a bean-value that can be set in crawler beans?