NetarchiveSuite-Github

Clone Tools
  • last updated a few minutes ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
How do you get Maven to name it? I might be misunderstanding what you mean..

How do you get Maven to name it? I might be misunderstanding what you mean..

I think it's fine to code it this way to start with and then optimise later. Also you may not have had my "background knowledge" that some of our metadata files are very large - those from snapshot...

I think it's fine to code it this way to start with and then optimise later. Also you may not have had my "background knowledge" that some of our metadata files are very large - those from snapshot harvests where we harvest thousands of domains in a single job so the crawl logs are enormous.

Vi burde se om det er muligt at lave den classe om til en streaming implementation som ikke gemmer hele indexet in-memory - men det er måske en opgave for en anden dag.

Vi burde se om det er muligt at lave den classe om til en streaming implementation som ikke gemmer hele indexet in-memory - men det er måske en opgave for en anden dag.

Have to disagree. Don't catch Exceptions in unit tests. Fail early and fail hard.

Have to disagree. Don't catch Exceptions in unit tests. Fail early and fail hard.

omdøb variabel til miniYarnCluster

omdøb variabel til miniYarnCluster

Tak for dit tip Det ser udtil at de nævnte problemer var isoleret til min pc. (Også pillar-frontend-1.3.3.zip var korrupt, så den viste 0 bytes).

Tak for dit tip
Det ser udtil at de nævnte problemer var isoleret til min pc.
(Også pillar-frontend-1.3.3.zip var korrupt, så den viste 0 bytes).

I think this is false by default, so we may not be testing this part of the code. But then again, I don't know if it's enabled in production either! It's basically "historic" function if we want to...

I think this is false by default, so we may not be testing this part of the code. But then again, I don't know if it's enabled in production either! It's basically "historic" function if we want to deduplicate against old harvests from before we harvested things compressed, so it's less and less relevant as time goes by.

Wouldn't it be a good idea to add the url-pattern and mime-pattern to the logging here?

Wouldn't it be a good idea to add the url-pattern and mime-pattern to the logging here?

You have a point.. Not sure what my thought process was here tbh.. I'm guessing I thought this was somehow a sensible way to enumerate the lines for logging, uh.

You have a point.. Not sure what my thought process was here tbh.. I'm guessing I thought this was somehow a sensible way to enumerate the lines for logging, uh.

I think the name/javadoc isn't very transparent. At least document that these directories are for requests from the IndexRequestServer.

I think the name/javadoc isn't very transparent. At least document that these directories are for requests from the IndexRequestServer.

Whoever wrote this code should probably add a class comment describing what this implementation actually does.

Whoever wrote this code should probably add a class comment describing what this implementation actually does.

Not sure. I think in this case the calling routine has more of the information needed to handle the error. The caller can log what the job was trying to do e.g. which harvest job it was trying to i...

Not sure. I think in this case the calling routine has more of the information needed to handle the error. The caller can log what the job was trying to do e.g. which harvest job it was trying to index. This class doesn't know anything about that.

This should really be a fatal error - throw some sort of RuntimeException right away.

This should really be a fatal error - throw some sort of RuntimeException right away.

I suppose we should rename this setting if we're going to deploy all our mapred jobs in the same jar. Maybe get maven to give it a sensible name as well while we're at it.

I suppose we should rename this setting if we're going to deploy all our mapred jobs in the same jar. Maybe get maven to give it a sensible name as well while we're at it.

Do you delete and recreate the top-level output directory for each job? The javadoc says you do, but I think the method doesn't - it only deletes it if it's occupied by a file of the same name. Sur...

Do you delete and recreate the top-level output directory for each job? The javadoc says you do, but I think the method doesn't - it only deletes it if it's occupied by a file of the same name. Surely you want to delete the input and output directories for each job when it is finished, but not the overall directories, in case there are multiple jobs running simultaneously.

I'm a bit worried about whether the names and documentation are a little bit confusing here. These methods don't really "initialize" files do they? They initialize the directories and return a Path...

I'm a bit worried about whether the names and documentation are a little bit confusing here. These methods don't really "initialize" files do they? They initialize the directories and return a Path that points inside them.

If you wrote directly to the context with each readLine() couldn't you save the memory used in the metadataLines collection?

If you wrote directly to the context with each readLine() couldn't you save the memory used in the metadataLines collection?

As stated in the Readme, the uber-jar is placed under 'NAS/wayback/wayback-indexer/target/wayback-indexer-5.7-IIPCH3-SNAPSHOT-withdeps.jar' so you copy it using cp <NAS-project>/wayback/wayback-ind...

As stated in the Readme, the uber-jar is placed under 'NAS/wayback/wayback-indexer/target/wayback-indexer-5.7-IIPCH3-SNAPSHOT-withdeps.jar' so you copy it using

cp <NAS-project>/wayback/wayback-indexer/target/wayback-indexer-5.7-IIPCH3-SNAPSHOT-withdeps.jar <NAS-DC>/nasapp/wayback-uber-jar.jar

just as you would copy the distribution zip-files.

If the uber-jar is not built in your NAS-project when running 'mvn clean package' there is something wrong with your local files.

Might be better to catch IOException log error and return null

Might be better to catch IOException log error and return null

Better to catch IOException instead of throwing it to the caller log it and return null?

Better to catch IOException instead of throwing it to the caller
log it and return null?

Ad 4. netarchivesuite 'NARK-1900-metadata' branch does not solve how "uber.jar" is copied to NAS-DC It is also unclear from this issue how this "uber.jar" is copied. The jar file is not found in th...

Ad 4.
netarchivesuite 'NARK-1900-metadata' branch does not solve how "uber.jar" is copied to NAS-DC
It is also unclear from this issue how this "uber.jar" is copied.
The jar file is not found in this branch even after 'mvn clean package' has created target directory, etc.
Please clarify what is missing

Merge remote-tracking branch 'origin/bitmag' into warc-record-api

# Conflicts:

# common/common-core/src/main/java/dk/netarkivet/common/CommonSettings.java

A few final edits.

The missing step is in the Readme, which step 4 says to follow, but yes I'll write it out explicitly.

The missing step is in the Readme, which step 4 says to follow, but yes I'll write it out explicitly.

I believe the test description is missing a step. You need to build the uber-jar with "mvn -DskipTests clean package" and then make sure that the test you are running loads the jarfile you built. R...

I believe the test description is missing a step. You need to build the uber-jar with "mvn -DskipTests clean package" and then make sure that the test you are running loads the jarfile you built. Rasmus Bohl Kristensen can you confirm and add the necessary details to the test description?

Followed described process, but got this error: ERROR: Service 'nasidx' failed to build: COPY failed: stat /var/lib/docker/tmp/docker-builder112586924/wayback-uber-jar.jar: no such file or directo...

Followed described process, but got this error:

ERROR: Service 'nasidx' failed to build: COPY failed: stat /var/lib/docker/tmp/docker-builder112586924/wayback-uber-jar.jar: no such file or directory
pech@pech-ThinkPad-T590:~/netarchivesuite-docker-compose$

corrected

latest changes i getFile etc.

Cleanup aaording review