Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Fixed
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: HADOOP
Labels:
None

Sprint:
Sprint 2, Sprint 3 - webdanica, Sprint 4 - webdanica, Sprint 5 - webdanica

Description

Language identification is not to be confused by multiple languages in the textextract.

This was seen using tika

 org.apache.tika.language.LanguageIdentifier.LanguageIdentifier(String content).getLanguage()

Where tika was lured in saying that the language was Dutch

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

all-da.wikipedia.org-report.txt
17 kB
16/Jun/16 12:00 PM
da_wikipedia_check.zip
444 kB
25/May/16 12:57 PM
da.wikipedia.org-report.txt
11 kB
25/May/16 1:56 PM
download_wikipedia.zip
227 kB
19/May/16 3:23 PM
dutch_english.txt
1 kB
02/May/16 3:19 PM
dutch.txt
0.5 kB
02/May/16 3:19 PM
en_wikipedia_check.zip
481 kB
25/May/16 12:57 PM
en.wikipedia.org-report.txt
11 kB
25/May/16 1:56 PM
english.txt
0.5 kB
02/May/16 3:19 PM
french_english.txt
0.7 kB
02/May/16 3:54 PM
languages_wikipedia.csv
8 kB
19/May/16 3:47 PM
nl-urls-harvestlog.txt
3 kB
08/Sep/16 10:04 AM
nl-urls-harvestlog.txt.report.txt
9 kB
08/Sep/16 10:04 AM
spanish_english.txt
1 kB
02/May/16 4:33 PM
wikipedia_da_check.zip
441 kB
16/Jun/16 12:00 PM

Activity

People

Assignee:: Stephen Hunt

Reporter:: Søren Vejrup Carlsen (Inactive)

Watchers:: 2 Start watching this issue

Dates

Created:: 27/Apr/16 2:41 PM

Updated:: 06/Oct/16 12:49 PM

Resolved:: 06/Oct/16 12:49 PM