Uploaded image for project: 'WebDanica'
  1. WebDanica
  2. WEBDAN-41

Language identification is not to be confused by multiple languages in the textextract

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Critical
    • None
    • None
    • HADOOP
    • None
    • Sprint 2, Sprint 3 - webdanica, Sprint 4 - webdanica, Sprint 5 - webdanica

    Description

      Language identification is not to be confused by multiple languages in the textextract.

      This was seen using tika

       org.apache.tika.language.LanguageIdentifier.LanguageIdentifier(String content).getLanguage()
      

      Where tika was lured in saying that the language was Dutch

      Attachments

        1. all-da.wikipedia.org-report.txt
          17 kB
        2. da_wikipedia_check.zip
          444 kB
        3. da.wikipedia.org-report.txt
          11 kB
        4. download_wikipedia.zip
          227 kB
        5. dutch_english.txt
          1 kB
        6. dutch.txt
          0.5 kB
        7. en_wikipedia_check.zip
          481 kB
        8. en.wikipedia.org-report.txt
          11 kB
        9. english.txt
          0.5 kB
        10. french_english.txt
          0.7 kB
        11. languages_wikipedia.csv
          8 kB
        12. nl-urls-harvestlog.txt
          3 kB
        13. nl-urls-harvestlog.txt.report.txt
          9 kB
        14. spanish_english.txt
          1 kB
        15. wikipedia_da_check.zip
          441 kB

        Activity

          People

            sthu Stephen Hunt
            svc Søren Vejrup Carlsen (Inactive)
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: