Commit 221877e0 authored by Atrax Nicolas's avatar Atrax Nicolas

Add PDFtoTSV to Streamlit

parent 0637e805
locale,key,value
fr,title,"# PDF To TSV"
en,title,"# PDF To TSV"
fr,text,"Convertit un fichier PDF en fichier TSV compatible avec Gargantext"
en,text,"Convert a PDF file into a TSV file compatible with GarganText"
fr,file,"Choisir un fichier"
en,file,"Choose a file"
fr,new_file,"Télécharge ton fichier TSV :"
en,new_file,"Download your TSV file : "
fr,author,"Auteur(s) : "
en,author,"Author(s) : "
fr,titlePDF,"Titre : "
en,titlePDF,"Title : "
fr,watermark,"Filigrane : "
en,watermark,"Watermark : "
fr,submit," Soumettre "
en,submit,"Submit "
fr,warning,"Attention ! Plusieurs langues ont été détectées pour la source : "
fr,warning2,"Les langues suivantes ont été détectées : "
en,warning,"Warning ! Multiple languages have been detected at the source : "
en,warning2,"The following languages have been detected : "
fr,globalWarning, "Attention ! Plusieurs langues ont été détectées entre vos pdf ! Les langues suivantes ont été détectées : "
en,globalWarning,"Warning ! Multiple languages have been detected for your pdfs file ! The following languages have been detected : "
fr,advice,"Cela pourrait affecter massivement l'analyse de GarganText. Vous pouvez régler ça en traduisant avec l'outil TsvTranslator."
en,advice,"This could massively affect the analysis of Gargantext.\nYou can correct this by translation with the TsvTranslator tool."
Copyright 2014-2015 Michal "Mimino" Danilak
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
include README.md
include LICENSE
include NOTICE
include MANIFEST.in
include requirements.txt
include langdetect/utils/messages.properties
recursive-include langdetect/profiles *
language-detection license
==========================
Copyright (c) 2010-2014 Cybozu Labs, Inc. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Metadata-Version: 2.1
Name: langdetect
Version: 1.0.9
Summary: Language detection library ported from Google's language-detection.
Home-page: https://github.com/Mimino666/langdetect
Author: Michal Mimino Danilak
Author-email: michal.danilak@gmail.com
License: MIT
Description: langdetect
==========
[![Build Status](https://travis-ci.org/Mimino666/langdetect.svg?branch=master)](https://travis-ci.org/Mimino666/langdetect)
Port of Nakatani Shuyo's [language-detection](https://github.com/shuyo/language-detection) library (version from 03/03/2014) to Python.
Installation
============
$ pip install langdetect
Supported Python versions 2.7, 3.4+.
Languages
=========
``langdetect`` supports 55 languages out of the box ([ISO 639-1 codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)):
af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw
Basic usage
===========
To detect the language of the text:
```python
>>> from langdetect import detect
>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'
```
To find out the probabilities for the top languages:
```python
>>> from langdetect import detect_langs
>>> detect_langs("Otec matka syn.")
[sk:0.572770823327, pl:0.292872522702, cs:0.134356653968]
```
**NOTE**
Language detection algorithm is non-deterministic, which means that if you try to run it on a text which is either too short or too ambiguous, you might get different results everytime you run it.
To enforce consistent results, call following code before the first language detection:
```python
from langdetect import DetectorFactory
DetectorFactory.seed = 0
```
How to add new language?
========================
You need to create a new language profile. The easiest way to do it is to use the [langdetect.jar](https://github.com/shuyo/language-detection/raw/master/lib/langdetect.jar) tool, which can generate language profiles from Wikipedia abstract database files or plain text.
Wikipedia abstract database files can be retrieved from "Wikipedia Downloads" ([http://download.wikimedia.org/](http://download.wikimedia.org/)). They form '(language code)wiki-(version)-abstract.xml' (e.g. 'enwiki-20101004-abstract.xml' ).
usage: ``java -jar langdetect.jar --genprofile -d [directory path] [language codes]``
- Specify the directory which has abstract databases by -d option.
- This tool can handle gzip compressed file.
Remark: The database filename in Chinese is like 'zhwiki-(version)-abstract-zh-cn.xml' or zhwiki-(version)-abstract-zh-tw.xml', so that it must be modified 'zh-cnwiki-(version)-abstract.xml' or 'zh-twwiki-(version)-abstract.xml'.
To generate language profile from a plain text, use the genprofile-text command.
usage: ``java -jar langdetect.jar --genprofile-text -l [language code] [text file path]``
For more details see [language-detection Wiki](https://code.google.com/archive/p/language-detection/wikis/Tools.wiki).
Original project
================
This library is a direct port of Google's [language-detection](https://code.google.com/p/language-detection/) library from Java to Python. All the classes and methods are unchanged, so for more information see the project's website or wiki.
Presentation of the language detection algorithm: [http://www.slideshare.net/shuyo/language-detection-library-for-java](http://www.slideshare.net/shuyo/language-detection-library-for-java).
Keywords: language detection library
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Description-Content-Type: text/markdown
langdetect
==========
[![Build Status](https://travis-ci.org/Mimino666/langdetect.svg?branch=master)](https://travis-ci.org/Mimino666/langdetect)
Port of Nakatani Shuyo's [language-detection](https://github.com/shuyo/language-detection) library (version from 03/03/2014) to Python.
Installation
============
$ pip install langdetect
Supported Python versions 2.7, 3.4+.
Languages
=========
``langdetect`` supports 55 languages out of the box ([ISO 639-1 codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)):
af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw
Basic usage
===========
To detect the language of the text:
```python
>>> from langdetect import detect
>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'
```
To find out the probabilities for the top languages:
```python
>>> from langdetect import detect_langs
>>> detect_langs("Otec matka syn.")
[sk:0.572770823327, pl:0.292872522702, cs:0.134356653968]
```
**NOTE**
Language detection algorithm is non-deterministic, which means that if you try to run it on a text which is either too short or too ambiguous, you might get different results everytime you run it.
To enforce consistent results, call following code before the first language detection:
```python
from langdetect import DetectorFactory
DetectorFactory.seed = 0
```
How to add new language?
========================
You need to create a new language profile. The easiest way to do it is to use the [langdetect.jar](https://github.com/shuyo/language-detection/raw/master/lib/langdetect.jar) tool, which can generate language profiles from Wikipedia abstract database files or plain text.
Wikipedia abstract database files can be retrieved from "Wikipedia Downloads" ([http://download.wikimedia.org/](http://download.wikimedia.org/)). They form '(language code)wiki-(version)-abstract.xml' (e.g. 'enwiki-20101004-abstract.xml' ).
usage: ``java -jar langdetect.jar --genprofile -d [directory path] [language codes]``
- Specify the directory which has abstract databases by -d option.
- This tool can handle gzip compressed file.
Remark: The database filename in Chinese is like 'zhwiki-(version)-abstract-zh-cn.xml' or zhwiki-(version)-abstract-zh-tw.xml', so that it must be modified 'zh-cnwiki-(version)-abstract.xml' or 'zh-twwiki-(version)-abstract.xml'.
To generate language profile from a plain text, use the genprofile-text command.
usage: ``java -jar langdetect.jar --genprofile-text -l [language code] [text file path]``
For more details see [language-detection Wiki](https://code.google.com/archive/p/language-detection/wikis/Tools.wiki).
Original project
================
This library is a direct port of Google's [language-detection](https://code.google.com/p/language-detection/) library from Java to Python. All the classes and methods are unchanged, so for more information see the project's website or wiki.
Presentation of the language detection algorithm: [http://www.slideshare.net/shuyo/language-detection-library-for-java](http://www.slideshare.net/shuyo/language-detection-library-for-java).
Metadata-Version: 2.1
Name: langdetect
Version: 1.0.9
Summary: Language detection library ported from Google's language-detection.
Home-page: https://github.com/Mimino666/langdetect
Author: Michal Mimino Danilak
Author-email: michal.danilak@gmail.com
License: MIT
Description: langdetect
==========
[![Build Status](https://travis-ci.org/Mimino666/langdetect.svg?branch=master)](https://travis-ci.org/Mimino666/langdetect)
Port of Nakatani Shuyo's [language-detection](https://github.com/shuyo/language-detection) library (version from 03/03/2014) to Python.
Installation
============
$ pip install langdetect
Supported Python versions 2.7, 3.4+.
Languages
=========
``langdetect`` supports 55 languages out of the box ([ISO 639-1 codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)):
af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw
Basic usage
===========
To detect the language of the text:
```python
>>> from langdetect import detect
>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'
```
To find out the probabilities for the top languages:
```python
>>> from langdetect import detect_langs
>>> detect_langs("Otec matka syn.")
[sk:0.572770823327, pl:0.292872522702, cs:0.134356653968]
```
**NOTE**
Language detection algorithm is non-deterministic, which means that if you try to run it on a text which is either too short or too ambiguous, you might get different results everytime you run it.
To enforce consistent results, call following code before the first language detection:
```python
from langdetect import DetectorFactory
DetectorFactory.seed = 0
```
How to add new language?
========================
You need to create a new language profile. The easiest way to do it is to use the [langdetect.jar](https://github.com/shuyo/language-detection/raw/master/lib/langdetect.jar) tool, which can generate language profiles from Wikipedia abstract database files or plain text.
Wikipedia abstract database files can be retrieved from "Wikipedia Downloads" ([http://download.wikimedia.org/](http://download.wikimedia.org/)). They form '(language code)wiki-(version)-abstract.xml' (e.g. 'enwiki-20101004-abstract.xml' ).
usage: ``java -jar langdetect.jar --genprofile -d [directory path] [language codes]``
- Specify the directory which has abstract databases by -d option.
- This tool can handle gzip compressed file.
Remark: The database filename in Chinese is like 'zhwiki-(version)-abstract-zh-cn.xml' or zhwiki-(version)-abstract-zh-tw.xml', so that it must be modified 'zh-cnwiki-(version)-abstract.xml' or 'zh-twwiki-(version)-abstract.xml'.
To generate language profile from a plain text, use the genprofile-text command.
usage: ``java -jar langdetect.jar --genprofile-text -l [language code] [text file path]``
For more details see [language-detection Wiki](https://code.google.com/archive/p/language-detection/wikis/Tools.wiki).
Original project
================
This library is a direct port of Google's [language-detection](https://code.google.com/p/language-detection/) library from Java to Python. All the classes and methods are unchanged, so for more information see the project's website or wiki.
Presentation of the language detection algorithm: [http://www.slideshare.net/shuyo/language-detection-library-for-java](http://www.slideshare.net/shuyo/language-detection-library-for-java).
Keywords: language detection library
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Description-Content-Type: text/markdown
LICENSE
MANIFEST.in
NOTICE
README.md
requirements.txt
setup.py
langdetect/__init__.py
langdetect/detector.py
langdetect/detector_factory.py
langdetect/lang_detect_exception.py
langdetect/language.py
langdetect.egg-info/PKG-INFO
langdetect.egg-info/SOURCES.txt
langdetect.egg-info/dependency_links.txt
langdetect.egg-info/requires.txt
langdetect.egg-info/top_level.txt
langdetect/profiles/af
langdetect/profiles/ar
langdetect/profiles/bg
langdetect/profiles/bn
langdetect/profiles/ca
langdetect/profiles/cs
langdetect/profiles/cy
langdetect/profiles/da
langdetect/profiles/de
langdetect/profiles/el
langdetect/profiles/en
langdetect/profiles/es
langdetect/profiles/et
langdetect/profiles/fa
langdetect/profiles/fi
langdetect/profiles/fr
langdetect/profiles/gu
langdetect/profiles/he
langdetect/profiles/hi
langdetect/profiles/hr
langdetect/profiles/hu
langdetect/profiles/id
langdetect/profiles/it
langdetect/profiles/ja
langdetect/profiles/kn
langdetect/profiles/ko
langdetect/profiles/lt
langdetect/profiles/lv
langdetect/profiles/mk
langdetect/profiles/ml
langdetect/profiles/mr
langdetect/profiles/ne
langdetect/profiles/nl
langdetect/profiles/no
langdetect/profiles/pa
langdetect/profiles/pl
langdetect/profiles/pt
langdetect/profiles/ro
langdetect/profiles/ru
langdetect/profiles/sk
langdetect/profiles/sl
langdetect/profiles/so
langdetect/profiles/sq
langdetect/profiles/sv
langdetect/profiles/sw
langdetect/profiles/ta
langdetect/profiles/te
langdetect/profiles/th
langdetect/profiles/tl
langdetect/profiles/tr
langdetect/profiles/uk
langdetect/profiles/ur
langdetect/profiles/vi
langdetect/profiles/zh-cn
langdetect/profiles/zh-tw
langdetect/tests/__init__.py
langdetect/tests/test_detector.py
langdetect/tests/test_language.py
langdetect/utils/__init__.py
langdetect/utils/lang_profile.py
langdetect/utils/messages.properties
langdetect/utils/messages.py
langdetect/utils/ngram.py
langdetect/utils/unicode_block.py
\ No newline at end of file
from .detector_factory import DetectorFactory, PROFILES_DIRECTORY, detect, detect_langs
from .lang_detect_exception import LangDetectException
import random
import re
import six
from six.moves import zip, xrange
from .lang_detect_exception import ErrorCode, LangDetectException
from .language import Language
from .utils.ngram import NGram
from .utils.unicode_block import unicode_block
class Detector(object):
'''
Detector class is to detect language from specified text.
Its instance is able to be constructed via the factory class DetectorFactory.
After appending a target text to the Detector instance with .append(string),
the detector provides the language detection results for target text via .detect() or .get_probabilities().
.detect() method returns a single language name which has the highest probability.
.get_probabilities() methods returns a list of multiple languages and their probabilities.
The detector has some parameters for language detection.
See set_alpha(double), .set_max_text_length(int) .set_prior_map(dict).
Example:
from langdetect.detector_factory import DetectorFactory
factory = DetectorFactory()
factory.load_profile('/path/to/profile/directory')
def detect(text):
detector = factory.create()
detector.append(text)
return detector.detect()
def detect_langs(text):
detector = factory.create()
detector.append(text)
return detector.get_probabilities()
'''
ALPHA_DEFAULT = 0.5
ALPHA_WIDTH = 0.05
ITERATION_LIMIT = 1000
PROB_THRESHOLD = 0.1
CONV_THRESHOLD = 0.99999
BASE_FREQ = 10000
UNKNOWN_LANG = 'unknown'
URL_RE = re.compile(r'https?://[-_.?&~;+=/#0-9A-Za-z]{1,2076}')
MAIL_RE = re.compile(r'[-_.0-9A-Za-z]{1,64}@[-_0-9A-Za-z]{1,255}[-_.0-9A-Za-z]{1,255}')
def __init__(self, factory):
self.word_lang_prob_map = factory.word_lang_prob_map
self.langlist = factory.langlist
self.seed = factory.seed
self.random = random.Random()
self.text = ''
self.langprob = None
self.alpha = self.ALPHA_DEFAULT
self.n_trial = 7
self.max_text_length = 10000
self.prior_map = None
self.verbose = False
def set_verbose(self):
self.verbose = True
def set_alpha(self, alpha):
self.alpha = alpha
def set_prior_map(self, prior_map):
'''Set prior information about language probabilities.'''
self.prior_map = [0.0] * len(self.langlist)
sump = 0.0
for i in xrange(len(self.prior_map)):
lang = self.langlist[i]
if lang in prior_map:
p = prior_map[lang]
if p < 0:
raise LangDetectException(ErrorCode.InitParamError, 'Prior probability must be non-negative.')
self.prior_map[i] = p
sump += p
if sump <= 0.0:
raise LangDetectException(ErrorCode.InitParamError, 'More one of prior probability must be non-zero.')
for i in xrange(len(self.prior_map)):
self.prior_map[i] /= sump
def set_max_text_length(self, max_text_length):
'''Specify max size of target text to use for language detection.
The default value is 10000(10KB).
'''
self.max_text_length = max_text_length
def append(self, text):
'''Append the target text for language detection.
If the total size of target text exceeds the limit size specified by
Detector.set_max_text_length(int), the rest is cut down.
'''
text = self.URL_RE.sub(' ', text)
text = self.MAIL_RE.sub(' ', text)
text = NGram.normalize_vi(text)
pre = 0
for i in xrange(min(len(text), self.max_text_length)):
ch = text[i]
if ch != ' ' or pre != ' ':
self.text += ch
pre = ch
def cleaning_text(self):
'''Cleaning text to detect
(eliminate URL, e-mail address and Latin sentence if it is not written in Latin alphabet).
'''
latin_count, non_latin_count = 0, 0
for ch in self.text:
if 'A' <= ch <= 'z':
latin_count += 1
elif ch >= six.u('\u0300') and unicode_block(ch) != 'Latin Extended Additional':
non_latin_count += 1
if latin_count * 2 < non_latin_count:
text_without_latin = ''
for ch in self.text:
if ch < 'A' or 'z' < ch:
text_without_latin += ch
self.text = text_without_latin
def detect(self):
'''Detect language of the target text and return the language name
which has the highest probability.
'''
probabilities = self.get_probabilities()
if probabilities:
return probabilities[0].lang
return self.UNKNOWN_LANG
def get_probabilities(self):
if self.langprob is None:
self._detect_block()
return self._sort_probability(self.langprob)
def _detect_block(self):
self.cleaning_text()
ngrams = self._extract_ngrams()
if not ngrams:
raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
self.langprob = [0.0] * len(self.langlist)
self.random.seed(self.seed)
for t in xrange(self.n_trial):
prob = self._init_probability()
alpha = self.alpha + self.random.gauss(0.0, 1.0) * self.ALPHA_WIDTH
i = 0
while True:
self._update_lang_prob(prob, self.random.choice(ngrams), alpha)
if i % 5 == 0:
if self._normalize_prob(prob) > self.CONV_THRESHOLD or i >= self.ITERATION_LIMIT:
break
if self.verbose:
six.print_('>', self._sort_probability(prob))
i += 1
for j in xrange(len(self.langprob)):
self.langprob[j] += prob[j] / self.n_trial
if self.verbose:
six.print_('==>', self._sort_probability(prob))
def _init_probability(self):
'''Initialize the map of language probabilities.
If there is the specified prior map, use it as initial map.
'''
if self.prior_map is not None:
return list(self.prior_map)
else:
return [1.0 / len(self.langlist)] * len(self.langlist)
def _extract_ngrams(self):
'''Extract n-grams from target text.'''
RANGE = list(xrange(1, NGram.N_GRAM + 1))
result = []
ngram = NGram()
for ch in self.text:
ngram.add_char(ch)
if ngram.capitalword:
continue
for n in RANGE:
# optimized w = ngram.get(n)
if len(ngram.grams) < n:
break
w = ngram.grams[-n:]
if w and w != ' ' and w in self.word_lang_prob_map:
result.append(w)
return result
def _update_lang_prob(self, prob, word, alpha):
'''Update language probabilities with N-gram string(N=1,2,3).'''
if word is None or word not in self.word_lang_prob_map:
return False
lang_prob_map = self.word_lang_prob_map[word]
if self.verbose:
six.print_('%s(%s): %s' % (word, self._unicode_encode(word), self._word_prob_to_string(lang_prob_map)))
weight = alpha / self.BASE_FREQ
for i in xrange(len(prob)):
prob[i] *= weight + lang_prob_map[i]
return True
def _word_prob_to_string(self, prob):
result = ''
for j in xrange(len(prob)):
p = prob[j]
if p >= 0.00001:
result += ' %s:%.5f' % (self.langlist[j], p)
return result
def _normalize_prob(self, prob):
'''Normalize probabilities and check convergence by the maximun probability.
'''
maxp, sump = 0.0, sum(prob)
for i in xrange(len(prob)):
p = prob[i] / sump
if maxp < p:
maxp = p
prob[i] = p
return maxp
def _sort_probability(self, prob):
result = [Language(lang, p) for (lang, p) in zip(self.langlist, prob) if p > self.PROB_THRESHOLD]
result.sort(reverse=True)
return result
def _unicode_encode(self, word):
buf = ''
for ch in word:
if ch >= six.u('\u0080'):
st = hex(0x10000 + ord(ch))[2:]
while len(st) < 4:
st = '0' + st
buf += r'\u' + st[1:5]
else:
buf += ch
return buf
import os
from os import path
import sys
try:
import simplejson as json
except ImportError:
import json
from .detector import Detector
from .lang_detect_exception import ErrorCode, LangDetectException
from .utils.lang_profile import LangProfile
class DetectorFactory(object):
'''
Language Detector Factory Class.
This class manages an initialization and constructions of Detector.
Before using language detection library,
load profiles with DetectorFactory.load_profile(str)
and set initialization parameters.
When the language detection,
construct Detector instance via DetectorFactory.create().
See also Detector's sample code.
'''
seed = None
def __init__(self):
self.word_lang_prob_map = {}
self.langlist = []
def load_profile(self, profile_directory):
list_files = os.listdir(profile_directory)
if not list_files:
raise LangDetectException(ErrorCode.NeedLoadProfileError, 'Not found profile: ' + profile_directory)
langsize, index = len(list_files), 0
for filename in list_files:
if filename.startswith('.'):
continue
filename = path.join(profile_directory, filename)
if not path.isfile(filename):
continue
f = None
try:
if sys.version_info[0] < 3:
f = open(filename, 'r')
else:
f = open(filename, 'r', encoding='utf-8')
json_data = json.load(f)
profile = LangProfile(**json_data)
self.add_profile(profile, index, langsize)
index += 1
except IOError:
raise LangDetectException(ErrorCode.FileLoadError, 'Cannot open "%s"' % filename)
except:
raise LangDetectException(ErrorCode.FormatError, 'Profile format error in "%s"' % filename)
finally:
if f:
f.close()
def load_json_profile(self, json_profiles):
langsize, index = len(json_profiles), 0
if langsize < 2:
raise LangDetectException(ErrorCode.NeedLoadProfileError, 'Need more than 2 profiles.')
for json_profile in json_profiles:
try:
json_data = json.loads(json_profile)
profile = LangProfile(**json_data)
self.add_profile(profile, index, langsize)
index += 1
except:
raise LangDetectException(ErrorCode.FormatError, 'Profile format error.')
def add_profile(self, profile, index, langsize):
lang = profile.name
if lang in self.langlist:
raise LangDetectException(ErrorCode.DuplicateLangError, 'Duplicate the same language profile.')
self.langlist.append(lang)
for word in profile.freq:
if word not in self.word_lang_prob_map:
self.word_lang_prob_map[word] = [0.0] * langsize
length = len(word)
if 1 <= length <= 3:
prob = 1.0 * profile.freq.get(word) / profile.n_words[length - 1]
self.word_lang_prob_map[word][index] = prob
def clear(self):
self.langlist = []
self.word_lang_prob_map = {}
def create(self, alpha=None):
'''Construct Detector instance with smoothing parameter.'''
detector = self._create_detector()
if alpha is not None:
detector.set_alpha(alpha)
return detector
def _create_detector(self):
if not self.langlist:
raise LangDetectException(ErrorCode.NeedLoadProfileError, 'Need to load profiles.')
return Detector(self)
def set_seed(self, seed):
self.seed = seed
def get_lang_list(self):
return list(self.langlist)
PROFILES_DIRECTORY = path.join(path.dirname(__file__), 'profiles')
_factory = None
def init_factory():
global _factory
if _factory is None:
_factory = DetectorFactory()
_factory.load_profile(PROFILES_DIRECTORY)
def detect(text):
init_factory()
detector = _factory.create()
detector.append(text)
return detector.detect()
def detect_langs(text):
init_factory()
detector = _factory.create()
detector.append(text)
return detector.get_probabilities()
_error_codes = {
'NoTextError': 0,
'FormatError': 1,
'FileLoadError': 2,
'DuplicateLangError': 3,
'NeedLoadProfileError': 4,
'CantDetectError': 5,
'CantOpenTrainData': 6,
'TrainDataFormatError': 7,
'InitParamError': 8,
}
ErrorCode = type('ErrorCode', (), _error_codes)
class LangDetectException(Exception):
def __init__(self, code, message):
super(LangDetectException, self).__init__(message)
self.code = code
def get_code(self):
return self.code
class Language(object):
'''
Language is to store the detected language.
Detector.get_probabilities() returns a list of Languages.
'''
def __init__(self, lang, prob):
self.lang = lang
self.prob = prob
def __repr__(self):
if self.lang is None:
return ''
return '%s:%s' % (self.lang, self.prob)
def __lt__(self, other):
return self.prob < other.prob
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
import unittest
import six
from langdetect.detector_factory import DetectorFactory
from langdetect.utils.lang_profile import LangProfile
class DetectorTest(unittest.TestCase):
TRAINING_EN = 'a a a b b c c d e'
TRAINING_FR = 'a b b c c c d d d'
TRAINING_JA = six.u('\u3042 \u3042 \u3042 \u3044 \u3046 \u3048 \u3048')
JSON_LANG1 = '{"freq":{"A":3,"B":6,"C":3,"AB":2,"BC":1,"ABC":2,"BBC":1,"CBA":1},"n_words":[12,3,4],"name":"lang1"}'
JSON_LANG2 = '{"freq":{"A":6,"B":3,"C":3,"AA":3,"AB":2,"ABC":1,"ABA":1,"CAA":1},"n_words":[12,5,3],"name":"lang2"}'
def setUp(self):
self.factory = DetectorFactory()
profile_en = LangProfile('en')
for w in self.TRAINING_EN.split():
profile_en.add(w)
self.factory.add_profile(profile_en, 0, 3)
profile_fr = LangProfile('fr')
for w in self.TRAINING_FR.split():
profile_fr.add(w)
self.factory.add_profile(profile_fr, 1, 3)
profile_ja = LangProfile('ja')
for w in self.TRAINING_JA.split():
profile_ja.add(w)
self.factory.add_profile(profile_ja, 2, 3)
def test_detector1(self):
detect = self.factory.create()
detect.append('a')
self.assertEqual(detect.detect(), 'en')
def test_detector2(self):
detect = self.factory.create()
detect.append('b d')
self.assertEqual(detect.detect(), 'fr')
def test_detector3(self):
detect = self.factory.create()
detect.append('d e')
self.assertEqual(detect.detect(), 'en')
def test_detector4(self):
detect = self.factory.create()
detect.append(six.u('\u3042\u3042\u3042\u3042a'))
self.assertEqual(detect.detect(), 'ja')
def test_lang_list(self):
langlist = self.factory.get_lang_list()
self.assertEqual(len(langlist), 3)
self.assertEqual(langlist[0], 'en')
self.assertEqual(langlist[1], 'fr')
self.assertEqual(langlist[2], 'ja')
def test_factory_from_json_string(self):
self.factory.clear()
profiles = [self.JSON_LANG1, self.JSON_LANG2]
self.factory.load_json_profile(profiles)
langlist = self.factory.get_lang_list()
self.assertEqual(len(langlist), 2)
self.assertEqual(langlist[0], 'lang1')
self.assertEqual(langlist[1], 'lang2')
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment