[LANG] detect samples files.

0f3ee316 · Alexandre Delanoë · c297d92d · 0f3ee316 · 0f3ee316 · 0f3ee316
Commit 0f3ee316 authored Sep 30, 2018 by Alexandre Delanoë
6 changed files
--- a/src/Gargantext/Text/Samples/CH.hs
+++ b/src/Gargantext/Text/Samples/CH.hs
+module Gargantext.Text.Samples.CH where
+
+import Data.String (String)
+
+textMining :: String
+textMining = "文本挖掘有时也被称为文字探勘、文本数据挖掘等，大致相当于文字分析，一般指文本处理过程中产生高质量的信息。高质量的信息通常通过分类和预测来产生，如模式识别。文本挖掘通常涉及输入文本的处理过程（通常进行分析，同时加上一些衍生语言特征以及消除杂音，随后插入到数据库中） ，产生结构化数据，并最终评价和解释输出。'高品质'的文本挖掘通常是指某种组合的相关性，新颖性和趣味性。典型的文本挖掘方法包括文本分类，文本聚类，概念/实体挖掘，生产精确分类，观点分析，文档摘要和实体关系模型（即，学习已命名实体之间的关系） 。 文本分析包括了信息检索、词典分析来研究词语的频数分布、模式识别、标签 注释、信息抽取，数据挖掘技术包括链接和关联分析、可视化和预测分析。本质上，首要的任务是，通过自然语言处理和分析方法，将文本转化为数据进行分析"
+
--- a/src/Gargantext/Text/Samples/DE.hs
+++ b/src/Gargantext/Text/Samples/DE.hs
+module Gargantext.Text.Samples.DE where
+
+import Data.String (String)
+
+textMining :: String
+textMining = "Text Mining, seltener auch Textmining, Text Data Mining oder Textual Data Mining, ist ein Bündel von Algorithmus-basierten Analyseverfahren zur Entdeckung von Bedeutungsstrukturen aus un- oder schwachstrukturierten Textdaten. Mit statistischen und linguistischen Mitteln erschließt Text-Mining-Software aus Texten Strukturen, die die Benutzer in die Lage versetzen sollen, Kerninformationen der verarbeiteten Texte schnell zu erkennen. Im Optimalfall liefern Text-Mining-Systeme Informationen, von denen die Benutzer zuvor nicht wissen, ob und dass sie in den verarbeiteten Texten enthalten sind. Bei zielgerichteter Anwendung sind Werkzeuge des Text Mining außerdem in der Lage, Hypothesen zu generieren, diese zu überprüfen und schrittweise zu verfeinern."
--- a/src/Gargantext/Text/Samples/EN.hs
+++ b/src/Gargantext/Text/Samples/EN.hs
+module Gargantext.Text.Samples.EN where
+
+import Data.String (String)
+
+textMining :: String
+textMining = "Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods. A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted."
+
+
--- a/src/Gargantext/Text/Samples/FR.hs
+++ b/src/Gargantext/Text/Samples/FR.hs
+module Gargantext.Text.Samples.FR where
+
+import Data.String (String)
+
+textMining :: String
+textMining = "La fouille de textes ou « l'extraction de connaissances » dans les textes est une spécialisation de la fouille de données et fait partie du domaine de l'intelligence artificielle. Cette technique est souvent désignée sous l'anglicisme text mining. Elle désigne un ensemble de traitements informatiques consistant à extraire des connaissances selon un critère de nouveauté ou de similarité dans des textes produits par des humains pour des humains. Dans la pratique, cela revient à mettre en algorithme un modèle simplifié des théories linguistiques dans des systèmes informatiques d'apprentissage et de statistiques. Les disciplines impliquées sont donc la linguistique calculatoire, l'ingénierie des langues, l'apprentissage artificiel, les statistiques et l'informatique."
+
+
--- a/src/Gargantext/Text/Samples/SP.hs
+++ b/src/Gargantext/Text/Samples/SP.hs
+module Gargantext.Text.Samples.SP where
+
+import Data.String (String)
+
+textMining :: String
+textMining = "La minería de textos se refiere al proceso de derivar información nueva de textos. A comienzos de los años ochenta surgieron los primeros esfuerzos de minería de textos que necesitaban una gran cantidad de esfuerzo humano, pero los avances tecnológicos han permitido que esta área progrese de manera rápida en la última década. La minería de textos es un área multidisciplinar basada en la recuperación de información, minería de datos, aprendizaje automático, estadísticas y la lingüística computacional. Como la mayor parte de la información (más de un 80%) se encuentra actualmente almacenada como texto, se cree que la minería de textos tiene un gran valor comercial."
--- a/src/Gargantext/Text/Terms/Stop.hs
+++ b/src/Gargantext/Text/Terms/Stop.hs
@@ -25,7 +25,7 @@ import Data.Char (toLower)
 import qualified Data.List as DL

 import Data.Maybe (maybe)
-import Data.Map.Strict (Map)
+import Data.Map.Strict (Map, toList)
 import qualified Data.Map.Strict as DM

 import Data.String (String)
@@ -83,10 +83,10 @@ data LangWord = LangWord Lang Word
 type LangProba = Map Lang Double

 ------------------------------------------------------------------------
-
-
-detectLangs :: String -> LangProba
-detectLangs s = detect (wordsToBook [0..2] s) testEL
+detectLangs :: String -> [(Lang, Double)]
+detectLangs s = DL.reverse $ DL.sortOn snd
+                           $ toList
+                           $ detect (wordsToBook [0..2] s) testEL

 testEL :: EventLang
 testEL = toEventLangs [0..2] [ LangWord EN EN.textMining