[FEAT] adding RAKE algorithm to extract multi-terms (keywords) in context of texts.

0f0205a3 · Alexandre Delanoë · 99e95bb1 · 0f0205a3 · 0f0205a3 · 0f0205a3
Commit 0f0205a3 authored Sep 13, 2018 by Alexandre Delanoë
Hide whitespace changes
Inline Side-by-side

Showing with 120 additions and 1 deletion

package.yaml package.yaml +1 -0

Multi.hs src/Gargantext/Text/Terms/Multi.hs +3 -1

RAKE.hs src/Gargantext/Text/Terms/Multi/RAKE.hs +116 -0

No files found.
--- a/package.yaml
+++ b/package.yaml
@@ -56,6 +56,7 @@ library:
  - Gargantext.Text.Terms.Mono
  - Gargantext.Text.Terms.Multi.Lang.En
  - Gargantext.Text.Terms.Multi.Lang.Fr
+  - Gargantext.Text.Terms.Multi.RAKE
  - Gargantext.Text.Terms.WithList
  - Gargantext.TextFlow
  - Gargantext.Viz.Graph

--- a/src/Gargantext/Text/Terms/Multi.hs
+++ b/src/Gargantext/Text/Terms/Multi.hs
@@ -13,7 +13,7 @@ Multi-terms are ngrams where n > 1.

 {-# LANGUAGE NoImplicitPrelude #-}

-module Gargantext.Text.Terms.Multi (multiterms)
+module Gargantext.Text.Terms.Multi (multiterms, multiterms_rake)
  where

 import Data.Text hiding (map, group, filter, concat)
@@ -29,6 +29,8 @@ import Gargantext.Text.Terms.Mono.Stem (stem)
 import qualified Gargantext.Text.Terms.Multi.Lang.En as En
 import qualified Gargantext.Text.Terms.Multi.Lang.Fr as Fr

+import Gargantext.Text.Terms.Multi.RAKE (multiterms_rake)
+
 multiterms :: Lang -> Text -> IO [Terms]
 multiterms lang txt = concat
                   <$> map (map (tokenTag2terms lang))

--- a/src/Gargantext/Text/Terms/Multi/RAKE.hs
+++ b/src/Gargantext/Text/Terms/Multi/RAKE.hs
+{-|
+Module      : Gargantext.Text.Terms.Multi.RAKE
+Description : Rapid automatic keyword extraction (RAKE)
+Copyright   : (c) CNRS, 2017
+License     : AGPL + CECILL v3
+Maintainer  : team@gargantext.org
+Stability   : experimental
+Portability : POSIX
+
+Personal notes for the integration of RAKE in Gargantext.
+
+RAKE algorithm is a simple, rapid and effective algorithm to extract
+keywords that is very sensitive to the quality of the stop word list.
+
+Indeed, the very first step starts from the stop words list to cut the
+text towards keywords extraction. The conTexT is the sentence level to
+compute the coccurrences and occurrences which are divided to compute
+the metric of one word. Multi-words metrics is equal to the sum of the
+metrics of each word.
+
+Finally The metrics highlight longer keywords which highly depends of
+quality of the cut which depends on the quality of the stop word list.
+
+As a consequence, to improve the effectiveness of RAKE algorithm, I am
+wondering if some bayesian features could be added to increase stop word
+list quality in time.
+
+-}
+
+{-# LANGUAGE NoImplicitPrelude #-}
+
+module Gargantext.Text.Terms.Multi.RAKE (multiterms_rake)
+  where
+
+import Data.Text (Text)
+import NLP.RAKE.Text
+import Gargantext.Prelude
+
+multiterms_rake :: Text -> [WordScore]
+multiterms_rake = candidates hardStopList
+                        defaultNosplit
+                        defaultNolist   . pSplitter
+
+-- | StopList
+hardStopList :: StopwordsMap
+hardStopList =   mkStopwordsStr [
+    "a","a's","able","about","above","apply","according","accordingly",
+    "across","actually","after","afterwards","again","against",
+    "ain't","all","allow","allows","almost","alone","along",
+    "already","also","although","always","am","among","amongst",
+    "an","and","another","any","anybody","anyhow","anyone","anything",
+    "anyway","anyways","anywhere","analyze","apart","appear","appreciate","appropriate",
+    "are","aren't","around","as","aside","ask","asking","associated","at",
+    "available","away","awfully","based", "b","be","became","because","become",
+    "becomes","becoming","been","before","beforehand","behind","being",
+    "believe","below","beside","besides","best","better","between","beyond",
+    "both","brief","but","by","c","c'mon","c's","came","can","can't","cannot",
+    "cant","cause","causes","certain","certainly","changes","clearly","co",
+    "com","come","comes","common","concerning","consequently","consider","considering",
+    "contain","containing","contains","corresponding","could","couldn't","course",
+    "currently","d","definitely","described","detects","detecting","despite","did","didn't","different",
+    "do","does","doesn't","doing","don't","done","down","downwards","during","e",
+    "each","edu","eg","eight","either","else","elsewhere","enough","entirely",
+    "especially","et","etc","even","ever","every","everybody","everyone",
+    "everything","everywhere","ex","exactly","example","except","f","far",
+    "few","find","fifth","first","five","followed","following","follows","for",
+    "former","formerly","forth","four","from","further","furthermore","g",
+    "get","gets","getting","given","gives","go","goes","going","gone","got",
+    "gotten","greetings","h","had","hadn't","happens","hardly","has","hasn't",
+    "have","haven't","having","he","he's","hello","help","hence","her","here",
+    "here's","hereafter","hereby","herein","hereupon","hers","herself","hi",
+    "him","himself","his","hither","hopefully","how","howbeit","however","i",
+    "i'd","identify","i'll","i'm","i've","ie","if","ignored","immediate","in","inasmuch",
+    "inc","indeed","indicate","indicated","indicates","inner","insofar",
+    "instead","into","inward","is","isn't","it","it'd","it'll","it's","its",
+    "itself","j","just","k","keep","keeps","kept","know","known","knows","l",
+    "last","lately","later","latter","latterly","least","less","lest","let",
+    "let's","like","liked","likely","little","look","looking","looks","ltd",
+    "m","mainly","many","may","maybe","me","mean","meanwhile","merely","might",
+    "more","moreover","most","mostly","much","must","my","myself","n",
+    "name","namely","nd","near","nearly","necessary","need","needs","neither",
+    "never","nevertheless","new","next","nine","no","nobody","non","none",
+    "noone","nor","normally","not","nothing","novel","now","nowhere","o",
+    "obviously","of","off","often","oh","ok","okay","old","on","once","one",
+    "ones","only","onto","or","other","others","otherwise","ought","our",
+    "ours","ourselves","out","outside","over","overall","own","p","particular",
+    "particularly","per","perhaps","placed","please","plus","possible",
+    "presents","presumably","probably","provides","q","que","quite","qv","r","rather",
+    "rd","re","really","reasonably","regarding","regardless","regards",
+    "relatively","respectively","right","s","said","same","saw","say",
+    "saying","says","second","secondly","see","seeing","seem","seemed",
+    "seeming","seems","seen","self","selves","sensible","sent","serious",
+    "seriously","seven","several","shall","she","should","shouldn't","since",
+    "six","so","some","somebody","somehow","someone","something","sometime",
+    "sometimes","somewhat","somewhere","soon","sorry","specified","specify",
+    "specifying","still","sub","such","sup","sure","t","t's","take","taken",
+    "tell","tends","th","than","thank","thanks","thanx","that","that's",
+    "thats","the","their","theirs","them","themselves","then","thence","there",
+    "there's","thereafter","thereby","therefore","therein","theres",
+    "thereupon","these","they","they'd","they'll","they're","they've",
+    "think","third","this","thorough","thoroughly","those","though","three",
+    "through","throughout","thru","thus","to","together","too","took","toward",
+    "towards","tried","tries","truly","try","trying","twice","two","u","un",
+    "under","unfortunately","unless","unlikely","until","unto","up","upon",
+    "us","use","used","useful","uses","using","usually","uucp","v","value",
+    "various","very","via","viz","vs","w","want","wants","was","wasn't","way",
+    "we","we'd","we'll","we're","we've","welcome","well","went","were",
+    "weren't","what","what's","whatever","when","whence","whenever","where",
+    "where's","whereafter","whereas","whereby","wherein","whereupon",
+    "wherever","whether","which","while","whither","who","who's","whoever",
+    "whole","whom","whose","why","will","willing","wish","with","within",
+    "without","won't","wonder","would","wouldn't","x","y","yes","yet","you",
+    "you'd","you'll","you're","you've","your","yours","yourself","yourselves",
+    "z","zero"]
+
+