Commit 0f0205a3 authored by Alexandre Delanoë's avatar Alexandre Delanoë

[FEAT] adding RAKE algorithm to extract multi-terms (keywords) in context of texts.

parent 99e95bb1
......@@ -56,6 +56,7 @@ library:
- Gargantext.Text.Terms.Mono
- Gargantext.Text.Terms.Multi.Lang.En
- Gargantext.Text.Terms.Multi.Lang.Fr
- Gargantext.Text.Terms.Multi.RAKE
- Gargantext.Text.Terms.WithList
- Gargantext.TextFlow
- Gargantext.Viz.Graph
......
......@@ -13,7 +13,7 @@ Multi-terms are ngrams where n > 1.
{-# LANGUAGE NoImplicitPrelude #-}
module Gargantext.Text.Terms.Multi (multiterms)
module Gargantext.Text.Terms.Multi (multiterms, multiterms_rake)
where
import Data.Text hiding (map, group, filter, concat)
......@@ -29,6 +29,8 @@ import Gargantext.Text.Terms.Mono.Stem (stem)
import qualified Gargantext.Text.Terms.Multi.Lang.En as En
import qualified Gargantext.Text.Terms.Multi.Lang.Fr as Fr
import Gargantext.Text.Terms.Multi.RAKE (multiterms_rake)
multiterms :: Lang -> Text -> IO [Terms]
multiterms lang txt = concat
<$> map (map (tokenTag2terms lang))
......
{-|
Module : Gargantext.Text.Terms.Multi.RAKE
Description : Rapid automatic keyword extraction (RAKE)
Copyright : (c) CNRS, 2017
License : AGPL + CECILL v3
Maintainer : team@gargantext.org
Stability : experimental
Portability : POSIX
Personal notes for the integration of RAKE in Gargantext.
RAKE algorithm is a simple, rapid and effective algorithm to extract
keywords that is very sensitive to the quality of the stop word list.
Indeed, the very first step starts from the stop words list to cut the
text towards keywords extraction. The conTexT is the sentence level to
compute the coccurrences and occurrences which are divided to compute
the metric of one word. Multi-words metrics is equal to the sum of the
metrics of each word.
Finally The metrics highlight longer keywords which highly depends of
quality of the cut which depends on the quality of the stop word list.
As a consequence, to improve the effectiveness of RAKE algorithm, I am
wondering if some bayesian features could be added to increase stop word
list quality in time.
-}
{-# LANGUAGE NoImplicitPrelude #-}
module Gargantext.Text.Terms.Multi.RAKE (multiterms_rake)
where
import Data.Text (Text)
import NLP.RAKE.Text
import Gargantext.Prelude
multiterms_rake :: Text -> [WordScore]
multiterms_rake = candidates hardStopList
defaultNosplit
defaultNolist . pSplitter
-- | StopList
hardStopList :: StopwordsMap
hardStopList = mkStopwordsStr [
"a","a's","able","about","above","apply","according","accordingly",
"across","actually","after","afterwards","again","against",
"ain't","all","allow","allows","almost","alone","along",
"already","also","although","always","am","among","amongst",
"an","and","another","any","anybody","anyhow","anyone","anything",
"anyway","anyways","anywhere","analyze","apart","appear","appreciate","appropriate",
"are","aren't","around","as","aside","ask","asking","associated","at",
"available","away","awfully","based", "b","be","became","because","become",
"becomes","becoming","been","before","beforehand","behind","being",
"believe","below","beside","besides","best","better","between","beyond",
"both","brief","but","by","c","c'mon","c's","came","can","can't","cannot",
"cant","cause","causes","certain","certainly","changes","clearly","co",
"com","come","comes","common","concerning","consequently","consider","considering",
"contain","containing","contains","corresponding","could","couldn't","course",
"currently","d","definitely","described","detects","detecting","despite","did","didn't","different",
"do","does","doesn't","doing","don't","done","down","downwards","during","e",
"each","edu","eg","eight","either","else","elsewhere","enough","entirely",
"especially","et","etc","even","ever","every","everybody","everyone",
"everything","everywhere","ex","exactly","example","except","f","far",
"few","find","fifth","first","five","followed","following","follows","for",
"former","formerly","forth","four","from","further","furthermore","g",
"get","gets","getting","given","gives","go","goes","going","gone","got",
"gotten","greetings","h","had","hadn't","happens","hardly","has","hasn't",
"have","haven't","having","he","he's","hello","help","hence","her","here",
"here's","hereafter","hereby","herein","hereupon","hers","herself","hi",
"him","himself","his","hither","hopefully","how","howbeit","however","i",
"i'd","identify","i'll","i'm","i've","ie","if","ignored","immediate","in","inasmuch",
"inc","indeed","indicate","indicated","indicates","inner","insofar",
"instead","into","inward","is","isn't","it","it'd","it'll","it's","its",
"itself","j","just","k","keep","keeps","kept","know","known","knows","l",
"last","lately","later","latter","latterly","least","less","lest","let",
"let's","like","liked","likely","little","look","looking","looks","ltd",
"m","mainly","many","may","maybe","me","mean","meanwhile","merely","might",
"more","moreover","most","mostly","much","must","my","myself","n",
"name","namely","nd","near","nearly","necessary","need","needs","neither",
"never","nevertheless","new","next","nine","no","nobody","non","none",
"noone","nor","normally","not","nothing","novel","now","nowhere","o",
"obviously","of","off","often","oh","ok","okay","old","on","once","one",
"ones","only","onto","or","other","others","otherwise","ought","our",
"ours","ourselves","out","outside","over","overall","own","p","particular",
"particularly","per","perhaps","placed","please","plus","possible",
"presents","presumably","probably","provides","q","que","quite","qv","r","rather",
"rd","re","really","reasonably","regarding","regardless","regards",
"relatively","respectively","right","s","said","same","saw","say",
"saying","says","second","secondly","see","seeing","seem","seemed",
"seeming","seems","seen","self","selves","sensible","sent","serious",
"seriously","seven","several","shall","she","should","shouldn't","since",
"six","so","some","somebody","somehow","someone","something","sometime",
"sometimes","somewhat","somewhere","soon","sorry","specified","specify",
"specifying","still","sub","such","sup","sure","t","t's","take","taken",
"tell","tends","th","than","thank","thanks","thanx","that","that's",
"thats","the","their","theirs","them","themselves","then","thence","there",
"there's","thereafter","thereby","therefore","therein","theres",
"thereupon","these","they","they'd","they'll","they're","they've",
"think","third","this","thorough","thoroughly","those","though","three",
"through","throughout","thru","thus","to","together","too","took","toward",
"towards","tried","tries","truly","try","trying","twice","two","u","un",
"under","unfortunately","unless","unlikely","until","unto","up","upon",
"us","use","used","useful","uses","using","usually","uucp","v","value",
"various","very","via","viz","vs","w","want","wants","was","wasn't","way",
"we","we'd","we'll","we're","we've","welcome","well","went","were",
"weren't","what","what's","whatever","when","whence","whenever","where",
"where's","whereafter","whereas","whereby","wherein","whereupon",
"wherever","whether","which","while","whither","who","who's","whoever",
"whole","whom","whose","why","will","willing","wish","with","within",
"without","won't","wonder","would","wouldn't","x","y","yes","yet","you",
"you'd","you'll","you're","you've","your","yours","yourself","yourselves",
"z","zero"]
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment