{-| Module : Gargantext.Core.Text.Terms.Mono.Stem Description : Stemming of mono (i.e. single word) terms. Copyright : (c) CNRS, 2017-Present License : AGPL + CECILL v3 Maintainer : team@gargantext.org Stability : experimental Portability : POSIX In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The @stem@ needs not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Source : https://en.wikipedia.org/wiki/Stemming A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stems", "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument". -} module Gargantext.Core.Text.Terms.Mono.Stem ( -- * Types StemmingAlgorithm(..), -- * Universal stemming function stem, -- * Handy re-exports Lang(..) ) where import Gargantext.Core.Text.Terms.Mono.Stem.Internal.Porter qualified as Porter import Gargantext.Core.Text.Terms.Mono.Stem.Internal.Lancaster qualified as Lancaster import Gargantext.Core.Text.Terms.Mono.Stem.Internal.GargPorter qualified as GargPorter import Gargantext.Core (Lang(..)) import Gargantext.Prelude -- | A stemming algorithm. There are different stemming algorithm, -- each with different tradeoffs, strengths and weaknesses. Typically -- one uses one or the other based on the given task at hand. data StemmingAlgorithm = -- | The porter algorithm is the classic stemming algorithm, possibly -- one of the most widely used. PorterAlgorithm -- | Slight variation of the porter algorithm; it's more aggressive with -- stemming, which might or might not be what you want. It also makes some -- subtle chances to the stem; for example, the stemming of \"dancer\" using -- Porter is simply \"dancer\" (i.e. it cannot be further stemmed). Using -- Lancaster we would get \"dant\", which is not a prefix of the initial word anymore. | LancasterAlgorithm -- | A variation of the Porter algorithm tailored for Gargantext. | GargPorterAlgorithm deriving (Show, Eq, Ord) -- | Stems the input 'Text' based on the input 'Lang' and using the -- given 'StemmingAlgorithm'. stem :: Lang -> StemmingAlgorithm -> Text -> Text stem lang algo unstemmed = case algo of PorterAlgorithm -> Porter.stem lang unstemmed LancasterAlgorithm | EN <- lang -> Lancaster.stem unstemmed | otherwise -> unstemmed -- Lancaster doesn't support any other language which is not english. GargPorterAlgorithm | EN <- lang -> GargPorter.stem unstemmed | otherwise -> unstemmed -- Our garg porter doesn't support other languages other than english.