Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
haskell-gargantext
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
199
Issues
199
List
Board
Labels
Milestones
Merge Requests
12
Merge Requests
12
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
gargantext
haskell-gargantext
Commits
965531a2
Commit
965531a2
authored
Jul 15, 2019
by
Alexandre Delanoë
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
[METRICS] FACTO
parent
a22e2dc6
Changes
5
Hide whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
30 additions
and
17 deletions
+30
-17
Main.hs
bin/gargantext-import/Main.hs
+2
-1
List.hs
src/Gargantext/Text/List.hs
+2
-2
Metrics.hs
src/Gargantext/Text/Metrics.hs
+15
-2
Terms.hs
src/Gargantext/Text/Terms.hs
+3
-4
En.hs
src/Gargantext/Text/Terms/Mono/Token/En.hs
+8
-8
No files found.
bin/gargantext-import/Main.hs
View file @
965531a2
...
...
@@ -48,8 +48,9 @@ main = do
let
--tt = (Unsupervised EN 6 0 Nothing)
tt
=
(
Multi
EN
)
format
=
WOS
-- CsvGargV3
cmd
::
forall
m
.
FlowCmdM
DevEnv
GargError
m
=>
m
CorpusId
cmd
=
flowCorpusFile
(
cs
user
)
(
cs
name
)
(
read
limit
::
Int
)
tt
CsvGargV3
corpusPath
cmd
=
flowCorpusFile
(
cs
user
)
(
cs
name
)
(
read
limit
::
Int
)
tt
format
corpusPath
{-
let debatCorpus :: forall m. FlowCmdM DevEnv GargError m => m CorpusId
debatCorpus = do
...
...
src/Gargantext/Text/List.hs
View file @
965531a2
...
...
@@ -60,8 +60,8 @@ data StopSize = StopSize {unStopSize :: Int}
buildNgramsLists
::
Lang
->
Int
->
Int
->
StopSize
->
UserCorpusId
->
MasterCorpusId
->
Cmd
err
(
Map
NgramsType
[
NgramsElement
])
buildNgramsLists
l
n
m
s
uCid
mCid
=
do
ngTerms
<-
buildNgramsTermsList
l
n
m
s
uCid
mCid
--
ngTerms <- buildNgramsTermsList' uCid (ngramsGroup l n m) (isStopTerm s . fst) 500 50
--
ngTerms <- buildNgramsTermsList l n m s uCid mCid
ngTerms
<-
buildNgramsTermsList'
uCid
(
ngramsGroup
l
n
m
)
(
isStopTerm
s
.
fst
)
500
50
othersTerms
<-
mapM
(
buildNgramsOthersList
uCid
identity
)
[
Authors
,
Sources
,
Institutes
]
pure
$
Map
.
unions
$
othersTerms
<>
[
ngTerms
]
...
...
src/Gargantext/Text/Metrics.hs
View file @
965531a2
...
...
@@ -102,13 +102,26 @@ takeScored listSize incSize = both (map _scored_terms)
linearTakes
::
(
Ord
b1
,
Ord
b2
)
=>
GraphListSize
->
InclusionSize
->
(
a
->
b2
)
->
(
a
->
b1
)
->
[
a
]
->
([
a
],[
a
])
linearTakes
gls
incSize
speGen
incExc
=
(
List
.
splitAt
g
ls
)
linearTakes
mls
incSize
speGen
incExc
=
(
List
.
splitAt
m
ls
)
.
List
.
concat
.
map
(
take
$
round
$
(
fromIntegral
g
ls
::
Double
)
$
(
fromIntegral
m
ls
::
Double
)
/
(
fromIntegral
incSize
::
Double
)
)
.
map
(
sortOn
speGen
)
.
splitEvery
incSize
.
take
5000
.
takePercent
(
0.70
)
.
sortOn
incExc
takePercent
::
Double
->
[
a
]
->
[
a
]
takePercent
l
xs
=
List
.
take
l'
xs
where
l'
=
round
$
l
*
(
fromIntegral
$
List
.
length
xs
)
splitTake
::
(
Int
,
a
->
Bool
)
->
(
Int
,
a
->
Bool
)
->
[
a
]
->
([
a
],
[
a
])
splitTake
(
a
,
af
)
(
b
,
bf
)
xs
=
(
mpa
<>
mpb
,
ca
<>
cb
)
where
(
mpa
,
ca
)
=
List
.
splitAt
a
$
List
.
filter
af
xs
(
mpb
,
cb
)
=
List
.
splitAt
b
$
List
.
filter
bf
xs
src/Gargantext/Text/Terms.hs
View file @
965531a2
...
...
@@ -129,11 +129,10 @@ termsUnsupervised _ = undefined
newTries
::
Int
->
Text
->
Tries
Token
()
newTries
n
t
=
buildTries
n
(
fmap
toToken
$
uniText
t
)
-- | TODO removing long terms > 24
uniText
::
Text
->
[[
Text
]]
uniText
=
-- map (map (Text.toLower))
map
(
List
.
filter
(
not
.
isPunctuation
))
uniText
=
map
(
List
.
filter
(
not
.
isPunctuation
))
.
map
tokenize
.
sentences
-- | TODO get sentences according to lang
.
sentences
-- | TODO get sentences according to lang
.
Text
.
toLower
src/Gargantext/Text/Terms/Mono/Token/En.hs
View file @
965531a2
{-|
Module : Gargantext.Text.Ngrams.Token.Text
Description :
Description :
Tokenizer main functions
Copyright : (c) Grzegorz Chrupała first, after: CNRS, 2018-Present
License : AGPL + CECILL v3
Maintainer : team@gargantext.org
...
...
@@ -77,20 +77,20 @@ run :: Tokenizer -> (Text -> [Text])
run
f
=
\
txt
->
map
T
.
copy
$
(
map
unwrap
.
unE
.
f
)
txt
defaultTokenizer
::
Tokenizer
defaultTokenizer
=
whitespace
>=>
uris
>=>
punctuation
>=>
contractions
>=>
negatives
defaultTokenizer
=
whitespace
>=>
uris
>=>
punctuation
>=>
contractions
>=>
negatives
-- | Detect common uris and freeze them
uris
::
Tokenizer
uris
x
|
isUri
x
=
E
[
Left
x
]
|
True
=
E
[
Right
x
]
where
isUri
u
=
any
(`
T
.
isPrefixOf
`
u
)
[
"http://"
,
"ftp://"
,
"mailto:"
]
where
isUri
u
=
any
(`
T
.
isPrefixOf
`
u
)
[
"http://"
,
"ftp://"
,
"mailto:"
,
"https://"
]
-- | Split off initial and final punctuation
punctuation
::
Tokenizer
punctuation
::
Tokenizer
punctuation
=
finalPunctuation
>=>
initialPunctuation
--hyphens :: Tokenizer
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment