Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
haskell-gargantext
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Julien Moutinho
haskell-gargantext
Commits
965531a2
Commit
965531a2
authored
Jul 15, 2019
by
Alexandre Delanoë
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
[METRICS] FACTO
parent
a22e2dc6
Changes
5
Show whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
30 additions
and
17 deletions
+30
-17
Main.hs
bin/gargantext-import/Main.hs
+2
-1
List.hs
src/Gargantext/Text/List.hs
+2
-2
Metrics.hs
src/Gargantext/Text/Metrics.hs
+15
-2
Terms.hs
src/Gargantext/Text/Terms.hs
+3
-4
En.hs
src/Gargantext/Text/Terms/Mono/Token/En.hs
+8
-8
No files found.
bin/gargantext-import/Main.hs
View file @
965531a2
...
...
@@ -48,8 +48,9 @@ main = do
let
--tt = (Unsupervised EN 6 0 Nothing)
tt
=
(
Multi
EN
)
format
=
WOS
-- CsvGargV3
cmd
::
forall
m
.
FlowCmdM
DevEnv
GargError
m
=>
m
CorpusId
cmd
=
flowCorpusFile
(
cs
user
)
(
cs
name
)
(
read
limit
::
Int
)
tt
CsvGargV3
corpusPath
cmd
=
flowCorpusFile
(
cs
user
)
(
cs
name
)
(
read
limit
::
Int
)
tt
format
corpusPath
{-
let debatCorpus :: forall m. FlowCmdM DevEnv GargError m => m CorpusId
debatCorpus = do
...
...
src/Gargantext/Text/List.hs
View file @
965531a2
...
...
@@ -60,8 +60,8 @@ data StopSize = StopSize {unStopSize :: Int}
buildNgramsLists
::
Lang
->
Int
->
Int
->
StopSize
->
UserCorpusId
->
MasterCorpusId
->
Cmd
err
(
Map
NgramsType
[
NgramsElement
])
buildNgramsLists
l
n
m
s
uCid
mCid
=
do
ngTerms
<-
buildNgramsTermsList
l
n
m
s
uCid
mCid
--
ngTerms <- buildNgramsTermsList' uCid (ngramsGroup l n m) (isStopTerm s . fst) 500 50
--
ngTerms <- buildNgramsTermsList l n m s uCid mCid
ngTerms
<-
buildNgramsTermsList'
uCid
(
ngramsGroup
l
n
m
)
(
isStopTerm
s
.
fst
)
500
50
othersTerms
<-
mapM
(
buildNgramsOthersList
uCid
identity
)
[
Authors
,
Sources
,
Institutes
]
pure
$
Map
.
unions
$
othersTerms
<>
[
ngTerms
]
...
...
src/Gargantext/Text/Metrics.hs
View file @
965531a2
...
...
@@ -102,13 +102,26 @@ takeScored listSize incSize = both (map _scored_terms)
linearTakes
::
(
Ord
b1
,
Ord
b2
)
=>
GraphListSize
->
InclusionSize
->
(
a
->
b2
)
->
(
a
->
b1
)
->
[
a
]
->
([
a
],[
a
])
linearTakes
gls
incSize
speGen
incExc
=
(
List
.
splitAt
g
ls
)
linearTakes
mls
incSize
speGen
incExc
=
(
List
.
splitAt
m
ls
)
.
List
.
concat
.
map
(
take
$
round
$
(
fromIntegral
g
ls
::
Double
)
$
(
fromIntegral
m
ls
::
Double
)
/
(
fromIntegral
incSize
::
Double
)
)
.
map
(
sortOn
speGen
)
.
splitEvery
incSize
.
take
5000
.
takePercent
(
0.70
)
.
sortOn
incExc
takePercent
::
Double
->
[
a
]
->
[
a
]
takePercent
l
xs
=
List
.
take
l'
xs
where
l'
=
round
$
l
*
(
fromIntegral
$
List
.
length
xs
)
splitTake
::
(
Int
,
a
->
Bool
)
->
(
Int
,
a
->
Bool
)
->
[
a
]
->
([
a
],
[
a
])
splitTake
(
a
,
af
)
(
b
,
bf
)
xs
=
(
mpa
<>
mpb
,
ca
<>
cb
)
where
(
mpa
,
ca
)
=
List
.
splitAt
a
$
List
.
filter
af
xs
(
mpb
,
cb
)
=
List
.
splitAt
b
$
List
.
filter
bf
xs
src/Gargantext/Text/Terms.hs
View file @
965531a2
...
...
@@ -129,10 +129,9 @@ termsUnsupervised _ = undefined
newTries
::
Int
->
Text
->
Tries
Token
()
newTries
n
t
=
buildTries
n
(
fmap
toToken
$
uniText
t
)
-- | TODO removing long terms > 24
uniText
::
Text
->
[[
Text
]]
uniText
=
-- map (map (Text.toLower))
map
(
List
.
filter
(
not
.
isPunctuation
))
uniText
=
map
(
List
.
filter
(
not
.
isPunctuation
))
.
map
tokenize
.
sentences
-- | TODO get sentences according to lang
.
Text
.
toLower
...
...
src/Gargantext/Text/Terms/Mono/Token/En.hs
View file @
965531a2
{-|
Module : Gargantext.Text.Ngrams.Token.Text
Description :
Description :
Tokenizer main functions
Copyright : (c) Grzegorz Chrupała first, after: CNRS, 2018-Present
License : AGPL + CECILL v3
Maintainer : team@gargantext.org
...
...
@@ -87,7 +87,7 @@ defaultTokenizer = whitespace
uris
::
Tokenizer
uris
x
|
isUri
x
=
E
[
Left
x
]
|
True
=
E
[
Right
x
]
where
isUri
u
=
any
(`
T
.
isPrefixOf
`
u
)
[
"http://"
,
"ftp://"
,
"mailto:"
]
where
isUri
u
=
any
(`
T
.
isPrefixOf
`
u
)
[
"http://"
,
"ftp://"
,
"mailto:"
,
"https://"
]
-- | Split off initial and final punctuation
punctuation
::
Tokenizer
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment