Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
haskell-gargantext
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
144
Issues
144
List
Board
Labels
Milestones
Merge Requests
9
Merge Requests
9
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
gargantext
haskell-gargantext
Commits
7fc403c9
Commit
7fc403c9
authored
Sep 03, 2024
by
Yoelis Acourt
1
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
fix(cleanTextForNLP): removes transformation for hyphaneted words
parent
5bb981be
Pipeline
#6585
failed with stages
Changes
1
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
3 additions
and
5 deletions
+3
-5
Multi.hs
src/Gargantext/Core/Text/Terms/Multi.hs
+3
-5
No files found.
src/Gargantext/Core/Text/Terms/Multi.hs
View file @
7fc403c9
...
...
@@ -11,10 +11,10 @@ Multi-terms are ngrams where n > 1.
-}
module
Gargantext.Core.Text.Terms.Multi
(
multiterms
,
multiterms_rake
,
tokenTagsWith
,
tokenTags
,
cleanTextForNLP
)
module
Gargantext.Core.Text.Terms.Multi
(
multiterms
,
Terms
(
..
),
tokenTag2terms
,
multiterms_rake
,
tokenTagsWith
,
tokenTags
,
cleanTextForNLP
)
where
import
Data.Attoparsec.Text
as
DAT
(
digit
,
space
,
notChar
,
string
)
import
Data.Attoparsec.Text
as
DAT
(
space
,
notChar
,
string
)
import
Gargantext.Core
(
Lang
(
..
),
NLPServerConfig
(
..
),
PosTagAlgo
(
..
))
import
Gargantext.Core.Text.Terms.Multi.Lang.En
qualified
as
En
import
Gargantext.Core.Text.Terms.Multi.Lang.Fr
qualified
as
Fr
...
...
@@ -82,12 +82,10 @@ groupTokens _ = Fr.groupTokens
-- TODO: make tests here
cleanTextForNLP
::
Text
->
Text
cleanTextForNLP
=
unifySpaces
.
remove
DigitsWith
"-"
.
remove
Urls
cleanTextForNLP
=
unifySpaces
.
removeUrls
where
remove
x
=
RAT
.
streamEdit
x
(
const
""
)
unifySpaces
=
RAT
.
streamEdit
(
many
DAT
.
space
)
(
const
" "
)
removeDigitsWith
x
=
remove
(
many
DAT
.
digit
*>
DAT
.
string
x
<*
many
DAT
.
digit
)
removeUrls
=
removeUrlsWith
"http"
.
removeUrlsWith
"www"
removeUrlsWith
w
=
remove
(
DAT
.
string
w
*>
many
(
DAT
.
notChar
' '
)
<*
many
DAT
.
space
)
Przemyslaw Kaminski
@cgenie
mentioned in commit
5660aec0
·
Oct 08, 2024
mentioned in commit
5660aec0
mentioned in commit 5660aec07ec5a0a0a5468f440092c1a8f57a864e
Toggle commit list
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment