Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
haskell-gargantext
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
154
Issues
154
List
Board
Labels
Milestones
Merge Requests
10
Merge Requests
10
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
gargantext
haskell-gargantext
Commits
58ad4a3e
Commit
58ad4a3e
authored
2 years ago
by
Alexandre Delanoë
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
[FEAT] Prepare text from Gutemberg
parent
916be24b
Pipeline
#3132
failed with stage
in 43 minutes and 35 seconds
Changes
3
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
103 additions
and
0 deletions
+103
-0
gargantext.cabal
gargantext.cabal
+1
-0
package.yaml
package.yaml
+1
-0
Prepare.hs
src/Gargantext/Core/Text/Prepare.hs
+101
-0
No files found.
gargantext.cabal
View file @
58ad4a3e
...
...
@@ -76,6 +76,7 @@ library
Gargantext.Core.Text.Metrics.TFICF
Gargantext.Core.Text.Metrics.CharByChar
Gargantext.Core.Text.Metrics.Count
Gargantext.Core.Text.Prepare
Gargantext.Core.Text.Search
Gargantext.Core.Text.Terms
Gargantext.Core.Text.Terms.Mono
...
...
This diff is collapsed.
Click to expand it.
package.yaml
View file @
58ad4a3e
...
...
@@ -100,6 +100,7 @@ library:
-
Gargantext.Core.Text.Metrics.TFICF
-
Gargantext.Core.Text.Metrics.CharByChar
-
Gargantext.Core.Text.Metrics.Count
-
Gargantext.Core.Text.Prepare
-
Gargantext.Core.Text.Search
-
Gargantext.Core.Text.Terms
-
Gargantext.Core.Text.Terms.Mono
...
...
This diff is collapsed.
Click to expand it.
src/Gargantext/Core/Text/
Clean
.hs
→
src/Gargantext/Core/Text/
Prepare
.hs
View file @
58ad4a3e
...
...
@@ -20,22 +20,63 @@ that could be the incarnation of the mythic Gargantua.
module
Gargantext.Core.Text.Clean
where
import
Gargantext.Prelude
import
Data.Text
(
Text
)
import
qualified
Data.Text
as
Text
import
Gargantext.Core.Text
(
sentences
)
import
Gargantext.Prelude
import
qualified
Data.List
as
List
import
qualified
Data.Text
as
Text
groupLines
::
[
Text
]
->
[
Text
]
groupLines
(
a
:
x
:
xs
)
=
undefined
cleanText
::
Text
->
[
Text
]
cleanText
txt
=
List
.
filter
(
/=
""
)
$
toParagraphs
$
Text
.
lines
$
Text
.
replace
"--"
""
-- removing bullets like of dialogs
$
Text
.
replace
"
\xd
"
""
txt
---------------------------------------------------------------------
prepareText
::
Paragraph
->
Text
->
[
Text
]
prepareText
p
txt
=
groupText
p
$
List
.
filter
(
/=
""
)
$
toParagraphs
$
Text
.
lines
$
Text
.
replace
"_"
" "
-- some texts seem to be underlined
$
Text
.
replace
"--"
""
-- removing bullets like of dialogs
$
Text
.
replace
"
\xd
"
""
txt
---------------------------------------------------------------------
groupText
::
Paragraph
->
[
Text
]
->
[
Text
]
groupText
(
Uniform
g
s
)
=
groupUniform
g
s
groupText
AuthorLike
=
groupLines
---------------------------------------------------------------------
data
Paragraph
=
Uniform
Grain
Step
|
AuthorLike
-- Uniform does not preserve the paragraphs of the author but length of paragraphs is uniform
-- Author Like preserve the paragraphs of the Author but length of paragraphs is not uniform
-- Grain: number of Sentences by block of Text
-- Step : overlap of sentence between connex block of Text
groupUniform
::
Grain
->
Step
->
[
Text
]
->
[
Text
]
groupUniform
g
s
ts
=
map
(
Text
.
intercalate
" "
)
$
chunkAlong
g
s
$
sentences
$
Text
.
concat
ts
groupLines
::
[
Text
]
->
[
Text
]
groupLines
xxx
@
(
a
:
b
:
xs
)
=
if
Text
.
length
a
>
moyenne
then
[
a
]
<>
(
groupLines
(
b
:
xs
))
else
let
ab
=
a
<>
" "
<>
b
in
if
Text
.
length
ab
>
moyenne
then
[
ab
]
<>
(
groupLines
xs
)
else
groupLines
([
ab
]
<>
xs
)
where
moyenne
=
round
$
mean
$
(
map
(
fromIntegral
.
Text
.
length
)
xxx
::
[
Double
])
groupLines
[
a
]
=
[
a
]
groupLines
[]
=
[]
groupLines_test
::
[
Text
]
groupLines_test
=
groupLines
theData
where
theData
=
[
"abxxxx"
,
"bc"
,
"cxxx"
,
"d"
]
---------------------------------------------------------------------
toParagraphs
::
[
Text
]
->
[
Text
]
toParagraphs
(
a
:
x
:
xs
)
=
if
a
==
""
...
...
@@ -46,4 +87,15 @@ toParagraphs (a:x:xs) =
toParagraphs
[
a
]
=
[
a
]
toParagraphs
[]
=
[]
-- Tests
-- TODO for internships: Property tests
toParagraphs_test
::
Bool
toParagraphs_test
=
toParagraphs
[
"a"
,
"b"
,
""
,
"c"
,
"d"
,
"d"
,
""
,
"e"
,
"f"
,
""
,
"g"
,
"h"
,
""
]
==
[
"a b"
,
""
,
"c d d"
,
""
,
"e f"
,
""
,
"g h"
,
""
]
This diff is collapsed.
Click to expand it.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment