Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
H
haskell-gargantext
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Allen Lee
haskell-gargantext
Commits
3b01b815
Commit
3b01b815
authored
Nov 13, 2018
by
Alexandre Delanoë
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
[TEXT][PARSER][PUBMED] parser fix for dates.
parent
681674f6
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
49 additions
and
47 deletions
+49
-47
PubMed.hs
src/Gargantext/Text/Parsers/PubMed.hs
+49
-47
No files found.
src/Gargantext/Text/Parsers/PubMed.hs
View file @
3b01b815
...
...
@@ -7,7 +7,16 @@ Maintainer : team@gargantext.org
Stability : experimental
Portability : POSIX
This version of Parsers fixes the Date of publication in Gargantext
(V3) parser of PubMed. Indeed, we can not rely neither on Journal
Publication Date neither on Article publication date, which are
incomplete structurally but for its interpretation too. Then, to
simplify and uniformize data, date of publication of database insertion
is used.
TODO:
- Add authors
- factorize
-}
{-# LANGUAGE OverloadedStrings #-}
...
...
@@ -16,68 +25,78 @@ Portability : POSIX
module
Gargantext.Text.Parsers.PubMed
where
import
Control.Monad
(
void
)
import
Data.Conduit.List
as
CL
hiding
(
catMaybes
)
import
Data.Conduit.List
as
CL
hiding
(
catMaybes
,
head
)
import
Control.Monad
(
join
)
import
GHC.IO
(
FilePath
)
import
Prelude
(
read
,
print
)
import
Prelude
(
read
)
import
Gargantext.Prelude
import
Control.Applicative
((
<*
))
import
Control.Monad.Catch
(
MonadThrow
)
import
Data.Maybe
(
Maybe
,
catMaybes
)
import
Data.Maybe
(
Maybe
)
import
Data.Monoid
(
mconcat
)
import
Text.XML.Stream.Parse
import
Data.Conduit
(
runConduit
,
(
.|
),
ConduitT
)
import
Data.Text
(
Text
,
unpack
,
concat
)
import
Data.Text
(
Text
,
unpack
)
import
Data.XML.Types
(
Event
)
import
Data.ByteString
(
ByteString
)
import
Data.Time.Segment
(
jour
)
import
Data.Time
(
UTCTime
(
..
))
import
qualified
Data.ByteString.Lazy
as
DBL
import
Gargantext.Text.Parsers.Wikimedia
data
PubMed
=
PubMed
{
pubmed_article
::
PubMedArticle
,
pubmed_date
::
PubMedData
}
deriving
Show
data
PubMedArticle
=
PubMedArticle
{
pubmed_title
::
Maybe
Text
,
pubmed_journal
::
Maybe
Text
,
pubmed_abstract
::
Maybe
[
Text
]
,
pubmed_date
::
UTCTime
,
pubmed_year
::
Integer
,
pubmed_month
::
Int
,
pubmed_day
::
Int
}
deriving
(
Show
)
deriving
(
Show
)
readPubMedFile
::
FilePath
->
IO
()
data
PubMedData
=
PubMedData
{
pubmedData_date
::
UTCTime
,
pubmedData_year
::
Integer
,
pubmedData_month
::
Int
,
pubmedData_day
::
Int
}
deriving
(
Show
)
readPubMedFile
::
FilePath
->
IO
[
PubMed
]
readPubMedFile
fp
=
do
input
<-
DBL
.
readFile
fp
pubMedParser
input
pubMedParser
::
DBL
.
ByteString
->
IO
()
pubMedParser
::
DBL
.
ByteString
->
IO
[
PubMed
]
pubMedParser
bstring
=
runConduit
$
parseLBS
def
bstring
.|
parseArticleSet
.|
CL
.
mapM_
print
.|
CL
.
consume
--parseArticleSet :: MonadThrow m => ConduitT Event o m [PubMedArticle]
parseArticleSet
::
MonadThrow
m
=>
ConduitT
Event
PubMed
m
()
parseArticleSet
=
do
as
<-
force
"force"
$
tagIgnoreAttrs
"PubmedArticleSet"
$
manyYield
parsePubMedArticle
-- _ <- many $ ignoreAnyTreeContent
return
as
parsePubMedArticle
::
MonadThrow
m
=>
ConduitT
Event
o
m
(
Maybe
PubMed
Article
)
parsePubMedArticle
::
MonadThrow
m
=>
ConduitT
Event
o
m
(
Maybe
PubMed
)
parsePubMedArticle
=
do
articles
<-
force
"PubmedArticle"
$
tagIgnoreAttrs
"PubmedArticle"
parsePubMedArticle'
--
_
<-
many
$
ignoreAnyTreeContent
articles
<-
tagIgnoreAttrs
"PubmedArticle"
parsePubMedArticle'
return
articles
parsePubMedArticle'
::
MonadThrow
m
=>
ConduitT
Event
o
m
(
Maybe
PubMedArticle
)
parsePubMedArticle'
::
MonadThrow
m
=>
ConduitT
Event
o
m
(
PubMed
)
parsePubMedArticle'
=
do
pubmed_article
<-
tagIgnoreAttrs
"MedlineCitation"
parseMedlineCitation
--
_
<-
tagIgnoreAttrs
"PubmedData"
content
article
<-
force
"MedlineCitation"
$
tagIgnoreAttrs
"MedlineCitation"
parseMedlineCitation
dates
<-
tagIgnoreAttrs
"PubmedData"
$
do
dates'
<-
tagIgnoreAttrs
"History"
$
many
$
tagIgnoreAttrs
"PubMedPubDate"
$
do
y'
<-
force
"Year"
$
tagIgnoreAttrs
"Year"
content
m'
<-
force
"Month"
$
tagIgnoreAttrs
"Month"
content
d'
<-
force
"Day"
$
tagIgnoreAttrs
"Day"
content
_
<-
many
$
ignoreAnyTreeContent
return
(
read
$
unpack
y'
,
read
$
unpack
m'
,
read
$
unpack
d'
)
_
<-
many
$
ignoreAnyTreeContent
return
dates'
_
<-
many
$
ignoreAnyTreeContent
return
pubmed_article
let
(
y
,
m
,
d
)
=
maybe
(
1
,
1
,
1
)
identity
$
join
$
fmap
head
$
reverse
<$>
join
dates
return
$
PubMed
(
article
)
(
PubMedData
(
jour
y
m
d
)
y
m
d
)
parseMedlineCitation
::
MonadThrow
m
=>
ConduitT
Event
o
m
PubMedArticle
parseMedlineCitation
=
do
...
...
@@ -87,18 +106,10 @@ parseMedlineCitation = do
parseArticle
::
MonadThrow
m
=>
ConduitT
Event
o
m
PubMedArticle
parseArticle
=
do
(
journal
,
maybePubDate
)
<-
force
"journal"
$
manyTagsUntil
"Journal"
$
do
maybePubDate'
<-
manyTagsUntil
"JournalIssue"
$
do
maybePubDate''
<-
manyTagsUntil
"PubDate"
$
do
y
<-
tagIgnoreAttrs
"Year"
content
m
<-
tagIgnoreAttrs
"Month"
content
d
<-
tagIgnoreAttrs
"Day"
content
return
(
y
,
m
,
d
)
return
maybePubDate''
journal
<-
force
"journal"
$
manyTagsUntil
"Journal"
$
do
j
<-
manyTagsUntil
"Title"
content
_
<-
many
$
ignoreAnyTreeContent
return
(
j
,
join
maybePubDate'
)
return
j
title
<-
do
t
<-
manyTagsUntil
"ArticleTitle"
content
...
...
@@ -115,17 +126,8 @@ parseArticle = do
return
as
-- TODO add authos
(
year
,
month
,
day
)
<-
case
maybePubDate
of
Nothing
->
force
"ArticleDate"
$
manyTagsUntil
"ArticleDate"
$
do
y
<-
force
"Year"
$
tagIgnoreAttrs
"Year"
content
m
<-
force
"Month"
$
tagIgnoreAttrs
"Month"
content
d
<-
force
"Day"
$
tagIgnoreAttrs
"Day"
content
return
(
read
$
unpack
y
,
read
$
unpack
m
,
read
$
unpack
d
)
Just
(
Just
y
,
Just
m
,
Just
d
)
->
return
(
read
$
unpack
"1"
,
read
$
unpack
"3"
,
read
$
unpack
"3"
)
_
->
panic
"error date"
_
<-
many
$
ignoreAnyTreeContent
return
$
PubMedArticle
title
journal
abstracts
(
jour
year
month
day
)
year
month
day
return
$
PubMedArticle
title
journal
abstracts
pubMedData
::
DBL
.
ByteString
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment