Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pubmed
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
2
Issues
2
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
gargantext
crawlers
pubmed
Commits
dcaa0f5d
Commit
dcaa0f5d
authored
May 17, 2019
by
Alexandre Delanoë
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
[Master] crawler function (without parameters, TODO).
parent
ace45a17
Changes
6
Hide whitespace changes
Inline
Side-by-side
Showing
6 changed files
with
163 additions
and
58 deletions
+163
-58
Main.hs
app/Main.hs
+2
-34
crawlerPubMed.cabal
crawlerPubMed.cabal
+105
-0
package.yaml
package.yaml
+10
-10
Lib.hs
src/Lib.hs
+0
-6
PUBMED.hs
src/PUBMED.hs
+38
-0
Parser.hs
src/PUBMED/Parser.hs
+8
-8
No files found.
app/Main.hs
View file @
dcaa0f5d
...
@@ -2,39 +2,7 @@
...
@@ -2,39 +2,7 @@
module
Main
where
module
Main
where
import
PUBMED.Client
import
qualified
PUBMED
as
PubMed
import
PUBMED.Parser
import
Network.HTTP.Client
(
newManager
)
import
Network.HTTP.Client.TLS
(
tlsManagerSettings
)
import
Servant.Client
(
runClientM
,
mkClientEnv
,
BaseUrl
(
..
),
Scheme
(
..
))
import
Text.XML
(
parseLBS_
,
def
)
import
Text.XML.Cursor
(
fromDocument
,
Cursor
)
import
qualified
Data.ByteString.Lazy
as
LBS
import
qualified
Data.Text
as
T
runParser
::
Show
res
=>
(
Cursor
->
res
)
->
LBS
.
ByteString
->
res
runParser
parser
=
parser
.
fromDocument
.
parseLBS_
def
runSimpleFindPubmedAbstractRequest
::
T
.
Text
->
IO
[
PubMed
]
runSimpleFindPubmedAbstractRequest
rq
=
do
manager'
<-
newManager
tlsManagerSettings
res
<-
runClientM
(
search
(
Just
rq
))
(
mkClientEnv
manager'
$
BaseUrl
Https
"eutils.ncbi.nlm.nih.gov"
443
"entrez/eutils"
)
case
res
of
(
Left
err
)
->
return
[]
(
Right
(
BsXml
docs
))
->
do
let
docIds
=
runParser
parseDocId
docs
res'
<-
runClientM
(
fetch
(
Just
"pubmed"
)
(
Just
"abstract"
)
docIds
)
(
mkClientEnv
manager'
$
BaseUrl
Https
"eutils.ncbi.nlm.nih.gov"
443
"entrez/eutils"
)
case
res'
of
(
Left
err
)
->
return
[]
(
Right
(
BsXml
abstracts
))
->
pubMedParser
abstracts
main
::
IO
()
main
::
IO
()
main
=
do
main
=
PubMed
.
crawler
"organ"
>>=
print
pubmeds
<-
runSimpleFindPubmedAbstractRequest
"organ"
print
pubmeds
crawlerPubMed.cabal
0 → 100644
View file @
dcaa0f5d
-- This file has been generated from package.yaml by hpack version 0.28.2.
--
-- see: https://github.com/sol/hpack
--
-- hash: 2d262a275f0e9e59092e1b7e005b90c2018a7b9484371aad24175b7a30116e60
name: crawlerPubMed
version: 0.1.0.0
description: Please see the README on GitHub at <https://gitlab.iscpif.fr/gargantext/crawlers/pubmed/blob/dev/README.md>
homepage: https://github.com/gitlab/crawlerPubMed#readme
bug-reports: https://github.com/gitlab/crawlerPubMed/issues
author: CNRS Gargantext
maintainer: contact@gargantext.org
copyright: 2019 CNRS/IMT
license: BSD3
license-file: LICENSE
build-type: Simple
cabal-version: >= 1.10
extra-source-files:
ChangeLog.md
README.md
source-repository head
type: git
location: https://github.com/gitlab/crawlerPubMed
library
exposed-modules:
PUBMED
PUBMED.Client
PUBMED.Parser
other-modules:
Paths_crawlerPubMed
hs-source-dirs:
src
build-depends:
base >=4.7 && <5
, bytestring
, conduit
, data-time-segment
, exceptions
, http-client
, http-client-tls
, http-media
, protolude
, servant
, servant-client
, text
, time
, xml-conduit
, xml-types
default-language: Haskell2010
executable crawlerPubMed-exe
main-is: Main.hs
other-modules:
Paths_crawlerPubMed
hs-source-dirs:
app
ghc-options: -threaded -rtsopts -with-rtsopts=-N
build-depends:
base >=4.7 && <5
, bytestring
, conduit
, crawlerPubMed
, data-time-segment
, exceptions
, http-client
, http-client-tls
, http-media
, protolude
, servant
, servant-client
, text
, time
, xml-conduit
, xml-types
default-language: Haskell2010
test-suite crawlerPubMed-test
type: exitcode-stdio-1.0
main-is: Spec.hs
other-modules:
Paths_crawlerPubMed
hs-source-dirs:
test
ghc-options: -threaded -rtsopts -with-rtsopts=-N
build-depends:
base >=4.7 && <5
, bytestring
, conduit
, crawlerPubMed
, data-time-segment
, exceptions
, http-client
, http-client-tls
, http-media
, protolude
, servant
, servant-client
, text
, time
, xml-conduit
, xml-types
default-language: Haskell2010
package.yaml
View file @
dcaa0f5d
name
:
pubMedCrawler
name
:
crawlerPubMed
version
:
0.1.0.0
version
:
0.1.0.0
github
:
"
git
hubuser/pubMedCrawler
"
github
:
"
git
lab/crawlerPubMed
"
license
:
BSD3
license
:
BSD3
author
:
"
Author
name
here
"
author
:
"
CNRS
Gargantext
"
maintainer
:
"
example@example.com
"
maintainer
:
"
contact@gargantext.org
"
copyright
:
"
2019
Author
name
here
"
copyright
:
"
2019
CNRS/IMT
"
extra-source-files
:
extra-source-files
:
-
README.md
-
README.md
...
@@ -17,7 +17,7 @@ extra-source-files:
...
@@ -17,7 +17,7 @@ extra-source-files:
# To avoid duplicated efforts in documentation and dealing with the
# To avoid duplicated efforts in documentation and dealing with the
# complications of embedding Haddock markup inside cabal files, it is
# complications of embedding Haddock markup inside cabal files, it is
# common to point users to the README.md file.
# common to point users to the README.md file.
description
:
Please see the README on GitHub at <https://git
hub.com/githubuser/pubMedCrawler#readme
>
description
:
Please see the README on GitHub at <https://git
lab.iscpif.fr/gargantext/crawlers/pubmed/blob/dev/README.md
>
dependencies
:
dependencies
:
-
base >= 4.7 && < 5
-
base >= 4.7 && < 5
...
@@ -40,7 +40,7 @@ library:
...
@@ -40,7 +40,7 @@ library:
source-dirs
:
src
source-dirs
:
src
executables
:
executables
:
pubMedCrawler
-exe
:
crawlerPubMed
-exe
:
main
:
Main.hs
main
:
Main.hs
source-dirs
:
app
source-dirs
:
app
ghc-options
:
ghc-options
:
...
@@ -48,10 +48,10 @@ executables:
...
@@ -48,10 +48,10 @@ executables:
-
-rtsopts
-
-rtsopts
-
-with-rtsopts=-N
-
-with-rtsopts=-N
dependencies
:
dependencies
:
-
pubMedCrawler
-
crawlerPubMed
tests
:
tests
:
pubMedCrawler
-test
:
crawlerPubMed
-test
:
main
:
Spec.hs
main
:
Spec.hs
source-dirs
:
test
source-dirs
:
test
ghc-options
:
ghc-options
:
...
@@ -59,4 +59,4 @@ tests:
...
@@ -59,4 +59,4 @@ tests:
-
-rtsopts
-
-rtsopts
-
-with-rtsopts=-N
-
-with-rtsopts=-N
dependencies
:
dependencies
:
-
pubMedCrawler
-
crawlerPubMed
src/Lib.hs
deleted
100644 → 0
View file @
ace45a17
module
Lib
(
someFunc
)
where
someFunc
::
IO
()
someFunc
=
putStrLn
"someFunc"
src/PUBMED.hs
View file @
dcaa0f5d
{-# LANGUAGE OverloadedStrings #-}
module
PUBMED
where
module
PUBMED
where
import
PUBMED.Client
import
PUBMED.Client
import
PUBMED.Parser
import
PUBMED.Parser
import
Network.HTTP.Client
(
newManager
)
import
Network.HTTP.Client.TLS
(
tlsManagerSettings
)
import
Servant.Client
(
runClientM
,
mkClientEnv
,
BaseUrl
(
..
),
Scheme
(
..
))
import
Text.XML
(
parseLBS_
,
def
)
import
Text.XML.Cursor
(
fromDocument
,
Cursor
)
import
qualified
Data.ByteString.Lazy
as
LBS
import
qualified
Data.Text
as
T
runParser
::
Show
res
=>
(
Cursor
->
res
)
->
LBS
.
ByteString
->
res
runParser
parser
=
parser
.
fromDocument
.
parseLBS_
def
crawler
::
T
.
Text
->
IO
[
PubMed
]
crawler
rq
=
do
manager'
<-
newManager
tlsManagerSettings
res
<-
runClientM
(
search
(
Just
rq
))
(
mkClientEnv
manager'
$
BaseUrl
Https
"eutils.ncbi.nlm.nih.gov"
443
"entrez/eutils"
)
case
res
of
(
Left
err
)
->
return
[]
(
Right
(
BsXml
docs
))
->
do
let
docIds
=
runParser
parseDocId
docs
res'
<-
runClientM
(
fetch
(
Just
"pubmed"
)
(
Just
"abstract"
)
docIds
)
(
mkClientEnv
manager'
$
BaseUrl
Https
"eutils.ncbi.nlm.nih.gov"
443
"entrez/eutils"
)
case
res'
of
(
Left
err
)
->
return
[]
(
Right
(
BsXml
abstracts
))
->
pubMedParser
abstracts
src/PUBMED/Parser.hs
View file @
dcaa0f5d
...
@@ -59,7 +59,7 @@ manyTagsUntil_' = many_ . ignoreEmptyTag . tagUntil
...
@@ -59,7 +59,7 @@ manyTagsUntil_' = many_ . ignoreEmptyTag . tagUntil
data
PubMed
=
data
PubMed
=
PubMed
{
pubmed_article
::
PubMedArticle
PubMed
{
pubmed_article
::
PubMedArticle
,
pubmed_date
::
PubMedDat
a
,
pubmed_date
::
PubMedDat
e
}
deriving
Show
}
deriving
Show
data
PubMedArticle
=
data
PubMedArticle
=
...
@@ -78,11 +78,11 @@ data Author =
...
@@ -78,11 +78,11 @@ data Author =
}
}
deriving
(
Show
)
deriving
(
Show
)
data
PubMedDat
a
=
data
PubMedDat
e
=
PubMedDat
a
{
pubmedData
_date
::
UTCTime
PubMedDat
e
{
pubmedDate
_date
::
UTCTime
,
pubmedDat
a
_year
::
Integer
,
pubmedDat
e
_year
::
Integer
,
pubmedDat
a
_month
::
Int
,
pubmedDat
e
_month
::
Int
,
pubmedDat
a
_day
::
Int
,
pubmedDat
e
_day
::
Int
}
deriving
(
Show
)
}
deriving
(
Show
)
readPubMedFile
::
FilePath
->
IO
[
PubMed
]
readPubMedFile
::
FilePath
->
IO
[
PubMed
]
...
@@ -106,7 +106,7 @@ parsePubMedArticle =
...
@@ -106,7 +106,7 @@ parsePubMedArticle =
parsePubMedArticle'
::
MonadThrow
m
=>
ConduitT
Event
o
m
PubMed
parsePubMedArticle'
::
MonadThrow
m
=>
ConduitT
Event
o
m
PubMed
parsePubMedArticle'
=
do
parsePubMedArticle'
=
do
article
<-
force
"MedlineCitation"
$
tagIgnoreAttrs
"MedlineCitation"
parseMedlineCitation
article
<-
force
"MedlineCitation"
$
tagIgnoreAttrs
"MedlineCitation"
parseMedlineCitation
dates
<-
tagIgnoreAttrs
"PubmedDat
a
"
$
do
dates
<-
tagIgnoreAttrs
"PubmedDat
e
"
$
do
dates'
<-
tagIgnoreAttrs
"History"
$
many
$
tagIgnoreAttrs
"PubMedPubDate"
$
do
dates'
<-
tagIgnoreAttrs
"History"
$
many
$
tagIgnoreAttrs
"PubMedPubDate"
$
do
y'
<-
force
"Year"
$
tagIgnoreAttrs
"Year"
content
y'
<-
force
"Year"
$
tagIgnoreAttrs
"Year"
content
m'
<-
force
"Month"
$
tagIgnoreAttrs
"Month"
content
m'
<-
force
"Month"
$
tagIgnoreAttrs
"Month"
content
...
@@ -117,7 +117,7 @@ parsePubMedArticle' = do
...
@@ -117,7 +117,7 @@ parsePubMedArticle' = do
return
dates'
return
dates'
_
<-
many
ignoreAnyTreeContent
_
<-
many
ignoreAnyTreeContent
let
(
y
,
m
,
d
)
=
maybe
(
1
,
1
,
1
)
identity
$
join
$
fmap
head
$
reverse
<$>
join
dates
let
(
y
,
m
,
d
)
=
maybe
(
1
,
1
,
1
)
identity
$
join
$
fmap
head
$
reverse
<$>
join
dates
return
$
PubMed
article
(
PubMedDat
a
(
jour
y
m
d
)
y
m
d
)
return
$
PubMed
article
(
PubMedDat
e
(
jour
y
m
d
)
y
m
d
)
parseMedlineCitation
::
MonadThrow
m
=>
ConduitT
Event
o
m
PubMedArticle
parseMedlineCitation
::
MonadThrow
m
=>
ConduitT
Event
o
m
PubMedArticle
parseMedlineCitation
=
do
parseMedlineCitation
=
do
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment