Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
G
GarganTexternal tools
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
1
Merge Requests
1
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Anne-Laure Thomas Derepas
GarganTexternal tools
Commits
be50ef3c
Commit
be50ef3c
authored
Sep 07, 2023
by
Nicolas Atrax
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Update PDF_to_TSV.py
parent
450d9bcf
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
12 additions
and
4 deletions
+12
-4
PDF_to_TSV.py
Streamlit/pages/PDF_to_TSV.py
+12
-4
No files found.
Streamlit/pages/PDF_to_TSV.py
View file @
be50ef3c
...
...
@@ -65,7 +65,7 @@ def estimateLanguagesPercentage(languages):
for
l
in
languages
:
total
+=
int
(
languages
[
l
])
for
l
in
languages
:
tmp
=
int
(
languages
[
l
])
/
total
*
100
tmp
=
round
(
int
(
languages
[
l
])
/
total
*
100
,
1
)
if
tmp
>
max
:
max
=
tmp
principal
=
l
...
...
@@ -167,15 +167,20 @@ def correctedSequence(text, last):
def
getInfo
():
return
st
.
session_state
.
author
,
st
.
session_state
.
title
,
st
.
session_state
.
watermark
title
=
st
.
session_state
.
title
if
title
==
""
:
title
=
st
.
session_state
.
fileName
.
replace
(
".pdf"
,
""
)
return
st
.
session_state
.
author
,
title
,
st
.
session_state
.
watermark
def
txtToTSV
(
fileName
,
fileAddress
,
pdfDir
):
st
.
session_state
.
page
=
1
author
,
title
,
watermark
=
getInfo
()
tsv
=
"authors
\t
source
\t
publication_year
\t
publication_month
\t
publication_day
\t
title
\t
abstract
\n
"
tsv
,
languages
=
segmentAbstract
(
fileName
,
fileAddress
,
tsv
,
author
,
title
,
str
(
date
.
today
()
.
year
),
"1"
,
"1"
,
watermark
)
with
st
.
spinner
(
st
.
session_state
.
general_text_dict
[
'loading'
]):
tsv
,
languages
=
segmentAbstract
(
fileName
,
fileAddress
,
tsv
,
author
,
title
,
str
(
date
.
today
()
.
year
),
"1"
,
"1"
,
watermark
)
if
'/'
in
fileName
:
fileName
=
fileName
.
split
(
'/'
)[
1
]
with
open
(
pdfDir
+
"/"
+
fileName
.
replace
(
".pdf"
,
".tsv"
),
"w"
,
encoding
=
"utf-8-sig"
)
as
file
:
...
...
@@ -224,6 +229,7 @@ def upPage():
def
uploadZip
():
st
.
write
(
st
.
session_state
.
general_text_dict
[
'text'
])
st
.
write
(
st
.
session_state
.
general_text_dict
[
'text2'
])
return
st
.
file_uploader
(
st
.
session_state
.
general_text_dict
[
'file'
],
type
=
[
"zip"
],
key
=
'file'
)
...
...
@@ -231,6 +237,7 @@ def uploadZip():
def
askPDF
(
fileName
):
with
st
.
form
(
"Submit"
):
st
.
write
(
fileName
)
st
.
write
(
st
.
session_state
.
general_text_dict
[
'text3'
])
col1
,
col2
=
st
.
columns
(
2
)
st
.
session_state
.
author
=
""
st
.
session_state
.
title
=
""
...
...
@@ -279,6 +286,7 @@ if st.session_state.page == 2:
if
st
.
session_state
.
page
==
1
:
fileName
=
os
.
listdir
(
st
.
session_state
.
zipDir
.
name
)[
st
.
session_state
.
nbDoc
]
st
.
session_state
.
fileName
=
fileName
if
'/'
in
fileName
:
fileName
=
fileName
.
split
(
'/'
)[
1
]
askPDF
(
fileName
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment