Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
G
GarganTexternal tools
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Julien Moutinho
GarganTexternal tools
Commits
fffc1c1b
Commit
fffc1c1b
authored
Sep 12, 2023
by
Nicolas Atrax
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Update pages and add corrections for CleanCSVtoTSV and PDFtoTSV
parent
a8323c73
Changes
19
Hide whitespace changes
Inline
Side-by-side
Showing
19 changed files
with
108 additions
and
153 deletions
+108
-153
text_CSVHarzingToTSV.csv
Streamlit/lang/text_CSVHarzingToTSV.csv
+1
-1
text_CleanCSVtoTSV.csv
Streamlit/lang/text_CleanCSVtoTSV.csv
+1
-1
text_GarganTextJsonToTSV.csv
Streamlit/lang/text_GarganTextJsonToTSV.csv
+1
-1
text_HALToGarganText.csv
Streamlit/lang/text_HALToGarganText.csv
+7
-7
text_IsidoreToGarganText.csv
Streamlit/lang/text_IsidoreToGarganText.csv
+8
-8
text_MergeTermGarganText.csv
Streamlit/lang/text_MergeTermGarganText.csv
+1
-1
text_PDFtoTXT.csv
Streamlit/lang/text_PDFtoTXT.csv
+1
-1
text_PubMedToGarganText.csv
Streamlit/lang/text_PubMedToGarganText.csv
+3
-3
text_RisToGarganText.csv
Streamlit/lang/text_RisToGarganText.csv
+1
-1
text_TXTtoTSV.csv
Streamlit/lang/text_TXTtoTSV.csv
+1
-1
text_TsvTranslator.csv
Streamlit/lang/text_TsvTranslator.csv
+1
-1
text_YTBtoTSV.csv
Streamlit/lang/text_YTBtoTSV.csv
+1
-1
text_ZoteroToGarganText.csv
Streamlit/lang/text_ZoteroToGarganText.csv
+6
-6
Clean_CSV_to_TSV.py
Streamlit/pages/Clean_CSV_to_TSV.py
+57
-92
HAL_To_GarganText.py
Streamlit/pages/HAL_To_GarganText.py
+2
-4
Isidore_To_GarganText.py
Streamlit/pages/Isidore_To_GarganText.py
+6
-17
Istex_To_GarganText.py
Streamlit/pages/Istex_To_GarganText.py
+5
-3
PDF_to_TSV.py
Streamlit/pages/PDF_to_TSV.py
+3
-1
Zotero_To_GarganText.py
Streamlit/pages/Zotero_To_GarganText.py
+2
-3
No files found.
Streamlit/lang/text_CSVHarzingToTSV.csv
View file @
fffc1c1b
...
...
@@ -8,7 +8,7 @@ en,text,"Convert a CSV Harzing file into a TSV file compatible with GarganText"
fr,file,"Choisir un fichier"
en,file,"Choose a file"
fr,new_file,"Télécharge
ton
fichier TSV :"
fr,new_file,"Télécharge
z votre
fichier TSV :"
en,new_file,"Download your TSV file :"
fr,submit," Soumettre "
...
...
Streamlit/lang/text_CleanCSVtoTSV.csv
View file @
fffc1c1b
...
...
@@ -8,7 +8,7 @@ en,text,"Inspect a CSV file to check if it is compatible with GarganText."
fr,file,"Choisir un fichier"
en,file,"Choose a file"
fr,new_file,"Télécharge
ton fichier TSV :
"
fr,new_file,"Télécharge
z votre fichier TSV :
"
en,new_file,"Download your TSV file : "
fr,error,"Erreur : le fichier n'est pas compatible avec GarganText"
...
...
Streamlit/lang/text_GarganTextJsonToTSV.csv
View file @
fffc1c1b
...
...
@@ -8,7 +8,7 @@ en,text,"Transform a Json corpus from GarganText to a TSV file for GarganText"
fr,file,"Choisir un fichier"
en,file,"Choose a file"
fr,new_file,"Télécharge
ton
fichier TSV :"
fr,new_file,"Télécharge
z votre
fichier TSV :"
en,new_file,"Download your TSV file:"
fr,error,"Erreur : le fichier n'est pas valide"
...
...
Streamlit/lang/text_HALToGarganText.csv
View file @
fffc1c1b
...
...
@@ -17,19 +17,19 @@ en,submit,"Submit"
fr,load_api,"Chargement de l'api..."
en,load_api,"Loading API..."
fr,overload_api,"L'API est surchargé, relance
r la requête dans quelques secondes
"
en,overload'api,"The API is overloaded, please retry the request in a few seconds"
fr,overload_api,"L'API est surchargé, relance
z la requête dans quelques secondes.
"
en,overload'api,"The API is overloaded, please retry the request in a few seconds
.
"
fr,nb_doc,"Nombres de documents : "
en,nb_doc,"Numbers of documents : "
fr,perform1,"Pour des raisons de perform
ence, on limit
à "
fr,perform2," le nombre
de document maximum
"
fr,perform1,"Pour des raisons de perform
ance, on limite
à "
fr,perform2," le nombre
maximum de documents.
"
en,perform1,"For performance reasons, we limit to "
en,perform2," the maximum number of documents"
en,perform2," the maximum number of documents
.
"
fr,nb_taken,"Nombres de documents à prendre"
en,nb_taken,"Number of documents to take"
fr,createTSV,"Création du fichier TSV (Cela peut prendre quelque
s minutes)
"
en,createTSV,"Creation of the TSV file (It may take a while)"
fr,createTSV,"Création du fichier TSV (Cela peut prendre quelque
minutes) ...
"
en,createTSV,"Creation of the TSV file (It may take a while)
...
"
Streamlit/lang/text_IsidoreToGarganText.csv
View file @
fffc1c1b
...
...
@@ -2,7 +2,7 @@ locale,key,value
fr,title,"# Isidore vers GarganText"
en,title,"# Isidore To GarganText"
fr,text,"Effectue une recherche Isidore de documents scientifiques et les converti
t
en un fichier TSV."
fr,text,"Effectue une recherche Isidore de documents scientifiques et les converti
r
en un fichier TSV."
en,text,"Do a Isidore scientific documents research and convert it into a TSV file."
fr,keyword,"Mots clés"
...
...
@@ -17,24 +17,24 @@ en,submit,"Submit"
fr,load_api,"Chargement de l'api..."
en,load_api,"Loading API..."
fr,overload_api,"L'API est surchargé
, relancer la requête dans quelques secondes.
"
en,overload'api,"The API is overloaded, please retry the request in a few seconds
.
"
fr,overload_api,"L'API est surchargé
e, relancez la requête dans quelques secondes
"
en,overload'api,"The API is overloaded, please retry the request in a few seconds"
fr,nb_doc,"Nombres de documents : "
en,nb_doc,"Numbers of documents : "
fr,perform1,"Pour des raisons de perform
ence
, on limite à "
fr,perform2," le nombre
maximum de documents.
"
fr,perform1,"Pour des raisons de perform
ances
, on limite à "
fr,perform2," le nombre
de documents maximums
"
en,perform1,"For performance reasons, we limit to "
en,perform2,"
,the maximum number of documents.
"
en,perform2,"
the maximum number of documents
"
fr,nb_taken,"Nombres de documents à prendre"
en,nb_taken,"Number of documents to take"
fr,createTSV,"Création du fichier TSV (Cela peut prendre quelque
s
minutes)"
fr,createTSV,"Création du fichier TSV (Cela peut prendre quelque minutes)"
en,createTSV,"Creation of the TSV file (It may take a while)"
fr,doc_abstract1,"Il y a "
fr,doc_abstract2," documents qui peuvent ne pas avoir de description."
fr,doc_abstract2," documents qui peuvent ne pas avoir de description
s
."
en,doc_abstract1,"There are "
en,doc_abstract2," documents who may not have an abstract"
\ No newline at end of file
Streamlit/lang/text_MergeTermGarganText.csv
View file @
fffc1c1b
...
...
@@ -9,5 +9,5 @@ en,text,"Input 2 term files from GarganText."
fr,file," Choisir un fichier "
en,file," Choose a file "
fr,new_file," Télécharge
ton fichier fusionné
"
fr,new_file," Télécharge
z la fusion de vos fichiers
"
en,new_file," Download your merge file "
\ No newline at end of file
Streamlit/lang/text_PDFtoTXT.csv
View file @
fffc1c1b
...
...
@@ -8,7 +8,7 @@ en,text,"Convert a PDF file into a TXT file"
fr,file,"Choisir un fichier"
en,file,"Choose a file"
fr,new_file,"Télécharge
ton
fichier TXT :"
fr,new_file,"Télécharge
z votre
fichier TXT :"
en,new_file,"Download your TXT file: "
fr,watermark,"Filigrane : "
...
...
Streamlit/lang/text_PubMedToGarganText.csv
View file @
fffc1c1b
...
...
@@ -8,8 +8,8 @@ en,text,"Transform a pubmed corpus to a TSV file for GarganText"
fr,file,"Choisir un fichier"
en,file,"Choose a file"
fr,new_file,"Télécharge
ton
fichier TSV :"
fr,new_file,"Télécharge
r votre
fichier TSV :"
en,new_file,"Download your TSV file:"
fr,error,"Erreur : le fichier n'est pas valide"
en,error,"Error : the file isn't valid"
\ No newline at end of file
fr,error,"Erreur : le fichier n'est pas valide !"
en,error,"Error : the file isn't valid !"
\ No newline at end of file
Streamlit/lang/text_RisToGarganText.csv
View file @
fffc1c1b
...
...
@@ -8,7 +8,7 @@ en,text,"Transform a RIS corpus to a TSV file for GarganText"
fr,file,"Choisir un fichier"
en,file,"Choose a file"
fr,new_file,"Télécharge
ton
fichier TSV :"
fr,new_file,"Télécharge
z votre
fichier TSV :"
en,new_file,"Download your TSV file:"
fr,error,"Erreur : le fichier n'est pas valide"
...
...
Streamlit/lang/text_TXTtoTSV.csv
View file @
fffc1c1b
...
...
@@ -14,7 +14,7 @@ en,text3,"You can choose the title and the author(s) for this TXT."
fr,file,"Choisir un fichier"
en,file,"Choose a file"
fr,new_file,"Télécharge
ton
fichier TSV :"
fr,new_file,"Télécharge
z votre
fichier TSV :"
en,new_file,"Download your TSV file : "
fr,author,"Auteur(s) : "
...
...
Streamlit/lang/text_TsvTranslator.csv
View file @
fffc1c1b
...
...
@@ -11,7 +11,7 @@ en,text2,"This tool detect automatically the languages of the PDF and translate
fr,file,"Choisir un fichier"
en,file,"Choose a file"
fr,new_file,"Télécharge
ton
fichier TSV traduit:"
fr,new_file,"Télécharge
z votre
fichier TSV traduit:"
en,new_file,"Download your translated TSV file : "
fr,submit," Soumettre "
...
...
Streamlit/lang/text_YTBtoTSV.csv
View file @
fffc1c1b
...
...
@@ -35,5 +35,5 @@ en,loading,"Videos processing : "
fr,quantity," sur "
en,quantity," out of "
fr,new_file,"Télécharge
ton
fichier TSV :"
fr,new_file,"Télécharge
z votre
fichier TSV :"
en,new_file,"Download your TSV file :"
Streamlit/lang/text_ZoteroToGarganText.csv
View file @
fffc1c1b
...
...
@@ -11,11 +11,11 @@ en,help,"Find your user ID here: https://www.zotero.org/settings/keys"
fr,submit,"Suivant"
en,submit,"Submit"
fr,denied,"L'acèss au compte n'est pas publi
que, pour la mettre publique
: https://www.zotero.org/settings/privacy"
fr,denied,"L'acèss au compte n'est pas publi
c, pour le mettre en public
: https://www.zotero.org/settings/privacy"
en,denied,"Account access is not public, to make it public: https://www.zotero.org/settings/privacy"
fr,add_doc,"*Ajoute
r
les documents que vous voulez mettre dans le TSV*"
en,add_doc,"*Add the document that
t
ou want in the TSV*"
fr,add_doc,"*Ajoute
z
les documents que vous voulez mettre dans le TSV*"
en,add_doc,"*Add the document that
y
ou want in the TSV*"
fr,select_all,"Select All"
en,select_all,"Select All"
...
...
@@ -29,11 +29,11 @@ en,p_page,"Previous Page"
fr,n_page,"Page Suivante"
en,n_page,"Next Page"
fr,add_collect,"**S
electionner
une collection** vous pouvez en choisir plusieurs"
fr,add_collect,"**S
électionnez
une collection** vous pouvez en choisir plusieurs"
en,add_collect,"**Chose a collection** you can choose multiple one"
fr,chose_collect,"Choisi
e
une collection"
en,chose_collect,"Chose a collection"
fr,chose_collect,"Choisi
r
une collection"
en,chose_collect,"Cho
o
se a collection"
fr,fileTSV1,"Le TSV contient "
fr,fileTSV2," documents"
...
...
Streamlit/pages/Clean_CSV_to_TSV.py
View file @
fffc1c1b
...
...
@@ -39,54 +39,29 @@ def getSeparator(file):
return
'
\t
'
,
False
def
checkPublicationCase
(
tmp
,
split
,
success
):
if
split
:
if
tmp
[
0
][
0
]
.
isupper
()
or
tmp
[
1
][
0
]
.
isupper
():
return
False
else
:
return
success
if
not
tmp
[
0
][
0
]
.
isupper
()
or
not
tmp
[
1
][
0
]
.
isupper
():
return
False
return
success
def
checkPublication
(
name
,
registeredNames
,
errorMessage
):
tmpName
=
name
def
lowerName
(
name
):
tmp
=
name
if
re
.
search
(
'[a-zA-Z0-9]'
,
name
[
0
])
==
None
:
tmpName
=
name
[
1
:]
tmp
=
tmpName
.
split
(
' '
)
success
=
True
tmp
=
name
[
1
:]
if
len
(
tmp
)
<
9
:
return
tmp
.
lower
()
tmp
=
name
.
split
(
' '
)
split
=
False
first
=
""
second
=
""
if
"_"
in
tmp
[
0
]
and
len
(
tmp
)
==
1
:
if
len
(
tmp
)
==
1
and
"_"
in
tmp
[
0
]
:
tmp
=
tmp
[
0
]
.
split
(
'_'
)
split
=
True
if
len
(
tmp
)
!=
2
:
success
=
False
return
name
.
lower
()
else
:
success
=
checkPublicationCase
(
tmp
,
split
,
success
)
first
=
tmp
[
0
][
0
]
.
lower
()
+
tmp
[
0
][
1
:]
second
=
tmp
[
1
][
0
]
.
lower
()
+
tmp
[
1
][
1
:]
if
first
!=
"publication"
or
second
not
in
[
"day"
,
"month"
,
"year"
]:
success
=
False
if
not
success
:
errorMessage
+=
"Error at line 1 ! Wrong name : "
+
\
name
+
" is not appropriated !
\n
"
else
:
registeredNames
.
append
(
first
+
"_"
+
second
)
return
success
,
errorMessage
return
first
+
"_"
+
second
def
checkNameValidity
(
name
,
columnNames
,
registeredNames
,
errorMessage
):
tmpName
=
name
if
re
.
search
(
'[a-zA-Z0-9]'
,
name
[
0
])
==
None
:
tmpName
=
name
[
1
:]
if
tmpName
not
in
columnNames
:
errorMessage
+=
"Error at line 1 ! Wrong name : "
+
\
name
+
" is not appropriated !
\n
"
return
False
,
errorMessage
if
tmpName
in
registeredNames
:
if
name
in
registeredNames
:
errorMessage
+=
"Error at line 1 ! Same name for 2 differents columns!
\n
"
return
False
,
errorMessage
return
True
,
errorMessage
...
...
@@ -105,23 +80,30 @@ def checkColumnExistence(registeredNames, errorMessage):
return
True
,
errorMessage
def
checkColumnNames
(
name
,
errorMessage
,
registeredNames
,
success
):
columnNames
=
[
"authors"
,
"title"
,
"publication_year"
,
"publication_month"
,
"publication_day"
,
"abstract"
,
"source"
]
def
checkColumnNames
(
name
,
errorMessage
,
registeredNames
,
otherColumns
,
success
):
columnNames
=
[
"authors"
,
"title"
,
"
source"
,
"
publication_year"
,
"publication_month"
,
"publication_day"
,
"abstract"
]
name
=
name
.
replace
(
"
\n
"
,
""
)
if
len
(
name
)
>
9
:
tmpSuccess
,
errorMessage
=
checkPublication
(
name
,
registeredNames
,
errorMessage
)
else
:
name
=
name
.
replace
(
" "
,
""
)
tmpSuccess
,
errorMessage
=
checkNameValidity
(
name
[
0
]
.
lower
()
+
name
[
1
:],
columnNames
,
registeredNames
,
errorMessage
)
if
tmpSuccess
:
registeredNames
.
append
(
name
[
0
]
.
lower
()
+
name
[
1
:])
if
success
:
tmpSuccess
,
errorMessage
=
checkNameValidity
(
name
,
columnNames
,
registeredNames
,
errorMessage
)
if
tmpSuccess
:
if
lowerName
(
name
)
in
columnNames
:
registeredNames
.
append
(
name
)
else
:
otherColumns
.
append
(
name
)
if
success
:
success
=
tmpSuccess
return
success
,
errorMessage
,
registeredNames
return
errorMessage
,
registeredNames
,
otherColumns
,
success
def
addColumnsNamestoTSV
(
data
,
registeredNames
,
otherColumns
):
for
name
in
registeredNames
:
if
data
!=
""
:
data
+=
"
\t
"
data
+=
name
for
name
in
otherColumns
:
data
+=
"
\t
"
data
+=
name
return
data
def
getColumnsNames
(
file
,
separator
,
errorMessage
):
data
=
""
...
...
@@ -130,40 +112,21 @@ def getColumnsNames(file, separator, errorMessage):
success
=
True
reader
=
csv
.
DictReader
(
codecs
.
iterdecode
(
file
,
'utf-8'
),
delimiter
=
separator
)
columnsName
s
=
[]
othersColumn
s
=
[]
for
row
in
reader
:
for
name
,
value
in
row
.
items
():
columnName
=
name
.
replace
(
"
\ufeff
"
,
""
)
if
(
columnNb
<
7
):
success
,
errorMessage
,
registeredNames
=
checkColumnNames
(
name
,
errorMessage
,
registeredNames
,
success
)
if
data
!=
""
:
data
+=
"
\t
"
data
+=
columnName
columnNb
+=
1
errorMessage
,
registeredNames
,
otherColumns
,
success
=
checkColumnNames
(
name
,
errorMessage
,
registeredNames
,
othersColumns
,
success
)
success
,
errorMessage
=
checkColumnExistence
(
registeredNames
,
errorMessage
)
if
success
:
data
=
addColumnsNamestoTSV
(
data
,
registeredNames
,
otherColumns
)
break
data
+=
"
\n
"
return
data
,
success
,
errorMessage
def
lowerName
(
name
):
tmp
=
name
.
split
(
' '
)
split
=
False
first
=
""
second
=
""
if
len
(
tmp
)
==
1
and
"_"
in
tmp
[
0
]:
tmp
=
tmp
[
0
]
.
split
(
'_'
)
split
=
True
if
len
(
tmp
)
!=
2
:
return
name
.
lower
()
else
:
first
=
tmp
[
0
][
0
]
.
lower
()
+
tmp
[
0
][
1
:]
second
=
tmp
[
1
][
0
]
.
lower
()
+
tmp
[
1
][
1
:]
return
first
+
"_"
+
second
def
checkDate
(
name
,
value
,
success
,
fill
,
csvLine
,
errorMessage
):
if
name
in
[
"publication_year"
,
"publication_month"
,
"publication_day"
]:
if
value
==
""
or
value
==
"
\n
"
:
...
...
@@ -210,43 +173,45 @@ def correctedSequence(text):
tmp
=
"
\"
"
+
tmp
+
"
\"
"
return
tmp
def
getContent
(
file
,
separator
,
data
,
success
,
fill
,
errorMessage
):
reader
=
csv
.
DictReader
(
codecs
.
iterdecode
(
file
,
'utf-8'
),
delimiter
=
separator
)
columnNames
=
[
"authors"
,
"title"
,
"source"
,
"publication_year"
,
"publication_month"
,
"publication_day"
,
"abstract"
]
csvLine
=
2
columnNb
=
0
reader
=
csv
.
DictReader
(
codecs
.
iterdecode
(
file
,
'utf-8'
),
delimiter
=
separator
)
for
row
in
reader
:
tmp
=
""
first
=
True
tsv1
=
""
tsv2
=
""
for
name
,
value
in
row
.
items
():
tmpFill
=
""
if
not
first
:
tmp
+=
"
\t
"
else
:
first
=
False
if
(
columnNb
<
7
):
if
lowerName
(
name
)
in
columnNames
:
if
not
first
:
tsv1
+=
"
\t
"
success
,
tmpFill
,
errorMessage
=
checkMissing
(
lowerName
(
name
),
value
,
success
,
fill
,
csvLine
,
errorMessage
)
if
tmpFill
!=
""
:
t
mp
+=
tmpFill
t
sv1
+=
tmpFill
else
:
success
,
tmpFill
,
errorMessage
=
checkDate
(
lowerName
(
name
),
value
,
success
,
fill
,
csvLine
,
errorMessage
)
tmp
+=
correctedSequence
(
value
)
else
:
tmp
+=
correctedSequence
(
value
)
columnNb
+=
1
columnNb
=
0
tsv1
+=
correctedSequence
(
value
)
else
:
success
,
tmpFill
,
errorMessage
=
checkMissing
(
lowerName
(
name
),
value
,
success
,
fill
,
csvLine
,
errorMessage
)
if
tmpFill
!=
""
:
tsv2
+=
"
\t
"
+
tmpFill
else
:
tsv2
+=
"
\t
"
+
correctedSequence
(
value
)
if
first
:
first
=
False
csvLine
+=
1
data
+=
t
mp
+
"
\n
"
data
+=
t
sv1
+
tsv2
+
"
\n
"
return
data
[:
-
1
],
success
,
errorMessage
# Code End
st
.
write
(
st
.
session_state
.
general_text_dict
[
'text'
])
st
.
session_state
.
fill
=
st
.
checkbox
(
st
.
session_state
.
general_text_dict
[
'fill'
])
st
.
session_state
.
fill
=
st
.
checkbox
(
value
=
True
,
label
=
st
.
session_state
.
general_text_dict
[
'fill'
])
file
=
st
.
file_uploader
(
st
.
session_state
.
general_text_dict
[
'file'
],
type
=
[
"tsv"
,
"csv"
],
key
=
'file'
)
...
...
Streamlit/pages/HAL_To_GarganText.py
View file @
fffc1c1b
...
...
@@ -190,7 +190,5 @@ if st.session_state.stage_isidore > 1:
print
(
st
.
session_state
.
nb_wanted
)
st
.
session_state
.
output
=
create_output
(
st
.
session_state
.
search
,
lang
[
st
.
session_state
.
language
],
st
.
session_state
.
nb_wanted
)
fileName
=
"HALOutput_"
+
str
(
datetime
.
now
()
.
strftime
(
"
%
Y-
%
m-
%
d_
%
H:
%
M:
%
S"
))
+
'.csv'
st
.
download_button
(
'Download TSV'
,
st
.
session_state
.
output
,
fileName
)
st
.
download_button
(
'Download TSV'
,
st
.
session_state
.
output
,
'output.csv'
)
Streamlit/pages/Isidore_To_GarganText.py
View file @
fffc1c1b
...
...
@@ -7,7 +7,6 @@ import streamlit as st
import
requests
as
req
import
json
import
time
from
datetime
import
datetime
from
json
import
JSONDecodeError
import
src.basic
as
tmp
...
...
@@ -65,16 +64,11 @@ def create_output(search, language, nb_doc):
break
time
.
sleep
(
retryTime
)
print
(
'Retry'
)
tmp
,
nb_tmp
=
createFile
(
txt
,
numberReplies
,
language
)
tmp
,
nb_tmp
=
createFile
(
txt
,
n
b_doc
%
n
umberReplies
,
language
)
output
+=
tmp
nb
+=
nb_tmp
if
nb_doc
%
numberReplies
!=
0
:
while
(
True
):
txt
=
loadApiIsidorePage
(
search
,
language
,
nb_doc
//
numberReplies
+
1
)
if
txt
!=
0
:
break
time
.
sleep
(
retryTime
)
print
(
'Retry'
)
txt
=
loadApiIsidorePage
(
search
,
language
,
nb_doc
//
numberReplies
+
1
)
tmp
,
nb_tmp
=
createFile
(
txt
,
nb_doc
%
numberReplies
,
language
)
output
+=
tmp
nb
+=
nb_tmp
...
...
@@ -145,16 +139,12 @@ def createFile(docs, limit, language):
else
:
abstract
=
tmp
else
:
if
'$'
in
abstract
.
keys
():
abstract
=
abstract
[
'$'
]
else
:
abstract
=
''
abstract
=
abstract
[
'$'
]
if
'types'
in
doc
[
'isidore'
]
.
keys
():
print
(
i
)
if
type
(
doc
[
'isidore'
][
'types'
][
'type'
])
==
str
and
doc
[
'isidore'
][
'types'
][
'type'
]
in
[
'Books'
,
'text'
]:
if
type
(
doc
[
'isidore'
][
'types'
][
'type'
]
==
str
)
and
doc
[
'isidore'
][
'types'
][
'type'
]
in
[
'Books'
,
'text'
]:
nb
+=
1
elif
type
(
doc
[
'isidore'
][
'types'
][
'type'
]
)
==
dict
and
doc
[
'isidore'
][
'types'
][
'type'
][
'$'
]
in
[
'Books'
,
'text'
]:
elif
type
(
doc
[
'isidore'
][
'types'
][
'type'
]
==
dict
)
and
doc
[
'isidore'
][
'types'
][
'type'
][
1
]
in
[
'Books'
,
'text'
]:
nb
+=
1
else
:
print
(
title
)
...
...
@@ -290,5 +280,4 @@ if st.session_state.stage_isidore > 1:
st
.
write
(
st
.
session_state
.
general_text_dict
[
'doc_abstract1'
]
+
str
(
st
.
session_state
.
nb_bad_file
)
+
st
.
session_state
.
general_text_dict
[
'doc_abstract2'
])
fileName
=
"isidoreOutput_"
+
str
(
datetime
.
now
()
.
strftime
(
"
%
Y-
%
m-
%
d_
%
H:
%
M:
%
S"
))
+
'.csv'
st
.
download_button
(
'Download TSV'
,
st
.
session_state
.
output
,
fileName
)
st
.
download_button
(
'Download TSV'
,
st
.
session_state
.
output
,
'output.csv'
)
Streamlit/pages/Istex_To_GarganText.py
View file @
fffc1c1b
...
...
@@ -5,7 +5,7 @@ Loïc Chapron
import
json
import
pandas
as
pd
from
datetime
import
datetime
import
datetime
import
zipfile
import
streamlit
as
st
import
src.basic
as
tmp
...
...
@@ -60,6 +60,8 @@ def read_zip(zip_file):
temp
[
"publication_year"
]
=
article
[
"publicationDate"
][
0
]
except
:
temp
[
"publication_year"
]
=
datetime
.
date
.
today
()
.
year
temp
[
"publication_year"
]
=
article
.
get
(
"publicationDate"
,
datetime
.
date
.
today
()
.
year
)[
0
]
temp
[
"publication_month"
]
=
1
temp
[
"publication_day"
]
=
1
...
...
@@ -89,13 +91,13 @@ file = st.file_uploader(
if
file
:
try
:
fileName
=
"istexOutput_"
+
str
(
datetime
.
now
()
.
strftime
(
"
%
Y-
%
m-
%
d_
%
H:
%
M:
%
S"
))
+
'.csv'
name
=
file
.
name
.
split
(
'.'
)[
0
]
+
'.csv'
res
,
nb_dup
=
read_zip
(
file
)
if
nb_dup
:
st
.
write
(
st
.
session_state
.
general_text_dict
[
'dup1'
]
+
str
(
nb_dup
)
+
st
.
session_state
.
general_text_dict
[
'dup2'
])
st
.
write
(
st
.
session_state
.
general_text_dict
[
'new_file'
])
st
.
download_button
(
'Download TSV'
,
res
,
fileN
ame
)
st
.
download_button
(
name
,
res
,
n
ame
)
except
Exception
as
e
:
st
.
write
(
st
.
session_state
.
general_text_dict
[
'error'
])
print
(
e
)
...
...
Streamlit/pages/PDF_to_TSV.py
View file @
fffc1c1b
...
...
@@ -13,6 +13,8 @@ import re
import
chardet
import
pandas
as
pd
import
streamlit
as
st
import
lib.tika.tika
as
tika
tika
.
initVM
()
from
lib.tika.tika
import
parser
from
lib.langdetect.langdetect
import
detect
from
lib.langdetect.langdetect.lang_detect_exception
import
LangDetectException
...
...
@@ -136,7 +138,7 @@ def segmentAbstract(fileName, fileAddress, tsv, author, source, year, month, day
count
=
1
languages
=
{}
while
n
<
nbLines
-
2
:
doc
=
"
\n
"
.
join
(
abstract
[
n
:
n
+
9
])
.
replace
(
"�"
,
""
)
doc
=
"
\n
"
.
join
(
abstract
[
n
:
n
+
9
])
.
replace
(
"�"
,
""
)
.
replace
(
""
,
""
)
title
=
source
+
" : Part "
+
str
(
count
)
tsv
+=
correctedSequence
(
author
,
False
)
+
"
\t
"
+
correctedSequence
(
source
,
False
)
+
"
\t
"
+
year
+
"
\t
"
+
month
+
"
\t
"
+
day
+
"
\t
"
...
...
Streamlit/pages/Zotero_To_GarganText.py
View file @
fffc1c1b
...
...
@@ -6,7 +6,7 @@ Loïc Chapron
import
streamlit
as
st
import
requests
as
req
import
json
from
datetime
import
date
,
datetime
from
datetime
import
date
import
src.basic
as
tmp
...
...
@@ -308,8 +308,7 @@ if st.session_state.stage == 2 and st.session_state.format == 'collections':
output
=
createTSVfromCollections
()
st
.
write
(
st
.
session_state
.
general_text_dict
[
'fileTSV1'
]
+
str
(
len
(
output
.
split
(
'
\n
'
))
-
2
)
+
st
.
session_state
.
general_text_dict
[
'fileTSV2'
])
fileName
=
"zoteroOutput_"
+
str
(
datetime
.
now
()
.
strftime
(
"
%
Y-
%
m-
%
d_
%
H:
%
M:
%
S"
))
+
'.csv'
st
.
download_button
(
'Download TSV'
,
output
,
fileName
)
st
.
download_button
(
'Download TSV'
,
output
,
'output.csv'
)
if
st
.
session_state
.
stage
>
0
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment