Commit 5866a043 authored by c24b's avatar c24b

Prepare a method for checking content file = is it the right parser?

parent 3374d428
...@@ -50,35 +50,38 @@ exposé dans `/templates/pages/projects/project.html` ...@@ -50,35 +50,38 @@ exposé dans `/templates/pages/projects/project.html`
## reference your parser script ## reference your parser script
## add your script into gargantext/util/ ## add your parser script into folder gargantext/util/parser/
here filename is Cern.py here my filename was Cern.py
##declare it into gargantext/util/parser/__init__.py ##declare it into gargantext/util/parser/__init__.py
from .Cern import CernParser from .Cern import CernParser
At this step, you will be able to see your parser and add a file with the form
but nothing will occur
## add your parser script into gargantext/util/parser/ ## the good way to write the scrapper script
At this step, you will be able to see your parser and add a file with the form Three main and only requirements:
it will send the job to toolchain * your parser class should inherit from the base class _Parser()
## * your parser class must have a parse method that take a **filename** as input
parse_extract_indexhyperdata(corpus) * you parser must structure and store data into **hyperdata_list** variable name
to be properly indexed by toolchain
# Adding a scrapper API to offer search option:
* Add pop up question Do you have a corpus * Add pop up question Do you have a corpus
option search in /templates/pages/projects/project.html line 181 option search in /templates/pages/projects/project.html line 181
adding
# Some changes # Some changes
* adding accepted_formats in constants * adding accepted_formats in constants
* adding check_file routine in Form check * adding check_file routine in Form check ==> but should inherit from utils/files.py
that also have implmented the size upload limit check
# Suggestion next step: # Suggestion next step:
* XML parser MARC21 UNIMARC ... * XML parser MARC21 UNIMARC ...
* A project type is qualified by the first element add i.e: * A project type is qualified by the first element add i.e:
the first element determine the type of corpus of all the corpora within the project the first element determine the type of corpus of all the corpora within the project
...@@ -246,7 +246,8 @@ from .settings import BASE_DIR ...@@ -246,7 +246,8 @@ from .settings import BASE_DIR
# uploads/.gitignore prevents corpora indexing # uploads/.gitignore prevents corpora indexing
# copora can be either a folder or symlink towards specific partition # copora can be either a folder or symlink towards specific partition
UPLOAD_DIRECTORY = os.path.join(BASE_DIR, 'uploads/corpora') UPLOAD_DIRECTORY = os.path.join(BASE_DIR, 'uploads/corpora')
UPLOAD_LIMIT = 1024 * 1024 * 1024 UPLOAD_LIMIT = 1024
#* 1024 * 1024
DOWNLOAD_DIRECTORY = UPLOAD_DIRECTORY DOWNLOAD_DIRECTORY = UPLOAD_DIRECTORY
......
...@@ -25,11 +25,13 @@ def download(url, name=''): ...@@ -25,11 +25,13 @@ def download(url, name=''):
def upload(uploaded): def upload(uploaded):
print(repr(uploaded))
if uploaded.size > UPLOAD_LIMIT: if uploaded.size > UPLOAD_LIMIT:
raise IOError('Uploaded file is bigger than allowed: %d > %d' % ( raise IOError('Uploaded file is bigger than allowed: %d > %d' % (
uploaded.size, uploaded.size,
UPLOAD_LIMIT, UPLOAD_LIMIT,
)) ))
return save( return save(
contents = uploaded.file.read(), contents = uploaded.file.read(),
name = uploaded.name, name = uploaded.name,
......
...@@ -23,6 +23,9 @@ class Parser: ...@@ -23,6 +23,9 @@ class Parser:
def __del__(self): def __del__(self):
self._file.close() self._file.close()
def detect_format(self, accepted_format):
print(self._file[:1000])
def detect_encoding(self, string): def detect_encoding(self, string):
"""Useful method to detect the encoding of a document. """Useful method to detect the encoding of a document.
""" """
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment