Commit db988a97 authored by delanoe's avatar delanoe

[STABLE] Update from current unstable.

parents 9dd09cc0 e4963d82
......@@ -28,6 +28,5 @@ see [install procedure](install.md)
2. Create a new branch <username>-refactoring
3. Run the gargantext-box
4. Code
5.Test
5. Test
6. Commit
......@@ -26,7 +26,7 @@ git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
## Install
``` bash
```bash
# go into the directory
user@computer: cd /srv/gargantext/
#git inside installation folder
......@@ -34,20 +34,31 @@ git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
#execute the installation
user@computer: ./install
```
During installation an admin account for gargantext will be created by asking you a username and a password
Remember it to accès to the Gargantext plateform
The installation requires to create a user for gargantext, it will be asked:
```bash
Username (leave blank to use 'gargantua'):
#email is not mandatory
Email address:
Password:
Password (again):
```
If successfully done this step you should see:
```bash
Superuser created successfully.
[ ok ] Stopping PostgreSQL 9.5 database server: main.
```
## Run
Once you proceed to installation Gargantext plateforme will be available at localhost:8000
by running the run executable file
to start gargantext plateform:
``` bash
# go into the directory
user@computer: cd /srv/gargantext/
#git inside installation folder
user@computer: cd /install
#execute the installation
user@computer: ./run
#type ctrl+d to exit or exit; command
user@computer: ./start
#type ctrl+d to exit or simply type exit in terminal;
```
Then open up a chromium browser and go to localhost:8000
......@@ -55,7 +66,3 @@ Click on "Enter Gargantext"
Login in with you created username and pasword
Enjoy! ;)
#resources
Adding a new source into Gargantext requires a previous declaration
of the source inside constants.py
```python
RESOURCETYPES= [
{ "type":9, #give a unique type int
"name": 'SCOAP [XML]', #resource name as proposed into the add corpus FORM [generic format]
"parser": "CernParser", #name of the new parser class inside a CERN.py file (set to None if not implemented)
"format": 'MARC21', #specific format
'file_formats':["zip","xml"],# accepted file format
"crawler": "CernCrawler", #name of the new crawler class inside a CERN.py file (set to None if no Crawler implemented)
'default_languages': ['en', 'fr'], #supported defaut languages of the source
},
...
]
```
## adding a new parser
Once you declared your new parser inside constants.py
add your new crawler file into /srv/gargantext/utils/parsers/
following this naming convention:
* Filename must be in uppercase without the Crawler mention.
eg. MailParser => MAIL.py
* Inside this file the Parser must be called following the exact typo declared as parser in constants.py
* Your new crawler shall inherit from baseclasse Parser and provide a parse(filebuffer) method
```python
#!/usr/bin/python3 env
#filename:/srv/gargantext/util/parser/MAIL.py:
from ._Parser import Parser
class MailParser(Parser):
def parse(self, file):
...
```
## adding a new crawler
Once you declared your new parser inside constants.py
add your new crawler file into /srv/gargantext/utils/parsers/
following this naming convention:
* Filename must be in uppercase without the Crawler mention.
eg. MailCrawler => MAIL.py
* Inside this file the Crawler must be called following the exact typo declared as crawler in constants.py
* Your new crawler shall inherit from baseclasse Crawler and provide three method:
* scan_results => ids
* sample = > yes/no
* fetch
```python
#!/usr/bin/python3 env
#filename:/srv/gargantext/util/crawler/MAIL.py:
from ._Crawler import Crawler
class MailCrawler(Crawler):
def scan_results(self, query):
...
self.ids = set()
def sample(self, results_nb):
...
def fetch(self, ids):
```
This diff is collapsed.
......@@ -14,6 +14,7 @@ djangorestframework==3.3.2
html5lib==0.9999999
jdatetime==1.7.2
kombu==3.0.33
langdetect==1.0.6
lxml==3.5.0
networkx==1.11
nltk==3.1
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# ****************************
# ***** CERN Scrapper *****
# ****************************
# Author:c24b
# Date: 27/05/2015
from ._Crawler import Crawler
import hmac, hashlib
import requests
import os
import random
import urllib.parse as uparse
from lxml import etree
from gargantext.settings import API_TOKENS
#from gargantext.util.files import build_corpus_path
from gargantext.util.db import session
from gargantext.models import Node
class CernCrawler(Crawler):
'''CERN SCOAP3 API Interaction'''
def __generate_signature__(self, url):
'''creation de la signature'''
#hmac-sha1 salted with secret
return hmac.new(self.secret,url, hashlib.sha1).hexdigest()
def __format_query__(self, query, of="xm", fields= None):
''' for query filters params
see doc https://scoap3.org/scoap3-repository/xml-api/
'''
#dict_q = uparse.parse_qs(query)
dict_q = {}
#by default: search by pattern
dict_q["p"] = query
if fields is not None and isinstance(fields, list):
fields = ",".join(fields)
dict_q["f"] = fields
#outputformat: "xm", "xmt", "h", "html"
dict_q["of"]= of
return dict_q
def __format_url__(self, dict_q):
'''format the url with encoded query'''
#add the apikey
dict_q["apikey"] = [self.apikey]
params = "&".join([(str(k)+"="+str(uparse.quote(v[0]))) for k,v in sorted(dict_q.items())])
return self.BASE_URL+params
def sign_url(self, dict_q):
'''add signature'''
API = API_TOKENS["CERN"]
self.apikey = API["APIKEY"]
self.secret = API["APISECRET"].encode("utf-8")
self.BASE_URL = u"http://api.scoap3.org/search?"
url = self.__format_url__(dict_q)
return url+"&signature="+self.__generate_signature__(url.encode("utf-8"))
def create_corpus(self):
#create a corpus
corpus = Node(
name = self.query,
#user_id = self.user_id,
parent_id = self.project_id,
typename = 'CORPUS',
hyperdata = { "action" : "Scrapping data"
, "language_id" : self.type["default_language"]
}
)
#add the resource
corpus.add_resource(
type = self.type["type"],
name = self.type["name"],
path = self.path)
try:
print("PARSING")
# p = eval(self.type["parser"])()
session.add(corpus)
session.commit()
self.corpus_id = corpus.id
parse_extract_indexhyperdata(corpus.id)
return self
except Exception as error:
print('WORKFLOW ERROR')
print(error)
session.rollback()
return self
def download(self):
import time
self.path = "/tmp/results.xml"
query = self.__format_query__(self.query)
url = self.sign_url(query)
start = time.time()
r = requests.get(url, stream=True)
downloaded = False
#the long part
with open(self.path, 'wb') as f:
print("Downloading file")
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
#print("===")
f.write(chunk)
downloaded = True
end = time.time()
#print (">>>>>>>>>>LOAD results", end-start)
return downloaded
def scan_results(self):
'''scanner le nombre de resultat en récupérant 1 seul résultat
qui affiche uniquement l'auteur de la page 1
on récupère le commentaire en haut de la page
'''
import time
self.results_nb = 0
query = self.__format_query__(self.query, of="hb")
query["ot"] = "100"
query["jrec"]='1'
query["rg"]='1'
url = self.sign_url(query)
print(url)
#start = time.time()
r = requests.get(url)
#end = time.time()
#print (">>>>>>>>>>LOAD results_nb", end-start)
if r.status_code == 200:
self.results_nb = int(r.text.split("-->")[0].split(': ')[-1][:-1])
return self.results_nb
else:
raise ValueError(r.status)
from ._Crawler import *
import json
class ISTexCrawler(Crawler):
"""
ISTEX Crawler
"""
def __format_query__(self,query=None):
'''formating query urlquote instead'''
if query is not None:
query = query.replace(" ","+")
return query
else:
self.query = self.query.replace(" ","+")
return self.query
def scan_results(self):
#get the number of results
self.results_nb = 0
self.query = self.__format_query__()
_url = "http://api.istex.fr/document/?q="+self.query+"&size=0"
#"&output=id,title,abstract,pubdate,corpusName,authors,language"
r = requests.get(_url)
print(r)
if r.status_code == 200:
self.results_nb = int(r.json()["total"])
self.status.append("fetching results")
return self.results_nb
else:
self.status.append("error")
raise ValueError(r.status)
def download(self):
'''fetching items'''
downloaded = False
def get_hits(future):
'''here we directly get the result hits'''
response = future.result()
if response.status_code == 200:
return response.json()["hits"]
else:
return None
#session = FuturesSession()
#self.path = "/tmp/results.json"
self.status.append("fetching results")
paging = 100
self.query_max = self.results_nb
if self.query_max > QUERY_SIZE_N_MAX:
msg = "Invalid sample size N = %i (max = %i)" % (self.query_max, QUERY_SIZE_N_MAX)
print("ERROR (scrap: istex d/l ): ",msg)
self.query_max = QUERY_SIZE_N_MAX
#urlreqs = []
with open(self.path, 'wb') as f:
for i in range(0, self.query_max, paging):
url_base = "http://api.istex.fr/document/?q="+self.query+"&output=*&from=%i&size=%i" %(i, paging)
r = requests.get(url_base)
if r.status_code == 200:
downloaded = True
f.write(r.text.encode("utf-8"))
else:
downloaded = False
self.status.insert(0, "error fetching ISTEX "+ r.status)
break
return downloaded
This diff is collapsed.
# Scrapers config
QUERY_SIZE_N_MAX = 1000
from gargantext.constants import get_resource
from gargantext.util.scheduling import scheduled
from gargantext.util.db import session
from requests_futures.sessions import FuturesSession
from gargantext.util.db import session
import requests
from gargantext.models.nodes import Node
#from gargantext.util.toolchain import parse_extract_indexhyperdata
from datetime import date
class Crawler:
"""Base class for performing search and add corpus file depending on the type
"""
def __init__(self, record):
#the name of corpus
#that will be built in case of internal fileparsing
self.record = record
self.name = record["corpus_name"]
self.project_id = record["project_id"]
self.user_id = record["user_id"]
self.resource = record["source"]
self.type = get_resource(self.resource)
self.query = record["query"]
#format the sampling
self.n_last_years = 5
self.YEAR = date.today().year
#pas glop
# mais easy version
self.MONTH = str(date.today().month)
if len(self.MONTH) == 1:
self.MONTH = "0"+self.MONTH
self.MAX_RESULTS = 1000
try:
self.results_nb = int(record["count"])
except KeyError:
#n'existe pas encore
self.results_nb = 0
try:
self.webEnv = record["webEnv"]
self.queryKey = record["queryKey"]
self.retMax = record["retMax"]
except KeyError:
#n'exsite pas encore
self.queryKey = None
self.webEnv = None
self.retMax = 1
self.status = [None]
self.path = "/tmp/results.txt"
def tmp_file(self):
'''here should stored the results
depending on the type of format'''
raise NotImplemented
def parse_query(self):
'''here should parse the parameters of the query
depending on the type and retrieve a set of activated search option
'''
raise NotImplemented
def fetch(self):
if self.download():
self.create_corpus()
return self.corpus_id
def get_sampling_dates():
'''Create a sample list of min and max date based on Y and M f*
or N_LAST_YEARS results'''
dates = []
for i in range(self.n_last_years):
maxyear = self.YEAR -i
mindate = str(maxyear-1)+"/"+str(self.MONTH)
maxdate = str(maxyear)+"/"+str(self.MONTH)
print(mindate,"-",maxdate)
dates.append((mindate, maxdate))
return dates
def create_corpus(self):
#create a corpus
corpus = Node(
name = self.query,
user_id = self.user_id,
parent_id = self.project_id,
typename = 'CORPUS',
hyperdata = { "action" : "Scrapping data",
"language_id" : self.type["default_language"],
}
)
self.corpus_id = corpus.id
if len(self.paths) > 0:
for path in self.paths:
#add the resource
corpus.add_resource(
type = self.type["type"],
name = self.type["name"],
path = path
)
session.add(corpus)
session.commit()
scheduled(parse_extract_indexhyperdata(corpus.id))
else:
#add the resource
corpus.add_resource(
type = self.type["type"],
name = self.type["name"],
path = self.path
)
session.add(corpus)
session.commit()
scheduled(parse_extract_indexhyperdata(corpus.id))
return corpus
import importlib
from gargantext.constants import RESOURCETYPES
from gargantext.settings import DEBUG
#if DEBUG: print("Loading available Crawlers")
base_parser = "gargantext.util.crawlers"
for resource in RESOURCETYPES:
if resource["crawler"] is not None:
try:
name =resource["crawler"]
#crawler is type basename+"Crawler"
filename = name.replace("Crawler", "").lower()
module = base_parser+".%s" %(filename)
importlib.import_module(module)
#if DEBUG: print("\t-", name)
except Exception as e:
print("Check constants.py RESOURCETYPES declaration %s \nCRAWLER %s is not available for %s" %(str(e), resource["crawler"], resource["name"]))
#initial import
#from .cern import CernCrawler
#from .istex import ISTexCrawler
#from .pubmed import PubmedCrawler
from gargantext.constants import *
from langdetect import detect, DetectorFactory
class Language:
def __init__(self, iso2=None, iso3=None, name=None):
def __init__(self, iso2=None, iso3=None,full_name=None, name=None):
self.iso2 = iso2
self.iso3 = iso3
self.name = name
self.implemented = iso2 in LANGUAGES
def __str__(self):
result = '<Language'
for key, value in self.__dict__.items():
......@@ -16,6 +16,7 @@ class Language:
return result
__repr__ = __str__
class Languages(dict):
def __missing__(self, key):
key = key.lower()
......@@ -25,6 +26,10 @@ class Languages(dict):
languages = Languages()
def detect_lang(text):
DetectorFactory.seed = 0
return languages[detect(text)].iso2
import pycountry
pycountry_keys = (
('iso639_3_code', 'iso3', ),
......@@ -49,3 +54,4 @@ languages['fre'] = languages['fr']
languages['ger'] = languages['de']
languages['Français'] = languages['fr']
languages['en_US'] = languages['en']
languages['english'] = languages['en']
......@@ -2,6 +2,8 @@ from ._Parser import Parser
from datetime import datetime
from bs4 import BeautifulSoup
from lxml import etree
#import asyncio
#q = asyncio.Queue(maxsize=0)
class CernParser(Parser):
#mapping MARC21 ==> hyperdata
......@@ -38,24 +40,34 @@ class CernParser(Parser):
"856": {"u":"pdf_source"},
}
def format_date(self, hyperdata):
'''formatting pubdate'''
prefix = "publication"
date = datetime.strptime(hyperdata[prefix + "_date"], "%Y-%m-%d")
#hyperdata[prefix + "_year"] = date.strftime('%Y')
hyperdata[prefix + "_month"] = date.strftime("%m")
hyperdata[prefix + "_day"] = date.strftime("%d")
hyperdata[prefix + "_hour"] = date.strftime("%H")
hyperdata[prefix + "_minute"] = date.strftime("%M")
hyperdata[prefix + "_second"] = date.strftime("%S")
hyperdata[prefix + "_date"] = date.strftime("%Y-%m-%d %H:%M:%S")
print("Date", hyperdata["publication_date"])
return hyperdata
# def format_date(self, hyperdata):
# '''formatting pubdate'''
# prefix = "publication"
# try:
# date = datetime.strptime(hyperdata[prefix + "_date"], "%Y-%m-%d")
# except ValueError:
# date = datetime.strptime(hyperdata[prefix + "_date"], "%Y-%m")
# date.day = "01"
# hyperdata[prefix + "_year"] = date.strftime('%Y')
# hyperdata[prefix + "_month"] = date.strftime("%m")
# hyperdata[prefix + "_day"] = date.strftime("%d")
#
# hyperdata[prefix + "_hour"] = date.strftime("%H")
# hyperdata[prefix + "_minute"] = date.strftime("%M")
# hyperdata[prefix + "_second"] = date.strftime("%S")
# hyperdata[prefix + "_date"] = date.strftime("%Y-%m-%d %H:%M:%S")
# #print("Date", hyperdata["publication_date"])
# return hyperdata
#@asyncio.coroutine
def parse(self, file):
#print("PARSING")
hyperdata_list = []
doc = file.read()
soup = BeautifulSoup(doc.decode("utf-8"), "lxml")
#print(doc[:35])
soup = BeautifulSoup(doc, "lxml")
#print(soup.find("record"))
for record in soup.find_all("record"):
hyperdata = {v:[] for v in self.MARC21["100"].values()}
hyperdata["uid"] = soup.find("controlfield").text
......@@ -86,8 +98,8 @@ class CernParser(Parser):
hyperdata["authors_affiliations"] = (",").join(hyperdata["authors_affiliations"])
hyperdata["authors"] = (",").join(hyperdata["authors"])
hyperdata["authors_mails"] = (",").join(hyperdata["authors_mails"])
hyperdata = self.format_date(hyperdata)
#hyperdata = self.format_date(hyperdata)
hyperdata = self.format_hyperdata_languages(hyperdata)
hyperdata = self.format_hyperdata_dates(hyperdata)
hyperdata_list.append(hyperdata)
return hyperdata_list
from .Ris import RISParser
from .RIS import RISParser
class ISIParser(RISParser):
_begin = 3
_parameters = {
b"ER": {"type": "delimiter"},
b"TI": {"type": "hyperdata", "key": "title", "separator": " "},
......@@ -17,4 +17,3 @@ class ISIParser(RISParser):
b"AB": {"type": "hyperdata", "key": "abstract", "separator": " "},
b"WC": {"type": "hyperdata", "key": "fields"},
}
......@@ -31,6 +31,7 @@ class PubmedParser(Parser):
if isinstance(file, bytes):
file = BytesIO(file)
xml = etree.parse(file, parser=self.xml_parser)
#print(xml.find("PubmedArticle"))
xml_articles = xml.findall('PubmedArticle')
# initialize the list of hyperdata
hyperdata_list = []
......
......@@ -36,6 +36,7 @@ class RISParser(Parser):
last_values = []
# browse every line of the file
for line in file:
if len(line) > 2 :
# extract the parameter key
parameter_key = line[:2]
......
......@@ -20,14 +20,9 @@ class Parser:
self._file = file
def __del__(self):
self._file.close()
if hasattr(self, '_file'):
self._file.close()
def detect_format(self, afile, a_formats):
#import magic
print("Detecting format")
#print(magic.from_file(afile))
return
def detect_encoding(self, string):
"""Useful method to detect the encoding of a document.
......@@ -167,6 +162,8 @@ class Parser:
def __iter__(self, file=None):
"""Parse the file, and its children files found in the file.
C24B comment: le stokage/extraction du fichier devrait être faite en amont
et cette methode est un peu obscure
"""
if file is None:
file = self._file
......
from .Ris import RISParser
from .Ris_repec import RepecParser
from .Isi import ISIParser
# from .Jstor import JstorParser
# from .Zotero import ZoteroParser
from .Pubmed import PubmedParser
# # 2015-12-08: parser 2 en 1
from .Europress import EuropressParser
from .ISTex import ISTexParser
from .CSV import CSVParser
from .Cern import CernParser
import importlib
from gargantext.constants import RESOURCETYPES
from gargantext.settings import DEBUG
if DEBUG:
print("Loading available PARSERS:")
base_parser = "gargantext.util.parsers"
for resource in RESOURCETYPES:
if resource["parser"] is not None:
#parser file is without Parser
fname = resource["parser"].replace("Parser", "")
#parser file is formatted as a title
module = base_parser+".%s" %(fname.upper())
#parser module is has shown in constants
parser = importlib.import_module(module)
if DEBUG:
print("\t-", resource["parser"])
getattr(parser,resource["parser"])
......@@ -3,9 +3,9 @@ When started, it initiates the parser;
when passed text, the text is piped to the parser.
When ended, the parser is closed and the tagged word returned as a tuple.
"""
from gargantext.constants import RULE_JJNN, DEFAULT_MAX_NGRAM_LEN
import re
import nltk
class Tagger:
......@@ -19,7 +19,28 @@ class Tagger:
| [][.,;"'?!():-_`] # these are separate tokens
''', re.UNICODE | re.MULTILINE | re.DOTALL)
self.buffer = []
self.start()
#self.start()
def clean_text(self, text):
"""Clean the text for better POS tagging.
For now, only removes (short) XML tags.
"""
return re.sub(r'<[^>]{0,45}>', '', text)
def extract(self, text, rule=RULE_JJNN, label='NP', max_n_words=DEFAULT_MAX_NGRAM_LEN):
self.text = self.clean_text(text)
grammar = nltk.RegexpParser(label + ': ' + rule)
tagged_tokens = list(self.tag_text(self.text))
if len(tagged_tokens):
grammar_parsed = grammar.parse(tagged_tokens)
for subtree in grammar_parsed.subtrees():
if subtree.label() == label:
if len(subtree) < max_n_words:
yield subtree.leaves()
# ex: [('wild', 'JJ'), ('pollinators', 'NNS')]
def __del__(self):
self.stop()
......@@ -29,6 +50,8 @@ class Tagger:
This method is called by the constructor, and can be overriden by
inherited classes.
"""
print("START")
self.extract(self.text)
def stop(self):
"""Ends the tagger.
......
from .TurboTagger import TurboTagger
from .NltkTagger import NltkTagger
from .TreeTagger import TreeTagger
from .MeltTagger import EnglishMeltTagger, FrenchMeltTagger
#version2
#imported as needed
#Version 1
#~ import importlib
#~ from gargantext.constants import LANGUAGES
#~ from gargantext.settings import DEBUG
#~ if DEBUG:
#~ print("Loading available Taggers:")
#~ for lang, tagger in LANGUAGES.items():
#~ tagger = tagger["tagger"]
#~ filename = "gargantext.util.taggers.%s" %(tagger)
#~ if DEBUG:
#~ print("\t-%s (%s)" %(tagger, lang))
#~ getattr(importlib.import_module(filename), tagger)()
#VERSION 0
#~ #initally a manual import declaration
#~ from .TurboTagger import TurboTagger
#~ from .NltkTagger import NltkTagger
#~ from .TreeTagger import TreeTagger
#~ from .MeltTagger import EnglishMeltTagger, FrenchMeltTagger
......@@ -102,7 +102,7 @@ def do_maplist(corpus,
if n_ngrams == 0:
raise ValueError("No ngrams in cooc table ?")
#return
# results, with same structure as quotas
chosen_ngrams = {
'topgen':{'monograms':[], 'multigrams':[]},
......
......@@ -82,6 +82,7 @@ def parse_extract_indexhyperdata(corpus):
favs = corpus.add_child(
typename='FAVORITES', name='favorite docs in "%s"' % corpus.name
)
session.add(favs)
session.commit()
print('CORPUS #%d: [%s] new favorites node #%i' % (corpus.id, t(), favs.id))
......@@ -265,7 +266,7 @@ def recount(corpus):
# -> specclusion/genclusion: compute + write (=> NodeNodeNgram)
(spec_id, gen_id) = compute_specgen(corpus, cooc_matrix = coocs,
spec_overwrite_id = old_spec_id,
spec_overwrite_id = old_spec_id,
gen_overwrite_id = old_gen_id)
print('RECOUNT #%d: [%s] updated spec-clusion node #%i' % (corpus.id, t(), spec_id))
......
#!/usr/bin/python3 env
"""
For initial ngram groups via stemming
Exemple:
......@@ -21,16 +22,13 @@ def prepare_stemmers(corpus):
"""
Returns *several* stemmers (one for each language in the corpus)
(as a dict of stemmers with key = language_iso2)
languages has been previously filtered by supported source languages
and formatted
"""
stemmers_by_lg = {
# always get a generic stemmer in case language code unknown
'__unknown__' : SnowballStemmer("english")
}
for lgiso2 in corpus.hyperdata['languages'].keys():
if (lgiso2 != '__skipped__'):
lgname = languages[lgiso2].name.lower()
stemmers_by_lg[lgiso2] = SnowballStemmer(lgname)
return stemmers_by_lg
stemmers = {lang:SnowballStemmer(languages[lang].name.lower()) for lang \
in corpus.languages.keys() if lang !="__skipped__"}
stemmers['__unknown__'] = SnowballStemmer("english")
return stemmers
def compute_groups(corpus, stoplist_id = None, overwrite_id = None):
"""
......@@ -57,16 +55,17 @@ def compute_groups(corpus, stoplist_id = None, overwrite_id = None):
my_groups = defaultdict(Counter)
# preloop per doc to sort ngrams by language
for doc in corpus.children():
if ('language_iso2' in doc.hyperdata):
lgid = doc.hyperdata['language_iso2']
else:
lgid = "__unknown__"
# doc.ngrams is an sql query (ugly but useful intermediate step)
# FIXME: move the counting and stoplist filtering up here
for ngram_pack in doc.ngrams.all():
todo_ngrams_per_lg[lgid].add(ngram_pack)
for doc in corpus.children('DOCUMENT'):
if doc.id not in corpus.skipped_docs:
if ('language_iso2' in doc.hyperdata):
lgid = doc.hyperdata['language_iso2']
else:
lgid = "__unknown__"
# doc.ngrams is an sql query (ugly but useful intermediate step)
# FIXME: move the counting and stoplist filtering up here
for ngram_pack in doc.ngrams.all():
todo_ngrams_per_lg[lgid].add(ngram_pack)
# --------------------
# long loop per ngrams
......
from gargantext.util.db import *
from gargantext.models import *
from gargantext.constants import *
from gargantext.util.ngramsextractors import ngramsextractors
from collections import defaultdict
from re import sub
from gargantext.util.scheduling import scheduled
def _integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor):
......@@ -36,7 +33,7 @@ def _integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor):
db.commit()
def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_INDEX_SUBGRAMS):
def extract_ngrams(corpus, keys=DEFAULT_INDEX_FIELDS, do_subngrams = DEFAULT_INDEX_SUBGRAMS):
"""Extract ngrams for every document below the given corpus.
Default language is given by the resource type.
The result is then inserted into database.
......@@ -46,57 +43,50 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_
db, cursor = get_cursor()
nodes_ngrams_count = defaultdict(int)
ngrams_data = set()
# extract ngrams
resource_type_index = corpus.resources()[0]['type']
#1 corpus = 1 resource
resource = corpus.resources()[0]
documents_count = 0
resource_type = RESOURCETYPES[resource_type_index]
default_language_iso2 = resource_type['default_language']
for documents_count, document in enumerate(corpus.children('DOCUMENT')):
# get ngrams extractor for the current document
language_iso2 = document.hyperdata.get('language_iso2', default_language_iso2)
try:
# this looks for a parser in constants.LANGUAGES
ngramsextractor = ngramsextractors[language_iso2]
except KeyError:
# skip document
print('Unsupported language: `%s` (doc #%i)' % (language_iso2, document.id))
# and remember that for later processes (eg stemming)
document.hyperdata['__skipped__'] = 'ngrams_extraction'
document.save_hyperdata()
session.commit()
if language_iso2 in corpus.hyperdata['languages']:
skipped_lg_infos = corpus.hyperdata['languages'].pop(language_iso2)
corpus.hyperdata['languages']['__skipped__'][language_iso2] = skipped_lg_infos
corpus.save_hyperdata()
session.commit()
continue
# extract ngrams on each of the considered keys
for key in keys:
value = document.hyperdata.get(key, None)
if not isinstance(value, str):
continue
# get ngrams
for ngram in ngramsextractor.extract(value):
tokens = tuple(normalize_forms(token[0]) for token in ngram)
if do_subngrams:
# ex tokens = ["very", "cool", "exemple"]
# subterms = [['very', 'cool'],
# ['very', 'cool', 'exemple'],
# ['cool', 'exemple']]
subterms = subsequences(tokens)
else:
subterms = [tokens]
for seqterm in subterms:
ngram = ' '.join(seqterm)
if len(ngram) > 1:
# doc <=> ngram index
nodes_ngrams_count[(document.id, ngram)] += 1
# add fields : terms n
ngrams_data.add((ngram[:255], len(seqterm), ))
source = get_resource(resource["type"])
#load available taggers for source default langage
docs = [doc for doc in corpus.children('DOCUMENT') if doc.id not in corpus.skipped_docs]
tagger_bots = {lang: load_tagger(lang)() for lang in corpus.languages if lang != "__skipped__"}
#sort docs by lang?
for lang, tagger in tagger_bots.items():
for documents_count, document in enumerate(docs):
language_iso2 = document.hyperdata.get('language_iso2', lang)
#print(language_iso2)
for key in keys:
try:
value = document[str(key)]
if not isinstance(value, str):
continue
# get ngrams
for ngram in tagger.extract(value):
tokens = tuple(normalize_forms(token[0]) for token in ngram)
if do_subngrams:
# ex tokens = ["very", "cool", "exemple"]
# subterms = [['very', 'cool'],
# ['very', 'cool', 'exemple'],
# ['cool', 'exemple']]
subterms = subsequences(tokens)
else:
subterms = [tokens]
for seqterm in subterms:
ngram = ' '.join(seqterm)
if len(ngram) > 1:
# doc <=> ngram index
nodes_ngrams_count[(document.id, ngram)] += 1
# add fields : terms n
ngrams_data.add((ngram[:255], len(seqterm), ))
except:
#value not in doc
pass
# except AttributeError:
# print("ERROR NO language_iso2")
# document.status("NGRAMS", error="No lang detected skipped Ngrams")
# corpus.skipped_docs.append(document.id)
# integrate ngrams and nodes-ngrams
if len(nodes_ngrams_count) >= BATCH_NGRAMSEXTRACTION_SIZE:
_integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor)
......@@ -105,12 +95,14 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_
if documents_count % BATCH_NGRAMSEXTRACTION_SIZE == 0:
corpus.status('Ngrams', progress=documents_count+1)
corpus.save_hyperdata()
session.add(corpus)
session.commit()
# integrate ngrams and nodes-ngrams
_integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor)
corpus.status('Ngrams', progress=documents_count+1, complete=True)
corpus.save_hyperdata()
session.commit()
else:
# integrate ngrams and nodes-ngrams
_integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor)
corpus.status('Ngrams', progress=documents_count+1, complete=True)
corpus.save_hyperdata()
session.commit()
except Exception as error:
corpus.status('Ngrams', error=error)
corpus.save_hyperdata()
......
from gargantext.util.db import *
from gargantext.models import *
from gargantext.constants import *
from collections import defaultdict
#from gargantext.util.parsers import *
from collections import defaultdict, Counter
from re import sub
from gargantext.util.languages import languages, detect_lang
def parse(corpus):
try:
documents_count = 0
corpus.status('Docs', progress=0)
# will gather info about languages
observed_languages = defaultdict(int)
# retrieve resource information
for resource in corpus.resources():
# information about the resource
if resource['extracted']:
continue
resource_parser = RESOURCETYPES[resource['type']]['parser']
resource_path = resource['path']
# extract and insert documents from corpus resource into database
for hyperdata in resource_parser(resource_path):
# uniformize the text values for easier POStagging and processing
for k in ['abstract', 'title']:
if k in hyperdata:
try :
hyperdata[k] = normalize_chars(hyperdata[k])
except Exception as error :
print("Error normalize_chars", error)
# save as DB child
# ----------------
document = corpus.add_child(
typename = 'DOCUMENT',
name = hyperdata.get('title', '')[:255],
hyperdata = hyperdata,
)
session.add(document)
# a simple census to raise language info at corpus level
if "language_iso2" in hyperdata:
observed_languages[hyperdata["language_iso2"]] += 1
#1 corpus => 1 resource
resources = corpus.resources()
#get the sources capabilities for a given corpus resource
sources = [get_resource(resource["type"]) for resource in corpus.resources() if resource["extracted"] is False]
if len(sources) == 0:
#>>> documents have already been parsed?????
return
if len(sources) > 0:
#>>> necessairement 1 corpus = 1 source dans l'archi actuelle
source = sources[0]
resource = resources[0]
#source.extend(resource)
if source["parser"] is None:
#corpus.status(error)
raise ValueError("Resource '%s' has no Parser" %resource["name"])
else:
#observed langages in corpus docs
corpus.languages = defaultdict.fromkeys(source["default_languages"], 0)
#remember the skipped docs in parsing
skipped_languages = []
corpus.skipped_docs = []
session.add(corpus)
session.commit()
#load the corresponding parser
parserbot = load_parser(source)
# extract and insert documents from resource.path into database
for hyperdata in parserbot(resource["path"]):
# indexed text fields defined in CONSTANTS
for k in DEFAULT_INDEX_FIELDS:
if k in hyperdata.keys():
try:
hyperdata[k] = normalize_chars(hyperdata[k])
except Exception as error :
hyperdata["error"] = "Error normalize_chars"
indexed = False
# a simple census to raise language info at corpus level
for l in ["iso2", "iso3", "full_name"]:
if indexed is True:
break
lang_field = "language_"+l
if lang_field in hyperdata.keys():
if l == "iso2":
try:
corpus.languages[hyperdata["language_iso2"]] += 1
indexed = True
except KeyError:
hyperdata["error"] = "Error: unsupported language"
skipped_languages.append(hyperdata["language_iso2"])
else:
lang = languages(hyperdata[lang_field].lower()).iso2
try:
corpus.languages[lang] += 1
indexed = True
except KeyError:
hyperdata["error"] = "Error: unsupported language"
skipped_languages.append(lang)
if indexed is False:
#no language have been indexed
#detectlang by index_fields
for k in DEFAULT_INDEX_FIELDS:
if indexed is True:
break
if k in hyperdata.keys():
try:
if len(hyperdata[k]) > 10:
#print("> detected on",k, ":", detect_lang(hyperdata[k]))
hyperdata["language_iso2"] = detect_lang(hyperdata[k])
corpus.languages[hyperdata["language_iso2"]] += 1
indexed = True
break
except KeyError:
hyperdata["error"] = "Error: unsupported language"
skipped_languages.append(hyperdata["language_iso2"])
indexed = True
except Exception as error :
print(error)
pass
# save as DB child
# ----------------
document = corpus.add_child(
typename = 'DOCUMENT',
name = hyperdata.get('title', '')[:255],
hyperdata = hyperdata,
)
session.add(document)
if "error" in hyperdata.keys():
#document.status("error")
document.status('Parsing', error= document.hyperdata["error"])
document.save_hyperdata()
session.commit()
#adding skipped_docs for later processsing
corpus.skipped_docs.append(document.id)
documents_count += 1
# logging
if documents_count % BATCH_PARSING_SIZE == 0:
corpus.status('Docs', progress=documents_count)
corpus.save_hyperdata()
session.add(corpus)
session.commit()
documents_count += 1
# update info about the resource
resource['extracted'] = True
# add a corpus-level info about languages...
corpus.hyperdata['languages'] = observed_languages
# ...with a special key inside for skipped languages at ngrams_extraction
corpus.hyperdata['languages']['__skipped__'] = {}
# add a corpus-level info about languages adding a __skipped__ info
corpus.languages['__skipped__'] = Counter(skipped_languages)
for n in corpus.languages.items():
print(n)
# commit all changes
corpus.status('Docs', progress=documents_count, complete=True)
corpus.save_hyperdata()
session.add(corpus)
session.commit()
except Exception as error:
corpus.status('Docs', error=error)
......
......@@ -37,7 +37,7 @@ def docs_by_titles(request, project_id, corpus_id):
'date': datetime.now(),
'project': project,
'corpus': corpus,
'resourcename' : resourcename(corpus),
'resourcename' : get_resource_by_name(corpus.resources()[0]),
'view': 'titles',
'user': request.user
},
......@@ -65,7 +65,7 @@ def docs_by_journals(request, project_id, corpus_id):
'date': datetime.now(),
'project': project,
'corpus' : corpus,
'resourcename' : resourcename(corpus),
'resourcename' : get_resource_by_name(corpus.resources()[0]),
'view': 'journals'
},
)
......@@ -84,11 +84,8 @@ def analytics(request, project_id, corpus_id):
'date': datetime.now(),
'project': project,
'corpus': corpus,
'resourcename' : resourcename(corpus),
'resourcename' : get_resource_by_name(corpus.resources()[0]),
'view': 'analytics',
'user': request.user
},
)
......@@ -59,12 +59,17 @@ def overview(request):
class NewCorpusForm(forms.Form):
#mapping choices based on ressource.type
source_list = [(resource["type"], resource["name"]) for resource in RESOURCETYPES]
source_list.insert(0, (0,"Select a database below"))
type = forms.ChoiceField(
choices = enumerate(resource_type['name'] for resource_type in RESOURCETYPES),
choices = source_list,
widget = forms.Select(attrs={ 'onchange' :'CustomForSelect( $("option:selected", this).text() );'})
)
name = forms.CharField( label='Name', max_length=199 , widget=forms.TextInput(attrs={ 'required': 'true' }))
file = forms.FileField()
def clean_resource(self):
file_ = self.cleaned_data.get('file')
def clean_file(self):
file_ = self.cleaned_data.get('file')
if len(file_) > 1024 ** 3 : # we don't accept more than 1GB
......@@ -117,7 +122,8 @@ def project(request, project_id):
resources = corpus.resources()
if len(resources):
resource = resources[0]
resource_type_name = RESOURCETYPES[resource['type']]['name']
#resource_type_name = RESOURCETYPES[resource['type']]['name']
resource_type_name = get_resource(resource["type"])["name"]
else:
print("(WARNING) PROJECT view: no listed resource")
# add some data for the viewer
......@@ -172,5 +178,3 @@ def project(request, project_id):
'query_size': QUERY_SIZE_N_DEFAULT,
},
)
......@@ -2,7 +2,7 @@ from gargantext.util.http import requires_auth, render, settings
from gargantext.util.db import session
from gargantext.util.db_cache import cache
from gargantext.models import Node
from gargantext.constants import resourcename
from gargantext.constants import get_resource_by_name
from datetime import datetime
@requires_auth
......@@ -42,7 +42,7 @@ def ngramtable(request, project_id, corpus_id):
'date': datetime.now(),
'project': project,
'corpus' : corpus,
'resourcename' : resourcename(corpus),
'resourcename' : get_resource_by_name(corpus),
'view': 'terms',
# for the CSV import modal
......
......@@ -55,8 +55,13 @@ def notify_user(username, email, password):
La nouvelle version de Gargantext sort en septembre prochain.
Vous êtes actuellement sur la version de développement, vos retours
seront précieux pour stabiliser la plateforme; merci d'avance!
seront précieux pour stabiliser la plateforme: merci d'avance!
Foire aux questions de Gargantext:
https://gogs.iscpif.fr/humanities/faq_gargantext/wiki/FAQ
Rapporter un bogue:
https://gogs.iscpif.fr/humanities/faq_gargantext/issues
Nous restons à votre disposition pour tout complément d'information.
Cordialement
......
#Install Instructions for Gargamelle:
# Install Instructions for Gargamelle
Gargamelle is the gargantext plateforme toolbox it is a full plateform system
with minimal modules
**Gargamelle** is the gargantext platform toolbox: it installs a full gargantext system with minimal modules inside a **docker** container.
First you need to get the source code to install it
The folder will be /srv/gargantext:
* docs containes all informations on gargantext
/srv/gargantext/docs/
* install contains all the installation files
/srv/gargantext/install/
First you need to get the source code to install it
The destination folder will be `/srv/gargantext`:
* docs contains all information on gargantext
(`/srv/gargantext/docs/`)
* install contains all the installation files
`/srv/gargantext/install/`
Help needed ?
Help needed ?
See [http://gargantext.org/about](http://gargantext.org/about) and [tools](./contribution_guide.md) for the community
## Get the source code
......@@ -27,36 +26,30 @@ git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
## Install
``` bash
# go into the directory
user@computer: cd /srv/gargantext/
#git inside installation folder
user@computer: cd /install
#execute the installation
user@computer: ./install
# go into the directory
user@computer: cd /srv/gargantext/
# get inside installation folder
user@computer: cd install
# execute the installation script
user@computer: ./install
```
During installation an admin account for gargantext will be created by asking you a username and a password
Remember it to accès to the Gargantext plateform
During installation an admin account for gargantext will be created by asking you a username and a password
Remember it to access to the Gargantext plateform
## Run
Once you proceed to installation Gargantext plateforme will be available at localhost:8000
by running the run executable file
``` bash
# go into the directory
user@computer: cd /srv/gargantext/
#git inside installation folder
user@computer: cd /install
#execute the installation
user@computer: ./run
#type ctrl+d to exit or exit; command
```
Once you're done with the installation, **Gargantext** platform will be available at `http://localhost:8000`
simply by running the `start` executable file
``` bash
# go into the directory
user@computer: cd /srv/gargantext/
# run the start command
user@computer: ./start
# type ctrl+d or "exit" command to exit
```
Then open up a chromium browser and go to localhost:8000
Click on "Enter Gargantext"
Login in with you created username and pasword
Then open up a chromium browser and go to `http://localhost:8000`
Click on "Enter Gargantext"
Login in with your created username and password
Enjoy! ;)
......@@ -9,6 +9,7 @@ MAINTAINER ISCPIF <gargantext@iscpif.fr>
USER root
### Update and install base dependencies
RUN echo "############ DEBIAN LIBS ###############"
RUN apt-get update && \
apt-get install -y \
apt-utils ca-certificates locales \
......@@ -19,33 +20,37 @@ RUN apt-get update && \
postgresql-9.5 postgresql-client-9.5 postgresql-contrib-9.5
RUN echo "############ DEBIAN LIBS ###############"
### Configure timezone and locale
RUN echo "Europe/Paris" > /etc/timezone && \
dpkg-reconfigure -f noninteractive tzdata && \
sed -i -e 's/# en_GB.UTF-8 UTF-8/en_GB.UTF-8 UTF-8/' /etc/locale.gen && \
RUN echo "########### LOCALES & TZ #################"
RUN echo "Europe/Paris" > /etc/timezone
ENV TZ "Europe/Paris"
RUN sed -i -e 's/# en_GB.UTF-8 UTF-8/en_GB.UTF-8 UTF-8/' /etc/locale.gen && \
sed -i -e 's/# fr_FR.UTF-8 UTF-8/fr_FR.UTF-8 UTF-8/' /etc/locale.gen && \
echo 'LANG="fr_FR.UTF-8"' > /etc/default/locale && \
dpkg-reconfigure --frontend=noninteractive locales && \
update-locale LANG=fr_FR.UTF-8
echo 'LANG="fr_FR.UTF-8"' > /etc/default/locale
ENV LANG fr_FR.UTF-8
ENV LANGUAGE fr_FR.UTF-8
ENV LC_ALL fr_FR.UTF-8
RUN echo "########### LOCALES & TZ #################"
### Install main dependencies and python packages based on Debian distrib
RUN echo "############# PYTHON DEPENDENCIES ###############"
RUN apt-get update && apt-get install -y \
libxml2-dev xml-core libgfortran-5-dev \
libpq-dev \
python3.5 \
python3-dev \
# for numpy, pandas and numpyperf
python3-six python3-numpy python3-setuptools \
# ^for numpy, pandas and numpyperf
python3-numexpr \
#python dependencies
# python dependencies
python3-pip \
# for lxml
libxml2-dev libxslt-dev
#libxslt1-dev zlib1g-dev
RUN echo "############# PYTHON DEPENDENCIES ###############"
#UPDATE AND CLEAN
# UPDATE AND CLEAN
RUN apt-get update && apt-get autoclean &&\
rm -rf /var/lib/apt/lists/*
#NB: removing /var/lib will avoid to significantly fill up your /var/ folder on your native system
......@@ -65,9 +70,8 @@ ADD psql_configure.sh /
ADD django_configure.sh /
RUN . /env_3-5/bin/activate && pip3 install -r requirements.txt && \
pip3 install git+https://github.com/zzzeek/sqlalchemy.git@rel_1_1 &&\
python3 -m nltk.downloader averaged_perceptron_tagger -d /usr/local/share/nltk_data;
# nltk.data.path.append('path_to_nltk_data')
pip3 install git+https://github.com/zzzeek/sqlalchemy.git@rel_1_1 && \
python3 -m nltk.downloader averaged_perceptron_tagger -d /usr/local/share/nltk_data
RUN chown gargantua:gargantua -R /env_3-5
......@@ -81,6 +85,4 @@ RUN echo "listen_addresses='*'" >> /etc/postgresql/9.5/main/postgresql.conf
EXPOSE 5432 8000
VOLUME ["/srv/",]
# VOLUME ["/srv/",]
#!/bin/bash
# opens a console + virtualenv inside the already active docker container
# (to use after start)
sudo docker exec -it gargamelle_box bash --rcfile 'env_3-5/bin/activate'
#!/bin/bash
sudo docker run \
-v /srv/:/srv/\
--name=gargamelle_box \
-v /srv/gargantext:/srv/gargantext \
-v /srv/gargandata:/srv/gargandata \
-v /srv/gargantext_lib:/srv/gargantext_lib \
-p 8000:8000 \
-p 5432 \
-it gargamelle:latest \
/bin/bash -c "service postgresql start; /bin/su gargantua -c 'source /env_3-5/bin/activate && /srv/gargantext/manage.py runserver 0.0.0.0:8000' && bin/bash"
sudo docker rm -f `docker ps -a | grep -v CONTAINER | awk '{print $1 }'`
sudo docker rm gargamelle_box
......@@ -16,13 +16,12 @@ echo "::::: DJANGO :::::"
/bin/su gargantua -c 'source /env_3-5/bin/activate &&\
echo "Activated env" &&\
./srv/gargantext/manage.py makemigrations &&\
./srv/gargantext/manage.py migrate && \
/srv/gargantext/manage.py makemigrations &&\
/srv/gargantext/manage.py migrate && \
echo "migrations ok" &&\
./srv/gargantext/dbmigrate.py && \
./srv/gargantext/dbmigrate.py && \
./srv/gargantext/dbmigrate.py && \
./srv/gargantext/manage.py createsuperuser'
/srv/gargantext/dbmigrate.py && \
/srv/gargantext/dbmigrate.py && \
/srv/gargantext/dbmigrate.py && \
/srv/gargantext/manage.py createsuperuser'
/usr/sbin/service postgresql stop
......@@ -14,6 +14,7 @@ html5lib==0.9999999
python-igraph>=0.7.1
jdatetime==1.7.2
kombu==3.0.33 # messaging
langdetect==1.0.6 #detectinglanguage
nltk==3.1
numpy==1.10.4
psycopg2==2.6.1
......
......@@ -43,6 +43,22 @@ function uncompress_lib {
#~ esac
echo "::: CREATE GROUP :::";
if grep -q 'gargantua' /etc/groups
then
echo "Using existing group 'gargantua'"
else
sudo groupadd gargantua
fi
# adding the users to the group
current_user=$(who -m | cut -d' ' -f1)
sudo usermod -G gargantua $current_user
sudo usermod -G gargantua gargantua
# changing the group of the sourcedir
sudo chown -R :gargantua /srv/gargantext
echo "::: SETUP ENV :::";
for dir in "/srv/gargantext_lib" "/srv/gargantext_static" "/srv/gargantext_media"; do
sudo mkdir -p $dir ;
......@@ -59,12 +75,17 @@ sudo docker build -t gargamelle:latest ./gargamelle
echo ':::: CONFIGURE ::::'
sudo docker run \
-v /srv/:/srv/ \
--name=gargamelle_box \
-v /srv/gargantext:/srv/gargantext \
-v /srv/gargandata:/srv/gargandata \
-v /srv/gargantext_lib:/srv/gargantext_lib \
-p 8000:8000 \
-p 5432 \
-it gargamelle:latest \
/bin/bash -c "./psql_configure.sh; ./django_configure.sh ; exit"
sudo docker rm -f `docker ps -a | grep -v CONTAINER | awk '{print $1 }'`
sudo docker rm gargamelle_box
# creating the "start" copy + giving it normal ownership (because we're probably sudo)
cp ./run /srv/gargantext/start
chown $current_user:gargantua /srv/gargantext/start
#!/bin/bash
sudo docker run \
-v /srv/:/srv/\
--name=gargamelle_box \
-v /srv/gargantext:/srv/gargantext \
-v /srv/gargandata:/srv/gargandata \
-v /srv/gargantext_lib:/srv/gargantext_lib \
-p 8000:8000 \
-p 5432 \
-it gargamelle:latest \
/bin/bash -c "service postgresql start; /bin/su gargantua -c 'source /env_3-5/bin/activate && /srv/gargantext/manage.py runserver 0.0.0.0:8000'"
sudo docker rm -f `docker ps -a | grep -v CONTAINER | awk '{print $1 }'`
sudo docker rm gargamelle_box
......@@ -8,7 +8,7 @@ from traceback import print_tb
from django.shortcuts import redirect, render
from django.http import Http404, HttpResponseRedirect, HttpResponseForbidden
from gargantext.constants import resourcetype, QUERY_SIZE_N_MAX
from gargantext.constants import get_resource_by_name, QUERY_SIZE_N_MAX
from gargantext.models.nodes import Node
from gargantext.util.db import session
from gargantext.util.http import JsonHttpResponse
......@@ -133,7 +133,7 @@ def save(request , project_id):
if filename!=False:
# add the uploaded resource to the corpus
corpus.add_resource(
type = resourcetype('ISTex')
type = get_resource_by_name('ISTex [ISI]')["type"]
, path = filename
)
dwnldsOK+=1
......
......@@ -18,7 +18,7 @@ from traceback import print_tb
from django.shortcuts import redirect
from django.http import Http404, HttpResponseRedirect, HttpResponseForbidden
from gargantext.constants import resourcetype, QUERY_SIZE_N_MAX
from gargantext.constants import get_resource, QUERY_SIZE_N_MAX
from gargantext.models.nodes import Node
from gargantext.util.db import session
from gargantext.util.db_cache import cache
......@@ -134,7 +134,7 @@ def save( request , project_id ) :
print(filename)
if filename != False:
# add the uploaded resource to the corpus
corpus.add_resource( type = resourcetype('Pubmed (XML format)')
corpus.add_resource( type = get_resource_by_name('Pubmed [XML]')["type"]
, path = filename
, url = None
)
......
......@@ -174,12 +174,28 @@
title="Export terms table in CSV">
Export terms table &nbsp; <span class="glyphicon glyphicon-download" aria-hidden="true"></span>
</a>
{% elif view == 'titles' %}
<a href="https://gogs.iscpif.fr/humanities/faq_gargantext/wiki/FAQ#import--export-a-dataset" class="pull-right btn btn-lg">
<span class="glyphicon glyphicon-question-sign" aria-hidden="true"></span>
</a>
<a class="btn btn-primary exportbtn pull-right" role="button"
href="/api/nodes?parent_id={{corpus.id}}&types[]=DOCUMENT&pagination_limit=100000&formated=csv"
title="Export full corpus in CSV">
Export corpus &nbsp; <span class="glyphicon glyphicon-download" aria-hidden="true"></span>
Export corpus &nbsp;
<span class="glyphicon glyphicon-download" aria-hidden="true"></span>
</a>
{% else %}
<!-- TODO export journal table -->
{% endif %}
......@@ -187,6 +203,7 @@
</div>
<div class="row">
<div class="col-md-1">
</span>
</div>
<div class="col-md-6">
<h3>
......
......@@ -212,12 +212,19 @@
<div class="modal-content">
<div class="modal-header">
<button type="button" class="close" data-dismiss="modal" aria-hidden="true">×</button>
<h3>Add a Corpus</h3>
<h3>Add a Corpus <a href="https://gogs.iscpif.fr/humanities/faq_gargantext/wiki/FAQ#import--export-a-dataset">
<span class="glyphicon glyphicon-question-sign" aria-hidden="true"></span>
</a>
</h3>
</div>
<div class="modal-body">
<!-- FAQ -->
<form id="id_form" enctype="multipart/form-data" action="/projects/{{project.id}}/" method="post">
{% csrf_token %}
<table cellpadding="5">
{% for field in form %}
<tr>
<th>{{field.label_tag}}</th>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment