Commit 0bd9237e authored by delanoe's avatar delanoe

[STABLE] Update from current unstable.

parents 515aa18b 9087e23c
...@@ -28,6 +28,5 @@ see [install procedure](install.md) ...@@ -28,6 +28,5 @@ see [install procedure](install.md)
2. Create a new branch <username>-refactoring 2. Create a new branch <username>-refactoring
3. Run the gargantext-box 3. Run the gargantext-box
4. Code 4. Code
5.Test 5. Test
6. Commit 6. Commit
...@@ -26,7 +26,7 @@ git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \ ...@@ -26,7 +26,7 @@ git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
## Install ## Install
``` bash ```bash
# go into the directory # go into the directory
user@computer: cd /srv/gargantext/ user@computer: cd /srv/gargantext/
#git inside installation folder #git inside installation folder
...@@ -34,20 +34,31 @@ git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \ ...@@ -34,20 +34,31 @@ git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
#execute the installation #execute the installation
user@computer: ./install user@computer: ./install
``` ```
During installation an admin account for gargantext will be created by asking you a username and a password The installation requires to create a user for gargantext, it will be asked:
Remember it to accès to the Gargantext plateform
```bash
Username (leave blank to use 'gargantua'):
#email is not mandatory
Email address:
Password:
Password (again):
```
If successfully done this step you should see:
```bash
Superuser created successfully.
[ ok ] Stopping PostgreSQL 9.5 database server: main.
```
## Run ## Run
Once you proceed to installation Gargantext plateforme will be available at localhost:8000 Once you proceed to installation Gargantext plateforme will be available at localhost:8000
by running the run executable file to start gargantext plateform:
``` bash ``` bash
# go into the directory # go into the directory
user@computer: cd /srv/gargantext/ user@computer: cd /srv/gargantext/
#git inside installation folder #git inside installation folder
user@computer: cd /install user@computer: ./start
#execute the installation #type ctrl+d to exit or simply type exit in terminal;
user@computer: ./run
#type ctrl+d to exit or exit; command
``` ```
Then open up a chromium browser and go to localhost:8000 Then open up a chromium browser and go to localhost:8000
...@@ -55,7 +66,3 @@ Click on "Enter Gargantext" ...@@ -55,7 +66,3 @@ Click on "Enter Gargantext"
Login in with you created username and pasword Login in with you created username and pasword
Enjoy! ;) Enjoy! ;)
#resources
Adding a new source into Gargantext requires a previous declaration
of the source inside constants.py
```python
RESOURCETYPES= [
{ "type":9, #give a unique type int
"name": 'SCOAP [XML]', #resource name as proposed into the add corpus FORM [generic format]
"parser": "CernParser", #name of the new parser class inside a CERN.py file (set to None if not implemented)
"format": 'MARC21', #specific format
'file_formats':["zip","xml"],# accepted file format
"crawler": "CernCrawler", #name of the new crawler class inside a CERN.py file (set to None if no Crawler implemented)
'default_languages': ['en', 'fr'], #supported defaut languages of the source
},
...
]
```
## adding a new parser
Once you declared your new parser inside constants.py
add your new crawler file into /srv/gargantext/utils/parsers/
following this naming convention:
* Filename must be in uppercase without the Crawler mention.
eg. MailParser => MAIL.py
* Inside this file the Parser must be called following the exact typo declared as parser in constants.py
* Your new crawler shall inherit from baseclasse Parser and provide a parse(filebuffer) method
```python
#!/usr/bin/python3 env
#filename:/srv/gargantext/util/parser/MAIL.py:
from ._Parser import Parser
class MailParser(Parser):
def parse(self, file):
...
```
## adding a new crawler
Once you declared your new parser inside constants.py
add your new crawler file into /srv/gargantext/utils/parsers/
following this naming convention:
* Filename must be in uppercase without the Crawler mention.
eg. MailCrawler => MAIL.py
* Inside this file the Crawler must be called following the exact typo declared as crawler in constants.py
* Your new crawler shall inherit from baseclasse Crawler and provide three method:
* scan_results => ids
* sample = > yes/no
* fetch
```python
#!/usr/bin/python3 env
#filename:/srv/gargantext/util/crawler/MAIL.py:
from ._Crawler import Crawler
class MailCrawler(Crawler):
def scan_results(self, query):
...
self.ids = set()
def sample(self, results_nb):
...
def fetch(self, ids):
```
This diff is collapsed.
...@@ -14,6 +14,7 @@ djangorestframework==3.3.2 ...@@ -14,6 +14,7 @@ djangorestframework==3.3.2
html5lib==0.9999999 html5lib==0.9999999
jdatetime==1.7.2 jdatetime==1.7.2
kombu==3.0.33 kombu==3.0.33
langdetect==1.0.6
lxml==3.5.0 lxml==3.5.0
networkx==1.11 networkx==1.11
nltk==3.1 nltk==3.1
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# ****************************
# ***** CERN Scrapper *****
# ****************************
# Author:c24b
# Date: 27/05/2015
from ._Crawler import Crawler
import hmac, hashlib
import requests
import os
import random
import urllib.parse as uparse
from lxml import etree
from gargantext.settings import API_TOKENS
#from gargantext.util.files import build_corpus_path
from gargantext.util.db import session
from gargantext.models import Node
class CernCrawler(Crawler):
'''CERN SCOAP3 API Interaction'''
def __generate_signature__(self, url):
'''creation de la signature'''
#hmac-sha1 salted with secret
return hmac.new(self.secret,url, hashlib.sha1).hexdigest()
def __format_query__(self, query, of="xm", fields= None):
''' for query filters params
see doc https://scoap3.org/scoap3-repository/xml-api/
'''
#dict_q = uparse.parse_qs(query)
dict_q = {}
#by default: search by pattern
dict_q["p"] = query
if fields is not None and isinstance(fields, list):
fields = ",".join(fields)
dict_q["f"] = fields
#outputformat: "xm", "xmt", "h", "html"
dict_q["of"]= of
return dict_q
def __format_url__(self, dict_q):
'''format the url with encoded query'''
#add the apikey
dict_q["apikey"] = [self.apikey]
params = "&".join([(str(k)+"="+str(uparse.quote(v[0]))) for k,v in sorted(dict_q.items())])
return self.BASE_URL+params
def sign_url(self, dict_q):
'''add signature'''
API = API_TOKENS["CERN"]
self.apikey = API["APIKEY"]
self.secret = API["APISECRET"].encode("utf-8")
self.BASE_URL = u"http://api.scoap3.org/search?"
url = self.__format_url__(dict_q)
return url+"&signature="+self.__generate_signature__(url.encode("utf-8"))
def create_corpus(self):
#create a corpus
corpus = Node(
name = self.query,
#user_id = self.user_id,
parent_id = self.project_id,
typename = 'CORPUS',
hyperdata = { "action" : "Scrapping data"
, "language_id" : self.type["default_language"]
}
)
#add the resource
corpus.add_resource(
type = self.type["type"],
name = self.type["name"],
path = self.path)
try:
print("PARSING")
# p = eval(self.type["parser"])()
session.add(corpus)
session.commit()
self.corpus_id = corpus.id
parse_extract_indexhyperdata(corpus.id)
return self
except Exception as error:
print('WORKFLOW ERROR')
print(error)
session.rollback()
return self
def download(self):
import time
self.path = "/tmp/results.xml"
query = self.__format_query__(self.query)
url = self.sign_url(query)
start = time.time()
r = requests.get(url, stream=True)
downloaded = False
#the long part
with open(self.path, 'wb') as f:
print("Downloading file")
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
#print("===")
f.write(chunk)
downloaded = True
end = time.time()
#print (">>>>>>>>>>LOAD results", end-start)
return downloaded
def scan_results(self):
'''scanner le nombre de resultat en récupérant 1 seul résultat
qui affiche uniquement l'auteur de la page 1
on récupère le commentaire en haut de la page
'''
import time
self.results_nb = 0
query = self.__format_query__(self.query, of="hb")
query["ot"] = "100"
query["jrec"]='1'
query["rg"]='1'
url = self.sign_url(query)
print(url)
#start = time.time()
r = requests.get(url)
#end = time.time()
#print (">>>>>>>>>>LOAD results_nb", end-start)
if r.status_code == 200:
self.results_nb = int(r.text.split("-->")[0].split(': ')[-1][:-1])
return self.results_nb
else:
raise ValueError(r.status)
from ._Crawler import *
import json
class ISTexCrawler(Crawler):
"""
ISTEX Crawler
"""
def __format_query__(self,query=None):
'''formating query urlquote instead'''
if query is not None:
query = query.replace(" ","+")
return query
else:
self.query = self.query.replace(" ","+")
return self.query
def scan_results(self):
#get the number of results
self.results_nb = 0
self.query = self.__format_query__()
_url = "http://api.istex.fr/document/?q="+self.query+"&size=0"
#"&output=id,title,abstract,pubdate,corpusName,authors,language"
r = requests.get(_url)
print(r)
if r.status_code == 200:
self.results_nb = int(r.json()["total"])
self.status.append("fetching results")
return self.results_nb
else:
self.status.append("error")
raise ValueError(r.status)
def download(self):
'''fetching items'''
downloaded = False
def get_hits(future):
'''here we directly get the result hits'''
response = future.result()
if response.status_code == 200:
return response.json()["hits"]
else:
return None
#session = FuturesSession()
#self.path = "/tmp/results.json"
self.status.append("fetching results")
paging = 100
self.query_max = self.results_nb
if self.query_max > QUERY_SIZE_N_MAX:
msg = "Invalid sample size N = %i (max = %i)" % (self.query_max, QUERY_SIZE_N_MAX)
print("ERROR (scrap: istex d/l ): ",msg)
self.query_max = QUERY_SIZE_N_MAX
#urlreqs = []
with open(self.path, 'wb') as f:
for i in range(0, self.query_max, paging):
url_base = "http://api.istex.fr/document/?q="+self.query+"&output=*&from=%i&size=%i" %(i, paging)
r = requests.get(url_base)
if r.status_code == 200:
downloaded = True
f.write(r.text.encode("utf-8"))
else:
downloaded = False
self.status.insert(0, "error fetching ISTEX "+ r.status)
break
return downloaded
This diff is collapsed.
# Scrapers config
QUERY_SIZE_N_MAX = 1000
from gargantext.constants import get_resource
from gargantext.util.scheduling import scheduled
from gargantext.util.db import session
from requests_futures.sessions import FuturesSession
from gargantext.util.db import session
import requests
from gargantext.models.nodes import Node
#from gargantext.util.toolchain import parse_extract_indexhyperdata
from datetime import date
class Crawler:
"""Base class for performing search and add corpus file depending on the type
"""
def __init__(self, record):
#the name of corpus
#that will be built in case of internal fileparsing
self.record = record
self.name = record["corpus_name"]
self.project_id = record["project_id"]
self.user_id = record["user_id"]
self.resource = record["source"]
self.type = get_resource(self.resource)
self.query = record["query"]
#format the sampling
self.n_last_years = 5
self.YEAR = date.today().year
#pas glop
# mais easy version
self.MONTH = str(date.today().month)
if len(self.MONTH) == 1:
self.MONTH = "0"+self.MONTH
self.MAX_RESULTS = 1000
try:
self.results_nb = int(record["count"])
except KeyError:
#n'existe pas encore
self.results_nb = 0
try:
self.webEnv = record["webEnv"]
self.queryKey = record["queryKey"]
self.retMax = record["retMax"]
except KeyError:
#n'exsite pas encore
self.queryKey = None
self.webEnv = None
self.retMax = 1
self.status = [None]
self.path = "/tmp/results.txt"
def tmp_file(self):
'''here should stored the results
depending on the type of format'''
raise NotImplemented
def parse_query(self):
'''here should parse the parameters of the query
depending on the type and retrieve a set of activated search option
'''
raise NotImplemented
def fetch(self):
if self.download():
self.create_corpus()
return self.corpus_id
def get_sampling_dates():
'''Create a sample list of min and max date based on Y and M f*
or N_LAST_YEARS results'''
dates = []
for i in range(self.n_last_years):
maxyear = self.YEAR -i
mindate = str(maxyear-1)+"/"+str(self.MONTH)
maxdate = str(maxyear)+"/"+str(self.MONTH)
print(mindate,"-",maxdate)
dates.append((mindate, maxdate))
return dates
def create_corpus(self):
#create a corpus
corpus = Node(
name = self.query,
user_id = self.user_id,
parent_id = self.project_id,
typename = 'CORPUS',
hyperdata = { "action" : "Scrapping data",
"language_id" : self.type["default_language"],
}
)
self.corpus_id = corpus.id
if len(self.paths) > 0:
for path in self.paths:
#add the resource
corpus.add_resource(
type = self.type["type"],
name = self.type["name"],
path = path
)
session.add(corpus)
session.commit()
scheduled(parse_extract_indexhyperdata(corpus.id))
else:
#add the resource
corpus.add_resource(
type = self.type["type"],
name = self.type["name"],
path = self.path
)
session.add(corpus)
session.commit()
scheduled(parse_extract_indexhyperdata(corpus.id))
return corpus
import importlib
from gargantext.constants import RESOURCETYPES
from gargantext.settings import DEBUG
#if DEBUG: print("Loading available Crawlers")
base_parser = "gargantext.util.crawlers"
for resource in RESOURCETYPES:
if resource["crawler"] is not None:
try:
name =resource["crawler"]
#crawler is type basename+"Crawler"
filename = name.replace("Crawler", "").lower()
module = base_parser+".%s" %(filename)
importlib.import_module(module)
#if DEBUG: print("\t-", name)
except Exception as e:
print("Check constants.py RESOURCETYPES declaration %s \nCRAWLER %s is not available for %s" %(str(e), resource["crawler"], resource["name"]))
#initial import
#from .cern import CernCrawler
#from .istex import ISTexCrawler
#from .pubmed import PubmedCrawler
from gargantext.constants import * from gargantext.constants import *
from langdetect import detect, DetectorFactory
class Language: class Language:
def __init__(self, iso2=None, iso3=None, name=None): def __init__(self, iso2=None, iso3=None,full_name=None, name=None):
self.iso2 = iso2 self.iso2 = iso2
self.iso3 = iso3 self.iso3 = iso3
self.name = name self.name = name
self.implemented = iso2 in LANGUAGES self.implemented = iso2 in LANGUAGES
def __str__(self): def __str__(self):
result = '<Language' result = '<Language'
for key, value in self.__dict__.items(): for key, value in self.__dict__.items():
...@@ -16,6 +16,7 @@ class Language: ...@@ -16,6 +16,7 @@ class Language:
return result return result
__repr__ = __str__ __repr__ = __str__
class Languages(dict): class Languages(dict):
def __missing__(self, key): def __missing__(self, key):
key = key.lower() key = key.lower()
...@@ -25,6 +26,10 @@ class Languages(dict): ...@@ -25,6 +26,10 @@ class Languages(dict):
languages = Languages() languages = Languages()
def detect_lang(text):
DetectorFactory.seed = 0
return languages[detect(text)].iso2
import pycountry import pycountry
pycountry_keys = ( pycountry_keys = (
('iso639_3_code', 'iso3', ), ('iso639_3_code', 'iso3', ),
...@@ -49,3 +54,4 @@ languages['fre'] = languages['fr'] ...@@ -49,3 +54,4 @@ languages['fre'] = languages['fr']
languages['ger'] = languages['de'] languages['ger'] = languages['de']
languages['Français'] = languages['fr'] languages['Français'] = languages['fr']
languages['en_US'] = languages['en'] languages['en_US'] = languages['en']
languages['english'] = languages['en']
...@@ -2,6 +2,8 @@ from ._Parser import Parser ...@@ -2,6 +2,8 @@ from ._Parser import Parser
from datetime import datetime from datetime import datetime
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from lxml import etree from lxml import etree
#import asyncio
#q = asyncio.Queue(maxsize=0)
class CernParser(Parser): class CernParser(Parser):
#mapping MARC21 ==> hyperdata #mapping MARC21 ==> hyperdata
...@@ -38,24 +40,34 @@ class CernParser(Parser): ...@@ -38,24 +40,34 @@ class CernParser(Parser):
"856": {"u":"pdf_source"}, "856": {"u":"pdf_source"},
} }
def format_date(self, hyperdata): # def format_date(self, hyperdata):
'''formatting pubdate''' # '''formatting pubdate'''
prefix = "publication" # prefix = "publication"
date = datetime.strptime(hyperdata[prefix + "_date"], "%Y-%m-%d") # try:
#hyperdata[prefix + "_year"] = date.strftime('%Y') # date = datetime.strptime(hyperdata[prefix + "_date"], "%Y-%m-%d")
hyperdata[prefix + "_month"] = date.strftime("%m") # except ValueError:
hyperdata[prefix + "_day"] = date.strftime("%d") # date = datetime.strptime(hyperdata[prefix + "_date"], "%Y-%m")
hyperdata[prefix + "_hour"] = date.strftime("%H") # date.day = "01"
hyperdata[prefix + "_minute"] = date.strftime("%M") # hyperdata[prefix + "_year"] = date.strftime('%Y')
hyperdata[prefix + "_second"] = date.strftime("%S") # hyperdata[prefix + "_month"] = date.strftime("%m")
hyperdata[prefix + "_date"] = date.strftime("%Y-%m-%d %H:%M:%S") # hyperdata[prefix + "_day"] = date.strftime("%d")
print("Date", hyperdata["publication_date"]) #
return hyperdata # hyperdata[prefix + "_hour"] = date.strftime("%H")
# hyperdata[prefix + "_minute"] = date.strftime("%M")
# hyperdata[prefix + "_second"] = date.strftime("%S")
# hyperdata[prefix + "_date"] = date.strftime("%Y-%m-%d %H:%M:%S")
# #print("Date", hyperdata["publication_date"])
# return hyperdata
#@asyncio.coroutine
def parse(self, file): def parse(self, file):
#print("PARSING")
hyperdata_list = [] hyperdata_list = []
doc = file.read() doc = file.read()
soup = BeautifulSoup(doc.decode("utf-8"), "lxml") #print(doc[:35])
soup = BeautifulSoup(doc, "lxml")
#print(soup.find("record"))
for record in soup.find_all("record"): for record in soup.find_all("record"):
hyperdata = {v:[] for v in self.MARC21["100"].values()} hyperdata = {v:[] for v in self.MARC21["100"].values()}
hyperdata["uid"] = soup.find("controlfield").text hyperdata["uid"] = soup.find("controlfield").text
...@@ -86,8 +98,8 @@ class CernParser(Parser): ...@@ -86,8 +98,8 @@ class CernParser(Parser):
hyperdata["authors_affiliations"] = (",").join(hyperdata["authors_affiliations"]) hyperdata["authors_affiliations"] = (",").join(hyperdata["authors_affiliations"])
hyperdata["authors"] = (",").join(hyperdata["authors"]) hyperdata["authors"] = (",").join(hyperdata["authors"])
hyperdata["authors_mails"] = (",").join(hyperdata["authors_mails"]) hyperdata["authors_mails"] = (",").join(hyperdata["authors_mails"])
hyperdata = self.format_date(hyperdata) #hyperdata = self.format_date(hyperdata)
hyperdata = self.format_hyperdata_languages(hyperdata)
hyperdata = self.format_hyperdata_dates(hyperdata)
hyperdata_list.append(hyperdata) hyperdata_list.append(hyperdata)
return hyperdata_list return hyperdata_list
from .Ris import RISParser from .RIS import RISParser
class ISIParser(RISParser): class ISIParser(RISParser):
_begin = 3 _begin = 3
_parameters = { _parameters = {
b"ER": {"type": "delimiter"}, b"ER": {"type": "delimiter"},
b"TI": {"type": "hyperdata", "key": "title", "separator": " "}, b"TI": {"type": "hyperdata", "key": "title", "separator": " "},
...@@ -17,4 +17,3 @@ class ISIParser(RISParser): ...@@ -17,4 +17,3 @@ class ISIParser(RISParser):
b"AB": {"type": "hyperdata", "key": "abstract", "separator": " "}, b"AB": {"type": "hyperdata", "key": "abstract", "separator": " "},
b"WC": {"type": "hyperdata", "key": "fields"}, b"WC": {"type": "hyperdata", "key": "fields"},
} }
...@@ -31,6 +31,7 @@ class PubmedParser(Parser): ...@@ -31,6 +31,7 @@ class PubmedParser(Parser):
if isinstance(file, bytes): if isinstance(file, bytes):
file = BytesIO(file) file = BytesIO(file)
xml = etree.parse(file, parser=self.xml_parser) xml = etree.parse(file, parser=self.xml_parser)
#print(xml.find("PubmedArticle"))
xml_articles = xml.findall('PubmedArticle') xml_articles = xml.findall('PubmedArticle')
# initialize the list of hyperdata # initialize the list of hyperdata
hyperdata_list = [] hyperdata_list = []
......
...@@ -36,6 +36,7 @@ class RISParser(Parser): ...@@ -36,6 +36,7 @@ class RISParser(Parser):
last_values = [] last_values = []
# browse every line of the file # browse every line of the file
for line in file: for line in file:
if len(line) > 2 : if len(line) > 2 :
# extract the parameter key # extract the parameter key
parameter_key = line[:2] parameter_key = line[:2]
......
...@@ -20,14 +20,9 @@ class Parser: ...@@ -20,14 +20,9 @@ class Parser:
self._file = file self._file = file
def __del__(self): def __del__(self):
self._file.close() if hasattr(self, '_file'):
self._file.close()
def detect_format(self, afile, a_formats):
#import magic
print("Detecting format")
#print(magic.from_file(afile))
return
def detect_encoding(self, string): def detect_encoding(self, string):
"""Useful method to detect the encoding of a document. """Useful method to detect the encoding of a document.
...@@ -167,6 +162,8 @@ class Parser: ...@@ -167,6 +162,8 @@ class Parser:
def __iter__(self, file=None): def __iter__(self, file=None):
"""Parse the file, and its children files found in the file. """Parse the file, and its children files found in the file.
C24B comment: le stokage/extraction du fichier devrait être faite en amont
et cette methode est un peu obscure
""" """
if file is None: if file is None:
file = self._file file = self._file
......
from .Ris import RISParser import importlib
from .Ris_repec import RepecParser from gargantext.constants import RESOURCETYPES
from .Isi import ISIParser from gargantext.settings import DEBUG
# from .Jstor import JstorParser if DEBUG:
# from .Zotero import ZoteroParser print("Loading available PARSERS:")
from .Pubmed import PubmedParser base_parser = "gargantext.util.parsers"
for resource in RESOURCETYPES:
# # 2015-12-08: parser 2 en 1 if resource["parser"] is not None:
from .Europress import EuropressParser #parser file is without Parser
fname = resource["parser"].replace("Parser", "")
from .ISTex import ISTexParser #parser file is formatted as a title
from .CSV import CSVParser module = base_parser+".%s" %(fname.upper())
from .Cern import CernParser #parser module is has shown in constants
parser = importlib.import_module(module)
if DEBUG:
print("\t-", resource["parser"])
getattr(parser,resource["parser"])
...@@ -3,9 +3,9 @@ When started, it initiates the parser; ...@@ -3,9 +3,9 @@ When started, it initiates the parser;
when passed text, the text is piped to the parser. when passed text, the text is piped to the parser.
When ended, the parser is closed and the tagged word returned as a tuple. When ended, the parser is closed and the tagged word returned as a tuple.
""" """
from gargantext.constants import RULE_JJNN, DEFAULT_MAX_NGRAM_LEN
import re import re
import nltk
class Tagger: class Tagger:
...@@ -19,7 +19,28 @@ class Tagger: ...@@ -19,7 +19,28 @@ class Tagger:
| [][.,;"'?!():-_`] # these are separate tokens | [][.,;"'?!():-_`] # these are separate tokens
''', re.UNICODE | re.MULTILINE | re.DOTALL) ''', re.UNICODE | re.MULTILINE | re.DOTALL)
self.buffer = [] self.buffer = []
self.start()
#self.start()
def clean_text(self, text):
"""Clean the text for better POS tagging.
For now, only removes (short) XML tags.
"""
return re.sub(r'<[^>]{0,45}>', '', text)
def extract(self, text, rule=RULE_JJNN, label='NP', max_n_words=DEFAULT_MAX_NGRAM_LEN):
self.text = self.clean_text(text)
grammar = nltk.RegexpParser(label + ': ' + rule)
tagged_tokens = list(self.tag_text(self.text))
if len(tagged_tokens):
grammar_parsed = grammar.parse(tagged_tokens)
for subtree in grammar_parsed.subtrees():
if subtree.label() == label:
if len(subtree) < max_n_words:
yield subtree.leaves()
# ex: [('wild', 'JJ'), ('pollinators', 'NNS')]
def __del__(self): def __del__(self):
self.stop() self.stop()
...@@ -29,6 +50,8 @@ class Tagger: ...@@ -29,6 +50,8 @@ class Tagger:
This method is called by the constructor, and can be overriden by This method is called by the constructor, and can be overriden by
inherited classes. inherited classes.
""" """
print("START")
self.extract(self.text)
def stop(self): def stop(self):
"""Ends the tagger. """Ends the tagger.
......
from .TurboTagger import TurboTagger #version2
from .NltkTagger import NltkTagger #imported as needed
from .TreeTagger import TreeTagger
from .MeltTagger import EnglishMeltTagger, FrenchMeltTagger #Version 1
#~ import importlib
#~ from gargantext.constants import LANGUAGES
#~ from gargantext.settings import DEBUG
#~ if DEBUG:
#~ print("Loading available Taggers:")
#~ for lang, tagger in LANGUAGES.items():
#~ tagger = tagger["tagger"]
#~ filename = "gargantext.util.taggers.%s" %(tagger)
#~ if DEBUG:
#~ print("\t-%s (%s)" %(tagger, lang))
#~ getattr(importlib.import_module(filename), tagger)()
#VERSION 0
#~ #initally a manual import declaration
#~ from .TurboTagger import TurboTagger
#~ from .NltkTagger import NltkTagger
#~ from .TreeTagger import TreeTagger
#~ from .MeltTagger import EnglishMeltTagger, FrenchMeltTagger
...@@ -102,7 +102,7 @@ def do_maplist(corpus, ...@@ -102,7 +102,7 @@ def do_maplist(corpus,
if n_ngrams == 0: if n_ngrams == 0:
raise ValueError("No ngrams in cooc table ?") raise ValueError("No ngrams in cooc table ?")
#return
# results, with same structure as quotas # results, with same structure as quotas
chosen_ngrams = { chosen_ngrams = {
'topgen':{'monograms':[], 'multigrams':[]}, 'topgen':{'monograms':[], 'multigrams':[]},
......
...@@ -82,6 +82,7 @@ def parse_extract_indexhyperdata(corpus): ...@@ -82,6 +82,7 @@ def parse_extract_indexhyperdata(corpus):
favs = corpus.add_child( favs = corpus.add_child(
typename='FAVORITES', name='favorite docs in "%s"' % corpus.name typename='FAVORITES', name='favorite docs in "%s"' % corpus.name
) )
session.add(favs) session.add(favs)
session.commit() session.commit()
print('CORPUS #%d: [%s] new favorites node #%i' % (corpus.id, t(), favs.id)) print('CORPUS #%d: [%s] new favorites node #%i' % (corpus.id, t(), favs.id))
...@@ -265,7 +266,7 @@ def recount(corpus): ...@@ -265,7 +266,7 @@ def recount(corpus):
# -> specclusion/genclusion: compute + write (=> NodeNodeNgram) # -> specclusion/genclusion: compute + write (=> NodeNodeNgram)
(spec_id, gen_id) = compute_specgen(corpus, cooc_matrix = coocs, (spec_id, gen_id) = compute_specgen(corpus, cooc_matrix = coocs,
spec_overwrite_id = old_spec_id, spec_overwrite_id = old_spec_id,
gen_overwrite_id = old_gen_id) gen_overwrite_id = old_gen_id)
print('RECOUNT #%d: [%s] updated spec-clusion node #%i' % (corpus.id, t(), spec_id)) print('RECOUNT #%d: [%s] updated spec-clusion node #%i' % (corpus.id, t(), spec_id))
......
#!/usr/bin/python3 env
""" """
For initial ngram groups via stemming For initial ngram groups via stemming
Exemple: Exemple:
...@@ -21,16 +22,13 @@ def prepare_stemmers(corpus): ...@@ -21,16 +22,13 @@ def prepare_stemmers(corpus):
""" """
Returns *several* stemmers (one for each language in the corpus) Returns *several* stemmers (one for each language in the corpus)
(as a dict of stemmers with key = language_iso2) (as a dict of stemmers with key = language_iso2)
languages has been previously filtered by supported source languages
and formatted
""" """
stemmers_by_lg = { stemmers = {lang:SnowballStemmer(languages[lang].name.lower()) for lang \
# always get a generic stemmer in case language code unknown in corpus.languages.keys() if lang !="__skipped__"}
'__unknown__' : SnowballStemmer("english") stemmers['__unknown__'] = SnowballStemmer("english")
} return stemmers
for lgiso2 in corpus.hyperdata['languages'].keys():
if (lgiso2 != '__skipped__'):
lgname = languages[lgiso2].name.lower()
stemmers_by_lg[lgiso2] = SnowballStemmer(lgname)
return stemmers_by_lg
def compute_groups(corpus, stoplist_id = None, overwrite_id = None): def compute_groups(corpus, stoplist_id = None, overwrite_id = None):
""" """
...@@ -57,16 +55,17 @@ def compute_groups(corpus, stoplist_id = None, overwrite_id = None): ...@@ -57,16 +55,17 @@ def compute_groups(corpus, stoplist_id = None, overwrite_id = None):
my_groups = defaultdict(Counter) my_groups = defaultdict(Counter)
# preloop per doc to sort ngrams by language # preloop per doc to sort ngrams by language
for doc in corpus.children(): for doc in corpus.children('DOCUMENT'):
if ('language_iso2' in doc.hyperdata): if doc.id not in corpus.skipped_docs:
lgid = doc.hyperdata['language_iso2'] if ('language_iso2' in doc.hyperdata):
else: lgid = doc.hyperdata['language_iso2']
lgid = "__unknown__" else:
lgid = "__unknown__"
# doc.ngrams is an sql query (ugly but useful intermediate step)
# FIXME: move the counting and stoplist filtering up here # doc.ngrams is an sql query (ugly but useful intermediate step)
for ngram_pack in doc.ngrams.all(): # FIXME: move the counting and stoplist filtering up here
todo_ngrams_per_lg[lgid].add(ngram_pack) for ngram_pack in doc.ngrams.all():
todo_ngrams_per_lg[lgid].add(ngram_pack)
# -------------------- # --------------------
# long loop per ngrams # long loop per ngrams
......
from gargantext.util.db import * from gargantext.util.db import *
from gargantext.models import * from gargantext.models import *
from gargantext.constants import * from gargantext.constants import *
from gargantext.util.ngramsextractors import ngramsextractors
from collections import defaultdict from collections import defaultdict
from re import sub from re import sub
from gargantext.util.scheduling import scheduled from gargantext.util.scheduling import scheduled
def _integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor): def _integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor):
...@@ -36,7 +33,7 @@ def _integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor): ...@@ -36,7 +33,7 @@ def _integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor):
db.commit() db.commit()
def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_INDEX_SUBGRAMS): def extract_ngrams(corpus, keys=DEFAULT_INDEX_FIELDS, do_subngrams = DEFAULT_INDEX_SUBGRAMS):
"""Extract ngrams for every document below the given corpus. """Extract ngrams for every document below the given corpus.
Default language is given by the resource type. Default language is given by the resource type.
The result is then inserted into database. The result is then inserted into database.
...@@ -46,57 +43,50 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_ ...@@ -46,57 +43,50 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_
db, cursor = get_cursor() db, cursor = get_cursor()
nodes_ngrams_count = defaultdict(int) nodes_ngrams_count = defaultdict(int)
ngrams_data = set() ngrams_data = set()
# extract ngrams #1 corpus = 1 resource
resource_type_index = corpus.resources()[0]['type'] resource = corpus.resources()[0]
documents_count = 0 documents_count = 0
resource_type = RESOURCETYPES[resource_type_index] source = get_resource(resource["type"])
default_language_iso2 = resource_type['default_language'] #load available taggers for source default langage
for documents_count, document in enumerate(corpus.children('DOCUMENT')): docs = [doc for doc in corpus.children('DOCUMENT') if doc.id not in corpus.skipped_docs]
# get ngrams extractor for the current document tagger_bots = {lang: load_tagger(lang)() for lang in corpus.languages if lang != "__skipped__"}
language_iso2 = document.hyperdata.get('language_iso2', default_language_iso2) #sort docs by lang?
try: for lang, tagger in tagger_bots.items():
# this looks for a parser in constants.LANGUAGES for documents_count, document in enumerate(docs):
ngramsextractor = ngramsextractors[language_iso2] language_iso2 = document.hyperdata.get('language_iso2', lang)
except KeyError: #print(language_iso2)
# skip document for key in keys:
print('Unsupported language: `%s` (doc #%i)' % (language_iso2, document.id)) try:
# and remember that for later processes (eg stemming) value = document[str(key)]
document.hyperdata['__skipped__'] = 'ngrams_extraction' if not isinstance(value, str):
document.save_hyperdata() continue
session.commit() # get ngrams
if language_iso2 in corpus.hyperdata['languages']: for ngram in tagger.extract(value):
skipped_lg_infos = corpus.hyperdata['languages'].pop(language_iso2) tokens = tuple(normalize_forms(token[0]) for token in ngram)
corpus.hyperdata['languages']['__skipped__'][language_iso2] = skipped_lg_infos if do_subngrams:
corpus.save_hyperdata() # ex tokens = ["very", "cool", "exemple"]
session.commit() # subterms = [['very', 'cool'],
continue # ['very', 'cool', 'exemple'],
# extract ngrams on each of the considered keys # ['cool', 'exemple']]
for key in keys:
value = document.hyperdata.get(key, None) subterms = subsequences(tokens)
if not isinstance(value, str): else:
continue subterms = [tokens]
# get ngrams
for ngram in ngramsextractor.extract(value): for seqterm in subterms:
tokens = tuple(normalize_forms(token[0]) for token in ngram) ngram = ' '.join(seqterm)
if len(ngram) > 1:
if do_subngrams: # doc <=> ngram index
# ex tokens = ["very", "cool", "exemple"] nodes_ngrams_count[(document.id, ngram)] += 1
# subterms = [['very', 'cool'], # add fields : terms n
# ['very', 'cool', 'exemple'], ngrams_data.add((ngram[:255], len(seqterm), ))
# ['cool', 'exemple']] except:
#value not in doc
subterms = subsequences(tokens) pass
else: # except AttributeError:
subterms = [tokens] # print("ERROR NO language_iso2")
# document.status("NGRAMS", error="No lang detected skipped Ngrams")
for seqterm in subterms: # corpus.skipped_docs.append(document.id)
ngram = ' '.join(seqterm)
if len(ngram) > 1:
# doc <=> ngram index
nodes_ngrams_count[(document.id, ngram)] += 1
# add fields : terms n
ngrams_data.add((ngram[:255], len(seqterm), ))
# integrate ngrams and nodes-ngrams # integrate ngrams and nodes-ngrams
if len(nodes_ngrams_count) >= BATCH_NGRAMSEXTRACTION_SIZE: if len(nodes_ngrams_count) >= BATCH_NGRAMSEXTRACTION_SIZE:
_integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor) _integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor)
...@@ -105,12 +95,14 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_ ...@@ -105,12 +95,14 @@ def extract_ngrams(corpus, keys=('title', 'abstract', ), do_subngrams = DEFAULT_
if documents_count % BATCH_NGRAMSEXTRACTION_SIZE == 0: if documents_count % BATCH_NGRAMSEXTRACTION_SIZE == 0:
corpus.status('Ngrams', progress=documents_count+1) corpus.status('Ngrams', progress=documents_count+1)
corpus.save_hyperdata() corpus.save_hyperdata()
session.add(corpus)
session.commit() session.commit()
# integrate ngrams and nodes-ngrams else:
_integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor) # integrate ngrams and nodes-ngrams
corpus.status('Ngrams', progress=documents_count+1, complete=True) _integrate_associations(nodes_ngrams_count, ngrams_data, db, cursor)
corpus.save_hyperdata() corpus.status('Ngrams', progress=documents_count+1, complete=True)
session.commit() corpus.save_hyperdata()
session.commit()
except Exception as error: except Exception as error:
corpus.status('Ngrams', error=error) corpus.status('Ngrams', error=error)
corpus.save_hyperdata() corpus.save_hyperdata()
......
from gargantext.util.db import * from gargantext.util.db import *
from gargantext.models import * from gargantext.models import *
from gargantext.constants import * from gargantext.constants import *
#from gargantext.util.parsers import *
from collections import defaultdict from collections import defaultdict, Counter
from re import sub from re import sub
from gargantext.util.languages import languages, detect_lang
def parse(corpus): def parse(corpus):
try: try:
documents_count = 0 documents_count = 0
corpus.status('Docs', progress=0) corpus.status('Docs', progress=0)
#1 corpus => 1 resource
# will gather info about languages resources = corpus.resources()
observed_languages = defaultdict(int) #get the sources capabilities for a given corpus resource
sources = [get_resource(resource["type"]) for resource in corpus.resources() if resource["extracted"] is False]
# retrieve resource information if len(sources) == 0:
for resource in corpus.resources(): #>>> documents have already been parsed?????
# information about the resource return
if resource['extracted']: if len(sources) > 0:
continue #>>> necessairement 1 corpus = 1 source dans l'archi actuelle
resource_parser = RESOURCETYPES[resource['type']]['parser'] source = sources[0]
resource_path = resource['path'] resource = resources[0]
# extract and insert documents from corpus resource into database #source.extend(resource)
for hyperdata in resource_parser(resource_path): if source["parser"] is None:
#corpus.status(error)
# uniformize the text values for easier POStagging and processing raise ValueError("Resource '%s' has no Parser" %resource["name"])
for k in ['abstract', 'title']: else:
if k in hyperdata: #observed langages in corpus docs
try : corpus.languages = defaultdict.fromkeys(source["default_languages"], 0)
hyperdata[k] = normalize_chars(hyperdata[k]) #remember the skipped docs in parsing
except Exception as error : skipped_languages = []
print("Error normalize_chars", error) corpus.skipped_docs = []
session.add(corpus)
# save as DB child session.commit()
# ---------------- #load the corresponding parser
document = corpus.add_child( parserbot = load_parser(source)
typename = 'DOCUMENT', # extract and insert documents from resource.path into database
name = hyperdata.get('title', '')[:255], for hyperdata in parserbot(resource["path"]):
hyperdata = hyperdata, # indexed text fields defined in CONSTANTS
) for k in DEFAULT_INDEX_FIELDS:
session.add(document) if k in hyperdata.keys():
try:
# a simple census to raise language info at corpus level hyperdata[k] = normalize_chars(hyperdata[k])
if "language_iso2" in hyperdata: except Exception as error :
observed_languages[hyperdata["language_iso2"]] += 1 hyperdata["error"] = "Error normalize_chars"
indexed = False
# a simple census to raise language info at corpus level
for l in ["iso2", "iso3", "full_name"]:
if indexed is True:
break
lang_field = "language_"+l
if lang_field in hyperdata.keys():
if l == "iso2":
try:
corpus.languages[hyperdata["language_iso2"]] += 1
indexed = True
except KeyError:
hyperdata["error"] = "Error: unsupported language"
skipped_languages.append(hyperdata["language_iso2"])
else:
lang = languages(hyperdata[lang_field].lower()).iso2
try:
corpus.languages[lang] += 1
indexed = True
except KeyError:
hyperdata["error"] = "Error: unsupported language"
skipped_languages.append(lang)
if indexed is False:
#no language have been indexed
#detectlang by index_fields
for k in DEFAULT_INDEX_FIELDS:
if indexed is True:
break
if k in hyperdata.keys():
try:
if len(hyperdata[k]) > 10:
#print("> detected on",k, ":", detect_lang(hyperdata[k]))
hyperdata["language_iso2"] = detect_lang(hyperdata[k])
corpus.languages[hyperdata["language_iso2"]] += 1
indexed = True
break
except KeyError:
hyperdata["error"] = "Error: unsupported language"
skipped_languages.append(hyperdata["language_iso2"])
indexed = True
except Exception as error :
print(error)
pass
# save as DB child
# ----------------
document = corpus.add_child(
typename = 'DOCUMENT',
name = hyperdata.get('title', '')[:255],
hyperdata = hyperdata,
)
session.add(document)
if "error" in hyperdata.keys():
#document.status("error")
document.status('Parsing', error= document.hyperdata["error"])
document.save_hyperdata()
session.commit()
#adding skipped_docs for later processsing
corpus.skipped_docs.append(document.id)
documents_count += 1
# logging # logging
if documents_count % BATCH_PARSING_SIZE == 0: if documents_count % BATCH_PARSING_SIZE == 0:
corpus.status('Docs', progress=documents_count) corpus.status('Docs', progress=documents_count)
corpus.save_hyperdata() corpus.save_hyperdata()
session.add(corpus)
session.commit() session.commit()
documents_count += 1
# update info about the resource # update info about the resource
resource['extracted'] = True resource['extracted'] = True
# add a corpus-level info about languages... # add a corpus-level info about languages adding a __skipped__ info
corpus.hyperdata['languages'] = observed_languages corpus.languages['__skipped__'] = Counter(skipped_languages)
# ...with a special key inside for skipped languages at ngrams_extraction for n in corpus.languages.items():
corpus.hyperdata['languages']['__skipped__'] = {} print(n)
# commit all changes # commit all changes
corpus.status('Docs', progress=documents_count, complete=True) corpus.status('Docs', progress=documents_count, complete=True)
corpus.save_hyperdata() corpus.save_hyperdata()
session.add(corpus)
session.commit() session.commit()
except Exception as error: except Exception as error:
corpus.status('Docs', error=error) corpus.status('Docs', error=error)
......
...@@ -37,7 +37,7 @@ def docs_by_titles(request, project_id, corpus_id): ...@@ -37,7 +37,7 @@ def docs_by_titles(request, project_id, corpus_id):
'date': datetime.now(), 'date': datetime.now(),
'project': project, 'project': project,
'corpus': corpus, 'corpus': corpus,
'resourcename' : resourcename(corpus), 'resourcename' : get_resource_by_name(corpus.resources()[0]),
'view': 'titles', 'view': 'titles',
'user': request.user 'user': request.user
}, },
...@@ -65,7 +65,7 @@ def docs_by_journals(request, project_id, corpus_id): ...@@ -65,7 +65,7 @@ def docs_by_journals(request, project_id, corpus_id):
'date': datetime.now(), 'date': datetime.now(),
'project': project, 'project': project,
'corpus' : corpus, 'corpus' : corpus,
'resourcename' : resourcename(corpus), 'resourcename' : get_resource_by_name(corpus.resources()[0]),
'view': 'journals' 'view': 'journals'
}, },
) )
...@@ -84,11 +84,8 @@ def analytics(request, project_id, corpus_id): ...@@ -84,11 +84,8 @@ def analytics(request, project_id, corpus_id):
'date': datetime.now(), 'date': datetime.now(),
'project': project, 'project': project,
'corpus': corpus, 'corpus': corpus,
'resourcename' : resourcename(corpus), 'resourcename' : get_resource_by_name(corpus.resources()[0]),
'view': 'analytics', 'view': 'analytics',
'user': request.user 'user': request.user
}, },
) )
...@@ -59,12 +59,17 @@ def overview(request): ...@@ -59,12 +59,17 @@ def overview(request):
class NewCorpusForm(forms.Form): class NewCorpusForm(forms.Form):
#mapping choices based on ressource.type
source_list = [(resource["type"], resource["name"]) for resource in RESOURCETYPES]
source_list.insert(0, (0,"Select a database below"))
type = forms.ChoiceField( type = forms.ChoiceField(
choices = enumerate(resource_type['name'] for resource_type in RESOURCETYPES), choices = source_list,
widget = forms.Select(attrs={ 'onchange' :'CustomForSelect( $("option:selected", this).text() );'}) widget = forms.Select(attrs={ 'onchange' :'CustomForSelect( $("option:selected", this).text() );'})
) )
name = forms.CharField( label='Name', max_length=199 , widget=forms.TextInput(attrs={ 'required': 'true' })) name = forms.CharField( label='Name', max_length=199 , widget=forms.TextInput(attrs={ 'required': 'true' }))
file = forms.FileField() file = forms.FileField()
def clean_resource(self):
file_ = self.cleaned_data.get('file')
def clean_file(self): def clean_file(self):
file_ = self.cleaned_data.get('file') file_ = self.cleaned_data.get('file')
if len(file_) > 1024 ** 3 : # we don't accept more than 1GB if len(file_) > 1024 ** 3 : # we don't accept more than 1GB
...@@ -117,7 +122,8 @@ def project(request, project_id): ...@@ -117,7 +122,8 @@ def project(request, project_id):
resources = corpus.resources() resources = corpus.resources()
if len(resources): if len(resources):
resource = resources[0] resource = resources[0]
resource_type_name = RESOURCETYPES[resource['type']]['name'] #resource_type_name = RESOURCETYPES[resource['type']]['name']
resource_type_name = get_resource(resource["type"])["name"]
else: else:
print("(WARNING) PROJECT view: no listed resource") print("(WARNING) PROJECT view: no listed resource")
# add some data for the viewer # add some data for the viewer
...@@ -172,5 +178,3 @@ def project(request, project_id): ...@@ -172,5 +178,3 @@ def project(request, project_id):
'query_size': QUERY_SIZE_N_DEFAULT, 'query_size': QUERY_SIZE_N_DEFAULT,
}, },
) )
...@@ -2,7 +2,7 @@ from gargantext.util.http import requires_auth, render, settings ...@@ -2,7 +2,7 @@ from gargantext.util.http import requires_auth, render, settings
from gargantext.util.db import session from gargantext.util.db import session
from gargantext.util.db_cache import cache from gargantext.util.db_cache import cache
from gargantext.models import Node from gargantext.models import Node
from gargantext.constants import resourcename from gargantext.constants import get_resource_by_name
from datetime import datetime from datetime import datetime
@requires_auth @requires_auth
...@@ -42,7 +42,7 @@ def ngramtable(request, project_id, corpus_id): ...@@ -42,7 +42,7 @@ def ngramtable(request, project_id, corpus_id):
'date': datetime.now(), 'date': datetime.now(),
'project': project, 'project': project,
'corpus' : corpus, 'corpus' : corpus,
'resourcename' : resourcename(corpus), 'resourcename' : get_resource_by_name(corpus),
'view': 'terms', 'view': 'terms',
# for the CSV import modal # for the CSV import modal
......
...@@ -55,8 +55,13 @@ def notify_user(username, email, password): ...@@ -55,8 +55,13 @@ def notify_user(username, email, password):
La nouvelle version de Gargantext sort en septembre prochain. La nouvelle version de Gargantext sort en septembre prochain.
Vous êtes actuellement sur la version de développement, vos retours Vous êtes actuellement sur la version de développement, vos retours
seront précieux pour stabiliser la plateforme; merci d'avance! seront précieux pour stabiliser la plateforme: merci d'avance!
Foire aux questions de Gargantext:
https://gogs.iscpif.fr/humanities/faq_gargantext/wiki/FAQ
Rapporter un bogue:
https://gogs.iscpif.fr/humanities/faq_gargantext/issues
Nous restons à votre disposition pour tout complément d'information. Nous restons à votre disposition pour tout complément d'information.
Cordialement Cordialement
......
#Install Instructions for Gargamelle: # Install Instructions for Gargamelle
Gargamelle is the gargantext plateforme toolbox it is a full plateform system **Gargamelle** is the gargantext platform toolbox: it installs a full gargantext system with minimal modules inside a **docker** container.
with minimal modules
First you need to get the source code to install it First you need to get the source code to install it
The folder will be /srv/gargantext: The destination folder will be `/srv/gargantext`:
* docs containes all informations on gargantext * docs contains all information on gargantext
/srv/gargantext/docs/ (`/srv/gargantext/docs/`)
* install contains all the installation files * install contains all the installation files
/srv/gargantext/install/ `/srv/gargantext/install/`
Help needed ? Help needed ?
See [http://gargantext.org/about](http://gargantext.org/about) and [tools](./contribution_guide.md) for the community See [http://gargantext.org/about](http://gargantext.org/about) and [tools](./contribution_guide.md) for the community
## Get the source code ## Get the source code
...@@ -27,36 +26,30 @@ git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \ ...@@ -27,36 +26,30 @@ git clone ssh://gitolite@delanoe.org:1979/gargantext /srv/gargantext \
## Install ## Install
``` bash ``` bash
# go into the directory # go into the directory
user@computer: cd /srv/gargantext/ user@computer: cd /srv/gargantext/
#git inside installation folder # get inside installation folder
user@computer: cd /install user@computer: cd install
#execute the installation # execute the installation script
user@computer: ./install user@computer: ./install
``` ```
During installation an admin account for gargantext will be created by asking you a username and a password
Remember it to accès to the Gargantext plateform During installation an admin account for gargantext will be created by asking you a username and a password
Remember it to access to the Gargantext plateform
## Run ## Run
Once you proceed to installation Gargantext plateforme will be available at localhost:8000 Once you're done with the installation, **Gargantext** platform will be available at `http://localhost:8000`
by running the run executable file simply by running the `start` executable file
``` bash ``` bash
# go into the directory # go into the directory
user@computer: cd /srv/gargantext/ user@computer: cd /srv/gargantext/
#git inside installation folder # run the start command
user@computer: cd /install user@computer: ./start
#execute the installation # type ctrl+d or "exit" command to exit
user@computer: ./run ```
#type ctrl+d to exit or exit; command
```
Then open up a chromium browser and go to localhost:8000 Then open up a chromium browser and go to `http://localhost:8000`
Click on "Enter Gargantext" Click on "Enter Gargantext"
Login in with you created username and pasword Login in with your created username and password
Enjoy! ;) Enjoy! ;)
...@@ -9,6 +9,7 @@ MAINTAINER ISCPIF <gargantext@iscpif.fr> ...@@ -9,6 +9,7 @@ MAINTAINER ISCPIF <gargantext@iscpif.fr>
USER root USER root
### Update and install base dependencies ### Update and install base dependencies
RUN echo "############ DEBIAN LIBS ###############"
RUN apt-get update && \ RUN apt-get update && \
apt-get install -y \ apt-get install -y \
apt-utils ca-certificates locales \ apt-utils ca-certificates locales \
...@@ -19,33 +20,37 @@ RUN apt-get update && \ ...@@ -19,33 +20,37 @@ RUN apt-get update && \
postgresql-9.5 postgresql-client-9.5 postgresql-contrib-9.5 postgresql-9.5 postgresql-client-9.5 postgresql-contrib-9.5
RUN echo "############ DEBIAN LIBS ###############"
### Configure timezone and locale ### Configure timezone and locale
RUN echo "Europe/Paris" > /etc/timezone && \ RUN echo "########### LOCALES & TZ #################"
dpkg-reconfigure -f noninteractive tzdata && \ RUN echo "Europe/Paris" > /etc/timezone
sed -i -e 's/# en_GB.UTF-8 UTF-8/en_GB.UTF-8 UTF-8/' /etc/locale.gen && \ ENV TZ "Europe/Paris"
RUN sed -i -e 's/# en_GB.UTF-8 UTF-8/en_GB.UTF-8 UTF-8/' /etc/locale.gen && \
sed -i -e 's/# fr_FR.UTF-8 UTF-8/fr_FR.UTF-8 UTF-8/' /etc/locale.gen && \ sed -i -e 's/# fr_FR.UTF-8 UTF-8/fr_FR.UTF-8 UTF-8/' /etc/locale.gen && \
echo 'LANG="fr_FR.UTF-8"' > /etc/default/locale && \
dpkg-reconfigure --frontend=noninteractive locales && \ dpkg-reconfigure --frontend=noninteractive locales && \
update-locale LANG=fr_FR.UTF-8 echo 'LANG="fr_FR.UTF-8"' > /etc/default/locale
ENV LANG fr_FR.UTF-8
ENV LANGUAGE fr_FR.UTF-8
ENV LC_ALL fr_FR.UTF-8
RUN echo "########### LOCALES & TZ #################"
### Install main dependencies and python packages based on Debian distrib ### Install main dependencies and python packages based on Debian distrib
RUN echo "############# PYTHON DEPENDENCIES ###############"
RUN apt-get update && apt-get install -y \ RUN apt-get update && apt-get install -y \
libxml2-dev xml-core libgfortran-5-dev \ libxml2-dev xml-core libgfortran-5-dev \
libpq-dev \ libpq-dev \
python3.5 \ python3.5 \
python3-dev \ python3-dev \
# for numpy, pandas and numpyperf
python3-six python3-numpy python3-setuptools \ python3-six python3-numpy python3-setuptools \
# ^for numpy, pandas and numpyperf
python3-numexpr \ python3-numexpr \
#python dependencies # python dependencies
python3-pip \ python3-pip \
# for lxml # for lxml
libxml2-dev libxslt-dev libxml2-dev libxslt-dev
#libxslt1-dev zlib1g-dev #libxslt1-dev zlib1g-dev
RUN echo "############# PYTHON DEPENDENCIES ###############"
#UPDATE AND CLEAN # UPDATE AND CLEAN
RUN apt-get update && apt-get autoclean &&\ RUN apt-get update && apt-get autoclean &&\
rm -rf /var/lib/apt/lists/* rm -rf /var/lib/apt/lists/*
#NB: removing /var/lib will avoid to significantly fill up your /var/ folder on your native system #NB: removing /var/lib will avoid to significantly fill up your /var/ folder on your native system
...@@ -65,9 +70,8 @@ ADD psql_configure.sh / ...@@ -65,9 +70,8 @@ ADD psql_configure.sh /
ADD django_configure.sh / ADD django_configure.sh /
RUN . /env_3-5/bin/activate && pip3 install -r requirements.txt && \ RUN . /env_3-5/bin/activate && pip3 install -r requirements.txt && \
pip3 install git+https://github.com/zzzeek/sqlalchemy.git@rel_1_1 &&\ pip3 install git+https://github.com/zzzeek/sqlalchemy.git@rel_1_1 && \
python3 -m nltk.downloader averaged_perceptron_tagger -d /usr/local/share/nltk_data; python3 -m nltk.downloader averaged_perceptron_tagger -d /usr/local/share/nltk_data
# nltk.data.path.append('path_to_nltk_data')
RUN chown gargantua:gargantua -R /env_3-5 RUN chown gargantua:gargantua -R /env_3-5
...@@ -81,6 +85,4 @@ RUN echo "listen_addresses='*'" >> /etc/postgresql/9.5/main/postgresql.conf ...@@ -81,6 +85,4 @@ RUN echo "listen_addresses='*'" >> /etc/postgresql/9.5/main/postgresql.conf
EXPOSE 5432 8000 EXPOSE 5432 8000
VOLUME ["/srv/",] # VOLUME ["/srv/",]
#!/bin/bash
# opens a console + virtualenv inside the already active docker container
# (to use after start)
sudo docker exec -it gargamelle_box bash --rcfile 'env_3-5/bin/activate'
#!/bin/bash #!/bin/bash
sudo docker run \ sudo docker run \
-v /srv/:/srv/\ --name=gargamelle_box \
-v /srv/gargantext:/srv/gargantext \
-v /srv/gargandata:/srv/gargandata \
-v /srv/gargantext_lib:/srv/gargantext_lib \
-p 8000:8000 \ -p 8000:8000 \
-p 5432 \ -p 5432 \
-it gargamelle:latest \ -it gargamelle:latest \
/bin/bash -c "service postgresql start; /bin/su gargantua -c 'source /env_3-5/bin/activate && /srv/gargantext/manage.py runserver 0.0.0.0:8000' && bin/bash" /bin/bash -c "service postgresql start; /bin/su gargantua -c 'source /env_3-5/bin/activate && /srv/gargantext/manage.py runserver 0.0.0.0:8000' && bin/bash"
sudo docker rm -f `docker ps -a | grep -v CONTAINER | awk '{print $1 }'` sudo docker rm gargamelle_box
...@@ -16,13 +16,12 @@ echo "::::: DJANGO :::::" ...@@ -16,13 +16,12 @@ echo "::::: DJANGO :::::"
/bin/su gargantua -c 'source /env_3-5/bin/activate &&\ /bin/su gargantua -c 'source /env_3-5/bin/activate &&\
echo "Activated env" &&\ echo "Activated env" &&\
./srv/gargantext/manage.py makemigrations &&\ /srv/gargantext/manage.py makemigrations &&\
./srv/gargantext/manage.py migrate && \ /srv/gargantext/manage.py migrate && \
echo "migrations ok" &&\ echo "migrations ok" &&\
./srv/gargantext/dbmigrate.py && \ /srv/gargantext/dbmigrate.py && \
./srv/gargantext/dbmigrate.py && \ /srv/gargantext/dbmigrate.py && \
./srv/gargantext/dbmigrate.py && \ /srv/gargantext/dbmigrate.py && \
./srv/gargantext/manage.py createsuperuser' /srv/gargantext/manage.py createsuperuser'
/usr/sbin/service postgresql stop /usr/sbin/service postgresql stop
...@@ -14,6 +14,7 @@ html5lib==0.9999999 ...@@ -14,6 +14,7 @@ html5lib==0.9999999
python-igraph>=0.7.1 python-igraph>=0.7.1
jdatetime==1.7.2 jdatetime==1.7.2
kombu==3.0.33 # messaging kombu==3.0.33 # messaging
langdetect==1.0.6 #detectinglanguage
nltk==3.1 nltk==3.1
numpy==1.10.4 numpy==1.10.4
psycopg2==2.6.1 psycopg2==2.6.1
......
...@@ -43,6 +43,22 @@ function uncompress_lib { ...@@ -43,6 +43,22 @@ function uncompress_lib {
#~ esac #~ esac
echo "::: CREATE GROUP :::";
if grep -q 'gargantua' /etc/groups
then
echo "Using existing group 'gargantua'"
else
sudo groupadd gargantua
fi
# adding the users to the group
current_user=$(who -m | cut -d' ' -f1)
sudo usermod -G gargantua $current_user
sudo usermod -G gargantua gargantua
# changing the group of the sourcedir
sudo chown -R :gargantua /srv/gargantext
echo "::: SETUP ENV :::"; echo "::: SETUP ENV :::";
for dir in "/srv/gargantext_lib" "/srv/gargantext_static" "/srv/gargantext_media"; do for dir in "/srv/gargantext_lib" "/srv/gargantext_static" "/srv/gargantext_media"; do
sudo mkdir -p $dir ; sudo mkdir -p $dir ;
...@@ -59,12 +75,17 @@ sudo docker build -t gargamelle:latest ./gargamelle ...@@ -59,12 +75,17 @@ sudo docker build -t gargamelle:latest ./gargamelle
echo ':::: CONFIGURE ::::' echo ':::: CONFIGURE ::::'
sudo docker run \ sudo docker run \
-v /srv/:/srv/ \ --name=gargamelle_box \
-v /srv/gargantext:/srv/gargantext \
-v /srv/gargandata:/srv/gargandata \
-v /srv/gargantext_lib:/srv/gargantext_lib \
-p 8000:8000 \ -p 8000:8000 \
-p 5432 \ -p 5432 \
-it gargamelle:latest \ -it gargamelle:latest \
/bin/bash -c "./psql_configure.sh; ./django_configure.sh ; exit" /bin/bash -c "./psql_configure.sh; ./django_configure.sh ; exit"
sudo docker rm -f `docker ps -a | grep -v CONTAINER | awk '{print $1 }'` sudo docker rm gargamelle_box
# creating the "start" copy + giving it normal ownership (because we're probably sudo)
cp ./run /srv/gargantext/start
chown $current_user:gargantua /srv/gargantext/start
#!/bin/bash #!/bin/bash
sudo docker run \ sudo docker run \
-v /srv/:/srv/\ --name=gargamelle_box \
-v /srv/gargantext:/srv/gargantext \
-v /srv/gargandata:/srv/gargandata \
-v /srv/gargantext_lib:/srv/gargantext_lib \
-p 8000:8000 \ -p 8000:8000 \
-p 5432 \ -p 5432 \
-it gargamelle:latest \ -it gargamelle:latest \
/bin/bash -c "service postgresql start; /bin/su gargantua -c 'source /env_3-5/bin/activate && /srv/gargantext/manage.py runserver 0.0.0.0:8000'" /bin/bash -c "service postgresql start; /bin/su gargantua -c 'source /env_3-5/bin/activate && /srv/gargantext/manage.py runserver 0.0.0.0:8000'"
sudo docker rm -f `docker ps -a | grep -v CONTAINER | awk '{print $1 }'` sudo docker rm gargamelle_box
...@@ -8,7 +8,7 @@ from traceback import print_tb ...@@ -8,7 +8,7 @@ from traceback import print_tb
from django.shortcuts import redirect, render from django.shortcuts import redirect, render
from django.http import Http404, HttpResponseRedirect, HttpResponseForbidden from django.http import Http404, HttpResponseRedirect, HttpResponseForbidden
from gargantext.constants import resourcetype, QUERY_SIZE_N_MAX from gargantext.constants import get_resource_by_name, QUERY_SIZE_N_MAX
from gargantext.models.nodes import Node from gargantext.models.nodes import Node
from gargantext.util.db import session from gargantext.util.db import session
from gargantext.util.http import JsonHttpResponse from gargantext.util.http import JsonHttpResponse
...@@ -133,7 +133,7 @@ def save(request , project_id): ...@@ -133,7 +133,7 @@ def save(request , project_id):
if filename!=False: if filename!=False:
# add the uploaded resource to the corpus # add the uploaded resource to the corpus
corpus.add_resource( corpus.add_resource(
type = resourcetype('ISTex') type = get_resource_by_name('ISTex [ISI]')["type"]
, path = filename , path = filename
) )
dwnldsOK+=1 dwnldsOK+=1
......
...@@ -18,7 +18,7 @@ from traceback import print_tb ...@@ -18,7 +18,7 @@ from traceback import print_tb
from django.shortcuts import redirect from django.shortcuts import redirect
from django.http import Http404, HttpResponseRedirect, HttpResponseForbidden from django.http import Http404, HttpResponseRedirect, HttpResponseForbidden
from gargantext.constants import resourcetype, QUERY_SIZE_N_MAX from gargantext.constants import get_resource, QUERY_SIZE_N_MAX
from gargantext.models.nodes import Node from gargantext.models.nodes import Node
from gargantext.util.db import session from gargantext.util.db import session
from gargantext.util.db_cache import cache from gargantext.util.db_cache import cache
...@@ -134,7 +134,7 @@ def save( request , project_id ) : ...@@ -134,7 +134,7 @@ def save( request , project_id ) :
print(filename) print(filename)
if filename != False: if filename != False:
# add the uploaded resource to the corpus # add the uploaded resource to the corpus
corpus.add_resource( type = resourcetype('Pubmed (XML format)') corpus.add_resource( type = get_resource_by_name('Pubmed [XML]')["type"]
, path = filename , path = filename
, url = None , url = None
) )
......
...@@ -174,12 +174,28 @@ ...@@ -174,12 +174,28 @@
title="Export terms table in CSV"> title="Export terms table in CSV">
Export terms table &nbsp; <span class="glyphicon glyphicon-download" aria-hidden="true"></span> Export terms table &nbsp; <span class="glyphicon glyphicon-download" aria-hidden="true"></span>
</a> </a>
{% elif view == 'titles' %} {% elif view == 'titles' %}
<a href="https://gogs.iscpif.fr/humanities/faq_gargantext/wiki/FAQ#import--export-a-dataset" class="pull-right btn btn-lg">
<span class="glyphicon glyphicon-question-sign" aria-hidden="true"></span>
</a>
<a class="btn btn-primary exportbtn pull-right" role="button" <a class="btn btn-primary exportbtn pull-right" role="button"
href="/api/nodes?parent_id={{corpus.id}}&types[]=DOCUMENT&pagination_limit=100000&formated=csv" href="/api/nodes?parent_id={{corpus.id}}&types[]=DOCUMENT&pagination_limit=100000&formated=csv"
title="Export full corpus in CSV"> title="Export full corpus in CSV">
Export corpus &nbsp; <span class="glyphicon glyphicon-download" aria-hidden="true"></span>
Export corpus &nbsp;
<span class="glyphicon glyphicon-download" aria-hidden="true"></span>
</a> </a>
{% else %} {% else %}
<!-- TODO export journal table --> <!-- TODO export journal table -->
{% endif %} {% endif %}
...@@ -187,6 +203,7 @@ ...@@ -187,6 +203,7 @@
</div> </div>
<div class="row"> <div class="row">
<div class="col-md-1"> <div class="col-md-1">
</span>
</div> </div>
<div class="col-md-6"> <div class="col-md-6">
<h3> <h3>
......
...@@ -212,12 +212,19 @@ ...@@ -212,12 +212,19 @@
<div class="modal-content"> <div class="modal-content">
<div class="modal-header"> <div class="modal-header">
<button type="button" class="close" data-dismiss="modal" aria-hidden="true">×</button> <button type="button" class="close" data-dismiss="modal" aria-hidden="true">×</button>
<h3>Add a Corpus</h3>
<h3>Add a Corpus <a href="https://gogs.iscpif.fr/humanities/faq_gargantext/wiki/FAQ#import--export-a-dataset">
<span class="glyphicon glyphicon-question-sign" aria-hidden="true"></span>
</a>
</h3>
</div> </div>
<div class="modal-body"> <div class="modal-body">
<!-- FAQ -->
<form id="id_form" enctype="multipart/form-data" action="/projects/{{project.id}}/" method="post"> <form id="id_form" enctype="multipart/form-data" action="/projects/{{project.id}}/" method="post">
{% csrf_token %} {% csrf_token %}
<table cellpadding="5"> <table cellpadding="5">
{% for field in form %} {% for field in form %}
<tr> <tr>
<th>{{field.label_tag}}</th> <th>{{field.label_tag}}</th>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment