Commit 29942ecd authored by Alexandre Delanoë's avatar Alexandre Delanoë

[FIX]

parents 32de2d3f 742b8194
...@@ -2,6 +2,11 @@ ...@@ -2,6 +2,11 @@
* Guided Tour * Guided Tour
* Sources form highlighting crawlers * Sources form highlighting crawlers
## Version 3.0.8.1
* WOS parser date FIX
* EUROPRESS parser author/text article FIX
* Backend: each project as user node as parent
## Version 3.0.7 ## Version 3.0.7
* Alembic implemented to manage database migrations * Alembic implemented to manage database migrations
......
...@@ -6,6 +6,21 @@ Keep in mind that Alembic only handles SQLAlchemy models: tables created from ...@@ -6,6 +6,21 @@ Keep in mind that Alembic only handles SQLAlchemy models: tables created from
Django ORM must be put out of Alembic sight. See [alembic:exclude] section in Django ORM must be put out of Alembic sight. See [alembic:exclude] section in
alembic.ini. alembic.ini.
To bootstrap Alembic where a gargantext database is already existing see
below: TELL ALEMBIC TO NOT START FROM SCRATCH.
USUAL WORKFLOW WITH ALEMBIC
1. Make change to models in gargantext/models
2. Autogenerate revision (see below GENERATE A REVISION)
3. Manually check and edit revision file in alembic/versions
4. Commit alembic revision (it should never be reverted)
5. Commit changes in models (it can be reverted if needed)
To create, drop or modify views, schemas, roles, stored procedures, triggers or
policies see below: REPLACEABLE OBJECTS.
TELL ALEMBIC TO NOT START FROM SCRATCH TELL ALEMBIC TO NOT START FROM SCRATCH
...@@ -29,25 +44,76 @@ DOWNGRADE TO INITIAL DATABASE STATE ...@@ -29,25 +44,76 @@ DOWNGRADE TO INITIAL DATABASE STATE
alembic downgrade base alembic downgrade base
GENERATE A NEW REVISION GENERATE A REVISION
alembic revision -m "Message for this migration" alembic revision --autogenerate -m "Message for this migration"
# A migration script is then created in alembic/versions directory. For # A migration script is then created in alembic/versions directory. For
# example alembic/versions/3adcc9a56557_message_for_this_migration.py # example alembic/versions/3adcc9a56557_message_for_this_migration.py
# where 3adcc9a56557 is the revision id generated by Alembic. # where 3adcc9a56557 is the revision id generated by Alembic.
# #
# Alembic should generate a script reflecting changes already made in
# models or database. However it is always a good idea to check it and edit
# it manually, Alembic is not always accurate and can't see all alterations.
# It should work with basic changes such as model or column creation. See
# http://alembic.zzzcomputing.com/en/latest/autogenerate.html#what-does-autogenerate-detect-and-what-does-it-not-detect
GENERATE AN EMPTY REVISION
alembic revision -m "Message for this migration"
# This script must be edited to write the migration itself, mainly # This script must be edited to write the migration itself, mainly
# in `upgrade` and `downgrade` functions. See Alembic documentation for # in `upgrade` and `downgrade` functions. See Alembic documentation for
# further details. # further details.
GENERATE A REVISION FROM CURRENT STATE REPLACEABLE OBJECTS
alembic revision --autogenerate -m "Message for this migration" There is no specific way no handle views, schemas, roles, stored procedures,
triggers or policies with Alembic. To ease revisions of such objects, avoid
boilerplate code and too much op.execute we use an enhanced version of
ReplaceableObject recipe (see Alembic documentation).
# Alembic should generate a script reflecting changes already made in To create, drop or modify such object you need to make a ReplaceableObject
# database. However it is always a good idea to check it and edit it instance, and then use create_*, drop_* or replace_* method of alembic.op.
# manually, Alembic is not always accurate and can't see all alterations. Conversion between ReplaceableObject and SQL is implemented in
# It should work with basic changes such as model or column creation. See gargantext/util/alembic.py.
# http://alembic.zzzcomputing.com/en/latest/autogenerate.html#what-does-autogenerate-detect-and-what-does-it-not-detect
* Views: create_view(ReplaceableObject(<name>, <query>))
* Roles: create_role(ReplaceableObject(<name>, <options>))
* Schemas: create_schema(ReplaceableObject(<name>))
* Stored procedures: create_sp(ReplaceableObject(<name(arguments)>, <body>)
* Triggers: create_trigger(ReplaceableObject(<name>, <when>, <table>, <body>))
* Policies: create_policy(ReplaceableObject(<name>, <table>, <body>))
Here is an example with a stored procedure:
...
from gargantext.util.alembic import ReplaceableObject
revision = '08230100f512'
...
my_function_sp = ReplaceableObject(
"my_function()", "RETURNS integer AS $$ SELECT 42 $$ LANGUAGE sql")
def upgrade():
op.create_sp(my_function_sp)
def downgrade():
op.drop_sp(my_function_sp)
To modify this stored procedure in a later revision:
...
from gargantext.util.alembic import ReplaceableObject
my_function_sp = ReplaceableObject(
"my_function()", "RETURNS integer AS $$ SELECT 43 $$ LANGUAGE sql")
def upgrade():
op.replace_sp(my_function_sp, replaces="08230100f512.my_function_sp")
def downgrade():
op.replace_sp(my_function_sp, replace_with="08230100f512.my_function_sp")
...@@ -18,7 +18,8 @@ from gargantext import settings, models ...@@ -18,7 +18,8 @@ from gargantext import settings, models
# this is the Alembic Config object, which provides # this is the Alembic Config object, which provides
# access to the values within the .ini file in use. # access to the values within the .ini file in use.
config = context.config config = context.config
config.set_main_option("sqlalchemy.url", settings.DATABASES['default']['URL']) config.set_main_option("sqlalchemy.url",
settings.DATABASES['default']['SECRET_URL'])
# Interpret the config file for Python logging. # Interpret the config file for Python logging.
# This line sets up loggers basically. # This line sets up loggers basically.
...@@ -52,6 +53,14 @@ def include_object(obj, name, typ, reflected, compare_to): ...@@ -52,6 +53,14 @@ def include_object(obj, name, typ, reflected, compare_to):
return True return True
context_opts = dict(
target_metadata=target_metadata,
include_object=include_object,
compare_server_default=True,
compare_type=True,
)
def run_migrations_offline(): def run_migrations_offline():
"""Run migrations in 'offline' mode. """Run migrations in 'offline' mode.
...@@ -65,9 +74,7 @@ def run_migrations_offline(): ...@@ -65,9 +74,7 @@ def run_migrations_offline():
""" """
url = config.get_main_option("sqlalchemy.url") url = config.get_main_option("sqlalchemy.url")
context.configure( context.configure(url=url, literal_binds=True, **context_opts)
url=url, target_metadata=target_metadata, literal_binds=True,
include_object=include_object)
with context.begin_transaction(): with context.begin_transaction():
context.run_migrations() context.run_migrations()
...@@ -86,11 +93,7 @@ def run_migrations_online(): ...@@ -86,11 +93,7 @@ def run_migrations_online():
poolclass=pool.NullPool) poolclass=pool.NullPool)
with connectable.connect() as connection: with connectable.connect() as connection:
context.configure( context.configure(connection=connection, **context_opts)
connection=connection,
target_metadata=target_metadata,
include_object=include_object
)
with context.begin_transaction(): with context.begin_transaction():
context.run_migrations() context.run_migrations()
......
"""Fix bug in title_abstract indexation
Revision ID: 159a5154362b
Revises: 73112a361617
Create Date: 2017-09-18 18:00:26.055335
"""
from alembic import op
import sqlalchemy as sa
from gargantext.util.alembic import ReplaceableObject
# revision identifiers, used by Alembic.
revision = '159a5154362b'
down_revision = '73112a361617'
branch_labels = None
depends_on = None
title_abstract_insert = ReplaceableObject(
'title_abstract_insert',
'BEFORE INSERT',
'nodes',
"""FOR EACH ROW
WHEN (NEW.hyperdata::text <> '{}'::text)
EXECUTE PROCEDURE title_abstract_update_trigger()"""
)
title_abstract_update = ReplaceableObject(
'title_abstract_update',
'BEFORE UPDATE OF hyperdata',
'nodes',
"""FOR EACH ROW
WHEN ((OLD.hyperdata ->> 'title', OLD.hyperdata ->> 'abstract')
IS DISTINCT FROM
(NEW.hyperdata ->> 'title', NEW.hyperdata ->> 'abstract'))
EXECUTE PROCEDURE title_abstract_update_trigger()"""
)
def upgrade():
op.replace_trigger(title_abstract_insert, replaces="73112a361617.title_abstract_insert")
op.replace_trigger(title_abstract_update, replaces="73112a361617.title_abstract_update")
# Manually re-build index
op.execute("UPDATE nodes SET title_abstract = to_tsvector('english', (hyperdata ->> 'title') || ' ' || (hyperdata ->> 'abstract')) WHERE typename=4")
def downgrade():
# Won't unfix the bug !
pass
"""Add english fulltext index on Nodes.hyperdata for abstract and title
Revision ID: 1fb4405b59e1
Revises: bedce47c9e34
Create Date: 2017-09-13 16:31:36.926692
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy_utils.types import TSVectorType
from gargantext.util.alembic import ReplaceableObject
# revision identifiers, used by Alembic.
revision = '1fb4405b59e1'
down_revision = 'bedce47c9e34'
branch_labels = None
depends_on = None
title_abstract_update_trigger = ReplaceableObject(
'title_abstract_update_trigger()',
"""
RETURNS trigger AS $$
begin
new.title_abstract := to_tsvector('english', (new.hyperdata ->> 'title') || ' ' || (new.hyperdata ->> 'abstract'));
return new;
end
$$ LANGUAGE plpgsql;
"""
)
title_abstract_update = ReplaceableObject(
'title_abstract_update',
'BEFORE INSERT OR UPDATE',
'nodes',
'FOR EACH ROW EXECUTE PROCEDURE title_abstract_update_trigger()'
)
def upgrade():
op.add_column('nodes', sa.Column('title_abstract', TSVectorType))
op.create_sp(title_abstract_update_trigger)
op.create_trigger(title_abstract_update)
# Initialize index with already existing data
op.execute('UPDATE nodes SET hyperdata = hyperdata');
def downgrade():
op.drop_trigger(title_abstract_update)
op.drop_sp(title_abstract_update_trigger)
op.drop_column('nodes', 'title_abstract')
"""Optimize title_abstract indexation
Revision ID: 73112a361617
Revises: 1fb4405b59e1
Create Date: 2017-09-15 14:14:51.737963
"""
from alembic import op
import sqlalchemy as sa
from gargantext.util.alembic import ReplaceableObject
# revision identifiers, used by Alembic.
revision = '73112a361617'
down_revision = '1fb4405b59e1'
branch_labels = None
depends_on = None
title_abstract_insert = ReplaceableObject(
'title_abstract_insert',
'AFTER INSERT',
'nodes',
"""FOR EACH ROW
WHEN (NEW.hyperdata::text <> '{}'::text)
EXECUTE PROCEDURE title_abstract_update_trigger()"""
)
title_abstract_update = ReplaceableObject(
'title_abstract_update',
'AFTER UPDATE OF hyperdata',
'nodes',
"""FOR EACH ROW
WHEN ((OLD.hyperdata ->> 'title', OLD.hyperdata ->> 'abstract')
IS DISTINCT FROM
(NEW.hyperdata ->> 'title', NEW.hyperdata ->> 'abstract'))
EXECUTE PROCEDURE title_abstract_update_trigger()"""
)
def upgrade():
op.replace_trigger(title_abstract_update, replaces="1fb4405b59e1.title_abstract_update")
op.create_trigger(title_abstract_insert)
def downgrade():
op.drop_trigger(title_abstract_insert)
op.replace_trigger(title_abstract_update, replace_with="1fb4405b59e1.title_abstract_update")
"""Add server side sensible defaults for nodes
Revision ID: 73304ae9f1fb
Revises: 159a5154362b
Create Date: 2017-10-05 14:17:58.326646
"""
from alembic import op
import sqlalchemy as sa
import gargantext
from sqlalchemy.dialects import postgresql
# revision identifiers, used by Alembic.
revision = '73304ae9f1fb'
down_revision = '159a5154362b'
branch_labels = None
depends_on = None
def upgrade():
op.alter_column('nodes', 'date',
existing_type=postgresql.TIMESTAMP(timezone=True),
server_default=sa.text('CURRENT_TIMESTAMP'),
nullable=False)
op.alter_column('nodes', 'hyperdata',
existing_type=postgresql.JSONB(astext_type=sa.Text()),
server_default=sa.text("'{}'::jsonb"),
nullable=False)
op.alter_column('nodes', 'name',
existing_type=sa.VARCHAR(length=255),
server_default='',
nullable=False)
op.alter_column('nodes', 'typename',
existing_type=sa.INTEGER(),
nullable=False)
op.alter_column('nodes', 'user_id',
existing_type=sa.INTEGER(),
nullable=False)
def downgrade():
op.alter_column('nodes', 'user_id',
existing_type=sa.INTEGER(),
nullable=True)
op.alter_column('nodes', 'typename',
existing_type=sa.INTEGER(),
nullable=True)
op.alter_column('nodes', 'name',
existing_type=sa.VARCHAR(length=255),
server_default=None,
nullable=True)
op.alter_column('nodes', 'hyperdata',
existing_type=postgresql.JSONB(astext_type=sa.Text()),
server_default=None,
nullable=True)
op.alter_column('nodes', 'date',
existing_type=postgresql.TIMESTAMP(timezone=True),
server_default=None,
nullable=True)
...@@ -7,6 +7,12 @@ ...@@ -7,6 +7,12 @@
$httpProvider.defaults.xsrfHeaderName = 'X-CSRFToken'; $httpProvider.defaults.xsrfHeaderName = 'X-CSRFToken';
$httpProvider.defaults.xsrfCookieName = 'csrftoken'; $httpProvider.defaults.xsrfCookieName = 'csrftoken';
}]); }]);
function url(path) {
// adding explicit "http[s]://" -- for cross origin requests
return location.protocol + '//' + window.GARG_ROOT_URL + path;
}
/* /*
* DocumentHttpService: Read Document * DocumentHttpService: Read Document
* =================== * ===================
...@@ -98,9 +104,7 @@ ...@@ -98,9 +104,7 @@
*/ */
http.factory('MainApiAddNgramHttpService', function($resource) { http.factory('MainApiAddNgramHttpService', function($resource) {
return $resource( return $resource(
// adding explicit "http://" b/c this a cross origin request url("/api/ngrams?text=:ngramStr&corpus=:corpusId&testgroup"),
'http://' + window.GARG_ROOT_URL
+ "/api/ngrams?text=:ngramStr&corpus=:corpusId&testgroup",
{ {
ngramStr: '@ngramStr', ngramStr: '@ngramStr',
corpusId: '@corpusId', corpusId: '@corpusId',
...@@ -131,9 +135,7 @@ ...@@ -131,9 +135,7 @@
http.factory('MainApiChangeNgramHttpService', function($resource) { http.factory('MainApiChangeNgramHttpService', function($resource) {
return $resource( return $resource(
// adding explicit "http://" b/c this a cross origin request url("/api/ngramlists/change?list=:listId&ngrams=:ngramIdList"),
'http://' + window.GARG_ROOT_URL
+ "/api/ngramlists/change?list=:listId&ngrams=:ngramIdList",
{ {
listId: '@listId', listId: '@listId',
ngramIdList: '@ngramIdList' // list in str form (sep=","): "12,25,30" ngramIdList: '@ngramIdList' // list in str form (sep=","): "12,25,30"
...@@ -171,8 +173,7 @@ ...@@ -171,8 +173,7 @@
*/ */
http.factory('MainApiFavoritesHttpService', function($resource) { http.factory('MainApiFavoritesHttpService', function($resource) {
return $resource( return $resource(
// adding explicit "http://" b/c this a cross origin request url("/api/nodes/:corpusId/favorites?docs=:docId"),
'http://' + window.GARG_ROOT_URL + "/api/nodes/:corpusId/favorites?docs=:docId",
{ {
corpusId: '@corpusId', corpusId: '@corpusId',
docId: '@docId' docId: '@docId'
......
...@@ -89,9 +89,9 @@ ...@@ -89,9 +89,9 @@
</div> </div>
<div class="row-fluid"> <div class="row-fluid">
<ul class="list-group clearfix"> <ul class="list-group clearfix">
<li class="list-group-item small"><span class="badge">source</span>{[{source}]}</li> <li class="list-group-item small"><span class="badge">source</span>{[{source || '&nbsp;'}]}</li>
<li class="list-group-item small"><span class="badge">authors</span>{[{authors}]}</li> <li class="list-group-item small"><span class="badge">authors</span>{[{authors || '&nbsp;'}]}</li>
<li class="list-group-item small"><span class="badge">date</span>{[{publication_date}]}</li> <li class="list-group-item small"><span class="badge">date</span>{[{publication_date || '&nbsp;'}]}</li>
</ul> </ul>
</div> </div>
......
...@@ -2,8 +2,6 @@ ...@@ -2,8 +2,6 @@
[uwsgi] [uwsgi]
# uwsgi --vacuum --socket monsite/mysite.sock --wsgi-file monsite/wsgi.py --chmod-socket=666 --home=/srv/alexandre.delanoe/env --chdir=/var/www/www/alexandre/monsite --env
env = DJANGO_SETTINGS_MODULE=gargantext.settings env = DJANGO_SETTINGS_MODULE=gargantext.settings
#module = django.core.handlers.wsgi:WSGIHandler() #module = django.core.handlers.wsgi:WSGIHandler()
...@@ -44,7 +42,7 @@ touch-reload = /tmp/gargantext.reload ...@@ -44,7 +42,7 @@ touch-reload = /tmp/gargantext.reload
# respawn processes taking more than 20 seconds # respawn processes taking more than 20 seconds
harakiri = 120 harakiri = 1200
post-buffering=8192 post-buffering=8192
# limit the project to 128 MB # limit the project to 128 MB
...@@ -55,7 +53,18 @@ max-requests = 5000 ...@@ -55,7 +53,18 @@ max-requests = 5000
# background the process & log # background the process & log
#daemonize = /var/log/uwsgi/gargantext.log #daemonize = /var/log/uwsgi/gargantext.log
daemonize = /var/log/gargantext/uwsgi/@(exec://date +%%Y-%%m-%%d_%%H%%M).log
uid = 1000 log-reopen = true
gid = 1000
#uid = 1000
#gid = 1000
#
how-config=true
disable-logging=false
logfile-chmod=644
#logfile-chown=false
log-maxsize=500000000
##logto=%(chdir)logs/uwsgi_access.log
#logger = longquery file:%(chdir)logs/uwsgi_long.log
#log-route = longquery msec
#
from sqlalchemy.schema import Column, ForeignKey, UniqueConstraint, Index from sqlalchemy.schema import Column, ForeignKey, UniqueConstraint, Index
from sqlalchemy.orm import relationship, validates from sqlalchemy.orm import relationship, validates
from sqlalchemy.types import TypeDecorator, \ from sqlalchemy.types import TypeDecorator, \
Integer, Float, Boolean, DateTime, String, Text Integer, REAL, Boolean, DateTime, String, Text
from sqlalchemy_utils.types import TSVectorType
from sqlalchemy.dialects.postgresql import JSONB, DOUBLE_PRECISION as Double from sqlalchemy.dialects.postgresql import JSONB, DOUBLE_PRECISION as Double
from sqlalchemy.ext.mutable import MutableDict, MutableList from sqlalchemy.ext.mutable import MutableDict, MutableList
from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import text
__all__ = ["Column", "ForeignKey", "UniqueConstraint", "relationship", __all__ = ["Column", "ForeignKey", "UniqueConstraint", "Index", "relationship",
"text",
"validates", "ValidatorMixin", "validates", "ValidatorMixin",
"Integer", "Float", "Boolean", "DateTime", "String", "Text", "Integer", "Float", "Boolean", "DateTime", "String", "Text",
"TSVectorType",
"TypeDecorator", "TypeDecorator",
"JSONB", "Double", "JSONB", "Double",
"MutableDict", "MutableList", "MutableDict", "MutableList",
...@@ -25,6 +29,16 @@ Base = declarative_base() ...@@ -25,6 +29,16 @@ Base = declarative_base()
DjangoBase = declarative_base() DjangoBase = declarative_base()
class Float(REAL):
"""Reflect exact REAL type for PostgreSQL in order to avoid confusion
within Alembic type comparison"""
def __init__(self, *args, **kwargs):
if kwargs.get('precision') == 24:
kwargs.pop('precision')
super(Float, self).__init__(*args, **kwargs)
class ValidatorMixin(object): class ValidatorMixin(object):
def enforce_length(self, key, value): def enforce_length(self, key, value):
"""Truncate a string according to its column length """Truncate a string according to its column length
......
...@@ -2,14 +2,11 @@ from gargantext.util.db import session ...@@ -2,14 +2,11 @@ from gargantext.util.db import session
from gargantext.util.files import upload from gargantext.util.files import upload
from gargantext.constants import * from gargantext.constants import *
# Uncomment to make column full text searchable
#from sqlalchemy_utils.types import TSVectorType
from datetime import datetime from datetime import datetime
from .base import Base, Column, ForeignKey, relationship, TypeDecorator, Index, \ from .base import Base, Column, ForeignKey, relationship, TypeDecorator, Index, \
Integer, Float, String, DateTime, JSONB, \ Integer, Float, String, DateTime, JSONB, TSVectorType, \
MutableList, MutableDict, validates, ValidatorMixin MutableList, MutableDict, validates, ValidatorMixin, text
from .users import User from .users import User
__all__ = ['Node', 'NodeNode', 'CorpusNode'] __all__ = ['Node', 'NodeNode', 'CorpusNode']
...@@ -47,7 +44,7 @@ class Node(ValidatorMixin, Base): ...@@ -47,7 +44,7 @@ class Node(ValidatorMixin, Base):
>>> session.query(Node).filter_by(typename='USER').first() # doctest: +ELLIPSIS >>> session.query(Node).filter_by(typename='USER').first() # doctest: +ELLIPSIS
<UserNode(...)> <UserNode(...)>
But beware, there are some caveats with bulk queries. In this case typename But beware, there are some pitfalls with bulk queries. In this case typename
MUST be specified manually. MUST be specified manually.
>>> session.query(UserNode).delete() # doctest: +SKIP >>> session.query(UserNode).delete() # doctest: +SKIP
...@@ -60,28 +57,33 @@ class Node(ValidatorMixin, Base): ...@@ -60,28 +57,33 @@ class Node(ValidatorMixin, Base):
Index('nodes_user_id_typename_parent_id_idx', 'user_id', 'typename', 'parent_id'), Index('nodes_user_id_typename_parent_id_idx', 'user_id', 'typename', 'parent_id'),
Index('nodes_hyperdata_idx', 'hyperdata', postgresql_using='gin')) Index('nodes_hyperdata_idx', 'hyperdata', postgresql_using='gin'))
# TODO
# create INDEX full_text_idx on nodes using gin(to_tsvector('english', hyperdata ->> 'abstract' || 'title'));
id = Column(Integer, primary_key=True) id = Column(Integer, primary_key=True)
typename = Column(NodeType, index=True) typename = Column(NodeType, index=True, nullable=False)
__mapper_args__ = { 'polymorphic_on': typename } __mapper_args__ = { 'polymorphic_on': typename }
# foreign keys # foreign keys
user_id = Column(Integer, ForeignKey(User.id, ondelete='CASCADE')) user_id = Column(Integer, ForeignKey(User.id, ondelete='CASCADE'),
nullable=False)
user = relationship(User) user = relationship(User)
parent_id = Column(Integer, ForeignKey('nodes.id', ondelete='CASCADE')) parent_id = Column(Integer, ForeignKey('nodes.id', ondelete='CASCADE'))
parent = relationship('Node', remote_side=[id]) parent = relationship('Node', remote_side=[id])
name = Column(String(255)) name = Column(String(255), nullable=False, server_default='')
date = Column(DateTime(timezone=True), default=datetime.now) date = Column(DateTime(timezone=True), nullable=False,
server_default=text('CURRENT_TIMESTAMP'))
hyperdata = Column(JSONB, default=dict, nullable=False,
server_default=text("'{}'::jsonb"))
hyperdata = Column(JSONB, default=dict) # Create a TSVECTOR column to use fulltext search feature of PostgreSQL.
# metadata (see https://bashelton.com/2014/03/updating-postgresql-json-fields-via-sqlalchemy/) # We need to create a trigger to update this column on update and insert,
# To make search possible uncomment the line below # it's created in alembic/version/1fb4405b59e1_add_english_fulltext_index_on_nodes_.py
#search_vector = Column(TSVectorType('hyperdata')) #
# To use this column: session.query(DocumentNode) \
# .filter(Node.title_abstract.match('keyword'))
title_abstract = Column(TSVectorType(regconfig='english'))
def __new__(cls, *args, **kwargs): def __new__(cls, *args, **kwargs):
if cls is Node and kwargs.get('typename'): if cls is Node and kwargs.get('typename'):
......
"""
Django settings for gargantext project.
Generated by 'django-admin startproject' using Django 1.9.2.
For more information on this file, see
https://docs.djangoproject.com/en/1.9/topics/settings/
For the full list of settings and their values, see
https://docs.djangoproject.com/en/1.9/ref/settings/
"""
import os
# Build paths inside the project like this: os.path.join(BASE_DIR, ...)
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# Quick-start development settings - unsuitable for production
# See https://docs.djangoproject.com/en/1.9/howto/deployment/checklist/
# SECURITY WARNING: keep the secret key used in production secret!
SECRET_KEY = '!%ktkh981)piil1%t5r0g4$^0=uvdafk!=f2x8djxy7_gq(n5%'
# SECURITY WARNING: don't run with debug turned on in production!
DEBUG = True
MAINTENANCE = False
BASE_URL = "testing.gargantext.org"
ALLOWED_HOSTS = ["localhost", ".gargantext.org", ".iscpif.fr",]
# Asynchronous tasks
import djcelery
djcelery.setup_loader()
BROKER_URL = 'amqp://guest:guest@localhost:5672/'
CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']
CELERY_TIMEZONE = 'Europe/Paris'
CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler'
CELERY_IMPORTS = (
"gargantext.util.toolchain",
"gargantext.util.crawlers",
"graph.graph",
"moissonneurs.pubmed",
"moissonneurs.istex",
"gargantext.util.ngramlists_tools",
)
# garg's custom unittests runner (adapted to our db models)
TEST_RUNNER = 'unittests.framework.GargTestRunner'
# Application definition
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'rest_framework',
'djcelery',
'annotations',
'graph',
'moissonneurs',
'gargantext',
]
MIDDLEWARE_CLASSES = [
'django.middleware.security.SecurityMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.middleware.common.CommonMiddleware',
'django.middleware.csrf.CsrfViewMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.auth.middleware.SessionAuthenticationMiddleware',
'django.contrib.messages.middleware.MessageMiddleware',
'django.middleware.clickjacking.XFrameOptionsMiddleware',
]
ROOT_URLCONF = 'gargantext.urls'
TEMPLATES = [
{
'BACKEND': 'django.template.backends.django.DjangoTemplates',
'DIRS': [
os.path.join(BASE_DIR, 'templates'),
#'./templates'
],
'APP_DIRS': True,
'OPTIONS': {
'context_processors': [
'django.template.context_processors.debug',
'django.template.context_processors.request',
'django.contrib.auth.context_processors.auth',
'django.contrib.messages.context_processors.messages',
],
},
},
]
WSGI_APPLICATION = 'gargantext.wsgi.application'
# http://getblimp.github.io/django-rest-framework-jwt/#additional-settings
REST_FRAMEWORK = {
'DEFAULT_PERMISSION_CLASSES': (
'rest_framework.permissions.IsAuthenticated',
),
'DEFAULT_AUTHENTICATION_CLASSES': (
'rest_framework_jwt.authentication.JSONWebTokenAuthentication',
'rest_framework.authentication.SessionAuthentication',
'rest_framework.authentication.BasicAuthentication',
),
}
JWT_AUTH = {
'JWT_VERIFY_EXPIRATION': False,
'JWT_SECRET_KEY': SECRET_KEY,
'JWT_AUTH_HEADER_PREFIX': 'Bearer',
}
# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/1.9/howto/static-files/
STATIC_ROOT = '/srv/gargantext_static/'
STATIC_URL = '/static/'
STATICFILES_DIRS = (
os.path.join(BASE_DIR, 'static'),
)
STATICFILES_FINDERS = (
'django.contrib.staticfiles.finders.AppDirectoriesFinder',
'django.contrib.staticfiles.finders.FileSystemFinder',
)
MEDIA_ROOT = '/srv/gargantext_media'
#MEDIA_ROOT = os.path.join(PROJECT_PATH, 'media')
MEDIA_URL = '/media/'
# Database
# https://docs.djangoproject.com/en/1.9/ref/settings/#databases
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'NAME': 'gargandb',
'USER': 'gargantua',
'PASSWORD': 'C8kdcUrAQy66U',
'HOST': '127.0.0.1',
'PORT': '5432',
'TEST': {
'NAME': 'test_gargandb',
},
}
}
DATABASES['default']['SECRET_URL'] = \
'postgresql+psycopg2://{USER}:{PASSWORD}@{HOST}:{PORT}/{NAME}'.format(
**DATABASES['default']
)
# Password validation
# https://docs.djangoproject.com/en/1.9/ref/settings/#auth-password-validators
AUTH_PASSWORD_VALIDATORS = [
{
'NAME': 'django.contrib.auth.password_validation.UserAttributeSimilarityValidator',
},
{
'NAME': 'django.contrib.auth.password_validation.MinimumLengthValidator',
},
{
'NAME': 'django.contrib.auth.password_validation.CommonPasswordValidator',
},
{
'NAME': 'django.contrib.auth.password_validation.NumericPasswordValidator',
},
]
API_TOKENS = {
"CERN": {
"APIKEY":'',
"APISECRET":'',
},
"MULTIVAC": {
"APIKEY": ""
}
}
# Internationalization
# https://docs.djangoproject.com/en/1.9/topics/i18n/
LANGUAGE_CODE = 'en-us'
TIME_ZONE = None
USE_I18N = True
USE_L10N = True
USE_TZ = True
# BOOL Interpreter
BOOL_TOOLS_PATH="/srv/gargantext/gargantext/util/crawlers/sparql"
...@@ -7,6 +7,8 @@ the name of "ReplaceableObject" class. ...@@ -7,6 +7,8 @@ the name of "ReplaceableObject" class.
This recipe is directly borrowed from Alembic documentation, see This recipe is directly borrowed from Alembic documentation, see
http://alembic.zzzcomputing.com/en/latest/cookbook.html#replaceable-objects http://alembic.zzzcomputing.com/en/latest/cookbook.html#replaceable-objects
**2017-10-09** ReversibleOp.define has been added to reduce boilerplate code.
""" """
from alembic.operations import Operations, MigrateOperation from alembic.operations import Operations, MigrateOperation
...@@ -16,9 +18,9 @@ __all__ = ['ReplaceableObject'] ...@@ -16,9 +18,9 @@ __all__ = ['ReplaceableObject']
class ReplaceableObject(object): class ReplaceableObject(object):
def __init__(self, name, sqltext): def __init__(self, name, *args):
self.name = name self.name = name
self.sqltext = sqltext self.args = args
class ReversibleOp(MigrateOperation): class ReversibleOp(MigrateOperation):
...@@ -58,38 +60,41 @@ class ReversibleOp(MigrateOperation): ...@@ -58,38 +60,41 @@ class ReversibleOp(MigrateOperation):
operations.invoke(drop_old) operations.invoke(drop_old)
operations.invoke(create_new) operations.invoke(create_new)
@classmethod
def define(cls, name, cname=None, register=Operations.register_operation):
def create(self):
return CreateOp(self.target)
@Operations.register_operation("create_view", "invoke_for_target") def drop(self):
@Operations.register_operation("replace_view", "replace") return DropOp(self.target)
class CreateViewOp(ReversibleOp):
def reverse(self):
return DropViewOp(self.target)
name = name.lower()
cname = cname or name.capitalize()
@Operations.register_operation("drop_view", "invoke_for_target") CreateOp = type('Create%sOp' % cname, (ReversibleOp,), {'reverse': drop})
class DropViewOp(ReversibleOp): DropOp = type('Drop%sOp' % cname, (ReversibleOp,), {'reverse': create})
def reverse(self):
return CreateViewOp(self.view)
CreateOp = register('create_' + name, 'invoke_for_target')(CreateOp)
CreateOp = register('replace_' + name, 'replace')(CreateOp)
@Operations.register_operation("create_sp", "invoke_for_target") DropOp = register('drop_' + name, 'invoke_for_target')(DropOp)
@Operations.register_operation("replace_sp", "replace")
class CreateSPOp(ReversibleOp):
def reverse(self):
return DropSPOp(self.target)
return (CreateOp, DropOp)
@Operations.register_operation("drop_sp", "invoke_for_target")
class DropSPOp(ReversibleOp): CreateViewOp, DropViewOp = ReversibleOp.define('view')
def reverse(self): CreateRoleOp, DropRoleOp = ReversibleOp.define('role')
return CreateSPOp(self.target) CreateSchemaOp, DropSchemaOp = ReversibleOp.define('schema')
CreateSPOp, DropSPOp = ReversibleOp.define('sp', 'SP')
CreateTriggerOp, DropTriggerOp = ReversibleOp.define('trigger')
CreatePolicyOp, DropPolicyOp = ReversibleOp.define('policy')
@Operations.implementation_for(CreateViewOp) @Operations.implementation_for(CreateViewOp)
def create_view(operations, operation): def create_view(operations, operation):
operations.execute("CREATE VIEW %s AS %s" % ( operations.execute("CREATE VIEW %s AS %s" % (
operation.target.name, operation.target.name,
operation.target.sqltext operation.target.args[0]
)) ))
...@@ -98,11 +103,37 @@ def drop_view(operations, operation): ...@@ -98,11 +103,37 @@ def drop_view(operations, operation):
operations.execute("DROP VIEW %s" % operation.target.name) operations.execute("DROP VIEW %s" % operation.target.name)
@Operations.implementation_for(CreateRoleOp)
def create_role(operations, operation):
args = operation.target.args
operations.execute(
"CREATE ROLE %s WITH %s" % (
operation.target.name,
args[0] if len(args) else 'NOLOGIN'
)
)
@Operations.implementation_for(DropRoleOp)
def drop_role(operations, operation):
operations.execute("DROP ROLE %s" % operation.target.name)
@Operations.implementation_for(CreateSchemaOp)
def create_schema(operations, operation):
operations.execute("CREATE SCHEMA %s" % operation.target.name)
@Operations.implementation_for(DropSchemaOp)
def drop_schema(operations, operation):
operations.execute("DROP SCHEMA %s" % operation.target.name)
@Operations.implementation_for(CreateSPOp) @Operations.implementation_for(CreateSPOp)
def create_sp(operations, operation): def create_sp(operations, operation):
operations.execute( operations.execute(
"CREATE FUNCTION %s %s" % ( "CREATE FUNCTION %s %s" % (
operation.target.name, operation.target.sqltext operation.target.name, operation.target.args[0]
) )
) )
...@@ -110,3 +141,44 @@ def create_sp(operations, operation): ...@@ -110,3 +141,44 @@ def create_sp(operations, operation):
@Operations.implementation_for(DropSPOp) @Operations.implementation_for(DropSPOp)
def drop_sp(operations, operation): def drop_sp(operations, operation):
operations.execute("DROP FUNCTION %s" % operation.target.name) operations.execute("DROP FUNCTION %s" % operation.target.name)
@Operations.implementation_for(CreateTriggerOp)
def create_trigger(operations, operation):
args = operation.target.args
operations.execute(
"CREATE TRIGGER %s %s ON %s %s" % (
operation.target.name, args[0], args[1], args[2]
)
)
@Operations.implementation_for(DropTriggerOp)
def drop_trigger(operations, operation):
operations.execute(
"DROP TRIGGER %s ON %s" % (
operation.target.name,
operation.target.args[1]
)
)
@Operations.implementation_for(CreatePolicyOp)
def create_policy(operations, operation):
operations.execute(
"CREATE POLICY %s ON %s %s" % (
operation.target.name,
operation.target.args[0],
operation.target.args[1],
)
)
@Operations.implementation_for(DropPolicyOp)
def drop_policy(operations, operation):
operations.execute(
"DROP POLICY %s ON %s" % (
operation.target.name,
operation.target.args[0],
)
)
...@@ -2,6 +2,7 @@ import os ...@@ -2,6 +2,7 @@ import os
from gargantext.settings import MEDIA_ROOT from gargantext.settings import MEDIA_ROOT
from datetime import MINYEAR from datetime import MINYEAR
from dateutil.parser import parse as parse_datetime_flexible
from django.utils.dateparse import parse_datetime from django.utils.dateparse import parse_datetime
from django.utils.timezone import datetime as _datetime, utc as UTC, now as utcnow from django.utils.timezone import datetime as _datetime, utc as UTC, now as utcnow
...@@ -19,7 +20,8 @@ class datetime(_datetime): ...@@ -19,7 +20,8 @@ class datetime(_datetime):
@staticmethod @staticmethod
def parse(s): def parse(s):
dt = parse_datetime(s) dt = parse_datetime(s) or \
parse_datetime_flexible(s, default=datetime(MINYEAR, 1, 1))
return dt.astimezone(UTC) if dt.tzinfo else dt.replace(tzinfo=UTC) return dt.astimezone(UTC) if dt.tzinfo else dt.replace(tzinfo=UTC)
......
...@@ -10,7 +10,7 @@ from sqlalchemy import delete ...@@ -10,7 +10,7 @@ from sqlalchemy import delete
def get_engine(): def get_engine():
from sqlalchemy import create_engine from sqlalchemy import create_engine
return create_engine( settings.DATABASES['default']['URL'] return create_engine( settings.DATABASES['default']['SECRET_URL']
, use_native_hstore = True , use_native_hstore = True
, json_serializer = json_dumps , json_serializer = json_dumps
, pool_size=20, max_overflow=0 , pool_size=20, max_overflow=0
...@@ -25,6 +25,7 @@ session = scoped_session(sessionmaker(bind=engine)) ...@@ -25,6 +25,7 @@ session = scoped_session(sessionmaker(bind=engine))
######################################################################## ########################################################################
from sqlalchemy.orm import aliased from sqlalchemy.orm import aliased
from sqlalchemy import func, desc from sqlalchemy import func, desc
from sqlalchemy.sql.expression import case
######################################################################## ########################################################################
# bulk insertions # bulk insertions
......
...@@ -14,17 +14,6 @@ _members = [ ...@@ -14,17 +14,6 @@ _members = [
'picture' : 'david.jpg', 'picture' : 'david.jpg',
'role':'principal investigator'}, 'role':'principal investigator'},
{ 'first_name' : 'Samuel', 'last_name' : 'Castillo J.',
'mail' : 'kaisleanATgmail.com',
'website' : 'http://www.pksm3.droppages.com',
'picture' : 'samuel.jpg',
'role' : 'developer'},
{ 'first_name' : 'Maziyar', 'last_name' : 'Panahi',
'mail' : '',
'website' : 'http://iscpif.fr',
'picture' : 'maziyar.jpg',
'role' : 'developer'},
{ 'first_name' : 'Alexandre', 'last_name' : 'Delanoë', { 'first_name' : 'Alexandre', 'last_name' : 'Delanoë',
'mail' : 'alexandre+gargantextATdelanoe.org', 'mail' : 'alexandre+gargantextATdelanoe.org',
...@@ -59,12 +48,19 @@ _membersPast = [ ...@@ -59,12 +48,19 @@ _membersPast = [
'mail' : '', 'mail' : '',
'website' : 'https://github.com/elishowk', 'website' : 'https://github.com/elishowk',
'picture' : '', 'role' : 'developer'}, 'picture' : '', 'role' : 'developer'},
{ 'first_name' : 'Samuel', 'last_name' : 'Castillo J.',
'mail' : 'kaisleanATgmail.com',
'website' : 'http://www.pksm3.droppages.com',
'picture' : 'samuel.jpg',
'role' : 'developer'},
] ]
_institutions = [ _institutions = [
{ 'name' : 'Mines ParisTech', 'website' : 'http://mines-paristech.fr', 'picture' : 'mines.png', 'funds':''}, { 'name' : 'Institut Mines Telecom', 'website' : 'https://www.imt.fr', 'picture' : 'IMT.jpg', 'funds':''},
#{ 'name' : 'Institut Pasteur', 'website' : 'http://www.pasteur.fr', 'picture' : 'pasteur.png', 'funds':''},
{ 'name' : 'EHESS', 'website' : 'http://www.ehess.fr', 'picture' : 'ehess.png', 'funds':''}, { 'name' : 'EHESS', 'website' : 'http://www.ehess.fr', 'picture' : 'ehess.png', 'funds':''},
{ 'name' : 'Mines ParisTech', 'website' : 'http://mines-paristech.fr', 'picture' : 'mines.png', 'funds':''},
#{ 'name' : '', 'website' : '', 'picture' : '', 'funds':''}, #{ 'name' : '', 'website' : '', 'picture' : '', 'funds':''},
# copy paste the line above and write your informations please # copy paste the line above and write your informations please
] ]
...@@ -76,10 +72,11 @@ _labs = [ ...@@ -76,10 +72,11 @@ _labs = [
] ]
_grants = [ _grants = [
{ 'name' : 'Institut Mines Telecom', 'website' : 'https://www.imt.fr', 'picture' : 'IMT.jpg', 'funds':''},
{ 'name' : 'Forccast', 'website' : 'http://forccast.hypotheses.org/', 'picture' : 'forccast.png', 'funds':''}, { 'name' : 'Forccast', 'website' : 'http://forccast.hypotheses.org/', 'picture' : 'forccast.png', 'funds':''},
{ 'name' : 'Mastodons', 'website' : 'http://www.cnrs.fr/mi/spip.php?article53&lang=fr', 'picture' : 'mastodons.png', 'funds':''}, { 'name' : 'Mastodons', 'website' : 'http://www.cnrs.fr/mi/spip.php?article53&lang=fr', 'picture' : 'mastodons.png', 'funds':''},
#{ 'name' : 'ADEME', 'website' : 'http://www.ademe.fr', 'picture' : 'ademe.png', 'funds':''}, { 'name' : 'ADEME', 'website' : 'http://www.ademe.fr', 'picture' : 'ademe.png', 'funds':''},
{ 'name' : 'Institut Pasteur', 'website' : 'http://www.pasteur.fr', 'picture' : 'pasteur.png', 'funds':''},
{ 'name' : 'Scoap 3', 'website' : 'https://scoap3.org/', 'picture' : 'cern.png', 'funds':''},
#{ 'name' : '', 'website' : '', 'picture' : '', 'funds':''}, #{ 'name' : '', 'website' : '', 'picture' : '', 'funds':''},
# copy paste the line above and write your informations please # copy paste the line above and write your informations please
] ]
......
...@@ -8,7 +8,7 @@ Tools to work with ngramlists (MAINLIST, MAPLIST, STOPLIST) ...@@ -8,7 +8,7 @@ Tools to work with ngramlists (MAINLIST, MAPLIST, STOPLIST)
""" """
from gargantext.util.group_tools import query_groups, group_union from gargantext.util.group_tools import query_groups, group_union
from gargantext.util.db import session, bulk_insert_ifnotexists from gargantext.util.db import session, bulk_insert_ifnotexists, desc
from gargantext.models import Ngram, NodeNgram, NodeNodeNgram, \ from gargantext.models import Ngram, NodeNgram, NodeNodeNgram, \
NodeNgramNgram, Node NodeNgramNgram, Node
......
from ._Parser import Parser from ._Parser import Parser
# from ..NgramsExtractors import * import pandas
import sys import io
import csv
csv.field_size_limit(sys.maxsize)
import numpy as np
class CSVParser(Parser): class CSVParser(Parser):
DELIMITERS = ", \t;|:" ENCODING = "utf-8"
def detect_delimiter(self, lines, sample_size=10): def open(self, file):
sample = lines[:sample_size] f = super(CSVParser, self).open(file)
# Compute frequency of each delimiter on each input line if isinstance(file, str) and file.endswith('.zip'):
delimiters_freqs = { return f
d: [line.count(d) for line in sample]
for d in self.DELIMITERS
}
# Select delimiters with a standard deviation of zero, ie. delimiters return io.TextIOWrapper(f, encoding=self.ENCODING)
# for which we have the same number of fields on each line
selected_delimiters = [
(d, np.sum(freqs))
for d, freqs in delimiters_freqs.items()
if any(freqs) and np.std(freqs) == 0
]
if selected_delimiters: def parse(self, fp=None):
# Choose the delimiter with highest frequency amongst selected ones fp = fp or self._file
sorted_delimiters = sorted(selected_delimiters, key=lambda x: x[1]) df = pandas.read_csv(fp, dtype=object, engine='python',
return sorted_delimiters[-1][0] skip_blank_lines=True, sep=None,
na_values=[], keep_default_na=False)
def parse(self, filebuf):
print("CSV: parsing (assuming UTF-8 and LF line endings)")
contents = filebuf.read().decode("UTF-8").split("\n")
# Filter out empty lines
contents = [line for line in contents if line.strip()]
# Delimiter auto-detection
delimiter = self.detect_delimiter(contents, sample_size=10)
if delimiter is None:
raise ValueError("CSV: couldn't detect delimiter, bug or malformed data")
print("CSV: selected delimiter: %r" % delimiter)
# Parse CSV
reader = csv.reader(contents, delimiter=delimiter)
# Get first not empty row and its fields (ie. header row), or (0, [])
first_row, headers = \
next(((i, fields) for i, fields in enumerate(reader) if any(fields)),
(0, []))
# Get first not empty column of the first row, or 0
first_col = next((i for i, field in enumerate(headers) if field), 0)
# Strip out potential empty fields in headers
headers = headers[first_col:]
# Return a generator of dictionaries with column labels as keys, # Return a generator of dictionaries with column labels as keys,
# filtering out empty rows # filtering out empty rows
for i, fields in enumerate(reader): for i, fields in enumerate(df.itertuples(index=False)):
if i % 500 == 0: if i % 500 == 0:
print("CSV: parsing row #%s..." % (i+1)) print("CSV: parsing row #%s..." % (i+1))
if any(fields):
yield dict(zip(headers, fields[first_col:])) # See https://docs.python.org/3/library/collections.html#collections.somenamedtuple._asdict
yield fields._asdict()
...@@ -81,8 +81,9 @@ class EuropresseParser(Parser): ...@@ -81,8 +81,9 @@ class EuropresseParser(Parser):
# "./header/div/p[@class='titreArticleVisu grandTitre']" # "./header/div/p[@class='titreArticleVisu grandTitre']"
# #
# title_xpath (chemin plus générique) # title_xpath (chemin plus générique)
title_xpath = "./header//*[contains(@class,'titreArticle')]" title_xpath = "./header//*[contains(@class,'titreArticleVisu rdp__articletitle')]"
text_xpath = "./section/div[@class='DocText']//p" authors_xpath = "./header//*[contains(@class,'docAuthors')]"
text_xpath = "./section/div[@class='DocText clearfix']//p"
entire_header_xpath = "./header" entire_header_xpath = "./header"
# diagnosed during date retrieval and used for rubrique # diagnosed during date retrieval and used for rubrique
...@@ -144,6 +145,15 @@ class EuropresseParser(Parser): ...@@ -144,6 +145,15 @@ class EuropresseParser(Parser):
yield(hyperdata) yield(hyperdata)
continue continue
# Authors
# --------
try:
authors = scrap_text(html_article.xpath(authors_xpath))
hyperdata['authors'] = '; '.join([author for author in authors])
except:
pass
# FULLTEXT # FULLTEXT
# -------- # --------
...@@ -154,6 +164,7 @@ class EuropresseParser(Parser): ...@@ -154,6 +164,7 @@ class EuropresseParser(Parser):
except: except:
pass pass
# PUBLICATIONNAME # PUBLICATIONNAME
# ---------------- # ----------------
try: try:
......
...@@ -12,12 +12,12 @@ import json ...@@ -12,12 +12,12 @@ import json
class HalParser(Parser): class HalParser(Parser):
def _parse(self, json_docs): def _parse(self, json_docs):
hyperdata_list = [] hyperdata_list = []
hyperdata_path = { "id" : "isbn_s" hyperdata_path = { "id" : "docid"
, "title" : "en_title_s" , "title" : ["en_title_s", "title_s"]
, "abstract" : "en_abstract_s" , "abstract" : ["en_abstract_s", "abstract_s"]
, "source" : "journalTitle_s" , "source" : "journalTitle_s"
, "url" : "uri_s" , "url" : "uri_s"
, "authors" : "authFullName_s" , "authors" : "authFullName_s"
...@@ -29,8 +29,8 @@ class HalParser(Parser): ...@@ -29,8 +29,8 @@ class HalParser(Parser):
, "instStructId_i" : "instStructId_i" , "instStructId_i" : "instStructId_i"
, "deptStructId_i" : "deptStructId_i" , "deptStructId_i" : "deptStructId_i"
, "labStructId_i" : "labStructId_i" , "labStructId_i" : "labStructId_i"
, "rteamStructId_i" : "rteamStructId_i" , "rteamStructId_i" : "rteamStructId_i"
, "docType_s" : "docType_s" , "docType_s" : "docType_s"
} }
uris = set() uris = set()
...@@ -38,29 +38,32 @@ class HalParser(Parser): ...@@ -38,29 +38,32 @@ class HalParser(Parser):
for doc in json_docs: for doc in json_docs:
hyperdata = {} hyperdata = {}
for key, path in hyperdata_path.items(): for key, path in hyperdata_path.items():
field = doc.get(path, "NOT FOUND") # A path can be a field name or a sequence of field names
if isinstance(field, list): if isinstance(path, (list, tuple)):
hyperdata[key] = ", ".join(map(lambda x: str(x), field)) # Get first non-empty value of fields in path sequence, or None
else: field = next((x for x in (doc.get(p) for p in path) if x), None)
hyperdata[key] = str(field) else:
# Get field value
field = doc.get(path)
if field is None:
field = "NOT FOUND"
if isinstance(field, list):
hyperdata[key] = ", ".join(map(str, field))
else:
hyperdata[key] = str(field)
if hyperdata["url"] in uris: if hyperdata["url"] in uris:
print("Document already parsed") print("Document already parsed")
else: else:
uris.add(hyperdata["url"]) uris.add(hyperdata["url"])
# hyperdata["authors"] = ", ".join(
# [ p.get("person", {})
# .get("name" , "")
#
# for p in doc.get("hasauthor", [])
# ]
# )
#
maybeDate = doc.get("submittedDate_s", None)
maybeDate = doc.get("submittedDate_s", None)
if maybeDate is not None: if maybeDate is not None:
date = datetime.strptime(maybeDate, "%Y-%m-%d %H:%M:%S") date = datetime.strptime(maybeDate, "%Y-%m-%d %H:%M:%S")
else: else:
...@@ -70,9 +73,9 @@ class HalParser(Parser): ...@@ -70,9 +73,9 @@ class HalParser(Parser):
hyperdata["publication_year"] = str(date.year) hyperdata["publication_year"] = str(date.year)
hyperdata["publication_month"] = str(date.month) hyperdata["publication_month"] = str(date.month)
hyperdata["publication_day"] = str(date.day) hyperdata["publication_day"] = str(date.day)
hyperdata_list.append(hyperdata) hyperdata_list.append(hyperdata)
return hyperdata_list return hyperdata_list
def parse(self, filebuf): def parse(self, filebuf):
......
import re
from .RIS import RISParser from .RIS import RISParser
...@@ -12,8 +14,39 @@ class ISIParser(RISParser): ...@@ -12,8 +14,39 @@ class ISIParser(RISParser):
"DI": {"type": "hyperdata", "key": "doi"}, "DI": {"type": "hyperdata", "key": "doi"},
"SO": {"type": "hyperdata", "key": "source"}, "SO": {"type": "hyperdata", "key": "source"},
"PY": {"type": "hyperdata", "key": "publication_year"}, "PY": {"type": "hyperdata", "key": "publication_year"},
"PD": {"type": "hyperdata", "key": "publication_month"}, "PD": {"type": "hyperdata", "key": "publication_date_to_parse"},
"LA": {"type": "hyperdata", "key": "language_fullname"}, "LA": {"type": "hyperdata", "key": "language_fullname"},
"AB": {"type": "hyperdata", "key": "abstract", "separator": " "}, "AB": {"type": "hyperdata", "key": "abstract", "separator": " "},
"WC": {"type": "hyperdata", "key": "fields"}, "WC": {"type": "hyperdata", "key": "fields"},
} }
_year = re.compile(r'\b\d{4}\b')
_season = re.compile(r'\b(SPR|SUM|FAL|WIN)\b', re.I)
_month_interval = re.compile(r'\b([A-Z]{3})-([A-Z]{3})\b', re.I)
_day_interval = re.compile(r'\b(\d{1,2})-(\d{1,2})\b')
def _preprocess_PD(self, PD, PY):
# Add a year to date if applicable
if PY and self._year.search(PY) and not self._year.search(PD):
PD = PY + " " + PD
# Drop season if any
PD = self._season.sub('', PD).strip()
# If a month interval is present, keep only the first month
PD = self._month_interval.sub(r'\1', PD)
# If a day interval is present, keep only the first day
PD = self._day_interval.sub(r'\1', PD)
return PD
def parse(self, file):
PD = self._parameters["PD"]["key"]
PY = self._parameters["PY"]["key"]
for entry in super().parse(file):
if PD in entry:
entry[PD] = self._preprocess_PD(entry[PD], entry[PY])
yield entry
...@@ -14,12 +14,12 @@ class PubmedParser(Parser): ...@@ -14,12 +14,12 @@ class PubmedParser(Parser):
"language_iso3" : 'MedlineCitation/Article/Language', "language_iso3" : 'MedlineCitation/Article/Language',
"doi" : 'PubmedData/ArticleIdList/ArticleId[@type=doi]', "doi" : 'PubmedData/ArticleIdList/ArticleId[@type=doi]',
"realdate_full_" : 'MedlineCitation/Article/Journal/JournalIssue/PubDate/MedlineDate', "realdate_full_" : 'MedlineCitation/Article/Journal/JournalIssue/PubDate/MedlineDate',
"realdate_year_" : 'MedlineCitation/Article/Journal/JournalIssue/PubDate/Year', "realdate_year_" : 'PubmedData/History/PubMedPubDate/Year',
"realdate_month_" : 'MedlineCitation/Article/Journal/JournalIssue/PubDate/Month', "realdate_month_" : 'PubmedData/History/PubMedPubDate/Month',
"realdate_day_" : 'MedlineCitation/Article/Journal/JournalIssue/PubDate/Day', "realdate_day_" : 'PubmedData/History/PubMedPubDate/Day',
"publication_year" : 'MedlineCitation/DateCreated/Year', "publication_year" : 'MedlineCitation/Article/ArticleDate/Year',
"publication_month" : 'MedlineCitation/DateCreated/Month', "publication_month" : 'MedlineCitation/Article/ArticleDate/Month',
"publication_day" : 'MedlineCitation/DateCreated/Day', "publication_day" : 'MedlineCitation/Article/ArticleDate/Day',
"authors" : 'MedlineCitation/Article/AuthorList', "authors" : 'MedlineCitation/Article/AuthorList',
} }
......
...@@ -3,10 +3,7 @@ import zipfile ...@@ -3,10 +3,7 @@ import zipfile
import re import re
import dateparser as date_parser import dateparser as date_parser
from gargantext.util.languages import languages from gargantext.util.languages import languages
from gargantext.util import datetime, convert_to_datetime, MINYEAR from gargantext.util import datetime, convert_to_datetime
DEFAULT_DATE = datetime(MINYEAR, 1, 1)
class Parser: class Parser:
...@@ -14,15 +11,14 @@ class Parser: ...@@ -14,15 +11,14 @@ class Parser:
""" """
def __init__(self, file): def __init__(self, file):
if isinstance(file, str): self._file = self.open(file)
self._file = open(file, 'rb')
else:
self._file = file
def __del__(self): def __del__(self):
if hasattr(self, '_file'): if hasattr(self, '_file'):
self._file.close() self._file.close()
def open(self, file):
return open(file, 'rb') if isinstance(file, str) else file
def detect_encoding(self, string): def detect_encoding(self, string):
"""Useful method to detect the encoding of a document. """Useful method to detect the encoding of a document.
...@@ -47,10 +43,7 @@ class Parser: ...@@ -47,10 +43,7 @@ class Parser:
if date_string is not None: if date_string is not None:
date_string = re.sub(r'\/\/+(\w*|\d*)', '', date_string) date_string = re.sub(r'\/\/+(\w*|\d*)', '', date_string)
try: try:
hyperdata['publication_date'] = dateutil.parser.parse( hyperdata['publication_date'] = datetime.parse(date_string)
date_string,
default=DEFAULT_DATE
)
except Exception as error: except Exception as error:
print(error, 'Date not parsed for:', date_string) print(error, 'Date not parsed for:', date_string)
hyperdata['publication_date'] = datetime.now() hyperdata['publication_date'] = datetime.now()
...@@ -93,6 +86,9 @@ class Parser: ...@@ -93,6 +86,9 @@ class Parser:
print("WARNING: Date unknown at _Parser level, using now()") print("WARNING: Date unknown at _Parser level, using now()")
hyperdata['publication_date'] = datetime.now() hyperdata['publication_date'] = datetime.now()
# XXX Handling prefixes is most likely useless: there seem to be only
# one prefix which is "publication" (like in "publication_date").
# ...then parse all the "date" fields, to parse it into separate elements # ...then parse all the "date" fields, to parse it into separate elements
prefixes = [key[:-5] for key in hyperdata.keys() if key[-5:] == "_date"] prefixes = [key[:-5] for key in hyperdata.keys() if key[-5:] == "_date"]
for prefix in prefixes: for prefix in prefixes:
...@@ -165,11 +161,10 @@ class Parser: ...@@ -165,11 +161,10 @@ class Parser:
file = self._file file = self._file
# if the file is a ZIP archive, recurse on each of its files... # if the file is a ZIP archive, recurse on each of its files...
if zipfile.is_zipfile(file): if zipfile.is_zipfile(file):
zipArchive = zipfile.ZipFile(file) with zipfile.ZipFile(file) as zf:
for filename in zipArchive.namelist(): for filename in zf.namelist():
f = zipArchive.open(filename, 'r') with zf.open(filename) as df, self.open(df) as f:
yield from self.__iter__(f) yield from self.__iter__(f)
f.close()
# ...otherwise, let's parse it directly! # ...otherwise, let's parse it directly!
else: else:
try: try:
......
...@@ -61,12 +61,14 @@ def nodes(parent=None, group_by='typename', order_by='typename', has_child='chec ...@@ -61,12 +61,14 @@ def nodes(parent=None, group_by='typename', order_by='typename', has_child='chec
def node_show(node, prefix='', maxlen=60): def node_show(node, prefix='', maxlen=60):
if node.children > 0 or node.cnt == 1: if node.children > 0 or node.cnt == 1:
node_id = '<{}> '.format(node.id)
name = node.name[:maxlen] + '..' if len(node.name) > maxlen else node.name name = node.name[:maxlen] + '..' if len(node.name) > maxlen else node.name
label = Fore.CYAN + name + Fore.RESET label = node_id + Fore.CYAN + name + Fore.RESET
else: else:
label = Fore.MAGENTA + str(node.cnt) + Fore.RESET label = Fore.MAGENTA + str(node.cnt) + Fore.RESET
print(prefix, '%s%s %s' % (Fore.GREEN, node.typename, label), sep='') typename = Fore.GREEN + node.typename + Fore.RESET
print(prefix, '%s %s' % (typename, label), sep='')
def tree_show(node, pos=FIRST|LAST, level=0, prefix='', maxlen=60, compact=True): def tree_show(node, pos=FIRST|LAST, level=0, prefix='', maxlen=60, compact=True):
......
...@@ -6,7 +6,7 @@ from gargantext.settings import BASE_URL ...@@ -6,7 +6,7 @@ from gargantext.settings import BASE_URL
drafts = { drafts = {
'workflowEnd' : ''' 'workflowEnd' : '''
Bonjour, Bonjour,
votre analyse sur Gargantext vient de se terminer. votre analyse sur Gargantext vient de se terminer.
...@@ -42,18 +42,33 @@ drafts = { ...@@ -42,18 +42,33 @@ drafts = {
''', ''',
'recountDone': '''
Bonjour,
le recalcul que vous avez lancé est terminé.
Vous pouvez accéder à votre corpus intitulé
\"%s\"
à l'adresse:
http://%s/projects/%d/corpora/%d
} Nous restons à votre disposition pour tout complément d'information.
Cordialement
--
L'équipe de Gargantext (CNRS)
'''
}
def notification(corpus,draft):
def notification(corpus, draft, subject='Update'):
user = session.query(User).filter(User.id == corpus.user_id).first() user = session.query(User).filter(User.id == corpus.user_id).first()
message = draft % (corpus.name, BASE_URL, corpus.parent_id, corpus.id) message = draft % (corpus.name, BASE_URL, corpus.parent_id, corpus.id)
if user.email != "" : if user.email != "" :
send_mail('[Gargantext] Update' send_mail('[Gargantext] %s' % subject
, message , message
, 'contact@gargantext.org' , 'contact@gargantext.org'
, [user.email], fail_silently=False ) , [user.email], fail_silently=False )
...@@ -63,11 +78,12 @@ def notification(corpus,draft): ...@@ -63,11 +78,12 @@ def notification(corpus,draft):
def notify_owner(corpus): def notify_owner(corpus):
notification(corpus, drafts['workflowEnd']) notification(corpus, drafts['workflowEnd'], 'Corpus updated')
def notify_listMerged(corpus): def notify_listMerged(corpus):
notification(corpus, drafts['listMerged']) notification(corpus, drafts['listMerged'], 'List merged')
def notify_recount(corpus):
notification(corpus, drafts['recountDone'], 'Recount done')
...@@ -13,7 +13,7 @@ from .ngram_coocs import compute_coocs ...@@ -13,7 +13,7 @@ from .ngram_coocs import compute_coocs
#from .ngram_coocs_old_sqlalchemy_version import compute_coocs #from .ngram_coocs_old_sqlalchemy_version import compute_coocs
from .metric_specgen import compute_specgen from .metric_specgen import compute_specgen
from .list_map import do_maplist from .list_map import do_maplist
from .mail_notification import notify_owner from .mail_notification import notify_owner, notify_recount
from gargantext.util.db import session from gargantext.util.db import session
from gargantext.models import Node from gargantext.models import Node
...@@ -62,12 +62,12 @@ def parse_extract_indexhyperdata(corpus): ...@@ -62,12 +62,12 @@ def parse_extract_indexhyperdata(corpus):
# apply actions # apply actions
print('CORPUS #%d' % (corpus.id)) print('CORPUS #%d' % (corpus.id))
corpus.status('Docs', progress=1) corpus.status('Docs', progress=1)
corpus.save_hyperdata() corpus.save_hyperdata()
session.commit() session.commit()
parse(corpus) parse(corpus)
docs = corpus.children("DOCUMENT").count() docs = corpus.children("DOCUMENT").count()
print('CORPUS #%d: parsed %d' % (corpus.id, docs)) print('CORPUS #%d: parsed %d' % (corpus.id, docs))
extract_ngrams(corpus) extract_ngrams(corpus)
...@@ -242,6 +242,19 @@ def recount(corpus_id): ...@@ -242,6 +242,19 @@ def recount(corpus_id):
corpus.save_hyperdata() corpus.save_hyperdata()
session.commit() session.commit()
# START OF KLUDGE...
from gargantext.models import NodeNgram, DocumentNode
from .ngrams_addition import index_new_ngrams
maplist_id = corpus.children("MAPLIST").first().id
ngram_ids = session.query(NodeNgram.ngram_id.distinct())
indexed_ngrams = ngram_ids.join(DocumentNode).filter(DocumentNode.parent_id==corpus.id)
not_indexed_ngrams = ngram_ids.filter(NodeNgram.node_id==maplist_id,
~NodeNgram.ngram_id.in_(indexed_ngrams))
not_indexed_ngrams = [x[0] for x in not_indexed_ngrams]
added = index_new_ngrams(not_indexed_ngrams, corpus)
print('RECOUNT #%d: [%s] indexed %s ngrams' % (corpus.id, t(), added))
# ...END OF KLUDGE
# -> overwrite occurrences (=> NodeNodeNgram) # -> overwrite occurrences (=> NodeNodeNgram)
occ_id = compute_occs(corpus, occ_id = compute_occs(corpus,
groupings_id = group_id, groupings_id = group_id,
...@@ -286,5 +299,10 @@ def recount(corpus_id): ...@@ -286,5 +299,10 @@ def recount(corpus_id):
corpus.save_hyperdata() corpus.save_hyperdata()
session.commit() session.commit()
if not DEBUG:
print('RECOUNT #%d: [%s] FINISHED Sending email notification' % (corpus.id, t()))
notify_recount(corpus)
def t(): def t():
return datetime.now().strftime("%Y-%m-%d_%H:%M:%S") return datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
...@@ -23,7 +23,7 @@ from datetime import datetime ...@@ -23,7 +23,7 @@ from datetime import datetime
def t(): def t():
return datetime.now().strftime("%Y-%m-%d_%H:%M:%S") return datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
def compute_occs(corpus, overwrite_id = None, groupings_id = None,): def compute_occs(corpus, overwrite_id = None, groupings_id = None, year=None, start=None, end=None, interactiv=False):
""" """
Calculates sum of occs per ngram (or per mainform if groups) within corpus Calculates sum of occs per ngram (or per mainform if groups) within corpus
(used as info in the ngrams table view) (used as info in the ngrams table view)
...@@ -61,6 +61,8 @@ def compute_occs(corpus, overwrite_id = None, groupings_id = None,): ...@@ -61,6 +61,8 @@ def compute_occs(corpus, overwrite_id = None, groupings_id = None,):
.group_by(NodeNgram.ngram_id) .group_by(NodeNgram.ngram_id)
) )
if year is not None:
occs_q = occs_q.filter(Node.hyperdata["publication_year"].astext == str(year))
# difficult case: with groups # difficult case: with groups
# ------------ # ------------
...@@ -108,6 +110,10 @@ def compute_occs(corpus, overwrite_id = None, groupings_id = None,): ...@@ -108,6 +110,10 @@ def compute_occs(corpus, overwrite_id = None, groupings_id = None,):
# for the sum # for the sum
.group_by("counted_form") .group_by("counted_form")
) )
if year is not None:
occs_q = occs_q.filter(Node.hyperdata["publication_year"].astext == str(year))
#print(str(occs_q.all())) #print(str(occs_q.all()))
occ_sums = occs_q.all() occ_sums = occs_q.all()
...@@ -134,13 +140,17 @@ def compute_occs(corpus, overwrite_id = None, groupings_id = None,): ...@@ -134,13 +140,17 @@ def compute_occs(corpus, overwrite_id = None, groupings_id = None,):
# £TODO make it NodeNgram instead NodeNodeNgram ! and rebase :/ # £TODO make it NodeNgram instead NodeNodeNgram ! and rebase :/
# (idem ti_ranking) # (idem ti_ranking)
bulk_insert(
NodeNodeNgram,
('node1_id' , 'node2_id', 'ngram_id', 'score'),
((the_id, corpus.id, res[0], res[1]) for res in occ_sums)
)
return the_id if interactiv is False :
bulk_insert(
NodeNodeNgram,
('node1_id' , 'node2_id', 'ngram_id', 'score'),
((the_id, corpus.id, res[0], res[1]) for res in occ_sums)
)
return the_id
else :
return [(res[0], res[1]) for res in occ_sums]
def compute_ti_ranking(corpus, def compute_ti_ranking(corpus,
......
...@@ -20,6 +20,7 @@ def compute_coocs( corpus, ...@@ -20,6 +20,7 @@ def compute_coocs( corpus,
stoplist_id = None, stoplist_id = None,
start = None, start = None,
end = None, end = None,
year = None,
symmetry_filter = False, symmetry_filter = False,
diagonal_filter = True): diagonal_filter = True):
""" """
...@@ -97,14 +98,21 @@ def compute_coocs( corpus, ...@@ -97,14 +98,21 @@ def compute_coocs( corpus,
WHERE WHERE
n.typename = {nodetype_id} n.typename = {nodetype_id}
AND n.parent_id = {corpus_id} AND n.parent_id = {corpus_id}
""".format( nodetype_id = NODETYPES.index('DOCUMENT')
, corpus_id=corpus.id
)
if year :
cooc_filter_sql += """
AND n.hyperdata -> 'publication_year' = '{year}'
""".format( year=str(year))
cooc_filter_sql += """
GROUP BY 1,2 GROUP BY 1,2
-- == -- ==
-- GROUP BY ngA, ngB -- GROUP BY ngA, ngB
) )
""".format( nodetype_id = NODETYPES.index('DOCUMENT') """
, corpus_id=corpus.id
)
# 3) taking the cooccurrences of ngram x2 # 3) taking the cooccurrences of ngram x2
ngram_filter_A_sql += """ ngram_filter_A_sql += """
-- STEP 1: X axis of the matrix -- STEP 1: X axis of the matrix
......
...@@ -56,18 +56,15 @@ def extract_ngrams(corpus, keys=DEFAULT_INDEX_FIELDS, do_subngrams = DEFAULT_IND ...@@ -56,18 +56,15 @@ def extract_ngrams(corpus, keys=DEFAULT_INDEX_FIELDS, do_subngrams = DEFAULT_IND
tagger_bots = {lang: load_tagger(lang) for lang in corpus.hyperdata["languages"] \ tagger_bots = {lang: load_tagger(lang) for lang in corpus.hyperdata["languages"] \
if lang != "__unknown__"} if lang != "__unknown__"}
tagger_bots["__unknown__"] = load_tagger("en") tagger_bots["__unknown__"] = load_tagger("en")
# print("#TAGGERS LOADED: ", tagger_bots) print("#TAGGERS LOADED: ", tagger_bots)
supported_taggers_lang = tagger_bots.keys() supported_taggers_lang = tagger_bots.keys()
# print("#SUPPORTED TAGGER LANGS", supported_taggers_lang) print("#SUPPORTED TAGGER LANGS", list(supported_taggers_lang))
for documents_count, document in enumerate(corpus.children('DOCUMENT')): for documents_count, document in enumerate(corpus.children('DOCUMENT')):
#load only the docs that have passed the parsing without error #load only the docs that have passed the parsing without error
if document.id not in corpus.hyperdata["skipped_docs"]: if document.id not in corpus.hyperdata["skipped_docs"]:
if 'language_iso2' in document.hyperdata: language_iso2 = document.hyperdata.get('language_iso2', '__unknown__')
language_iso2 = document.hyperdata['language_iso2']
else:
language_iso2 = "__unknown__"
# debug # debug
# print(language_iso2) # print(language_iso2)
......
...@@ -5,6 +5,7 @@ from collections import defaultdict ...@@ -5,6 +5,7 @@ from collections import defaultdict
from gargantext.util.toolchain import * from gargantext.util.toolchain import *
import copy import copy
from gargantext.util.db import session from gargantext.util.db import session
from gargantext.models import UserNode
class ProjectList(APIView): class ProjectList(APIView):
'''API endpoint that represent a list of projects owned by a user''' '''API endpoint that represent a list of projects owned by a user'''
...@@ -36,10 +37,16 @@ class ProjectList(APIView): ...@@ -36,10 +37,16 @@ class ProjectList(APIView):
return Response({"detail":"Project with this name already exists", "url":"/projects/%s" %str(project.id)}, status = HTTP_409_CONFLICT) return Response({"detail":"Project with this name already exists", "url":"/projects/%s" %str(project.id)}, status = HTTP_409_CONFLICT)
else: else:
user_node = session.query(UserNode).filter_by(user_id=request.user.id).one_or_none()
if user_node is None:
print("??? Can't find UserNode for %r to create ProjectNode with name %r ???" % (request.user, name))
new_project = Node( new_project = Node(
user_id = request.user.id, user_id = request.user.id,
typename = 'PROJECT', typename = 'PROJECT',
name = name, name = name,
parent_id = user_node and user_node.id,
) )
session.add(new_project) session.add(new_project)
......
...@@ -230,6 +230,7 @@ def countCooccurrences( corpus_id=None , cooc_id=None ...@@ -230,6 +230,7 @@ def countCooccurrences( corpus_id=None , cooc_id=None
session.commit() session.commit()
#data = cooc2graph(coocNode.id, cooc, distance=distance, bridgeness=bridgeness) #data = cooc2graph(coocNode.id, cooc, distance=distance, bridgeness=bridgeness)
#return data else:
return cooc
return(coocNode.id, cooc) return(coocNode.id, cooc)
...@@ -70,15 +70,12 @@ drafts = { ...@@ -70,15 +70,12 @@ drafts = {
Your feedback will be valuable for further development Your feedback will be valuable for further development
of the platform, do not hesitate to contact us and of the platform, do not hesitate to contact us and
to contribute! to contribute on our forum:
https://discourse.iscpif.fr/c/gargantext
If you want to access your old corpuses, We remain at your disposal for any further information and you be
access codes remain valid until June 30 kept updated if you subscribe to our mailing-list:
2017 midnight: https://phplist.iscpif.fr/?p=subscribe&id=4
http://old.gargantext.org
We remain at your disposal for any further information.
With our best regards, With our best regards,
-- --
...@@ -127,15 +124,14 @@ drafts = { ...@@ -127,15 +124,14 @@ drafts = {
Vos retours seront précieux pour poursuivre le développement Vos retours seront précieux pour poursuivre le développement
de la plateforme, n'hésitez pas à nous contacter et de la plateforme, n'hésitez pas à nous contacter et
contribuer! contribuer sur notre forum:
https://discourse.iscpif.fr/c/gargantext
Si vous souhaitez accéder à vos anciens corpus, vos anciens Nous restons à votre disposition pour tout complément
codes d'accès restent valides à cette adresse jusqu'au 30 juin d'information et vous serez informés de l'évolution des
2017 minuit: services en vous inscrivant à la liste:
https://phplist.iscpif.fr/?p=subscribe&id=4
http://old.gargantext.org
Nous restons à votre disposition pour tout complément d'information.
Cordialement Cordialement
-- --
L'équipe de Gargantext (CNRS) L'équipe de Gargantext (CNRS)
...@@ -195,9 +191,14 @@ drafts = { ...@@ -195,9 +191,14 @@ drafts = {
Vos retours seront précieux pour poursuivre le développement Vos retours seront précieux pour poursuivre le développement
de la plateforme, n'hésitez pas à nous contacter et de la plateforme, n'hésitez pas à nous contacter et
contribuer! contribuer sur notre forum:
https://discourse.iscpif.fr/c/gargantext
Nous restons à votre disposition pour tout complément
d'information et vous serez informés de l'évolution des
services en vous inscrivant à la liste:
https://phplist.iscpif.fr/?p=subscribe&id=4
Nous restons à votre disposition pour tout complément d'information.
Cordialement Cordialement
-- --
L'équipe de Gargantext (CNRS) L'équipe de Gargantext (CNRS)
......
...@@ -28,11 +28,13 @@ RandomWords==0.1.12 ...@@ -28,11 +28,13 @@ RandomWords==0.1.12
ujson==1.35 ujson==1.35
umalqurra==0.2 # arabic calendars (?? why use ??) umalqurra==0.2 # arabic calendars (?? why use ??)
networkx==1.11 networkx==1.11
pandas==0.18.0 pandas==0.21.0
six==1.10.0 six==1.10.0
lxml==3.5.0 lxml==3.5.0
requests-futures==0.9.7 requests-futures==0.9.7
bs4==0.0.1 bs4==0.0.1
requests==2.10.0 requests==2.10.0
alembic>=0.9.2 alembic>=0.9.2
# SQLAlchemy-Searchable==0.10.4 SQLAlchemy==1.1.14
SQLAlchemy-Searchable==0.10.4
SQLAlchemy-Utils==0.32.16
...@@ -106,17 +106,17 @@ RUN apt-get update && apt-get install -y \ ...@@ -106,17 +106,17 @@ RUN apt-get update && apt-get install -y \
libblas-dev \ libblas-dev \
liblapack-dev liblapack-dev
USER notebooks #USER notebooks
#
RUN cd /home/notebooks \ #RUN cd /home/notebooks \
&& curl -sSL https://get.haskellstack.org/ | sh \ # && curl -sSL https://get.haskellstack.org/ | sh \
&& stack setup \ # && stack setup \
&& git clone https://github.com/gibiansky/IHaskell \ # && git clone https://github.com/gibiansky/IHaskell \
&& . /env_3-5/bin/activate \ # && . /env_3-5/bin/activate \
&& cd IHaskell \ # && cd IHaskell \
&& stack install gtk2hs-buildtools \ # && stack install gtk2hs-buildtools \
&& stack install --fast \ # && stack install --fast \
&& /root/.local/bin/ihaskell install --stack # && /root/.local/bin/ihaskell install --stack
#
...@@ -15,12 +15,17 @@ os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'gargantext.settings') ...@@ -15,12 +15,17 @@ os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'gargantext.settings')
django.setup() django.setup()
from gargantext.constants import QUERY_SIZE_N_MAX, get_resource, get_resource_by_name from gargantext.constants import QUERY_SIZE_N_MAX, get_resource, get_resource_by_name
from gargantext.models import ProjectNode, DocumentNode, UserNode, User from gargantext.models import (Node, ProjectNode, DocumentNode,
from gargantext.util.db import session, get_engine Ngram, NodeNgram, NodeNgramNgram, NodeNodeNgram)
from gargantext.util.db import session, get_engine, func, aliased, case
from collections import Counter from collections import Counter
import importlib import importlib
from django.http import Http404 from django.http import Http404
# Import those to be available by notebook user
from langdetect import detect as detect_lang
from gargantext.models import UserNode, User
import functools
class NotebookError(Exception): class NotebookError(Exception):
pass pass
...@@ -35,8 +40,11 @@ def documents(corpus_id): ...@@ -35,8 +40,11 @@ def documents(corpus_id):
#import seaborn as sns #import seaborn as sns
import pandas as pd import pandas as pd
def countByField(docs, field):
return list(Counter([doc.hyperdata[field] for doc in docs]).items())
def chart(docs, field): def chart(docs, field):
year_publis = list(Counter([doc.hyperdata[field] for doc in docs]).items()) year_publis = countByField(docs, field)
frame0 = pd.DataFrame(year_publis, columns=['Date', 'DateValue']) frame0 = pd.DataFrame(year_publis, columns=['Date', 'DateValue'])
frame1 = pd.DataFrame(year_publis, columns=['Date', 'DateValue'], index=frame0.Date) frame1 = pd.DataFrame(year_publis, columns=['Date', 'DateValue'], index=frame0.Date)
return frame1 return frame1
...@@ -49,15 +57,35 @@ def scan_hal(request): ...@@ -49,15 +57,35 @@ def scan_hal(request):
return hal.scan_results(request) return hal.scan_results(request)
def scan_gargantext(corpus_id, lang, request): def _search_docs(corpus_id, request, fast=False):
connection = get_engine().connect() q = session.query(DocumentNode).filter_by(parent_id=corpus_id)
# TODO add some sugar the request (ideally request should be the same for hal and garg)
query = """select count(n.id) from nodes n # Search ngram <request> in hyperdata <field>
where to_tsvector('%s', hyperdata ->> 'abstract' || 'title') H = lambda field, request: Node.hyperdata[field].astext.op('~*')(request)
@@ to_tsquery('%s')
AND n.parent_id = %s;""" % (lang, request, corpus_id) if not fast:
return [i for i in connection.execute(query)][0][0] # Only match <request> starting and ending with word boundary
connection.close() # Sequence of spaces will match any sequence of spaces
request = '\s+'.join(filter(None, r'\m{}\M'.format(request).split(' ')))
return q.filter(Node.title_abstract.match(request)) if fast else \
q.filter(H('title', request) | H('abstract', request))
def scan_gargantext(corpus_id, request, fast=False, documents=False):
query = _search_docs(corpus_id, request, fast)
if documents:
return query.all()
return query.with_entities(func.count(DocumentNode.id.distinct())).one()[0]
def scan_gargantext_and_delete(corpus_id, request, fast=False):
r = _search_docs(corpus_id, request, fast).delete(synchronize_session='fetch')
session.commit()
return r
def myProject_fromUrl(url): def myProject_fromUrl(url):
...@@ -179,3 +207,80 @@ def run_moissonneur(moissonneur, project, name, query): ...@@ -179,3 +207,80 @@ def run_moissonneur(moissonneur, project, name, query):
session.commit() session.commit()
return corpus return corpus
ALL_LIST_TYPES = ['main', 'map', 'stop']
def _ngrams(corpus_id, list_types, entities):
list_types = (list_types,) if isinstance(list_types, str) else list_types
list_typenames = [
'{}LIST'.format(t.upper()) for t in list_types if t in ALL_LIST_TYPES]
# `Node` is our list, ie. MAINLIST and/or MAPLIST and/or STOPLIST
return (session.query(*entities)
.select_from(Ngram)
.filter(NodeNgram.ngram_id==Ngram.id,
NodeNgram.node_id==Node.id,
Node.parent_id==corpus_id,
Node.typename.in_(list_typenames)))
def corpus_list(corpus_id, list_types=ALL_LIST_TYPES, with_synonyms=False,
with_count=False):
# Link between a GROUPLIST, a normal form (ngram1), and a synonym (ngram2)
NNN = NodeNgramNgram
# Get the list type from the Node type -- as in CSV export
list_type = (case([(Node.typename=='MAINLIST', 'main'),
(Node.typename=='MAPLIST', 'map'),
(Node.typename=='STOPLIST', 'stop')])
.label('type'))
# We will retrieve each ngram as the following tuple:
entities = (list_type, Ngram.terms.label('ng'))
if with_count:
entities += (Ngram.id.label('id'),)
# First, get ngrams from wanted lists
ngrams = _ngrams(corpus_id, list_types, entities)
# Secondly, exclude "synonyms" (grouped ngrams that are not normal forms).
# We have to exclude synonyms first because data is inconsistent and some
# of them can be both in GROUPLIST and in MAIN/MAP/STOP lists. We want to
# take synonyms from GROUPLIST only -- see below.
Groups = aliased(Node, name='groups')
query = (ngrams.outerjoin(Groups, (Groups.parent_id==corpus_id) & (Groups.typename=='GROUPLIST'))
.outerjoin(NNN, (NNN.node_id==Groups.id) & (NNN.ngram2_id==Ngram.id))
.filter(NNN.ngram1_id==None))
# If `with_synonyms` is True, add them from GROUPLIST: this is the reliable
# source for them
if with_synonyms:
Synonym = aliased(Ngram)
ent = (list_type, Synonym.terms.label('ng'), Synonym.id.label('id'))
synonyms = (ngrams.with_entities(*ent)
.filter(NNN.ngram1_id==Ngram.id,
NNN.ngram2_id==Synonym.id,
NNN.node_id==Groups.id,
Groups.parent_id==corpus_id,
Groups.typename=='GROUPLIST'))
query = query.union(synonyms)
# Again, data is inconsistent: MAINLIST may intersect with MAPLIST and
# we don't wan't that
if 'main' in list_types and 'map' not in list_types:
# Exclude MAPLIST ngrams from MAINLIST
query = query.except_(_ngrams(corpus_id, 'map', entities))
if with_count:
N = query.subquery()
return (session.query(N.c.type, N.c.ng, NodeNodeNgram.score)
.join(Node, (Node.parent_id==corpus_id) & (Node.typename=='OCCURRENCES'))
.outerjoin(NodeNodeNgram, (NodeNodeNgram.ngram_id==N.c.id) &
(NodeNodeNgram.node1_id==Node.id) &
(NodeNodeNgram.node2_id==corpus_id)))
# Return found ngrams sorted by list type, and then alphabetically
return query.order_by('type', 'ng')
...@@ -42,3 +42,7 @@ ipython==5.2.0 ...@@ -42,3 +42,7 @@ ipython==5.2.0
ipython-genutils==0.1.0 ipython-genutils==0.1.0
ipywidgets ipywidgets
matplotlib==2.0.2 matplotlib==2.0.2
alembic>=0.9.2
SQLAlchemy==1.1.14
SQLAlchemy-Searchable==0.10.4
SQLAlchemy-Utils==0.32.16
This source diff could not be displayed because it is too large. You can view the blob instead.
This diff is collapsed.
#!/bin/bash #!/bin/bash
FILE="/var/log/gargantext/uwsgi/$(date +%Y%m%d-%H:%M:%S).log" # Script to start uwsgi
#touch /var/log/gargantext/uwsgi/$FILE && sudo uwsgi /srv/gargantext/gargantext.ini
sudo uwsgi gargantext.ini --logto $FILE
echo "To reload UWSGI: touch /tmp/gargantext.reload"
This diff is collapsed.
...@@ -30,7 +30,7 @@ ...@@ -30,7 +30,7 @@
</a> </a>
<a class="btn btn-success btn-lg" target="blank" href="https://iscpif.fr/gargantext/your-first-map/" title="Fill the form to sign up"> <a class="btn btn-success btn-lg" target="blank" href="https://iscpif.fr/gargantext/" title="Fill the form to sign up">
<span class="glyphicon glyphicon-hand-right" aria-hidden="true"></span> <span class="glyphicon glyphicon-hand-right" aria-hidden="true"></span>
Documentation Documentation
</a> </a>
......
...@@ -368,7 +368,7 @@ ...@@ -368,7 +368,7 @@
<p> <p>
Gargantext Gargantext
<span class="glyphicon glyphicon-registration-mark" aria-hidden="true"></span> <span class="glyphicon glyphicon-registration-mark" aria-hidden="true"></span>
, version 3.0.7, , version 3.0.8.1,
<a href="http://www.cnrs.fr" target="blank" title="Institution that enables this project."> <a href="http://www.cnrs.fr" target="blank" title="Institution that enables this project.">
Copyrights Copyrights
<span class="glyphicon glyphicon-copyright-mark" aria-hidden="true"></span> <span class="glyphicon glyphicon-copyright-mark" aria-hidden="true"></span>
......
...@@ -203,6 +203,7 @@ ...@@ -203,6 +203,7 @@
// do something… // do something…
resetStatusForm("#createForm"); resetStatusForm("#createForm");
}) })
return false;
}) })
......
...@@ -1170,10 +1170,10 @@ ...@@ -1170,10 +1170,10 @@
// REST and callback // REST and callback
garganrest.metrics.update(corpusId, function(){ garganrest.metrics.update(corpusId, function(){
statusDiv.innerHTML = '<div class="statusinfo">Corpus updated</div>' statusDiv.innerHTML = '<div class="statusinfo">Recount is started, please wait, you will be sent a notification email.</div>'
// revert visual // revert visual
setTimeout(function(){ statusDiv.innerHTML = previousStatus }, 2000); //setTimeout(function(){ statusDiv.innerHTML = previousStatus }, 2000);
}) })
} }
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment