resource.md 2.15 KB

#resources

Adding a new source into Gargantext requires a previous declaration of the source inside constants.py

RESOURCETYPES= [
{    "type":9, #give a unique type int
      "name": 'SCOAP [XML]', #resource name as proposed into the add corpus FORM [generic format]
      "parser": "CernParser", #name of the new parser class inside a CERN.py file (set to None if not implemented)
      "format": 'MARC21', #specific format
      'file_formats':["zip","xml"],# accepted file format
      "crawler": "CernCrawler", #name of the new crawler class inside a CERN.py file (set to None if no Crawler implemented)
      'default_languages': ['en', 'fr'], #supported defaut languages of the source
 },
 ...
 ]

adding a new parser

Once you declared your new parser inside constants.py

add your new crawler file into /srv/gargantext/utils/parsers/ following this naming convention:

  • Filename must be in uppercase without the Crawler mention. eg. MailParser => MAIL.py
  • Inside this file the Parser must be called following the exact typo declared as parser in constants.py
  • Your new crawler shall inherit from baseclasse Parser and provide a parse(filebuffer) method
  #!/usr/bin/python3 env
  #filename:/srv/gargantext/util/parser/MAIL.py:
  from ._Parser import Parser
  class MailParser(Parser):
      def parse(self, file):
          ...

adding a new crawler

Once you declared your new parser inside constants.py add your new crawler file into /srv/gargantext/utils/parsers/ following this naming convention:

  • Filename must be in uppercase without the Crawler mention. eg. MailCrawler => MAIL.py
  • Inside this file the Crawler must be called following the exact typo declared as crawler in constants.py
  • Your new crawler shall inherit from baseclasse Crawler and provide three method:
    • scan_results => ids
    • sample = > yes/no
    • fetch
  #!/usr/bin/python3 env
  #filename:/srv/gargantext/util/crawler/MAIL.py:
  from ._Crawler import Crawler
  class MailCrawler(Crawler):
      def scan_results(self, query):
        ...
        self.ids = set()
      def sample(self, results_nb):
        ...
      def fetch(self, ids):