Unverified Commit 47b313d5 authored by Neel Kamath's avatar Neel Kamath Committed by GitHub

Fix #21 (#22)

* Conditionally enable sense2vec for performance improvements

* Disable sense2vec in unrelated pipeline uses

* Test HTTP exceptions

* Update Docker image tagging convention

* Conditionally disable sense2vec
parent ca48e810
......@@ -2,7 +2,7 @@
[![Built with spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg)](https://spacy.io)
This project provides industrial-strength NLP for multiple languages via [spaCy](https://spacy.io/) and [sense2vec](https://github.com/explosion/sense2vec) over a containerized HTTP API.
This project provides industrial-strength NLP via [spaCy](https://spacy.io/) and [sense2vec](https://github.com/explosion/sense2vec) over a containerized HTTP API.
## Installation
......@@ -10,18 +10,11 @@ This project provides industrial-strength NLP for multiple languages via [spaCy]
Install [Docker](https://hub.docker.com/search/?type=edition&offering=community).
You can find specific tags (say for example, a French model) on the [Docker Hub repository](https://hub.docker.com/repository/docker/neelkamath/spacy-server/tags?page=1).
For example, to run an English model at `http://localhost:8000`, run:
```
docker run --rm -e SPACY_MODEL=en_core_web_sm -p 8000:8000 neelkamath/spacy-server:v1-en_core_web_sm
```
### Generating an SDK
You can generate a wrapper for the HTTP API using [OpenAPI Generator](https://openapi-generator.tech/) on the file [`https://raw.githubusercontent.com/neelkamath/spacy-server/master/docs/openapi.yaml`](https://raw.githubusercontent.com/neelkamath/spacy-server/master/docs/openapi.yaml).
## [Usage](https://neelkamath.gitlab.io/spacy-server/)
## [Usage](https://hub.docker.com/r/neelkamath/spacy-server)
## [Contributing](docs/CONTRIBUTING.md)
......
FROM python:3.8
FROM python:3.8 AS base
WORKDIR /app
ENV PYTHONUNBUFFERED 1
ARG SPACY_MODEL
ENV PYTHONUNBUFFERED=1 SENSE2VEC=0 SPACY_MODEL=$SPACY_MODEL
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
ARG SPACY_MODEL
RUN python -m spacy download $SPACY_MODEL
COPY src/main.py .
COPY src/s2v_old/ s2v_old/
EXPOSE 8000
HEALTHCHECK --timeout=2s --start-period=2s --retries=1 \
CMD curl -f http://localhost:8000/health_check
RUN useradd user
USER user
CMD ["uvicorn", "main:app", "--host", "0.0.0.0"]
FROM base
ENV SENSE2VEC 1
COPY src/s2v_old/ src/s2v_old/
\ No newline at end of file
version: '3.7'
services:
app:
command: sh scripts/setup.sh 'uvicorn src.main:app --host 0.0.0.0 --reload'
command: sh -c '. scripts/setup.sh && uvicorn src.main:app --host 0.0.0.0 --reload'
ports: ['8000:8000']
environment:
SPACY_MODEL:
SENSE2VEC:
\ No newline at end of file
version: '3.7'
services:
app:
command: sh scripts/setup.sh pytest
command: sh -c '. scripts/setup.sh && pytest'
environment:
SENSE2VEC: 1
# Since any model will do, tests have been written only for the en_core_web_sm model because of its combination of
# speed, features, and accuracy.
......
# It is not possible to use a Docker volume to cache the dependencies because subsequent usage of the volume
# occasionally gets corrupted for an unknown reason. Hence, a virtual environment is to be used instead. It is known
# that virtual environments aren't needed in Docker because isolation is already provided; we use it as a cache instead.
# A virtual environment caches dependencies instead of a Docker volume because the volume randomly gets corrupted.
version: '3.7'
services:
app:
image: python:3.8
working_dir: /app
environment:
SPACY_MODEL:
volumes:
- type: bind
source: .
......
......@@ -9,6 +9,6 @@ If you're forking the repo to develop the project as your own and not just to se
1. Clone the repository using one of the following methods.
- SSH: `git clone git@github.com:neelkamath/spacy-server.git`
- HTTPS: `git clone https://github.com/neelkamath/spacy-server.git`
1. Download the [pretrained vectors](https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz). Extract it into the project's `src` directory.
1. If you are not going to use sense2vec, skip this step. Download the [pretrained vectors](https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz). Extract it into the project's `src` directory.
## [Developing](developing.md)
\ No newline at end of file
......@@ -2,12 +2,12 @@
## Server
Replace `<MODEL>` with the name of the [spaCy model](https://spacy.io/models) (e.g., `en_core_web_sm`, `fr_core_news_md`). The model must be compatible with the spaCy version specified in [requirements.txt](../requirements.txt).
Replace `<MODEL>` with the name of the [spaCy model](https://spacy.io/models) (e.g., `en_core_web_sm`, `fr_core_news_md`). The model must be compatible with the spaCy version specified in [requirements.txt](../requirements.txt). Replace `<ENABLED>` with `1` or `0` to enable to disable sense2vec respectively.
### Development
```
SPACY_MODEL=<MODEL> docker-compose -p dev --project-directory . \
SPACY_MODEL=<MODEL> SENSE2VEC=<ENABLED> docker-compose -p dev --project-directory . \
-f docker/docker-compose.yml -f docker/docker-compose.override.yml up --build
```
......@@ -15,19 +15,29 @@ The server will be running on `http://localhost:8000`, and has automatic reload
### Testing
```
docker-compose -p test --project-directory . -f docker/docker-compose.yml -f docker/docker-compose.test.yml \
- For noninteractive environments (e.g., CI pipelines), you can run all the tests with the single command:
```
docker-compose -p test --project-directory . -f docker/docker-compose.yml -f docker/docker-compose.test.yml \
up --build --abort-on-container-exit --exit-code-from app
```
```
- For faster iterations (e.g., while developing), you can run the tests interactively. Changes to the source code will automatically be mirrored in the container.
1. Run:
```
docker-compose -p test --project-directory . -f docker/docker-compose.yml -f docker/docker-compose.test.yml \
run --service-ports app bash
```
1. `. scripts/setup.sh` (run this command every time you update `requirements.txt`)
1. Execute tests any number of times you want with pytest (e.g., `pytest`).
1. After you're done testing, exit the container by running `exit`.
### Production
```
docker build --build-arg SPACY_MODEL=<MODEL> -t spacy-server -f docker/Dockerfile .
docker run --rm -e SPACY_MODEL=<MODEL> -p 8000:8000 spacy-server
docker build <TARGET> --build-arg SPACY_MODEL=<MODEL> -t spacy-server -f docker/base.Dockerfile .
```
Replace `<TARGET>` with `--target base` if you want to disable sense2vec, and an empty string otherwise.
The container `EXPOSE`s port `8000`.
The container `EXPOSE`s port `8000`. Run using `docker run --rm -p 8000:8000 spacy-server`.
## Specification
......@@ -61,7 +71,10 @@ Open `redoc-static.html` in your browser.
## Releases
- Create a GitHub release (this will automatically create the git tag). If you bumped the version in `docs/openapi.yaml`, then create a new release. If you haven't bumped the version but have updated the HTTP API's functionality, delete the existing GitHub release and git tag, and create a new one. Otherwise, skip this step. The release's title should be the features included (e.g., `NER, POS tagging, sentencizer, tokenizer, and sense2vec`). The tag should be the HTTP API's version (e.g., `v1`). The release's body should be ```Download and open the release asset, `redoc-static.html`, in your browser to view the HTTP API documentation.```. Upload the asset named `redoc-static.html` which contains the HTTP API docs.
- If you haven't updated the HTTP API functionality, skip this step.
1. If you haven't bumped the version in the OpenAPI spec, delete the corresponding GitHub release and git tag.
1. Generate `redoc-static.html`: `npx redoc-cli bundle docs/openapi.yaml -o redoc-static.html --title 'spaCy Server'`
1. Create a GitHub release. The release's body should be ```Download and open the release asset, `redoc-static.html`, in your browser to view the HTTP API documentation.```. Upload `redoc-static.html` as an asset.
- If required, update the [Docker Hub repository](https://hub.docker.com/r/neelkamath/spacy-server)'s **Overview**.
- For every commit to the `master` branch in which the tests have passed, the following will automatically be done.
- The new images will be uploaded to Docker Hub.
......
openapi: 3.0.2
info:
title: spaCy Server
version: '1'
version: '2'
description: |
Industrial-strength NLP via [spaCy](https://spacy.io) and [sense2vec](https://github.com/explosion/sense2vec). No
knowledge of spaCy or sense2vec is required to use this service.
......@@ -26,8 +26,8 @@ paths:
/ner:
post:
tags: [nlp]
description: Named entity recognition. Similar phrases will also be provided via sense2vec. The pretrained model
must have the `ner` and `parser` pipeline components to use this endpoint.
description: Named entity recognition. The pretrained model must have the `ner` and `parser` pipeline components
to use this endpoint. If a sense2vec model was bundled with the service, similar phrases can also be provided.
operationId: ner
requestBody:
required: true
......@@ -39,7 +39,7 @@ paths:
- Net income was $9.4 million compared to the prior year of $2.7 million. Google is a big company.
- Revenue exceeded twelve billion dollars, with a loss of $1b.
schema:
$ref: '#/components/schemas/Sections'
$ref: '#/components/schemas/NERRequest'
responses:
'200':
description: Labeled text, with phrases similar to each entity
......@@ -74,13 +74,21 @@ paths:
text_with_ws: Sundar Pichai
text: Google is headed by Sundar Pichai.
schema:
$ref: '#/components/schemas/NamedEntities'
$ref: '#/components/schemas/NERResponse'
'400':
description: The pretrained model lacks the `ner` or `parser` pipeline components.
content:
application/json:
example:
examples:
invalid_model:
summary: The spaCy model lacks the required pipeline components.
value:
detail: The pretrained model (en_trf_bertbaseuncased_lg) doesn't support named entity recognition.
sense2vec_disabled:
summary: Similar phrases via sense2vec were requested, but a sense2vec model wasn't bundled with the
service.
value:
detail: There is no sense2vec model bundled with this service.
schema:
$ref: '#/components/schemas/InvalidModel'
/pos:
......@@ -225,18 +233,22 @@ paths:
description: All systems are operational
components:
schemas:
Sections:
NERRequest:
type: object
properties:
sections:
description:
Although you could pass the full text as a single array item, it would be faster to split large text
into multiple items. Each item needn't be semantically related.
Although you could pass the full text as a single array item, it would be faster to split large text into
multiple items. Each item needn't be semantically related.
type: array
items:
type: string
sense2vec:
description: Whether to also compute similar phrases using sense2vec (significantly slower)
type: boolean
default: false
required: [sections]
NamedEntities:
NERResponse:
type: object
properties:
data:
......@@ -268,7 +280,7 @@ components:
description: The entity’s lemma.
sense2vec:
type: array
description: Phrases similar to the entity
description: Phrases similar to the entity (empty if sense2vec was disabled)
items:
type: object
properties:
......
......@@ -2,6 +2,7 @@
# particular versions.
spacy==2.2.3
sense2vec==1.0.2
fastapi==0.45.0
uvicorn==0.10.8
pytest>=4.6.7,<5
\ No newline at end of file
#!/usr/bin/env sh
# Builds and uploads every image (e.g., neelkamath/spacy-server:v1-en_core_web_sm) to Docker Hub.
# Builds and uploads every image (e.g., neelkamath/spacy-server:2-en_core_web_sm-sense2vec) to Docker Hub.
# Get the HTTP API version.
version=$(grep version docs/openapi.yaml -m 1)
version=${version#*: }
version=v$(echo "$version" | cut -d "'" -f 2)
version=$(echo "$version" | cut -d "'" -f 2)
# Log in.
echo "$DOCKER_HUB_PASSWORD" | docker login -u "$DOCKER_HUB_USER" --password-stdin https://index.docker.io/v1/
# Build and upload the images.
while IFS='' read -r model || [ -n "$model" ]; do
tag="$DOCKER_HUB_USER"/spacy-server:"$version"-"$model"
docker build --build-arg SPACY_MODEL="$model" -t "$tag" -f docker/Dockerfile .
docker push "$tag"
docker rmi "$tag" # Delete the image to prevent the device (e.g., CI runner) from running out of space and crashing.
while IFS='' read -r spacy_model || [ -n "$spacy_model" ]; do
base_tag="$DOCKER_HUB_USER"/spacy-server:"$version"-"$spacy_model"
sense2vec_tag="$base_tag"-sense2vec
docker build --target base --build-arg SPACY_MODEL="$spacy_model" -t "$base_tag" -f docker/Dockerfile .
docker build --build-arg SPACY_MODEL="$spacy_model" -t "$sense2vec_tag" -f docker/Dockerfile .
docker push "$base_tag"
docker push "$sense2vec_tag"
docker rmi "$base_tag" "$sense2vec_tag" # Prevent the device (e.g., CI runner) from running out of space and crashing.
done <scripts/models.txt
#!/usr/bin/env sh
# Executes a command in a virtual environment (e.g., <sh setup.sh 'uvicorn main:app --reload'>).
# Sets up the development environment.
python -m venv venv
. venv/bin/activate
pip install -r requirements.txt
python -m spacy download "$SPACY_MODEL"
$1
......@@ -11,34 +11,46 @@ import starlette.status
app = fastapi.FastAPI()
model = os.getenv('SPACY_MODEL')
pipeline_error = 'The pretrained model ({})'.format(model) + " doesn't support {}."
pipeline_error = 'The pretrained model ({})'.format(model) \
+ " doesn't support {}."
nlp = spacy.load(model)
nlp.add_pipe(sense2vec.Sense2VecComponent(nlp.vocab).from_disk('src/s2v_old'))
if os.getenv('SENSE2VEC') == '1':
nlp.add_pipe(
sense2vec.Sense2VecComponent(nlp.vocab).from_disk('src/s2v_old')
)
class SectionsModel(pydantic.BaseModel):
class NERRequest(pydantic.BaseModel):
sections: typing.List[str]
sense2vec: bool = False
@app.post('/ner')
async def recognize_named_entities(request: SectionsModel):
async def recognize_named_entities(request: NERRequest):
if not nlp.has_pipe('ner') or not nlp.has_pipe('parser'):
raise fastapi.HTTPException(
status_code=400,
detail=pipeline_error.format('named entity recognition')
)
if request.sense2vec and not nlp.has_pipe('sense2vec'):
raise fastapi.HTTPException(
status_code=400,
detail='There is no sense2vec model bundled with this service.'
)
response = {'data': []}
for doc in nlp.pipe(request.sections, disable=['tagger']):
for sent in doc.sents:
entities = [build_entity(ent) for ent in sent.ents]
entities = [
build_entity(ent, request.sense2vec) for ent in sent.ents
]
data = {'text': sent.text, 'entities': entities}
response['data'].append(data)
return response
def build_entity(ent):
def build_entity(ent, use_sense2vec):
similar = []
if ent._.in_s2v:
if use_sense2vec and ent._.in_s2v:
for data in ent._.s2v_most_similar():
similar.append(
{'phrase': data[0][0], 'similarity': float(data[1])}
......@@ -69,7 +81,8 @@ async def tag_parts_of_speech(request: TextModel):
detail=pipeline_error.format('part-of-speech tagging')
)
data = []
for token in [build_token(token) for token in nlp(request.text)]:
doc = nlp(request.text, disable=['sense2vec'])
for token in [build_token(token) for token in doc]:
text = token['sent']
del token['sent']
if text in [obj['text'] for obj in data]:
......@@ -126,7 +139,7 @@ def build_token(token):
@app.post('/tokenizer')
async def tokenize(request: TextModel):
doc = nlp(request.text, disable=['tagger', 'parser', 'ner'])
doc = nlp(request.text, disable=['tagger', 'parser', 'ner', 'sense2vec'])
return {'tokens': [token.text for token in doc]}
......@@ -137,7 +150,7 @@ async def sentencize(request: TextModel):
status_code=400,
detail=pipeline_error.format('sentence segmentation')
)
doc = nlp(request.text, disable=['tagger', 'ner'])
doc = nlp(request.text, disable=['tagger', 'ner', 'sense2vec'])
return {'sentences': [sent.text for sent in doc.sents]}
......
{
"data": [
{
"text": "Net income was $9.4 million compared to the prior year of $2.7 million.",
"entities": [
{
"text": "$9.4 million",
"label": "MONEY",
"start_char": 15,
"end_char": 27,
"lemma": "$ 9.4 million",
"start": 3,
"end": 6,
"text_with_ws": "$9.4 million ",
"sense2vec": []
},
{
"text": "the prior year",
"label": "DATE",
"start_char": 40,
"end_char": 54,
"lemma": "the prior year",
"start": 8,
"end": 11,
"text_with_ws": "the prior year ",
"sense2vec": []
},
{
"text": "$2.7 million",
"label": "MONEY",
"start_char": 58,
"end_char": 70,
"lemma": "$ 2.7 million",
"start": 12,
"end": 15,
"text_with_ws": "$2.7 million",
"sense2vec": []
}
]
},
{
"text": "Google is a big company.",
"entities": [
{
"text": "Google",
"label": "ORG",
"start_char": 72,
"end_char": 78,
"lemma": "Google",
"start": 16,
"end": 17,
"text_with_ws": "Google ",
"sense2vec": []
}
]
},
{
"text": "Revenue exceeded twelve billion dollars, with a loss of $1b.",
"entities": [
{
"text": "twelve billion dollars",
"label": "MONEY",
"start_char": 17,
"end_char": 39,
"lemma": "twelve billion dollar",
"start": 2,
"end": 5,
"text_with_ws": "twelve billion dollars",
"sense2vec": []
},
{
"text": "1b",
"label": "MONEY",
"start_char": 57,
"end_char": 59,
"lemma": "1b",
"start": 11,
"end": 12,
"text_with_ws": "1b",
"sense2vec": []
}
]
}
]
}
\ No newline at end of file
{
"data": [
{
"text": "Net income was $9.4 million compared to the prior year of $2.7 million.",
"entities": [
{
"end": 6,
"end_char": 27,
"text": "$9.4 million",
"label": "MONEY",
"start_char": 15,
"end_char": 27,
"lemma": "$ 9.4 million",
"sense2vec": [],
"start": 3,
"start_char": 15,
"text": "$9.4 million",
"text_with_ws": "$9.4 million "
"end": 6,
"text_with_ws": "$9.4 million ",
"sense2vec": []
},
{
"end": 11,
"end_char": 54,
"text": "the prior year",
"label": "DATE",
"start_char": 40,
"end_char": 54,
"lemma": "the prior year",
"start": 8,
"end": 11,
"text_with_ws": "the prior year ",
"sense2vec": [
{
"phrase": "the previous year",
......@@ -59,17 +64,17 @@
"phrase": "the entire year",
"similarity": 0.6915000081062317
}
],
"start": 8,
"start_char": 40,
"text": "the prior year",
"text_with_ws": "the prior year "
]
},
{
"end": 15,
"end_char": 70,
"text": "$2.7 million",
"label": "MONEY",
"start_char": 58,
"end_char": 70,
"lemma": "$ 2.7 million",
"start": 12,
"end": 15,
"text_with_ws": "$2.7 million",
"sense2vec": [
{
"phrase": "$1 million",
......@@ -111,22 +116,22 @@
"phrase": "$2 million",
"similarity": 0.7371000051498413
}
],
"start": 12,
"start_char": 58,
"text": "$2.7 million",
"text_with_ws": "$2.7 million"
]
}
],
"text": "Net income was $9.4 million compared to the prior year of $2.7 million."
]
},
{
"text": "Google is a big company.",
"entities": [
{
"end": 17,
"end_char": 78,
"text": "Google",
"label": "ORG",
"start_char": 72,
"end_char": 78,
"lemma": "Google",
"start": 16,
"end": 17,
"text_with_ws": "Google ",
"sense2vec": [
{
"phrase": " Google",
......@@ -168,33 +173,33 @@
"phrase": "Yahoo",
"similarity": 0.8037999868392944
}
],
"start": 16,
"start_char": 72,
"text": "Google",
"text_with_ws": "Google "
]
}
],
"text": "Google is a big company."
]
},
{
"text": "Revenue exceeded twelve billion dollars, with a loss of $1b.",
"entities": [
{
"end": 5,
"end_char": 39,
"text": "twelve billion dollars",
"label": "MONEY",
"start_char": 17,
"end_char": 39,
"lemma": "twelve billion dollar",
"sense2vec": [],
"start": 2,
"start_char": 17,
"text": "twelve billion dollars",
"text_with_ws": "twelve billion dollars"
"end": 5,
"text_with_ws": "twelve billion dollars",
"sense2vec": []
},
{
"end": 12,
"end_char": 59,
"text": "1b",
"label": "MONEY",
"start_char": 57,
"end_char": 59,
"lemma": "1b",
"start": 11,
"end": 12,
"text_with_ws": "1b",
"sense2vec": [
{
"phrase": "100m",
......@@ -236,14 +241,9 @@
"phrase": "100B",
"similarity": 0.8209999799728394
}
],
"start": 11,
"start_char": 57,
"text": "1b",
"text_with_ws": "1b"
]
}
],
"text": "Revenue exceeded twelve billion dollars, with a loss of $1b."
]
}
]
}
\ No newline at end of file
......@@ -5,29 +5,51 @@ import starlette.testclient
client = starlette.testclient.TestClient(main.app)
def test_ner():
body = {
ner_body = {
'sections': [
'Net income was $9.4 million compared to the prior year of $2.7 '
+ 'million. Google is a big company.',
'Revenue exceeded twelve billion dollars, with a loss of $1b.'
]
}
response = client.post('/ner', json=body)
}
ner_sense2vec_body = {**ner_body, 'sense2vec': True}
def test_ner_sense2vec_enabled():
response = client.post('/ner', json=ner_sense2vec_body)
assert response.status_code == 200
with open('src/outputs/ner.json') as f:
with open('src/outputs/ner/sense2vec_enabled.json') as f:
assert response.json() == json.load(f)
def test_ner_sense2vec_disabled():
response = client.post('/ner', json=ner_body)
with open('src/outputs/ner/sense2vec_disabled.json') as f:
assert response.json() == json.load(f)
def test_ner_spacy_fail():
fail('/ner', ner_body, 'ner')
def test_ner_sense2vec_fail():
fail('/ner', ner_sense2vec_body, 'sense2vec')
pos_body = {'text': 'Apple is looking at buying U.K. startup for $1 billion'}
def test_pos():
text = {'text': 'Apple is looking at buying U.K. startup for $1 billion'}
response = client.post('/pos', json=text)
response = client.post('/pos', json=pos_body)
assert response.status_code == 200
with open('src/outputs/pos.json') as f:
assert response.json() == json.load(f)
def test_pos_fail():
fail('/pos', pos_body, 'parser')
def test_tokenizer():
text = {'text': 'Apple is looking at buying U.K. startup for $1 billion'}
response = client.post('/tokenizer', json=text)
......@@ -36,16 +58,29 @@ def test_tokenizer():
assert response.json() == json.load(f)
sentencizer_body = {
'text': 'Apple is looking at buying U.K. startup for $1 billion. Another '
+ 'sentence.'
}
def test_sentencizer():
body = {
'text': 'Apple is looking at buying U.K. startup for $1 billion. '
+ 'Another sentence.'
}
response = client.post('/sentencizer', json=body)
response = client.post('/sentencizer', json=sentencizer_body)
assert response.status_code == 200
with open('src/outputs/sentencizer.json') as f:
assert response.json() == json.load(f)
def test_sentencizer_fail():
fail('/sentencizer', sentencizer_body, 'parser')
def test_health_check():
assert client.get('/health_check').status_code == 204
def fail(endpoint, body, pipe):
with main.nlp.disable_pipes(pipe):
response = client.post(endpoint, json=body)
assert response.status_code == 400
assert 'detail' in response.json()
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment