Unverified Commit 47b313d5 authored by Neel Kamath's avatar Neel Kamath Committed by GitHub

Fix #21 (#22)

* Conditionally enable sense2vec for performance improvements

* Disable sense2vec in unrelated pipeline uses

* Test HTTP exceptions

* Update Docker image tagging convention

* Conditionally disable sense2vec
parent ca48e810
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
[![Built with spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg)](https://spacy.io) [![Built with spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg)](https://spacy.io)
This project provides industrial-strength NLP for multiple languages via [spaCy](https://spacy.io/) and [sense2vec](https://github.com/explosion/sense2vec) over a containerized HTTP API. This project provides industrial-strength NLP via [spaCy](https://spacy.io/) and [sense2vec](https://github.com/explosion/sense2vec) over a containerized HTTP API.
## Installation ## Installation
...@@ -10,18 +10,11 @@ This project provides industrial-strength NLP for multiple languages via [spaCy] ...@@ -10,18 +10,11 @@ This project provides industrial-strength NLP for multiple languages via [spaCy]
Install [Docker](https://hub.docker.com/search/?type=edition&offering=community). Install [Docker](https://hub.docker.com/search/?type=edition&offering=community).
You can find specific tags (say for example, a French model) on the [Docker Hub repository](https://hub.docker.com/repository/docker/neelkamath/spacy-server/tags?page=1).
For example, to run an English model at `http://localhost:8000`, run:
```
docker run --rm -e SPACY_MODEL=en_core_web_sm -p 8000:8000 neelkamath/spacy-server:v1-en_core_web_sm
```
### Generating an SDK ### Generating an SDK
You can generate a wrapper for the HTTP API using [OpenAPI Generator](https://openapi-generator.tech/) on the file [`https://raw.githubusercontent.com/neelkamath/spacy-server/master/docs/openapi.yaml`](https://raw.githubusercontent.com/neelkamath/spacy-server/master/docs/openapi.yaml). You can generate a wrapper for the HTTP API using [OpenAPI Generator](https://openapi-generator.tech/) on the file [`https://raw.githubusercontent.com/neelkamath/spacy-server/master/docs/openapi.yaml`](https://raw.githubusercontent.com/neelkamath/spacy-server/master/docs/openapi.yaml).
## [Usage](https://neelkamath.gitlab.io/spacy-server/) ## [Usage](https://hub.docker.com/r/neelkamath/spacy-server)
## [Contributing](docs/CONTRIBUTING.md) ## [Contributing](docs/CONTRIBUTING.md)
......
FROM python:3.8 FROM python:3.8 AS base
WORKDIR /app WORKDIR /app
ENV PYTHONUNBUFFERED 1 ARG SPACY_MODEL
ENV PYTHONUNBUFFERED=1 SENSE2VEC=0 SPACY_MODEL=$SPACY_MODEL
COPY requirements.txt . COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt RUN pip install --no-cache-dir -r requirements.txt
ARG SPACY_MODEL
RUN python -m spacy download $SPACY_MODEL RUN python -m spacy download $SPACY_MODEL
COPY src/main.py . COPY src/main.py .
COPY src/s2v_old/ s2v_old/
EXPOSE 8000 EXPOSE 8000
HEALTHCHECK --timeout=2s --start-period=2s --retries=1 \ HEALTHCHECK --timeout=2s --start-period=2s --retries=1 \
CMD curl -f http://localhost:8000/health_check CMD curl -f http://localhost:8000/health_check
RUN useradd user RUN useradd user
USER user USER user
CMD ["uvicorn", "main:app", "--host", "0.0.0.0"] CMD ["uvicorn", "main:app", "--host", "0.0.0.0"]
\ No newline at end of file
FROM base
ENV SENSE2VEC 1
COPY src/s2v_old/ src/s2v_old/
\ No newline at end of file
version: '3.7' version: '3.7'
services: services:
app: app:
command: sh scripts/setup.sh 'uvicorn src.main:app --host 0.0.0.0 --reload' command: sh -c '. scripts/setup.sh && uvicorn src.main:app --host 0.0.0.0 --reload'
ports: ['8000:8000'] ports: ['8000:8000']
\ No newline at end of file environment:
SPACY_MODEL:
SENSE2VEC:
\ No newline at end of file
version: '3.7' version: '3.7'
services: services:
app: app:
command: sh scripts/setup.sh pytest command: sh -c '. scripts/setup.sh && pytest'
environment: environment:
SENSE2VEC: 1
# Since any model will do, tests have been written only for the en_core_web_sm model because of its combination of # Since any model will do, tests have been written only for the en_core_web_sm model because of its combination of
# speed, features, and accuracy. # speed, features, and accuracy.
......
# It is not possible to use a Docker volume to cache the dependencies because subsequent usage of the volume # A virtual environment caches dependencies instead of a Docker volume because the volume randomly gets corrupted.
# occasionally gets corrupted for an unknown reason. Hence, a virtual environment is to be used instead. It is known
# that virtual environments aren't needed in Docker because isolation is already provided; we use it as a cache instead.
version: '3.7' version: '3.7'
services: services:
app: app:
image: python:3.8 image: python:3.8
working_dir: /app working_dir: /app
environment:
SPACY_MODEL:
volumes: volumes:
- type: bind - type: bind
source: . source: .
......
...@@ -9,6 +9,6 @@ If you're forking the repo to develop the project as your own and not just to se ...@@ -9,6 +9,6 @@ If you're forking the repo to develop the project as your own and not just to se
1. Clone the repository using one of the following methods. 1. Clone the repository using one of the following methods.
- SSH: `git clone git@github.com:neelkamath/spacy-server.git` - SSH: `git clone git@github.com:neelkamath/spacy-server.git`
- HTTPS: `git clone https://github.com/neelkamath/spacy-server.git` - HTTPS: `git clone https://github.com/neelkamath/spacy-server.git`
1. Download the [pretrained vectors](https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz). Extract it into the project's `src` directory. 1. If you are not going to use sense2vec, skip this step. Download the [pretrained vectors](https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz). Extract it into the project's `src` directory.
## [Developing](developing.md) ## [Developing](developing.md)
\ No newline at end of file
...@@ -2,12 +2,12 @@ ...@@ -2,12 +2,12 @@
## Server ## Server
Replace `<MODEL>` with the name of the [spaCy model](https://spacy.io/models) (e.g., `en_core_web_sm`, `fr_core_news_md`). The model must be compatible with the spaCy version specified in [requirements.txt](../requirements.txt). Replace `<MODEL>` with the name of the [spaCy model](https://spacy.io/models) (e.g., `en_core_web_sm`, `fr_core_news_md`). The model must be compatible with the spaCy version specified in [requirements.txt](../requirements.txt). Replace `<ENABLED>` with `1` or `0` to enable to disable sense2vec respectively.
### Development ### Development
``` ```
SPACY_MODEL=<MODEL> docker-compose -p dev --project-directory . \ SPACY_MODEL=<MODEL> SENSE2VEC=<ENABLED> docker-compose -p dev --project-directory . \
-f docker/docker-compose.yml -f docker/docker-compose.override.yml up --build -f docker/docker-compose.yml -f docker/docker-compose.override.yml up --build
``` ```
...@@ -15,19 +15,29 @@ The server will be running on `http://localhost:8000`, and has automatic reload ...@@ -15,19 +15,29 @@ The server will be running on `http://localhost:8000`, and has automatic reload
### Testing ### Testing
``` - For noninteractive environments (e.g., CI pipelines), you can run all the tests with the single command:
docker-compose -p test --project-directory . -f docker/docker-compose.yml -f docker/docker-compose.test.yml \ ```
up --build --abort-on-container-exit --exit-code-from app docker-compose -p test --project-directory . -f docker/docker-compose.yml -f docker/docker-compose.test.yml \
``` up --build --abort-on-container-exit --exit-code-from app
```
- For faster iterations (e.g., while developing), you can run the tests interactively. Changes to the source code will automatically be mirrored in the container.
1. Run:
```
docker-compose -p test --project-directory . -f docker/docker-compose.yml -f docker/docker-compose.test.yml \
run --service-ports app bash
```
1. `. scripts/setup.sh` (run this command every time you update `requirements.txt`)
1. Execute tests any number of times you want with pytest (e.g., `pytest`).
1. After you're done testing, exit the container by running `exit`.
### Production ### Production
``` ```
docker build --build-arg SPACY_MODEL=<MODEL> -t spacy-server -f docker/Dockerfile . docker build <TARGET> --build-arg SPACY_MODEL=<MODEL> -t spacy-server -f docker/base.Dockerfile .
docker run --rm -e SPACY_MODEL=<MODEL> -p 8000:8000 spacy-server
``` ```
Replace `<TARGET>` with `--target base` if you want to disable sense2vec, and an empty string otherwise.
The container `EXPOSE`s port `8000`. The container `EXPOSE`s port `8000`. Run using `docker run --rm -p 8000:8000 spacy-server`.
## Specification ## Specification
...@@ -61,7 +71,10 @@ Open `redoc-static.html` in your browser. ...@@ -61,7 +71,10 @@ Open `redoc-static.html` in your browser.
## Releases ## Releases
- Create a GitHub release (this will automatically create the git tag). If you bumped the version in `docs/openapi.yaml`, then create a new release. If you haven't bumped the version but have updated the HTTP API's functionality, delete the existing GitHub release and git tag, and create a new one. Otherwise, skip this step. The release's title should be the features included (e.g., `NER, POS tagging, sentencizer, tokenizer, and sense2vec`). The tag should be the HTTP API's version (e.g., `v1`). The release's body should be ```Download and open the release asset, `redoc-static.html`, in your browser to view the HTTP API documentation.```. Upload the asset named `redoc-static.html` which contains the HTTP API docs. - If you haven't updated the HTTP API functionality, skip this step.
1. If you haven't bumped the version in the OpenAPI spec, delete the corresponding GitHub release and git tag.
1. Generate `redoc-static.html`: `npx redoc-cli bundle docs/openapi.yaml -o redoc-static.html --title 'spaCy Server'`
1. Create a GitHub release. The release's body should be ```Download and open the release asset, `redoc-static.html`, in your browser to view the HTTP API documentation.```. Upload `redoc-static.html` as an asset.
- If required, update the [Docker Hub repository](https://hub.docker.com/r/neelkamath/spacy-server)'s **Overview**. - If required, update the [Docker Hub repository](https://hub.docker.com/r/neelkamath/spacy-server)'s **Overview**.
- For every commit to the `master` branch in which the tests have passed, the following will automatically be done. - For every commit to the `master` branch in which the tests have passed, the following will automatically be done.
- The new images will be uploaded to Docker Hub. - The new images will be uploaded to Docker Hub.
......
openapi: 3.0.2 openapi: 3.0.2
info: info:
title: spaCy Server title: spaCy Server
version: '1' version: '2'
description: | description: |
Industrial-strength NLP via [spaCy](https://spacy.io) and [sense2vec](https://github.com/explosion/sense2vec). No Industrial-strength NLP via [spaCy](https://spacy.io) and [sense2vec](https://github.com/explosion/sense2vec). No
knowledge of spaCy or sense2vec is required to use this service. knowledge of spaCy or sense2vec is required to use this service.
...@@ -26,8 +26,8 @@ paths: ...@@ -26,8 +26,8 @@ paths:
/ner: /ner:
post: post:
tags: [nlp] tags: [nlp]
description: Named entity recognition. Similar phrases will also be provided via sense2vec. The pretrained model description: Named entity recognition. The pretrained model must have the `ner` and `parser` pipeline components
must have the `ner` and `parser` pipeline components to use this endpoint. to use this endpoint. If a sense2vec model was bundled with the service, similar phrases can also be provided.
operationId: ner operationId: ner
requestBody: requestBody:
required: true required: true
...@@ -39,7 +39,7 @@ paths: ...@@ -39,7 +39,7 @@ paths:
- Net income was $9.4 million compared to the prior year of $2.7 million. Google is a big company. - Net income was $9.4 million compared to the prior year of $2.7 million. Google is a big company.
- Revenue exceeded twelve billion dollars, with a loss of $1b. - Revenue exceeded twelve billion dollars, with a loss of $1b.
schema: schema:
$ref: '#/components/schemas/Sections' $ref: '#/components/schemas/NERRequest'
responses: responses:
'200': '200':
description: Labeled text, with phrases similar to each entity description: Labeled text, with phrases similar to each entity
...@@ -74,13 +74,21 @@ paths: ...@@ -74,13 +74,21 @@ paths:
text_with_ws: Sundar Pichai text_with_ws: Sundar Pichai
text: Google is headed by Sundar Pichai. text: Google is headed by Sundar Pichai.
schema: schema:
$ref: '#/components/schemas/NamedEntities' $ref: '#/components/schemas/NERResponse'
'400': '400':
description: The pretrained model lacks the `ner` or `parser` pipeline components. description: The pretrained model lacks the `ner` or `parser` pipeline components.
content: content:
application/json: application/json:
example: examples:
detail: The pretrained model (en_trf_bertbaseuncased_lg) doesn't support named entity recognition. invalid_model:
summary: The spaCy model lacks the required pipeline components.
value:
detail: The pretrained model (en_trf_bertbaseuncased_lg) doesn't support named entity recognition.
sense2vec_disabled:
summary: Similar phrases via sense2vec were requested, but a sense2vec model wasn't bundled with the
service.
value:
detail: There is no sense2vec model bundled with this service.
schema: schema:
$ref: '#/components/schemas/InvalidModel' $ref: '#/components/schemas/InvalidModel'
/pos: /pos:
...@@ -225,18 +233,22 @@ paths: ...@@ -225,18 +233,22 @@ paths:
description: All systems are operational description: All systems are operational
components: components:
schemas: schemas:
Sections: NERRequest:
type: object type: object
properties: properties:
sections: sections:
description: description:
Although you could pass the full text as a single array item, it would be faster to split large text Although you could pass the full text as a single array item, it would be faster to split large text into
into multiple items. Each item needn't be semantically related. multiple items. Each item needn't be semantically related.
type: array type: array
items: items:
type: string type: string
sense2vec:
description: Whether to also compute similar phrases using sense2vec (significantly slower)
type: boolean
default: false
required: [sections] required: [sections]
NamedEntities: NERResponse:
type: object type: object
properties: properties:
data: data:
...@@ -268,7 +280,7 @@ components: ...@@ -268,7 +280,7 @@ components:
description: The entity’s lemma. description: The entity’s lemma.
sense2vec: sense2vec:
type: array type: array
description: Phrases similar to the entity description: Phrases similar to the entity (empty if sense2vec was disabled)
items: items:
type: object type: object
properties: properties:
......
...@@ -2,6 +2,7 @@ ...@@ -2,6 +2,7 @@
# particular versions. # particular versions.
spacy==2.2.3 spacy==2.2.3
sense2vec==1.0.2 sense2vec==1.0.2
fastapi==0.45.0 fastapi==0.45.0
uvicorn==0.10.8 uvicorn==0.10.8
pytest>=4.6.7,<5 pytest>=4.6.7,<5
\ No newline at end of file
#!/usr/bin/env sh #!/usr/bin/env sh
# Builds and uploads every image (e.g., neelkamath/spacy-server:v1-en_core_web_sm) to Docker Hub. # Builds and uploads every image (e.g., neelkamath/spacy-server:2-en_core_web_sm-sense2vec) to Docker Hub.
# Get the HTTP API version. # Get the HTTP API version.
version=$(grep version docs/openapi.yaml -m 1) version=$(grep version docs/openapi.yaml -m 1)
version=${version#*: } version=${version#*: }
version=v$(echo "$version" | cut -d "'" -f 2) version=$(echo "$version" | cut -d "'" -f 2)
# Log in. # Log in.
echo "$DOCKER_HUB_PASSWORD" | docker login -u "$DOCKER_HUB_USER" --password-stdin https://index.docker.io/v1/ echo "$DOCKER_HUB_PASSWORD" | docker login -u "$DOCKER_HUB_USER" --password-stdin https://index.docker.io/v1/
# Build and upload the images. # Build and upload the images.
while IFS='' read -r model || [ -n "$model" ]; do while IFS='' read -r spacy_model || [ -n "$spacy_model" ]; do
tag="$DOCKER_HUB_USER"/spacy-server:"$version"-"$model" base_tag="$DOCKER_HUB_USER"/spacy-server:"$version"-"$spacy_model"
docker build --build-arg SPACY_MODEL="$model" -t "$tag" -f docker/Dockerfile . sense2vec_tag="$base_tag"-sense2vec
docker push "$tag" docker build --target base --build-arg SPACY_MODEL="$spacy_model" -t "$base_tag" -f docker/Dockerfile .
docker rmi "$tag" # Delete the image to prevent the device (e.g., CI runner) from running out of space and crashing. docker build --build-arg SPACY_MODEL="$spacy_model" -t "$sense2vec_tag" -f docker/Dockerfile .
docker push "$base_tag"
docker push "$sense2vec_tag"
docker rmi "$base_tag" "$sense2vec_tag" # Prevent the device (e.g., CI runner) from running out of space and crashing.
done <scripts/models.txt done <scripts/models.txt
#!/usr/bin/env sh #!/usr/bin/env sh
# Executes a command in a virtual environment (e.g., <sh setup.sh 'uvicorn main:app --reload'>). # Sets up the development environment.
python -m venv venv python -m venv venv
. venv/bin/activate . venv/bin/activate
pip install -r requirements.txt pip install -r requirements.txt
python -m spacy download "$SPACY_MODEL" python -m spacy download "$SPACY_MODEL"
$1
...@@ -11,34 +11,46 @@ import starlette.status ...@@ -11,34 +11,46 @@ import starlette.status
app = fastapi.FastAPI() app = fastapi.FastAPI()
model = os.getenv('SPACY_MODEL') model = os.getenv('SPACY_MODEL')
pipeline_error = 'The pretrained model ({})'.format(model) + " doesn't support {}." pipeline_error = 'The pretrained model ({})'.format(model) \
+ " doesn't support {}."
nlp = spacy.load(model) nlp = spacy.load(model)
nlp.add_pipe(sense2vec.Sense2VecComponent(nlp.vocab).from_disk('src/s2v_old')) if os.getenv('SENSE2VEC') == '1':
nlp.add_pipe(
sense2vec.Sense2VecComponent(nlp.vocab).from_disk('src/s2v_old')
)
class SectionsModel(pydantic.BaseModel): class NERRequest(pydantic.BaseModel):
sections: typing.List[str] sections: typing.List[str]
sense2vec: bool = False
@app.post('/ner') @app.post('/ner')
async def recognize_named_entities(request: SectionsModel): async def recognize_named_entities(request: NERRequest):
if not nlp.has_pipe('ner') or not nlp.has_pipe('parser'): if not nlp.has_pipe('ner') or not nlp.has_pipe('parser'):
raise fastapi.HTTPException( raise fastapi.HTTPException(
status_code=400, status_code=400,
detail=pipeline_error.format('named entity recognition') detail=pipeline_error.format('named entity recognition')
) )
if request.sense2vec and not nlp.has_pipe('sense2vec'):
raise fastapi.HTTPException(
status_code=400,
detail='There is no sense2vec model bundled with this service.'
)
response = {'data': []} response = {'data': []}
for doc in nlp.pipe(request.sections, disable=['tagger']): for doc in nlp.pipe(request.sections, disable=['tagger']):
for sent in doc.sents: for sent in doc.sents:
entities = [build_entity(ent) for ent in sent.ents] entities = [
build_entity(ent, request.sense2vec) for ent in sent.ents
]
data = {'text': sent.text, 'entities': entities} data = {'text': sent.text, 'entities': entities}
response['data'].append(data) response['data'].append(data)
return response return response
def build_entity(ent): def build_entity(ent, use_sense2vec):
similar = [] similar = []
if ent._.in_s2v: if use_sense2vec and ent._.in_s2v:
for data in ent._.s2v_most_similar(): for data in ent._.s2v_most_similar():
similar.append( similar.append(
{'phrase': data[0][0], 'similarity': float(data[1])} {'phrase': data[0][0], 'similarity': float(data[1])}
...@@ -69,7 +81,8 @@ async def tag_parts_of_speech(request: TextModel): ...@@ -69,7 +81,8 @@ async def tag_parts_of_speech(request: TextModel):
detail=pipeline_error.format('part-of-speech tagging') detail=pipeline_error.format('part-of-speech tagging')
) )
data = [] data = []
for token in [build_token(token) for token in nlp(request.text)]: doc = nlp(request.text, disable=['sense2vec'])
for token in [build_token(token) for token in doc]:
text = token['sent'] text = token['sent']
del token['sent'] del token['sent']
if text in [obj['text'] for obj in data]: if text in [obj['text'] for obj in data]:
...@@ -126,7 +139,7 @@ def build_token(token): ...@@ -126,7 +139,7 @@ def build_token(token):
@app.post('/tokenizer') @app.post('/tokenizer')
async def tokenize(request: TextModel): async def tokenize(request: TextModel):
doc = nlp(request.text, disable=['tagger', 'parser', 'ner']) doc = nlp(request.text, disable=['tagger', 'parser', 'ner', 'sense2vec'])
return {'tokens': [token.text for token in doc]} return {'tokens': [token.text for token in doc]}
...@@ -137,7 +150,7 @@ async def sentencize(request: TextModel): ...@@ -137,7 +150,7 @@ async def sentencize(request: TextModel):
status_code=400, status_code=400,
detail=pipeline_error.format('sentence segmentation') detail=pipeline_error.format('sentence segmentation')
) )
doc = nlp(request.text, disable=['tagger', 'ner']) doc = nlp(request.text, disable=['tagger', 'ner', 'sense2vec'])
return {'sentences': [sent.text for sent in doc.sents]} return {'sentences': [sent.text for sent in doc.sents]}
......
{
"data": [
{
"text": "Net income was $9.4 million compared to the prior year of $2.7 million.",
"entities": [
{
"text": "$9.4 million",
"label": "MONEY",
"start_char": 15,
"end_char": 27,
"lemma": "$ 9.4 million",
"start": 3,
"end": 6,
"text_with_ws": "$9.4 million ",
"sense2vec": []
},
{
"text": "the prior year",
"label": "DATE",
"start_char": 40,
"end_char": 54,
"lemma": "the prior year",
"start": 8,
"end": 11,
"text_with_ws": "the prior year ",
"sense2vec": []
},
{
"text": "$2.7 million",
"label": "MONEY",
"start_char": 58,
"end_char": 70,
"lemma": "$ 2.7 million",
"start": 12,
"end": 15,
"text_with_ws": "$2.7 million",
"sense2vec": []
}
]
},
{
"text": "Google is a big company.",
"entities": [
{
"text": "Google",
"label": "ORG",
"start_char": 72,
"end_char": 78,
"lemma": "Google",
"start": 16,
"end": 17,
"text_with_ws": "Google ",
"sense2vec": []
}
]
},
{
"text": "Revenue exceeded twelve billion dollars, with a loss of $1b.",
"entities": [
{
"text": "twelve billion dollars",
"label": "MONEY",
"start_char": 17,
"end_char": 39,
"lemma": "twelve billion dollar",
"start": 2,
"end": 5,
"text_with_ws": "twelve billion dollars",
"sense2vec": []
},
{
"text": "1b",
"label": "MONEY",
"start_char": 57,
"end_char": 59,
"lemma": "1b",
"start": 11,
"end": 12,
"text_with_ws": "1b",
"sense2vec": []
}
]
}
]
}
\ No newline at end of file
{ {
"data": [ "data": [
{ {
"text": "Net income was $9.4 million compared to the prior year of $2.7 million.",
"entities": [ "entities": [
{ {
"end": 6, "text": "$9.4 million",
"end_char": 27,
"label": "MONEY", "label": "MONEY",
"start_char": 15,
"end_char": 27,
"lemma": "$ 9.4 million", "lemma": "$ 9.4 million",
"sense2vec": [],
"start": 3, "start": 3,
"start_char": 15, "end": 6,
"text": "$9.4 million", "text_with_ws": "$9.4 million ",
"text_with_ws": "$9.4 million " "sense2vec": []
}, },
{ {
"end": 11, "text": "the prior year",
"end_char": 54,
"label": "DATE", "label": "DATE",
"start_char": 40,
"end_char": 54,
"lemma": "the prior year", "lemma": "the prior year",
"start": 8,
"end": 11,
"text_with_ws": "the prior year ",
"sense2vec": [ "sense2vec": [
{ {
"phrase": "the previous year", "phrase": "the previous year",
...@@ -59,17 +64,17 @@ ...@@ -59,17 +64,17 @@
"phrase": "the entire year", "phrase": "the entire year",
"similarity": 0.6915000081062317 "similarity": 0.6915000081062317
} }
], ]
"start": 8,
"start_char": 40,
"text": "the prior year",
"text_with_ws": "the prior year "
}, },
{ {
"end": 15, "text": "$2.7 million",
"end_char": 70,
"label": "MONEY", "label": "MONEY",
"start_char": 58,
"end_char": 70,
"lemma": "$ 2.7 million", "lemma": "$ 2.7 million",
"start": 12,
"end": 15,
"text_with_ws": "$2.7 million",
"sense2vec": [ "sense2vec": [
{ {
"phrase": "$1 million", "phrase": "$1 million",
...@@ -111,22 +116,22 @@ ...@@ -111,22 +116,22 @@
"phrase": "$2 million", "phrase": "$2 million",
"similarity": 0.7371000051498413 "similarity": 0.7371000051498413
} }
], ]
"start": 12,
"start_char": 58,
"text": "$2.7 million",
"text_with_ws": "$2.7 million"
} }
], ]
"text": "Net income was $9.4 million compared to the prior year of $2.7 million."
}, },
{ {
"text": "Google is a big company.",
"entities": [ "entities": [
{ {
"end": 17, "text": "Google",
"end_char": 78,
"label": "ORG", "label": "ORG",
"start_char": 72,
"end_char": 78,
"lemma": "Google", "lemma": "Google",
"start": 16,
"end": 17,
"text_with_ws": "Google ",
"sense2vec": [ "sense2vec": [
{ {
"phrase": " Google", "phrase": " Google",
...@@ -168,33 +173,33 @@ ...@@ -168,33 +173,33 @@
"phrase": "Yahoo", "phrase": "Yahoo",
"similarity": 0.8037999868392944 "similarity": 0.8037999868392944
} }
], ]
"start": 16,
"start_char": 72,
"text": "Google",
"text_with_ws": "Google "
} }
], ]
"text": "Google is a big company."
}, },
{ {
"text": "Revenue exceeded twelve billion dollars, with a loss of $1b.",
"entities": [ "entities": [
{ {
"end": 5, "text": "twelve billion dollars",
"end_char": 39,
"label": "MONEY", "label": "MONEY",
"start_char": 17,
"end_char": 39,
"lemma": "twelve billion dollar", "lemma": "twelve billion dollar",
"sense2vec": [],
"start": 2, "start": 2,
"start_char": 17, "end": 5,
"text": "twelve billion dollars", "text_with_ws": "twelve billion dollars",
"text_with_ws": "twelve billion dollars" "sense2vec": []
}, },
{ {
"end": 12, "text": "1b",
"end_char": 59,
"label": "MONEY", "label": "MONEY",
"start_char": 57,
"end_char": 59,
"lemma": "1b", "lemma": "1b",
"start": 11,
"end": 12,
"text_with_ws": "1b",
"sense2vec": [ "sense2vec": [
{ {
"phrase": "100m", "phrase": "100m",
...@@ -236,14 +241,9 @@ ...@@ -236,14 +241,9 @@
"phrase": "100B", "phrase": "100B",
"similarity": 0.8209999799728394 "similarity": 0.8209999799728394
} }
], ]
"start": 11,
"start_char": 57,
"text": "1b",
"text_with_ws": "1b"
} }
], ]
"text": "Revenue exceeded twelve billion dollars, with a loss of $1b."
} }
] ]
} }
\ No newline at end of file
...@@ -5,29 +5,51 @@ import starlette.testclient ...@@ -5,29 +5,51 @@ import starlette.testclient
client = starlette.testclient.TestClient(main.app) client = starlette.testclient.TestClient(main.app)
ner_body = {
'sections': [
'Net income was $9.4 million compared to the prior year of $2.7 '
+ 'million. Google is a big company.',
'Revenue exceeded twelve billion dollars, with a loss of $1b.'
]
}
ner_sense2vec_body = {**ner_body, 'sense2vec': True}
def test_ner():
body = { def test_ner_sense2vec_enabled():
'sections': [ response = client.post('/ner', json=ner_sense2vec_body)
'Net income was $9.4 million compared to the prior year of $2.7 '
+ 'million. Google is a big company.',
'Revenue exceeded twelve billion dollars, with a loss of $1b.'
]
}
response = client.post('/ner', json=body)
assert response.status_code == 200 assert response.status_code == 200
with open('src/outputs/ner.json') as f: with open('src/outputs/ner/sense2vec_enabled.json') as f:
assert response.json() == json.load(f) assert response.json() == json.load(f)
def test_ner_sense2vec_disabled():
response = client.post('/ner', json=ner_body)
with open('src/outputs/ner/sense2vec_disabled.json') as f:
assert response.json() == json.load(f)
def test_ner_spacy_fail():
fail('/ner', ner_body, 'ner')
def test_ner_sense2vec_fail():
fail('/ner', ner_sense2vec_body, 'sense2vec')
pos_body = {'text': 'Apple is looking at buying U.K. startup for $1 billion'}
def test_pos(): def test_pos():
text = {'text': 'Apple is looking at buying U.K. startup for $1 billion'} response = client.post('/pos', json=pos_body)
response = client.post('/pos', json=text)
assert response.status_code == 200 assert response.status_code == 200
with open('src/outputs/pos.json') as f: with open('src/outputs/pos.json') as f:
assert response.json() == json.load(f) assert response.json() == json.load(f)
def test_pos_fail():
fail('/pos', pos_body, 'parser')
def test_tokenizer(): def test_tokenizer():
text = {'text': 'Apple is looking at buying U.K. startup for $1 billion'} text = {'text': 'Apple is looking at buying U.K. startup for $1 billion'}
response = client.post('/tokenizer', json=text) response = client.post('/tokenizer', json=text)
...@@ -36,16 +58,29 @@ def test_tokenizer(): ...@@ -36,16 +58,29 @@ def test_tokenizer():
assert response.json() == json.load(f) assert response.json() == json.load(f)
sentencizer_body = {
'text': 'Apple is looking at buying U.K. startup for $1 billion. Another '
+ 'sentence.'
}
def test_sentencizer(): def test_sentencizer():
body = { response = client.post('/sentencizer', json=sentencizer_body)
'text': 'Apple is looking at buying U.K. startup for $1 billion. '
+ 'Another sentence.'
}
response = client.post('/sentencizer', json=body)
assert response.status_code == 200 assert response.status_code == 200
with open('src/outputs/sentencizer.json') as f: with open('src/outputs/sentencizer.json') as f:
assert response.json() == json.load(f) assert response.json() == json.load(f)
def test_sentencizer_fail():
fail('/sentencizer', sentencizer_body, 'parser')
def test_health_check(): def test_health_check():
assert client.get('/health_check').status_code == 204 assert client.get('/health_check').status_code == 204
def fail(endpoint, body, pipe):
with main.nlp.disable_pipes(pipe):
response = client.post(endpoint, json=body)
assert response.status_code == 400
assert 'detail' in response.json()
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment