Named Entity Recognition

The SSSLM software package contains a submodule ssslm.ner for named entity normalization (NEN) and named entity recognition (NER) and that provides a standard class API and data model encoded with pydantic models.

By default, SSSLM wraps the NEN/NER system implemented in gilda because of its speed and lack of heavy dependencies. SSSLM also wraps the more powerful spacy and gliner NER systems, though they require more complex installation, setup, and configuration.

The following NEN systems have been directly wrapped by SSSLM:

NEN System

Class

Implementation

Gilda

ssslm.ner.GildaMatcher

Dictionary lookup

The following NER systems have been directly wrapped by SSSLM:

NER System

Class

Implementation

Gilda

ssslm.ner.GildaGrounder

Dictionary lookup

SpaCy

ssslm.ner.SpacyGrounder

transition-based sequence model

GLiNER

ssslm.ner.GLiNERGrounder

Bi-directional transformer (BERT)

SSSLM can be extended to other NER/NEN systems by subclassing ssslm.ner.Matcher (for NEN), ssslm.ner.Annotator (for NER), or ssslm.ner.Grounder (for combine NEN/NER).

Case Study

The following examples are about grounding the labels for diseases and organs appearing in the example table from the OBO Academy’s tutorial From Tables to Linked Data to demonstrate grounding.

The initial table looks like this:

species

strain

organ

disease

RAT

F 344/N

LUNG

ADENOCARCINOMA

MOUSE

B6C3F1

NOSE

INFLAMMATION

RAT

F 344/N

ADRENAL CORTEX

NECROSIS

Our goal is to look up the best possible ontology/database identifiers first for organs int the second-to-last column then for diseases in the last column. The example code shows two different flavors of grounding for both entity types:

  1. By adding a column organ_curie that contains string representations of the references to external ontologies like the Uber Anatomy Ontology (UBERON) and Brenda Tissue Ontology (BTO).

  2. By adding a column organ_reference that contains a data structure with the prefix, identifier, and name of the references.

  3. By adding a column disease_curie that contains string representations of the references to external ontologies like the Disease Ontology (DOID) and Symptom Ontology (SYMP).

  4. By adding a column disease_reference that contains a data structure with the prefix, identifier, and name of the references.

Single vocabulary

If you’re looking to ground a column to a single ontology/database, you can load a SSSLM grounder via PyOBO’s pyobo.get_grounder() like:

Before running the following, make sure you do pip install pandas pyobo[gilda-slim]>=0.12.0.

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "pandas",
#     "pyobo[gilda-slim]>=0.12.0",
# ]
# ///

import pandas as pd
import pyobo

uberon_grounder = pyobo.get_grounder("uberon")

data_url = "https://raw.githubusercontent.com/OBOAcademy/obook/master/docs/tutorial/linking_data/data.csv"
df = pd.read_csv(data_url)
df = df[["species", "strain", "organ"]]

# this adds a new column `organ_curie` that has strings
# for the Bioregistry-standardized CURIEs
uberon_grounder.ground_df(df, "organ", target_column="organ_curie")

# this adds a new column `organ_reference` that has reference objects
# for Bioregistry-standardized references (e.g., pre-parsed prefix, identifier, and name)
uberon_grounder.ground_df(
    df, "organ", target_column="organ_reference", target_type="reference"
)

# print the final dataframe to show below
print(df.to_markdown(tablefmt="rst", index=False))

This returns the following:

species

strain

organ

organ_curie

organ_reference

RAT

F 344/N

LUNG

uberon:0002048

prefix=’uberon’ identifier=’0002048’ name=’lung’

MOUSE

B6C3F1

NOSE

uberon:0000004

prefix=’uberon’ identifier=’0000004’ name=’nose’

RAT

F 344/N

ADRENAL CORTEX

uberon:0001235

prefix=’uberon’ identifier=’0001235’ name=’adrenal cortex’

Pre-constructed lexica

In the following example, we load two pre-constructed lexica for diseases/phenotypes and for anatomical terms from the Biolexica project. These lexica are the union of multiple ontologies/databases that have been deduplicated using mappings assembled by SeMRA.

These lexica are good when you’re not sure what’s the best vocabulary for your given entity type.

Warning

Pre-construction of lexica in the Biolexica project is part of ongoing research, and is subject to change.

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "pandas",
#     "ssslm[gilda-slim]",
# ]
# ///

import pandas as pd
import ssslm

mappings_fmt = "https://github.com/biopragmatics/biolexica/raw/main/lexica/{key}/{key}.ssslm.tsv.gz"

phenotype_grounder = ssslm.make_grounder(mappings_fmt.format(key="phenotype"))
anatomy_grounder = ssslm.make_grounder(mappings_fmt.format(key="anatomy"))

# you can also do the following, if you `pip install biolexica`:
# import biolexica
# phenotype_grounder = biolexica.load_grounder("phenotype")
# anatomy_grounder = biolexica.load_grounder("anatomy")

data_url = "https://raw.githubusercontent.com/OBOAcademy/obook/master/docs/tutorial/linking_data/data.csv"
df = pd.read_csv(data_url)
df = df[["species", "strain", "organ", "disease"]]
print(df.to_markdown(tablefmt="rst", index=False))

# this adds a new column `organ_curie` that has strings
# for the Bioregistry-standardized CURIEs
anatomy_grounder.ground_df(df, "organ", target_column="organ_curie")

# this adds a new column `organ_reference` that has reference objects
# for Bioregistry-standardized references (e.g., pre-parsed prefix, identifier, and name)
anatomy_grounder.ground_df(
    df, "organ", target_column="organ_reference", target_type="reference"
)

# this adds a new column `disease_curie` that has strings
# for the Bioregistry-standardized CURIEs
phenotype_grounder.ground_df(df, "disease", target_column="disease_curie")

# this adds a new column `disease_curie` that has reference objects
# for Bioregistry-standardized references (e.g., pre-parsed prefix, identifier, and name)
phenotype_grounder.ground_df(
    df, "disease", target_column="disease_reference", target_type="reference"
)

# print the final dataframe to show below
print(df.to_markdown(tablefmt="rst", index=False))

Here’s what it looks like in the end:

species

strain

organ

disease

organ_curie

organ_reference

disease_curie

disease_reference

RAT

F 344/N

LUNG

ADENOCARCINOMA

bto:0000763

prefix=’bto’ identifier=’0000763’ name=’lung’

doid:299

prefix=’doid’ identifier=’299’ name=’adenocarcinoma’

MOUSE

B6C3F1

NOSE

INFLAMMATION

bto:0000840

prefix=’bto’ identifier=’0000840’ name=’nose’

symp:0000061

prefix=’symp’ identifier=’0000061’ name=’inflammation’

RAT

F 344/N

ADRENAL CORTEX

NECROSIS

bto:0000045

prefix=’bto’ identifier=’0000045’ name=’adrenal cortex’

symp:0000132

prefix=’symp’ identifier=’0000132’ name=’necrosis’