Named Entity Recognition
The SSSLM software package contains a submodule ssslm.ner for named entity
normalization (NEN) and named entity recognition (NER) and that provides a standard
class API and data model encoded with pydantic models.
By default, SSSLM wraps the NEN/NER system implemented in gilda because of its
speed and lack of heavy dependencies. SSSLM also wraps the more powerful spacy
and gliner NER systems, though they require more complex installation, setup, and
configuration.
The following NEN systems have been directly wrapped by SSSLM:
NEN System |
Class |
Implementation |
|---|---|---|
Dictionary lookup |
The following NER systems have been directly wrapped by SSSLM:
NER System |
Class |
Implementation |
|---|---|---|
Dictionary lookup |
||
transition-based sequence model |
||
Bi-directional transformer (BERT) |
SSSLM can be extended to other NER/NEN systems by subclassing ssslm.ner.Matcher
(for NEN), ssslm.ner.Annotator (for NER), or ssslm.ner.Grounder (for
combine NEN/NER).
Case Study
The following examples are about grounding the labels for diseases and organs appearing in the example table from the OBO Academy’s tutorial From Tables to Linked Data to demonstrate grounding.
The initial table looks like this:
species |
strain |
organ |
disease |
|---|---|---|---|
RAT |
F 344/N |
LUNG |
ADENOCARCINOMA |
MOUSE |
B6C3F1 |
NOSE |
INFLAMMATION |
RAT |
F 344/N |
ADRENAL CORTEX |
NECROSIS |
Our goal is to look up the best possible ontology/database identifiers first for organs int the second-to-last column then for diseases in the last column. The example code shows two different flavors of grounding for both entity types:
By adding a column
organ_curiethat contains string representations of the references to external ontologies like the Uber Anatomy Ontology (UBERON) and Brenda Tissue Ontology (BTO).By adding a column
organ_referencethat contains a data structure with the prefix, identifier, and name of the references.By adding a column
disease_curiethat contains string representations of the references to external ontologies like the Disease Ontology (DOID) and Symptom Ontology (SYMP).By adding a column
disease_referencethat contains a data structure with the prefix, identifier, and name of the references.
Single vocabulary
If you’re looking to ground a column to a single ontology/database, you can load a SSSLM
grounder via PyOBO’s pyobo.get_grounder() like:
Before running the following, make sure you do pip install pandas
pyobo[gilda-slim]>=0.12.0.
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "pandas",
# "pyobo[gilda-slim]>=0.12.0",
# ]
# ///
import pandas as pd
import pyobo
uberon_grounder = pyobo.get_grounder("uberon")
data_url = "https://raw.githubusercontent.com/OBOAcademy/obook/master/docs/tutorial/linking_data/data.csv"
df = pd.read_csv(data_url)
df = df[["species", "strain", "organ"]]
# this adds a new column `organ_curie` that has strings
# for the Bioregistry-standardized CURIEs
uberon_grounder.ground_df(df, "organ", target_column="organ_curie")
# this adds a new column `organ_reference` that has reference objects
# for Bioregistry-standardized references (e.g., pre-parsed prefix, identifier, and name)
uberon_grounder.ground_df(
df, "organ", target_column="organ_reference", target_type="reference"
)
# print the final dataframe to show below
print(df.to_markdown(tablefmt="rst", index=False))
This returns the following:
species |
strain |
organ |
organ_curie |
organ_reference |
|---|---|---|---|---|
RAT |
F 344/N |
LUNG |
uberon:0002048 |
prefix=’uberon’ identifier=’0002048’ name=’lung’ |
MOUSE |
B6C3F1 |
NOSE |
uberon:0000004 |
prefix=’uberon’ identifier=’0000004’ name=’nose’ |
RAT |
F 344/N |
ADRENAL CORTEX |
uberon:0001235 |
prefix=’uberon’ identifier=’0001235’ name=’adrenal cortex’ |
Pre-constructed lexica
In the following example, we load two pre-constructed lexica for diseases/phenotypes and for anatomical terms from the Biolexica project. These lexica are the union of multiple ontologies/databases that have been deduplicated using mappings assembled by SeMRA.
These lexica are good when you’re not sure what’s the best vocabulary for your given entity type.
Warning
Pre-construction of lexica in the Biolexica project is part of ongoing research, and is subject to change.
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "pandas",
# "ssslm[gilda-slim]",
# ]
# ///
import pandas as pd
import ssslm
mappings_fmt = "https://github.com/biopragmatics/biolexica/raw/main/lexica/{key}/{key}.ssslm.tsv.gz"
phenotype_grounder = ssslm.make_grounder(mappings_fmt.format(key="phenotype"))
anatomy_grounder = ssslm.make_grounder(mappings_fmt.format(key="anatomy"))
# you can also do the following, if you `pip install biolexica`:
# import biolexica
# phenotype_grounder = biolexica.load_grounder("phenotype")
# anatomy_grounder = biolexica.load_grounder("anatomy")
data_url = "https://raw.githubusercontent.com/OBOAcademy/obook/master/docs/tutorial/linking_data/data.csv"
df = pd.read_csv(data_url)
df = df[["species", "strain", "organ", "disease"]]
print(df.to_markdown(tablefmt="rst", index=False))
# this adds a new column `organ_curie` that has strings
# for the Bioregistry-standardized CURIEs
anatomy_grounder.ground_df(df, "organ", target_column="organ_curie")
# this adds a new column `organ_reference` that has reference objects
# for Bioregistry-standardized references (e.g., pre-parsed prefix, identifier, and name)
anatomy_grounder.ground_df(
df, "organ", target_column="organ_reference", target_type="reference"
)
# this adds a new column `disease_curie` that has strings
# for the Bioregistry-standardized CURIEs
phenotype_grounder.ground_df(df, "disease", target_column="disease_curie")
# this adds a new column `disease_curie` that has reference objects
# for Bioregistry-standardized references (e.g., pre-parsed prefix, identifier, and name)
phenotype_grounder.ground_df(
df, "disease", target_column="disease_reference", target_type="reference"
)
# print the final dataframe to show below
print(df.to_markdown(tablefmt="rst", index=False))
Here’s what it looks like in the end:
species |
strain |
organ |
disease |
organ_curie |
organ_reference |
disease_curie |
disease_reference |
|---|---|---|---|---|---|---|---|
RAT |
F 344/N |
LUNG |
ADENOCARCINOMA |
bto:0000763 |
prefix=’bto’ identifier=’0000763’ name=’lung’ |
doid:299 |
prefix=’doid’ identifier=’299’ name=’adenocarcinoma’ |
MOUSE |
B6C3F1 |
NOSE |
INFLAMMATION |
bto:0000840 |
prefix=’bto’ identifier=’0000840’ name=’nose’ |
symp:0000061 |
prefix=’symp’ identifier=’0000061’ name=’inflammation’ |
RAT |
F 344/N |
ADRENAL CORTEX |
NECROSIS |
bto:0000045 |
prefix=’bto’ identifier=’0000045’ name=’adrenal cortex’ |
symp:0000132 |
prefix=’symp’ identifier=’0000132’ name=’necrosis’ |