Close

Presentation

PathLlama: A Language Model for Automated Cancer Surveillance
DescriptionTransforming unstructured information into structured common
data models (CDM) is a critical step for enabling cancer
surveillance and advancing precision medicine. CDMs standardize
the structure and content of oncologic data extracted
from electronic health records. Unfortunately, traditional Extract
Transform Load processes for electronic health data capture
are generally rule-based, error-prone, and produce static
datasets unsuitable for near real-time information retrieval.

The Modeling Outcomes using Surveillance Data and Scalable
AI for Cancer (MOSSAIC) project developed and deployed
a hierarchical self-attention (HiSAN) model capable
of autocoding approximately 30% of National Cancer Institute
Surveillance, Epidemiology, and End Results (SEER) registry
cancer pathology reports [1], [2]. While a significant step
forward, this falls short of the broader goal of automatically
coding all pathology reports. Fully automating CDM conversion
would facilitate clinical trial matching, decision support
dashboards, real-time case ascertainment, and population
health surveillance.

The distribution of cancer phenotypes in real-world data is
highly imbalanced. While HiSAN performs well on classes
well-represented during training, its accuracy and confidence
degrade substantially for less common categories. Large language
models (LLMs) offer a promising solution for underrepresented
oncological entities, owing to their ability to
leverage context and pretraining. Rather than relying solely
on general-purpose models, domain adaptation or continual
pretraining of LLMs may further improve performance by
helping models learn the specialized vocabulary, abbreviations,
and context typical of clinical text. In this study, we finetune
LLMs for SEER pathology report classification, with and
without additional domain-adaptive pretraining, and compare
the results to the HiSAN baseline [2].

Based on Llama 3 8B, PathLlama was developed by finetuning
for cancer pathology report classification, with and without
domain adaptation. The domain adaptation task was next token
prediction and the pretraining dataset was composed of a
large corpus of approximately 10M cancer pathology reports
and abstracts from SEER and about 500k clinical notes and
radiology reports from MIMIC [3]. The PathLlama models
were finetuned to classify site (70 categories), subsite (330),
laterality (7), histology (677), and behavior (4). The finetuning
dataset was 4052951 reports from six SEER registries:
Kentucky, Louisiana, New Jersey, New Mexico, Seattle/Puget
Sound, and Utah. The finetuning dataset was randomly split
into 80%/10%/10% for training, test, and validation, ensuring
all reports associated with a single case belong to the same
split.

Finetuning results are shown in Table I. We observe that
the micro F1 scores, dominated by majority classes due the
imbalance in the dataset, improve only slightly from the
HiSAN to either of the PathLlama models. The most notable
improvements in micro F1 come from the domain-adapted
PathLlama for subsite and laterality. In contrast, more significant
improvements occur for macro F1, particularly for subsite,
laterality, and histology. For these three tasks, the domain-adapted
PathLlama model also substantially outperforms the
PathLlama base model. From these macro F1 results, we find
that the contextual and pretraining advantages of Llama itself
are indeed sufficient to markedly improve classification performance
on underrepresented classes. However, domain adaptation
offers additional benefit, further enhancing performance
that justifies the increased computational cost associated with
extended pretraining.