Research Paper|Open access|Published: 2026

Find My Trial: A Patient-Centered NLP System for Clinical Trial Matching

NSRI Research Papers1, 1-8 (2026)|

Open access

Abstract

Clinical trial discovery is a major challenge for patients and clinicians due to the scale of existing registries and the complexity of eligibility criteria written in biomedical language. This work presents Find My Trial, a novel, interpretable natural language processing (NLP) system designed to perform real-time matching between patient free-text descriptions and more than 500,000 clinical trials. The pipeline involves preprocessing, TF-IDF vectorization, cosine-similarity rankings, and optional biomedical language model embeddings. This allows scalable retrieval of trials without sacrificing transparency. Qualitative evaluations show that the system is robust, as it ignores noisy patient input, efficiently identifies semantically relevant trials, and supports use cases from patient self-navigation to clinical decision support. By keeping computational efficiency, interpretability, and accessibility in mind, Find My Trial offers a practical framework for improving clinical trial discoverability and supporting more informed treatment decisions across diverse healthcare environments.

Introduction

Clinical trials are necessary for advancing medical knowledge and providing patients with access to new treatments. However, for many patients, especially those managing late-stage cancer or rare, chronic conditions, identifying an appropriate clinical trial can be challenging.

Patients and their families often struggle to search trial databases, understand complicated eligibility criteria, and determine which studies are most aligned with their conditions and treatment history. These barriers exist even as the number of clinical trials available globally has expanded. ClinicalTrials.gov alone now contains hundreds of thousands of studies, and the volume of information returned from queries makes manual search nearly impossible for many users.

Compounding this challenge is the complexity of clinical trial text. Eligibility criteria are written using biomedical terminology and dense clinical language, which many patients cannot easily interpret. These barriers disproportionately affect people with limited clinical knowledge and can delay entry into therapies, especially in late-stage disease.

Find My Trial responds to these challenges by combining lightweight NLP methods with modern biomedical language models. The system enables patient-centered, real-time matching across more than 500,000 clinical trials and demonstrates how accessible computational tools can help bridge long-standing gaps in clinical trial discoverability.

Background

Efforts to improve clinical trial accessibility have focused on decision-support systems, yet significant limitations remain across existing platforms. Although ClinicalTrials.gov contains the largest registry in the world of clinical research, studies have documented inconsistencies in reporting, variability in data quality, and structural limitations that affect effective search and retrieval.

Academic work on computational trial-matching systems has attempted to address these barriers through structured retrieval methods and rule-based matching. However, many early systems lacked robustness, struggled with diverse eligibility language, or required highly structured input that most patients could not provide.

Recent improvements in NLP algorithms have made it easier to identify biomedical entities, symptoms, and illnesses in unstructured language. BioBERT and ClinicalBERT have performed well on clinical classification and entity extraction tasks, while TF-IDF weighting and vector-space similarity remain fast, interpretable, and effective for large-scale retrieval.

Despite these advances, existing clinical trial search systems rarely combine patient-centered free-text input with large-scale NLP-driven semantic matching. Most lack integrated preprocessing pipelines that normalize patient language or align it with the terminology used in trial registries.

Framework Design

The Find My Trial system is built as an adjustable NLP and information-retrieval pipeline designed for large-scale clinical trial matching. Its architecture follows a linear flow: load and structure the ClinicalTrials.gov dataset, preprocess trial text, generate vector representations, encode the patient query using the same pipeline, compute similarity scores, and rank more than 500,000 trials in real time.

Data loading and matrix construction are performed with pandas, while numerical operations such as sparse data handling, large matrices, score normalization, and vector arithmetic rely on NumPy. Trial text is constructed by combining the study title, summary, conditions, interventions, and locations into a single searchable representation.

Because raw clinical trial text contains punctuation, stopwords, inconsistent capitalization, and domain-specific phrasing, NLP preprocessing is essential. The system uses NLTK for English stopword filtering and token normalization, and spaCy for tokenization and lemmatization. These preprocessing steps reduce noise, collapse linguistic variation, and improve the stability of similarity scores across diverse patient inputs.

For document representation, the system primarily uses TF-IDF vectorization through scikit-learn. TF-IDF is chosen for its speed, interpretability, and ability to scale to hundreds of thousands of documents. Cosine similarity then compares patient input with trial records and returns ranked matches.

Behavior and Performance

Across repeated tests with varied patient inputs, the system demonstrated stable behavior even when queries contained misspellings, fragmented symptom descriptions, or mixed lay and clinical terminology. The preprocessing pipeline, including tokenization and lemmatization, plays an important role in this resilience.

Performance is influenced by the scale of the vectorized body. Because the system maintains TF-IDF representations for more than 500,000 trials, it supports near-instantaneous cosine similarity lookup after the initial vectorizer fitting step. Optional BERT-based embeddings are computationally heavier but can be cached to reduce wasted computation.

During testing, the top-ranked trials consistently reflected logical relevance to the provided inputs. Trials mentioning conditions and symptoms in the input text often surfaced near the top of the results, while less relevant trials naturally fell toward the bottom.

Design Advantages

One of the system's major strengths is transparency. Unlike black-box transformer architectures that require complex explanation strategies, the TF-IDF and cosine similarity framework provides understandable reasoning. Users can trace which terms in a query influenced the score, and developers can debug or adjust pipeline components more easily.

Scalability is another core advantage. Sparse TF-IDF matrices are memory-efficient, and cosine similarity can be computed rapidly using optimized linear algebra operations. This makes it feasible to match across more than 500,000 trials without specialized hardware.

The system's flexibility with patient input is also important. Unlike current trial search platforms that require structured queries or strict keyword matching, Find My Trial accepts free-text input as patients might naturally describe symptoms or medical history. This reduces cognitive load and removes the burden of knowing exact clinical terminology.

Limitations

Several limitations remain. First, the quality and structure of ClinicalTrials.gov data vary widely. Some trials contain rich descriptive summaries while others provide minimal information, reducing matching precision. Inconsistent terminology across conditions or interventions can also weaken semantic alignment.

Second, although the preprocessing pipeline handles common variations and noisy patient input, more advanced clinical concept normalization is not yet implemented. Linking terms such as heart attack, myocardial infarction, and MI to a consistent medical concept would improve retrieval precision.

Third, although the system can include BioBERT and ClinicalBERT, running these models in real time can introduce latency. Caching reduces this issue, but inference cost remains higher than TF-IDF-based scoring. Finally, the system still requires physician validation, user testing, and standardized quantitative evaluation before clinical deployment.

Conclusions

Find My Trial demonstrates that an interpretable, scalable, and computationally efficient NLP framework can identify clinical trials from patient-provided descriptions. By combining preprocessing, lightweight vectorization, and transparent similarity scoring, the system offers a practical alternative to traditional keyword-based search tools.

The system's commitment to transparency is especially important in clinical contexts, where patients and clinicians must be able to trust and examine the basis of recommendations. Optional biomedical language model integration further demonstrates that interpretability and modern NLP capabilities do not have to be mutually exclusive.

Ultimately, Find My Trial provides a foundation for patient-centered trial discovery that is accessible, efficient, and adaptable to diverse clinical scenarios. While further evaluation and refinement are needed, the current implementation illustrates the potential for intelligent matching tools to bridge long-standing gaps in clinical trial navigation.

Future Directions

Clinical trials remain the cornerstone of evidence-based medicine, but access and availability remain major challenges, especially for patients treated in community settings. Trial enrollment remains limited outside university settings even though many patients receive care in the community.

Future work should enable community practitioners to participate more directly in the research process and enroll patients more efficiently. This could increase diversity in trial populations, expedite enrollment, and improve access to promising treatment options.

Declaration of Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding

This research was fully funded by the National Student Research Institution (NSRI).

Back to Home