Merging Single-Cell RNA-Seq with EHR Data for Oncology: A Pipeline ‣ 2026-03-27

In 2026, precision oncology increasingly relies on multi-omics data coupled with clinical context. Merging single-cell RNA-Seq (scRNA-Seq) with electronic health record (EHR) data creates a holistic view of tumor biology and patient trajectories. This article outlines a reproducible pipeline that automates this integration with Python scripts and Docker containers, enabling researchers to generate actionable insights while maintaining compliance and scalability.

1. Define the Clinical Question and Data Scope

Before building the pipeline, clarify the scientific goal—e.g., identifying immune cell signatures predictive of immunotherapy response or mapping spatial heterogeneity to treatment resistance. This determines which EHR tables (diagnosis, medication, lab results) and which scRNA-Seq layers (raw counts, annotated cell types) are required.

Key Considerations

Patient consent and de-identification requirements (HIPAA, GDPR).
Temporal alignment: match scRNA-Seq sample dates to clinical events.
Granularity: cell barcodes versus bulk summaries.

2. Acquire and Prepare the Data

2.1. EHR Extraction

Use a FHIR (Fast Healthcare Interoperability Resources) server or an OMOP Common Data Model (CDM) to pull standardized records. Python libraries such as fhirclient or ohdsi-analytics simplify this step.

2.2. scRNA-Seq Retrieval

Obtain raw FASTQ files or processed count matrices from sequencing centers or repositories like ArrayExpress. Convert to an AnnData object using scanpy for consistent downstream analysis.

2.3. Harmonize Identifiers

Link samples to patients via a secure mapping table that maps sequencing batch IDs to de-identified EHR patient IDs. Store the mapping in an encrypted SQLite database accessible to the pipeline.

3. Build the Integration Core with Python

3.1. Data Ingestion Layer

import pandas as pd
import scanpy as sc
from fhirclient import client

# Load EHR cohort
ehr_df = pd.read_sql_query("SELECT * FROM patients WHERE cohort='oncology'", con=conn)

# Load scRNA-Seq
adata = sc.read_h5ad("patient_scRNA.h5ad")

3.2. Feature Alignment

Map EHR phenotypes (e.g., Charlson score) to scRNA-Seq derived metrics (e.g., cell-type proportions). Use pyarrow for efficient in-memory joins.

3.3. Data Quality Checks

Drop cells with n_genes < 200 or percent.mt > 5%.
Impute missing EHR fields with median or model-based imputation.
Validate temporal consistency: ensure no scRNA-Seq sample predates the first clinical encounter.

3.4. Modeling and Analysis

Implement integrative models such as Multi-Omics Factor Analysis (MOFA+) or joint embeddings using scVI. Train the model inside a Docker container to guarantee reproducibility.

# Example scVI training script
import scvi
scvi.data.setup_anndata(adata, labels='cell_type')
model = scvi.model.SCVI(adata)
model.train(max_epochs=200)
latent = model.get_latent_representation()

4. Containerize the Pipeline with Docker

4.1. Dockerfile Blueprint

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y build-essential libpq-dev

# Create working directory
WORKDIR /app

# Copy requirements
COPY requirements.txt .

# Install Python packages
RUN pip install --no-cache-dir -r requirements.txt

# Copy source code
COPY . .

# Define entrypoint
ENTRYPOINT ["python", "run_pipeline.py"]

4.2. Multi-Stage Build for Size Optimization

Use a build stage to compile heavy libraries (e.g., scanpy with numpy) and a runtime stage that strips development tools. This reduces the final image to under 500 MB.

4.3. Docker Compose for Orchestration

version: '3.8'
services:
  pipeline:
    build: .
    volumes:
      - ./data:/app/data
    environment:
      - DATABASE_URL=postgres://user:pass@db:5432/oncology
  db:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: pass
    volumes:
      - db-data:/var/lib/postgresql/data
volumes:
  db-data:

5. Automate Execution with CI/CD and Scheduler

5.1. CI/CD with GitHub Actions

Set up a workflow that triggers on pushes to the main branch, builds the Docker image, runs unit tests, and pushes the image to a container registry.

5.2. Scheduler for Data Ingestion

Use cron or Airflow to schedule nightly runs that pull fresh EHR updates and new sequencing data, then re-execute the pipeline.

6. Governance, Compliance, and Reproducibility

6.1. Audit Trails

Store container logs, pipeline timestamps, and dataset hashes in a separate audit table. This supports reproducibility and audit compliance.

6.2. Data Provenance with W3C PROV

Embed provenance metadata into the final dataset using prov-python so that each record references its source files and processing steps.

6.3. Container Signing

Sign Docker images with Docker Content Trust to ensure that only authorized, untampered images are deployed in production environments.

7. Extending the Pipeline for Real-Time Clinical Decision Support

Integrate the model output into an EHR-based dashboard using FastAPI and a lightweight React front end. The API can expose risk scores and suggested biomarkers, enabling oncologists to access multi-omic insights at the point of care.

Conclusion

By systematically aligning single-cell transcriptomics with patient-level EHR data, and encapsulating the workflow in Docker containers orchestrated by automated pipelines, researchers can rapidly generate reproducible, clinically relevant insights. This modular approach ensures scalability, compliance, and the flexibility to incorporate emerging data types—paving the way for truly integrated precision oncology in 2026 and beyond.