Research · Yueran (Hannah) Sun

UW · Lab for Computing Cultural Heritage

Graduate Student Researcher

Nov 2025 – Present

Examined clusters from over 1.5 million Library of Congress archival newspaper images to identify large-scale patterns of visual reuse and circulation in early 20th-century print media.
Applied CLIP embeddings with DBSCAN clustering to generate groupings of visually similar images, then computed cluster statistics to support systematic interpretation.
Built robust embedding-image lookup pipelines linking CLIP vectors, image IDs, filenames, and metadata, enabling reliable retrieval and scalable downstream analysis.
Performed zero-shot concept labeling on clusters by embedding candidate semantic categories and assigning top concepts using cosine similarity.
Quantified image circulation and reuse dynamics by measuring cluster frequency distributions, cross-newspaper diversity, and temporal spread.

Research Collaborator

Sep 2025 – Present

Contributed to NeuroAdapt, a content personalization framework across neurodivergent profiles, by developing core segmentation, alignment, and post-processing pipelines comparing original text with plain-language counterparts.
Built a semantic alignment pipeline using LASER-3 embeddings and VecAlign to support complex alignment types and capture sentence splitting, merging, expansions, and omissions.
Conducted quantitative evaluation using linguistic and readability metrics, complemented by qualitative analysis, to assess lexical substitutions, structural changes, semantic restorations, and tone and style adjustments.
Reviewed related work on adaptive reading systems, multi-mode reading interfaces, and accessibility-focused evaluation to identify transferable design and methodological insights for an adaptive text system in development.

Bioinformatics Research Assistant

Nov 2023 – May 2024 · Remote (Shanghai, China)

Pretrained scGPT, a Transformer-based model for gene expression prediction using single-cell RNA sequencing data.
Curated and preprocessed a dataset of 300,000 human blood cells from the CELLxGENE repository, preserving organism-level structure while performing normalization, binning, and tokenization of highly variable genes.
Modified the Transformer encoder architecture to jointly embed gene identities and binned expression values.
Leveraged GPU-accelerated training to scale pretraining experiments, writing Bash scripts to automate job submission, checkpointing, and experiment tracking.