code | quinn

quinn's code showcase! ↩

I started coding in Python in 2017, back when Codecademy had a functioning free tier and Stack Overflow was alive. As a result, I now speak Python more fluently than I speak Chinese -- a language I'm natively bilingual in -- and was a TA for several introductory Python courses at NYU footnote.

While studying data science at NYU, I developed a strong foundation in statistical analysis and machine learning, and learned frontend development in my spare time as a hobby. My work in data analysis and ML found me doing just that at the NYC Mayor's Office of Management and Budget, and my frontend explorations somehow landed me collaborations with Dan Toomey of YouTube's Good Work, Michelladonna of TikTok's Shop Cats, and Neal of neal.fun.

The following are three projects that showcase my work in data analysis, statistical testing, and machine learning. The repositories are hyperlinked in the project titles. To see more of my frontend work, you can have a browse here.

⇱ Folio: A Personalized arXiv Paper Recommendation Engine

PythonMachine LearningNLPInformation RetrievalStreamlitFull Stack

Background: For our senior ML project, our group built a personalized arXiv paper recommendation engine. Folio generates a daily feed of papers tailored to your research interests, updated in real time via feedback — runs entirely as a local Streamlit app.

At onboarding, your interests (free text, curated tags, or imported from Google Scholar) are decomposed into 1–3 "research thread" vectors. Those vectors drive daily retrieval from a k-means-clustered SPECTER2 embedding index, with a custom scoring function balancing relevance, recency, and feed diversity. Likes, saves, and skips update your interest vectors via exponential moving average.

I helped in prototyping and implementing the recommendation system, including the k-means clustering and diversity scoring components. I also developed Folio's research lab view, and worked on the user interface.

Architecture highlights:

Offline pipeline builds a memory-mapped SPECTER2 embedding matrix (~6 GB) and k-means cluster index; retrieval is dot-product search within candidate clusters, not a full scan.
Diversity scoring uses greedy selection with a tunable diversity index (δ) to prevent the feed from collapsing into whichever research thread dominates by similarity.
Query search expands short queries into scientific retrieval text, then ranks candidates by query similarity, profile similarity, recency, and lexical signals from title/abstract.
Workspace tab supports paper saving, AI synthesis and connection graphs (via OpenAI API), and PDF annotation in a Research Lab view.

Key skills: vector search and embedding index design, k-means clustering for retrieval, recommendation system design (EMA centroid updates, diversity scoring), UMAP dimensionality reduction, full-stack Streamlit architecture, SQLite schema design.

⇱ The 411 on 311: NYC Service Requests, Mapped and Analyzed

PythonData AnalysisData VizHTML/CSSJavaScriptFrontend

Background: My last apartment building had a tenant group chat that fired off roughly ten 311 requests a week — mice, roaches, leaks, heat, noise. NYC is just that, but hundreds of thousands of times over. The city's 311 system gets millions of requests a year routed across 17 agencies, but not every complaint gets the same response. I looked at whether ZIP code or complaint type predicts how long a New Yorker waits for the city to act, and mapped it out with interactive tooltips.

Data sources:

15.35 million 311 service requests (Oct 2020 – Oct 2025) from NYC OpenData, covering 202 ZIP codes, 248 complaint types, and 17 agencies.
NYC Modified Zip Code Tabulation Areas (MODZCTA) shapefile for geographic boundary mapping.

Findings: The average wait is ~20 days — but that masks a range from 4 hours to nearly 2 years depending on complaint type and location. Complaints routed to the NYPD (noise, parking) close in hours; those sent to regulatory agencies (housing, food inspection, tree requests) can sit open for months. Full findings and methods are here footnote, and you can read the full story here; it has pretty maps!

Key skills: large-scale data cleaning, geospatial analysis (GeoPandas/GeoJSON), D3.js choropleth mapping, scroll-driven data storytelling (scrollytelling).

⇱ NYC Congestion Pricing Air Quality Analysis

PythonData AnalysisData VizCausal Inference

Background: While working in the Policy & Operations task force at NYC OMB, I focused on one of congestion pricing's hoped-for side effects: air quality improvement. Car exhaust is a major PM2.5 source in high-traffic areas, so when the CRZ took effect January 5, 2025, we set out to measure whether it moved the needle.

Data sources:

Hourly PM2.5 readings from 18 NYC air quality monitoring stations across all boroughs (Jan 2024 – May 2025), from the NYC DOT/NYSDEC sensor network.
Boston daily PM2.5 data as an external control city.
Station location/metadata CSV for geographic filtering.

Findings: PM2.5 was meaningfully lower across all five boroughs in spring 2025 vs. spring 2024 — but the same drop appears at control sites, making the CRZ hard to isolate. A difference-in-differences model using Boston as a control estimates a ~6% Manhattan-specific reduction, though it just misses the conventional significance threshold (p = 0.063). Directionally consistent with the CRZ working, but not conclusive yet.

Key skills: multi-source data ingestion, time series aggregation, causal inference (DiD), statistical hypothesis testing, data visualization.

believe me, my dual fluency in Python and Chinese came in handy for that!
i also unearthed a years-long Sisyphean grudge between a building off the corner of Central Park and a food vendor; more details in the README.