Building The Lenny Lens | Rachitha Suresh

How I learned RAG, worked with Claude Code as a thought partner, and shipped an AI app for $8/month

The Treasure Hunt Begins

Lenny Rachitsky dropped a treasure trove: 303 podcast episodes with full transcripts. Every conversation with Brian Chesky, Reid Hoffman, Julie Zhuo, and 300+ world-class operators. All the PM wisdom you could want.

Plus, I'd been fascinated by RAG (Retrieval-Augmented Generation) since hearing about it at a company hackathon a year ago. I'd used NotebookLM, read about embeddings and vectors, but never actually built something from scratch.

This was my chance: learn RAG hands-on AND solve a real problem.

Seven days later, The Lenny Lens was live.

What I Built: A Technical Overview

Here's what happens when you search The Lenny Lens:

User Query: "What did Elena Verna say about pricing?"

Step 1: QUERY PREPROCESSING

• Detect query type: "guest_specific"

• Extract guest name: "Elena Verna"

• Extract topic: "pricing"

• Rewrite search query to just "pricing"

Step 2: GENERATE QUERY EMBEDDING

• Call OpenAI API directly

• POST to https://api.openai.com/v1/embeddings

• Model: text-embedding-3-small

• Convert "pricing" → 1,536 numbers

• Cost: $0.000002

Step 3: VECTOR SEARCH (Database)

• Search 35,268 chunks in PostgreSQL

• Filter: episode_guest ILIKE '%Elena Verna%'

• Use cosine distance (<=> operator)

• Return: Top 10 similar chunks about pricing

• Time: ~2-3 seconds

Step 4: QUALITY FILTER

• Keep only chunks with >35% similarity

• If less than 3, keep top 3 anyway

Step 5: BUILD CONTEXT & SELECT PROMPT

• Format top 7 chunks with metadata

• Detect query type for prompt selection

• Choose adaptive template based on type

Step 6: GPT SYNTHESIS

• Send to GPT-4o-mini with adaptive prompt

• Include conversation context if follow-up

• Cost: $0.01-0.03 | Time: 3-5 seconds

Step 7: RETURN ANSWER

• Structured response with citations

• Total time: 5-8 seconds | Cost: ~$0.02

The Stack

303 episodes → 35,268 semantic chunks
PostgreSQL + pgvector (vector database)
FastAPI backend (query processing + retrieval)
React frontend (conversational UI)
Deployed: Render + Neon | Total: $8/month

The Build Journey

Step 0: Planning & Prototyping

Before writing code, I used Claude web with Artifacts to brainstorm features and prototype UIs. What makes this different from "just another Lenny chatbot"?

I explored comparison modes, framework extraction, episode guides, trending questions. Claude generated interactive prototypes—I could click through UIs and see how features might work.I even shared with fellow PMs to get feedback on the features

Once I had clarity, Claude helped structure a PRD: Core feature (semantic search + GPT synthesis), differentiators (episode guides, trending questions), success metrics, and technical constraints (keep costs under $10/month).

This PRD became my guide for implementation.

Step 1: Project Setup

I fed the PRD to Claude Code and asked it to create an implementation plan. Claude Code broke it down into 9 tasks and proposed to discuss each before implementing.

First: project structure. Claude Code created standard boilerplate—FastAPI backend with organized folders, React frontend, scripts for data processing, and data storage. Nothing surprising here. We moved on.

Step 2: Data Preparation

Downloaded the GitHub repo from https://github.com/ChatPRD/lennys-podcast-transcripts: 303 markdown transcripts with YAML metadata. Claude Code wrote a parser to read YAML and markdown, extract guest name, title, date, keywords, and combine into structured JSON.

Ran it. 303 JSON files generated. Simple step. No decisions needed.

Step 3: Chunking Strategy

This is where real decisions started. Based on the PRD, Claude Code proposed implementing Q&A pair detection for chunking. I probed: what other strategies exist? Why Q&A pairs specifically?

Claude Code explained five approaches—fixed-size chunks, sentence-based, Q&A detection, logical sections, and speaker turns—with pros and cons for each. For podcast transcripts, it recommended combining Q&A pairs, logical sections, and speaker turns hierarchically.

We tested Q&A detection on 5 episodes. Caught ~40% of content. Implemented the hybrid strategy. Result: 35,268 chunks with preserved context.

The Library Analogy: Understanding RAG

The best analogy that helped me make sense of RAG: Imagine a massive library with thousands of books (in our case, the 303 transcripts).

Chunks = Individual chapters or sections ripped out of books and filed separately. Instead of searching whole books, you search chapter-by-chapter.

Vectors = A magical filing system where each chapter gets a "meaning coordinate." Chapters about similar topics get placed physically near each other on the shelves—even if the words are different. So a chapter titled "How to bake bread" sits right next to "Making sourdough at home" because their meaning coordinates are close.

Semantic search = When someone asks "how do I make bread rise?", you convert their question into coordinates, walk to that spot in the library, and grab the nearest chapters.

What this conversation taught me:

Chunking isn't just splitting text. It's preserving meaning. Good chunks = good answers. Discussing trade-offs with Claude Code helped me understand WHY the strategy matters, not just WHAT to implement.

Step 4: Generate Embeddings

Claude Code proposed using OpenAI's text-embedding-3-small to convert chunks into 1,536-dimensional vectors. I asked about the large version. It explained the trade-off: text-embedding-3-large is 5x more expensive with minimal quality difference for conversational content. Cost: $0.07 vs $0.35 for 35K chunks.

Went with small. Could regenerate if quality was an issue. Wrote batch processing script to avoid rate limits. Took 1 hour. Cost: $0.07.

Step 5: Vector Database Setup

Claude Code recommended PostgreSQL with pgvector extension for 35K vectors. I asked about Pinecone since every RAG tutorial mentions it.

The comparison was stark: Pinecone at $70/month minimum vs PostgreSQL at $0.50/month on Neon. For 35K vectors, PostgreSQL gives 2-3 second queries—perfectly acceptable—and costs 140x less. Pinecone makes sense at massive scale (>100K vectors, <100ms requirements).

This wasn't about picking the "best" technology. It was about sustainability. At $70/month, I'd kill this project in 3 months. At $8/month total, it runs forever.

Claude Code set up Neon database, created tables, loaded embeddings, configured pgvector.

The $70 decision: Technical decisions are product decisions. Cost determines survival.

Step 6: Backend API & Query Intelligence

The API needed to handle different types of questions intelligently. A guest-specific query like "What did Elena say about pricing?" needs different handling than "How do I measure PMF?"

Query type detection happens through regex pattern matching: guest-specific queries, comparisons, how-to questions, or general queries. But detection serves two purposes:

Purpose 1: Optimize retrieval. For "What did Elena Verna say about pricing?", extract guest name ("Elena Verna") and topic ("pricing"), rewrite search query to just "pricing", then filter database to Elena's content only. This finds topic-specific chunks instead of generic Elena mentions.

Purpose 2: Choose GPT prompt template. Different query types get different prompt structures. Guest-specific queries focus on that guest's insights. Comparisons synthesize multiple perspectives. How-to questions provide actionable steps. General queries use adaptive structure.

Protection layers: Rate limiting (10 queries/day per IP) prevents API abuse. Without this, someone hits the API 1000 times → $200 OpenAI bill → project dies. Cloudflare Turnstile provides free bot protection—browser completes a cryptographic challenge in <200ms (invisible to user). Bots can't pass. Result: zero bot traffic, zero abuse, predictable costs.

Smart model choices: text-embedding-3-small (5x cheaper than large), gpt-4o-mini (10x cheaper than gpt-4). The open internet is hostile. Cost controls aren't features—they're survival insurance.

Step 7: Frontend & Conversation Context

Built React UI with conversational interface. The tricky part: managing context across follow-up questions without exploding token costs.

The token growth problem: Message 1 costs 100 tokens. Message 2 costs 200 tokens (100 new + 100 previous). Message 5 costs 500 tokens (100 new + 400 previous). Message 10 costs 1,000 tokens (100 new + 900 previous). At 10 messages, 90% of tokens are context, 10% are the actual question.

Solution: 5-message limit per conversation. Deep enough to explore a topic. Controlled token costs. Session storage tracks last 5 Q&A pairs per IP. Send last 2 full exchanges as context for follow-ups.

Also added short query expansion: if follow-up is <5 words, append it to previous query for better retrieval. "What about B2B?" becomes "How do I measure PMF? What about B2B?"

Step 8: Differentiators

Two features to make this more than just search:

Episode Action Guides: AI-generated playbooks for all 300 episodes. Each guide has TLDR, key frameworks, action items, when it applies, and decision filters (listen if / skip if). Cost: $0.40 for 300 episodes. Validated with 10 PM friends first—they wanted more. Then generated all 300.

Why action guides work: Regular summaries say "This episode discusses pricing." Action guides say: Key insight (Elena's value-based pricing framework), when to use (B2B SaaS with clear value metrics), steps (identify outcome, measure value, tie pricing), common mistakes (pricing too low, not anchoring). Summaries inform. Action guides enable doing.

Trending Questions: Track all searches, show top 10 from last 7 days. View counts on episode guides. Creates network effects—more users mean better trending data. Shows what PMs care about RIGHT NOW. Not just individual search—community intelligence.

Step 9: Deploy to Production

Deployment stack: Render ($7/month), Neon PostgreSQL ($0.50/month), static frontend hosting (free). Total: $7.50/month. Added environment variables for API keys. Pushed to GitHub. Live in 10 minutes at lennylens.rachithasuresh.com.

Final costs: One-time $0.47 (embeddings + episode guides), monthly $7.50, per query ~$0.02. From PRD to production: 7 days. The project survives at these costs. That was the whole point.

What I Actually Learned

1. Cost Controls = Survival Insurance

The open internet is hostile. Without protection, any public service is at risk. My protection layers: rate limiting (prevents $100+ surprise bills), Cloudflare Turnstile (free bot protection, <200ms latency), and smart model choices (5-10x cheaper models with sufficient quality).

The lesson: Without protection → $200 bill → shut down. With protection → $8/month → survives. Cost controls are survival decisions, not technical features.

2. Claude Code as Thought Partner

Most valuable conversations were "why this approach?" not "write this code." Examples: "Pinecone or PostgreSQL?" → "PostgreSQL is 140x cheaper with acceptable latency." "Why does chunking matter?" → "Good chunks preserve context. Bad chunks split thoughts mid-sentence."

These taught me principles, not just implementations. The insight: Don't just ask Claude Code to build. Ask it to explain trade-offs. You'll learn the 'why' behind decisions.

3. Prototyping in Artifacts Prevents Wrong Features

Before backend code, I prototyped in Claude web: Comparison Mode GUI (realized natural language handles it, deleted), Framework Extractor (generated generic fluff, shelved), Episode Action Guides (users loved them, generated all 300).

Cost of prototyping: 30 minutes per feature. Cost of wrong feature: 2 days coding + tech debt. Artifacts let me invalidate ideas before writing production code.

What's New in RAG (2024 Developments)

Hybrid Search: Combine semantic search with keyword matching (BM25) + reranking for better accuracy.

Small Language Models for Routing: Use Phi-3 or Llama 3.2 for classification → GPT only for synthesis. 4x cost reduction.

Contextual Retrieval (Anthropic): Prepend context to chunks before embedding. 49% reduction in retrieval failures. Read more

Structured Outputs (OpenAI): Guaranteed valid JSON outputs. Reliable query detection and entity extraction.

GraphRAG (Microsoft): Build knowledge graph + vector search. Better for multi-hop questions. Overkill for <100K chunks.

Late Chunking (Jina AI): Embed full document then split. Better for long-form content. Traditional chunking works for podcasts.

My takeaway: RAG fundamentals remain the same. New techniques improve accuracy and reduce costs. Start simple. Add complexity when you hit limitations.

Grab the Chunking Strategy Cheatsheet

I created a practical guide covering 6 common use cases: podcast transcripts, documentation, customer support tickets, research papers, legal documents, and code repositories.

Each includes bad vs good chunking examples with specific implementation strategies.

Download Chunking Strategy Guide (PDF)

Try It Yourself

The Lenny Lens is live: lennylens.rachithasuresh.com

GitHub is public: github.com/rachithasuresh/lenny-lens

The pattern: Chunk intelligently → Generate embeddings → Store in PostgreSQL + pgvector → Search → retrieve → synthesize → Protect with rate limits + Turnstile → Ship it.

Cost to learn: Under $1. Time: One week. Value: You'll understand how AI products work.

Final Thoughts

This started as "I want to learn RAG." It became "I want expert PM advice on-demand."

The technical learning was valuable. But the product thinking—what to build, what to delete, how to make it sustainable—that's what I'll carry forward.

Building something people use is way more fun than localhost demos.

What surprised me: Claude Code conversations taught more than tutorials. Prototyping in Artifacts saved weeks. Cost controls matter more than technical elegance. RAG is accessible—barrier is starting, not complexity.

P.S. Shoutout to Lenny Rachitsky for releasing this treasure trove. This is that cool thing. 🚀

Questions? Feedback? I'm still iterating based on what users find valuable.

Further Reading: