Streaming AI Responses in Real-Time with SSE in Next.js & NestJS

Streaming AI Responses in Real-Time with SSE in Next.js & NestJS
"The first token in 120 ms can double user trust" – from a Google 2020 study
Conference:  WeAreDevelopers World Congress 2025
Software Development Head at Soki AG
AM
by Ahmed Megahd
 
Spinner vs Typewriter — How Response Style Shapes User Perception
T-First-Token = 3.8s → Frustration
Perceived slowness, no trust building
T-First-Token = 120ms → +30% Retention
User sees progress instantly, feels alive
Zeigarnik effect: Incomplete, visible progress keeps users engaged
Google UX team 2020
Latency & Delivery Protocols — Why SSE Wins for AI Streaming
Choosing the right protocol is critical for a responsive AI experience, especially when dealing with real-time streaming.
TTFT Comparison:
💡 Polling kills battery 🔋
💡 WebSocket is great for 2-way chat, not ideal for lightweight push-only streaming
💡 SSE = Perfect Push-only for AI
Fullstack Architecture — Real-Time AI Streaming Flow
Frontend (Next.js)
Initiates EventSource connection to backend for AI response.
Backend (NestJS API)
Exposes @Sse() endpoints for real-time data push.
Verifies JWT Auth: Secures endpoints using NestJS guards.
Validates Rate-Limit: Prevents abuse with @Throttle(3, 60).
AI Models
OpenAI: Returns responses token-by-token (delta.content).
Gemini: Returns responses chunk-by-chunk (streamPart.content).
User opens chat → UI calls /api/ai/openai or /api/ai/gemini via EventSource (stream:true) →
NestJS forwards to AI Model → As tokens/chunks arrive, NestJS emits via observer.next({ data }) → UI renders live.
💡 Security Note: JWT from cookies is validated on the backend. 
API keys are NEVER exposed on the frontend.
🧪 Live Demo: Real-Time AI Responses (Token-by-Token / Chunk-by-Chunk)
Demo Scope Overview:
Use NestJS @Sse() to stream tokens from OpenAI and chunks from Gemini.
Connect from Next.js using EventSource() with auto-reconnect.
Render responses live in a "typewriter" style.
Show abort (Stop button) and metrics (Tokens/sec).
Tech Focus:
Backend: NestJS SSE endpoint + RxJS retry() and interval().
Frontend: Next.js component using EventSource + state buffering.
AI: OpenAI's delta.content, Gemini's streamPart.
🔥 Tokens flow live from server to UI
💡 "In this demo, you'll see how real-time feedback makes AI feel alive — with just 40 lines of extra code."
🔗 View on GitHub →
🔹 Part A — Backend: NestJS SSE in 6 Lines
NestJS – 6-Line SSE Endpoint (OpenAI / Gemini)
@Get('ai/stream')
async stream(@Query() { provider:p, q }, @Res() res) {
  res.writeHead(200, { 'Content-Type':'text/event-stream','X-Accel-Buffering':'no' });
  if (p==='openai')
    (await openai.chat.completions.create({model:'gpt-3.5-turbo',stream:true,
       messages:[{role:'user',content:q}]}))
       .on('data', c => res.write(`data:${c.choices[0]?.delta?.content}\n\n`));
  if (p==='gemini')
    for await (const part of gemini.generateContentStream({prompt:q}))
      res.write(`data:${part.text}\n\n`);
}
    
One endpoint → any provider (switch by “provider” param)
Writes token or chunk directly into SSE pipe
Same headers keep RAM low & bypass proxy buffering
🔹 Part B — Frontend: Next.js Writer in 5 Lines (+ Typewriter)
Next.js – 5-Line Hook
export function useAiWriter(prompt, provider='openai') {
  const [txt,setTxt] = useState('');
  useEffect(() => { if (!prompt) return;
    const es = new EventSource(`/ai/stream?provider=${provider}&q=${encodeURIComponent(prompt)}`);
    es.onmessage = e => setTxt(t => t + e.data); es.onerror = () => es.close();
    return () => es.close();
  }, [prompt, provider]); return txt;
}
Tiny Typewriter Component
function Typewriter({ streamText, speed = 20 }) {
  const [display,setDisplay] = useState('');
  useEffect(() => {
    let i=0, id=setInterval(()=>{ setDisplay(p=>p+streamText[i++]||''); if(i>=streamText.length) clearInterval(id); }, speed);
    return () => clearInterval(id);
  }, [streamText, speed]);
  return <pre className="whitespace-pre-wrap leading-6">{display}</pre>;
}
Same hook handles tokens (OpenAI) & chunks (Gemini)
Typewriter adds 20 ms per character ⇒ ChatGPT-like feel
Stop button → es.close() to save tokens & cost
📊 Live Metrics: Token-by-Token vs Chunk-by-Chunk vs REST
Time to First Token (TTFT) Comparison
Test Context: Simulated 3G throttled network + Vercel Edge + NestJS
Detailed Performance Benchmarks
Key Observations:
Gemini sends fewer but bigger chunks, leading to high tokens/sec with fewer events.
OpenAI delivers a very smooth typing effect due to more frequent token delivery.
Traditional REST kills user engagement due to significant waiting times.
Both SSE options support reconnect & resume for robust streaming.
💡 UX Insight
TTFT under 400 ms increases retention by 30% (Google UX Study 2020)
💡 "Streaming feels instant — first token in 120ms is the new bar."
🛡️ Production Guardrails: Scaling Streaming Safely
Running real-time AI streaming in production requires robust safeguards for performance, security, and cost control.
💡 "SSE is lightweight—but not free. Guardrails make it reliable & affordable."
SSE Delivery: Direct vs Hub
Choosing the right architecture for Server-Sent Events (SSE) is crucial for scaling real-time AI responses.
🧠 Why Use a Hub Layer?
Reduces load on upstream logic (NestJS calls OpenAI once, not per user).
Distributes a single message to thousands of clients via efficient push.
Handles auto-reconnects and buffering cleanly.
Makes scaling horizontally much easier.
Without a hub, SSE is powerful but not production-grade at large scale.
[Clients] → [Mercure Hub] ← [NestJS API] → [OpenAI / Gemini]
SSE is just the protocol — but hub architecture turns it into scalable infrastructure.
🎯 AI Providers Supporting Real-Time Streaming
For building responsive, real-time AI applications, choosing a provider that offers streaming capabilities via Server-Sent Events (SSE) is paramount. Here's a comparison of leading AI models and their streaming support.
📌 Key Notes:
Groq leads in speed; ideal for instant UX.
Gemini and OpenAI support SSE natively.
Perplexity AI can be integrated via proxies like OpenRouter.
All providers support text-based output suitable for SSE piping.
💡 "Choose your provider based on latency and granularity: token streams (OpenAI/Groq) give max interactivity; chunk streams (Gemini) reduce overhead."
Token-by-Token vs Chunk-by-Chunk – UX Matters
Token-by-Token Streaming
Fine-grained control: Text appears letter by letter.
Feels more "alive" and interactive.
Better user engagement for long outputs.
Higher event frequency (~300+ per response).
Slightly more CPU/network overhead.
Examples: OpenAI, Groq, Perplexity.
Chunk-by-Chunk Streaming
Smoother bursts of text (e.g., whole sentence or paragraph).
Fewer events (~50 per response).
Easier to implement on low-resource backends.
More suited for summarization or QA tasks.
Lower granularity → less perceived interactivity.
Examples: Gemini, Claude (Anthropic), Cohere.
💡 "For maximum immersion in chat UIs, token streaming wins. For structured answers or mobile devices, chunk streaming may suffice."
AI Features Enabled by Real-Time Streaming
Streaming powers real-world features across SaaS, DevTools, and even IoT – it's not just for chatbots!
📲 Advanced GPT Action Integration – SSE to Browser Extension
Leveraging SSE for real-time actions, bridging AI capabilities with local user tools.
GPT Action Trigger
A custom GPT receives a request (e.g., "Create and save a note...") and sends an action POST request to your backend.
Backend Streams Event
Your NestJS backend processes the action, then pushes a real-time event via Server-Sent Events (SSE) to connected clients.
Extension Acts Locally
A browser extension on the user's laptop, connected via EventSource, receives the SSE event and triggers a local note creation or system action.
💡 "SSE bridges OpenAI GPTs, custom backends, and local user tools for zero polling, zero delays, and real-time execution."
SSE Limitations & When to Use What
Common Limitations & Solutions
Choosing Your Streaming Protocol
Need push only?
Use SSE
Need full duplex chat?
Use WebSocket
Need binary/voice?
Use WebRTC
Internal service mesh?
Use gRPC Streaming
💡 "SSE is ideal for lightweight push. Choose your streaming strategy based on product needs."
Streaming Makes AI Feel Alive
Real-time AI responses mean less latency, more user trust, and better retention. SSE is the simplest, most efficient way to achieve this.
Dynamic UX
Reduces perceived latency, making AI interactions feel instant and alive.
Simplified Streaming
Leverage SSE for easy token-by-token or chunk-by-chunk LLM output with minimal setup.
Scalable & Versatile
Works seamlessly across browsers, extensions, and mobile; integrates with major AI providers like OpenAI, Gemini, and Claude; easily scalable and secure.
Find a ready-made template for streaming with NestJS + Next.js on our GitHub repo.
Thank You & Let's Connect
🙏 Thanks for your time & energy!
I'm Ahmed Megahd — CTO 

Feel free to reach out to me through any of the following channels:
1
Email
[email protected]
2
LinkedIn
linkedin.com/in/ahmedragabshaban
3
Twitter
@AhmedRagabShaba
I'm always happy to chat about web development, AI, or any other topics you're interested in. Looking forward to connecting!
Questions?
Thank you for your attention. I'd now like to open the floor for questions. Feel free to ask about anything we've covered today or other topics related to browser extension development.
Made with