Text Extractor for TTS

Original Page

Loading preview...

Enter a URL above to preview the page

TTS-Ready Text

Extracting text for TTS...

Extracted text will appear here

How it works

This tool uses the Firecrawl /v2/scrape API with LLM-powered JSON extraction optimized for text-to-speech. An AI model reads the page and extracts clean plain text that sounds natural when read aloud — no markdown, no URLs, no footnote markers, no formatting symbols.

Why TTS-optimized AI extraction?

Universal — works on any website without per-site CSS selectors
TTS-ready — strips markdown formatting, footnote markers, URLs, and figure references that would sound unnatural when read aloud
Smart filtering — AI removes UI junk (share buttons, subscribe prompts, ads, navigation) and keeps only the article content meant to be heard
Structured output — separates title from body so TTS apps can announce the headline differently
Natural speech — expands abbreviations and preserves dialogue naturally for listening

Firecrawl API features used

{type: "json", prompt: "...", schema: {...}} — LLM reads the full page and extracts structured data (title + article_text) based on a TTS-optimized prompt
{type: "screenshot", fullPage: true} — full-page screenshot rendered server-side via headless browser (works even on sites that block iframes)

Actual API request (the entire backend)

const prompt = `Extract the main article content as clean plain
text optimized for text-to-speech.

Rules:
- Return ONLY the article body text that should be read aloud
- No markdown formatting: no **, no ##, no [](), no bullet symbols
- No URLs or links — just the link text if part of a sentence
- No footnote markers like [1], [2], (1), *, †
- No figure/image references like (Fig. 1), (see image above)
- No image captions or alt text
- No author names, bylines, dates, or publication info
- No share/subscribe/comment prompts or UI elements
- No ads, related articles, or navigation
- Expand common abbreviations for natural speech
- Keep direct quotes and dialogue naturally
- Separate paragraphs with double newlines
- For the title field: extract just the article headline`;

const resp = await fetch('https://api.firecrawl.dev/v2/scrape', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer FIRECRAWL_API_KEY',
  },
  body: JSON.stringify({
    url,
    formats: [
      { type: 'screenshot', fullPage: true },
      { type: 'json',
        prompt,
        schema: {
          type: 'object',
          properties: {
            title: {
              type: 'string',
              description: 'The article headline as a clean sentence'
            },
            article_text: {
              type: 'string',
              description: 'Full article body as plain text for TTS.
                No markdown, no URLs, no footnotes.'
            }
          },
          required: ['title', 'article_text']
        }
      }
    ],
  }),
});

The prompt tells the LLM to extract TTS-optimized plain text. The schema returns a title and article_text — the title appears first so TTS apps can announce it before the body.