Before your content can be searched, it has to be split into pieces — *chunks* — small enough to embed and retrieve precisely. How you split matters more than people expect.
Why not just embed whole pages?
Embed a 3,000-word page as one vector and you get one blurry average of everything on it. A question about shipping retrieves the whole page, most of which is noise. Smaller chunks mean sharper matches.
Why not split every sentence?
Go too small and you lose context. A sentence like "It ships in 3 days" is useless if the chunk doesn't know *what* ships.
The sweet spot
- ~500 tokens per chunk, with a small overlap (say 50 tokens) so ideas aren't cut mid-thought.
- Respect headings. Keep a section together and prefix its heading path so the embedding captures context.
- Split FAQs per question. Each Q&A is its own high-precision chunk.
Curious how a page breaks down? The Token & Chunk Estimator shows tokens, chunk count, and cost for any text you paste.