Why Your LLM Output Feels Off: A Dev’s Guide to top_k and top_p Parameters
If your completions are too flat or too wild, you're probably using sampling settings wrong. Here's how top_k and top_p really affect output quality and what to do about it.
Most Devs Use LLM APIs Wrong
You're building with OpenAI, Anthropic, Cohere, or any of the newer models out there. You call generate() or chat(), tweak the temperature, maybe adjust max_tokens, and ship it.
But the real power — and real control over generation quality — lives in the sampling parameters: top_k and top_p.
If you’ve ever asked:
Why is the output so repetitive even with temperature 0.8?
Why did it suddenly go off-topic?
Why do creative responses sometimes feel dead?
The answer usually lies in your top_k and top_p settings.
What is top_k?
top_k controls how many of the most probable tokens the model can pick from at each step.
If top_k = 1, the model is deterministic. It always picks the most likely token.
If top_k = 10, it samples randomly from the top 10 tokens.
If top_k = 100, the pool of candidate tokens is larger. You get more diversity, but also more risk.
Think of it like this: you're telling the model to consider only the top k options, and discard everything else. Useful for consistent tone or logic, but too low a value can lead to repetitive outputs.
What is top_p?
top_p, also called nucleus sampling, is more adaptive.
Instead of choosing a fixed number of tokens, it accumulates token probabilities from the top until it exceeds a threshold p.
If top_p = 0.9, the model samples from the smallest set of tokens whose combined probability is at least 90 percent. The number of tokens selected may vary depending on how confident the model is.
This makes it well-suited for situations where you want a balance of creativity and coherence.
Can You Use Both Together?
Yes, and most APIs support it. Here's what happens under the hood:
First, the model filters to the top k tokens. Then it applies top_p filtering to that subset, summing from the top until the combined probability exceeds p.
But here's the catch:
If the top k tokens do not add up to the probability threshold p, the top_p filter gets ignored. The model then samples from the full top k list with no further filtering.
This happens silently, and can lead to outputs that behave differently than expected.
Real-World Example
Let’s say your prompt is:
"The cat sat on the..."
The model generates the following top 10 next-token options with associated probabilities:
mat (0.35)
floor (0.25)
couch (0.15)
table (0.08)
chair (0.06)
bed (0.05)
roof (0.025)
moon (0.005)
pizza (0.004)
unicorn (0.001)
Together, these sum to about 0.94 probability mass.
Now suppose you call the API with top_k = 10 and top_p = 0.95.
Here’s what happens:
The model first keeps the top 10 tokens (
top_k = 10).Then it starts summing token probabilities from the top: mat + floor + couch + table + chair = 0.89
bed adds another 0.05, reaching 0.94. Still below 0.95
Since there are no more tokens above that threshold in the top 10,
top_pfiltering stops.
Because the total probability mass of the top 10 doesn’t meet the top_p = 0.95 threshold, the top_p filter has no effect. The model ends up sampling from all 10 tokens.
In practice, this means your top_p setting didn’t do anything, and you wouldn’t know unless you looked closely at the distribution.
How to Avoid This Trap
If you're using both, make sure top_k is large enough to give top_p room to work.
Good practice:
Use
top_kbetween 40 and 100Use
top_pbetween 0.85 and 0.95
This creates a balance between a hard cap (top_k) and a dynamic filter (top_p), allowing you to control diversity without letting things get too random.
When to Use What
For deterministic Q&A: use
top_k = 1andtop_p = 1.0
For chat or assistants: usetop_k = 50,top_p = 0.9
For creative writing: skiptop_k, just usetop_p = 0.92
For coding tasks: trytop_k = 10,top_p = 0.85
For controlled randomness: usetop_k = 80,top_p = 0.95Remember: temperature scales the probabilities, while
top_kandtop_pshape the candidate list.
Wrap up
Calling an LLM API with just a prompt and default settings works, but you're not in control. Sampling isn't a detail. It's your first line of control over what the model says and how it behaves.
Poor sampling config is often the root cause of confusing or low-quality output. Know what each parameter does, and use it deliberately.
Treat top_k and top_p as core parts of your product's intelligence not just another knob to tweak.


