When Sam Altman took the stage at DevDay last November, one of the announcements that mostly flew under the radar was the re-introduction of logprobs to ChatGPT. At the time, I didn't know what that meant, but it provoked a number of cheers from the audience of developers.
A few weeks ago, OpenAI delivered on that promise - logprobs were now available for GPT-3.5 and GPT-4. But I still didn't know what they were, or why they were useful. So I decided to find out.
To make a log story short
As we've discussed before, large language models are - to grossly oversimplify - autocomplete on steroids. A more technical way of saying it would be a "token prediction machine." At each step in its output, it comes up with a list of potential next words (or "tokens"), and assigns a probability to each.
For example: "Taylor Swift's best album is" could continue in a few ways. "Midnights" might have a high probability, as would "1989". But "Yeezus" would have a pretty low probability - likely close to 0%.
In practice, we don't actually use the actual percentage when working with model outputs. Instead, for math reasons, we use a logarithm of the probability. “-1” instead of “10%”. That's a logprob.
Here's what GPT-4's logprobs look like for the sentence above:
Prompt: "You are a Taylor Swift superfan. You have strong opinions about her work. Taylor Swift's best album is"
Next token:
{
"token": "Red",
"logprob": -0.024693936,
"top_logprobs": [
{"token": "Red", "logprob": -0.024693936}, # 97.5% confident
{"token": "198", "logprob": -4.618444}, # 1% confident
{"token": "Speak", "logprob": -4.696569}, # 1% confident
{"token": "RED", "logprob": -5.259069}, # 0.5% confident
{"token": "folk", "logprob": -8.6496935}. # 0% confident
]
}
Side note: If you're wondering about the math reasons, there are a few:
Efficiency: It's cheaper for a computer to do addition than multiplication. When dealing with multiple token probabilities (i.e. evaluating the likelihood of a sentence as a whole), it's more efficient to add their logprobs rather than multiply their probabilities.
Rounding: The product of many probabilities can become extremely small - too small for a computer to accurately represent in memory. Instead, a logprob goes from negative infinity to 0 (where 0 is 100% probability) - and adding those numbers together makes them grow, not shrink.
Optimization: When training a model, it can be more effective to optimize the log probability rather than the probability itself, as the gradient of the logprob is generally smoother and easier to optimize.
What can we do with logprobs?
Evaluating models
Before we entered our generative AI moment, logprobs were mostly used for evaluating models. As you train a model, you can benchmark its output with test cases - if it's able to predict the next token(s) with sufficient confidence, that's a strong signal that the training is working. And with ever-growing context windows, we can also check the model's confidence levels of entire sentences or paragraphs.
Researchers also use logprobs to calibrate models. If you have a suite of evals that you've tested on existing models, you can train new models and to see how far off your new model is from your existing ones - it may become more accurate in some areas, but less helpful in others.
These examples, however, are mostly relevant to internal model training - and up until very recently we didn't have hosted models that would provide logprobs. But with OpenAI offering them as an option, we're able to easily build new features and applications that can take advantage of them.
Classifying content
A very common way to use logprobs is to classify content. Say we prompt ChatGPT with the following:
You will be given a noun. Classify the noun into one of the following categories: Animal, Vegetable, and Mineral. Return only the name of the category, and nothing else. MAKE SURE your output is one of the three categories stated.
If we just ask for the chat completion, we’re going to hopefully get back one of the categories as our answer. But what if we want to know how confident the model was in its decision? By looking at the logprobs, we can see. Here are the results for a few different nouns (including one that's a little bit of a curveball):
Noun: Obsidian
Category: Mineral
Top 3 Logprobs:
- "Min": -5.5122365e-07 (100.0%)
- "min": -14.703126 (0.0%)
- " Mineral": -17.609375 (0.0%)
Noun: Bear
Category: Animal
Top 3 Logprobs:
- "Animal": -3.1281633e-07 (100.0%)
- "animal": -16.109375 (0.0%)
- "Anim": -16.265625 (0.0%)
Noun: Apple
Category: Vegetable
Top 3 Logprobs:
- "Ve": -3.1281633e-07 (100.0%)
- "veget": -15.78125 (0.0%)
- " Vegetable": -15.90625 (0.0%)
Noun: Goldbug
Category: Animal
Top 3 Logprobs:
- "Animal": -0.0025567575 (99.74%)
- "Min": -5.971307 (0.26%)
- "Anim": -13.252557 (0.0%)
We can see that GPT-4 is 100% confident except on "Goldbug" - it's only 99.74% confident, with 0.26% reserved for "Mineral".
Another version of this use case is content moderation. There are many ways to try and identify and flag harmful content, but one approach is to monitor the potential next tokens of the model output.
For example, you can have an internal list of words that correspond to different flags (hate speech, obscenities, etc.). For each model response, you can check the logprobs to see if any of those words came up as a potential next token. If they did (and if the probability was above a certain threshold) you can flag or outright delete the response. This isn’t a foolproof solution, but it can offer an extra layer of protection.
Reducing hallucinations
Hallucinations are still a major problem when working with LLMs. One approach to solving them is RAG, or retrieval augmented generation - basically checking relevant sources before answering the question.
But even when performing RAG, we still don't know whether the answer we're looking for is present in the provided sources at all. It could be missing from our data, or just plain unanswerable. In those situations, we can ask our model to check whether the answer is present in our sources before deciding whether or not to respond.
Let's look at the following example (borrowed from OpenAI's docs): we're going to ask GPT-4 questions about an article we're providing. Here's the article:
Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 – 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation.
Ada Byron was the only legitimate child of poet Lord Byron and reformer Lady Byron. All Lovelace's half-siblings, Lord Byron's other children, were born out of wedlock to other women. Byron separated from his wife a month after Ada was born and left England forever. He died in Greece when Ada was eight. Her mother was anxious about her upbringing and promoted Ada's interest in mathematics and logic in an effort to prevent her from developing her father's perceived insanity. Despite this, Ada remained interested in him, naming her two sons Byron and Gordon. Upon her death, she was buried next to him at her request. Although often ill in her childhood, Ada pursued her studies assiduously. She married William King in 1835. King was made Earl of Lovelace in 1838, Ada thereby becoming Countess of Lovelace.
Her educational and social exploits brought her into contact with scientists such as Andrew Crosse, Charles Babbage, Sir David Brewster, Charles Wheatstone, Michael Faraday, and the author Charles Dickens, contacts which she used to further her education. Ada described her approach as "poetical science" and herself as an "Analyst (& Metaphysician)".
When she was eighteen, her mathematical talents led her to a long working relationship and friendship with fellow British mathematician Charles Babbage, who is known as "the father of computers". She was in particular interested in Babbage's work on the Analytical Engine. Lovelace first met him in June 1833, through their mutual friend, and her private tutor, Mary Somerville.
Between 1842 and 1843, Ada translated an article by the military engineer Luigi Menabrea (later Prime Minister of Italy) about the Analytical Engine, supplementing it with an elaborate set of seven notes, simply called "Notes".
Lovelace's notes are important in the early history of computers, especially since the seventh one contained what many consider to be the first computer program—that is, an algorithm designed to be carried out by a machine. Other historians reject this perspective and point out that Babbage's personal notes from the years 1836/1837 contain the first programs for the engine. She also developed a vision of the capability of computers to go beyond mere calculating or number-crunching, while many others, including Babbage himself, focused only on those capabilities. Her mindset of "poetical science" led her to ask questions about the Analytical Engine (as shown in her notes) examining how individuals and society relate to technology as a collaborative tool.
And here are the question that we want to know about - the LLM’s goal is to check whether the question can be answered based on the information in the provided article:
Easy questions:
"What nationality was Ada Lovelace?"
"What was an important finding from Lovelace's seventh note?"
Hard questions:
"Did Lovelace collaborate with Charles Dickens",
"What concepts did Lovelace build with Charles Babbage"
Take a second and see if you can figure them out yourself! Here's how GPT-4 did:
Question: What nationality was Ada Lovelace?
Article Contains Answer: True
- True: -3.1281633e-07 (100.0%)
- False: -15.125 (0.0%)
Question: What was an important finding from Lovelace's seventh note?
Article Contains Answer: True
- True: -5.5122365e-07 (100.0%)
- False: -14.562501 (0.0%)
Question: Did Lovelace collaborate with Charles Dickens?
Article Contains Answer: True
- True: -0.035777908 (96.49%)
- False: -3.3482778 (3.51%)
Question: What concepts did Lovelace build with Charles Babbage?
Article Contains Answer: True
- True: -0.36724457 (69.26%)
- False: -1.1797446 (30.74%)
You can see that GPT-4 is actually not that confident abut the last question! It thinks it has enough context, but it's not 100% sure. In practice, you would want to decide on a confidence threshold before using GPT-4 to write an answer for the user.
Token healing
Token healing is a pretty cool concept that I'd never come across before researching this article (though it gets in the weeds a bit). Consider the following prompt:
The link is <a href="http:
As a human, you know that the next two characters should almost undoubtedly be "//". But, many LLMs will, out of the box, generate bad output. In the examples I saw, the output started with a space: " //google.com/search".
That's because LLMs often see a standalone colon (":") as a separate token from a colon-with-slashes ("://"). And with input prompts, models assume that we're feeding in entire tokens, and not considering that the end of the input could be a partial token. As a result, there are subtle bugs that crop up when creating outputs - in this example, you're not going to get a properly formatted URL back as a result.
Microsoft has provided one approach to this problem - when processing inputs, remove the last token but constrain the first output token to be one that starts with the removed text. So in our example, the model input would become:
The link is <a href="http
But we would make sure that the first output token starts with a colon, leading to the correct output - "://" (with no spaces).
Autocomplete
And last but not least, logprobs make it pretty straightforward to build autocomplete features when working with LLMs. If we can send the results directly back to the user, we can build an UI where users see the next most likely words based on the sentence so far (there would be some work involved to make sure we're displaying full words and not just individual tokens).
Final thoughts
As I mentioned above, logprobs were available with GPT-3, and have now made it into GPT-3.5 and GPT-4. They're also available from at least some of the top open-source models (Llama 2 and Mistral, last I checked).
But it takes some additional work to process and store logprobs, and they can become unwieldy as LLM outputs get longer and longer. And as LLM APIs go, they tend to get prioritized less than improved speed or cost. As a result, it's difficult to find hosted models that return logprobs with prompt results.
As a concept, I think logprobs are uniquely interesting. When we communicate with other people, we have no clue how confident they are - but we can know that with LLMs. What types of UX could we build if we knew what other people might say, and how confident they were in their conclusions? My hunch is we've barely scratched the surface of what we could do with logprobs specifically (and latent space in general).
But that’s a topic for another time.
I'm a little bit confused about some of the responses to the Ada Lovelace questions:
Question: What nationality was Ada Lovelace?
Answer: True
- True: -3.1281633e-07 (100.0%)
- False: -15.125 (0.0%)
Question: What was an important finding from Lovelace's seventh note?
Answer: True
- True: -5.5122365e-07 (100.0%)
- False: -14.562501 (0.0%)
Why would the answers here be "True"?