#5 - OpenAI's o1 & Chain of Thought Reasoning

What's different, why you should care, and how to use it.

Sep 22, 2024

Welcome back to MAJOR.MINOR.PATCH.

Each edition, we cover one notable upgrade to the world of computing, why you should care about it, and how you can use it.

This week…

OpenAI's o1.

OpenAI revealed there new model, o1, on September 12.

Unlike all models that came before, it uses Chain of Thought (CoT) during inference. That means when it's asked a complicated question, it breaks it down into simpler steps and works through the problem, revising its tactics along the way if need be.

You can see some demos of o1 reasoning through problems in OpenAI’s press release.

Why you should care about o1.

o1’s performance is astounding.

It ranks in the 89th percentile on competitive programming questions (Codeforces),

it places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and
it exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).

And it’s a major improvement over GPT-4o in several other ML benchmarks as well:

o1 vs. GPT-4o on various ML benchmarks (OpenAI)

For the LLMs on the market right now, the performance gains on each successive release are tapering off. At least it feels that way as a user.

A big cause for optimism with o1 (and all competing models to follow that will clone this ‘reinforcement learning for CoT’ technique) is that the initial results indicate that performance can scale significantly, potentially leading to new breakthroughs in reasoning and planning.

o1’s accuracy continues to improve with more compute (OpenAI)

Its performance keeps improving with more compute. And as we know from previous editions of MAJOR.MINOR.PATCH., there’s a lot of innovation going on in the hardware acceleration space right now. Accuracy may creep up to the 95th percentile for various benchmarks faster than we expect.

Another notable advancement is in the AI Safety department. Detractors have long complained that “AI models are a black box!” and “you can’t tell what they’re thinking!”

Well… now you can.

The OpenAI team believes that using a chain of thought offers significant advances for safety and alignment because:

It enables us to observe the model thinking in a legible way, and
the model reasoning about safety rules is more robust to out-of-distribution scenarios.

The truth of (1) is obvious. For (2), the proof is in the pudding:

GPT-4 vs. o1-preview on various safety tests (OpenAI)

o1’s applications.

What keeps humans in the loop and limits the applications of generative AI today is its lack of accuracy. If your AI is only right 70% of the time, you’re not going to let it do any meaningful work without human supervision.

But what about when the error rate is less than 1%? That kind of performance may usher in a golden era of startup activity, where the applications are truly making an economic difference.

o1 is most fit for applications that require a multi-step approach — such as coding, math and science problems — and that can stomach significant latency.

Given that o1 “stops to think”, running through an internal chain of thought to arrive at its output, the wait time is highly variable and it can take upwards of 10 seconds before generating its first token depending on the complexity of the problem.

Therefore, in the short-term, we likely won’t see novel consumer applications on top of o1.

However, these models may finally have something of value to offer those in the scientific research community beyond proof-reading papers. That’s a world of applications that is still largely unexplored.

Using o1.

Right now, ChatGPT subscribers have access to the o1-preview, and trusted OpenAI developers have access to o1’s API.

The API currently offers access to two variants of the o1 model:

o1-preview: This is the early preview of the full o1 model, designed to tackle complex problems requiring broad general knowledge.
o1-mini: A faster and more cost-effective version of o1, well-suited for tasks in coding, math, and science where extensive general knowledge might not be necessary.

Both o1-preview and o1-mini are accessible via the chat completions endpoint, making it easy to incorporate them into existing projects. The process involves selecting the desired model (e.g., model="o1-preview") when making API calls.

Sources

OpenAI o1: ChatGPT Supercharged! (Two Minute Papers)
OpenAI o1 Guide: How It Works, Use Cases, API & More (Datacamp)
Learning to Reason with LLMs (OpenAI)
Some Non-Obvious Points About OpenAI’s o1 (TheSequence)

Rayhan Memon

Discussion about this post