04 Apr 2025 2 min read

Benchmarking R1 1776: A Post-Trained DeepSeek R1 671B Model

Purpose

This benchmark measures the real-world inference performance of Perplexity AI’s R1 1776 model, a post-trained version of DeepSeek R1 671B, designed to eliminate censorship and enhance unbiased information delivery under controlled conversational growth conditions.

The focus is on gradual context expansion, realistic outputs, and streaming performance at different context sizes.

The test was designed to simulate natural conversational use cases rather than synthetic token stuffing or artificial padding.

Testing Setup

• Model: r1-1776:671b (customized from deepseek-r1:671b)

• Inference Backend: Ollama

• Hardware: 2025 Apple Mac Studio M3 Ultra with 512 GB Shared Memory

• Prompting Style: Progressive, real questions about rocket engines to simulate natural conversation.

Metrics Measured:

• Time to First Token (TTFT)

• Total Tokens Streamed

• Tokens Per Second (TPS) Including TTFT

• Tokens Per Second (TPS) Streaming Only

• Context Length Growth

Why a Custom Modelfile Was Needed

The base R1-1776 671B model naturally produces long, detailed outputs due to its chain of thought (reasoning) optimization. I wanted to incrementally test the impact on context length. Not addressing this resulted in the second iteration jumping to a context length close to 1000 tokens.
For controlled benchmarking, it was necessary to constrain both reasoning and response length to manage the context length iterations with granularity.

A custom Modelfile was created to achieve this:

FROM r1-1776:671b

PARAMETER num_predict 250
PARAMETER temperature 0.7

SYSTEM "You are a helpful assistant. Think briefly if necessary, but keep your internal reasoning very short. Focus on delivering direct and concise answers within two sentences whenever possible."

The Modelfile constraints served several purposes:

Constraint	Reason
`num_predict = 250`	Hard limit on maximum output tokens to allow for incremental context length results.
`temperature = 0.7`	Realistic variability for production-like conversational behavior.
`System Prompt`	Constrain the model’s reasoning behavior to be brief and direct without suppressing it entirely.

Key Observations

Aspect	Notes
Time To First Token	Increased smoothly with context size.
Tokens Per Second	Decreased gently from ~16–17 TPS to ~13 TPS.
Streaming Stability	Smooth and stable across growing contexts.
Context Growth	Steady and predictable without sudden jumps.
Reasoning Size	Controlled; brief `<think>` sections that did not dominate output.
Realism	High; model outputs resembled real-world chat applications without robotic behavior.

Detailed Benchmark Results

What you came to see....

Round	Context Tokens	TTFT (s)	Total Time (s)	Streaming Time (s)	Tokens Streamed	TPS incl TTFT	TPS Streaming Only
1	17	0.40	9.75	9.35	163	16.72	17.44
2	240	6.10	16.26	10.17	171	10.51	16.82
3	476	6.24	15.17	8.93	145	9.56	16.25
4	688	5.07	15.27	10.21	160	10.48	15.67
5	923	6.00	19.24	13.24	200	10.40	15.11
6	1193	8.03	21.86	13.84	200	9.15	14.45
7	1485	7.99	21.02	13.03	180	8.56	13.81
8	1744	7.36	22.33	14.97	200	8.96	13.36

Resources

The benchmark python script and Modelfile used in this testing: