Benchmarking R1 1776: A Post-Trained DeepSeek R1 671B Model

Purpose
This benchmark measures the real-world inference performance of Perplexity AI’s R1 1776 model, a post-trained version of DeepSeek R1 671B, designed to eliminate censorship and enhance unbiased information delivery under controlled conversational growth conditions.
The focus is on gradual context expansion, realistic outputs, and streaming performance at different context sizes.
The test was designed to simulate natural conversational use cases rather than synthetic token stuffing or artificial padding.
Testing Setup
• Model: r1-1776:671b (customized from deepseek-r1:671b)
• Inference Backend: Ollama
• Hardware: 2025 Apple Mac Studio M3 Ultra with 512 GB Shared Memory
• Prompting Style: Progressive, real questions about rocket engines to simulate natural conversation.
Metrics Measured:
• Time to First Token (TTFT)
• Total Tokens Streamed
• Tokens Per Second (TPS) Including TTFT
• Tokens Per Second (TPS) Streaming Only
• Context Length Growth
Why a Custom Modelfile Was Needed
The base R1-1776 671B model naturally produces long, detailed outputs due to its chain of thought (reasoning) optimization. I wanted to incrementally test the impact on context length. Not addressing this resulted in the second iteration jumping to a context length close to 1000 tokens.
For controlled benchmarking, it was necessary to constrain both reasoning and response length to manage the context length iterations with granularity.
A custom Modelfile was created to achieve this:
FROM r1-1776:671b
PARAMETER num_predict 250
PARAMETER temperature 0.7
SYSTEM "You are a helpful assistant. Think briefly if necessary, but keep your internal reasoning very short. Focus on delivering direct and concise answers within two sentences whenever possible."
The Modelfile constraints served several purposes:
Constraint | Reason |
---|---|
num_predict = 250 |
Hard limit on maximum output tokens to allow for incremental context length results. |
temperature = 0.7 |
Realistic variability for production-like conversational behavior. |
System Prompt |
Constrain the model’s reasoning behavior to be brief and direct without suppressing it entirely. |
Key Observations
Aspect | Notes |
---|---|
Time To First Token | Increased smoothly with context size. |
Tokens Per Second | Decreased gently from ~16–17 TPS to ~13 TPS. |
Streaming Stability | Smooth and stable across growing contexts. |
Context Growth | Steady and predictable without sudden jumps. |
Reasoning Size | Controlled; brief <think> sections that did not dominate output. |
Realism | High; model outputs resembled real-world chat applications without robotic behavior. |
Detailed Benchmark Results
What you came to see....
Round | Context Tokens | TTFT (s) | Total Time (s) | Streaming Time (s) | Tokens Streamed | TPS incl TTFT | TPS Streaming Only |
---|---|---|---|---|---|---|---|
1 | 17 | 0.40 | 9.75 | 9.35 | 163 | 16.72 | 17.44 |
2 | 240 | 6.10 | 16.26 | 10.17 | 171 | 10.51 | 16.82 |
3 | 476 | 6.24 | 15.17 | 8.93 | 145 | 9.56 | 16.25 |
4 | 688 | 5.07 | 15.27 | 10.21 | 160 | 10.48 | 15.67 |
5 | 923 | 6.00 | 19.24 | 13.24 | 200 | 10.40 | 15.11 |
6 | 1193 | 8.03 | 21.86 | 13.84 | 200 | 9.15 | 14.45 |
7 | 1485 | 7.99 | 21.02 | 13.03 | 180 | 8.56 | 13.81 |
8 | 1744 | 7.36 | 22.33 | 14.97 | 200 | 8.96 | 13.36 |
Resources
The benchmark python script and Modelfile used in this testing:
Comments ()