Benchmarking R1 1776: A Post-Trained DeepSeek R1 671B Model

Benchmarking R1 1776: A Post-Trained DeepSeek R1 671B Model

Purpose

This benchmark measures the real-world inference performance of Perplexity AI’s R1 1776 model, a post-trained version of DeepSeek R1 671B, designed to eliminate censorship and enhance unbiased information delivery under controlled conversational growth conditions.

The focus is on gradual context expansion, realistic outputs, and streaming performance at different context sizes.

The test was designed to simulate natural conversational use cases rather than synthetic token stuffing or artificial padding.

Testing Setup

Model: r1-1776:671b (customized from deepseek-r1:671b)

Inference Backend: Ollama

Hardware: 2025 Apple Mac Studio M3 Ultra with 512 GB Shared Memory

Prompting Style: Progressive, real questions about rocket engines to simulate natural conversation.

Metrics Measured:

Time to First Token (TTFT)

Total Tokens Streamed

Tokens Per Second (TPS) Including TTFT

Tokens Per Second (TPS) Streaming Only

Context Length Growth

Why a Custom Modelfile Was Needed

The base R1-1776 671B model naturally produces long, detailed outputs due to its chain of thought (reasoning) optimization. I wanted to incrementally test the impact on context length. Not addressing this resulted in the second iteration jumping to a context length close to 1000 tokens.
For controlled benchmarking, it was necessary to constrain both reasoning and response length to manage the context length iterations with granularity.

A custom Modelfile was created to achieve this:

FROM r1-1776:671b

PARAMETER num_predict 250
PARAMETER temperature 0.7

SYSTEM "You are a helpful assistant. Think briefly if necessary, but keep your internal reasoning very short. Focus on delivering direct and concise answers within two sentences whenever possible."

The Modelfile constraints served several purposes:

Constraint Reason
num_predict = 250 Hard limit on maximum output tokens to allow for incremental context length results.
temperature = 0.7 Realistic variability for production-like conversational behavior.
System Prompt Constrain the model’s reasoning behavior to be brief and direct without suppressing it entirely.

Key Observations

Aspect Notes
Time To First Token Increased smoothly with context size.
Tokens Per Second Decreased gently from ~16–17 TPS to ~13 TPS.
Streaming Stability Smooth and stable across growing contexts.
Context Growth Steady and predictable without sudden jumps.
Reasoning Size Controlled; brief <think> sections that did not dominate output.
Realism High; model outputs resembled real-world chat applications without robotic behavior.

Detailed Benchmark Results

What you came to see....

Round Context Tokens TTFT (s) Total Time (s) Streaming Time (s) Tokens Streamed TPS incl TTFT TPS Streaming Only
1 17 0.40 9.75 9.35 163 16.72 17.44
2 240 6.10 16.26 10.17 171 10.51 16.82
3 476 6.24 15.17 8.93 145 9.56 16.25
4 688 5.07 15.27 10.21 160 10.48 15.67
5 923 6.00 19.24 13.24 200 10.40 15.11
6 1193 8.03 21.86 13.84 200 9.15 14.45
7 1485 7.99 21.02 13.03 180 8.56 13.81
8 1744 7.36 22.33 14.97 200 8.96 13.36

Resources

The benchmark python script and Modelfile used in this testing: