Skip to content

Gemini 3.5 Flash: a detailed benchmark and capability review_

A detailed look at Gemini 3.5 Flash: what shipped at Google I/O 2026, pricing, Google's own benchmark table, Artificial Analysis numbers, and how it scores on the Appwrite Arena benchmark.

Gemini 3.5 Flash shipped on May 19, 2026 at Google I/O. Google positions it as "Pro-level reasoning at Flash-class latency," with the claim that a mid-tier model can carry agentic and coding workloads previously handled by the Pro tier.

This post evaluates that claim against three data sources: Google's published model card, Artificial Analysis, and Appwrite Arena, an open-source benchmark covering 191 questions across nine Appwrite service categories.

Model overview

Gemini 3.5 Flash is built on the Gemini 3 Flash reasoning foundation with explicit thinking levels that control quality, cost, and latency. The variant on the Artificial Analysis leaderboard and in most of Google's published numbers is the "high" thinking configuration.

Model specifications:

  • Inputs. Text, images, audio, video, and PDFs, up to a 1M token context window.
  • Output. Text only, with a 64K token output cap.
  • Knowledge cutoff. January 2025.
  • Tooling. Function calling, structured output, code execution, and search-as-a-tool are all first-party.
  • Distribution. Gemini app, Gemini API, Google AI Studio, Gemini Enterprise, the Gemini Enterprise Agent Platform, Google Search AI Mode, Google Antigravity, and Android Studio.
  • Status. Public preview at launch, free in the consumer Gemini app and Search AI Mode.

Pricing

API pricing per million tokens:

  • Input: $1.50
  • Output: $9.00
  • Cached input: $0.15 (90% discount)

How it compares:

  • vs Gemini 3 Flash ($0.50 / $3.00): 3x more on both input and output.
  • vs Gemini 3.1 Pro ($2.00 / $12.00): 25% cheaper per token on both input and output.
  • Within the Flash tier: the most expensive Flash-tier model Google has released.

Google's published benchmark table

The model card lists head-to-head numbers against Gemini 3 Flash, Gemini 3.1 Pro, Claude Sonnet 4.6, Claude Opus 4.7, and GPT-5.5. The full table:

CategoryBenchmarkGemini 3.5 FlashGemini 3 FlashGemini 3.1 ProClaude Sonnet 4.6Claude Opus 4.7GPT-5.5
CodingTerminal-bench 2.1 (Terminus-2 harness)76.2%58.0%70.3%n/a66.1%78.2%
CodingSWE-Bench Pro (Public, single attempt)55.1%49.6%54.2%n/a64.3%58.6%
AgenticMCP Atlas (multi-step MCP workflows)83.6%62.0%78.2%69.5%79.1%75.3%
AgenticToolathlon (real-world tool use)56.5%49.4%n/an/an/a55.6%
UI ControlOSWorld-Verified78.4%65.1%76.2%72.5%78.0%78.7%
Expert tasksFinance Agent v257.9%42.6%43.0%51.0%51.5%51.8%
Expert tasksGDPval-AA (Elo)165612041314167617531769
MultimodalCharXiv Reasoning (no tools)84.2%80.3%83.3%72.4%82.1%84.1%
MultimodalMMMU-Pro (no tools)83.6%81.2%80.5%74.5%75.2%81.2%
MultimodalBlueprint-Bench 2 (normalized)33.6%0.0%26.5%6.7%24.5%36.2%
Long contextMRCR v2 (8-needle, 128k average)77.3%67.2%84.9%84.9%59.3%94.8%
Long contextMRCR v2 (1M, pointwise)26.6%22.1%26.3%n/an/an/a
ReasoningHumanity's Last Exam (full set)40.2%33.7%44.4%33.2%46.9%41.4%
ReasoningARC-AGI-272.1%33.6%77.1%58.3%75.8%84.6%

Gemini 3.5 Flash leads Pro-class models on agentic tasks (MCP Atlas, Toolathlon, Finance Agent v2) and on multimodal reasoning (CharXiv, MMMU-Pro). It trails on academic reasoning (Humanity's Last Exam, ARC-AGI-2). For coding, results sit between 3.1 Pro and GPT-5.5 depending on the benchmark.

The largest gain is MCP Atlas: a 21.6 point increase over Gemini 3 Flash and 5.4 points over 3.1 Pro. On MCP tool-call workloads, 3.5 Flash is Google's strongest model in the Gemini 3 series.

Artificial Analysis

Artificial Analysis runs an independent evaluation suite and ranks models by Intelligence Index, a composite of 10 evaluations: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, and CritPt.

Gemini 3.5 Flash on Artificial Analysis:

  • Intelligence Index: 55.3 (rank #7 of 147). Top three: GPT-5.5 (xhigh) at 60.2, GPT-5.5 (high) at 58.9, Claude Opus 4.7 (max) at 57.3.
  • Speed: 278 output tokens per second (rank #2 of 147 in its AA price class). The closest frontier peer is gpt-oss-120b (high) at 246. Other frontier-class models are well behind: Gemini 3.1 Pro Preview at 123, GPT-5.5 (xhigh) at 65, Claude Opus 4.7 (max) at 50.
  • Verbosity: 73M tokens generated across the Intelligence Index suite, against a leaderboard average of 36M. Verbosity counts how many output tokens the model produced to complete the eval suite. Higher means the model spent more reasoning tokens per answer, which raises latency and bill size even when the per-token price is low.
  • Cost to evaluate the Intelligence Index: $1,552. That is 5.5x Gemini 3 Flash and 75% more than Gemini 3.1 Pro despite the lower per-token rate. This is the total dollar cost to run the full Intelligence Index once, combining per-token pricing and token volume. It serves as a proxy for what the model costs on heavy reasoning workloads in production.
  • Hallucination rate: 61% on the AA hallucination measure, 31 points lower than Gemini 3 Flash. The hallucination measure is the share of responses on a fabrication-probing prompt set where the model produces incorrect or invented content. Lower is better, and a 31-point drop versus the predecessor indicates a material gain in factual reliability.

On the intelligence-versus-speed axis, Artificial Analysis ranks Gemini 3.5 Flash as the Pareto leader. No model in the same intelligence bracket runs near 278 tokens per second.

Intelligence per token against SOTA peers

Per-model summaries from Artificial Analysis:

ModelAA Intelligence IndexOutput tokens (full Index)Total eval costSpeed (tok/s)Input $/MtokOutput $/Mtok
GPT-5.5 (xhigh)60.275M$3,35765$5.00$30.00
Claude Opus 4.7 (max)57.3110M$5,11750$6.25$25.00
Gemini 3.1 Pro Preview57.257M$892123$2.00$12.00
Gemini 3.5 Flash (high)55.373M$1,552278$1.50$9.00
Kimi K2.653.9170M$94898$0.95$4.00

Two points are worth calling out.

GPT-5.5 is more intelligent on a similar token budget. GPT-5.5 (xhigh) generates 75M tokens for the full Intelligence Index against 3.5 Flash's 73M, a 3% difference. For roughly the same output token count, GPT-5.5 scores 60.2 versus 55.3. The reason GPT-5.5's eval cost lands at $3,357 against 3.5 Flash's $1,552 is per-token pricing ($5/$30 vs $1.50/$9), not token efficiency. On quality per token, GPT-5.5 leads.

Gemini 3.1 Pro is the sharper internal comparison. 3.1 Pro Preview generates 57M tokens, 22% fewer than 3.5 Flash, and scores 57.2 on the Intelligence Index, 1.9 points higher. Total eval cost is $892, 42% lower than 3.5 Flash. The only axis where 3.5 Flash leads is speed: 278 tokens per second versus 3.1 Pro's 123. Google's "Pro-level reasoning at Flash-class latency" claim holds on latency. On the Intelligence Index itself, 3.5 Flash is the second-best Gemini and uses more tokens than 3.1 Pro to reach a lower score.

Appwrite Arena: backend SDK and API performance

Public leaderboards measure general capability, not whether a model can drive an SDK without hallucinating method names. Appwrite Arena is an open-source benchmark covering 191 questions across nine Appwrite service categories: Foundation, Auth, Databases, Functions, Storage, Sites, Messaging, Realtime, and CLI. Each model is evaluated twice: once with the relevant Appwrite Skill loaded into context, and once without. Results are published on GitHub.

Top finishers on the May 20, 2026 run:

With Skills loaded (Skill files in context, 191 questions):

ModelOverallMCQFreeformCost (USD)Duration
GPT 5.597.7098.2094.80$4.5133m
Claude Opus 4.797.1097.6094.20$3.0753m
Qwen 3.6 Plus96.5097.6089.80$0.5854m
Kimi K2.696.3097.0091.90$1.64135m
Gemini 3.5 Flash96.2096.9091.90$3.7820m
DeepSeek V4 Flash96.1096.4094.20$0.37125m
Gemini 3.1 Pro (Preview)92.7093.3088.80$4.4445m
Gemini 3.1 Flash Lite (Preview)88.3089.7079.40$0.5919m

Without Skills (model's built-in knowledge only):

ModelOverallMCQFreeformCost (USD)Duration
Claude Opus 4.796.2096.4094.80$1.8925m
GPT 5.594.2094.5090.00$2.1927m
Kimi K2.693.6095.2083.50$0.48103m
Gemini 3.1 Pro (Preview)92.5095.3076.90$1.3426m
Gemini 3.5 Flash90.7092.9077.50$1.1413m
GLM 5.190.2091.5081.90$0.3045m

Three observations from the Arena data.

It is the fastest model in the top tier. 20 minutes with Skills and 13 minutes without is faster than every other model scoring above 90. The only model in the with-Skills table with a shorter run is Gemini 3.1 Flash Lite at 19 minutes, but it scores 88.3, below the 90-point top tier.

Skills materially improve the freeform score. Without Skills, freeform scores 77.5%. With Skills, freeform reaches 91.9%, a 14.4-point increase. The same delta for GPT 5.5 is +4.8 points (90.0 to 94.8), and for Claude Opus 4.7 is −0.6 points (94.8 to 94.2), where Skills slightly lowered the score because the model's built-in Appwrite knowledge is already near the ceiling. 3.5 Flash relies more on in-context documentation than its frontier peers, consistent with the January 2025 knowledge cutoff.

Category profile. With Skills, 3.5 Flash scores 100% on Messaging, MCQ Foundation, MCQ Auth, MCQ Functions, and MCQ Sites, and 94.1% on MCQ Realtime. The weakest categories are TablesDB (89.1% with Skills, 77.8% without) and CLI (95.0% with Skills, 73.3% without). Both require the most current API surface, which the knowledge cutoff does not cover.

Workloads where 3.5 Flash is the right choice

  • MCP-driven agents. MCP Atlas at 83.6% is the highest result Google has published on the benchmark. For agents driving an MCP server such as Appwrite's API MCP, 3.5 Flash is the most cost-efficient frontier option.
  • Throughput-bound multimodal pipelines. CharXiv at 84.2% and MMMU-Pro at 83.6% at 278 tokens per second is a combination no other top-ten Intelligence Index model provides. Document ingestion with charts, audio and video reasoning, and pipelines with many small multimodal calls benefit directly.
  • Iterative coding agents on bounded scope. Terminal-Bench 2.1 at 76.2%, a 1M context window, and the highest throughput in the top ten allow more iterations per wall-clock minute than any frontier alternative. The reasoning gap to Opus 4.7 and GPT-5.5 only becomes a constraint on research-grade tasks.

Model selection for Appwrite projects

Appwrite provides the primitives an agent needs to operate on a project: typed tables, scoped API keys, an API MCP server, a Docs MCP server, and Agent Skills for every major SDK. The Arena results above show how each model performs against this surface.

Speed is the column where Gemini 3.5 Flash dominates, but speed is not coding intelligence. On the Arena freeform scores and the SOTA Intelligence Index comparison above, GPT 5.5 and Claude Opus 4.7 lead 3.5 Flash by a meaningful margin on the same Appwrite coding tasks.

Two recommended defaults:

  1. For interactive workloads where a developer waits on the response, Gemini 3.5 Flash with the Appwrite Skill loaded is the fastest top-tier option. Use it when iteration speed beats per-response correctness.
  2. For coding work where correctness matters more than wall-clock latency, GPT 5.5 or Claude Opus 4.7 lead. Both produce higher quality code on the same Appwrite tasks, with or without Skills loaded.

For other cases, optimize on the price-to-throughput frontier, where 3.5 Flash sits.

Next steps

Select Gemini 3.5 Flash inside a tool that supports it: Cursor, Google AI Studio, Google Antigravity, or the Gemini API directly. To connect Appwrite to the model, follow the Cursor plugin docs for Cursor, or the Antigravity MCP setup docs for Antigravity. Both walk through adding the Appwrite API MCP and Docs MCP servers so the model can act on your project.

Read next

Ready to build?_