How We Rate Models
This page documents the research, sources, and decisions behind every model rating in the Bearing registry. We believe that a recommendation tool should be transparent about its own methodology — so here it all is.
Last updated: 15 April 2026 (Registry v0.5.0)
Scoring overview
Every model is scored from 0.0 to 1.0 across seven factors. When you rank your priorities, Bearing converts your ranking into weights and produces a single match score for each model.
| Factor | What it measures | How it's scored |
|---|---|---|
| Quality | How well the model performs on your task type | Per-task fitness scores from benchmarks and evaluation data |
| Capability | Whether the model can do what you need | Binary pass/fail for hard requirements (vision, code, tools), graduated for soft matches |
| Cost | How much it costs relative to other models | Inverse log-scale against the cheapest model in the registry |
| Speed | How fast responses come back | Based on latency tiers and inference benchmarks. 1.0 = sub-second |
| Privacy | How your data is handled | Based on data retention policies. 1.0 = no retention, no training on inputs |
| Sustainability | Environmental footprint | Composite of inference energy, training footprint, and provider infrastructure |
| Transparency | How open the model and provider are | Composite of open weights, training data, methodology, licence, and provider disclosure |
Sustainability methodology
Sustainability is scored across three sub-dimensions, each 0.0 to 1.0. The composite score is the mean of all available values (nulls are excluded, not treated as zero).
Inference energy
How efficient is the model at inference time?
| Score | Meaning |
|---|---|
| 1.0 | Published per-query energy metrics |
| 0.8 | Very small or efficient model |
| 0.6 | Mixture-of-Experts architecture |
| 0.4 | Standard large model |
| null | No data available |
Training footprint
How much is known about the environmental cost of training?
| Score | Meaning |
|---|---|
| 1.0 | Published with carbon offsets |
| 0.7 | Published training energy/compute data |
| 0.4 | Estimated from model size and architecture |
| 0.0 | No information at all |
| null | No data available |
Provider infrastructure
How clean is the energy powering the model?
| Score | Meaning |
|---|---|
| 1.0 | 100% renewable with heat recovery or additional efficiency measures |
| 0.8 | 100% renewable energy |
| 0.6 | Significant commitment with reporting |
| 0.4 | Some renewable energy |
| 0.2 | No commitment |
| null | No data available |
Transparency methodology
Transparency is scored across five sub-dimensions, each 0.0 to 1.0. The composite score is the mean of all five.
We reference the Stanford CRFM Foundation Model Transparency Index (FMTI) 2025, which evaluates 100 indicators across 13 companies. We also consider the criticism from EleutherAI that FMTI measures commercial documentation rather than genuine openness — so we weight actual accessibility alongside documentation.
FMTI company scores (2025)
These inform the provider_disclosure dimension:
| Company | FMTI Score | Our provider_disclosure |
|---|---|---|
| IBM | 95 | 1.0 |
| AI21 Labs | 72 | 0.7 |
| Writer | 67 | 0.7 |
| Anthropic | 37 | 0.4 |
| 36 | 0.4 | |
| Amazon | 36 | 0.4 |
| OpenAI | 35 | 0.4 |
| DeepSeek | 30 | 0.2 |
| Meta | 30 | 0.3 |
| Alibaba | 30 | 0.2 |
| Mistral | 15 | 0.2 |
| xAI | 14 | 0.2 |
| Midjourney | 14 | 0.2 |
Sub-dimension buckets
| Dimension | 1.0 | 0.7 | 0.3 | 0.0 |
|---|---|---|---|---|
| Open weights | Fully downloadable | With registration required | API only | Closed |
| Open training data | Documented and accessible | Documented but not accessible | Vague description | None |
| Open methodology | Full reproducible paper | Substantial blog/model card | Marketing only | None |
| Licence openness | MIT / Apache 2.0 | Open with restrictions | Proprietary, clear terms | Restrictive |
| Provider disclosure | FMTI 80+ | FMTI 50–79 | FMTI 30–49 | FMTI <15 |
Provider sustainability research
Research conducted April 2026. All scores are based on publicly available information and may change as providers update their commitments and disclosures.
Google (Gemini models)
Provider infrastructure: 0.8 — Google has maintained a 100% renewable energy match on a global basis every year since 2017. They are targeting 24/7 carbon-free energy on every grid they operate by 2030. Google is the only major cloud provider to publish per-query energy data: 0.24 watt-hours per median text prompt for Gemini (August 2025).
Sources: - Google Clean Energy - 5 years of 100% renewable energy — Google Cloud Blog - Hannah Ritchie — AI carbon footprint update (August 2025)
Anthropic (Claude models)
Provider infrastructure: 0.4 — Anthropic committed to covering 100% of grid upgrade costs required by their data centres, water-efficient cooling, and energy price offsets for ratepayers. However, they have not published a renewable energy target or sustainability report. Data centres are located in Texas, New York, and Louisiana.
Sources: - Anthropic — Covering electricity price increases - Anthropic Emissions Breakdown — DitchCarbon - Sustainability Magazine — Why is Anthropic pledging to offset its AI energy costs?
OpenAI (GPT models)
Provider infrastructure: 0.4 — OpenAI's CEO stated that ChatGPT uses approximately 0.34 watt-hours per average query. OpenAI claims 100% renewable energy by 2027, but this currently flows through Microsoft Azure's Renewable Energy Certificates (RECs) rather than direct power purchase agreements. As of March 2026, OpenAI has published zero verified Scope 1, 2, or 3 emissions figures.
Sources: - MIT Technology Review — AI energy usage - The Sustainable Innovation — Open AI Sustainability - OpenAI, Oracle and Vantage — Green Energy Partnership
Meta (Llama models)
Provider infrastructure: 0.8 — Meta claims to run global operations with 100% renewable energy and aims for net-zero emissions across its value chain by 2030. They also target being water-positive by 2030 and use dry cooling at new facilities (no water demands for cooling). However, for API users accessing Llama through third-party providers, sustainability depends on the hosting provider's infrastructure.
Sources: - Meta — Delivering AI, Creating Community Benefits - Fortune — Big tech climate goals and data centres
Mistral (Codestral, Devstral, Medium 3, OCR)
Provider infrastructure: 0.55 — Mistral invested €1.2 billion in a partnership with EcoDataCenter to build an AI data centre in Borlänge, Sweden, running on renewable energy with advanced cooling. Their Paris facility runs on France's low-carbon nuclear grid. Mistral has also launched a sustainability auditing tool for industry-wide transparency. Scored between the "significant commitment" (0.6) and "some renewable" (0.4) buckets.
Sources: - Mistral AI and EcoDataCenter — Sweden AI data centre - CNBC — Mistral secures $830 million for Paris data centre - Euronews — Mistral raises $830m for European AI infrastructure
xAI (Grok)
Provider infrastructure: 0.15 — xAI's Memphis Colossus data centre has been operating gas turbines without emission permits since August 2025. Plans call for 2 gigawatts of capacity (enough to power 1.5 million homes). An 88-acre solar array has been proposed but is not yet built. 168 Tesla Megapacks have been installed for battery storage. Despite this, the overall sustainability profile is among the worst of any major AI provider.
Sources: - Tennessee Lookout — Data centre battle along Mississippi-Tennessee line - Time — Elon Musk's Memphis AI Data Center raises pollution concerns - TBA — Tech, Toxins, and Memphis: Environmental Footprint of the xAI Facility
GreenPT (GreenL, GreenR)
Provider infrastructure: 1.0 — GreenPT runs 100% on renewable energy with heat recovery (server waste heat is used for community heating). They publish real-time energy metrics per query (mWh per 100 tokens). Their Power Usage Effectiveness (PUE) is 1.37 versus an industry average of 1.57. Their Water Usage Effectiveness (WUE) is 0.067 versus an industry average of 1.8 — a 96% improvement. They achieve 20–30% compute reduction through model compression and quantization. EU-hosted with full GDPR compliance.
The training of the underlying models (Mistral Small 3.2 for GreenL, GPT-OSS 120B for GreenR) was not done on GreenPT's green infrastructure, which is why the training footprint sub-score is lower.
Sources: - GreenPT — AI impact on the environment - GreenPT — New API integrations: Scraper, OCR - EcoCompute — GreenPT presentation
DeepSeek (R1, R1 0528, V3.1, V3.2)
Provider infrastructure: 0.2 — DeepSeek's MoE architecture is inherently efficient, activating only 37 billion of 671 billion parameters during inference, with 93.3% memory reduction techniques. However, China's energy grid is heavily coal-based, and DeepSeek publishes no sustainability data or commitments. Reasoning models (R1, R1 0528) consume significantly more energy per query due to long chains of thought — HuggingFace found reasoning models use 30x more energy on average.
Sources: - S&P Global — Potential impacts of DeepSeek on datacenters and energy demand - SingularityHub — Reasoning models use 30x more energy - ScienceDirect — Does DeepSeek curb data centre energy surge?
IBM (Granite)
Provider infrastructure: 0.55 — IBM has the highest FMTI score (95/100), which includes disclosure on compute infrastructure. They publish training compute and energy data and provide smaller models that run on consumer hardware. Scored between "significant commitment" and "some renewable" buckets, weighted upward by their exceptional disclosure practices.
Sources: - IBM — Granite 3.3 announcement - Signal65 — IBM Granite Benchmarking and Enterprise Readiness
Moonshot AI (Kimi K2, K2.5) and MiniMax (M2.5, M2.7)
Provider infrastructure: 0.2 — Both are China-based providers with no published sustainability data or renewable energy commitments. Scored at the minimum for operating providers.
General energy context
According to research, AI is responsible for approximately 100 terawatt-hours of electricity globally in 2025. The HuggingFace AI Energy Score project measures energy consumption across models using standardised hardware (NVIDIA H100 GPUs), finding that reasoning models consume 30x more energy on average than non-reasoning models.
Sources: - HuggingFace AI Energy Score - Earth911 — Your AI Carbon Footprint - Innovating with AI — How much energy does AI consume?
Model capability decisions
DeepSeek V3.2 — vision removed
DeepSeek V3.2 was previously listed with vision capability, but research confirmed it does not have native vision support. Vision is expected in DeepSeek V4, which was in limited testing as of April 2026. The vision capability and all vision task fitness scores have been removed.
Sources: - DeepSeek API Docs — V3.2 Release - TechNode — DeepSeek V4 test interface suggests Vision mode
GreenPT GreenL — vision added
GreenPT GreenL is powered by Mistral Small 3.2 (24B), which has native vision support — it can process and reason over both text and image inputs. The vision task fitness is set to 0.68 based on the underlying model's capabilities. GreenPT also offers dedicated OCR and Scraper APIs for document processing.
Sources: - Mistral Small 3.2 — HuggingFace - GreenPT — OCR and Scraper API integrations
GreenPT GreenR — extended thinking added
GPT-OSS 120B (the model powering GreenR) has full chain-of-thought reasoning and adjustable reasoning effort, qualifying it for the extended_thinking capability.
Sources: - OpenAI — Introducing GPT-OSS - OpenAI — GPT-OSS Model Card
Kimi K2.5 — video added
Kimi K2.5 demonstrates strong video understanding capabilities, scoring 86.6% on VideoMMMU. It also supports agent swarm execution for parallel sub-task processing.
Sources: - Moonshot AI — Kimi K2.5 Tech Blog - InfoQ — Kimi K2.5 with Vision and Agent Swarm
Qwen 3 235B — vision not included
The base Qwen3-235B-A22B is a text-only model. The vision-language variant (Qwen3-VL-235B-A22B) is a separate model. We represent the base text model in the registry. Note that Qwen 3.5 (the newer generation) is natively multimodal — vision is built into the base architecture.
Sources: - Qwen3-235B-A22B — HuggingFace - Qwen3-VL Technical Report
IBM Granite 3.3 — vision not in base, family noted
The Granite 3.3 8B Instruct model in our registry is a text-only model. IBM has separate vision models in the Granite family (granite-vision-3.3-2b, granite-4.0-3b-vision) designed for document understanding, OCR, and chart analysis. We note the family in the model's strengths but represent the 8B text model as the primary entry.
Sources: - IBM — Granite 3.3 8B Instruct — HuggingFace - IBM — Granite Vision models - IBM — Granite 4.0 3B Vision — HuggingFace
Task fitness research
Task fitness scores represent how well a model performs on specific task types, from 0.0 to 1.0. These are informed by benchmark data, public evaluations, and comparative analysis. Here are the key adjustments made in the v0.5.0 audit, with sources.
MiniMax M2.7 — code: 0.78 → 0.86
MiniMax M2.7 achieved 56.22% on SWE-Pro (matching GPT-5.3-Codex), 78% on SWE-bench Verified (significantly outperforming Opus at 55%), and 86.2% on PinchBench (within 1.2 points of Opus). The model was also open-sourced in April 2026. Pricing was corrected from $1.00/$4.00 to $0.30/$1.20 per million tokens.
Sources: - MiniMax — M2.7 announcement - VentureBeat — MiniMax M2.7 is self-evolving - MarkTechPost — MiniMax M2.7 open-sourced
MiniMax M2.5 — speed: 0.70 → 0.82
M2.5 is served natively at 100 tokens per second — nearly twice the rate of other frontier models. End-to-end runtime decreased by 37% through improvements like parallel tool calling.
Sources: - MiniMax — M2.5 announcement
Kimi K2.5 — vision: 0.73 → 0.82, code: 0.84 → 0.86
MMMU Pro: 78.5%. MathVision: 84.2%. VideoMMMU: 86.6%. BrowseComp: 74.9% (78.4% with agent swarm). These are competitive with frontier multimodal models.
Sources: - Moonshot AI — Kimi K2.5 Tech Blog - Complete Performance Analysis — Kimi K2.5 vs GPT, Claude, Gemini
Llama 4 Maverick — vision: 0.75 → 0.82, code: 0.80 → 0.84
MMMU: 73.4% (beats GPT-4o at 69.1%). MathVista: 73.7% (beats GPT-4o at 63.8%). LiveCodeBench: 43.4% (beats GPT-4o at 32.3%). Natively multimodal with early fusion, supporting up to 8 images per prompt.
Sources: - Llama 4 — Meta AI Blog - Llama 4 — Official site - Artificial Analysis — Llama 4 Maverick
IBM Granite 3.3 — code: 0.75 → 0.78, extract: 0.78 → 0.80
Granite 3.3 8B's MATH500 performance puts it ahead of Claude 3.5 Haiku (64.2%) and Llama 3.1 8B (44.4%), roughly in line with the 24B-parameter Mistral Small 3. Strong function calling, fill-in-the-middle coding, and RAG capabilities with dedicated LoRA adapters.
Sources: - IBM — Granite 3.3 8B Instruct — HuggingFace - IBM — Granite 3.3 announcement - Signal65 — Granite Benchmarking
Gemini 3 Flash — code: 0.85 → 0.88
Gemini 3 Flash scored 78% on SWE-bench Verified — actually outperforming Gemini 3 Pro (76.2%) on coding tasks.
Sources: - Google — Gemini 3 Flash Preview
Gemini 3.1 Pro — analyse: 0.90 → 0.93
GPQA Diamond: 94.3%. ARC-AGI-2: 77.1% (more than double Gemini 3 Pro). This justifies a very high analysis score.
Sources: - Google — Gemini 3.1 Pro announcement - Google DeepMind — Gemini 3.1 Pro Model Card
Qwen 3.5 397B — code: 0.80 → 0.85
LiveCodeBench v6: 83.6%. SWE-bench Verified: 76.4%. AIME 2026: 91.3%. Natively multimodal with early fusion. 201 languages supported (up from 119 in Qwen 3). 60% cheaper to run than its predecessor.
Sources: - VentureBeat — Alibaba's Qwen 3.5 - Qwen — Qwen3.5 Blog
What we don't know
We try to be honest about the limits of our data:
- Quality scores are estimates — task fitness is our weakest signal. It's informed by benchmarks, but benchmarks don't capture everything. As we collect outcome data from real Bearing users, these scores will improve.
- Sustainability data is sparse — most providers don't publish per-query energy consumption. Google and GreenPT are exceptions. For many models, we're making informed estimates from model architecture and provider commitments.
- Pricing changes frequently — we sync from OpenRouter where possible, but some models are priced outside that ecosystem. Check the provider's own pricing page for the latest.
- Benchmarks have limitations — SWE-bench, MMMU, GPQA, and others are useful signals but don't perfectly predict real-world performance on your specific task. That's why Bearing also includes Compare mode for direct testing.
How to suggest corrections
If you spot an error in our ratings or have a better source for a specific data point, please open an issue on GitHub with the model name, the score you think is wrong, and a link to the evidence. We review every suggestion.