Environments for the next generation of AI capabilities

We help AI labs achieve SOTA performance on tool-using verticals through high-quality, hand-crafted environments that expose capability gaps. We also work on scaling synthetic environment creation for coding and tool use. We work continuously with labs to reveal model gaps and support iterative performance improvement.

Agentic Environments

PDFBench

Terminal200 Tasks

A rigorous, sandboxed environment for testing economically critical PDF workflows. Also available in computer use format.

Available Now

The Problem

Over 90% of government, enterprise, and trade documents use PDF as the authoritative format. Current models still can't reliably fill out these forms.

Evaluation Focus

Competitive Edge

Frontier models still fail ~35% of tasks

Regulatory Compliance

Tasks mirror real requirements for sensitive data

Generalist Inference

Tests ability to infer form intent from layout

Read the Report

EMRBench

Computer Use200 Tasks

A realistic EMR environment for validating clinical workflow automation.

Available Now

The Problem

Electronic Medical Records (EMR) are the backbone of modern healthcare, yet they remain notoriously difficult to automate due to complex interfaces and high stakes for patient safety. Standard agents fail to navigate these safely.

Evaluation Focus

Clinical Accuracy

Tests real medical task completion rates

Safety Validation

Validates safe handling of patient data

Workflow Fidelity

Mirrors actual EHR navigation complexity

Read the Report

Bespoke Synthetic

Terminal100+ Tasks

Scalable, domain-specific environments built on demand for specialized verticals.

Available Now

The Problem

Scaling simulation engines in general is human capital intensive, using a weakly supervised approach we can artificially generate and validate hundreds and thousands of environments on demand. Ask us how.

Evaluation Focus

Domain Expertise

Regulatory Workflows, Pharma, Chemical Synthesis

Scalable

Methodology follows DeepSeek 3.2 'General Agent' approach

Iterative

Continuous refinement based on model performance

Foundational Datasets

Gauntlet

SFT Dataset2K Repos79K Examples

An SFT dataset of real developer workflows from 2K repositories with pre-resolved dependencies and Docker containers.

Available Now

The Problem

Training data quality directly impacts model performance. Most code datasets are synthetic or lack verification. Gauntlet provides real developer workflows where every example is pytest-verifiable.

Evaluation Focus

Pytest Verifiable

Every example validated against real tests

Dependencies Resolved

Pre-built Docker containers, no setup required

Production Scale

2K repositories, 79K real workflows

Read the Report

Let's work together

Interested in our environments or datasets? We'd love to discuss how we can help improve your model's capabilities.

contact@refresh.dev

Environments for the next generation ofEnvironments for the next generation of AI capabilitiesAI capabilities

Agentic Environments

PDFBench

The Problem

Evaluation Focus

Competitive Edge

Regulatory Compliance

Generalist Inference

EMRBench

The Problem

Evaluation Focus

Clinical Accuracy

Safety Validation

Workflow Fidelity

Bespoke Synthetic

The Problem

Evaluation Focus

Domain Expertise

Scalable

Iterative

Foundational Datasets

Gauntlet

The Problem

Evaluation Focus

Pytest Verifiable

Dependencies Resolved

Production Scale

Let's work together

Environments for the next generation of AI capabilities