Oct 6, 2025

Model Customization as Moat

Why custom models will become your only real moat

We’re approaching a fascinating inflection point in software development. As AI continues to automate more and more software engineering tasks, a paradox emerges: if everyone is using the same models from OpenAI, Anthropic, or Google, how do you differentiate? The answer lies in something that sounds intimidating but is becoming increasingly accessible: model customization and training.

The API Wrapper Problem

Right now, most AI startups are essentially wrappers around lab APIs. You build a clever interface, add some workflow orchestration, maybe inject some domain-specific prompts, and you’re off to the races. This works today. But as Sudharshan Devaraj points out, this model has a shelf life.

Consider what happened with DeepSeek in early 2025: they distilled a frontier model for just $6 million that matched o1-level reasoning, just two months after OpenAI’s release. What took billions and years of research can now be replicated for the cost of a seed round. The barrier to training models isn’t just falling; it’s collapsing.

When Everyone Can Code, What Matters?

Let’s think ahead to a world where most programming tasks under 30 minutes are fully automated. At that point, software development starts to look less like a technical moat and more like a commodity, similar to how cloud infrastructure became commoditized. In that world, what actually matters?

Data. And lots of it.

Specifically, the right kind of data: traces of how users actually solve problems in your domain, the patterns of successful workflows, and the subtle decisions experts make that generic models can’t learn from public internet data. This proprietary knowledge is where model customization becomes your competitive advantage.

The Pattern Emerging

We’re already seeing a clear pattern with successful AI companies:

Start with an API wrapper to find product-market fit and collect data
Fine-tune small specialized models for specific features
Train your own models using your proprietary data moat
Increase Token Factor Productivity to retain users

Take Cursor as an example. They began as a VSCode wrapper around GPT-4. But today, they run proprietary models for features like “Fast Apply.” Why? Because they’re sitting on billions of coding traces, every edit, every acceptance, every rejection. That’s not just data; that’s a reinforcement learning goldmine.

At some point, the exact backend model becomes secondary. What matters is that Cursor controls it and continuously improves it with their unique dataset.

Your App Is an RL Environment

Here’s the key insight: every application you build is potentially a reinforcement learning environment. Every user interaction generates trajectory data. Every workflow completion creates a training signal.

OpenAI understands this, they bought Statsig specifically to capture billions of trajectory replays from their “Session Replays” product. The era of experience has arrived, and the companies collecting the richest interaction data are building asymmetric advantages.

Think about the professional world: we spend most of our lives in front of computers. That’s hours upon hours of unlabeled, unrecorded data every single day. The applications that capture and learn from these workflows will compound their advantages over time.

Token Factor Productivity: The New Metric

As models become factors of productivity rather than just measures of intelligence, we need new ways to measure value. Enter Token Factor Productivity (TFP):

TFP = Economic Value of Output / Tokens Consumed

This metric mirrors Total Factor Productivity in traditional economics but applies to AI-driven work. Here’s why it matters for your competitive positioning:

If you’re building an AI wrapper that pays full API pricing but delivers 10x value to customers, your margins look great today. But what happens when a competitor trains their own model with 80% of the capability at 20% of the cost? They can undercut your pricing while maintaining similar margins.

However, if you’ve been collecting domain-specific trajectory data and training specialized models, you can achieve higher TFP than any generic model. Your model becomes more efficient at your specific use case, fewer tokens to achieve the same outcome, or better outcomes for the same tokens.

The Wedge Isn’t Architecture, It’s Efficiency

You don’t need to invent a new Transformer architecture to compete. The wedge is data efficiency and reinforcement learning from your specific domain.

Recent research on distillation shows that a distilled 1B parameter model can match a 7B model trained from scratch. Phi-4 and Gemma demonstrate this symptom clearly. What this means practically:

You can start smaller than you think
Domain-specific fine-tuning beats general-purpose scale
Your data quality matters more than your parameter count

Why This Becomes Critical

As software engineering becomes increasingly automated, three dynamics converge:

1. Commoditization of Generic Code Generation
When everyone can generate decent code with the same API, code generation itself becomes table stakes. Differentiation moves to higher-level concerns.

2. Distribution Becomes Everything
Just like consumer brands, when the underlying product becomes similar, distribution and brand become the moat. But for AI applications, there’s an additional moat: your specialized models.

3. Data Compounds
Unlike traditional software moats that can be copied, a data moat compounds over time. Every user interaction makes your models better, which attracts more users, which generates more data. This is the flywheel.

The Practical Path Forward

You don’t need to compete with OpenAI’s scale. Here’s a more realistic approach:

Start Small
Begin with fine-tuning open models (Llama, Mistral) on your specific domain. This is achievable with modest compute budgets.

Instrument Everything
Build your application with the assumption that every interaction is potential training data. Track what works, what doesn’t, and why.

Identify Your Wedge
What unique data do you have access to? Medical records? Legal documents? Financial workflows? Find the data source that competitors can’t easily replicate.

Build Your RL Loop
Create feedback mechanisms where user actions improve your models. Acceptances and rejections are signals. Time-to-completion is a signal. User corrections are golden.

Own Your Inference
Eventually, move inference in-house. This gives you control over costs, latency, and continuous improvement.

The Coming Bifurcation

The AI landscape will likely bifurcate into two camps:

The ASI Players: A few companies (OpenAI, Anthropic, DeepMind, possibly Meta) pushing toward artificial superintelligence with massive capital and compute.

The Specialized Players: Hundreds or thousands of companies with deeply customized models for specific domains, verticals, or workflows.

The mistake would be trying to compete with the first group. The opportunity is in the second, building models so specialized and data-efficient that the generic APIs can’t compete on your specific use case.

Conclusion

Model customization isn’t just a nice-to-have anymore; it’s becoming a strategic imperative. As the barriers to training collapse and software engineering becomes increasingly automated, the companies that survive will be those that:

Collect proprietary trajectory data from their applications
Train specialized models that understand their specific domain
Achieve superior Token Factor Productivity in their niche
Build compounding data flywheels

The era of API wrappers isn’t over, but its expiration date is visible on the horizon. The question isn’t whether you’ll need to train models, it’s when you’ll start collecting the data to do it.

The training imperative isn’t just for AI labs anymore. It’s for anyone building serious AI applications.

This post was inspired by Sudharshan Devaraj’s excellent article “The Training Imperative” which explores these dynamics in depth. If you’re serious about AI strategy, it’s required reading.