The Chicken and Egg of AI Development: Framework or Data First?
When embarking on an AI project, particularly for something like a RAG (Retrieval-Augmented Generation) solution, teams often face a key dilemma: Should you prioritize building the framework or preparing the data? This question might seem simple, but the answer can significantly impact the project’s success—or failure.
I’ve lived this problem firsthand. While building my own RAG solution, I focused heavily on connecting different components—FAISS for similarity search, Redis for caching, and vector databases for storage—before optimizing data. It was a powerful way to understand the architecture and technical challenges, but it also came with risks. When I started to test the early prototype, I saw underwhelming results and questioned its value initially. Ultimately, it identified missed steps and additional data that was needed all along but was initially non-obvious.
So, how do you strike a balance between building infrastructure and generating early business value? Let’s break it down.
Option 1: Framework First (The Engineer’s Path)
Building the framework first involves setting up the full pipeline: data retrieval, vector searches, ranking, caching, and integration with LLMs. The advantage here is that you gain early insight into scalability, system architecture, and performance bottlenecks.
Pros:
- Iterative Testing: You can continuously test how components (e.g., similarity search, retrieval speed) interact and identify issues early.
- Modularity and Scalability: By focusing on infrastructure, you build a system that can handle new data without major redesigns.
- Understanding the Stack: Prototyping the framework helps engineers understand what’s feasible and where the real challenges are.
Cons:
- Underwhelming Results: Without high-quality data, the system may not produce useful outputs, leaving business users unimpressed.
- Risk of Rework: If the data requirements change later, parts of the framework might need significant revisions.
Option 2: Data First (The Business Path)
Alternatively, you could start with data: collecting, cleaning, and embedding your datasets before worrying about infrastructure. This approach allows for quick wins by generating immediate, useful outputs—at least in small-scale tests.
Pros:
- Quick Business Value: Teams can generate insights and show tangible results early on, building confidence with stakeholders.
- Model Optimization: You can optimize your LLMs and retrieval models based on real-world data rather than assumptions.
Cons:
- Delayed Integration: Without an early pipeline in place, scaling and integration issues might be discovered too late in the process.
- Overfitting: The system may become too tailored to the quirks of early datasets, limiting its ability to generalize.
The Reality: You Need Both
In most AI projects, success requires a hybrid approach. Starting with both a small, well-defined data set and a minimal pipeline allows teams to iterate without getting bogged down by either infrastructure or data preparation.
Here’s a recommended strategy:
- Define a Clear Use Case: Focus on one business problem that can drive development and provide measurable value.
- Build a Minimal Pipeline: Implement core components (retrieval, ranking, response generation) using a small, representative dataset.
- Iterate and Improve: As data and business feedback evolve, refine both the pipeline and the data to balance scalability and results.
In my case, I already had the foundations of structured intelligence—data sets that defined categories and attributes—which helped guide my development process. But even with that advantage, I found that technical prototyping was key to understanding system performance and limitations.
A Warning About Expectations
When business stakeholders see early prototypes, they often expect polished results. It’s important to set realistic expectations ahead of time: AI solutions take time to optimize. Early-stage prototypes should be viewed as a learning process, not a final product.
This is where organizations often stumble because of the significant up-front effort they may need to make, which puts a lot of pressure on measurable wins. Teams may connect complex technologies—FAISS, Redis, embedding providers, vector DBs, LLMs—only to face pushback when business users don’t see immediate value. This disconnect can derail projects unless both sides understand the need for technical exploration and iteration.
The Takeaway: Focus and Flexibility
The question of whether to prioritize framework or data first is a bit of a “chicken or egg” dilemma, but the answer lies in balance. By focusing on a single, high-value use case and starting small, you can test key components without overwhelming the system—or the business.
Organizations also need to adopt a flexible mindset around AI prototyping. Supporting freedom to fail safely is critical. Teams must have the time and space to experiment, connect technologies, and iterate without fear of immediate dismissal from stakeholders.
In AI projects, the long game is the content. You can’t get to the valuable content unless you have the piping connected.
With the right approach, you’ll move from early underwhelming prototypes to scalable, impactful AI solutions—one iteration at a time.