The Questions That Reveal Actual Capability
These questions separate firms with real implementation experience from those without it:
"Walk me through a recent implementation that did not go as planned. What happened and how did you resolve it?" Every real implementation project hits unexpected problems: a document format the ingestion pipeline could not handle, a rate limit that surfaced at scale, a prompt that drifted after a model update, an integration endpoint that changed its auth scheme mid-project. An agency that claims everything goes smoothly has limited experience or poor memory. An agency that can describe a specific problem, their diagnosis, what they rolled back, and how they fixed it has real field experience.
"How do you handle AI errors and failures in production systems?" AI outputs are probabilistic, not deterministic. They produce errors, hallucinations, and off-spec outputs. How does the agency design for this? What monitoring do they build in? What minimum confidence thresholds do they enforce? What review steps exist before AI outputs are acted upon? A thoughtful answer here indicates production-grade experience. You should hear specifics: log aggregation tools (Datadog, Honeycomb, Langfuse, or a custom solution), evaluation frameworks, regression test suites, and an on-call escalation path.
"What AI systems have you built that are still running for clients after 12 months?" Many vendors can stand up a demo. Fewer build systems that are maintained, updated, and delivering value a year later. Ask specifically about long-running implementations, including how maintenance costs have trended (they usually decrease after month four as prompt drift patterns are tamed) and how the systems have been extended since launch. A vendor with a graveyard of pilots and no year-one systems is not a partner for production work.
"What would you not use AI for in a business like mine?" The best AI agencies are direct about where AI does not make sense. A vendor who positions AI as the answer to every problem is a vendor who does not understand AI well enough or is prioritizing their sales goal over your outcome. Specific "do not automate" examples to listen for: final pricing decisions on large contracts, anything requiring physical human presence, workflows with fewer than 50 monthly executions (the build cost will never pay back), and anything where a single wrong answer creates legal or safety exposure without a human checkpoint.
"What models do you default to, and when do you use something else?" You want to hear a model-agnostic answer. Claude for long-context reasoning, GPT-4 for tool use and speed, open-weight models for high-volume low-risk workloads, smaller models like Haiku or 4o-mini for classification tasks. An agency that uses one model for everything is optimizing for their own convenience, not your cost or quality.
What to Look for in Their Approach
Discovery before recommendations. Any credible agency should want to understand your operations before recommending solutions. A firm that proposes solutions in the first meeting without meaningful discovery is selling a product, not solving your problem. Expect a discovery phase of at least two weeks on engagements above $40,000, including stakeholder interviews, process mapping, data audit, and a written findings document.
Measurable outcomes, not just deliverables. "We will deliver an AI chatbot" is a deliverable. "We will reduce support ticket volume by 30% over 90 days, measured against a baseline of 4,200 tickets per month, with a fallback target of 18% triggering scope review" is an outcome with accountability. The agency's focus should be on outcomes you care about, with a written method for measuring them and a defined response if targets are missed.
Clear ownership of results. If a system underperforms, what happens? Does the agency commit to improvement? What does support look like after launch? Many agencies disappear after delivery; the best ones treat post-launch support as a structured engagement with defined SLAs, a named account contact, and a predictable monthly retainer. A handoff-and-ghost pattern is one of the clearest predictors of a project that will need to be rebuilt within 18 months.
A realistic timeline and cost estimate. Chatbot implementations for a defined use case: 4 to 12 weeks, $8,000 to $40,000. Custom AI agent implementations: 8 to 20 weeks, $20,000 to $100,000. RAG-backed knowledge systems: 6 to 14 weeks, $18,000 to $80,000. Large-scale enterprise AI programs: months to years, six to seven figures. Be skeptical of unusually fast timelines or unusually low prices. A $4,000 custom agent quote means the vendor is either dumping a generic template on you or underestimating the scope so aggressively that change orders will double the final bill.
Adjacent competence where it matters. AI systems rarely ship in isolation. They usually need to connect to a website, a CRM, an email system, and often a brand layer. An agency that can coordinate with your website design and brand identity work, or handle it themselves, reduces integration friction.
Red Flags
No technical staff. Some AI "agencies" are primarily resellers who know how to configure standard platforms (Voiceflow, Botpress, Intercom Fin, a no-code RAG tool) but do not have engineers who can build custom systems. If your use case requires anything beyond a standard tool configuration, you need an agency with actual technical staff on the team, not subcontractors pulled in per-project from Upwork. Ask how many full-time engineers work on implementations and how long they have been with the firm.
Proprietary platform lock-in. Some agencies only work with one platform or tool, and their proposal is always to implement that tool. This is not always wrong, sometimes the right tool is the right tool, but an agency that only knows one approach cannot give you objective advice about whether it is actually the best approach for your situation. If they get a referral fee or reseller margin on a specific platform, they should disclose it.
Guarantees on AI outcomes. AI system performance depends on many factors outside the agency's control, including the quality and consistency of your data, how well the AI is integrated into existing workflows, and how much your team adopts it. Credible agencies set realistic targets and define a process for improving toward them; they do not guarantee specific outcomes in advance. Guaranteed savings claims (often marketed as "we will save you $X or you pay nothing") usually come with fine print that makes the guarantee unenforceable.
Vague explanations of how it works. If you ask how the system handles a specific edge case and the answer is "our AI handles that" without any specifics, the agency does not have a clear picture of their own implementation. You should understand, at a high level, what the system does, what it does not do, and what happens in the edge case you just described.
No evidence of domain expertise in your industry. This is a softer flag, but meaningful. An agency that has worked in your industry knows the vocabulary, the compliance constraints (HIPAA, FINRA, SOC 2, PCI), the integration landscape, and the specific pain points. They are faster, cheaper, and less error-prone than an agency that is learning your domain from scratch. For regulated industries this is not soft at all; it is usually a hard requirement.
No evaluation or testing framework. Ask how they validate that a system meets the agreed-upon quality bar before go-live. If the answer is "we test it manually" or "it looks good in our demos," walk away. You want to hear about eval datasets, benchmark test sets, automated regression suites, and defined accuracy thresholds.
How to Evaluate Your Options
For a significant AI implementation ($25,000 and up), a reasonable evaluation process includes:
1. Shortlist three to five agencies based on portfolio review, referrals, and case study depth 2. Send a written brief describing your problem, current workflow, data situation, and goals 3. Conduct a technical discovery call where you ask the diagnostic questions above 4. Request a proposal with scope, timeline, cost, measurement plan, and post-launch support terms 5. Call at least two references for each finalist, with specific questions prepared 6. Run a paid discovery sprint with your top one or two finalists (typically $3,000 to $8,000) before committing to the full build 7. Make a selection decision based on capability, fit, pricing, and contractual terms
Do not skip the references, and do not skip the paid discovery sprint on larger engagements. The gap between how agencies present themselves and how they actually perform is often revealed only in client conversations and in the quality of their discovery output. A two-week paid discovery that produces a weak requirements document is a $6,000 lesson worth learning before you commit $80,000 to a build.
Running Start Digital builds custom AI systems for business process automation and AI-assisted workflows, with documented implementations and available references. We also offer a two-week paid discovery option that delivers a written technical spec regardless of whether you proceed with us on the build.
Frequently Asked Questions
### Is it better to hire an AI agency or build internal AI capability? Most businesses need both over time, but in different sequences. An external agency can implement specific, high-value AI systems faster than you can build that capability internally, typically 60 to 80% faster on a first project. Once those systems are running, the operational knowledge for maintaining and evolving them can transfer to internal staff through a structured handoff that usually takes 90 to 180 days. Building internal AI capability from scratch before you have proven use cases is expensive and slow; senior AI engineers are commanding $280,000 to $450,000 base salaries in most US metros in 2026. Most businesses are best served by using external expertise to prove and launch, then building internal capability around what is already running.
### What should a basic AI implementation proposal include? A credible proposal includes: a summary of your stated problem and goals (showing they listened); a recommended approach with technical specifics (models, frameworks, data sources, integration points); scope of work with specific deliverables; timeline with milestones and decision gates; cost with what is included, what is not, and change-order pricing; how success will be measured and what happens if metrics are missed; post-launch support terms with SLA numbers; and what they need from you to execute, including named stakeholders and data access. If any of these are missing, ask for them. A proposal that is all vision and no specifics is a proposal that will expand in scope and cost after you have signed.
### How do we evaluate AI agencies if we do not have technical expertise internally? The diagnostic questions in this article do not require technical expertise. They test for experience and judgment, not specific knowledge. You can also hire a fractional CTO or an independent technical advisor (typical rate $250 to $450 per hour) for four to eight hours to review technical proposals if you want a second opinion. The clearest signal that you are not dealing with a credible technical team is when they cannot answer concrete questions about how their systems work. Technical credibility shows up in specificity, not in jargon. A vendor who throws around "transformer architecture" and "neural networks" without explaining how their specific implementation handles your specific use case is performing, not delivering.
### What is a reasonable budget range for working with an AI agency? Ranges vary significantly by scope. Focused single-workflow chatbot or automation implementations typically run $8,000 to $40,000. Custom multi-step AI agent systems: $25,000 to $100,000. RAG-backed internal knowledge systems: $18,000 to $80,000. Ongoing AI program management with multiple workstreams: $5,000 to $20,000 per month. These ranges assume US-based or equivalent-quality agencies. Offshore teams are cheaper but typically require more management overhead and produce more revision cycles, and compliance-sensitive industries often cannot use them at all. The right investment depends on the value of the problem you are solving: a workflow that costs $500,000 per year in staff time justifies a $100,000 implementation; a workflow that costs $20,000 per year does not.
### How do I structure the contract to protect myself? Key contract protections: milestone-based payments (no more than 30% upfront), defined acceptance criteria per milestone, a written change-order process with per-hour pricing, IP ownership on custom code and prompts, source code and prompt file deliverables as part of final acceptance (not behind a vendor platform), a data handling addendum covering where your data is stored and how it is used, and a defined handoff package including documentation, runbooks, and access credentials. Net-30 payment terms and a 30 to 60 day warranty on critical defects post-launch are reasonable to ask for and worth negotiating.
### How does agency selection connect to our broader marketing and technology stack? AI implementations land in a context. The chatbot talks to customers who arrived through SEO services; the content engine produces material published through the web hosting and maintenance stack; the brand voice needs to match the brand identity system that governs everything else. Agencies that understand this context and can coordinate across it produce better results than specialists who optimize their piece in isolation. If your AI engagement will touch the public-facing brand, coordinate the selection of your AI partner with the firms running your marketing and web infrastructure rather than treating them as separate procurements.
