Building Non-Sycophantic AI with the Claude API
Claude Sonnet 4.6 is meaningfully less sycophantic than comparable models. Its behavior is more resistant to praise and leading questions in system prompts.
Syntora understands the challenges of LLM sycophancy in production systems, especially for structured data extraction. Syntora offers engineering services to design and build architectures that use models like Claude Sonnet for reliable, verifiable outputs.
This matters for production applications where sycophancy causes hidden failures. An AI that agrees with a flawed premise, like parsing an incomplete document, will generate plausible but incorrect structured output. Controlling this behavior requires careful system prompt engineering and structured output validation. Syntora provides engineering engagements to design and implement systems that mitigate these risks, with scope determined by the document types and required output reliability.
What Problem Does This Solve?
Developers often start by adding "be honest" or "if you don't know, say so" to their prompts. This is a weak defense. Models like GPT-4 can interpret this as part of a role-play, overriding the instruction if the user's question implies a desired answer. For example, asking "Here's the data, confirm the upward trend, right?" will often yield a confirmation, even if the data is flat.
A regional insurance agency with 6 adjusters used a custom GPT to summarize claim reports. Their prompt instructed the model to extract "incident date, policy number, and description of damages." An adjuster uploaded a blurry photo with a note: "Looks like water damage from the storm on the 15th." The GPT confidently extracted "water damage" and a date of the 15th, even though the photo showed fire damage and the official report had no date. It agreed with the user's premise, costing the agency 4 hours of manual rework on that claim. This happened on 12 of their first 100 claims.
The root issue is that most models are optimized for conversational helpfulness. Their training data rewards agreeableness. Using few-shot examples in the prompt can make it worse. If your examples show the model successfully finding a clause, it learns to always find a clause, even if that means inventing one. This creates a system that looks correct 90% of the time but fails unpredictably.
How Would Syntora Approach This?
Syntora's approach to mitigating LLM sycophancy and ensuring reliable structured output begins with understanding the client's specific document types and desired extractions. The first step would be a discovery phase to define the problem space, including typical document variations and edge cases.
To objectively measure and improve model performance, Syntora would work with the client to create a "golden dataset" of 50-100 real-world inputs and their correct outputs, validated by a human. A Python-based test harness using pytest would then be built. This harness runs various prompt iterations against the dataset, allowing for objective measurement of sycophancy and other failure modes. We have applied similar rigorous testing patterns to document processing pipelines in other sectors, such as financial document analysis.
Prompt engineering for the Claude API would involve a multi-part system prompt. The first part would define the task using a neutral, declarative tone. The second part would provide explicit constraints, such as "Only extract text directly present in the source document." Syntora would utilize Anthropic's tool-use patterns for structured output, defining a Pydantic model for the expected JSON. This forces the model to populate specific fields, increasing output consistency.
The core logic for such a system would typically be deployed as a serverless function on AWS Lambda, fronted by a FastAPI endpoint. This architecture helps manage hosting costs efficiently. Structured logging using structlog would be implemented, with every API call, its prompt, and the model's response logged to a database like Supabase. This detailed logging supports rapid debugging and continuous improvement.
For applications requiring high reliability, Syntora can implement a verifier model. The primary Claude Sonnet model would perform the initial extraction. A second, more cost-effective model, such as Haiku, would then receive a different prompt: "Given this source text and this extraction, is the extraction fully supported by the text? Answer yes or no." If the verifier model's answer is negative, the request would be flagged for human review, significantly reducing the incidence of hallucinations reaching end-users.
Typical engagements for systems of this complexity involve a 6-12 week build timeline, assuming the client can provide a representative dataset and subject matter expertise. Deliverables would include the deployed system, the prompt engineering and testing codebase, and documentation for operation and maintenance.
What Are the Key Benefits?
Go From Prompt to Production in 3 Weeks
Our test-driven process identifies and fixes sycophancy issues in days, not months. We deploy a working endpoint for integration testing in the first 10 business days.
Pay For Code, Not Vague Retainers
One-time build cost for a production-ready system. After launch, you only pay for cloud usage, often less than $50/month on AWS Lambda.
You Get the GitHub Repo and Test Suite
We deliver the complete Python source code, pytest evaluation harness, and deployment scripts. Your system is not a black box.
Alerts When The Model Drifts
We set up CloudWatch alarms tied to structured log outputs. If the rate of human-review flags exceeds 2% over a 24-hour period, you get a Slack alert.
Connects to Your Tools via REST API
The final system is a standard FastAPI endpoint. It integrates with any system that can make an HTTP request, from your internal CRM to a Google Sheet.
What Does the Process Look Like?
Week 1: Golden Dataset and Test Harness
You provide 50-100 examples of inputs and desired outputs. We build the pytest harness that becomes our objective measure of success.
Week 2: Prompt Engineering and Iteration
We develop and test multiple system prompts against the harness, iterating until we pass all tests. You receive a daily summary of progress and key findings.
Week 3: API Deployment and Integration
We deploy the final prompt and logic to an AWS Lambda function. You receive API documentation and a secure endpoint for your team to begin testing.
Weeks 4-6: Monitoring and Handoff
We monitor the live system for performance and accuracy. At the end of week 6, you receive the full source code and a runbook for maintenance.
Frequently Asked Questions
- How much does a custom Claude application cost to build?
- The cost depends on the complexity of the structured output and the number of integration points. A typical project to extract data from documents and provide a simple API takes 3-4 weeks. More complex systems with multiple tools or feedback loops take longer. We provide a fixed-price quote after a 45-minute discovery call where we scope the exact requirements.
- What happens when the Claude API is down or returns an error?
- Our Python wrappers include exponential backoff and retry logic for transient 500-level errors from the Anthropic API. For persistent outages, the API returns a 503 Service Unavailable status. We can configure a fallback model for critical applications that require 99.9% uptime, which adds to the hosting cost.
- How is this better than just using a platform like LangChain?
- LangChain is a library, not a production framework. It is great for prototyping but its layers of abstraction make debugging difficult and introduce unpredictable latency. We write direct Python code using httpx for API calls. This results in faster, more reliable systems that are easier to maintain because there are no hidden framework behaviors.
- Can you work with models other than Claude?
- Yes. While we specialize in the Claude API due to its strong performance on reasoning and reduced sycophancy, our process is model-agnostic. We have built production systems using GPT-4 and open-source models like Llama 3. The test harness approach works with any model that has an API, allowing us to benchmark and choose the best one for your specific task.
- What if a new model version breaks the prompt?
- This is a primary reason for our test-driven approach. Before upgrading a model in production, we run the entire pytest suite against the new version. If any tests fail, we know the prompt needs to be updated. This prevents regressions and ensures updates are safe. The maintenance runbook covers this exact procedure.
- Why not just use the model's built-in JSON mode?
- Built-in JSON modes are a good start but don't enforce schema correctness beyond valid JSON syntax. Anthropic's tool-use feature is more robust because it forces the model to call a function with specific, typed arguments that you define with a Pydantic model. This catches type errors, missing fields, and extra fields, providing a much higher guarantee of structured output quality.
Ready to Automate Your Technology Operations?
Book a call to discuss how we can implement ai automation for your technology business.
Book a Call