AI-Assisted Development Workflow: A Practical Reality Check

Oct 24, 2025

musing
opinion
ai/ml

After months of integrating AI tools into production development workflows, I've learned that effective AI-assisted development looks nothing like the hype. This is a reality check on what actually works—and what produces maintainable, high-quality code.

Setting Realistic Expectations

If you've read about developers launching dozens of agents in parallel, generating thousands of lines of code in minutes, or completing features in a fraction of the usual time—that approach produces poor code.

The reality of effective AI-assisted development mirrors the principles that have worked for decades:

Clearly define what you want — vague prompts generate vague code
Provide focused, relevant context — LLMs have no memory; they need explicit knowledge for each task
Break work into discrete todos — one well-scoped task at a time, not monolithic generation
Establish feedback mechanisms — unit tests and human checkpoints catch errors early
Iterate deliberately — generate, review, test, refine; never blindly merge generated code

Large Language Models must be treated as new development engineers: highly capable, fast learners, but with zero working memory. Each task requires a clean onboarding process that provides context, scope, and constraints.

What This Workflow Optimizes For

Code quality — tested, reviewed, maintainable implementations Developer productivity — focus on architecture and design while AI handles boilerplate Reproducibility — deterministic workflows that produce consistent results Knowledge preservation — all context, decisions, and iterations captured in version control

What it does not optimize for:

Raw speed — generating 10,000 lines of untested code in 5 minutes is worthless Massive parallelism — running 20 agents simultaneously creates coordination chaos Autonomous operation — human checkpoints are mandatory, not optional "AI does everything" — developers remain responsible for architecture, review, and integration

The time savings come from eliminating repetitive work (boilerplate, test scaffolding, routine refactoring), not from bypassing proper software engineering discipline. If you follow this workflow rigorously, expect 20-40% productivity gains on well-defined tasks, not 10x mythical breakthroughs.

Preserving Developer Learning and Growth

This workflow ensures developers continue to learn and gain experience. This is critical because humans need to stay in the loop—and will for the foreseeable future.

Other AI-assisted workflows that emphasize full automation and "AI does everything" approaches have a dangerous side effect: they make developers dumber and software more error-prone. When developers become passive consumers of AI-generated code without understanding what's being produced, several problems emerge:

Skill atrophy — Developers lose the ability to debug, architect, or reason about code they didn't write and don't understand
Blind trust — Without reviewing and testing each increment, subtle bugs and architectural flaws accumulate undetected
Context loss — When AI generates thousands of lines without human checkpoints, no one understands the system anymore
Inability to maintain — Code that no human has read or understood becomes unmaintainable the moment something breaks
Loss of judgment — Developers who don't practice making design decisions lose the ability to make good ones

The Core Workflow Components

I use three tools and one principle:

Claude — code generation, refactoring, test scaffolding Codex — reasoning, architectural review, second opinions (via MCP) GitHub — version control, CI/CD, review integration Unit tests — the operational checkpoints that validate every step

This constraint ensures predictable maintenance, compliance, and security overhead. There's no need for Code Rabbit, AutoGen, Crew AI, or others. The foundational models perform well and just need proper workflow structure.

Roles and Responsibilities

Claude — Code Generation Engine

Acts as the primary code generation tool for all implementation work
Accepts high-level input and directly generates code, tests, and documentation
Performs refactoring, scaffolding, debugging, and iterative development
Maintains short-term conversational state for the current task only
Optionally consults Codex through MCP when a second opinion is needed

Codex — Architectural Advisor (via MCP)

Connected to Claude through Model Context Protocol as an external endpoint
Not used for code generation — only for review, validation, and advisory opinions
Invoked selectively when Claude or the developer needs architectural feasibility assessment, design trade-off analysis, or validation that the proposed approach aligns with project standards
Never writes code directly — only provides opinions that inform Claude's generation work

The Developer

Defines and owns the task scope and acceptance criteria
Provides the explicit onboarding context to the LLMs
Reviews AI output and integrates validated results
Maintains test discipline — each functional increment must be validated by a unit test checkpoint

The Task Lifecycle

Every feature or bug fix follows a deterministic lifecycle: define → onboard → iterate → implement → test → review → merge.

1. Task Definition

Every task begins as a Markdown file in sessions/tasks/. This file is the single source of truth for the task scope and serves as onboarding material for both humans and LLMs.

Example task file:

---
status: pending
branch: feature/status-endpoint
created: 2025-10-22
success_criteria:
  - Returns HTTP 200 with correct JSON payload
  - Unit tests cover success and DB-failure scenarios
  - Endpoint appears in OpenAPI schema
---

# Add `/api/status` Endpoint

## Context
Our monitoring tools require a lightweight backend endpoint that returns the current service status, including database connectivity and version metadata.

## Scope
Implement a new GET endpoint `/api/status` in the FastAPI backend.

## Acceptance Criteria
- Returns HTTP 200 and JSON payload
- Unit test covers positive and simulated DB-down scenarios
- Endpoint included in OpenAPI schema

2. Implementation by Claude

Claude analyzes the relevant codebase files, proposes an implementation approach, and after approval generates:

The feature code
Corresponding unit tests
Any necessary documentation updates

The developer reviews the generated code before proceeding.

3. Optional: Request Codex Second Opinion

If architectural validation is needed before committing, the developer can ask Claude to consult Codex via MCP.

Example:

"Before we commit this, can you get Codex's opinion on whether this status endpoint design aligns with our monitoring architecture?"

Claude sends the context and code to Codex, which responds with architectural feedback:

✅ Design is sound.

⚠️ Consider adding response time measurement for the DB check.

💡 Suggest wrapping DB call in try/except to return db_connected: false
   rather than error on failure.

Claude can then regenerate the implementation incorporating Codex's suggestions.

4. Unit Test Checkpoint

Each implemented step requires at least one unit test confirming correctness. Tests are the feedback mechanism that validates LLM output.

Critical principle: Tests are not optional or "nice to have"—they're the only way to verify that generated code actually works. The same rule that applied before AI still applies: untested code is broken code until proven otherwise.

pytest -q --disable-warnings
..
2 passed in 0.45s

If tests pass, the task is considered functionally complete. If tests fail, the implementation is incomplete regardless of how much code was generated. Never proceed to the next todo until the current one has passing tests.

This is where velocity gains come from: Claude generates the test scaffolding and initial implementation quickly, but the human developer must verify correctness before moving forward.

5. Pull Request and Automated Review

After pushing, the Claude Code GitHub Action runs automatically, performing an independent AI review:

Code style compliance
Missing or insufficient tests
Architectural and security issues
Ambiguous or incomplete documentation

This serves as the final automated gate before human approval and merge.

When to Use Claude vs. Codex

Use Claude for all code generation. Consult Codex when you need architectural validation, design trade-off analysis, or a second opinion on complex decisions.

Decision Matrix

Situation	Use Claude	Consult Codex
Feature implementation with clear requirements	✅ Generate code	❌ Not needed
Bug fix with known solution	✅ Generate fix	❌ Not needed
Refactoring with unclear impact	✅ Generate code	✅ Review design
Multiple architectural approaches exist	✅ Implement chosen approach	✅ Evaluate trade-offs first
Performance optimization strategy	✅ Implement optimization	✅ Validate approach
Simple CRUD endpoint	✅ Generate code	❌ Not needed
New system integration	✅ Generate integration code	✅ Review integration pattern
Database schema migration	✅ Generate migration	✅ Review if major schema change

What I Do Differently Now

Mandatory Test Checkpoints

Every LLM-generated code change must be coupled with a corresponding or updated test. This forms the validation baseline before merge. The same discipline that prevented bugs in the decades before AI remains essential: write the test, verify it passes, then move to the next todo.

Structured Task Files

All project standards—structure, conventions, and code guidelines—live in version-controlled Markdown files in /docs/. These serve as canonical references for both human developers and AI tools.

Incremental Implementation

Break complex tasks into discrete todos. Implement one at a time. Review and test each increment before proceeding. This prevents the "black box" problem where thousands of lines of code appear without anyone understanding it.

Human Checkpoints

Reviews and integration decisions remain with humans. AI accelerates execution speed on tasks you already understand, freeing mental bandwidth to tackle harder architectural problems.

Deterministic Workflows

Each workflow run is reproducible from the same inputs. Session logs, task references, and resulting commits are linked for traceability. This ensures transparency, compliance, and reproducibility.

The Bottom Line

AI-assisted development is a force multiplier, not a replacement for engineering discipline. The real productivity gain comes from eliminating boilerplate while maintaining the same rigor that made software work before AI existed.

Measure system performance end to end. Users don't care whether a human or an AI wrote the code; they care about whether it works, is maintainable, and solves their problem.

The workflows that work long-term are the ones that treat AI as a highly capable assistant that still needs context, guidance, and verification—not as a magic code factory that can operate autonomously.