Software developers

How software developers use AI agents

Plenty of AI tools will write you code that looks right. The gap that costs you hours is between looks right and runs. A snippet compiles in your head, you paste it in, and the off-by-one or the wrong default surfaces ten minutes later. The whole point of this agent is to close that gap before the code reaches your editor.

It runs real Python in a sandbox. So when you ask it to write a function, it can write the function, write a few test cases, execute them, and show you the output. If the edge case breaks, it sees the traceback and fixes it, then runs it again, in the same turn. You are not reading a plausible suggestion. You are reading code that already passed the cases you cared about, with the proof attached. That changes debugging too: paste the failing function and the input that breaks it, and it reproduces the bug instead of theorizing about it.

Then there is the part only Keimodel does. Hard problems do not have one right answer, and no single model is best at all of them. So for the calls that matter, ask the same question of Claude, GPT, and Gemini side by side and read three real solutions instead of trusting one. One model writes the cleanest recursion, another catches the concurrency bug the first one missed, a third explains the tradeoff better. You pick the winner. Keep your stack in Memory so every example comes back in your actual language, framework, and conventions, not generic pseudocode you have to translate.

Open the Agent10 min read

Capabilities this leans on

Web search Python code execution File upload Memory Session memories Skills

Set up Memory once

Do this first. With your stack saved, every snippet, review, and example comes back in the language, framework, and style you actually ship, so you stop translating generic answers.

Remember my stack and conventions so all code you give me fits without edits. Backend is Python 3.12 with FastAPI and SQLAlchemy 2.0 against Postgres 16. Frontend is TypeScript with React 19 and Vite, styled with Tailwind. We use pytest for tests, ruff for linting, and type hints everywhere. House style: small pure functions, early returns over nested ifs, descriptive names over comments, no clever one-liners. When you show code, include the imports and a runnable example, and prefer the standard library before adding a dependency.

1.Write a function and prove it runs

Ask for the code and the evidence in the same breath, so you are reviewing something that already passed.

Write a Python function that merges overlapping intervals given a list of [start, end] pairs. Then write pytest cases for the obvious ones plus empty input, single interval, fully nested intervals, and touching-but-not-overlapping. Run them and show me the output.

Add a case where intervals are unsorted and one has start equal to end. Run again and confirm it still passes.

Now give me the final version with type hints and a one-line docstring, no test code, ready to paste.

What you get: A function you have already watched pass its own edge cases, not a guess you still have to validate yourself.

2.Compare how three models solve the same problem

For a problem with real tradeoffs, read Claude, GPT, and Gemini side by side and keep the best one.

Implement a rate limiter for a FastAPI endpoint: 100 requests per user per minute, sliding window, backed by Redis. I want correctness under concurrent requests. Write it with the Redis calls and explain the race condition you are guarding against.

Ask the same question of Claude, GPT, and Gemini and put the three answers next to each other. Tell me where they disagree, especially on atomicity and the Redis approach, and which one actually handles concurrent requests correctly.

Take the strongest answer, harden it against the weakness the others exposed, and give me the final version with a test that hammers it with concurrent calls.

What you get: Three real solutions to a hard problem instead of one, the disagreements made explicit, and a final version that borrows the best of all of them.

3.Review a diff before it ships

Paste the change and get a focused review, not a vague thumbs up.

Here is a git diff. Review it for correctness bugs, missing error handling, and anything that breaks our conventions from Memory. Skip style nits ruff would catch. List findings by severity, each with the line and a concrete fix.

For the highest-severity finding, write the failing test that would have caught it, run it to show it fails on the old code, and confirm it passes on your fix.

Summarize the review as a PR comment I can paste: what is solid, what must change before merge, what is optional.

What you get: A diff review that finds the real bug and proves it with a failing test, instead of a paragraph of generic praise.

4.Generate the regex or SQL you would rather not hand-write

Describe what you need in plain language and have it test the result against examples.

Write a regex that matches US phone numbers in these formats: (555) 123-4567, 555-123-4567, 555.123.4567, and +1 555 123 4567, but not partial or 11-digit junk. Then run it against a list I give you of 8 should-match and 6 should-not-match strings and show which it got right.

Here are three tables: users, orders, order_items. Write a Postgres query for the top 10 customers by total spend in the last 90 days, including their order count, and explain the join. Match my SQLAlchemy 2.0 style if I ask for the ORM version.

Now give me the SQLAlchemy 2.0 select() version of that query using my conventions from Memory.

What you get: A regex you have seen pass and fail the right strings, and SQL in both raw and ORM form that fits your stack.

5.Trace a bug instead of guessing at it

Hand it the failing code and the input that breaks it, and let it reproduce the failure.

This function throws a KeyError on some inputs but not others. Here is the function and an input that triggers it. Reproduce the error by running it, tell me the exact line and why, then fix it and run the same input to confirm it is gone.

Are there other inputs that would hit the same class of bug? Generate a few, run them against your fix, and show the results.

Save the working fix plus the regression test as a single snippet I can drop into the file.

What you get: A reproduced failure, a verified fix, and a regression test, all confirmed by actually running the code rather than reasoning about it.

6.Draft the docs and README you keep putting off

Point it at the code and have it write the documentation in your voice.

Here is a module. Write a README section for it: what it does, install and setup, a runnable usage example using my stack, the public functions with their signatures, and a short gotchas list. Plain and direct, no marketing tone.

Generate Google-style docstrings for every public function in this file, returning the file with them inserted.

Turn 'write a README in this format' into a reusable Skill called 'Module README' so I can run it on any new module.

What you get: Documentation that matches your code and your tone, plus a saved Skill so the next module gets one without retyping the brief.

Related playbooks

Data & analytics

Upload a CSV and get real computed numbers, charts, and a weekly report, not guessed math.

Read the playbook

SaaS founders

Landing copy, changelogs, lifecycle emails, support macros, and churn analysis.

Read the playbook

Customer support

On-brand replies, reusable macros, thread summaries, and a daily triage digest.

Read the playbook

Run your first prompt

Open the Agent, paste any prompt above, and change the details to fit your business.

Open the Agent All use cases