Prompts Are Refactoring Targets: Logging an AI Workflow to Find What to Automate

The move I trust most in my own work is this: take something experience flags as improvable, then prove the why and how with real data — rather than acting on the hunch. The intuition tells you where to look; the data tells you whether you were right and what to do about it. What's changed lately is that an LLM can do the looking, at a scale that used to make this kind of proof too expensive to bother with.

I leaned on exactly this at Google. On the Legal Content Policy and Standards team, as a Policy Lead, I had a hunch about which kinds of legal removal cases were clogging the backlog and which were routine enough to automate. Rather than argue it from experience, I had an LLM classify and summarize cases at scale — thousands of them — until the data either confirmed the hypothesis or corrected it. In Privacy Sandbox, as a TPM, it was the same shape: a feeling about which documents headed to legal review actually warranted deeper scrutiny, turned into a pipeline that had an LLM review and classify each one so the expensive human attention landed where it mattered. In both cases the intuition was the starting point, not the verdict. The LLM made the verdict affordable.

So a recent engineering post from LayerX (by Maeda Yōhei (前田洋平), an engineering manager on the mobile-app team for LayerX's Bakuraku finance-ops product, who writes as id:myohei11) caught my eye, because it runs that move on a target I hadn't thought to point it at: the AI workflow itself. Engineers who work with a coding agent all day keep typing the same things — "commit, push, open a PR," "address the review comments," "fix the failing CI." It feels repetitive, but a feeling is a weak basis for deciding what to automate. So they log every prompt sent to the agent, analyze the log on a schedule, and let the patterns that actually recur say what to build. The phrase that stuck with me is that prompts become refactoring targets: instructions to an AI, like code, are something to observe, measure, and reshape into reusable pieces — and saying the same thing a dozen times is duplication, which is a smell precisely because it marks a unit worth extracting.

How it works

The mechanics are deliberately modest. A UserPromptSubmit hook appends each prompt to a local prompts.jsonl — one JSON line with a timestamp, a session id, and the text — skipping trivially short inputs and keeping the file owner-readable only. Logs are masked, kept locally, and deleted after seven days. (That last part isn't an afterthought: a prompt log is a record of everything you've asked, so the privacy guardrails are load-bearing, not optional.)

Two detection paths run on top of the log. One works inside a live session, noticing in real time when the same operation comes up again. The other is a scheduled daily job that reads the past week of prompts and groups them by intent. Running both catches more than either alone: the real-time path sees bursts within a conversation, the batch path sees the same request scattered across days. When a pattern crosses a low threshold — three or more occurrences — it posts to Slack, with the pattern summary and a ready-made prompt for building the corresponding Skill or command attached.

Two design choices make this more than a logging gimmick. First, the daily analysis uses an LLM to group prompts semantically rather than clustering embeddings — so "make a PR," "commit, push, and open a PR," and "commit the fix and get a PR up" all collapse into one pattern, because they share an intent regardless of wording. That's exactly the fuzzy, judgment-laden grouping language models are good at and rigid similarity metrics are bad at — the same reason the LLM, not a keyword filter, was the right tool for triaging those legal cases. Second, the automation stops at proposing. It never creates or deletes a Skill on its own; a human approves each one. That keeps the system from quietly breeding a thicket of half-useful rules — the failure mode that makes most "self-improving" automation a liability.

The interesting twist: AI improving how we work with AI

What makes this more than a tidy housekeeping trick is where it points. The same log that surfaces new automations also surfaces broken ones: if people keep invoking an existing Skill and then immediately tacking on more instructions — "also update the Notion page," "also post to Slack," "you didn't finish" — that trailing chatter is evidence the Skill is underspecified, with unclear completion criteria or a missing step. So the system turns on itself, and a flywheel appears: better instrumentation surfaces better automations, which sharpen the very system that finds the next one.

That flywheel is why I think the target matters more than the trick. Agentic coding is becoming the default working model for software engineering — which makes "how a person and an agent actually interact" a new, and largely unmeasured, surface for improvement. Most teams are flying blind on it. The reported results here are deliberately unglamorous — 29 patterns across 691 occurrences, 22 of them (76%) converted into Skills or commands, 652 repeated prompts eliminated; the commit-push-PR cycle alone showed up 94 times before it became a single /ship command — and that's the point. This is the boring, high-frequency toil that's invisible until you count it, and counting it is now cheap.

Frequency is the signal; leverage is the criterion

Two caveats keep me honest about it. The first is that frequency isn't importance. Just because something happens a lot doesn't mean it's worth abstracting — there's little real value in saving an engineer from typing a few words. The value is when a whole process becomes automated, repeatable, and scalable; the count only tells you where to look for that, not that you've found it. The second is cost: a prompt log is data you now have to govern, prompts capture only one slice of a workflow, and a pile of marginal Skills can clog the environment until the agent can't tell which one applies. None of that is fatal, but it's real technical debt, and it's the reason the human approval gate isn't optional.

The judgment, in other words, stays where it belongs — and the biggest prize isn't any single automated workflow. It's improving the efficiency of the human–AI interaction itself, because that's the thing now sitting on the critical path of nearly all development. Speed that up and you speed up everything downstream of it. The hunch that something is repetitive is still just a hunch; what's new is that proving it, and acting on it, has dropped from a research project to a daily habit. I'm going to start logging mine.