Back to Thinking

Agentic Systems in Practice: Capability Is No Longer the Constraint

Alex Del Castillo

March 24, 2026

Close-up of code on a screen with blurred neon highlights.

Core Idea

The bottleneck has moved: the decisive question is no longer whether a system can be built, but whether it is reliable and economical enough to run.

There is a point at which a system stops feeling like a demonstration and begins to feel like infrastructure. It does not announce itself. There is no moment where something clearly "works" for the first time. Instead, the transition happens gradually, as processes that once required attention begin to run without it. Information starts to arrive already structured. Priorities surface without being explicitly requested. Over time, the system ceases to be something you interact with and becomes something that operates alongside you.

I have reached that point with an agentic system running on OpenClaw (v2026.3.23-2), configured locally through Docker, with a Gateway process running in terminal and a browser-based interface layered on top. The system is connected across Slack, WhatsApp, Telegram, Discord, multiple email accounts, calendar, and Google Drive. One of the agents operates with its own email address and workspace, capable of drafting communications, organizing documents, and maintaining calendar context. Each morning at approximately 7:30, the system runs a full briefing. It ingests inboxes, calendars, and task lists, synthesizes what matters, and produces a structured view of the day ahead. A separate agent maintains a rolling to-do list, continuously updated as new information arrives. A third is dedicated to research. A fourth, focused on financial planning and resource allocation, is likely the next logical step.

What becomes immediately clear when operating a system like this is that capability is no longer the primary variable. The models can do the work. The question is how that capability behaves when extended across time, connected to tools, and required to maintain coherence across multiple steps. The differences between models are not particularly visible in a single response. They emerge in sequences - especially when those sequences involve tool use, memory, and dependency across tasks.

I began by attempting to run the system using locally hosted models through Ollama, primarily Qwen variants. In isolation, they appeared functional. They could respond to prompts, follow instructions, and maintain a basic conversational thread. Under orchestration, however, the limitations became obvious. Tool calls were inconsistent, outputs were rigid, and tasks that appeared to complete often failed in execution. The system would produce something that looked correct at the surface level while quietly breaking underneath. The issue was not that these models were unusable. It was that they were not reliable enough to support a system expected to run continuously.

Moving to frontier models changed the behavior, but not uniformly. GPT-5.1 was the first version that produced something approaching stability. It could run workflows, though with noticeable friction. GPT-5.2, despite expectations, performed worse in this environment. Context would degrade across steps, tool interactions became inconsistent, and the system required constant intervention to maintain alignment. The shift to GPT-5.4 was materially different. Iterations decreased, execution became more consistent, and multi-step tasks began to complete without supervision. The system did not become perfect. It became operable. That distinction matters more than incremental improvements in output quality. A system that requires five attempts to complete a task is not meaningfully equivalent to one that completes it in two. The difference compounds across every workflow.

In steady state, the economics are surprisingly favorable. Running daily briefings, maintaining task lists, and performing light research costs approximately $0.50 to $0.80 per day. Over the last seven days, with relatively light usage, total API spend was $21. At that level, the system is not just viable. It is compelling. It functions as a persistent layer of organization, reducing cognitive load in a way that feels disproportionate to its cost.

The dynamic changes the moment the system is pushed. During debugging, iteration, or extended workflows, costs increase rapidly. A single day of development for me reached $20 to $30, driven by increased request volume, expanded context windows, and repeated execution cycles. Earlier in the process, these spikes were frequent. They were not the result of inefficiency. They were the cost of making the system work. That experience changes behavior. You become aware, very quickly, that each additional iteration carries a marginal cost. Exploration becomes something that must be justified rather than assumed. The system remains capable, but the willingness to use it at full capacity becomes conditional.

This constraint becomes more visible when compared to development workflows. Over the same period, I have built multiple applications, a game, and two websites using GPT-5.3 (Codex) and GPT-5.4 through Cursor. In that environment, the role shifts. I am not writing code in the traditional sense. I am coordinating between models, translating intent into structured prompts, and guiding execution. The process works remarkably well. It is not difficult to see that an agentic system could assume much of that coordination. I have not enabled that - not because it is infeasible, but because the cost profile changes meaningfully when these systems are allowed to operate continuously at that level. Sustained coding workflows through frontier models would reasonably push monthly costs for me into the $1,000 to $2,000 range. For an individual, that is a real constraint. For a funded organization, it is not.

There is a second constraint that operates alongside cost, and it is more subtle. Trust in these systems is not binary. It is layered. I trust the system to summarize information, prioritize my day, and maintain situational awareness. I do not allow it to send communications or execute actions without review. Not because it fails frequently, but because when it fails, it can do so without immediate visibility. That is sufficient to limit autonomy in contexts where errors carry consequences.

What is often missed in current discussions is that these systems are not speculative. They are already functional. A single operator can now configure a system that ingests, interprets, and organizes information across multiple domains, producing structured outputs on a recurring basis. The barrier is not access to capability. It is the ability to configure, sustain, and afford its operation. The setup process is non-trivial. It requires familiarity with models, routing, orchestration, and memory design. Once established, however, the system becomes straightforward to use. The constraint shifts from technical difficulty to practical judgment.

If there is a pattern that emerges from working with these systems, it is that progress is driven by control rather than scale. Building one agent and allowing it to run reliably produces more value than attempting to construct a fully integrated system from the outset. Complexity can be layered in once stability is achieved. Attempting to accelerate this process tends to produce cost without output. The principle is not new. The context is.

What remains unresolved is the relationship between capability and cost. If open-source models close the gap in reliability, or if inference costs decline materially, the economics of these systems will shift. At that point, the constraint described here weakens. Until then, the balance remains. The systems are capable. They are increasingly reliable. They are not yet economically invisible.

The most consequential shift is not what these systems might become. It is what they already enable. The distance between intent and execution has narrowed to the point where a single individual can design and operate systems that previously required coordination across multiple roles. That does not eliminate the need for judgment. It changes where judgment is applied.

The bottleneck has moved.

It is no longer whether something can be built.

It is whether it is worth running.