Posted On02-24-2025 05:50 AM

The Network is the Gym

Posted by Gabriele Farei

Our Current Approach: History and Network-level Pre-Defined Learning Strategies
A Potential Future: Action Reasoning Gyms
Conclusion: Agents are just software that learns how to improve themselves

At Moonsong Labs, we’ve been deeply immersed in the possibilities of agent autonomy and collaborative action learning— key drivers of a thesis we are calling “The Agent Economy.” This new paradigm envisions AI agents as autonomous economic participants, transforming digital services by replacing rigid, human-centric workflows with flexible, self-improving systems.

As part of our research and development for a future Venture Studio company, we have been incubating a platform designed to embody the primitives for this shift, merging the latest advancements in code agents (agents that use executable code to interact with their environment) combined with blockchains’ ability to bootstrap and sustain open collaborative networks.

Our view is that when these code agents are networked via a shared protocol, they can exchange meaningful signals and observations that helps them improve over time.

We think an open protocol that solves this can create compounding effects similar to those seen in open source that should prevail over proprietary and siloed solutions in the long run.

Networked agents will learn not just from their own experience but from a broader pool of peer actions, driving a convergence toward more effective and robust code leading to improved task performance. This network effect could mirror the unstoppable rise of open source, lifting the collective capability of all agents involved.

In this post, we’ll outline some of the thinking behind our current learning approach and explore a thought experiment on the potential for reinforcement learning in these settings.

Our Current Approach: History and Network-level Pre-Defined Learning Strategies

Our current effort rests on a core hypothesis, namely that agents can deliver substantial performance gains today by modeling their interactions with the environment as executable code. We use Python as a universal language for agents to plan and execute actions, making their behavior both learnable and generalizable. By coupling this with a robust memory system—capturing individual histories and a rich action registry for network-wide signals— agents can optimize their strategies over time.

This learning approach, which refines actions rather than intrinsic model weights, offers a key advantage: it thrives in domains where developers can’t easily predefine every possible action trajectory. We call that category of problems “goal-oriented”: scenarios in which it is much easier and cheaper to articulate clearly the challenge and definition of success (goal) than it is to define the solution or workflow.

Consider a logistics network adapting to real-time disruptions—traffic jams or weather delays—where hardcoding every contingency is impractical or even intractable. An agent can discover efficient rerouting strategies by testing code-based actions, learning from outcomes, and converging on a more efficient solution over time. Similarly, in drug discovery, agents could explore molecular combinations via code, adapting based on simulation feedback where exhaustive manual scripting falls short. Finally, consider a research workflow, such as those in Deep Research or similar tools. The type of research—whether market analysis, science, or policy—shapes the path to an insightful report. An agent can adapt by testing code-based strategies and service integrations, learning from the results, and refining its approach to fit each category.

In all cases above, the process of automated discovery—left to the agent—yields robust action sequences and thus better outcomes than the pre-determined and static approaches that are prevalent today.

The power lies in combining strong objective decomposition with code quality, deep memory signals, and network scale. A single agent might recall a past success (e.g., rerouting around a storm) and refine it, while the network registry shares peer solutions (e.g., a shortcut another agent found). A feedback loop drives convergence toward more effective code and superior task performance over time. Our early experiments suggest this outpaces traditional, human-defined workflows, especially in cases where pre-planning is difficult and adaptability to changing conditions is needed. Before even exploring distributed or federated learning, this code-driven, networked approach lays a strong foundation for the Agent Economy, proving that a shared action representation can already unlock significant agent efficiency.

A Potential Future: Action Reasoning Gyms

Having established how our networked agents learn over time—using code, memory, and shared insights to optimize actions— let’s take it a step further with reinforcement learning (RL). RL is a simple yet powerful idea: agents figure out the best actions by trying things out, learning from both successes and mistakes to maximize rewards.

This learning paradigm could naturally extend our network approach, evolving it into a decentralized RL “gym” where agents actively adapt to real-world feedback, moving beyond pre-set learning rules. (Note: A “gym” in RL refers to a simulated or real-world environment where agents hone their skills by tackling tasks, getting feedback, and earning rewards.)

The key difference here is that we would be moving away from predefined learning rules and let the environment and rewards alone guide the agents to smarter, more adaptable behaviors.

In the logistics example, an agent could proactively explore new routes to earn efficiency rewards, updating its own knowledge so it converges on better paths under similar conditions next time. This shift, from us defining how agents learn to letting the environment and rewards drive their behavior, feels like a natural next step.

The prospect of these action networks evolving into gym environments for reinforcement learning is captivating. It opens the door to a future where emergent intelligence, fueled by distributed networks, could be more equitable and widely accessible.

With groundbreaking advancements like DeepSeek R1 and the snowball effect driving the research community forward, we’re optimistic that meaningful breakthroughs using RL techniques at scale are within reach.

While these developments unfold, we’ve put together some ideas—more as a thought experiment —exploring how such a network might take shape.

Here are our observations:

Federated RL for Embedded Learning Today, we achieve network learning with predefined learning strategies through processes like generalization and retrieval of actions—weighting actions based on signals like success rates or efficiency. Imagine replacing this with federated RL: agents share experiences (states, actions, rewards) via a registry, and these are aggregated into policy updates—gradient adjustments (tweaks to how agents decide what’s best) distributed across the network. Rather than relying on predefined learning strategies alone, successful trajectories are incorporated as “fitter” and rewarded to promote similar trajectories in the future. E.g. A breakthrough discovered by one agent, such as optimizing a task pipeline, could enhance all agents, creating a self-improving ecosystem, as they are all part of the “same collective system”.
Self-play as an Infinite Learning Engine Code is a goldmine for RL: every action generates new code, reasoning patterns, and signals. With self-play, agents could challenge their past successes—say, trimming an 8-step task to 6—guided by self-referential rewards (e.g., fewer LLM calls, lower costs). Unlike predefined learning strategies, these improvements would emerge through reward-driven discovery, not prescribed by a protocol.
Dual Rewards to Balance Self and Network To sustain this reinforcement learning (RL) “gym,” we need a reward system that mirrors the dynamics of open source software: companies share code not just for altruism but because they gain from a stronger, more robust ecosystem of contributors to their codebase. In our network, agents pursuing their company’s goals benefit similarly by sharing code and experiential feedback. When others adopt and refine their strategies, the original agent gets a faster, cheaper, and more secure version back—amplified in a system where every agent acts as a developer, iterating at an even larger scale and speed than traditional OSS. A dual reward system could capture this dynamic:
- Self-relative rewards push agents to beat their own past performance, like programmers refining their own code to cut costs and boost efficiency.
- Peer-relative rewards encourage sharing: agents craft strategies tailored to their needs, enticing others to use and enhance them—selfishly improving their own outcomes while naturally benefiting all.
This setup mirrors open source’s social contract—users share because they benefit from the network’s collective improvements, not just for altruism. By balancing self-interest with collaboration, the system fosters competition and collective growth, driving agents toward network-wide effectiveness.
Critic Agent for Long Trajectories Complex tasks—like multi-step workflows—pose a challenge: which action contributed to its success or failure? Building on our existing task decomposition used for predefined learning strategies, sequential/temporal RL could assign relative rewards to each step, using a “Critic Agent” to evaluate the broader context. This critic clarifies relative credit across long trajectories—distinguishing a good step in a flawed plan from a misstep in a winning strategy—accelerating network learning that can be drawn from any given scenario.

Conclusion: Agents are just software that learns how to improve themselves

We’re imagining a network where code, as a universal, learnable language, lets agents adapt and refine actions dynamically.

We covered how our networked approach—merging shared memory, novel learning strategies, and blockchain’s open protocols—could pave the way for a decentralized reinforcement learning gym, where agents evolve together, tackling real-world challenges with insights from both individual and peer efforts.

Web3’s incentive-aligned networks power this, creating sustainable, collaborative marketplaces that rival closed-source systems. Agents share not just altruistically, but because they gain from collective progress—a system where selfish optimization benefits all.

We can see parallels with open-source software in the era of code agents: if agents will write much of tomorrow’s software, we should build the incentives to make it open and lift everybody else up.

We’re starting small, but the trajectory is bold—software development as a living, agent-driven process. For the latest developments and to join the conversation, follow us on X to stay tuned as we build this future.