Login

NEWS /

Smart observability: building one of the industry’s most efficient RTB infrastructures

by Federico Soave, Head of Infrastructure, Onetag

Real-time bidding (RTB) is one of the most demanding environments in technology. Millions of auctions per second must be served under strict latency requirements measured in milliseconds.

At this scale, performance and efficiency depend on more than infrastructure choices alone. Over time, we’ve found that sustained operational excellence comes from the ability to continuously observe how systems behave, understand where friction emerges, and improve them with confidence.

When issues occur in RTB, they can appear quickly as latency spikes, missed bidding opportunities, or inefficiencies that impact economics. The ability to identify root causes early is essential for maintaining stability and delivering consistent performance. This is why observability sits at the centre of how we operate. Or, as we like to say:

“What you can’t measure, you can’t improve.”

For us, observability is how we understand and manage a complex ecosystem in real time. It’s not just a matter of collecting logs or tracking a handful of high-level metrics. If something matters, we measure it. And when something changes, we want the data trail that explains what happened, where, and why.

The three pillars of observability

Like many teams operating distributed systems, we rely on three core pillars:

Logs for detailed debugging and shared visibility

Metrics for continuous measurement and improvement

Traces for understanding the behaviour of individual requests

Centralised logs: the foundation

We’ve invested in centralising logs across Onetag’s services, making them accessible and searchable for our engineering teams.

We also enrich logs with structured metadata, which enables aggregation and analysis across services. This makes it easier to identify recurring patterns and correlate anomalies with system-level events.

Traces: understanding behaviour, one request at a time

At RTB scale, tracing every request becomes unnecessary and far too inefficient.

Instead, we use smart sampling and data modelling to trace a representative subset, helping us understand where time is spent and where behaviour diverges without introducing excessive overhead.

Forensic analysis

When a bid request behaves unexpectedly, traces allow us to reconstruct the sequence of events: which services were involved, how long each step took, and where latency accumulated.

Aggregated insights

When more complex issues emerge, collections of sampled traces can reveal patterns that are difficult to detect through logs alone.

Metrics: our most powerful operational tool

We aim to generate measurable signals across every layer of our stack, because each layer influences the next.

• At the request layer, we track responsiveness and reliability to meet strict timing requirements and avoid inefficiencies upstream.

• At the application layer, we monitor workload patterns, processing backlogs, and caching behaviour.

• Across data flows, we measure throughput and handling efficiency to optimise speed, aggregation quality, and transfer overhead.

• Within the runtime environment, we observe memory usage, scheduling activity, and pause events to detect early indicators of reduced performance.

• At the system layer, we analyse resource distribution and OS-level behaviour to understand process interactions.

• At the hardware layer, we monitor compute, storage, and network pressure to identify the most impactful optimisation points.

Metrics support many of our operational decisions — including scaling, capacity planning, tuning, performance optimisation, troubleshooting, and long-term architecture work.

Closing the loop on smart observability

Observability also supports automation across our platform.

Metrics feed into our:

• Traffic shaping logic

• Scaling policies

• Load shedding behaviour

• Queue backpressure mechanisms

• Capacity planning

This allows us to be more proactive, preventing certain issues before they become incidents.

Over time, a deeper understanding of system behaviour also helps eliminate waste at multiple layers, reducing cost and lowering operational burden for technical teams.

Most importantly, greater efficiency supports better performance for our partners — and a stronger experience for end users.

Final thoughts

Operational excellence is rarely the result of a single breakthrough. It comes from consistent discipline, shared visibility, and a willingness to learn from real data.

Smart observability helps us improve continuously, respond quickly, automate intelligently, and keep our RTB infrastructure performant at massive scale.

If you’re building or operating distributed systems, investing in observability early is one of the most effective steps you can take.

Everything else improves from there.

Originally Published on: LinkedIn