by Martino Salbego, Head of Big Data, Onetag
Staying ahead of the technological curve is no longer just an option—it is a strategic imperative. We all know that technical stagnation leads to debt, a cost that eventually far outweighs the price of a planned upgrade. However, the real challenge isn’t simply adopting “the new.” The true test of engineering is integrating innovation into existing, mission-critical platforms without disrupting operations. This balance is the core of our strategy. In this article, I want to take you behind the scenes of how we handle complex refactoring on established systems.
We will dive into a high-impact use case that we touched upon at the AWS Summit in Milan: upgrading our Elastic MapReduce (EMR) clusters to Graviton processors. Through this case study, we will demonstrate our approach to planning and executing a delicate migration—proving that you can achieve a net increase in operational efficiency and significant cost reduction, without ever sacrificing system stability.
Why did we choose the new Graviton processors? The numbers speak for themselves. As industry benchmarks highlight, this new generation delivers up to a 30% increase in compute performance combined with significant optimization in energy efficiency. However, the advantage extends beyond the CPU. The adoption of DDR5 memory and increased bandwidth are true game-changers for intensive workloads: they drastically accelerate shuffle operations and data I/O, effectively eliminating historical bottlenecks in processing pipelines.
Migrating EMR nodes to Graviton 4 instances mandated an update to the EMR release, resulting in a significant version jump for Spark and Flink. This required a deep evaluation of new features and, crucially, the management of breaking changes within existing pipelines. To mitigate these risks, we leveraged our existing MSK Multi-VPC architecture, utilizing the ability of Development consumers to read directly from Production MSK topics via dedicated IAM policies and MSK Multi-VPC. This strategy was pivotal: it allowed us to execute authentic load tests by simulating real production traffic and to perform precise discrepancy checks. All of this took place in an isolated environment, without impacting producers (zero double-writing) and without duplicating infrastructure costs.
Post Migration: Real-world Findings
Leveraging this authentic load testing capability, we were able to quantify the performance gains of the Graviton 4 instances effectively. While the reference AWS article cites performance improvements of up to 30%, our specific real-world tests demonstrated a solid 20% gain in our use case. As shown in the graph below, the process rate saw a clear uplift, rising from 1,120,000 to 1,350,000 record/s. This confirmed that the architectural upgrade delivered superior throughput.
After replicating the infrastructure and changes in the Sandbox environment, we executed a comprehensive suite of Discrepancy Checks to certify the behavior of the new Spark and Flink versions. Validation was conducted on multiple levels: we verified schema consistency to catch any structural format alterations, performed punctual data checks to guarantee the correctness of calculated values, and implemented checksum comparisons to confirm the binary integrity of processed records. This rigorous approach enabled us to validate the effectiveness of the changes and ensure the absence of regressions prior to the production release.
The true value of using AWS Cloud Development Kit (CDK) became clear during the production release. Because we treated our infrastructure as software, the complexity of the migration was managed upfront in the code, not during the deployment window. This meant the actual update in production was surprisingly simple: a deterministic deployment of a configuration that had already been fully defined and proven. By removing manual steps and potential human error, CDK allowed us to promote the Graviton architecture to production with absolute confidence, knowing it was an exact replica of our validated environment.
Measured Change can Lead to Massive Gains
In summary, even a seemingly minor change, done the right way—such as the adjustment of a few parameters or the update of a single component—can unlock massive optimization, as evidenced by the 20% gain we achieved. This is the true power of continuous research paired with rigorous validation: ensuring that performance enhancements never come at the expense of stability. Ultimately, it is these “small but significant” details that fuel a virtuous cycle of innovation. Maintaining this balance requires constant commitment, but the return is guaranteed in terms of operational efficiency and infrastructure cost savings. Furthermore, this efficiency extends beyond the bottom line. By drastically reducing processing time, we not only accelerate our models to the benefit of our partners, but we also significantly lower our carbon footprint through reduced energy consumption—proving that high performance and environmental sustainability should go hand in hand.
If you want to explore the technologies and architectural patterns mentioned in this article further, here are some useful resources:
AWS Summit Milan: https://youtu.be/kmPYdl0okh4?si=uaJ_n0W41_ZUtuWj
AWS Graviton4. https://aws.amazon.com/it/blogs/aws/aws-graviton4-based-amazon-ec2-r8g-instances-best-price-performance-in-amazon-ec2/
MSK Multi-VPC. https://docs.aws.amazon.com/msk/latest/developerguide/mvpc-getting-started.html
AWS CDK. https://docs.aws.amazon.com/cdk/v2/guide/home.html
Originally Published on: LinkedIn