Learning to Love your Legacy Codebase

I was chatting with a friend a few days ago, and we were discussing the news about how a certain high profile legacy system was being rewritten from scratch on a very short schedule. The particulars of the system aren’t important, but as I mentioned to my friend, if I had a nickel for every time one of these rewrites of legacy codebases failed, I would have five or six nickels by now. My initial idea for this blog post was to talk about the pitfalls of rewriting from scratch, but countless books and blog posts have warned about such things, and I felt like I was just screaming into the void of the blogosphere along with everyone else. Instead, I want to talk about the dreaded legacy codebase, and make the argument that it’s probably not as bad as you think it is, and how you can learn from and improve on these battle-hardened codebases.

Legacy Code Gets a Bad Rap

When I first started working as a software developer, it was on a project supporting a legacy component that consisted of over 250,000 lines of C++ mixed with C. Everyone on the team hated everything about it, and at some point the team decided to replace it with something brand new. Since I was the noob, the entire legacy codebase was dropped in my lap and I was tasked with fixing bugs and adding features while the rest of the team went off to work on the shiny new thing. I had no other choice but to really get to know this codebase. After about a year, I effortlessly chased down bugs and implemented numerous new features, while the rest of the team was still arguing over the minutiae of the architecture of its replacement. By the time I left the team three years later, more folks were contributing to the legacy code and there still had been no substantial deliveries of the new system. I’m not sure if it was ever completed. During that time, I gained a lot of respect for the old system. A lot of aspects of it weren’t pretty, but at its core it was an elegant system that had accumulated a lot of scars from being used every day by millions of computers all over the world.

Not all legacy codebases are the same; some just can’t be saved, and some are built on obsolete tech that is no longer supported. But I personally think that a lot of old codebases are thrown away because people didn’t take the time to understand them.

Understanding Legacy Systems

So how do we even start to really get a deep understanding of these legacy systems? Well, the most valuable advice I ever received related to understanding a large old codebase was to break out the debugger and start stepping through the code. Even If your favorite debugger is the print statement, an actual debugger is going to be the best at this, and every major language has one. A software system is the combination of the logic of the code and the data that flows through it, and a debugger is the best way to see both in action. You can start pretty much anywhere. The main entrypoint will teach you a lot about what an application needs to set up and how it starts and runs. Starting with one of the most used API calls can teach you about how various components are connected. If there are tests, I find that they are always great places to dive into a new codebase. The test serves as the narrator and can describe how it expects a piece of code to behave. You can also ask your favorite generative AI tool to explain a particular piece of code to you, but be sure to always follow up with stepping through it in a debugger. There’s really no substitute for actually observing the code’s behavior.

Beyond the debugger, there can be other artifacts that can shed a lot of light on how a legacy codebase works, what constraints it was built under, and what problems it has encountered over its lifetime. Some additional areas that can really help paint a picture are:

The original requirements.
Tests, if they exist. Tests can sometimes help you piece together both original requirements, and the history of problems encountered.
Comments sprinkled throughout the code. Comments can provide insight into what was going on during development, and sometimes an unvarnished look at the mindset of the developers during that time period.
Readmes, design documents, and even marketing information.

There is a sense of discovery when you embark on this archaeological process, and every piece of information you find is helpful for gaining a better understanding of the codebase. Once you understand it, you’ll become more fearless about adding new features. You’ll also be less likely to break existing functionality every time you touch the code. Knowing where to integrate new features and the risk of breaking existing ones are probably the two biggest things that make working with legacy code such a challenge. It’s unreasonable to expect to understand 100% of an old application, but you don’t really need to reach that level of familiarity to be able to effectively contribute and maintain it.

Learning New Things

Whether it’s the most comically large bowl of spaghetti code or the most meticulously engineered piece of software in history, legacy code encapsulates a ton of wisdom. The longer a piece of code has been running, the more problems it has encountered and solved. It may be full of clever performance optimizations that were common at the time, but are now largely forgotten. Legacy code can teach you about algorithms you’ve never seen implemented before, and it exposes you to how other people approached and solved problems. The 250k line legacy codebase that was dropped in my lap was laden with every type of pointer arithmetic imaginable. It provided me with the opportunity to learn all about how pointers work, and a lot of times I learned about them the hard way. Nowadays I would never willingly do any kind of pointer arithmetic, but that exposure gave me a deeper understanding of pointers and memory management than I was ever able to learn in a programming class.

Legacy code contains a lot of information about the domain and industry it operates in. You can learn a lot about how the application solves problems for its users, and how its domain may have changed through the application’s lifetime. The fixes and features that have been bolted onto it over the years are the scars of hard lessons learned. The accumulation of these fixes and changes are simultaneously why rewrites tend to fail and why the codebase can become unmanageable. Over time, urgent issues require small changes to be made to the code. They’re rarely documented well, and they accumulate to become a significant part of an application’s logic. These changes also tend to be written in a way that doesn’t really integrate well into the original architecture. The end result is the accumulation of a lot of functionality that is implemented in a hurry. It’s what makes us grow to hate working with the codebase, but it is also an important piece of how the application provides value to its users. With some effort, you can start to understand these pieces, and start working on ways to make it just a little bit better.

Improve Instead of Rewrite

Now that we’re in love with our legacy code, it’s unthinkable to just throw it in the trash and start over but let’s be honest, it’s still pretty gross. How do you go about making it better? The most successful approach that I’ve seen consists of purposefully replacing components with new pieces of functionality. By methodically replacing these pieces, you can focus on making sure that one small area retains the legacy functionality while including some new and improved features. This does wonders to avoid the common pitfalls of rewriting from scratch. It gives you the opportunity to include new tests, and it can be done in a way that is nearly transparent to the end user. This process is commonly referred to as the Ship of Theseus or Strangler Fig pattern. This approach can take a long time, but it’s unlikely that it takes longer than a rewrite. It also has the added benefit of keeping the existing application running while the new code is being written and integrated.

After all of the advocating that I just did for really understanding your legacy code, I have to admit that there are just some occasions where all hope is lost, and you have no choice but to start over. This can be due to a major change in platforms, or that the technology is completely obsolete and is no longer being maintained. It’s always going to be more fun to start fresh, but It’s important that you can feel confident in saying that’s truly the case. It’s also critical that you know what challenges lie ahead when you abandon your old system and build a new one. There’s a good chance that you’ve just created the next generation of legacy code that future developers will despise for years to come.

As software developers, we’re likely going to have to work with at least one or likely multiple legacy codebases. They’re big, messy, and adding new functionality ranges from difficult to a complete disaster. If you put the effort into really digging in and understanding the codebase, you may find that it’s not really that terrible. There are ways to cleanly add new features, and there’s lots to learn about the application domain and the technologies used to build it. Before you decide to throw away that old code, take the time to get to know it a little better, and you may just come out of it with a newfound respect for the code and the people who originally built it.

Loved the article? Hated it? Didn’t even read it?

We’d love to hear from you.

Reach Out

Learning to Love your Legacy Codebase

Legacy Code Gets a Bad Rap

Understanding Legacy Systems

Learning New Things

Improve Instead of Rewrite

Leave a comment

Leave a Reply Cancel reply

More Insights

Beyond DELETE

Safely Formatting Template Strings with HTML in Rails

Working With Legacy Code

2024 Summer Hackathon in Paris

Interested in empowering your energy and utility operations?