Tech Trained Monkey

Everyday Problem Solvers

Tag Archives: Good Practices

Analysing success

As the 1940’s air war in Europe intensified, the Allies faced a major problem. Their bombers would leave England by the hundreds, but too many of them didn’t return, brought down by extremely heavy enemy flak. The Allies desperately needed to beef up the armor on their planes to provide protection, but armoring an entire plane, or even an entire cockpit, involved far too much weight. How could they choose the few especially vulnerable places to be armored?

A couple of clever engineers solved this problem with a counter-intuitive analysis. After comprehensively logging the locations of flak damage inflicted around the fuselages, engines, and cockpits of planes returning from hundreds of bombing runs, they calculated Read more of this post

I’ve inherited 200K lines of spaghetti code—what now?

kmote asks: I am newly employed as the sole “SW Engineer” in a fairly small shop of scientists who have spent the last 10-20 years cobbling together a vast code base. (It was written in a virtually obsolete language: G2—think Pascal with graphics). The program itself is a physical model of a complex chemical processing plant; the team that wrote it has incredibly deep domain knowledge but little or no formal training in programming fundamentals. They’ve recently learned some hard lessons about the consequences of nonexistent configuration management. Their maintenance efforts are also greatly hampered by the vast accumulation of undocumented “sludge” in the code itself. I will spare you the “politics” of the situation (there’s always politics!), but suffice it to say, there is not a consensus of opinion about what is needed for the path ahead.  Read more of this post

Logging and software development maturity

According to Wikipedia maturity is:

a psychological term used to indicate how a person responds to the circumstances or environment in an appropriate manner. This response is generally learned rather than instinctive, and is not determined by one’s age.

It is a widely know and accepted fact that write logs is a very good thing, since it helps you find out what happened at a given moment of time. It is rule number one on the universal developer book, if it existed… But again and again I find a lot of programs that just don’t do it.

I’m used to clients calling and say that something is not correct with the product, but it makes me cry when I ask for the logs and he says that there are none and I want to kill the dev who didn’t write log. I always find the bastard. If by any chance it is not a intern, oh help me lord, some blood will be shed!

Logging came to me as a very instinctive thing. My firsts programs didn’t write any log, but of course, I was the only one using them. But as I got better and better i felt the need to better control my “world”. I was scared of the fact that a problem would occur and I wouldn’t be able to know about it. In order to really evolve and grow as developer one needs to make more complex and bigger programs and, more importantly, learn from ones own mistakes. Logging is perfect for the learning part, because you don’t need to bother the users with questions and other stuff (if you logged all the information you needed).

One might start to wonder what does it have to do with maturity. Just logging is not enough. Logging for the sake of it does not help. You need to have a consistent, useful, categorized and, most of all (I think), easy to track log message. It takes real-maturity to know how to log. You should log everything that happens but each event must be logged in a particular way. Logging user interactions that same way you log Null Reference Exceptions is not very helpful. Remember that logging provides information and information is gold!

A lot of people do write a bunch of logs. They actually log the entire thing. Take World of Warcraft for example: Everything you do in the entire game can be traced! They can trace every little critter that you killed. They know which weapon you used. They log everything! But, as i said before, just write messages is not enough. You have to write them in a understandable, easy-to-track way. If a developer in your team can’t trace back, for example, the users actions, such as buttons clicked, radio button choices, in-screen info, check-boxes selected, etc, your log is not good. It might help you find and solve a problem, but it could do a lot better. I recently learned a important lesson.

I must say that I have never had a problem with production environment software (Yes, my programs bugs, but always very soon after deploy, so I was always very close by to help). Thank god all programs that I wrote never gave me serious headaches. Until a couple weeks back. Deploy went fine and everything was peachy. So one beautiful morning the client called me and reported a issue. As always I asked for the logs, he sent me and then the my world felt apart. I could not trace what was going on. Everything was there, but i could not set a time-line to the events. Due to multiple front-ends and multiple web services servers it was very, VERY hard to track what happened to a user. I had a special hard time figuring out what messages on the web services logs belonged to whom. It was a nightmare. I wanted to cry. But I managed to do it and I matured a bit… no, I matured a byte… sorry for the pun…

So now I’m developing a new logging lib and a log-reader tool (feel free to “borrow”) that is planned to solve the multi user/thread/server scenarios. The current log lib we use today is perfect to developer and single-user/thread/server scenarios. You can easily “follow” up what and when it happened, what came after what and so on. But when multi-thread/multi-users/async-operations come-out to play things get ugly. Since the current lib only logs the time and severity of the event its very hard to track continuous actions of a user, or even the path of a particular thread.

The lib itself is not enough though. The real magic is the log reader. The reader creates a visual path of the massages. It literally creates a “fork” for multithread visually showing parallel operations and stuff! Its getting very cool when its ready ill post the source code… but now back to maturity…

I must disclose that I’m not the most seasoned developer out there… for christ sake I’m only 24, but this much I can say: Write easy to read and understand log messages. Your kids will be thankful! Ok maybe not, but you will when you need to know whats really going on!

Is “crashing” the worst thing it could happen?

Here’s an interesting thought question from Mike Stall: what’s worse than crashing?

Mike provides the following list of crash scenarios, in order from best to worst:

  1. Application works as expected and never crashes.
  2. Application crashes due to rare bugs that nobody notices or cares about.
  3. Application crashes due to a commonly encountered bug.
  4. Application deadlocks and stops responding due to a common bug.
  5. Application crashes long after the original bug.
  6. Application causes data loss and/or corruption.

Mike points out that there’s a natural tension between…

  • failing immediately when your program encounters a problem, eg “fail fast”
  • attempting to recover from the failure state and proceed normally

The philosophy behind “fail fast” is best explained in Jim Shore’s article (pdf).

Some people recommend making your software robust by working around problems automatically. This results in the software “failing slowly.” The program continues working right after an error but fails in strange ways later on. A system that fails fast does exactly the opposite: when a problem occurs, it fails immediately and visibly. Failing fast is a nonintuitive technique: “failing immediately and visibly” sounds like it would make your software more fragile, but it actually makes it more robust. Bugs are easier to find and fix, so fewer go into production.

Fail fast is reasonable advice– if you’re a developer. What could possibly be easier than calling everything to a screeching halt the minute you get a byte of data you don’t like? Computers are spectacularly unforgiving, so it’s only natural for developers to reflect that masochism directly back on users.

But from the user’s perspective, failing fast isn’t helpful. To them, it’s just another meaningless error dialog preventing them from getting their work done. The best software never pesters users with meaningless, trivial errors– it’s more considerate than that. Unfortunately, attempting to help the user by fixing the error could make things worse by leading to subtle and catastrophic failures down the road. As you work your way down Mike’s list, the pain grows exponentially. For both developers and users. Troubleshooting #5 is a brutal death march, and by the time you get to #6– you’ve lost or corrupted user data– you’ll be lucky to have any users left to fix bugs for.

What’s interesting to me is that despite causing more than my share of software crashes and hardware bluescreens, I’ve never lost data, or had my data corrupted. You’d figure Murphy’s Law would force the worst possible outcome at least once a year, but it’s exceedingly rare in my experience. Maybe this is an encouraging sign for the current state of software engineering. Or maybe I’ve just been lucky.

So what can we, as software developers, do about this? If we adopt a “fail as often and as obnoxiously as possible” strategy, we’ve clearly failed our users. But if we corrupt or lose our users’ data through misguided attempts to prevent error messages– if we fail to treat our users’ data as sacrosanct– we’ve also failed our users. You have to do both at once:

  1. If you can safely fix the problem, you should. Take responsibility for your program. Don’t slouch through the easy way out by placing the burden for dealing with every problem squarely on your users.
  2. If you can’t safely fix the problem, always err on the side of protecting the user’s data. Protecting the user’s data is a sacred trust. If you harm that basic contract of trust between the user and your program, you’re hurting not only your credibility– but the credibility of the entire software industry as a whole. Once they’ve been burned by data loss or corruption, users don’t soon forgive.

The guiding principle here, as always, should be to respect your users. Do the right thing.

Top Project Manager Practice: Project Postmortem

You may think you’ve completed a software project, but you aren’t truly finished until you’ve conducted a project postmortem. Mike Gunderloy calls the postmortem an essential tool for the savvy developer:

The difference between average programmers and excellent developers is not a matter of knowing the latest language or buzzword-laden technique. Rather, it can boil down to something as simple as not making the same mistakes over and over again. Fortunately, there’s a powerful tool that any developer can use to help learn from the past: the project postmortem.

There’s no shortage of checklists out there offering guidance on conducting your project postmortem. My advice is a bit more sanguine: I don’t think it matters how you conduct the postmortem, as long as you do it.Most shops are far too busy rushing ahead to the next project to spend any time thinking about how they could improve and refine their software development process. And then they wonder why their new project suffers from all the same problems as their previous project.

Steve Pavlina offers a developer’s perspective on postmortems:

The goal of a postmortem is to draw meaningful conclusions to help you learn from your past successes and failures. Despite its grim-sounding name, a postmortem can be an extremely productive method of improving your development practices.

Game development is some of the most difficult software development on the planet. It’s a veritable pressure cooker, which also makes it a gold mine of project postmortem knowledge. I’m fascinated with the Gamasutra postmortems, but I didn’t realize that all the Gamasutra postmortems had been consolidated into a book: Postmortems from Game Developer: Insights from the Developers of Unreal Tournament, Black and White, Age of Empires, and Other Top-Selling Games (Paperback) . Ordered. Also, if you’re too lazy for all that pesky reading, Noel Llopis condensed all the commonalities from the Game Developer magazine postmortems.

Geoff Keighley’s Behind the Games series, while not quite postmortems, are in the same vein. The early entries in the series are amazing pieces of investigative reporting on some of the most notorious software development projects in the game industry. Here are a few of my favorites:

Most of the marquee games highlighted here suffered massive schedule slips and development delays. It’s testament to the difficulty of writing A-list games. I can’t wait to read The Final Hours of Duke Nukem Forever, which was in development for over 15 years (so it must be a massive doc). Its vaporware status is legendary— here’s a list of notable world events that have occurred since DNF began development.

Don’t make the mistake of omitting the project postmortem from your project. If you don’t conduct project postmortems, then how can you possibly know what you’re doing right– and more importantly, how to avoid making the same exact mistakes on your next project?

How about an hourly build?

The benefits of a daily build are well understood by now.  McConnell even cites the book Showstopper! as an extreme example circa 1993:

Who can benefit from this process? Some developers protest that it is impractical to build every day because their projects are too large. But what was perhaps the most complex software project in recent history used daily builds successfully. By the time it was released, Microsoft Windows NT 3.0 consisted of 5.6 million lines of code spread across 40,000 source files. A complete build took as many as 19 hours on several machines, but the NT development team still managed to build every day. Far from being a nuisance, the NT team attributed much of its success on that huge project to their daily builds. Those of us who work on projects of less staggering proportions will have a hard time explaining why we aren’t also reaping the benefits of this practice.

I think the main argument against daily builds was, frankly, a technological one– it simply took too long. In this age of blazingly fast 64-bit processors– and even cool distributed build stuff like IncrediBuild—  time should no longer be a factor.

Most of us will never work on a project as large as Windows NT, so our builds should be near-instantaneous. And we should do better than a daily build. I believe, for all but the largest projects, you should be building multiple times per day. This is also known as continuous integration:

Developers should be integrating and releasing code into the code repository every few hours, whenever possible. In any case never hold onto changes for more than a day. Continuous integration often avoids diverging or fragmented development efforts, where developers are not communicating with each other about what can be re-used, or what could be shared. Everyone needs to work with the latest version. Changes should not be made to obsolete code causing integration head aches.

I’m at the point now that I get aggravated when other developers leave files checked out overnight. Perhaps it’s a question of programming style, but I believe your code should almost always be in a compilable state. It may not do much, but it should successfully compile. I strongly believe that an aggressive checkin policy leads to better code. So does Martin Fowler:

One of the hardest things to express about continuous integration is that makes a fundamental shift to the whole development pattern, one that isn’t easy to see if you’ve never worked in an environment that practices it. In fact most people do see this atmosphere if they are working solo – because then they only integrate with themself. For many people team development just comes with certain problems that are part of the territory. Continuous integration reduces these problems, in exchange for a certain amount of discipline.

The fundamental benefit of continuous integration is that it removes sessions where people spend time hunting bugs where one person’s work has stepped on someone else’s work without either person realizing what happened. These bugs are hard to find because the problem isn’t in one person’s area, it is in the interaction between two pieces of work. This problem is exacerbated by time. Often integration bugs can be inserted weeks or months before they first manifest themselves. As a result they take a lot of finding.

With continuous integration the vast majority of such bugs manifest themselves the same day they were introduced. Furthermore it’s immediately obvious where at least half of the interaction lies. This greatly reduces the scope of the search for the bug. And if you can’t find the bug, you can avoid putting the offending code into the product, so the worst that happens is that you don’t add the feature that also adds the bug. (Of course you may want the feature more than you hate the bug, but at least this way it’s an informed choice.)

Now there’s no guarantee that you get all the integration bugs. The technique relies on testing, and as we all know testing does not prove the absence of errors. The key point is that continuous integration catches enough bugs to be worth the cost.

The net result of all this is increased productivity by reducing the time spent chasing down integration bugs. While we don’t know of anyone who’s given this anything approaching a scientific study the anecdotal evidence is pretty strong. Continuous Integration can slash the amount of time spent in integration hell, in fact it can turn hell into a non-event.

I’m not making any conscious effort to use so-called Extreme Programming; my concern is a more practical one. I simply can’t think of any justifiable reason for a developer to hold on to code that long. If you’re writing code that is so broken you can’t check any of it in for more than a day– you might have some bad programming habits.  I believe it’s far healthier to grow or accrete your software from a small, functional base and use aggressive daily checkins to checkpoint those growth stages.