Friday, April 18, 2014

Computer Science fundamentals still hold true

There are some discussions out there about what software developer should really know nowadays. Arguments are raised that most of contemporary software developer work is no longer a computer science - which meant creating new stuff out of nothing - but is "just" an engineering work - using already built and verified components and approaches, gluing and mixing it together to create stuff out of other stuff. Questions are raised whether deep understanding of algorithms and data structures or other fundamentals of computer science is crucial in putting together existing libraries and frameworks, which is what most of us basically do every day.

The need for a good software developer to hold a Computer Science degree is often questioned and there are multiple advices out there how to be successful in a field without a degree. Indeed, I've never actually implemented sorting or tree search on my own for professional needs, only as an academic exercise. But I think the knowledge I've gained during my CS studies gave me a lot of insight how things work and gives me a sort of confidence in what I do. Without the knowledge about time and memory complexities, I'd be wandering in the darkness.

Right now, working in a self-sufficient Agile team, with strong desire to avoid knowledge and responsibility silos, everyone is encouraged to take on every task needed to reach the goal. Of course there are still (and always be) people more skilled in database stuff and other more skilled in HTML, and that's fine. But with no code ownership it's also fine when more front-end inclined developers do some back-end tasks and conversely.

But unless someone is doing something purely declarative in its nature, like plain HTML or CSS, the code is still code, regardless it's low-level C or JavaScript at the client. That means understanding the mechanics of how things works and knowing the fundamentals of data structures and algorithms is still crucial for all software developers, no matter where in the stack they fit best. Of course not having a degree cannot disqualify anyone from being a good software engineer, but the theoretical gaps need to be filled properly.

Recently I've stumbled upon a simple piece of code we already had on production, working well enough so that it haven't brought any attention until the data quantity was small enough. The code goes like this:

foreach (var foo in foos)
{
    var matchingBars = bars.Where(x => x.Foo == foo);
    foreach (var bar in matchingBars)
    {
        DoSomethingWith(foo, bar);
    }
}

Simple enough, isn't it? But there is more and more data. When we reached more than 20k foos and more than 20k bars, this is what happened:

That simple piece of code hit us badly with quite an obvious O(n^2) number of comparisons. Foos and bars are both plain lists, finding matching elements requires traversing plain bars collection for each and every foo element. Each comparison is insignificantly small, but doing it 425 million times takes more than a minute!

I've changed the code to use Lookup, which is a basic hashed structure that allows quick access to the elements by the key. The code now looks like this:

var barsLookup = bars.ToLookup(x => x.Foo);
foreach (var foo in foos)
{
    foreach (var bar in barsLookup[foo])
    {
        DoSomethingWith(foo, bar);
    }
}

That simple change replaced >400 millions of comparisons with only 20k needed to build up the lookup + 20k cheap lookup reads. The result? Total execution time fall down to just 115 ms.

That's 538 times faster, just by one simple data structure change.

I've found a great algorithms and data structures complexity cheat sheet. I think one may not call himself a software developer if he doesn't understand what at least the basic stuff in those tables mean.

6 comments:

  1. Is it really arrogant to expect that *engineers* know their job and are professionalists? I'd argue.

    Fundamentals are not necessary only if someone is not writing any production code at all.

    ReplyDelete
  2. You say "But unless someone is doing something purely declarative in its nature, like plain HTML or CSS, the code is still code"

    You realize that HTML+CSS is turing complete, right?

    ReplyDelete
    Replies
    1. So are a bunch of stones on a checkers board together with some rules on how to set them.
      Like conways game of life.
      Just because it's turing complete doesn't mean it's programming language.

      Delete
  3. @ntcoding: I think it's absolutely fair to expect that a software developer can talk about runtime complexity -- even simple CRUD apps aspire to be important enough that scale begins to matter. Even if you've never run into big-theta, you should have an intuition about whether your nested loops are going to cause your program an issue.

    Anon: HTML+CSS is only turing complete with something else to pump its state.

    ReplyDelete
  4. Which profiler are you using?

    ReplyDelete

Note: Only a member of this blog may post a comment.