Friday, April 18, 2014

Computer Science fundamentals still hold true

There are some discussions out there about what software developer should really know nowadays. Arguments are raised that most of contemporary software developer work is no longer a computer science - which meant creating new stuff out of nothing - but is "just" an engineering work - using already built and verified components and approaches, gluing and mixing it together to create stuff out of other stuff. Questions are raised whether deep understanding of algorithms and data structures or other fundamentals of computer science is crucial in putting together existing libraries and frameworks, which is what most of us basically do every day.

The need for a good software developer to hold a Computer Science degree is often questioned and there are multiple advices out there how to be successful in a field without a degree. Indeed, I've never actually implemented sorting or tree search on my own for professional needs, only as an academic exercise. But I think the knowledge I've gained during my CS studies gave me a lot of insight how things work and gives me a sort of confidence in what I do. Without the knowledge about time and memory complexities, I'd be wandering in the darkness.

Right now, working in a self-sufficient Agile team, with strong desire to avoid knowledge and responsibility silos, everyone is encouraged to take on every task needed to reach the goal. Of course there are still (and always be) people more skilled in database stuff and other more skilled in HTML, and that's fine. But with no code ownership it's also fine when more front-end inclined developers do some back-end tasks and conversely.

But unless someone is doing something purely declarative in its nature, like plain HTML or CSS, the code is still code, regardless it's low-level C or JavaScript at the client. That means understanding the mechanics of how things works and knowing the fundamentals of data structures and algorithms is still crucial for all software developers, no matter where in the stack they fit best. Of course not having a degree cannot disqualify anyone from being a good software engineer, but the theoretical gaps need to be filled properly.

Recently I've stumbled upon a simple piece of code we already had on production, working well enough so that it haven't brought any attention until the data quantity was small enough. The code goes like this:

foreach (var foo in foos)
    var matchingBars = bars.Where(x => x.Foo == foo);
    foreach (var bar in matchingBars)
        DoSomethingWith(foo, bar);

Simple enough, isn't it? But there is more and more data. When we reached more than 20k foos and more than 20k bars, this is what happened:

That simple piece of code hit us badly with quite an obvious O(n^2) number of comparisons. Foos and bars are both plain lists, finding matching elements requires traversing plain bars collection for each and every foo element. Each comparison is insignificantly small, but doing it 425 million times takes more than a minute!

I've changed the code to use Lookup, which is a basic hashed structure that allows quick access to the elements by the key. The code now looks like this:

var barsLookup = bars.ToLookup(x => x.Foo);
foreach (var foo in foos)
    foreach (var bar in barsLookup[foo])
        DoSomethingWith(foo, bar);

That simple change replaced >400 millions of comparisons with only 20k needed to build up the lookup + 20k cheap lookup reads. The result? Total execution time fall down to just 115 ms.

That's 538 times faster, just by one simple data structure change.

I've found a great algorithms and data structures complexity cheat sheet. I think one may not call himself a software developer if he doesn't understand what at least the basic stuff in those tables mean.

Thursday, April 10, 2014

Waiting screen - doing bad things right - is it even possible?

Sometimes in large web applications, there is a necessity to make a client wait, before the server is able to provide any content. There may be some heavy calculations to be performed, caches refreshed etc. In most cases it probably can - and should - be avoided using background workers not being a part of the actual web request or some kind of asynchronous AJAX calls. Those approaches give the possibility to have either completely undisturbed user experience or at least to reduce the fuss and eliminate the need to have a blocking wait.

But there's a chance the task we are to accomplish cannot be offloaded onto the background thread at the server nor performed asynchronously with AJAX during normal site operation. I had that kind of situation in a project where authorization was handled by an external, awfully enterprisey component, available through the web service which was slow as a snail and completely out of our control. It was not possible to show anything more than the waiting screen on the first log on, as everything of any value had to be authorized first.

In that hopeless situation, we've decided we need a waiting screen, so that users at least see something that lets them know the service is being prepared for them. You know, if it's slow, it's a clear indication there's a lot of value inside and it's worth paying big money for it. In case it loaded quickly, there seems to have no real value (in case my English is not so fluent to express my thoughts accurately - yes, this was meant to be sarcastic).

Anyway, I was looking for a solution for the waiting screen, that ideally holds all of the features listed here in my subjective order of importance.

  1. it needs to be available not just as a JS throbber to be put somewhere on the site - it needs to be a response for the initial request to the website;
  2. it needs to draw something on screen while waiting (obvious to say, but not so obvious to implement);
  3. it should be HTTP-compliant in terms of status codes etc. so that it doesn't confuse any browsers or web crawlers;
  4. it should not break "back" button;
  5. while waiting, it should indicate "waiting" status in a browser's status bar, to convince the user something is really going on.

Serving a simple wait screen with 302 Redirect to the target request that will do the actual work is not an option, as it will fail on requirement no. 2 - the browser will issue a redirected request without rendering our wait screen. But in order to have point 3 and 4 fulfilled, we need 302 Redirect - serving 200 OK without the actual content will harm the protocol and browsers' history badly - it's tightly coupled. So there's a big contradiction here. We can try going the unwanted HTML Redirect route or use JavaScript redirection - it satisfies point 2, but it's no better about protocol compliance nor browser history obedience.

Well, I got stuck here. I've asked for help on StackOverflow, but no interest at all. Let me know if I'm missing something.

By now, I've sacrificed protocol compliance and proper "back" button behavior in order to have anything shown on screen - points 1 and 2 are the crucial ones for the general user experience. It's still quite weak experience, but without those, there is no experience at all. I've chosen the JavaScript redirect route called on window.load jQuery event (note that document.ready event is raised when the DOM is ready, but possibly not yet rendered - a bit too early for us to be sure something is already drawn).

I can't see a way to have the waiting screen done right. Well, maybe there is no right way to do bad things?

Monday, January 27, 2014

Introducing NOtherLookup - even better lookups

Recently I've been discussing how awesome is the ILookup data structure provided by .NET Framework as a part of LINQ library. I've looked through some of its key features, but I've also mentioned its few imperfections, like no way to create it directly and no easy way to do any operations on a Lookup as a whole, promising to take a look how to fix it.

So, how to fix it? By using NOtherLookup library I've just pushed to NuGet. It is a set of extension methods and utility classes designed to make Lookup-based operations as easy and as natural as in plain LINQ-to-objects.

Consider for example merging two Lookup instances into a single one that has the keys from both instances and for each key that is present in both Lookups, the union of sets should be produced. When doing it in plain LINQ, we probably need to convert both Lookups to IDictionary<TKey, List<TValue>>, then merge corresponding keys and finally produce a new Lookup manually. With NOtherLookup, we can just use LINQ in the same way as we do on IEnumerable:

var result = firstLookup.Union(secondLookup)

In this example, result is still typed as ILookup<TKey, TValue>, with no more conversions needed. And here is what it does:

"a" => 1,2
"b" => 3,4

"b" => 4,5
"c" => 6,7

"a" => 1,2
"b" => 3,4,5
"c" => 6,7

NOtherLookup offers support for several operators known from LINQ. There are some operators that combine two Lookups - besides Union shown above, there is Concat, Except, Intersect, Join and Zip. There are some that manipulate a single Lookup - Select and Where. There is also quite universal OnEachKey method that allow running arbitrary LINQ query on each lookup element, maintaining ILookup type.

ILookup<int, string> transformed = lookup.OnEachKey(g => g.Select(x => x + g.Key).Reverse());


"a" => 1,2
"b" => 3,4

"a" => "2a", "1a"
"b" => "4b", "3b"

NOtherLookup contains also a few features that make obtaining ILookups easier. The most important one is Lookup.Builder - a class that allows creating lookup from the scratch (as a solution for the lack of Lookup public constructor), with support for custom comparators, if needed.

ILookup<int, string> lookup = Lookup.Builder
    .WithKey(1, new[] { "a", "b" })
    .WithKey(2, new[] { "c", "d" }) 
    .WithComparer(new CustomIntComparer()) // can be omitted

There is also empty lookup available for your convenience.

Last but not least, there are easy conversions between ILookup and IDictionary available - it's a simple method call in both ways:

ILookup<int, string> lookup = dict.ToLookup();
Dictionary<int, List<string>> backToDict = lookup.ToDictionary();

For the detailed descriptions with examples, see README on GitHub project page, where the code lives. NOtherLookup is licenced with very liberal MIT licence, so feel free to grab the code and use it any way you want.

Sunday, January 12, 2014

The downsides of .NET's Lookups

In the previous post I was admiring the wonderful features of .NET's Lookup - its immutability, null-safeness and clean and readable presence. I've also promised to look for its downsides or pitfalls one may encounter when using it.

Are there any? Well, to be honest, I can't see any serious problems with using Lookups. If you need a collection to be used as a lookup table by some identifier, with no modifications support, with multiple values per key - you won't find anything better suited than ILookup.

There are just two things I find to be a slight impediments in ideal Lookup's world.

The first is that the only way to create an ILookup instance is via LINQ's ToLookup method. Although the framework provides the default public ILookup implementation - a Lookup class - it has no public constructor available. It means that in order to create our lookup, we need to prepopulate some other collection just to convert it later using ToLookup().

The second thing is that the beautiful clean API provided by ILookup<TKey, TValue> goes away immediately if we need to do anything other than just read our lookup. If we need to have a lookup created from existing lookup but with one more value, or have our lookup filtered, or joined with another, or whatever, it breaks into IEnumerable<IGrouping<TKey, TValue>> - and the readability hurts a lot. IGrouping represents a single lookup item which is a values collection with its associated key. Still nice to access the data, but not nice to operate on. There's no public implementation of IGrouping available in the framework - the implementation is internal to LINQ. There's no way to modify the grouping, what is understandable as it is part of immutable ILookup, but there's also no way to create new grouping based on existing one - the workaround is to convert our whole lookup back to some mutable collection like Dictionary<TKey, TValue>, mutate it and then convert to new lookup using the convoluted LINQ projections first, something like this:

var temporaryDictionary = lookup.ToDictionary(x => x.Key, x => x.ToArray());
temporaryDictionary.Add("new key", new[] { "new value" });
var newLookup = temporaryDictionary
    .SelectMany(kv => kv.Value, (kv, v) => new { kv.Key, Value = v })
    .ToLookup(x => x.Key, x => x.Value);

Or alternatively, if we don't mind losing existing key-values mappings and we are able to recreate it once again, we can flatten our lookup to List<TValue>, modify the list and then use the simplest ToLookup() overload.

var temporaryList = lookup.SelectMany(x => x).ToList();
temporaryList.Add("new value");
var newLookup = temporaryList.ToLookup(x => SomethingThatRecreatesKey(x));

No matter which way we choose, it's quite a lot of work for such a simple thing.

Next time I'll try to propose a solution for both the issues. Stay tuned!

Monday, January 6, 2014

Lookup - .NET's hidden gem

.NET Framework has a lot to offer when comes to generic data structures, especially collections. We have several lists, arrays, sets, dictionaries - each of them has its performance and functional characteristics and scenarios where it fits much better than the others. As I'm observing in the project I work in, one of the most often forgotten yet powerful type of collection is Lookup. I think it needs more attention.

ILookup was introduced to the framework as a part of LINQ and is exclusively used in LINQ context, as a result of ToLookup calls. Logically, it is a key-value collection like IDictionary, but it is designed to hold multiple values in a single key. Also, unlike Dictionary<K, V>, it is immutable, what means that after creating the lookup, we can only read it - no modifications are possible.

Generally one may say that ILookup<K, V> is no better than IDictionary<K, IEnumerable<V>> and as such is redundant. But I'd strongly disagree. First of all, let's think how are dictionaries of collections used. From my experience, it's most often used to keep predefined static "dictionaries" that are loaded once and live long to provide easy access to some rarely changing values like what counties are there in each country etc. Well, these are even often called "lookup tables". Note that in most cases we don't ever change the stored values - it is immutable. We replace it as a whole periodically or when something changes. Doesn't it sound exactly like ILookup<K, V> key features discussed above? We should always be using the tools that are best suited for the job to avoid reinventing the wheel and avoid the problems others already solved.

Moreover, ILookup's definition is clearly more readable than nested generics in its dictionary-based equivalent. When I look at the code never seen before and I see ILookup<K, V>, I instantly know that it is the lookup table and I feel its read-only, one-way nature. When I see IDictionary<K, IEnumerable<V>> I need to consider the possibility that it is used as a read-write store for some values and I, as a user of this code, can (or should?) modify that collection. Using Lookup for lookups just makes sense.

IDictionary-based lookups, as mutable data structures may kick us badly. I've once experienced quite an ugly error caused by passing around the plain List<T> read from what was intended to be static lookup shared between users and was based on Dictionary<K, List<V>>. The intention was to pass around the immutable lookup data, but the receiving code actually got mutable list being exactly the same reference that was sitting in the global lookup. The value passed was eventually customized for each user and there were miracles happening to the global lookup. All that would be just impossible with lookups.

With dictionaries that are open for modification and not thread-safe by default, we are also subject to multiple kinds of failures in multi-threaded scenarios. All of it is gone with immutable data structure like ILookup.

And last but not least - null checks. Who likes to litter his code with all that boring null checks? When using IDictionary as a lookup table, every piece of code that consumes its data need to check for key existence first or have a conditional statement with TryGetValue method, because calling dictionary[nonExistingKey] would throw dreadful KeyNotFoundException. Moreover, even if the key exists, the value may still legally be null, so to read the collection contained under the key we need to have two checks:

if (dictionary.ContainsKey(theKey))
     var collectionOrPossiblyNull = dictionary[theKey];
     if (collectionOrPossiblyNull != null)

or, with more compactness but same complexity:

T collection;
if (dictionary.TryGetValue(theKey, out collection) && collection != null)

ILookup<K, V> was designed totally differently around key existence and null values. It just returns an empty collection for non-existent keys. And thanks to its strictly controlled way of creating (by LINQ's ToLookup method only), we can also be sure that no null collection is possible. And the code above just looks as it should like:


Isn't it beautiful?

Next time I'll try to find some pitfalls or inconveniences of using lookups.

Sunday, December 29, 2013

Me & Open Source - a New Year's Resolution

I am working as a software developer full-time for several years now. I've already written many thousands of lines of code, but nearly none of it is publicly available. Most of it remains a property of my employers for who the code was created, but there are always some pieces that are already open source or side projects that don't need to adhere to strict ownership rules and can be generalized enough to be shared with others.

I always wanted to share and I had hundreds of ideas for full-size projects that I could do when there's enough time. But in the reality, there was never enough time and motivation for anything big. Full-time job takes a lot of time, the rest goes to family and other duties. But there is always a few late evening hours each week that can be saved.

All the things I've shared with the community up to date were mentioned here on my blog. These were mostly short gists or snippets. The only larger thing was an HTML building API based on the "loquacious" interface idea, published on GitHub as NOtherHtml. I still love the design! But I admit, it was rather one-timer, not a major contribution to the open source world.

So, here is my New Year's Resolution for 2014: try spending those few late evening hours on contributing to the open source projects. Yeah, I know no one cares, but I'm sharing it as posting it publicily will motivate me more than leaving it just as my internal thought. There will be probably no biggies for the reasons mentioned, I think I'll rather focus on bug fixes or small features for the projects I use. It's easiest, it's beneficial, it's fair.

As a start, yesterday I've submitted my first pull request to the "real" open source project I use - JP. Boodhoo's developwithpassion.specifications. I've fixed a bug in the extension method that helps assert collections elements equality, but failed when the collections were of different lengths. I feel I'm already over the hump!

Sunday, October 6, 2013

ASP.NET MVC: Current route values implicitly used by Url.Action

I really don't like when I work with statically-typed language like C# and still need to rely on strings or other loosely-typed constructs for things that may be strongly-typed. ASP.NET MVC went its way towards strongly-typing and now offers things like strongly-typed view models, but still lacks official built-in support for building internal links and URLs without using plain controller and action strings.

Fortunately, here is where MVC Futures comes in (available on NuGet for MVC 3 or 4). It offers a nice expression-based extension methods to build links (on HtmlHelper):

Html.ActionLink<TheController>(c => c.TheAction(theParam))

and URLs (on UrlHelper):

Url.Action<TheController>(c => c.TheAction(theParam))

It's working fine and it protects me from working with those ugly string-based methods and it is also a very convenient way to specify the target action parameters - in definitely nicer way then manually building the RouteValuesCollection required by those traditional methods.

But there is one not well-known feature of MVC (at least I haven't heard a lot about it) that for me seems to be disturbing here - MVC is reusing the route values from the current request when building URLs and no other value specified. This maybe makes some sense for the string-based methods (you don't need to build those RouteValuesCollections over and over again), but it makes no sense for the expression-based approach as the compiler enforces specifying all the parameters explicitly in order to build a lambda expression that compiles.

And here comes the nasty bug I've recently spent some hours on - passing null value is like not passing a value at all. MVC puts null under the appropriate key in the RouteValuesCollection, it then goes down to Route.GetVirtualPath method, which also takes the current route values from the RequestContext. And then the evil happens - the low-level System.Web.Routing's method ParsedRoute.Bind ignores the null value and - bang - it takes the value from the current request, if any accidentally matches by the parameter name.

It means that when trying to build an URL that passes parameter param equal to null and the current request accidentally have parameter named also param (regardless of its type or existence of any logical connection), the request's param value will be passed instead our explicitly demanded null value. And I can see no way to pass null in this case.

Actually, the bug (or feature?) exists in case of string-based URL builder methods, too. But here it is much more visible and obviously wrong.

The only way to fix that strange implicit parameter inheritance by name I know is to work around it - either by removing the name collision by renaming one of the parameters (yuck!) or by using own extension method.

I've created my own extension method to generate URLs/links that has the same signature as the buggy ones and placed it in the namespace more visible than those replaced (you'd better check your usings carefully). Here is the UrlHelper extension I have - HtmlHelper implementation generally calls UrlHelper and wraps its result in a link, so I'll omit it here. The method calls the same methods as the original method being replaced, but it tweaks the RequestContext instance: instead of passing the instance available in UrlHelper (which contains those conflicting route values from the current request), I'm creating new RequestContext reusing Route and IRouteHandler instances from the current request, leaving the actual route values empty. This way there's no possibility for the current values to "infect" our URL building process anymore. Interesting is that RouteData constructor doesn't enforce the values being set, anyway.

public static string Action<TController>(this UrlHelper helper, Expression<Action<TController>> action)
    where TController : Controller
    // we need to recreate RequestContext without values, to override the MVC "feature"
    // that replaces null specified in action with value inherited from current request
    var currentRouteData = helper.RequestContext.RouteData;
    var fixedRouteData = new RouteData(currentRouteData.Route, currentRouteData.RouteHandler);
    var fixedRequestContext = new RequestContext(helper.RequestContext.HttpContext, fixedRouteData);

    var valuesFromExpr = Microsoft.Web.Mvc.Internal.ExpressionHelper.GetRouteValuesFromExpression(action);
    var vpd = helper.RouteCollection.GetVirtualPathForArea(fixedRequestContext, valuesFromExpr);
    return vpd == null ? null : vpd.VirtualPath;

Well, it seems to be yet another example of the fact that null is rather vague and problematic thing. It can represent many different notions, depending on the context and implementation - like no value, empty value, inherited value etc. Althouth ASP.NET MVC treats null as "inherit the value" not only in this case, I'd always prefer being explicit about that kind of behaviors.