Friday, June 10, 2011

Looping the Right Way

Many programmers write their loops by starting with a loop over what they have and then producing code to determine what needs to happen for each iteration. However, the result is often code which is hard to read. The underlying problem is that the algorithm they are describing is often naturally implemented by looping over something else. However, the code is awkward as a result of having to adjust for being implemented in a straightforward iteration.


Consider an example of a run length encoder:


last = None
count = 0
for letter in input_text:
if letter == last:
count += 1
elif count != 0:
packed = struct.pack(">Bc", count, letter)
output.write(packed)
count = 0
if count == 255:
packed = struct.pack(">Bc", count, letter)
output.write(packed)
count = 0
last = letter

if count != 0:
packed = struct.pack(">Bc", count, last)
output.write(packed)

Following the logic of this code is somewhat difficult. It is also repeats important elements in multiple places. The problem is that the code considers the input letter by letter. However, the run length encoding algorithm really considers the input in blocks of the same character. As a result, the looping logic has become intertwined with the output logic which results in them being harder to read. However, we can easily seperate the two of them.


blocks = []
last = None
count = 0
for letter in input_text:
if letter == last:
count += 1
elif count != 0:
blocks.append( (letter, count) )
count = 0
if count == 255:
blocks.append( (letter, count) )
count = 0
last = letter

if count != 0:
blocks.append( (letter, count) )


for letter, count in blocks:
packed = struct.pack(">Bc", count, letter)
output.write(packed)

Now we've clearly seperated the output and looping logic. The problem with this technique is that it creates a temporary list of blocks. If the data in question is too large this is going to be a problem. However, Python solves this problem with a generator feature. Generator act like lists except that they generate the data as its asked for instead of at the beginning. This also means that you can only iterator once through a generator. Here is an example of the code rewritten to take advantage of that.


def blocks():
last = None
count = 0
for letter in input_text:
if letter == last:
count += 1
elif count != 0:
yield letter, count
count = 0
if count == 255:
yield letter, count
count = 0
last = letter

if count != 0:
yield letter, count


for letter, count in blocks():
packed = struct.pack(">Bc", count, letter)
output.write(packed)

However, a further consideration of the blocks function reveals that it does two things. Firstly, it combines repeats of letters. However, it also restricts the size of each block to 255. We can split these into two functions:


def whole_blocks():
last = None
count = 0
for letter in input_text:
if letter == last:
count += 1
elif count != 0:
yield letter, count
count = 0
last = letter

if count != 0:
yield letter, count

def blocks():
for letter, count in whole_blocks():
while count > 255:
yield letter, 255
count -= 255
yield letter, count

Now we can actually see the logic more clearly. However, another advantage arises because such small pieces can often be reusable. In fact, whole_blocks can be implemented by using the itertools.groupby method. It divides an iterator up into group of elements which are equal. If we recast the solution in terms of that we get:


def blocks(count, maxsize = 255):
while count > maxsize:
yield maxsize
count -= maxsize
yield count

for letter, items in itertools.groupby(input_text):
for block_size in blocks(len(list(items))):
packed = struct.pack(">Bc", count, letter)
output.write(packed)

This code is easier to read and much less error prone then the original code.


The take home lessons:




  1. Don't iterate over what you have, decide what you want to iterate over for your algorithm and then figure out how to iterate over that.

  2. Divide complex tasks into a number of serials tasks that can be applied to a collection

  3. If you find yourself adding flags other variables which change from iteration to iteration, you should look for an alternative



Monday, June 6, 2011

Efficiency

It's been often repeated that "premature optimization is the root of all evil." However, an often missed aspect of this issue is that it is surprisingly common that one can have your cake and eat it too. The most straightforward algorithm to solve a problem is often not the most efficient. However, since it takes additional time to both write and debug the more efficient algorithm, it is often the case that the benefit is not worth the cost. The admonishment therefore is to avoid worrying about optimizing code until actual performance measurements indicate that the code in question is a noticeable amount of the processing time.

However, there is a way to keep your code simple, easy to write, and yet take advantage of faster algorithms. All that is necessary is for your code to call previously written and debugged code which performs the necessary function. If doing this, your code should be simpler because it doesn't have to implement any algorithm whatsoever. As a result, doing so allows both the simplicity and speed.

Let's consider the example of a word count program. This program reads in text from the output and counts the number of occurrences of the words in the text. Here we'll worry about the problem of keeping track of the count of all the words.

Firstly, consider the necessary function written in C. The simplest way to implement the necessary behavior would seem to be as a linked list structure. The following code implements such a function.

void increment_word_count( Node * node, const char * word)
{
// If the current is the correct word, increment the count
if( strcmp(node->word, word) == 0)
{
node->count++;
}
// if there is a next node, keep looking
else if( node->next )
{
increment_word_count(node->next, word);
}
// if we reach the end of the linked list without finding the word
// create a new node
else
{
node->next = malloc(sizeof(Node))
node->next->word = word;
node->next->count = 1;
}
}

This algorithm has O(n) time for each word. It will also run into trouble with its recursive function call and will quickly run out of stack space.

On the other hand, consider how a C++ programmer would write this function. C++ contains a richer standard library then C and thus has a considerable portion of already written code. This particular case works well with the map template.

void increment_word_count( std::map<std::string, int> & word_count, std::string word)
{
word_count[word] += 1;
}

Note that this code is much simpler then the C code. However, it implements a more complicated algorithm. Underneath, the STL using a tree structure to ensure that the operation being performed is done in O(log n) time. Thus by using already written code we can get both speed and simplicity.

Don't start by attempting to write the simplest algorithm that will get the job done. Start seeing whether you can avoid writing any algorithm by locating an already written efficient version of the algorithm.

Tuesday, May 17, 2011

Reworking Code






Over on http://codereview.stackexchange.com, I recently reworked some python that code that somebody posted for review. Afterwards, they asked how I managed to see so many things to could be improved. I didn't know, so I'm going to try and rework the process keeping notes as I go.

See the request here: http://codereview.stackexchange.com/questions/2449/parsing-huge-xml-file-with-lxml-etree-iterparse-in-python/2454#2454

Firstly we have this line:

def fast_iter2(context, cursor):

Whenever a name appears in the code, I ask: is it a good name? A good name follows three rules:

  1. It shouldn't be an abbrevation unless the abbreviation is really common
  2. It should indicate what it is naming
  3. It shouldn't be overly generic
elements = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])

This line has a problematic name because elements is too generic. Additionally is storing the set as a local variable when its actually a constant. Its a constant by virtue of the fact that its never modified. Constants should generally be put at the module level rather then inside the function.

paper = {} # represents a paper with all its tags.

Somehow it seems innocuous. But this line is evil. Well, the line isn't actually evil. The problem is all of the places within the function that assign values into paper. Most problematically, paper is reset within the loop.
But why is that a problem? Code should look like what its trying to do. On first glance, assignments into or of paper seem randomly distributed through the loops. The question is what do we really want to loop over? Really, we want to loop over one paper at a time. But how do we do that?
A quick look at the documentation shows that when we see an element in iterparse, the element and all of its children have been parsed. This means that we can safely use the xml elementtree functions to access the information. One code rearrangement later we've massively simplified the code.

if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
    pass # throw away "unwanted" records.
else:
    populate_database(paper, authors, cursor)

The if block does nothing. As a general rule one should never include code that does nothing. It makes you look silly. It is easy to avoid by using "not in" rather then "in."

paperCounter += 1
print paperCounter

This bothers me even though I didn't fix it. Variables that count things are usually unnecessary in python. Its possible to eliminate the explicit counting, but I'll leave that as an exercise to the reader.

element.clear()
while element.getprevious() is not None:
    del element.getparent()[0]

Different people have different perspective on when one should break out a function. I think this should be broken into its own function because its a break from what the rest of the code is doing. Its just too different, and therefore I want to push it off into a function so the interruption is limited.

del context

Using del on the names in a function is rarely helpful. Just don't do it. It has no effect here, but even aside from that its not a good plan.

Monday, May 16, 2011

Should Exceptions be Exceptional?

It is frequently stated that exceptions should be exceptional. The idea behind this statement is that one should not throw (or catch) exceptions which will be thrown during the normal course of program operation. The idea is that one should use standard conditional logic for that case and only use exceptions to deal with situations that are truly exceptional.

For example consider a simple ParseInt function. This function converts a string such as "42" into a number such as 42. However, the string can contain non-numeric characters or perhaps be too large, causing the function to fail. There are three strategies to deal with this:
  1. Throw an exception if the function fails.
  2. Provide a CanParseInt function which checks whether the string can be parsed.
  3. Provide an additional boolean output from ParseInt indicating whether it has succeeded.
 Advocates of "exceptions should be exceptional" prefer options 2 or 3 to option 1. They argue that you shouldn't throw an exception when its sensible to use typical conditional logic i.e. if blocks. It is a fairly regular occurrence to attempt to parse a string which doesn't actually contain an integer.

The CanX, DoX method of handling this problem is a bad idea for a number of reasons. The idea behind this method is that you test whether an action will succeed before actually attempting it. The pattern looks something like:

if( CanParseInt(string) )
{
    value = ParseInt(string);
}else{
    return false; // indicate failure
} 

The first problem is that the code specifies what it intends to do twice. Both CanParseInt and ParseInt indicate the same information. Having them both there just increases the clutter in the code. Additionally, the two functions will duplicate effort. CanParseInt has to spend all the effort necessary to ensure that all the characters in string are valid digits. Parse is just going to have to do all of that work over again.

Another scenario has an additional problem.

if( CanOpenFile(filename) )
{
    File file = OpenFile(filename);
    // play with file
}

The problem is that anything can happen between CanOpenfile and OpenFile. The file could be deleted, opened by another program, etc. The fact that CanOpenFile returns true does not guarantee that OpenFile will succeed.

Using the CanX/DoX method of handling errors is broken. It complicates the code, waste processor time, and is sometimes just plain wrong. Clearly, that is not a good strategy. This leaves the two remaining methods. Either we throw an exception or somehow indicate via return value or similar that the function failed. Both are somewhat similar in that you try to do something, and then get a report back if it fails.

The question is, which is a better way of indicating failure. Let's look at some code

Firstly, exception based:
try
{
    this->value = ParseInt(string;)
}
catch( NumberParseError )
{
    throw new UserInputError("Expected something else");
}

Secondly, return value based:

validNumber = ParseInt(string, this->value);
if(!validNumber)
{
    return false;
}

The return value based method makes use a out parameter, one that modifies its argument. This is undesirable because its not obvious that is what is happening. However, that particular point could be solved by using a language which allowed multiple return values. Also, its not a problem if there is a suitable sentinel value (such as null)

The fact that exceptions are not being used prevents the return value based method from including any information about why it failed. This can be handled by making use of another technique such global variables or similar to store additional error information.

I think there is a slight readability win for the exception based method here. However, both techniques come off fairly similarly. However, difference arise if we consider what happens if we don't explicitly handle the case of an invalid string.

this->value = ParseInt(string;)

The exception based version will raise an uncaught exception which will fall into whatever uncaught exception strategy that I am using. In my case that means the exception will probably take down the program and then send me an e-mail with a stack trace and any other information.

ParseInt(string, this->value);

The return value version will simply try to continue as if nothing has happened. this->value will probably be left with whatever value was in it before. Nobody will notice that parsing did not happen as it should and who knows what will result from that.

I think it much better to die in a flaming crash when something like this happens rather then to try and continue on. If one continue one using invalid options there is not telling what the result will be.

One might be inclined to argue that one should always check for this error. However, in some cases it will be impossible to get the error. For example, the numbers could be derived from a capture expression in a regular expression. In that case trying to handle the error would be silly. However, if a bug in my code leads to non-numeric data being pushed into there, I want to know.

The other item which advocates of "exceptions are exceptional" will bring up is that exceptions are slow. This isn't quite true. The implementation of exceptions in many popular languages are slow. They are designed to be used infrequently, and thus there is not a great deal of concern for making them fast. On the other hand, some languages such as Python have an efficient implementation of exceptions. The exceptions are slow because people use them infrequently. There is nothing inherently problematic about using exceptions frequently.

However, you have to code within the idioms of a particular language. You shouldn't write python as if it was Java, and you shouldn't write Java as if it was python. In languages where exceptions are discouraged during normal execution, you shouldn't throw exceptions. Doing so makes your unidiomatic for your language and thus slow and possible confusing.