Tuesday, May 17, 2011

Reworking Code

Over on http://codereview.stackexchange.com, I recently reworked some python that code that somebody posted for review. Afterwards, they asked how I managed to see so many things to could be improved. I didn't know, so I'm going to try and rework the process keeping notes as I go.

See the request here: http://codereview.stackexchange.com/questions/2449/parsing-huge-xml-file-with-lxml-etree-iterparse-in-python/2454#2454

Firstly we have this line:

def fast_iter2(context, cursor):

Whenever a name appears in the code, I ask: is it a good name? A good name follows three rules:

  1. It shouldn't be an abbrevation unless the abbreviation is really common
  2. It should indicate what it is naming
  3. It shouldn't be overly generic
elements = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])

This line has a problematic name because elements is too generic. Additionally is storing the set as a local variable when its actually a constant. Its a constant by virtue of the fact that its never modified. Constants should generally be put at the module level rather then inside the function.

paper = {} # represents a paper with all its tags.

Somehow it seems innocuous. But this line is evil. Well, the line isn't actually evil. The problem is all of the places within the function that assign values into paper. Most problematically, paper is reset within the loop.
But why is that a problem? Code should look like what its trying to do. On first glance, assignments into or of paper seem randomly distributed through the loops. The question is what do we really want to loop over? Really, we want to loop over one paper at a time. But how do we do that?
A quick look at the documentation shows that when we see an element in iterparse, the element and all of its children have been parsed. This means that we can safely use the xml elementtree functions to access the information. One code rearrangement later we've massively simplified the code.

if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
    pass # throw away "unwanted" records.
    populate_database(paper, authors, cursor)

The if block does nothing. As a general rule one should never include code that does nothing. It makes you look silly. It is easy to avoid by using "not in" rather then "in."

paperCounter += 1
print paperCounter

This bothers me even though I didn't fix it. Variables that count things are usually unnecessary in python. Its possible to eliminate the explicit counting, but I'll leave that as an exercise to the reader.

while element.getprevious() is not None:
    del element.getparent()[0]

Different people have different perspective on when one should break out a function. I think this should be broken into its own function because its a break from what the rest of the code is doing. Its just too different, and therefore I want to push it off into a function so the interruption is limited.

del context

Using del on the names in a function is rarely helpful. Just don't do it. It has no effect here, but even aside from that its not a good plan.

Monday, May 16, 2011

Should Exceptions be Exceptional?

It is frequently stated that exceptions should be exceptional. The idea behind this statement is that one should not throw (or catch) exceptions which will be thrown during the normal course of program operation. The idea is that one should use standard conditional logic for that case and only use exceptions to deal with situations that are truly exceptional.

For example consider a simple ParseInt function. This function converts a string such as "42" into a number such as 42. However, the string can contain non-numeric characters or perhaps be too large, causing the function to fail. There are three strategies to deal with this:
  1. Throw an exception if the function fails.
  2. Provide a CanParseInt function which checks whether the string can be parsed.
  3. Provide an additional boolean output from ParseInt indicating whether it has succeeded.
 Advocates of "exceptions should be exceptional" prefer options 2 or 3 to option 1. They argue that you shouldn't throw an exception when its sensible to use typical conditional logic i.e. if blocks. It is a fairly regular occurrence to attempt to parse a string which doesn't actually contain an integer.

The CanX, DoX method of handling this problem is a bad idea for a number of reasons. The idea behind this method is that you test whether an action will succeed before actually attempting it. The pattern looks something like:

if( CanParseInt(string) )
    value = ParseInt(string);
    return false; // indicate failure

The first problem is that the code specifies what it intends to do twice. Both CanParseInt and ParseInt indicate the same information. Having them both there just increases the clutter in the code. Additionally, the two functions will duplicate effort. CanParseInt has to spend all the effort necessary to ensure that all the characters in string are valid digits. Parse is just going to have to do all of that work over again.

Another scenario has an additional problem.

if( CanOpenFile(filename) )
    File file = OpenFile(filename);
    // play with file

The problem is that anything can happen between CanOpenfile and OpenFile. The file could be deleted, opened by another program, etc. The fact that CanOpenFile returns true does not guarantee that OpenFile will succeed.

Using the CanX/DoX method of handling errors is broken. It complicates the code, waste processor time, and is sometimes just plain wrong. Clearly, that is not a good strategy. This leaves the two remaining methods. Either we throw an exception or somehow indicate via return value or similar that the function failed. Both are somewhat similar in that you try to do something, and then get a report back if it fails.

The question is, which is a better way of indicating failure. Let's look at some code

Firstly, exception based:
    this->value = ParseInt(string;)
catch( NumberParseError )
    throw new UserInputError("Expected something else");

Secondly, return value based:

validNumber = ParseInt(string, this->value);
    return false;

The return value based method makes use a out parameter, one that modifies its argument. This is undesirable because its not obvious that is what is happening. However, that particular point could be solved by using a language which allowed multiple return values. Also, its not a problem if there is a suitable sentinel value (such as null)

The fact that exceptions are not being used prevents the return value based method from including any information about why it failed. This can be handled by making use of another technique such global variables or similar to store additional error information.

I think there is a slight readability win for the exception based method here. However, both techniques come off fairly similarly. However, difference arise if we consider what happens if we don't explicitly handle the case of an invalid string.

this->value = ParseInt(string;)

The exception based version will raise an uncaught exception which will fall into whatever uncaught exception strategy that I am using. In my case that means the exception will probably take down the program and then send me an e-mail with a stack trace and any other information.

ParseInt(string, this->value);

The return value version will simply try to continue as if nothing has happened. this->value will probably be left with whatever value was in it before. Nobody will notice that parsing did not happen as it should and who knows what will result from that.

I think it much better to die in a flaming crash when something like this happens rather then to try and continue on. If one continue one using invalid options there is not telling what the result will be.

One might be inclined to argue that one should always check for this error. However, in some cases it will be impossible to get the error. For example, the numbers could be derived from a capture expression in a regular expression. In that case trying to handle the error would be silly. However, if a bug in my code leads to non-numeric data being pushed into there, I want to know.

The other item which advocates of "exceptions are exceptional" will bring up is that exceptions are slow. This isn't quite true. The implementation of exceptions in many popular languages are slow. They are designed to be used infrequently, and thus there is not a great deal of concern for making them fast. On the other hand, some languages such as Python have an efficient implementation of exceptions. The exceptions are slow because people use them infrequently. There is nothing inherently problematic about using exceptions frequently.

However, you have to code within the idioms of a particular language. You shouldn't write python as if it was Java, and you shouldn't write Java as if it was python. In languages where exceptions are discouraged during normal execution, you shouldn't throw exceptions. Doing so makes your unidiomatic for your language and thus slow and possible confusing.