Tuesday, May 17, 2011

Reworking Code






Over on http://codereview.stackexchange.com, I recently reworked some python that code that somebody posted for review. Afterwards, they asked how I managed to see so many things to could be improved. I didn't know, so I'm going to try and rework the process keeping notes as I go.

See the request here: http://codereview.stackexchange.com/questions/2449/parsing-huge-xml-file-with-lxml-etree-iterparse-in-python/2454#2454

Firstly we have this line:

def fast_iter2(context, cursor):

Whenever a name appears in the code, I ask: is it a good name? A good name follows three rules:

  1. It shouldn't be an abbrevation unless the abbreviation is really common
  2. It should indicate what it is naming
  3. It shouldn't be overly generic
elements = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])

This line has a problematic name because elements is too generic. Additionally is storing the set as a local variable when its actually a constant. Its a constant by virtue of the fact that its never modified. Constants should generally be put at the module level rather then inside the function.

paper = {} # represents a paper with all its tags.

Somehow it seems innocuous. But this line is evil. Well, the line isn't actually evil. The problem is all of the places within the function that assign values into paper. Most problematically, paper is reset within the loop.
But why is that a problem? Code should look like what its trying to do. On first glance, assignments into or of paper seem randomly distributed through the loops. The question is what do we really want to loop over? Really, we want to loop over one paper at a time. But how do we do that?
A quick look at the documentation shows that when we see an element in iterparse, the element and all of its children have been parsed. This means that we can safely use the xml elementtree functions to access the information. One code rearrangement later we've massively simplified the code.

if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
    pass # throw away "unwanted" records.
else:
    populate_database(paper, authors, cursor)

The if block does nothing. As a general rule one should never include code that does nothing. It makes you look silly. It is easy to avoid by using "not in" rather then "in."

paperCounter += 1
print paperCounter

This bothers me even though I didn't fix it. Variables that count things are usually unnecessary in python. Its possible to eliminate the explicit counting, but I'll leave that as an exercise to the reader.

element.clear()
while element.getprevious() is not None:
    del element.getparent()[0]

Different people have different perspective on when one should break out a function. I think this should be broken into its own function because its a break from what the rest of the code is doing. Its just too different, and therefore I want to push it off into a function so the interruption is limited.

del context

Using del on the names in a function is rarely helpful. Just don't do it. It has no effect here, but even aside from that its not a good plan.

No comments:

Post a Comment