views:

257

answers:

3

I have a question about whether or not a specific way of applying of the DRY principle is considered a good practice in Haskell.I'm going to present an example, and then ask whether the approach I'm taking is considered good Haskell style. In a nutshell, the question is this: when you have a long formula, and then you find yourself needing to repeat some small subsets of that formula elsewhere, do you always put that repeated subset of the formula into a variable so you can stay DRY? Why or why not?

The Example: Imagine we're taking a string of digits, and converting that string into its corresponding Int value. (BTW, this is an exercise from "Real World Haskell").

Here's a solution that works except that it ignores edge cases:

asInt_fold string = fst (foldr helper (0,0) string)
  where
    helper char (sum,place) = (newValue, newPlace)
      where 
        newValue = (10 ^ place) * (digitToInt char) + sum
        newPlace = place + 1

It uses foldr, and the accumulator is a tuple of the next place value and the sum so far.

So far so good. Now, when I went to implement the edge case checks, I found that I needed little portions of the "newValue" formula in different places to check for errors. For example, on my machine, there would be an Int overflow if the input was larger than (2^31 - 1), so the max value I could handle is 2,147,483,647. Therefore, I put in 2 checks:

  1. If the place value 9 (the billions place) and the digit value is > 2, there's an error.
  2. If sum + (10 ^ place) * (digitToInt char) > maxInt, there's an error.

Those 2 checks caused me to repeat part of the formula, so I introduced the following new variables:

  • digitValue = digitToInt char
  • newPlaceComponent = (10^place) * digitValue

The reason I introduced those variables is merely an automatic application of the DRY principle: I found myself repeating those portions of the formula, so I defined them once and only once.

However, I wonder if this is considered good Haskell style. There are obvious advantages, but I see disadvantages as well. It definitely makes the code longer, whereas much of the Haskell code I've seen is pretty terse.

So, do you consider this good Haskell style, and do you follow this practice, or not? Why / why not?

And for what it's worth, here's my final solution that deals with a number of edge cases and therefore has quite a large where block. You can see how large the block became due to my application of the DRY principle.

Thanks.

asInt_fold "" = error "You can't be giving me an empty string now"
asInt_fold "-" = error "I need a little more than just a dash"
asInt_fold string | isInfixOf "." string = error "I can't handle decimal points"
asInt_fold ('-':xs) = -1 * (asInt_fold xs) 
asInt_fold string = fst (foldr helper (0,0) string)
  where
    helper char (sum,place) | place == 9 && digitValue > 2 = throwMaxIntError
            | maxInt - sum < newPlaceComponent      = throwMaxIntError
                   | otherwise                             = (newValue, newPlace)
      where
        digitValue =  (digitToInt char)
        placeMultiplier = (10 ^ place)
        newPlaceComponent = placeMultiplier * digitValue
        newValue = newPlaceComponent + sum
        newPlace = place + 1
        maxInt = 2147483647
        throwMaxIntError = 
                        error "The value is larger than max, which is 2147483647"
+2  A: 

I think the way you've done it makes sense.

You should certainly always break repeated computations out into separately defined values if avoiding repeated computation is important, but in this case that doesn't look necessary. Nevertheless, the broken out values have easy to understand names, so they make your code easier to follow. I don't think the fact that your code is a bit longer as a result is a bad thing.

BTW, instead hardcoding the maximum Int, you can use (maxBound :: Int) which avoids the risk of you making a mistake or another implementation with a different maximum Int breaking your code.

Ganesh Sittampalam
Cool, thanks for the response. I knew there was probably something like maxBound :: Int, but the book I'm using hasn't covered it yet so I just used the interpreter to find the breaking point and hardcoded the max.
Charlie Flowers
+7  A: 

DRY is just as good of a principle in Haskell as it is anywhere else :) A lot of the reason behind the terseness you speak of in haskell is that many idioms are lifted out into libraries, and that often those examples you look at have been considered very carefully to make them terse :)

For example, here's an alternate way to implement your digit-to-string algorithm:

asInt_fold ('-':n) = negate (asInt_fold n)
asInt_fold "" = error "Need some actual digits!"
asInt_fold str = foldl' step 0 str
    where
        step _ x
            | x < '0' || x > '9'
            = error "Bad character somewhere!"
        step sum dig =
            case sum * 10 + digitToInt dig of
                n | n < 0 -> error "Overflow!"
                n -> n

A few things to note:

  1. We detect overflow when it happens, not by deciding arbitrary-ish limits on what digits we allow. This signifigantly simplifies the overflow detection logic - and makes it work on any integer type from Int8 to Integer [as long as overflow results in wraparound, doesn't occur, or results in an assertion from the addition operator itself]
  2. By using a different fold, we don't need two seperate states.
  3. No repeating ourselves, even without going out of our way to lift things out - it falls naturally out of re-stating what we're trying to say.

Now, it's not always possible to just reword the algorithm and make the duplication go away, but it's always useful to take a step back and reconsider how you've been thinking about the problem :)

bdonlan
Very helpful. I had already found out from other people's comments online that foldl would be better, because then the accumulator could be a simple Int. I did not know that you could "name" the result of a case expression the way you did with "n". Is there a case where the overflow is so large that you would not get a negative number (even though the number would still be incorrect)?
Charlie Flowers
In twos-complement, if the overflow was so large you wouldn't get a negative number, then newPlaceComponent would also wraparound. You really need to check for overflow whilst doing the exponentiation to guard against that. Or use Integer for the intermediate computations and do an overflow check before converting to Int. Obviously there are efficiency trade-offs here.
Ganesh Sittampalam
+2  A: 

As noted by bdonlan, your algorithm could be cleaner---it's especially useful that the language itself detects overflow. As for your code itself and the style, I think the main tradeoff is that each new name imposes a small cognitive burden on the reader. When to name an intermediate result becomes a judgment call.

I personally would not have chosen to name placeMultiplier, as I think the intent of place ^ 10 is much clearer. And I would look for maxInt in the Prelude, as you run the risk of being terribly wrong if run on 64-bit hardware. Otherwise, the only thing I find objectionable in your code are the redundant parentheses. So what you have is an acceptable style.

(My credentials: At this point I have written on the order of 10,000 to 20,000 lines of Haskell code, and I have read perhaps two or three times that. I also have ten times that much experience with the ML family of languages, which require the programmer to make similar decisions.)

Norman Ramsey
Thanks. I was surprised that so far no one argued for the other side. Clearly there are drawbacks to large numbers of variables. Haskell in particular seems to value small code size, and I think for good reason. Thanks very much for your feedback.
Charlie Flowers
So, in this example, you see a kind of trade-off between DRY and terseness, yes? Both good values, and in this particular case, they begin to compete with each other. Is that a fair summary of your point of view?
Charlie Flowers