If you modify the Horowitz-Sahni algorithm in the right way, then it's hardly slower than original Horowitz-Sahni.  Recall that Horowitz-Sahni works two lists of subset sums: Sums of subsets in the left half of the original list, and sums of subsets in the right half.  Call these two lists of sums L and R.  To obtain subsets that sum to some fixed value A, you can sort R, and then look up a number in R that matches each number in L using a binary search.  However, the algorithm is asymmetric only to save a constant factor in space and time.  It's a good idea for this problem to sort both L and R.  
In my code below I also reverse L.  Then you can keep two pointers into R, updated for each entry in L:  A pointer to the last entry in R that's too low, and a pointer to the first entry in R that's too high.  When you advance to the next entry in L, each pointer might either move forward or stay put, but they won't have to move backwards.  Thus, the second stage of the Horowitz-Sahni algorithm only takes linear time in the data generated in the first stage, plus linear time in the length of the output.  Up to a constant factor, you can't do better than that (once you have committed to this meet-in-the-middle algorithm).
Here is a Python code with example input:
# Input
terms = [29371, 108810, 124019, 267363, 298330, 368607,
    438140, 453243, 515250, 575143, 695146, 840979, 868052, 999760]
(A,B) = (500000,600000)
# Subset iterator stolen from Sage
def subsets(X):
    yield []; pairs = []
    for x in X:
        pairs.append((2**len(pairs),x))
        for w in xrange(2**(len(pairs)-1), 2**(len(pairs))):
            yield [x for m, x in pairs if m & w]
# Modified Horowitz-Sahni with toolow and toohigh indices
L = sorted([(sum(S),S) for S in subsets(terms[:len(terms)/2])])
R = sorted([(sum(S),S) for S in subsets(terms[len(terms)/2:])])
(toolow,toohigh) = (-1,0)
for (Lsum,S) in reversed(L):
    while R[toolow+1][0] < A-Lsum and toolow < len(R)-1: toolow += 1
    while R[toohigh][0] <= B-Lsum and toohigh < len(R): toohigh += 1
    for n in xrange(toolow+1,toohigh):
        print '+'.join(map(str,S+R[n][1])),'=',sum(S+R[n][1])
"Moron" (I think he should change his user name) raises the reasonable issue of optimizing the algorithm a little further by skipping one of the sorts.  Actually, because each list L and R is a list of sizes of subsets, you can do a combined generate and sort of each one in linear time!  (That is, linear in the lengths of the lists.)  L is the union of two lists of sums, those that include the first term, term[0], and those that don't.  So actually you should just make one of these halves in sorted form, add a constant, and then do a merge of the two sorted lists.  If you apply this idea recursively, you save a logarithmic factor in the time to make a sorted L, i.e., a factor of N in the original variable of the problem.  This gives a good reason to sort both lists as you generate them.  If you only sort one list, you have some binary searches that could reintroduce that factor of N; at best you have to optimize them somehow.
At first glance, a factor of O(N) could still be there for a different reason:  If you want not just the subset sum, but the subset that makes the sum, then it looks like O(N) time and space to store each subset in L and in R.  However, there is a data-sharing trick that also gets rid of that factor of O(N).   The first step of the trick is to store each subset of the left or right half as a linked list of bits (1 if a term is included, 0 if it is not included).  Then, when the list L is doubled in size as in the previous paragraph, the two linked lists for a subset and its partner can be shared, except at the head:
     0
     |
     v
1 -> 1 -> 0 -> ...
Actually, this linked list trick is an artifact of the cost model and never truly helpful.  Because, in order to have pointers in a RAM architecture with O(1) cost, you have to define data words with O(log(memory)) bits.  But if you have data words of this size, you might as well store each word as a single bit vector rather than with this pointer structure.  I.e., if you need less than a gigaword of memory, then you can store each subset in a 32-bit word.  If you need more than a gigaword, then you have a 64-bit architecture or an emulation of it (or maybe 48 bits), and you can still store each subset in one word.  If you patch the RAM cost model to take account of word size, then this factor of N was never really there anyway.
So, interestingly, the time complexity for the original Horowitz-Sahni algorithm isn't O(N*2^(N/2)), it's O(2^(N/2)).  Likewise the time complexity for this problem is O(K+2^(N/2)), where K is the length of the output.