views:

57

answers:

3

The first three columns exist. I am trying to create a formula for the fourth (HH_ANALYSIS_FLAG).

ACCOUNT_NUMBER   HOUSEHOLD_NUMBER   ACCOUNT_ANALYSIS_FLAG   HH_ANALYSIS_FLAG
1001             1                  1                       0
1002             2                  0                       0
1003             3                  1                       0
1004             3                  0                       0
1005             3                  0                       0
1006             2                  0                       0
1007             4                  0                       0
1008             1                  1                       0

I have 50,000 accounts. They are flagged as being under analysis with the ACCOUNT_ANALYSIS_FLAG column (0,1). All accounts belong to a household. Multiple accounts can belong to the same household. I need the HH_ANALYSIS_FLAG column to evaluate to true or false (0,1) if any account in the same household is under analysis. So with the above data and a working formula, my spreadsheet would look like so:

ACCOUNT_NUMBER   HOUSEHOLD_NUMBER   ACCOUNT_ANALYSIS_FLAG   HH_ANALYSIS_FLAG
1001             1                  1                       1
1002             2                  0                       0
1003             3                  1                       1
1004             3                  0                       1
1005             3                  0                       1
1006             2                  0                       0
1007             4                  0                       0
1008             1                  1                       1
A: 

Insert another column D (you can hide it later), which is equal to the household number if it is being analyzed, and zero if it is not. The formula for D2 can be =B2*C2. Fill column D with this formula.

Then for your HH_ANALYSIS_FLAG column, you can count the number of values in column D which match the household in column B. The formula would be like IF(COUNTIF(D:D,"="&B2)>0,1,0).

I'm not sure whether this approach is fast enough for the 50,000 accounts, though.

          A                B                    C                    D                 E
1   ACCOUNT_NUMBER  HOUSEHOLD_NUMBER  ACCOUNT_ANALYSIS_FLAG  HH_UNDER_ANALYSIS HH_ANALYSIS_FLAG
2   1001            1                 1                      1 (=B2*C2)        =IF(COUNTIF(D:D,"="&B2)>0,1,0)  
3   1002            2                 0                      0 (=B3*C3)        =IF(COUNTIF(D:D,"="&B3)>0,1,0)
4   1003            3                 1                      3 (=B4*C4)        =IF(COUNTIF(D:D,"="&B4)>0,1,0)        
Justin
+3  A: 

The following formula should do the trick. In fact, it will give you the total number of accounts being analysed per household.

    A        B       C                  D
1   ACC_NUM  HH_NUM  ACC_ANALYSIS_FLAG  HH_ANALYSIS_FLAG      
2   1001     1       1                  =SUMIF(B$2:B$50001, B2, C$2:c$50001)
3   1002     2       0                  =SUMIF(B$2:B$50001, B3, C$2:c$50001)
4   1003     3       1                  =SUMIF(B$2:B$50001, B4, C$2:c$50001)

For each row this takes selects the set of rows that share the value in the ACC_NUM column (based on the row conaining the formula) and sums together the values in the corresponding ACC_ANALYSIS_FLAG columns. This gives you the total number of accounts under analysis for the given household. Compare the result to 0 if you only need to use it as a boolean value.

EDIT:

Apparently the performance of this isn't up to snuff. However, assuming the the household numbers are all colocated, it should be possible to speed things up significantly by changin to something like the following.

2    1001     1       1                  =SUMIF(B2:B5,  B2, C2:C5)
3    1002     2       0                  =SUMIF(B2:B6,  B3, C2:C6)
4    1003     2       0                  =SUMIF(B2:B7,  B3, C2:C7)
5    1004     2       0                  =SUMIF(B2:B8,  B3, C2:C8)
6    1005     2       0                  =SUMIF(B3:B9,  B3, C3:C9)
7    1006     2       0                  =SUMIF(B4:B10, B3, C4:C10)
8    1007     2       0                  =SUMIF(B5:B11, B3, C5:C11)
9    1008     2       0                  =SUMIF(B6:B12, B3, C6:C12)
10   1009     2       0                  =SUMIF(B7:B13, B3, C7:C13)

This assumes that there are at most 4 accounts per household, and thus limits the range of the SUMIF to the current cell +/- 3 rows.

To avoid referencing invalid cells you'll the first and last rows have to be treated as special cases. If you need to generate a single forumala for all of these cells I think it should be possible using the OFFSET in combination with MAX, MIN and ROW to generate the appropriate ranges with just a little arithmatic.

torak
I hesitantly accept this as it was the first that performed what I needed. I can't actually use it, though, as the performance is too poor. Thanks all for the help, though!
Kenneth
@Kenneth: Not sure, but you could probably speed things up. Editted above to suggest how.
torak
Unfortunately not. With this many records I have no way of knowing the quantity or location of the unique HHNUMs. Thanks though.
Kenneth
@Kenneth: You might want to consider the option of sorting them then. Quicksort is on average O(n log(n)) which combined with the smaller SUMIF range might yield in the order of a n/log(n) speed up compared the original formula I suggested which is O(n^2). If my math is right then for 50,000 rows that works out to something like 3,000 times faster. Egads! Is that right?
torak
A: 

Presuming your HOUSEHOLD_NUMBER column is column B:

=IF(SUMIF(B:B,C:C)>0,1,0)

should do it.

Tom