views:

1485

answers:

5

I'm porting a process which creates a MASSIVE CROSS JOIN of two tables. The resulting table contains 15m records (looks like the process makes a 30m cross join with a 2600 row table and a 12000 row table and then does some grouping which must split it in half). The rows are relatively narrow - just 6 columns. It's been running for 5 hours with no sign of completion. I only just noticed the count discrepancy between the known good and what I would expect for the cross join, so my output doesn't have the grouping or deduping which will halve the final table - but this still seems like it's not going to complete any time soon.

First I'm going to look to eliminate this table from the process if at all possible - obviously it could be replaced by joining to both tables individually, but right now I do not have visibility into everywhere else it is used.

But given that the existing process does it (in less time, on a less powerful machine, using the FOCUS language), are there any options for improving the performance of large CROSS JOINs in SQL Server (2005) (hardware is not really an option, this box is a 64-bit 8-way with 32-GB of RAM)?

Details:

It's written this way in FOCUS (I'm trying to produce the same output, which is a CROSS JOIN in SQL):

JOIN CLEAR *
DEFINE FILE COSTCENT
  WBLANK/A1 = ' ';
  END
TABLE FILE COSTCENT
  BY WBLANK BY CC_COSTCENT
  ON TABLE HOLD AS TEMPCC FORMAT FOCUS
  END

DEFINE FILE JOINGLAC
  WBLANK/A1 = ' ';
  END
TABLE FILE JOINGLAC
  BY WBLANK BY ACCOUNT_NO BY LI_LNTM
  ON TABLE HOLD AS TEMPAC FORMAT FOCUS INDEX WBLANK

JOIN CLEAR *
JOIN WBLANK IN TEMPCC TO ALL WBLANK IN TEMPAC
DEFINE FILE TEMPCC
  CA_JCCAC/A16=EDIT(CC_COSTCENT)|EDIT(ACCOUNT_NO);
  END
TABLE FILE TEMPCC
  BY CA_JCCAC BY CC_COSTCENT AS COST CENTER BY ACCOUNT_NO
  BY LI_LNTM
  ON TABLE HOLD AS TEMPCCAC
  END

So the required output really is a CROSS JOIN (it's joining a blank column from each side).

In SQL:

CREATE TABLE [COSTCENT](
       [COST_CTR_NUM] [int] NOT NULL,
       [CC_CNM] [varchar](40) NULL,
       [CC_DEPT] [varchar](7) NULL,
       [CC_ALSRC] [varchar](6) NULL,
       [CC_HIER_CODE] [varchar](20) NULL,
 CONSTRAINT [PK_LOOKUP_GL_COST_CTR] PRIMARY KEY NONCLUSTERED
(
       [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY
= OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [JOINGLAC](
       [ACCOUNT_NO] [int] NULL,
       [LI_LNTM] [int] NULL,
       [PR_PRODUCT] [varchar](5) NULL,
       [PR_GROUP] [varchar](1) NULL,
       [AC_NAME_LONG] [varchar](40) NULL,
       [LI_NM_LONG] [varchar](30) NULL,
       [LI_INC] [int] NULL,
       [LI_MULT] [int] NULL,
       [LI_ANLZ] [int] NULL,
       [LI_TYPE] [varchar](2) NULL,
       [PR_SORT] [varchar](2) NULL,
       [PR_NM] [varchar](26) NULL,
       [PZ_SORT] [varchar](2) NULL,
       [PZNAME] [varchar](26) NULL,
       [WANLZ] [varchar](3) NULL,
       [OPMLNTM] [int] NULL,
       [PS_GROUP] [varchar](5) NULL,
       [PS_SORT] [varchar](2) NULL,
       [PS_NAME] [varchar](26) NULL,
       [PT_GROUP] [varchar](5) NULL,
       [PT_SORT] [varchar](2) NULL,
       [PT_NAME] [varchar](26) NULL
) ON [PRIMARY]

CREATE TABLE [JOINCCAC](
       [CA_JCCAC] [varchar](16) NOT NULL,
       [CA_COSTCENT] [int] NOT NULL,
       [CA_GLACCOUNT] [int] NOT NULL,
       [CA_LNTM] [int] NOT NULL,
       [CA_UNIT] [varchar](6) NOT NULL,
 CONSTRAINT [PK_JOINCCAC_KNOWN_GOOD] PRIMARY KEY CLUSTERED
(
       [CA_JCCAC] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY
= OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

With the SQL Code:

INSERT  INTO [JOINCCAC]
       (
        [CA_JCCAC]
       ,[CA_COSTCENT]
       ,[CA_GLACCOUNT]
       ,[CA_LNTM]
       ,[CA_UNIT]
       )
       SELECT  Util.PADLEFT(CONVERT(varchar, CC.COST_CTR_NUM), '0',
                                     7)
               + Util.PADLEFT(CONVERT(varchar, GL.ACCOUNT_NO), '0',
                                       9) AS CC_JCCAC
              ,CC.COST_CTR_NUM AS CA_COSTCENT
              ,GL.ACCOUNT_NO % 900000000 AS CA_GLACCOUNT
              ,GL.LI_LNTM AS CA_LNTM
              ,udf_BUPDEF(GL.ACCOUNT_NO, CC.COST_CTR_NUM, GL.LI_LNTM, 'N') AS CA_UNIT
       FROM   JOINGLAC AS GL
       CROSS JOIN COSTCENT AS CC

Depending on how this table is subsequently used, it should be able to be eliminated from the process, by simply joining to both the original tables used to build it. However, this is an extremely large porting effort, and I might not find the usage of the table for some time, so I was wondering if there were any tricks to CROSS JOINing big tables like that in a timely fashion (especially given that the existing process in FOCUS is able to do it more speedily). That way I could validate the correctness of my building of the replacement query and then later factor it out with views or whatever.

I am also considering factoring out the UDFs and string manipulation and performing the CROSS JOIN first to break the process up a bit.

RESULTS SO FAR:

It turns out that the UDFs do contribute a lot (negatively) to the performance. But there also appears to be a big difference between a 15m row cross join and a 30m row cross join. I do not have SHOWPLAN rights (boo hoo), so I can't tell whether the plan it is using is better or worse after changing indexes. I have not refactored it yet, but am expecting the entire table to go away shortly.

A: 

Could you give the sample table schema & why is the cross join being done? Example output?

That will help understand the real issue and a way in which query can be re-written.

shahkalpesh
Added the original code I'm porting to the question.
Cade Roux
shahkalpesh
Added the SQL code, too.
Cade Roux
+1  A: 

Break down the query to make it a plain simple cross join.


   SELECT  CC.COST_CTR_NUM, GL.ACCOUNT_NO
              ,CC.COST_CTR_NUM AS CA_COSTCENT
              ,GL.ACCOUNT_NO AS CA_GLACCOUNT
              ,GL.LI_LNTM AS CA_LNTM
-- I don't know what is BUPDEF doing? but remove it from the query for time being
--              ,udf_BUPDEF(GL.ACCOUNT_NO, CC.COST_CTR_NUM, GL.LI_LNTM, 'N') AS CA_UNIT
       FROM   JOINGLAC AS GL
       CROSS JOIN COSTCENT AS CC

See how good is the simple cross join? (without any functions applied on it)

shahkalpesh
If that works fast, try just doing SELECT (with the functions applied) and see, if that is still OK?
shahkalpesh
posted results so far
Cade Roux
The UDF performance (they are scalar and do not access tables) was the key and is horrible - the two UDFs can only process around 300 rows per second. I am currently engaged in finding workarounds.
Cade Roux
+2  A: 

Examining that query shows only one column used from one table, and only two columns used from the other table. Due to the very low numbers of columns used, this query can be easily enhanced with covering indexes:

CREATE INDEX COSTCENTCoverCross ON COSTCENT(COST_CTR_NUM)
CREATE INDEX JOINGLACCoverCross ON JOINGLAC(ACCOUNT_NO, LI_LNTM)

Here are my questions for further optimization:

When you put the query in query analyzer and whack the "show estimated execution plan" button, it will show a graphical representation of what it's going to do.

Join Type: There should be a nested loop join in there. (the other options are merge join and hash join). If you see nested loop, then ok. If you see merge join or hash join, let us know.

Order of table access: Go all the way to the top and scroll all the way to the right. The first step should be accessing a table. Which table is that and what method is used(index scan, clustered index scan)? What method is used to access the other table?

Parallelism: You should see the little jaggedy arrows on almost all icons in the plan indicating that parallelism is being used. If you don't see this, there is a major problem!

That udf_BUPDEF concerns me. Does it read from additional tables? Util.PADLEFT concerns me less, but still.. what is it? If it isn't a Database Object, then consider using this instead:

RIGHT('z00000000000000000000000000' + columnName, 7)

Are there any triggers on JOINCCAC? How about indexes? With an insert this large, you'll want to drop all triggers and indexes on that table.

David B
BUPDEF is a huge business logic function which was ported. Hopefully it will go away (like this table), but there are no lookups in it (or the other UDF I left out for clarity)
Cade Roux
The UDF performance was the key and is horrible - the two UDFs can only process around 300 rows per second. I am currently engaged in finding workarounds.
Cade Roux
+2  A: 

Continuing on what others a saying, DB functions that contained queries which are used in a select always made my queries extremely slow. Off the top of my head, I believe i had a query run in 45 seconds, then I removed the function, and then result was 0 seconds :)

So check udf_BUPDEF is not doing any queries.

Ben Dempsey
The UDF performance (they are scalar and do not access tables) was the key and is horrible - the two UDFs can only process around 300 rows per second. I am currently engaged in finding workarounds.
Cade Roux
A: 

So, what was the real issue?

It will be great, if you could share your findings.

shahkalpesh
I have not been able to re-test it, but the 30m row problem was due to me not filtering a table on date, so it came down to 2 hours and completed the 15m rows). I suspect it's actually outperforming Focus, which is really my #1 concern. Then optimize the hell out of it or optimize it away.
Cade Roux
The UDF performance (they are scalar and do not access tables) was the key and is horrible - the two UDFs can only process around 300 rows per second. I am currently engaged in finding workarounds.
Cade Roux