views:

1086

answers:

6

Hi All,

I have a very large table (8gb) with information about files, and i need to run a report against it that would would look something like this:

(select * from fs_walk_scan where file_path like '\\\\server1\\groot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server1\\hroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server1\\iroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server2\\froot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server2\\groot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server3\\hroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server4\\iroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server5\\iroot$\\%' order by file_size desc limit 0,30)
[...]
order by substring_index(file_path,'\\',4), file_size desc

This method accomplishes what I need to do: Get a list of the 30 biggest files for each volume. However, this is deathly slow, and the 'like' searches are hardcoded even though they are sitting in another table and can be gotten that way.

What I'm looking for is a way to do this without going through the huge table several times. Anyone have any ideas?

Thanks.

P.S. I cant change the structure of the huge source table in any way.

Update: There are indexes on file_path and file_size, but each one of those sub(?)queries still takes about 10 mins, and I have to do 22 minimum.

+2  A: 

What kind of indexes do you have on that table? This index:

CREATE INDEX fs_search_idx ON fs_walk_scan(file_path, file_size desc)

would speed this query up significantly... if you don't already have one like it.

Update:

You said there are already indexes on file_path and file_size... are they individual indexes? Or is there one single index with both columns indexed together? The difference would be huge for this query. Even with 22 subqueries, if indexed right, this should be blazing fast.

bobwienholt
I think this would be impossible with the condition that changes to the structure of the db are not allowed
warren
they are individual, but i don't see how linking them into one would help
Zenshai
Essentially, you are searching on file_path and then within that you are picking the 30 largest files (using LIMIT and ORDER BY). So you are essentially searching on two columns at the same time. If both pieces of data you are searching on is available in the index together, searching is fast.
bobwienholt
A: 

How about something like this (haven't tested it, but looks close):

select * from fs_walk_scan where file_path like '\\\\server' and file_path like 'root$\\%' order by file_size desc

This way you're doing a pair of comparisons on the individual field which will generically match what you've described. It may be possible to use a regex, too, but I've not done it.

warren
Are you suggesting doing this in each UNIONed query to improve performance (cause that could help) or did I misunderstand?
Zenshai
+2  A: 

You could use a regexp:

select * from fs_walk_scan
  where file_path regexp '^\\\\server(1\\[ghi]|2\\[fg]|3\\h|[45]\\i)root$\\'

Otherwise if you can modify your table structure, add two columns to hold the server name and base path (and index them), so that you can create a simpler query:

select * from fs_walk_scan
  where server = 'server1' and base_path in ('groot$', 'hroot$', 'iroot$')
     or server = 'server2' and base_path in ('froot$', 'groot$')

You can either set up a trigger to initialise the fields when you insert the record, or else do a bulk update afterwards to fill in the two extra columns.

dland
A: 

Thanks, these are helpful. However this doesn't solve my main problem.

If i do (for exapmle):

select * from fs_walk_scan
  where file_path regexp '^\\\\server(1\\[ghi]|2\\[fg]|3\\h|[45]\\i)root$\\'
  order by file_size desc
  LIMIT 0,30

I will get the top 30 files across all 8 volumes, 30 rows in total. What I am trying to get is the top 30 per volume, 240 rows in total.

I think this going to need some looping and variables. Unfortunately I don't know how to use that stuff in MySQL.

Zenshai
+1  A: 

You could do something like this... assuming fs_list has a list of your "LIKE" searches:

DELIMITER $$

DROP PROCEDURE IF EXISTS `test`.`proc_fs_search` $$
CREATE PROCEDURE `test`.`proc_fs_search` ()
BEGIN

DECLARE cur_path VARCHAR(255);
DECLARE done INT DEFAULT 0;


DECLARE list_cursor CURSOR FOR select file_path from fs_list;

DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;

SET @sql_query = '';

OPEN list_cursor;

REPEAT
  FETCH list_cursor INTO cur_path;

  IF NOT done THEN
    IF @sql_query <> '' THEN
      SET @sql_query = CONCAT(@sql_query, ' UNION ALL ');
    END IF;

    SET @sql_query = CONCAT(@sql_query, ' (select * from fs_walk_scan where file_path like ''', cur_path , ''' order by file_size desc limit 0,30)');
  END IF;

UNTIL done END REPEAT;

SET @sql_query = CONCAT(@sql_query, ' order by file_path, file_size desc');

PREPARE stmt FROM @sql_query;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

END $$

DELIMITER ;
bobwienholt
+1  A: 

Try this.
You want to get every record where there are fewer than 30 records with greater file size and the same file path.

SELECT * 
FROM   fs_walk_scan a
WHERE  ( SELECT COUNT(*) 
         FROM   fs_walk_scan b 
         WHERE  b.file_size  > a.file_size 
         AND    b.file_path  = a.file_path
       ) < 30

Edit:

Apparently this performs like a dog. So... How about this looping syntax?

SELECT DISTINCT file_path
INTO tmp1
FROM   fs_walk_scan a

DECLARE path VARCHAR(255);

SELECT MIN(file_path)
INTO   path
FROM   tmp1 

WHILE  path IS NOT NULL DO
    SELECT * 
    FROM   fs_walk_scan
    WHERE  file_path = path
    ORDER BY file_size DESC
    LIMIT 0,30

    SELECT MIN(file_path)
    INTO   path
    FROM   tmp1
    WHERE  file_path > path 
END WHILE

The idea here is to 1. get a list of the file paths 2. loop, doing a query for each path which will get the 30 largest file sizes.

(I did look up the syntax, but I'm not very hot on MySQL, so appologies if it's not quite there. Feel free to edit/comment)

AJ
tried the above. left the query running for 48 hours (by accident), and it still hasn't come back with a result.
Zenshai
ouch. let me go away and think about that, then! If I come up with something better, I'll edit my post.
AJ
Thanks for helping, i really appreciate it. However i dont think this updated method will change anything performance-wise. The problem is that we are still doing a "select * from fs_walk_scan" in every iteration of the while loop, and this is very expensive because the table is huge.
Zenshai
What can I say? The table _is_ huge. It's going to be slow. Another strategy would be to maintain a "top 30" list using a trigger, but that wouldn't really be more efficient, and it would slow down inserts.
AJ
You're right it'll be slow no matter what. However I think there may be a way to do it right. What im thinking of right now is using a perl (but eventually native) script to go through the table one row at a time, and test the filesize of the row against the lowest of a top 30 list for its path loc.
Zenshai
So there will be a top 30 array for each of the volumes, and upon encountering each row ill find the array it can potentially go into, and test whether its filesize is greater than the lowest in the list, if it is, ill find the right place to insert it and discard the lowest filesize to make room.
Zenshai