views:

3053

answers:

14

Hello,

I have the 3 following tables in a MySQL 4.x DB :

  • hosts: (300.000 records)
    • id (UNSIGNED INT) PRIMARY KEY
    • name (VARCHAR 100)
  • paths: (6.000.000 records)
    • id (UNSIGNED INT) PRIMARY KEY
    • name (VARCHAR 100)
  • urls: (7.000.000 records)
    • host (UNSIGNED INT) PRIMARY KEY <--- links to hosts.id
    • path (UNSIGNED INT) PRIMARY KEY <--- links to paths.id

As you can see, the schema is really simple but the problem is the amount of data in these tables.

Here is the query I'm running :

SELECT CONCAT(H.name, P.name)
FROM hosts AS H
INNER JOIN urls as U ON H.id = U.host
INNER JOIN paths AS P ON U.path = P.id;

This query works perfectly fine, but takes 50 minutes to run. Does anyone have any idea about how I could speed up that query?

Thanks in advance. Nicolas

+3  A: 

Perhaps you should include a WHERE clause? Or do you really need ALL the data?

Mitch Wheat
+1  A: 

Have you already declared some indexes on the join-attributes?

PS: See here for indexes on MySQL 4.x

Leonidas
Actually, if he is really looking to get all rows returned, indexes might not be helpful. Doing an index lookup for each value in the table is probably slower than full-scanning the tables and hashing or merging them together.
Dave Costa
I see several 100 megabytes of data. If it all fits into the memory - you are right. But a proper DBMS (and I guess even MySQL 4.x, as it is proper enough) will ignore existing indexes then by itself.
Leonidas
+1  A: 

For one thing I wouldn't do the CONCAT in the query. Do it outside.

But really you're query runs slowly because you're retrieving millions of rows.

cletus
+1  A: 

Try optimizing your tables before you run the query:

optimize table hosts, paths, urls;

It might save you some time, especially if rows have been deleted from the tables. (see http://dev.mysql.com/doc/refman/4.1/en/optimize-table.html for more information on OPTIMIZE)

tehvan
A: 

The concat is definitely slowing you down. Can we see the results of a mysql explain on this? Documentation Link

The biggest thing to do is to try and pull only the data you need though. If you can pull fewer records that will speed you up as much as anything. But a mysql explain should help us see if any indexes would help.

Ryan Guill
+1  A: 

I'd try to create a new table with the data you wanna get. Doing this means that you lose some real data but you win in quickness. Could this idea be similar to OLAP or something like that?

Of course, you have to do an update (daily or whatever) of this table.

jaloplo
Yes, a "materialized view" would be recommendable, if he does not need the latest data all the time.
Leonidas
+4  A: 
Tony Andrews
I was just about to add an answer saying the same as the "On Second Thought" part.
James Curran
Okay, just out of interest, if that's your point of view, why use a relational database at all? Your intent is the exact opposite of what I advise all of my clients. All I can say is *ARGH!!!*
Dems
Dems, I pity your clients if you insist on surrogate keys on EVERY table. Relational databases work just as well with natural keys - sometimes even better. "ARGH!!!" indeed!
Tony Andrews
A: 

I understand that you want a complete list of urls - which is 7 million records. Perhaps as sugested by Mitch you should consider using the WHERE clause to filter your results. Perhaps the timing is mainly related to the delay in displaying records

check time for this query

select count(*)
FROM hosts AS H
INNER JOIN urls as U ON H.id = U.host
INNER JOIN paths AS P ON U.path = P.id

If this is still slow I would go and check timing for select count(*) from urls

then

select count(*) 
from urls u 
inner join hosts h on u.host = h.id

then

select count(*) 
from urls u 
inner join hosts h on u.host = h.id
inner join paths p on u.path = p.id

just to locate the source of the slow down

Also sometimes reordering your query can help

SELECT CONCAT(u.host, u.path)
from urls u 
inner join hosts h on u.host = h.id
inner join paths p on u.path = p.id
kristof
A: 

I can't say for sure about mySQL but I know in SQL Server that primary keys create an index automatically but foreign keys do not. Make sure to check that there is an index on your foreign key fields.

HLGEM
+1  A: 

I'm no MySQL expert, but it looks like MySQL primary keys are clustered -- you'll want to make sure that's the case with your primary keys; clustered indexes will definitely help speed things up.

One thing, though -- I don't believe you can have two "primary" keys on any table; your urls table looks rather suspect to me for that reason. Above all, you should make absolutely sure those two columns in the urls table are indexed to the hilt -- a single numeric index on each one should be fine -- because you're joining on them, so the DBMS needs to know how to find them quickly; that could be what's going on in your case. If you're full-table-scanning that many rows, then yes, you could be sitting there for quite some time while the server tries to find everything you asked for.

I'd also suggest removing that CONCAT function from the select statement, and seeing how that affects your results. I'd be amazed if that weren't a contributing factor somehow. Just retrieve both columns and handle the concatenation afterward, and see how that goes.

Lastly, have you figured out where the bottleneck is? Just joining on three several-million-row tables shouldn't take much time at all (I'd expect maybe a second or so, just eyeballing your tables and query), provided the tables are properly indexed. But if you're pushing those rows over a slow or already-pegged NIC, to a memory-starved app server, etc., the slowness could have nothing to do with your query at all, but instead with what happens after the query. Seven million rows is quite a bit of data to be assembling and moving around, regardless of how long the finding of those rows happens to take. Try selecting just one row instead, rather than all seven million, and see how that looks by contrast. If that's fast, then the problem isn't the query, it's the result set.

Christian Nunciato
MySQL only allows 1 primary key per table, but that key can be made up of multiple columns from the table. So in Nicolas's example, the `urls` table has a single primary key made up of `host` + `path`.
Manzabar
Sure, that makes sense -- I neglected to ask whether the keys are actually one composite one (which I don't think is necessarily made clear anyway). Mainly, though, I just wanted to point out the importance of those two columns being indexed explicitly somehow.
Christian Nunciato
+1  A: 

Overall, the best advice is to trace and profile to see what is really taking up time. But here are my thoughts about specific things to look at.

(1) I would say that you want to ensure that indexes are NOT used in the execution of this query. Since you have no filtering conditions, it should be more efficient to full-scan all the tables and then join them together with a sort-merge or hash operation.

(2) The string concatenation is surely taking some time, but I don't understand why people are recommending to remove it. You would presumably then need to do the concatenation in another piece of code, where it would still take about the same amount of time (unless MySQL's string concatenation is particularly slow for some reason).

(3) The data transferral from the server to the client is probably taking significant time, quite possibly more than the time the server needs to fetch the data. If you have tools to trace this sort of thing, use them. If you can increase the fetch array size in your client, experiment with different sizes (e.g. in JDBC use Statement.setFetchSize() ). This can be significant even if the client and server are on the same host.

Dave Costa
+1  A: 

As your result set returns all data, there is very little optimisation that can be done at all. You're scanning the whole table, then joining on other tables that have indexes.

Are the PrimaryKeys Clustered? This ensures that the data is stored on the disk in the index order, so avoiding bouncing around different parts of the disk.

Also, you can have the data spread over multiple disks. If you have URLs on PRIMARY and PATHS/HOSTS on SECONDARY then you'll get better throughput from the drives.

Dems
+1  A: 

You need to look at your server configuration. The default memory parameters for MySQL will cripple performance on a table that size. If you are using the defaults, you need to raise at least key_buffer_size and join_buffer_size by at least a factor of 4, perhaps much more. Look in the documentation; there are other memory parameters you can tweak.

MySQL has a funny performance quirk where if your tables go over a certain size with queries that will return most of the data, performance goes into the toilet. Unfortunately, it has no way of telling you when that threshold is reached. It looks to me like you have, though.

staticsan
A: 

Since I am not a big MySQL fan, I would ask if you have tried PostgreSQL. In that DB, you would want to make sure that your work_mem setting was quite high, but you can set it per DB connection with SET work_mem = 64MB, for example.

Another suggestion is to look into using duplicate path entries. There are many URLs that share paths.

Another thing that might or might not help is using fixed-length text fields instead of varchars. It used to make a speed difference but I'm not sure about current DB engines.

If you do use PostgreSQL it will let you use JOIN USING but even on MySQL I like it more: name your id field the same in every table. Instead of id in hosts and host in urls, name it host_id both places.

Now some more commentary. :) This data layout you have here is very useful when you are selecting a small set of rows, perhaps every URL from the same domain. It can also help a lot if your queries often need to do sequential scans of the urls table for other data stored there, because the scan can skip over the large text fields (Unless it doesn't matter because your DB stores text via pointers to a linked table anyway).

However, if you almost always select all the domain and path data, then it makes more sense to store it in one table.

Zan Lynx