views:

206

answers:

6

Hello

I am developing a script in my localhst using PHP and mysql and I am dealing with large data (about 2 millions of records for scintific research)

some queries I need to call once in a life (to analyse the data and prepare some data); however it takes very long time for example: now my script is analysing some data for more than 4 hours

I knew I might have some problems in the optimization of my database I am not an expert

for example I just figured out that "indexing" can be useful to speed up the queries however even with indexing some columns my script is still very slow

any idea how to speed up my script (in PHP and mysql)

I am using XAMPP as a server package

Thanks a lot for help

best regards

update 1:

part of my slow script which takes more than 4 hours to process

$sql = "select * from urls";//10,000 record of cached HTML documents
$result = $DB->query($sql);
while($row = $DB->fetch_array($result)){
$url_id = $row["id"];
$content = $row["content"];

$dom = new DOMDocument();
@$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$row = $xpath->evaluate("/html/body//a");

for($i = 0; $i < $row->length; $i++) {
     // lots of the code here to deal with the HTML documents and some update and insert and select queries which query another table which has 1 million record
}


update 2:

I do not have "JOIN" in my quires or even "IN"

they are very simple queries

and don't know! and I don't know how to know which causes the slowness?

is it the PHP or the MYSQL?

+8  A: 

First of all, to be able to optimize efficiently, you need to know what it taking time :

  • is PHP doing too much calculations ?
  • do you have too many SQL queries ?
  • do you have SQL queries that take too much time ?
    • if yes, which ones ?
  • where is your script spending time ?

With those informations, you can then try to figure out :

  • if you can diminish the number of SQL queries
    • for instance, if you are doing the exact same query over and over again, you are obviously wasting time
    • another idea is to "regroup" queries, if that is possible ; for instance, use only one query to get 10 lines, instead of 10 queries which all get one line back.
  • if you can optimize queries that take too long
    • either using indexes -- those which are usefull generally depend on the joins and conditions you are using
    • or re-writing the queries, if they are "bad"
    • About optimization of select statements, you can take a look at : 7.2. Optimizing SELECT and Other Statements
  • if PHP is doing too much calculations, can you have it make less calculations ?
    • Maybe not recalculating similar stuff again and again ?
    • Or using more efficient queries ?
  • if PHP is taking time, and the SQL server is not over-loaded, using parallelism (launching several calculations at the same time) might also help speed up the whole thing.

Still : this is quite a specific question, and the answers will be probably be pretty specific too -- which means more informations might be necessary if you want more than general answer...


Edit after your edits

As you only have simple queries, things might be a bit easier... Maybe.

  • First of all : you need to identify the kind of queries you are doing.
    • I'm guessing, of all all your queries, you can identify some "types" of queries.
    • for instance : "select * from a where x = 12" and "select * from a where x = 14" are of the same type : same select, same table, same where clause -- only the value changes
  • once you know which queries are used the most, you'll need to check if they are optimized : using EXPLAIN will help
    • (if needed, I'm sure some people will be able to help you understand its output, if you provider it alongside the schema of you DB (tables + indexes))
    • If needed : create the right indexes -- that's kind of the hard/specific part ^^
    • it is also for those queries that reducing the number of queries might prove useful...
  • when you're done with queries often used, it's time to go with queries that take too long ; using microtime from PHP will help you find out which ones those are
    • another solution is to use the 5.2.4. The Slow Query Log
    • when you have identified those queries, same as before : optimize.


Before that, to find out if PHP is working too much, or if it's MySQL, a simple way is to use the "top" command on Linux, or the "process manager" (I'm not on windows, and don't use it in english -- the real name might be something else).

If PHP is eating 100% of CPU, you have your culprit. If MySQL is eating all CPU, you have your culprit too.

When you know which one of those is working too much, it's a first step : you know what to optimize first.


I see from your portion of code that your are :

  • going through 10,000 elements one by one -- it should be easy to split those in 2 or more slices
  • using DOM and XPath, which might eat some CPU on the PHP-side

If you have a multi-core CPU, an idea (that I would try if I see that PHP is eating lots of CPU) would to to parallelize.

For instance, you could have two instances of the PHP script running at the same time :

  • one that will deal with the first half of the URLs
    • the SQL query for this one will be like "select * from urls where id < 5000"
  • and the other one that will deal with the second half of the URLs
    • its query will be like "select * from urls where id >= 5000"

You will get a bit more concurrency on the network (probably not a problem) and on the database (a database knows how to deal with concurrency, and 2 scripts using it will generally not be too much), but you'll be able to process almost twice the same amount of documents in the same time.

If you have 4 CPU, splitting the urls-list in 4 (or even more ; find out by trial and error) parts would do too.

Pascal MARTIN
worse than "general answers" you'll get suggestions to "try things" which is optimization hell.
le dorfier
A: 

inner joins are quicker than left or right joins

Has always sped up my queries going through after and thinking specifically about the joins.

have a look in your mysql config for settings you can turn off etc

Question Mark
A: 

If you are not using indexes it can be the main problem. There are many more optimization hints and tricks. Better will be to show i.e. your slowest query. It's not possible to help without any input data. Indexes and correct joins can speed this up really much.

If the queries will return same data you can store them in file or in memory and do them just once.

2 millions of records is not much.

martin.malek
A: 

Knowing queries and table structures it would be easier.

If you cant give them out check if you have IN operator. MySQL tends to slow too much in there. Also try to run

EXPLAIN yourquery;

and see how it is executed. Sometimes sorting takes too much time. Try to avoid sorting on non-index columns.

Cem Kalyoncu
A: 

Before you can optimise, you need to find out where the bottleneck is. Can you run the script on a smaller dataset, for testing purposes?

In that case, you should set such a test up, and then profile the code. You can either use a dedicated profiler such as Xdebug, or if you find it too daunting to configure (Not that complicated really, but you sound like you're a bit in the deep end already), you may feel more comfortable with a manual approach. This means starting a timer before parts of your code and stopping it after, then printing the result out. You can then narrow down which part is slowest.

Once you got that, we can give more specific answers, or perhaps it will be apparent to you what to do.

troelskn
A: 

Since your query is on one table and has no grouping or ordering, it is unlikely that the query is slow. I expect the issue is the size and number of the content fields. It appears that you are storing the entire HTML of a webpage in your database and then pulling it out every time you want to change a couple of values on the page. This is a situation to be avoided if at all possible.

Most scientific webapps (like BLAST for example) have the option to export the data as a delimited text file like a csv. If this is the case for you, you might consider restructuring your url table so that you have one column per data field in the csv. Then your update queries will be significantly faster as you will be able to do them entirely in SQL instead of pulling the entire url table into PHP, accessing and pulling one or more other records for each url record and then updating your table.


Assumably you have stored your data as webpages so you can dump the content easily to a browser. If you change your database schema as I've suggested, you'll need to write a webpage template that you can plug the data into when you wish to output it.

dnagirl