tags:

views:

90

answers:

1

Our real estate website sends email notifications when properties come on the market matching users saved search criteria.

I have a PHP script that runs on the cronjob to process the searches.

The system now has over 40,000 users. I ran into an issue where I exhausted the PHP memory size. As the system grows I can continue to increase the memory size in php.ini but would like to find a more robust way to do this.

Is there a more scalable way to accomplish this? Should I build something in a more robust language? Perhaps a threaded application in Java or Python?

+1  A: 

Use some identifier in the saved search tables to partition the tasks by the amount of computers dedicated to it.

divide and conquer by farming out sections of the search refactoring to many machines, each running some partition of the tasks.

Each script can do

SELECT * FROM saved_search_tbl WHERE ssid IN CALCRANGE(searchid, node_id)

Where INRANGE is some partitioning logic based on amount of work and the machine's node number. When new searches come in, you cound load balance them across machines by assigning primary and secondary compute nodes.

A few ideas, hope they help!

good luck

Aiden Bell
Thanks Aiden. Great idea!
andrew
The only downside to this is refactoring the partition cache every n new search entries to ensure an even balance. But if you expect a large growth in queries then something like this earlier rather than later :)
Aiden Bell
You could run the process on X number of machines, and just divide the total number of processes by X and have each machine perform a specific region of the searches every time. As long as the machines are using the same logic to divide the searches, there shouldn't be an issue with missing anything. You avoid adding fields to the tables and load-balancing is inherently automatic this way.
epalla
@epalla - I agree. I was a bit unclear, I was hinting more at caching that calculation and refactoring it when the number of machines changes. But this is only relevant for large numbers, as a side-note. Also adding pri/secondary fields and a 'been-done' bit means that you can built it redundantly in-case some machine dies.
Aiden Bell