views:

219

answers:

3

Hello,

I have two tables with list of urls fetched from different sources.

I want to find the common entries and put them in separate table.

This is what I'm doing:

  1. find md5 hash of url while fetching them.
  2. Store them in a column.
  3. I fetch one table as an array, run a loop through it and insert the values from other table where md5 hash is the same.

EDIT: Should I strip the urls of "http://' and 'www.'

I want to know any other method, which is better and faster, using which I can perform the above.

I am using PHP + MySQL

A: 

Try something like:

INSERT INTO table3  (SELECT url FROM table1, table2 WHERE table1.hash = table2.hash)

That's not a valid SQL-statement, but a nested query like that should read urls from table1 and table2 that match by their hash and put them in table3.

EDIT: If you want to sanitize your input urls (e.g. removing GET-variables), I'd do that before saving them to tabel1 and table2. I wouldn't remove http and www as "https://somesite" and "http://somesite" as well as "www.somesite.com" and "somesite.com" may have different content.

Select0r
A: 
SELECT * FROM table1 WHERE hash IN (SELECT hash FROM table2)

You may probably also want to have a look at the concept of table joins.

Greets, Philip

Philip
+3  A: 

MD5 is a little bit slow if you need real speed. Try MurmurHash

You should do the following transformations before hash calculation:

  • Strip "http://" and www.
  • Strip trailing slash
  • Normalize URL (urlencode it)
FractalizeR
+1 for normalize url.
Martin Wickman
BTW, not sure PHP implementation of MurmurHash will be faster than md5 function. That needs testing. Anyway for REAL speed, you can make PHP plugin.
FractalizeR
I guess implementing MumurHash in PHP will be tough. Are there any other faster hashing methods?
Jagira
You can use mhash extension, but I doubt you will get any speed improvement: http://develobert.blogspot.com/2007/09/php5-hash-benchmarks.html
FractalizeR