views:

55

answers:

2

I've been using a Drupal module called FeedAPI De-Dupe to remove duplicate records from my site's (MySQL) database. The module does a great job, but I'm still left with some "duplicates" when the URLs sometimes contain "www" and sometimes don't. The code handling the duplicate detection is:

case 0: //url-based
    $sql = "SELECT n.nid
  FROM {node} n
  INNER JOIN {feedapi_node_item} i ON i.nid=n.nid
  WHERE type='%s' AND url = '%s'
  LIMIT 1";

  $dnid = db_result(db_query($sql, 
          variable_get('feedapi_dedupe_content_type', feedapi_dedupe_content_type()),
          $url));
break;

Is there a way to modify the SQL to also handle the www/no-www case? If not, is there a way to handle it in a separate SQL query instead?

+1  A: 

Maybe:

case 0: //url-based
  $sql = "SELECT n.nid
  FROM {node} n
  INNER JOIN {feedapi_node_item} i ON i.nid=n.nid
  WHERE type='%s' AND (url = '%s' OR url = '%s')
  LIMIT 1";

  $dnid = db_result(
    db_query(
      $sql, 
      variable_get('feedapi_dedupe_content_type', feedapi_dedupe_content_type()),
      $url,
      preg_replace('@//www.@', '//' $url)));
break;

However, I cannot test it myself right now.

Johan
I don't think that will work because the url field contains the full url (ie, "http://example.com" not "example.com").
Matt V.
Ok, I modified the code according to that. Still untested, and probably need some modifications, but you get the idea.
Johan
+1  A: 

Supposing that $url contains the full URL (including the http:// part), the following code should do what you are asking for:

preg_match('@^(?:http://)?([^/]+)@i', $url, $matches);
$sql = "SELECT n.nid
  FROM {node} n
  INNER JOIN {feedapi_node_item} i ON i.nid=n.nid
  WHERE type='%s' AND (url = '%s' OR url = '%s')";

$dnid = db_result(db_query_range($sql, 
  variable_get('feedapi_dedupe_content_type', feedapi_dedupe_content_type()),
   'http://www.' . $matches[1], 'http://' . $matches[1], 0, 1)
);

If then the URL contained in the database contains values like `http://example.com/directory, then you can use the following code:

preg_match('@^(?:http://)?([^/]+)@i', $url, $matches);
$sql = "SELECT n.nid
  FROM {node} n
  INNER JOIN {feedapi_node_item} i ON i.nid=n.nid
  WHERE type='%s' AND (url LIKE '%s' OR url LIKE '%s')";

$dnid = db_result(db_query_range($sql, 
  variable_get('feedapi_dedupe_content_type', feedapi_dedupe_content_type()),
   'http://www.' . $matches[1] . '/%', 'http://' . $matches[1] . '/%', 0, 1)
);

In Drupal, if you want to limit the number of rows obtained from a query, you need to use db_query_range().

kiamlaluno