tags:

views:

79

answers:

3

I have the following problem:

  • I have a feed capturer that captures news from different sources every half an hour.

  • I only insert entries that don't have their URLs already in the database (URL is used to see if the record is already in database).

    Even with that, I get some repeated entries, because some sites report the same news (that usually are from a news source like Reuters). I could look for these repeated entries during insertion, but i think this would slow the insertion time even more.

    So, I can later find these repeated entries by the title. But I think this search is slow. Then, my idea is to generate a numeric field from the title and then search by this number for repeated titles.

  • What kind of encoding could I use (I thought in something reverse to base64) to encode the titles?

  • I'm suposing that searching for repeated numbers is a lot faster than searching for repeated words. Is that true or not?
  • Do you suggest a better solution for this problem?

Well, I don't care to have the repeated entries in the database, I just don't want to show then to the user. Like google, that filters the repeated results, but shows then if you want.

I hope I explained It well. Thanks in advance.

+2  A: 

Fill the MD5 hash of the URL and title and build a UNIQUE index on it:

CREATE UNIQUE INDEX ux_mytable_title_url ON (title_hash, url_hash)

INSERT
INTO    mytable (url, title, url_hash, title_hash)
VALUES  ('url', 'title', MD5('url'), MD5('title'))

To select like Google (one result per title), use this query:

SELECT  *
FROM    (
        SELECT  DISTINCT title_hash
        FROM    mytable
        ) md
JOIN    mytable mo
ON      mo.url_title = md.title_hash
        AND mo.url_hash =
        (
        SELECT  url_hash
        FROM    mytable mi
        WHERE   mi.title_hash = md.title_hash
        ORDER BY
                mi.title_hash, mi.url_hash
        LIMIT 1
        )
Quassnoi
A: 

so you can use a new table containing only the encoded keys based on title and url, you have then to add a key on it to accelerate search. But i don't think that you can use an effecient algorytm to transform strings to numbers ..

for the encryption use

SELECT MD5(CONCAT('title', 'url'));

and before every insertion you test if the encoded concatenation of title and url exists on this table.

Houssem
A: 

@Quassnoi can explain better than I, but I think there is no visible difference in performance if you use a VARCHAR/CHAR or INT in a index to use it later for GROUPing or other method to find the duplicates. That way you could use the solution proposed by him but use a normal INDEX instead of a UNIQUE index and keep the duplicates in the database, filtering out only when showing to users.

Leonel Martins
In fact, indexes on title have `2` drawbacks: first, titles are long enough to increase the indexes significantly so that less pages fit into the memory (thus increasing probability of cache misses); second, indexes on titles are less balanced, since natural language titles are not distributed evenly. These things are unnoticeable when selecting a single record, but for large joins this can matter.
Quassnoi
I meant index on MD5(title) not in title directly (it´d be distributed better, wouldnt?) but your edited solution is even better than i thought.
Leonel Martins