views:

45

answers:

1

Hi.

There's a site containing dozens of text snippets, each of them may have links and I need to build connections between snippets. Snippets are stored in database, database layout is following:

snippet: id primary key, title varchar(255), url varchar(400), body text references: snippet_id, crc32_sum backlinks: snippet_id, links_to

process of adding new snippet:

  1. sanitize url, generate its crc32 sum and find all entries in references table having crc32_sum = new urls crc32 sum. if there are such records add them to backlinks where snippet_id is entry having new urls crc32 sum and links_to - new snippets id

  2. grab all links from new snippets text, sanitize them, calculate their crc32 sum and put it all into references where snippet_id is id of new snippet and crc32 sum is sum of the found link

sanitizing url and link grabbing are done with PHP, crc32 works in MySQL.

Can I somehow optimize this process? I've added indexes in MySQL, tried to put as many crc32 in request as possible, but still its dead slow :(

A: 

I am not sure what exactly is slow here, so I just suggest another way of storing data.

snippets: id, title, url, text

links: fromid, tourl, toid

I don't emulate hash indexing here to simplify scheme and also because MySQL can index strings well. If needed, hashes can be emulated with md5 index (in binary form).

So, when you insert a new snippet you do the following:

Insert a new snipped to DB and get last_insert_id(). Search links table for entries, where toid=0 and tourl=*snippet.url*. Update found entires in the links table with new toid value. These are backlinks for our snippet.

Extract links from the snippet and iterate snippets table where url=*extracted_link*. Insert needed entries into links table.

That's all. If this will be slow, we will need more data: what part of this is slow exacty, what is "slow" etc.

In my schematic links table contain both links and backlinks. Is it sufficient for your needs?

FractalizeR
it became a bit better, but still its pretty slow :(
Daniel
What exactly part is slow? Did you do profiling?
FractalizeR