views:

283

answers:

6

Suppose it is a long article (say 100,000 words), and I need to write a PHP file to display page 1, 2, or page 38 of the article, by

display.php?page=38

but the number of words for each page can change over time (for example, right now if it is 500 words per page, but next month, we can change it to 300 words per page easily). What is a good way to divide the long article and store into the database?

P.S. The design may be further complicated if we want to display 500 words but include whole paragraphs. That is, if we are showing word 480 already but the paragraph has 100 more words remaining, then show those 100 words anyway even though it exceeds the 500 words limit. (and then, the next page shouldn't show those 100 words again).

+2  A: 

You could of course output exactly 500 words per page, but the better way would be to put some kind of breaks into your article (end of sentence, end of paragraph). Put these at places where a break would be good. This way your pages won't have exactly X words in it each, but about or up to X and it won't tear sentences or paragraphs apart. Of course, when displaying the pages, don't display these break markers.

schnaader
+1  A: 

You might want to start by breaking the article up into an array of paragraphs by using split command: http://www.php.net/split

$array = split("\n",$articleText);
Travis
so how do you decide which paragraphs to show when it is page 38?
動靜能量
+2  A: 

I would do it by splitting articles on chuks when saving them. The save script would split the article using whatever rules you design into it and save each chunk into a table like this:

CREATE TABLE article_chunks (
    article_id int not null,
    chunk_no int not null,
    body text
}

Then, when you load a page of an article:

$sql = "select body from article_chunks where article_id = "
    .$article_id." and chunk_no=".$page;

Whenever you want to change the logic of splitting articles into pages, you run a script thats pulls all the chunks together and re-splits them:

UPDPATE: Giving the advice I suppose your application is read-intensive more than write-intensive, meaning that articles are read more often than they are written

artemb
what if there are a few hundred long articles and re-splitting them may need to stop the site for maintenance... and if there is a bug in the re-splitting script, then the content can be contaminated?
動靜能量
Well, if there is a bug in any code that works with data, content can be damaged.You can avoid the need for stopping the site by starting and commiting a transaction around saving each article. But stopping a site for maintanance once in a while is a common thing.
artemb
You wouldn't need to stop the site!!, you could rebuild the article while it's online. I'd also suggest adding a table article (with article_id as an identity/autoincrement/..., and body text), this is the original text that is split into chunks. In the algorithm, i'd set a trigger to update the text of the chunks online... add new chunks that weren't there, and delete unneeded chunks.
Osama ALASSIRY
+1  A: 

It's better way to manually cut the text, because it's not a good idea to leave a program that determines where to cut. Sometimes it will be cut just after h2 tag and continue with text on next page.

This is simple database structure for that:
article(id, title, time, ...)
article_body(id, article_id, page, body, ...)

The SQL query:

SELECT a.*, ab.body, ab.page
FROM article a
INNER JOIN article_body ab
    ON ab.article_id = a.id
WHERE a.id = $aricle_id AND ab.page= $page
LIMIT 1;

In application you can use jQuery to simple add new textarea for another page...

sasa
says if there a few hundred of such articles, manually splitting them could take too long. also, if it is decided to be 300 words per page next month, can't re-split them by hand again.
動靜能量
+1  A: 

Your table could be something like

CREATE TABLE ArticleText (
  INTEGER artId,
  INTEGER wordNum,
  INTEGER wordId,
  PRIMARY KEY (artId, wordNum),
  FOREIGN KEY (artId) REFERENCES Articles,
  FOREIGN KEY (wordId) REFERENCES Words
)

this of course may be very space-expensive, or slow, etc, but you'll need some measurements to determine that (as so much depends on your DB engine). BTW, I hope it's clear that the Articles table is simply a table with metadata on articles keyed by artId, and the Words table a table of all words in every article keyed by wordId (trying to save some space there by identifying already-known words when an article is entered, if that's feasible...). One special word must be the "end of paragraph" marker, easily identifiable as such and distinct from every real word.

If you do structure your data like this you gain lots of flexibility in retrieving by page, and page length can be changed in a snap, even query by query if you wish. To get a page:

SELECT wordText
FROM  Articles
 JOIN ArticleText USING (artID)
 JOIN Words USING (wordID)
 WHERE wordNum BETWEEN (@pagenum-1)*@pagelength AND @pagenum * @pagelength + @extras
  AND Articles.artID = @articleid

parameters @pagenum, @pagelength, @extras, @articleid are to be inserted in the prepared query at query time (use whatever syntax your DB and language like, such as :extras or numbered parameters or whatever).

So we get @extras words beyond expected end-of-page and then on the client side we check those extra words to make sure one of them is the end-paragraph marker - otherwise we'll do another query (with different BETWEEN values) to get yet more.

Far from ideal, but, given all the issues you've highlighted, worth considering. If you can count on the page length always being e.g. a multiple of 100, you can adopt a slight variation of this based on 100-word chunks (and no Words table, just text stored directly per row).

Alex Martelli
+1  A: 

Let the author divide the article into parts themselves.

Writers know how to make an article interesting and readable by dividing it into logical parts, like "Part 1—Installation", "Part 2—Configuration" etc. Having an algorithm do it is a bad decision, imho.

Chopping an article in the wrong place just makes the reader annoyed. Don't do it.

my 2¢

/0
0scar