views:

59

answers:

0

Okay,

I' not an expert in PHP (I'm really just beginning to grasp the thing) and I need a little help to customized a script for extracting titles from Tumblr's post (and I insist, it's a small problem, I'm not asking for the whole script to be written for me).

For those who doesn't know: Tumblr is a microblogging platform with the characteristic of offering pre-formated type of post (quote, photo, etc.). None of those type, except for one, include title.

I'll do my best and try to make this question as specific as possible so I can get an answer as specific as possible. I'll make it brief.

I'm starting with Ben Ward script Tumblr2Wordpress (available on GitHub). Here's the part that acts as a title extractor:

# Try to extract a sane, single line blog title from input text, and
# (optionally) remove it from the entry body to avoid duplication.
function formatEntryTitle(&$text, $strip=true) {
$lines = explode("\n", $text);
$block_count = 0; # How far into the entry are we?
for($i=0; $l = $lines[$i]; $i++) {

    if(empty($l)) {
        # Ignoring emptry lines
        continue;
    }
    elseif(preg_match('/^\s*(#+|<[hH][1-6]>).*$/', $l, $match)) {
        # Matches a heading in Markdown or HTML

        # Now we need to see if the title embeds any links. If it does,
        # we want to strip out the link mark-up…

        # If raw input:
        if($markdown) {
            # Run markdown:
            $l = Markdown($l);
        }
        # Crudely check for <a>
        $contains_link = !(false === stripos('<a', $l));

        if( true === $strip
            && false === $contains_link) {
            # If there has been no other content so far (allowing one block
            # for quote attribution), and we're stripping titles out of the
            # text to avoid duplication, do it:
            array_splice($lines, $i, 1);
            $text = implode("\n", $lines);
        }

        # In the final return, strip not-inline HTML tags.
        return str_replace('\n', '', strip_tags($l));
        #'<abbr><acronym><i><b><strong><em><code><kbd><samp><span><q>
        # <cite><dfn><ins><del><mark><meter><rp><rt><ruby><sub><sup>
        # <time><var>'
    }
    else {
        $block_count++;
    }

    if($block_count > 2) {
        # Too far into the post. Give up.
        break;
    }
}
return '';
}    

Now, here's how I understand it works (for those who do not read PHP: I don't but I can guess):

1) Create a function called formatEntryTitle that we'll later use to format the output of the data fetched by reading the API of a Tumblr blog. 2) Then, there's a code to count the number of lines in order to control the search. The idea here, Ben's idea, is that this title extractor will search the post content for a specific item. If the search fails after a certain number of lines (or block) the query will end and return nothing. 3) But what to search in order to create a title? Ben's idea is simple an, in some case, very efficient : the code will search for a HTML header (h1, h2, h3 etc.). That's the role of the preg_match command. 4) If it does find a header, it will strip it of a) any link if it does have a link; 2) any other HTML tags. Keep in mind that this extractor was build to create title for Wordpress. In Wordpress, the links in the title of any post is usually its permalink. 5) There's also a part in there that does strip out of the content the words that were found and are used for the title so the content of the post doesn't duplicate it's title (maybe it's an aesthetic decision, or something related to SEO : I don't know).

Later, when formatting the data retrieved by reading Tumblr's API, the script use an argument like that to create the title:

<title><?php echo htmlspecialchars(formatEntryTitle(&$post_content)) ?></title>

htmlspecialcharsis a PHP string function designed to convert special characters to HTML entities. Then we make use of the formatEntryTile to format the $post_content (which is defined individually for every post type in regard with Tumblr API structure), that is to extract a title from the content of any post type.

So far so good. Except for two things: 1) This title extractor will work IF and only IF you made a systematic use of HTML headers in your blog. Otherwise, the extractor won't find any h1or h2or h3 etc and will return nothing. Basically, none of your post will get any title after the export process is done. 2) I didn't use headers on my blog. But I set up a test blog with headers to see how it works. I was never able to extract any title. Maybe it's me, the way I made use of headers in the body of the post. I don't know. It doesn't matter for me: I don't want to fix Ben's script (I'm not even sure its broken), I want to customize it instead.

(That's where Stack Overflow comes into play.)

It should be simple. As I said, I don't know much about PHP, but I'm already halfway there. My idea is the following : the extractor could search for ALL the post content... and truncate it. That's it. Simply get the first few words of the post, and use them as a title. I know it's not a perfect solution for everyone (titles may not be always relevant of the content and they will partially duplicate the content of the post) but 1) At least it would work for those who do not make use of HTML headers; 2) In my case, it's great because the first few words of each of my post is content attribution : creator's name, name of the photo or book, year, etc.

I found this little PHP code by Chirp Internet. It's a simple truncating function. It can be customized in many ways. Moreover: it works. Here's how I initially tried to use it.

I kept the name of the function formatEntryTitle but emptied it of its content and replace it with Chirp's code:

function formatEntryTitle($string, $limit, $break=".", $pad="...")
{
// return with no change if string is shorter than $limit
if(strlen($string) <= $limit) return $string;

// is $break present between $limit and the end of the string?
if(false !== ($breakpoint = strpos($string, $break, $limit))) {
if($breakpoint < strlen($string) - 1) {
  $string = substr($string, 0, $breakpoint) . $pad;
}
}

return $string;
}    

Then for each post type, I first define a string $title like so : $title = formatEntryTitle($post_content, 40, " "); (where "40" is the number of character I want the post content truncated to and the blank space is the criteria to end the truncating process, in plain English: truncate-at-the-first-blank-space-after-40-characters) and then use the following argument to output the title itself: <title><?php echo htmlspecialchars($title) ?></title>

And it works. It really does.

Except for one point. And that's where I need some help: my titles are... full of HTML tags. I need to clean the truncated part of any HTML tag so I can get a plain, clear, English-only title.

I've tried to make use of Ben's code (his title extractor strips HTML tags from the title) but it's beyond what I'm capable to do. I think it's a problem related to the hierarchical structure of the extractor: I don't know where to put the strip function.

I know it's a long post, but hopefully someone will see the solution in a flash.

Thanks a lot!

P.