views:

74

answers:

2

This question has less to do with actual code, and more to do with the underlying methods.

My 'boss' at my pseudointernship has requested that I write him a script that will scrape a list of links from a users' tweet (the list comes 'round once per week, and it's always the same user) and then publish said list to the company's Tumblr account.

Currently, I am thinking about this structure: The base will be a bash script that first calls some script that uses the Twitter API to find the post given a hashtag and parse the list (current candidates for languages being Perl, PHP and Ruby, in no particular order). Then, the script will store the parsed list (with some markup) into a text file, from where another script that uses the Tumblr API will format the list and then post it.

Is this a sensible way to go about doing this? So far in planning I'm only up to getting the Twitter post, but I'm already stuck between using the API to grab the post or just grabbing the feed they provide and attempting to parse it. I know it's not really a big project, but it's certainly the largest one I've ever started, so I'm paralyzed with fear when it comes to making decisions!

+1  A: 

Your approach seems to be appropriate.

  • Utilize user_timeline twitter api to fetch all tweets posted by a user.
  • Parse the fetcned list ( may be using regex ) to extract links from tweets and store them in an external file.
  • Post those links to tumblr account using tumblr write api.

You may also want to track last fetched tweet id from twitter so that you can continue extraction from that tweet id.

Harsha Hulageri
+1  A: 

From your description, there's no reason you shouldn't be able to do it all in one script, which would simplify things unless there's a good reason to ferry the data between two scripts. And before you go opening connections manually, there are libraries written for many languages for both Tumblr and Twitter that can make your job much easier. You should definitely not try to parse the RSS feed - they provide an API for a reason.*

I'd personally go with Python, as it is quick to get up and running and has great libraries for such things. But if you're not familiar with that, there are libraries available for Ruby or Perl too (PHP less so). Just Google "{platform} library {language}" - a quick search gave me python-tumblr, WWW:Tumblr, and ruby-tumblr, as well as python-twitter, Net::Twitter, and a Ruby gem "twitter".

Any of these libraries should make it easy to connect to Twitter to pull down the tweets for a particular user or hashtag via the API. You can then step through them, parsing it as needed, and then use the Tumblr library to post them to Tumblr in whatever format you want.

You can do it manually - opening and reading connections or, even worse, screen scraping, but there's really no sense in doing that if you have a good library available - which you do - and it's more prone to problems, quirks, and bugs that go unnoticed. And as I said, unless there's a good reason to use the intermediate bash script, it would be much easier to just keep the data within one script, in an array or some other data structure. If you need it in a file too, you can just write it out when you're done, from the same script.

*The only possible complication here is if you need to authenticate to Twitter - which I don't think you do, if you're just getting a user timeline - they will be discontinuing basic authentication very soon, so you'll have to set up an OAuth account (see "What is OAuth" over at dev.twitter.com). This isn't really a problem, but makes things a bit more complicated. The API should still be easier than parsing the RSS feed.

cincodenada