tags:

views:

798

answers:

5

Is there a way to embed only a section of a website in another HTML page?

Example: I see an answer I want to blog about, so I grab the HTML content, and splat it in somewhere, and show only that, styled like it is on stackoverflow. Basically, I want to blockquote the section of the page with original styling, if that makes sense. Is that something the site itself has to provide, or can I use an iframe and tell it to show only a certain element or something crazy? Open to all options, but I want it to show up as HTML, not as an image (that's really a last resort).

If this is even possible, are there security concerns I need to aware of?

A: 

That sounds like something that IE8's Web Slices would be perfect for. However, it's only available in IE8, and the site of origin would have to implement for you to be able to take advantage of it.

Jeremy Sullivan
+3  A: 

Don't think image should really be last resort. You have no control over the HTML/CSS of the source page, so even if you craft a solution (probably by using JavaScript to parse out the desired snippet) there is no guarantee that tomorrow the site doesn't decide to change its layout.

Even Jeff, who has control over the layout of stackoverflow.com, still prefers to screen-capture the site, rather than pull in the contents live.

Now if your goals was to have the contents auto-update, that would be a different story. But still, unless you use some agreed-upon method of sharing content, such as RSS, your solution would be very fragile.

Eugene Katz
Images also have the advantage of hack-free html support, total security, and not bandwidth leeching the target site who are unlikely to thank you for it.
annakata
A: 

I'd recommend using a server side solution with Python; using urllib2 to request the page, then using BeautifulSoup to parse out the bit that you need. BeautifulSoup has a very flexible selection api with which you can craft heuristics for the section you are interested in.

To illustrate:

soup = BeautifulSoup(html)
text = soup.find(text="Some text on the page that is unlikely to change")
print soup.parent.prettify()

That way if the webmaster later changes the markup on the page, your scraping script should still work.

EoghanM
+1  A: 

The concept you are describing is roughly what is called a "purple include" or "transclusions". There is a library out there for it, but its not exactly actively developed. Here's a couple ajaxian articles on it.

Russell Leggett
A: 

On client side <iframe> is the only practical option. It is possible to scroll it, but it might not work in the long term, because it's technically close to clickjacking attack.

There's also cross-site XHR, but requires opt-in from destination site, and today works only in few latest browsers.

Getting HTML on server side is easy (every decent web framework has ability to download page and parse HTML and you can use XPath/XSLT or DOM to extract bit you want).

Getting styles however is going to be tricky – CSS rules may not work with HTML fragment taken out of context. You'd have to parse CSS, extract and transform rules or use browser and read currentStyle of every node.

Obviously you have to heavily filter HTML you extract to avoid XSS. It's harder than it seems.

If you don't need to automate this, a good HTML+CSS WYSIWYG editor might be able to extract content fragment with styles.

porneL