html-content-extraction

(experimental) wrapper induction libraries for java. Do any exist?

I'd like to know if any (experimental or not ) wrapper induction libraries for java exist. Given a website of choice I would like to be able to point my code to product-pages of a particular website. The Wrapper Induction library should be able to: - infer the 'wrapper' or schema of the product pages from a couple of examples. - have ...

Getting BeautifulSoup to find a specific <p>

I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph. The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html....

Extract news links from news website

Is there any reliable method to find out the collection of links which is directed us to detail news page. in other word after visiting the first page of website I just want those links that refer to a news item. any solution ? ...

How facebook extracts the right thumbnail of a link?

Hi, I'm wondering how facebook extracts the right picture of the article from a link? they ignore any icons, ads images, or other not related images, & they gives you the right image? What technique/method they use? because i've tried to extract all images using a php regex but how to find the right one? Thanks ...

How do I extract HTML content using Regex in PHP

I know, i know... regex is not the best way to extract HTML text. But I need to extract article text from a lot of pages, I can store regexes in the database for each website. I'm not sure how XML parsers would work with multiple websites. You'd need a separate function for each website. In any case, I don't know much about regexes, so ...

Is there anything for Python that is like readability.js?

Hi, I'm looking for a package / module / function etc. that is approximately the Python equivalent of Arc90's readability.js http://lab.arc90.com/experiments/readability http://lab.arc90.com/experiments/readability/js/readability.js so that I can give it some input.html and the result is cleaned up version of that html page's "main t...

Get the rendered text from HTML (Delphi)

I have some HTML and I need to extract the actual written text from the page. So far I have tried using a web browser and rendering the page, then going to the document property and grabbing the text. This works, but only where the browser is supported (IE com object). The problem is I want this to be able to run under wine also, so...

Xquery parsing text with <a> tags

I am using XQuery to extract content from html pages. The html body structure is of this kind: <td> <a href ="hw1">xyz </a> Hello world 1 <a href="hw2">Helloworld 2</a> Helloworld 3 </td> My XQuery expression for extracting the text is as follows: //a[starts-with(@href,'hw1')]/following...

XQuery extract between two tags

I am currently working on extracting data from html. I would like to extract the text between two tags. <p class="xfHeading"><b>XYZ:</b></p> <p>asdfghjk</p> <p>sdsdsd</p> <p>asdvcvcfghjk</p> <p class="xfHeading"><b>ABC:</b></p> <P>fvgbhnjm</P> <p cl...

Get element content from a variable containing html

How do I use the DOM parser to extract the content of a html element in a variable. More exactly: I have a form where user inputs html in a text area. I want to extract the content of the first paragraph. I know there are many tutorials on this, but could not find any on extracting from variable and not a file(page) Thanks ...

how to extract html code for website using iframe and silverlight

I need to load a specific webpage from a site that has multiple images on the site. I need to extract these images but I can't do this manually because the names of each image have no pattern and there will be hundreds of sites. I have a silverlight application to load the webpage in an iframe and I intended on extracting the html for th...

Extract Data from HTML using PHP

Here is what I am looking for : I have a Link which displays some data on HTML format : http://www.118.com/people-search.mvc...0&amp;pageNumber=1 Data comes in below format : <div class="searchResult regular"> Bird John 56 Leathwaite Road London SW11 6RS 020 7228 5576 I want my PHP page to execute above URL and Ex...

How extract meaningful text from HTML

Hi I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this? I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for this it would be appropriate. Thanks!! PD: Please do not recommend anything with java ...