html-content-extraction

Using BeautifulSoup to find a HTML tag that contains certain text

I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11} <h2> this is cool #12345678901 </h2> So, the previous would match by using: soup('h2',text=re.compile(r' #\S{11}')) And the results would be something like: [u'blahblah #223409823523', u'thisisinteresting #293845023984'] I'm able to...

Create Great Parser - Extract Relevant Text From HTML/Blogs

I'm trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie's URL and get back clean text of the post itself. My basic approach (from python) has been to use a combination of BeautifulSoup / Urllib2, which is okay, but it assumes you know the proper tags for the blog entr...

"Smart" way of parsing and using website data?

How does one intelligently parse data returned by search results on a page? For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get the raw HTML data of the page, and do some regexs to make the data work for my web service,...

Extracting text fragment from a HTML body (in .NET)

I have an HTML content which is entered by user via a richtext editor so it can be almost anything (less those not supposed to be outside the body tag, no worries about "head" or doctype etc). An example of this content: <h1>Header 1</h1> <p>Some text here</p><p>Some more text here</p> <div align=right><a href="x">A link here</a></div><...

Parse a .Net Page with Postbacks

Hello, I need to read data from an online database that's displayed using an aspx page from the UN. I've done HTML parsing before, but it was always by manipulating query-string values. In this case, the site uses asp.net postbacks. So, you click on a value in box one, then box two shows, click on a value in box 2 and click a button to ...

python extract contents of regex

hello, I want a regular expression to extract the title from a HTML page. Currently I have this: title = re.search('<title>.*</title>', html, re.IGNORECASE).group() if title: title = title.replace('<title>', '').replace('</title>', '') Is there a regular expression that will extract just the contents of so I don't have to remov...

HTML comment scraping in PHP

Hi there, I've been looking around but have yet to find a solution. I'm trying to scrape an HTML document and get the text between two comments however have been unable to do this successfully so far. I'm using PHP and have tried the PHP Simple DOM parser recommended here many times but can't seem to get it to do what I want. Here's (...

Text Extraction from HTML Java

Hi. I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows; FileReader fileReader = new FileReader(file); BufferedReader buffR...

How can I extract HTML content efficiently with Perl?

I am writing a crawler in Perl, which has to extract contents of web pages that reside on the same server. I am currently using the HTML::Extract module to do the job, but I found the module a bit slow, so I looked into its source code and found out it does not use any connection cache for LWP::UserAgent. My last resort is to grab HTML...

Possible to parse a HTML document and build a DOM tree(java)

Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree through some API. For example: DomRoot = parse("myhtml.html"); for (tags : DomRoot) { } Note: this is a HTML document not XHtml. ...

Looking for an information retrival / text mining application or library

We extract various information from e-mails - flights, car rentals, hotels and more. the method is to extract the body of the mail, usually in HTML form but sometime it's text or we use the information in a PDF/Word/RTF attachment. We then apply regular expressions (sometimes in several steps) in order to get information, which is provid...

how to extract all text from HTML file using PHP?

Hello how to extract all text from HTML file I want to extract all text, in the alt attributes, < p > tags, etc.. however I don't want to extract the text between style and script tags Thanks right now I have the following code <?PHP $string = trim(clean(strtolower(strip_tags($html_content)))); $arr = explode(" ", $str...

how to parse < sign htmlparser.Parse(sr)?

hi i am trying to export html to pdf. using itextsharp , In the html table one value appears like this <34 . while parsing , this is giving error. (closing tag i.e.> is required ...) please tell me how to get through this? thanks in advance, rsd ...

Python strategy for extracting text from malformed html pages

I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't just install anything I want on the server. pyparsing and html2text.py also did not seem ...

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

Hi, I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run th...

What's the best way to write a maintainable web scraping app?

I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl and it was quite complicated and hard to maintain. After a few instances of my bank changi...

How do I save a web page, programatically?

I would like to save a web page programmatically. I don't mean merely save the HTML. I would also like automatically to store all associated files (images, CSS files, maybe embedded SWF, etc), and hopefully rewrite the links for local browsing. The intended usage is a personal bookmarks application, in which link content is cached in c...

Scraping from wsj.com or finance.yahoo.com

I want to display on a wordpress page the total volume of shares traded on the NYSE stock exchange the last 2 weeks that it's been open. What is the best way to go about doing this? ...

BeautifulSoup - easy way to to obtain HTML-free contents.

I'm using this code to find all interesting links in a page: soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+')) And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font, b and different things... I'd like to get just the text content, without any other html tag. Example of l...

Python HTML scraping

Hey, It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example: <a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e"> I want to get the href value. Any ideas on how to do this? Maybe regex? Could you post some example code? I'm guessing html scraping libs, su...