html-parsing

Beautify HTML code using Ruby or Java?

I'm looking for pure Ruby (or Java) solutions for beautifying HTML code. I'm currently using Hpricot to parse HTML, since Nokogiri and other HTML parsers require external C programs. I assume that I can use Hpricot to clean up HTML if I can come up with a good algorithm. I'd prefer not to reinvent if this has already been done. ...

Parse html with ajax json inside

Hi I have such files to parse (from scrapping) with Python: some HTML and JS here... SomeValue = { 'calendar': [ { 's0Date': new Date(2010, 9, 12), 'values': [ { 's1Date': new Date(2010, 9, 17), 'price': 9900 }, { 's1Date': new Date(2010, 9, 18), 'price': 9900 }, ...

Best methods to parse HTML

Im working on a system that requires the parsing of HTML documents under PHP. my question is simply this: What's the best method of parsing content for relative information. When I parse a site I don't want random content I want to find relevant content such as blocks of text, images, links etc. but obviously I don't want header links...

libxml2 - insert child node before parent node's content

I'm using libxml2 to parse HTML. The HTML might look like this: <div> Some very very long text here. </div> I want to insert a child node, e.g. a header, in before the text, like this: <div> <h3> Some header here </h3> Some very very long text here. </div> Unfortunately, libxml2 always adds my header after t...

Posting to Facebook through Graph API - HTML Blog Entry with Inline Images

Let's say there is a blog entry, which you have the HTML for, it looks like this: <h1>Hi</h1> <img src="http://thesource.com/someImage.gif"/&gt; <p>And just a little more text, with a &nbsp;</p> If you use the graph API to send this to Facebook, the message will look exactly as it appears above. I'm using HTMLCleaner in order to clea...

Python 2.7, ValueError when dealing with HTMLParser

First time working with the HTMLParser module. Trying to use standard string formatting on the ouput, but it's giving me an error. The following code: import urllib2 from HTMLParser import HTMLParser class LinksParser(HTMLParser): def __init__(self, url): HTMLParser.__init__(self) req = urllib2.urlopen(url) ...

Python: Check the value of a variable passed as a parameter in another method?

Somewhat related to my earlier question. I'm making a simple html parser to play around with in Python 2.7. I would like to have multiple parse types, IE can parse for links, script tags, images, ect. I'm using the HTMLParser module, so my initial thoughts were just make a separate class for each thing I want to parse. But that seemed ra...

formatting date string for html table

Can someone show me how to change this date stamp and print this in an html table? I have an input file with this time stamp format: 4-Start=20100901180002 This time format is stored like this in an array. I print out the array like so to create an html table: foreach ($data as $row){ $counter ++; ...

HTML Parsing, iterating over a Dictionary<>, no results returning when expected. C#

Working with HTML Agility Pack in C#. Running the following code on a site I know should return some values keeps coming up blank. Can anyone see what I'm doing wrong here? public Dictionary<string, string> linkMiner(string site) { Dictionary<string, string> links = new Dictionary<string, string>(); url = site; ...

Get a list of all the urls in a web page

What's the best way to get an array of all the URLs in a web page? and how would I do it? ...

Classify a table in lxml

I am working with a large set of html documents. One of my tasks is to extract all text from the documents. I have gotten pretty far but now I am stumped because of the use of tables as containers / formatting structures for information that is not numeric in nature My goal is to ignore - leave behind - not extract the 'table' if it i...

Avoiding a space leak reading an HTML document with HXT

Link to truncated version of example document I'm trying to extract the large chunk of text in the last "pre", process it, and output it. For the purposes of argument, let's say I want to apply concatMap (unwords . take 62 . drop 11) . lines to the text and output it. This takes over 400M of space on a 4M html document when I do it....

Can any of Ruby's HTML Parsers do Javascript to see the resulting DOM?

When trying Hpricot and Nokogiri, the HTML can be fetched and parsed, but can they also execute the Javascript as well so that the content shows on the page? (shows up in the the DOM). That's because some page won't show the info unless the Javascript interpreter has run. ...

parsing html to get data

hi, i am having a problem with parsing html from which i would like to get the data <td id="Company" style="border-bottom-width: 0px; padding-left: 5px"> <strong>ABC</strong> </td> so the data i need is of course "ABC" only, i have tried the following parsing rule but it does not work /<td id=\"Company\" style=\"border-bottom-width: ...

Extracting everything but tags from a web page without a parser - using scanner and regex?

Working on Android SDK, it's Java minus some things. I have a solution that pulls out two regex patterns from web pages. The problems I'm having is that it's finding things inside HTML tags. I tried jTidy, but it was just too slow on the Android. Not sure why but my Scanner regex match solution whips it many times over. currently, I g...

Node.setTextContext("STRING") replace all the children in the actual node.

Hi everyone. I have a problem with the Node class. I'm parsing a XHTML To translate each string from a webpage using nekoHTml library. My problem is when I have a tag that includes other tags for example Divs inside Divs. My problem is that I need to extract only the text, translate it and replace it but when I use the setTextContext ...

Render HTML Webpage to text in Java

I would like to get the text representation of a website in a human-readable form, for example hyperlink locations or input fields. Is there any library that does this? (I've checked Jericho Renderer but it does not show input fields) For example <div> <form action="example.php"> Name: <input type="text" name="name_field"> <input type="...

Getting non-contiguous text with lxml / ElementTree

Suppose I have this sort of HTML from which I need to select "text2" using lxml / ElementTree: <div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div> If I already have the div element as mydiv, then mydiv.text returns just "text1". Using itertext() seems problematic or cumbersome at best since it walks the entire tr...

Fetch <td> text while using WWW::Mechanize to fetch <a> within that <td> tag

Hi experts, I'm new to Perl-HTML things. I'm trying to fetch both the texts and links from a HTML table. Here is the HTML structure: <td>Td-Text <br> <a href="Link-I-Want" title="title-I-Want">A-Text</a> </td> I've figured out that WWW::Mechanize is the easiest module to fetch things I need from the <a> part, but I'm not su...

Help me rewrite this regex to not match tags with attributes?

========================================================================= EDIT: I'm using node.js, so I don't have access to the DOM, and parsing with an HTML parser is not an option (it's not efficient enough to justify parsing through such a small amount of text) =======================================================================...