html-parsing

How to get HTML tables using xpath in c?

I'm using libxml2 in my c project. I was wondering how could I grab all tables in a html file using xpath. Sample code will do the trick. I need to parse the data in html table. Thanks EDIT: This is a row of the table: <tr class="report-data-row-even"> <td class="NormalTxt report-data-cell report-data-column-even"><nobr>0.0285</nob...

How can I parse only part of an HTML file and ignore the rest?

In each of 5,000 HTML files I have to get only one line of text, which is line 999. How can I tell the HTML::Parser that I only have to get line 999? </p><h1>dataset 1:</h1> &nbsp;<table border="0" bgcolor="#EFEFEF" leftmargin="15" topmargin="5"><tr> <td><strong>name:</strong>&nbsp;</td> <td width=500> myname one </td></tr>...

How can I extract the contents of a specific table from HTML source using Perl?

I have to parse 5000 files - which look pretty identical. I like using HTML::TokeParser::Simple and DBI in order to do the parsing job and store the results. I have little experience with HTML::TokeParser::Simple but this task goes over my head. Note: i also have had a look at the ideas - that seems to be also an appropiate way. But at...

fetch pages [LWP] parse them [HTML::TokeParser] and store results [DBI]

Hello good evening dear stackoverflow-friends, A triple job: i have to do a job with tree task. we have three tasks: Fetch pages Parse HTML Store data... And yes - this is a true Perl-job! i have to do a parser-job on all 6000 sub-pages of a site in suisse. (a governmental site - which has very good servers ). see http://www.e...

Help extracting text from HTML table using xpath

I'm trying to pull the text between the nobr tags. This is part of the table: <table class="report-main-table dirLTR NormalTxt" width="100%" border="0" cellspacing="0" cellpadding="0"> <thead> <tr> <td class="report-data-title-cell report-data-column-odd"><nobr><b>&#1505;&#1492;"&#1499; &#1506;&#1500;&#1493;&#1514; &#1489;&#1...

Seeking HTML editor with visual tag matching

Is there an editor or IDE which will show HTML code with some visual indication of matching open/close tags? Kompozer sort of helps, but I would prefer something like .---><div> | | <h1>xxx</h1> | | .---><frameset> | | | | .---><div> | | | | | | <p>Lorem ipsum dolor sit amet</p> | | | | | .---></div> | | | .---...

HTML batch save into folder - Any ideas ?

I am trasferring some old 'inhouse' html sites to a new system. The current folder structure is that all htmls of all sites are in one folder, and all the images of all those site are in /images folder. Ofcourse i need to have seperate folders for each html and its images. Just before writing some code to do the Job : Is anyone famil...

How extract meaningful text from HTML

Hi I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this? I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for this it would be appropriate. Thanks!! PD: Please do not recommend anything with java ...

How to normalize HTML in Javascript or jQuery?

Tags can have multiple attributes. The order in which attributes appear in the code does not matter. For example: <a href="#" title="#"> <a title="#" href="#"> How can I "normalize" the HTML in Javascript, so the order of the attributes is always the same? I don't care which order is chosen, as long as it is always the same. UPDATE:...

Parse a HTML combox in C#

Hi, I need parse a select value in html file. I have this html file: <html> <head></head> <body> <select id="region" name="region"> <option value="0" selected>Všetky regiony</option> <optgroup>Banskobystrický kraj</optgroup> <option value="k_1">Banskobystrický kraj</option> <option value="1">Banská ...

Parsing HTML to return CSS rules from ids and classes attributes with PHP

Hi, I hate to have to write down a lot of CSS rules and then enter my styles in it, so I'd like to develop a tiny php script that would parse the HTML I'd pass to it and then return empty CSS rules. I decided to use PHP's DomDocument. The question is: How could I loop through the whole structure? (I saw that for example DomDocument on...

How can I use Python to get the contents inside of this span tag?

I'm trying to scrape the information from Google Translate as a learning exercise and I can't figure out how to reach the content of this span tag. <span title="Hello" onmouseover="this.style.backgroundColor='#ebeff9'" onmouseout="this.style.backgroundColor='#fff'"> Hallo </span> How would I...

HTML::TableExtract: applying the right attribs to specify the attributes of interest

I tried to run the following Perl script on the HTML further below. My problem is how to define the correct hash reference, with attribs that specify attributes of interest within my HTML <table> tag itself. #!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; use YAML; my $table = HTML::TableExtract->new(keep_html=>0, d...

HTML code processing

Hello, I want to process some HTML code and remove the tags as in the example: "<p><b>This</b> is a very interesting paragraph.</p>" results in "This is a very interesting paragraph." I'm using Python as technology; do you know any framework I may use to remove the HTML tags? Thanks! ...

How to get raw XML back from lxml?

I'm using the following code to locate a div: parser = etree.HTMLParser() tree = etree.parse(StringIO(page), parser) div = tree.xpath("//div[@class='content']")[0] My only problem is, that after doing this I do not want to rely on lxml to extract the contents of said div: I just want to get back the raw XML the div contains. Is this ...

Get data from specific HTML table cells using Php

I need to get the data out of all of the table cells in the 4th row of the 4th table on an HTML page. After researching for a while, it seems that using DOMXPath is the best way to parse the HTML file. However, no IDs or classes are used anywhere in the file. What would be the best way to get the data out of these cells? Thanks in advan...

delete html comment tags using regexp

Hi! This is how my text (html) file looks like <!-- | | | This is a dummy comment | | please delete me | | asap | | | ________________________________ | --> this is another line i...

Java String Manipulating HTML Tags

I have a java string with some text and html: <title>test title</title> blabla bla more text What I am trying to achieve is two-fold: 1) Retrieve the content of <title></title> and save it in a separate string. 2) Remove that part of the original string: <title>test title</title> So the end result would be something like originalS...

Using Python lxml.html how can I find images within link tags?

Hi there. I am using lxml.html to parse some hmtl to get links, however when it hits a link which contains an image it just returns blank, what it'd really like is to be able to detect if it's an image, and then try and return the image alt text. So it looks like this... from lxml.html import parse, fromstring doc = fromstring('<a hr...

Is it possible to use jQuery for HTML parsing?

Just out of curiosity, I am trying to see if it is possible to use jQuery to read a HTML file so that I can use it to output some values of some html elements? I am looking for some functionality like what Firebug provides i.e. Firebug lets me use the $() on any webpage so what I am trying to achieve is: I have a bunch of HTML files I ...