questions about html-parsing | ansaurus

html-parsing

How to replace text not within a specific-Tag in JavaScript

I have a string (partly HTML) where I want to replace the string :-) into bbcode :wink:. But this replacement should not happen within <pre>, but in any other tag (or even not within a tag). For example, I want to replace :-)<pre>:-)</pre><blockquote>:-)</blockquote> to: :wink:<pre>:-)</pre><blockquote>:wink:</blockquote> I alrea...

Java: I have a big string of html and need to extract the href="..." text...

I have this string containing a large chunk of html and am trying to extract the link from href="..." portion of the string. The href could be in one of the following forms: <a href="..." /> <a class="..." href="..." /> I don't really have a problem with regex but for some reason when I use the following code: String innerHTM...

Parse HTML Page For Links With Regex Using Perl

Possible Duplicate: How can I remove external links from HTML using Perl? Alright, i'm working on a job for a client right now who just switched up his language choice to Perl. I'm not the best in Perl, but i've done stuff like this before with it albeit a while ago. There are lots of links like this: <a href="/en/subtitles/35...

How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?

Ever since I asked how to parse html with regex and got bashed a bit (rightfully so), I've been studying HTML::TreeBuilder, HTML::Parser, HTML::TokeParser, and HTML::Elements Perl modules. I have HTML like this: <div id="listSubtitlesFilm"> <dt id="a1"> <a href="/45/subtitles-67624.aspx"> .45 (2006) </a> </dt> </div> ...

Matching pair tag with regex

I'm trying to retrieve specific tags with their content out of an xhtml document, but it's matching the wrong ending tags. In the following content: <cache_namespace name="content"> <content_block id="15"> some content here <cache_namespace name="user"> <content_block id="welcome"> Welcome Apiko...

Getting elements by type in malformed HTML

What's the easiest way in Java to retrieve all elements with a certain type in a malformed HTML page? So I want to do something like this: public static void main(String[] args) { // Read in an HTML file from disk // Retrieve all INPUT elements regardless of whether the HTML is well-formed // Loop through all elements and r...

Help with parsing XML document using 'Reader' and Nokogiri

Hi. I am a newbie when it comes to using Nokogirie reader to parse an xml file. Here is the xml file I want to parse and sample code: <?xml version='1.0' encoding='UTF-8'?> <inventory> <tire name="super slick racing tire" /> <tire name="all weather tire" /> </inventory> -------------------------------------------------------------...

Making BeautifulSoup ignore contents inside script tags

I have been trying to get BeautifulSoup (3.1.0.1)to parse a html page that has a lot of javascript that generates html inside tags. One example fragment looks like this : <html><head><body><div> <script type='text/javascript'> if(ii > 0) { html += '<span id="hoverMenuPosSepId" class="hoverMenuPosSep">|</span>' } html += '<div class=...

BeautifulSoup - easy way to to obtain HTML-free contents.

I'm using this code to find all interesting links in a page: soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+')) And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font, b and different things... I'd like to get just the text content, without any other html tag. Example of l...

html-content-extraction

How to parse a rendered web page containing javascript.

How can one extract data from a rendered web page? In which java script would update the data with time. Is it possible to write user script which can access varibles from webpage java script? Please suggest possible way to achieve this. ...

information-extraction

What is the best way to crawl a login based sites?

I've to automate a file download activity from a website (similar to, let's say, yahoomail.com). To reach a page which has this file download link, i've to login, jump from page to page to provide some parameters like dates etc., and finally click on download link. I am thinking of three approaches: Using WatIN and develop a windows s...

C# Regex - How to parse string for Swedish letters åäöÅÄÖ?

I'm trying to parse an HTML file for strings in this format: <a href="/userinfo/userinfo.aspx?ID=305157" target="main">MyUsername</a> O22</td> I want to retrieve the information where "305157", "MyUsername" and the first letter in "O22" (which can be either T, K or O). I'm using this regex; <a href="/userinfo/userinfo\.aspx\?ID=\d*"...

Parsing web pages

I have a question about parsing HTML pages, specificaly forums, i want to parse a forum or thread containing certain post criterias, i havent defined the algorithm yet, since i have only parsed structure text formats before, A use case may be copy and paste each thread into the program by hand, or insert a URL like http://www.forums....

What is parsing?

Parsing is something i come accross alot in development, but as a junior its one of those things i assume i will get the hang of at some point, when its needed. In my current project ive been told to find and use an HTML parser for a certain function, I have found a couple on the web, but what does an HTML parser actually do? And what do...

Find all CSS styles used on website

I have a DotNetNuke skin that has a single CSS file over 3,500 lines long. It contains styles for YUI, Telerik, Cluetip as well as the actual customisation of the site. The old developers just kept adding styles and never cleaned up the old unused ones. I want to cleanup the file and get it to a more managable size. I first thought abou...

Need some help with XPath expression. One works, the other doesn't...

I'm using the COBRA HTMLParser but haven't had luck parsing one particular tag. Here's the source: <li id="eta" class="hentry"> <span class="body"> <span class="actions"> </span> <span class="content"> </span> <span class="meta entry">Content here </span> <span class="meta entry stub">Content here <span...

C# library to clean up html

Hi, I was wondering if there is a library in .Net to clean up and remove unclosed tags in an html document? ...

split a comma separated list with links in with beautifulsoup

I've got a comma separated list in a table cell in an HTML document, but some of items in the list are linked: <table> <tr> <td>Names</td> <td>Fred, John, Barry, <a href="http://www.example.com/">Roger</a>, James</td> </tr> </table> I've been using beautiful soup to parse the html, and I can get to the table, but ...

Can I build my own dictionary application using online data?

hi there Because I'm a non-native English person, i use a lot a dictionary. Now I'm learning C# and i was thinking to if I'm allowed to build an application which will run on my machine, but it will use the google/babefish translate service, or any other translation/dictionary online tool. It takes time to go on the browser each time ...

How to find/replace text in html while preserving html tags/structure

I use regexps to transform text as I want, but I want to preserve the HTML tags. e.g. if I want to replace "stack overflow" with "stack underflow", this should work as expected: if the input is stack <sometag>overflow</sometag>, I must obtain stack <sometag>underflow</sometag> (i.e. the string substitution is done, but the tags are sti...

1
2
3
4
5
...
16