html-parsing

Creating EMML markup from HTML

I'm building a Mashup application which will generate output in EMML. The application is JavaScript based having an interface like Yahoo Pipes. Once the Mashup is designed in the editor, it needs to convert the same HTML markup to correct EMML markup. I need to know if there is any Javascript framework or plugin available which convert...

Parsing string and making list of required data

hi, I am using **<style>..{data inside}..</style>** which is there in following code.i have taken all the data between style tags in one string ,say string tempStyle and all operations are to be done on that string only. I am looking for function which will take make a list of all "style" data. i.e. only style1,style2,style15,style20 i...

How do I get a list of all parent tags in BeautifulSoup?

Let's say I have a structure like this: <folder name="folder1"> <folder name="folder2"> <bookmark href="link.html"> </folder> </folder> If I point to bookmark, what would be the command to just extract all of the folder lines? For example, bookmarks = soup.findAll('bookmark') then beautifulsoupcommand(bookmarks[...

Display images on Android using TextView and Html.ImageGetter asynchronously?

I want to set a TextView with SpannableString which is from the method below: Html.fromHtml(String source, Html.ImageGetter imageGetter, Html.TagHandler tagHandler) But the ImageGetter here need to override the method below: public abstract Drawable getDrawable (String source) Because I need to get the drawable from the inte...

Parsing Random Web Pages

Hi, I need to parse a bunch of random pages and add them to a DB. I am thinking of using regular expressions but I was wondering if there are any 'special' techniques (other than looking for content between known text/tags). The content is more(not always) like: Some Title Text related to Title I guess I don't need to extract complet...

Should I use regex to parse this string of html table data?

What's the best way to parse this data? Should I use regex or something else? The data is in html, but I found it from a website and will be parsing this and only this string (note: string is much longer - over 1,300 instances - only two below) - note I use php & jquery for most web programming. I only need to extract the data in t...

Find all tags with a specific attribute value

How can I iterate over all tags which have a specific attribute with a specific value? For instance, let's say we need the data1, data2 etc... only. <html> <body> <invalid html here/> <dont care> ... </dont care> <invalid html here too/> <interesting attrib1="naah, it is not this"> ... </interesting t...

awk return parent HTML tag value if its child tag content is matched - possible?

Hello, I've been searching for solution to this problem for quite some time, but I can't figure it out on my own. So I have bunch of HTML blocks of code, and I want to search for specific string that is contained in one of the inner tags and if there's match I want return it's parent tag value. Here's example" <li rel="Returns this val...

Which Perl modules for good for data munging?

Nine years ago when I started to parsing HTML and free text with Perl I read the classic Data Munging with Perl. Does someone know if David is planning to update the book or if there are similar books or web pages where the new parsing modules like XML-Twig, Regexp-Grammars, etc, are explained? I assume that in the last nine years some ...

C# HTMLAgilityPack HTML to Text - Parse Errors

I need to extract text from an HTML file using C#. I am trying to use HTMLAgilityPack but I am seeing some parse errors (tags not closed). I am using these two options: htmlDoc.OptionFixNestedTags = true; htmlDoc.OptionAutoCloseOnEnd = true; Is there any "Fix all" type option. I don't care about the errors, I just wan...

How do I determine if there are two or one numbers at the start of my string?

I need a function or some code so I can find out what the number at the start is; Basically need to know the number at the start; The number can be 1 or 2 digits long 35|http:\/\/v10.lscache3.c.youtube.com\/videoplayback?ip=0.0.0.0&sparams=id%2Cexpire%2Cip%2Cipbits%2Citag%2Calgorithm%2Cburst%2Cfactor%2Coc%3AU0dXSlZRTl9FSkNNN19OS1JJ&fexp...

xpath query to parse html tags

I need to parse the following sample html using xpath query.. <td id="msgcontents"> <div class="user-data">Just seeing if I can post a link... please ignore post <a href="http://finance.yahoo.com"&gt;http://finance.yahoo.com&lt;/a&gt; </div> </td> <td id="msgcontents"> <div class="user-data">some text2... <a href="http://abc.com...

Android - Options for pulling data from a website? (HTML)

I was wondering what the best approach is on Android to retrieve information from a HTML page hosted on the internet? For example I'd like to be able to get the text from the following page at the start of each day: http://www.met.ie/forecasts/sea-area.asp I have been downloading and parsing XML files but I have never tried to parse i...

Parse HTML Content in POI

Hi all, I am using POI to create a spreadsheet report, I have html content with <p>, <b/>, &nbsp; etc, how do i parse these html tags in POI?. is there any function in POI which can parse html content? this is a sample of my POI code: HSSFCell cell = getHSSFCell(mysheet, 5, 1); cell.setCellValue(new HSSFRichTextString(htmlCont...

zend_mm_heap error with simple_html_dom

I'm trying to parse an HTML file with simplehtmldom and I'm getting this error: zend_mm_heap corrupted after about 4 seconds of execution on a 8231 lines HTML file. Could this be a bug or just excessive memory usage? ...

How to get IMG tag code from HTML document?

Hi, how I get the img code from a text? Now I get the code and URL if the tag looks like: text text <img src = "image.gif" />, but if the code is <img src = "image.gif" target = _blank />, then I get the URL: "image.gif" target = _blank. So, how correctly find img full code and URL? Thanks preg_match_all('/\<img src = (.*?)\/>/', $inp...

Find h3 and h4 tags beneath it

This is my HTML: <h3>test 1</h3> <p>blah</p> <h4>subheading 1</h4> <p>blah</p> <h4>subheading 2</h4> <h3>test 2</h3> <h4>subheading 3</h4> <p>blah</p> <h3>test 3</h3> I am trying to build an array of the h3 tags, with the h4 tags nested within them. An example of the array would look like: Array ( [test1] => Array ( ...

Fix uneven Divs with php

Hello I have a problem that looks like this: My string of text looks like so: <div> content <div> <div> content <div> </div> </div> If you notice I'm missing some divs and this risks breaking my theme when I use this content elsewhere. What would be the best way to go about solv...

How to match the brother of a certain XML element in ruby?

I played around with nokogiri in ruby and the XML searching feature, e.g.: a = Nokogiri.XML(open 'a.xml') x = a.search('//div[@class="foo"]').text which works quite nice. But how can I specify to match the next (brother) element on the same level (and only the next)? For example for this input: <div> <div>...</div> <div>...</di...

How can I find the contents of the first h3 tag?

Hi, I am looking for a regex to find the contents of the first <h3> tag. What can I use there? ...