html-parsing

Matching unknown # of occurrences on a page using perl?

i am parsing an html page, let's say this page lists all players in a football team and those who are seniors will be bolded. I can't parse the file line by line and look for the strong tag because in my real example the pattern is much more complex and span multiple lines. Something like this: <strong>Senior:</strong> John Smith Junio...

[PHP] - DOMDocument to extract part of a webpage (Any encoding)?

What's the code to store in a string the whole webpage's content between <body></body> tags? can be any HTML/XHTML page can be any encoding (ISOx, UTF-8, Asian-something) can have attributes in the <body> (may trick the parser) I've heard about DOMDocument but I'm a big rookie, some code sample would help! ...

importing and mapping data from source file to HTML tables

Hi, php newbie here..I need some PHP help ideas/examples on how to import data from a delimited text file and map them into html tables. The data should populate and be mapped under its proper header. There are instances also where each record doesn't have all the values and if no data, then we can leave it null (See sample records). I w...

regex: put text outside <p> inside <p>

I have some broken html-code that i would like to fix with regex. The html might be something like this: <p>text1</p> <p>text2</p> text3 <p>text4</p> <p>text5</p> But there can be much more paragraphs and other html-elements too. I want to turn in into: <p>text1</p> <p>text2</p> <p>text3</p> <p>text4</p> <p>text5</p> Is this poss...

How to get all pieces from regular expression (Python)

Hello! I want get all mathes from this expression: import re def my_handler(matches): return str(matches.groups()) text = "<a href='#' title='Title here'>" print re.sub("<[a-zA-Z]+( [a-zA-Z]+=[\#a-zA-Z0-9_.'\" ]+)*>", my_handler, text) Actual result: (" title='Title here'",) Expected result: ("a", " href='#'", " title=...

Getting target _parent or _top to recognise a URL with a page anchor

I apologise if the terminology is not quite correct, ala 'page anchor' but I shall endeavour to explain what I am attempting. I have an iframe, with links (to same domain) that I would like to have shown in the parent. <a href="foo.html" target="_parent">bar link</a> works as expected. However, I am attempting to use a URL of the form;...

libxml2 HTML chunk parsing

I'm downloading HTML from a website. The file can be quite large so while the file's downloading, I want to already parse the available chunks of HTML so that the process appears faster for the end-user of my program. I don't have control over how the cunks are generated, so a chunk can begin in the middle of a word, e.g. like so: chunk...

C# read HTML/PHP files code

I want to write an application using the C# that takes a URL as a parameter/input and then get the source code of the page, extract some URLs and some text based on given criteria ... ...

libxml2 HTML parsing problems

I'm using libxml2 to parse HTML: static htmlSAXHandler simpleSAXHandlerStruct = { NULL, /* internalSubset */ NULL, /* isStandalone */ NULL, /* hasInternalSubset */ NULL, /* hasExternalSubset */ NULL, /* res...

Xpath and HTML Cleaner problem, no data returned.

Hi, new to the community. been up all night trying to flesh out the underlying html reading system that's at the core of my app's functionally. I could really use a fresh pair of eyes on this one. Problem: While trying to return a string to be displayed on my app's home activity, I've run into an issue where I'm almost certain that th...

python beautifulsoup adding extra end tags

I'm using Beautifulsoup to parse a website request = urllib2.Request(url) response = urllib2.urlopen(request) soup = BeautifulSoup.BeautifulSoup(response) I am using it to traverse a table. The problem I am running into is that BS is adding an extra end tag for the table into the html which doesn't exist, which I verified with...

Search block of text, return MP3 links using PHP

Hi guys, I've just run into a little bit of trouble with some PHP on my latest project. Basically I have a block of text ($text) and I would like to search through that text and return all of the MP3 links. I know it has something to do with regular expressions but I just cannot get it working. Here's my current code: if(preg_match...

Parsing HTML: Adult Classification Systems

I'm research the different and (sometimes obsolete) Ratings/Classification standards used on the web. i.e. PICS, POWDER, ICRA Which standard is the most popular (number of sites using it)? Is there a C# library which will handle any (or all) of these? ...

Removing broken tags and poorly formatted html from some text

i have a huge database of scraped forum posts that i am inserting into a website. however alot of people try to use html in their forum posts and often times do it wrong. because of this, there are always stray <strike> <b> </strike> </div> </b> tags in the posts which will end up messing up the webpage format when i add say 15 forum po...

IE 8 Quirks vs Standards retrieving offsetHeight/offsetWidth

I am in the process of converting my application to use XHTML strict mode (it didn't have a DOCTYPE before). However, I noticed a significant degradation when getting offsetHeight/offsetWidth. This is very noticeable on pages with large number of DOM elements, let's say a table with 1 column by 800 rows, the cells only have a piece of te...

How to parse an HTML page using PHP?

Parsing HTML / JS codes to get info using PHP. www.asos.com/Asos/Little-Asos-Union-Jack-T-Shirt/Prod/pgeproduct.aspx?iid=1273626 Take a look at this page, it's a clothes shop for kids. This is one of their items and I want to point out the size section. What we need to do here is to get all the sizes for this item and check whether the...

libxml2 HTML parsing

I'm parsing HTML with libxml2, using XPath to find elements. Once I found the element I'm looking for, how can I get the HTML as a string from that element (keeping in mind that this element will have many child elements). Given a document: <html> <header> <title>Some document</title> </header <body> <p id="...

HTML parser that is compatible with JRuby?

I'm having a difficult time locating an HTML parser that works with JRuby. I've become fond of using Nokogiri for HTML parsing, but Nokogiri requires the use of bxml2.dll, which I don't have available on my machine and am not sure that I can ensure that it is available on all users' machines. I attempted to use another favorite, Scruby...

How do I use libcurl to login to a secure website and get at the html behind the login.

Hey guys, I was wondering if you guys could help me work through accessing the html behind a login page using C and libcurl. Specific Example: The website I'm trying to access is https://onlineservices.ubs.com/olsauth/ex/pbl/ubso/dl Is it possible to do something like this? The problem is that we have a lot of clients each of which h...

Character corruption going from BufferedReader to BufferedWriter in java

In Java, I am trying to parse an HTML file that contains complex text such as greek symbols. I encounter a known problem when text contains a left facing quotation mark. Text such as mutations to particular “hotspot” regions becomes mutations to particular “hotspot�? regions I have isolated the problem by writting a simple text...