parsing

html parsing with libxml

In another thread I got convinced into using HTML parsers instead of regexps for HTML parsing. I thought of using libxml (it has some HTML parser built in), but failed to find any useful tutorial. I also found this site and it says here it should do fine even with severely broken HTML. Could you give me some examples of HTML parsing wit...

reading data from Excel file prior to version 95

Apparently Excel 4.0 is still used and I have to read it in Java. Neither poi nor jExcelAPI, as great as they are, can parse them. I can't find anything on them, especially with Java. Any help? Thank you. ...

Why do on-line parsers seem to stop at regexps?

I've been wondering for long why there doesn't seem to be any parsers for, say, BNF, that behave like regexps in various libraries. Sure, there's things like ANTLR, Yacc and many others that generate code which, in turn, can parse a CFG, but there doesn't seem to be a library that can do that without the intermediate step. I'm interest...

Is the same file tokenized every time I include it?

This question is about the PHP parsing engine. When I include a file multiple times in a single runtime, does PHP tokenize it every time or does it keep a cache and just run the compiled code on subsequent inclusions? EDIT: More details: I am not using an external caching mechanism and I am dealing with the same file being included mul...

XSD Schemas allowing special/reserved characters in string element tag

In a string element tag the XML parser will get confused if it finds the following characters ' " < > & (i.e. lets say the name of company has been retrieved from a database field, and it looks like this: "Smith & Sons") The question is - how can you design your XSD to ignore these characters if found within an element? ...

Getting html attributes from DOM trees ( libxml )

I'm using this program to display a list of all html tags in a given file: #include <cstdio> #include <libxml/HTMLparser.h> #include <libxml/tree.h> #include <iostream> #include <cstring> using namespace std; static void print_element_names(htmlNodePtr a_node) { htmlNodePtr cur_node = NULL; for (cur_node = a_node; cur_node!=N...

Parsing files with Python

What type of Python objects should I use to parse files with a specific syntax? Also what sort of loop should be followed to make it through the file. Should one pass be sufficient? Two, three? ...

JavaScript for-loop in BNF

Hi I'm writing BNF for JavaScript which will be used to generate a lexer and a parser for the language. However, I'd like some ideas on how to design the for-loop. Here is the simplified version of my current BNF: [...] VarDecl. Statement ::= "var" Identifier "=" Expr ";" ForLoop. Statement ::= "for" "(" Expr ";" Expr ";" Expr ")" [......

Haskell parsing tools - yacc:lex :: happy:?

So, it seems like Happy is a robust replacement for yacc in Haskell. Is there an equally robust lexer generator to replace lex/flex? ...

Python lxml screen scraping?

I need to do some HTML parsing with python. After some research lxml seems to be my best choice but I am having a hard time finding examples that help me with what I am trying to do. this is why i am hear. I need to scrape a page for all of its viewable text.. strip out all tags and javascript.. I need it to leave me with what text is vi...

How do you remove html tags using Universal Feed Parser?

The documentation lists the tags that are allowed/removed by default: http://www.feedparser.org/docs/html-sanitization.html But it doesn't say anything about how you can specify which additional tags you want removed. Is there a way to do this using Universal Feed Parser or do you have to do further processing using your own regex and...

What's a good way to mix RSS feeds using Python?

SimplePie lets you merge feeds together: http://simplepie.org/wiki/tutorial/sort_multiple_feeds_by_time_and_date Is there anything like this in the Python world? The Universal Feed Parser documentation doesn't say anything about merging multiple feeds together. ...

What are alternatives to regexes for syntax highlighting ?

While editing this and that in Vim, I often find that its syntax highlighting (for some filetypes) has some defects. I can't remember any examples at the moment, but someone surely will. Usually, it consists of strings badly highlighted in some cases, some things with arithmetic and boolean operators and a few other small things as well....

Parse a .txt file

Hello everybody , I have a .txt file like: Symbols from __ctype_tab.o: Name Value Class Type Size Line Section __ctype |00000000| D | OBJECT |00000004| |.data __ctype_tab |00000000| r | OBJECT |00000101| |.rodata Symbols from _ashldi3.o: Name ...

How can I make a console application that lets me enter LINQ expressions and my program will execute them?

How can I write a console application that prompts me and lets me enter LINQ expressions and it will spit out the results of that LINQ query? What would be the easiest way to parse/evaluate a incoming string as a LINQ expression? ...

Nokogiri: Select content between element A and B

What's the smartest way to have Nokogiri select all content between the start and the stop element (including start-/stop-element)? Check example code below to understand what I'm looking for: require 'rubygems' require 'nokogiri' value = Nokogiri::HTML.parse(<<-HTML_END) "<html> <body> <p id='para-1'>A</p> <div clas...

Parser, Generator for Java with the following requirements...

I am looking for a parser generator for Java that does the following: My language project is pretty simple and only contains a small set of tokens. Output in pure READABLE Java code so that I can modify it (this why I wouldn't use ANTLR) Mature library, that will run and work with at least Java 1.4 I have looked at the following and t...

Parsing a WSDL to extract Service / Port elements

I want to automatically process a WSDL file to discover defined Service / Port elements. Is this possible, using Java or some sort of Ant utility? If so, how? ...

Parsing XML with AS3

Here is my entire Script as I can't seem to figure out where the problem is. The symptoms are that where I addChild(book) , is not the appropriate place for this to be added properly and sequentially with the thumbs as well. As a result, and to my surprise, the only way I can get these to appear so far is by writing a faulty trace state...

Need to construct a XML representation for C# code

Hi, I need to convert C# code to an equivalent XML representation. I plan to convert the C# code (C# 2.0 code snippets, no generics or nullable types) to an AST and then convert the AST to XML. Looking for a simple lexer/parser for C# which outputs an AST. Any pointers on converting C# code to an XML representation (which can be convert...