parsing

how to parse (only text) web sites while crawling

i can succesfully run crawl command via cygwin on windows xp. and i can also make web search via using tomcat. but i also want to save parsed pages during crawling event so when i start crawling with like this bin/nutch crawl urls -dir crawled -depth 3 i also want save parsed html files to text files i mean during this period which ...

Library to parse ERB files

I am attempting to parse, not evaluate, rails ERB files in a Hpricot/Nokogiri type manner. The files I am attempting to parse contain HTML fragments intermixed with dynamic content generated using ERB (standard rails view files) I am looking for a library that will not only parse the surrounding content, much the way that Hpricot or No...

Getting the value of an xml attribute with Applescript

I want to parse the Yahoo! Weather API and I want to save these element attributes to variables for later use: <yweather:location city="Zebulon" region="NC" country="US"/> <yweather:astronomy sunrise="6:52 am" sunset="7:39 pm"/> <yweather:forecast day="Wed" date="7 Apr 2010" low="61" high="96" text="Partly Cloudy" code="29" /> H...

Parsing multibyte string in PHP

I would like to write a (HTML) parser based on state machine but I have doubts how to acctually read/use an input. I decided to load the whole input into one string and then work with it as with an array and hold its index as current parsing position. There would be no problems with single-byte encoding, but in multi-byte encoding each ...

Extract / Parse Tags from Mixed Content String

Hello, i want to parse Tags from a mixed Content String. The string goes like this: "<PERSON>yasir arafat</PERSON> , the president of the <LOCATION>palestinian authority</LOCATION> , on the defensive , mr . sharon believes , a government official" I only want to use jaxp. Got anybody an idea for this. May an easy way with Expressions....

How to decode an array of json object

I have an array of json objects like so: [{"a":"b"},{"c":"d"},{"e":"f"}] What is the best way to turn this into a php array? json_decode will not handle the array part and returns NULL for this string. ...

Matching math expression with regular expression?

For example, these are valid math expressions: a * b + c -a * (b / 1.50) (apple + (-0.5)) * (boy - 1) And these are invalid math expressions: --a *+ b @ 1.5.0 // two consecutive signs, two consecutive operators, invalid operator, invalid number -a * b + 1) // unmatched parentheses a) * (b + c) / (d // unmatched parentheses I hav...

Textile parsing in Objective-C?

Are there any libraries to parse Textile (Textile to HTML) which will work in an Objective C iPhone app? C libraries will work too. Update: I couldn't find any sufficiently developed libraries in C/Obj-C, but I did find one written in Javascript, which I used through an invisible UIWebView. Link: Javascript textile parser ...

I Need a Human Readable, Yet Parse-able Document Format

I'm working on one of those projects where there are a million better ways to accomplish what I need but I have no choice and I have to do it this way. Here it is: There is a web form, when the user fills it out and hits a submit a human readable text file is created using the form data. It looks like this: field_1: value for field one...

Why can’t DateTime.ParseExact() parse the AM/PM in “4/4/2010 4:20:00 PM” using “M'/'d'/'yyyy H':'mm':'ss' 'tt”

I'm using c#, and if I do DateTime.ParseExact("4/4/2010 4:20:00 PM", "M'/'d'/'yyyy H':'mm':'ss' 'tt", null) The return value is always 4:20 AM -- what am I doing wrong with using tt? Thanks! ...

Parsing adobe Kuler RSS feed

I have been trying to parse the below XML file (kuler rss feed). I have read the various posts on this site but am unable to piece them together. I specifically want to extract the child(or siblings) nodes of the element <kuler:themeItem>. However I am getting an exception : Namespace Manager or XsltContext needed. This query has a prefi...

Regarding XML Parsing

I am using IXMLDOMNodeListPtr , IXMLDOMNodePtr , IXMLDOMElementPtr and IXMLDOMDocPtr. I am having little confusion over here i.e. Should i have to call Release() on these pointers before they go out of scope. Thanks. ...

How do I get Bison/YACC to not recognize a command until it parses the whole string?

I have some bison grammar: input: /* empty */ | input command ; command: builtin | external ; builtin: CD { printf("Changing to home directory...\n"); } | CD WORD { printf("Changing to directory %s\n", $2); } ; I'm wondering how I get Bison to not accept (YYACCEPT?) something as a command until...

Really fast C++ html parser

Hello to all, I'm doing a html text feature extractor in C++; the program need to be REALLY fast: i need to extract a this features in ms per html page and the memory usage needs to be good and finally unicode encoding well be nice. I know how difficult is to have all of this things, but i want a parser close to these things at least. ...

Parse HTML in PHP

I've read the other posts here about this topic, but I can't seems to get what I want. This is the original HTML: <div class="add-to-cart"><form class=" ajax-cart-form ajax-cart-form-kit" id="uc-product-add-to-cart-form-20" method="post" accept-charset="UTF-8" action="/product/rainbox-river-lodge-guides-salomon-selection"> <div><div cl...

Why does 12:20 PM parse to 0:20 on the next day?

I'm using java.text.SimpleDateFormat to parse string representations of date/time values inside an XML document. I'm seeing all times that have an hour value of 12 shifted by 12 hours into the future, i. e. 20 minutes past noon gets parsed to mean 20 minutes past midnight the following day. I wrote a unit test which seems to confirm tha...

Why can't this SimpleDateFormat parse this date string?

The SimpleDateFormat: SimpleDateFormat pdf = new SimpleDateFormat("MM dd yyyy hh:mm:ss:SSSaa"); The exception thrown by pdf.parse("Mar 30 2010 5:27:40:140PM");: java.text.ParseException: Unparseable date: "Mar 30 2010 5:27:40:140PM" Any ideas? Edit: thanks for the fast answers. You were all correct, I just missed that one key se...

string parse and replace in for loop

i have a string that looks like this - "1AL||9CA||34CO||196WY||..." i want to use a for loop or while loop, in which if i have an integer, it should parse this string and delete the value matching that integer. example for above string string = "1AL||9CA||34CO||196WY||..." integer = 34 for ... loop new string = "1AL||9CA||196WY||......

ANTLR lexer mismatches tokens

I have a simple ANTLR grammar, which I have stripped down to its bare essentials to demonstrate this problem I'm having. I am using ANTLRworks 1.3.1. grammar sample; assignment : IDENT ':=' NUM ';' ; IDENT : ('a'..'z')+ ; NUM : ('0'..'9')+ ; WS : (' '|'\n'|'\t'|'\r')+ {$channel=HIDDEN;} ; Obviously, thi...

Accessing items separated by -componentsSeparatedByString

Hi, I have an array gathered by componentsSeparatedByString: that looks like the following when I use po in the GDB after the array has gone through componentsSeparatedByString: "\n\t\t <b>Suburb, </b> BAIRNSDALE", "\n\t\t <b>Address, </b> 15K NW BAIRNSDALE", "\n\t\t <b>Reference, </b> MELWOOD/SCHO...