parsing

with python's win32com and parsing html problem

hello, I'm new to python. I want to extract some text from the CNN website. I want to use python win32com module. EDIT: on [why win32com] Because of javascript in website... I thought of using win32com; I have looked for other solution but without success in regard to my requirement. In fact, I wanted to use mechanize or a similiar solut...

Python script to print all function definitions of a C/C++ file

I want a python script to print list of all functions defined in a C/C++ file. e.g. abc.c defines two functions as: void func1() { } int func2(int i) { printf("%d", i); return 1; } I just want to search the file (abc.c) and print all the functions defined in it (function names only). In the example above, I would like to print func1,...

Elimination left recursion for E := EE+|EE-|id

How to eliminate left recursion for the following grammar? E := EE+|EE-|id Using the common procedure: A := Aa|b translates to: A := b|A' A' := ϵ| Aa Applying this to the original grammar we get: A = E, a = (E+|E-) and b = id Therefore: E := id|E' E' := ϵ|E(E+|E-) But this grammar seems incorrect because ϵE+ -> ϵ id + w...

How to iterate over plain text segments with the Jericho HTML parser

Hi, For a Jericho Element, I am trying to find out how to loop over all child nodes, whether an element or plain text. Now there is Element.getNodeIterator(), but this references ALL descendants within the Element, not just the first descendants. I need the equivalent of Element.getChildSegments(). Any ideas? Thanks ...

how to extract some text by use lxml?

hello. i want to extract some text in certain website. here is web address what i want to extract some text to make scraper. http://news.search.naver.com/search.naver?sm=tab%5Fhty&where=news&query=times&x=0&y=0 in this page, i want to extract some text with subject and content field separately. for example,if you open tha...

What widespread languages are LL(k)?

Alrighty, by LL(k) languages, I mean programming languages whose parsers can be described by grammars which are LL(k). these are my guesses: pascal lisp xml and friends ...

Why isn't this PHP parsing XML correctly?

I am attempting to parse Yahoo's weather XML feed via this script. The parsing itself works: I am just struggling with getting the days to correspond with today, tomorrow and the day after. The final HTML output looks like this: Which can be seen here: http://www.wdmadvertising.com.au/preview/cfs/index.shtml todayMon______________19 ...

Best way to deserialize a long string (response of an external web service)

I am querying a web service that was built by another developer. It returns a result set in a JSON-like format. I get three column values (I already know what the ordinal position of each column means): [["Boston","142","JJK"],["Miami","111","QLA"],["Sacramento","042","PPT"]] In reality, this result set can be thousands of records ...

Which ISO8601 date/time incl. timezone format to use for maximum success across languages?

The ISO8601 format for date/time representations supports many variations of format to express the same information. I know that not all languages have libraries that support the range of the standard - for example, I've had problems parsing the different possible formats of the timezone using Java's SimpleDateFormat API. Given the cho...

Disabled/Custom params_parser per action

Hi, I have a create action that handles XML requests. Rather than using the built in params hash, I use Nokogiri to validate the XML against an XML schema. If this validation passes, the raw XML is stored for later processing. As far as I understand, the XML is parsed twice: First the Rails creates the params hash, then the Nokogiri pa...

Getting displayed text only from HTML

Is there a simple way, using C#, to open an arbitrary URL, read in the text, and reduce it down to that which would be displayed in a web page? I suppose I could get the < body > content, and iterate char by char over that content, ripping out anything that is in betwee < and >(inclusive). I looked briefly at HTML Agiligy Pack, and tha...

Php parsing with simple_html_dom, search problem

Hello I am using simple_html_dom to find every link in the html document that is of the class "new". Ordinarily I would use: $html->find('a[class=new]'); This would obtain links such as e.g. <a class="new" ... blah blah ... /> Hoever the problem this time is that the html document contains links with classes such as <a class="to...

Mathematica function foo that can distinguish foo[.2] from foo[.20]

Suppose I want a function that takes a number and returns it as a string, exactly as it was given. The following doesn't work: SetAttributes[foo, HoldAllComplete]; foo[x_] := ToString[Unevaluated@x] The output for foo[.2] and foo[.20] is identical. The reason I want to do this is that I want a function that can understand dates with ...

How to use LINQ to get method name and namespace in a soap request?

Hi, How to extract method name and namespace from this xml using LINQ to XML? <?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/add...

Parse ASP.NET pages to an object model

Hi SO, I need to parse aspx, ascx, master files in an MVC project to an object model so that I can allow people to change particular parts and save the file back ~ A content management type of thing. Is there anything in the framework to help me? What I have tried. XDocument.Load: Cannot load the directives and inline code blocks Ge...

NSXMLParser foundCharacters method!?

Hey! i'm using the NSXMLParser to fetch a String from xml. I'v created a class to store the data with synchronized variables. To get the text between the elementName i use the foundCharacter function. And to store the Strings i use a MutableString *. When i find the String and print everything is correct but when i'm done the two differ...

Simple Regex help for C#

I have an unfinished binary file that has some info that I can recover using regex. The contents are: G $12.Angry.Men.1957.720p.HDTV.x264-HDLH Lhttp://site.com/forum/f89/12-angry-men-1957-720p-hdtv-x264-hdl-538403/ L I Š M ,ABBA.The.Movie.1977.720p.BluRay.DTS.x264-iONN Phttp://site.com/forum/f89/abba-movie-1977-...

Parsing a .txt or html file into a string in the iPhone SDK

So, what I'm trying to do, is take a .txt or html file, being able to search through it, and grab a piece of text from file, place it into a string and finally adding it into a textView. Each couple of piece of text will be divided like this: 001:001 Text1 001:002 Text2 001:003 Text3 002:001 Text1a 002:002 Text1b...

Scraping and Parsing a Wikipedia Page

Hey guys. I'm wondering if there are any existing libraries in or accessible from Objective-C that would allow me to scrape pages formatted like this one. Specifically, all of the dates and all of the text next to each date. If not, what would be the best way to go about doing this? Regular expressions? I heard that NSString might alread...

Get the type of an element in Hpricot

I want to go through the children of an element and filter only the ones that are text or span, something like: element.children.select {|child| child.class == String || child.element_type == 'span' } but I can't find a way to test which type a certain element is. How do I test that? I'd like to know that regardless if there's a bet...