parsing

How do you get the text from an HTML 'datacell' using BeautifulSoup

I have been trying to strip out some data from HTML files. I have the logic coded to get the right cells. Now I am struggling to get the actual contents of the 'cell': here is my htm snip headerRows[0][10].contents [<font size="+0"><font face="serif" size="1"><b>Apples Produced</b><font size="3"> </font></font></font>] ...

Parsing different date formats from feedparser in python?

I'm trying to get the dates from entries in two different RSS feeds through feedparser. Here is what I'm doing: import feedparser as fp reddit = fp.parse("http://www.reddit.com/.rss") cc = fp.parse("http://contentconsumer.com/feed") print reddit.entries[0].date print cc.entries[0].date And here's how they come out: 2008-10-21T22:23:...

How can I parse a Certificate Signing Request with Perl?

I want to use Perl to extract information from a Certificate Signing Request, preferably without launching an external openssl process. Since a CSR is stored in a base64-encoded ASN.1 format, I tried the Convert::PEM module. But it requires an ASN.1 description of the content, which I haven't been able to put together (ASN.1 being the be...

Parse Phone Number into component parts

I need a well tested Regular Expression (.net style preferred), or some other simple bit of code that will parse a USA/CA phone number into component parts, so: 3035551234122 1-303-555-1234x122 (303)555-1234-122 1 (303) 555 -1234-122 etc... all parse into: AreaCode: 303 Exchange: 555 Suffix: 1234 Extension: 122 ...

Date time parsing that accepts 05/05/1999 and 5/5/1999, etc...

Is there a simple way to parse a date that may be in MM/DD/yyyy, or M/D/yyyy, or some combination? i.e. the zero is optional before a single digit day or month. To do it manually, one could use: String[] dateFields = dateString.split("/"); int month = Integer.parseInt(dateFields[0]); int day = Integer.parseInt(dateFields[1]); int year ...

C# HTML Font Tag Parsing

Hi, I need to parse a large amount of text that uses HTML font tags for formatting, For example: <font face="fontname" ...>Some text</font> Specifically, I need to determine which characters would be rendered using each font used in the text. I need to be able to handle stuff like font tags inside another font tag. I need to use C#...

Converting XML-RPC to JSON in JavaScript

Can anyone recommend a lightweight JavaScript XML-RPC library? After researching this a while ago, I couldn't find anything I was comfortable with, so I kinda ended up writing my own. However, maybe that was stupid, as there must be something suitable out there!? My own pseudo-library is mainly missing a way to turn an XML-RPC response...

A "regex for words" (semantic replacement) - any example syntax and libraries?

I'm looking for syntatic examples or common techniques for doing regular expression style transformations on words instead of characters, given a procedural language. For example, to trace copying, one would want to create a document with similar meaning but with different word choices. I'd like to be able to concisely define these po...

Format/parse a string in Vb

I am working with some input that is in the possible forms $1,200 20 cents/ inch $10 Is there a way to parse these to numbers in VB? Also printing these numbers? EDIT: Regular expressions would be great. EDIT: VB 6 in particular ...

Suggestions on how to make a configurable parser.

I want to build a parser for a C like language. The interesting aspect about it is that I want to build it in such a way that someone who has access to the source can easily modified it to extend the language (a new expression type of instance) with the extensions being runtime configurable (they can be turned on and off). My current in...

Splitting strings in python

I have a string which is like this: this is [bracket test] "and quotes test " I'm trying to write something in Python to split it up by space while ignoring spaces within square braces and quotes. The result I'm looking for is: ['this','is','bracket test','and quotes test '] ...

Parsing C++ to generate unit test stubs

I've recently been trying to create units tests for some legacy code. I've been taking the approach of using the linker to show me which functions cause link errors, greping the source to find the definition and creating a stub from that. Is there an easier way? Is there some kind of C++ parser that can give me class definitions, in ...

Any python libs for parsing apache config files?

Any python libs for parsing apache config files or if not python anyone aware of such thing in other languages (perl, php, java, c#)? As i'll be able to rewrite them in python. ...

How do I parse a listing of files to get just the filenames in python?

So lets say I'm using Python's ftplib to retrieve a list of log files from an FTP server. How would I parse that list of files to get just the file names (the last column) inside a list? See the link above for example output. ...

Java HTML Parsing

Hello everyone. I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for "div class = "classname"" in each line of HTML - This wor...

How can I support wildcards in user-defined search strings in Python?

Is there a simple way to support wildcards ("*") when searching strings - without using RegEx? Users are supposed to enter search terms using wildcards, but should not have to deal with the complexity of RegEx: "foo*" => str.startswith("foo") "*foo" => str.endswith("foo") "*foo*" => "foo" in str (it gets more complicated when...

Removing HTML from a Java String

Is there a good way to remove HTML from a Java string? A simple regex like replaceAll("\\<.*?>","") will work, but things like &amp; wont be converted correctly and non-HTML between the two angle brackets will be removed (ie the .*? in the regex will disappear). ...

Parse DateTime with timezone of form PST/CEST/UTC/etc

I'm trying to parse an international datetime string similar to: 24-okt-08 21:09:06 CEST So far I've got something like: CultureInfo culture = CultureInfo.CreateSpecificCulture("nl-BE"); DateTime dt = DateTime.ParseExact("24-okt-08 21:09:06 CEST", "dd-MMM-yy HH:mm:ss ...", culture); The problem is what should I use for the '......

Why C++ cannot be parsed with a LR(1) parser?

I were reading about parsers and parser generators when I hit upon this statement in wikipedia's LR parsing -page: "Many programming languages can be parsed using some variation of an LR parser. One notable exception is C++." Why is it so? What particular property in C++ causes it to be impossible to parse with LR parsers? I first tri...

Writing a parser - In the need of guides and research papers.

My knowledge about implementing a parser is a bit rusty. I have no idea about the current state of research in the area, and could need some links regarding recent advances and their impact on performance. General resources about writing a parser are also welcome, (tutorials, guides etc.) since much of what I had learned at college I ...