text-parsing

Code Golf: Quickly Build List of Keywords from Text, Including # of Instances

I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc). Return only wor...

Textual analysis of large documents

I have a project where I need to compare multi-chapter documents to a second document to determine their similarity. The issue is I have no idea how to go about doing this, what approaches exist or if their are any libraries available. My first question is... what is similar? The numbers of words that match, the number of consecutive wo...

Create Great Parser - Extract Relevant Text From HTML/Blogs

I'm trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie's URL and get back clean text of the post itself. My basic approach (from python) has been to use a combination of BeautifulSoup / Urllib2, which is okay, but it assumes you know the proper tags for the blog entr...

library to parse a relative date (like google calendar can) in c#

Hi, I'm asking the same question as this: http://stackoverflow.com/questions/296738/how-can-i-parse-relative-dates-with-perl but in C#. Sorry if this is a duplicate, ill delete if so. Does such a library exist? Thanks ...

How can I split out individual column values from each line in a text file?

I have lines in an ASCII text file that I need to parse. The columns are separated by a variable number of spaces, for instance: column1 column2 column3 How would i split this line to return an array of only the values? thanks ...

Character strings in Fortran: Portable LEN_TRIM and LNBLNK?

I need a portable function/subroutine to locate the position of the last non-blank character in a string. I've found two options: LEN_TRIM and LNBLNK. However, different compilers seem to have different standards. The official documentation for the following compilers suggests that LEN_TRIM is part of the Fortran 95 standard on the f...

Parse 'family' names into people + last name with regex

Given the following string, I'd like to parse into a list of first names + a last name: Peter-Paul, Mary & Joël Van der Winkel (and the simpler versions) I'm trying to work out if I can do this with a regex. I've got this far (?:([^, &]+))[, &]*(?:([^, &]+)) But the problem here is that I'd like the last name to be captured in ...

How to find Title case phrases from a passage or bunch of paragraphs

How do I parse sentence case phrases from a passage. For example from this passage Conan Doyle said that the character of Holmes was inspired by Dr. Joseph Bell, for whom Doyle had worked as a clerk at the Edinburgh Royal Infirmary. Like Holmes, Bell was noted for drawing large conclusions from the smallest observations.[1] Michael Har...

What is a Surefire way to get a string Word Count in C#

I am not sure how to go about this. Right now I am counting the spaces to get the word count of my string but if there is a double space the word count will be inaccurate. Is there a better way to do this? ...

C# - Trimming string from first null terminator and onwards

I have a C# string "RIP-1234-STOP\0\0\0\b\0\0\0???|B?Mp?\0\0\0" returned from a call to a native driver. How can I trim all characters from first null terminator '\0\ onwards. In this case, I just would like to have "RIP-1234-STOP". Thanks. ...

NWS LSR Documented Format or Retrieval (PHP)

I am attempting to find documentation on how Local Storm Reports (LSR) issued by the Nation Weather Services are formatted. Also I am aware of public FTP directory these text files are stored but I was wondering if anyone knows if the NWS or other sources provide these reports via a web service instead if having to manually write a par...

Simple get string (ignore numbers at end) in C#

I figure regex is overkill also it takes me some time to write some code (i guess i should learn now that i know some regex). Whats the simplest way to separate the string in an alphanumeric string? It will always be LLLLDDDDD. I only want the letters(l's), typically its only 1 or 2 letters. ...

SimpleParse non-deterministic grammar until runtime

Hi I'm working on a basic networking protocol in Python, which should be able to transfer both ASCII strings (read: EOL-terminated) and binary data. For the latter to be possible, I chose to create the grammar such that it contains the number of bytes to come which are going to be binary. For SimpleParse, the grammar would look like th...

Format ParseException with JavaCC

I was wondering how could it be possible to format in a human-readable format a ParseException thrown by JavaCC: in fact it includes fields such asbeginLine, beginColumn, endColumn, endLine in the token reference of the exception, but not the reference to the source parsed. Thanks! :) ...

Replacing text function in php

Hello, I want to clean up some parsed text such as \n the said \r\n\r\n\r\n I look in your eyes my dear\r\n\r\nI see green rolling Forests\r\n\r\nI see the far away Sky\r\n\r\nThey turn into the rain\r\n\r\n\r\nI see high soaring eagles... more\n So I want to get rid of the "\n", "\r\n", "\r\n\r\n", "\r\n\r\n\r\n", "\r\n\r\n\r\n\r\n" a...

Regular expression capture groups in Oracle PL/SQL

I'm trying to turn free-form text into something more structured. I have a complex pattern that matches the great majority (well above the minimum acceptable limit) of the data available, and I'd like to use that to assist in structuring the data, rather than parsing the text character-by-character. The problem that I've just run into is...

Java String parsing - {k1=v1,k2=v2,...}

I have the following string which will probably contain ~100 entries: String foo = "{k1=v1,k2=v2,...}" and am looking to write the following function: String getValue(String key){ // return the value associated with this key } I would like to do this without using any parsing library. Any ideas for something speedy? ...

Python parsing bracketed blocks

What would be the best way in python to parse out chunks of text contained in matching brackets? "{ { a } { b } { { { c } } } }" should initially return: [ "{ a } { b } { { { c } } }" ] putting that as an input should return: [ "a", "b", "{ { c } }" ] which should return: [ "{ c }" ] [ "c" ] [] ...

How to do Erlang pattern matching using regular expressions?

When I write Erlang programs which do text parsing, I frequently run into situations where I would love to do a pattern match using a regular expression. For example, I wish I could do something like this, where ~ is a "made up" regular expression matching operator: my_function(String ~ ["^[A-Za-z]+[A-Za-z0-9]*$"]) -> .... I know...

Determine locations mentioned in shortish (500 to 1000 words) piece of text using PHP

I'd like to find a way to take a piece of user supplied text and determine what addresses on the map are mentioned within the text. I'd be happy to use a free web service if it exists or use a script which will not consume too many resources. One way I can imagine doing this is taking a gigantic database of addressing and searching for...