text-processing

Data extraction from source with lots of white space

I'm trying to extract data from : http://www.phillysheriff.com/old_site/properties.html Ideally I'd be able to get a CSV file with the address, ward, price, and square feet? Is there an easy way to do this? ...

Please suggest a good text processing project

Hello, Lately I've realized that one must be good at handling (parsing) text. It may be from as simple as interpreting the HTTP response or reading a settings file (*.ini or *.xml or *.json) to as hard as writing a compiler or regex engine. I agree that now we have library functions/methods for interpreting popular formats of text. But...

Reading email content

Hi, Hope someone may be able to help. What i am looking to do is create a small winform app in c# to read the content of a email from a pop account, and upload key values to a sql automatically. The email format is always the same for each email, eg, First name : Last name : Phone number : etc... Currently the emails are being store...

Extract URLs from text using Ruby while handling matched parens

URI.extract claims to do this, but it doesn't handle matched parens: >> URI.extract("text here (http://foo.example.org/bla) and here") => ["http://foo.example.org/bla)"] What's the best way to extract URLs from text without breaking parenthesized URLs (which users like to use)? ...

Script to fix broken lines in a .txt file?

I'd love like to read books properly on my Kindle. To achieve my dream, I need a script to fix broken lines in a txt file. For example, if the txt file has this line: He watched Kahlan as she walked with her shoulders slumped down. ... then it should fix it by deleting the newline before the word "down": He watched Kahlan as she wa...

Parse items from text file

I have a text file that includes data inside {[]} tags. What would be the suggested way to parse that data so I can just use the data inside the tags? Example text file would look like this: 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.' I would like to end up with '...

Simple tool to find the most recurrent terms in a text

Hi. I have a text and I would like to extract the most recurrent terms, even if made up by more than one word (i.e.: managing director, position, salary, web developer). I would need a library or an installable executable, more than a web service. I came across some complex tools (such as Topia's Term Extraction, MAUI) that require tr...

XSLT 2.0 regex question (opening and closing elements on different matches)

I've simplified the problem somewhat, but I hope I've still captured the essence of my problem. Let's say I have the following simple XML file: <main> outside1 ===BEGIN=== inside1 ====END==== outside2 =BEGIN= inside2 ==END== outside3 </main> Then I can use the following the XSLT 2.0: <?xml version="1.0" encodin...

Identifying frequent formulas in a codebase

My company maintains a domain-specific language that syntactically resembles the Excel formula language. We're considering adding new builtins to the language. One way to do this is to identify verbose commands that are repeatedly used in our codebase. For example, if we see people always write the same 100-character command to trim whit...

Splitting words in running text using Python?

I am writing a piece of code which will extract words from running text. This text can contain delimiters like \r,\n etc. which might be there in text. I want to discard all these delimiters and only extract full words. How can I do this with Python? any library available for crunching text in python? ...

How to calculate the percentage of similarity or difference between two texts / strings ?

Explaining it further Assume i have two strings like below I am a super boy who can Fly! Really . I am super boy who can Break walls! Really . So some characters are similar I am super boy who can and Really . . Is there anything ready to use to find percentage similarity/diffrence between those two strings. ...

Parse log files programmatically in .NET

We have a large number (read: 50,000) of relatively small (read under 500K, typically under 50K) log files created using log4net from our client application. A typical log looks like: Start Painless log Framework:8.1.7.0 Application:8.1.7.0 2010-05-05 19:26:07,678 [Login ] INFO Application.App.OnShowLoginMessage(194) - Validating Crede...

Exploding UpperCasedCamelCase to Upper Cased Camel Case in PHP

The title says it all. Right now, I am implementing this with a split, slice, and implosion: $exploded = implode(' ',array_slice(preg_split('/(?=[A-Z])/','ThisIsATest'),1)); //$exploded = "This Is A Test" Prettier version: $capital_split = preg_split('/(?=[A-Z])/','ThisIsATest'); $blank_first_ignored = array_slice($capital_split,1);...

Extract text with java

If I have the string below, how can I extract the EDITORS PREFACE text with java? Thanks. <div class='chapter'><a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>EDITORS PREFACE</a></div> ...

Fast Text Preprocessing

In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this: get HTML page -> (To plain text -> stemming -> remove stop words) -> further text processing In brackets there are preprocessing steps. The application runs in ab...

How to define syntax

Hi, I am new at language processing and I want to create a parser with Irony for a following syntax: name1:value1 name2:value2 name3:value ... where name1 is the name of an xml element and value is the value of the element which can also include spaces. I have tried to modify included samples like this: public TestGrammar() ...

Extract snippet out of HTML with Ruby?

I need to show the first 100 characters of an HTML text, which means, I have to pick the first 100 characters that are not tags and then close any open tags leaving a balanced HTML. Is there any library that can do it? Or is there any trivial way to do it that I am missing? The text is originally written in Textile which can and does co...

Techniques for probabilistic clustering of similar looking text data?

I have 20,000 company addresses on various documents, which are all formatted differently. For example: Company A 12345 street US CompanyA, Inc box2, 12345 street WA, US The Company B company Ltd 123 happy street UK company B, Ltd 123, happy street, london, S1 1AA I'd like to be able to combine the records for each company (i.e. sepe...

Estimating the word count of a file without reading the full file

I have a program to process very large files. Now I need to show a progress bar to show the progress of the processing. The program works on a word level, read one line at a time, splitting it into words and processing the words one by one. So while the programs runs, it knows the count of the words processed. If somehow it knows the wor...

extract specific set of lines from files

Hello, I have many large (~30 MB a piece) tab-delimited text files with variable-width lines. I want to extract the 2nd field from the nth (here, n=4) and next-to-last line (the last line is empty). I can get them separately using awk: awk 'NR==4{print $2}' filename.dat and (I don't comprehend this entirely but) awk '{y=x "\n" $2};EN...