text-processing

How to find out if a sentence is a question (interrogative)?

Is there an open source Java library/algorithm for finding if a particular piece of text is a question or not? I am working on a question answering system that needs to analyze if the text input by user is a question. I think the problem can probably be solved by using opensource NLP libraries but its obviously more complicated than s...

Efficiently parsing a large text file in C#

I need to read a large space-seperated text file and count the number of instances of each code in the file. Essentially, these are the results of running some experiments hundreds of thousands of times. The system spits out a text file that looks kind of like this: A7PS A8PN A6PP23 ... And there are literally hundreds of thousands of...

Code for identifying programming language in a text file

Hi all, i'm supposed to write code which when given a text file (source code) as input will output which programming language is it. This is the most basic definition of the problem. More constraints follow: I must write this in C++. A wide variety of languages should be recognized - html, php, perl, ruby, C, C++, Java, C#... Amount o...

Identifying keywords of a (programming) language

Hi all, this is a follow up to my recent question ( Code for identifying programming language in a text file ). I'm really thankful for all the answers I got, it helped me very much. My code for this task is complete and it works fairly well - quick and reasonably accurate. The method i used is the following: i have a "learning" perl s...

Parsing Random Web Pages

Hi, I need to parse a bunch of random pages and add them to a DB. I am thinking of using regular expressions but I was wondering if there are any 'special' techniques (other than looking for content between known text/tags). The content is more(not always) like: Some Title Text related to Title I guess I don't need to extract complet...

UNIX shell: how do you tail up to a searchable expression?

The end of git status looks like this: # Untracked files: # (use "git add <file>..." to include in what will be committed) # # Classes/Default.png # Classes/[email protected] ... Since you might have any number of untracked files, I'm trying to tail from the end of the file to "Untracked files" and save it to a temp file, s...

Parse string into a tree structure?

I'm trying to figure out how to parse a string in this format into a tree like data structure of arbitrary depth. "{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}" [[["Hello big" "Hi" "Hey"] ["world" "earth"]] [["Goodbye" "farewell"] ["planet" "rock" "globe" ["." "!"]]]] ...

joining two tab-delimited files by column with same identifiers in ONE step (command)?

Very often I want to join two ascii-files, which are both tables in the sense that they consist of columns separated by tab, like this: file 1 FRUIT ID apple alpha banana beta cherry gamma file 2 ID FOOBAR alpha cat beta dog delta airplane and I want to join them like this with an inner join: FRUIT ID FOOBAR appl...

How can I trim the contents of a file in Perl?

I would like to remove contents of a file from a certain character to a certain character in the file in Perl. How do I do that using a script? The file has this: Syslog logging: enabled (11 messages dropped, 2 messages rate-limited, 0 flushes, 0 overruns, xml disabled, filtering disabled) Console logging: level inf...

Algorithm for text classification

Hello. I have millions of short (up to 30 words) documents which I need to split into several known categories. It's possible, that a document matches several of the categories (seldom, but possible). It's also possible that a document doesn't match any of the categories (also seldom). I also have millions of documents which have already...

Python: How to loop through blocks of lines

How to go through blocks of lines separated by an empty line? The file looks like the following: ID: 1 Name: X FamilyN: Y Age: 20 ID: 2 Name: H FamilyN: F Age: 23 ID: 3 Name: S FamilyN: Y Age: 13 ID: 4 Name: M FamilyN: Z Age: 25 I want to loop through the blocks and grab the fields Name, Family name and Age in a list of 3 columns: ...

Recommendations for text-processing software

I have a need to process text files to extract relevant information for later input into R for statistical analysis. The text file content typically looks like the example extract shown below. Can the board make any recommendations as to what software/programming language I should be looking to use for this purpose? The critical requirem...