text-processing

bash: filter away consecutive lines from text file

I want to delete from many files each instance of a paragraph. I call paragraph a sequence of lines. For example: my first line my second line my third line the fourth 5th and last the problem is that I only want to delete them when they appear as a group. For example, if my first line appears alone I don't want to delete it. ...

Log parser/analyzer in Unix

What's the popular tool people use in Unix to parse/analyze log files? Doing counting, find unique, select/copy certain line which have certain patterns. Please advise some tools or some keyword. Since I believe there must be similar questions asked before, but I don't any idea about the keywords. Thanks. ...

What's the best tool to do text processing in Linux or Mac?

I generally need to do a fair amount of text processing for my research, such as removing the last token from all lines, extracting the first two tokens from each line, splitting each line into tokens, etc. What is the best way to perform this? Should I learn Perl for this? Or should I learn some kind of shell commands? The main concern...

tfidf, am I understanding it right?

Hey everyone, I am interested in doing some document clustering, and right now I am considering using TF-IDF for this. If I am not wrong, TFIDF is particularly used for evaluating the relevance of a document given a query. If I do not have a particular query, how can I apply tfidf to clustering? ...

Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8 in Perl

Consider the following problem: A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed. I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1 lines. Also, in the event of errors in th...

C# Combining lines

Hey everybody, this is what I have going on. I have two text files. Umm lets call one A.txt and B.txt. A.txt is a config file that contains a bunch of folder names, only 1 listing per folder. B.txt is a directory listing that contains folders names and sizes. But B contains a bunch of listing not just 1 entry. What I need is if B, ...

Using regex to extract variables from a plain-text form letter?

Hi - I'm looking for a good example of using Regular Expressions in PHP to "reverse engineer" a form letter (with a known format, of course) that has been pasted into a multiline textbox and sent to a script for processing. So, for example, let's assume this is the original plain-text input (taken from a USDA press release): WASHING...

Algorithm for Negating Sentences

I was wondering if anyone was familiar with any attempts at algorithmic sentence negation. For example, given a sentence like "This book is good" provide any number of alternative sentences meaning the opposite like "This book is not good" or even "This book is bad". Obviously, accomplishing this with a high degree of accuracy would pr...

details on the following Natural Language Processing terms ?

Named Entity Extraction (extract ppl, cities, organizations) Content Tagging (extract topic tags by scanning doc) Structured Data Extraction Topic Categorization (taxonomy classification by scanning doc....bayesian ) Text extraction (HTML page cleaning) are there libraries that i can use to do any of the above functions of NLP ? dont ...

List of uninteresting words

[Caveat] This is not directly a programing question, but it is something that comes up so often in language processing that I'm sure it's of some use to the community. Does anyone have a good list of uninteresting (English) words that have been tested by more then a casual look? This would include all prepositions, conjunctions, etc... ...

Resources for character and text processing (encoding, regular expressions, NLP)

I'd like to learn foundations of encodings, characters and text. Understanding these is important for dealing with a large set of text whether that are log files or text source for building algorithms for collective intelligence. My current knowledge is pretty basic: something like "As long as I use UTF-8, I'm okay." I don't say I need ...

Java text classification problem

Hello, I have a set of Books objects, classs Book is defined as following : Class Book{ String title; ArrayList<tags> taglist; } Where title is the title of the book, example : Javascript for dummies. and taglist is a list of tags for our example : Javascript, jquery, "web dev", .. As I said a have a set of books talking about di...

Given a document, select a relevant snippet.

When I ask a question here, the tool tips for the question returned by the auto search given the first little bit of the question, but a decent percentage of them don't give any text that is any more useful for understanding the question than the title. Does anyone have an idea about how to make a filter to trim out useless bits of a que...

What's the fastest way to strip and replace a document of high unicode characters using Python?

I am looking to replace from a large document all high unicode characters, such as accented Es, left and right quotes, etc., with "normal" counterparts in the low range, such as a regular 'E', and straight quotes. I need to perform this on a very large document rather often. I see an example of this in what I think might be perl here: ht...

Bash: any command to replace strings in text files?

I have a hierarchy of directories containing many text files. I would like to search for a particular text string every time it comes up in one of the files, and replace it with another string. For example, I may want to replace every occurrence of the string "Coke" with "Pepsi". Does anyone know how to do this? I am wondering if there i...

problem in extracting the data from text file

hello , i am new to python , and I want to extract the data from this format FBpp0143497 5 151 5 157 PF00339.22 Arrestin_N Domain 1 135 149 83.4 1.1e-23 1 CL0135 FBpp0143497 183 323 183 324 PF02752.15 Arrestin_C Domain 1 137 138 58.5 6e-16 1 CL0135 FBpp0131987 60 280 51 280 PF00089.19 Trypsin Domain 14 219 219 127.7 3.7e-37 1 CL0124 t...

Regex for finding an unterminated string

I need to search for lines in a CSV file that end in an unterminated, double-quoted string. For example: 1,2,a,b,"dog","rabbit would match whereas 1,2,a,b,"dog","rabbit","cat bird" 1,2,a,b,"dog",rabbit would not. I have very limited experience with regular expressions, and the only thing I could think of is something like "[^"]*...

Inlining the LaTeX \input Command

I'm looking a program to recursively inline all \input{} commands in a LaTeX file. By "recursively", I mean doing the inlining iteratively until no \input{} command remains in the final LaTeX file. I've already come across the flatten package. But, for some reason, my TeXLive distribution doesn't install it. When I execute the command s...

reading job requirements

I'd like to read an advertisement for the job through my program. Initially i am working on the templates provided by the microsoft word as "Job Description". Basically I have to extract the requirements of jobs like required education, skills or any development tools etc. I'd store these requirements in the database and then further us...

Are there any well known algorithms to detect the presence of names?

For example, given a string: "Bob went fishing with his friend Jim Smith." Bob and Jim Smith are both names, but bob and smith are both words. Weren't for them being uppercase, there would be less indication of this outside of our knowledge of the sentence. Are there any well known algorithms for detecting the presence of names, at lea...