I want to delete from many files each instance of a paragraph. I call paragraph a sequence of lines.
For example:
my first line
my second line
my third line
the fourth
5th and last
the problem is that I only want to delete them when they appear as a group. For example, if my first line appears alone I don't want to delete it.
...
What's the popular tool people use in Unix to parse/analyze log files? Doing counting, find unique, select/copy certain line which have certain patterns. Please advise some tools or some keyword. Since I believe there must be similar questions asked before, but I don't any idea about the keywords. Thanks.
...
I generally need to do a fair amount of text processing for my research, such as removing the last token from all lines, extracting the first two tokens from each line, splitting each line into tokens, etc.
What is the best way to perform this? Should I learn Perl for this? Or should I learn some kind of shell commands? The main concern...
Hey everyone,
I am interested in doing some document clustering, and right now I am considering using TF-IDF for this.
If I am not wrong, TFIDF is particularly used for evaluating the relevance of a document given a query. If I do not have a particular query, how can I apply tfidf to clustering?
...
Consider the following problem:
A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed.
I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1 lines. Also, in the event of errors in th...
Hey everybody, this is what I have going on. I have two text files. Umm lets call one A.txt and B.txt.
A.txt is a config file that contains a bunch of folder names, only 1 listing per folder.
B.txt is a directory listing that contains folders names and sizes. But B contains a bunch of listing not just 1 entry.
What I need is if B, ...
Hi - I'm looking for a good example of using Regular Expressions in PHP to "reverse engineer" a form letter (with a known format, of course) that has been pasted into a multiline textbox and sent to a script for processing.
So, for example, let's assume this is the original plain-text input (taken from a USDA press release):
WASHING...
I was wondering if anyone was familiar with any attempts at algorithmic sentence negation.
For example, given a sentence like "This book is good" provide any number of alternative sentences meaning the opposite like "This book is not good" or even "This book is bad".
Obviously, accomplishing this with a high degree of accuracy would pr...
Named Entity Extraction (extract ppl, cities, organizations)
Content Tagging (extract topic tags by scanning doc)
Structured Data Extraction
Topic Categorization (taxonomy classification by scanning doc....bayesian )
Text extraction (HTML page cleaning)
are there libraries that i can use to do any of the above functions of NLP ?
dont ...
[Caveat] This is not directly a programing question, but it is something that comes up so often in language processing that I'm sure it's of some use to the community.
Does anyone have a good list of uninteresting (English) words that have been tested by more then a casual look? This would include all prepositions, conjunctions, etc... ...
I'd like to learn foundations of encodings, characters and text. Understanding these is important for dealing with a large set of text whether that are log files or text source for building algorithms for collective intelligence. My current knowledge is pretty basic: something like "As long as I use UTF-8, I'm okay."
I don't say I need ...
Hello,
I have a set of Books objects, classs Book is defined as following :
Class Book{
String title;
ArrayList<tags> taglist;
}
Where title is the title of the book, example : Javascript for dummies.
and taglist is a list of tags for our example : Javascript, jquery, "web dev", ..
As I said a have a set of books talking about di...
When I ask a question here, the tool tips for the question returned by the auto search given the first little bit of the question, but a decent percentage of them don't give any text that is any more useful for understanding the question than the title. Does anyone have an idea about how to make a filter to trim out useless bits of a que...
I am looking to replace from a large document all high unicode characters, such as accented Es, left and right quotes, etc., with "normal" counterparts in the low range, such as a regular 'E', and straight quotes. I need to perform this on a very large document rather often. I see an example of this in what I think might be perl here: ht...
I have a hierarchy of directories containing many text files. I would like to search for a particular text string every time it comes up in one of the files, and replace it with another string. For example, I may want to replace every occurrence of the string "Coke" with "Pepsi". Does anyone know how to do this? I am wondering if there i...
hello , i am new to python , and I want to extract the data from this format
FBpp0143497 5 151 5 157 PF00339.22 Arrestin_N Domain 1 135 149 83.4 1.1e-23 1 CL0135
FBpp0143497 183 323 183 324 PF02752.15 Arrestin_C Domain 1 137 138 58.5 6e-16 1 CL0135
FBpp0131987 60 280 51 280 PF00089.19 Trypsin Domain 14 219 219 127.7 3.7e-37 1 CL0124
t...
I need to search for lines in a CSV file that end in an unterminated, double-quoted string.
For example:
1,2,a,b,"dog","rabbit
would match whereas
1,2,a,b,"dog","rabbit","cat bird"
1,2,a,b,"dog",rabbit
would not.
I have very limited experience with regular expressions, and the only thing I could think of is something like
"[^"]*...
I'm looking a program to recursively inline all \input{} commands in a LaTeX file. By "recursively", I mean doing the inlining iteratively until no \input{} command remains in the final LaTeX file.
I've already come across the flatten package. But, for some reason, my TeXLive distribution doesn't install it. When I execute the command s...
I'd like to read an advertisement for the job through my program. Initially i am working on the templates provided by the microsoft word as "Job Description".
Basically I have to extract the requirements of jobs like required education, skills or any development tools etc. I'd store these requirements in the database and then further us...
For example, given a string:
"Bob went fishing with his friend Jim Smith."
Bob and Jim Smith are both names, but bob and smith are both words. Weren't for them being uppercase, there would be less indication of this outside of our knowledge of the sentence. Are there any well known algorithms for detecting the presence of names, at lea...