text-processing

Is there a python module for regex match in zip files

I have over a million text files compressed into 40 zip files. I also have a list of about 500 model names of phones. I want to find out the number of times a particular model was mentioned in the text files. Is there any python module which can do a regex match on the files without unzipping it. Is there a simple way to solve this pro...

Can you really build a fast word processor with GoF Design Patterns?

The Gang of Four's Design Patterns uses a word processor as an example for at least a few of their patterns, particularly Composite and Flyweight. Other than by using C or C++, could you really use those patterns and the object-oriented overhead they entail to write a high-performing fully featured word processor? I know that Eclipse ...

How can I extract a range of lines from a text file on unix?

I have a ~23000 line sql dump containing several databases worth of data. I need to extract a certain section of this file (i.e. the data for a single database) and place it in a new file. I know both the start and end line numbers of the data that I want. Does anyone know a unix command (or series of commands) to extract all lines from...

Is there still any reason to learn AWK ?

I am constantly learning new tools, even old fashioned ones, because I like to use the right solution for the problem. Nevertheless, I wonder if there is still any reason to learn some of them. AWK for example, is interesting to me, but for simple text processing, I can use grep / cut / sed / whatever, while for complex ones, I´ll go fo...

Algorithm to estimate number of English translation words from Japanese source

I'm trying to come up with a way to estimate the number of English words a translation from Japanese will turn into. Japanese has three main scripts -- Kanji, Hiragana, and Katakana -- and each has a different average character-to-word ratio (Kanji being the lowest, Katakana the highest). Examples: computer: コンピュータ (Katakana - 6 chara...

How to use sed to replace only the first occurrence in a file?

I want to update a large number of C++ source files with an extra include directive before any existing #includes. For this sort of task I normally use a small bash script with sed to re-write the file. How do I get sed to replace just the first occurrence of a string in a file rather than replacing the every occurrence? If I use se...

Skip file lines until a match is found, then output the rest.

I can write a trivial script to do this but in my ongoing quest to get more familliar with unix I'd like to learn efficient methods using built in commands instead. I need to deal with very large files that have a variable number of header lines. the last header line consists of the text 'LastHeaderLine'. I wish to output everything aft...

Reading text values into matlab variables from ASCII files

Consider the following file var1 var2 variable3 1 2 3 11 22 33 I would like to load the numbers into a matrix, and the column titles into a variable that would be equivalent to: variable_names = char('var1', 'var2', 'variable3'); I don't mind to split the names and the numbers in two files, however preparing matlab code...

Count Duplicate URLs, fastest method possible

Hi Guys, I'm still working with this huge list of URLs, all the help I have received has been great. At the moment I have the list looking like this (17000 URLs though): http://www.domain.com/page?CONTENT_ITEM_ID=1 http://www.domain.com/page?CONTENT_ITEM_ID=3 http://www.domain.com/page?CONTENT_ITEM_ID=2 http://www.domain.com/page?CONT...

Automatic spell checking of words in a text

[EDIT]In Short: How would you write an automatic spell checker? The idea is that the checker builds a list of words from a known good source (a dictionary) and automatically adds new words when they are used often enough. Words which haven't been used a while should be phased out. So if I delete part of a scene which contains "Mungrohype...

How to repace ${} variables in a *nix text file

I want to pipe the output of a "template" file into mysql, the file having variables like ${dbName} interspersed. What is the commandline utility to replace these instances and dump the output to stdout? ...

How do I identify language of a text document in Java?

Is there an existing Java library that could tell me whether a String contains English language text or not (e.g. I need to be able to distinguish French or Italian text -- the function needs to return false for French and Italian, and true for English)? ...

Extracting info from large structured text files

I need to read some large files (from 50k to 100k lines), structured in groups separated by empty lines. Each group start at the same pattern "No.999999999 dd/mm/yyyy ZZZ". Here´s some sample data. No.813829461 16/09/1987 270 Tit.SUZANO PAPEL E CELULOSE S.A. (BR/BA) C.N.P.J./C.I.C./N INPI : 16404287000155 Procurador: MARCEL...

"Absolute" string metric

I have a huge (but finite) set of natural language strings. I need a way to convert each string to a numeric value. For any given string the value must be the same every time. The more "different" two given strings are, the more different two corresponding values should be. The more "similar" they are, the less different values should ...

Delete Chars in Python

does anybody know how to delete all characters behind a specific character?? like this: http://google.com/translate_t into http://google.com ...

Swap key and array value pair

I have a text file layed out like this: 1 a, b, c 2 c, b, c 2.5 a, c I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this: a 1, 2.5 b 1, 2 c 1, 2, 2.5 (Notice how 2 isn't duplicated for c.) I do not need this exact output. The numbers in the input are ord...

Perl: Looping over input lines with an index-based approach

This is a beginner-best-practice question in perl. I'm new to this language. The question is: If I want to process the output lines from a program, how can I format THE FIRST LINE in a special way? I think of two possibilities: 1) A flag variable, once the loop is executed first time is set. But it will be evaluated for each cycle. BA...

Is there a tool to clean the output of the script(1) tool?

script(1) is a tool for keeping a record of an interactive terminal session; by default it writes to the file transcript. My problem is that I use ksh93, which has readline features, and so the transcript is mucked up with all sorts of terminal escape sequences and it can be very difficult to reconstruct the command that was actually e...

String Problem in C++

I have problem in string maniputation with c++. The Rule : if the same 'word' is repeated from sentences or paragraph i want it to become an integer. Please help me ?! example input : we prefer questions that can be answered, not just we discussed that. output: 1 prefer questions 2 can be answered, not just 1 discussed 2. 1 we 2 th...

Awk/etc.: Extract Matches from File

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands: awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html The problem is, this prints everything whereas I simpl...