text-processing

Folder searching algorithm

Not sure if this is the usual sort of question that gets asked around here, or if I'll get any answers to this one, but I'm looking for a pseudo-code approach to generating DB linking records from a folder structure containing image files. I have a set of folders, structured as folllows: +-make_1/ | +--model_1/ | +-default_versi...

Text manipulation while keeping original position offsets.

Hi. I need to manipulate large strings in Java (deleting and adding the deleted chars again, moving chars around), but still want to remember the original position offsets. E.g. the word "computer" starts at offset 133 in the original text and is then moved to position 244, I still want the info that it was originally at position 133. T...

How to extract a single function from a source file

Hi, I'm working on a small academic research about extremely long and complicated functions in the Linux kernel. I'm trying to figure out if there is a good reason to write 600 or 800 lines-long functions. For that purpose, I would like to find a tool that can extract a function from a .c file, so I can run some automated tests on the ...

How was the Google Books' Popular passages feature developed?

I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too. If you do not know what I am writing about here is a link to an example of Popular...

How to move part of file to its end

Hi, rpm automatically place a new installed kernel as the first option. However, I want to move it as the last one - to end of the file. Grub configuration file looks like this: default=0 timeout=5 splashimage=(hd0,0)/grub/splash.xpm.gz hiddenmenu title Fedora (2.6.29.6-217.2.7.fc11.x86_64) root (hd0,0) kernel /vmlinuz-2.6.29....

Finding dictionary words

I have a lot of compound strings that are a combination of two or three English words. e.g. "Spicejet" is a combination of the words "spice" and "jet" I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words. What would be the most efficient by which I...

Natural language processing / text structure analysis starting point

I need to parse & process a big set of semi-structured text (basically, legal documents - law texts, addendums to them, treaties, judge's decisions, ...). The most fundamental thing I'm trying to do is extract information on how subparts are structured - chapters, articles, subheadings, ... plus some metadata. My question is if anyone ca...

Using Awk to process a file where each record has different fixed-width fields.

I have some data files from a legacy system that I would like to process using Awk. Each file consists of a list of records. There are several different record types and each record type has a different set of fixed-width fields (there is no field separator character). The first two characters of the record indicate the type, from thi...

VIM: how to compute the number of times word appeared in a file or in some range

Sometimes, I want to see how many times a certain function is called in a file or a code block. How do you do that? I am using vi 7.2. I presume you have to use !wc or some such. Thanks ...

How can I remove all non-word characters except the newline?

I have a file like this: my line - some words & text oh lóok i've got some characters I want to 'normalize' it and remove all the non-word characters. I want to end up with something like this: mylinesomewordstext ohlóokivegotsomecharacters I'm using Linux on the command line at the moment, and I'm hoping there's some one-liner I c...

Text Processing with Program Instead of Perl

I have a .plist file that looks like this: <plist version="1.0"> <array> <dict> <key>name</key> <string>Alabama</string> <key>abreviation</key> <string>AL</string> <key>date</key> <string>1819</string> <key>population</key> <string>4,627,851</string> <key>capital</key> <string>Montgomery</string> ...

Classifying Text Based on Groups of Keywords?

I have a list of requirements for a software project, assembled from the remains of its predecessor. Each requirement should map to one or more categories. Each of the categories consists of a group of keywords. What I'm trying to do is find an algorithm that would give me a score ranking which of the categories each requirement is likel...

Put bar at the end of every line that includes foo

Hi, I have a list with a large number of lines, each taking the subject-verb-object form, eg: Jane likes Fred Chris dislikes Joe Nate knows Jill To plot a network graph that expresses the different relationships between the nodes in directed color-coded edges, I will need to replace the verb with an arrow and place a color code at ...

how to get the similar texts from a lot of pages?

get the x most similar texts from a lot of texts to one text. maybe change the page to text is better. You should not compare the text to every text, because its too slow. ...

Algorithms to detect phrases and keywords from text

I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are grouped together. If I just count the words, I get a large number of really common words (is, t...

How to read, in a line, all characters from column A to B

Hi, is it possible in Python, given a file with 10000 lines, where all of them have this structure: 1, 2, xvfrt ert5a fsfs4 df f fdfd56 , 234 or similar, to read the whole string, and then to store in another string all characters from column 7 to column 17, including spaces, so the new string would be "xvfrt ert5a" ? Thanks a...

How can I delete all lines that do not begin with certain characters?

I need to figure out a regular expression to delete all lines that do not begin with either "+" or "-". I want to print a paper copy of a large diff file, but it shows 5 or so lines before and after the actual diff. ...

what is the best language to process ebooks in different formats

I have a collection of ebooks in different formats (e.g pdf, lit, chm, and other), I would like to extract the first page of each book and have it in plain text. What would be the best language to do so? A portable language between Linux and XP would be a big plus. My prime candidates at the moments are Java and Ruby, mostly be...

Getting word count for all files within a folder

I need to find word count for all of the files within a folder. Here is the code I've come up with so far: $f="../mts/sites/default/files/test.doc"; // count words $numWords = str_word_count($str)/11; echo "This file have ". $numWords . " words"; This will count the words within a single file, how would I go about counting the words...

How can I filter a large file into two separate files?

I've got a huge file (500 MB) that is organized like this: <link type="1-1" xtargets="1;1"> <s1>bunch of text here</s1> <s2>some more here</s2> </link> <link type="1-1" xtargets="1;1"> <s1>bunch of text here</s1> <s2>some more here</s2> </link> <link type="1-1" xtargets="1;1"> <s1>bunch of text here</s1> <s2>some...