text-processing

generic text reading ....

hey guys.. I am working on a project where I need to read some generic text...I am looking for any api by I can read generic text and also can convert it to .csv file... Can any one plz help... using java on windows os... --------------------------MORE Detail------------------------------------------------------------------------------...

Which OSS can extract a synopsis from a text?

Is there an OSS which can compress a text to a synopsis? My goal is to build an editor for SciFi novels which can either automatically create a synopsizes for chapters or at least make a suggestion for one. ...

Ruby: How do I selectively remove line wrapping from a text file?

I want to edit the following text so that every line begins with Dealer:. This means no wrapping/line breaks. For lines starting with System, wrapping is fine. What would a solution in ruby look like? Thanks This is located in a .txt file Dealer: 5 seconds left to act Dealer: hitman2714 wins the pot (9) Dealer: Hand #1684326626D Deale...

Extract key sentences from a text

Hi, do you know about an effective method for extracting key sentences from a text with their frequency parameters, etc and that can also do "stemming" (search also for similar sentences) ? I wonder also if there is some software implementation Thanks a lot ...

auto-tokenize user agents strings for statistics?

We keep track of user agent strings in our website. I want to do some statistics on them, to see how many IE6 users we have ( so we know what we have to develop against), and also how many mobile users we have. So we have log entires like this: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts) Mozilla/4.0 (compatible; ...

Unicode Strings in Ruby 1.9

I've written a Ruby script that is reading a file (File.read()) that contains unicode characters, and it works fine from the command line. However, when I try to put it into an Automator Workflow (Mac OS X), I get this error; 2009-12-23 17:55:15 -0500: /Users/jeffreyaylesworth/bin/symbols:19:in `split': invalid byte sequence in US-ASCI...

term clustering library?

Hi, Does anybody know an open-source\free library that does term clustering? Thanks, yaniv ...

delete a line based on logic

Hi everyone, I have a file where i have multiple records with such data F00DY4302B8JRQ rank=0000030 x=800.0 y=1412.0 length=89 now i want to search for the line where if i find length<=50 then delete this line and the next line in the file and write to another file. Thanks everyone ...

Split a text file in PHP

How can I split a large text file into separate files by character count using PHP? So a 10,000 character file split every 1000 characters would be split into 10 files. Further, can I split only after a full stop is found? Thanks. UPDATE 1: I like zombats code and I removed some errors and have come up with the following, but does anyo...

Output text file with line breaks in PHP

I'm trying to open a text file and output its contents with the code below. The text file includes line breaks but when I echo the file its unformatted. How do I fix this? Thanks. <html> <head> </head> <body> $fh = fopen("filename.txt", 'r'); $pageText = fread($fh, 25000); echo $pageText; </body> </html...

Optimize a list of text additions and deletions

Hello, I've got a list containing positions of text additions and deletions, like this: Type Position Text/Length 1. + 2 ab // 'ab' was added at position 2 2. + 1 cde // 'cde' was added at position 1 3. - 4 1 // a character was deleted at position 4 T...

Modifying Text Column Wise with Sed/AWK

I have an input data with three columns (tab separated) like this: a mrna_185598_SGL 463 b mrna_9210_DLT 463 c mrna_9210_IND 463 d mrna_9210_INS 463 e mrna_9210_SGL 463 How can I use sed/awk to modify it into four columns data that looks like this: a mrna_185598 SGL 463 b mrna_9210 DLT 463 c mrna_9210 ...

MySQL truncation command - unicode characters

Hi all, I am currently trying to adjust the values stored in a table in MySQL. The values stored contain a series of Unicode characters. I need to truncate to 40 bytes worth of storage, but when I try: UPDATE `MYTABLE` SET `MYCOLUMN` = LEFT(`MYCOLUMN`, 40); MySQL is overly helpful and retains 40 characters, rather than 40 bytes. Is t...

Replacing all GUIDs in a file with new GUIDs from the command line

I have a file containing a large number of occurrences of the string Guid="GUID HERE" (where GUID HERE is a unique GUID at each occurrence) and I want to replace every existing GUID with a new unique GUID. This is on a Windows development machine, so I can generate unique GUIDs with uuidgen.exe (which produces a GUID on stdout every tim...

Sed or Ruby for mundane text processing / dev-productivity tools creation.

I want to learn Sed. Can you please point me to good references so that I can fully utilize it. I want to learn it to perform more of the do-once-then-forget type administrative or dev-tools like tasks. So, I don't really care about performance or modularity or object orientedness etc when writing this type of code. Do you think it wou...

What is the best way to select a text portion to cut based on keywords?

When you search something in Stackoverflow it cuts the portion of the question description that best matches your criteria and after that it marks the criteria words. I wonder the best way to do this manually in C#, meaning without the help of a full-text search engine. The main problem is how to select the best text portion in a fast ...

processing text from a non-flat file (to extract information as if it *were* a flat file)

Hi, I have a longitudinal data set generated by a computer simulation that can be represented by the following tables ('var' are variables): time subject var1 var2 var3 t1 subjectA ... t2 subjectB ... and subject name subjectA nameA subjectB nameB However, the file generated writes a data file in a format similar to the f...

How can I sum values in column based on the value in another column?

I have a text file which is: ABC 50 DEF 70 XYZ 20 DEF 100 MNP 60 ABC 30 I want an output which sums up individual values and shows them as a result. For example, total of all ABC values in the file are (50 + 30 = 80) and DEF is (100 + 70 = 170). So the output should sum up all unique 1st column names as - ABC 80 DEF 170 XYZ 20 MNP 60...

Sed script to edit csv file Or Python

In our project we need to import the csv file to postgres. There are multiple types of files meaning the length of the file changes as some files are with fewer columns and some with all of them. We need a fast way to import this file to postgres. I want to use COPY FROM of the postgres since the speed requirement of the processing are ...

sed: toupper a character on a location

This sed "s/public \(.*\) get\(.*\)()/\1 \2/g" will transfor this public class ChallengeTO extends AbstractTransferObject { public AuthAlgorithm getAlgorithm(); public long getCode(); public int getIteration(); public String getPublicKey(); public String getSelt(); }; into this public class ...