text-parsing

Elegant structured text file parsing

I need to parse a transcript of a live chat conversation. My first thought on seeing the file was to throw regular expressions at the problem but I was wondering what other approaches people have used. I put elegant in the title as i've previously found that this type of task has a danger of getting hard to maintain just relying on reg...

Split string containing command-line parameters into string[] in C#

I have a single string that contains the command-line parameters to be passed to another executable and I need to extract the string[] containing the individual parameters in the same way that C# would if the commands had been specified on the command-line. The string[] will be used when executing another assemblies entry-point via refle...

.NET 2.0 - Tokenizing space separated text

Suppose you have output like this: Word1 Word2 Word3 Word4 Where the number of spaces between words is arbitrary. I want to break it into an array of words. I used the following code: string[] tokens = new List<String>(input.Split(' ')) .FindAll ( delegate(string t...

Parsing a File Format

I'm working with Quickbook's IIF file format and I need to write a parser to read and write IIF files and I'm running into some issues reading the files. The files are simple, they're tab deliminated. Every line is either a table definition or a row. Definitions begin with'!' and the table name, and rows begin with just the table name. ...

Parsing Ambiguous Dates (Language Independent)

I am curious what would be the best way to handle an ambiguous date string in any given language. When pre-validating your user input isn't an option, how should MM/dd/YYYY dates be parsed? How would you parse the following ambiguous date and for what reason (statistical, cultural, etc)? '1111900' as Jan 11, 1900 [M/dd/YYYY] or Nov 1,...

What's the best way(error proof / foolproof) to parse a file using python with following format?

######################################## # some comment # other comment ######################################## block1 { value=data some_value=some other kind of data othervalue=032423432 } block2 { value=data some_value=some other kind of data othervalue=032423432 } ...

Textual Irregularities

Does anybody know of a library or piece of software out there that will locate irregularities in text? For example, lets say I have... 1. Name 1, Comment 2. Name 2, Comment 3. Name 3 , Comment 5. Name 10, Comment This software or library would first cut up portions of text that it would find similar (much alike a piece of compression...

Please help me create a regular expression to parse my SQL statement

I want to extract FROM codes WHERE FieldName='ContactMethod' and IsNull(Deactived,'') != 'T' from SELECT FieldDescription,FieldValue FROM codes WHERE FieldName='ContactMethod' and IsNull(Deactived,'') != 'T' order by fielddescription using a regular expression. I have a regex like this: \FROM.*\order which extracts FROM cod...

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below: feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date") for item in feed: print rep...

Parsing a string in C++

I have a huge set of log lines and I need to parse each line (so efficiency is very important). Each log line is of the form cust_name time_start time_end (IP or URL )* So ip address, time, time and a possibly empty list of ip addresses or urls separated by semicolons. If there is only ip or url in the last list there is no separato...

Can I transpose a file in vim?

I know I can use awk but I am on a windows box I am making a function for others that may not have awk. I also know I can write a C program but I would love not have to create maintain and compile something for a little vim utility I am making. THe original file might be THE DAY WAS LONG THE WAY WAS FAST and it would become TT H...

How Do I Tokenize This String in Ruby?

I have this string: %{Children^10 Health "sanitation management"^5} And I want to convert it to tokenize this into an array of hashes: [{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}] I'm aware of StringScanner and the Syntax gem (http://syntax.rubyforge.org/) ...

Regex and the "war" on XSS

I've always been interested in writing web software like forums or blogs, things which take a limited markup to rewrite into HTML. But lately, I've noticed more and more that for PHP, try googling "PHP BBCode parser -PEAR" and test a few out, you either get an inefficient mess, or you get poor code with XSS holes here and there. Taking...

String parsing, extracting numbers and letters

What's the easiest way to parse a string and extract a number and a letter? I have string that can be in the following format (number|letter or letter|number), i.e "10A", "B5", "C10", "1G", etc. I need to extract the 2 parts, i.e. "10A" -> "10" and "A". Update: Thanks to everyone for all the excellent answers ...

How should I detect which delimiter is used in a text file?

I need to be able to parse both CSV and TSV files. I can't rely on the users to know the difference, so I would like to avoid asking the user to select the type. Is there a simple way to detect which delimiter is in use? One way would be to read in every line and count both tabs and commas and find out which is most consistently used in...

PDF Text Extraction Approach Using OCR

Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI code to be written. I'm familiar with pdfbox, which is now an Apache incubator project ...

Parsing a textfile in C# with skipping some contents

Hi, I'm trying to parse a text file that has a heading and the body. In the heading of this file, there are line number references to sections of the body. For example: SECTION_A 256 SECTION_B 344 SECTION_C 556 This means, that SECTION_A starts in line 256. What would be the best way to parse this heading into a dictionary and then...

Having trouble with str.find()

I'm trying to use the str.find() and it keeps raising an error, what am I doing wrong? import codecs def countLOC(inFile): """ Receives a file and then returns the amount of actual lines of code by not counting commented or blank lines """ LOC = 0 for line in inFile: if...

Code Golf: Evaluating Mathematical Expressions

Challenge Here is the challenge (of my own invention, though I wouldn't be surprised if it has previously appeared elsewhere on the web). Write a function that takes a single argument that is a string representation of a simple mathematical expression and evaluates it as a floating point value. A "simple expression" may in...

Is there a clever way to parse plain-text lists into HTML?

Question: Is there a clever way to parse plain-text lists into HTML? Or, must we resort to esoteric recursive methods, or sheer brute force? I've been wondering this for a while now. In my own ruminations I have come back again and again to the brute-force, and odd recursive, methods ... but it always seems so clunky. There must be a...