views:

7295

answers:

11

I have a csv file of 1.2 million records of text. The alphanumeric fields are wrapped in quotation marks, the date/time or numeric fields are not.

For example "Fred","Smith",01/07/1967,2,"7, The High Street","Anytown","Anycounty","LS1 7AA"

What I want do is write some VBA in Excel (more or less the only tool available to me that I am reasonably proficient in the use of) that reads the CSV record by record, performs a check (as it happens on the last field, the post code) and then outputs a small subset of the 1.2m records to a new output file.

I understand how to open the two files, read the record, do what I need to do with the data and write it out (I will just output the input record with a prefix denoting an exception type)

What I don't know is how to parse the CSV in VBA properly. I can't do a simple text scan and search for commas as the text sometimes has commas in (hence why the text fields are text delimited)

Is there a fantastic command that would let me quicky get the data from the nth field in my record?

What I want is s_work = field(s_input_record,5) where 5 is the field number in my CSV....

Many thanks, C

+3  A: 

How about VBScript, though this would also work in Excel:

Set cn = CreateObject("ADODB.Connection")

'Note HDR=Yes, that is, first row contains field names '
'and FMT delimted, ie CSV '

strCon="Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\Docs\;" _
& "Extended Properties=""text;HDR=Yes;FMT=Delimited"";"

cn.open strcon

'You would not need delimiters ('') if last field is numeric: '    
strSQL="SELECT FieldName1, FieldName2 INTO New.csv FROM Old.csv " _
& " WHERE LastFieldName='SomeTextValue'"

'Creates new csv file
cn.Execute strSQL
Remou
I've used a similiar technique to accomplish the same thing in the past
Mitch Wheat
A: 

This doesn't directly answer your question, but grep (or one of the Windows equivalents) would really shine for this, e.g.,

grep -e <regex_filter> foo.csv > bar.csv
Hank Gay
Why unix for a windows question?
Remou
Windows variants of grep: WinGrep(http://www.wingrep.com/), PowerGrep(http://www.powergrep.com/), and GNU Grep for Windows(http://gnuwin32.sourceforge.net/packages/grep.htm).
Hank Gay
Given that grep is available in some form, and it's entire purpose is to print lines matching a regex, which seemed sufficient for the filtering mentioned in the question, it seemed like a natural fit.
Hank Gay
A: 

Hank Gay - (I tried to post this as a comment but keep getting errors, so posted it as a new answer...)

My problem is my condition that I need to meet to copy the record from input to output is quite convoluted so I kind of need to get a record at a time, perform some code on it and then decide on the basis of that code if the record needs to be output or not.

My example is simplified but I need to look at >1 field at once and cross compare them etc.

Will def have a look at grep though, it looks great for handling great chunks of data but don't think it solves the problem here.

If you say exactly what the comparison you need to do is, it should be possible to build a suitable SQL string. Furthermore, you can use a recordset through ADO and go through the records one by one. ADO seems suitable for all the problems you mentioned.
Remou
I think grep is for Unix only, whereas you seem to be using windows.
Remou
A: 

@C J

In that case, you could probably still solve it on a *nix command line with utilities like cut, etc., but I wouldn't want to try it personally, and I'm not sure there are Windows equivalents for all of them. If VBA is your best language, then I'd have to get behind Remou's suggestion.

Hank Gay
+4  A: 

The following code should do the trick. I don't have Excel in front of me, so I haven't tested it, but the concept is sound.

If this ends up being too slow, we can look at ways to improve the efficiency.

Sub SelectSomeRecords()
    Dim testLine As String

    Open inputFileName For Input As #1
    Open outputFileName For Output As #2

    While Not EOF(1)
        Line Input #1, testLine
        If RecordIsInteresting(testLine) Then
            Print #2, testLine
        End If
    Wend

    Close #1
    Close #2
End Sub

Function RecordIsInteresting(recordLine As String) As Boolean
    Dim lineItems(1 to 8) As String

    GetRecordItems(lineItems(), recordLine)

    RecordIsInteresting = lineItems(8) = "LS1 7AA" //'do your custom checking here
End Function

Sub GetRecordItems(items() As String, recordLine as String)
    Dim finishString as Boolean
    Dim itemString as String
    Dim itemIndex as Integer
    Dim charIndex as Long
    Dim inQuote as Boolean
    Dim testChar as String

    inQuote = False
    charIndex = 1
    itemIndex = 1
    itemString = ""
    finishString = False

    While charIndex <= Len(recordLine)
        testChar = Mid$(recordLine, charIndex, 1)

        finishString = False

        If inQuote Then
            If testChar = Chr$(34) Then
                inQuote = False
                finishString = True
                charIndex = charIndex + 1 //'ignore the next comma
            Else
                itemString = itemString + testChar
            End If
        Else
            If testChar = Chr$(34) Then
                inQuote = True
            ElseIf testChar = "," Then
                finishString = True
            Else
                itemString = itemString + testChar
            End If
        End If

        If finishString Then
            items(itemIndex) = itemString
            itemString = ""
            itemIndex = itemIndex + 1
        End If

        charIndex = charIndex + 1
    Wend
End Sub
e.James
A: 

Cheers both - will experiment

Thanks again C

Both? You do not seem to find my suggestion of any use?
Remou
Sorry Remou, when I typed it there was only your suggestion and Hank Gay's suggestion on my screen (note to self, press refresh now and again) so actually it was eJames I rudely but inadvertantly missed off. Many thanks one and all. C
A: 

I'd suggest taking a look at the Regular Expression library (you should see it in "Tools...References" as "Microsoft VBScript Regular Expressions 5.5" or something very similar.

There are samples of both the Reg Exp and a fairly comprehensive character-by-character at this location: http://www.xbeat.net/vbspeed/c_ParseCSV.php. Note that the Regexp version is waaaay shorter!

Have fun...

Mike Woodhouse
+1  A: 

Look at the Input # statement in the Excel help

Sample usage would be:

Input #fnInput, s_Forename, s_Surname, dt_DOB, i_Something, s_Street, s_Town, s_County, s_Postcode

and then use the Write # statement to write matching records out again

The only issue might be that the date format in the output will end up as #1967-07-01# but this format is unambiguous unlike 01/07/1967 which would represent 1st July in the UK and 7th January in the US. If you need to preserve the formatting of the date then write it out as a string:

s_DOB = Format(dt_DOB, "dd/mm/yyyy")
barrowc
+1  A: 

Anything you can do a-row-at-a-time with vba in excel, you can do in access with vba; plus a lot more because it's a database rather than a spreadsheet. Is access unavailable to you?

It's a lot easier to deal with logical tables, records, and fields than logical worksheets, rows, and columns.

For input, why does the "/Data/Import External Data/Text/csv" not work? Is the input not truly portable csv?

le dorfier
1.2 million rows is not going to work in Excel, however, data can be manipulated with ADO using the Jet engine, that is, the engine on which Access is based. Hence my reply.
Remou
ok - so to confirm - Access is not available? Anyway, if you can access Jet with ADO, you can also open an Access mdb database from within Excel with Excel VBA and store the data there - you wouldn't even need Access installed to do that.
le dorfier
A: 

I used the following derivative of the code given above to successfully open an arbitrary csv file from VBA in Excel.

Option Explicit
Public cn As Connection
Public Sub DoIt()
Dim strcon As String
Dim strsql As String
Dim rs As Recordset

Set cn = CreateObject("ADODB.Connection")

strcon = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\bin\HomePlanet\;" _
& "Extended Properties=""text;HDR=Yes;FMT=Delimited"";"

cn.Open strcon

strsql = "SELECT * FROM astuname.csv "
Set rs = New ADODB.Recordset
rs.Open strsql, cn
DoEvents ' pause here to inspect objects and properties rs.Close
End Sub

The rs (recordset) has a collection of fields, with a Count property. Each field as a Type property.

You can reference the fields by sequence number ...

Debug.Print rs.Fields(rs.Fields.Count - 1).Type

Is this sufficient?

If not, post the first several rows of the input file and I'll take it the rest of the way.

le dorfier
A: 

Use the split function

giuliano