views:

1274

answers:

5

Hi,

I have a large .csv file (~26000 rows). I want to be able to read it into matlab. Another problem is that it contains a collection of strings delimited by commas in one of the fields.

I'm having trouble reading it. I tried stuff like tdfread, which won't work here. Any tricks with textscan i should be aware about?

Is there any other way?

A: 

You have a problem because you're reading it in as a .csv, and you have commas within your data. You can get it in Excel and manipulate the date, possibly extract the unwanted commas with Excel formulas. I work with .csv files for DB imports quite a bit. I imagine matLab has similar rules, which is - no commas in your data.

Can you tell us more about your data? Are there commas throughout, our just one column? Maybe you can read it in as tab delimited?

cinqoTimo
This wont help for excel will treat all rows as comma delimited values, and in essence will come up with extra columns.
drlouie - louierd
Actually, if you have it in an .xls you can have commas within your cells. At this point, you can run your functions to extract the commas, and then save as a .csv
cinqoTimo
I tried converting into xls, but the number of rows far exceeds the max limit. It's actually ~263000. I think xls has a max limit of 65535 by ~277 or something. I managed to read it using xlsread on the csv file itself. Thank you
Excel can only have commas in the field because they are setting a text qualifier (") for that field. Without one, even Excel can't figure out that it should be a single field containing commas (side note: Excel 2007 allows over a million rows)
Gabriel McAdams
+1  A: 

I'm not sure what is generating your CSV file but that is your problem.

The point of a CSV file, is that the file itself designates separation of fields. If the text of the CSV contains commas, then nothing you can do will help you. How would ANY program know when the text in a single field contains commas, or when that comma is a field delimiter?

Proper CSV would have a text qualifier. Some generators/readers gives you the option to use one. The standard text qualifier is a " (quote). Its changeable, though, because your text may contain those, too.

Again, its all about generating proper CSV content.

Gabriel McAdams
The CSV file is provided to me, I have no control over how it is generated. Good point. I have heard that it's possible to design a context-based lexical analyzer to run through the file and change the comma's(separating the collection of strings) to another character.
You can't differentiate field delimiters from commas in the text when there are commas in the fields and no text qualifiers. Are you able to talk to those who generate this CSV and get them to use a text qualifier?
Gabriel McAdams
A: 

Hi

Since, as others have observed, your file is CSV with commas inside what you think of as a single field, it's going to be hard to persuade Matlab that that really is only one field. I think your best strategy is going to be to read one line at a time, into a string acting as a buffer, and to translate it, field-by-field, into the variables or other data structures that you want. Since Matlab has in-built regular expression capabilities this shouldn't be too hard.

And, as others have already suggested, posting a sample of your data would help us to help you.

Regards

Mark

High Performance Mark
I managed to read the file using xlsread in matlab. I used the option where i can get the function to give me numeric, text and raw data in different matrices.
+1  A: 

There's a chance that xlsread won't give you the answer you expect -- do the strings always appear in the same columns, for example? I think (as everyone else seems to :-) that it would be more robust to just use

fid = fopen('yourfile.csv');

and then either textscan

t = textscan(fid, '%s', delimiter', sprintf('\n'));
t = t{1};

or just fgetl (the example in the help is perfect).

After that you can do some line-by-line processing -- using textscan again on the text content of each line, for example, is a nice, quick way to get a cell-array that will allow fast analysis of each line.

Nivag
+1 I've found that MATLAB "auto-loading" features are not extremely robust when your data is not just well-behaved numbers. I've even had issues with buggy undocumented features (loading a hexadecimal numbers). It sucks, but when in doubt, better implement the parsing yourself.
Kena
A: 

Are you using a Unix system? The reason I am asking is that you could use a command-line function such as sed and regular expressions to clean those data files before you pass them into Matlab. Here is a link that explains how to do exactly what you are looking for.

John Bellone