views:

9266

answers:

9

I'm importing some data from a CSV file, and numbers that are larger 1000 get turned into "1,100" etc. What's a good way to remove both the quotes and the comma from this so I can put it into an int field?

Edit: The data is actually already in a MySQL table, so I need to be able to this using SQL. Sorry for the mixup.

A: 

You could use this perl command.

Perl -lne 's/[,|"]//; print' file.txt > newfile.txt

You may need to play around with it a bit, but it should do the trick.

Eldila
A: 

Here's the PHP way:

$stripped = str_replace(array(',', '"'), '', $value);

Link to W3Schools page

Daniel F. Hanson
A: 

Daniel's and Eldila's answer have one problem: They remove all quotes and commas in the whole file.

What I usually do when I have to do something like this is to first replace all separating quotes and (usually) semicolons by tabs.

  • Search: ";"
  • Replace: \t

Since I know in which column my affected values will be I then do another search and replace:

  • Search: ^([\t]+)\t([\t]+)\t([0-9]+),([0-9]+)\t
  • Replace: \1\t\2\t\3\4\t

... given the value with the comma is in the third column.

You need to start with an "^" to make sure that it starts at the beginning of a line. Then you repeat ([0-9]+)\t as often as there are columns that you just want to leave as they are.

([0-9]+),([0-9]+) searches for values where there is a number, then a comma and then another number.

In the replace string we use \1 and \2 to just keep the values from the edited line, separating them with \t (tab). Then we put \3\4 (no tab between) to put the two components of the number without the comma right after each other. All values after that will be left alone.

If you need your file to have semicolon to separate the elements, you then can go on and replace the tabs with semicolons. However then - if you leave out the quotes - you'll have to make sure that the text values do not contain any semicolons themselves. That's why I prefer to use TAB as column separator.

I usually do that in an ordinary text editor (EditPlus) that supports RegExp, but the same regexps can be used in any programming language.

BlaM
A: 

Thanks for the correction Blam. My command does remove all ',' and '"'.

In order to convert the sting "1,000" more strictly, you will need the following command.

Perl -lne 's/"(\d+),(\d+)"/$1$2/; print' file.txt > newfile.txt
Eldila
A: 

Actually nlucaroni, your case isn't quite right. Your example doesn't include double-quotes, so

id,age,name,...
1,23,phil,

won't match my regex. It requires the format "XXX,XXX". I can't think of an example of when it will match incorrectly.

All the following example won't include the deliminator in the regex:

"111,111",234
234,"111,111"
"111,111","111,111"

Please let me know if you can think of a counter-example.

Cheers!

Eldila
A: 

Here is a good case for regular expressions. You can run a find and replace on the data either before you import (easier) or later on if the SQL import accepted those characters (not nearly as easy). But in either case, you have any number of methods to do a find and replace, be it editors, scripting languages, GUI programs, etc. Remember that you're going to want to find and replace all of the bad characters.

A typical regular expression to find the comma and quotes (assuming just double quotes) is: (Blacklist)

/[,"]/

Or, if you find something might change in the future, this regular expression, matches anything except a number or decimal point. (Whitelist)

/[^0-9\.]/

What has been discussed by the people above is that we don't know all of the data in your CSV file. It sounds like you want to remove the commas and quotes from all of the numbers in the CSV file. But because we don't know what else is in the CSV file we want to make sure that we don't corrupt other data. Just blindly doing a find/replace could affect other portions of the file.

Joseph Pecoraro
A: 

The solution to the changed question is basically the same.

You will have to run select query with the regex where clause.

Somthing like

Select *
  FROM SOMETABLE
  WHERE SOMEFIELD REGEXP '"(\d+),(\d+)"'

Foreach of these rows, you want to do the following regex substitution s/"(\d+),(\d+)"/$1$2/ and then update the field with the new value.

Please Joseph Pecoraro seriously and have a backup before doing mass changes to any files or databases. Because whenever you do regex, you can seriously mess up data if there are cases that you have missed.

Eldila
+5  A: 

My guess here is that because the data was able to import that the field is actually a varchar or some character field, because importing to a numeric field might have failed. Here was a test case I ran purely a MySQL, SQL solution.

  1. The table is just a single column (alpha) that is a varchar.

    mysql> desc t;
    
    
    +-------+-------------+------+-----+---------+-------+
    | Field | Type        | Null | Key | Default | Extra |
    +-------+-------------+------+-----+---------+-------+
    | alpha | varchar(15) | YES  |     | NULL    |       | 
    +-------+-------------+------+-----+---------+-------+
    
  2. Add a record

    mysql> insert into t values('"1,000,000"');
    Query OK, 1 row affected (0.00 sec)
    
    
    mysql> select * from t;
    
    
    +-------------+
    | alpha       |
    +-------------+
    | "1,000,000" | 
    +-------------+
    
  3. Update statement.

    mysql> update t set alpha = replace( replace(alpha, ',', ''), '"', '' );
    Query OK, 1 row affected (0.00 sec)
    Rows matched: 1  Changed: 1  Warnings: 0
    
    
    mysql> select * from t;
    
    
    +---------+
    | alpha   |
    +---------+
    | 1000000 | 
    +---------+
    

So in the end the statement I used was:

UPDATE table
   SET field_name = replace( replace(field_name, ',', ''), '"', '' );

I looked at the MySQL Documentation and it didn't look like I could do the regular expressions find and replace. Although you could, like Eldila, use a regular expression for a find and then an alternative solution for replace.


Also be careful with s/"(\d+),(\d+)"/$1$2/ because what if the number has more then just a single comma, for instance "1,000,000" you're going to want to do a global replace (in perl that is s///g). But even with a global replace the replacement starts where you last left off (unless perl is different), and would miss the every other comma separated group. A possible solution would be to make the first (\d+) optional like so s/(\d+)?,(\d+)/$1$2/g and in this case I would need a second find and replace to strip the quotes.

Here are some ruby examples of the regular expressions acting on just the string "1,000,000", notice there are NOT double quote inside the string, this is just a string of the number itself.

>> "1,000,000".sub( /(\d+),(\d+)/, '\1\2' )
# => "1000,000"  
>> "1,000,000".gsub( /(\d+),(\d+)/, '\1\2' )
# => "1000,000"  
>> "1,000,000".gsub( /(\d+)?,(\d+)/, '\1\2' )
# => "1000000"  
>> "1,000,000".gsub( /[,"]/, '' )
# => "1000000"  
>> "1,000,000".gsub( /[^0-9]/, '' )
# => "1000000"
Joseph Pecoraro
A: 

@Joseph Pecoraro

Yeah, I did mean "Please take joe seriously". Anyways, nice solution. I didn't know mysql implemented a replace function. That is good information to remember.

Cheers!

Eldila