views:

2701

answers:

9

(NOTE: This question is not about escaping queries, it's about escaping results)

I'm using GROUP_CONCAT to combine multiple rows into a comma delimited list. For example, assume I have the two (example) tables:

CREATE TABLE IF NOT EXISTS `Comment` (
`id` int(11) unsigned NOT NULL auto_increment,
`post_id` int(11) unsigned NOT NULL,
`name` varchar(255) collate utf8_unicode_ci NOT NULL,
`comment` varchar(255) collate utf8_unicode_ci NOT NULL,
PRIMARY KEY  (`id`),
KEY `post_id` (`post_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=6 ;

INSERT INTO `Comment` (`id`, `post_id`, `name`, `comment`) VALUES
(1, 1, 'bill', 'some comment'),
(2, 1, 'john', 'another comment'),
(3, 2, 'bill', 'blah'),
(4, 3, 'john', 'asdf'),
(5, 4, 'x', 'asdf');


CREATE TABLE IF NOT EXISTS `Post` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(255) collate utf8_unicode_ci NOT NULL,
PRIMARY KEY  (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=7 ;

INSERT INTO `Post` (`id`, `title`) VALUES
(1, 'first post'),
(2, 'second post'),
(3, 'third post'),
(4, 'fourth post'),
(5, 'fifth post'),
(6, 'sixth post');

And I want to list all posts along with a list of each username who commented on the post:

SELECT
Post.id as post_id, Post.title as title, GROUP_CONCAT(name) 
FROM Post 
LEFT JOIN Comment on Comment.post_id = Post.id
GROUP BY Post.id

gives me:

id  title  GROUP_CONCAT( name )
1   first post  bill,john
2   second post  bill
3   third post  john
4   fourth post  x
5   fifth post  NULL
6   sixth post  NULL

This works great, except that if a username contains a comma it will ruin the list of users. Does MySQL have a function that will let me escape these characters? (Please assume usernames can contain any characters, since this is only an example schema)

+1  A: 

REPLACE()

Example:

... GROUP_CONCAT(REPLACE(name, ',', '\\,'))

Note you have to use a double-backslash (if you escape the comma with backslash) because backslash itself is magic, and \, becomes simply ,.

Bill Karwin
A: 

Bill:

If the values of two usernames were

abc,\

and

def

...that replacement would result in

abc\,\,def

I suppose I could do:

REPLACE(REPLACE(name, '\\', '\\\\'), ',', '\\,')
Bill Zeller
No, the replacement doesn't do that. You should try it. It performs the substitution on each `name` individually, *before* it's combined into a list by the GROUP_CONCAT().
Bill Karwin
Right, it would replace "abc,\" with "abc\,\" and replace "def" with "def". When GROUP_CONCAT combined the two results, it would end up with "abc\,\,def", which I can't distinguish from a single named with the value "abc,,def"
Bill Zeller
+4  A: 

If there's some other character that's illegal in usernames, you can specify a different separator character using a little-known syntax:

...GROUP_CONCAT(name SEPARATOR '|')...

... You want to allow pipes? or any character?

Escape the separator character, perhaps with backslash, but before doing that escape backslashes themselves:

group_concat(replace(replace(name, '\\', '\\\\'), '|', '\\|') SEPARATOR '|')

This will:

  1. escape any backslashes with another backslash
  2. escape the separator character with a backslash
  3. concatenate the results with the separator character

To get the unescaped results, do the same thing in the reverse order:

  1. split the results by the separator character where not preceded by a backslash. Actually, it's a little tricky, you want to split it where it isn't preceded by an odd number of blackslashes. This regex will match that:
    (?<!\\)(?:\\\\)*\|
  2. replace all escaped separator chars with literals, i.e. replace \| with |
  3. replace all double backslashes with singe backslashes, e.g. replace \\ with \
ʞɔıu
I ended up doing something slightly different, but very close to this. Thanks!
Bill Zeller
A: 

what nick said really, with an enhancement - the separator can be more than one character too.

I've often used

GROUP_CONCAT(name SEPARATOR '"|"')

Chances of a username containing "|" are fairly low i'd say.

benlumley
A: 

You're getting into that gray area where it might be better to postprocess this outside the world of SQL.

At least that's what I'd do: I'd just ORDER BY instead of GROUP BY, and loop through the results to handle the grouping as a filter done in the client language:

  1. Start by initializing last_id to NULL
  2. Fetch the next row of the resultset (if there aren't more rows go to step 6)
  3. If the id of the row is different than last_id start a new output row:

    a. if last_id isn't NULL then output the grouped row

    b. set the new grouped row = the input row, but store the name as a single element array

    c. set last_id to the value of the current ID

  4. Otherwise (id is the same as last_id) append the row name onto the existing grouped row.

  5. Go back to step 2
  6. Otherwise you have finished; if the last_id isn't NULL then output the existing group row.

Then your output ends up including names organized as an array and can decide how you want to handle/escape/format them then.

What language/system are you using? PHP? Perl? Java?

Jason S
A: 

If you're going to be doing the decoding in your application, maybe just use hex:

SELECT GROUP_CONCAT(HEX(foo)) ...

or you could also put the length in them:

SELECT GROUP_CONCAT(CONCAT(LENGTH(foo), ':', foo)) ...

Not that I tested either :-D

derobert
A: 

Jason S: This is exactly the issue I'm dealing with. I'm using an PHP MVC framework and was processing the results like you describe (multiple rows per result and code to group the results together). However, I've been working on two functions for my models to implement. One returns a list of all necessary fields needed to recreate the object and the other is a function that given a row with the fields from the first function, instantiate a new object. This lets me request a row from the database and easily turn it back into the object without knowing the internals of the data needed by the model. This doesn't work quite as well when multiple rows represent one object, so I was trying to use GROUP_CONCAT to get around that problem.

Bill Zeller
A: 

Right now I'm allowing any character. I realize a pipe would be unlikely to show up, but I'd like to allow it.

How about a control character, which you should be stripping out of application input anyway? I doubt you need eg. a tab or a newline in a name field.

bobince
A: 

I'd suggest GROUP_CONCAT(name SEPARATOR '\n'), since \n usually does not occur. This might be a little simpler, since you don't need to escape anything, but could lead to unexpected problems. The encodeing/regexp decoding stuff as proposed by nick is of course nice too.