views:

54

answers:

2

I am curious if you have a string how would you detect the delimiter?

We know php can split a string up with explode() which requires a delimiter parameter.

But what about a method to detect the delimiter before sending it to explode function?

Right now I am just outputting the string to the user and they enter the delimiter. That's fine -- but I am looking for the application to pattern recognize for me.

Should I look to regular expressions for this type of pattern recognition in a string?

EDIT: I have failed to initially specify that there is a likely expected set of delimiters. Any delimiter that is probably used in a CSV. So technically anyone could use any character to delimit a CSV file but it is more probable to use one of the following characters: comma, semicolon, vertical bar and a space.

EDIT 2: Here is the workable solution I came up with for a "determined delimiter".

$get_images = "86236058.jpg 86236134.jpg 86236134.jpg";

    //Detection of delimiter of image filenames.
        $probable_delimiters = array(",", " ", "|", ";");

        $delimiter_count_array = array(); 

        foreach ($probable_delimiters as $probable_delimiter) {

            $probable_delimiter_count = substr_count($get_images, $probable_delimiter);
            $delimiter_count_array[$probable_delimiter] = $probable_delimiter_count;

        }

        $max_value = max($delimiter_count_array);
        $determined_delimiter_array = array_keys($delimiter_count_array, max($delimiter_count_array));

        while( $element = each( $determined_delimiter_array ) ){
        $determined_delimiter_count = $element['key'];
        $determined_delimiter = $element['value'];
        }

        $images = explode("{$determined_delimiter}", $get_images);
+2  A: 

Determine which delimiters you consider probable (like ,, ; and |) and for each search how often they occur in the string (substr_count). Then choose the one with most occurrences as the delimiter and explode.

Even though that might not be fail-safe it should work in most cases ;)

nikic
This is bound to fail too often. What if I have content that contains heaps of `,,,,, ;;;;; ||||||`?
Pekka
If it's for experimental purposes, it can be a start. Otherwise such constructs are going to be your systems downfall.
tharkun
One option, is if the counts are high for more than one, or they are close together, go line by line through the file and count occurrences. The one with the most stable number (differing by at most 1) is likely the delimiter...
ircmaxell
@Pekka: Everything depend on the data expected to be entered. For example if you should input tags it is pretty improbably that a tagname will contain a `,` or a `;`.
nikic
I do think that this is the best attempt at doing this so far.
Alex
What if I choose `;` as the delimiter, but enter a string like: `Barbados,Belarus,Brazil;Canada,China,Congo,Cuba`? There's only one instance of the actual delimiter `;` but five instances of `,` which is another likely choice. In this case, choosing the one with the most occurrences will give the wrong result.
stevelove
@stevelove You've got a point. That throws off this approach slightly.
Alex
@stevelove: Well, I already said, it depend on the expected data. If this is the kind of data you expect, then this is the wrong approach for sure.
nikic
@stevelove: For any system like this you can generate a delimited string/file that will lead to the wrong choice (hence why this is a heuristic, it *tends* to lead you to the correct answer, but not directly). But that doesn't mean that it's bad to build just because there exist edge cases (which may not occur at all, depending on the use case)...
ircmaxell
A: 

Either you have imploded or built the string yourself and know the delimiter or else you do NOT use explode. If you're not sure about the delimiter, there is no way you're gonna find out with probability 1. As some comments suggest you could build a heuristic approach but that would be a scientific venture.

tharkun
Well, I guess it depends on your assumptions. If you're talking about parsing arbitrary delimited text, then yes it's not going to work too well in the long run. But most cases with delimited text, you have some hints about the format. For example, Excel uses `,` by default, some other apps use `;`. And some programmers use `|`. So if you're handling user input from a spreadsheet program, you prob can get away with looking at those 3 and should be able to detect the delimiter pretty reliably (I'd say for > 99% of regular user input, if defined intelligently)...
ircmaxell
@ircmaxell I agree
Alex