views:

116

answers:

3

I'm in a situation where I need to parse arguments from a string in the same way that they would be parsed if provided on the command-line to a Java/Clojure application.

For example, I need to turn "foo \"bar baz\" 'fooy barish' foo" into ("foo" "bar baz" "fooy barish" "foo").

I'm curious if there is a way to use the parser that Java or Clojure uses to do this. I'm not opposed to using a regex, but I suck at regexes, and I'd fail hard if I tried to write one for this.

Any ideas?

A: 

I ended up doing this:

(filter seq
        (flatten
         (map #(%1 %2)
              (cycle [#(s/split % #" ") identity])
              (s/split (read-line) #"(?<!\\)(?:'|\")"))))
Rayne
I'm afraid this breaks with, say, `'asdf"asdf'`.
Michał Marczyk
Also, a backslash may itself be escaped... Just pointing things out in case you want to fix them, if I figure out an alternative solution, I'll post that as an answer.
Michał Marczyk
Indeed. I knew it wasn't quite right, but I was taking whatever I could get at that point.
Rayne
+1  A: 

Updated with a new, even more convoluted version. This is officially ridiculous; the next iteration will use a proper parser (or c.c.monads and a little bit of Parsec-like logic on top of that). See the revision history on this answer for the original.

This convoluted bunch of functions seems to do the trick (not at my DRYest with this one, sorry!):

(defn initial-state [input]
  {:expecting nil
   :blocks (mapcat #(str/split % #"(?<=\s)|(?=\s)")
                   (str/split input #"(?<=(?:'|\"|\\))|(?=(?:'|\"|\\))"))
   :arg-blocks []})

(defn arg-parser-step [s]
  (if-let [bs (seq (:blocks s))]
    (if-let [d (:expecting s)]
      (loop [bs bs]
        (cond (= (first bs) d)
              [nil (-> s
                       (assoc-in [:expecting] nil)
                       (update-in [:blocks] next))]
              (= (first bs) "\\")
              [nil (-> s
                       (update-in [:blocks] nnext)
                       (update-in [:arg-blocks]
                                  #(conj (pop %)
                                         (conj (peek %) (second bs)))))]
              :else
              [nil (-> s
                       (update-in [:blocks] next)
                       (update-in [:arg-blocks]
                                  #(conj (pop %) (conj (peek %) (first bs)))))]))
      (cond (#{"\"" "'"} (first bs))
            [nil (-> s
                     (assoc-in [:expecting] (first bs))
                     (update-in [:blocks] next)
                     (update-in [:arg-blocks] conj []))]
            (str/blank? (first bs))
            [nil (-> s (update-in [:blocks] next))]
            :else
            [nil (-> s
                     (update-in [:blocks] next)
                     (update-in [:arg-blocks] conj [(.trim (first bs))]))]))
    [(->> (:arg-blocks s)
          (map (partial apply str)))
     nil]))

(defn split-args [input]
  (loop [s (initial-state input)]
    (let [[result new-s] (arg-parser-step s)]
      (if result result (recur new-s)))))

Somewhat encouragingly, the following yields true:

(= (split-args "asdf 'asdf \" asdf' \"asdf ' asdf\" asdf")
   '("asdf" "asdf \" asdf" "asdf ' asdf" "asdf"))

So does this:

(= (split-args "asdf asdf '  asdf \" asdf ' \" foo bar ' baz \" \" foo bar \\\" baz \"")
   '("asdf" "asdf" "  asdf \" asdf " " foo bar ' baz " " foo bar \" baz "))

Hopefully this should trim regular arguments, but not ones surrounded with quotes, handle double and single quotes, including quoted double quotes inside unquoted double quotes (note that it currently treats quoted single quotes inside unquoted single quotes in the same way, which is apparently at variance with the *nix shell way... argh) etc. Note that it's basically a computation in an ad-hoc state monad, just written in a particularly ugly way and in a dire need of DRYing up. :-P

Michał Marczyk
Jesus. I'm horrified that I have to put that thing in my code. This should be a lot easier than it actually is. :\Thanks a lot! :D
Rayne
You know, you might want to consider putting this into contrib or a small library or something. Seriously, this could be useful to more than just me.
Rayne
Shouldn't this be true? `(= (split-args "foo bar baz") '("foo" "bar" "baz"))false`
Rayne
Ah, right, will fix in a sec. (Might make it a bit DRYer too.)
Michał Marczyk
Well, this is simple enough to fix -- wrap the `str/split` form with `(mapcat #(str/split % #"(?<=\s)|(?=\s)") ...)`. I have however found another bug to do with escaping quotes... will post an updated version once I've got that fixed.
Michał Marczyk
Yikes. Maybe using a proper parser-generator library would be in order. There's always fnparse, and there are plenty of Java tools. (ANTLR seems pretty good.)
Brian Carper
I'd consider fnparse, but I've never used a parser (in any language), and the documentation is kind of "wut". And there is also the fact that fnparse's project.clj is completely insane and doesn't use the right Clojure and contrib jars...
Rayne
"Yikes" sounds just about right. :-) I took the special-purpose function approach because I thought it would be very simple... then I thought the fix would be very simple... which I guess is true only insofar as it is "simpler" not to have an extra dependency. I should have a bit of time later today to fix this, but I guess I'll also edit in a version based on some parsing library (or simply `c.c.monads`) at some point. No promises about the timing this weekend though...
Michał Marczyk
What sucks about all of this is that it feels like such a simple problem, but turns out to be a horrid problem. By the way, what is the bug you need to fix? You mentioned it, but I can't find it. :<
Rayne
Right. Try `(split-args "\"asdf\\\"asdf\"")`; the `\\\"` should mean "an escaped double quote", but the current version fails to catch this. :-( I've got a cunning idea about how to make the function more convoluted (by a factor of ten) in such a way as to cause it to become robust in the face of such inputs. :-P
Michał Marczyk
I think I might have established a personal record in the "ugliest code snippet" category with the new edit to this answer.
Michał Marczyk
+1  A: 

This bugged me, so I got it working in ANTLR. The grammar below should give you an idea of how to do it. It includes rudimentary support for backslash escape sequences.

Getting ANTLR working in Clojure is too much to write in this text box. I wrote a blog entry about it though.

grammar Cmd;

options {
    output=AST;
    ASTLabelType=CommonTree;
}

tokens {
    DQ = '"';
    SQ = '\'';
    BS = '\\';
}

@lexer::members {
    String strip(String s) {
        return s.substring(1, s.length() - 1);
    }
}

args: arg (sep! arg)* ;
arg : BAREARG
    | DQARG 
    | SQARG
    ;
sep :   WS+ ;

DQARG  : DQ (BS . | ~(BS | DQ))+ DQ
        {setText( strip(getText()) );};
SQARG  : SQ (BS . | ~(BS | SQ))+ SQ
        {setText( strip(getText()) );} ;
BAREARG: (BS . | ~(BS | WS | DQ | SQ))+ ;

WS  :   ( ' ' | '\t' | '\r' | '\n');
Brian Carper