tags:

views:

5869

answers:

4

I'm trying to parse an array of JSON objects into an array of strings in C#. I can extract the array from the JSON object, but I can't split the array string into an array of individual objects.

What I have is this test string:

string json = "{items:[{id:0,name:\"Lorem Ipsum\"},{id:1,name" 
            + ":\"Lorem Ipsum\"},{id:2,name:\"Lorem Ipsum\"}]}";

Right now I'm using the following regular expressions right now to split the items into individual objects. For now they're 2 separate regular expressions until I fix the problem with the second one:

Regex arrayFinder = new Regex(@"\{items:\[(?<items>[^\]]*)\]\}"
                                 , RegexOptions.ExplicitCapture);
Regex arrayParser = new Regex(@"((?<items>\{[^\}]\}),?)+"
                                 , RegexOptions.ExplicitCapture);

The arrayFinder regex works the way I'd expect it but, for reasons I don't understand, the arrayParser regex doesn't work at all. All I want it to do is split the individual items into their own strings so I get a list like this:

{id:0,name:"Lorem Ipsum"}
{id:1,name:"Lorem Ipsum"}
{id:2,name:"Lorem Ipsum"}

Whether this list is a string[] array or a Group or Match collection doesn't matter, but I'm stumped as to how to get the objects split. Using the arrayParser and the json string declared above, I've tried this code which I assumed would work with no luck:

string json = "{items:[{id:0,name:\"Lorem Ipsum\"},{id:1,name" 
            + ":\"Lorem Ipsum\"},{id:2,name:\"Lorem Ipsum\"}]}";

Regex arrayFinder = new Regex(@"\{items:\[(?<items>[^\]]*)\]\}"
                                 , RegexOptions.ExplicitCapture);
Regex arrayParser = new Regex(@"((?<items>\{[^\}]\}),?)+"
                                 , RegexOptions.ExplicitCapture);

string array = arrayFinder.Match(json).Groups["items"].Value;
// At this point the 'array' variable contains: 
// {id:0,name:"Lorem Ipsum"},{id:1,name:"Lorem Ipsum"},{id:2,name:"Lorem Ipsum"}

// I would have expected one of these 2 lines to return 
// the array of matches I'm looking for
CaptureCollection c = arrayParser.Match(array).Captures;
GroupCollection g = arrayParser.Match(array).Groups;

Can anybody see what it is I'm doing wrong? I'm totally stuck on this.

+3  A: 

Are you using .NET 3.5? If so, you can use the DataContractJsonSerializer to parse this out. There is no reason to do this yourself.

If you are not using .NET 3.5, you can use Jayrock:

http://jayrock.berlios.de/

casperOne
Or JavaScriptSerializer (3.5 SP1)
Marc Gravell
JavaScriptSerializer can also be used in .NET 2.0 by installing the ASP.NET 2.0 AJAX Extensions 1.0, using the System.Web.Extensions assembly and System.Web.Script.Serialization namespace
Russ Cam
+18  A: 

Balanced parentheses are literally a textbook example of a language that cannot be processed with regular expressions. JSON is essentially balanced parentheses plus a bunch of other stuff, with the braces replaced by parens. In the hierarchy of formal languages, JSON is a context-free language. Regular expressions can't parse context-free languages.

Some systems offer extensions to regular expressions that kinda-sorta handle balanced expressions. However they're all ugly hacks, they're all unportable, and they're all ultimately the wrong tool for the job.

In professional work, you would almost always use an existing JSON parser. If you want to roll your own for educational purposes then I'd suggest starting with a simple arithmetic grammar that supports + - * / ( ). (JSON has some escaping rules which, while not complex, will make your first attempt harder than it needs to be.) Basically, you'll need to:

  1. Decompose the language into an alphabet of symbols
  2. Write a context-free grammar in terms of those symbols thatrecognizes the language
  3. Convert the grammar into Chomsky normal form, or near enough to make step 5 easy
  4. Write a lexer that converts raw text into your input alphabet
  5. Write a recursive descent parser that takes your lexer's output, parses it, and produces some kind of output

This is a typical third-year CS assignment at just about any university.

The next step is to find out how complex a JSON string you need to trigger a stack overflow in your recursive parser. Then look at the other types of parsers that can be written, and you'll understand why anyone who has to parse a context-free language in the real world uses a tool like yacc or antlr instead of writing a parser by hand.

If that's more learning than you were looking for then you should feel free to go use an off-the-shelf JSON parser, satisified that you learned something important and useful: the limits of regular expressions.

I'm familiar with writing parsers, I was just hoping there would be a way to write a regular expression to parse the string I was working with since it was in a very simple format that never changed. Oh well, I guess using an existing JSON parser is better than trying to roll my own in this case.
Dan Herbert
Perl regular expressions can process balanced parentheses, Perl 5.10 and Perl6 make it easier to do this. http://perldoc.perl.org/perlretut.html#Recursive-patterns
Brad Gilbert
+2  A: 

JSON cannot typically be parsed with regular expressions (certain extremely simplified variants of JSON can, but then they are not JSON but something else).

You need an actual parser to properly parse JSON.

And anyway, why are you trying to parse JSON at all? There are numerous libraries out there which can do it for you, and much better than your code would. Why reinvent the wheel, when there's a wheel factory around the corner with the words FOSS over the door?

Justice
+3  A: 

Balanced parentheses are literally a textbook example of a language that cannot be processed with regular expressions

bla bla bla ... check this out:

arrayParser = "(?<Key>[\w]+)":"?(?<Value>([\s\w\d\.\\\-/:_]+(,[,\s\w\d\.\\\-/:_]+)?)+)"?

this works for me

if you want to match empty values change last '+' to '*'

it's great but it doesn't cover UTF chars.
Serkan Yersen
I had to edit this to get it to work for .NET objects I serialised with DataContractJsonSerializer. Specifically I added quotes for the key and optional quotes for the value:(?<Key>["\w"]+):?(?<Value>(["\s\w\d\.\\\-/:_]+(,[,\s\w\d\.\\\-/:_]+)?)+)
rohancragg