tags:

views:

73

answers:

2

I'm gluing together a number of system calls using the Amazon Elastic Map Reduce command line tools. These commands return JSON text which has already been (partially?) escaped. Then when the system call turns it into an R text object (intern=T) it appears to get escaped again. I need to clean this up so it will parse with the rjson package.

I do the system call this way:

system("~/EMR/elastic-mapreduce --describe --jobflow j-2H9P770Z4B8GG", intern=T)

which returns:

 [1] "{"                                                                                             
 [2] "  \"JobFlows\": ["                                                                             
 [3] "    {"                                                                                         
 [4] "      \"LogUri\": \"s3n:\\/\\/emrlogs\\/\","                                                   
 [5] "      \"Name\": \"emrFromR\","                                                                 
 [6] "      \"BootstrapActions\": [" 
...

but the same command from the command line returns:

{
  "JobFlows": [
    {
      "LogUri": "s3n:\/\/emrlogs\/",
      "Name": "emrFromR",
      "BootstrapActions": [
        {
          "BootstrapActionConfig": {
...

If I try to run the results of the system call through rjson, I get this error:

Error: '\/' is an unrecognized escape in character string starting "s3n:\/"

I believe this is because of the double escaping in the s3n line. I'm struggling to get this text massaged into something that will parse.

It might be as simple as replacing "\\" with "\" but since I kinda struggle with regex and escaping, I can't get that done properly.

So how do I take a vector of strings and replace any occurrence of "\\" with "\"? (even to type this question I had to use three back slashes to represent two) Any other tips related to this specific use case?

Here's my code in more detail:

> library(rjson)
> emrJson <- paste(system("~/EMR/elastic-mapreduce --describe --jobflow j-2H9P770Z4B8GG", intern=T))
> 
>     parser <- newJSONParser()
>     for (i in 1:length(emrJson)){
+       parser$addData(emrJson[i])
+     }
> 
> parser$getObject()
Error: '\/' is an unrecognized escape in character string starting "s3n:\/"

and if you're itching to recreate the emrJson object, here's the dput() output:

> dput(emrJson)
c("{", "  \"JobFlows\": [", "    {", "      \"LogUri\": \"s3n:\\/\\/emrlogs\\/\",", 
"      \"Name\": \"emrFromR\",", "      \"BootstrapActions\": [", 
"        {", "          \"BootstrapActionConfig\": {", "            \"Name\": \"Bootstrap 0\",", 
"            \"ScriptBootstrapAction\": {", "              \"Path\": \"s3:\\/\\/rtmpfwblrx\\/bootstrap.sh\",", 
"              \"Args\": []", "            }", "          }", 
"        }", "      ],", "      \"ExecutionStatusDetail\": {", 
"        \"EndDateTime\": 1278124414.0,", "        \"CreationDateTime\": 1278123795.0,", 
"        \"LastStateChangeReason\": \"Steps completed\",", "        \"State\": \"COMPLETED\",", 
"        \"StartDateTime\": 1278124000.0,", "        \"ReadyDateTime\": 1278124237.0", 
"      },", "      \"Steps\": [", "        {", "          \"StepConfig\": {", 
"            \"ActionOnFailure\": \"CANCEL_AND_WAIT\",", "            \"Name\": \"Example Streaming Step\",", 
"            \"HadoopJarStep\": {", "              \"MainClass\": null,", 
"              \"Jar\": \"\\/home\\/hadoop\\/contrib\\/streaming\\/hadoop-0.18-streaming.jar\",", 
"              \"Args\": [", "                \"-input\",", "                \"s3n:\\/\\/rtmpfwblrx\\/stream.txt\",", 
"                \"-output\",", "                \"s3n:\\/\\/rtmpfwblrxout\\/\",", 
"                \"-mapper\",", "                \"s3n:\\/\\/rtmpfwblrx\\/mapper.R\",", 
"                \"-reducer\",", "                \"cat\",", 
"                \"-cacheFile\",", "                \"s3n:\\/\\/rtmpfwblrx\\/emrData.RData#emrData.RData\"", 
"              ],", "              \"Properties\": []", "            }", 
"          },", "          \"ExecutionStatusDetail\": {", "            \"EndDateTime\": 1278124322.0,", 
"            \"CreationDateTime\": 1278123795.0,", "            \"LastStateChangeReason\": null,", 
"            \"State\": \"COMPLETED\",", "            \"StartDateTime\": 1278124232.0", 
"          }", "        }", "      ],", "      \"JobFlowId\": \"j-2H9P770Z4B8GG\",", 
"      \"Instances\": {", "        \"Ec2KeyName\": \"JL 09282009\",", 
"        \"InstanceCount\": 2,", "        \"Placement\": {", 
"          \"AvailabilityZone\": \"us-east-1d\"", "        },", 
"        \"KeepJobFlowAliveWhenNoSteps\": false,", "        \"SlaveInstanceType\": \"m1.small\",", 
"        \"MasterInstanceType\": \"m1.small\",", "        \"MasterPublicDnsName\": \"ec2-174-129-70-89.compute-1.amazonaws.com\",", 
"        \"MasterInstanceId\": \"i-2147b84b\",", "        \"InstanceGroups\": null,", 
"        \"HadoopVersion\": \"0.18\"", "      }", "    }", "  ]", 
"}")
+1  A: 

The general rule seems to be to use double the number of backslashes you think you need (can't find the source now).

emrJson <- gsub("\\\\", "\\", emrJson)
parser <- newJSONParser()
for (i in 1:length(emrJson)){
    parser$addData(emrJson[i])
}
parser$getObject()

worked here with your dput output.

Eduardo Leoni
Thank you very much. In retrospect the answer is pretty straight forward but it gave me a fit. Thanks.
JD Long
A: 

I'm not sure that it is double escaped. Remember that you need to use 'cat' to see what the string is, as opposed to the representation of the string.

hadley
the question is about parsing the string. don't you want to parse whatever is written as on dput?
Eduardo Leoni
No, because dput gives you the representation of string, not the string itself.
hadley
Now I see that it says so right there in the help file! However, the opposite command, dget, just parses the (dput created) file directly: eval(parse(file = file)). So I don't know when the distinction comes into play.
Eduardo Leoni
I think this is a bug in the rjson library - the string looks fine to me.
hadley