views:

72

answers:

1

Hi all,

I want to use R to do string parsing that (I think) is like a simplistic HTML parsing.

For example, let's say we have the following two variables:

Seq <- "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
Str <- ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."

Say that I want to parse "Seq" According to "Str", by using the legend here

Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
     |     |  |              | |               |     |               ||     |
     +-----+  +--------------+ +---------------+     +---------------++-----+
        |        Stem 1            Stem 2                 Stem 3         |
        |                                                                |
        +----------------------------------------------------------------+
                                Stem 0

Assume that we always have 4 stems (0 to 3), but that the length of letters before and after each of them can very.

The output should be something like the following list structure:

list(
    "Stem 0 opening" = "GCCTCGA",
    "before Stem 1" = "TA",
    "Stem 1" = list(opening = "GCTC",
                inside = "AGTTGGGA",
                closing = "GAGC"
            ),
    "between Stem 1 and 2" = "G",
    "Stem 2" = list(opening = "TACGA",
                inside = "CTGAAGA",
                closing = "TCGTA"
            ),
    "between Stem 2 and 3" = "AGGtC",
    "Stem 3" = list(opening = "ACCAG",
                inside = "TTCGATC",
                closing = "CTGGT"
            ),
    "After Stem 3" = "",
    "Stem 0 closing" = "TCGGGGC"
)

I don't have any experience with programming a parser, and would like advices as to what strategy to use when programming something like this (and any recommended R commands to use).

What I was thinking of is to first get rid of the "Stem 0", then go through the inner string with a recursive function (let's call it "seperate.stem") that each time will split the string into: 1. before stem 2. opening stem 3. inside stem 4. closing stem 5. after stem

Where the "after stem" will then be recursively entered into the same function ("seperate.stem")

The thing is that I am not sure how to try and do this coding without using a loop.

Any advices will be most welcomed.

Update: someone sent me a bunch of question, here they are.

Q: Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence?
A: Yes

Q: Does the parsing always start with a partial stem 0 as your example shows? A: No. Sometimes it will start with a few "."

Q: Is there a way of making sure you have the right sequences when you start? A: I am not sure I understand what you mean.

Q: Is there a chance of error in the middle of the string that you have to restart from? A: Sadly, yes. In which case, I'll need to ignore one of the inner stems...

Q: How long are these strings that you want to parse? A: Each string has between 60 to 150 characters (and I have tens of thousands of them...)

Q: Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters? A: each sequence is self contained.

Q: Is there always at least one '.' between stems?
A: No.

Q: A full set of rules as to how the parsing should be done would be useful. A: I agree. But since I don't have even a basic idea on how to start coding this, I thought first to have some help on the beginning and try to tweak with the other cases that will come up before turning back for help.

Q: Do you have the BNF syntax for parsing? A: No. Your e-mail is the first time I came across it (http://en.wikipedia.org/wiki/Backus–Naur_Form).

+2  A: 

You can simplify the task by using run length encoding.

First, convert Str to be a vector of individual characters, then call rle.

split_Str <- strsplit(Str, "")[[1]]
rle_Str <- rle(split_Str)

Run Length Encoding
  lengths: int [1:14] 7 2 4 8 4 1 5 7 5 5 ...
  values : chr [1:14] ">" "." ">" "." "<" "." ">" "." "<" "." ">" "." "<" "."

Now you just need to parse rle_Str$values, which is perhaps simpler. For instance, an inner stem will always look like ">" "." "<".

I think the main thing that you need to think about is the structure of the data. Does a "." always have to come between ">" and "<", or is it optional? Can you have a "." at the start? Do you need to be able to generalise to stems within stems within stems, or even more complex structures?

Once you have this solved, contructing your list output should be straightforward.

Also, don't worry about using loops, they are in the language because they are useful. Get the thing working first, then worry about speed optimisations (if you really have to) afterwards.

Richie Cotton