tags:

views:

1120

answers:

3

edit:

I need advice on best way to search with regex in vim and extract any matches that are discovered.

end edit.

I have a csv file that looks something like this:

two fields: id and description


0g98932,"long description sometimes containing numbers like 1234567, or 0000012345 and even BR00012345 but always containing text"

I need to search the description field on each row. If a number matching \d{10} exists in the second field, I want to pull it out.

doing something like :% s/(\d{10})/^$1/g gives me a Pattern not found (\d{10}) error.

I've never learned how to grab and reference a match from a regex search in vim - so that's part of the problem.

the other part:

I would really like to either

A) delete everything other than the first 7 digit id and the matches

or

B) copy the id and the matches to another file - or to the top of the current file (somewhere - anywhere just to separate the matches from the unfiltered data)

A: 

To grab match you have to use

\(pattern\)

To delete use

:%s/not_pattern\(pattern\)another_not_pattern/\1/
Mykola Golubyev
+1  A: 

Maybe something like this: s/([^,]+)(?:\D*(\d{10}*))+/\1,\2,\3/g

I.e., capture non-comma characters, then capture groups of 10 numbers that may or may not be preceded by non-numeric characters. Replace with captured values.

Ordinarily, I would probably write a script (outside of vim) to loop through captures so I could be sure of the count in a given line.

Questions:
--Are there more than one \d{10} in the description? Should they be comma-separated in the output?
--Your example shows BR00012345 as a possible number yet BR is clearly non-numeric. Is there a finite list of prefixes or are they always \D{2}\d{8} or what?

steamer25
+2  A: 

The important thing to know about vim regexes is that different levels are escaping are required (as opposed to, say, regexes in Perl or Ruby)

From :help /\m

after:    \v     \m       \M        \V    matches
                 'magic'  'nomagic'
          $      $        $         \$    matches end-of-line
          .      .        \.        \.    matches any character
          *      *        \*        \*    any number of the previous atom
          ()     \(\)     \(\)      \(\)  grouping into an atom
          |      \|       \|        \|    separating alternatives
          \a     \a       \a        \a    alphabetic character
          \\     \\       \\        \\    literal backslash
          \.     \.       .         .     literal dot
          \{     {        {         {     literal '{'
          a      a        a         a     literal 'a'

The default setting is 'magic', so to make the regex you gave worked, you'd have to use:

:%s/".*\(\d\{10}\).*"/\1/

If you want to delete everything other than the first 7 digit id and the matches (by which I assume you mean that you want to delete lines without any match)

:v/^\([[:alnum:]]\{7}\),\s*".*\(\d\{10}\).*/d
:%s//\1,\2/

The :v/<pattern>/ command allows you to run a command on each line that doesn't match the given pattern, so this just deletes the non-matches. :s// reuses the prior pattern, so we don't have to specify it.

This transforms the following:

0g98932,"long description sometimes containing numbers like 0123456789"
0g98932,"long description no numbers"
0g98932,"long description no numbers"
0g98932,"long description sometimes containing numbers like 0123456789"
0g98932,"long description no numbers"
0g98932,"long description no numbers"
0g98932,"long description no numbers"
0g98932,"long description no numbers"
0g98932,"long description sometimes containing numbers like 0123456789"
0g98932,"long description no numbers"
0g98932,"long description no numbers"
0g98932,"long description sometimes containing numbers like 0123456789"

into this:

0g98932,0123456789
0g98932,0123456789
0g98932,0123456789
0g98932,0123456789
rampion
Thanks rampion. your explanations are very helpful. I did not know about escaping the leading { in \d\{10} yet knew something must be out of order. I had tried \d\{10\}, which didn't work to match the simplest pattern. I'll take a look at your examples and start working them into my solution.
42