tags:

views:

639

answers:

6

I have a requirement to grep a string or pattern (say around 200 characters before and after the string or pattern) from an extremely long line ed file. The file contains streams of data (market trading data) coming from a remote server and getting appended onto this line of the file.

I know that I can match lines containing a specific pattern using grep (or other tools), but once I have such lines, how can I extract a portion of the line? I want to grab the part of the line with the pattern plus roughly 200 characters before and after the pattern. I would be especially interested in answers using...(supply tools or languages you're comfortable with here).

+1  A: 

(.{0,200}(pattern).{0,200}), or something?

Svish
+5  A: 

If what you need is the 200 characters before and after the expression plus the expression itself, then you are looking at:

/.{200}aaa.{200}/

If you need captures for each (allowing you to extract each part as a unit), then you use this regexp:

/(.{200})(aaa)(.{200})/
Pinochle
Yep , that looks pretty good
xxxxxxx
+1  A: 

Is this what you want (in C)?
If it is, feel free to adapt to your specific needs.

#include <stdio.h>
#include <string.h>

void prt_grep(const char *haystack, const char *needle, int padding) {
  char *ptr, *start, *finish;
  ptr = strstr(haystack, needle);
  if (!ptr) return;
  start = (ptr - padding);
  if (start < haystack) start = haystack;
  finish = ptr + strlen(needle) + padding;
  if (finish > haystack + strlen(haystack)) finish = haystack + strlen(haystack);
  for (ptr = start; ptr < finish; ptr++) putchar(*ptr);
}

int main(void) {
  const char *longline = "123456789 ASDF 123456789";
  const char *pattern = "ASDF";

  prt_grep(longline, pattern, 5); /* you want 200 */
  return 0;
}
pmg
My Congratulations , you have just reinvented the wheel!
xxxxxxx
Is there a function to do this in the Standard C Library? In POSIX C library? If there is, best option is to ignore my answer and use the solution provided by the Library. [perl] [bash] [python] [php] [c] is an awful lot of language tags
pmg
+3  A: 

If your grep has -o then that will output only the matched part.

 echo "abc def ghi jkl mno pqr" | egrep -o ".{4}ghi.{4}"

produces:

def ghi jkl
Dennis Williamson
Nice and direct.
Telemachus
A: 

I think I might approach the problem by matching the part of the string I need, then using the match position as the starting point for the substring extraction. In Perl, once your regex suceeds, the pos built-in tells you where you left off:

 if( $long_string = m/$regex/ ) {
      $substring = substr( $long_string, pos( $long_string ), 200 );
      }

I tend to write my programs in Perl instead of doing everything in the regular expression. There's nothing particularly special about Perl in this case.

brian d foy
A: 

I think this may be more basic that everybody is thinking, correct me if I'm wrong... Do you want to print before and after the string excluding the string?

awk -F "ASDF" '{print "Before ASDF" $1 "\n" "After ASDF" $2}' $FILE

This will print something like:

Before ASDF blablabla

After ASDF blablablabla

Change it to match your needs, remove the "\n" and or the "Before..." and "After..." comments

Do you want to supress the string from the file? This will replace the string with a blank space, again, change it to whatever you need.

sed -i 's/ASDF/\ /' longstring.txt

HTH