views:

403

answers:

7

I want to print

userId = 1234
userid = 12345
timestamp = 88888888
js = abc

from my data

messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
<input name="userId" value="1234" type="hidden"> messsssssssssssssssssss
<input name="userid" value="12345" type="hidden"> messssssssssssssssssss
<input name="timestamp" value="88888888" type="hidden"> messssssssssssss
<input name="js" value="abc" type="hidden"> messssssssssssssssssssssssss
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss

How can I do this with AWK(or whatever)? Assume that my data is stored in the "$info" variable (single line data).

Edit : single line data i mean all data represent like this

messss...<input name="userId" value="1234" type="hidden">messsss...<input ....>messssssss

So i can't use grep to extract interest section.

+4  A: 

I'm not sure I understand your "single line data" comment but if this is in a file, you can just do something like:

cat file
    | grep '^<input '
    | sed 's/^<input name="//'
    | sed 's/" value="/ = /'
    | sed 's/".*$//'

Here's the cut'n'paste version:

cat file | grep '^<input ' | sed 's/^<input name="//' | sed 's/" value="/ = /' | sed 's/".*$//'

This turns:

messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
<input name="userId" value="1234" type="hidden"> messsssssssssssssssssss
<input name="userid" value="12345" type="hidden"> messssssssssssssssssss
<input name="timestamp" value="88888888" type="hidden"> messssssssssssss
<input name="js" value="abc" type="hidden"> messssssssssssssssssssssssss
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss

quite happily into:

userId = 1234
userid = 12345
timestamp = 88888888
js = abc

The grep simply extracts the lines you want while the sed commandsrespectively:

  • strip off up to the first quote.
  • replace the section between the name and value with an "=".
  • remove everything following the value closing quote (including that quote).
paxdiablo
+1 You can perform multiple operations with a single call to sed using the `-e` switch, i.e. `sed -e 's/^<input name="//' -e ...`
soulmerge
formatting your code like that makes it really hard to cut and paste. put the "|" at the end of each line, then the shell knows to continue the pipeline
glenn jackman
@glenn, I tend to write for readability but I'll take that on board.
paxdiablo
+2  A: 

To process variables that contain more than one line, you need to put the variable name in double quotes:

echo "$info"|sed 's/^\(<input\( \)name\(=\)"\([^"]*\)" value="\([^"]*\)"\)\?.*/\4\2\3\2\5/'
soulmerge
+3  A: 

This part should probably be a comment on Pax's answer, but it got a bit long for that little box. I'm thinking 'single line data' means you don't have any newlines in your variable at all? Then this will work:

echo "$info" | sed -n -r '/<input/s/<input +name="([^"]+)" +value="([^"]+)"[^>]*>[^<]*/\1 = \2\n/gp'

Notes on interesting bits: - -n means don't print by default - we'll say when to print with that p at the end.

  • -r means extended regex

  • /<input/ at the beginning makes sure we don't even bother to work on lines that don't contain the desired pattern

  • That \n at the end is there to ensure all records end up on separate lines - any original newlines will still be there, and the fastest way to get rid of them is to tack on a '| grep .' on the end - you could use some sed magic but you wouldn't be able to understand it thirty seconds after you typed it in.

I can think of ways to do this in awk, but this is really a job for sed (or perl!).

Jefromi
+2  A: 

using perl

cat file | perl -ne 'print($1 . "=" . $2 . "\n") if(/name="(.*?)".*value="(.*?)"/);'
johnB
i got onlyuserId = 1234
bugbug
+1  A: 

IMO, parsing HTML should be done with a proper HTML/XML parser. For example, Ruby has an excellent package, Nokogiri, for parsing HTML/XML:

ruby -e '
    require "rubygems"
    require "nokogiri"
    doc = Nokogiri::HTML.parse(ARGF.read)
    doc.search("//input").each do |node|
        atts = node.attributes
        puts "%s = %s" % [atts["name"], atts["value"]]
    end
' mess.html

produces the output you're after

glenn jackman
A: 

AWK:

BEGIN {
  # Use record separator "<", instead of "\n".
  RS = "<"
  first = 1
}

# Skip the first record, as that begins before the first tag
first {
  first = 0
  next
}

/^input[^>]*>/ { #/
  # make sure we don't match outside of the tag
  end = match($0,/>/)

  # locate the name attribute
  pos = match($0,/name="[^"]*"/)
  if (pos == 0 || pos > end) { next }
  name = substr($0,RSTART+6,RLENGTH-7)

  # locate the value attribute
  pos = match($0,/value="[^"]*"/)
  if (pos == 0 || pos > end) { next }
  value = substr($0,RSTART+7,RLENGTH-8)

  # print out the result
  print name " = " value
}
MizardX
A: 

Tools like awk and sed can be used together with XMLStarlet and HTML Tidy to parse HTML.

Mark Edgar