tags:

views:

45

answers:

3

I wanna extract https://www.sth.com/yment/Paymentform.aspx from below string

<form id='paymentUTLfrm' action='https://www.sth.com/yment/Paymentform.aspx' method='post'>

How can I do it with Regex or somthing ?

+3  A: 

Use Html Agility Pack. It will save you a lot of trouble in the long run.

using HtmlAgilityPack;     
var doc = new HtmlDocument();                                       
doc.LoadHtml("<form id='paymentUTLfrm' action='https://www.sth.com/yment/Paymentform.aspx' method='post'>");      
var form = doc.DocumentNode.SelectSingleNode("id('paymentUTLfrm')");
string action = form.Attributes["action"].Value;

It supports loading pages directly from the web, as well as XPath (used above). The HTML does not have to be valid.

EDIT: If you want to use the name:

doc.DocumentNode.SelectSingleNode("//*[@name='paymentUTLfrm']");
Matthew Flaschen
great, thanks. if we don't have an `id` for the tag how we can do that then ? e.g : we have just name attribute instead.
Mohammad
Is `SelectSingleNode` case sensitive ?
Mohammad
@Mohammad, in HtmlAgilityPack, the `id` is case-insensitive. To make the `@name` match case-insensitive, you can use a hack [like this](http://stackoverflow.com/questions/2279513/how-can-i-create-a-nokogiri-case-insensitive-xpath-selector/2279611#2279611).
Matthew Flaschen
A: 

While I would agree that general html parsing is best done with html agility pack (etc) rather than with regex, this is a pretty simple requirement and a regex would be appropriate. I am no regex expert, but this one works:

action=["'](.*)["']

The (.*) will capture the url

maybe some expert can add a comnent to refine this...

Ray
Your regex is greedy and will cause problems
NullUserException
Specifically, for this example it will match `action='https://www.sth.com/yment/Paymentform.aspx' method='post'`
Matthew Flaschen
Hence the disclaimer and request for refinement - I don't think providiing an idea for an alternate approach, even if not perfect, is cause for a downvote. @Matthew - I really think html agility pack is overkill for the OP's need. @Null - your regex actually works (as compared to mine) - I learned something - thanx.
Ray
+4  A: 

While I don't encourage using regex to parse HTML, this is simple enough that a regex will suffice. For more complex operations, do use a proper (X)HTML parser like HtmlAgilityPack.

This regex should work:

<\s*form[^>]*\s+action=(["'])(.*?)\1

EDIT:

Updated regex so it will work with apostrophes in URLs. Note that the URL is now in the 2nd capture group.

See it on rubular

NullUserException
Your expression select entire form tag!!! What about this one `action="[^"]*"`
Mohammad
Your regex is wrong for two reasons: ① Consider `<form action="blah'blah">` (yes, apostrophes are valid in URLs). ② Consider `<form action='x' m:special-action='x'>`.
Timwi
@Mohammad That's what that capture group is there for. See on [rubular](http://rubular.com/r/vNFS2AK0Ta), on the right-hand side there is a div with "match captures." Use those capture groups.
NullUserException
Yeah, but not the right x (bad example). See [this example](http://rubular.com/r/aPYkNs29Wz).
Matthew Flaschen
@Matthew Flaschen: Thanks!
Timwi
@Timwi Fixed, although I've *never* seen either of those "in the wild."
NullUserException
OK, what other machinations do you have for my regex?
NullUserException
@NullUserException how can I extract the capture group with C# ?
Mohammad
When you have a match, access `.Groups[2]`.
Matthew Flaschen
@Moha See it in action [here](http://ideone.com/HDNxt)
NullUserException
@Matthew Flaschen, I've used this code `var action = Regex.Match(form, "<\\s*form[^>]*\baction=[\"']([^'\"]+)['\"]", RegexOptions.IgnoreCase).Groups; foreach (var item in action) {Console.Write(item);}`but it returns nothing !!!
Mohammad