ansaurus

Question

Extract action attribute in a Form tag with Regex in C#?

Answer 1

+3 A:

Use Html Agility Pack. It will save you a lot of trouble in the long run.

using HtmlAgilityPack;     
var doc = new HtmlDocument();                                       
doc.LoadHtml("<form id='paymentUTLfrm' action='https://www.sth.com/yment/Paymentform.aspx' method='post'>");      
var form = doc.DocumentNode.SelectSingleNode("id('paymentUTLfrm')");
string action = form.Attributes["action"].Value;

It supports loading pages directly from the web, as well as XPath (used above). The HTML does not have to be valid.

EDIT: If you want to use the name:

doc.DocumentNode.SelectSingleNode("//*[@name='paymentUTLfrm']");

Matthew Flaschen 2010-08-21 17:35:45

great, thanks. if we don't have an `id` for the tag how we can do that then ? e.g : we have just name attribute instead.

Mohammad 2010-08-21 17:55:32

Is `SelectSingleNode` case sensitive ?

Mohammad 2010-08-21 19:03:17

@Mohammad, in HtmlAgilityPack, the `id` is case-insensitive. To make the `@name` match case-insensitive, you can use a hack [like this](http://stackoverflow.com/questions/2279513/how-can-i-create-a-nokogiri-case-insensitive-xpath-selector/2279611#2279611).

Matthew Flaschen 2010-08-21 19:54:23

Answer 2

A:

While I would agree that general html parsing is best done with html agility pack (etc) rather than with regex, this is a pretty simple requirement and a regex would be appropriate. I am no regex expert, but this one works:

action=["'](.*)["']

The (.*) will capture the url

maybe some expert can add a comnent to refine this...

Ray 2010-08-21 17:43:58

Your regex is greedy and will cause problems

NullUserException 2010-08-21 17:46:50

Specifically, for this example it will match `action='https://www.sth.com/yment/Paymentform.aspx' method='post'`

Matthew Flaschen 2010-08-21 17:48:14

Hence the disclaimer and request for refinement - I don't think providiing an idea for an alternate approach, even if not perfect, is cause for a downvote. @Matthew - I really think html agility pack is overkill for the OP's need. @Null - your regex actually works (as compared to mine) - I learned something - thanx.

Ray 2010-08-21 17:57:43

Answer 3

+4 A:

While I don't encourage using regex to parse HTML, this is simple enough that a regex will suffice. For more complex operations, do use a proper (X)HTML parser like HtmlAgilityPack.

This regex should work:

<\s*form[^>]*\s+action=(["'])(.*?)\1

EDIT:

Updated regex so it will work with apostrophes in URLs. Note that the URL is now in the 2nd capture group.

See it on rubular

NullUserException 2010-08-21 17:50:33

Your expression select entire form tag!!! What about this one `action="[^"]*"`

Mohammad 2010-08-21 18:03:47

Your regex is wrong for two reasons: ① Consider `<form action="blah'blah">` (yes, apostrophes are valid in URLs). ② Consider `<form action='x' m:special-action='x'>`.

Timwi 2010-08-21 18:06:31

@Mohammad That's what that capture group is there for. See on [rubular](http://rubular.com/r/vNFS2AK0Ta), on the right-hand side there is a div with "match captures." Use those capture groups.

NullUserException 2010-08-21 18:06:54

Yeah, but not the right x (bad example). See [this example](http://rubular.com/r/aPYkNs29Wz).

Matthew Flaschen 2010-08-21 18:09:55

@Matthew Flaschen: Thanks!

Timwi 2010-08-21 18:11:13

@Timwi Fixed, although I've *never* seen either of those "in the wild."

NullUserException 2010-08-21 18:19:34

OK, what other machinations do you have for my regex?

NullUserException 2010-08-21 18:20:12

@NullUserException how can I extract the capture group with C# ?

Mohammad 2010-08-21 18:21:13

When you have a match, access `.Groups[2]`.

Matthew Flaschen 2010-08-21 18:25:59

@Moha See it in action [here](http://ideone.com/HDNxt)

NullUserException 2010-08-21 18:26:28

@Matthew Flaschen, I've used this code `var action = Regex.Match(form, "<\\s*form[^>]*\baction=[\"']([^'\"]+)['\"]", RegexOptions.IgnoreCase).Groups; foreach (var item in action) {Console.Write(item);}`but it returns nothing !!!

Mohammad 2010-08-21 18:30:51

ansaurus

tags:

views:

answers:

Extract action attribute in a Form tag with Regex in C#?

EDIT:

related questions