tags:

views:

580

answers:

7

I receive rather long XML strings as output from a third party and some of the fields represented in the XML may contain credit card numbers. I do not know the node/element/attribute names ahead of time. What would be the simplest method for finding and replacing card numbers with a placeholder in C#? String functions? Regex?

Edit: I think I'm going to do something like this:

Match m = Regex.Match(xml, ">[0-9]{16}<"); 
xml = xml.Replace(m.Value, ">FOOBAR<");

Checking for exceptions if the string doesn't exist of course. I think this, possibly combined with a checksum algorithm, will be sufficient for my needs

Thank you for the replies.

+2  A: 

Hi, Since we don't know what kinda credit card it is (Master,Visa .. ) so there are several Regex expressions for that :

Here :

* Visa: ^4[0-9]{12}(?:[0-9]{3})?$ All Visa card numbers start with a 4. New cards have 16 digits. Old cards have 13.
* MasterCard: ^5[1-5][0-9]{14}$ All MasterCard numbers start with the numbers 51 through 55. All have 16 digits.
* American Express: ^3[47][0-9]{13}$ American Express card numbers start with 34 or 37 and have 15 digits.
* Diners Club: ^3(?:0[0-5]|[68][0-9])[0-9]{11}$ Diners Club card numbers begin with 300 through 305, 36 or 38. All have 14 digits. There are Diners Club cards that begin with 5 and have 16 digits. These are a joint venture between Diners Club and MasterCard, and should be processed like a MasterCard.
* Discover: ^6(?:011|5[0-9]{2})[0-9]{12}$ Discover card numbers begin with 6011 or 65. All have 16 digits.
* JCB: ^(?:2131|1800|35\d{3})\d{11}$ JCB cards beginning with 2131 or 1800 have 15 digits. JCB cards beginning with 35 have 16 digits.

and to replace the content you can write something like :

content.Replace(Regex.Match(content,regExpresion),something);
Braveyard
Yeah, I found that out in Googleland, too. I'm not sure that really answers my question though.
Of course there are better ways for that and to me this method of doing that is really naive.
Braveyard
What would you consider a non-naive better way?
A: 

I'm not sure it's going to be possible. To be able to manipulate an XML document, you'd need to know the structure either through DTD or the Schema. If you look around, I believe the third party should have their API.

It is only when you know the XML structure can you manipulate it.

Helen Neely
I have their API. I don't necessarily want to write code to specifically scan each element in the doc and do a replace. And they periodically add new tags, so my code would need constant revision. To keep it simple, I was hoping I could just load the doc into a string and scan it for anything resembling a credit card, replacing it with String.Empty for instance. I am just looking for suggestions on the easiest way to accomplish this. Thanks.
+1  A: 

Considering the XML just as a string, you could step through it, identify each sequence of digits, and if that sequence passes the Luhn checksum, replace it.

R Ubben
I don't think there's a simple regex way to implement a checksum algorithm but maybe this combined with regex would do the trick.
A: 

Something like (\d[\w-]?){13,16}. This is rather permissive regex, but it accounts for the fact that random people often enter credit cards with dashs and spaces all over the place. It accepts and set of 13-16 consecutive numbers, allowing one space or dash between any such pair of numbers.

Brian
A: 

Please, please don't use regular expressions to parse XML. You will regret it. Do they handle entity references? What about schema-defined entities? Tags in comments? Nested tags? CDATA sections?

Load the XML into a parser/DOM, find the data you're looking for and replace it (you could apply a regular expression at that point), then stream out the modified XML.

TrueWill
Do they handle entity references? NoWhat about schema-defined entities? NoTags in comments? NoNested tags? NoCDATA sections? NoIt is pretty vanilla, but there are no friendly node names. Just a bunch of numbered elements that have to be interpreted from an API manual.I love the XML parser idea, but again, they add new tags from time to time and I'd have to continually tweak my code or build some sort of fancy dynamic parser. I'm really looking for quick and dirty here.
If you're dealing with numbered elements, I'd guess you're dealing with ANSI X12 EDI XML or something similar. The API should give you clear information on which elements are going to contain credit card data. To avoid constant maintenance on your parsing code, set it up to scan for these elements. If an element matches, you do a Regex search-and-replace on the contents. If an element doesn't match, you leave it alone. The only time you have to change your code is if they add another element for credit card data. You could move the list of elements to scan into a DB or config file.
TrueWill
+2  A: 

Don't use regular expressions to process XML. There will be bugs in your code that you don't know about.

Here's a way of doing this that requires no knowledge whatsoever of the structure of the XML document:

foreach (XmlText t in myXmlDocument.SelectNodes("//text()")
{
   t.Value = myRegex.Replace(t.Value, replacement);
}

This won't find degenerate situations like text nodes with comments in the middle of them, but all of the issues of encoding, CDATA, etc. go away if you let the DOM manage the text for you.

You can do the same thing with an XmlReader, too, if you don't want to parse the whole document before processing it.

Robert Rossney
A: 

TrueWill: "I'd guess you're dealing with ANSI X12 EDI XML or something similar."

No, it is not ANSI X12. Just oddly-named elements.

Robert Rossney: "This won't find degenerate situations like text nodes with comments in the middle of them, but all of the issues of encoding, CDATA, etc..."

None of this exists in the XML. I'm sorry, I should have clarified that from the start. But this suggestion is helpful for something else, thanks.