views:

292

answers:

2

I have a requirement.

I have a text which can contain any characters.

a) I have to retain only Alphanumeric characters b) If the word "The" is found with a space prefixed or suffixed with the word, that needs to be removed.

e.g.

CASE 1:

 Input:  The Company Pvt Ltd. 

 Output: Company Pvt Ltd

But 

     Input:  TheCompany Pvt Ltd. 

     Output: TheCompany Pvt Ltd

because there is no space between The & Company words.

CASE 2:

Similarly, Input:  Company Pvt Ltd.  The 

     Output: Company Pvt Ltd

But Input:  Company Pvt Ltd.The 

     Output: Company Pvt Ltd

Case 3:

Input: Company@234 Pvt; Ltd.

Output: Company234 Pvt Ltd

No , or . or any other special characters.

I am basically setting the data to some variable like

 _company.ShortName = _company.CompanyName.ToUpper();

So at the time of saving I cannot do anything. Only when I am getting the data from the database, then I need to apply this filter. The data is coming in _company.CompanyName

and I have to apply the filter on that.

So far I have done

public string ReplaceCharacters(string words)
        {
            words = words.Replace(",", " ");
            words = words.Replace(";", " ");
            words = words.Replace(".", " ");
            words = words.Replace("THE ", " ");
            words = words.Replace(" THE", " ");
            return words;
        }

        private void button1_Click(object sender, EventArgs e)
        {
            MessageBox.Show(ReplaceCharacters(textBox1.Text.ToUpper()));
        }

Thanks in advance. I am using C#

+10  A: 

Here is a basic regex that matches your supplied cases. With the caveat that as Kobi says, your supplied cases are inconsistent, so I've taken the periods out of the first four tests. If you need both, please add a comment.

This handles all the cases you require, but the rapid proliferation of edge cases makes me think that maybe you should reconsider the initial problem?

    [TestMethod]
    public void RegexTest()
    {
        Assert.AreEqual("Company Pvt Ltd", RegexMethod("The Company Pvt Ltd"));
        Assert.AreEqual("TheCompany Pvt Ltd", RegexMethod("TheCompany Pvt Ltd"));
        Assert.AreEqual("Company Pvt Ltd", RegexMethod("Company Pvt Ltd. The"));
        Assert.AreEqual("Company Pvt LtdThe", RegexMethod("Company Pvt Ltd.The"));
        Assert.AreEqual("Company234 Pvt Ltd", RegexMethod("Company@234 Pvt; Ltd."));
        // Two new tests for new requirements
        Assert.AreEqual("CompanyThe Ltd", RegexMethod("CompanyThe Ltd."));
        Assert.AreEqual("theasdasdatheapple", RegexMethod("the theasdasdathe the the the ....apple,,,, the"));
        // And the case where you have THETHE at the start
        Assert.AreEqual("CCC", RegexMethod("THETHE CCC"));
    }

    public string RegexMethod(string input)
    {   
        // Old method before new requirement          
        //return Regex.Replace(input, @"The | The|[^A-Z0-9\s]", string.Empty, RegexOptions.IgnoreCase);  
        // New method that anchors the first the          
        //return Regex.Replace(input, @"^The | The|[^A-Z0-9\s]", string.Empty, RegexOptions.IgnoreCase);            
        // And a third method that does look behind and ahead for the last test
        return Regex.Replace(input, @"^(The)+\s|\s(?<![A-Z0-9])[\s]*The[\s]*(?![A-Z0-9])| The$|[^A-Z0-9\s]", string.Empty, RegexOptions.IgnoreCase);
    }

I've also added a test method to my example that exercises the RegexMethod that contains the regular expression. To use this in your code you just need the second method.

David Hall
Looking at what I provided - it meets what you have asked for, but there are DOZENS of possible edge cases. Things like when the "The " come in the middle of the company name - should they be removed? There are ways of making a regex cater for most requirements but you need to be clear on those requirements first.
David Hall
+1 for test case, which I assume was written before the actual method.
byte
Not to be snotty, but I think this is a nicer piece of code than a bunch of *.Replace() calls. On the other hand, by showing the test case first, the answer becomes less understandable and accessible to those who aren't used to this methodology.
Jeff Meatball Yang
+1 for beating me in many minutes, and for giving a note about edge cases the OP didn't mention, thus saving me from having to write that myself :)
Kobi
Thanks for the feedback Jeff - I've added a note that should hopefully explain the test first methodology a little.
David Hall
It fails in this case the theasdasdathe the the the ....apple,,,, theOutput is : theasdasdaappleexpected output: theasdasdatheapple
priyanka.sarkar
+1 Very nice code example with test ! thanks,
BillW
I've added two more tests for the failing case. One fixes the problem with the "The " space case being part of the company name (as alluded to in my first comment)The second I'm still thinking about.
David Hall
It fails for this : THETHE CCCOutput: THECCCExpected is: CCC
priyanka.sarkar
I've added a third regex that deals with your 'apple' case - actually found this http://stackoverflow.com/questions/889045/substituting-a-regex-only-when-it-doesnt-match-another-regex-python SO post. I'd never seen look ahead and behind before, it's nifty. As for that last case, I think you need to look at your requirements, they are getting a little bit silly. Perhaps you can cleasne the last few edge cases with a plain old replace?
David Hall
Though, after a bit of thought, this regex catches that last case:return Regex.Replace(input, @"^(The)+\s|\s(?<![A-Z0-9])[\s]*The[\s]*(?![A-Z0-9])| The$|[^A-Z0-9\s]", string.Empty, RegexOptions.IgnoreCase);
David Hall
The previous got solved but again failed for thisInput: thethe c thethethe theappleOutput: cthethethetheappleExpected:ctheapple
priyanka.sarkar
@priyanka - Honestly, this is impossible. You keep changing the specs, or inventing new ones. Try to edit your question, and define *clear rules*. `thethe` should be removed? Why?
Kobi
Sir. please don't mistake me.. It is not me who keeps on changing the requirement... Sir, it is the original requirement. I know that you all are busy and many thanks that because of that reason also you all have paid proper attention to my problem. I literally don't have the intension to disturb you all.The requirement is if THE is appended to a text e.g. THESTACKOVERFLOW or STACKOVERFLOWTHE or STACKTHEOVERFLOW then I should consider it else not . However the aforementioned constrains will be there.
priyanka.sarkar
But many many thanks for your help. I have a limited purview otherwise I would have give you more points. Thanks a lot (:
priyanka.sarkar
+2  A: 
string company = "Company; PvtThe Ltd.The  . The the.the";
company = Regex.Replace(company, @"\bthe\b", "", RegexOptions.IgnoreCase);
company = Regex.Replace(company, @"[^\w ]", "");
company = Regex.Replace(company, @"\s+", " ");
company = company.Trim();
// company == "Company PvtThe Ltd"

These are the steps. 1 and 2 can be combined, but this is more clear.

  1. Remove "the" as a whole word (also works for ".the").
  2. Remove anything that isn't a letter or space.
  3. Remove all adjacent spaces.
  4. Remove spaces from the edges.
Kobi
Kobi that was a mistake while typing.. it will edit.. there should not be any special characters. Thanks for notifying that.
priyanka.sarkar