views:

184

answers:

3
tickettypepat = (r'MIS Notes:.*(//p//)?.*')
retype = re.search(tickettypepat,line)
if retype:
  print retype.group(0)
  print retype.group(1)

Given the input.

MIS Notes: //p//

Can anyone tell me why group(0) is

MIS Notes: //p// 

and group(1) is returning as None?

I was originally using regex because, before I ran into problems the matching was more complex than just matching //p// here's the full code. I'm fairly new at this so forgive my noobness, I'm sure there are better ways of accomplishing much of this and if anyonee feels like pointing those out that would be awesome. But aside from the problem with the regex for //[pewPEW]// being too greedy it seems to be functional. I appreciate the help.


Takes Text and cleans up / converts some things.

filename = (r'.\4-12_4-26.txt')
import re
import sys
#Clean up output from the web to ensure that you have one catagory per line
f = open(filename)
w = open('cleantext.txt','w')

origdatepat = (r'(Ticket Date: )([0-9]+/[0-9]+/[0-9]+),( [0-9]+:[0-9]+ [PA]M)')
tickettypepat = (r'MIS Notes:.*(//[pewPEW]//)?.*')

print 'Begining Blank Line Removal'
for line in f:
    redate = re.search(origdatepat,line)
    retype = re.search(tickettypepat,line)
    if line == ' \n':
        line = ''
        print 'Removing blank Line'
#remove ',' from time and date line    
    elif redate:
        line = redate.group(1) + redate.group(2)+ redate.group(3)+'\n'
        print 'Redating... ' + line

    elif retype:
        print retype.group(0)
        print retype.group(1)

        if retype.group(1) == '//p//':
            line = line + 'Type: Phone\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == '//e//':
            line = line + 'Type: Email\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == '//w//':
            line = line + 'Type: Walk-in\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == ('' or None):
            line = line + 'Type: Ticket\n'
            print 'Setting type for... ' + line

    w.write(line)

print 'Closing Files'                 
f.close()
w.close()

And here's some sample input.

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: some random stuff //p// followed by more stuff
Key Words:  

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: //p//
Key Words:  

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: //e// stuff....
Key Words:  


Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes:
Key Words:  
+1  A: 

Regex are greedy, which means that .* matches as much as it can, the entire string. So there is nothing left to match for the optional group. group(0) is always the entire matched sting.

From you comment, why do you event want regex? Isn't something like this enough:

if line.startswith('MIS Notes:'): # starts with that string
    data = line[len('MIS Notes:'):] # the rest in the interesting part
    if '//p//' in data:
        stuff, sep, rest = data.partition('//p//') # or sothing like that
    else:
        pass #other stuff
THC4k
Thanks for the quick comment ultimately I'm trying to match any string that has the pattern MIS Notes: + anything or nothing before the string //p// and anything or nothing after it. and naturally //p// is optional as well. So I guess I need a way to prevent the beginning greedyness. Sorry if this description isn't clear enough let me know and I can try to clarify.
AaronG
Edited question to try and clarify purpose / regex reasoning. Thanks.
AaronG
A: 

The pattern is ambiguous for your purposes. It would be good to group them by prefix or suffix. In the example here, I've chosen prefix grouping. Basically, if //p// occurs in the line, then prefix is non-empty. Suffix will everything after the //p// item, or everything in the line if it doesn't exist.

import re
lines = ['MIS Notes: //p//',
    'MIS Notes: prefix//p//suffix']

tickettypepat = (r'MIS Notes: (?:(.*)//p//)?(.*)')
for line in lines:
    m = re.search(tickettypepat,line)
    print 'line:', line
    if m: print 'groups:', m.groups()
    else: print 'groups:', m

results:

line: MIS Notes: //p//
groups: ('', '')
line: MIS Notes: prefix//p//suffix
groups: ('prefix', 'suffix')
Shane Holloway
Unless I'm misunderstanding your solution I'm not sure that does what I want. Ideally the regex would return always return //p// //e// or //w// in a specific group if it was present in the input.
AaronG
+1  A: 

MIS Notes:.*(//p//)?.* works like this, on the example of "MIS Notes: //p//" as the target:

  1. MIS Notes: matches "MIS Notes:", no surprises here.
  2. .* immediately runs to the end of the string (match so far "MIS Notes: //p//")
  3. (//p//)? is optional. Nothing happens.
  4. .* has nothing left to match, we are at the end of the string already. Since the star allows zero matches for the preceding atom, the regex engine stops reporting the entire string as a match, and the sub-group as empty.

Now when you change the regex to MIS Notes:.*(//p//).*, the behavior changes:

  1. MIS Notes: matches "MIS Notes:", still no surprises here.
  2. .* immediately runs to the end of the string (match so far "MIS Notes: //p//")
  3. (//p//) is necessary. The engine starts to backtrack character by character in order to fulfill this requirement. (Match so far "MIS Notes: ")
  4. (//p//) can match. Sub-group one is saved and contains "//p//".
  5. .* runs to the end of the string. Hint: If you are not interested in what it matches, it is superfluous and you can remove it.

Now when you change the regex to MIS Notes:.*?//(p)//, the behavior changes again:

  1. MIS Notes: matches "MIS Notes:", and still no surprises here.
  2. .*? is non-greedy and checks the following atom before it proceeds (match so far "MIS Notes: ")
  3. //(p)// can match. Sub-group one is saved and contains "p".
  4. Done. Note that no backtracking occurs, this saves time.

Now if you know that there can be no / before the //p//, you can use: MIS Notes:[^/]*//(p)//:

  1. MIS Notes: matches "MIS Notes:", you get the idea.
  2. [^/]* can fast-forward to the first slash (this is faster than .*?)
  3. //(p)// can match. Sub-group one is saved and contains "p".
  4. Done. Note that no backtracking occurs, this saves time. This should be faster than version #3.
Tomalak
Thanks for the thorough explanation I think I understand this now, I have to run to a meeting for a tic, but upon return will test.
AaronG
I tried it and I think we are close MIS Notes:.*?(//[pewPEW]) matches all the cases where theres a //*// tag of some sort. But it breaks on elif retype.group(1) == ('' or None): line = line + 'Type: Ticket\n' print 'Setting type for... ' + line Whereas MIS Notes:.*?(//[pewPEW]//)? seems to give null for group(1) no matter what. Same thing with MIS Notes:.*?(//[pewPEW]//)*
AaronG
I obviously missed the trailing // in my first example in the last comment. Should have read, MIS Notes:.*?(//[pewPEW]//)
AaronG
@Aaron: As indicated, `*` and `?` both are optional. My case #1 applies to both. I'm not sure what you mean by "breaks"?
Tomalak
Sorry about ambiguity on breaks. If i run with MIS Notes:.*?(//[pewPEW]//) it returns FALSE on line23 in my post above >> 'elif retype:' thus skipping the logic chain below it and ultimately doing nothing for lines that don't have //*// in them. Where it should be marking those as Ticket.
AaronG
@Aaron: This works: `MIS Notes:(?:(?!//[pewPEW]//).)*(?://([pewPEW])//)?` but it's ugly and I'm trying to come up with a nicer version.
Tomalak
@Aaron: The explanation why sth. in the form of `.*?(x)?` won't match is that `.*?` has no reason to move forward if the rest of the expression is optional. And since the rest of the expression is optional, the engine finishes immediately.
Tomalak
MIS Notes:(?:(?!//[pewPEW]//).)*(?://([pewPEW])//)? Confirmed working with my data. Thanks, again. Pretty sure I would have never come up with that.
AaronG
@AaronG: Having thought about this some more, I don't think there is a much simpler regex to accomplish what you want in one step. Since *"Simple is better than complex"*, I would recommend working in two steps, matching `MIS Notes:` to find the line, and then matching `//([pewPEW])//` in a second step. This would be much more obvious and maintainable.
Tomalak