This pyparsing solution follows a similar logic path as your posted answer. All tags are matched, and then checked against the list of known valid tags, removing them from the reported results. Only those matches that have values left over after removing the valid ones are reported as matches.
from pyparsing import *
# define the pattern of a tag, setting internal results names for easy validation
AT,LPAR,RPAR = map(Suppress,"@()")
term = Word(alphas,alphanums).setResultsName("terms",listAllMatches=True)
sphxTerm = AT + ~White() + ( term | LPAR + delimitedList(term) + RPAR )
# define tags we consider to be valid
valid = set("cat mouse dog".split())
# define a parse action to filter out valid terms, and attach to the sphxTerm
def filterValid(tokens):
tokens = [t for t in tokens.terms if t not in valid]
if not(tokens):
raise ParseException("",0,"")
return tokens
sphxTerm.setParseAction(filterValid)
##### Test out the parser #####
test = """@cat search terms @ house
@(cat) search terms
@(cat, dog) search term @(goat)
@cat searchterm1 @dog searchterm2 @(cat, doggerel)
@(cat, dog) searchterm1 @mouse searchterm2
@caterpillar"""
# scan for invalid terms, and print out the terms and their locations
for t,s,e in sphxTerm.scanString(test):
print "Terms:%s Line: %d Col: %d" % (t, lineno(s, test), col(s, test))
print line(s, test)
print " "*(col(s,test)-1)+"^"
print
With these lovely results:
Terms:['goat'] Line: 3 Col: 29
@(cat, dog) search term @(goat)
^
Terms:['doggerel'] Line: 4 Col: 39
@cat searchterm1 @dog searchterm2 @(cat, doggerel)
^
Terms:['caterpillar'] Line: 6 Col: 5
@caterpillar
^
This last snippet will do all the scanning for you, and just give you the list of found invalid tags:
# print out all of the found invalid terms
print list(set(sum(sphxTerm.searchString(test), ParseResults([]))))
Prints:
['caterpillar', 'goat', 'doggerel']