I trying to tokenize following snippets by types of numbers:
"(0-22) 222-33-44, 222-555-666, tel./.fax (111-222-333) 22-33-44 UK, TEL/faks: 000-333-444, fax: 333-444-555, tel: 555-666-888"
and
"tel: 555-666-888, tel./fax (111-222-333) 22-33-44 UK"
and
"fax (111-222-333) 22-33-44 UK, TEL/faks: 000-333-444, fax: 333-444-555"
and so on.
The conception is that this can be any combination of like "tel/faks" and "tel/fax numbers" after it or just a "tel/fax number" at the beginning of the string.
I make this:
"(?:.(?!((tel|fax|faks)[ /:.]+)+))++"
on example 1, but after find() it returns: (chars '_' were added by me)
-
_(0-22) 222-33-44, 222-555-666,_
_TEL./_
_FAX (111-222-333) 22-33-44 UK,_
_TEL_
_FAKS: 000-333-444,_
_FAX: 333-444-555_
it seems that I loosing one char in every group and combined types like "TEL/faks" are splited. I need also to grab (if this exist, if not then default number is tel) for future processing.
How can I get rid of this?
ps. I use: case-insensitive