views:

200

answers:

7

Hello there

I've a table with about 130 000 records with telephonenumbers. The numbers are all formated like this +4311234567. The numbers always include international country code, local area code and then the phonenumber and sometimes an extension.

There is a webservice which checks for the caller's number in the table. That service works already. But now the client wants that also if someone calls from a company which number is already in the database but not his extension, that the service will return some result.

Example for table.

   **id** | **telephonenumber**    | **name**   
|    1    | +431234567             | company A  
|    2    | +431234567890          |  employee in company A  
|    3    | +4398765432            | company b 

now if somebody from company A calls with a different extension for example +43123456777, than it should return id1. But the problem is, that I don't know how many digits the extensions have. It could have 3,4 or more digits.

Are there any patterns for string kind of matchings?

The data is stored in a sql2005 database.

Thanks

EDIT:
The telephonenumbers i am getting from a crm system. I've talked with the admin of the crm and he is trying to send me the data in a different format.

   **id** | **telephonenumber** |**extension**   | **name**   
|    1    | +431234567          |                | company A  
|    2    | +431234567          |      890       |  employee in company A  
|    3    | +4398765432         |                | company b 
+1  A: 

The number of digits in an extension are PBX-specific. The number of digits in an area code+phone number are country/carrier-specific.

One way to do it would be to define additional rules, for example ...

+43123 | 12

... to say that anything begining with +43123 is a 12-digit number, and that anything beyond that is an extension: this lets you use (configurable instead of hard-coded) data to specify where an extension would begin.

Another way might be to insist that for any number-with-extention entries there should also be a corresponding number-without-extension, as shown in your example of "company A".

ChrisW
+1, but needs access to these numbers. Might be hard (I know that this would be awful for Germany at least, with valid numbers ranging from 3 (rare) to 7 or possibly 8 per area code, depending on the region)
Benjamin Podszun
yes that's a good idea, but problem is that there are also some special numbers, which are shorter than normal telphonenumbers
nWorx
Aren't 'special' numbers likely to look like a different area code? For more complicated scenarios you can specify that the match string is a regular expression, so that you can specify things like "+43[2-8]23"
ChrisW
True, but maintaining the database of country+area codes would be an absolute nightmare.
Mark Booth
Just not as bad as maintaining the same thing as code; it can push the problem/maintenance off to the customer (or tech support or the sales staff).
ChrisW
+4  A: 

Is there a way to determine which exact part of the stored number is an extension? Or are the "base" numbers without extansion are stored. IF yes you could just check if a number in your database(without extension) is a prefix of the current number to check. Prefix means a substring of the String that starting at the beginning.

But if you have only numbers in your database with extension and there is no way to find out how many digits belong to it, I believe you can not find an exact solution.

HansDampf
You were faster ;)
Benjamin Podszun
@Hans, Can you think of any reason why my suggested algorithm wouldn't work? I know it would probably result in multiple lookups, but the OP doesn't say how expensive a lookup is, or what the performance requirements are.
Mark Booth
+1  A: 

Well, my understanding of the phone number system is, that no two valid/complete numbers can exist where one is a prefix of the other. A common prank over here is to give out your number as 11 05 32 or something, where 110 is the German emergency police number.

So - if you can change the database structure and preprocess the data, you could look for numbers that have the same prefix (order them first, if the longer starts with the shortest they are extensions). Every match is

  • A base number (the shortest one)
  • A direct number plus extension (all longer ones)

I'd mark those in the database for faster lookup, if possible.

This approach falls short for the case where you have a common default extension. Over here lots of companies give out something like 1234567-0 as external number, where 0 can be replaced with the 2-4 digit extension. For these cases my approach would fall short - for your example data it would work though?

Benjamin Podszun
The numbers are always formatted properly and the number matching comes from caller ID so it is not spoofed. Your approach should work.
Unreason
A: 

That will be impossible without further information: If your table is structured as above, the system has no means to know which part ist the base number and which one is the extension. So it would return "company b" for any(unknown) number starting with "+439".

EDIT (@MarkBooth)

I stand with my claim that its impossible without additional information. Just for making it clearer: Say we have the following information in our database

...
+43316852132 - ....
+433168731 - Company A (reception)
+433168739999 - Company A, Mr. X
+433168911321 - ....
...

The structure of these numbers ist +43 (316) 873 - 1, which the Program doesn't know. So if a number +43316872133 (+43 (316) 87 21 33 with structure) is calling (which is not in the database), you (and therefore your software :)) cannot tell if it belongs to company A or not without further information.

The only solution would be to maintain "base numbers" for companies against which you can do a simple prefix search.

MartinStettner
@MartinStettner, Do you stand by your assertion that this is impossible? I would be interested to know if there are algorithmic flaws (as opposed to efficiency flaws) in my proposed solution.
Mark Booth
The structure is completely irrelevant. Just as you don't key in the structure when you dial the number, you strip it out in the database. The database in your edit doesn't follow the examples given by the OP, there should be a `+43316873 - Company A`, in which case my algorithm would return `Company A` and `extension 23` for `+4331687323`.Knowing the base numbers of each company doesn't actually help you look them up if you don't know what the base part of your incoming number is, that is the problem my search algorithm fixes.If you can fault that algorithm, please do.
Mark Booth
I think the problem arises, if you only have "extension" numbers of some company: In my example, this would be the +433168731 - number: All numbers of Company A start with +43316873, but this "base" number is not in the database. I think this is a common case with (especially hand-entered) telephone number data. And I have to stand by my claim, that in this case every possible algorithm would falsly report any number starting with +4331687... as belonging to company A, simply because there is no way anyone could tell, what is the base number and what is the extension.
MartinStettner
Of course, if you can guarantee that for every company in your directory you'll always have the base number in your data, there are a couple of ways to solve the problem. Also please note that what I'm saying is more or less the same as HansDampf's response, second case. Perhaps he put it in better words than I could.
MartinStettner
+1  A: 

If you are dealing with phone numbers from different countries it will almost be impossible. The length often changes, even within the same country. If you know what the lengths will be (or you want to maintain a list like ChrisW) said, you can use the LEFT(field, x) function to truncate the phone number before searching for the company's phone number. Note that if you are doing a join, it will probably run much slower because it has to run the function on every row.

Nelson
Not only that but in the UK, if I remember correctly, you can buy a whole direct dial number range, and just use the digits common to all numbers in the range i.e. if you have 0123 456700 to 0123 456799 you can use 0123 4567 as your presented caller id for all numbers. Plus, this presented ID can be independent of whether that number is a valid one to accept calls. I have certainly had calls from direct marketters whose caller ID was not a valid number to call them back on.
Mark Booth
I don't know how it is in the UK, but in the US caller ID is considered a "convenience" only. With the right access to the telco, you can present any number you wish. If you block caller ID, it is still sent through from carrier to carrier with a flag and in theory the last carrier will not show it to the final recipient (such as a home user). I think there are some new laws that prevent you from hiding the true origin, but there is no technical limit.
Nelson
+1  A: 

Assuming you get a phone number such as +431234567891 from caller ID

SELECT name, id
FROM Table
WHERE CHARINDEX(telephonenumber, "+431234567891") > 0;

would return the company, and in case of +431234567890 would return 2 records

  • company
  • actual extension

If you can deal with two rows returned from the client side you should be fine with the above.

Preprocessing the data is better (performance wise), but for that you need to describe data in more detail,for example:

  • are extensions only 3 and 4 digits,
  • is the base number always 9 or 10 digits,
  • do you always have at least one extension number for companies with extensions, etc...
Unreason
extensions have differente lengths, they differ from company to company, also base numbers are dependent on the country and if they are special numbers or normal telefonnumbers
nWorx
ok, as your data will be in better shape do you still need assistance in writing the query? Also did you try my original suggestion - it should have been enough (it also can be used with the new structure).
Unreason
@Unreason +1 Now I see what you're doing - you reverse the problem. Instead of looking for the telephone number in the database, you check every number in the database to see if it either matches or prefixes the incoming number. Very nice. I'm not sure if this will end up more or less efficient than my binary search, but it is certainly more elegant.
Mark Booth
A: 

Given that the number of digits in the extension can be different for each company and the number of digits in the number could be different for each country and area code, this is a tricky problem to do efficiently.

Even if you get the data table split into base number and extension, you still have to split the incoming number into base number and extension, which I actually think complicates things.

What I would be inclined to try is:

Original format

  1. Try to match the incoming number with the database.
    • If it matches one record, you have your answer - a specific person.
    • If it matches more than one record, something has gone wrong, so fail.
    • Otherwise, you have to find the company:
  2. Strip off the trailing digit from the incoming number and try to match this with the database again.
    • If the number of digits drops below a threshold (probably 6 digits) then your search should probably fail. This is just to limit the number of database searches performed when the number isn't going to be found.
    • If it matches no records, then you need to try this step again.
    • If it matches more than one record, something has gone wrong, so fail.
    • If it matches exactly one record, you have your next best answer - the company.

For example, searching for "+43123456777":

  • +43123456777 matches 0 entries.
  • +4312345677 matches 0 entries.
  • +431234567 matches 1 entry: "Company A"

The main failure mode of this approach is if a company has variable length extension numbers. For instance consider what happens if both 431234567890 and 43123456789 are valid numbers but only the second one is in the database. If the incoming number is 431234567890, then 43123456789 will be matched in error.

Split format

This is a little more complex, but more robust.

  1. Try to match the incoming number with the database.
    • If it matches one record, you have your answer - the company.
    • If it matches more than one record, match the entry without an extension and you have found the company.
    • Otherwise, you have to find the base company number and extension:
  2. Strip off the trailing digit from the incoming number and try to match this with the database again.
    • If the number of digits drops below a threshold (probably 6 digits) then your search should probably fail. This is just to limit the number of database searches performed when the number isn't going to be found.
    • If it matches no records, then you need to try this step again.
    • If it matches one record, then you have found your answer - the company.
    • If it matches more than one record, then you have found the base number of the company and thus now know the extension, so can try to look up the specific person:
  3. Strip the base number from the start of the original incoming number and use this to search the extensions of the records with that base number.
    • If it matches exactly one record, you have found a specific person.
    • If it doesn't match a specific person, match the entry without an extension and you have found the company.

For example, searching for "+43123456777":

  • +43123456777 matches 0 entries.
  • +4312345677 matches 0 entries.
  • +431234567 matches 2 entries: "empty:Company A" & "890:employee in company A"
  • Within these two matches "77" matches nothing, so return the empty extension: "Company A".

Implementation notes

This algorithm, as noted above, does have some efficiency problems. If the database lookup is expensive, it has a linear cost related to the length of the telephone number, especially in the case where no similar numbers exist in the database (for example, if the incoming number is from Kazakhstan, but there are no Kazakhstan numbers in the datsbase *8').

You could add some optimisations relatively easily though. If most of the companies you deal with use 3 or 4 digit extensions, you could start by stripping, say, 4 digits off the end and then doing a binary chop, until you reach an answer. This would reduce a 15 digit number to 4 or 5 in many cases and at most 6 lookups.

Also, every time you narrow the selection, you could select only within the previous selection rather than having to select within the whole database.

Additional implementation notes

Having finally worked out how Unreason's answer works, I can see that is a much simpler, more elegant solution. I wish I'd though of the simplicity of simply looking for the database number in the incoming number rather than the other way around.

My only concern is that performing this on every telephonenumber in the database might impose excessive demands on the server. I would suggest benchmarking that solution under maximum stress and see if it causes problems. If not, fine - use that. If it does, consider implementing the simple form of my algorithm and doing the stress tests again. If the performance is still too low, try my binary search suggestion.

Mark Booth
@Mark: I added an example to my response to show why (imo) it's still impossible to solve this problem in its original form (i.e. without further information)
MartinStettner
you will put every numer with same first x digits being equal put in one company even if that isnt the case.
HansDampf
@MartinStettner, I think either you have misunderstood the problem or I have. I cannot see why my suggested algorithms would not solve nWorx's problem.
Mark Booth
@HansDampf, the Implication from the question is that a company base number will always be a subset of the base number and it's extension in the original format. In the split format, the base number will be explicit anyway, so `+43123456666` will never accidentally match `Company A` because even when it gets down to `+43123456` it won't match `+431234567`.
Mark Booth