views:

1004

answers:

7

Hey everyone, great community you got here. I'm an Electrical Engineer doing some "programming" work on the side to help pay for bills. I say this because I want you to take into consideration that I don't have proper Computer Science training, but I have been coding for the past 7 years.

I have several excel tables with information (all numeric), basically it is "dialed phone numbers" in one column and number of minutes to each of those numbers on another. Separately I have a list of "carrier prefix code numbers" for the different carriers in my country. What I want to do is separate all the "traffic" per carrier. Here is the scenario:

First dialed number row: 123456789ABCD,100 <-- That would be a 13 digit phone number and 100 minutes.

I have a list of 12,000+ prefix codes for carrier 1, these codes vary in length, and I need to check everyone of them:

Prefix Code 1: 1234567 <-- this code is 7 digits long.

I need to check the first 7 digits for the dialed number an compare it to the dialed number, if a match is found, I would add the number of minutes to a subtotal for later use. Please consider that not all prefix codes are the same length, some times they are shorter or longer.

Most of this should be a piece of cake, and I could should be able to do it, but I'm getting kind of scared with the massive amount of data; Some times the dialed number lists consists of up to 30,000 numbers, and the "carrier prefix code" lists around 13,000 rows long, and I usually check 3 carriers, that means I have to do a lot of "matches".

Does anyone have an idea of how to do this efficiently using C#? Or any other language to be kind honest. I need to do this quite often and designing a tool to do it would make much more sense. I need a good perspective from someone that does have that "Computer Scientist" background.

Lists don't need to be in excel worksheets, I can export to csv file and work from there, I don't need an "MS Office" interface.

Thanks for your help.

Update:

Thank you all for your time on answering my question. I guess in my ignorance I over exaggerated the word "efficient". I don't perform this task every few seconds. It's something I have to do once per day and I hate to do with with Excel and VLOOKUPs, etc.

I've learned about new concepts from you guys and I hope I can build a solution(s) using your ideas.

A: 

Maybe it would be simpler (not necessarily more efficient) to do it in a database instead of C#.

You could insert the rows on the database and on insert determine the carrier and include it in the record (maybe in an insert trigger).

Then your report would be a sum query on the table.

jvanderh
+8  A: 

It sounds to me like you need to build a trie from the carrier prefixes. You'll end up with a single trie, where the terminating nodes tell you the carrier for that prefix.

Then create a dictionary from carrier to an int or long (the total).

Then for each dialed number row, just work your way down the trie until you find the carrier. Find the total number of minutes so far for the carrier, and add the current row - then move on.

Jon Skeet
It's a good (and relatively efficient) solution, but I wonder about the need that he has to generate the trie itself. Obviously with 12k prefix codes he's not generating it by himself, but would need to do so with a tool. So either he's making an automated code generator to create the trie that he then runs against his data, or he uses a more conventional method.
Kevin
I'm imagining the prefix codes coming from a file somewhere. It's fairly easy to build the trie as you read that file. It would be nice to use the same trie for *millions* of lines of dialed numbers though, rather than just 30,000.
Jon Skeet
@John Skeet: Thanks for your answer, I am very interested in it, but unfortunately I don't posses all the necessary knowledge to proceed with it, could you elaborate a little bit more on your answer? If not, I will continue "googling". Thanks.
gus
Well, if you search for "trie C#" you'll find a few implementations which you could adapt, I'm sure. Which bit do you need more elaboration on?
Jon Skeet
(Downvoters: care to comment?)
Jon Skeet
Thanks Jon, I will look for it.
gus
A bit overkill compared to other suggestions is all Jon, for the average programmer like myself anyway.
Marc
If "average" means "mediocore" then, yea. Jon gave what is the most efficient solution in the whole thread. If you can't implement a trie and a utilize a dictionary then you might want to go look into a different profession. Programming is hard, lets go shopping!
Simucal
@Jon Skeet, +1, great answer.
Simucal
@Marc: Tries aren't really hard. You may not have implemented one before, but I suspect if you read through the Wikipedia article you'd be able to do so quite easily.
Jon Skeet
+1 for the new concept. Thanks!
Cameron MacFarland
+1  A: 

The easiest data structure that would do this fairly efficiently would be a list of sets. Make a Set for each carrier to contain all the prefixes.

Now, to associate a call with a carrier:

foreach (Carrier carrier in carriers)
{
    bool found = false;

    for (int length = 1; length <= 7; length++)
    {
        int prefix = ExtractDigits(callNumber, length);

        if (carrier.Prefixes.Contains(prefix))
        {
            carrier.Calls.Add(callNumber);
            found = true;
            break;
        }
    }

    if (found)
        break;
}

If you have 10 carriers, there will be 70 lookups in the set per call. But a lookup in a set isn't too slow (much faster than a linear search). So this should give you quite a big speed up over a brute force linear search.

You can go a step further and group the prefixes for each carrier according to the length. That way, if a carrier has only prefixes of length 7 and 4, you'd know to only bother to extract and look up those lengths, each time looking in the set of prefixes of that length.

Daniel Earwicker
A: 

I would probably just put the entries in a List, sort it, then use a binary search to look for matches. Tailor the binary search match criteria to return the first item that matches then iterate along the list until you find one that doesn't match. A binary search takes only around 15 comparisons to search a list of 30,000 items.

Dolphin
A: 

You may want to use a HashTable in C#.

This way you have key-value pairs, and your keys could be the phone numbers, and your value the total minutes. If a match is found in the key set, then modify the total minutes, else, add a new key.

You would then just need to modify your searching algorithm, to not look at the entire key, but only the first 7 digits of it.

Sev
+10  A: 

UPDATE

You can do a simple trick - group the prefixes by their first digits into a dictionary and match the numbers only against the correct subset. I tested it with the following two LINQ statements assuming every prefix has at least three digis.

const Int32 minimumPrefixLength = 3;

var groupedPefixes = prefixes
    .GroupBy(p => p.Substring(0, minimumPrefixLength))
    .ToDictionary(g => g.Key, g => g);

var numberPrefixes = numbers
    .Select(n => groupedPefixes[n.Substring(0, minimumPrefixLength)]
        .First(n.StartsWith))
    .ToList();

So how fast is this? 15.000 prefixes and 50.000 numbers took less than 250 milliseconds. Fast enough for two lines of code?

Note that the performance heavily depends on the minimum prefix length (MPL), hence on the number of prefix groups you can construct.

     MPL    Runtime
    -----------------
     1     10.198 ms
     2      1.179 ms
     3        205 ms
     4        130 ms
     5        107 ms

Just to give an rough idea - I did just one run and have a lot of other stuff going on.

Original answer

I wouldn't care much about performance - an average desktop pc can quiete easily deal with database tables with 100 million rows. Maybe it takes five minutes but I assume you don't want to perform the task every other second.

I just made a test. I generated a list with 15.000 unique prefixes with 5 to 10 digits. From this prefixes I generated 50.000 numbers with a prefix and additional 5 to 10 digits.

List<String> prefixes = GeneratePrefixes();
List<String> numbers = GenerateNumbers(prefixes);

Then I used the following LINQ to Object query to find the prefix of each number.

var numberPrefixes = numbers.Select(n => prefixes.First(n.StartsWith)).ToList();

Well, it took about a minute on my Core 2 Duo laptop with 2.0 GHz. So if one minute processing time is acceptable, maybe two or three if you include aggregation, I would not try to optimize anything. Of course, it would be realy nice if the programm could do the task in a second or two, but this will add quite a bit of complexity and many things to get wrong. And it takes time to design, write, and test. The LINQ statement took my only seconds.

Test application

Note that generating many prefixes is really slow and might take a minute or two.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;

namespace Test
{
    static class Program
    {
        static void Main()
        {
            // Set number of prefixes and calls to not more than 50 to get results
            // printed to the console.

            Console.Write("Generating prefixes");
            List<String> prefixes = Program.GeneratePrefixes(5, 10, 15);
            Console.WriteLine();

            Console.Write("Generating calls");
            List<Call> calls = Program.GenerateCalls(prefixes, 5, 10, 50);
            Console.WriteLine();

            Console.WriteLine("Processing started.");

            Stopwatch stopwatch = new Stopwatch();

            const Int32 minimumPrefixLength = 5;

            stopwatch.Start();

            var groupedPefixes = prefixes
                .GroupBy(p => p.Substring(0, minimumPrefixLength))
                .ToDictionary(g => g.Key, g => g);

            var result = calls
                .GroupBy(c => groupedPefixes[c.Number.Substring(0, minimumPrefixLength)]
                    .First(c.Number.StartsWith))
                .Select(g => new Call(g.Key, g.Sum(i => i.Duration)))
                .ToList();

            stopwatch.Stop();

            Console.WriteLine("Processing finished.");
            Console.WriteLine(stopwatch.Elapsed);

            if ((prefixes.Count <= 50) && (calls.Count <= 50))
            {
                Console.WriteLine("Prefixes");
                foreach (String prefix in prefixes.OrderBy(p => p))
                {
                    Console.WriteLine(String.Format("  prefix={0}", prefix));
                }

                Console.WriteLine("Calls");
                foreach (Call call in calls.OrderBy(c => c.Number).ThenBy(c => c.Duration))
                {
                    Console.WriteLine(String.Format("  number={0} duration={1}", call.Number, call.Duration));
                }

                Console.WriteLine("Result");
                foreach (Call call in result.OrderBy(c => c.Number))
                {
                    Console.WriteLine(String.Format("  prefix={0} accumulated duration={1}", call.Number, call.Duration));
                }
            }

            Console.ReadLine();
        }

        private static List<String> GeneratePrefixes(Int32 minimumLength, Int32 maximumLength, Int32 count)
        {
            Random random = new Random();
            List<String> prefixes = new List<String>(count);
            StringBuilder stringBuilder = new StringBuilder(maximumLength);

            while (prefixes.Count < count)
            {
                stringBuilder.Length = 0;

                for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
                {
                    stringBuilder.Append(random.Next(10));
                }

                String prefix = stringBuilder.ToString();

                if (prefixes.Count % 1000 == 0)
                {
                    Console.Write(".");
                }

                if (prefixes.All(p => !p.StartsWith(prefix) && !prefix.StartsWith(p)))
                {
                    prefixes.Add(stringBuilder.ToString());
                }
            }

            return prefixes;
        }

        private static List<Call> GenerateCalls(List<String> prefixes, Int32 minimumLength, Int32 maximumLength, Int32 count)
        {
            Random random = new Random();
            List<Call> calls = new List<Call>(count);
            StringBuilder stringBuilder = new StringBuilder();

            while (calls.Count < count)
            {
                stringBuilder.Length = 0;

                stringBuilder.Append(prefixes[random.Next(prefixes.Count)]);

                for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
                {
                    stringBuilder.Append(random.Next(10));
                }

                if (calls.Count % 1000 == 0)
                {
                    Console.Write(".");
                }

                calls.Add(new Call(stringBuilder.ToString(), random.Next(1000)));
            }

            return calls;
        }

        private class Call
        {
            public Call (String number, Decimal duration)
            {
                this.Number = number;
                this.Duration = duration;
            }

            public String Number { get; private set; }
            public Decimal Duration { get; private set; }
        }
    }
}
Daniel Brückner
@Daniel: Thank you very much, I will try your approach also.
gus
@Daniel: How could I incorporate the "minutes" SUM into your solution?
gus
I added aggregating the call durations to my test application - hope it helps.
Daniel Brückner
+1: excellent answer :)
Juliet
+1  A: 

How about dumping your data into a couple of database tables and then query them using SQL? Easy!

CREATE TABLE dbo.dialled_numbers ( number VARCHAR(100), minutes INT )

CREATE TABLE dbo.prefixes ( prefix VARCHAR(100) )

-- now populate the tables, create indexes etc
-- and then just run your query...

SELECT p.prefix,
    SUM(n.minutes) AS total_minutes
FROM dbo.dialled_numbers AS n
    INNER JOIN dbo.prefixes AS p
        ON n.number LIKE p.prefix + '%'
GROUP BY p.prefix

(This was written for SQL Server, but should be very simple to translate for any other DBMS.)

LukeH
@Luke: Another great idea I did not consider before. As practice for my "coding" skills I want to try all of them.
gus