tags:

views:

337

answers:

9

I am currently working on a project. In this project I have set of data which follows particular algorithm. I have to find the pattern.

 1    355138022809833    RUPQ730562P    247001    20578330    70175500    
 2    355138022809841    RUPQ730563D    247001    72754950    71957850    
 3    355138023475287    RVSQ831978E    247001    39374170    25101090    
 4    355138023475295    RVSQ831979F    247001    06260280    87190670    
 5    355138023475303    RVSQ831980L    247001    05025410    26440510    
 6    355138023475352    RVSQ831985Y    247001    96637700    48209200    
 7    355138023475360    RVSQ831986A    247001    27362620    70790740    
 8    355138023475378    RVSQ831987P    247001    16576600    30002180    
 9    355138023475386    RVSQ831988D    247001    74778020    98010580      
10    355138023475402    RVSQ831990M    247001    25716170    97946520

First column is the serial number. Next 3 columns are the input which will be given. Next 2 are the output of the algorithm.

So basically

I have 3 variables x, y, z (2nd, 3rd, 4th column of above data)

And

y1 = f1(x, y, z)

y2 = f2(x, y, z)

y1 is 5th column of above data

y2 is 6th column of above data

I have above data. Now I need to find function f1 and f2.

What procedure should I follow? What steps must be taken?

EDIT 1 by Krishna Kant Sharma

I posted this question to not to ask for the answering algorithm. I just asked for the steps necessary to take, to solve these kinds of problem, when we have alphabets in variables too. This is the first time, in my experience, some small portion of stackoverflow community have acted like closed minded people. Whats the whole point of stackoverflow? We are here to help each other understand and resolve problems. To lend a helping hand when some of us need it. So why dont we stop beating around some technical pureness (like alphabets are not alphabetical characters), and solve the main problem.

More Data

11   355138023475436  RVSQ831993L   247001   07481830  49057990 
12   355138023475444  RVSQ831994T   247001   65090950  87729430 
13   355138023475451  RVSQ831995B   247001   06689330  60021180 
14   355138023475469  RVSQ831996K   247001   05784310  69836640 
15   355138023475477  RVSQ831997Z   247001   13157740  35850670 
16   355138023475485  RVSQ831998Y   247001   68658020  77311320 
17   355138023475501  RVSQ832000N   247001   01567780  26994970 
18   355138023475519  RVSQ832001E   247001   43775370  58120770 
19   355138023475527  RVSQ832002F   247001   42463550  55145190 
20   355138023475535  RVSQ832003R   247001   85766840  15491950
+2  A: 

Look at rows 5 and 6. All 3 input parameters are the same and yet the output is different. I don't think this is possible to solve having only the data you gave us.

DmitryK
Nicely spotted! Unless the rownumber is also an input variable, this will probably be impossible to solve.
Cloud
Good eye Dmitry. In fact, it goes against CS principles to get different results given the same input. Is there a parameter missing? Is there a temporal component perhaps?
Paul Sasik
Sorry fellows. There was an error in the data. I have updated it now.
Krishna Kant Sharma
Your formatting went away...
Paul Sasik
+5  A: 

Sorry, but I think what you are asking is impossible (from a computational point of view).

The system this data comes from could be doing, say,

SELECT Y1, Y2 FROM my_secret_data WHERE Col1 = x AND Col2 = Y and Col3 = Z;

Where my_secret_data contains values that are not computationally derived.

So unless you had the underlying table, you could never find an algorithm that solves it (unless you had every possible combination of inputs and outputs - which would then mean reconstructing the entire table)

Outside computation, all I think you can do is look for patterns, and try to find out what the input / output values represent, and see where that takes you.

Edit:

All is not lost in certain situations; things would be different if the inputs, outputs and any functions the algorithm uses were continuous(given the inputs are alphanumeric this does not look the case here)

If they were, you could (probably) find an algorithm via interpolating (perhaps a neural network), but under these circumstances, given the values, I think you would need a lot more sample data.

DanSingerman
Thanks for your help, DanSingerman.
Krishna Kant Sharma
They look all but contiguous here... the entries are fairly ressemblant yet the outputs seems spread among the whole space!
Matthieu M.
+5  A: 

First step you should undertake is to get to know what the context of this input data is. Then you would have the option to make assumptions on what the result columns might be and what algorithms/function are typically used in that context.

The next point is to analyze yourself the input data looking for patterns and matchings with real world things (e.g. zip codes, serial numbers, dates, etc.) Therefor you should look at different parts of the input, but also at similar blocks of input.

If you don't succeed at the points before, you'll have no other choice than trial and error. You can still sort out some functions or algorithms by looking at the input data (e.g. letters will render typical mathematic functions useless, so maybe it's some hash functions.)

To shed some light on your input data:

  • the last character of x (and also y) is looking like a checksum char, so if you look for patterns check the number/text without the last char
  • the letters in y might be some common abbreviation of currencies, business processes or something else
  • the last 0 in the result columns might also be some checksum char or depending on the z column (not enough data to tell)

I'd try some (common) hash functions on some combinations of the input data which produce results of 8 digits and look for results.

MicSim
Thanks for the help, MicSim.
Krishna Kant Sharma
A: 

You can always define a piecewise function:

f1(355138022809833,RUPQ730562P,247001)=20578330 f1(355138022809841,RUPQ730563D,247001)=72754950

etc. as you don't require the continuity.

Jaska
No that would not work. We would not be able to find the unknown with them.
Krishna Kant Sharma
No matter what values you function takes on fixed points, it may take arbitrary values at the other points.
Jaska
A: 

Given that the z value is not changing in your sample data, it can be removed. There's really no general approach to solve this if you only have data. On the other hand, if you have the ability to test the function with arbitrary inputs of your own devising, you can use techniques similar to differential cryptanalysis.

Cade Roux
A: 

This seems like a problem that would be best-solved by a neural network. Hopefully you can get a bigger data set to train with though!

Ryan Fox
A: 

If you are sure there is a pattern you could try machine learning techniques. But your data-set to set up and train the "machine" is quite small (only 10 each). Further more you need a prediction, so several algorithms will not work for you like clustering, classification, association mining. Neural networks would be such a kind of technique. It's an option you can actually try out. Unfortunately I'm no machine learning/data mining expert and can't tell you the way. For Java have a look at WEKA.

Peter Kofler
This is almost certainly unproductive. Machine learning techniques generally give approximate solutions. They generally work by juggling parameters to predefined computations (in a neural net, these would usually be weights), and if the original functions aren't in the same form there will be no way for the net to duplicate them.
David Thornley
just a try ;-) ...
Peter Kofler
+2  A: 

Where are you getting these, and how are they being used?

If the entity that is using the real f1 and f2 is using them for authentication, and they're at all intelligent, f1 and f2 are going to be cryptographically secure functions, and you have essentially no chance of breaking them. If they're just checksums, you could try common checksum algorithms. I'd start with Wikipedia.

How much data can you get? Can you get the values of f1 and f2 for arbitrary inputs, or are you limited to what you can observe? If you can observe the values for inputs differing in one character, you can see how much of a change this makes. If the results are mostly the same, it's not a cryptographic hash, and you've got a better chance.

How important is this to you and your company? I'd say you have very little chance of succeeding, unless there's more than I'm seeing right now. It's extremely likely that any solution would involve a large number of brute-force searches.

Incidentally, in your solving process, don't use all the data. You can come up with some sort of function to fit any data points. Keep a little data back for tests to see if your derived function works on outside data.

Finally, is this even ethical? You haven't told us where these numbers come from, and it seems plausible that they're something another company generates for purposes of security, and their intentions may be good or bad. That's something to think about, if only because a company that behaves unethically to others might well behave unethically to its employees.

David Thornley
Well said. Thank you.
Krishna Kant Sharma
A: 

Look at the data from various angles:

  • What values occur, which digits.

  • Look at differences. This exposes that x and y are sequential and that there is a close relation when you strip off the last digit/letter.

  • Look at patterns. y1 begins with 06 then 05 in rows 4, 5 and 13, 14. The difference of the serial numbers minus the check digit is 16. It might be a coincidence, or it might not.

  • Run statistical tests (not much data to go by here).

  • Look at the data in various number systems (hex, binary).

  • Look at the prime factorization of numbers.

  • Look at the effect of small differences in the data.

  • You may initially want to exclude the first two rows, because their serial numbers are far off the others, which may obscure a possible pattern.

Try to find out as much background about the computation as you can.

Some knowledge of cryptanalysis wouldn't be bad either.

Then make up some working hypothesis how the y1 and y2 values are computed and test it. For example, the first I'd check would be some bit twiddling with shift and xor (maybe a CRC), or some linear function of the serial number modulo 10000000, disregarding the trailing zeros.

Rinse and repeat. If you have enough patience and it is not too hard you could find it.

starblue