views:

423

answers:

11

The general idea of problem is that data is arranged in following three columns in a table

"Entity" "parent entity" "value"
A001     B001  .10
A001     B002   .15
A001     B003  .2
A001     B004     .3
A002     B002    .34
A002     B003  .13
..
..
..
A002     B111  .56

There is graph of entities and values can be seen as weight of directed edge from parent entity to entity. I have to calculate how many different subsets of parent entity of a particular entity are greater than .5 (say). To further calculate something (later part is easy, not complex computationally)

The point is data is huge (Excel files says data lost :( ). which language or tool I can use? some people have suggested me SAS or STATA.

Thanks in advance

+1  A: 

SAS is an excellent language for quickly processing huge datasets (hundreds of millions of records in which each record has hundreds of variables). It is used in academia and in many industries (we use it for warranty claims analysis; many clinical trials use it for statistical analysis & reporting).

However, there are some caveats: the language has several deficiencies in my opinion which makes it difficult to write modular, reusable code (there is a very rich macro facility, but no user defined functions until version 9.2). Probably a bigger caveat is that a SAS license is very expensive; thus, it probably wouldn't be practical for a single individual to purchase a license for their own experimentation, though the price of a license may not be prohitive to a large company. Still, I believe SAS sells a learning edition, which is likely less expensive.

If you're interested in learning SAS, here are some excellent resources:

There are also regional and local SAS users groups, from which you can learn a lot (for example in my area there is a MWSUG (Midwest SAS Users Group) and MISUG (Michigan SAS User's Group)).

Matt Nizol
+2  A: 

I'm guessing that the table you refer to is actually in a file, and that the file is too big for Excel to handle. I'd suggest that you use a language that you know well. Of those you know, select the one with these characteristics:

-- able to read files line by line;

-- supports data structures of the type that you want to use in memory;

-- has good maths facilities.

Regards

Mark

High Performance Mark
A: 

Perl would be a good place to start, it is very effeciant at handling file input and string parsing. You could then hold the whole set in memory or only the subsets.

David Waters
Depending on how large this file is, you would need to consider only hold parts in memory.
David Waters
+3  A: 

If you're considering SAS you could take a look at R, a free language / environment used for data mining.

Tom
R does not handle large datasets gracefully. It's got memory fragmentation issues that dont crop up until after your datafile is larger than Excel's limit.
Karl
+4  A: 

You can do this in SQL. Two options for the desktop (without having to install a SQL server of some kind) are MS Access or OpenOffice Database. Both can read CSV files into a database.

In there, you can run SQL queries. The syntax is a bit odd but this should get you started:

select ParentEntity, sum(Value)
from Data
where sum(Value) > .5
group by ParentEntity

Data is the name of the table in which you loaded the data, Entity and Value are the names of columns in the Data table.

Aaron Digulla
at least simple SQL statement wont work (please read problem carefuly) i need to find sum of all subsets and check tht sum of elements of the set > .5 or not . thanks
asin
You can do grouping and sums in SQL. See my edits. I didn't actually try that but it should get you started.
Aaron Digulla
+1  A: 

If you don't mind really getting into a language and using some operating system specific calls, C with memory-mapped files is very fast.

You would first need to write a converter that would translate the textual data that you have into a memory map file, and then a second program that maps the file into memory and scans through the data.

Jonathan
A: 

SQL is a good option. Database servers are designed to manage huge amounts of data, and are optimized to use every ressource available on the machine efficiently to gain performance.

Notably, Oracle 10 is optimized for multi-processor machines, automatically splitting requests across processors if possible (with the correct configuration, search for "oracle request parallelization" on your favorite search engine).

This solution is particularly efficient if you are in a big organization with good database servers already available.

Mathieu Garstecki
+2  A: 

I hate to do this, but I would reccomend simply C. What you need is actually to figure out your problem in the language of math, then implement it into C. The ways of storing a graph in memory is a large research area. You could use an adjacency matrix if the graph is dense (highly connected), or an adjacency list if it is not. Each of the subtree searches will be some fancy code and it might be a hard problem.

As others have said, SQL can do it, and the code has even been posted. If you need help putting the data from a text file into a SQL database, that's a different question. Look up bulk data inserts.

The problem with SQL is that even though it is a wonderfully succinct language, it is parsed by the database engine and the underlying code might not be the best method. For most data access routines, the SQL database engine will produce some amazing code efficiencies, but for graphs and very large computations like this, I would not trust it. That's why you go to C. Some lower level language that makes you do it yourself will be the most efficient.

I assume you will need efficient code due to the bulk of the data.

All of this assumes the dataset fits into memory. If your graph is larger than your workstation's ram, (get one with 24GB if you can), then you should find a way to partition the data such that it does fit.

Karl
A: 

Mathematica is quite good in my experience...

Paxinum
A: 

at least simple SQL statement wont work (please read problem carefuly) i need to find sum of all subsets and check tht sum of elements of the set .5 or not . thanks – asin Aug 18 at 7:36

Since your data is in Stata, here is the code to do what you ask in Stata (paste this code into your do-file editor):

//input the data
clear
input str10 entity str10 parent_entity value
A001 B001 .10
A001 B002 .15
A001 B003 .2
A001 B004 .3
A002 B002 .34
A002 B003 .13
A002 B111 .56
end

//create a var. for sum of all subsets
bysort entity : egen sum_subset = total(value)

//flag the sets that sum > .5
bysort entity : gen indicator = 1 if sum_subset>.5
recode ind (.=0)
lab def yn 1 "YES", modify
lab def yn 0 "No", modify
lab val indicator yn
li *, clean

Keep in mind that when using Stata, your data is kept in memory so you are limited only by your system's memory resources. If you try to open your .dta file & it says 'op. sys refuses to provide mem', then you need to try to use the command -set mem- to increase your memory to run the data.

Ultimately, StefanWoe's question:

ay you give us an idea of HOW huge the data set is? Millions? Billions of records? Also an important questions: Do you have to do this only once? Or every day in the future? Or hundreds of times each hour? – StefanWoe Aug 18 at 13:15

really drives your question more than which software to use...automating this using Stata, even on an immense amount of data, wouldn't be difficult but you you could max your resource limits quickly.


Eric A. Booth | [email protected]

eric.a.booth
A: 

I would use Java's BigInteger library and something functional, like say Hadoop.

aaa