large-data-volumes

How to build a wordlist

So now I want to make Estonian wordlist ~about 20m unique words in lowercase. To get input for wordlist, corpus of Estonian can be used. Corpus files are in Text Encoding Initiative (TEI) format. I tried using regex to find the words. This is what I made: it's inefficient, mcv is all messed up, it brakes if hashset of words can't fit i...

Copy one column to another for over a billion rows in SQL Server database

Database : SQL Server 2005 Problem : Copy values from one column to another column in the same table with a billion+ rows. test_table (int id, bigint bigid) Things tried 1: update query update test_table set bigid = id fills up the transaction log and rolls back due to lack of transaction log space. Tried 2 - a procedure on fol...

SQLite view across multiple databases. Is this okay? Is there a better way?

Using SQlite I have a large database split into years: DB_2006_thru_2007.sq3 DB_2008_thru_2009.sq3 DB_current.sq3 They all have a single table call hist_tbl with two columns (key, data). The requirements are: 1. to be able to access all the data at once. 2. inserts only go to the current version. 3. the data will continue to be...

Adding an autonumber to a SQLcolumn which has more than 15 million records

I need to add a autonumber column to an existing table which has about 15 million records in SQL 2005. Do you think how much time it'll take? What's the better way to do it? ...

Parallel.ForEach throws exception when processing extremely large sets of data

My question centers on some Parallel.ForEach code that used to work without fail, and now that our database has grown to 5 times as large, it breaks almost regularly. Parallel.ForEach<Stock_ListAllResult>( lbStockList.SelectedItems.Cast<Stock_ListAllResult>(), SelectedStock => { ComputeTipDown( SelectedStock.Symbol ); } ); The Com...

Transfer large messaeges with Apache CXF

I'm writing a CXF WS to upload some large files - up to 1GB. In most cases they won't be >10-15MB, but the problem is that it is ineffective to load the file and send it as regular byte[] using the standard binding. For that reason a custom interceptor might be needed but I'm not sure it is the only option as well as how to write it. 10x...

Cost of serialization in web service

My next project involves the creation of a data API within an enterprise framework. The data will be consumed by several applications running on different software platforms. While my colleagues generally favour SOAP, I would like to use a RESTful architecture. Most of the applications will only need a few objects at every call. Othe...

ORM usage with potentially billions of records

I was thinking of this the other day, apps like Twitter deal with millions of users. I was thinking how the functionality of 'following' would work, where the maximum amount of users in the database can follow the maximum amount of users in the database less one, (himself). If this was a ManyToMany bidirectional mapping, it would cre...

Computing token counters on huge dataset

I need to go over a huge amount of text (> 2 Tb, a Wikipedia full dump) and keep two counters for each seen token (each counter is incremented depending on the current event). The only operation that I will need for these counters is increase. On a second phase, I should calculate two floats based on these counters and store them. It sh...

Fastest way to fetch a subset (200M) from a very large table (600M) in SQL Server

Hi, we are facing the following problem and we are trying to come up with the best possible solution. We are using SQL Server 2008. We have a table that has more than 600 millions records, and has about 25 columns. One of the columns is an ID and is indexed. We need to get a subset of records from this table. There are mainly 2 cases: ...