Referencing and assigning a subset of a matlab dataset appears to be extremely inefficient and possibly scales like rows^2
Example:
alldata is a large dataset of mixed data - say 150,000 rows by 25 columns (integer, boolean and string).
The format for the dataset is:
'format', '%s%u%u%u%u%u%s%s%s%s%s%s%s%u%u%u%u%s%u%s%s%u%s%s%s%s%u%s%u%s%s%s%u%s'
I then convert 2 type integer cols into type boolean
the following subset assignment:
somedata = alldata(1:m,:)
takes >7 sec for m = 10,000 and ridiculous amounts of time for larger values of m. Plotting time vs m shows a m^2 type relationship which is strange, given that copying alldata is nearly instantaneous, as is using functions like sortrows and find. In fact reading the original .csv data file in is faster than the above assignment for large values of m.
Using the profiler, it appears there is a function subref that includes a very slow line that checks for string comparisons to determine unique values within the dataset. Is this related to how the dataset type is stored (i.e. a reference table)? The dataset includes large number of unique string values.
Are their any solutions to extracting a subset of a dataset in matlab? Such as preallocation (how?), or copying the dataset and deleting rows rather than assigning an extract/subset.
I am using a dual core machine with 1.5Gb ram, but task manager reports less than 1Gb of ram in use.