python 3.x - Discrepancy between rpy2 and standard R matrices -
my goal
so i'm trying call normalize_quantiles
function preprocesscore
r package (r-3.2.1
) within python3 script using rpy2
package on enormous matrix (10 gb+ files). have virtually unlimited memory. when run through r following, able complete normalization , print output:
require(preprocesscore); <- data.matrix(read.table("data_table.txt",sep="\t",header=true)); all[,6:57]=normalize.quantiles(all[,6:57]); write.table(all,"qn_data_table.txt",sep="\t",row.names=false);
i'm trying build python script other things using rpy2
python package, i'm having trouble way builds matrices. example below:
matrix = sample_list # 2-d python array containing data. v = robjects.floatvector([ element col in matrix element in col ]) m = robjects.r['matrix'](v, ncol = len(matrix), byrow=false) print("performing quantile normalization.") rnormalized_matrix = preprocesscore.normalize_quantiles(m) norm_matrix = np.array(rnormalized_matrix) return header, pos_list, norm_matrix
the issue
this works fine smaller files, when run on huge files, dies error: rpy2.rinterface.rruntimeerror: error: cannot allocate vector of size 9.7 gb
i know max size of vector r 8 gb, explains why above error being thrown. rpy2 docs say:
"a matrix special case of array. arrays, 1 must remember vector dimension attributes (number of rows, number of columns)."
i sort of wondered how strictly adhered this, changed code initialize matrix of size wanted , iterate through , assign elements values:
matrix = sample_list # 2-d python array of data. m_count = 1 m = robjects.r['matrix'](0.0, ncol=len(matrix), nrow=len(matrix[0])) samp in matrix: i_count = 1 entry in samp: m.rx[i_count, m_count] = entry # assign data element. i_count += 1 m_count += 1 print("performing quantile normalization.") rnormalized_matrix = preprocesscore.normalize_quantiles(m) norm_matrix = np.array(rnormalized_matrix) return header, pos_list, norm_matrix
again, works smaller files, crashes same error previous.
so question what's underlying difference allows assignment of huge matrices in r causes issues in rpy? there different way need approach this? should suck , in r? or there way circumvent issues i'm having?
at root, r functional language. means when doing in r
m[i, j] <- 123
what happening like
assign_to_cell <- `[<-` m <- assign_to_cell(m, i, j, 123)
where arguments passed value.
this mean new matrix m
should returned cell @ (i,j)
containing new value. making copy of m
each assignment going rather inefficient, particularly larger objects experiencing it, r interpreter has nifty trick (see r's c source details): left side of expression compared right side of expression , if object same interpreter know copy unnecessary , modification can done "in place".
now rpy2
there 2 additional points consider: while python (mostly) passing arguments reference, embedded r engine has no way know happening on left side of python expression.
the expression
m.rx[i_count, m_count] = entry
is faithfully building r call like
m <- assign_to_cell(m, i, j, entry)
but ability r ahead on left side of expression lost. result copy of m
made. each modification.
however, rpy2
making easy move vectors, matrices, , arrays defined in r python's pass-by-reference world. example these r objects can aliased corresponding numpy
objects (using asarra
- see http://rpy.sourceforge.net/rpy2/doc-2.0/html/numpy.html#low-level-interface). remembering r arrays in column-major order, 1 can compute index , skip aliasing numpy
array , modify cells in-place with:
m[idx] = entry
note:
i think limitation 8gb, caused r indices being 32bit integers if remember correctly, lifted couple of years ago. might have less unlimited memory believe. physical memory on system not mean 1 can allocate of in contiguous block.
Comments
Post a Comment