my goal

so i'm trying call normalize_quantiles function preprocesscore r package (r-3.2.1) within python3 script using rpy2 package on enormous matrix (10 gb+ files). have virtually unlimited memory. when run through r following, able complete normalization , print output:

require(preprocesscore); <- data.matrix(read.table("data_table.txt",sep="\t",header=true)); all[,6:57]=normalize.quantiles(all[,6:57]); write.table(all,"qn_data_table.txt",sep="\t",row.names=false);

i'm trying build python script other things using rpy2 python package, i'm having trouble way builds matrices. example below:

matrix = sample_list  # 2-d python array containing data. v = robjects.floatvector([ element col in matrix element in col ]) m = robjects.r['matrix'](v, ncol = len(matrix), byrow=false) print("performing quantile normalization.") rnormalized_matrix = preprocesscore.normalize_quantiles(m) norm_matrix = np.array(rnormalized_matrix)  return header, pos_list, norm_matrix

the issue

this works fine smaller files, when run on huge files, dies error: rpy2.rinterface.rruntimeerror: error: cannot allocate vector of size 9.7 gb

i know max size of vector r 8 gb, explains why above error being thrown. rpy2 docs say:

"a matrix special case of array. arrays, 1 must remember vector dimension attributes (number of rows, number of columns)."

i sort of wondered how strictly adhered this, changed code initialize matrix of size wanted , iterate through , assign elements values:

matrix = sample_list  # 2-d python array of data. m_count = 1 m = robjects.r['matrix'](0.0, ncol=len(matrix), nrow=len(matrix[0])) samp in matrix:     i_count = 1     entry in samp:         m.rx[i_count, m_count] = entry  # assign data element.         i_count += 1     m_count += 1  print("performing quantile normalization.")  rnormalized_matrix = preprocesscore.normalize_quantiles(m) norm_matrix = np.array(rnormalized_matrix)  return header, pos_list, norm_matrix

again, works smaller files, crashes same error previous.

so question what's underlying difference allows assignment of huge matrices in r causes issues in rpy? there different way need approach this? should suck , in r? or there way circumvent issues i'm having?

at root, r functional language. means when doing in r

m[i, j] <- 123

what happening like

assign_to_cell <- `[<-` m <- assign_to_cell(m, i, j, 123)

where arguments passed value.

this mean new matrix m should returned cell @ (i,j) containing new value. making copy of m each assignment going rather inefficient, particularly larger objects experiencing it, r interpreter has nifty trick (see r's c source details): left side of expression compared right side of expression , if object same interpreter know copy unnecessary , modification can done "in place".

now rpy2 there 2 additional points consider: while python (mostly) passing arguments reference, embedded r engine has no way know happening on left side of python expression.

the expression

m.rx[i_count, m_count] = entry

is faithfully building r call like

m <- assign_to_cell(m, i, j, entry)

but ability r ahead on left side of expression lost. result copy of m made. each modification.

however, rpy2 making easy move vectors, matrices, , arrays defined in r python's pass-by-reference world. example these r objects can aliased corresponding numpy objects (using asarra - see http://rpy.sourceforge.net/rpy2/doc-2.0/html/numpy.html#low-level-interface). remembering r arrays in column-major order, 1 can compute index , skip aliasing numpy array , modify cells in-place with:

m[idx] = entry

note:

i think limitation 8gb, caused r indices being 32bit integers if remember correctly, lifted couple of years ago. might have less unlimited memory believe. physical memory on system not mean 1 can allocate of in contiguous block.

Search This Blog

HTPPS

python 3.x - Discrepancy between rpy2 and standard R matrices -

my goal

the issue

note:

Comments

Post a Comment

Popular posts from this blog

wordpress - (T_ENDFOREACH) php error -

Export Excel workseet into txt file using vba - (text and numbers with formulas) -

Using django-mptt to get only the categories that have items -