python - pandas df.loc[z,x]=y how to improve speed? -

- March 15, 2014

i have identified 1 pandas command

timeseries.loc[z, x] = y

to responsible of time spent in iteration. , looking better approaches accelerate it. loop covers not 50k elements (and production goal ~250k or more), needs sad 20 seconds.

here code (ignore top half, timing helper)

def populatetimeseriestable(df, observable, timeseries):     """     go through rows of df ,      put observable timeseries      @ correct row (symbol), column (tsmean).     """      print "len(df.index)=", len(df.index)  # show number of rows      global bf, t     bf = time.time()                       # set 'before'     t = dict([(i,0) in range(5)])    # fill category timing zeros      def t(i):         """         timing helper: add passed time category 'i'. set 'before' now.         """         global bf, t          t[i] = t[i] + (time.time()-bf)         bf = time.time()              in df.index:             # slow loop         bf = time.time()          sym = df["symbol"][i]         t(0)          tsmean = df["tsmean"][i]         t(1)          tsmean = tsformatter(tsmean)         t(2)          o = df[observable][i]         t(3)          timeseries.loc[sym, tsmean] = o         t(4)      pprint import pprint     print "times needed (total = %.1f seconds) each command:" % sum(t.values())     pprint (t)      return timeseries

with (not important, not slow)

def tsformatter(ts):     "as human readable string, whole seconds"     return time.strftime("%y-%m-%d %h:%m:%s", time.gmtime(ts))

. .

--> the to-be-optimized code in for-loop.

(t, , t helper function & dict, timing.)

i have timed every step. vast majority of time:

len(df.index)= 47160 times needed (total = 20.2 seconds) each command: {0: 1.102,  1: 0.741,  2: 0.243,  3: 0.792,  4: 17.371}

is spent in last step

timeseries.loc[sym, tsmean] = o

i have downloaded , install pypy - sadly, doesn't support pandas yet.

any ideas how speed populating 2d array?

thanks!

edit: sorry, hadn't mentioned - 'timeseries' dataframe too:

timeseries = pd.dataframe({"name": titles}, index=index)

update: starting pandas 0.20.1 the .ix indexer deprecated, in favor of more strict .iloc , .loc indexers.

=====================================================================

@jezrael has provided interesting comparison , decided repeat using more indexing methods , against 10m rows df (actually size doesn't matter in particular case):

setup:

in [15]: df = pd.dataframe(np.random.rand(10**7, 5), columns=list('abcde'))  in [16]: df.info() <class 'pandas.core.frame.dataframe'> rangeindex: 10000000 entries, 0 9999999 data columns (total 5 columns):    float64 b    float64 c    float64 d    float64 e    float64 dtypes: float64(5) memory usage: 381.5 mb  in [17]: df.shape out[17]: (10000000, 5)

timing:

in [37]: %timeit df.loc[random.randint(0, 10**7), 'b'] 1000 loops, best of 3: 502 µs per loop  in [38]: %timeit df.iloc[random.randint(0, 10**7), 1] 1000 loops, best of 3: 394 µs per loop  in [39]: %timeit df.at[random.randint(0, 10**7), 'b'] 10000 loops, best of 3: 66.8 µs per loop  in [41]: %timeit df.iat[random.randint(0, 10**7), 1] 10000 loops, best of 3: 32.9 µs per loop  in [42]: %timeit df.ix[random.randint(0, 10**7), 'b'] 10000 loops, best of 3: 64.8 µs per loop  in [43]: %timeit df.ix[random.randint(0, 10**7), 1] 1000 loops, best of 3: 503 µs per loop

results bar plot:

timing data df:

in [88]: r out[88]:        method  timing 0         loc   502.0 1        iloc   394.0 2          @    66.8 3         iat    32.9 4    ix_label    64.8 5  ix_integer   503.0  in [89]: r.to_dict() out[89]: {'method': {0: 'loc',   1: 'iloc',   2: 'at',   3: 'iat',   4: 'ix_label',   5: 'ix_integer'},  'timing': {0: 502.0,   1: 394.0,   2: 66.799999999999997,   3: 32.899999999999999,   4: 64.799999999999997,   5: 503.0}}

plotting

ax = sns.barplot(data=r, x='method', y='timing') ax.tick_params(labelsize=16) [ax.annotate(str(round(p.get_height(),2)), (p.get_x() + 0.2, p.get_height() + 5)) p in ax.patches] ax.set_xlabel('indexing method', size=20) ax.set_ylabel('timing (microseconds)', size=20)

Search This Blog

HTPPS

python - pandas df.loc[z,x]=y how to improve speed? -

Comments

Post a Comment

Popular posts from this blog

wordpress - (T_ENDFOREACH) php error -

Export Excel workseet into txt file using vba - (text and numbers with formulas) -

Using django-mptt to get only the categories that have items -