python - Best way to join two large datasets in Pandas -


i'm downloading 2 datasets 2 different databases need joined. each of them separately around 500mb when store them csv. separately fit memory when load both memory error. trouble when try merge them pandas.

what best way outer join on them don't memory error? don't have database servers @ hand can install kind of open source software on computer if helps. ideally still solve in pandas not sure if possible @ all.

to clarify: merging mean outer join. each table has 2 row: product , version. want check products , versions in left table only, right table , both tables.

pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer') 

this seems task dask designed for. essentially, dask can pandas operations out-of-core, can work datasets don't fit memory. dask.dataframe api subset of pandas api, there shouldn't of learning curve. see dask dataframe overview page additional dataframe specific details.

import dask.dataframe dd  # read in csv files. df1 = dd.read_csv('file1.csv') df2 = dd.read_csv('file2.csv')  # merge csv files. df = dd.merge(df1, df2, how='outer', on=['product','version'])  # write output. df.to_csv('file3.csv', index=false) 

assuming 'product' , 'version' columns, may more efficient replace merge with:

df = dd.concat([df1, df2]).drop_duplicates() 

i'm not entirely sure if better, apparently merges aren't done on index "slow-ish" in dask, worth try.


Comments

Popular posts from this blog

wordpress - (T_ENDFOREACH) php error -

Export Excel workseet into txt file using vba - (text and numbers with formulas) -

Using django-mptt to get only the categories that have items -