python - Best way to join two large datasets in Pandas -
i'm downloading 2 datasets 2 different databases need joined. each of them separately around 500mb when store them csv. separately fit memory when load both memory error. trouble when try merge them pandas.
what best way outer join on them don't memory error? don't have database servers @ hand can install kind of open source software on computer if helps. ideally still solve in pandas not sure if possible @ all.
to clarify: merging mean outer join. each table has 2 row: product , version. want check products , versions in left table only, right table , both tables.
pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer')
this seems task dask
designed for. essentially, dask
can pandas
operations out-of-core, can work datasets don't fit memory. dask.dataframe
api subset of pandas
api, there shouldn't of learning curve. see dask dataframe overview page additional dataframe specific details.
import dask.dataframe dd # read in csv files. df1 = dd.read_csv('file1.csv') df2 = dd.read_csv('file2.csv') # merge csv files. df = dd.merge(df1, df2, how='outer', on=['product','version']) # write output. df.to_csv('file3.csv', index=false)
assuming 'product'
, 'version'
columns, may more efficient replace merge
with:
df = dd.concat([df1, df2]).drop_duplicates()
i'm not entirely sure if better, apparently merges aren't done on index "slow-ish" in dask
, worth try.
Comments
Post a Comment