python - Can we use a Pandas function in a Spark DataFrame column ? If so, how? -

- February 15, 2010

i have pandas dataframe called "pd_df".

i want modify it's column , this:

    import pandas pd      pd_df['notification_dt'] = pd.to_datetime(pd_df['notification_dt'], format="%y-%m-%d")

it works.

on same database, created spark dataframe called "spark_df"

i want same function (pd.to_datatime) on it's column perform same operation. did this.

    pyspark.sql.functions import userdefinedfunction      pyspark.sql.types import timestamptype      udf = userdefinedfunction(lambda x: pd.to_datetime(x, format="%y-%m-%d"), timestamptype())      spark_df2 = spark_df.withcolumn("notification_dt1", (udf(spark_df["notification_dt"])))

it should work, according me. on

   spark_df.show()

i encounter following error after minute or so: enter image description here

so, got fixed.

 udf = userdefinedfunction(lambda x: pd.to_datetime(x, format="%y-%m-%d"), timestamptype())

should be

 udf = userdefinedfunction(lambda x: str(pd.to_datetime(x, format="%y-%m-%d")), timestamptype())

it failing convert result timestamptype()

Search This Blog

HTPPS

python - Can we use a Pandas function in a Spark DataFrame column ? If so, how? -

Comments

Post a Comment

Popular posts from this blog

wordpress - (T_ENDFOREACH) php error -

Export Excel workseet into txt file using vba - (text and numbers with formulas) -

Using django-mptt to get only the categories that have items -