python - Can we use a Pandas function in a Spark DataFrame column ? If so, how? -
i have pandas dataframe called "pd_df".
i want modify it's column , this:
import pandas pd pd_df['notification_dt'] = pd.to_datetime(pd_df['notification_dt'], format="%y-%m-%d")
it works.
on same database, created spark dataframe called "spark_df"
i want same function (pd.to_datatime) on it's column perform same operation. did this.
pyspark.sql.functions import userdefinedfunction pyspark.sql.types import timestamptype udf = userdefinedfunction(lambda x: pd.to_datetime(x, format="%y-%m-%d"), timestamptype()) spark_df2 = spark_df.withcolumn("notification_dt1", (udf(spark_df["notification_dt"])))
it should work, according me. on
spark_df.show()
i encounter following error after minute or so:
so, got fixed.
udf = userdefinedfunction(lambda x: pd.to_datetime(x, format="%y-%m-%d"), timestamptype())
should be
udf = userdefinedfunction(lambda x: str(pd.to_datetime(x, format="%y-%m-%d")), timestamptype())
it failing convert result timestamptype()
Comments
Post a Comment