Convert from timestamp to specific date in pyspark

Question

I would like to convert on a specific column the timestamp in a specific date.

Here is my input :

+----------+
| timestamp|
+----------+
|1532383202|
+----------+

What I would expect :

+------------------+
|      date        |
+------------------+
|24/7/2018 1:00:00 |
+------------------+

If possible, I would like to put minutes and seconds to 0 even if it's not 0.

For example, if I have this :

+------------------+
|      date        |
+------------------+
|24/7/2018 1:06:32 |
+------------------+

I would like this :

+------------------+
|      date        |
+------------------+
|24/7/2018 1:00:00 |
+------------------+

What I tried is :

from pyspark.sql.functions import unix_timestamp

table = table.withColumn(
    'timestamp',
    unix_timestamp(date_format('timestamp', 'yyyy-MM-dd HH:MM:SS'))
)

But I have NULL.

In almost all cases, you should avoid using a udf when there is an equivalent API function. Depending on the size of your data, it may not make a noticeable difference but it's much more efficient to allow all of the processing happen inside the JVM. — pault, Commented Aug 24, 2018 at 17:30

pault · Accepted Answer · 2018-08-23 15:20:12Z

Update

Inspired by @Tony Pellerin's answer, I realize you can go directly to the :00:00 without having to use regexp_replace():

table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:00:00"))
table.show()
#+----------+-------------------+
#| timestamp|               date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+

Your code doesn't work because pyspark.sql.functions.unix_timestamp() will:

Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.

You actually want to do the inverse of this operation, which is convert from an integer timestamp to a string. For this you can use pyspark.sql.functions.from_unixtime():

import pyspark.sql.functions as f

table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:MM:SS"))
table.show()
#+----------+-------------------+
#| timestamp|               date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:07:00|
#+----------+-------------------+

Now the date column is a string:

table.printSchema()
#root
# |-- timestamp: long (nullable = true)
# |-- date: string (nullable = true)

So you can use pyspark.sql.functions.regexp_replace() to make the minutes and seconds zero:

table.withColumn("date", f.regexp_replace("date", ":\d{2}:\d{2}", ":00:00")).show()
#+----------+-------------------+
#| timestamp|               date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+

The regex pattern ":\d{2}" means match a literal : followed by exactly 2 digits.

pault · Accepted Answer · 2018-08-23 15:22:53Z

2

Maybe you could use the datetime library to convert timestamps to your wanted format. You should also use user-defined functions to work with spark DF columns. Here's what I would do:

# Import the libraries
from pyspark.sql.functions import udf
from datetime import datetime

# Create a function that returns the desired string from a timestamp 
def format_timestamp(ts):
    return datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:00:00')

# Create the UDF
format_timestamp_udf = udf(lambda x: format_timestamp(x))

# Finally, apply the function to each element of the 'timestamp' column
table = table.withColumn('timestamp', format_timestamp_udf(table['timestamp']))

Hope this helps.

edited Aug 23, 2018 at 15:22

pault

43.4k17 gold badges119 silver badges159 bronze badges

answered Aug 23, 2018 at 13:32

Tony Pellerin

2411 silver badge7 bronze badges

1

Good edit there. +1 for showing that you can directly convert to :00:00 without having to use regex. I've added it to my answer as well. Your answer works but it's preferable to avoid using udf when possible.
– pault
Commented Aug 23, 2018 at 15:18

Add a comment |

Collectives™ on Stack Overflow

Convert from timestamp to specific date in pyspark

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
pyspark
type-conversion
timestamp
converters
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pysparktype-conversiontimestampconverters or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
pyspark
type-conversion
timestamp
converters
or ask your own question.