1

I would like to convert on a specific column the timestamp in a specific date.

Here is my input :

+----------+
| timestamp|
+----------+
|1532383202|
+----------+

What I would expect :

+------------------+
|      date        |
+------------------+
|24/7/2018 1:00:00 |
+------------------+

If possible, I would like to put minutes and seconds to 0 even if it's not 0.

For example, if I have this :

+------------------+
|      date        |
+------------------+
|24/7/2018 1:06:32 |
+------------------+

I would like this :

+------------------+
|      date        |
+------------------+
|24/7/2018 1:00:00 |
+------------------+

What I tried is :

from pyspark.sql.functions import unix_timestamp

table = table.withColumn(
    'timestamp',
    unix_timestamp(date_format('timestamp', 'yyyy-MM-dd HH:MM:SS'))
)

But I have NULL.

1

2 Answers 2

5

Update

Inspired by @Tony Pellerin's answer, I realize you can go directly to the :00:00 without having to use regexp_replace():

table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:00:00"))
table.show()
#+----------+-------------------+
#| timestamp|               date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+

Your code doesn't work because pyspark.sql.functions.unix_timestamp() will:

Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.

You actually want to do the inverse of this operation, which is convert from an integer timestamp to a string. For this you can use pyspark.sql.functions.from_unixtime():

import pyspark.sql.functions as f

table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:MM:SS"))
table.show()
#+----------+-------------------+
#| timestamp|               date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:07:00|
#+----------+-------------------+

Now the date column is a string:

table.printSchema()
#root
# |-- timestamp: long (nullable = true)
# |-- date: string (nullable = true)

So you can use pyspark.sql.functions.regexp_replace() to make the minutes and seconds zero:

table.withColumn("date", f.regexp_replace("date", ":\d{2}:\d{2}", ":00:00")).show()
#+----------+-------------------+
#| timestamp|               date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+

The regex pattern ":\d{2}" means match a literal : followed by exactly 2 digits.

2

Maybe you could use the datetime library to convert timestamps to your wanted format. You should also use user-defined functions to work with spark DF columns. Here's what I would do:

# Import the libraries
from pyspark.sql.functions import udf
from datetime import datetime

# Create a function that returns the desired string from a timestamp 
def format_timestamp(ts):
    return datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:00:00')

# Create the UDF
format_timestamp_udf = udf(lambda x: format_timestamp(x))

# Finally, apply the function to each element of the 'timestamp' column
table = table.withColumn('timestamp', format_timestamp_udf(table['timestamp']))

Hope this helps.

1
  • 1
    Good edit there. +1 for showing that you can directly convert to :00:00 without having to use regex. I've added it to my answer as well. Your answer works but it's preferable to avoid using udf when possible.
    – pault
    Commented Aug 23, 2018 at 15:18

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.