Python is powerful, concise, and robust. Simply great. Except…when you work with time. Coping with mysterious errors in transforming dates and timestamps took me hours and days of frustration. I was like, ‘why is Python doing it to me’?
I found that much of online Data Science material on time and time series analysis in Python silently gloss over the amount of preparatory work needed before you even start.
Because I work a lot with time series in Sopra Steria analytics team, I had no choice but to comprehend all this in detail. I created the summary below, so others have it easier. Enjoy.
In general, I think the following 9 areas are problematic for working with time in Python:
- Understanding timestamp-like data types (this article)
- Parsing timestamps from text
- Converting between timestamps (this article)
- Collections of timestamps (lists, ndarrays and DataFrames)
- Time deltas
- Histograms of timestamps
- Fixing histogram bins and edges
- Histograms of time deltas
- Time deltas in logarithmic scale
I will cover two of the subjects below. The remainign ones, time permitting, will be covered in the follow-up articles. The inspiration of the article title stems from this stackoverflow thread. Andy Hayden asked how to convert from datetime to Timestamp. Wes McKinney (the author of pandas library) starts his response with Welcome to Hell. I liked it.
What is a timestamp
A timestamp is a sequence of characters that tells you when something occurred (Wikipedia). In other words, a timestamp uniquely identifies a certain moment in time. Irritatingly, this most basic term is the source of confusion in the Python world. In the Python literature, the term has at least four meanings:
- the generic timestamp meaning, as above
- Python native timestamp
- Unix timestamp
- Pandas timestamp
In this article, we will use the Wikipedia definition. This will save us from constant confusion. Walk with me through the first circle of hell to reach the enlightenment. No kidding, it will be fun. Note: I am using numpy version ‘1.19.2’ and pandas version ‘1.2.3’.
The first circle of hell: five ways of storing a timestamp
Below are some Python time-related data types which you may encounter. Yes, it looks scary. In this section I will explain how they relate.
string datetime.datetime numpy.datetime64 dtype: datetime64[ns] dtype: datetime64[h] dtype: datetime64[D] pandas.Timestamp pandas._libs.tslibs.timestamps.Timestamp dtype('<M8[s]') dtype('>M8[s]') dtype=object int64 dtype=np.int64
1. string
Timestamps are often stored as strings. This is simple, but not very useful for data transformation.
stamp = '2021-04-13 15:33'
variable content: 2021-04-13 15:33
variable type: <class 'str'>
2. datetime.datetime
Python’s native data format is datetime.datetime
. I think it is getting less popular these days. Among pandas Data Science aficionados, it is now rarely used.
from datetime import datetime stamp_datetime = datetime(year=2021, month=4, day=13)
variable content: 2021-04-13 00:00:00 variable type: <class 'datetime.datetime'>
3. numpy.datetime64
Numpy package introduced improved numpy.datetime64, which is like the datetime on steroids. Example:
import numpy as np stamp_np = np.datetime64('2021-04-13 15:33')
variable content: 2021-04-13T15:33 variable type: <class 'numpy.datetime64'>
The important point about datetime64 is the unit of internal storage (the fundamental time unit, nicely explained here in Python Data Science Handbook by Jake VanderPlas), which may differ between variable. This is the grain of information used to store the data internally. For instance, a nanosecond or a second can be picked as the unit. Importantly, you must know for sure what unit you are using inside your datetime64 variable, because it will impact the calculations as we will soon see. It is a good practice to explicitly impose the timestamp64 with a certain unit (nanosecond, hour, day). Here is how. Then, the unit information is stored in the variable dtype.
np.datetime64('2021-04-13 15:33', 'ns') show_np(np.datetime64('2021-04-13 15:33', 'h') show_np(np.datetime64('2021-04-13 15:33', 'D')
Here is how those variables look internally:
When unit is nanosecond: variable content: 2021-04-13T15:33:00.000000000 variable type: <class 'numpy.datetime64'> variable dtype: datetime64[ns] When unit is hour: variable content: 2021-04-13T15 variable type: <class 'numpy.datetime64'> variable dtype: datetime64[h] When unit is day: variable content: 2021-04-13 variable type: <class 'numpy.datetime64'> variable dtype: datetime64[D]
datetime[m] versus <M8[m] ?
At this point you understand the meaning of dtype datetime64[ns]
. It is datetime64 with unit nanosecond. Similarly, you may encounter datetime64[s]
, datetime64[D
] and so on. To complete our understanding of numpy.timedate64, try to run this in a Jupyter Notebook cell:
print(stamp_np.dtype) stamp_np.dtype
datetime64[m] dtype('<M8[m]')
Surprisingly, those two statements show different results. As explained here, the both timestamp types are equivalent. The types designated as ‘<M8’ simply mean that your machine is little-endian (like most personal computers). If in doubt, you may verify that the following statement returns True:
np.dtype('<M8[s]') == np.dtype('datetime64[s]')
4. pandas timestamp
Pandas introduces yet another date type: pandas.timestamp
. It uses np.timedate64
under the hood. You create pd.timestamp
using pd.to_datetime()
:
import pandas as pd<br>stamp_pandas = pd.to_datetime('2021-04-13 15:33')
variable content: 2021-04-13 15:33:00 variable type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Already quoted Jake VanderPlas refers to pd.Timestamp as best of both worlds, as compared to datetime.datetime and numpy.datetime64). It is true in terms of performance, which I hope to cover later. But regarding the API design, I am not so sure. In my view this data type has some design problems. It exhibits itself sometimes as pd.timestamp
, sometimes as ‘pandas._libs.tslibs.timestamps.Timestamp'
, as seen above, and sometimes as np.timestamp64
. For developers, this causes a lot of confusion at code design and debugging level.
5. Unix (POSIX) time
Finally, a timestamp can be stored as an integer. This is least readable for humans, but very efficient for the machines. As convention, that integer represents the number of seconds that elapsed from the so-called Epoch, defined as Jan 1st 1970. Such representation of a timestamp is referred to as Unix time, or POSIX time.
# let's calculate POSIX time for 1st January 2021 stamp_posix = (2021 - 1970) * 365 * 24 * 60 * 60
variable content: 1608336000 variable type: <class 'int'>
To summarize the material so far, there are five ways to represent a timestamp in Python, of which numpy.datetime64
is probably the most versatile. The reason of many problems is that in practice you cannot just pick your favourite timestamp format and stay with it. Rather, you will often need to convert between those formats.
The second circle of hell: converting the timestamps
Don’t stop reading, the fun hasn’t started yet.
In my practice, I often need to represent my timestamps as POSIX timestamps (integers). For instance, numpy.histogram() only accepts integers as an argument. So if you store your timestamp in numpy or pandas formats, you need to convert them. Here is how.
From pandas to Unix time
Here is how to convert from pandas.timestamp to Unix time:
stamp_pandas.timestamp()
1618327980.0
Quite simple, and it works. Then, what’s wrong about it? In my view, the naming is wrong and confusing. You have to remember that:
- Pandas timestamp type is called
pd.timestamp
- but
pd.timestamp.timestamp()
returns Unix timestamp (an integer) - the method
pd.to_datetime()
does not returndatetime.datetime
, as the name suggests. It returnspd.timestamp
.
In my view, the naming of those three is confusing. Otherwise, they work nicely.
From numpy to Unix time
In contrast, to convert numpy.timedate64 to Unix timestamp you only need to use simple arithmetic. First, remember about the base unit I explained earlier. The first step is to check what base unit is being used under the hood:
# converting from timedate64 to posix time implies checking the unit and then simple mathematics stamp_np.dtype
dtype('<M8[m]')
Allright, so we have minutes as the base unit. Our timestamp is represented as minutes from the Epoch. In fact, it is an integer already. And here is how to convert it to Unix time: simply multiply the minutes by 60.
posix_from_numpy_minutes = stamp_np.astype('uint64') * * np.uint64(60)
variable content: 1618327980 variable type: <class 'numpy.uint64'> variable dtype: uint64
And here’s an idea to make things easier, so in the future you don’t even need that arithmetics:
# It is better to enforce the unit as seconds upfront. # This results in dtype that stores directly the seconds since Epoch: stamp_in_seconds = np.datetime64(stamp, 's') # now the conversion is straightforward stamp_in_seconds.astype('uint64')
Try to run this code in a Jupyter notebook cell. The result should be same as above:
1618327980
So far, so good? The warm up is over. The easy part is behind us. We will now learn to work with lists of timestamps. I’ll cover this in the next article.
And while you wait till I publish it, enjoy the diagram of Dante’s nine circles of hell, adapted to the misery of Python time series analyst’s life. We have covered two, and there’s seven circles to go. Time permitting.
Did you spot any errors in the above? Contact me or post your thoughts here or on Facebook.