Pandas, historical dates and formatting error

Programming with dates and times are always fun. I had hoped that after some time working with them at JISCmail on a central service that I could be forgiven their gory details. Sadly some work with pandas bit me again in this area…

I have been updating my Digital Humanities Summer School talk on Reproducibility for this year: Github repository here. Having seen a great talk on Binder and Jupyter as part of an Analytic Workflows talk, I was using Binder to update the talk. That would deserve a separate post but, for now, it simplifies a task I was trying to achieve. The task is deliberately simple – load the Early English Books Online catalogue as csv, filter for an author and then print a graph – as it has to be taught in 40 minutes and the assumption is that I am going to explain most things.

Having played with Jupyter for another task, I knew that I wanted to scale the X axis on the graph to tidy it up a little. The last time that I did this, I converted the date column in the DataFrame into numpy int64 numbers and then provided them to scale_x_axis() for formatting. I got an OutOfBoundsDatetime error

“Out of bounds nanotimestamp:”

(https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslibs/np_datetime.pyx)

Intriguing. So I had a quick search and got hints that I could run pandas.Timestamp.min and the corresponding pandas.Timestamp.max to determine what the minimum and maximum times are in pandas. The earliest timestamp for me was 1677-09-21 00:12:43.145225 and the maximum is 2262-04-11 23:47:16.854775807.

This is a slight issue as Martin Luther, the subject of the graph, was dead a long time before 1677. In fact, quite a lot of EEBO was written before then.

According to the documentation, it appears that the pandas date range is limited by the use of nanosecond-resolution 64-bit integers. Time, as represented by computers, does have limits in various limits and arbitrary beginnings such as epoch time and the time systems behind Excel. If I hadn’t wanted to tidy up that graph axis, then I would not know about the issue.

I did some digging around the source code, hence the discovery of the resolution, and found that the 1677 year is hard-coded into the np_datetime.pyx file. So I had a look at the datetime/np_datetime.c file and this also has these dates hard-coded into them.

This intrigues me a little as it suggests that the library, although it can be used for what I want, has a community that values it more for contemporary machine use than historical time. So programmers may be limited by the wider ideals and means of a community with whom they may have no had a conversation.

The underlying computational reasoning, that one is looking for very small time resolution, suggests that this is where the development has focussed and seems sound. Yet it has a limit that is computationally determined and is limiting the software itself.

The bug only shows up with the formatting so the graph can still be built, suggesting that the timing check is used everywhere. This means that we can still use Pandas but just need to be aware of its limits. It suggests how the computational can still be limiting for the Humanities, even at small levels.