Pandas, historical dates and formatting error

Programming with dates and times are always fun. I had hoped that after some time working with them at JISCmail on a central service that I could be forgiven their gory details. Sadly some work with pandas bit me again in this area…

I have been updating my Digital Humanities Summer School talk on Reproducibility for this year: Github repository here. Having seen a great talk on Binder and Jupyter as part of an Analytic Workflows talk, I was using Binder to update the talk. That would deserve a separate post but, for now, it simplifies a task I was trying to achieve. The task is deliberately simple – load the Early English Books Online catalogue as csv, filter for an author and then print a graph – as it has to be taught in 40 minutes and the assumption is that I am going to explain most things.

Having played with Jupyter for another task, I knew that I wanted to scale the X axis on the graph to tidy it up a little. The last time that I did this, I converted the date column in the DataFrame into numpy int64 numbers and then provided them to scale_x_axis() for formatting. I got an OutOfBoundsDatetime error

“Out of bounds nanotimestamp:”


Intriguing. So I had a quick search and got hints that I could run pandas.Timestamp.min and the corresponding pandas.Timestamp.max to determine what the minimum and maximum times are in pandas. The earliest timestamp for me was 1677-09-21 00:12:43.145225 and the maximum is 2262-04-11 23:47:16.854775807.

This is a slight issue as Martin Luther, the subject of the graph, was dead a long time before 1677. In fact, quite a lot of EEBO was written before then.

According to the documentation, it appears that the pandas date range is limited by the use of nanosecond-resolution 64-bit integers. Time, as represented by computers, does have limits in various limits and arbitrary beginnings such as epoch time and the time systems behind Excel. If I hadn’t wanted to tidy up that graph axis, then I would not know about the issue.

I did some digging around the source code, hence the discovery of the resolution, and found that the 1677 year is hard-coded into the np_datetime.pyx file. So I had a look at the datetime/np_datetime.c file and this also has these dates hard-coded into them.

This intrigues me a little as it suggests that the library, although it can be used for what I want, has a community that values it more for contemporary machine use than historical time. So programmers may be limited by the wider ideals and means of a community with whom they may have no had a conversation.

The underlying computational reasoning, that one is looking for very small time resolution, suggests that this is where the development has focussed and seems sound. Yet it has a limit that is computationally determined and is limiting the software itself.

The bug only shows up with the formatting so the graph can still be built, suggesting that the timing check is used everywhere. This means that we can still use Pandas but just need to be aware of its limits. It suggests how the computational can still be limiting for the Humanities, even at small levels.

Thoughts on May’s border models

I’ve been thinking a little about Theresa May’s latest clarification of what her Government wants from Brexit and borders. I come at this merely as curious about some of the underlying models. A line in Dan Roberts’ analysis sums up part of what is bothering me: May’s convoluted proposals for avoiding a customs union would […]

Welcoming the Videocracy

I’m currently reading Videocracy (Allocca, 2018). As the Head of Culture and Trends, there is a clear bias but the breathlessness of the writing is intriguing. Has YouTube style become writing style? Allocca highlights some interesting videos and extrapolates these into trends and themes. Some are long lasting such as the power and use of […]

High Frequency Trading Fans

Ticketmaster, a large purveyor of tickets, appear to be attempting to suggest how a genuine fan might be constructed: Tickets: is this the system to finally beat the touts? Having taken the email address, they want other details to check if a purchaser is a Verified Fan via social media. I do wonder how long […]

Mental health as algorithm bias

The Guardian has a story on insurance companies refusing life insurance cover for those with mental illness. A concern is raised that: The suspicion is that insurers are cherry-picking customers to minimise risk and boost the bottom line. (Marsh, 2018) So are the algorithm’s health being shown by human health? It may or may not […]

A creaking social media

The Guardian ran a piece by Tim Burrows on Facebook’s Safety Check feature, Safety Check: is Facebook becoming fear’s administrator-in-chief? I do find the social media platform increasingly fascinating, especially as it comes under critique for its social and technical choices. There’s an increasing creakiness that is coming to the fore now. Ruminating on the […]

A day at DMRN

I presented a poster on Joshua Steele at the Digital Music Research Network (DMRN) workshop this week. More on the poster in another post but the day did provide a range of talks. Much enjoyed the keynote by Augusto Sardi about capturing and rendering spatial audio. It reflected on techniques from computer vision and the […]

Prototyping, tracers and the art of throwing things away

I’m a fan of prototyping. Not all the time but I strongly believe that it has a place within the toolkit. If I am unsure of how something might be put together then I might put together a quick version to test out a current approach. If I am working with a team who know […]

Some Thoughts on Digitizing the Stage

I attended the Digitizing the Stage conference, jointly between the Bodleian libraries and Folger Shakespeare library. A basic storify exists here for the various tweets. A mix of performance, textual, makers, and doers, this was a chance to consider the needs of archives, scholars and the data for ongoing scholarship. I noticed a disenchantment with […]

Jane Austen’s word choices

A Facebook friend had a link to an NY Times piece on Jane Austen’s word choices. Using Franco Moretti’s techniques, it begins showing how Digital Humanities can be useful. There are one of two of his books that I am waiting for before I can get into the pros and cons but I do have […]