Delegating Dataframes

2 minute read

Pandas dataframes are used for data engineering, analytics, machine learning and other use cases that require general puspose data containers. In this post we will look at adding functionality to pandas dataframes in order to make them more useful for applications. The first way to add functionality is by subclassing.

Subclassing the DataFrame class is initially straightforward and it will work fr the most part like the DataFrame class.

class CustomDataFrame(pd.DataFrame):
    
    pass

To achieve full functionality however, you will need to follow the official guide to subclassing pandas dataframes However you will inevitably run into a number of issues:

  1. You class will have all the internals of dataframe which will clutter autocomplete
  2. As seen in the guide, there are quite a few steps required just to achieve full functionality
  3. Your custom class will be bound to the pandas dataframe code in non-obvious ways, leading to breakages in the future.

With these factors in mind, it’s best to explore the second option - delegation.

DataFrameHolder

The delegate is class that is constructed from a dataframe and keeps that dataframe as an instance variable.

class DataFrameHolder:

    def __init__(self, data: pd.DataFrame, title=''):
        self.data = data
        self.title = title

Length and Get Item

The first delegate functions that we need are __len__() and __get_item__().

    def __len__(self):
        return len(self.data)

The get item function will delegate to the dataframe indexer functions. Dataframes have multiple ways to access but for simplicity, we use [] if the item is a list and use iloc if it is not.

    def __getitem__(self, item):
        if isinstance(item, list):
            return self.data[item]
        return self.data.iloc[item]

Repr Functions

Repr functions allow for an object to be displayed in an output stream - which could be the console, log or even a web page. Python objects have a __repr__() function, which you override to provide a representation of the object as a string. For our DataFrameHolder we want to represent the dataframe itself, and so we override __repr__() and delegate to the dataframe’s __repr__() function.

The second repr function - repr_html() is more interesting. It allows you to display your custom object as HTML in a notebook, which, given that we are deletaging dataframes, will render the dataframe as a table.

    def __repr__(self):
        return self.data.__repr__()

    def _repr_html_(self):
        return self.data._repr_html_()

However, you can tailor the ouput however you want, including adding a custom header to the dataframe output.

    def _repr_html_(self):
        html = f"<h3>{self.title}</h3>"
        html = html + self.data._repr_html_()
        return html

Query

Query is a simple delegate that will simply pass a query string to the dataframe query function. However, it will create a new DataFrameHolder with the resulting data.

    def query(self, query_str: str):
        result = self.data.query(query_str)
        return DataFrameHolder(result, title=self.title)

Usage

Let’s use one example where we read some data from a Wikipedia page and create a DataFrameHolder with the results. In this case we will read a table containing a list of Olympic Sports.

tables = pd.read_html('https://en.wikipedia.org/wiki/Summer_Olympic_Games', header=1)
data = tables[2]
sports = DataFrameHolder(data, title="Olympic Sports")

If we print the sports dataframe holder in a notebook it will render with the title we specified. Olympic Sports

We can query our dataframeholder and it will delegate the query to the dataframe underneath.

sports.query("Years=='1900'")

Olympic Sports

sports.query("Years=='1900'").query("Sport=='Cricket'")

Olympic Sports

Gist

This code is available as as gist

Conclusion

Hopefully this post will leave you with a useful way to extends and use pandas dataframes as general purpose data containers.