Reading HTML with Pandas

1 minute read

In this post we will look at a somewhat uncommon use of the pandas dataframe library - reading data from web pages. Pandas has a function read_html() that will scrape the tables on an html page into a list of dataframes. It is a quick and easy way to read data from a webpage into dataframes for later analysis, and is particularly good at Wikipedia pages, which tend to contain tables with useful reference data.

Example

The following code extracts a dataframe containing a list of soverign states from Wikipedia.

url = 'https://en.wikipedia.org/wiki/List_of_sovereign_states'
tables = pd.read_html(url)
# Get the table, drop the first row and fill NAs
tables[0].drop([0]).fillna('')

list of sovereign states

A couple things to note:

  1. If there are more than 1 tables in the returned list then you will need to know which table(s) contain the data you are looking for.
  2. Pandas will apply headers if they are within <th> elements. If pandas cannot detect headers then you can pass the row number to use for headers.
    tables = pd.read_html(url, header=0)
    
  3. read_html will throw an exception if no tables are found on a page

More Complicated Pages

Using read_html works for fairly simple parsing jobs. If you need to grab data that is not within tables, or element attributes such as anchor href, then you are better off using the requests and bs4 libraries. But give how easily you ca pull data from web pages into dataframes in a few lines of code, it’s a useful tool in your toolbox.

More detailed examples of using pandas to read html can be found on this blog. Reading HTML with Pandas Interestingly, you can also read data from web pages that are built using Javascript. Enjoy!