In this post we will look at a somewhat uncommon use of the pandas dataframe library - reading data from web pages.
Pandas has a function
read_html() that will scrape the tables on an html page into a list of dataframes. It is a quick and easy way to read data from a webpage into dataframes for later analysis, and is particularly good at Wikipedia pages, which tend to contain tables with useful reference data.
The following code extracts a dataframe containing a list of soverign states from Wikipedia.
url = 'https://en.wikipedia.org/wiki/List_of_sovereign_states' tables = pd.read_html(url) # Get the table, drop the first row and fill NAs tables.drop().fillna('')
A couple things to note:
- If there are more than 1 tables in the returned list then you will need to know which table(s) contain the data you are looking for.
- Pandas will apply headers if they are within <th> elements. If pandas cannot detect headers then you can pass the row number to use for headers.
tables = pd.read_html(url, header=0)
- read_html will throw an exception if no tables are found on a page
More Complicated Pages
Using read_html works for fairly simple parsing jobs. If you need to grab data that is not within tables, or element attributes such as anchor href, then you are better off using the requests and bs4 libraries. But give how easily you ca pull data from web pages into dataframes in a few lines of code, it’s a useful tool in your toolbox.