Lxml Web Scraping

Pandas makes it easy to scrape a table (<table> tag) on a web page. After obtaining it as a DataFrame, it is of course possible to do various processing and save it as an Excel file or csv file.

Requests# Well known library for most of the Python developers as a fundamental tool to get raw. There are a lot of Python libraries out there that can help you with web scraping. There is lxml, BeautifulSoup, and a full-ﬂedged framework called Scrapy. Most of the tutorials discuss BeautifulSoup and Scrapy, so I decided to go with what powers both libraries: the lxml library.

In this article you’ll learn how to extract a table from any webpage. Sometimes there are multiple tables on a webpage, so you can select the table you need.

Related course:Data Analysis with Python Pandas

Pandas web scraping

Install modules

It needs the modules lxml, html5lib, beautifulsoup4. You can install it with pip.

Lxml Web Scraping Library

pands.read_html()

You can use the function read_html(url) to get webpage contents.

The table we’ll get is from Wikipedia. We get version history table from Wikipedia Python page:

This outputs:

Lxml Web Scraping

Because there is one table on the page. If you change the url, the output will differ.
To output the table:

You can access columns like this:

Pandas Web Scraping

Once you get it with DataFrame, it’s easy to post-process. If the table has many columns, you can select the columns you want. See code below:

Lxml Website Scraping

Then you can write it to Excel or do other things:

Lxml Web Scraping Table

Related course:Data Analysis with Python Pandas