Lxml website scraping

Pandas makes it easy to scrape a table (<table> tag) on a web page. After obtaining it as a DataFrame, it is of course possible to do various processing and save it as an Excel file or csv file.

Requests# Well known library for most of the Python developers as a fundamental tool to get raw. There are a lot of Python libraries out there that can help you with web scraping. There is lxml, BeautifulSoup, and a full-fledged framework called Scrapy. Most of the tutorials discuss BeautifulSoup and Scrapy, so I decided to go with what powers both libraries: the lxml library.

In this article you’ll learn how to extract a table from any webpage. Sometimes there are multiple tables on a webpage, so you can select the table you need.

Related course:Data Analysis with Python Pandas

Pandas web scraping

Install modules

Lxml website scrapingLxml Web ScrapingLxml Web Scraping

It needs the modules lxml, html5lib, beautifulsoup4. You can install it with pip.

Lxml Web Scraping Library

pands.read_html()

You can use the function read_html(url) to get webpage contents.

The table we’ll get is from Wikipedia. We get version history table from Wikipedia Python page:

This outputs:

Lxml Web Scraping

Because there is one table on the page. If you change the url, the output will differ.
To output the table:

You can access columns like this:

Pandas Web Scraping

Once you get it with DataFrame, it’s easy to post-process. If the table has many columns, you can select the columns you want. See code below:

Lxml Website Scraping

Lxml Web Scraping

Then you can write it to Excel or do other things:

Lxml Web Scraping Table

Related course:Data Analysis with Python Pandas