Scraping Data from Wikipedia Tables

Demographics.png

Library_tidyverse.png


Next, we need to give R the url for the webpage that we’re interested in:

URL_Wikipedia.png


We then use the read_html() function to translate this webpage to information that R can use, followed by the html_nodes() function to focus exclusively on the table objects contained within the webpage:

Houston_HML.png


It looks like the Houston Wikipedia page contains 19 tables, although some of these class descriptions are more informative than others:

XM_Nodset.png


Next, we pull out our table of interest from these available tables. The nth() function specifies that we want the 4th table from the above list. Determining the right table to specify here may take some trial and error when there are multiple tables on a webpage. You can do your best to guess the correct number by looking at the webpage, as well as just viewing different tables until you see what you’re looking for:

Pop_Table.png


We get the following output, and the Wikipedia table is now in R! As often happens with web scraping, however, this table isn’t really usable yet. All four columns have the same name, the first and last rows don’t contain data, and there is an extra column in the middle of our data frame:

Historial_Population.png


Let’s do some quick clean-up with this table to make it more usable. We can’t do much of anything before our columns have unique names, and we also need to restrict this table to its 2nd-19th rows:

colnamespoptable.png


We’re not quite there yet, but the output is looking much better:

Year_Table.png


Let’s do some final cleaning. First, we’ll get rid of the blank column. All columns are also currently stored as character variables, whereas year should be a date and population and percent_change should be numeric. We remove unnecessary strings from the percent_change and population columns, then convert all columns to their appropriate formats:

Poptable.png


It’s as simple as that. Everything in the table now looks as we would expect it to:

New_Stat_Table.png


The population data is now fully usable and ready to be analyzed. Web scraping is a great tool for accessing a wide range of data sources, and is far preferable to manually copying values contained within online tables given its reproducibility and reduced likelihood for human error. The code contained in this article can additionally be built upon to scrape numerous tables at once, allowing for even greater efficiency and access to even more data.

Comments

Comments (4)

author
Jonathan Turner
Thanks for the clear explanation
2021-05-05 19:20


author
Laura Woodhead
Useful info
2021-05-05 20:33


author
Diego L
Good job Emily, nailed it with your explanation.
2021-05-08 21:24


author
Robert Squirrell
This is very inspirational and helpful for data scientists.
2021-05-09 00:25

Trending