Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Scraping Data From the Web!
You have completed Scraping Data From the Web!
Preview
Introducing the Python web scraping package, Beautiful Soup.
Additional Resources
- pipenv Workshop
- Beautiful Soup web site
- Alice's Adventures in Wonderland by Lewis Carroll
Parsers
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Beautiful soup, so rich and
green, waiting in a hot tureen.
0:00
Who for such dainties would not stoop?
0:04
Soup of the evening, beautiful soup.
0:06
Soup of the evening, beautiful soup.
0:09
This is the start of a song the mock
turtle sings in Lewis Carroll's book,
0:12
Alice's Adventures in Wonderland.
0:16
I couldn't beat Gene Wilder's
singing of it, so I didn't even try.
0:19
It's also where the HTML parsing
package Beautiful Soup gets its name.
0:23
It's designed to scrape web pages,
and provides a tool kit for
0:28
dissecting a document, and
extracting what you need.
0:31
One of the features that Beautiful Soup
provides is the ability to utilize
0:36
different parsers to create the Python
object version of the page.
0:39
There are some that
are faster than others, and
0:44
some that are better at transforming
the messy HTML pages we looked at earlier.
0:46
We'll be using a good middle of
the road parser right now, but
0:51
check the teacher's notes for
other popular options.
0:54
Let's see how to get Beautiful Soup
set up and use it to parse a webpage.
0:58
Let's head into our IDE and get started.
1:03
I'm using PyCharm.
1:06
So first,
we'll need to install Beautiful Soup.
1:08
Do Preferences.
1:12
We want to install a package.
1:15
We want to look for beautifulsoup4,
And install the package.
1:18
If you're in a different IDE, you can
also use tools such as PIP or PIPEnv.
1:28
If you aren't familiar with PIPEnv,
it's similar to PIP but
1:34
offers some additional features.
1:38
Check the teacher's notes for
more information.
1:40
And now we want to put
it to use in a new file.
1:42
Let's call it scraper.py.
1:45
Scraper.py, and we'll do our imports.
1:50
So from urllib.request,
we'll import urlopen.
1:53
This will handle the server
request to our URL.
2:00
From bs4, we want to import BeautifulSoup.
2:06
Next, we pass in the URL of the URL we
want to scrape into the urlopen method.
2:13
We'll assign it a variable called html
2:21
= urlopen, and our site's URL, which
2:27
is
https://treehouse-projects.github.io/hors-
2:32
e-land/index.html.
2:40
Then we create our Beautiful Soup object.
2:44
Call it soup = BeautifulSoup.
2:49
In here, we pass in the HTML,
and call read to read it.
2:52
So html.read, then we pass in our parser.
2:58
In our case, we're using the html.parser
that's included with Python 3,
3:03
html.parser.
3:09
And giddy up, we're set to read our page.
3:13
Let's print it out to see what we get.
3:15
We'll print(soup) and run our script.
3:19
Great, that works, but nothing is indented
like it's supposed to be in HTML.
3:26
We can do better with the prettify method.
3:32
We do soup.
3:35
Prettify, and rerun it.
3:40
That's much better and easier to read,
which is what prettify does.
3:45
It simply makes things easier to read.
3:48
Now, did you notice something in here?
3:53
Our image gallery isn't being displayed.
3:54
We have our unordered list
with our ID of image gallery,
4:01
which from a previous video we know
contains all of the images of our horses.
4:04
Here in the HTML though,
we're not seeing the list items,
4:10
our images are being populated
using some JavaScript.
4:14
Beautiful Soup doesn't wait for JavaScript
to run before it scrapes a page.
4:17
We'll see how to handle these
situations in a little bit.
4:23
For now, let's look at additional
Beautiful Soup features.
4:26
We can drill down to get specific pieces
of the site, like the page title.
4:30
We'll do soup.title There it is.
4:36
How about a page on it, like a div?
4:42
soup.div.
4:45
Well, shucks, that only gets
us the first div on the page.
4:48
Let's get them all and
loop through them to print them out.
4:52
There is a find all method that
allows us to easily do that.
4:55
So come up here.
4:59
Say divs = soup.find_all, and
5:00
we want div elements, for div in divs.
5:05
We want to print our div.
5:15
And run it again, there's our divs.
5:22
We can filter some of these out
by passing in class values.
5:26
Let's just get the one that
has this featured class name.
5:30
We come back here to our website and
do the developer tools.
5:34
Featured section here is the one
with a horse of the month.
5:38
So here, we'll pass in,
class, and we want featured.
5:41
There, now we only have classes
that have featured in the name.
5:53
This narrows down a specific area for
us to scrape.
5:58
We'll explore more of this find all and
its related record find in the next video.
6:01
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up