Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Scraping Data From the Web!
You have completed Scraping Data From the Web!
Preview
We've seen how to scrape data from a single page. Now let's see how we can capture links on one page and follow them to process additional pages.
Additional Resources
- Regular Expressions in Python
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
[MUSIC]
0:00
We've seen how we can scrape
data from a single page and
0:04
isolate all the links on that page.
0:07
We can utilize that and
start moving off a single page and
0:10
onto multiple pages, or crawling the web.
0:13
The internet constitutes over 4.5 billion
pages connected together with hyperlinks.
0:16
Web crawling is, for our purposes, the
practice of moving between these connected
0:23
web pages and
crawling along the paths of hyperlinks.
0:28
This is where the power of
automation comes into play.
0:31
We can write our application to
look at a page, scrape the data,
0:34
then follow the links, if any, on that
page and scrape the next page, and so on.
0:38
Most webpages have both internal and
external links on them.
0:44
Before we saddle up again and
0:48
get going in our code, let's think
about web crawling at a high level.
0:50
We need to scrape a given page and
generate a list of links to follow.
0:54
It's often a good idea to determine
if a link is internal or external and
0:58
keep track of them separately.
1:03
We'll go through the list of links and
separate them into internal and
1:04
external lists.
1:08
We'll check to see if we already
have the link recorded, and if so,
1:10
it will be ignored.
1:13
If we don't have a record of seeing
a particular link, we'll add it to our list.
1:15
We'll also looking at how to leverage the
power of regular expressions to account
1:19
for things like URLs.
1:23
If you need a refresher on
regular expressions in Python,
1:25
I know I occasionally do, check the
teacher's notes to get a quick refresher.
1:28
When we last look at scraper.py,
1:33
we're getting all of the links
from our horse land's main page.
1:35
Let's see how we can round up
these links and put them to use.
1:39
Looking at the output from our previous
run of scraper.py, we're getting this
1:43
internal link here for mustang.html and
then all of these external links.
1:48
We can separate those out and follow them.
1:54
First, let's make a new file.
1:56
The new Python file,
let's call it soup_follow_scraper.
2:02
I told you I'm bad at naming things.
2:08
We can minimize this.
2:10
And we'll bring in our imports,
from urllib.request,
2:13
we want urlopen,
from bs4 import BeautifulSoup.
2:20
And we'll be using regular expressions.
2:28
So let's import re to take care of that.
2:30
Let's make an internal links function that
will take a link URL, internal_links.
2:36
We'll need to open our URL
to define our html urlopen.
2:47
Inside here,
we'll pass in the start of the URL and
2:53
format it with the internal
URL we scrape from the page.
2:56
For our URL in our case,
3:01
treehouse-projects.github.io/horse-land
and
3:04
our string formatter.
3:12
And we'll format it with a linkURL.
3:19
Next, we create our Beautiful Soup object.
3:23
Soup is BeautifulSoup,
pass in our html, and
3:27
we'll use the same html parser
we've been using, html.parser.
3:32
And we want to return
the link of the soup object,
3:37
soup.find, and we want the anchor links.
3:39
We'll look for the anchor tags and
use the HREF attribute of the find method
3:44
with a regular expression to just get
the links that, in our case, end in .html.
3:50
It's inside here, re.compile,
3:56
our pattern is .html.
4:02
Let's put it to use.
4:06
So if dunder name, equals dunder main,
4:09
we want our urls to be in internal_links.
4:16
And we'll pass in our starting
URL to the internal_links method,
4:20
and in our case, it's index.html.
4:24
And then we'll do a while loop.
4:28
So we'll have a length of
our urls is greater than 0.
4:31
We want to capture the URL href.
4:36
Now we can do a lot of processing here,
but for
4:44
now let's just print out the page
information we get, print[page].
4:46
And then we'll add a little
bit of formatting.
4:50
Couple new lines in there, and
4:57
then we'll call our internal links
method again, for the next link.
5:00
Internal_links(page), let's run it and
see it in action.
5:09
Well, there we have it.
5:22
It's doing what we asked, but
it's in an infinite loop.
5:24
Index.html is finally linked to
mustang.html, which is finding
5:29
the link back to index.html,
which is, well, you get the point.
5:34
Let's add in the list to keep track
of our pages Call it site_links.
5:40
And then we'll adjust our while loop.
5:52
So if page not in site_links.
5:56
And then we'll add the pages to our list,
6:00
site_links.append(page).
6:04
We can indent all that.
6:08
Give us some more space.
6:12
So otherwise, we'll just break.
6:16
And let's run it again.
6:19
Page is not defined.
6:21
And pull that out.
6:25
Started my if statement too soon.
6:30
There we go, and
we get the links that we were expecting.
6:34
External links are handled in a similar
fashion, you do find the base url path,
6:37
and then, with regex define the pattern
you're looking for and follow the links.
6:42
I'll saddle you with the responsibility
to give it a try and
6:48
post your solution in the community.
6:52
Don't worry, I'm sure you can rein it in.
6:54
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up