Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Scraping Data From the Web!
You have completed Scraping Data From the Web!
Preview
Let's further explore how to crawl the web.
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
With our first spider, Ike, we saw how
to process a static list of URLs.
0:00
This is great if you know all the URLs
of the pages you want to scrape.
0:05
What happens though,
0:10
when you want to start following links
that are included on the page itself?
0:11
Scrapy has some helpful methods for
handling these situations,
0:14
with link extractor, and
crawl spider classes.
0:18
A word of caution here, before we crawl
down this path, we need to be aware of
0:22
the overwhelming amount of data and
sites that are connected on the web.
0:26
Writing a spider that gets and follows
all the links on each followed web page,
0:30
can lead to a program that never ends.
0:35
Also, with the idea that any given site is
only six clicks away from any other site,
0:38
sending a spider on a massive
crawling task can potentially
0:45
lead to some sites that are way
off our originally intended topic.
0:49
We should look at setting up some rules
for our spider to follow as well.
0:53
The CrawlSpider class from Scrapy is set up
a bit differently than the spider we wrote
0:57
in the last video.
1:02
It has the same overall concept, but
instead of a start_requests method,
1:03
we define allowed_domains and start_urls.
1:08
Then we'll define a set of rules for
a spider to follow.
1:12
This lets us tell the spider which links
to match or not, follow or not, and
1:15
how to parse the information.
1:19
Let's take a look at how to implement
these concepts in a new spider.
1:21
Let's create a new file in our spider's
folder and call it crawler.py.
1:26
Crawler.py, and we need a few imports.
1:33
So from scrappy.linkextractors,
1:36
import LinkExtractor.
1:42
And, from scrappy spiders,
scrappy.spiders,
1:46
we want to import CrawlSpider and Rule.
1:51
Next, we define our class,
this time inheriting from CrawlSpider.
1:55
We'll name this one after another
famous horse, Whirlaway.
2:05
Perhaps not quite the same as
Ike from Charlotte's Web, but
2:09
a winner in his own right.
2:13
When using the CrawlSpider class, we can
set a few parameters for it to follow.
2:19
Let's start within allowed domain's limit,
2:23
to prevent our spider from
getting too far out of control.
2:26
So we do allowed domains,
we can pass in a list, for
2:29
ours we'll just do treehouse-projects,
github.io.
2:34
Next, we define a place to start.
2:44
So we do start_urls, and
2:46
we want treehouse-projects
2:53
Github.io/horse-land.
2:59
Now we can define our rules.
3:04
We'll use the LinkExtractor class and
3:06
pass in a regular expression
of links to follow or ignore.
3:09
So our rules be rule,
3:13
LinkExtractor, and our regular expression.
3:16
Then we tell our rule how to parse
the information by assigning the call back
3:24
parameter to the method name.
3:28
Let's use parse_horses, so
callback, parse_horses,
3:30
Then we tell the rule if it's
okay to follow the links.
3:39
follow=True.
3:45
And let's clean this up a little bit.
3:45
Drop these down onto new lines,
Now we can define our parsing method.
3:50
parse_horses will take self, and
the response, we'll grab the page URL.
3:59
And the page title.
4:11
We can use CSS to select
specific page elements.
4:12
The result of running
a response.css title,
4:21
is a list like object called
selector list, which represents
4:24
a list of selector objects that
wrap around XML or HTML elements.
4:29
And allow you to run further queries
to fine grain the selection or
4:35
extract the data.
4:39
For this example,
let's just print out the URL and titles.
4:40
We'll print Page URL,
4:43
Format(url), and
we'll print the page title.
4:50
Go to the terminal, and we'll ask Scrapy
to crawl our site, crawl Whirlaway.
5:02
Need to be in the right directory.
5:15
crawl Whirlaway.
5:30
And there's our information.
5:34
Scroll up here.
5:35
So again, we see that we got a 404 when
it was looking for the robots.txt.
5:42
Page URL, page title, it's kinda messy.
5:49
We can clean that up a little bit.
5:53
We only want to extract the text elements
directly inside the title element, so
5:55
let's change that up here.
5:59
So title, we want text,
and we want to extract it.
6:01
And we'll run it again.
6:10
Come up here, there's our page title,
that's much better.
6:15
Also note here in the output,
that Scrapy found those external links but
6:18
filtered them out.
6:23
Thanks, Scrapy.
6:24
Well done,
you've written two different spiders now.
6:28
One that follows links
that we provide and,
6:31
one that extracts links from a site and
follows them based on rules we set.
6:34
These are both very powerful tools for
scraping data from the web.
6:40
Being able to get the information
is a major task, and
6:44
we've seen how easy scraping makes it.
6:47
In the next stage, let's take a look at
how to handle some other common tasks,
6:50
such as handling forms and
interacting with APIs.
6:55
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up