⚡ CYBER NOVEMBER SALE ⚡ 50% off any plan for 6 months, exclusive to new subscribers.

Join the Treehouse affiliate program and earn 25% recurring commission!

Well done!

You have completed Scraping Data From the Web!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

Crawling Spiders

5:26 with Ken Alger

Let's use the Python Library, Scrapy, to create a spider to crawl the web.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

Additional Resources

Python List Comprehensions
Scrapy Response object

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Inside the spiders folder, let's create a new file to crawl our sample horse site. 0:00

So Spiders > File > New File > Python, horse.py. 0:06

To start with, we need to import scrapy. 0:14

And then we'll write a class, we'll call it HorseSpider, 0:20

which we'll inherit from scrapy.Spider. 0:26

Now, we need to give out HorseSpider a name, let's call it ike, 0:30

after the horse in Charlotte's Web. 0:34

Spider names must be unique within a scrapy project. 0:39

So scrapy knows which spider to run in the project. 0:42

We'll use it to run our spider in just a bit. 0:46

There are two functions we need to write in here. 0:48

Start request, which defines the initial request to be made and 0:51

if applicable, how to follow links. 0:55

So we'll define start_requests. 0:59

For now, we'll just pass. 1:01

The other function is parse which will tell the spider how extracted data is to 1:03

be parsed. 1:08

Inside start_requests, we provide a list of URLs that we want to process. 1:11

So urls, and we pass in a list. 1:17

So we'll pass in our index and our mustang.html pages. 1:21

The whole URL is treehouse-projects. 1:26

Github.io/horseland/index.html. 1:33

We'll paste that in and change it to mustang. 1:43

Then we need to return a scrapy.Request. 1:49

This is a list comprehension. 1:52

It's going to create a new list of request by looping to each of our URLs. 1:54

More on the teacher's notes. 2:00

So we wanna return a list of scrapy.Request. 2:01

We want our url to be url, our callback is gonna be self.parse. 2:10

We want that for urls in urls. 2:17

This line is looping through our urls list and on each one calling the parse method. 2:21

Let's update that method to do something. 2:26

We could do a lot of things inside this method. 2:31

How you parse the data on a site will be highly dependent 2:33

on the purpose of your project, since every use case can be a little different. 2:37

For now, let's just save the entire HTML file. 2:42

So we'll define a url, the response.url. 2:46

This response object represents an HTTP response 2:50

from the request we made in start_requests. 2:54

It's usually downloaded by the downloader and fed to the spiders for processing. 2:57

See the teacher's notes for additional documentation on scrapy's response object. 3:03

So with our url, we wanna get a specific page. 3:08

We'll split it, On our last slash there, 3:13

and our file name we'll call it horses. 3:20

We'll format that with our page and we'll print out what the URL is, 3:25

And then we'll save our page. 3:36

I'm going to just write the entire response body. 3:44

Then we'll print out the saved file name. 3:49

Nice, now in a terminal window, We navigate to our spider's directory. 3:57

And tell scrapy to crawl using our spider name. 4:11

So we do scrapy crawl ike. 4:15

If we look at output in our terminal, we can find, 4:19

come up here a little bit, To right in here. 4:23

We see that the spider looked for 4:29

our robots.txt file, which it didn't find since the site doesn't have one. 4:30

See this 404 code here? 4:35

In our robots.txt, the pages 4:37

that we included in our URLs list were found and saved from the parse method. 4:41

There's the URLs, there's the file names, 4:46

we'll come back up here, there they are, very nice. 4:50

Great work on writing your first spider. 4:55

We saw the two methods that a scrapy spider needs, start requests and parse. 4:58

We put in a list of URLs in the start_requests method and 5:05

have it loop through that list and process each URL with the parse method. 5:08

We could have our parse method do something more powerful 5:13

than just saving the entire file. 5:16

But this is a nice start. 5:18

Next up though, 5:20

let's see how to write a spider that will crawl more URLs than what we give it. 5:21

You need to sign up for Treehouse in order to download course files.

Sign up

You need to sign up for Treehouse in order to set up Workspace

Sign up