Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Scraping Data From the Web!
You have completed Scraping Data From the Web!
Preview
Getting up and going with the Scrapy library.
Additional Resources
- Scrapy web site
- Scrapy installation guide
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
The script we built to
read the horse website
0:00
is a basic web crawling bot
to scrape data from a site.
0:02
Python has a great module available
which provides a more full-featured way
0:07
to quickly extract data
you need from websites.
0:12
With Scrapy, we write the rules
regarding the data we want extracted and
0:15
let it do the rest.
0:20
Let's get Scrapy installed and
then set up our first spider project.
0:21
Let's look at the Scrapy
installation guide.
0:26
We see that it will run on Python 2.7 and
3.4 and higher,
0:29
and can be installed using conda or
from PyPI with pip.
0:34
Let's add this package to
our project in PyCharm.
0:39
We want Scrapy.
0:48
And install the package.
0:53
If you find there are issues with your
installation, check the platform-specific
0:54
installation notes in the Scrapy
documentation for additional information.
0:59
Once it's finished installing,
you can come out of here.
1:03
Go to a Terminal window.
1:08
Let's create a new spider.
1:12
We'll call it AraneaSpider.
1:13
Aranea is one of Charlotte's children's
names in the classic children's book,
1:16
Charlotte's Web.
1:21
It's also the genus name of one of my
personal favorite spiders, the orb weaver.
1:23
So if we do scrapy
startproject AraneaSpider,
1:28
it creates our spider for us.
1:34
Running this command handles creating
the directory structure and setup for
1:37
a Scrapy project.
1:42
Let's see what Scrapy has provided for us.
1:43
We'll minimize this.
1:46
So here, under our folder, there's a
scrapy.cfg files which handles deployment
1:48
configuration, a project Python module
from which we'll import our code.
1:54
And there are some stub
files that are generated.
1:59
Their names are pretty descriptive,
items, middlewares,
2:02
pipelines, settings, all include
respective setting information.
2:06
Next is the spiders directory.
2:11
This is where we'll put our spiders.
2:13
Let's talk a little bit about what
a couple of these files are used for.
2:16
items.py is used to define a model
of data for scraped items.
2:20
Scrapy spiders can return
scraped data as Python dicts.
2:25
As you know, dicts lack structure.
2:29
We can use items.py to create containers,
2:33
where we can put the data
we get from a site.
2:37
Middlewares allow for custom functionality
to be built to customize the responses
2:40
that are sent to spiders.
2:45
The pipeline.py is used to
customize the processing of data.
2:46
For example, you could write
a pipeline that would cleanse the HTML,
2:51
then move down the processing
pipeline to be validated,
2:56
then store the information
into a database.
3:00
Steps along the data processing
path can be put into the pipeline.
3:03
settings.py allows for the behavior of
Scrapy components to be customized.
3:08
In our next video,
let's write our first spider.
3:13
I'll see you shortly.
3:16
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up