Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Scraping Data From the Web!
You have completed Scraping Data From the Web!
Preview
Web scraping doesn't have to entirely be about scraping data for processing. Web scraping tools can be used to test websites as well.
- Testing in Python
- Introduction to Selenium
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Using web scraping tools doesn't
just have to be for gathering data.
0:00
It can be used to test a site as well.
0:04
Testing your code is a great
development practice to get into.
0:07
Writing a unit test, and
0:11
combining them with a web scraper,
can be a powerful tool for testing a site.
0:12
You can check to make sure that
a page's title is as expected,
0:17
or that all of the content resides in
an element with a specific CSS class.
0:20
If you need a refresher
on testing in Python,
0:26
check the teacher's notes for
some great resources.
0:29
Let's head back to our sample site, and
0:33
use unit tests to make sure it has
the elements that we expected it to have.
0:34
Let's go back to our horse site.
0:40
We'll check to see if
it's a stable version
0:41
of what we're expecting it to have.
0:44
Go over, let's create a new file,
new Python file.
0:46
We'll call it horse_test.py,
and we'll bring in our imports.
0:52
We need request here to bring in urlopen.
0:59
We'll bring in BeautifulSoup, and
1:06
since we're running the unit test
we'll need to import unittest.
1:08
Next, we define our class and
setup information.
1:14
So we'll call the class TestHorseLand,
which inherits unittest and TestCase.
1:18
We'll set our soup, to start with, equal
to None, and then we define a setUpClass.
1:26
And in this case, it won't take self.
1:34
We'll pass in our url,
1:37
treehouse-projects.github.io/horse-land/i-
1:41
ndex.html.
1:49
Then we define our soup object.
1:54
It's going to be BeautifulSoup,
urlopen, pass in the URL,
1:58
and we want the html.parser again.
2:03
Now, let's test that the h1 text
is what we're expecting it to be.
2:08
So we'll define a test for header1,
2:13
We want header1 to be equal to
our TestHorseLand.soup.find.
2:19
We want to grab the h1, and get_text.
2:27
Next, we want to make sure that header1,
that we're capturing here,
2:32
is equal to what our string should be.
2:36
In our case, Horse Land.
2:39
So we would do self.assertEqual,
pass in our string
2:41
that we want, Horse Land,
equal to header1.
2:46
And do our dunder check here,
And we'll run unittest.main.
2:51
And when we run this, we get an OK,
and the test passed, very nice.
3:00
Another method to test sites is
with a package called selenium,
3:06
which is designed specifically for
website testing.
3:10
It can be installed on PyCharm,
the same as BeautifulSoup, or
3:14
it can be installed with Pipenv.
3:17
I've included a link to
the installation information
3:19
in the teacher's notes, as well.
3:22
One additional step you'll need is
the driver for your preferred browser.
3:24
Follow the instructions on
the page to get it set up.
3:28
Let's create a new file
to show off selenium.
3:31
So we can close this,
Do another new Python
3:34
file, horse_test_selenium.
3:40
So we'll be using BeautifulSoup again.
3:48
And from selenium,
we want to import webdriver.
3:52
We'll also want to import the time module,
to allow the page to fully load.
3:59
So next, we want to tell our
webdriver which browser to use.
4:05
I'm using Chrome, so I'll set that up,
4:09
Then we tell the driver
to go get our page.
4:16
Horse-land, back to index.html.
4:25
Let's have our script wait a few seconds,
before we process anything.
4:30
Just to give the JavaScript time to run,
and load the horse images on the page.
4:33
We do time.sleep, pass in 5,
that should give us plenty of time.
4:39
Now, we can utilize
BeautifulSoup to parse the page.
4:44
Let's just print out the HTML,
to see if we get the images.
4:47
Recall from earlier video,
when we did this,
4:51
we just got an empty, unordered list.
4:54
Because BeautifulSoup doesn't wait for
JavaScript.
4:56
The driver object has
a function called page_source,
4:59
which gets us the source of
the page at the time it was read.
5:03
So we'll say page_html,
driver.page_source, and
5:07
we can use that with BeautifulSoup.
5:11
We'll pass in the page_html,
we'll use our html.parser again,
5:16
and we'll pretty-print our soup.
5:22
Then, we want to make
sure we close the driver.
5:28
And let's run our script, and there we go!
5:32
We see all of our images and page content.
5:42
We could now put our scraping skills
to use in many productive ways.
5:45
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up