Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Scraping Data From the Web!
You have completed Scraping Data From the Web!
Preview
Let's take a brief look at how an HTML page is structured so we can better understand how to navigate a page for web scraping.
- Horse Land web site
- Horse Land site source code
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Before we jump into Python and
start wrangling data from a web page,
0:00
I think it will be helpful to revisit
what a web page looks like in code.
0:03
How is a web page structured,
or more specifically,
0:07
how a web page should be structured.
0:10
In your journey with web scraping,
you'll likely come across a site or
0:13
two where you ask, hold your horses,
why aren't any of the tags closed?
0:17
Or, seriously,
there are five h1 tags on this page?
0:21
If we look at how an HTML page
should be structured, it starts and
0:26
ends with an opening and closing html tag.
0:30
Inside the html tag,
we have a head section which has tags for
0:33
metadata about the page and other
essential information for the document.
0:38
The page title will be
found in here as well.
0:41
Next, we have the body section where
the content of the page is found.
0:44
Inside here is where we'll do
the majority or our scraping.
0:48
Things like heading tags,
div, paragraph, anchor, and
0:51
form elements will reside inside here.
0:56
I mentioned that structure
is how a page should look,
0:59
sometimes reality is different.
1:02
Let's take a look at how
lenient HTML can be written and
1:04
still look good in the browser.
1:07
This will point out some
of the challenges and
1:09
benefits that we can come into
when attempting to scrape a site.
1:11
Let's take a look at a sample
website that the amazing design team
1:16
here at Treehouse put together.
1:19
It's hosted on GitHub Pages,
which is great
1:21
because it allows us to view the site and
easily see the HTML code.
1:24
Check the teacher's notes for the link.
1:29
I'm using the Chrome browser, and
1:32
if we open up the developer's tools
with Option+Cmd+I on a Mac, or
1:33
Ctrl+Shift+I on Windows,
we can examine the structure of our page.
1:37
Here at the top,
we see the head section, and
1:43
can expand that to see that
it contains a few things.
1:45
There's some metadata, there's links
to our style sheet and fonts, and
1:48
there it is, our page title.
1:52
We'll see how to scrape that
information in code here shortly.
1:54
The body section is where, as I mentioned,
1:59
we'll find most of the interesting
items we'll want to scrape.
2:01
We see that we have a few different
div elements that separate the page
2:04
into different logical components.
2:08
Such as the graphical header,
there's our featured image,
2:11
and then down here, there's
the links at the bottom of the page.
2:15
The main portion of this particular
webpage is the list of horses with
2:17
the images.
2:22
We see here in the HTML that they all
reside here in this unordered list
2:23
section, with the imageGallery ID and
card-wrap class.
2:28
If we expand this section,
we see a bunch of list items.
2:33
These look like potential
scraping targets, and
2:37
we'll explore them more specifically,
later in the course.
2:39
One thing I do want to mention here is
that modern web browsers can hide a lot of
2:43
HTML errors for us.
2:48
Inline elements such as span, and some
block level elements such as paragraph
2:49
tags may not be closed in the actual HTML,
but the browser closes them for us.
2:55
If we take a look here,
we see this paragraph here at the bottom.
3:01
We see that it has a class of credits,
and there's an opening and closing p tag.
3:07
However, if we look at the source code for
this file on GitHub,
3:11
that's down here under index.html.
3:15
So in here,
we scroll down to the bottom of the page.
3:19
We see the opening p tag on line 43, but
3:23
there isn't a closing tag when
this paragraph ends on line 46.
3:25
In this case, the browser helps us out for
web scraping tasks.
3:30
Fortunately, HTML doesn't
have to be perfect.
3:34
With some web page anatomy under our
belts, let's take a quick pit stop before
3:37
we get started with some scraping tasks
with the Python package, Beautiful Soup.
3:42
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up