Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Preparing Data for Analysis!
You have completed Preparing Data for Analysis!
Preview
Learn the different types of bad data and what they mean.
Terms
- Duplicates - repeated data
- Missing data - data labeled as unknown, Nan, or empty
- Formatting - misspellings, extra whitespace, differences after combining multiple datasets
- Type - data that is a different type than expected
- Nonsensical - data that does not make sense
- Saturated - data that is at the extremes of the measurement
- Confidential - personally identifiable information
- Individual Error - errors that affect a single value
- Systematic Error - errors that affect all or large portions of the data ### Further PII Resources:
- DOL PII
- EU GDPR
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Hello again, let's take a look at this
dirty version of the Pokemon dataset.
0:00
You might be able to pick out
a few of the issues already, but
0:05
let's give them a name.
0:09
These are in no particular order.
0:10
First, duplicates.
0:13
There are two Blastoise
entries in the data set.
0:16
Duplicates can bias your results and
0:20
can also make the dataset take
up more space than it needs to.
0:23
Next is missing data on line 44.
0:28
Golbat has some values that are missing.
0:32
Instead of values there are empty cells or
even NAN,
0:37
or not a number,
like the one down here on line 150.
0:42
This is an instance where you
will need to make a decision.
0:48
Fortunately, Pokemon is pretty popular and
0:53
there's reference material that you can
consult to find the correct values.
0:55
The next might be a bit harder.
1:00
If a name was missing or unknown,
this might be a bit more difficult
1:02
to find the matching Pokemon name using
the data from the rest of the columns.
1:08
Instead, you might need to exclude
this row in some cases because
1:14
of the missing information.
1:19
Following that, formatting.
1:23
There are quite a few
examples in this dataset.
1:25
This includes misspellings,
excess whitespace, and
1:28
differences in how something was formatted
when two different data sets are combined.
1:31
In the first row here, ice is misspelled.
1:37
On line 57,
the weaknesses are separated by
1:40
a dash instead of a comma
like the rest of the rows.
1:45
After that, type in row 38.
1:51
The height is written out as a string
1:59
instead of numerically in rows 31,
2:04
26, And 10.
2:11
These all include the notation for
inches or pounds,
2:19
which would make these
entries non numerical.
2:23
The last example in this
data set is nonsensical.
2:28
This includes data that
doesn't make sense.
2:33
On line 68, There's
2:36
a negative value for weight,
but weight cannot be negative.
2:41
The last two types of errors you may
encounter are called saturated and
2:47
confidential.
2:52
Saturated data are where the values are at
the extreme limits of the measurement.
2:54
For example, let's say there's
one of their "Your Speed (blank)"
3:00
signs that light up with your
car's speed as you go by.
3:05
But this sign only shows speeds
between 20 and 40 miles per hour.
3:09
If a car drove by faster
than 40 miles per hour,
3:16
the sign just flashes slow down instead
of showing the cars actual speed.
3:20
As a driver, you may not be receiving
any new reliable information
3:26
from the speed sign as you slow down or
speed up.
3:31
Similarly, when forming or
sharing your analysis,
3:35
saturated data can introduce
instability in your data sets.
3:38
In our Pokemon data set,
saturated data could be something
3:43
like measuring the Pokemon's height,
but with only a ruler, so
3:48
anything larger than 12 inches would just
have 12 inches listed as their height.
3:53
So all of these would be 12, 12, 12, etc.
3:59
You can imagine how this would skew
your analysis and your graphs.
4:06
Lastly, confidential data, here Pokemon
have addresses and credit card numbers.
4:13
I know it's silly,
just go with me on this.
4:19
Credit card information should never be
included as it is highly confidential
4:22
information.
4:26
You may also need to remove other
personally identifiable information or
4:28
PII from your table depending on who will
be viewing your analysis or data set.
4:34
PII may be necessary for a company's
internal use but you have an obligation
4:41
to protect this information if sharing
a data set or analysis publicly.
4:47
I'll post some resources in the teacher's
notes to dive into PII even more.
4:52
Sharing personally identifiable
information publicly and
4:58
without consent can cause
legal ramifications.
5:02
Another aspect to consider
when reviewing your data set's
5:07
errors is if they are individual errors or
systematic errors.
5:12
A single misspelling would
be an individual error,
5:18
it affects a single row or value.
5:22
A systematic error is one that affects
a large portion or even all of a data set.
5:25
For instance, if the ruler used to measure
the height was only 11 inches long instead
5:32
of 12, all of these heights would then be
incorrect, causing a systematic error.
5:39
This is why it's important to
understand your data set and
5:47
how the data was collected
as much as possible.
5:50
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up