Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Introduction to Big Data!
You have completed Introduction to Big Data!
Preview
Let's discuss what problems we are trying to solve with all these data needs
Terms
- Sentiment Analysis -- The analysis of structured text to determine the emotion behind it.
- Cluster -- A group of computers arranged together logically to work more efficiently on tasks in parallel.
Learn More
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
With an understanding of the importance
of big data, and a want to learn
0:00
the paradigms and major tools for dealing
with it, you're now ready to tackle many
0:04
new problems you may have never dealt with
before across many different domains.
0:08
Let's take a moment to look in more detail
at some of these problems, just to get you
0:13
thinking about how great of a tool
big data can be in your solutions.
0:17
We often want to store large amounts
of data for various reasons.
0:21
For instance, maybe your company processes
large amounts of credit card transaction,
0:25
and you want to store them for
fraud detection.
0:29
Maybe your side project requires you to
store tweets for sentiment analysis like,
0:33
is this a happy tweet or an angry one?
0:37
Or perhaps your school project
requires that you find and
0:39
rank related articles from
Wikipedia based on relevancy.
0:42
Storing large amount of data is hard
to do in memory on one machine,
0:47
you can't just store 100 gigabyte data
set in RAM on your typical laptop.
0:51
Keeping that much data on your hard disk
0:56
means the code that you write
has to process all that data.
0:58
It also has to be able to read
it efficiently and all at once.
1:02
As you might imagine,
1:05
this is something that is hard to
write well, especially from scratch.
1:06
Searching through lots of data
introduces several problems.
1:11
Think a minute here about search
bars on your most used applications,
1:15
like LinkedIn, Facebook, or Twitter.
1:18
There is a lot of data in
those tiny little search
1:21
bars that you need to search through.
1:24
So first, you have to index
the data into search terms ,and
1:26
then surface it quickly enough for
users, so
1:29
that they don't notice too much latency or
delay in their request.
1:32
You also need to make sure that
your data is stored consistently,
1:36
otherwise the results will be wrong for
each different request.
1:39
Now, searching is typically
spread across many machines.
1:43
So you need tools to ingest the new data
that will update the search indexes, so
1:46
that your query systems get
the most up to date data.
1:51
Another common problem, is that we need
to process large amounts of incoming or
1:55
streaming data.
1:59
Now, for instance, imagine a power company
that has thousands of sensors in their
2:00
power stations, distributed
across large geographic regions.
2:04
They need to be able to
ingest all that new data,
2:08
which could be in any number of different
units, as well as different formats.
2:11
They will use that data
to detect anomalies
2:15
that could indicate failures or surges.
2:18
Social media applications like
Facebook need to be able to process
2:21
actions from users quickly,
and send out notifications.
2:24
As a Facebook user, you need to know
immediately when you get that like.
2:28
I mean, it's like,
why you posted it, right?
2:32
They don't want you feeling like,
no one likes me?
2:35
That validation needs
to be almost immediate.
2:38
Netflix, Amazon, and Hulu all want to be
able to process your movie choices and
2:41
provide specific
recommendations in real time.
2:46
When you need them, with the latest
versions of their video catalog.
2:49
Cyber security companies want to be
able to ingest customers' logs, and
2:53
tell the customer whether they've
been potentially compromised.
2:57
Minutes matter here, and the wrong
tools will provide answers far too
3:01
slowly to prevent the magnitude
of the possible attack.
3:04
To solve the problems,
we've referenced the need to use many
3:08
machines to do both the data
processing and the storage.
3:11
Now, in general, this is another problem
presented to us in the realm of big data.
3:14
To store, process, and recall information
from large and complex data sets,
3:19
it's almost always a necessity to
have more than one computer, or
3:24
relatively small size server,
to handle the data.
3:28
When you start to have data spread across,
potentially,
3:32
many machines, you need to have tools
that abstract away the management and
3:34
work flow needed to use multiple machines.
3:38
As we'll learn about,
almost all big data tools and
3:42
systems are built for running across
large groups, or clusters, of machines.
3:44
Now, that we have an idea of the new
problems for big data, let's take a look
3:50
at how they are being solved by some
of the most popular tools out there.
3:54
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up