0% found this document useful (0 votes)
5 views3 pages

Video 3 What Is Data

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 3

Let's take a look. Let's look at an example of a table of data, which we also call a data set.

If you're
trying to figure out how to price houses that you're trying to buy or sell, you might collect a data set
like this. And can be just like an Excel spreadsheet of data, where one column is the size of the
house, say in square feet or square meters, and the second column is the price of the house. And
so, if you're trying to build an AI system or machine learning system to help you set prices for
houses, or figure out if a house is priced appropriately. You might decide that the size of the house
is A and the price of the house is B, and have an AI system learn this input to output or A to B
mapping. Now, rather than just pricing a house based on the size, you might say, well, let's also
collect data on the number of bedrooms of this house. In that case, A can be both of these first two
columns, and B can be just the price of the house. So given a table of data, given a data set, it's
actually up to you, up to your business use case to decide what is A and what is B. Data is often
unique to your business. And this is an example of a data set that a real estate agency might have
if they're trying to help price houses. And it's up to you to decide, what is A and what is B, and how
to choose these definitions of A and B to make it valuable for your business. As another example, if
you have a certain budget and you want to decide what is the size of house you can afford, then
you might decide that the input A is how much does someone spend, and B is just the size of the
hulls in square feet. And that would be a totally different choice of A and B that tells you, given a
certain budget, what's the size of the hulls you should be maybe looking at. Here's another example
of a data set. Let's say that you want to build an AI system to recognize cats in pictures, I'm not
sure why you might want to do that, but maybe the fun mobile app and you want to tag all the
pictures of cats. So you might collect a data set where the input A is set of different images and the
output B are labels that says, first picture is a cat, there's not a cat, there's a cat, there's not a cat.
And have an AI input a picture A and output B, is it a cat or not? So you can tag all the cat pictures
on your photo feed or on your mobile app. In machine learning tradition, there's actually a lot of cats
in machine learning. I think some of this started when I was leading the Google Brain team and we
published the results with a somewhat infamous Google cat, where an AI system learned to detect
cats from watching YouTube videos. But since then, there's been a tradition of using cats as a
running example when talking about machine learning. With apologies to all the dog lovers out
there, I love dogs too. So data is important, but how do you get data? How do you acquire data?
Well, one way to get data is manual labeling. For example, you might collect a set of pictures like
these over here, and then you might either yourself or have someone else go through these pictures
and label each of them. So the first one is a cat, second one is not a cat, third one is a cat, fourth
one is not a cat. And by manually labeling each of these images, you now have a data set for
building a cat detector. To do that, you actually need more than four pictures, you might need
hundreds of thousands of pictures. But manual labeling is a tried and true way of getting a data set
where you have both A and B. Another way to get a data set is from observing user behaviors or
other types of behaviors. So, for example, let's say you run a website that sells things online. So an
ecommerce or an electronic commerce website, where you offer things to users at different prices,
and you can just observe if they buy your product or not. So just through the act of either buying or
not buying your product, you may be able to collect a data set like this. We can store the user ID,
the time the user visit your website, the price you offer the product to the users, as well as whether
or not they purchased it. And so, just by using your website, users can generate this data from you.
This was an example of observing user behaviors, we can also observe behaviors of other things,
such as machines. If you run a large machine in a factory, and you want to predict if a machine is
about to fail or have a fault, then just by observing the behavior of a machine, you can then record a
data set like this. There's a machine ID, there's the temperature of the machine, there's a pressure
within the machine, and then did the machine fail or not? And if your application prevents the
maintenance, say you want to figure out if a machine is about to fail. Then you could, for example,
choose this as the input A and choose that as the output B to try to figure out if a machine is about
to fail, in which case, you might do preventative maintenance on the machine. The third and very
common way of acquiring data is to download it from a website or to get it from a partner. Thanks
to the open Internet, there's just so many data sets that you can download freely, ranging from
computer vision or image data sets, to self-driving car data sets, to speech recognition data sets, to
medical imaging data sets, to many, many more. And so, if your application needs a type of data
you just download off the web, keeping in mind licensing and copyright, then that could be a great
way to get started on an application. And finally, if you work with a partner, say you're working with a
factory, then they may already have collected a big data set of machines and temperatures and
pressure. And did the machine stay or not that they could give to you? Data is important, but is also
a little bit overhyped and sometimes misused. Let me describe to you two of the most common
misuses or the bad ways of thinking about data. When I speak with CEOs of large companies, a
few of them have actually said to me, hey, Andrew, give me three years to build up my IT team,
we're collecting so much data. And then after three years, I'll have this perfect data set and then
we'll do AI then, it turns out that's a really bad strategy. Instead, what I recommend to every
company is, once you started collecting some data, go ahead and start showing it or feeding it to an
AI team. Because often the AI team can give feedback to your IT team on what types of data to
collect and what types of IT infrastructure to keep on building. For example, maybe an AI team can
look at your factory data and say, hey, you know what? If you can collect data from this big
manufacturing machine, not just once every ten minutes, but instead once every one minute, then
we could do a much better job building a preventative maintenance systems for you. So there's
often this interplay of this back and forth between IT and AI teams. And my advice is usually, try to
get feedback from AI earlier because it can help you guide the development of your IT
infrastructure. Second, misuse of data. Unfortunately, I've seen some CEOs read about the
importance of data in the news and then say, hey, I have so much data, surely an AI team can
make it valuable. And unfortunately, this doesn't always work out. More data is usually better than
less data, but I wouldn't take it for granted just because you have many terabytes or gigabytes of
data that an AI team can magically make that valuable. So my advice is, don't throw data in AI team
and assume it will be valuable. In fact, in one extreme case, I saw one company go and acquire a
whole string of other companies in medicine on the thesis on the hypothesis that their data would be
very valuable. And now, a couple of years later, as far as I know, the engineers have not yet figured
out how to take all this data and actually create value out of it. So sometimes it works and
sometimes it doesn't, but I would not over invest in just acquiring data for the sake of data until
unless you also get an AI team to take a look at it. Because they can help guide you to think through
what is the data that is actually the most valuable. Finally, data is messy, you may have heard the
phrase garbage in, garbage out. And if you have bad data, then the AI will learn inaccurate things.
Here are some examples of data problems. Let's say you have this data set of sizes of houses,
number of bedrooms, and the price, you can have incorrect labels or just incorrect data. For
example, this house is probably not going to sell for $0.001 thousand, just for $1. Or data can also
have missing values, such as, we have here, a whole bunch of unknown values. And so, your AI
team will need to figure out how to clean up the data, or how to deal with these incorrect labels and
or missing values. And there are also multiple types of data, for example, sometimes you hear
about images, audio, and text. These are types of data that humans find it very easy to interpret.
There's a term for this, this is called unstructured data. And there's certain types of AI techniques
that could work with images to recognize cats, or audio to recognize speech, or text to understand if
an email is spam. And then there are also data sets like the one on the right. This is an example of
structured data, and that basically means data that lives in a giant spreadsheet. And the techniques
for dealing with unstructured data are a little bit different than the techniques for dealing with
structured data. Germs of AI today is used primarily to generate unstructured data, such as text,
images, and audio, rather than structured data. In contrast, supervised learning can work very well
for both of these types of data, unstructured data and structured data. In this video, you learned
what is data, and you also saw how not to misuse data, for example, by overinvesting in an IT
infrastructure in the hope that it will be useful for AI in the future. But without actually checking that it
really will be useful for the AI applications you want to build. And finally, you saw data is messy,
but a good AI team will be able to help you deal with all of these problems. Now, AI has a
complicated terminology where people throw around terms like AI, machine learning, data science.
What I want to do in the next video is share with you what these terms actually mean, so that you'll
be able to confidently and accurately talk about these concepts with others. Let's go on to the next
video.

You might also like