ELK 1 5 - Mapping Index Data
ELK 1 5 - Mapping Index Data
This lecture is really designed to be one of the final lectures, of course, of our discussion
on ES specifically, but really something that's going to bridge the gap from ES to
Logstash, which is the next section.
Now I’m going to begin this section with a little bit of a hot take. Something that may be
a little bit controversial, and this relates to kind of proliferation of what we would call a
NoSQL database.
So of course with SQL or relational databases you have to explicitly define your data
structure ahead of time, before you start putting data into it.
The flip side of that is a NoSQL database maybe something like MongoDB, where you're
actually just jamming data into the database in kind of an unstructured way, and kind of
sorting things out on the application end.
Now there are some benefits to using these NoSQL databases and especially during
development or testing, they certainly make sense. And we've been kind of treating ES
this way thus far, because we've just been dynamically inserting data into it, without a
lot of specific structure.
However I don't necessarily believe that's the appropriate path forward for our
application and for the applications of DFIR and NSM. As a matter of fact, I would go so
far to say, and this is the hot take that not knowing your data is lazy and there's not
really an excuse for it. Most of the time, when we're going to be indexing evidence-
based data. It has some sort of a structure to it and some sort of a definable structure
that we can parse.
Now that parsing may be a little difficult from time to time, but we should be able to
identify clear fields and the identification of those fields and being able to map those to
specific data types, actually opens up some additional windows for us and how we
search that data out on the investigative side of things.
We saw an example of this earlier in a previous section when I mapped a data type to
the IP type, which allowed us to perform searches on it using cider notation that opened
up some doors to us, and many more doors than would have been opened had we just
indexed IP addresses as text where we couldn't perform the appropriate actions on
them
So knowing your data is important and it paves an important road forward. So for the
rest of this course we're really going to focus on making sure our data is structured and
that really involves coordination both between the Logstash end of things which we
haven't got to yet, and ES which is receiving the data that will be indexed.
So I want to start with the ES side here of course, then we're going to talk about how to
properly structure your data and how to map specific data types to specific feature sets
that you may want.
Thus far everything we've done with ES is in its default configuration, where our indexes
are configured for dynamic mapping, and for example what that means is I could feed it
a piece of data let's call it an IP address field and let's say it's 192.168.1.100.
When I send that data to ES when it's indexed, it's going to dynamically map it so it's
going to use a series of internal configurations to decide what it thinks that is, and it's
going to map that by default an IP address to text and of course when it's mapped as
text, we can't perform all those advanced operations we'd like to: we can't search it
based upon cider notation. We can't perform mathematical operations against it, etc.
So that's a limitation, but this is the default mode. Now this is of course a lot more
flexible it's a lot easier to get data in so we don't have to define mappings ahead of time.
But this is maybe not the best choice for our purposes and for DFIR where we actually
know the data types we want.
Instead of letting ES make the decision for us on how things are mapped, we can
actually use explicit mapping and that's what's shown here.
In this case we've fed 192.168.1. into our index and we have a mapping that we've
already set up that says for this field name, map this to the IP address data type.
So that is mapped it's an IP and we can do the things we want to with it. The big
difference here again with dynamic mapping which is the default, I can feed in whatever
data I want and ES is going to take it and index it. With explicit mapping (which is a
configuration variable we have to change and I’ll show you how to do that) we have to
define the data mapping ahead of time, and any data that does not conform to it, gets
rejected.
We're kind of have the sense of filtering our data based upon meeting a certain set of
criteria which is I think more ideal for our use case here.
There are a lot of different types that are available via ES and there are a lot of the ones
you're familiar with if you've done any type of coding or programming.
We have string based types, we have the text and keyword types, we have numeric
types, such as integer long short and float, and then we have a few more specific types:
we have the date type, the Booleans (yes or no, true or false), binary for binary data
that might be inserted into ES database, geo point for referencing geographic points on
a map’s latitude and longitude and so on, and even of course the IP address data type
that we talked about thus far.
So there are a whole lot of these. The ES documentation lists everyone they have
available and there's a lot more complex ones. So you can install, you can store arrays as
well an array is a more complex data type, they have this concept of metadata types and
so on.
So we're going to go through several of these here, especially some of these more basic
and common ones, and as we go along in the course, you'll see us experiment with
more of these as we go along.
I don't want to go through every single one of them one by one because that would be
fairly boring. So we'll go through those, a lot of those in the context, that we're actually
using them in throughout the rest of the course.
Now let's take a look at mapping some structured data within the ES console we're
already using.
Now in this case I’ve gone ahead and deleted any existing indexes I had that we've been
using prior I want to start from scratch and we're going to use a DFIR specific example
here.
Now what I’m going to do is issue a put command for the DFIR index for the proxy type.
We're going to be using proxy data here, and we're going to assign an ID to this.
Now I’m going to index this data which looks like pretty standard proxy stuff. We have a
date field source IP, destination IP, an HTTP response code, an HTTP method, a URI, and
a content type.
So it looks like here this is a pretty standard record for
liveupdate.symantecliveupdate.com/minitri.flg, which is text content which Symantec,
the antivirus, uses to test to see if it needs to perform an update of its antivirus
definitions pretty straightforward.
So I’m going to go ahead and hit play, and index this data it looks like it was successful.
Now I’m going to get rid of some of this extra stuff here, change this to a GET, and I
want to get the structure of our DFIR index, and this is not a screen we've spent a ton of
time looking at, but I do want to look at it now
Now what we've got here at the top level is just our DFIR index and I can collapse that
and it collapses everything.
We've got a couple of other sections here I want to point out. This is really focused in
two sections. We have mappings and we have settings. Of course settings are those
things related to the specific settings of the index so we have things, we haven't really
talked about or aren't really going to talk about too much in this course in terms of
shards and replicas and things like that.
However, under the mapping section we do have very relevant data. And I want to look
at that a little bit here. Now first of all we have proxy, which is the type we've specified
mappings are done oftentimes by based upon type specific settings.
And then from there we have our actual mappings here and they're not in the order we
put them in. Here they're actually alphabetical we have here starting with content type.
And it looks like the type it is mapped to is text. So we just have text. Below this we have
a date which is a field we used, and the type for that is simply date.
Now those are probably expected to be the content type is going to be text the date is
going to be date. Below here however we have DSTIP and notice DSTIP is mapped to the
‘type: text’. Well we don't want it to be mapped to text. We want to be able to search
for it based upon Saturn notation, and perform math on it and so on.
We actually want that to be mapped to the type to the IP type, but unfortunately ES's
dynamic mapping didn't do that. So that's something we actually want to correct. That's
with DSTIP, of course we go to SRCIP.
We have the same thing. It was mapped to the type text. It looks like otherwise we have
things mostly what would expect to see method is mapped to text. Response is mapped
to text. We probably want to map that as an integer, because those response codes are
three digit response codes, an HTTP 200 a 404.
We might want to map that as an integer so we can do integer based operations on it.
And then finally we have URI that was mapped is text as well now we have a couple of
other configuration parameters here, ignore above and so on. I don't want to get into
those just yet, but you get the kind of gist of this right we're performing very basic
mappings wherein we specify the field name, and then we map it to a specific data type.
And this was done all automatically by ES. Just by indexing the data, it made its best
guess effort, as far as doing dynamic mapping and setting the data up, how we want or
how it thinks we want to look at it.
However, there are a few things we want to change, so that's what we're going to do
next.
So I’m actually going to go ahead in here and delete the DFIR index and we're going to
start from scratch. Now you might ask yourself: well, can't we just update the mappings
or change them? And you can but not easily. Updating existing mappings requires re-
indexing. Think of it this way. I mean the data has already been indexed to a specific
type, and if it's text based and you want to convert it to an integer, that data has to be
all re-examined.
So the re-indexing has to occur and there's a little bit of trickery you can do, where you
can re-index data without downtime, and there are some great guides and things on
that out there.
We're not going to cover that here, we're going to start with the more simple scenario
where we're just defining our mappings from the outset, from the beginning, as we
create a brand new index with specific mappings already defined.
Previously to create a new index, we've just indexed data and that's automatically
created the index force. We're not going to do that here, we're going to create the index
in earnest on its own with mappings, but with no particular data.
So to do that, we're going to get ready to create our index, and we're going to start with
‘mappings’ object.
So we're using the same JSON notation that we've been using all along.
We're going to focus here on this mappings object. Now from here, I’m going to specify
the type for which I want to create mappings. Remember this is based on specific types
within the index.
Now I’m going to specify the ‘proxy’ type. That's the type we're dealing with here. And
I’m going to go ahead and input properties which is what we're actually configuring.
We're configuring the properties of that type. And that's we're going to find the specific
fields we want. So we should be ready to go and with this structure, we can now start
defining those properties.
So I’m going to start with the ‘date’ field, and that gets its own object here, and we're
going to say ‘type’ for ‘date’, and we're going to set that to date pretty straight forward
there. And then we'll close that object there.
So that's pretty much it. That's all we're going to input. There are more parameters you
can put in here, but we're just going to simply say that the field name date has a type of
date pretty straightforward.
So let's go ahead and try a couple more of these. We're just going to put a comma to
separate our objects here and instead of date next we're going to do SRCIP and we want
to map that to the type IP.
We can do that again with DSTIP. We're going to map it to IP as well. We have a couple
more of these, so let's go ahead and just knock those out, so we have DSTIP, we have
response, we have method, we have URI, we have ‘contenttype’.
Let's put our commas here, so we're properly formatted. Let's see what we've got so
we've got response. That's the response code.
I’m going to go ahead and map that to an integer, since that's just generally a three digit
number. Method is going to be text, URI is also going to be text, and content type we’ll
make that text as well.
So everything looks fine there. Again we're just specifying that, we're creating a new
index called DFIR, we're specifying that, we're going to modify the, we're going to find
the mappings.
When we create the index for the proxy type and for the properties of that type, we're
simply defining our fields with our mappings pretty straightforward, pretty easy, I hit
play and it looks like this was acknowledged and it looks like everything was created as
we expected.
So let's go ahead here, and let's get the DFIR type again and hit play, and sure enough
we see our mappings are listed right here. So we have mappings, proxy, and this is really
just a repeat of the same thing we just input.
It's broken out a little differently. These objects are not all in one line, but this is exactly
what we just entered into the properties of the proxy type within the DFIR index.
So we should be good to go. We should be good to start indexing data based upon the
mappings we've created. So let's try to index data now into this index (which) is already
created.
I’m going to paste in the same data that we looked at earlier this is the exact same data
we indexed earlier, but now we're index it into an already existing index with mappings
as opposed to letting ES do a dynamic mapping.
Now when I hit play here, It looks like this was successful and everything's fine and that
makes sense, because everything matches up here. But I want to show you something
of interest, let's say we take this SRCIP field. Let's go ahead and enter a new record.
We'll change the ID. Let's change this SRCIP field, let's just changed it to some text, just
text and some digits. So this is not compatible with the IP TYPE.
Look what happens when I hit play here. We actually get an error and that error tells us
it failed to parse the SRCIP field that there was a ‘mapper_parsing_exception’.
And down here, it says that the value input ‘is not an IP string literal’. So in this case
because we've defined the type IP, if we put anything that doesn't map to an IP address
in here. It's going to mess this up. It's not going to work. It's not going to be compatible
with the type that we've specified.
Same thing can happen at other places. So let's go ahead and it looks like response here
is set to integer. So let's just change that to some text. If I hit play here, sure enough
again this was supposed to be an integer and we've given it text. So that isn't going to
work either.
Now let's try one more. Let's try method. Let's change that, get text, let's just change
that to a number. Now this one works and you might ask yourself ‘why did it work’.
That's definitely the number. It's not text and well the reason is you can actually index
numbers and have ES treat them as text.
If you remember, when we let ES to decide the mappings dynamically for an IP address
field, it treated that as text, so somewhat basic computer science principles, that
number can be text but text cannot be a number.
So that's a little bit confusing, but hopefully you see what's going on there to some
degree in what we have available once we've strictly or explicitly defined the mappings
that we're using.
Now, I want to try one more thing here. Let's go ahead and change method back to GET,
and we'll enter a new ID here and let's try to add a field.
So I’m going to hit a comma here, and let's just add a field called BYTES, and we'll specify
that 500 bytes were transferred.
Now I’m going to hit play. Before I do, what do you think will happen here? Do you think
I’ll get an error or do you think it'll be accepted? Let's try it. Okay, it looks like it was
accepted.
So, ‘ID 3’ here was indexed without errors, and you may be asking yourself: well, why is
that we've specified mappings and this is data that does not conform to our mapping, so
that shouldn't work, should it? Well it actually will, and there's a reason for that, and
remember we specified mappings. But we haven't configured ES to reject everything
that doesn't necessarily correspond to the data as we have it mapped.
It's already doing type checking, so if you know like we looked at destination IP, if we
put a non-IP type in there, that will fail because that is the specific field name, that's
been specified and a specific type it has to map to in order for things to work correctly.
However we haven't done anything that tells us not to be able to dynamically index new
data. So to do that, we actually have to disable the indexing or the ES's ability to
dynamically index data, so let's actually change that configuration setting right now.
To change this setting, we're actually going to hit a completely different endpoint. I’m
going to use the GET HTTP method, and specify the index I’m working with, and I’m
going to specify the mapping endpoint.
And then within that, there's an option. I have to enter which is actually the specific
type. So in this case I’m going to specify the PROXY type. Now from here we just input
the option we want to change.
So I’m going to learn to spell, and then I’m going to specify the ‘dynamic’ option, and
change that to ‘strict’, and what this is doing dynamic, is set to true by default. We're
going to change that to strict, and what it's saying is that: anytime ES indexes data it
must conform to the mappings that we've already defined, if not it should reject that
data it should not allow it to be indexed.
So I’m going to hit play here, it says ‘acknowledged true’. It looks like that setting was
successful. So I want to show you that here. Let's just do a GET for the DFIR index
structure, and notice that here under our proxy type we now have dynamic strict, and
we see this here, and you might ask yourself: well, can we just define this when we
create the index in the first place? And you can! You absolutely can! I didn't do that here
for the sake of this demonstration.
But you can 100 percent define the dynamic property when you create your index and
for the most part, probably should. But I do want to show you how to change that in
case you need to.
Now that we've changed this property of our index let's go ahead and try to paste in
some data. This is the same data we were looking at just a moment ago, where I added
the bytes. But in this case I’m going to change the number to four and let's try this one
more time.
In this case it looks like it allowed it, and the reason is we already created the ‘bytes’
field. It's one that's already been added. We already dynamically indexed that, so as far
as this is concerned, it's not a new field.
So let's try this again. Let's add the user field. Let's say we have a user aware proxy and
we'll say we're adding the field ‘chris’. Let's hit play on that. Now this one fails, and it
should! Because it's new data. It's a new field that was not already existing in the
mapping. So it says strict dynamic mapping exception, mapping set to strict dynamic,
introduction of user within the proxy type is not allowed. So very straightforward error
message, simply put the user field does not exist in the existing mapping that we already
have.
So this is set to strict, so ES is going to reject the data. It's not going to allow us to add
that new field in.
Now again you might ask yourself: why is this important, why do I need to do this, and
consider a real world example, consider the fact that oftentimes the source fields we're
parsing from, they can change. Vendors do change these from time to time, it's often
unexpected. I worked at a job at one point, where we had a dedicated parsing team and
their whole job was nothing more than to take vendor-specific logs from various
evidence sources, and parse those into common fields. And you wouldn't believe how
often these things change without warning.
And it can be a massive change, just completely changing a field name, or even a subtle
change, ?? like putting a specific space in a certain location, that breaks the regular
expressions that are doing the parsing.
So these things change all the time and especially when you talk about changing field
names that can really mess stuff up, especially if you have a developer who just simply
typed the wrong thing.
So for instance here let's say we have a developer who's writing code and he's maybe
making a specific tool that inputs data into your ES database, or maybe even ?? doing
this himself.
What if instead of putting SRCIP you put SOURCEIP, and played that data and index that
into your ES database, well what you would have is you would have some records with
SRCIP, and some with the full spelling out of it. Just like this SOURCEIP.
So you'd have disparity amongst your data. It would be really hard to track down and
since if you didn't have strict mapping enabled, this would be indexed just like a
perfectly fine document when in fact it's actually not correct.
So this is just a simple example, but this type of database creep is something I see very
nearly in every instance of ES I’ll find that doesn't have strict data mapping enabled. So
again knowing your data so important and being able to perform these mappings is an
important part of knowing your data and defining it and setting a strict structure that
says: hey, if you want to put data into my ES index, it absolutely has to be what I’m
expecting.
We wouldn't accept unexpected data and any other walk of life in security, so we don't
want to accept it into our database here as well.
Now one thing I’m going to encourage you to do as you go along here, is to try to learn a
little bit more about these data types. Of course, we went here and we looked at a lot of
different ones so we looked at the IP data type, the text data type, the date data type,
and so on.
So we looked at a few different ones here, but go to the ES documentation and every
time you use one try to read a little bit more about it and read about what its
capabilities are. The documentation here is really robust and there's a lot of detail about
these data types and what they mean and particularly if you don't have a programming
or computer science background, you really want to understand what choices you're
making before you make them.
So you can learn a lot here, one of the things you can get into as well that's quite a large
subject is the analysis of particular types of data and particularly string data. ES provides
the ability to analyze data based on some things we talked about earlier, where you're
taking words and you're lowercasing them or you're stemming them, making them a
little bit more neutral to really affect relevancy in your search.
So you can apply analyzers to your data here when you're defining these mappings. So
in this example here notice we have text-based data and we're defining the analyzer
standard which is the standard analyzer that ES uses. But there are other analyzers that
have different effects, and there are endpoints that allow you to test these particular
types of analyzers that will really impact quite a bit how you search for data.
So we'll probably cover a couple other examples of this later on, but I do want to
highlight this analysis section and this great documentation about these data types that
you can use as you continue learning how to map your data and define your data
structure.
That's going to do it for this section about mapping your indexed data.
We talked very first about my hot take that knowing your data is important, and if you
don't know your data, it's a little bit lazy. In a field where evidence is really everything,
you must understand where your evidence comes from, what the data types associated
with are, and how to access it, and how to search it.
This will become exceedingly important later on, as we start using Logstash to get data
from one point to another, and when we use Kibana to start searching through our data
as we would in a real investigation.
ES defaults to dynamic mapping of data. That's what we've been doing to this point. In
the course before we got to this particular lesson, and when you use dynamic metadata
dynamic mapping of data.
You can index pretty much any data you want, but that comes with some of its own
problems that can manifest much later on, and be quite a bit of a headache.
ES can be configured to require explicit data mapping. And I showed you how to
configure that, and what the results look like when you try to index data that is not in
your mapping when you're configured to use explicit mapping.
Of course data can be mapped to several different data types. We have text based or
number based data types. We also have special data types such as IP addresses, dates,
and so on, and you really want to become better at understanding how and when to use
those data types if you really want to understand your data and use it properly within
the ELK Stack Ecosystem.
Finally I showed the ES documentation, four data types, and mentioned that text-based
data can be further analyzed based on a variety of mechanisms that we're not going to
cover too in depth here but are certainly worth a little bit of your time if you're going to
be indexing a lot of very basic just text-based data which doesn't really cover a lot of
NSM data types but with some of that we do have to be concerned with that a little bit.
So definitely spend some time thinking on that, so in general the summary of this is
really the first bullet point.
Knowing your data is important and ES provides the capability for you to express that
knowledge of your data via mappings which we've discussed here.