Data Formatting: The Input File - . .: Tag Number Year

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Chapter

Data formatting: the input le . . .


Clearly, the rst step in any analysis is gathering and collating your data. Well assume that at the minimum, you have records for the individually marked individuals in your study, and from these records, can determine whether or not an individual was encountered (in one fashion or another) on a particular sampling occasion. Most typically, your data will be stored in what we refer to as a vertical le - where each line in the le is a record of when a particular individual was seen. For example, consider the following table, consisting of some individually identifying mark (ring or tag number), and the year. Each line in the le (or, row in the matrix) corresponds to the animal being seen in a particular year. tag number
1147-38951 1147-38951 1147-38951 1147-38951 1147-45453 1147-45453

year
73 75 76 82 74 78

However, while it is easy and efcient to record the observation histories of individually marked animals this way, the vertical format is not at all useful for capture-mark-recapture analysis. The preferred format is the encounter history. The encounter history is a contiguous series of specic dummy variables, each of which indicates something concerning the encounter of that individual - for example, whether or not it was encountered on a particular sampling occasion, how it was encountered, where it was encountered, and so forth. The particular encounter history will reect the underlying model type you are working with (e.g., recaptures of live individuals, recoveries of dead individuals). Consider for example, the encounter history for a typical mark-recapture analysis (the encounter history for a mark-recapture analysis is often referred to as a capture history, since it implies physical capture of the individual). In most cases, the encounter history consists of a contiguous series of 1s and 0s, where 1 indicates that an animal was recaptured (or otherwise known to be alive and in the sampling area), and 0 indicates the animal was not recaptured (or otherwise seen). Consider the individual in the preceding table with tag number 1147-38951. Suppose that 1973 is the rst year of the study, and that 1985 is the last year of the study. Examining the table, we see that this individual was captured and marked during the rst year of the study, was seen periodically until 1982, when it was seen for the last time. The corresponding encounter-history for this individual would be: 1011000001000
c Cooch & White (2012) 07.22.2012

2.1. Encounter histories formats

2-2

In other words, the individual was seen in 1973 (the starting 1), not seen in 1974 (0), seen in 1975 and 1976 (11), not seen for the next 5 years (00000), seen again in 1982 (1), and then not seen again (000). While this is easy enough in principal, you surely dont want to have to construct capture-histories manually. Of course, this is precisely the sort of thing that computers are good for - large-scale data manipulation and formatting. MARK does not do the data formatting itself - no doubt you have your own preferred data manipulation environment (dBASE, Excel, Paradox, SAS). Thus, in general, youll have to write your own program to convert the typical vertical le (where each line represents the encounter information for a given individual on a given sampling occasion; see the example on the preceding page) into encounter histories (where the encounter history is a horizontal string). In fact, if you think about it a bit, you realize that in effect what you need to do is to take a vertical le, and transpose it into a horizontal le - where elds to the right of the individual tag number represent when an individual was recaptured or resighted. However, while the idea of a matrix transpose seems simple enough, there is one rather important thing that needs to be done - your program must insert the 0 value whenever an individual was not seen. Well assume for the purposes of this book that you will have some facility to put your data into the proper encounter-history format. Of course, you could always do it by hand, if absolutely necessary!
begin sidebar

editing the .INP le Many of the problems people have getting started with MARK can ultimately be traced back to problems with the .INP le. One common issue relates to choice of editor used to make changes/aditions to the .INP le. You are strongly urged to avoid - as in like the plague - using Windows Notepad (or, even worse, Word) to do much of anything related to building/editing .INP les. Do yourself a favor and get yourself a real ASCII editor - there are a number of very good free applications you can (and should) use instead of Notepad (e.g., Notepad++, EditPad Lite, jEdit, and so on...)
end sidebar

2.1. Encounter histories formats


Now well look at the formatting of the encounter histories le in detail. It is probably easiest to show you a typical encounter history le, and then explain it piece by piece. The encounter-history reects a mark-recapture experiment.

Supercially, the encounter histories le is structurally quite simple. It consists of an ASCII (text) le, consisting of the encounter history itself (the contiguous string of dummy variables), followed by
Chapter 2. Data formatting: the input le . . .

2.1. Encounter histories formats

2-3

one or more additional columns of information pertaining to that history. Each record (i.e., each line) in the encounter histories le ends with a semi-colon. Each history (i.e., each line, or record) must be the same length (i.e., have the same number of elements - the encounter history itself must be the same length over all records, and the number of elements to the right of the encounter history must also be the same) - this is true regardless of the data type. The encounter histories le should have a .INP sufx (for example, EXAMPLE1.INP). Generally, there are no other control statements or PROC statements required in a MARK input le. However, you can optionally add comments to the INP le using the slash-asterisk asterisk/slash convention common to many programming environments we have included a comment at the top of the example input le (shown at the bottom of the preceding page). The only thing to remember about comments is that they do not end with a semi-colon. Lets look at each record (i.e., each line) a bit more closely. In this example, each encounter history is followed by a number. This number is the frequency of all individuals having a particular encounter history. This is not required (and in fact isnt what you want to do if youre going to consider individual covariates - more on that later), but is often more convenient for large data sets. For example, the summary encounter history
110000101 4;

could also be entered in the INP les as


110000101 1; 110000101 1; 110000101 1; 110000101 1;

Note again that each line each encounter history record ends in a semi-colon. How would you handle multiple groups? For example, suppose you had encounter data from males and females? In fact, it is relatively straightforward to format the INP le for multiple groups - very easy for summary encounter histories, a bit less so for individual encounter histories. In the case of summary encounter histories, you simply add a second column of frequencies to the encounter histories to correspond to the other sex. For example,
110100111 23 17; 110000101 101100011 4 1 2; 3;

In other words, 23 of one sex and 17 of the other have history 110100111 (the ordering of the sexes - which column of frequencies corresponds to which sex - is entirely up to you). If you are using individual records, rather than summary frequencies, you need to indicate group association in a slightly less-obvious way - you will have to use a 0 or 1 within a group column to indicate the frequency - but obviously for one group only. Well demonstrate the idea here. Suppose we had the following summary history, with frequencies for males and females (respectively):
110000101 4 2;

In other words, 4 males, and 2 females with this encounter history (note: the fact that males come before females in this example is completely arbitrary. You can put whichever sex - or group - you want in any column you want - all youll need to do is remember which columns in the INP le correspond to which groups). To code individual encounter histories, the INP le would be modied to look like:
Chapter 2. Data formatting: the input le . . .

2.1.1. Groups within groups...

2-4

110000101 110000101 110000101 110000101 110000101 110000101

1 0; 1 0; 1 0; 1 0; 0 1; 0 1;

In this example, the coding 1 0 indicates that the individual is a male (frequency of 1 in the male column, frequency of 0 in the female column), and 0 1 indicates the individual is a female (frequency of 0 in the male column, and frequency of 1 in the male column). The use of one-record per individual is only necessary if youre planning on using individual covariates in your analysis. 2.1.1. Groups within groups... In the preceding example, we had 2 groups: males and females. The frequency of encounters for each sex is coded by adding the frequency for each sex to the right of the encounter history. But, what if you had something like males, and females (i.e., data from both sexes) and good colony and poor colony (i.e., data were sampled for both sexes from each of 2 different colonies - one classied as good, and the other as poor). How do you handle this in the INP le? Well, all you need to do is have a frequency column for each (sex.colony) combination: one frequency column for females from the good colony, one frequency column for females from the poor colony, one frequency column for males from the good colony, and nally, one frequency column for males from the poor colony. An example of such an INP le is shown below:

As we will see in subsequent chapters, building models to test for differences between and among groups, and for interactions among groups (e.g., an interaction of sex and colony in this example) is relatively straightforward in MARK - all youll really need to do is remember which frequency column codes for which grouping (hence the utility of adding comments to your INP le, as weve done in this example).

2.2. Removing individuals from the sample


Occasionally, you may choose to remove individuals from the data set at a particular sampling occasion. For example, because your experiment requires you to remove the individual after its rst recapture, or because it is injured, or for some other reason. The standard encounter history we have looked at so far records presence or absence only. How do we accommodate removals in the INP le? Actually, its very easy - all you do is change the sign on the frequencies from positive to
Chapter 2. Data formatting: the input le . . .

2.3. Dierent encounter history formats

2-5

negative. Negative frequencies indicates that that many individuals with a given encounter history were removed from the study. For example,
100100 1500 1678; 100100 -23 -25;

In this example, we have 2 groups, and 6 sampling occasions. In the rst record, we see that there were 1500 individuals and 1678 individuals in each group marked on the rst occasion, not encountered on the next 2 occasions, seen on the fourth occasion, and not seen again. In the second line, we see the same encounter history, but with the frequencies -23 and -25. The negative values indicate to MARK that 23 and 25 individuals in both groups were marked on the rst occasion, not seen on the next 2 occasions, were encountered on the fourth occasion, at which time they were removed from the study. Clearly, if they were removed, they cannot have been seen again.
begin sidebar

uneven time-intervals between sampling occasions? In the preceding, we have implicitly assumed that the sampling interval between sampling occasions is identical throughout the course of the study (e.g., sampling every 12 months, or every month, or every week). But, in practice, it is not uncommon for the time interval between occasions to vary either by design, or because of logistical constraints. This has clear implications for how you analyze your data. For example, suppose you sample a population each October, and again each May (i.e., two samples within a year, with different time intervals between samples; October May (7 months), and May October (5 months)). Suppose the true monthly survival rate is constant over all months, and is equal to 0.9. As such, the estimated survival for October May will be 0.97 = 0.4783, while the estimated survival rate for May October will be 0.95 = 0.5905. Thus, if you t a model without accounting for these differences in time intervals, it is clear that there would appear to be differences in survival between successive samples, when in fact the monthly survival does not change over time. So, how do you tell MARK that the interval between samples may vary over time? You might think that you need to code this interval information in the INP le in some fashion. In fact, you dont - you specify the time intervals when you are specifying the data type in MARK, and not in the INP le. In the INP le, you simply enter the encounter histories as contiguous strings, regardless of the true interval between sampling occasions. We will discuss handling uneven time-intervals in more detail in a later chapter.
end sidebar

2.3. Dierent encounter history formats


Up until now, weve more or less used typical mark-recapture encounter histories (i.e., capture histories) to illustrate the basic principles of constructing an INP le. However, MARK can be applied to far more than mark-recapture analyses, and as such, there are a number of slight permutations on the encounter history that you need to be aware of in order to use MARK to analyze your particular data type. First, we summarize in table form (on the next page) the different data types MARK can handle, and the corresponding encounter history format. Each data type in MARK requires a primary from of data entry provided by the encounter history. Encounter histories can consist of information on only live encounters (LLLL) or information on both live and dead (LDLDLDLD). In addition, some types allow a summary format (e.g., recovery
Chapter 2. Data formatting: the input le . . .

2.3. Dierent encounter history formats

2-6

recaptures only recoveries only both known fate closed captures BTO ring recoveries robust design both (Barker model) multi-strata Brownie recoveries Jolly-Seber Huggins closed captures Robust design (Huggins) Pradel recruitment Pradel survival & seniority Pradel survival & Pradel survival & recruitment POPAN multi-strata - live and dead encounters closed captures with heterogeneity full closed captures with heterogeneity nest survival occupancy estimation robust design occupancy estimation open robust design multi-strata closed robust design multi-strata

LLLL LDLDLDLD LDLDLDLD LDLDLDLD LLLL LDLDLDLD LLLL LDLDLDLD LLLL LDLDLDLD LLLL LLLL LLLL LLLL LLLL LLLL LLLL LLLL LDLDLDLD LLLL LLLL LDLDLDLD LLLL LLLL LLLL LLLL

matrix) which reduces the amount of input. The second column of the table shows the basic structure for a 4 occasion encounter history. There are, in fact, broad types: live encounters only, and mixed live and dead (or known fate) encounters. For example, for a recaptures only study (i.e., live encounters), the structure of the encounter history would be LLLL - where L indicates information on encountered/not encountered status. As such, each L in the history would be replaced by the corresponding coding variable to indicate encountered or not encountered status (usually 1 or 0 for the recaptures only history). So, for example, the encounter 1011 indicates seen and marked alive at occasion 1, not seen on occasion 2, and seen again at both occasion 3 and occasion 4. For data types including both live and dead individuals, the encounter history for the 4 occasion study is effectively doubled - taking the format LDLDLDLD, where the L refers to the live encountered or not encountered status, and the D refers to the dead encountered or not encountered status. At each sampling occasion, either event is possible - an individual could be both seen alive at occasion (i) and then found dead at occasion (i), or during the interval between (i) and (i+1). Since both potential events need to be coded at each occasion, this effectively doubles the length of the encounter history from a 4 character string to an 8 character string. For example, suppose you record the following encounter history for an individual over 4 occasions

Chapter 2. Data formatting: the input le . . .

2.4. Some more examples

2-7

- where the encounters consist of both live encounters and dead recoveries. Thus, the history 10001100 reects an individual seen and marked alive on the rst occasion, not recovered during the rst interval, not seen alive at the second occasion and not recovered during the second interval, seen alive on the third occasion and then recovered dead during the third interval, and not seen or recovered thereafter (obviously, since the individual was found dead during the preceding interval).

2.4. Some more examples


The MARK help les contain a number of different examples of encounter formats. We list only a few of them here. For example, suppose you are working with dead recoveries only. If you look at the table on the preceding page, you see that it has a format of LDLDLDLD. Why not just LLLL, and using 1 for live, and 0 for recovered dead? The answer is because you need to differentiate between known dead (which is a known fate) , and simply not seen. 0 alone could ambiguously mean either dead, or not seen (or both!). 2.4.1. Dead recoveries only The following is an example of dead recoveries only, because a live animal is never captured alive after its initial capture. That is, none of the encounter histories have more than one 1 in an L column. This example has 15 encounter occasions and 1 group. If you study this example, you will see that 500 animals were banded each banding occasion.
000000000000000000000000000010 000000000000000000000000000011 000000000000000000000000001000 000000000000000000000000001001 000000000000000000000000001100 000000000000000000000000100000 000000000000000000000000100001 000000000000000000000000100100 000000000000000000000000110000 465; 35; 418; 15; 67; 395; 3; 25; 77;

Traditionally, recoveries only data sets were summarized into what are known as recovery tables. MARK accommodates recovery tables, which have a triangular matrix form, where time goes from left to right (shown below). This format is similar to that used by Brownie et al. (1985).
7 4 8 1 5 10 0 1 4 16 99 88 153 1; 0; 2; 3; 12; 114 123;

Following each matrix is the number of individuals marked each year. So, 99 individuals marked on the rst occasion, of which 7 were recovered dead during the rst interval, 4 during the second, 1 during the third, and so on.
Chapter 2. Data formatting: the input le . . .

2.4.2. Individual covariates

2-8

2.4.2. Individual covariates Finally, an example of known fate data, where individual covariates are included. Comments are given at the start of each line to identify the individual (this is optional, but often very helpful in keeping track of things). Then comes the capture history for this individual, in a LDLDLD. . . sequence. Thus the rst capture history is for an animal that was released on occasion 1, and died during the interval. The second animal was released on occasion 1, survived the interval, released again on occasion 2, and died during this second interval. Following the capture history is the count of animals with this history (always 1 in this example). Then, 4 covariates are provided. The rst is a dummy variable representing age (0=subadult, 1=adult), then a condition index, wing length, and body weight.
/* 01 */ /* 04 */ /* 05 */ /* 06 */ /* 07 */ /* 08 */ /* 09 */ 1100000000000000 1011000000000000 1011000000000000 1010000000000000 1010000000000000 1010110000000000 1010000000000000 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1.16 1.16 1.08 1.12 1.14 1.20 1.10 27.7 26.4 26.7 26.2 27.7 28.3 26.4 4.19; 4.39; 4.04; 4.27; 4.11; 4.24; 4.17;

What if you have multiple groups, such that individuals are assigned (or part of) a given group, and where you also have individual covariates? There are a couple of ways you could handle this sort of situation. You can either code for the groups explicitly in the .inp le, or use an individual covariate for the groups. There are pros and cons to either approach (this issue is discussed in Chapter 11). Here is an snippet from a data set with 2 groups coded explicitly, and an individual covariate. In this data fragment, the rst 8 contiguous values represent the encounter history, followed by 2 columns representing the frequencies depending on group: 1 0 indicating group 1, and 0 1 indicating group 2, followed by the value of the covariate:
11111111 1 0 123.211; 11111111 0 1 92.856; 11111110 1 0 122.115; 11111110 1 0 136.460;

So, the rst record with an encounter history of 11111111 is in group 1, and has a covariate value of 123.211. The second individual, also with an encounter history of 11111111, is in group 2, and has a covariate value of 92.856. The third individual has an encounter history of 11111110, and is in group 1, with a covariate value of 122.115. And so on. If you wanted to code the group as an individual covariate, this same input le snippet would look like:
11111111 1 1 123.211; 11111111 1 0 92.856; 11111110 1 1 122.115; 11111110 1 1 136.460;

In this case, following the encounter history, is a column of 1s, indicating the frequency for each individual, followed by a column containing a 0/1 dummy code to indicate group (in this example, weve used a 1 to indicate group 1, 0 to indicate group 2), followed by the value of the covariate.
Chapter 2. Data formatting: the input le . . .

2.4.2. Individual covariates

2-9

A nal example for three groups where we code for each group explicitly (such that each group has its own dummy column in the input le), an encounter history with individual covariates might look like:
11111 11110 11111 1 0 0 0 1 0 0 0 1 123.5; 99.8; 115.2;

where the rst individual with encounter history 11111 is in group 1 (dummy value of 1 in the rst column after the encounter history, and 0s in the next two columns) and has a covariate value of 123.5, second individual with encounter history 11110 is in group 2 (dummy code of 0 in the rst column, 1 in the second, and 0 in the third) and a covariate value of 99.8, and a third individual with encounter history 11111 in group 3 (0 in the rst two columns, and a 1 in the third column), with a covariate value of 115.2. As is noted in the help le (and discussed at length in Chapter 11), it is helpful to scale the values of covariates to have a mean in the interval [0 1] to ensure that the numerical optimization algorithm nds the correct parameter estimates. For example, suppose the individual covariate weight is used, with a range from 1000 g to 5000 g. In this case, you should scale the values of weight to be from 0.1 to 0.5 by multiplying each weight value by 0.0001. In fact, MARK defaults to doing this sort of scaling for you automatically (without you even being aware of it). This automatic scaling is done by determining the maximum absolute value of the covariates, and then dividing each covariate by this value. This results in each column scaled to between -1 and 1. This internal scaling is purely for purposes of ensuring the success of the numerical optimization - the parameter values reported by MARK (i.e., in the output that you see) are back-trasformed to the original scale. Alternatively, if you prefer that the scaled covariates have a mean of 0, and unit variance (this has some advantages in some cases), you can use the Standardize Individual Covariates option of the Run Window to perform the default standardization method (more on these in subsequent chapters). More details on how to handle individual covariates in the input le are given in Chapter 11.

Summary
Thats it! Youre now ready to learn how to use MARK. Before you leap into the rst major chapter (Chapter 3), take some time to consider that MARK will always do its best to analyze the data you feed into it. However, it assumes that you will have taken the time to make sure your data are correct. If not, youll be the unwitting victim to perhaps the most telling comment in data analysis: garbage in...garbage out. Take some time at this stage to make sure you are condent in how to properly create and format your les.

Chapter 2. Data formatting: the input le . . .

You might also like