Lab 1 - Statistical Distributions
COMPUTER LAB 1: AN INTRODUCTION TO STATISTICAL DISTRIBUTIONS USING PROGRAM R/R STUDIO
The list of statistical distributions can seem endless, with dozens of distributions competing for your attention, with little or no intuitive basis for differentiating between them. The descriptions tend to be abstract and emphasize statistical properties such as the moments, characteristic functions and cumulative distributions. To help clarify how distributions can/should fit into your research, we will focus on the aspects of distributions that are most useful when analyzing raw data and trying to fit the right distribution to that data. For supplemental information, I highly recommend reviewing this document (it was given to me during one of my graduate courses).
R-Studio and Program R
For this course we will use Program R to perform most of our statistical analyses. I have found that using a free R-code editor, R-Studio, to be a great tool for developing code/models. R-Studio provides and IDE (Integrated Development Environment) in which the screen is split into four sections: 1) The Source frame (where you write code), 2) The Console frame(where R is actually running and code is executed), 3) The Workspace frame (where you can see which objects currently exist in memory/the R session), and 4) The Help/Packages/Plots frame (where you can view plots, R help, and install additional packages).
With R-Studio you can submit each line of code incrementally or submit an entire file, thus providing for a flexible method to interact and test your code in the R environment. To use R-Studio you obviously need Program R installed first to run your code. You can use any version of Program R, but I typically have the most recent version installed (NOTE: You can have more than one version of R installed without consequence and can tell R Studio which version to use for that session). You can view some instructions for installing R and R Studio onto your own CPU, with some spatial packages, in this document: R_Instal_Instructions_Albeke. Our classroom already has these software installed.
If you would like to perform some very basic R scripting, you can work through this short set of R functions: SimpleIntro.r
There are also a couple of books that you can review to provide you with some basic R skills:
When confronted with data that needs to be characterized by a distribution, it is best to start with the raw data and answer four basic questions about the data that can help in the characterization:
- Are the data discrete or continuous values (i.e. presence/absence is a discrete value, but the biomass at each sample location represents a continuous variable.)
- Symmetry of the data;and if there is asymmetry, which direction it lies in; in other words, are positive and negative outliers equally likely or is one more likely than the other.
- Are there upper or lower limits on the data? There are some data items like revenues that cannot be lower than zero whereas there are others like operating margins that cannot exceed a value (100%).
- The likelihood of observing extreme values in the distribution; in some data, the extreme values occur very infrequently whereas in others, they occur more often.
One way to decide the best distribution is to view the shape of the distribution using a histogram. Using some data stored on the server, create an R object/data.frame named 'otter'. Then generate a histogram of the data by executing the following in R Studio:
otter<- read.table("http://piney.uwyo.edu/salbeke/AdvSpat/Spr2014/lab1/otterfeces.txt", header=TRUE, sep="\t")
The histogram should appear to be somewhat normally distributed with no skewness.
Gaussian Distribution (aka: Normal)
This is the famous 'bell-shaped" distribution.The normal distribution has several features that make it popular. First, it can be fully characterized by just two parameters – the mean and the standard deviation – and thus reduces estimation pain. Second, the probability of any value occurring can be obtained simply by knowing how many standard deviations separate the value from the mean; the probability that a value will fall 2 standard deviations from the mean is roughly 95%. The normal distribution is best suited for data that, at the minimum, meets the following conditions:
- There is a strong tendency for the data to take on a central value.
- Positive and negative deviations from this central value are equally likely
- The frequency of the deviations falls off rapidly as we move further away from the central value.
The last two conditions show up when we compute the parameters of the normal distribution: the symmetry of deviations leads to zero skewness and the low probabilities of large deviations from the central value reveal themselves in no kurtosis.
One very important facet of a normal distribution is the Central Limit Theorem, which essentially states that as you increase your number of samples, the average result will more closely follow a Gaussian Distribution. To put the idea to the test, open CentralLimit.r in R-Studio and submit all of the lines. The resulting graphic demonstrates how increased sample size tends toward a more normal distribution around the mean.
Now let's look at how we can use Program R to plot a normal distribution given known parameters of mean and standard deviation. Using NormalDist.r and R-Studio, plot the histogram of the randomly generated data. Notice the three parameters at the top of the script (sample size, mean, sd). Make adjustments to each of the parameters and see how the histogram changes. For example, leave the mean value set to mu=5, but change the standard deviation to s=1. What happens to the distribution? Now change the s = 20. How does the distribution change? Also of note is the probability and density (i.e. the y-axis and the red line). To interpret, the y-axis is estimating the probability of a value (x) occuring within the population, given the sample distribution. It is very important to know how the mean and standard deviation can effect the shape of the distribution, for this 'sample of the population' is what you are using to make estimates of your system.
Finally, to wrap up our discussion of the Normal distribution, below are a few items that should hopefully be review:
- To calculate 50% Confidence Interval = mean ± (sd/sqrt(n))*0.67
- To calculate 95% Confidence Interval = mean ± (sd/sqrt(n))*1.96
- To calculate 99% Confidence Interval = mean ± (sd/sqrt(n))*2.58
This distribution is quite useful and can take any value from negative to positive infinity and consists of two parameters, a minimum and a maximum. Using UniformDist.r notice how the distribution is flat, indicating equal chance for any value within the parameters. Feel free to tweak the sample size or min/max values.
Beta, Gamma and Exponential Distributions
These distributions are often used when the data skewed, making the Normal distribution unusable. Use BGEDist.r to explore the shapes of these distributions. Note: This section will be expanded in the near future.
The most important discrete probability distributions are the Bernoulli, Binomial and Poisson distributions.
Tossing a coin is equivalent to examining a random variable following a Bernoulli distribution of parameter 0.5. If the coin has not been tampered with and "heads" appears with probability p, it is a Bernoulli distribution of parameter p. Thus a Bernoulli trial is one event with a probability of that event succeeding. The Binomial Distribution using parameters (n,p) is a sum of n Bernoulli trials of parameter p (the probability of success). Using BinomialDist.r, run the code. What do you think this distribution is telling you? As previously described, the y-axis is the probability and the x-axis is the number of successful trials (coin-flips) given the probability (p). So as you might expect, given a p=0.5, the likelihood of all 10 trials succeeding during one observation is extremely low. Adjust the values of p and n and assess the changes to the distribution.
The Poisson distribution measures the likelihood of a number of events occurring within a given time interval, where the key parameter that is required is the average number of events in the given interval (l). The resulting distribution looks similar to the binomial, with the skewness being positive but decreasing with l.
More formally, it is a probability distribution such that: (1) the probability of observing an event in a "small" interval is proportional to the size of this interval (in particular, it does not depend on the position of this interval on the time-axis); (2) the probability that an event occur in a given interval is independant from the probability that an event occurs in any other disjoint interval; (3) the events are never simultaneous.
One can show that this uniquely defines a probability distribution, where lamba is the average number of events per unit of time: PoissonDist.r
COMPUTER EXERCISE ASSIGNMENT
There isn't a homework assignment!