\chapter{Probability theory and parameter estimation}
\section{Basic Probability Theory}
\vspace{-.1in}
\subsection{Some definitions}
\begin{description}
\item[sample space:] the set of all possible sample outcomes, denoted
  $S$
%\footnote{Sometimes the sample space is denoted by $\Omega$.}
\item[event:] a collection of sample outcomes; any subset of $S$ ($A
  \subseteq S$)
% \item[simple event] an event consisting of a single sample outcome; an
%  event that cannot be broken down into subevents
\item[probability:] a number assigned to each event $A$ in $S$, denoted $P(A)$
\end{description}

\vspace{-.2in}
\subsection{Axioms}
\vspace{-.1in}
Probabilities obey the following axioms:
\vspace{-.1in}
\begin{enumerate}
\item $P(A) \ge 0$ for all $A \subseteq S$
\item $P(S) = 1$
\item For mutually exclusive events $A$ and $B$: $P(A \cup B) = P(A) +
  P(B)$ 
\end{enumerate}
\vspace{-.1in}
\small In the case of infinite numbers of events, a fourth axiom (similar
to \#3) is needed. \normalsize

\vspace{-.1in}
\subsection{Conditional probability, etc.}
\vspace{-.1in}
Definition of {\em conditional probability} (``probability of $A$
given $B$''):
\begin{equation}\label{condprob}
  P(A | B) = \frac{P(A \cap B)}{P(B)}  
\end{equation}
``Probability of $A$ {\em and} $B$'' (conjunction):
\begin{equation} \label{andprob}
  P(A \cap B) = P(B) P(A|B) = P(A) P(B|A)
\end{equation}
If $A$ and $B$ are {\em independent} then  $P(A|B) = P(A)$ and $P(B|A)
= P(B)$ and Equation~\ref{andprob} reduces to the familiar
``multiplication rule'':
\begin{equation} \label{indepandprob}
      P(A \cap B) =  P(A) P(B)
\end{equation}

``Probability of $A$ or $B$'' (disjunction):
\begin{equation} \label{orprob}
  P(A \cup B) = P(A) + P(B) - P(A \cap B)
\end{equation}
If $A$ and $B$ are {\em mutually exclusive\/} or {\em disjoint\/},
then $P(A \cap B) = 0$ and Equation~\ref{orprob} reduces to the
familiar ``addition rule'':
\begin{equation}
  P(A \cup B) = P(A) + P(B) 
\end{equation}
Note that the familiar addition and multiplication rules depend on
important assumptions, namely, mutual exclusivity and independence,
respectively.
\vspace{-.1in}
\subsection{Bayes's Rule}
\vspace{-.1in}
From Equation~\ref{andprob} we can easily derive Bayes's Rule:
\begin{equation}  \label{bayes1}
  P(B|A) = \frac{P(A|B) P(B)}{P(A)}
\end{equation}
Another expression of Bayes's Rule uses an ``unpacked'' version of
$P(A)$, using the Law of Total Probability:
\begin{equation} \label{totprob}
  P(A) = \sum_{i=1}^{n} P(A \cap B_i)  = \sum_{i=1}^{n} P(A|B_i) P(B_i)
\end{equation}
where all $B_i$ are mutually exclusive and $B_1 \cup B_2 \cup \cdots
\cup B_n = S$.
Substituting the unpacked expression for $P(A)$ into
Equation~\ref{bayes1} we get:
\begin{equation}  \label{bayes2}
  P(B_j|A) = \frac{P(A|B_j) P(B_j)}{\sum_{i=1}^{n} P(A|B_i) P(B_i)}
\end{equation}
Bayes's Rule is useful because it allows one to ``switch''
conditional probabilities.  In particular, it can be used as a method
of ``updating'' probabilities in light of new data/evidence.  Let $H$
denote a hypothesis, and $D$ denote some observed data.
\begin{equation}  \label{bayes3}
  P(H|D) = P(D|H) \frac{P(H)}{P(D)}
\end{equation}
The term on the left represents the ``new'' belief about $H$ in light
of the data $D$.  $P(H)$ is the ``old'' belief about $H$. $P(D|H)$ is
the probability of getting data $D$ under hypothesis $H$; in essence,
it is a $p$-value.  $P(D)$ is some overall probability of the data.
We can break it up into more interpretable pieces using the law of
total probability. 

% An example: Updating hypotheses about biased coins.

Using Bayes's Rule, we can (in very simple situations) go from
$p$-values to the more interesting probabilities of the form $P(H|D)$.
Sometimes people will talk about $p$-values as if they were $P(H|D)$
probabilities.  Don't be fooled.

\vspace{-.2in}
\section{Random Variables}
\vspace{-.1in}
\begin{description}
\item[random variable:]  a real-valued function for which the domain
  is the sample space % (i.e., a mapping from $S \rightarrow \cal R$) 
\end{description}
A random variable is a way of assigning a real number to events in
$S$.  More informally, it is a ``numerical event'' that will vary from
experiment to experiment.  

\vspace{-.2in}
\subsection{Discrete and Continuous Random Variables}
\vspace{-.1in}
If a random variable takes on a finite or countably infinite number of
values (say, $1,2,3,\ldots$) it is {\em discrete\/}.
Otherwise, it is {\em continuous\/}.

Examples of discrete random variables: 
\vspace{-.1in}
\begin{itemize}
% \item the number of spots showing after a die is rolled
\item the number of coin flips made before the first
  head appears
\item the number of words remembered out of a list of $n$ words
\item the number of people waiting in line at the post office at time
  $t$
\item the number circled on a 7-point rating scale measuring mood
\end{itemize}
\vspace{-.1in}
Examples of continuous RVs:
\vspace{-.1in}
\begin{itemize}
\item the height of a randomly chosen individual
\item a subject's response latency in naming the color of a word
\item the length of time between arrival at the post office and being
  served
\end{itemize}
\vspace{-.1in}
Our goal is to understand the ``behavior'' of random variables, and to
learn some tools for dealing with them.

\section{Probability Functions}
\vspace{-.1in}
\subsection{Discrete random variables}
\vspace{-.1in}
\subsubsection{Probability density functions}
\vspace{-.1in} 
The behavior of a discrete random variable can be
described by its probability density function (pdf\footnote{This function
  is often referred to simply as the ``probability function'' or
  ``probability mass function.''  However, for parallelism with the
  continuous case, we will use the phrase ``probability density
  function'' or simply ``pdf.''})
For a discrete RV $Y$, the pdf $f_Y(y)$ describes the probability that
the RV will take on each of its possible values:
\begin{equation}
    f_Y(y) = P(Y=y) 
\end{equation}
for all possible values of $y$.

For example, let $Y$ be the number of heads obtained in one flip of a
fair coin:
\begin{equation}
  f_Y(y) = \left\{ 
  \begin{array}{cl} 
    \frac{1}{2} & \mbox{$y=0,1$} \\ 
    0 & \mbox{elsewhere} 
  \end{array}
\right.
\end{equation}
Notes:
\vspace{-.1in}
\begin{enumerate}
\item Capital letters will generally denote random variables, lower
  case letters will denote possible values/outcomes/realizations of
  random variables.
\item MWS uses the notation $p(y)$ for discrete pdfs.
\item Often the subscript $Y$ will de dropped, and the pdf will be
  denoted simply $f(y)$.
\item The probability density function can be specified by a table, a
  graph, or a mathematical expression.
\end{enumerate}

Key properties of a discrete pdf:
\vspace{-.1in}
\begin{itemize}
\item $0 \le f(y) \le 1$ for all $y$
\item $\sum_y f(y) = 1$ (summing over all
  values of $y$ that have positive probability)
\end{itemize}

\vspace{-.1in}
\subsubsection{Cumulative distribution functions} 
\vspace{-.1in}
Another way of describing a random variable's distribution is with a
{\em cumulative distribution function\/} (or cdf).

The cdf for a random variable gives the probability that the RV will
take on a value less than or equal to some possible value.  That is,
\begin{equation}
  F_Y(y) = P(Y \le y) 
\end{equation}

So for the coin example:
\begin{equation}
  F(y) =\left\{ 
  \begin{array}{cl} 
    0 & y<0 \\
    \frac{1}{2} & 0 \le y < 1 \\
    1 & y \ge 1
  \end{array}
  \right.
\end{equation}

Properties of a cdf:
\vspace{-.1in}
\begin{itemize}
\item ${\displaystyle \lim_{y \rightarrow -\infty}} F(y) = F(-\infty) = 0$
\item ${\displaystyle \lim_{y \rightarrow \infty}} F(y) = F(\infty) = 1$
\item $F(y)$ is a non-decreasing function: $F(y_b) \ge F(y_a)$ if $y_b
  > y_a$ 
\end{itemize}
\vspace{-.1in}

For discrete cdf's only: 
\vspace{-.1in}
\begin{itemize}
\item $F(y)$ is a step function (i.e., it has ``jumps'').
\end{itemize}

\vspace{-.1in}
\subsection{Continuous random variables}
\vspace{-.1in}
Continuous random variables can take on ``too many'' values
for us to be able to list the probability for each.  Thus, the
definition of pdf used before, $f(y) = P(Y = y)$, will not work for
continuous RVs.  In fact, for any specific value $y$, $P(Y=y) = 0$.

However, we can define the pdf for a continuous RV to be the
continuous function $f(y)$ such that:
\begin{equation} \label{contcdf1}
  P(a \le Y \le b) = \int_a^b f(y) dy
\end{equation}
The cdf can be defined as before: $F(y)=P(Y \le y)$. Expressing it
in terms of Equation~\ref{contcdf1}:
\begin{equation} \label{contcdf2}
  F(y)=P(Y \le y) = \int_{-\infty}^y f(t) dt
\end{equation}
(\small Note: $t$ is a dummy variable used because $y$ is used in the
limits of the integral. \normalsize)

It follows from Equation~\ref{contcdf2} and the good old fundamental
theorem of calculus that $f(y)$ is the derivative of $F(y)$ (with
respect to $y$):  
\begin{equation}
  f(y) = F'(y) = \frac{dF(y)}{dy}
\end{equation}
\vspace{-.1in}
\subsection{Summary of pdf, cdf properties}
\vspace{-.1in}
\begin{center}
  \begin{tabular}{l|c|c|}
 & Discrete & Continuous \\
\hline
 & & \\
pdfs & $0 \le f(y) \le 1$ & $0 \le f(y)$ \\
 & & \\
cdfs  & $0 \le F(y) \le 1$ & $0 \le F(y) \le 1$ \\
 & & \\
cdf limits & $F(-\infty)=0, F(\infty)=1$ & $F(-\infty)=0,
F(\infty)=1$ \\
 & & \\
pdfs sum/integrate to 1 & ${\displaystyle \sum_y} f(y) = 1$ &
${\displaystyle \int_{-\infty}^{\infty}} f(y) dy = 1$ \\
 & & \\
$P(a \le Y \le b)$ & ${\displaystyle\sum_{y \in [a,b]}} f(y) $ &
${\displaystyle \int_a^b} f(y) dy$ \\
 & & \\
$P(a < Y \le b)$ & $F(b) - F(a)$ & $F(b) - F(a)$ \\
 & & \\
\hline
  \end{tabular}
\end{center}

\section{Expected Value and Variance}
The pdf (or cdf) of a random variable may be a bit cumbersome to work
with at times, so we may often have use for summary measures of a RV's
distribution. 

One popular summary measure is the RV's {\em expected value\/}, or
{\em mean\/}, or {\em expectation\/}.  The expected value of an RV
$Y$, denoted $E(Y)$, is the ``average'' value of $Y$.

In calculating the EV, we weight each possible value of the RV by its
pdf, and sum (or integrate) over all possible values.  In symbols, for
the discrete case:
\begin{equation}
  E(Y) = \sum_{y} y f(y) 
  \label{evdisc}
\end{equation}
For the continuous case, we replace the summation with an integral:
\begin{equation}
  E(Y) = \int_{-\infty}^{\infty} y f(y) dy 
  \label{evcont}
\end{equation}
Notes:
\vspace{-.1in}
\begin{itemize}
\item The expected value need not be a value that the RV can ever
  realize.
% (consider the expected value of the number of heads appearing
%   on a single coin flip). 
\item Some distributions do not have defined expected values, because
  the sum or integral does not converge to any finite value.
\end{itemize}

\subsection{Expected value of a function of $Y$}
To determine the expected value of a function $g(Y)$ of a random variable
$Y$, we simply replace the $y$ in the expressions above with $g(y)$:
\begin{equation} \label{expfunc}
    E[g(Y)] = \int_{-\infty}^{\infty} g(y) f(y) dy 
\end{equation}
The discrete case is similar.  The logic of this formula is that now
instead of weighting each value of $y$ by its pdf,  we're
now weighting each value of $g(y)$ by its pdf.

\subsection{Properties of expected values}
The following are very useful properties of expected values:
\begin{itemize}
\item Expected value of a constant: $E(c) = c$
\item Expected value of a linear transformation of an RV: $E(aY + b) =
  a E(Y) + b$ 
\item Expected value of a sum of RVs: $E(Y_1 + Y_2 + \cdots + Y_n) =
  E(Y_1) + E(Y_2) + \cdots + E(Y_n)$ 
\end{itemize}
These properties hold for both discrete and continuous RVs.
\vspace{-.1in}
\subsection{Median}
Another measure of ``location'' of a random variable's distribution is
the {\em median\/}, which is the value(s) of $y$ such that $F(y) =
\frac{1}{2}$. The median is the 50th percentile of the distribution.
Other percentiles are defined similarly in terms of the cdf; i.e., the
$p$th percentile is the value of $y$ such that $F(y) = \frac{p}{100}$.

\subsection{Variance}
The expected value is a measure of a random variable's location.  The
variance is a measure of a random variable's dispersion or spread.
The variance is defined as the expected value of a function of a
random variable: specifically, the {\em mean squared deviation\/} of a RV
from its expected value:
\begin{equation}
  Var(Y) = E[(Y - E(Y))^2] = E(Y-\mu)^2
\end{equation}
where $\mu$ denotes the expected value of $Y$.
Variances can be computed using Equation~\ref{expfunc}.  The following
relations can be useful:
\begin{equation}
  Var(Y) = E(Y^2) - \mu^2 = E[Y (Y-1)] + \mu - \mu^2
\end{equation}
\vspace{-.3in}
\subsubsection{Properties of the variance}
\vspace{-.1in}
\begin{itemize}
\item Variance of a constant: $Var(c) = 0$
\item Variance of a linear transformation of $Y$: $Var(aY + b) = a^2
  Var(Y)$
\item For independent random variables only: $Var(Y_1 + \cdots + Y_n) = 
  Var(Y_1) + \cdots + Var( Y_n)$
\end{itemize}
\vspace{-.1in}
\section{Some Useful Distributions}
\vspace{-.1in}
Probability distributions are simple models for data we might observe
in ``the real world.'' We now turn to some examples of useful distributions.

Each of the distributions discussed below is actually a {\em family}
of distributions.  The probability functions involve unknown values
({\em parameters\/}) that specify a particular member of the family.
In practice, if a distribution family has been chosen as a plausible
model for some data, the main task is to estimate the parameters (or
functions thereof) of the distribution, and also get estimates of the
variability of these parameter estimates.  We will turn to these estimation
tasks later.
\vspace{-.1in}
\subsection{Discrete distributions}
\vspace{-.1in}
\subsubsection{Bernoulli distribution}
\vspace{-.1in}
If a random variable can take on only two possible values, the
appropriate model is the Bernoulli distribution. Imagine a biased coin
that has an unknown probability of turning up heads (we'll call that
probability $p$).  Let $X=1$ if heads turn up, $X=0$ if tails turn up.
$X$ has a Bernoulli distribution with parameter $p$ ($0 \le p \le 1$).
The probability distribution function (pdf) for a
Bernoulli($p$) random variable is below:
\begin{equation} \label{Bernpdf}
  f(x) = \left\{ 
  \begin{array}{cl} p & \mbox{if $x=1$} \\ 
    1-p & \mbox{if $x=0$} \\
    0 & \mbox{elsewhere}
  \end{array}
\right.   
\end{equation}
(Note: $1-p$ is often denoted $q$.)
\paragraph{Properties}
\vspace{-.2in}
\begin{itemize} \setlength{\itemsep}{.02in}
\vspace{-.2in}
\item $E(X) = p$
\item $Var(X) = pq = p - p^2$
\end{itemize}
\vspace{-.1in}
\subsubsection{Binomial distribution}
\vspace{-.1in}
Imagine flipping the biased coin $n$ times.  Let $Y$ be the number
of heads occurring in the $n$ flips.  $Y$ can be considered to be a
sum of $n$ Bernoulli variables (one for each flip).  A sum of $n$ independent
Bernoulli($p$) variables has a Binomial($n,p$) distribution:
\begin{equation}
  \label{binpdf}
  f(y) = \left(\begin{array}{c} n \\ y \end{array}\right) 
  p^y q^{n-y} = \frac{n!}{y! (n-y)!} p^y q^{n-y} \;\;\;\;\; y=0, 1,
  2, \ldots, n
\end{equation}
% \small Note: The exclamation point (!) is the factorial operator.
% Recall that $y! = y (y-1) (y-2) \cdots (2) (1)$ \normalsize 
\vspace{-.1in}
\paragraph{Properties}
\vspace{-.2in}
\begin{itemize} \setlength{\itemsep}{.02in}
\vspace{-.2in}
\item $E(Y) = np $
\item $Var(Y) = npq$
\item Bernoulli($p$) is equivalent to Binomial(1, $p$)
\end{itemize}
\vspace{-.2in}
\subsubsection{Poisson distribution}
\vspace{-.1in}
We can consider a limiting case of the binomial where $n$ gets very
large while $p$ gets very small and the product $np$ is constant.
Imagine that calls come in to a 1--900 number ``at random'' over a
time period $T$.  Let's call the average number of calls coming in
(during period $T$) $\lambda$.  Imagine breaking up the interval $T$
into $n$ very small subintervals of length $T/n$.  The probability
that a call comes in during the subinterval is $p = \lambda/n$.  The
total number of calls coming in during the entire interval $T$ can
then be considered a Binomial($n,\lambda/n$) random variable.  If we
let $n \rightarrow \infty$ though, we get a new distribution: the
Poisson distribution. The pdf of a Poisson random variable $Y$ is:
\begin{equation}
  \label{poispdf}
  f(y) = \frac{e^{-\lambda} \lambda^y}{y!} \;\;\;\;\;\; y=0, 1, 2, \ldots
\end{equation}
\paragraph{Properties}
\vspace{-.2in}
\begin{itemize} \setlength{\itemsep}{.02in}
\vspace{-.2in}
\item $E(Y) = \lambda $
\item $Var(Y) = \lambda $
\item Mean and variance are equal.
\item Limiting case of Binomial($n,p$) as $n \rightarrow \infty$ and
  $np \rightarrow \lambda$
\item Applications:
%  \vspace{-.1in}
  \begin{itemize}
    \setlength{\itemsep}{.02in}
  \item number of rare events happening over time or space
  \item approximation to binomial for large $n$ and small $p$
  \end{itemize}
\end{itemize}
\vspace{-.2in}
\subsubsection{Geometric distribution}
\vspace{-.1in}
Consider performing a series of Bernoulli trials (or coin flips) until
the first success (or head) appears.  The number of trials required
for the first success is a random variable following the geometric
distribution.  The geometric pdf is:
\begin{equation}
  f(y) = q^{y-1} p \;\;\;\;\; y=1,2,3,\ldots,\infty
  \label{geompdf}
\end{equation}
\paragraph{Properties}
\vspace{-.2in}
\begin{itemize}
\setlength{\itemsep}{.02in}
\vspace{-.2in}
\item $E(Y) = \frac{1}{p}$
\item $Var(Y) = \frac{1-p}{p^2}$
\item Memoryless property: $P(Y > a + b \; | \; Y > a) = P(Y > b)$
\item Applications: discrete-interval waiting times
\end{itemize}
\vspace{-.1in}
\subsection{Continuous Distributions}
\vspace{-.1in}
\subsubsection{Uniform distribution}
A uniform RV has a density that is indeed unform over some range of
values; the pdf is a constant:
\begin{equation} \label{unifpdf}
  f(y) = \left\{ 
  \begin{array}{cl} 
    \frac{1}{\theta_2-\theta_1} & \mbox{$\theta_1 \leq y \leq
      \theta_2$} \\  
    0 & \mbox{elsewhere} 
  \end{array}
\right.
\end{equation}
\paragraph{Properties}
\vspace{-.2in}
\begin{itemize} \setlength{\itemsep}{.02in}
\vspace{-.2in}
\item $E(Y) = \frac{\theta_1 + \theta_2}{2}$
\item $Var(Y) = \frac{(\theta_2 - \theta_1)^2}{12}$
\end{itemize}
\vspace{-.1in}
\subsubsection{Exponential distribution}
\vspace{-.1in}
The exponential distribution is a special case of a two-parameter
distribution known as the gamma distribution.  The pdf for the
exponential is:
\begin{equation} \label{expopdf}
  f(y) = \left\{ 
  \begin{array}{cl} 
    \frac{1}{\beta} e^{-y/\beta} & \mbox{$0 \leq y \leq \infty$} \\
    0 & \mbox{elsewhere} 
  \end{array}
\right.
\end{equation}
\paragraph{Properties}
\vspace{-.2in}
\begin{itemize} \setlength{\itemsep}{.02in}
\vspace{-.2in}
\item $E(Y) = \beta$
\item $Var(Y) = \beta^2$
\item Memoryless property: $P(Y > a + b \; | \; Y > a) = P(Y > b)$
\item Applications: failure times, arrival times, times between random
  (Poisson) events
\end{itemize}
\vspace{-.1in}
\subsubsection{Normal/Gaussian distribution}
\vspace{-.1in}
This is the granddaddy of 'em all.  We'll discuss this distribution in
detail later, but just for completeness, here's its pdf:
\begin{equation}
f(y) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-(y-\mu)^2/2\sigma^2}
  \label{normpdf}
\end{equation}
\paragraph{Properties}
\vspace{-.2in}
\begin{itemize} \setlength{\itemsep}{.02in}
\vspace{-.2in}
\item $E(Y) = \mu$
\item $Var(Y) = \sigma^2$
\item The cdf cannot be expressed conveniently mathematically; that's
  why there are so many normal-area tables around.
\item The pdf is symmetric around $\mu$ (so $\mu$ is also the median).
\end{itemize}

\section{Bivariate Distributions}
The joint pdf $f(x,y)$ is a density function defined over two dimensions.  
For the continuous case, probabilities can be obtained by integrating
over the two dimensions:
\begin{equation}
  P(a \le X \le b \; {\rm and} \; c \le Y \le d) = 
  \int_a^b \int_c^d  f(x,y) dy dx = 
  \int_c^d \int_a^b  f(x,y) dx dy 
\end{equation}

\subsection{Marginal and conditional densities}
The marginal pdf for $X$ can be obtained by integrating the joint pdf
over $y$:
\begin{equation}
  f_X(x) = \int f(x,y) dy
\end{equation}
Similarly for $Y$:
\begin{equation}
  f_Y(y) = \int f(x,y) dx
\end{equation}
{\small (Note that these relations are analogous to the law of total
  probability.)}

Conditional densities can be defined, analogous to the definition of
conditional probability, as follows:
\begin{equation}
  f(y|x) = \frac{f(x,y)}{f_X(x)}
\end{equation}
\begin{equation}
  f(x|y) = \frac{f(x,y)}{f_Y(y)}
\end{equation}
{\small (Note that the conditional density is defined to be 0 when the
  denominator density is 0.)}

\subsection{Independent random variables}
Two random variables are independent if and only if their joint
density is a product of their marginal densities:
\begin{equation}
  f(x,y) = f_X(x) f_Y(y)
\end{equation}
A similar relation applies to cdfs under independence:
\begin{equation}
  F(x,y) = F_X(x) F_Y(y)
\end{equation}
Under independence, then, it is clear that:
\begin{equation}
  f(x|y) = f_X(x) \;\;\;\;\;\;\; f(y|x) = f_Y(y)
\end{equation}

\begin{center}
\begin{tabular}{r|c|c|}
 & Probabilities & Densities \\
\hline
conjunction & $P(A \cap B)$ & $f(x,y)$ \\
 & & \\
conditional & $P(A | B)$ & $f(x|y)$ \\
 & & \\
independence & $P(A \cap B) = P(A) P(B)$ & $f(x,y) = f(x) f(y)$ \\
 & & \\
\end{tabular}
\end{center}
Bayes's Rule (with densities):
\begin{equation}
  f(y|x) = \frac{f(x|y) f(y)}{\int f(x|y) f(y) dy}
\end{equation}

\subsection{Expectation of a function of random variables}

For the continuous case:
\begin{equation}
  E[g(X,Y)] = \int \int g(x,y) f(x,y) dy dx
\end{equation}

For instance:
\begin{equation}
  E(XY)= \int \int xy f(x,y) dy dx
\end{equation}

Note that for independent random variables $X$ and $Y$, $E(XY)=E(X)
E(Y).$

\subsection{Covariance}
The covariance is defined as:
\begin{equation}
  Cov(X,Y) = E[(X - E(X)) (Y-E(Y))] = E(XY) - E(X)E(Y)
\end{equation}

Some properties of the covariance:
\begin{itemize}
\item $Cov(X,Y) = Cov(Y,X)$
\item $Cov(X,X) = Var(X)$
\item $Cov(aX+b,cY+d) = acCov(X,Y)$
\item $Cov(W+X,Y+Z) = Cov(W,Y) + Cov(W,Z) + Cov(X,Y) + Cov(X,Z)$
\item $Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)$
\item For independent RVs, $Cov(X,Y) = 0$.  (Note that the converse is
  {\bf not} true, in general.)
\end{itemize}

\subsection{Correlation}
The population correlation coefficient is defined to be the
standardized covariance of the two RVs:
\begin{equation}
  \rho(X,Y) = \frac{Cov(X,Y)}{\sigma_X \sigma_Y} =
  \frac{Cov(X,Y)}{\sqrt{Var(X) Var(Y)}}
\end{equation}

\section{Estimation}

\begin{description}
\item[parameters:] constants that specify a particular member of a
  family of distributions 
\item[statistic:] a function of a random variables; usually a function
  of a random sample ($Y_1, \ldots, Y_n$) from some probability
  distribution. 
\item[estimator:] an expression for estimating an unknown parameter; a
  function of random variables that estimates a parameter (e.g.,
  $\frac{1}{n} \sum_i Y_i$)
\item[estimate:] numerical quantity resulting from inserting
  sample values into estimator (e.g., $\frac{1}{n} \sum_i
  y_i$) 
\end{description}

\subsection{Unbiasedness}
An estimator $\hat\theta$ of a parameter $\theta$ is {\em unbiased}
if: 
\begin{equation} 
  E(\hat\theta) = \theta 
\end{equation}
(i.e., if the expected value of the estimator is the desired parameter.)

The {\em bias\/} of an estimator is the difference between the
estimator's expected value and the desired parameter:
\begin{equation}
  B = E(\hat\theta) - \theta
\end{equation}

\subsection{Mean Square Error}
The mean square error, or MSE, of an estimator $\hat\theta$ is the
expected squared deviation of the estimator from the target parameter:
\begin{equation}
  MSE(\hat\theta) = E[(\hat\theta - \theta)^2] = Var(\hat\theta) + B^2
\end{equation}
Sometimes minimizing an estimator's MSE involves a tradeoff between
variance and bias (this is the idea behind ``ridge
regression,'' for instance).

\subsection{Relative Efficiency}
Given two unbiased estimators $\hat\theta_1, \hat\theta_2$ of
parameter $\theta$, the relative efficiency of $\hat\theta_1$ with
respect to $\hat\theta_2$ is: 
\begin{equation}
  \frac{Var(\hat\theta_2)}{Var(\hat\theta_1)}
\end{equation}
The relative efficiency is greater than 1 if $\hat\theta_1$ is less
variable than $\hat\theta_2$.

\subsection{Consistency}
An estimator $\hat\theta$ is a {\em consistent} estimator of $\theta$ 
if:  
\begin{equation}
 \lim_{n \rightarrow \infty} P(| \hat\theta-\theta | \ge \epsilon)
= 0 \; \; \; {\rm for \; any \;}  \epsilon > 0 
\end{equation}
(i.e., as the sample size approaches infinity, the estimator
converges on the desired parameter)
Consistency entails two properties:
\begin{itemize}
\item Asymptotic unbiasedness
\item Estimator's variance $\rightarrow 0$ as $n \rightarrow \infty$ 
\end{itemize}

\section{Method of moments}
Define the $k$th moment (about the origin) of a random variable $X$ to
be: 
\begin{equation}
  \mu_k = E(X^k)
\end{equation}
If $X_1, \ldots, X_n$ are i.i.d. RVs, we can define {\bf sample} moments
similarly as:
\begin{equation}
  m_k = \frac{1}{n} \sum_{i=1}^n X_i^k
\end{equation}
{\bf Central moments} (or moments about the mean) are defined as:
\begin{equation}
  \mu_k^{\prime} = E[(X-\mu_1)^k]
\end{equation}
for the population and:
\begin{equation}
  m_k^{\prime} = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^k
\end{equation}
for the sample.

(Note that the mean is the first moment about the origin, and the
variance is the second central moment. The first central moment is
zero.) 

The idea of the method of moments, is to use the sample moments ($m_1,
m_2, \ldots$)  as estimators of the population moments. Since the
population moments are functions of the parameters we usually want to
estimate, we can solve for the parameter(s) in terms of the sample
moments, and bingo, we have estimator(s) for the parameters.

In the case of a single parameter, we need only estimate the first
population moment with the first sample moment and solve for the
desired parameter.  In the case of $k$ parameters, we estimate the
first $k$ population moments with their sample analogues.  Then
solving for the parameters is an exercise in some fun algebra, using
$k$ equations to solve for $k$ unknowns.

Sometimes it'll be easier to deal with central moments rather than
moments about the origin. 

% Examples: Bernoulli($p$), Uniform(0,$\theta$), Normal($\mu,\sigma^2$)

\section{Method of maximum likelihood}
Let  $X_1, \ldots, X_n$ be a random sample from some distribution with
density function $f(x|\theta)$.  Given the observed values
$x_1,\ldots,x_n$, we define the {\bf likelihood function}: 

\begin{equation}
  L(\theta) = f(x_1, \ldots, x_n | \theta) = f(x_1 | \theta) \cdots
  f(x_n|\theta) = \prod_{i=1}^n f(x_i|\theta)
\end{equation}

The likelihood function gives the ``probability'' of observing the observed
data as a function of one or more unknown parameters.  It is a joint
density function in which roles are reversed; the data are considered
fixed and the parameter(s) are considered variable.

The idea of the method of maximum likelihood is to find the value of
$\theta$ that maximizes $L$, and use that value as an estimator/estimate of
$\theta$.  In effect, we're asking the question, ``for what value of
$\theta$ would the data be most probable?''

Often, it will be much easier to maximize the natural log of the
likelihood function rather than the likelihood function itself.
Because natural log is a monotonic transformation, $L$ and $\ln L$
will be maximized at the same values of the parameters.

\begin{equation}
  \ln L(\theta) = \sum_{i=1}^n \ln f(x_i|\theta)
\end{equation}

To maximize $L$ or $\ln L$, we will usually differentiate the function
with respect to the parameter(s) of interest, set the derivatives
equal to 0, and solve for the parameters.  These solutions are the
maximum likelihood estimators (MLEs) of the parameters.

\subsection{Properties of MLEs}

Maximum likelihood estimators have some nice properties. Under very
general conditions on $f$, MLEs are:
\begin{itemize}
\item asymptotically unbiased

\item consistent

\item asymptotically normal
\end{itemize}

The asymptotic variance of a MLE achieves the {\bf Cramer-Rao lower bound}:
\begin{equation}
  {\rm CRLB} = \frac{1}{n E[(\frac{\partial \ln f(x|\theta)}{\partial \theta})^2]}
  = \frac{-1}{n E[\frac{\partial^2 \ln f(x|\theta)}{\partial \theta^2}]}
\end{equation}

To summarize, for large $n$, the MLE $\hat\theta$ of $\theta$ approximates the
following distribution:
\begin{equation}
  \hat\theta \sim N \left(\theta, \frac{-1}{n E[\frac{\partial^2 \ln
      f(x|\theta)}{\partial \theta^2}]} \right)
\end{equation}

\subsubsection{Invariance property}
If $\hat\theta$ is the MLE of $\theta$ and $u(\theta)$ is a function
of $\theta$, then $u(\hat\theta)$ is the MLE of $u(\theta)$.

That is, to get the MLE of a function of a parameter, use the function
of the MLE of the paramter.


\section{Distributions Related to the Normal Distribution}

\subsection{Some properties of the normal distribution}

Let $X_1, \ldots, X_n \stackrel{\scriptscriptstyle iid}{\sim} N\left(\mu,\sigma^2 \right)$.
\begin{itemize}
\item Linear transformation: $aX_i + b \sim N\left(a\mu + b, a^2
  \sigma^2 \right)$ 
\item Standardization: $Z_i=\frac{X_i - \mu}{\sigma} 
  \sim N\left( 0,1 \right)$
\item Sum of normals: If $X \sim N\left(\mu_X, \sigma_X^2 \right)$
  and $Y \sim N\left(\mu_Y, \sigma_Y^2 \right)$ and $X$ and $Y$
  have correlation $\rho$: 
  \begin{equation}
    X + Y \sim N\left(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2 + 2 \rho
    \sigma_X \sigma_Y \right)
  \end{equation}
\item $\bar{X} \sim N\left(\mu,\frac{\sigma^2}{n} \right) $
\end{itemize}

\subsection{The $\chi^2$ distribution}

Let $Z_1, \ldots, Z_n \stackrel{\scriptscriptstyle iid}{\sim} N\left(0,1
\right)$
% be independent standard normal random variables.
\begin{itemize}
\item $Z_i^2$ follows a $\chi^2$ distribution with 1 degree of
  freedom:
  \begin{equation}
    Z_i^2 \sim \chi_1^2
  \end{equation}

\item $E(Z_i^2) = 1$

\item $Var(Z_i^2) = 2$

\item Sum of $n$ independent squared normals follows a $\chi_n^2$
  distribution:
  \begin{equation}
    V = \left(\sum_{i=1}^n Z_i^2 \right) \sim \chi_n^2
  \end{equation}
\item $E(V) = n$
\item $Var(V) = 2n$
\item $V$ follows a Gamma distribution with parameters $\frac{n}{2}$
  and 2. 
\item Let $X_1, \ldots, X_n \stackrel{\scriptscriptstyle iid}{\sim} N\left( \mu, \sigma^2
  \right)$ random variables, and let $S^2 =
  \frac{1}{n-1}\sum_{i=1}^{n} (X_i - \bar{X})^2$.
  \begin{equation}
    \frac{(n-1) S^2}{\sigma^2} \sim \chi_{n-1}^2
  \end{equation}
\end{itemize}

\subsection{The $t$ distribution}


\begin{itemize}
\item If $Z$ is a standard normal random variable , and $V$ is a
  $\chi_n^2$ rabdom variable, and $Z$ and $V$ are independent, then:
  \begin{equation}
    T = \frac{Z}{\sqrt{V/n}} \sim t_n
  \end{equation}

\item $E(T) = 0$ for $n > 1$

\item $Var(T) = \frac{n}{n-2}$ for $n > 2$

\item The $t_1$ distribution is identical to the ``standard''
  Cauchy(1,0) distribution.

\item $\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim N\left( 0,1
\right)$

\item $\frac{\bar{X} - \mu}{S / \sqrt{n}} \sim t_{n-1}$

\end{itemize}

\subsection{The $F$ distribution}

\begin{itemize}
\item If $U$ and $V$ are independent $\chi^2$ random variables with
  $m$ and $n$ degrees of freedom, respectively, then:
  \begin{equation}
    F =  \frac{U/m}{V/n} \sim F_{m,n}
  \end{equation}

\item $E(F) =\frac{n}{n-2}$ for $n > 2$
\item The variance of $F$ is pretty messy.

\item If $S_1^2$ and $S_2^2$ are sample variances based on samples of
  sizes $n_1$ and $n_2$ drawn from normal distributions
  with variances $\sigma_1^2$ and $\sigma_2^2$, respectively, then:
  \begin{equation}
    \frac{S_1^2 / \sigma_1^2}{S_2^2 / \sigma_2^2} \sim F_{n_1-1, n_2-1}
  \end{equation}
\end{itemize}

\subsection{Big summary table}
\begin{center}
  \begin{tabular}{r|c|c|c|}
    & $\chi_n^2$ & $t_n$ & $F_{m,n}$ \\
    \hline
     & & & \\
    Expected value & $n$ & 0 $ \;\;\; n > 1$ & $\frac{n}{n-2} \;\;\; n>2$ \\
     & & & \\
    Variance & $2n$ & $\frac{n}{n-2} \;\;\; n>2$ & --- \\
     & & & \\
    Dist'n Family & Gamma($\frac{n}{2},2$) & --- & ``Beta-like'' \\
     & & & \\
    Relation to others & $\sum_{i=1}^n Z_i^2$ & $\frac{Z}{\sqrt{V/n}}$ &
    $\frac{U/m}{V/n}$ \\
     & & & \\
    Example Statistic & $\frac{(n-1) S^2}{\sigma^2} \sim \chi_{n-1}^2$ & 
    $\frac{\bar{X} - \mu}{S / \sqrt{n}} \sim t_{n-1}$ &
    $\frac{S_1^2 / \sigma_1^2}{S_2^2 / \sigma_2^2} \sim F_{n_1-1, n_2-1}$
    \\
     & & & \\
     \hline
   \end{tabular}
 \end{center}