## STAT2011 Lecture Notes

### Shane Leviton

### 6-6-2016

1 | P a g e

### Contents

Table of discrete distributions ... 8

Introduction ... 9

Mathematical theory of probability ... 9

Sample space ... 9

Sample point ... 9

Examples: ... 9

Events: ... 9

Probability as a measure ... 10

Sigma algebra ... 10

Example of sigma algebra ... 11

Probability space: ... 11

Eg: model of fair dice ... 11

Venn diagrams: ... 11

De morgan’s laws (for set theory): ... 12

Eg: ... 12

Equiprobable spaces ... 12

Claim: 𝑃𝐴 = 𝐴Ω ... 13

Conditional probability: ... 14

Independence ... 15

Law of total probability ... 15

Distributive law for set theory: ... 16

Baye’s rule: ... 16

Random Variables ... 16

Discrete RV ... 16

Probability mass function of Discrete RV: ... 16

Examples of random variables ... 17

Joint probability mass function: ... 20

Independence: ... 21

Poisson Distribution: ... 22

Assumptions of poission: ... 22

Distibution of a sum of random variables: (arithmetricies of RV) ... 26

Convolution: ... 27

Expectation: ... 28

Definition ... 28

Examples of particular kinds ... 28

2 | P a g e

Bernoulli: ... 28

Poisson ... 28

Binomial ... 29

Geometric: ... 29

Negative binomial ... 29

Hypergeometic ... 30

Negative expectation: ... 30

Expectation of a function of a RV: ... 30

Proof:... 30

Proof of 2: ... 31

Variance ... 31

Definition: ... 32

Variance and expectation computation... 32

Examples of variance ... 32

Table of discrete distributions ... 34

Expectation as a linear operator ... 35

Expected value of sum of 𝑋 + 𝑌 ... 35

Random vector ... 35

Expectation multiplication ... 37

Claim: ... 37

L1 and L2 ... 38

Example of 𝑋 ∈ 𝐿1 but not ∈ 𝐿2 ... 38

What about the opposite? ... 38

So is 𝑉𝑋 + 𝑌 ∈ 𝐿2? ... 38

Claim: ... 38

Claim: shifting 𝑉(𝑋) ... 39

Covariance... 40

Definition ... 40

Claim: ... 40

So; going back to: 𝑉𝑋 + 𝑌 ... 41

By induction: variance of 𝑋𝑖 ′𝑠 ... 42

Chebyshev’s inequality ... 44

Lemma: Markov’s Inequality... 45

Law of large numbers: ... 46

Convergence probability ... 46

Theorem: Weak law of large numbers ... 47

3 | P a g e

Strong law of large numbers: ... 47

The multinomial distribution: ... 47

Multinomial random vector: ... 48

Goal: pm for multinomal Random vector ... 48

Estimation: ... 49

The method of moments ... 49

Method of moments ... 49

For estimating ... 51

Maximum likelihood estimation (MLE) ... 53

Likelihood function ... 53

Hardy-Weinberg equilibrium ... 59

theory ... 59

Example: ... 59

So how do we rigorously compare different estimators? ... 61

Back to HW: ... 63

Hardy – Weinberg equilibrium ... 65

Delta method: ... 66

Parametric Bootstrap method ... 72

Example: Binomial ... 72

Example 2: Hardy Weiberg equilibrium ... 73

Conditional Expectation/Variance ... 74

Conditional expectation: ... 74

Example: die ... 75

Claim: L1 ... 75

Random Sums: ... 75

Examples: ... 75

Conditional Expectation (the RV)…. (not a number, a random variable)…... 77

Definition: ... 77

Theorem: Expectation of conditional expectation (total expectation law) ... 79

What about the variance of conditional expectation? 𝑉𝐸𝑌𝑋 is it 𝑉𝑌 ? ... 80

Conditional Variance: ... 81

Analogously: ... 81

So, conditional variance: ... 81

Continuous Random Variables ... 83

CDF (Cumulative distribution function) ... 83

Definition: ... 83

4 | P a g e

Continuous random variable definition: ... 86

PDF: 𝑓 ... 86

Examples of PDF/CDF distrbution: ... 88

Uniform distribution: ... 88

Exponential distribution ... 89

Gamma distribution: ... 91

Normal distribution ... 93

Quantiles: ... 94

Quantile definition: ... 94

Special quantiles ... 95

Pth quantile ... 95

𝑝th quantile definition: ... 95

Quantile function: Definition ... 96

Functions of random variables for continuous RV... 97

Claim: uniform distribution ... 97

Claim: sampling ... 98

SAMPLING‼!... 99

Claim: ... 99

Theorem: Random variable functions ... 99

Examples of samples and functions: ... 100

Joint distribution (discrete analogue of joint pmf) ... 102

Joint density definition: ... 103

Comments: ... 103

Marginal distribution: ... 105

Proof: ... 105

Examples: ... 106

Gamma/exponential joint ... 106

Uniform distribution of region in a plane ... 107

Bivariate normal (special case) ... 109

Independent RV’s (continuous) ... 110

For continuous: ... 110

Definition: ... 110

Examples: ... 111

Conditional distributions: ... 113

Continuous: ... 113

Construction: ... 113

5 | P a g e

Conditional density: ... 114

Examples: ... 115

Sampling in 2D ... 116

Discrete: ... 116

Continuous ... 116

Sum of Random Variables ... 116

Continuous: ... 116

Density of Sum of continuous random variables 𝑍 = 𝑋 + 𝑌: (convolution) ... 117

Quotients of RV’s ... 119

Formula for quotients: ... 122

For independent RV ... 122

Functions of jointly continuous RV ... 122

Mapping: ... 122

Blow up factor: ... 123

Jacobian: generally ... 124

Box-Muller Sampling ... 128

Visual comparison of distributions ... 129

QQ plot ... 129

Examples: ... 129

More general QQ plot: ... 130

Empirical data: ... 130

Extrema and order statistics ... 137

Sorted values: order statistics ... 137

Claim: ... 137

Expectation of Order statistics ... 139

Expectation of a continuous RV ... 139

Definition: ... 139

Looking at examples: ... 139

Expectation of a function for continuous RV ... 142

In continuous case analogue: Expectation of function ... 142

Variance for continuous: ... 143

Equation to calclute: ... 144

Variance of linear sum: ... 144

Covariance: ... 144

Claim: ... 144

Variance of sum of RVs independent:... 144

6 | P a g e

Standard deviation: ... 144

Correlation coefficient: ... 145

Claim: Correlation coefficient ... 145

Examples of variance ect: ... 145

Standard normal: ... 145

General normal: 𝑋 ∼ 𝑁𝜇, 𝜎2 ... 146

Gamma ... 146

Bivariante standard normal: ... 147

Markov’s inequality in Continuous: ... 148

continuous ... 148

Chebysshev’s inequality: ... 148

Weak law of large numbers: ... 149

Strong law of large numbers: ... 149

Estimation for continuous ... 149

Examples: ... 149

Standard normal: ... 149

Normal: ... 149

Exponential: ... 149

Scaled Cauchy: ... 149

Method of moments: ... 149

Examples: ... 150

Maximum likelihood estimation (MLE) ... 150

For continuous: ... 151

Conditional expectation: ... 152

Conditional expectation def: ... 152

More generally: ... 152

Mixed distribution ... 153

Examples ... 154

Uniform and binomial ... 154

Natural valued with iid sum ... 154

Expectation of mixed distirbutoins: ... 154

Eg 1: ... 154

Eg 2: ... 155

Variance of random sum: ... 155

Law of total probability ... 155

We can now prove this: ... 155

7 | P a g e

Prediction ... 156

Predictor ... 156

Claim: ... 157

Back to prediction problem: ... 157

Example of prediction: ... 158

Moment generating function (MGF) ... 160

Definition: ... 160

Calculating MGF: ... 160

So why are we doing this? ... 161

Theorem 1: ... 161

Theorem 2: ... 163

Weak convergence/converges in distribution ... 164

Weakly converges ... 164

Characteristic function (CF)... 167

Confidence intervals ... 168

Estimation: ... 168

We want to know- how close are we to 𝜃? ... 168

Confidence interval set up: ... 169

Constructing confidence intervals (CI s) ... 172

Examples of confidence intervals: ... 172

Constructing approximate Cis based on the MLE ... 180

Confidence interveals for Bernoulli 𝜃: ... 181

8 | P a g e

### PROBABILITY AND STATISTICAL MODELS STAT2911 Notes

### Table of discrete distributions

Distribution Model pmf 𝐸(𝑋) 𝑉(𝑋)

𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) Success or failure, with probability 𝑝 Eg: coin toss

𝑝 𝑝 𝑝(1 − 𝑝)

𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐(𝑝) Probability 𝑝 to first success: eg coin toss till first head

(1 − 𝑝)^{𝑘−1}𝑝 1

𝑝

1 − 𝑝
𝑝^{2}

𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝) 𝑛 Bernoullii trials.

Eg 𝑛 coin tosses

(𝑛

𝑘) 𝑝^{𝑘}(1 − 𝑝)^{𝑛−𝑘} 𝑛𝑝 𝑛𝑝(1 − 𝑝)

𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆) Number of events in a certain time interval.

Eg: number of radiactive particles emmited in certain time

𝑒^{−𝜆}𝜆^{𝑘}
𝑘!

𝜆 𝜆

𝐻𝑦𝑝𝑒𝑟𝑔𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 (𝑟, 𝑛, 𝑚)

number of red
balls 𝑟 in a
sample of 𝑚
balls drawn
**without **
replacement
from an urn
with 𝑟 red balls
and 𝑛 − 𝑟 black
balls

(𝑟

𝑘) (𝑛 − 𝑟 𝑚 − 𝑘) (𝑛

𝑚)

𝑚𝑟

𝑛 𝑚𝑟

𝑛 𝑛 − 𝑟

𝑛

𝑛 − 𝑚 𝑛 − 1

𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑏𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑟, 𝑝)

Number of
Bernoulli trials
till the 𝑟^{𝑡ℎ}
success

(𝑘 − 1

𝑟 − 1) (1 − 𝑝)^{𝑘−𝑟}𝑝^{𝑟} 𝑟
𝑝

𝑟(1 − 𝑝)
𝑝^{2}

9 | P a g e Uri Keich; Carslaw 821; Monday 5-6

### Introduction

- Probability in general and this course has 2 components:

o Mathematical theory of probability

Definitions, theorems and proofs

Abstraction of experiments whose outcome is random

o Modelling: argued rather than proved applications can be confusing, as well as ill defined

Useful for describing/summarising the data and making accurate predictions

### Mathematical theory of probability

### Sample space

The set of all possible outcomes, denoted Ω, is the sample space

### Sample point

A point 𝜔 ∈ Ω is a sample point

Examples:

Die:

Ω = {1,2,3,4,5,6}

Coin:

Ω = {𝐻, 𝑇}

- Complication: do we model the possibility that the coin can land on its side?

### Events:

- Events are subsets of Ω, for which we can assign a probability.

- We say an event 𝐴 occurred if the outcome, or sample point 𝜔 ∈ Ω satisfies 𝜔 ∈ 𝐴

10 | P a g e

### Probability as a measure

𝑃(𝐴) is the probability of the event 𝐴, which intuitively is the rate at which 𝐴 occurs if we repeat the experiment many times.

Mathematically: 𝑃 is a probability measure function if:

1. 𝑃(Ω) = 1

2. 𝑃(𝐴) ≥ 0 for any event 𝐴

3. If 𝐴_{1}, 𝐴_{2}, … are mutually disjoint (𝐴_{1}∩ 𝐴_{2}∩ … = ∅) then (𝑃(∪_{𝑛=1}^{∞} 𝐴_{𝑛}) = ∑^{∞}_{𝑛=1}𝑃(𝐴_{𝑛}))
For this to make sense, we need to know that a union of a sequence of events is ALSO an event.

- Why do we bother with determining events? Why can’t any subset of Ω be an event?

o Imagine choosing a point at random in a large cube 𝐶 ⊂ ℝ^{3}, where the probability
of the point lying in any set 𝐴 ⊂ 𝐶 is proportional to its volume |𝐴|

o Clearly, if 𝐴, 𝐵 ∈ 𝐶 are related through a rigid motion (translation + rotation), then

|𝐴| = |𝐵| so their probabilities are the same

o Similarly, if A ∈ C can be split into 𝐴_{1}∪ 𝐴_{2}, then 𝑃(𝐴) = 𝑃(𝐴_{1}) + 𝑃(𝐴_{2})
o Bernard-Tarski Paradox:

The unit ball 𝐵 in ℝ^{3} can be decomposed into 5 pieces, which can be
assembled using only rigid motions into two balls of the same size.

But then: 𝑃(𝐵) = 𝑃(𝐵_{1}) + 𝑃(𝐵_{2}) + ⋯ + 𝑃(𝐵_{5}) = 𝑃(𝐵_{1}^{′}) + ⋯ + 𝑃(𝐵_{5}^{′}) =
2𝑃(𝐵)

This is why we cannot assign probabilities to EVERY subset of Ω. Probabilities can’t be assigned to arbitrary sets, rather to sets which are “measurable”.

The collection of measurable sets is captured by the notion of 𝜎 algebra (𝜎 −field)

### Sigma algebra

Definition:

A collection of subsets of Ω is a 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 is 1. Ω ∈ 𝐹

2. 𝐴 ∈ 𝐹 ⟹ 𝐴^{𝑐}∈ 𝐹

3. 𝐴_{1}, 𝐴_{2}, … ∈ 𝐹 ⟹∪_{𝑛=1}^{∞} 𝐴_{𝑛} ∈ 𝐹

2. says that 𝐹 is closed with respect to complementing and taking the complement 3. says that 𝐹 is closed with respect to a countable union

- (countable means you can count it) 𝐴 = {1,3,5} is a finite countable set - ℕ is an infinite countable set - ℤ is also an infinite countable set - ℚ is infinite countable

- ℝ is not countable

11 | P a g e Example of sigma algebra

Ω = {1,2, … 6}

𝐹 = the power set of Ω = the set of all subsets of Ω

= {∅, {𝑎𝑙𝑙 𝑠𝑖𝑛𝑔𝑙𝑒𝑠}, {𝑎𝑙𝑙 𝑑𝑜𝑢𝑏𝑙𝑒𝑠} … , Ω Denoted

= 2^{Ω}
*Questions: *

What is the cardinality of 𝐹, or how many subsets of Ω does it contain -

Is 2^{Ω} ALWAYS a 𝜎 algebra

What is the smallest 𝜎 algebra regardless of the sample space

*ANSWERS? *

- 2^{|Ω|} ?

- {∅, Ω}

### Probability space:

A probability space consists of 1. A sample space Ω

2. A 𝜎 − algebra of subsets of Ω, 𝐹 3. A probability measure: 𝑃: 𝐹 → ℝ Eg: model of fair dice

Ω = {1,2, … 6}

𝐹 = 2^{Ω}
𝑃({, }) =1

6, 𝑖 = [1,6]

𝑃(𝐴) =|𝐴|

6 (𝑓𝑜𝑟 𝐴 ∈ 𝐹)

### Venn diagrams:

Useful to vizualise relations between sets. They help plan proofs, but don’t use them as PROOFS in a course

12 | P a g e

### De morgan’s laws (for set theory):

(𝐴 ∪ 𝐵)^{𝑐} = 𝐴^{𝑐}∩ 𝐵^{𝑐}
(𝐴 ∩ 𝐵)^{𝑐} = 𝐴^{𝑐}∪ 𝐵^{𝑐}

Eg:

𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)

- Why is 𝐴 ∪ 𝐵 and ∩ 𝐵 ∈ 𝐹 ? (why are they events)

𝐴 ∪ 𝐵 = 𝐴 ∪̇ (𝐵 \𝐴) [(∪)̇ = 𝑑𝑖𝑠𝑗𝑜𝑖𝑛𝑡 𝑢𝑛𝑖𝑜𝑛)
𝐵\𝐴 = 𝐵 ∩ 𝐴^{𝑐}

∴ 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵\𝐴)

(𝑢𝑠𝑖𝑛𝑔 𝑡ℎ𝑎𝑡 (𝑃(∪_{𝑛=1}^{∞} 𝐴_{𝑛}) = ∑ 𝑃(𝐴_{𝑛})

∞

𝑛=1

, 𝑤𝑖𝑡ℎ 𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑛𝑑 𝑖𝑛𝑓𝑖𝑛𝑖𝑡𝑒 ∅ 𝑠^{′} )

*(probability of empty set is 0) *

𝑃(∅)𝑃(∪ ∅) = ∑ 𝑃(∅)

∞

𝑛=1

→ 𝑃(∅ ) = 0

𝑃(𝐵) = 𝑃(𝐵\𝐴) + 𝑃(𝐵 ∩ 𝐴)

→ 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)

### Equiprobable spaces

- A space consistics of a finite sample space such that ∀𝜔 ∈ Ω 𝑃({𝜔}) = 𝑐 > 0 - (all equally likely)

A 𝐵

13 | P a g e 𝑁𝑂𝑡𝑒: 𝜔 ∈ Ω 𝑖𝑠 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒 𝑝𝑜𝑖𝑛𝑡, {𝜔} = 𝑎𝑛 𝑒𝑣𝑒𝑛𝑡 𝑖𝑛 𝐹

- Why does it have to be finite?

o Otherwise

{𝜔}_{𝑛=1}^{∞} ⊂ Ω

𝑃(∪_{𝑛}{𝜔_{𝑛}}) = ∑ 𝑃(𝜔_{𝑛})

𝑛

= ∑ 𝑐

∞

𝑛=1

= ∞ [𝑏𝑢𝑡 ∀𝐴 ⊂ 𝐹, 𝑃(𝐴) ∈ [0,1] (𝑤ℎ𝑦? → 𝑠ℎ𝑜𝑤))

Claim: 𝑃(𝐴) = ^{|𝐴|}_{|Ω|}

Proof:

𝐴 =∪_{𝜔∈𝐴}{𝜔}

∴ 𝑃(𝐴) = 𝑃(∪_{𝜔∈𝐴}{𝜔}) = ∑ 𝑃(𝜔)

𝜔∈𝐴

= |𝐴| × 𝑐 𝑡𝑎𝑘𝑒 𝐴 = Ω

→ 𝑃(Ω) = 1 (𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛) = |Ω|𝑐

→ 𝑐 = 1

|Ω|

→ 𝑃(𝐴) =|𝐴|

|Ω|

This means: that probability is just combinatorics‼‼

*Example of probability combinatorics *
1. A fair die is rolled:

𝑃({𝜔}) =1 6

2. A group of 𝑛 people meet at a party, what is the probability that at least 2 od them share a birthday

o model that no leap years, no association between birthdays
Ω = {(𝑖_{1}, … 𝑖_{𝑛})|𝑖_{𝑘} ∈ {1, … 365}}

→ 1

|Ω|= 365^{−𝑛}

𝐴 = {(𝑖_{1}, … 𝑖_{𝑛}) ∈ Ω}∃𝑗 ≠ 𝑘, 𝑖_{𝑗}= 𝑖_{𝑘}
𝐿𝑒𝑡^{′}𝑠 𝑟𝑎𝑡ℎ𝑒𝑟 𝑐𝑜𝑚𝑝𝑢𝑡𝑒 𝑡ℎ𝑒 𝑐𝑜𝑚𝑝𝑙𝑒𝑚𝑒𝑛𝑡
𝐴^{𝑐} = {(𝑖_{1}, … 𝑖_{𝑛}) ∈ Ω}∃𝑗 ≠ 𝑘, 𝑖_{𝑗} ≠ 𝑖_{𝑘}

14 | P a g e
𝑃(𝐴^{𝑐}) =|𝐴^{𝐶}|

|Ω| =365 × 364 × … (365 − 𝑛 + 1)
365^{𝑛}

= ∏365 − (𝑖 − 1) 365

𝑛

𝑖=1

{𝑛 ≤ 365}

= ∏ 1 −𝑖 − 1 365

𝑛

𝑖=1

𝑃(𝐴) = 1 − (365

𝑛 )
365^{𝑛}
Sidenote: for 𝑛 = 23 → 𝑃(𝐴) ≈^{1}

2

3. what is the probability that one of the guests shares YOUR birthday
𝐵 = {(𝑖_{1}, … 𝑖_{𝑛}) ∈ Ω}∃𝑗, 𝑖_{𝑗} = 𝑥 (𝑦𝑜𝑢𝑟 𝑏𝑖𝑟𝑡ℎ𝑑𝑎𝑦)}

∴ 𝐵^{𝑐} = {(𝑖_{1}, … 𝑖_{𝑛}) ∈ Ω}∀𝑗, 𝑖_{𝑗} ≠ 𝑥

= |𝐵^{𝑐}| = 364^{𝑛}

→ 𝑃(𝐵^{𝑐}) =364^{𝑛}
365^{𝑛}

= (1 − 1 365)

𝑛

≈ 3^{−}^{365}^{𝑛}
Sidenote: for 𝑛 = 253, 𝑃(𝐵) ≈^{1}

2

### Conditional probability:

If 𝐴, 𝐵 ∈ 𝐹 and 𝑃(𝐵) > 0 we can define (Probability of A given B) 𝑃(𝐴|𝐵) =𝑃(𝐴 ∩ 𝐵)

𝑃(𝐵)

Example:

Ω = {1, … ,6}

A 𝐵

Ω

15 | P a g e 𝐴 = {1,2}; 𝐵 = {1,3,5}

### Independence

Event 𝐴 ∈ 𝐹 is independent (ind) of 𝐵 ∈ 𝐹 if knowing whether or not 𝐴 occurred, does not give us any information on whether or not 𝐵 occurred.

∴ 𝑃(𝐵|𝐴) = 𝑃(𝐵)

→^{𝑃(𝐵∩𝐴)}

𝑃(𝐴) = 𝑃(𝐵)

∴ 𝑃(𝐵 ∩ 𝐴) = 𝑃(𝐴)𝑃(𝐵)

- this definition is more robust (symmetric about A and B), and no need to assume 𝑃(𝐴) > 0;

and easier to generalise general:

𝑃(𝐴_{1},∩ 𝐴_{2}, … ∩ 𝐴_{𝑛}) = 𝑃 (∏ 𝐴_{𝑖}

𝑛

𝑖=1

) - stronger than pairwise independence‼!

NOTE: independence is NOT DISJOINT (disjoint gives you total knowledge that the other would not have occurred)

### Law of total probability

- a probability of an event may be computed by summing over all eventualities

3 machines A B C

Production rate .05 .2 .3

Failure rate .01 .02 .005

𝑃(𝑓𝑎𝑖𝑙𝑒𝑑 𝑝𝑟𝑜𝑑𝑢𝑐𝑡) = .01(. 05) + .2(. 02) + .3(. 005) Generally:

If 𝑃_{𝑗}∈ 𝐹 forms a partition of Ω
Ω =∪_{𝑗}̇ 𝐵_{𝑗}

𝑡ℎ𝑒𝑛 𝑃(𝐴) = 𝑃(𝐴 ∩ Ω)

= 𝑃 (𝐴 ∪ (∪_{𝑗}̇ 𝐵_{𝑗})) = 𝑃 (∪_{𝑗}̇ (𝐴 ∩ 𝐵_{𝑗})) = ∑ 𝑃(𝐴 ∩ 𝐵) = ∑ 𝑃(𝐴)𝑃(𝐵|𝐴)

𝑗 𝑗

16 | P a g e Distributive law for set theory:

𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶) 𝐴 ∩ (𝐵 ∪ 𝐶) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐶)

### Baye’s rule:

Diagnostic: which of the events 𝐵_{𝑗} triggered the event 𝐴?

𝑃(𝐵_{𝑗}|𝐴) =𝑃(𝐴 ∩ 𝐵_{𝑗})
𝑃(𝐴)
(which machine B caused the failure of A?)

= 𝑃(𝐴|𝐵_{𝑗})𝑃(𝐵_{𝑗})

∑ 𝑃(𝐴|𝐵_{𝑖} _{𝑖})𝑃(𝐵_{𝑖})

(𝑢𝑠𝑖𝑛𝑔 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛𝑠 𝑜𝑓 𝑙𝑎𝑤 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑎𝑛𝑑 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦

### Random Variables

A RV is a measurable function 𝑋: Ω → ℝ

### Discrete RV

A RV is is discrete if its range:

𝑋(Ω) = {𝑋(𝜔): 𝜔 ∈ Ω} ⊂ ℝ Is a countable set (finite or infinite)

Probability mass function of Discrete RV:

The pmf pf 𝑋 is definite as:

𝑝_{𝑋}(𝑥) = 𝑃(𝑋 = 𝑥) = 𝑃({𝜔: 𝑋(𝜔) = 𝑥}) [𝑤𝑖𝑡ℎ 𝑥 ∈ ℝ]

- why is {𝜔: 𝑋(𝜔) = 𝑥} ∈ 𝐹 ?
*Properties of pmf *

Claim:

1. ∀𝑥 ∈ ℝ \𝑋(Ω) then 𝑝_{𝑋}(𝑥) = 0 (the pmf of something which is unattainable is 0)
2. With {𝑥_{𝑖}} = 𝑋(Ω), ∑ 𝑝_{𝑖} _{𝑋}(𝑥)= 1 (the pmf of all outcomes is 1)

The distribution of a discrete RV is completely specified by its pmf. Indeed, ∀𝐴 ⊂ ℝ
𝑃(𝑋 ∈ 𝐴) = ∑ 𝑝_{𝑋}(𝑥_{𝑖})

𝑖:𝑥_{𝑖}∈𝐴

(shows the probability that an outcome is going to happen)

17 | P a g e
- We can thus specify the distribution of a RV 𝑋 by specifying its pmf 𝑝_{𝑋}

*Question: *

Can any functions 𝑝: ℝ → [0,1] with a countable support {𝑥: 𝑝(𝑥) > 0} such that ∑𝑖:𝑝(𝑟)>0𝑝(𝑥_{𝑖})= 1
be a pmf for some random variable?

ClaimL if 𝑝 is as above, then there exist a probability space (Ω, 𝐹, 𝑝) obout a RV 𝑋: Ω → ℝ such that
𝑝_{𝑋} = 𝑝

Proof: Ω = {𝑥: 𝑝(𝑥) > 0}; 𝐹 = 2^{Ω}; 𝑃(𝐴) = ∑_{𝑥∈𝐴}𝑝(𝑥)
𝑋(𝜔) = 𝜔
Examples of random variables

*Bernoulli random variable *
Is defined by the pmf

𝑝(𝑥) = {1 − 𝑝; 𝑥 = 0

𝑝; 𝑥 = 1 (𝑓𝑎𝑖𝑙𝑢𝑒𝑟 𝑎𝑛𝑑 𝑠𝑢𝑐𝑐𝑒𝑠𝑠) Eg:

Ω = {𝐻, 𝑇}: 𝑃(𝐻) = 𝑝 𝑋(𝜔) = {1; 𝜔 = 𝐻

0; 𝜔 = 𝑇
*Binomial random variable *

𝑆_{𝑛} models the number 𝐴 of some in an iid (independatn and identically distributed) Bernoulli trials
A 2 parameter family of distributions: (𝑛, 𝑝)

Note: if 𝑋_{𝑖} = 1: {𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙} = {0 𝑖𝑓 𝑓𝑎𝑖𝑙𝑢𝑟𝑒

1 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠; then 𝑋_{𝑖} ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) and 𝑆_{𝑛} = ∑ 𝑋^{𝑛}_{1} _{𝑖}
Generally: fpr an event 𝐴, then the random variable 1_{𝐴} is the indicator function of 𝐴:

1_{𝐴}(𝑅𝑉) = { 1 𝑖𝑓 𝜔 ∈ 𝐴
0 𝜔 𝑛𝑜𝑡 𝑖𝑛 𝐴

𝑆_{𝑛}(Ω) = {0,1,2, … 𝑛}

For 𝑘 ∈ 𝑆_{𝑛}(Ω); 𝑝(𝑆_{𝑛}= 𝑘) is:

- Consider the configuration of a Bernoulli trials: 𝑠, 𝑠, 𝑠, … . 𝑓, 𝑓, 𝑓..

- Its probability is 𝑝^{𝑘}(1 − 𝑝)^{𝑘}× (𝑛
𝑘)
Binomial RV pmf

∴ 𝑝(𝑆𝑛 = 𝑘) = (𝑛

𝑘) 𝑝^{𝑘}(1 − 𝑝)^{𝑘}
*Geometric random variable: *

Models the number of iid Bernoulli (𝑝) trials if it takes till the first success

18 | P a g e 𝑋(Ω) = {1,2,3 … } ∪ {∞}

A 1-parameter family : 𝑝

Pmf of geometric random variable

𝑝(𝑋 = 𝑘) = (1 − 𝑝)^{𝑘−1}𝑝

𝑃(𝑋 > 𝑘) = (1 − 𝑝)^{𝑘}

𝑃(𝑋 > 𝑛 + 𝑘|𝑋 > 𝑘) =𝑃(𝑋 > 𝑛 + 𝑘, 𝑋 > 𝑘)

𝑃(𝑋 > 𝑘) (𝑢𝑠𝑖𝑛𝑔 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦)

=𝑃(𝑋 > 𝑛 + 𝑘)

𝑃(𝑋 > 𝑘) =(1 − 𝑝)^{𝑛+𝑘}

(1 − 𝑝)^{𝑘} = (1 − 𝑝)^{𝑛}= 𝑃(𝑋 > 𝑛)

This property is “memoryless” (the probability does not remember what has happened before) - Geometric is the ONLY discrete memoryless distribution

If 𝑥_{𝑖} = 1{𝑠𝑢𝑐𝑐𝑠𝑒𝑠𝑠 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙 𝑖}, then 𝑥 = min{𝑖: 𝑋_{𝑖} = 1}

*Negative binomial RV *

Models the number of iid Bernoulli trials till the 𝑟^{𝑡ℎ} success, where 𝑟 ∈ ℕ
- A 2 parameter disctribution

- Note: 𝑋^{1}: 𝑟 = 1 is a geometric random variable

𝑋^{𝑟}(Ω) = {𝑟, 𝑟 + 1, … } ∪ {∞}

Pmf:

For 𝑘 ∈ 𝑋^{𝑟}(Ω), 𝑃(𝑋^{𝑟} = 𝑘) =

- Consider the configuration of its Bernoulli trials (𝑓, 𝑓, 𝑓, … . 𝑠, 𝑠, 𝑠, 𝑠)

19 | P a g e 𝑝 = (𝑘 − 1

𝑟 − 1) (1 − 𝑝)^{𝑘−𝑟}𝑝^{𝑟}

𝑋^{𝑟} = min {𝑚: ∑ 𝑥_{𝑖}

𝑛

1

= 𝑟}

*Hypergeometric RV *

𝑋 models the number of red balls in a sample of 𝑚 balls drawn without replacement from an urn with 𝑟 red balls and 𝑛 − 𝑟 black balls

𝑋(Ω) ⊂ {0,1, … 𝑟}

Probability:

For 𝑘 ∈ 𝑋(Ω)

𝑝𝑚𝑓: 𝑃(𝑋 = 𝑘) =(𝑟

𝑘) (𝑛 − 𝑟 𝑚 − 𝑘) (𝑛

𝑚) Which is a 3 parater family: (𝑟, 𝑛, 𝑚)

(is not phrased in iid Bernoulii RV)

- A famous case of this is the Fisher Exact Test Fisher Exact Test:

30 convicted criminals with same sex twin of 13 which 13 were identical and 17 were non identical twins.

- Is there evidence of a genetic link?

20 | P a g e -

Convicted Not convicted

Identical 10 3 13

different 2 15 17

12 18

- Assuming that wheterh or not the twin of the criminal is also convicted does not depend on the biological type of the twin, we have a sample from a hypergeometric distribution Indeed: there are 13 red (monozygote) balls and 17 black (dizygotic) balls. We randomly sample 12 balls (convicted), what is the probability that we will see 10 or more red balls in the same sample?

Let 𝑋 ∼ 𝐻𝑦𝑝𝑒𝑟(13,17,12) = (𝑟, 𝑘, 𝑚)

𝑃(𝑋 ≥ 10) = ∑ (13

𝑘) ( 17 12 − 𝑘) (30

12)

12

𝑘=0

≈ 0.000465

So- it seems very unlikely that there is no relation between the conviction of the twin and its type, but:

- Did we establish that a criminal mind is inherited?

o Problem with this statement

o Conviction vs truth “that face is up to no good??”

o Sampling/ascertainment bias o Identical twins tighter connection?

### Joint probability mass function:

The joint pmf of the RVs 𝑋 and 𝑌 specifies their interaction:

𝑃𝑋𝑌(𝑥, 𝑦) = 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦) (𝑥, 𝑦 ∈ ℝ) Note: {𝑋 = 𝑥, 𝑌 = 𝑦}; {𝜔 ∈ Ω| 𝑋(𝜔) = 𝑥, 𝑌(Ω = 𝑦}

- If 𝑋 and 𝑌 are discrete Random Variables, with
o {𝑥_{𝑖}} = 𝑋(Ω); 𝑎𝑛𝑑 {𝑦_{𝑖}} = 𝑌(Ω)

Then: (keeping 𝑦 fixed)

𝑃_{𝑋}(𝑥) = 𝑃(𝑋 = 𝑥) = ∑ 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦_{𝑗})

𝑗

= ∑ 𝑃_{𝑋𝑌}(𝑥, 𝑦𝑗)

𝑗

Similarly:

𝑃_{𝑌}(𝑦) = ∑ 𝑃_{𝑋𝑌}(𝑥_{𝑖}, 𝑦)

𝑖

𝑃_{𝑋} and 𝑃_{𝑦}** are referred to as the marginal pmf’s **

21 | P a g e Example:

A fair coin is tossed 3 times

𝑋 = 1 {𝐻 𝑜𝑛 𝑓𝑖𝑟𝑠𝑡 𝑡𝑜𝑠𝑠} = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 ℎ𝑒𝑎𝑑𝑠 𝑖𝑛 𝑓𝑖𝑟𝑠𝑡 𝑡𝑜𝑠𝑠 𝑌 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 ℎ𝑒𝑎𝑑𝑠

Ω = {𝐻𝐻𝐻, 𝐻𝐻𝑇 … }; |Ω| = 8

𝒀 **0 ** **1 ** **2 ** **3 **

𝑿 1/8 2/8 1/8 0 1/2

𝟎 0 1/8 2/8 1/8 1/2

𝟏 0 1/8 2/8 1/8

1/8 3/8 3/8 1/8

So, 𝑃(𝑋 = 0) =^{1}

8

(note: called marginal, as it is in the margis) NOTE:

𝑋 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙 (1 2) 𝑌 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (3,1

2)

Are 𝑋 and 𝑌 independent RVs?

- The random variable 𝑋 is indepdent of 𝑌 if “knowing the value of 𝑌 does not change the discibtion of 𝑋” (so NO- they are not independent)

Independence:

𝑃(𝑋 = 𝑥_{𝑖}, 𝑌 = 𝑦_{𝑗}) = 𝑃(𝑋 = 𝑥_{𝑖}, 𝑌 ≠ 𝑦_{𝑗})

→ 𝑃(𝑋 = 𝑥_{𝑖}, 𝑌 = 𝑦_{𝑗}) = 𝑃(𝑋 = 𝑥_{𝑖})𝑃(𝑌 = 𝑦_{𝑗}) (𝐷𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛)
𝑃𝑋𝑌(𝑥, 𝑦) = 𝑃_{𝑋}(𝑥)𝑃_{𝑌}(𝑦)

The joint pmf factors into the product of the marginal

More generally, the random variables 𝑋_{1}, … 𝑋_{𝑁} are independent if:

22 | P a g e
𝑃(𝑋_{1}= 𝑥_{1}, … 𝑋_{𝑛}= 𝑥_{𝑛}) = 𝑃_{𝑋}_{1}_{,…𝑋}_{𝑁}(𝑥_{1}, … 𝑥_{𝑛}) = ∏ 𝑃_{𝑋}_{𝑖}(𝑥_{𝑖})

𝑛

𝑖

(∀𝑥_{𝑖}∈ ℝ)

Note: take caution that PAIWISE INDEPEDENCE DOES NOT IMPLY INDEPENDENCE

### Poisson Distribution:

Recall:

𝑋 ∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆) If

𝑃_{𝑋}(𝑘) = 𝑒^{−𝜆} 𝜆^{𝑘}
𝑘!

(𝑘 ∈ ℕ - A 1 parameter faily, 𝜆 > 0

- Where does it come from?

o Models the number of “events” that are registered in a certain time interval, eg: the number of

Particles emitted by a radioactive source in a hour

Incoming calls to a service centre between 1-2 pm

Light bulbs burnt in a year

Fatalities from horse kicks in the Prussian cavalry over 200 corp years (Boriewiez 1898)

20 corps x 10 years = 200 corp years

Number of deaths Observed count frequency Poisson approximation

0 109 .545 .543

1 65 .325 .331

2 22 .110 .101

3 3 .015 .021

4 1 .005 .003

So, how did we come up with the Poisson distribution?

Assumptions of poission:

1. The distribution of the number of events of any time interval depends only on its length or duration. (eg; number of horse kicks only depends on that it is a day, not on the particular day)

23 | P a g e 2. The number of events recorded in two disjoint time intervals are independent of one

another (numer of horse kicks is independent from today and yesterday)

3. No two events are recorded at exactly the same time point (you can’t have 2 horsekicks exactly at the same time, must be slightly different in time)

So:

Let 𝑋_{𝑡,𝑠} denote the number of events in the time intercal (𝑡, 𝑠]

- Denote by: 𝑋_{𝑡} = 𝑋_{0,𝑡}

Our goal is to find the disctibution of 𝑋:

- Let 𝑓(𝑡) = 𝑃(𝑋_{𝑡}= 0), then
𝑓(𝑡 + 𝑠) = 𝑃(𝑋_{𝑡+𝑠}= 0)

= 𝑃(𝑋_{𝑡} = 0, 𝑋_{𝑡,𝑠+𝑡}= 0) (𝑑𝑖𝑠𝑗𝑜𝑖𝑛𝑡 𝑡𝑖𝑚𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠, 𝑎𝑛𝑑 𝑢𝑠𝑖𝑛𝑔 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦 2)

= 𝑃(𝑋_{𝑡} = 0)𝑃(𝑋_{𝑡,𝑠+𝑡} = 0) (𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦 1)

= 𝑃(𝑋_{𝑡} = 0)𝑃(𝑋_{𝑠} = 0) = 𝑓(𝑡) = 𝑓(𝑠)

→ 𝑓(𝑡 + 𝑠) = 𝑓(𝑡)𝑓(𝑠) ∀𝑡, 𝑠 > 0

One type of solution of this is:

𝑓(𝑡) = 𝑒^{𝛼𝑡} (𝛼 ∈ ℝ)

(note: other solutions exists, but are a mess, in that they are unbounded, but 𝑓(𝑡) ∈ [0,1], so can’t exist for probabilities).

For the same reason, 𝛼 < 0 (to remain bounded) So:

𝛼 = −𝜆 (𝑓𝑜𝑟 𝜆 > 0)

→ 𝑃(𝑋_{𝑡} = 0) = 𝑒^{−𝜆𝑡} (𝑓𝑜𝑟 𝑡 = 1)
𝑃_{𝑋}_{1}(0) = 𝑃(𝑋_{1}= 0) = 𝑒^{−(𝜆)}

Let: 𝑌_{𝑛}= the number of intervals in (^{𝑘−1}_{𝑛} ,^{𝑘}_{𝑛}] in (0,1] where an event occurred is:

= ∑ 1

{𝑋_{𝑘−1}

𝑛 ,𝑘 𝑛

≥1}

𝑛

𝑘=1

(𝑛 𝑖𝑖𝑑 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (𝑝_{𝑛})) (𝑎𝑠 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝑡𝑖𝑚𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙, 𝑎𝑛𝑑 𝑑𝑖𝑠𝑗𝑜𝑖𝑛𝑡)

∴ 𝑝_{𝑛}= 𝑃 (𝑋𝑘−1
𝑛 ,𝑘

𝑛

≥ 1)

= 𝑃 (𝑋1 𝑛

≥ 1) (𝑎 𝑙𝑒𝑎𝑠𝑡 1)

= 1 − 𝑃 (𝑋1 𝑛

= 0) (𝑙𝑎𝑤 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠)

24 | P a g e

= 1 − 𝑒^{−}^{𝑛}^{𝜆}

→1

𝑛∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑛, 𝑝_{𝑛}= 1 − 𝑒^{−}^{𝜆}^{𝑛})

Note: ^{1}

𝑛≤ 𝑋_{1} and 𝑌_{𝑛}< 𝑋_{1}, if two events occur in the same interval

However:

𝑛→∞lim 𝑌_{𝑛} = 𝑋_{1}

(as 𝑋_{1} accounts ALL events, but 𝑌_{𝑛} counts only the events in a certain time interval)
Because no two events can occur at the same time. (property 3)

The expected value of a 𝑏𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝) RV is 𝑛𝑝. For ^{1}

𝑛, this is 𝑛𝑝_{𝑛}= 𝑛 (1 − 𝑒^{𝜆}^{𝑛}), what is the

𝑛→∞lim 𝑛𝑝_{𝑛} ?

= 𝑛 (1 − (1 −𝜆

𝑛− 𝑅_{1}(−𝜆

𝑛)) ) (𝑢𝑠𝑖𝑛𝑔 𝑡ℎ𝑒 𝑇𝑎𝑦𝑙𝑜𝑟 𝑒𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛)

= 𝜆 +𝑅_{1}(−𝜆
𝑛)

−𝜆 𝑛

𝜆 → 𝜆

Therefore, the limit of the expected value for the binomial is 𝜆

Claim: if 𝑌_{𝑛} ∼ 𝐵𝑖𝑛𝑜𝑚(𝑛, 𝑝_{𝑛}) such that 𝑛𝑝_{𝑛} → 𝜆 (𝑎𝑠 𝑛 → ∞), then for any fixed 𝑘 ∈ ℤ^{+}:
𝑃(𝑌_{𝑛}= 𝑘) →_{𝑛→∞}𝑒^{−𝜆}𝜆^{𝑘}

𝑘!

(binomial converges to the poisson pmf) Corollary:

𝑋 = 𝑋_{1}∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆) 𝑓𝑜𝑟 𝑙𝑎𝑟𝑔𝑒 𝑛

𝑛→∞lim 𝑃(𝑌_{𝑛}= 𝑘) → 𝑃(𝑋_{𝑝𝑜𝑖𝑠𝑠𝑜𝑛})

Proof of claim:

𝑃(𝑌_{𝑛}= 𝑘) = (𝑛

𝑘) 𝑝_{𝑛}^{𝑘}(1 − 𝑝_{𝑛})^{𝑛−𝑘}

=𝑛(𝑛 − 1) … (𝑛 − 𝑘 + 1)

𝑘! 𝑝_{𝑛}^{𝑘}