STAT2011 Lecture Notes
Shane Leviton
6-6-2016
1 | P a g e
Contents
Table of discrete distributions ... 8
Introduction ... 9
Mathematical theory of probability ... 9
Sample space ... 9
Sample point ... 9
Examples: ... 9
Events: ... 9
Probability as a measure ... 10
Sigma algebra ... 10
Example of sigma algebra ... 11
Probability space: ... 11
Eg: model of fair dice ... 11
Venn diagrams: ... 11
De morgan’s laws (for set theory): ... 12
Eg: ... 12
Equiprobable spaces ... 12
Claim: 𝑃𝐴 = 𝐴Ω ... 13
Conditional probability: ... 14
Independence ... 15
Law of total probability ... 15
Distributive law for set theory: ... 16
Baye’s rule: ... 16
Random Variables ... 16
Discrete RV ... 16
Probability mass function of Discrete RV: ... 16
Examples of random variables ... 17
Joint probability mass function: ... 20
Independence: ... 21
Poisson Distribution: ... 22
Assumptions of poission: ... 22
Distibution of a sum of random variables: (arithmetricies of RV) ... 26
Convolution: ... 27
Expectation: ... 28
Definition ... 28
Examples of particular kinds ... 28
2 | P a g e
Bernoulli: ... 28
Poisson ... 28
Binomial ... 29
Geometric: ... 29
Negative binomial ... 29
Hypergeometic ... 30
Negative expectation: ... 30
Expectation of a function of a RV: ... 30
Proof:... 30
Proof of 2: ... 31
Variance ... 31
Definition: ... 32
Variance and expectation computation... 32
Examples of variance ... 32
Table of discrete distributions ... 34
Expectation as a linear operator ... 35
Expected value of sum of 𝑋 + 𝑌 ... 35
Random vector ... 35
Expectation multiplication ... 37
Claim: ... 37
L1 and L2 ... 38
Example of 𝑋 ∈ 𝐿1 but not ∈ 𝐿2 ... 38
What about the opposite? ... 38
So is 𝑉𝑋 + 𝑌 ∈ 𝐿2? ... 38
Claim: ... 38
Claim: shifting 𝑉(𝑋) ... 39
Covariance... 40
Definition ... 40
Claim: ... 40
So; going back to: 𝑉𝑋 + 𝑌 ... 41
By induction: variance of 𝑋𝑖 ′𝑠 ... 42
Chebyshev’s inequality ... 44
Lemma: Markov’s Inequality... 45
Law of large numbers: ... 46
Convergence probability ... 46
Theorem: Weak law of large numbers ... 47
3 | P a g e
Strong law of large numbers: ... 47
The multinomial distribution: ... 47
Multinomial random vector: ... 48
Goal: pm for multinomal Random vector ... 48
Estimation: ... 49
The method of moments ... 49
Method of moments ... 49
For estimating ... 51
Maximum likelihood estimation (MLE) ... 53
Likelihood function ... 53
Hardy-Weinberg equilibrium ... 59
theory ... 59
Example: ... 59
So how do we rigorously compare different estimators? ... 61
Back to HW: ... 63
Hardy – Weinberg equilibrium ... 65
Delta method: ... 66
Parametric Bootstrap method ... 72
Example: Binomial ... 72
Example 2: Hardy Weiberg equilibrium ... 73
Conditional Expectation/Variance ... 74
Conditional expectation: ... 74
Example: die ... 75
Claim: L1 ... 75
Random Sums: ... 75
Examples: ... 75
Conditional Expectation (the RV)…. (not a number, a random variable)…... 77
Definition: ... 77
Theorem: Expectation of conditional expectation (total expectation law) ... 79
What about the variance of conditional expectation? 𝑉𝐸𝑌𝑋 is it 𝑉𝑌 ? ... 80
Conditional Variance: ... 81
Analogously: ... 81
So, conditional variance: ... 81
Continuous Random Variables ... 83
CDF (Cumulative distribution function) ... 83
Definition: ... 83
4 | P a g e
Continuous random variable definition: ... 86
PDF: 𝑓 ... 86
Examples of PDF/CDF distrbution: ... 88
Uniform distribution: ... 88
Exponential distribution ... 89
Gamma distribution: ... 91
Normal distribution ... 93
Quantiles: ... 94
Quantile definition: ... 94
Special quantiles ... 95
Pth quantile ... 95
𝑝th quantile definition: ... 95
Quantile function: Definition ... 96
Functions of random variables for continuous RV... 97
Claim: uniform distribution ... 97
Claim: sampling ... 98
SAMPLING‼!... 99
Claim: ... 99
Theorem: Random variable functions ... 99
Examples of samples and functions: ... 100
Joint distribution (discrete analogue of joint pmf) ... 102
Joint density definition: ... 103
Comments: ... 103
Marginal distribution: ... 105
Proof: ... 105
Examples: ... 106
Gamma/exponential joint ... 106
Uniform distribution of region in a plane ... 107
Bivariate normal (special case) ... 109
Independent RV’s (continuous) ... 110
For continuous: ... 110
Definition: ... 110
Examples: ... 111
Conditional distributions: ... 113
Continuous: ... 113
Construction: ... 113
5 | P a g e
Conditional density: ... 114
Examples: ... 115
Sampling in 2D ... 116
Discrete: ... 116
Continuous ... 116
Sum of Random Variables ... 116
Continuous: ... 116
Density of Sum of continuous random variables 𝑍 = 𝑋 + 𝑌: (convolution) ... 117
Quotients of RV’s ... 119
Formula for quotients: ... 122
For independent RV ... 122
Functions of jointly continuous RV ... 122
Mapping: ... 122
Blow up factor: ... 123
Jacobian: generally ... 124
Box-Muller Sampling ... 128
Visual comparison of distributions ... 129
QQ plot ... 129
Examples: ... 129
More general QQ plot: ... 130
Empirical data: ... 130
Extrema and order statistics ... 137
Sorted values: order statistics ... 137
Claim: ... 137
Expectation of Order statistics ... 139
Expectation of a continuous RV ... 139
Definition: ... 139
Looking at examples: ... 139
Expectation of a function for continuous RV ... 142
In continuous case analogue: Expectation of function ... 142
Variance for continuous: ... 143
Equation to calclute: ... 144
Variance of linear sum: ... 144
Covariance: ... 144
Claim: ... 144
Variance of sum of RVs independent:... 144
6 | P a g e
Standard deviation: ... 144
Correlation coefficient: ... 145
Claim: Correlation coefficient ... 145
Examples of variance ect: ... 145
Standard normal: ... 145
General normal: 𝑋 ∼ 𝑁𝜇, 𝜎2 ... 146
Gamma ... 146
Bivariante standard normal: ... 147
Markov’s inequality in Continuous: ... 148
continuous ... 148
Chebysshev’s inequality: ... 148
Weak law of large numbers: ... 149
Strong law of large numbers: ... 149
Estimation for continuous ... 149
Examples: ... 149
Standard normal: ... 149
Normal: ... 149
Exponential: ... 149
Scaled Cauchy: ... 149
Method of moments: ... 149
Examples: ... 150
Maximum likelihood estimation (MLE) ... 150
For continuous: ... 151
Conditional expectation: ... 152
Conditional expectation def: ... 152
More generally: ... 152
Mixed distribution ... 153
Examples ... 154
Uniform and binomial ... 154
Natural valued with iid sum ... 154
Expectation of mixed distirbutoins: ... 154
Eg 1: ... 154
Eg 2: ... 155
Variance of random sum: ... 155
Law of total probability ... 155
We can now prove this: ... 155
7 | P a g e
Prediction ... 156
Predictor ... 156
Claim: ... 157
Back to prediction problem: ... 157
Example of prediction: ... 158
Moment generating function (MGF) ... 160
Definition: ... 160
Calculating MGF: ... 160
So why are we doing this? ... 161
Theorem 1: ... 161
Theorem 2: ... 163
Weak convergence/converges in distribution ... 164
Weakly converges ... 164
Characteristic function (CF)... 167
Confidence intervals ... 168
Estimation: ... 168
We want to know- how close are we to 𝜃? ... 168
Confidence interval set up: ... 169
Constructing confidence intervals (CI s) ... 172
Examples of confidence intervals: ... 172
Constructing approximate Cis based on the MLE ... 180
Confidence interveals for Bernoulli 𝜃: ... 181
8 | P a g e
PROBABILITY AND STATISTICAL MODELS STAT2911 Notes
Table of discrete distributions
Distribution Model pmf 𝐸(𝑋) 𝑉(𝑋)
𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) Success or failure, with probability 𝑝 Eg: coin toss
𝑝 𝑝 𝑝(1 − 𝑝)
𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐(𝑝) Probability 𝑝 to first success: eg coin toss till first head
(1 − 𝑝)𝑘−1𝑝 1
𝑝
1 − 𝑝 𝑝2
𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝) 𝑛 Bernoullii trials.
Eg 𝑛 coin tosses
(𝑛
𝑘) 𝑝𝑘(1 − 𝑝)𝑛−𝑘 𝑛𝑝 𝑛𝑝(1 − 𝑝)
𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆) Number of events in a certain time interval.
Eg: number of radiactive particles emmited in certain time
𝑒−𝜆𝜆𝑘 𝑘!
𝜆 𝜆
𝐻𝑦𝑝𝑒𝑟𝑔𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 (𝑟, 𝑛, 𝑚)
number of red balls 𝑟 in a sample of 𝑚 balls drawn without replacement from an urn with 𝑟 red balls and 𝑛 − 𝑟 black balls
(𝑟
𝑘) (𝑛 − 𝑟 𝑚 − 𝑘) (𝑛
𝑚)
𝑚𝑟
𝑛 𝑚𝑟
𝑛 𝑛 − 𝑟
𝑛
𝑛 − 𝑚 𝑛 − 1
𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑏𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑟, 𝑝)
Number of Bernoulli trials till the 𝑟𝑡ℎ success
(𝑘 − 1
𝑟 − 1) (1 − 𝑝)𝑘−𝑟𝑝𝑟 𝑟 𝑝
𝑟(1 − 𝑝) 𝑝2
9 | P a g e Uri Keich; Carslaw 821; Monday 5-6
Introduction
- Probability in general and this course has 2 components:
o Mathematical theory of probability
Definitions, theorems and proofs
Abstraction of experiments whose outcome is random
o Modelling: argued rather than proved applications can be confusing, as well as ill defined
Useful for describing/summarising the data and making accurate predictions
Mathematical theory of probability
Sample space
The set of all possible outcomes, denoted Ω, is the sample space
Sample point
A point 𝜔 ∈ Ω is a sample point
Examples:
Die:
Ω = {1,2,3,4,5,6}
Coin:
Ω = {𝐻, 𝑇}
- Complication: do we model the possibility that the coin can land on its side?
Events:
- Events are subsets of Ω, for which we can assign a probability.
- We say an event 𝐴 occurred if the outcome, or sample point 𝜔 ∈ Ω satisfies 𝜔 ∈ 𝐴
10 | P a g e
Probability as a measure
𝑃(𝐴) is the probability of the event 𝐴, which intuitively is the rate at which 𝐴 occurs if we repeat the experiment many times.
Mathematically: 𝑃 is a probability measure function if:
1. 𝑃(Ω) = 1
2. 𝑃(𝐴) ≥ 0 for any event 𝐴
3. If 𝐴1, 𝐴2, … are mutually disjoint (𝐴1∩ 𝐴2∩ … = ∅) then (𝑃(∪𝑛=1∞ 𝐴𝑛) = ∑∞𝑛=1𝑃(𝐴𝑛)) For this to make sense, we need to know that a union of a sequence of events is ALSO an event.
- Why do we bother with determining events? Why can’t any subset of Ω be an event?
o Imagine choosing a point at random in a large cube 𝐶 ⊂ ℝ3, where the probability of the point lying in any set 𝐴 ⊂ 𝐶 is proportional to its volume |𝐴|
o Clearly, if 𝐴, 𝐵 ∈ 𝐶 are related through a rigid motion (translation + rotation), then
|𝐴| = |𝐵| so their probabilities are the same
o Similarly, if A ∈ C can be split into 𝐴1∪ 𝐴2, then 𝑃(𝐴) = 𝑃(𝐴1) + 𝑃(𝐴2) o Bernard-Tarski Paradox:
The unit ball 𝐵 in ℝ3 can be decomposed into 5 pieces, which can be assembled using only rigid motions into two balls of the same size.
But then: 𝑃(𝐵) = 𝑃(𝐵1) + 𝑃(𝐵2) + ⋯ + 𝑃(𝐵5) = 𝑃(𝐵1′) + ⋯ + 𝑃(𝐵5′) = 2𝑃(𝐵)
This is why we cannot assign probabilities to EVERY subset of Ω. Probabilities can’t be assigned to arbitrary sets, rather to sets which are “measurable”.
The collection of measurable sets is captured by the notion of 𝜎 algebra (𝜎 −field)
Sigma algebra
Definition:
A collection of subsets of Ω is a 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 is 1. Ω ∈ 𝐹
2. 𝐴 ∈ 𝐹 ⟹ 𝐴𝑐∈ 𝐹
3. 𝐴1, 𝐴2, … ∈ 𝐹 ⟹∪𝑛=1∞ 𝐴𝑛 ∈ 𝐹
2. says that 𝐹 is closed with respect to complementing and taking the complement 3. says that 𝐹 is closed with respect to a countable union
- (countable means you can count it) 𝐴 = {1,3,5} is a finite countable set - ℕ is an infinite countable set - ℤ is also an infinite countable set - ℚ is infinite countable
- ℝ is not countable
11 | P a g e Example of sigma algebra
Ω = {1,2, … 6}
𝐹 = the power set of Ω = the set of all subsets of Ω
= {∅, {𝑎𝑙𝑙 𝑠𝑖𝑛𝑔𝑙𝑒𝑠}, {𝑎𝑙𝑙 𝑑𝑜𝑢𝑏𝑙𝑒𝑠} … , Ω Denoted
= 2Ω Questions:
What is the cardinality of 𝐹, or how many subsets of Ω does it contain -
Is 2Ω ALWAYS a 𝜎 algebra
What is the smallest 𝜎 algebra regardless of the sample space
ANSWERS?
- 2|Ω| ?
- {∅, Ω}
Probability space:
A probability space consists of 1. A sample space Ω
2. A 𝜎 − algebra of subsets of Ω, 𝐹 3. A probability measure: 𝑃: 𝐹 → ℝ Eg: model of fair dice
Ω = {1,2, … 6}
𝐹 = 2Ω 𝑃({, }) =1
6, 𝑖 = [1,6]
𝑃(𝐴) =|𝐴|
6 (𝑓𝑜𝑟 𝐴 ∈ 𝐹)
Venn diagrams:
Useful to vizualise relations between sets. They help plan proofs, but don’t use them as PROOFS in a course
12 | P a g e
De morgan’s laws (for set theory):
(𝐴 ∪ 𝐵)𝑐 = 𝐴𝑐∩ 𝐵𝑐 (𝐴 ∩ 𝐵)𝑐 = 𝐴𝑐∪ 𝐵𝑐
Eg:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
- Why is 𝐴 ∪ 𝐵 and ∩ 𝐵 ∈ 𝐹 ? (why are they events)
𝐴 ∪ 𝐵 = 𝐴 ∪̇ (𝐵 \𝐴) [(∪)̇ = 𝑑𝑖𝑠𝑗𝑜𝑖𝑛𝑡 𝑢𝑛𝑖𝑜𝑛) 𝐵\𝐴 = 𝐵 ∩ 𝐴𝑐
∴ 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵\𝐴)
(𝑢𝑠𝑖𝑛𝑔 𝑡ℎ𝑎𝑡 (𝑃(∪𝑛=1∞ 𝐴𝑛) = ∑ 𝑃(𝐴𝑛)
∞
𝑛=1
, 𝑤𝑖𝑡ℎ 𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑛𝑑 𝑖𝑛𝑓𝑖𝑛𝑖𝑡𝑒 ∅ 𝑠′ )
(probability of empty set is 0)
𝑃(∅)𝑃(∪ ∅) = ∑ 𝑃(∅)
∞
𝑛=1
→ 𝑃(∅ ) = 0
𝑃(𝐵) = 𝑃(𝐵\𝐴) + 𝑃(𝐵 ∩ 𝐴)
→ 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
Equiprobable spaces
- A space consistics of a finite sample space such that ∀𝜔 ∈ Ω 𝑃({𝜔}) = 𝑐 > 0 - (all equally likely)
A 𝐵
13 | P a g e 𝑁𝑂𝑡𝑒: 𝜔 ∈ Ω 𝑖𝑠 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒 𝑝𝑜𝑖𝑛𝑡, {𝜔} = 𝑎𝑛 𝑒𝑣𝑒𝑛𝑡 𝑖𝑛 𝐹
- Why does it have to be finite?
o Otherwise
{𝜔}𝑛=1∞ ⊂ Ω
𝑃(∪𝑛{𝜔𝑛}) = ∑ 𝑃(𝜔𝑛)
𝑛
= ∑ 𝑐
∞
𝑛=1
= ∞ [𝑏𝑢𝑡 ∀𝐴 ⊂ 𝐹, 𝑃(𝐴) ∈ [0,1] (𝑤ℎ𝑦? → 𝑠ℎ𝑜𝑤))
Claim: 𝑃(𝐴) = |𝐴||Ω|
Proof:
𝐴 =∪𝜔∈𝐴{𝜔}
∴ 𝑃(𝐴) = 𝑃(∪𝜔∈𝐴{𝜔}) = ∑ 𝑃(𝜔)
𝜔∈𝐴
= |𝐴| × 𝑐 𝑡𝑎𝑘𝑒 𝐴 = Ω
→ 𝑃(Ω) = 1 (𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛) = |Ω|𝑐
→ 𝑐 = 1
|Ω|
→ 𝑃(𝐴) =|𝐴|
|Ω|
This means: that probability is just combinatorics‼‼
Example of probability combinatorics 1. A fair die is rolled:
𝑃({𝜔}) =1 6
2. A group of 𝑛 people meet at a party, what is the probability that at least 2 od them share a birthday
o model that no leap years, no association between birthdays Ω = {(𝑖1, … 𝑖𝑛)|𝑖𝑘 ∈ {1, … 365}}
→ 1
|Ω|= 365−𝑛
𝐴 = {(𝑖1, … 𝑖𝑛) ∈ Ω}∃𝑗 ≠ 𝑘, 𝑖𝑗= 𝑖𝑘 𝐿𝑒𝑡′𝑠 𝑟𝑎𝑡ℎ𝑒𝑟 𝑐𝑜𝑚𝑝𝑢𝑡𝑒 𝑡ℎ𝑒 𝑐𝑜𝑚𝑝𝑙𝑒𝑚𝑒𝑛𝑡 𝐴𝑐 = {(𝑖1, … 𝑖𝑛) ∈ Ω}∃𝑗 ≠ 𝑘, 𝑖𝑗 ≠ 𝑖𝑘
14 | P a g e 𝑃(𝐴𝑐) =|𝐴𝐶|
|Ω| =365 × 364 × … (365 − 𝑛 + 1) 365𝑛
= ∏365 − (𝑖 − 1) 365
𝑛
𝑖=1
{𝑛 ≤ 365}
= ∏ 1 −𝑖 − 1 365
𝑛
𝑖=1
𝑃(𝐴) = 1 − (365
𝑛 ) 365𝑛 Sidenote: for 𝑛 = 23 → 𝑃(𝐴) ≈1
2
3. what is the probability that one of the guests shares YOUR birthday 𝐵 = {(𝑖1, … 𝑖𝑛) ∈ Ω}∃𝑗, 𝑖𝑗 = 𝑥 (𝑦𝑜𝑢𝑟 𝑏𝑖𝑟𝑡ℎ𝑑𝑎𝑦)}
∴ 𝐵𝑐 = {(𝑖1, … 𝑖𝑛) ∈ Ω}∀𝑗, 𝑖𝑗 ≠ 𝑥
= |𝐵𝑐| = 364𝑛
→ 𝑃(𝐵𝑐) =364𝑛 365𝑛
= (1 − 1 365)
𝑛
≈ 3−365𝑛 Sidenote: for 𝑛 = 253, 𝑃(𝐵) ≈1
2
Conditional probability:
If 𝐴, 𝐵 ∈ 𝐹 and 𝑃(𝐵) > 0 we can define (Probability of A given B) 𝑃(𝐴|𝐵) =𝑃(𝐴 ∩ 𝐵)
𝑃(𝐵)
Example:
Ω = {1, … ,6}
A 𝐵
Ω
15 | P a g e 𝐴 = {1,2}; 𝐵 = {1,3,5}
Independence
Event 𝐴 ∈ 𝐹 is independent (ind) of 𝐵 ∈ 𝐹 if knowing whether or not 𝐴 occurred, does not give us any information on whether or not 𝐵 occurred.
∴ 𝑃(𝐵|𝐴) = 𝑃(𝐵)
→𝑃(𝐵∩𝐴)
𝑃(𝐴) = 𝑃(𝐵)
∴ 𝑃(𝐵 ∩ 𝐴) = 𝑃(𝐴)𝑃(𝐵)
- this definition is more robust (symmetric about A and B), and no need to assume 𝑃(𝐴) > 0;
and easier to generalise general:
𝑃(𝐴1,∩ 𝐴2, … ∩ 𝐴𝑛) = 𝑃 (∏ 𝐴𝑖
𝑛
𝑖=1
) - stronger than pairwise independence‼!
NOTE: independence is NOT DISJOINT (disjoint gives you total knowledge that the other would not have occurred)
Law of total probability
- a probability of an event may be computed by summing over all eventualities
3 machines A B C
Production rate .05 .2 .3
Failure rate .01 .02 .005
𝑃(𝑓𝑎𝑖𝑙𝑒𝑑 𝑝𝑟𝑜𝑑𝑢𝑐𝑡) = .01(. 05) + .2(. 02) + .3(. 005) Generally:
If 𝑃𝑗∈ 𝐹 forms a partition of Ω Ω =∪𝑗̇ 𝐵𝑗
𝑡ℎ𝑒𝑛 𝑃(𝐴) = 𝑃(𝐴 ∩ Ω)
= 𝑃 (𝐴 ∪ (∪𝑗̇ 𝐵𝑗)) = 𝑃 (∪𝑗̇ (𝐴 ∩ 𝐵𝑗)) = ∑ 𝑃(𝐴 ∩ 𝐵) = ∑ 𝑃(𝐴)𝑃(𝐵|𝐴)
𝑗 𝑗
16 | P a g e Distributive law for set theory:
𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶) 𝐴 ∩ (𝐵 ∪ 𝐶) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐶)
Baye’s rule:
Diagnostic: which of the events 𝐵𝑗 triggered the event 𝐴?
𝑃(𝐵𝑗|𝐴) =𝑃(𝐴 ∩ 𝐵𝑗) 𝑃(𝐴) (which machine B caused the failure of A?)
= 𝑃(𝐴|𝐵𝑗)𝑃(𝐵𝑗)
∑ 𝑃(𝐴|𝐵𝑖 𝑖)𝑃(𝐵𝑖)
(𝑢𝑠𝑖𝑛𝑔 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛𝑠 𝑜𝑓 𝑙𝑎𝑤 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑎𝑛𝑑 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
Random Variables
A RV is a measurable function 𝑋: Ω → ℝ
Discrete RV
A RV is is discrete if its range:
𝑋(Ω) = {𝑋(𝜔): 𝜔 ∈ Ω} ⊂ ℝ Is a countable set (finite or infinite)
Probability mass function of Discrete RV:
The pmf pf 𝑋 is definite as:
𝑝𝑋(𝑥) = 𝑃(𝑋 = 𝑥) = 𝑃({𝜔: 𝑋(𝜔) = 𝑥}) [𝑤𝑖𝑡ℎ 𝑥 ∈ ℝ]
- why is {𝜔: 𝑋(𝜔) = 𝑥} ∈ 𝐹 ? Properties of pmf
Claim:
1. ∀𝑥 ∈ ℝ \𝑋(Ω) then 𝑝𝑋(𝑥) = 0 (the pmf of something which is unattainable is 0) 2. With {𝑥𝑖} = 𝑋(Ω), ∑ 𝑝𝑖 𝑋(𝑥)= 1 (the pmf of all outcomes is 1)
The distribution of a discrete RV is completely specified by its pmf. Indeed, ∀𝐴 ⊂ ℝ 𝑃(𝑋 ∈ 𝐴) = ∑ 𝑝𝑋(𝑥𝑖)
𝑖:𝑥𝑖∈𝐴
(shows the probability that an outcome is going to happen)
17 | P a g e - We can thus specify the distribution of a RV 𝑋 by specifying its pmf 𝑝𝑋
Question:
Can any functions 𝑝: ℝ → [0,1] with a countable support {𝑥: 𝑝(𝑥) > 0} such that ∑𝑖:𝑝(𝑟)>0𝑝(𝑥𝑖)= 1 be a pmf for some random variable?
ClaimL if 𝑝 is as above, then there exist a probability space (Ω, 𝐹, 𝑝) obout a RV 𝑋: Ω → ℝ such that 𝑝𝑋 = 𝑝
Proof: Ω = {𝑥: 𝑝(𝑥) > 0}; 𝐹 = 2Ω; 𝑃(𝐴) = ∑𝑥∈𝐴𝑝(𝑥) 𝑋(𝜔) = 𝜔 Examples of random variables
Bernoulli random variable Is defined by the pmf
𝑝(𝑥) = {1 − 𝑝; 𝑥 = 0
𝑝; 𝑥 = 1 (𝑓𝑎𝑖𝑙𝑢𝑒𝑟 𝑎𝑛𝑑 𝑠𝑢𝑐𝑐𝑒𝑠𝑠) Eg:
Ω = {𝐻, 𝑇}: 𝑃(𝐻) = 𝑝 𝑋(𝜔) = {1; 𝜔 = 𝐻
0; 𝜔 = 𝑇 Binomial random variable
𝑆𝑛 models the number 𝐴 of some in an iid (independatn and identically distributed) Bernoulli trials A 2 parameter family of distributions: (𝑛, 𝑝)
Note: if 𝑋𝑖 = 1: {𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙} = {0 𝑖𝑓 𝑓𝑎𝑖𝑙𝑢𝑟𝑒
1 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠; then 𝑋𝑖 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) and 𝑆𝑛 = ∑ 𝑋𝑛1 𝑖 Generally: fpr an event 𝐴, then the random variable 1𝐴 is the indicator function of 𝐴:
1𝐴(𝑅𝑉) = { 1 𝑖𝑓 𝜔 ∈ 𝐴 0 𝜔 𝑛𝑜𝑡 𝑖𝑛 𝐴
𝑆𝑛(Ω) = {0,1,2, … 𝑛}
For 𝑘 ∈ 𝑆𝑛(Ω); 𝑝(𝑆𝑛= 𝑘) is:
- Consider the configuration of a Bernoulli trials: 𝑠, 𝑠, 𝑠, … . 𝑓, 𝑓, 𝑓..
- Its probability is 𝑝𝑘(1 − 𝑝)𝑘× (𝑛 𝑘) Binomial RV pmf
∴ 𝑝(𝑆𝑛 = 𝑘) = (𝑛
𝑘) 𝑝𝑘(1 − 𝑝)𝑘 Geometric random variable:
Models the number of iid Bernoulli (𝑝) trials if it takes till the first success
18 | P a g e 𝑋(Ω) = {1,2,3 … } ∪ {∞}
A 1-parameter family : 𝑝
Pmf of geometric random variable
𝑝(𝑋 = 𝑘) = (1 − 𝑝)𝑘−1𝑝
𝑃(𝑋 > 𝑘) = (1 − 𝑝)𝑘
𝑃(𝑋 > 𝑛 + 𝑘|𝑋 > 𝑘) =𝑃(𝑋 > 𝑛 + 𝑘, 𝑋 > 𝑘)
𝑃(𝑋 > 𝑘) (𝑢𝑠𝑖𝑛𝑔 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦)
=𝑃(𝑋 > 𝑛 + 𝑘)
𝑃(𝑋 > 𝑘) =(1 − 𝑝)𝑛+𝑘
(1 − 𝑝)𝑘 = (1 − 𝑝)𝑛= 𝑃(𝑋 > 𝑛)
This property is “memoryless” (the probability does not remember what has happened before) - Geometric is the ONLY discrete memoryless distribution
If 𝑥𝑖 = 1{𝑠𝑢𝑐𝑐𝑠𝑒𝑠𝑠 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙 𝑖}, then 𝑥 = min{𝑖: 𝑋𝑖 = 1}
Negative binomial RV
Models the number of iid Bernoulli trials till the 𝑟𝑡ℎ success, where 𝑟 ∈ ℕ - A 2 parameter disctribution
- Note: 𝑋1: 𝑟 = 1 is a geometric random variable
𝑋𝑟(Ω) = {𝑟, 𝑟 + 1, … } ∪ {∞}
Pmf:
For 𝑘 ∈ 𝑋𝑟(Ω), 𝑃(𝑋𝑟 = 𝑘) =
- Consider the configuration of its Bernoulli trials (𝑓, 𝑓, 𝑓, … . 𝑠, 𝑠, 𝑠, 𝑠)
19 | P a g e 𝑝 = (𝑘 − 1
𝑟 − 1) (1 − 𝑝)𝑘−𝑟𝑝𝑟
𝑋𝑟 = min {𝑚: ∑ 𝑥𝑖
𝑛
1
= 𝑟}
Hypergeometric RV
𝑋 models the number of red balls in a sample of 𝑚 balls drawn without replacement from an urn with 𝑟 red balls and 𝑛 − 𝑟 black balls
𝑋(Ω) ⊂ {0,1, … 𝑟}
Probability:
For 𝑘 ∈ 𝑋(Ω)
𝑝𝑚𝑓: 𝑃(𝑋 = 𝑘) =(𝑟
𝑘) (𝑛 − 𝑟 𝑚 − 𝑘) (𝑛
𝑚) Which is a 3 parater family: (𝑟, 𝑛, 𝑚)
(is not phrased in iid Bernoulii RV)
- A famous case of this is the Fisher Exact Test Fisher Exact Test:
30 convicted criminals with same sex twin of 13 which 13 were identical and 17 were non identical twins.
- Is there evidence of a genetic link?
20 | P a g e -
Convicted Not convicted
Identical 10 3 13
different 2 15 17
12 18
- Assuming that wheterh or not the twin of the criminal is also convicted does not depend on the biological type of the twin, we have a sample from a hypergeometric distribution Indeed: there are 13 red (monozygote) balls and 17 black (dizygotic) balls. We randomly sample 12 balls (convicted), what is the probability that we will see 10 or more red balls in the same sample?
Let 𝑋 ∼ 𝐻𝑦𝑝𝑒𝑟(13,17,12) = (𝑟, 𝑘, 𝑚)
𝑃(𝑋 ≥ 10) = ∑ (13
𝑘) ( 17 12 − 𝑘) (30
12)
12
𝑘=0
≈ 0.000465
So- it seems very unlikely that there is no relation between the conviction of the twin and its type, but:
- Did we establish that a criminal mind is inherited?
o Problem with this statement
o Conviction vs truth “that face is up to no good??”
o Sampling/ascertainment bias o Identical twins tighter connection?
Joint probability mass function:
The joint pmf of the RVs 𝑋 and 𝑌 specifies their interaction:
𝑃𝑋𝑌(𝑥, 𝑦) = 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦) (𝑥, 𝑦 ∈ ℝ) Note: {𝑋 = 𝑥, 𝑌 = 𝑦}; {𝜔 ∈ Ω| 𝑋(𝜔) = 𝑥, 𝑌(Ω = 𝑦}
- If 𝑋 and 𝑌 are discrete Random Variables, with o {𝑥𝑖} = 𝑋(Ω); 𝑎𝑛𝑑 {𝑦𝑖} = 𝑌(Ω)
Then: (keeping 𝑦 fixed)
𝑃𝑋(𝑥) = 𝑃(𝑋 = 𝑥) = ∑ 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦𝑗)
𝑗
= ∑ 𝑃𝑋𝑌(𝑥, 𝑦𝑗)
𝑗
Similarly:
𝑃𝑌(𝑦) = ∑ 𝑃𝑋𝑌(𝑥𝑖, 𝑦)
𝑖
𝑃𝑋 and 𝑃𝑦 are referred to as the marginal pmf’s
21 | P a g e Example:
A fair coin is tossed 3 times
𝑋 = 1 {𝐻 𝑜𝑛 𝑓𝑖𝑟𝑠𝑡 𝑡𝑜𝑠𝑠} = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 ℎ𝑒𝑎𝑑𝑠 𝑖𝑛 𝑓𝑖𝑟𝑠𝑡 𝑡𝑜𝑠𝑠 𝑌 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 ℎ𝑒𝑎𝑑𝑠
Ω = {𝐻𝐻𝐻, 𝐻𝐻𝑇 … }; |Ω| = 8
𝒀 0 1 2 3
𝑿 1/8 2/8 1/8 0 1/2
𝟎 0 1/8 2/8 1/8 1/2
𝟏 0 1/8 2/8 1/8
1/8 3/8 3/8 1/8
So, 𝑃(𝑋 = 0) =1
8
(note: called marginal, as it is in the margis) NOTE:
𝑋 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙 (1 2) 𝑌 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (3,1
2)
Are 𝑋 and 𝑌 independent RVs?
- The random variable 𝑋 is indepdent of 𝑌 if “knowing the value of 𝑌 does not change the discibtion of 𝑋” (so NO- they are not independent)
Independence:
𝑃(𝑋 = 𝑥𝑖, 𝑌 = 𝑦𝑗) = 𝑃(𝑋 = 𝑥𝑖, 𝑌 ≠ 𝑦𝑗)
→ 𝑃(𝑋 = 𝑥𝑖, 𝑌 = 𝑦𝑗) = 𝑃(𝑋 = 𝑥𝑖)𝑃(𝑌 = 𝑦𝑗) (𝐷𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛) 𝑃𝑋𝑌(𝑥, 𝑦) = 𝑃𝑋(𝑥)𝑃𝑌(𝑦)
The joint pmf factors into the product of the marginal
More generally, the random variables 𝑋1, … 𝑋𝑁 are independent if:
22 | P a g e 𝑃(𝑋1= 𝑥1, … 𝑋𝑛= 𝑥𝑛) = 𝑃𝑋1,…𝑋𝑁(𝑥1, … 𝑥𝑛) = ∏ 𝑃𝑋𝑖(𝑥𝑖)
𝑛
𝑖
(∀𝑥𝑖∈ ℝ)
Note: take caution that PAIWISE INDEPEDENCE DOES NOT IMPLY INDEPENDENCE
Poisson Distribution:
Recall:
𝑋 ∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆) If
𝑃𝑋(𝑘) = 𝑒−𝜆 𝜆𝑘 𝑘!
(𝑘 ∈ ℕ - A 1 parameter faily, 𝜆 > 0
- Where does it come from?
o Models the number of “events” that are registered in a certain time interval, eg: the number of
Particles emitted by a radioactive source in a hour
Incoming calls to a service centre between 1-2 pm
Light bulbs burnt in a year
Fatalities from horse kicks in the Prussian cavalry over 200 corp years (Boriewiez 1898)
20 corps x 10 years = 200 corp years
Number of deaths Observed count frequency Poisson approximation
0 109 .545 .543
1 65 .325 .331
2 22 .110 .101
3 3 .015 .021
4 1 .005 .003
So, how did we come up with the Poisson distribution?
Assumptions of poission:
1. The distribution of the number of events of any time interval depends only on its length or duration. (eg; number of horse kicks only depends on that it is a day, not on the particular day)
23 | P a g e 2. The number of events recorded in two disjoint time intervals are independent of one
another (numer of horse kicks is independent from today and yesterday)
3. No two events are recorded at exactly the same time point (you can’t have 2 horsekicks exactly at the same time, must be slightly different in time)
So:
Let 𝑋𝑡,𝑠 denote the number of events in the time intercal (𝑡, 𝑠]
- Denote by: 𝑋𝑡 = 𝑋0,𝑡
Our goal is to find the disctibution of 𝑋:
- Let 𝑓(𝑡) = 𝑃(𝑋𝑡= 0), then 𝑓(𝑡 + 𝑠) = 𝑃(𝑋𝑡+𝑠= 0)
= 𝑃(𝑋𝑡 = 0, 𝑋𝑡,𝑠+𝑡= 0) (𝑑𝑖𝑠𝑗𝑜𝑖𝑛𝑡 𝑡𝑖𝑚𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠, 𝑎𝑛𝑑 𝑢𝑠𝑖𝑛𝑔 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦 2)
= 𝑃(𝑋𝑡 = 0)𝑃(𝑋𝑡,𝑠+𝑡 = 0) (𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦 1)
= 𝑃(𝑋𝑡 = 0)𝑃(𝑋𝑠 = 0) = 𝑓(𝑡) = 𝑓(𝑠)
→ 𝑓(𝑡 + 𝑠) = 𝑓(𝑡)𝑓(𝑠) ∀𝑡, 𝑠 > 0
One type of solution of this is:
𝑓(𝑡) = 𝑒𝛼𝑡 (𝛼 ∈ ℝ)
(note: other solutions exists, but are a mess, in that they are unbounded, but 𝑓(𝑡) ∈ [0,1], so can’t exist for probabilities).
For the same reason, 𝛼 < 0 (to remain bounded) So:
𝛼 = −𝜆 (𝑓𝑜𝑟 𝜆 > 0)
→ 𝑃(𝑋𝑡 = 0) = 𝑒−𝜆𝑡 (𝑓𝑜𝑟 𝑡 = 1) 𝑃𝑋1(0) = 𝑃(𝑋1= 0) = 𝑒−(𝜆)
Let: 𝑌𝑛= the number of intervals in (𝑘−1𝑛 ,𝑘𝑛] in (0,1] where an event occurred is:
= ∑ 1
{𝑋𝑘−1
𝑛 ,𝑘 𝑛
≥1}
𝑛
𝑘=1
(𝑛 𝑖𝑖𝑑 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (𝑝𝑛)) (𝑎𝑠 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝑡𝑖𝑚𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙, 𝑎𝑛𝑑 𝑑𝑖𝑠𝑗𝑜𝑖𝑛𝑡)
∴ 𝑝𝑛= 𝑃 (𝑋𝑘−1 𝑛 ,𝑘
𝑛
≥ 1)
= 𝑃 (𝑋1 𝑛
≥ 1) (𝑎 𝑙𝑒𝑎𝑠𝑡 1)
= 1 − 𝑃 (𝑋1 𝑛
= 0) (𝑙𝑎𝑤 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠)
24 | P a g e
= 1 − 𝑒−𝑛𝜆
→1
𝑛∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑛, 𝑝𝑛= 1 − 𝑒−𝜆𝑛)
Note: 1
𝑛≤ 𝑋1 and 𝑌𝑛< 𝑋1, if two events occur in the same interval
However:
𝑛→∞lim 𝑌𝑛 = 𝑋1
(as 𝑋1 accounts ALL events, but 𝑌𝑛 counts only the events in a certain time interval) Because no two events can occur at the same time. (property 3)
The expected value of a 𝑏𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝) RV is 𝑛𝑝. For 1
𝑛, this is 𝑛𝑝𝑛= 𝑛 (1 − 𝑒𝜆𝑛), what is the
𝑛→∞lim 𝑛𝑝𝑛 ?
= 𝑛 (1 − (1 −𝜆
𝑛− 𝑅1(−𝜆
𝑛)) ) (𝑢𝑠𝑖𝑛𝑔 𝑡ℎ𝑒 𝑇𝑎𝑦𝑙𝑜𝑟 𝑒𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛)
= 𝜆 +𝑅1(−𝜆 𝑛)
−𝜆 𝑛
𝜆 → 𝜆
Therefore, the limit of the expected value for the binomial is 𝜆
Claim: if 𝑌𝑛 ∼ 𝐵𝑖𝑛𝑜𝑚(𝑛, 𝑝𝑛) such that 𝑛𝑝𝑛 → 𝜆 (𝑎𝑠 𝑛 → ∞), then for any fixed 𝑘 ∈ ℤ+: 𝑃(𝑌𝑛= 𝑘) →𝑛→∞𝑒−𝜆𝜆𝑘
𝑘!
(binomial converges to the poisson pmf) Corollary:
𝑋 = 𝑋1∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆) 𝑓𝑜𝑟 𝑙𝑎𝑟𝑔𝑒 𝑛
𝑛→∞lim 𝑃(𝑌𝑛= 𝑘) → 𝑃(𝑋𝑝𝑜𝑖𝑠𝑠𝑜𝑛)
Proof of claim:
𝑃(𝑌𝑛= 𝑘) = (𝑛
𝑘) 𝑝𝑛𝑘(1 − 𝑝𝑛)𝑛−𝑘
=𝑛(𝑛 − 1) … (𝑛 − 𝑘 + 1)
𝑘! 𝑝𝑛𝑘