From 35644c21a442dab21044a4b6a5e50d9dbccf3810 Mon Sep 17 00:00:00 2001 From: AaravVishal1 Date: Wed, 21 May 2025 16:11:27 +0530 Subject: [PATCH] Structured: 01 - Statistics and Probability.qmd --- .../01 - Statistics and Probability.qmd | 182 ++++++++++++++++++ 1 file changed, 182 insertions(+) diff --git a/book/chapters/04 Drawing Conclusions From Data/01 - Statistics - Essentials/01 - Statistics and Probability.qmd b/book/chapters/04 Drawing Conclusions From Data/01 - Statistics - Essentials/01 - Statistics and Probability.qmd index e69de29..ec46730 100644 --- a/book/chapters/04 Drawing Conclusions From Data/01 - Statistics - Essentials/01 - Statistics and Probability.qmd +++ b/book/chapters/04 Drawing Conclusions From Data/01 - Statistics - Essentials/01 - Statistics and Probability.qmd @@ -0,0 +1,182 @@ +# Statistics and Probability + +--- + +## 1. Motivation: Why Statistics? + +A **statistician** or **scientist** is often interested in a particular aspect of a group of items. +For example, the group could be *all eligible voters in the state of Massachusetts*, and the *attribute* could be the **average age** of these voters, or the **percentage** who intend to vote. + +Even though it might be important to know the true value of such an attribute, measuring it for every voter is typically impractical. + +--- + +## 2. The Challenge of Measuring Everything + +The number of eligible voters is large — possibly in the tens of millions. It would be extremely expensive and time-consuming to ask every single voter about their age or voting intent. + +Luckily, we don’t have to. + +We can instead **select a small, representative subset** (a sample), and use it to make an informed estimate about the population. + +--- + +## 3. Populations and Samples + +Let’s introduce key terms: + +- **Population**: The entire group of interest (e.g., all eligible MA voters) +- **Sample**: A subset of the population that we can actually measure (e.g., 10,000 voters) +- **Attribute**: A measurable feature of each item (e.g., age) + +--- + +## 4. Attribute-Values and Notation + +Let the population have $N$ items. +Each item has an attribute-value denoted: + +$$ +a_1, a_2, \dots, a_N +$$ + +We select a sample of size $n$, and measure the attribute for each sampled item: + +$$ +X_1, X_2, \dots, X_n +$$ + +Because of the **randomized nature of sampling**, each $X_i$ is treated as a **random variable**. + +--- + +## 5. Sampling and Randomness + +Since our sample is drawn at random, the act of sampling is like running a random experiment. + +Hence: +$$ +X_1, X_2, \dots, X_n \quad \text{are random variables} +$$ + +They are drawn from an unknown population distribution, which we now discuss. + +--- + +## 6. The Population Distribution + +Even though the population has a finite size $N$, we often **approximate** the distribution of the attribute-values with a well-known distribution like: + +- Bernoulli +- Gaussian +- Poisson + +We call this assumed distribution the **population distribution**, denoted $\mathcal{D}$. + +> Under simple random sampling: +> $$ +> X_1, \dots, X_n \overset{\text{iid}}{\sim} \mathcal{D} +> $$ + +While we do not know the exact parameters of $\mathcal{D}$, we usually assume we know its family. + +--- + +## 7. Defining a Statistic + +Once we have sampled values $X_1, \dots, X_n$, we compute a **statistic**. + +A statistic is any function of the sample: + +$$ +f(X_1, \dots, X_n) +$$ + +This function summarizes the data. +Examples include: + +- Mean +- Mode +- Median +- 99th percentile +- Variance + +Some functions are **valid but useless** (e.g., the sum of the 10th digit of each $X_i$), while others are both valid and useful. + +--- + +## 8. Sampling Distributions + +Since $X_1, \dots, X_n$ are random variables, any statistic $f(X_1, \dots, X_n)$ is also a **random variable**. + +Its distribution is called the **sampling distribution** of the statistic. + +For example, the sample mean: + +$$ +\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i +$$ + +is a statistic whose sampling distribution approaches the **Normal distribution** as $n \to \infty$ (Central Limit Theorem). + +> **Important:** The population is always assumed to be fixed. +> The randomness comes from the **sampling procedure**, which leads to variation in the statistic. + +--- + +## 9. Sampling Repeatedly + +If we repeat the sampling process $k$ times, we get: + +$$ +(X_1^{(1)}, \dots, X_n^{(1)}), \quad (X_1^{(2)}, \dots, X_n^{(2)}), \quad \dots, \quad (X_1^{(k)}, \dots, X_n^{(k)}) +$$ + +We will obtain different values for each sample mean: +$$ +\bar{X}^{(1)}, \bar{X}^{(2)}, \dots, \bar{X}^{(k)} +$$ + +This is why the statistic $\bar{X}$ has a **sampling distribution**. + +--- + +## 10. The Population Parameter + +Now imagine we somehow measured the attribute-values for all $N$ items in the population: + +$$ +a_1, a_2, \dots, a_N +$$ + +We compute the **true statistic** on this full dataset: + +$$ +\Theta = f(a_1, \dots, a_N) +$$ + +This value $\Theta$ is called the **population parameter**. + +It is not random — it's just a calculation on fixed numbers. + +--- + +## 11. Example: True Mean Age + +If the attribute is Age, then the true mean age across the entire population is: + +$$ +\Theta = \mu_{\text{Age}} = \frac{1}{N} \sum_{i=1}^N a_i +$$ + +This is the **population mean** — a fixed number we’re trying to estimate using a statistic (like $\bar{X}$). + +--- + +## 12. Summary + +- Populations are large; samples are small. +- Attribute-values from the sample are treated as **random variables**. +- A **statistic** is any function of these random variables. +- A **parameter** is a fixed (but unknown) value based on the whole population. +- The statistic has a **sampling distribution** because the sample is random.