Glossary
Word | Definition |
---|---|
Population | The collections of all items of interest to our study; denoted N. |
Sample | A subset of the population; denoted n. |
Parameter | Is a value that refers to a population. It is the opposite of statistic. |
Statistic | Is a value that refers to a sample. It is the opposite of a parameter. |
Random sample | A random sample is collected when each member of the sample is chosen from the population strictly by chance. |
Representative sample | A representative sample is a subset of the population that accurately reflects the members of the entire population. |
Variable | A variable is a set of characteristics of a person, object, thing, idea, etc. Variables can vary from case to case. For example, 'height' is a variable that describes a characteristic of a person. It varies from person to person. |
Frequency distribution table | A table that represents the frequency of each variable. |
Frequency | Measures the occurrence of a variable. |
Absolute frequency | Measures the NUMBER of occurrences of a variable. |
Relative frequency | Measures the RELATIVE NUMBER of occurrences of a variable. Usually, expressed in percentages. |
Cumulative frequency | The sum of relative frequencies so far. The cumulative frequency of all members is 100% or 1. |
Pareto diagram | A special type of bar chart, where frequencies are shown in descending order. There is an additional line on the chart, showing the cumulative frequency. |
Histogram | A type of bar chart that represents numerical data. It is divided into intervals (or bins) that are not overlapping and span from the first observation to the last. The intervals (bins) are adjacent - where one stops, the other starts. |
Bins (histogram) | The intervals that are represented in a histogram. |
Cross table / Contigency table | A table which represents categorical data. On one axis we have the categories, and on the other - their frequencies. It can be built with absolute or relative frequencies. |
Scatter plot | A plot that represents numerical data. Graphically, each observation looks like a point on the scatter plot. |
Measures of central tendency | Measures that describe the data through the so called 'averages'. The most common are the mean, median and mode. There is also geometric mean, harmonic mean, weighted-average mean, etc. |
Mean | The simple average of the dataset. Denoted μ. |
Median | The middle number in an ordered dataset. |
Mode | The value that occurs most often. A dataset can have 0, 1 or multiple modes. |
Measures of asymmetry | Measures that describe the data through the level of symmetry that is observed. The most common are skewness and kurtosis. |
Skewness | A measure that describes the symmetry of the dataset around its mean. |
Sample formula | A formula, that is calculated on a sample. The value obtained is a statistic. |
Population formula | A formula, that is calculated on a population. The value obtained is a parameter. |
Measures of variability | Measures that describe the data through the level of dispersion (variability). The most common ones are variance and standard deviation. |
Variance | Measures the dispersion of the dataset around its mean. It is measured in units squared. Denoted σ2 for a population and s2 for a sample. |
Standard deviation | Measures the dispersion of the dataset around its mean. It is measured in original units. It is equal to the square root of the variance. Denoted σ for a population and s for a sample. |
Coefficient of variation | Measures the dispersion of the dataset around its mean. It is also called 'relative standard deviation'. It is useful for comparing different datasets in terms of variability. |
Univariate measure | A measure which refers to a single variable. |
Multivariate measure | A measure which refers to multiple variables. |
Covariance | A measure of relationship between two variables. Usually, because of its scale of measurement, covariance is not directly interpretable. Denoted σxy for a population and sxy for a sample. |
Linear correlation coefficient | A measure of relationship between two variables. Very useful for direct interpretation as it takes on values from [-1,1]. Denoted ρxy for a population and rxy for a sample. |
Correlation | A measure of the relationship between two variables. There are several ways to compute it, the most common being the linear correlation coefficient. |
Distribution | A distribution is a function that shows the possible values for a variable and the probability of their occurrence. |
Bell curve | A common name for the normal distribution. |
Gaussian distribution | The original name of the normal distribution. Named after the famous mathematician Gauss, who was the first to explore it through his work on the Gaussian function. |
To control for the mean/std/etc | holding this particular value constant, we change the other variables and observe the effect. |
Standard normal distribution | A normal distribution with a mean of 0, and a standard deviation of 1 |
z-statistic | The statistic associated with the normal distribution |
Standardized variable | In statistics, we usually standardize a variable using the z-score formula. This is done by first subtracting the mean and then dividing by the standard deviation |
Central limit theorem | No matter the distribution of the underlying dataset, the sampling distribution of the means of the dataset approximate a normal distribution. |
Sampling distribution | the distribution of a sample. |
Standard error | the standard error is the standard deviation of the sampling distribution. It takes into account the size of the sample. |
Estimator | A function or a rule, according to which we make estimations. |
Estimate | A particular value that was estimated through an estimator. |
Bias | An unbiased estimator has an expected value the population parameter. A biased one has an expected value different from the population parameter. The bias is the deviation from the true value. |
Efficiency (in estimators) | in the context of estimators, the efficiency loosely refers to 'lack of variability'. The most efficient estimator is the one with the least variability. It is a comparative measure, e.g. one estimator is more efficient than another. |
Point estimator | A function or a rule, according to which we make estimations that will result in a single number. |
Point estimate | A single number that was derived from a certain point estimator. |
Interval estimator | A function or a rule, according to which we make estimations that will result in an interval. In this course, we will only consider confidence intervals. Another instance that we don't discuss are also credible intervals (Bayesian statistics). |
Interval estimate | A particular result that was obtained from an interval estimator. It is an interval. |
Confidence interval | A confidence interval is the range within which you expect the population parameter to be. You have a certain probability of it being correct, equal to the significance level. |
Reliability factor | A value from a z-table, t-table, etc. that is associated with our test. |
Level of confidence | Shows in what % of the cases we expect the population parameter to fall into the confidence interval we obtained. Denoted 1 - α. Example: 95% confidence level means that in 95% of the cases, the population parameter will fall into the specified interval. |
Critical value | A value coming from a table for a specific statistic (z, t, F, etc.) associated with the probability α that the researcher has chosen. |
z-table | A table associated with the Z-statistic, where given a probability (α), we can see the value of the standardized variable, following the standard normal distribution. |
t-statistic | A statistic that is generally associated with the Student's T distribution, in the same way the z-statistic is associated with the normal distribution. |
A rule of thumb | A principle, which is approximately true, but is widely used in practice due to its simplicity. |
t-table | A table associated with the t-statistic, where given a probability (α), and certain degrees of freedom, we can check the reliability factor. |
Degrees of freedom | The number of variables in the final calculation that are free to vary. |
Margin of error | Half the width of a confidence interval. It drives the width of the interval. |
Hypothesis | Loosely, a hypothesis is 'an idea that can be tested' |
Hypothesis test | A test that is conducted in order to verify if a hypothesis is true or false. |
Null hypothesis | The null hypothesis is the one to be tested. Whenever we are conducting a test, we are trying to reject the null hypothesis. |
Alternative hypothesis | The alternative hypothesis is the opposite of the null. It is usually the opinion of the researcher, as he is trying to reject the null hypothesis and thus accept the alternative one. |
To accept a hypothesis | The statistical evidence shows, that the hypothesis is likely to be true. |
To reject a hypothesis | The statistical evidence shows, that the hypothesis is likely to be false. |
One-tailed (one-sided) test | Tests which are determining if a value is lower (lower or equal) or higher (higher or equal) to a certain value are one-sided. This is because they can be rejected only on one side. |
Two-tailed (two-sided) test | Tests which are determining if a value is equal (or different) to a certain value are two-sided. This is because they can be rejected on two sides - if the parameter is too big or too small. |
Significance level | The probability of rejecting the null hypothesis, if it is true. Denoted α. You choose the significance level. All else equal, the lower the level, the better the test. |
Rejection region | The part of the distribution, for which we would reject the null hypothesis. |
Type I error (false positive) | This error consists of rejecting a null hypothesis that is true. The probability of committing it is α, the significance level. |
Type II error (false negative) | This error consists of accepting a null hypothesis that is false. The probability of committing it is β. |
Power of the test | Probability of rejecting a null hypothesis that is false (the researcher's goal). Denoted by 1- β. |
z-score | The standardized variable associated with the dataset we are testing. It is observed in the table with an α equal to the level of significance of the test. |
μ0 | The hypothesized population mean. |
p-value | The smallest level of significance at which we can still reject the null hypothesis, given the observed sample statistic. |
https://global.oup.com/uk/orc/xedition/brymanbrm4exe/student/mcqs/ch12