![]() |
VOOZH | about |
Stratified Random Sampling is a technique used in Machine Learning and Data Science to select random samples from a large population for training and test datasets. When the population is not large enough, random sampling can introduce bias and sampling errors. Stratified Random Sampling ensures that the samples adequately represent the entire population.
Stratified Random Sampling eliminates this problem of having bias in the sample dataset, by dividing the population into smaller sub-groups and randomly picking samples from them. In this article, we will deep into the world of Random Sampling and see how Stratified Random Sampling is better than traditional Random Sampling.
Table of Content
Unlike the traditional Random Sampling method, in which some values are picked randomly from a population without considering any factor or feature, Stratified Random Sampling first splits the entire population into smaller subsets known as Strata (The singular term of Strata - Stratum, which means a single subgroup, All the stratum are collectively known as Strata), this is done based on a particular characteristic present in the data. In simpler terms, the data are being sorted out from the population based on their feature.
Now, after dividing the entire population into smaller sub-groups based on the feature, the process of random sampling from those Strata takes place. Due to this approach, all the characteristics or features present in the entire population will be reflected in the sample dataset, eliminating the bias present in it. In the case of Random Sampling, there is always a chance that there can be a sampling error due to the bias present in the population. But using the Stratified approach, all the features of the elements or values present in the dataset will be considered equally and they will be reflected in the Sample dataset, this will make the Machine Learning model more accurate.
In simple terms, the entire Stratified Random Sampling consists of two main steps -
Mainly, There are two types Stratified Random Sampling possible -
It is a type of Stratified Random Sampling in which the number of random samples taken from each stratum (a single group of strata), that number is solely based upon how big the stratum is as compared to the whole population. In other words, the amount or the fraction of the sample taken from a stratum matches the fraction of that stratum in the entire population.
In proportionate stratified random sampling, the sample size for each stratum is proportional to the stratum's size in the population. This means that if a stratum represents 20% of the population, then 20% of the sample should be selected from that stratum.
This type of stratified random sampling is most commonly used when the strata are relatively homogeneous in size. It ensures that the sample is representative of the entire population, but it may not be as efficient as other sampling techniques if some strata are much smaller than others.
Example: Surveying student satisfaction in a university with freshmen, sophomores, juniors, and seniors.
In disproportionate stratified random sampling, the sample size for each stratum is not proportional to the stratum's size in the population. This means that a stratum that is considered more important for the analysis may be oversampled, while a stratum that is less important may be undersampled.
This type of stratified random sampling is most commonly used when the strata are heterogeneous in size or when some strata are considered more important than others. It can be more efficient than proportionate stratified random sampling, but it may not be as representative of the entire population.
In this kind of Stratified Random Sampling method, without considering the proportion or any other factor, we will just provide a specific number to fetch samples from the population.
Example: Surveying residents' opinions on a public transportation system in three districts with different population sizes.
The benefits of Stratified Random Sampling is-
Now, we will see, how we could perform Stratified Random Sampling, in a stepwise manner.
The first step of any sampling process is to define the Population from which we will collect our samples. Then the main task is to identify and select certain characteristic based on which we want to divide the population and create the subgroups i.e strata. This is very important step as defining the unique characteristic using which we will divide the population into sub-groups and form the strata. It is recommended to choose a clear and unique feature which will differentiate each other clearly, so that they can be put into different strata. Otherwise if there is an overlap of feature then forming the strata might get difficult.
It is also possible to use multiple columns/feature to stratify the dataset and creater sub-groups, as long as they can uniquely differentiate with other columns/feature of the dataset.
Now, consider each and every member of the population and add them into different stratum based on their charateristic and unique feature. The collection of all the stratum is known as strata.
Before deciding the sample size of each stratum, it is necessary to decide which type of Stratified Random Sampling we will use, Proportionate or Disproportionate. In case of the proportionate sampling, the size of the sample from each stratum is in proportion with how much that stratum makes up the population. If the stratum is a big part of the population then we will consider larger amount of sample from that stratum, and vice versa for smaller part.
In case of disproportionate sampling, there is no need to consider the proportionate of the stratum with the population.
After deciding which method to use, it is time to decide the sampling size, the sample size should be large enough so that data from each stratum are equally represented in the sample dataset and we can do statistical analysis properly in it.
Now we will random sampling method to collect data randomly from each stratum and form our sample dataset. Once we have sampled from each stratum, we need to combine all of the samples into one representative sample. This can be done by simply concatenating the samples together.
Stratified Random Sampling is commonly used in numerous research and facts series scenarios, along with
There are certain scenarios in which Stratified Random Sampling will work better than that of the simple Random Sampling method. Some of them are listed below -
Criteria | Startified Random sampling | |||
|---|---|---|---|---|
Definition | Everyone has the equal chance of being included. | A pre-defined and fixed sampling interval is used | population is divided into strata and data are collected randomly from strata | Population is divided into clusters and a subset of those clusters are used for analysis |
Advantage | Simplicity, Easy to Implement | Could me more efficient than Random Sampling if there is an order in the population | Ensures each feature is present in the sample dataset | Efficient for vast and geographically dispersed population. |
Disadvantage | May not represent all the feature in the sample dataset | Sensitive to Periodicities present in population | Complex to implement, if done wrongly then the model will be errorneous | Introductionof bias if the clusters are not homogeneous |
When to Use | The population is homogeneous | The population posesses a certain pattern | The population has distinct unique features | The clusters are similar and capable of representing entire dataset |
Efficiency Consideration | Less efficient when the dataset is heterogeneous | Efficient for ordered population | Most efficient for heterogeneous population | Efficient for vast and geographically disperesed population |
Complexity of Implementation | Low | Moderate | Moderate | Moderate |
Even though Stratified Random Sampling has it's own advantages, it comes with several disadvantages too.
In conclusion, Stratified Random Sampling stand at the pinnacle position when it comes to statistical sampling. By considering the diversity and the uniqueness of the population, and dividing that population into smaller groups called "strata" , this approach increases representatives of a sample. The thoughtful creation of the strata ensures that all the unique features of the population are given importance equally into the sample dataset, which results in more accurate and reliable results. Stratified Random Sampling is mostly useful while dealing with heterogeneous population. Its potential to improve the precision and reduction of sampling error makes it the most preferred choice for research and analysis purposes where comprehensive and well-structured sampling is unparalleled.