Introduction
Statistics is regarded as the extension of mathematics by many people failing to recognize the popularity that statistics has gained as an independent branch of science. Statistics has contributed to the development of different branches of science in general and data science in particular. There is a wide and holistic usage of statistics concepts for data science. This is because statistics forms the quantitative foundation of data science. This quantitative foundation is laid by concepts like sampling probability surveys and the like.
In this article, we examine the concepts of sampling, reliability and validity.
Importance of sampling in data science
Sampling is broadly classified into four major types.
The first type is called simple random sampling. In simple random sampling, a sample of population or an event is selected at random and no particular methodology is followed for the selection process.
The second type of sampling is called systematic random sampling. In systematic random sampling, a specific methodology is followed for selection of a particular sample among the group. For instance, a sample may be selected with the help of frequency number which means that selection of a sample is done after n turns.
The third type of sampling is called cluster sampling. In this type of sampling, the population is first divided into discrete groups on the basis of certain characteristics. After this, selection is done from each of the selected groups so that sampling is much more uniform.
The fourth type of sampling is called cluster sampling. In cluster sampling, the first process that is followed is the formation of clusters. However, clusters are selected at random and no particular characteristic classification is followed. After this, samples are selected from these clusters at random with each cluster getting represented in the final classification.
In addition to this, there are also other types of simply like convenience sampling. Convenience sampling is also called accidental sampling. Another type of sampling is called purposive sampling which is divided into quota sampling and snowball sampling.
Checking the validity and reliability of data
Statistics is very important for checking the reliability and validity of different data sets in data science. Reliability refers to the number of times the output of an experiment is repeated. Reliability can broadly be divided into two types.
The first type is called temporal reliability. Temporal reliability is all about the repetition of a number of results at various points of time.
The second type of Reliability is called comparative reliability. Comparative reliability is all about establishing reliability with respect to the change of target observer and testing methodology.
Reliability is easy to establish when it comes to quantitative results. Reliability becomes a problem when input is provided in the form of qualitative means.
Validity of an experiment
Validity is all about hitting the bull’s eye or achieving the desired result while performing an experiment. Validity can further be classified into four different types.
The first type is called internal validity and is used for establishing ordinary relationships.
The second type of validity is called external validity and is used to check if the results obtained in an experiment apply to a larger group.
The third type of validity is called measurement validity. This type of validity is used to check if the results obtained measure the quantity exactly that was originally intended.
Ecological validity establishes the relationship between a particular research study and the natural experience of the people. This type of validity measures if the results of an experiment can be closely applied to natural settings.
Factors affecting validity
There are three major factors that affect validity of an experiment.
The first is the historicity of an event. This means that variables are bound to take different values at different points in time.
The second factor that influences validity is methodology. As different methods are adopted, it is likely that the overall validity of an experiment may witness a change.
The third factor that influences validity is the selection bias. Selection bias occurs when the samples that are selected are not representative of the entire population.
Concluding remarks
Sampling, reliability and validity are the most important and fundamental concepts in understanding various areas of data science. These concepts find application in a large number of domains and have interdisciplinary applications.