Analysis of Variance

Introduction
1. The analysis of variance is a very powerful statistical tool for tests of significance. The test of significance, which is formed on t-distribution, is a satisfactory procedure only for testing the significance of the difference between two sample means in a situation when we have three or more samples to consider at a single time. An alternative procedure is needed for testing the hypothesis that all the samples are drawn from the same population, i.e., they have the same mean. For example, if five fertilisers are applied to four plots each of wheat and yield of wheat on every one of the plot is given, we may be interested in finding out whether the effect of these fertilisers on the yields is significantly different, or, in other words, whether the samples have come from the same normal population? The answer to this problem is provided by the analysis of variance. The basic purpose of the analysis of variance is to test the homogeneity of several means.
2. The term ‘Analysis of Variance’ was introduced by the father of statistics, Prof. R.A. Fisher, in 1920’s to deal with the problem in the analysis of agronomical data. No doubt, variation is deep-rooted in nature. The total variation in any set of numerical data is due to a number of causes which may be classified as:
  1. i. Assignable causes, and
  2. ii. Chance causes.
3. The variation due to assignable causes can be detected and measured whereas the variation due to chance causes is beyond the control of human and cannot be traced separately.

Definition. According to Prof. R.A. Fisher, Analysis of Variance a.k.a. ANOVA is the “Separation of variance ascribable to one group of causes from the variance ascribable to other group”.

By this technique the total variation in the sample data is expressed as the sum of its non-negative components where each of these components is a measure of the variation due to some specific independent source or factor or cause. The ANOVA consists of the estimation of the amount of variation due to each of the independent factors (causes) separately and then comparing these estimates due to assignable factors (causes) with the estimate due to chance factor (causes), the latter being known as ‘experimental error’ or simply ‘error’.
1. i. Assumptions for ANOVA test. The ANOVA test is based on the test statistics-F (or Variation Ratio).

For the validity of F-test in analysis of variance, the following assumptions are made:

The observations are independent,
Parent population from which sample are taken is normal, and
Various treatment and environmental effects are additive in nature.

In the following sequences, we will discuss the analysis of variance for:

One-way classification, and
Two-way classification

One-way classification
1. Let us suppose that N observations y_ij, (i=1, 2, …., k; j=1, 2, …., n_i) of a random variable Y are grouped, on some basis, into k classes of sizes n₁, n₂, ….., n_krespectively, N=n₁+n₂+n₃+……+n_k.
2. The total variation in the observation y_ij can be split into the following two components:
  1. i. The variation between the classes or the variation due to different bases of classification is commonly known as treatments.
  2. ii. The variation which is present inside the classes, i.e., the inherent variation of the random variable within the observations of a class is also considered.
3. The first type of variation is due to assignable causes which can be detected and controlled by human endeavour and the second type of variation is due to chance causes which are beyond the control of human hand.
4. The main objective of analysis of variance technique is to examine if there is significant difference between the class means in view of the inherent variability within the separate classes.
5. In particular, let us consider the effect of k different rations in the yield of milk of N cows (of the same breed and stock) divided in to k classes of size n₁, n₂, n₃, …., n_k respectively, N=n₁+n₂+n₃+…..n_k. Here the source of variation is:
  1. i. Effect of the rations (treatments): ti; i-=1, 2, ….., k.
  2. ii. Error (ɛ) produced by numerous causes of such magnitude that they are not detected and identified with the knowledge that we have and they together produce a variation of random nature obeying Gaussian (Normal) law of errors.

NOTE: Fixed Effect Model v/s. Random Effect Model

Fixed Effect Model. Suppose the k-levels of the factor (treatments) under consideration are the only levels of interest, and all these are included in the experiment by the investigator or out of a large number of classes, the k-classes (treatments) in the model have been specifically chosen by the experimenter.
Random Effect Model. Suppose we have a large number of classes (levels of factor under consideration) and we want to test through an experiment if all these class effects are equal or not. Due to consideration of time, money or administrative convenience, it may not be possible to include all the factor-levels in the experiment. In such a situation, we take only a random sample of classes (factor levels) in the experiment and after studying and analysing the sample data, we want to draw conclusions which would be valid for all the classes (factor levels), whether included in the experiment or not.

In the random effect model, if the null hypothesis of the homogeneity if class (treatment) effects is rejected, then to test the significance of the difference between two classes (treatment) effects, we cannot apply the t-test because all the treatments are not included in the experiment.

Two-way classification (one observation per cell)
1. Suppose n observations are classified into k categories (or classes), say A₁, A₂, ….., A_k according to some criterion A; and into h categories say, B₁, B₂, ….., B_h according to some criterion B, having hk combinations (A_i, B_j) i=1, 2, ….., k; j=1, 2, …..h; often called cells. This scheme of classification according to two factors or criteria is called ‘two-way classification’ and its analysis is called ‘two-way analysis of variance’. The number of observations in each cell may be equal or different, but we shall consider the case of one observation per cell so that n=hk, i.e., the total number of cells is n=hk.
2. In the two-way classification, the values of the response variable are affected by two factors.

For example, the yield of milk may be affected by differences in treatments, i.e., rations as well as the differences in variety, i.e., breed and stock of the cows. Let us now suppose that the n cows are divided into h different groups or classes according to their breed and stock, each group containing k cows and then let us consider the effect of k treatments (i.e., rations given at random to cows in each group) in the yield of milk.

Click here for government certification in Statistics and Quality

Vaibhav Miglani

Vaibhav Miglani, an aspirant of knowledge in statistics is currently pursuing his "B.Sc(H) in Statistics" from "Ramanujan College, University of Delhi". He wants to become one of the greatest minds in the field of statistics.

7 Comments. Leave new

Chitranshi
May 7, 2015 2:36 pm

Good work!

Reply
Shatakshi Bhargava
May 11, 2015 4:41 pm

Impressive work ! The way things have been explained is commendable.

Reply
Padmaa Murugesan
May 26, 2015 3:12 pm

Good effort..!

Reply
Juhi Jolly
May 31, 2015 7:42 pm

Good work.nice one!

Reply
Divya
May 31, 2015 8:36 pm

Nice article.

Reply
Stephy Paul
June 29, 2015 5:14 pm

Vaibhav! You always do excellent researches!!

Reply
Ushant Ghimire
November 2, 2015 1:41 am

great…

Reply