Concepts and terminologies

Concept

Data analysis is a process, within which several phases can be distinguished:

Data cleaning

Data cleaning is an important procedure during which the data are inspected, and erroneous data are—if necessary, preferable, and possible—corrected. Data cleaning can be done during the stage of data entry. If this is done, it is important that no subjective decisions are made. The guiding principle provided by Adèr (ref) is: during subsequent manipulations of the data, information should always be cumulatively retrievable. In other words, it should always be possible to undo any data set alterations. Therefore, it is important not to throw information away at any stage in the data cleaning phase. All information should be saved (i.e., when altering variables, both the original values and the new values should be kept, either in a duplicate data set or under a different variable name), and all alterations to the data set should carefully and clearly documented, for instance in a syntax or a log.

Initial data analysis

The most important distinction between the initial data analysis phase and the main analysis phase, is that during initial data analysis one refrains from any analysis that are aimed at answering the original research question. The initial data analysis phase is guided by the following four questions:

Quality of data

The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using different types of analyses: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms, normal probability plots), associations (correlations, scatter plots).
Other initial data quality checks are:

Checks on data cleaning: have decisions influenced the distribution of the variables? The distribution of the variables before data cleaning is compared to the distribution of the variables after data cleaning to see whether data cleaning has had unwanted effects on the data.
Analysis of missing observations: are there many missing values, and are the values missing at random? The missing observations in the data are analyzed to see whether more than 25% of the values are missing, whether they are missing at random (MAR), and whether some form of imputation is needed.
Analysis of extreme observations: outlying observations in the data are analyzed to see if they seem to disturb the distribution.
Comparison and correction of differences in coding schemes: variables are compared with coding schemes of variables external to the data set, and possibly corrected if coding schemes are not comparable.
Test for common-method variance.

The choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses that will be conducted in the main analysis phase.

Quality of measurements

The quality of the measurement instruments should only be checked during the initial data analysis phase when this is not the focus or research question of the study. One should check whether structure of measurement instruments corresponds to structure reported in the literature.
There are two ways to assess measurement quality:

Confirmatory factor analysis
Analysis of homogeneity (internal consistency), which gives an indication of the reliability of a measurement instrument. During this analysis, one inspects the variances of the items and the scales, the Cronbach's α of the scales, and the change in the Cronbach's alpha when an item would be deleted from a scale.

Initial transformations

After assessing the quality of the data and of the measurements, one might decide to impute missing data, or to perform initial transformations of one or more variables, although this can also be done during the main analysis phase.
Possible transformations of variables are:

Square root transformation (if the distribution differs moderately from normal)
Log-transformation (if the distribution differs substantially from normal)
Inverse transformation (if the distribution differs severely from normal)
Make categorical (ordinal / dichotomous) (if the distribution differs severely from normal, and no transformations help)

Did the implementation of the study fulfill the intentions of the research design?

One should check the success of the randomization procedure, for instance by checking whether background and substantive variables are equally distributed within and across groups.
If the study did not need and/or use a randomization procedure, one should check the success of the non-random sampling, for instance by checking whether all subgroups of the population of interest are represented in sample.
Other possible data distortions that should be checked are:

dropout (this should be identified during the initial data analysis phase)
Item nonresponse (whether this is random or not should be assessed during the initial data analysis phase)
Treatment quality (using manipulation checks).

Characteristics of data sample

In any report or article, the structure of the sample must be accurately described. It is especially important to exactly determine the structure of the sample (and specifically the size of the subgroups) when subgroup analyses will be performed during the main analysis phase.
The characteristics of the data sample can be assessed by looking at:

Basic statistics of important variables
Scatter plots
Correlations
Cross-tabulations

Final stage of the initial data analysis

During the final stage, the findings of the initial data analysis are documented, and necessary, preferable, and possible corrective actions are taken.
Also, the original plan for the main data analyses can and should be specified in more detail and/or rewritten.
In order to do this, several decisions about the main data analyses can and should be made:

In the case of non-normals: should one transform variables; make variables categorical (ordinal/dichotomous); adapt the analysis method?
In the case of missing data: should one neglect or impute the missing data; which imputation technique should be used?
In the case of outliers: should one use robust analysis techniques?
In case items do not fit the scale: should one adapt the measurement instrument by omitting items, or rather ensure comparability with other (uses of the) measurement instrument(s)?
In the case of (too) small subgroups: should one drop the hypothesis about inter-group differences, or use small sample techniques, like exact tests or bootstrapping?
In case the randomization procedure seems to be defective: can and should one calculate propensity scores and include them as covariates in the main analyses?

Analyses

Several analyses can be used during the initial data analysis phase:

Univariate statistics
Bivariate associations (correlations)
Graphical techniques (scatter plots)

It is important to take the measurement levels of the variables into account for the analyses, as special statistical techniques are available for each level:

Nominal and ordinal variables
- Frequency counts (numbers and percentages)
- Associations
  - circumambulations (crosstabulations)
  - hierarchical loglinear analysis (restricted to a maximum of 8 variables)
  - loglinear analysis (to identify relevant/important variables and possible confounders)
- Exact tests or bootstrapping (in case subgroups are small)
- Computation of new variables

Continuous variables
- Distribution
  - Statistics (M, SD, variance, skewness, kurtosis)
  - Stem-and-leaf displays
  - Box plots

Main data analysis

In the main analysis phase analyses aimed at answering the research question are performed as well as any other relevant analysis needed to write the first draft of the research report.

Exploratory and confirmatory approaches

In the main analysis phase either an exploratory or confirmatory approach can be adopted. Usually the approach is decided before data is collected. In an exploratory analysis no clear hypothesis is stated before analysing the data, and the data is searched for models that describe the data well. In a confirmatory analysis clear hypotheses about the data are tested.

Exploratory data analysis should be interpreted carefully. When testing multiple models at once there is a high chance on finding at least one of them to be significant, but this can be due to a type 1 error. It is important to always adjust the significance level when testing multiple models with, for example, a bonferroni correction. Also, one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory analysis is used to find ideas for a theory, but not to test that theory as well. When a model is found exploratory in a dataset, then following up that analysis with a comfirmatory analysis in the same dataset could simply mean that the results of the comfirmatory analysis are due to the same type 1 error that resulted in the exploratory model in the first place. The comfirmatory analysis therefore will not be more informative than the original exploratory analysis.

Stability of results

It is important to obtain some indication about how generalizable the results are. While this is hard to check, one can look at the stability of the results. Are the results reliable and reproducible? There are two main ways of doing this:

Cross-validation: By splitting the data in multiple parts we can check if analyzes (like a fitted model) based on one part of the data generalize to another part of the data as well.
Sensitivity analysis: A procedure to study the behavior of a system or model when global parameters are (systematically) varied. One way to do this is with bootstrapping.

Statistical methods

Many statistical methods have been used for statistical analyses. A very brief list of four of the more popular methods is:

General linear model: A widely used model on which various statistical methods are based (e.g. t test, ANOVA, ANCOVA, MANOVA). Usable for assessing the effect of several predictors on one or more continuous dependent variables.
Generalized linear model: An extension of the general linear model for discrete dependent variables.
Structural equation modelling: Usable for assessing latent structures from measured manifest variables.
Item response theory: Models for (mostly) assessing one latent variable from several binary measured variables (e.g. an exam).

Terminologies used in Data Analytics -

Alternative hypothesis
A precise statement relating to the research question to be tested, expressed in terms which assume a relationship (association) or difference between variables. Used in conjunction with a suitable null hypothesis. (See the issues of quantitative analysis for more information.)

Audit trial
Confirms the quality of qualitative research

Auditable
Able to audit the qualitative research process

Average
A general term given to a descriptive statistic, which gives a measure of central tendency of sample data, eg mean, median, mode.

Bias
An unplanned effect on the data collection in research, which may influence results. For example 'non response bias' in the return of postal questionnaires.

Categorical data
Data at non-measurement level, grouped into categories. For example, nominal – gender, or ordinal – income band

Confidence interval
A range of values, within which we are fairly sure the true value of the parameter being investigated lies. A common confidence interval (CI) is 95%. Thus, for example, we can be 95% certain that the true population mean lies approximately within the interval calculated from the sample mean ± 2 x standard error of the mean. 2 is an approximation, dependent on sample size.

Contingency table
A contingency table is a two-dimensional table of counts, usually showing frequencies of two variables, displayed in rows and columns respectively.

Continuous data
This data measured at least at interval level. It is as precise as measuring instruments will allow.

Critical appraisal
Interpreting the strengths and weaknesses of the research process and applying judgements to practice

Data
Information gathered in the course of a research study. It may be quantitative or qualitative.

Data analysis
Processing, interpretation and analysis of findings

Deductive paradigm
The testing/application of theories

Dependent variable
The variable which is assumed to respond to the values of the independent (explanatory) variable. For example, blood pressure could be deemed to respond to changes in age.

Discrete data
Data measured at least at interval level, but only as whole numbers (integers). For example, household size, or number of siblings.

Ethical committee
A committee of members who judge the appropriateness and merit of proposed research

Focus groups
A group composed of between six and twelve individuals who meet to discuss a research problem

Hypothesis
A precise statement relating to the research question to be tested.

Independent variable
The variable which is assumed to determine the values of the dependent (response) variable. For example, blood pressure could be deemed to respond to changes in age.

Inductive paradigm
The development of theories from observation

Interquartile range (IQR)
This is the difference between the upper (Q₃) and lower (Q₁) quartiles. It is less sensitive to extreme outliers than the range, as a measure of spread of data.

Interval/ratio data
This is data recorded on a scale with equal distances between points. Data can be continuous or discrete. Data at ratio level has the additional quality of an 'absolute zero'. Thus temperature (in °C or °F) is measured at interval level. Weight, height, etc are at ratio level.

Linearity
To calculate Pearson's coefficient of correlation to measure the level of association between 2 variables, it is necessary for the data to be related following a straight line. Thus a check for linearity is obtained by plotting a scatter diagram of the two variables.

Literature review
Appraisal of previous research or literature on a subject

Mean
The arithmetic mean is a descriptive statistic, which is a measure of central tendency, or average, around which the data clusters. All data in a sample is used. It is appropriate for data measured at least at interval level.

Median
The median is a descriptive statistic, which is a measure of central tendency, or average, around which the data clusters. It is the middle value when data in a sample is arranged in order. It is appropriate for data measured at least at ordinal level.

Mode
The mode is a descriptive statistic, which is a measure of central tendency, or average, around which the data clusters. It is the most frequently occurring value in a sample. It is appropriate for categorical data.

Nominal data
Categorical data gathered into groups, with no order attached to them. For example, ethnicity.

Non-probability sampling
Use of random selection in obtaining the sample

Normal distribution
Data following a normal distribution, such that a bell-shaped curve.

Null hypothesis
A precise statement relating to the research question to be tested, expressed in terms which assume no relationship (association) or difference between variables. Used in conjunction with a suitable alternate hypothesis. (See the issues of quantitative analysis for more information.)

Ordinal data
Categorical data gathered into groups, with order attached to them. For example, job grade, age group.

Outlier
An extreme, or atypical, data value(s) in a sample. They should be considered carefully, before exclusion from analysis. For example, data values maybe recorded erroneously, and hence they may be corrected. However, in other cases they may just be surprisingly different, but not necessarily 'wrong'.

Parameter
This is a property of a population, eg the mean or standard deviation, which is being estimated from the sample data.

Percentiles
Percentiles split the sample data into hundredths. For example, the 25th percentile is equivalent to the lower quartile, and the 50th percentile is the same as the median.

Population
The group of individuals, or items, to be studied is called the population.
For example, men aged 21 and over; pregnant women; households in Bristol; houses in Bristol. The subset of this population that is measured or observed is called the sample.

Probability sampling
Sampling techniques that do not use random selection

P-value
The probability of an observed result happening by chance under the null hypothesis.

Qualitative analysis
Interpretation of words and text

Quantitative analysis
Interpretation of numerical data

Quartiles
The lower (Q₁) quartile is the value below which the bottom 25% of the sample data lie, and the upper (Q₃) quartile is the value above which the upper 25% lie.
NB. The middle quartile (Q₂) corresponds to the median.

Random sample
Every member of a population has an equally-likely chance of being in a random sample. It is representative of the population being studied.

Range
A descriptive statistic equal to the maximum less the minimum value in a data set. It is a crude measure of variation (spread) of the data.

Rank
Sample values are ordered or ranked.

Raw data
Rows and columns containing numbers representing the collected data. Numbers may be values or category codes. Rows relate to subjects/cases, and columns relate to individual variables.

Representative
The extent to which sample data or members reflect accurately the characteristics of the population from which they are drawn.

Research process
The process undertaken by researchers to answer research questions/hypotheses.

Research question
A specific question that guides the research process

Sample
A subset (n) of the entire population (N). Those people, objects or events selected from the population for inclusion in the study.

Scatter
A scatter plot is drawn between 2 variables at interval/ratio level to check for a linear relationship, prior to calculating Pearson's coefficient of correlation. The independent (explanatory) variable is plotted on the horizontal (x) axis, and the dependent (response) variable is plotted on the vertical (y) axis.

Semi-interquartile range (SIQR)
This is half the interquartile range (IQR).

Significance level
Set as the p-value.

Standard deviation
The standard deviation is a descriptive statistic, which is a measure of dispersion, or spread, of sample data around the mean. All data in a sample is used. It is appropriate for data measured at least at interval level.

Standard error of the mean
A measure of the accuracy of the sample mean as an estimate of the population mean.
Equal to sample standard deviation divided by the square root of the sample size

Statistically significant
If a result is 'statistically significant', it implies a statistical test has been carried-out, and the probability of obtaining the observed data (or more extreme) under the null hypothesis, is small – typically less than 0.05.

Summary or descriptive statistics
These are a set of calculated terms from the sample data to describe the sample data to the reader. They include sample size, maximum & minimum values, averages (mean, median, mode), measures of variability (range, interquartile range, standard deviation).
Note that it is important to use the correct statistic depending on the level of measurement of the data.

Systematic review
Review of the literature based on a scientific design

Triangulation
A research design that includes two or more approaches to data collection or analysis

Trustworthiness
A description of the credibility, transferability, dependability and confirmability of qualitative research

Variable
Is a term ascribed to the characteristic(s) being investigated, and can take any value in a reasonable range. For example, blood group, blood pressure, age of patients being studied.

Verbatim extracts
Exact word for word and punctuated account of speech

Certified Business Intelligence Professional Concepts and terminologies