Data Types, Collection and Accuracy

Process improvement needs it to be measurable by data collection which is critical for any improvisation.

Types of data and Measurement scales

Data is information that is objective and types of data and measurement scales are discussed next

Types of data

They are of two types, discrete and continuous.

Attribute or discrete data – It is based on counting like the number of processing errors, the count of customer complaints, etc. Discrete data values can only be non-negative integers such as 1, 2, 3, etc. and can be expressed as a proportion or percent (e.g., percent of x, percent good, percent bad). Discrete data describes attributes and cannot be broken down into smaller units. Discrete data could be easier to collect and interpret since it is very readable and understandable, but it requires large samples of data to be effective. It can be very subjective; you may run into trouble with what people are supposed to count and how they’re supposed to note it when you are capturing the data. It includes
Count or percentage – It counts of errors or % of output with errors.
Binomial data – Data can have only one of two values like yes/no or pass/fail.
Attribute-Nominal – The “data” are names or labels. Like in a company, Dept A, Dept B, Dept C or in a shop: Machine 1, Machine 2, Machine 3
Attribute-Ordinal – The names or labels represent some value inherent in the object or item (so there is an order to the labels) like on performance – excellent, very good, good, fair, poor or tastes – mild, hot, very hot
Variable or continuous data – They are measured on a continuum or scale. Data values for continuous data can be any real number: 2, 3.4691, -14.21, etc. Continuous data can be recorded at many different points and are typically physical measurements like volume, length, size, width, time, temperature, cost, etc. It is more powerful than attribute as it is more precise due to decimal places which indicate accuracy levels and specificity. It is any variable measured on a continuum or scale that can be infinitely divided. Continuous data can be broken down into different forms, starting with how you can measure it on a specific scale and break it down into smaller units. The subcategories include the physical property data associated with what you are measuring and the resource data (time and money). For example, with value stream mapping at the step level, you might be interested in the low, median, and high amount of time it takes to do various tasks, or the amount of variability in the result of a step-by-step process in that product or service across the value stream. Examples of things that generate continuous data include time values, height, weight, temperature. Continuous data is easier to analyze and more precise than discrete data. It gives you an idea how far from the target you are at the time you are capturing the data. Continuous data requires some kind of measuring instrument. You will rely on continuous data to create precision, and on discrete data to create order and make comparisons.

Data are said to be discrete when they take on only a finite number of points that can be represented by the non-negative integers. An example of discrete data is the number of defects in a sample. Data are said to be continuous when they exist on an interval, or on several intervals. An example of continuous data is the measurement of pH. Quality methods exist based on probability functions for both discrete and continuous data.

Data could easily be presented as variables data like 10 scratches could be reported as total scratch length of 8.37 inches. The ultimate purpose for the data collection and the type of data are the most significant factors in the decision to collect attribute or variables data.

Converting Data Types

Continuous data, tend to be more precise due to decimal places but, need to be converted into discrete data. As continuous data contains more information than discrete data hence, during conversion to discrete data there is loss of information.

Discrete data cannot be converted to continuous data as instead of measuring how much deviation from a standard exists, the user may choose to retain the discrete data as it is easier to use. Converting variable data to attribute data may assist in a quicker assessment, but the risk is that information will be lost when the conversion is made.

Measurement

A measurement is assigning numerical value to something, usually continuous elements. Measurement is a mapping from an empirical system to a selected numerical system. The numerical system is manipulated and the results of the manipulation are studied to help the manager better understand the empirical system. Measured data is regarded as being better than counted data. It is more precise and contains more information. Sometimes, data will only occur as counted data. If the information can be obtained as either attribute or variables data, it is generally preferable to collect variables data.

The nature of a scale is that it inherently has to relate to a standard. For example, with temperature readings, you might have Fahrenheit versus Celsius, or maybe inches versus millimeters, and other standards that you would use for measurement. The hierarchy of measurement scales is defined as nominal, ordinal, interval and ratio. In Six Sigma, this is referred to with the acronym NOIR.

This hierarchy can be viewed in the form of an upside-down pyramid to understand the relationship between the four attributes and levels of measurement. You will usually start with nominal and then build on that with ordinal. The ordinal is then built on by the interval, and then finally ratio. This interaction is important to the flow and examination of data going forward. Each level includes qualities the one below it is going to need or can leverage.

The information content of a number is dependent on the scale of measurement used which also determines the types of statistical analyses. Hence, validity of analysis is also dependent upon the scale of measurement. The four measurement scales employed are nominal, ordinal, interval, and ratio and are summarized as

Scale	Definition	Example	Statistics
Nominal	Only the presence/absence of an attribute. It can only count items. Data consists of names or categories only. No ordering scheme is possible. It has central location at mode and only information for dispersion.	go/no-go, success/fail, accept/reject	percent, proportion, chi-square tests
Ordinal	Data is arranged in some order but differences between values cannot be determined or are meaningless. It can say that one item has more or less of an attribute than another item. It can order a set of items. It has central location at median and percentages for dispersion.	taste, attractiveness	rank-order correlation, sign or run test
Interval	Data is arranged in order and differences can be found. However, there is no inherent starting point and ratios are meaningless. The difference between any two successive points is equal; often treated as a ratio scale even if assumption of equal intervals is incorrect. It can add, subtract and order objects. It has central location at arithmetic mean and standard deviation for dispersion.	calendar time, temperature	correlations, t-tests, F-tests, multiple regression
Ratio	An extension of the interval level that includes an inherent zero starting point. Both differences and ratios are meaningful. True zero point indicates absence of an attribute. It can add, subtract, multiply and divide. It has central location at geometric mean and percent variation for dispersion.	elapsed time, distance, weight	t-test, F-test, correlations, multiple regression

Data collection methods

Data collection is based on crucial aspects of what to know, from whom to know and what to do with the data. Factors which ensure that data is relevant to the project includes

Person collecting data like team member, associate, subject matter expert, etc.
Type of Data to collect like cost, errors, ratings etc.
Time Duration like hourly, daily, batch-wise etc.
Data source like reports, observations, surveys etc.
Cost of collection

Few types of data collection methods includes

Manual data collection refers to things like simple tally sheets and tick sheets. Advantages of this type of collection are that they are very simple to execute and they use very simple tools. They can also be very helpful in determining which issues you ought to pursue inside of a process, without an exhaustive round of data collection.
Direct observation – Direct observation is always very powerful because you get to watch the actual interaction of the customer. Direct observation allows you to gather information about what that customer experience looks like, including the negative and positive aspects of the observation and nonverbal signals. You can also identify issues that may not otherwise be obvious or even things that the customer may not even see themselves.
Interviews – Interviews are a powerful technique and a great way to gather individual perspectives versus the perspective of a group. However, there are some limitations to interviews. Interviews are one of the most expensive ways to get an idea of what customers think. It’s costly to convert written data, then transcribe it, and then convert it to a form that you can analyze using Lean Six Sigma tools. One of the core ideas for in-person interviews is metatalk. Metatalk involves the things that are not said with words. Here you want to pay attention to the kind of rapport are you are developing, identify visual clues, and make eye contact. Phone interviews allow you to gain access to a wide variety of people and can be done at a low cost because you don’t have to travel anywhere or provide a place to meet.
Focus groups – Focus groups are another method where you can gather feedback from a lot of customers at the same time. They’re good for dynamic brainstorming and gathering a number of ideas and inputs to the problem you’re trying to address. They are not nearly as restrictive as surveys or individual interviews, which require the use of a script, and you can use visual aids to promote the conversation.
Surveys – Surveys are very useful, and are arguably the most inexpensive way to gather quantifiable data. You can use surveys in conjunction with other data collection methods. Surveys are very good for basic issues and are a great way to collect a lot of data in a short time.
Check sheets – It is a structured, well-prepared form for collecting and analyzing data consisting of a list of items and some indication of how often each item occurs. There are several types of check sheets like confirmation check sheets for confirming whether all steps in a process have been completed, process check sheets to record the frequency of observations with a range of measurement, defect check sheets to record the observed frequency of defects and stratified check sheets to record observed frequency of defects by defect type and one other criterion. It is easy to use, provides a choice of observations and good for determining frequency over time. It should be used to collect observable data when the collection is managed by the same person or at the same location from a process.
Coded data- It is used when presence of too many digits are to be recorded into small blocks or during data capturing of large sequences of digits from a single observation or rounding off errors are observed whilst recording large digit numbers. It is also used if numeric data is used to represent attribute data or data quantity is not enough for a statistical significance in the sample size. Various types of coded data collection are
Truncation coding for storing only 3,2 or 9 for 1.0003, 1.0002, and 1.0009
Substitution coding – It stores fractional observation, as integers like expressing the number 32 for 32-3/8 inches with 1/8 inch as base.
Category coding – Using a code for category like “S” for scratch
Adding/subtracting a constant or multiplying/dividing by a factor – It is usually used for encoding or decoding
Automatic measurements – In it a computer or electronic equipment performs data gathering without human intervention like radioactive level in a nuclear reactor. The equipment observes and records data for analysis and action.

Techniques for Assuring Data Accuracy and Integrity

Data integrity and accuracy have a crucial in the data collection process as they ensure the usefulness of data being collected. Data integrity determines whether the information being measured truly represents the desired attribute and data accuracy determines the degree to which individual or average measurements agree with an accepted standard or reference value.

Data integrity is doubtful if the data collected does not fulfill the purpose like data collected on finished good departure gathers data from truck departures but if the data is recorded on computing device present in the warehouse then integrity is doubtful. Similarly data accuracy is doubtful if the measurement device does not conforms to the laid down device standards.

Bad data can be avoided by following few precautions like avoiding emotional bias relative to tolerances, avoiding unnecessary rounding and screening data to detect and remove data entry errors.

Reliability and validity are essentials for quality data collection. If your data is consistent, stable, and repeatable, you know you can rely on those results. You also need to know the method that you use to collect the data, what’s actually being measured, and what’s meant to be measured. Effective data collection relies on reliability, validity and margin of error.

Sampling

Practically all items of population cannot be measured due to cost or being impractical hence, sampling is used to get a representative group of items to measure. Various sampling strategies are

Random Sampling – The use of a sampling plan requires randomness in sample selection and requires giving every part an equal chance of being selected for the sample. The sampling sequence must be based on an independent random plan. It is the least biased of all sampling techniques, there is no subjectivity as each member of the total population has an equal chance of being selected and can also be obtained using random number tables.
Sequential or Systematic Sampling – In it every nth record is selected from a list of the population. Usually, these plans are ended after the number inspected has exceeded the sample size of a sampling plan. It is used for costly or destructive testing. If the list does not contain any hidden order, this strategy is just as random as random sampling.
Stratified Sampling – It selects random samples from each group or process that is different. If the population has identifiable categories, or strata, that have a common characteristic, random sampling is used to select a sufficient number of units from each strata. Stratified sampling is often used to reduce sampling error. The resulting mix of samples can be biased if the proportion of the samples does not reflect the relative frequency of the groups.

Sample Homogeneity

It occurs when the data chosen for a sample have similar characteristics. It focuses on how similar the data are in a given sample. If data are from a variety of sources, such as several production streams or several geographical areas then, the results will reflect these combined sources. It aims for homogeneous data so as to relate data from a single source to the degree as much possible, to evaluate and determine the influence from an input of concern on data. Non-homogeneous data result in errors. Deficiency of homogeneity in data will hide the sources and make root cause analysis difficult.

Sampling Distribution of Means

If the means of all possible samples are obtained and organized, we could derive the sampling distribution of the means. The mean of the sampling distribution of the mean is the mean of the population from which the scores were sampled. Therefore, if a population has a mean μ, then the mean of the sampling distribution of the mean is also μ.