Data is raw fact, which is collected and analyzed to create information suitable for making decisions. Data is measured, collected and reported, and analyzed, to create information suitable for making decisions.
Types of data
Discrete and Continuous
- Attribute or discrete data – It is based on counting like the number of processing errors, the count of customer complaints, etc. Discrete data values can only be non-negative integers such as 1, 2, 3, etc. and can be expressed as a proportion or percent (e.g., percent of x, percent good, percent bad). It includes
- Count or percentage – It counts of errors or % of output with errors.
- Binomial data – Data can have only one of two values like yes/no or pass/fail.
- Attribute-Nominal – The “data” are names or labels. Like in a company, Dept A, Dept B, Dept C or in a shop: Machine 1, Machine 2, Machine 3
- Attribute-Ordinal – The names or labels represent some value inherent in the object or item (so there is an order to the labels) like on performance – excellent, very good, good, fair, poor or tastes – mild, hot, very hot
- Variable or continuous data – They are measured on a continuum or scale. Data values for continuous data can be any real number: 2, 3.4691, -14.21, etc. Continuous data can be recorded at many different points and are typically physical measurements like volume, length, size, width, time, temperature, cost, etc. It is more powerful than attribute as it is more precise due to decimal places which indicate accuracy levels and specificity. It is any variable measured on a continuum or scale that can be infinitely divided.
Data are said to be discrete when they take on only a finite number of points that can be represented by the non-negative integers. An example of discrete data is the number of defects in a sample. Data are said to be continuous when they exist on an interval, or on several intervals. An example of continuous data is the measurement of pH. Quality methods exist based on probability functions for both discrete and continuous data.
Data could easily be presented as variables data like 10 scratches could be reported as total scratch length of 8.37 inches. The ultimate purpose for the data collection and the type of data are the most significant factors in the decision to collect attribute or variables data.
Cross-sectional and Time series data – Often financial analysts are interested in particular types of data such as time-series data or cross-sectional data.
- Time-series data is a set of observations collected at usually discrete and equally spaced time intervals. For example, the daily closing price of a certain stock recorded over the last six weeks is an example of time-series data. Note that a too long or too short time period may lead to time-period bias. Other examples of time-series would be staff numbers at a particular institution taken on a monthly basis in order to assess staff turnover rates, weekly sales figures of ice-cream sold during a holiday period at a seaside resort and the number of students registered for a particular course on a yearly basis. All of the above would be used to forecast likely data patterns in the future.
- Cross-sectional data – are observations that coming from different individuals or groups at a single point in time. For example, if one considered the closing prices of a group of 20 different tech stocks on December 15, 1986 this would be an example of cross-sectional data. Note that the underlying population should consist of members with similar characteristics. For example, suppose you are interested in how much companies spend on research and development expenses. Firms in some industries such as retail spend little on research and development (R&D), while firms in industries such as technology spend heavily on R&D. Therefore, it’s inappropriate to summarize R&D data across all companies. Rather, analysts should summarize R&D data by industry, and then analyze the data in each industry group. Other examples of cross-sectional data would be: an inventory of all ice creams in stock at a particular store, a list of grades obtained by a class of students for a specific test.
Population and Sample Data
When we think of the term “population,” we usually think of people in our town, region, state or country and their respective characteristics such as gender, age, marital status, ethnic membership, religion and so forth. In statistics the term “population” takes on a slightly different meaning. The “population” in statistics includes all members of a defined group that we are studying or collecting information on for data driven decisions.
A part of the population is called a sample. It is a proportion of the population, a slice of it, a part of it and all its characteristics. A sample is a scientifically drawn group that actually possesses the same characteristics as the population – if it is drawn randomly.(This may be hard for you to believe, but it is true!)
A population includes all of the elements from a set of data. A sample consists of one or more observations from the population.
Converting Data Types – Continuous data, tend to be more precise due to decimal places but, need to be converted into discrete data. As continuous data contains more information than discrete data hence, during conversion to discrete data there is loss of information.
Discrete data cannot be converted to continuous data as instead of measuring how much deviation from a standard exists, the user may choose to retain the discrete data as it is easier to use. Converting variable data to attribute data may assist in a quicker assessment, but the risk is that information will be lost when the conversion is made.
Data Structuring – It refers to structuring of data elements and is classified as
- Structured data – Any data that resides in a fixed field within a record or file. This includes data contained in relational databases and spreadsheets. Structured data first depends on creating a data model – a model of the types of business data that will be recorded and how they will be stored, processed and accessed. Structured data has the advantage of being easily entered, stored, queried and analyzed.
- Semi-structured data – Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure like XML, JSON
- Unstructured data – Information that doesn’t reside in a traditional row-column database. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, web pages and many other kinds of business documents.
Data collection methods
Data collection is based on crucial aspects of what to know, from whom to know and what to do with the data. Factors which ensure that data is relevant to the project includes
- Person collecting data like team member, associate, subject matter expert, etc.
- Type of Data to collect like cost, errors, ratings etc.
- Time Duration like hourly, daily, batch-wise etc.
- Data source like reports, observations, surveys etc.
- Cost of collection
Few types of data collection methods includes
- Check sheets – It is a structured, well-prepared form for collecting and analyzing data consisting of a list of items and some indication of how often each item occurs. There are several types of check sheets like confirmation check sheets for confirming whether all steps in a process have been completed, process check sheets to record the frequency of observations with a range of measurement, defect check sheets to record the observed frequency of defects and stratified check sheets to record observed frequency of defects by defect type and one other criterion. It is easy to use, provides a choice of observations and good for determining frequency over time. It should be used to collect observable data when the collection is managed by the same person or at the same location from a process.
- Coded data- It is used when presence of too many digits are to be recorded into small blocks or during data capturing of large sequences of digits from a single observation or rounding off errors are observed whilst recording large digit numbers. It is also used if numeric data is used to represent attribute data or data quantity is not enough for a statistical significance in the sample size. Various types of coded data collection are
- Truncation coding for storing only 3,2 or 9 for 1.0003, 1.0002, and 1.0009
- Substitution coding – It stores fractional observation, as integers like expressing the number 32 for 32-3/8 inches with 1/8 inch as base.
- Category coding – Using a code for category like “S” for scratch
- Adding/subtracting a constant or multiplying/dividing by a factor – It is usually used for encoding or decoding
- Automatic measurements – In it a computer or electronic equipment performs data gathering without human intervention like radioactive level in a nuclear reactor. The equipment observes and records data for analysis and action.
Few important data management related terms are
- Data quality – It refers to the level of quality of Data. Data is generally considered high quality if, they are fit for their intended uses in operations, decision making and planning.
- Data cleansing – Data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data or coarse data.
- Data validation – It is the process of ensuring that a program operates on clean, correct and useful data. It uses routines, often called “validation rules” “validation constraints” or “check routines”, that check for correctness, meaningfulness, and security of data that are input to the system.
- Data integrity – It refers to maintaining and assuring the accuracy and consistency of data over its entire life-cycle, and is a critical aspect to the design, implementation and usage of any system which stores, processes, or retrieves data.
- Data governance – It is a control that ensures that the data entry by an operations team member or by an automated process meets precise standards, such as a business rule, a data definition and data integrity constraints in the data model. It is a set of processes that ensures that important data assets are formally managed throughout the enterprise. Data governance ensures that data can be trusted and that people can be made accountable for any adverse event that happens because of low data quality.
Techniques for Assuring Data Accuracy and Integrity
Data integrity and accuracy have a crucial in the data collection process as they ensure the usefulness of data being collected. Data integrity determines whether the information being measured truly represents the desired attribute and data accuracy determines the degree to which individual or average measurements agree with an accepted standard or reference value.
Data integrity is doubtful if the data collected does not fulfill the purpose like data collected on finished good departure gathers data from truck departures but if the data is recorded on computing device present in the warehouse then integrity is doubtful. Similarly data accuracy is doubtful if the measurement device does not conforms to the laid down device standards.
Bad data can be avoided by following few precautions like avoiding emotional bias relative to tolerances, avoiding unnecessary rounding and screening data to detect and remove data entry errors.
With change and spread of technology, companies are moving towards digital marketing as consumers are moving towards e-commerce and mobile commerce. Availability of low cost internet access and devices has also spurned this shift amongst consumers. Digital data like html footprints that consumers leave behind when they visit a website or social media data, have significant value over these traditional tools of analytics in multiple ways. To begin with, by analyzing digital data you are ‘listening in’ to natural, honest conversations that are not limited. It isn’t a forced conversation. Second, the sample size is enormous. If you’re looking at 2000 consumers in a traditional survey, you’re talking about over 200,000 with digital data. Finally, the analysis is less expensive than traditional research, fast and therefore can be conducted multiple times in a year to answer different questions or hypotheses.
Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.
Big data is a large volume unstructured data which can not be handled by standard database management systems like DBMS, RDBMS or ORDBMS. Big Data is very large, loosely structured data set that defies traditional storage. Few examples are as
- Facebook : has 40 PB of data and captures 100 TB / day
- Yahoo : 60 PB of data
- Twitter : 8 TB / day
- EBay : 40 PB of data, captures 50 TB / day
In defining big data, it’s also important to understand the mix of unstructured and multi-structured data that comprises the volume of information.
- Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it’s text-heavy. Metadata, Twitter tweets, and other social media posts are good examples of unstructured data.
- Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information. As digital disruption transforms communication and interaction channels—and as marketers enhance the customer experience across devices, web properties, face-to-face interactions and social platforms—multi-structured data will continue to evolve.
Big Data is usually characterized by following “V” attributes
- Volume – Data being handled is so voluminous that it frequently exceeds a server’s storage and processing capacity. When vertical scalable solutions (adding more storage or faster processors) due to costs or zero downtimes are not acceptable options, horizontal scalable solutions (using cheaper servers without shutting down the existing server; for example, using MapReduce Hadoop technology) are needed. Data grows too quickly over time and can overwhelm the capacity and processing power of the existing server. Data may be needed at all times. Cheaper servers are needed for boosting the computational capabilities.
- Variety – Data from different sources is aggregated i.e. from online, mobile, and social media; and from ubiquitous sensors. The sensors can be at stores and linked devices as part of the Internet of Things (IoT).
- Veracity It refers to the lack of clarity or certainty. Data is not well-structured relational data such as transactions hence, companies must be able to store any data in a form that can be analyzed
- Velocity — It refers to the speed needed to analyze and make decisions in tandem to the data being generated. The speed is often measured in fractions of a second as in real-time or how long it takes for a customer to click to leave your site or ignore your location-based mobile offer.
Big data can come from multiple sources, as
- Web Data — still it is big data
- Click stream data – when users navigate a website, the clicks are logged for further analysis (like navigation patterns). Click stream data is important in on line advertising and E-Commerce
- Sensor Data – sensors embedded in roads to monitor traffic and misc. other applications generate a large volume of data
- Connected Devices – Smart phones are a great example. For example when you use a navigation application like Google Maps, the phone sends pings back reporting its location and speed (this information is used for calculating traffic hotspots). Just imagine hundreds of millions (or even billions) of devices consuming data and generating data.
- Social network profiles or Social media data – Sites like Facebook, Twitter, LinkedIn generate a large amount of data. Tapping user profiles from Facebook, LinkedIn, Yahoo, Google, and specific-interest social or travel sites, to cull individuals’ profiles and demographic information, and extend that to capture their hopefully-like-minded networks.
- Social influencers — Editor, analyst and subject-matter expert blog comments, user forums, Twitter & Facebook “likes,” Yelp-style catalog and review sites, and other review-centric sites like Apple’s App Store, Amazon, etc.
- Activity-generated data—Computer and mobile device log files, aka “The Internet of Things.” This category includes web site tracking information, application logs, and sensor data – such as check-ins and other location tracking – among other machine-generated content. But consider also the data generated by the processors found within vehicles, video games, cable boxes or, soon, household appliances.
- Public—Microsoft Azure Market Place/ Data Market, The World Bank, SEC/Edgar, Wikipedia, IMDb, etc. – data that is publicly available on the Web which may enhance the types of analysis able to be performed.