Data Mining and Warehousing

If you are looking for a job in Data Mining and Warehousing, then checkout our expert created interview questions to help you in your job Interview.

Q.1 Explain the purpose of cluster analysis in Data Warehousing.
Some of defined purpose of cluster analysis are -
1. Scalability
2. Ability to deal with different kinds of attributes
3. Discovery of clusters with attribute shape
4. High dimensionality
5. Ability to deal with noisy
6. Interpretability
Q.2 What do you understand by Data Mining?
Data Mining is defined as the process of analyzing the data in different dimensions or perspectives and summarizing into a useful information. Data mining can be queried and retrieved the data from database in their own format.
Q.3 What do you understand by OLAP?
OLAP referred as Online Analytical Processing, is defined as a set to be a system which collects, manages, processes multi-dimensional data for analysis and management purposes.
Q.4 How would you differentiate between View and Materialized View?
View is defined as nothing but a virtual table which takes the output of the query and it can be used in place of tables On the other hand materialized view is nothing but an indirect access to the table data by storing the results of a query in a separate schema.
Q.5 How would you define real-time datawarehousing?
Real-time datawarehousing is primarily used to capture the business data whenever it occurs. Therefore whenever there is business activity which gets completed, then that data will be available in the flow and become available for use instantly.
Q.6 State some of the advantages of data mining.
Some of the advantages of Data mining are -
1. We use data mining in banks and financial institutions to find probable defaulters. This is done based on past transactions, user behaviour and data patterns.
2. It assists advertisers to push the right advertisements to the internet. Also, it surfer on web pages based on machine learning algorithms. Therefore data mining benefit both possible buyers as well as sellers of the various products.
3. It is to arrange and keep most sellable items in the most attentive positions.
Q.7 What according to you are the disadvantages of data mining?
The disadvantages of data mining are -
1. They do not have security systems in place to protect us.
2. Data mining analytics use software therefore it is difficult to operate.
3. It require a user to have knowledge based training such that the techniques of data mining are not 100% accurate.
Q.8 State the required technological drivers in data mining.
The required technological drivers in data mining include -
1. Database size: Since for maintaining and processing the huge amount of data, therefore powerful systems are required.
2. Query Complexity: In order to analyze the complex and large number of queries, therefore powerful system is required.
Q.9 What is data mining query language?
Data Mining Query Language (DMQL) was proposed for the DBMiner data mining system such that it was based on the Structured Query Language. These query languages are designed to support ad hoc and interactive data mining. Data mining query language provides commands for specifying primitives. We can use DMQL to work with databases and data warehouses as well. We can also use it to define data mining tasks. In particular we examine how to define data warehouses and data marts in Data mining query language.
Q.10 What do you understand by Syntax for Specifying the Kind of Knowledge?
Syntax for Characterization, Discrimination, Association, Classification, and Prediction.
Q.11 What is the purpose of Data Mining Languages Standardization?
The purpose of Data Mining Languages Standardization are -
1. It helps the systematic development of data mining solutions.
2. It improves interoperability among multiple data mining systems and functions.
3. It helps in promoting education and rapid learning.
4. It promotes the use of data mining systems in industry and society.
Q.12 What do you understand by Active Datawarehousing?
Active datawarehouse is a datawarehouse which permits decision makers within a company or organization to manage customer relationships effectively and efficiently.
Q.13 Define the key columns in Fact and dimension tables?
Foreign keys of dimension tables are primary keys of entity tables. Foreign keys of fact tables are the primary keys of the dimension tables.
Q.14 What do you understand by loops in Datawarehousing?
In datawarehousing, if there is a loop between the tables, then the query generation will take more time and it creates ambiguity. Such that it is advised to avoid loop between the tables.
Q.15 How do you define Dimensional Modelling?
Dimensional Modeling is defined as a concept used by dataware house designers to build their own datawarehouse. This model can be stored in two kind of tables – Facts and Dimension table. Such that Fact table has facts and measurements of the business and dimension table contains the context of measurements.
Q.16 What do you understand by snapshot with reference to data warehouse?
Snapshot is defined as a complete visualization of data at the time of extraction. This occupies less space and can be used to back up and restore data quickly. In other words, a snapshot is a process of knowing about the activities performed. Also it is stored in a report format from a specific catalog. The report is generated soon after the catalog is disconnected.
Q.17 Define surrogate key.
Surrogate key is a substitute for the natural primary key. It is set to be a unique identifier for each row that can be used for the primary key to a table.
Q.18 What is the goal objective of Optimizer?
The goal to Optimizer is to find the most efficient way to execute the SQL statements.
Q.19 What do you understand by Hybrid SCD?
Hybrid SCDs is a combination of both SCD 1 and SCD 2. It may happen that in a table, some columns are important and we need to track changes for them that is capture the historical data for them whereas in some columns even if the data changes, we do not have to bother. For such tables, we implement Hybrid SCDs, where in some columns are Type 1 and some are Type 2.
Q.20 Why do we override the execute method as struts Framework ?
Given Struts Framework, the Action Servlet can be developed, ActionForm servlets (ActionServlet means what class extends the Action class, and ActionForm indicates what class extends the Action Form class) and other servlet classes.

In case of ActionForm class, we can develop validate() method. This method will return the ActionErrors object. In this method we can write the validation code. If this method returns null or ActionErrors with size=0, the web container will call execute() as part of the Action class.

If it returns size > 0, it will not call the execute() method. It will rather execute the jsp, servlet or html file as value for the input attribute as part of the attribute in struts-config.xml file.
Q.21 What does metadata consists of?
Metadata consists of : 1. Structure of the data. 2. Algorithm used for summarization. 3. Mapping from the operational environment to the data warehouse.
Q.22 What is the first step in acquisition of data for the warehouse?
The first step in acquisition of data for the warehouse is to extract data from multiple, heterogeneous sources.
Q.23 What does crashing a project imply?
Project cost and project schedule tradeoffs occur to achieve the maximum schedule compression for the least cost to the project without compromising the intended scope of the project.
Q.24 What does the unequal partition and join key, results in?
The unequal partition and join key, results in reduced join performance.
Q.25 What is learning rate?
Learning rate is called as the degree of adjustment due to learning trial in neural networks.
Q.26 What do you understand by Data warehousing?
A Data warehouse refers to the repository of data used for Management decision support systems. Data warehouse consists of a large variety of data that has a high level of business conditions at a single point in time.
Q.27 Name the various stages of Data warehousing.
The stages of Data warehousing are: Offline Operational Database Offline Data Warehouse Real-Time Data warehouse Integrated Data warehouse
Q.28 Define OLTP and ODS.
OLTP stands for On-Line Transaction Processing that refers to an application for modifying the data whenever it is received and has a large number of simultaneous users. And, ODS stands for Operational Data Store and it is a repository of real-time operational data rather than long-term trend data.
Q.29 Define ETL.
ETL refers to Extract, Transform and Load. This is software that is used for reading the data from the defined data source and extract a desired subset of data. Moreover, it transforms the data using rules and lookup tables and converts it to the desired state.
Q.30 What do you understand about Junk Dimension?
This can be considered as a single dimension used for storing the tiny dimensions known as junk attributes. The junk attributes in this phase are a set of text attributes and flags that are transmitted into a different sub-domain known as the junk dimension. It is a dimension table consisting of the properties that do not fit either the truth table or the current dimension tables. Further, these features are basically text or multiple flags like non-generic comments or basic yes/no or true/false markers.
Q.31 Define Subject-Oriented Data Warehouse.
This can be defined as the storage of data for a specific field such as product, customer, or sales. The subject-oriented property states that the data in a data warehouse are grouped around major bodies in an organization’s interests. For example, customers, brands, prices, and vendors. Further, this property allows DW users to do in-depth analyses on each topic for operational and strategic decision-making.
Q.32 What is ETL Testing?
ETL Testing is carried out for validating data extracted from various sources and destinations. This is used for: Firstly, validating the components of the ETL Data Warehouse. Secondly, putting ETL software through its paces. Thirdly, running the test in the background. Then, identifying and resolving problems. Next, developing the concept and carrying out the test harness and events. Lastly, accepting design criteria and standards.
Q.33 Define XMLA.
Using XMLA as XML for testing purposes is an effective method for collecting information from OLAP, data mining, and other online sources. XMLA is a simple object management protocol in which the protocol makes use of two methods: Discover and Execute. The discovery system collects data from the archive and the execution system helps you execute programs against data sources.
Q.34 Define Snowflake Schema.
A Data Warehouse snowflake schema is a mathematical structural representation of tables in which the ER diagram resembles a snowflake shape. It is an extension of a Star Schema that adds depth. The dimension tables have been normalized, resulting in the data being divided into additional tables.
Q.35 Explain the term Cube in Data Warehousing.
Cubes are used for explaining multidimensional data logically. The dimension members are located on the cube’s edge, and the data values are located on the cube’s body.
Q.36 What do you know about Business Intelligence?
Business intelligence can be defined as a decision support system (DSS), and it relates to the technologies, applications, and practices for collecting, integrating, and analyzing business-related knowledge or data. However, data warehousing and business intelligence are concepts used for characterizing the method of storing all of the company’s data in internal or external databases from different sources for analyzing it and creating actionable information through online BI software.
Q.37 Define Aggregate Tables.
Aggregate tables contain current warehouse data that have been clustered to a specific degree of dimension. Data is easier to obtain from aggregated tables than from the original table with a larger number of records. Further, aggregate tables roll up data to a degree greater than a base or derived table. This table lowers the burden on the database server and improves query efficiency.
Q.38 What is an ER Diagram?
Entity-Relationship Diagram (ER diagram) depicts the interrelationships between the entities in a database. This diagram demonstrates the arrangement of each table as well as the relationships between them. An ER Diagram is a flowchart that displays how entities in a structure interact with one another.
Q.39 Define the term BUS Schema.
In a fact table, the BUS schema contains a suite of verified dimensions and uniform definitions. The BUS Schema controls the Dimension Identification for all businesses. And, BUS Schema in ETL has a standardized description of details as well as a conformed dimension.
Q.40 What is a data lake?
A data lake refers to a centralized repository for structure and unstructured data storage. It can be used for storing raw data as it is without any structure schema. Moreover, there is no need for performing any ETL or transformation job on it. This can store any type of data like images, text, files, videos, and even it can store machine learning model artifacts, real-time and analytics output, etc.
Q.41 Define structured and unstructured data.
Structure data has a known schema and could be fit in a fixed table. It uses the DBMS storage method. Scaling schemas is very difficult. Some of the protocols are ODBS, SQL, and ADO.NET, etc. Whereas, Unstructured data has no schema or structure. It is mostly unmanaged and very easy to scale in runtime and can store any type of data. Some of the protocols are XML,CSV, SMSM, SMTP, JASON etc.
Q.42 What do you know about data purging?
Data purging refers to the process involving methods that can erase data permanently from the storage several techniques. And, strategies can be used for data purging the process of data shaping often contrasts with data deletion. However, data purging permanently removes the data to free up more storage and memory space which can be utilized for other purposes. Further, the purging process enables us to archive data even if it is permanently removed from the main source giving us an option to recover that data.
Q.43 Name some of the data warehouse solutions used in the industry.
Some of the major solutions are: Snowflakes Oracle Exadata Apache Hadoop SAP BW4HANA Microfocus Vertical Teradata AWS Redshift GCP Big Query
Q.44 What do you know about slowly changing dimensions?
A slowly changing dimension (SCD) is one that appropriately manages modifications of dimension members over time. It applies when business entity value changes over time and in an ad-hoc manner.
Q.45 Name the types of SCD.
There are six types of Slowly Changing Dimension that are: Type 0: Dimension never changes here, dimension is fixed, and no changes are permissible. Type 1: There’s no record of historical values, only the current state. A kind 1 SCD always reflects the newest values, and when changes in source data are detected, the dimension table is overwritten. Type 2: Row Versioning Track changes as version records which will be identified by current flag & active dates and other metadata. If the source system doesn’t store versions, then it is the info warehouse load process that detects changes and appropriately manages the change during a dimension table. Type 3: Previous Value column Track change to a selected attribute, add a column to point out the previous value, which is updated as further changes occur. Type 4: History Table shows the current value in the dimension table. Hybrid SCD: Hybrid SDC uses techniques from SCD Types 1, 2, and three to trace change.
Q.46 What is a Factless fact table?
Factless fact is a fact table with no value. Such a table only contains keys from different dimension tables.
Q.47 Define the Three-Layer Architecture of ETL.
1. Staging Layer It stores data collected from various sources. 2. Access Layer In this, end-users focus on data collection in this layer by gaining the information from this layer. 3. Data Integration Layer This imports data from the staging layer into the database. The information may be classified as facts, aggregates, or dimensions. Dimensions and Facts may be combined to form the schema.
Q.48 Define virtual data warehousing.
A virtual data warehouse provides a collective view of the finished data. A virtual data warehouse has no historical data. However, it is often considered as a logical data model of the given metadata. This can be considered as the simplest way of translating data and presenting it inside the form which will be employed by decision-makers. Further, it also provides a semantic map that enables users for viewing because the data is virtualized.
Q.49 Explain the term Star Schema.
Star schema can be defined as organizing the tables in such a way that results can be retrieved from the database quickly in the data warehouse environment.
Q.50 Describe the steps to build the data warehouse.
The steps to be followed for building the data warehouse: Firstly, collecting business requirements. Secondly, identifying the necessary sources. Thirdly, identifying the facts. Then, the specification of the dimensions and the attributes. Next, redefining the dimensions and attributes if required After that, organizing the Attribute hierarchy and defining the relationships. Lastly, allocating unique Identifiers.
Q.51 Define the term conformed fact.
A conformed fact is a type of table that will be used over multiple data marts and multiple fact tables.
Q.52 What is the core dimension?
The core dimension is a Dimension table which is dedicated for a single fact table or Data Mart.
Q.53 Define degenerate dimension.
In a data warehouse, a degenerate dimension refers to a dimension key in the fact table that does not have its own dimension table. Degenerate dimensions usually take place when the fact table’s grain is a single transaction (or transaction line).
Q.54 What is normalization?
Normalization can be considered as a multi-step process that puts data into tabular form and eliminates duplicated data from the relation tables. This helps in reducing data redundancy and saves physical database spaces and has minimal write operation cost.
Q.55 Define Data Pipeline.
Data Pipeline refers to any set of process elements that move data from one system to a different one. Data Pipeline is often created for an application that uses data to bring value. And, it can also integrate the information across the applications, building the info-driven web products, and completing the data mining activities. Data engineers build the data pipeline.
Get Govt. Certified Take Test