Top 10 Data Analysis with Python Interview Questions

Data Analysis with Python

Q.1

Explain the difference between Series.str and apply for string operations.

Series.str → vectorized string operations (faster)

apply → can apply any function but slower for strings

Report This Question

Q.2 Explain the difference between stack() and unstack() in Pandas.

stack() → converts columns into row indices (wide → long)

unstack() → converts row indices into columns (long → wide)

Report This Question

Q.3

How do you detect and remove duplicate rows based on specific columns?

df.drop_duplicates(subset=['col1', 'col2'], keep='first', inplace=True)

Report This Question

Q.4

What are some common aggregation functions used in Pandas?

sum(), mean(), count(), median(), min(), max(), std()

Can be applied with groupby() or agg()

Report This Question

Q.5

Explain how to merge time-series data with different frequencies.

Use resample() to standardize frequency

Use merge_asof() to perform nearest-key joins for time-based merges

Report This Question

Q.6

How do you profile a dataset quickly in Python?

Use df.describe() for summary statistics

Use df.info() for data types and null counts

Use pandas-profiling library for automated, detailed profiling

Report This Question

Q.7

How can you efficiently filter large datasets in Pandas?

Use boolean indexing or query() for faster filtering:

df_filtered = df.query('col1 > 100 & col2 == "Yes"')

Report This Question

Q.8

How do you calculate a moving average with exponential weighting?

df['ewma'] = df['col'].ewm(span=3, adjust=False).mean()

Report This Question

Q.9 What are hierarchical indexes, and how are they used?

df.set_index(['Region', 'Product'], inplace=True)
df.loc['North', 'ProductA']

Report This Question

Q.10 How do you handle large datasets that exceed memory using Python?

Use chunksize in read_csv() for batch processing

Use Dask, Vaex, or PySpark for out-of-core computations

Downcast numeric types to reduce memory footprint

Report This Question

Q.11

How do you perform a correlation analysis in Python, and what does it indicate?

Use df.corr() for Pearson correlation by default

Positive correlation → variables move together; negative → move inversely

Useful for feature selection and multicollinearity detection

Report This Question

Q.12

How do you create custom aggregation functions in Pandas?

def range_func(x):
    return x.max() - x.min()
df.groupby('col1')['col2'].agg(range_func)

Report This Question

Q.13

Explain the difference between pivot_table() and groupby() in Pandas.

groupby() → aggregates data based on one or more columns; returns Series/DataFrame

pivot_table() → reshapes data with aggregation, handles missing values, and can have multiple indices/columns

Report This Question

Q.14

How can you detect seasonality in a time-series dataset using Python?

Use statsmodels.tsa.seasonal_decompose() to decompose series into trend, seasonality, and residuals

from statsmodels.tsa.seasonal import seasonal_decompose
decompose = seasonal_decompose(df['sales'], model='additive', period=12)
decompose.plot()

Report This Question

Q.15

How do you handle imbalanced datasets in Python?

Techniques include:

Resampling: oversampling minority class (SMOTE) or undersampling majority class
Using class-weight parameter in machine learning models
Ensemble methods like BalancedRandomForestClassifier

Report This Question

Q.16

How do you detect seasonality and trend in time-series data using Python?

Use seasonal_decompose() from statsmodels
Plot rolling mean/median to observe trends
Use autocorrelation (acf) and partial autocorrelation (pacf) plots

Report This Question

Q.17

Explain the difference between wide and long data formats.

Wide: multiple variables in columns for each observation
Long: each observation-variable pair is a separate row
melt() and pivot() in Pandas can convert between formats

Report This Question

Q.18 How do you perform feature selection using correlation in Python?

Calculate correlation matrix: df.corr()
Drop highly correlated features (threshold > 0.8) to reduce multicollinearity
Use heatmaps for visualization: seaborn.heatmap()

Report This Question

Q.19 What is the difference between concat and append in Pandas?

concat() → flexible, can concatenate along rows or columns, handles multiple DataFrames
append() → adds rows to a DataFrame (deprecated in recent Pandas versions)

Report This Question

Q.20 How do you apply a lambda function to a DataFrame column?

df['new_col'] = df['col'].apply(lambda x: x**2 if x > 0 else 0)

Report This Question

Q.21 Explain one-hot encoding and when it is preferred over label encoding.

Converts categorical values into binary columns
Preferred when categories are nominal (no ordinal relationship) to avoid introducing artificial order

Report This Question

Q.22 How can you optimize groupby operations on large datasets?

Use categorical data types for grouping columns
Reduce memory usage by downcasting numeric columns
Consider using dask.dataframe.groupby() for out-of-core computation

Report This Question

Q.23 How do you visualize the distribution of a numeric variable?

Use histograms: df['col'].hist()
KDE plots: sns.kdeplot(df['col'])
Boxplots for outliers: sns.boxplot(x=df['col'])

Report This Question

Q.24 How do you handle multi-index DataFrames for selection and slicing?

Use xs() for cross-section selection: df.xs('index_level_value', level='level_name')
Use loc with tuples for multi-index slicing: df.loc[('Region1','ProductA')]

Report This Question

Q.25 What is data analysis, and why is Python used for it?

Data analysis involves inspecting, cleaning, and modeling data to extract useful insights. Python is widely used because of its simplicity, rich libraries (Pandas, NumPy, Matplotlib, Seaborn), and strong community support.

Report This Question

Q.26 What is Pandas in Python?

Pandas is a Python library used for data manipulation and analysis. It provides data structures like Series (1D) and DataFrame (2D) to handle structured data efficiently.

Report This Question

Q.27 How do you handle missing data in a DataFrame?

Missing data can be handled using:

- df.dropna() → remove missing values
- df.fillna(value) → fill missing values with a specific value
- Imputation using mean, median, or mode

Report This Question

Q.28 Explain the difference between Series and DataFrame.

Series: 1-dimensional labeled array, similar to a single column.
DataFrame: 2-dimensional labeled data structure, similar to a table with rows and columns.

Report This Question

Q.29 What is NumPy, and why is it important in data analysis?

NumPy is a library for numerical computing in Python. It provides arrays for efficient data storage and fast operations, along with mathematical functions for calculations and analysis.

Report This Question

Q.30

How can you merge or join two DataFrames in Pandas?

You can use:

pd.concat([df1, df2]) → for concatenation
pd.merge(df1, df2, on='column_name', how='inner/left/right/outer') → for database-like joins

Report This Question

Q.31

What are some common data visualization libraries in Python?

Popular libraries include:

Matplotlib → basic plotting
Seaborn → statistical plots with attractive designs
Plotly → interactive visualizations

Report This Question

Q.32 How do you group data in Pandas?

Use groupby() to aggregate data:

df.groupby('column_name')['another_column'].sum()

This groups data by a column and performs aggregation functions like sum, mean, or count.

Report This Question

Q.33 How do you detect and remove duplicates in a dataset?

- Detect duplicates: df.duplicated()
- Remove duplicates: df.drop_duplicates()

Report This Question

Q.34

Explain the difference between loc and iloc in Pandas.

- loc → label-based indexing (row/column names)
- iloc → integer-based indexing (row/column positions)

Report This Question

Q.35

How do you handle outliers in a dataset using Python?

Outliers can be handled using:

Statistical methods (e.g., Z-score, IQR)
Python example using IQR:

Q1 = df['col'].quantile(0.25)
Q3 = df['col'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['col'] < Q1 - 1.5*IQR) | (df['col'] > Q3 + 1.5*IQR))]

Report This Question

Q.36

How do you optimize large datasets in Pandas to reduce memory usage?

- Use appropriate data types (category for categorical data)
- Use df.astype() to convert types
- Load only required columns with usecols in read_csv()

Report This Question

Q.37

What is vectorization in Python, and why is it important?

Vectorization is performing operations on entire arrays rather than element-wise loops. It improves performance and speed in numerical computations, mainly using NumPy arrays.

Report This Question

Q.38

How do you pivot a DataFrame, and when would you use it?

Pivot is used to reshape data:

df.pivot(index='Date', columns='Category', values='Sales')

Report This Question

Q.39

Explain time-series analysis using Python.

Time-series analysis involves data indexed over time. Python libraries like Pandas (pd.to_datetime, resample) and statsmodels can be used for trend analysis, seasonal decomposition, and forecasting.

Report This Question

Q.40

What is the difference between merge and join in Pandas?

- merge() → flexible, SQL-style joins using a key
- join() → simpler, joins on index by default

Report This Question

Q.41

How do you detect multicollinearity in a dataset?

- Use correlation matrix (df.corr())
- Variance Inflation Factor (VIF) calculation using statsmodels
- High correlation or VIF > 5-10 indicates multicollinearity

Report This Question

Q.42

How do you normalize or standardize data in Python?

- Normalization: scale values between 0 and 1 using MinMaxScaler
- Standardization: scale to zero mean and unit variance using StandardScaler from sklearn

Report This Question

Q.43

How do you handle categorical variables for machine learning?

- Label Encoding → integer representation
- One-Hot Encoding → binary columns
- Pandas: pd.get_dummies()

Report This Question

Q.44

Explain the difference between deep copy and shallow copy in Pandas.

- Shallow copy: references original data, changes affect original
- Deep copy: independent copy, changes don’t affect original (df.copy(deep=True))

Report This Question

Q.45

Explain the difference between apply(), map(), and applymap() in Pandas.

- map() → element-wise operation on Series
- apply() → applies a function along Series or DataFrame axis
- applymap() → element-wise operation on entire DataFrame

Report This Question

Q.46

How do you perform a rolling window operation in Pandas?

df['rolling_mean'] = df['col'].rolling(window=3).mean()

Report This Question

Q.47 How do you handle large CSV files that cannot fit in memory?

Use chunking in read_csv() with chunksize

Process data in smaller batches

Use Dask or PySpark for distributed processing

Report This Question

Q.48

How do you create a multi-index DataFrame?

df.set_index(['col1', 'col2'], inplace=True)

Used for hierarchical data representation and advanced grouping operations.

Report This Question

Data Analysis with Python

Get Govt. Certified

Are you an expert ?