Data Analysis with Python

Q.1

Explain the difference between Series.str and apply for string operations.

  • Series.str → vectorized string operations (faster)

  • apply → can apply any function but slower for strings

  • Q.2 Explain the difference between stack() and unstack() in Pandas.
  • stack() → converts columns into row indices (wide → long)

  • unstack() → converts row indices into columns (long → wide)

  • Q.3

    How do you detect and remove duplicate rows based on specific columns?

    df.drop_duplicates(subset=['col1', 'col2'], keep='first', inplace=True)
    

    Q.4

    What are some common aggregation functions used in Pandas?

  • sum(), mean(), count(), median(), min(), max(), std()

  • Can be applied with groupby() or agg()

  • Q.5

    Explain how to merge time-series data with different frequencies.

  • Use resample() to standardize frequency

  • Use merge_asof() to perform nearest-key joins for time-based merges

  • Q.6

    How do you profile a dataset quickly in Python?

  • Use df.describe() for summary statistics

  • Use df.info() for data types and null counts

  • Use pandas-profiling library for automated, detailed profiling

  • Q.7

    How can you efficiently filter large datasets in Pandas?

    Use boolean indexing or query() for faster filtering:

    df_filtered = df.query('col1 > 100 & col2 == "Yes"')
    Q.8

    How do you calculate a moving average with exponential weighting?

    df['ewma'] = df['col'].ewm(span=3, adjust=False).mean()
    Q.9 What are hierarchical indexes, and how are they used?
    df.set_index(['Region', 'Product'], inplace=True)
    df.loc['North', 'ProductA']
    Q.10 How do you handle large datasets that exceed memory using Python?

      Use chunksize in read_csv() for batch processing

      Use Dask, Vaex, or PySpark for out-of-core computations

      Downcast numeric types to reduce memory footprint

      Q.11

      How do you perform a correlation analysis in Python, and what does it indicate?

    • Use df.corr() for Pearson correlation by default

    • Positive correlation → variables move together; negative → move inversely

    • Useful for feature selection and multicollinearity detection

    • Q.12

      How do you create custom aggregation functions in Pandas?

      def range_func(x):
          return x.max() - x.min()
      df.groupby('col1')['col2'].agg(range_func)
      Q.13

      Explain the difference between pivot_table() and groupby() in Pandas.

    • groupby() → aggregates data based on one or more columns; returns Series/DataFrame

    • pivot_table() → reshapes data with aggregation, handles missing values, and can have multiple indices/columns

    • Q.14

      How can you detect seasonality in a time-series dataset using Python?

      Use statsmodels.tsa.seasonal_decompose() to decompose series into trend, seasonality, and residuals

      from statsmodels.tsa.seasonal import seasonal_decompose
      decompose = seasonal_decompose(df['sales'], model='additive', period=12)
      decompose.plot()
      Q.15

      How do you handle imbalanced datasets in Python?

      Techniques include:

      • Resampling: oversampling minority class (SMOTE) or undersampling majority class

      • Using class-weight parameter in machine learning models

      • Ensemble methods like BalancedRandomForestClassifier

      Q.16

      How do you detect seasonality and trend in time-series data using Python?


      • Use seasonal_decompose() from statsmodels

      • Plot rolling mean/median to observe trends

      • Use autocorrelation (acf) and partial autocorrelation (pacf) plots

      Q.17

      Explain the difference between wide and long data formats.


      • Wide: multiple variables in columns for each observation

      • Long: each observation-variable pair is a separate row

      • melt() and pivot() in Pandas can convert between formats

      Q.18 How do you perform feature selection using correlation in Python?


      • Calculate correlation matrix: df.corr()

      • Drop highly correlated features (threshold > 0.8) to reduce multicollinearity

      • Use heatmaps for visualization: seaborn.heatmap()

      Q.19 What is the difference between concat and append in Pandas?
      • concat() → flexible, can concatenate along rows or columns, handles multiple DataFrames

      • append() → adds rows to a DataFrame (deprecated in recent Pandas versions)

      Q.20 How do you apply a lambda function to a DataFrame column?
      df['new_col'] = df['col'].apply(lambda x: x**2 if x > 0 else 0)
      

      Q.21 Explain one-hot encoding and when it is preferred over label encoding.
      • Converts categorical values into binary columns

      • Preferred when categories are nominal (no ordinal relationship) to avoid introducing artificial order

      Q.22 How can you optimize groupby operations on large datasets?
      • Use categorical data types for grouping columns

      • Reduce memory usage by downcasting numeric columns

      • Consider using dask.dataframe.groupby() for out-of-core computation

      Q.23 How do you visualize the distribution of a numeric variable?
      • Use histograms: df['col'].hist()

      • KDE plots: sns.kdeplot(df['col'])

      • Boxplots for outliers: sns.boxplot(x=df['col'])

      Q.24 How do you handle multi-index DataFrames for selection and slicing?
      • Use xs() for cross-section selection: df.xs('index_level_value', level='level_name')

      • Use loc with tuples for multi-index slicing: df.loc[('Region1','ProductA')]

      Q.25 What is data analysis, and why is Python used for it?
      Data analysis involves inspecting, cleaning, and modeling data to extract useful insights. Python is widely used because of its simplicity, rich libraries (Pandas, NumPy, Matplotlib, Seaborn), and strong community support.
      Q.26 What is Pandas in Python?
      Pandas is a Python library used for data manipulation and analysis. It provides data structures like Series (1D) and DataFrame (2D) to handle structured data efficiently.
      Q.27 How do you handle missing data in a DataFrame?

      Missing data can be handled using:

        • df.dropna() → remove missing values
        • df.fillna(value) → fill missing values with a specific value
        • Imputation using mean, median, or mode
      Q.28 Explain the difference between Series and DataFrame.
      • Series: 1-dimensional labeled array, similar to a single column.
      • DataFrame: 2-dimensional labeled data structure, similar to a table with rows and columns.
      Q.29 What is NumPy, and why is it important in data analysis?
      NumPy is a library for numerical computing in Python. It provides arrays for efficient data storage and fast operations, along with mathematical functions for calculations and analysis.
      Q.30

      How can you merge or join two DataFrames in Pandas?

      You can use:

      • pd.concat([df1, df2]) → for concatenation

      • pd.merge(df1, df2, on='column_name', how='inner/left/right/outer') → for database-like joins

      Q.31

      What are some common data visualization libraries in Python?

      Popular libraries include:

      • Matplotlib → basic plotting

      • Seaborn → statistical plots with attractive designs

      • Plotly → interactive visualizations

      Q.32 How do you group data in Pandas?

      Use groupby() to aggregate data:

      df.groupby('column_name')['another_column'].sum()

      This groups data by a column and performs aggregation functions like sum, mean, or count.

      Q.33 How do you detect and remove duplicates in a dataset?
        • Detect duplicates: df.duplicated()
        • Remove duplicates: df.drop_duplicates()
      Q.34

      Explain the difference between loc and iloc in Pandas.

        • loc → label-based indexing (row/column names)
        • iloc → integer-based indexing (row/column positions)
      Q.35

      How do you handle outliers in a dataset using Python?

      Outliers can be handled using:

      • Statistical methods (e.g., Z-score, IQR)

      • Python example using IQR:

      Q1 = df['col'].quantile(0.25)
      Q3 = df['col'].quantile(0.75)
      IQR = Q3 - Q1
      df = df[~((df['col'] < Q1 - 1.5*IQR) | (df['col'] > Q3 + 1.5*IQR))]
      

      Q.36

      How do you optimize large datasets in Pandas to reduce memory usage?

        • Use appropriate data types (category for categorical data)
        • Use df.astype() to convert types
        • Load only required columns with usecols in read_csv()
      Q.37

      What is vectorization in Python, and why is it important?

      Vectorization is performing operations on entire arrays rather than element-wise loops. It improves performance and speed in numerical computations, mainly using NumPy arrays.

      Q.38

      How do you pivot a DataFrame, and when would you use it?

      Pivot is used to reshape data:

      df.pivot(index='Date', columns='Category', values='Sales')
      Q.39

      Explain time-series analysis using Python.

      Time-series analysis involves data indexed over time. Python libraries like Pandas (pd.to_datetime, resample) and statsmodels can be used for trend analysis, seasonal decomposition, and forecasting.

      Q.40

      What is the difference between merge and join in Pandas?

        • merge() → flexible, SQL-style joins using a key
        • join() → simpler, joins on index by default
      Q.41

      How do you detect multicollinearity in a dataset?

        • Use correlation matrix (df.corr())
        • Variance Inflation Factor (VIF) calculation using statsmodels
        • High correlation or VIF > 5-10 indicates multicollinearity
      Q.42

      How do you normalize or standardize data in Python?

        • Normalization: scale values between 0 and 1 using MinMaxScaler
        • Standardization: scale to zero mean and unit variance using StandardScaler from sklearn
      Q.43

      How do you handle categorical variables for machine learning?

        • Label Encoding → integer representation
        • One-Hot Encoding → binary columns
        • Pandas: pd.get_dummies()
      Q.44

      Explain the difference between deep copy and shallow copy in Pandas.

        • Shallow copy: references original data, changes affect original
        • Deep copy: independent copy, changes don’t affect original (df.copy(deep=True))
      Q.45

      Explain the difference between apply(), map(), and applymap() in Pandas.

        • map() → element-wise operation on Series
        • apply() → applies a function along Series or DataFrame axis
        • applymap() → element-wise operation on entire DataFrame
      Q.46

      How do you perform a rolling window operation in Pandas?

      df['rolling_mean'] = df['col'].rolling(window=3).mean()
      Q.47 How do you handle large CSV files that cannot fit in memory?
    • Use chunking in read_csv() with chunksize

    • Process data in smaller batches

    • Use Dask or PySpark for distributed processing

    • Q.48

      How do you create a multi-index DataFrame?

      df.set_index(['col1', 'col2'], inplace=True)
      Used for hierarchical data representation and advanced grouping operations.
      Get Govt. Certified Take Test
       For Support