Pydata-visualizer

PyPI version Python versions License: MIT

A powerful and intuitive Python library for exploratory data analysis and data profiling. Pydata-visualizer automatically analyzes your dataset, generates interactive visualizations, and provides detailed statistical insights with minimal code.

Features

  • Comprehensive Data Profiling: Analyze numerical, categorical, boolean, and string data types

  • Automated Data Quality Checks: Detect missing values, outliers, skewed distributions, duplicate rows, and more

  • Interactive Visualizations: Generate distribution plots, correlation heatmaps, word clouds, and statistical charts

  • Text Analysis: Automatic word frequency analysis and word cloud generation for text columns

  • Rich HTML Reports: Export analysis to visually appealing and shareable HTML reports

  • Performance Optimized: Fast analysis even on large datasets

  • Correlation Analysis: Calculate Pearson, Spearman, and Cramér’s V correlations between variables

  • Flexible Configuration: Customize analysis thresholds and options via the Settings class

Installation

pip install pydata-visualizer

Quick Start

import pandas as pd
from data_visualizer.profiler import AnalysisReport, Settings

# Load your dataset
df = pd.read_csv("your_dataset.csv")

# Create a report with default settings
report = AnalysisReport(df)
report.to_html("report.html")

Advanced Usage

Customizing Analysis Settings

from data_visualizer.profiler import AnalysisReport, Settings

# Configure analysis settings
report_settings = Settings(
    minimal=False,              # Set to True for faster, minimal analysis
    top_n_values=5,             # Show top 5 values in categorical columns
    skewness_threshold=2.0,     # Tolerance for skewness alerts
    outlier_method='iqr',       # Outlier detection method: 'iqr' or 'zscore'
    outlier_threshold=1.5,      # IQR multiplier for outlier detection
    duplicate_threshold=5.0,    # Percentage threshold for duplicate alerts
    text_analysis=True          # Enable word frequency analysis for text columns
)

# Create report with custom settings
report = AnalysisReport(df, settings=report_settings)

# Perform analysis and get results dictionary
results = report.analyse()

# Generate HTML report
report.to_html("custom_report.html")

Report Structure

The generated report includes:

  • Overview: Dataset dimensions, missing values, duplicate rows, and duplicate percentage

  • Variable Analysis: Detailed per-column statistics and visualizations including:

    • Distribution plots for numeric data

    • Bar charts for categorical data

    • Word clouds and frequency analysis for text data

    • Outlier detection and highlighting

  • Sample Data: Head and tail samples of the dataset

  • Correlations: Correlation matrices and heatmaps (Pearson, Spearman, Cramér’s V)

  • Data Quality Alerts: Automated detection of data quality issues

API Reference

AnalysisReport Class

class AnalysisReport:
    def __init__(self, data, settings=None):
        """
        Initialize the analysis report object.
        
        Parameters:
        -----------
        data : pandas.DataFrame
            The dataset to analyze
        settings : Settings, optional
            Configuration settings for the analysis
        """
        
    def analyse(self):
        """
        Perform the data analysis.
        
        Returns:
        --------
        dict
            A dictionary containing all analysis results
        """
        
    def to_html(self, filename="report.html"):
        """
        Generate an HTML report from the analysis.
        
        Parameters:
        -----------
        filename : str, optional
            Path to save the HTML report (default: "report.html")
        """

Settings Class

class Settings(pydantic.BaseModel):
    """
    Settings for the analysis report.
    
    Attributes:
    -----------
    minimal : bool, default=False
        Whether to perform minimal analysis (skips type-specific analysis and visualizations)
    
    top_n_values : int, default=10
        Number of top values to show for categorical columns (must be >= 1)
    
    skewness_threshold : float, default=1.0
        Threshold for skewness alerts (must be >= 0.0)
    
    outlier_method : str, default='iqr'
        Outlier detection method: 'iqr' (Interquartile Range) or 'zscore'
    
    outlier_threshold : float, default=1.5
        IQR multiplier for outlier detection (must be >= 0.0)
        Standard: 1.5 for moderate outliers, 3.0 for extreme outliers
    
    duplicate_threshold : float, default=5.0
        Percentage of duplicate rows to trigger an alert (must be >= 0.0)
    
    text_analysis : bool, default=True
        Enable word frequency analysis and word cloud generation for text columns
    """

Type Analyzers

The library automatically detects and applies the appropriate analysis for different data types:

  • Numeric (Integer/Float): Statistical measures (mean, std, quartiles), distribution plots, skewness, kurtosis, outlier detection

  • Categorical/Object: Value counts, cardinality analysis, frequency distributions, top N values

  • String: Unique value counts, cardinality, top N values, word frequency analysis, word cloud generation

  • Boolean: Value counts and proportions

  • Generic: Basic analysis for unrecognized types

Correlation Analysis

Three correlation methods are calculated when applicable:

  • Pearson: Linear correlation between numerical variables (range: -1 to 1)

  • Spearman: Rank correlation capturing monotonic relationships (range: -1 to 1)

  • Cramér’s V: Measure of association between categorical variables (range: 0 to 1)

Data Quality Alerts

The library automatically detects potential issues in your data:

  • High Missing Values: Columns with more than 20% missing data

  • Skewness: Distributions exceeding the configured skewness threshold

  • Outliers: Data points detected using IQR or Z-score methods

  • High Duplicates: Duplicate rows exceeding the configured threshold percentage

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Credits

Created by Aditya Deshmukh (adideshmukh2005@gmail.com)

GitHub: https://github.com/Adi-Deshmukh/Pydata-visualizer