Pydata-visualizer
A powerful and intuitive Python library for exploratory data analysis and data profiling. Pydata-visualizer automatically analyzes your dataset, generates interactive visualizations, and provides detailed statistical insights with minimal code.
Features
Comprehensive Data Profiling: Analyze numerical, categorical, boolean, and string data types
Automated Data Quality Checks: Detect missing values, outliers, skewed distributions, duplicate rows, and more
Interactive Visualizations: Generate distribution plots, correlation heatmaps, word clouds, and statistical charts
Text Analysis: Automatic word frequency analysis and word cloud generation for text columns
Rich HTML Reports: Export analysis to visually appealing and shareable HTML reports
Performance Optimized: Fast analysis even on large datasets
Correlation Analysis: Calculate Pearson, Spearman, and Cramér’s V correlations between variables
Flexible Configuration: Customize analysis thresholds and options via the Settings class
Installation
pip install pydata-visualizer
Quick Start
import pandas as pd
from data_visualizer.profiler import AnalysisReport, Settings
# Load your dataset
df = pd.read_csv("your_dataset.csv")
# Create a report with default settings
report = AnalysisReport(df)
report.to_html("report.html")
Advanced Usage
Customizing Analysis Settings
from data_visualizer.profiler import AnalysisReport, Settings
# Configure analysis settings
report_settings = Settings(
minimal=False, # Set to True for faster, minimal analysis
top_n_values=5, # Show top 5 values in categorical columns
skewness_threshold=2.0, # Tolerance for skewness alerts
outlier_method='iqr', # Outlier detection method: 'iqr' or 'zscore'
outlier_threshold=1.5, # IQR multiplier for outlier detection
duplicate_threshold=5.0, # Percentage threshold for duplicate alerts
text_analysis=True # Enable word frequency analysis for text columns
)
# Create report with custom settings
report = AnalysisReport(df, settings=report_settings)
# Perform analysis and get results dictionary
results = report.analyse()
# Generate HTML report
report.to_html("custom_report.html")
Report Structure
The generated report includes:
Overview: Dataset dimensions, missing values, duplicate rows, and duplicate percentage
Variable Analysis: Detailed per-column statistics and visualizations including:
Distribution plots for numeric data
Bar charts for categorical data
Word clouds and frequency analysis for text data
Outlier detection and highlighting
Sample Data: Head and tail samples of the dataset
Correlations: Correlation matrices and heatmaps (Pearson, Spearman, Cramér’s V)
Data Quality Alerts: Automated detection of data quality issues
API Reference
AnalysisReport Class
class AnalysisReport:
def __init__(self, data, settings=None):
"""
Initialize the analysis report object.
Parameters:
-----------
data : pandas.DataFrame
The dataset to analyze
settings : Settings, optional
Configuration settings for the analysis
"""
def analyse(self):
"""
Perform the data analysis.
Returns:
--------
dict
A dictionary containing all analysis results
"""
def to_html(self, filename="report.html"):
"""
Generate an HTML report from the analysis.
Parameters:
-----------
filename : str, optional
Path to save the HTML report (default: "report.html")
"""
Settings Class
class Settings(pydantic.BaseModel):
"""
Settings for the analysis report.
Attributes:
-----------
minimal : bool, default=False
Whether to perform minimal analysis (skips type-specific analysis and visualizations)
top_n_values : int, default=10
Number of top values to show for categorical columns (must be >= 1)
skewness_threshold : float, default=1.0
Threshold for skewness alerts (must be >= 0.0)
outlier_method : str, default='iqr'
Outlier detection method: 'iqr' (Interquartile Range) or 'zscore'
outlier_threshold : float, default=1.5
IQR multiplier for outlier detection (must be >= 0.0)
Standard: 1.5 for moderate outliers, 3.0 for extreme outliers
duplicate_threshold : float, default=5.0
Percentage of duplicate rows to trigger an alert (must be >= 0.0)
text_analysis : bool, default=True
Enable word frequency analysis and word cloud generation for text columns
"""
Type Analyzers
The library automatically detects and applies the appropriate analysis for different data types:
Numeric (Integer/Float): Statistical measures (mean, std, quartiles), distribution plots, skewness, kurtosis, outlier detection
Categorical/Object: Value counts, cardinality analysis, frequency distributions, top N values
String: Unique value counts, cardinality, top N values, word frequency analysis, word cloud generation
Boolean: Value counts and proportions
Generic: Basic analysis for unrecognized types
Correlation Analysis
Three correlation methods are calculated when applicable:
Pearson: Linear correlation between numerical variables (range: -1 to 1)
Spearman: Rank correlation capturing monotonic relationships (range: -1 to 1)
Cramér’s V: Measure of association between categorical variables (range: 0 to 1)
Data Quality Alerts
The library automatically detects potential issues in your data:
High Missing Values: Columns with more than 20% missing data
Skewness: Distributions exceeding the configured skewness threshold
Outliers: Data points detected using IQR or Z-score methods
High Duplicates: Duplicate rows exceeding the configured threshold percentage
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Credits
Created by Aditya Deshmukh (adideshmukh2005@gmail.com)