# Pydata-visualizer [![PyPI version](https://img.shields.io/pypi/v/pydata-visualizer.svg)](https://pypi.org/project/pydata-visualizer/) [![Python versions](https://img.shields.io/pypi/pyversions/pydata-visualizer.svg)](https://pypi.org/project/pydata-visualizer/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) A powerful and intuitive Python library for exploratory data analysis and data profiling. Pydata-visualizer automatically analyzes your dataset, generates interactive visualizations, and provides detailed statistical insights with minimal code. ## Features - Comprehensive Data Profiling: Analyze numerical, categorical, boolean, and string data types - Automated Data Quality Checks: Detect missing values, outliers, skewed distributions, duplicate rows, and more - Interactive Visualizations: Generate distribution plots, correlation heatmaps, word clouds, and statistical charts using Plotly or Seaborn - Dual Rendering Modes: Choose between interactive Plotly charts or static Seaborn/Matplotlib visualizations - Text Analysis: Automatic word frequency analysis and word cloud generation for text columns - Rich HTML Reports: Export analysis to visually appealing and shareable HTML reports with interactive or static charts - Performance Optimized: Fast analysis even on large datasets - Correlation Analysis: Calculate Pearson, Spearman, and Cramér's V correlations between variables - Flexible Configuration: Customize analysis thresholds and options via the Settings class ## Installation ```bash pip install pydata-visualizer ``` ## Quick Start ```python import pandas as pd from data_visualizer.profiler import AnalysisReport, Settings # Load your dataset df = pd.read_csv("your_dataset.csv") # Create a report with default settings report = AnalysisReport(df) report.to_html("report.html") ``` ## Advanced Usage ### Customizing Analysis Settings ```python from data_visualizer.profiler import AnalysisReport, Settings # Configure analysis settings report_settings = Settings( minimal=False, # Set to True for faster, minimal analysis top_n_values=5, # Show top 5 values in categorical columns skewness_threshold=2.0, # Tolerance for skewness alerts outlier_method='iqr', # Outlier detection method: 'iqr' or 'zscore' outlier_threshold=1.5, # IQR multiplier for outlier detection duplicate_threshold=5.0, # Percentage threshold for duplicate alerts text_analysis=True # Enable word frequency analysis for text columns ) # Create report with custom settings report = AnalysisReport(df, settings=report_settings) # Perform analysis and get results dictionary results = report.analyse() # Generate HTML report report.to_html("custom_report.html") ``` ### Report Structure The generated report includes: - **Overview**: Dataset dimensions, missing values, duplicate rows, and duplicate percentage - **Variable Analysis**: Detailed per-column statistics and visualizations including: - Distribution plots for numeric data - Bar charts for categorical data - Word clouds and frequency analysis for text data - Outlier detection and highlighting - **Sample Data**: Head and tail samples of the dataset - **Correlations**: Correlation matrices and heatmaps (Pearson, Spearman, Cramér's V) - **Data Quality Alerts**: Automated detection of data quality issues ## API Reference ### `AnalysisReport` Class ```python class AnalysisReport: def __init__(self, data, settings=None): """ Initialize the analysis report object. Parameters: ----------- data : pandas.DataFrame The dataset to analyze settings : Settings, optional Configuration settings for the analysis """ def analyse(self): """ Perform the data analysis. Returns: -------- dict A dictionary containing all analysis results """ def to_html(self, filename="report.html"): """ Generate an HTML report from the analysis. Parameters: ----------- filename : str, optional Path to save the HTML report (default: "report.html") """ ``` ### `Settings` Class ```python class Settings(pydantic.BaseModel): """ Settings for the analysis report. Attributes: ----------- minimal : bool, default=False Whether to perform minimal analysis (skips type-specific analysis and visualizations) top_n_values : int, default=10 Number of top values to show for categorical columns (must be >= 1) skewness_threshold : float, default=1.0 Threshold for skewness alerts (must be >= 0.0) outlier_method : str, default='iqr' Outlier detection method: 'iqr' (Interquartile Range) or 'zscore' outlier_threshold : float, default=1.5 IQR multiplier for outlier detection (must be >= 0.0) Standard: 1.5 for moderate outliers, 3.0 for extreme outliers duplicate_threshold : float, default=5.0 Percentage of duplicate rows to trigger an alert (must be >= 0.0) text_analysis : bool, default=True Enable word frequency analysis and word cloud generation for text columns """ ``` ## Type Analyzers The library automatically detects and applies the appropriate analysis for different data types: - **Numeric (Integer/Float)**: Statistical measures (mean, std, quartiles), distribution plots, skewness, kurtosis, outlier detection - **Categorical/Object**: Value counts, cardinality analysis, frequency distributions, top N values - **String**: Unique value counts, cardinality, top N values, word frequency analysis, word cloud generation - **Boolean**: Value counts and proportions - **Generic**: Basic analysis for unrecognized types ## Correlation Analysis Three correlation methods are calculated when applicable: - **Pearson**: Linear correlation between numerical variables (range: -1 to 1) - **Spearman**: Rank correlation capturing monotonic relationships (range: -1 to 1) - **Cramér's V**: Measure of association between categorical variables (range: 0 to 1) ## Data Quality Alerts The library automatically detects potential issues in your data: - **High Missing Values**: Columns with more than 20% missing data - **Skewness**: Distributions exceeding the configured skewness threshold - **Outliers**: Data points detected using IQR or Z-score methods - **High Duplicates**: Duplicate rows exceeding the configured threshold percentage ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Credits Created by Aditya Deshmukh (adideshmukh2005@gmail.com) GitHub: [https://github.com/Adi-Deshmukh/Pydata-visualizer](https://github.com/Adi-Deshmukh/Pydata-visualizer)