# Pydata-visualizer

[![PyPI version](https://img.shields.io/pypi/v/pydata-visualizer.svg)](https://pypi.org/project/pydata-visualizer/)
[![Python versions](https://img.shields.io/pypi/pyversions/pydata-visualizer.svg)](https://pypi.org/project/pydata-visualizer/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A powerful and intuitive Python library for exploratory data analysis and data profiling. Pydata-visualizer automatically analyzes your dataset, generates interactive visualizations, and provides detailed statistical insights with minimal code.

## Features

- Comprehensive Data Profiling: Analyze numerical, categorical, boolean, and string data types
- Automated Data Quality Checks: Detect missing values, outliers, skewed distributions, duplicate rows, and more
- Interactive Visualizations: Generate distribution plots, correlation heatmaps, word clouds, and statistical charts using Plotly or Seaborn
- Dual Rendering Modes: Choose between interactive Plotly charts or static Seaborn/Matplotlib visualizations
- Text Analysis: Automatic word frequency analysis and word cloud generation for text columns
- Rich HTML Reports: Export analysis to visually appealing and shareable HTML reports with interactive or static charts
- Performance Optimized: Fast analysis even on large datasets
- Correlation Analysis: Calculate Pearson, Spearman, and Cramér's V correlations between variables
- Flexible Configuration: Customize analysis thresholds and options via the Settings class

## Installation

```bash
pip install pydata-visualizer
```

## Quick Start

```python
import pandas as pd
from data_visualizer.profiler import AnalysisReport, Settings

# Load your dataset
df = pd.read_csv("your_dataset.csv")

# Create a report with default settings
report = AnalysisReport(df)
report.to_html("report.html")
```

## Advanced Usage

### Customizing Analysis Settings

```python
from data_visualizer.profiler import AnalysisReport, Settings

# Configure analysis settings
report_settings = Settings(
    minimal=False,              # Set to True for faster, minimal analysis
    top_n_values=5,             # Show top 5 values in categorical columns
    skewness_threshold=2.0,     # Tolerance for skewness alerts
    outlier_method='iqr',       # Outlier detection method: 'iqr' or 'zscore'
    outlier_threshold=1.5,      # IQR multiplier for outlier detection
    duplicate_threshold=5.0,    # Percentage threshold for duplicate alerts
    text_analysis=True          # Enable word frequency analysis for text columns
)

# Create report with custom settings
report = AnalysisReport(df, settings=report_settings)

# Perform analysis and get results dictionary
results = report.analyse()

# Generate HTML report
report.to_html("custom_report.html")
```

### Report Structure

The generated report includes:

- **Overview**: Dataset dimensions, missing values, duplicate rows, and duplicate percentage
- **Variable Analysis**: Detailed per-column statistics and visualizations including:
  - Distribution plots for numeric data
  - Bar charts for categorical data
  - Word clouds and frequency analysis for text data
  - Outlier detection and highlighting
- **Sample Data**: Head and tail samples of the dataset
- **Correlations**: Correlation matrices and heatmaps (Pearson, Spearman, Cramér's V)
- **Data Quality Alerts**: Automated detection of data quality issues

## API Reference

### `AnalysisReport` Class

```python
class AnalysisReport:
    def __init__(self, data, settings=None):
        """
        Initialize the analysis report object.
        
        Parameters:
        -----------
        data : pandas.DataFrame
            The dataset to analyze
        settings : Settings, optional
            Configuration settings for the analysis
        """
        
    def analyse(self):
        """
        Perform the data analysis.
        
        Returns:
        --------
        dict
            A dictionary containing all analysis results
        """
        
    def to_html(self, filename="report.html"):
        """
        Generate an HTML report from the analysis.
        
        Parameters:
        -----------
        filename : str, optional
            Path to save the HTML report (default: "report.html")
        """
```

### `Settings` Class

```python
class Settings(pydantic.BaseModel):
    """
    Settings for the analysis report.
    
    Attributes:
    -----------
    minimal : bool, default=False
        Whether to perform minimal analysis (skips type-specific analysis and visualizations)
    
    top_n_values : int, default=10
        Number of top values to show for categorical columns (must be >= 1)
    
    skewness_threshold : float, default=1.0
        Threshold for skewness alerts (must be >= 0.0)
    
    outlier_method : str, default='iqr'
        Outlier detection method: 'iqr' (Interquartile Range) or 'zscore'
    
    outlier_threshold : float, default=1.5
        IQR multiplier for outlier detection (must be >= 0.0)
        Standard: 1.5 for moderate outliers, 3.0 for extreme outliers
    
    duplicate_threshold : float, default=5.0
        Percentage of duplicate rows to trigger an alert (must be >= 0.0)
    
    text_analysis : bool, default=True
        Enable word frequency analysis and word cloud generation for text columns
    """
```

## Type Analyzers

The library automatically detects and applies the appropriate analysis for different data types:

- **Numeric (Integer/Float)**: Statistical measures (mean, std, quartiles), distribution plots, skewness, kurtosis, outlier detection
- **Categorical/Object**: Value counts, cardinality analysis, frequency distributions, top N values
- **String**: Unique value counts, cardinality, top N values, word frequency analysis, word cloud generation
- **Boolean**: Value counts and proportions
- **Generic**: Basic analysis for unrecognized types

## Correlation Analysis

Three correlation methods are calculated when applicable:

- **Pearson**: Linear correlation between numerical variables (range: -1 to 1)
- **Spearman**: Rank correlation capturing monotonic relationships (range: -1 to 1)
- **Cramér's V**: Measure of association between categorical variables (range: 0 to 1)

## Data Quality Alerts

The library automatically detects potential issues in your data:

- **High Missing Values**: Columns with more than 20% missing data
- **Skewness**: Distributions exceeding the configured skewness threshold
- **Outliers**: Data points detected using IQR or Z-score methods
- **High Duplicates**: Duplicate rows exceeding the configured threshold percentage

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Credits

Created by Aditya Deshmukh (adideshmukh2005@gmail.com)

GitHub: [https://github.com/Adi-Deshmukh/Pydata-visualizer](https://github.com/Adi-Deshmukh/Pydata-visualizer)