Pydata-visualizer Documentation
Table of Contents
Introduction
Pydata-visualizer is a Python library designed for exploratory data analysis and data profiling. It automatically analyzes datasets, providing detailed statistical insights, visualizations, and interactive HTML reports with minimal code requirements.
The library aims to streamline the initial data exploration phase of the data science workflow, helping users quickly understand their data’s structure, distribution, and quality.
Installation
Requirements
Python 3.8 or higher
Core dependencies: pandas, numpy, matplotlib, seaborn, scipy, jinja2
Install from PyPI
pip install pydata-visualizer
Install from Source
git clone https://github.com/Adi-Deshmukh/Pydata-visualizer.git
cd Pydata-visualizer
pip install -e .
Basic Usage
import pandas as pd
from data_visualizer.profiler import AnalysisReport
# Load your dataset
df = pd.read_csv("your_dataset.csv")
# Create and generate a report
report = AnalysisReport(df)
report.to_html("report.html")
This simple code will:
Analyze all columns in your dataset
Generate appropriate visualizations for each column
Calculate correlations between columns
Create an interactive HTML report
Core Components
Main Modules
profiler.py: Core analysis functionality, including
AnalysisReportandSettingsclassesvisualizer.py: Creates visualizations for different data types
type_analyzers.py: Type-specific analysis functions
correlations.py: Correlation calculation between variables
alerts.py: Data quality alert generation
report.py: HTML report generation
Data Flow
Data Input: User provides pandas DataFrame
Type Detection: Library detects column data types
Analysis: Type-specific analysis is performed
Visualization: Appropriate plots are generated
Correlation: Relationships between variables are calculated
Report Generation: Results are compiled into HTML report
Settings Configuration
The Settings class allows customization of the analysis process:
from data_visualizer.profiler import Settings, AnalysisReport
# Custom settings
settings = Settings(
minimal=False, # Full analysis (True for faster, minimal analysis)
top_n_values=5, # Show top 5 values in categorical columns
skewness_threshold=2.0, # Alert threshold for skewness
outlier_method='iqr', # Outlier detection method: 'iqr' or 'zscore'
outlier_threshold=1.5, # IQR multiplier for outlier detection
duplicate_threshold=5.0, # Alert if duplicates exceed 5% of dataset
text_analysis=True # Enable word frequency analysis for text columns
)
# Create report with custom settings
report = AnalysisReport(df, settings=settings)
report.to_html("custom_report.html")
Available Settings
Parameter |
Type |
Default |
Description |
|---|---|---|---|
minimal |
bool |
False |
If True, performs basic analysis only (skips visualizations) |
top_n_values |
int |
10 |
Number of top values to show in categorical analysis (>= 1) |
skewness_threshold |
float |
1.0 |
Threshold for skewness alerts (>= 0.0) |
outlier_method |
str |
‘iqr’ |
Outlier detection method: ‘iqr’ or ‘zscore’ |
outlier_threshold |
float |
1.5 |
IQR multiplier for outlier detection (>= 0.0) |
duplicate_threshold |
float |
5.0 |
Percentage of duplicates to trigger alert (>= 0.0) |
text_analysis |
bool |
True |
Enable word frequency analysis and word clouds for text |
Analysis Methods
Overview Statistics
Row count
Column count
Duplicate rows count
Missing value count and percentage
Numeric Column Analysis
Standard statistics (min, max, mean, median, std, quartiles)
Skewness and kurtosis
Outlier detection using IQR or Z-score methods
Outlier count and percentage
Distribution histograms with KDE and outlier highlighting
Categorical Column Analysis
Unique value count
Most frequent value
Cardinality assessment (High/Low)
Top N value counts (configurable)
Bar charts for frequency distribution
String/Text Column Analysis
Unique value count
Most frequent value
Cardinality assessment (High/Low)
Top N value counts (configurable)
Word frequency analysis (when text_analysis is enabled)
Word cloud generation (when text_analysis is enabled)
Bar charts for value distribution
Boolean Column Analysis
Value counts and proportions
Distribution visualization
Visualization Features
The library automatically generates appropriate visualizations based on data type:
Numeric columns: Histograms with kernel density estimation, outliers highlighted in red
Categorical columns: Bar charts for top values
Text columns: Word clouds showing word frequency and bar charts for value counts
Correlation matrices: Heatmap visualizations
Visualizations are embedded directly in the HTML report as base64-encoded images.
Correlation Analysis
Three correlation methods are calculated:
Pearson Correlation
Measures linear relationships between numerical variables
Values range from -1 (perfect negative) to +1 (perfect positive)
Visualized as a heatmap
Spearman Correlation
Measures monotonic relationships between numerical variables
Less sensitive to outliers than Pearson
Visualized as a heatmap
Cramér’s V
Measures association between categorical variables
Values range from 0 (no association) to 1 (perfect association)
Visualized as a heatmap
Data Quality Alerts
The library automatically detects potential issues in your data:
Missing Values: Warns when columns have significant missing data (>20%)
Skewness: Flags highly skewed distributions based on threshold
Alerts are displayed prominently in the HTML report with warning icons and explanatory messages.
HTML Report Generation
The HTML report contains:
Overview panel: Dataset dimensions and summary
Variables panel: Detailed per-column analysis
Type information
Statistical measures
Visualizations
Data quality alerts
Sample Data: Head and tail sample rows
Correlations: Correlation matrices and heatmaps
The report is fully interactive with Bootstrap styling, allowing for:
Collapsible sections
Sortable tables
Interactive visualizations
API Reference
AnalysisReport
class AnalysisReport:
"""
Main class for dataset analysis.
"""
def __init__(self, data, settings=None):
"""
Initialize the analysis report object.
Parameters:
-----------
data : pandas.DataFrame
The dataset to analyze
settings : Settings, optional
Configuration settings for the analysis
"""
def analyse(self):
"""
Perform the data analysis.
Returns:
--------
dict
A dictionary containing all analysis results with keys:
- 'overview': Dataset statistics
- 'variables': Per-column analysis
- 'Sample_data': DataFrame head and tail
- 'Correlations_Plots': Visualization of correlations
- 'Correlations_JSON': Raw correlation data
"""
def to_html(self, filename="report.html"):
"""
Generate an HTML report from the analysis.
Parameters:
-----------
filename : str, optional
Path to save the HTML report (default: "report.html")
"""
def _analyze_column(self, column_data, column_name):
"""
Analyze a single column of data.
Parameters:
-----------
column_data : pandas.Series
The column data to analyze
column_name : str
The name of the column
Returns:
--------
dict
Dictionary of analysis results for the column
"""
def _data_sample(self):
"""
Create HTML samples of the dataset head and tail.
Returns:
--------
dict
Dictionary with HTML representations of head and tail
"""
Settings
class Settings(pydantic.BaseModel):
"""
Settings for the analysis report.
Attributes:
-----------
minimal : bool, default=False
Whether to perform minimal analysis (skips type-specific analysis and visualizations)
top_n_values : int, default=10
Number of top values to show for categorical columns (must be >= 1)
skewness_threshold : float, default=1.0
Threshold for skewness alerts (must be >= 0.0)
outlier_method : str, default='iqr'
Outlier detection method: 'iqr' (Interquartile Range) or 'zscore'
outlier_threshold : float, default=1.5
IQR multiplier for outlier detection (must be >= 0.0)
Standard: 1.5 for moderate outliers, 3.0 for extreme outliers
duplicate_threshold : float, default=5.0
Percentage of duplicate rows to trigger an alert (must be >= 0.0)
text_analysis : bool, default=True
Enable word frequency analysis and word cloud generation for text columns
"""
Threshold for skewness alerts
"""
## Examples
### Basic Analysis
```python
import pandas as pd
from data_visualizer.profiler import AnalysisReport
# Load dataset
df = pd.read_csv("customer_data.csv")
# Create and generate report
report = AnalysisReport(df)
report.to_html("customer_analysis.html")
Using Analysis Results Programmatically
import pandas as pd
from data_visualizer.profiler import AnalysisReport, Settings
# Load dataset
df = pd.read_csv("financial_data.csv")
# Configure settings for financial data
settings = Settings(skewness_threshold=3.0, top_n_values=3)
# Create report
report = AnalysisReport(df, settings=settings)
# Run analysis and get results dictionary
results = report.analyse()
# Access specific insights
overview = results['overview']
missing_percentage = overview['missing_percentage']
print(f"Dataset has {missing_percentage:.2f}% missing values")
# Check skewness of a specific column
column_stats = results['variables']['income']
if 'skewness' in column_stats:
print(f"Income skewness: {column_stats['skewness']:.2f}")
# Generate report
report.to_html("financial_analysis.html")
Extending the Library
Registering Custom Type Analyzers
You can extend the library by registering custom analyzers for specific data types:
from data_visualizer.type_registry import register_analyzer
from visions.types import DateTime
@register_analyzer(DateTime)
def _analyze_datetime(report_object, column_data):
"""Custom analyzer for datetime columns"""
datetime_stats = {
'min_date': column_data.min(),
'max_date': column_data.max(),
'range_days': (column_data.max() - column_data.min()).days,
'weekday_counts': column_data.dt.day_name().value_counts().to_dict()
}
return datetime_stats
Troubleshooting
Common Issues
Missing Dependencies
If you encounter import errors, ensure all dependencies are installed:
pip install "pydata-visualizer[complete]"
Memory Issues with Large Datasets
For large datasets, use the minimal setting:
from data_visualizer.profiler import AnalysisReport, Settings
# Memory-efficient settings
settings = Settings(minimal=True)
report = AnalysisReport(large_df, settings=settings)
Visualization Errors
If visualizations fail to generate, check matplotlib backend:
import matplotlib
matplotlib.use('Agg') # Use non-interactive backend
Character Encoding Issues
For datasets with special characters, ensure proper encoding:
df = pd.read_csv("international_data.csv", encoding="utf-8")