Pydata-visualizer User Guide

This guide provides a step-by-step introduction to using the Pydata-visualizer library for data analysis and profiling.

Table of Contents

  1. Installation

  2. Basic Usage

  3. Understanding the Report

  4. Advanced Configuration

  5. Working with Large Datasets

  6. Common Use Cases

  7. Customization Options

1. Installation

Using pip

pip install pydata-visualizer

Verifying Installation

You can verify the installation was successful by importing the library:

from data_visualizer.profiler import AnalysisReport
print("Pydata-visualizer is installed successfully!")

2. Basic Usage

Step 1: Import the library

import pandas as pd
from data_visualizer.profiler import AnalysisReport

Step 2: Load your data

# Load data from a CSV file
df = pd.read_csv("your_dataset.csv")

# Or from any other source supported by pandas
# df = pd.read_excel("your_data.xlsx")
# df = pd.read_sql_query("SELECT * FROM your_table", connection)

Step 3: Create an analysis report

# Initialize the report
report = AnalysisReport(df)

Step 4: Generate the HTML report

# Create HTML report
report.to_html("output_report.html")

Step 5: Open the HTML report

Open the generated HTML file in your web browser to view the complete data profile.

3. Understanding the Report

The HTML report contains several sections:

Overview Section

  • Dataset Overview: Shows basic dataset information

    • Number of rows

    • Number of columns

    • Number of duplicate rows with percentage

    • Duplicate row indices (list of row positions)

    • Duplicate samples (first 5 duplicate row groups shown)

    • Missing values count and percentage (across entire dataset)

    • Dataset-level alerts (e.g., high duplicate rate if exceeds threshold)

Variables Section

For each column in your dataset:

  • Data type: Detected data type (pandas dtype)

  • Missing values: Count and percentage

  • Type-specific statistics:

    • For numeric: min, max, mean, median, std, quartiles (25%, 50%, 75%), skewness, kurtosis, outlier detection (count, percentage, and indices)

    • For categorical: unique values count, most frequent value, cardinality (High if >50 unique values, Low otherwise), top N value counts (configurable via top_n_values setting)

    • For string/text: unique values count, most frequent value, cardinality (High/Low), top N value counts, word frequency analysis (when text_analysis is enabled)

    • For boolean: value counts and proportions

  • Visualizations:

    • Distribution histograms with KDE for numeric data, with outliers highlighted in red (when outliers are detected)

    • Bar charts for categorical data showing top 10 most frequent values

    • Word clouds for text data (when text_analysis is enabled), plus bar charts for value distribution

    • For Plotly mode: interactive charts with zoom, pan, and hover tooltips

    • For Seaborn mode: static, publication-ready images

  • Alerts: Warnings about potential data issues

    • Missing values alert when >20% of data is missing

    • Outliers alert showing count and percentage

    • Skewness alert when absolute skewness exceeds configured threshold

Sample Data

  • Shows the first 10 rows (head) of your dataset in HTML table format

  • Shows the last 10 rows (tail) of your dataset in HTML table format

Correlations

  • Pearson correlation: For linear relationships between numerical variables (range: -1 to +1)

  • Spearman correlation: For monotonic relationships between numerical variables (range: -1 to +1)

  • Cramér’s V: For relationships between categorical variables (range: 0 to 1)

  • Heatmaps: Visual representation of all correlation matrices (when include_correlations_plots is True)

  • JSON data: Raw correlation matrices in JSON format (when include_correlations_json is True)

4. Advanced Configuration

You can customize the analysis with the Settings class:

from data_visualizer.profiler import AnalysisReport, Settings

# Create custom settings
settings = Settings(
    minimal=False,                      # Full analysis with all features
    top_n_values=5,                     # Show top 5 values in categorical columns
    skewness_threshold=2.0,             # Alert threshold for skewness
    outlier_method='iqr',               # Outlier detection method: 'iqr' or 'zscore'
    outlier_threshold=1.5,              # IQR multiplier for outlier detection
    duplicate_threshold=5.0,            # Alert if duplicates exceed 5% of dataset
    text_analysis=True,                 # Enable word frequency and word cloud for text
    use_plotly=False,                   # Use static Seaborn/Matplotlib plots
    include_plots=True,                 # Include visualizations
    include_correlations=True,          # Include correlation analysis
    include_correlations_plots=True,    # Include correlation heatmaps
    include_correlations_json=False,    # Don't include raw correlation JSON
    include_alerts=True,                # Include data quality alerts
    include_sample_data=True,           # Include head/tail samples
    include_overview=True               # Include overview statistics
)

# Apply settings to report
report = AnalysisReport(df, settings=settings)

# Generate the report
report.to_html("custom_report.html")

Settings Options

  • minimal (bool): If True, performs minimal analysis (faster, skips visualizations and type-specific analysis). Default: False

  • top_n_values (int): Number of top values to show for categorical variables (must be >= 1). Default: 10

  • skewness_threshold (float): Threshold for flagging skewed distributions (must be >= 0.0). Default: 1.0

  • outlier_method (str): Method for outlier detection - ‘iqr’ (Interquartile Range) or ‘zscore’. Default: ‘iqr’

  • outlier_threshold (float): IQR multiplier for outlier detection (must be >= 0.0). Default: 1.5 (use 3.0 for extreme outliers only)

  • duplicate_threshold (float): Percentage of duplicate rows to trigger an alert (must be >= 0.0). Default: 5.0

  • text_analysis (bool): Enable word frequency analysis and word cloud generation for text columns. Default: True

  • use_plotly (bool): Use Plotly for interactive visualizations instead of Seaborn/Matplotlib static plots. Default: False

  • include_plots (bool): Include visualizations/plots in the analysis. Default: True

  • include_correlations (bool): Include correlation analysis. Default: True

  • include_correlations_plots (bool): Include correlation heatmaps. Default: True

  • include_correlations_json (bool): Include correlation data in JSON format. Default: False

  • include_alerts (bool): Include data quality alerts (column and dataset-level). Default: True

  • include_sample_data (bool): Include head/tail data samples (first and last 10 rows). Default: True

  • include_overview (bool): Include dataset overview statistics. Default: True

5. Working with Large Datasets

For large datasets, consider these approaches:

Use minimal analysis

settings = Settings(
    minimal=True,             # Skip type-specific analysis and visualizations
    include_plots=False       # Also disable plots for fastest processing
)
report = AnalysisReport(large_df, settings=settings)

Sample your data

# Sample 10,000 rows randomly
sampled_df = large_df.sample(10000, random_state=42)
report = AnalysisReport(sampled_df)

Analyze specific columns only

# Select only specific columns for analysis
subset_df = large_df[['important_column_1', 'important_column_2', 'important_column_3']]
report = AnalysisReport(subset_df)

6. Common Use Cases

Exploratory Data Analysis (EDA)

import pandas as pd
from data_visualizer.profiler import AnalysisReport

# Load your dataset
df = pd.read_csv("new_dataset.csv")

# Generate comprehensive EDA report
report = AnalysisReport(df)
report.to_html("eda_report.html")

Data Quality Assessment

import pandas as pd
from data_visualizer.profiler import AnalysisReport, Settings

# Load your dataset
df = pd.read_csv("dataset_to_check.csv")

# Set stricter thresholds for data quality
settings = Settings(
    skewness_threshold=1.5,             # Lower threshold for skewness
    duplicate_threshold=3.0,            # Lower threshold for duplicates
    outlier_threshold=1.5,              # Standard IQR multiplier
    include_alerts=True                 # Ensure alerts are included
)

# Generate data quality report
report = AnalysisReport(df, settings=settings)
report.to_html("quality_report.html")

Correlation Discovery

import pandas as pd
from data_visualizer.profiler import AnalysisReport, Settings

# Load your dataset
df = pd.read_csv("features.csv")

# Enable correlation JSON output to access correlation data programmatically
settings = Settings(include_correlations_json=True)

# Generate report with focus on correlations
report = AnalysisReport(df, settings=settings)
results = report.analyse()

# Access correlation matrices programmatically
pearson_corr = results['Correlations_JSON']['pearson']
spearman_corr = results['Correlations_JSON']['spearman']

# Find strongly correlated features (absolute correlation > 0.7)
import numpy as np
strong_correlations = [(col1, col2) for col1 in pearson_corr 
                       for col2 in pearson_corr if col1 != col2 
                       and abs(pearson_corr[col1][col2]) > 0.7]

print("Strongly correlated features:", strong_correlations)

# Generate complete report (will include correlation heatmaps)
report.to_html("correlations_report.html")

7. Customization Options

Accessing Analysis Results Programmatically

You can access and manipulate the analysis results directly:

import pandas as pd
from data_visualizer.profiler import AnalysisReport

# Load your dataset
df = pd.read_csv("your_data.csv")

# Run analysis
report = AnalysisReport(df)
results = report.analyse()

# Access specific components
overview = results['overview']
column_stats = results['variables']

# Print summary information
print(f"Dataset has {overview['num_Row']} rows and {overview['num_Columns']} columns")
print(f"Missing values: {overview['missing_percentage']:.2f}%")

# Check specific column statistics
if 'age' in column_stats:
    age_stats = column_stats['age']
    print(f"Age statistics: Min={age_stats.get('min')}, Max={age_stats.get('max')}, Mean={age_stats.get('mean')}")

Customizing Report Output

You can generate the HTML report to a specific location:

# Generate report to specific path
report.to_html("/path/to/reports/analysis_report.html")

Integrating with Other Tools

import pandas as pd
from data_visualizer.profiler import AnalysisReport
import webbrowser
import os

# Load your dataset
df = pd.read_csv("your_data.csv")

# Create and generate report
report_path = "analysis_report.html"
report = AnalysisReport(df)
report.to_html(report_path)

# Automatically open the report in the default browser
webbrowser.open('file://' + os.path.abspath(report_path))

Conclusion

This guide covered the essential aspects of using Pydata-visualizer for data analysis and profiling. For more detailed information, refer to the full documentation or explore the source code.