# Pydata-visualizer User Guide This guide provides a step-by-step introduction to using the Pydata-visualizer library for data analysis and profiling. ## Table of Contents 1. [Installation](#1-installation) 2. [Basic Usage](#2-basic-usage) 3. [Understanding the Report](#3-understanding-the-report) 4. [Advanced Configuration](#4-advanced-configuration) 5. [Working with Large Datasets](#5-working-with-large-datasets) 6. [Common Use Cases](#6-common-use-cases) 7. [Customization Options](#7-customization-options) ## 1. Installation ### Using pip ```bash pip install pydata-visualizer ``` ### Verifying Installation You can verify the installation was successful by importing the library: ```python from data_visualizer.profiler import AnalysisReport print("Pydata-visualizer is installed successfully!") ``` ## 2. Basic Usage ### Step 1: Import the library ```python import pandas as pd from data_visualizer.profiler import AnalysisReport ``` ### Step 2: Load your data ```python # Load data from a CSV file df = pd.read_csv("your_dataset.csv") # Or from any other source supported by pandas # df = pd.read_excel("your_data.xlsx") # df = pd.read_sql_query("SELECT * FROM your_table", connection) ``` ### Step 3: Create an analysis report ```python # Initialize the report report = AnalysisReport(df) ``` ### Step 4: Generate the HTML report ```python # Create HTML report report.to_html("output_report.html") ``` ### Step 5: Open the HTML report Open the generated HTML file in your web browser to view the complete data profile. ## 3. Understanding the Report The HTML report contains several sections: ### Overview Section - **Dataset Overview**: Shows basic dataset information - Number of rows - Number of columns - Number of duplicate rows with percentage - Duplicate row indices (list of row positions) - Duplicate samples (first 5 duplicate row groups shown) - Missing values count and percentage (across entire dataset) - Dataset-level alerts (e.g., high duplicate rate if exceeds threshold) ### Variables Section For each column in your dataset: - **Data type**: Detected data type (pandas dtype) - **Missing values**: Count and percentage - **Type-specific statistics**: - For numeric: min, max, mean, median, std, quartiles (25%, 50%, 75%), skewness, kurtosis, outlier detection (count, percentage, and indices) - For categorical: unique values count, most frequent value, cardinality (High if >50 unique values, Low otherwise), top N value counts (configurable via top_n_values setting) - For string/text: unique values count, most frequent value, cardinality (High/Low), top N value counts, word frequency analysis (when text_analysis is enabled) - For boolean: value counts and proportions - **Visualizations**: - Distribution histograms with KDE for numeric data, with outliers highlighted in red (when outliers are detected) - Bar charts for categorical data showing top 10 most frequent values - Word clouds for text data (when text_analysis is enabled), plus bar charts for value distribution - For Plotly mode: interactive charts with zoom, pan, and hover tooltips - For Seaborn mode: static, publication-ready images - **Alerts**: Warnings about potential data issues - Missing values alert when >20% of data is missing - Outliers alert showing count and percentage - Skewness alert when absolute skewness exceeds configured threshold ### Sample Data - Shows the first 10 rows (head) of your dataset in HTML table format - Shows the last 10 rows (tail) of your dataset in HTML table format ### Correlations - **Pearson correlation**: For linear relationships between numerical variables (range: -1 to +1) - **Spearman correlation**: For monotonic relationships between numerical variables (range: -1 to +1) - **Cramér's V**: For relationships between categorical variables (range: 0 to 1) - **Heatmaps**: Visual representation of all correlation matrices (when include_correlations_plots is True) - **JSON data**: Raw correlation matrices in JSON format (when include_correlations_json is True) ## 4. Advanced Configuration You can customize the analysis with the `Settings` class: ```python from data_visualizer.profiler import AnalysisReport, Settings # Create custom settings settings = Settings( minimal=False, # Full analysis with all features top_n_values=5, # Show top 5 values in categorical columns skewness_threshold=2.0, # Alert threshold for skewness outlier_method='iqr', # Outlier detection method: 'iqr' or 'zscore' outlier_threshold=1.5, # IQR multiplier for outlier detection duplicate_threshold=5.0, # Alert if duplicates exceed 5% of dataset text_analysis=True, # Enable word frequency and word cloud for text use_plotly=False, # Use static Seaborn/Matplotlib plots include_plots=True, # Include visualizations include_correlations=True, # Include correlation analysis include_correlations_plots=True, # Include correlation heatmaps include_correlations_json=False, # Don't include raw correlation JSON include_alerts=True, # Include data quality alerts include_sample_data=True, # Include head/tail samples include_overview=True # Include overview statistics ) # Apply settings to report report = AnalysisReport(df, settings=settings) # Generate the report report.to_html("custom_report.html") ``` ### Settings Options - **minimal** (bool): If True, performs minimal analysis (faster, skips visualizations and type-specific analysis). Default: False - **top_n_values** (int): Number of top values to show for categorical variables (must be >= 1). Default: 10 - **skewness_threshold** (float): Threshold for flagging skewed distributions (must be >= 0.0). Default: 1.0 - **outlier_method** (str): Method for outlier detection - 'iqr' (Interquartile Range) or 'zscore'. Default: 'iqr' - **outlier_threshold** (float): IQR multiplier for outlier detection (must be >= 0.0). Default: 1.5 (use 3.0 for extreme outliers only) - **duplicate_threshold** (float): Percentage of duplicate rows to trigger an alert (must be >= 0.0). Default: 5.0 - **text_analysis** (bool): Enable word frequency analysis and word cloud generation for text columns. Default: True - **use_plotly** (bool): Use Plotly for interactive visualizations instead of Seaborn/Matplotlib static plots. Default: False - **include_plots** (bool): Include visualizations/plots in the analysis. Default: True - **include_correlations** (bool): Include correlation analysis. Default: True - **include_correlations_plots** (bool): Include correlation heatmaps. Default: True - **include_correlations_json** (bool): Include correlation data in JSON format. Default: False - **include_alerts** (bool): Include data quality alerts (column and dataset-level). Default: True - **include_sample_data** (bool): Include head/tail data samples (first and last 10 rows). Default: True - **include_overview** (bool): Include dataset overview statistics. Default: True ## 5. Working with Large Datasets For large datasets, consider these approaches: ### Use minimal analysis ```python settings = Settings( minimal=True, # Skip type-specific analysis and visualizations include_plots=False # Also disable plots for fastest processing ) report = AnalysisReport(large_df, settings=settings) ``` ### Sample your data ```python # Sample 10,000 rows randomly sampled_df = large_df.sample(10000, random_state=42) report = AnalysisReport(sampled_df) ``` ### Analyze specific columns only ```python # Select only specific columns for analysis subset_df = large_df[['important_column_1', 'important_column_2', 'important_column_3']] report = AnalysisReport(subset_df) ``` ## 6. Common Use Cases ### Exploratory Data Analysis (EDA) ```python import pandas as pd from data_visualizer.profiler import AnalysisReport # Load your dataset df = pd.read_csv("new_dataset.csv") # Generate comprehensive EDA report report = AnalysisReport(df) report.to_html("eda_report.html") ``` ### Data Quality Assessment ```python import pandas as pd from data_visualizer.profiler import AnalysisReport, Settings # Load your dataset df = pd.read_csv("dataset_to_check.csv") # Set stricter thresholds for data quality settings = Settings( skewness_threshold=1.5, # Lower threshold for skewness duplicate_threshold=3.0, # Lower threshold for duplicates outlier_threshold=1.5, # Standard IQR multiplier include_alerts=True # Ensure alerts are included ) # Generate data quality report report = AnalysisReport(df, settings=settings) report.to_html("quality_report.html") ``` ### Correlation Discovery ```python import pandas as pd from data_visualizer.profiler import AnalysisReport, Settings # Load your dataset df = pd.read_csv("features.csv") # Enable correlation JSON output to access correlation data programmatically settings = Settings(include_correlations_json=True) # Generate report with focus on correlations report = AnalysisReport(df, settings=settings) results = report.analyse() # Access correlation matrices programmatically pearson_corr = results['Correlations_JSON']['pearson'] spearman_corr = results['Correlations_JSON']['spearman'] # Find strongly correlated features (absolute correlation > 0.7) import numpy as np strong_correlations = [(col1, col2) for col1 in pearson_corr for col2 in pearson_corr if col1 != col2 and abs(pearson_corr[col1][col2]) > 0.7] print("Strongly correlated features:", strong_correlations) # Generate complete report (will include correlation heatmaps) report.to_html("correlations_report.html") ``` ## 7. Customization Options ### Accessing Analysis Results Programmatically You can access and manipulate the analysis results directly: ```python import pandas as pd from data_visualizer.profiler import AnalysisReport # Load your dataset df = pd.read_csv("your_data.csv") # Run analysis report = AnalysisReport(df) results = report.analyse() # Access specific components overview = results['overview'] column_stats = results['variables'] # Print summary information print(f"Dataset has {overview['num_Row']} rows and {overview['num_Columns']} columns") print(f"Missing values: {overview['missing_percentage']:.2f}%") # Check specific column statistics if 'age' in column_stats: age_stats = column_stats['age'] print(f"Age statistics: Min={age_stats.get('min')}, Max={age_stats.get('max')}, Mean={age_stats.get('mean')}") ``` ### Customizing Report Output You can generate the HTML report to a specific location: ```python # Generate report to specific path report.to_html("/path/to/reports/analysis_report.html") ``` ### Integrating with Other Tools ```python import pandas as pd from data_visualizer.profiler import AnalysisReport import webbrowser import os # Load your dataset df = pd.read_csv("your_data.csv") # Create and generate report report_path = "analysis_report.html" report = AnalysisReport(df) report.to_html(report_path) # Automatically open the report in the default browser webbrowser.open('file://' + os.path.abspath(report_path)) ``` ## Conclusion This guide covered the essential aspects of using Pydata-visualizer for data analysis and profiling. For more detailed information, refer to the [full documentation](https://github.com/Adi-Deshmukh/Pydata-visualizer) or explore the source code.