scikit-learn

Machine learning library for Python. Provides wide range of algorithms for classification, regression, clustering, and dimensionality reduction. Features simple and consistent API, excellent documentation, optimal for educational use. Standard implementation of traditional ML methods.

PythonMachine LearningData ScienceClassificationRegressionClustering

GitHub Overview

scikit-learn/scikit-learn

scikit-learn: machine learning in Python

Stars59,842
Watchers2,876
Forks25,123
Created:August 17, 2010
Language:Python
License:BSD 3-Clause License

Topics

scikit-learnmachine-learningpythondata-scienceclassificationregressionclusteringstatistics

Star History

scikit-learn/scikit-learn Star History
Data as of: Invalid Date

Framework

scikit-learn

Overview

scikit-learn is Python's representative machine learning library.

Details

scikit-learn is an open-source machine learning library for Python that began development in 2007, providing simple and efficient tools for data mining and data analysis. It implements a rich collection of classical machine learning algorithms including classification, regression, clustering, and dimensionality reduction, featuring a consistent and user-friendly API. Through integration with NumPy, SciPy, and matplotlib, it serves as the core of Python's data science ecosystem. Widely used from education to practical applications, it's beloved by many developers from machine learning beginners to experts. It supports the entire machine learning pipeline including preprocessing, feature selection, model selection, and evaluation, covering everything from prototyping to small-to-medium scale production systems. With comprehensive documentation, sample code, and active community support, it has become the standard choice for machine learning education.

Pros and Cons

Pros

  • Consistent API: Unified interface across all algorithms
  • Rich Algorithms: Comprehensive implementation of classification, regression, clustering, and dimensionality reduction
  • Excellent Documentation: Detailed explanations and abundant sample code
  • Stability: High reliability and few bugs through long-term development
  • Easy to Learn: Optimal design for machine learning beginners
  • Pipeline Features: Consistent processing from preprocessing to model evaluation
  • Active Community: Large user base and continuous development

Cons

  • No Deep Learning Support: Limited neural network implementation
  • Scalability: Performance limitations with large datasets
  • No GPU Support: GPU acceleration not supported as standard
  • Online Learning: Limited real-time learning functionality
  • Flexibility: Difficult to implement custom algorithms
  • Memory Usage: Memory efficiency can be challenging with large datasets

Key Links

Code Examples

Hello World

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Check scikit-learn version
import sklearn
print(f"scikit-learn version: {sklearn.__version__}")

# Create sample data
from sklearn.datasets import make_classification
X, y = make_classification(
    n_samples=1000, 
    n_features=20, 
    n_informative=15, 
    n_redundant=5, 
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("\nDetailed Report:")
print(classification_report(y_test, y_pred))

Data Preprocessing

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Create sample data
data = {
    'age': [25, 30, 35, np.nan, 45],
    'income': [50000, 60000, np.nan, 80000, 90000],
    'education': ['High School', 'College', 'Graduate', 'High School', 'College'],
    'city': ['Tokyo', 'Osaka', 'Tokyo', 'Nagoya', 'Osaka']
}
df = pd.DataFrame(data)
print("Original data:")
print(df)

# Separate numeric and categorical columns
numeric_features = ['age', 'income']
categorical_features = ['education', 'city']

# Numeric data preprocessing pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical data preprocessing pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Execute preprocessing
X_processed = preprocessor.fit_transform(df)
print(f"\nProcessed data shape: {X_processed.shape}")
print(f"Feature names: {preprocessor.get_feature_names_out()}")

Classification Tasks

from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Compare multiple models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'SVM': SVC(random_state=42, probability=True),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}

results = {}

for name, model in models.items():
    # Train
    model.fit(X_train, y_train)
    
    # Predict
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    
    results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'CV Mean': cv_scores.mean(),
        'CV Std': cv_scores.std()
    }

# Display results
results_df = pd.DataFrame(results).T
print("Model comparison results:")
print(results_df.round(4))

Regression Tasks

from sklearn.datasets import make_regression, load_boston
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Create regression sample data
X, y = make_regression(
    n_samples=500, 
    n_features=10, 
    n_informative=5, 
    noise=0.1, 
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define regression models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=1.0),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

results = {}

for name, model in models.items():
    # Use standardized data for linear models
    if name in ['Linear Regression', 'Ridge', 'Lasso']:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R2 Score': r2
    }

# Display results
results_df = pd.DataFrame(results).T
print("Regression model comparison results:")
print(results_df.round(4))

Clustering

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, adjusted_rand_score
import matplotlib.pyplot as plt

# Generate clustering data
X, y_true = make_blobs(
    n_samples=300, 
    centers=4, 
    cluster_std=0.60, 
    random_state=0
)

# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply clustering algorithms
clustering_algorithms = {
    'K-Means': KMeans(n_clusters=4, random_state=42),
    'DBSCAN': DBSCAN(eps=0.3, min_samples=5),
    'Agglomerative': AgglomerativeClustering(n_clusters=4)
}

results = {}

plt.figure(figsize=(15, 5))

for i, (name, algorithm) in enumerate(clustering_algorithms.items()):
    # Execute clustering
    cluster_labels = algorithm.fit_predict(X_scaled)
    
    # Calculate silhouette score
    if len(set(cluster_labels)) > 1:  # Only if number of clusters > 1
        silhouette_avg = silhouette_score(X_scaled, cluster_labels)
        ari = adjusted_rand_score(y_true, cluster_labels)
    else:
        silhouette_avg = -1
        ari = -1
    
    results[name] = {
        'Silhouette Score': silhouette_avg,
        'Adjusted Rand Index': ari,
        'Number of Clusters': len(set(cluster_labels))
    }
    
    # Visualization
    plt.subplot(1, 3, i + 1)
    plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis', alpha=0.6)
    plt.title(f'{name}\nSilhouette: {silhouette_avg:.3f}')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

plt.tight_layout()
plt.show()

# Display results
results_df = pd.DataFrame(results).T
print("Clustering results:")
print(results_df.round(4))

Pipeline and Grid Search

from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load dataset
digits = load_digits()
X, y = digits.data, digits.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# SVM pipeline
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC())
])

# Random Forest pipeline
rf_pipeline = Pipeline([
    ('classifier', RandomForestClassifier())
])

# Grid search parameters
svm_params = {
    'classifier__C': [0.1, 1, 10],
    'classifier__gamma': ['scale', 'auto'],
    'classifier__kernel': ['rbf', 'linear']
}

rf_params = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [10, 20, None],
    'classifier__min_samples_split': [2, 5, 10]
}

# Execute grid search
print("SVM grid search in progress...")
svm_grid = GridSearchCV(
    svm_pipeline, 
    svm_params, 
    cv=5, 
    scoring='accuracy',
    n_jobs=-1
)
svm_grid.fit(X_train, y_train)

print("Random Forest grid search in progress...")
rf_grid = GridSearchCV(
    rf_pipeline, 
    rf_params, 
    cv=5, 
    scoring='accuracy',
    n_jobs=-1
)
rf_grid.fit(X_train, y_train)

# Compare results
models = {
    'SVM': svm_grid,
    'Random Forest': rf_grid
}

for name, model in models.items():
    y_pred = model.predict(X_test)
    print(f"\n{name} Results:")
    print(f"Best parameters: {model.best_params_}")
    print(f"Cross-validation score: {model.best_score_:.4f}")
    print(f"Test score: {model.score(X_test, y_test):.4f}")

Model Saving and Loading

import joblib
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Prepare data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

pipeline.fit(X_train, y_train)
original_score = pipeline.score(X_test, y_test)
print(f"Original model accuracy: {original_score:.4f}")

# 1. Save/load using joblib (recommended)
joblib.dump(pipeline, 'model_pipeline.joblib')
loaded_pipeline_joblib = joblib.load('model_pipeline.joblib')
joblib_score = loaded_pipeline_joblib.score(X_test, y_test)
print(f"Accuracy after joblib loading: {joblib_score:.4f}")

# 2. Save/load using pickle
with open('model_pipeline.pkl', 'wb') as f:
    pickle.dump(pipeline, f)

with open('model_pipeline.pkl', 'rb') as f:
    loaded_pipeline_pickle = pickle.load(f)

pickle_score = loaded_pipeline_pickle.score(X_test, y_test)
print(f"Accuracy after pickle loading: {pickle_score:.4f}")

# 3. Save model metadata
model_info = {
    'model_type': 'RandomForestClassifier',
    'parameters': pipeline.named_steps['classifier'].get_params(),
    'feature_names': [f'feature_{i}' for i in range(X.shape[1])],
    'classes': pipeline.classes_,
    'accuracy': original_score
}

# Save metadata
joblib.dump(model_info, 'model_metadata.joblib')
loaded_metadata = joblib.load('model_metadata.joblib')

print("\nModel metadata:")
for key, value in loaded_metadata.items():
    print(f"{key}: {value}")

# Predict on new data
import numpy as np
new_data = np.random.randn(5, 20)
predictions = loaded_pipeline_joblib.predict(new_data)
probabilities = loaded_pipeline_joblib.predict_proba(new_data)

print(f"\nPredictions on new data: {predictions}")
print(f"Prediction probabilities: {probabilities}")