MLJ.jl

Unified machine learning framework for Julia. Provides wide range of algorithms for classification, regression, clustering with scikit-learn-like interface. Supports model evaluation, hyperparameter tuning, and pipeline construction.

Juliamachine learningMLdata sciencestatistical learningpredictive modelingalgorithmsunified API

Framework

MLJ.jl

Overview

MLJ.jl (Machine Learning in Julia) is a comprehensive machine learning framework for the Julia language. It provides over 180 machine learning algorithms through a unified API, enabling consistent workflows from data preprocessing to model evaluation.

Details

MLJ.jl is an integrated platform for machine learning in the Julia language. Development was initiated by the Alan Turing Institute in 2019, with the goal of leveraging Julia's speed and expressiveness for machine learning applications. Its key feature is a unified API that allows different machine learning library algorithms to be used through the same interface. It can handle over 180 algorithms including Scikit-learn, XGBoost, LightGBM, GLMnet, and MLJFlux (deep learning) in a unified manner. Type safety enables detection of data type errors before execution, allowing the construction of robust machine learning pipelines. Through automatic differentiation and Julia's high performance, efficient learning is possible even with large-scale data. Model composition, training, evaluation, and tuning can be executed in a consistent workflow, supporting reproducible machine learning research. It integrates closely with DataFrames.jl using the Tables.jl interface, enabling seamless execution from data analysis to machine learning.

Pros and Cons

Pros

  • Unified API: Use 180+ algorithms through the same interface
  • High Performance: Fast training and inference leveraging Julia's performance
  • Type Safety: Robustness through compile-time type checking
  • Rich Algorithms: Classification, regression, clustering, dimensionality reduction, etc.
  • Automatic Differentiation: Automated gradient computation
  • Reproducibility: Consistent workflows for research reproducibility

Cons

  • Learning Curve: Requires mastering the Julia language
  • Ecosystem: Fewer peripheral tools compared to Python
  • Documentation: Limited Japanese resources for some algorithms
  • Stability: Relatively new framework with frequent changes

Main Use Cases

  • Predictive modeling
  • Classification and regression analysis
  • Anomaly detection
  • Recommendation systems
  • Natural language processing
  • Computer vision
  • Time series analysis
  • Deep learning

Basic Usage

Installation

using Pkg
Pkg.add("MLJ")
Pkg.add("DataFrames")
using MLJ, DataFrames

Basic Classification Problem

# Data loading
using RDatasets
iris = dataset("datasets", "iris")

# Feature and target separation
X = select(iris, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth])
y = iris.Species

# Data splitting
train, test = partition(eachindex(y), 0.8, shuffle=true)

# Model selection
Tree = @load DecisionTreeClassifier pkg=DecisionTree
model = Tree(max_depth=5)

# Model training
mach = machine(model, X, y)
fit!(mach, rows=train)

# Prediction
predictions = predict(mach, X[test, :])
predictions_mode = predict_mode(mach, X[test, :])

# Evaluation
accuracy_score = accuracy(predictions_mode, y[test])
println("Accuracy: ", accuracy_score)

Regression Problem

# Boston housing data
using MLJBase
boston = load_boston()

X = boston.features
y = boston.targets

# Train/test split
train, test = partition(eachindex(y), 0.8, shuffle=true)

# Model selection
LinearRegressor = @load LinearRegressor pkg=GLM
model = LinearRegressor()

# Training
mach = machine(model, X, y)
fit!(mach, rows=train)

# Prediction
predictions = predict(mach, X[test, :])

# Evaluation
rmse_score = rmse(predictions, y[test])
println("RMSE: ", rmse_score)

Pipeline Processing

# Pipeline combining preprocessing and model
Standardizer = @load Standardizer pkg=MLJModels
PCA = @load PCA pkg=MultivariateStats
RandomForestClassifier = @load RandomForestClassifier pkg=DecisionTree

# Pipeline construction
pipe = @pipeline(
    Standardizer(),
    PCA(maxoutdim=3),
    RandomForestClassifier(n_trees=100)
)

# Training
mach = machine(pipe, X, y)
fit!(mach, rows=train)

# Prediction
predictions = predict_mode(mach, X[test, :])
accuracy_score = accuracy(predictions, y[test])
println("Pipeline Accuracy: ", accuracy_score)

Cross-Validation

# k-fold cross-validation
cv = CV(nfolds=5, shuffle=true)

# Model evaluation
evaluation = evaluate!(mach, resampling=cv,
                      measure=[accuracy, f1score, precision, recall])

print(evaluation)

Hyperparameter Tuning

# Random Forest hyperparameter tuning
RandomForestClassifier = @load RandomForestClassifier pkg=DecisionTree
model = RandomForestClassifier()

# Tuning range setting
r1 = range(model, :n_trees, lower=10, upper=100)
r2 = range(model, :max_depth, lower=1, upper=10)

# Grid search
tuned_model = TunedModel(model=model,
                        tuning=Grid(resolution=10),
                        resampling=CV(nfolds=5),
                        ranges=[r1, r2],
                        measure=accuracy)

# Tuning execution
mach = machine(tuned_model, X, y)
fit!(mach, rows=train)

# Best parameters
best_params = fitted_params(mach)
println("Best parameters: ", best_params)

Deep Learning (MLJFlux)

using MLJFlux
using Flux

# Neural Network Classifier
NeuralNetworkClassifier = @load NeuralNetworkClassifier pkg=MLJFlux

# Network architecture definition
builder = MLJFlux.Short(
    n_hidden=32,
    dropout=0.1,
    σ=Flux.relu
)

# Model creation
model = NeuralNetworkClassifier(
    builder=builder,
    epochs=100,
    batch_size=32,
    lambda=0.01
)

# Training
mach = machine(model, X, y)
fit!(mach, rows=train)

# Prediction
predictions = predict_mode(mach, X[test, :])
accuracy_score = accuracy(predictions, y[test])
println("Neural Network Accuracy: ", accuracy_score)

Feature Engineering

# Feature selection
using MLJFeatureSelection
FeatureSelector = @load FeatureSelector pkg=MLJFeatureSelection

# Selection based on feature importance
selector = FeatureSelector(
    features=[:SepalLength, :SepalWidth, :PetalLength, :PetalWidth],
    selection_method=:univariate,
    n_features=2
)

# Feature scaling
Standardizer = @load Standardizer pkg=MLJModels
scaler = Standardizer()

# Feature transformation pipeline
preprocessing = @pipeline(
    scaler,
    selector
)

# Complete pipeline
full_pipeline = @pipeline(
    preprocessing,
    RandomForestClassifier(n_trees=50)
)

# Training and evaluation
mach = machine(full_pipeline, X, y)
fit!(mach, rows=train)
predictions = predict_mode(mach, X[test, :])

Model Saving and Loading

# Model saving
MLJ.save("my_model.jlso", mach)

# Model loading
mach_loaded = machine("my_model.jlso")

# Prediction with loaded model
predictions = predict_mode(mach_loaded, X[test, :])

Latest Trends (2025)

  • GPU Acceleration: Fast processing of large-scale data with CUDA support
  • Distributed Learning: Parallel learning support across multiple nodes
  • AutoML: Enhanced automated machine learning pipelines
  • Deep Learning Integration: Further integration with Flux.jl
  • Explainable AI: Improved model interpretability features

Summary

MLJ.jl has matured as the standard framework for machine learning in Julia in 2025. Its unified API enables consistent use of diverse algorithms, and leverages Julia's speed and type safety for efficient machine learning. It is gaining attention as an alternative to Python's Scikit-learn in fields requiring scientific computing and high-performance computing.