(B29) CatBoostClassifier #2

060420202152

https://github.com/catboost/tutorials/blob/master/python_tutorial.ipynb

CatBoostClassifier sam koduje sobie zmienne tekstowe kategoryczne na zmienne kategoryczne wyrażone numerycznie. Jeżeli sami przeprowadzimy codowanie i zakodujemy zmienne kategoryczne na format cyfrowy, wyniki naszych modeli będą takie same (przynajmniej takie jest moje doświadczenie). Aby przeprowadzić eksperyment i przetestować model CatBoostClassifier bez wskazania na zmienne kategoryczne (cat_features) oraz ze wskazaniem na zmienne musimy sami zakodoać tekstowe zmienne kategoryczne na format cyfrowy. W przeciwnym razie gdy będziemy mieli zmienne tekstowe a nie wskarzemy CatBoostClassifier że to zmienne kategoryczne, wyskoczy nam błąd.

In [1]:
##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):     
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')

1.2 Załadowanie danych

inny sposób na załadowanie tych samych danych o Tytaniku.

In [2]:
from catboost.datasets import titanic
import numpy as np
import pandas as pd

train_df, test_df = titanic()

train_df.head()
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Sprawdzam kompletność zbioru

metoda pokazuje tylko te zmienna, w których brakuje danych.

In [3]:
null_value_stats = train_df.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]
Out[3]:
Age         177
Cabin       687
Embarked      2
dtype: int64

W miejcu gdzie były puste rekordy wstawiana jest wartość -777

In [4]:
train_df.fillna(-777, inplace=True)
train_df.fillna(-777, inplace=True)

Dzielimy na zmienne opisujące i wynikowe

In [5]:
X = train_df.drop('Survived', axis=1)
y = train_df.Survived

Szukamy zmiennych kategorycznych

Zostały wybrane takie kolumny jako kolumny zmiennych kategorycznych.

In [6]:
print(X.dtypes)

categorical_features_indices = np.where(X.dtypes != np.float)[0]
PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
In [7]:
categorical_features_indices
Out[7]:
array([ 0,  1,  2,  3,  5,  6,  7,  9, 10])

Wyświetlamy co to za kolumny

In [8]:
PPS = categorical_features_indices

KOT_MIC = dict(zip(train_df, PPS))
KOT_sorted_keys_MIC = sorted(KOT_MIC, key=KOT_MIC.get, reverse=True)

for r in KOT_sorted_keys_MIC:
    print (r, KOT_MIC[r])
Ticket 10
Parch 9
SibSp 7
Age 6
Sex 5
Name 3
Pclass 2
Survived 1
PassengerId 0

Można też użyć mojego sposobu na identyfikację zmiennych kategorycznych. Tutaj mamy nazwiska i kabiny więc ten sposób idetyfikacji zmiennych kategorycznych nie będzie właściwy.

In [9]:
import numpy as np

categorical_fuX = np.where(train_df.nunique() <8) [0]
categorical_fuX
Out[9]:
array([ 1,  2,  4,  6,  7, 11])

Dzielimy zbiór na zbiory treningowe i testowe

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)
In [11]:
X_train.head(3)
Out[11]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
298 299 1 Saalfeld, Mr. Adolphe male -777.0 0 0 19988 30.50 C106 S
884 885 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.05 -777 S
247 248 2 Hamalainen, Mrs. William (Anna) female 24.0 0 2 250649 14.50 -777 S

Poziom zbilansowania zbioru wynikowego

In [12]:
y_train.value_counts(dropna = False, normalize=True).plot(kind='pie')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f84617d24d0>

2.1 Szkolenie modelowe

Teraz stwórzmy sam model: poszlibyśmy tutaj z parametrami domyślnymi (ponieważ zapewniają one naprawdę dobrą linię bazową prawie przez cały czas), jedyną rzeczą, którą chcielibyśmy tutaj określić, jest parametr custom_loss, ponieważ dałoby to nam możliwość zobaczenia co się dzieje pod względem tego wskaźnika konkurencji – dokładności, a także możliwości obserwowania utraty logów, ponieważ byłoby to bardziej płynne w przypadku zestawu danych o takim rozmiarze.

  • custom_loss metryka użyta podczas szkolenia, wybrane: [„accuracy”] https://catboost.ai/docs/search/?query=%27Accuracy%27
  • random_seed = 42 Losowe nasiona użyte do treningu. Te losowe wartości są za każdym razem takie same.
  • logging_level = ‘Silent’ Poziom logowania, aby przejść do standardowego wyjścia. „Cichy” – nie wysyłaj żadnych danych logowania na standardowe wyjście. „Verbose” – wyślij następujące dane na standardowe wyjście, a następnie pokaże w modelu. Dopasuj całą ścieżkę uczenia się. „Informacje” lub „Debugowanie” – wyświetlanie dodatkowych informacji i liczby drzew.
In [13]:
from catboost import CatBoostClassifier, Pool, cv
from sklearn.metrics import accuracy_score

Zdefiniowanie modelu bez deklarowania zmiennych kategorycznych

Optymalizacja pod kontem powierzchni AUC.

In [14]:
model = CatBoostClassifier(
    custom_loss=['Accuracy'],
    random_seed=42,
    logging_level='Silent'
)
In [15]:
model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_validation, y_validation),
#     logging_level='Verbose',  # you can uncomment this for text output
    plot=True
);

‘;
}

this.layout = $(‘

‘ +

‘ +

‘ +
‘ +

Learn’ +
‘ +

Eval’ +

‘ +

‘ +

‘ +

‘ +
‘ +
‘ +
‘ +
‘ +

‘ +
‘ +
‘ +
‘ +
‘ +

‘ +
cvAreaControls +

‘ +

‘ +

‘ +

‘ +

‘ +

‘ +

‘);
$(parent).append(this.layout);

this.addTabEvents();
this.addControlEvents();
};

CatboostIpython.prototype.addTabEvents = function() {
var self = this;

$(‘.catboost-graph__tabs’, this.layout).click(function(e) {
if (!$(e.target).is(‘.catboost-graph__tab:not(.catboost-graph__tab_active)’)) {
return;
}

var id = $(e.target).attr(‘tabid’);

self.activeTab = id;

$(‘.catboost-graph__tab_active’, self.layout).removeClass(‘catboost-graph__tab_active’);
$(‘.catboost-graph__chart_active’, self.layout).removeClass(‘catboost-graph__chart_active’);

$(‘.catboost-graph__tab[tabid=”‘ + id + ‘”]’, self.layout).addClass(‘catboost-graph__tab_active’);
$(‘.catboost-graph__chart[tabid=”‘ + id + ‘”]’, self.layout).addClass(‘catboost-graph__chart_active’);

self.cleanSeries();

self.redrawActiveChart();
self.resizeCharts();
});
};

CatboostIpython.prototype.addControlEvents = function() {
var self = this;

$(‘#catboost-control-learn’ + this.index, this.layout).click(function() {
self.layoutDisabled.learn = !$(this)[0].checked;

$(‘.catboost-panel__series’, self.layout).toggleClass(‘catboost-panel__series_learn_disabled’, self.layoutDisabled.learn);

self.redrawActiveChart();
});

$(‘#catboost-control-test’ + this.index, this.layout).click(function() {
self.layoutDisabled.test = !$(this)[0].checked;

$(‘.catboost-panel__series’, self.layout).toggleClass(‘catboost-panel__series_test_disabled’, self.layoutDisabled.test);

self.redrawActiveChart();
});

$(‘#catboost-control2-clickmode’ + this.index, this.layout).click(function() {
self.clickMode = $(this)[0].checked;
});

$(‘#catboost-control2-log’ + this.index, this.layout).click(function() {
self.logarithmMode = $(this)[0].checked ? ‘log’ : ‘linear’;

self.forEveryLayout(function(layout) {
layout.yaxis = {type: self.logarithmMode};
});

self.redrawActiveChart();
});

var slider = $(‘#catboost-control2-slider’ + this.index),
sliderValue = $(‘#catboost-control2-slidervalue’ + this.index);

$(‘#catboost-control2-smooth’ + this.index, this.layout).click(function() {
var enabled = $(this)[0].checked;

self.setSmoothness(enabled ? self.lastSmooth : -1);

slider.prop(‘disabled’, !enabled);
sliderValue.prop(‘disabled’, !enabled);

self.redrawActiveChart();
});

$(‘#catboost-control2-cvstddev’ + this.index, this.layout).click(function() {
var enabled = $(this)[0].checked;

self.setStddev(enabled);

self.redrawActiveChart();
});

slider.on(‘input change’, function() {
var smooth = Number($(this).val());

sliderValue.val(isNaN(smooth) ? 0 : smooth);

self.setSmoothness(smooth);
self.lastSmooth = smooth;

self.redrawActiveChart();
});

sliderValue.on(‘input change’, function() {
var smooth = Number($(this).val());

slider.val(isNaN(smooth) ? 0 : smooth);

self.setSmoothness(smooth);
self.lastSmooth = smooth;

self.redrawActiveChart();
});
};

CatboostIpython.prototype.setTraceVisibility = function(trace, visibility) {
if (trace) {
trace.visible = visibility;
}
};

CatboostIpython.prototype.updateTracesVisibility = function() {
var tracesHash = this.groupTraces(),
traces,
smoothDisabled = this.getSmoothness() === -1,
self = this;

for (var train in tracesHash) {
if (tracesHash.hasOwnProperty(train)) {
traces = tracesHash[train].traces;

if (this.layoutDisabled.traces[train]) {
traces.forEach(function(trace) {
self.setTraceVisibility(trace, false);
});
} else {
traces.forEach(function(trace) {
self.setTraceVisibility(trace, true);
});

if (this.hasCVMode) {
if (this.stddevEnabled) {
self.filterTracesOne(traces, {type: ‘learn’}).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});
self.filterTracesOne(traces, {type: ‘test’}).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});

self.filterTracesEvery(traces, this.getTraceDefParams({type: ‘learn’, cv_avg: true})).forEach(function(trace) {
self.setTraceVisibility(trace, true);
});
self.filterTracesEvery(traces, this.getTraceDefParams({type: ‘test’, cv_avg: true})).forEach(function(trace) {
self.setTraceVisibility(trace, true);
});

self.filterTracesEvery(traces, this.getTraceDefParams({type: ‘learn’, cv_avg: true, smoothed: true})).forEach(function(trace) {
self.setTraceVisibility(trace, true);
});
self.filterTracesEvery(traces, this.getTraceDefParams({type: ‘test’, cv_avg: true, smoothed: true})).forEach(function(trace) {
self.setTraceVisibility(trace, true);
});

self.filterTracesEvery(traces, this.getTraceDefParams({type: ‘test’, cv_avg: true, best_point: true})).forEach(function(trace) {
self.setTraceVisibility(trace, true);
});

self.filterTracesOne(traces, {cv_stddev_first: true}).forEach(function(trace) {
self.setTraceVisibility(trace, true);
});
self.filterTracesOne(traces, {cv_stddev_last: true}).forEach(function(trace) {
self.setTraceVisibility(trace, true);
});
} else {
self.filterTracesOne(traces, {cv_stddev_first: true}).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});
self.filterTracesOne(traces, {cv_stddev_last: true}).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});

self.filterTracesEvery(traces, this.getTraceDefParams({type: ‘learn’, cv_avg: true})).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});
self.filterTracesEvery(traces, this.getTraceDefParams({type: ‘test’, cv_avg: true})).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});

self.filterTracesEvery(traces, this.getTraceDefParams({type: ‘learn’, cv_avg: true, smoothed: true})).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});
self.filterTracesEvery(traces, this.getTraceDefParams({type: ‘test’, cv_avg: true, smoothed: true})).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});

self.filterTracesEvery(traces, this.getTraceDefParams({type: ‘test’, cv_avg: true, best_point: true})).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});
}
}

if (smoothDisabled) {
self.filterTracesOne(traces, {smoothed: true}).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});
}

if (this.layoutDisabled[‘learn’]) {
self.filterTracesOne(traces, {type: ‘learn’}).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});
}

if (this.layoutDisabled[‘test’]) {
self.filterTracesOne(traces, {type: ‘test’}).forEach(function(trace) {
self.setTraceVisibility(trace, false);
});
}
}
}
}
};

CatboostIpython.prototype.getSmoothness = function() {
return this.smoothness && this.smoothness > -1 ? this.smoothness : -1;
};

CatboostIpython.prototype.setSmoothness = function(weight) {
if (weight 1) {
return;
}

this.smoothness = weight;
};

CatboostIpython.prototype.setStddev = function(enabled) {
this.stddevEnabled = enabled;
};

CatboostIpython.prototype.redrawActiveChart = function() {
this.chartsToRedraw[this.activeTab] = true;

this.redrawAll();
};

CatboostIpython.prototype.redraw = function() {
if (this.chartsToRedraw[this.activeTab]) {
this.chartsToRedraw[this.activeTab] = false;

this.updateTracesVisibility();
this.updateTracesCV();
this.updateTracesBest();
this.updateTracesValues();
this.updateTracesSmoothness();

this.plotly.redraw(this.traces[this.activeTab].parent);
}

this.drawTraces();
};

CatboostIpython.prototype.addRedrawFunc = function() {
this.redrawFunc = throttle(this.redraw, 400, false, this);
};

CatboostIpython.prototype.redrawAll = function() {
if (!this.redrawFunc) {
this.addRedrawFunc();
}

this.redrawFunc();
};

CatboostIpython.prototype.addPoints = function(parent, data) {
var self = this;

data.chunks.forEach(function(item) {
if (typeof item.remaining_time !== ‘undefined’ && typeof item.passed_time !== ‘undefined’) {
if (!self.timeLeft[data.path]) {
self.timeLeft[data.path] = [];
}

self.timeLeft[data.path][item.iteration] = [item.remaining_time, item.passed_time];
}

[‘test’, ‘learn’].forEach(function(type) {
var sets = self.meta[data.path][type + ‘_sets’],
metrics = self.meta[data.path][type + ‘_metrics’];

for (var i = 0; i ‘ + parameter + ‘ : ‘ + valueOfParameter;
}
}
}
if (!hovertextParametersAdded && type === ‘test’) {
hovertextParametersAdded = true;
trace.hovertext[pointIndex] += self.hovertextParameters[pointIndex];
}
smoothedTrace.x[pointIndex] = pointIndex;
}

if (bestValueTrace) {
bestValueTrace.x[pointIndex] = pointIndex;
bestValueTrace.y[pointIndex] = self.lossFuncs[nameOfMetric];
}

if (launchMode === ‘CV’ && !cvAdded) {
cvAdded = true;

self.getTrace(parent, $.extend({cv_stddev_first: true}, params));
self.getTrace(parent, $.extend({cv_stddev_last: true}, params));

self.getTrace(parent, $.extend({cv_stddev_first: true, smoothed: true}, params));
self.getTrace(parent, $.extend({cv_stddev_last: true, smoothed: true}, params));

self.getTrace(parent, $.extend({cv_avg: true}, params));
self.getTrace(parent, $.extend({cv_avg: true, smoothed: true}, params));

if (type === ‘test’) {
self.getTrace(parent, $.extend({cv_avg: true, best_point: true}, params));
}
}
}

self.chartsToRedraw[key.chartId] = true;

self.redrawAll();
}
});
});
};

CatboostIpython.prototype.getLaunchMode = function(path) {
return this.meta[path].launch_mode;
};

CatboostIpython.prototype.getChartNode = function(params, active) {
var node = $(‘

‘);

if (active) {
node.addClass(‘catboost-graph__chart_active’);
}

return node;
};

CatboostIpython.prototype.getChartTab = function(params, active) {
var node = $(‘

‘ + params.name + ‘

‘);

if (active) {
node.addClass(‘catboost-graph__tab_active’);
}

return node;
};

CatboostIpython.prototype.forEveryChart = function(callback) {
for (var name in this.traces) {
if (this.traces.hasOwnProperty(name)) {
callback(this.traces[name]);
}
}
};

CatboostIpython.prototype.forEveryLayout = function(callback) {
this.forEveryChart(function(chart) {
callback(chart.layout);
});
};

CatboostIpython.prototype.getChart = function(parent, params) {
var id = params.id,
self = this;

if (this.charts[id]) {
return this.charts[id];
}

this.addLayout(parent);

var active = this.activeTab === params.id,
chartNode = this.getChartNode(params, active),
chartTab = this.getChartTab(params, active);

$(‘.catboost-graph__charts’, this.layout).append(chartNode);
$(‘.catboost-graph__tabs’, this.layout).append(chartTab);

this.traces[id] = {
id: params.id,
name: params.name,
parent: chartNode[0],
traces: [],
layout: {
xaxis: {
range: [0, Number(this.meta[params.path].iteration_count)],
type: ‘linear’,
tickmode: ‘auto’,
showspikes: true,
spikethickness: 1,
spikedash: ‘longdashdot’,
spikemode: ‘across’,
zeroline: false,
showgrid: false
},
yaxis: {
zeroline: false
//showgrid: false
//hoverformat : ‘.7f’
},
separators: ‘. ‘,
//hovermode: ‘x’,
margin: {l: 38, r: 0, t: 35, b: 30},
autosize: true,
showlegend: false
},
options: {
scrollZoom: false,
modeBarButtonsToRemove: [‘toggleSpikelines’],
displaylogo: false
}
};

this.charts[id] = this.plotly.plot(chartNode[0], this.traces[id].traces, this.traces[id].layout, this.traces[id].options);

chartNode[0].on(‘plotly_hover’, function(e) {
self.updateTracesValues(e.points[0].x);
});

chartNode[0].on(‘plotly_click’, function(e) {
self.updateTracesValues(e.points[0].x, true);
});

return this.charts[id];
};

CatboostIpython.prototype.getTrace = function(parent, params) {
var key = this.getKey(params),
chartSeries = [];

if (this.traces[key.chartId]) {
chartSeries = this.traces[key.chartId].traces.filter(function(trace) {
return trace.name === key.traceName;
});
}

if (chartSeries.length) {
return chartSeries[0];
} else {
this.getChart(parent, {id: key.chartId, name: params.chartName, path: params.path});

var plotParams = {
color: this.getNextColor(params.path, params.smoothed ? 0.2 : 1),
fillsmoothcolor: this.getNextColor(params.path, 0.1),
fillcolor: this.getNextColor(params.path, 0.4),
hoverinfo: params.cv_avg ? ‘skip’ : ‘text+x’,
width: params.cv_avg ? 2 : 1,
dash: params.type === ‘test’ ? ‘solid’ : ‘dot’
},
trace = {
name: key.traceName,
_params: params,
x: [],
y: [],
hovertext: [],
hoverinfo: plotParams.hoverinfo,
line: {
width: plotParams.width,
dash: plotParams.dash,
color: plotParams.color
},
mode: ‘lines’,
hoveron: ‘points’,
connectgaps: true
};

if (params.best_point) {
trace = {
name: key.traceName,
_params: params,
x: [],
y: [],
marker: {
width: 2,
color: plotParams.color
},
hovertext: [],
hoverinfo: ‘text’,
mode: ‘markers’,
type: ‘scatter’
};
}

if (params.best_value) {
trace = {
name: key.traceName,
_params: params,
x: [],
y: [],
line: {
width: 1,
dash: ‘dash’,
color: ‘#CCCCCC’
},
mode: ‘lines’,
connectgaps: true,
hoverinfo: ‘skip’
};
}

if (params.cv_stddev_last) {
trace.fill = ‘tonexty’;
}

trace._params.plotParams = plotParams;

this.traces[key.chartId].traces.push(trace);

return trace;
}
};

CatboostIpython.prototype.getKey = function(params) {
var traceName = [
params.train,
params.type,
params.indexOfSet,
(params.smoothed ? ‘smoothed’ : ”),
(params.best_point ? ‘best_pount’ : ”),
(params.best_value ? ‘best_value’ : ”),
(params.cv_avg ? ‘cv_avg’ : ”),
(params.cv_stddev_first ? ‘cv_stddev_first’ : ”),
(params.cv_stddev_last ? ‘cv_stddev_last’ : ”)
].join(‘;’);

return {
chartId: params.chartName,
traceName: traceName,
colorId: params.train
};
};

CatboostIpython.prototype.filterTracesEvery = function(traces, filter) {
traces = traces || this.traces[this.activeTab].traces;

return traces.filter(function(trace) {
for (var prop in filter) {
if (filter.hasOwnProperty(prop)) {
if (filter[prop] !== trace._params[prop]) {
return false;
}
}
}

return true;
});
};

CatboostIpython.prototype.filterTracesOne = function(traces, filter) {
traces = traces || this.traces[this.activeTab].traces;

return traces.filter(function(trace) {
for (var prop in filter) {
if (filter.hasOwnProperty(prop)) {
if (filter[prop] === trace._params[prop]) {
return true;
}
}
}

return false;
});
};

CatboostIpython.prototype.cleanSeries = function() {
$(‘.catboost-panel__series’, this.layout).html(”);
};

CatboostIpython.prototype.groupTraces = function() {
var traces = this.traces[this.activeTab].traces,
index = 0,
tracesHash = {};

traces.map(function(trace) {
var train = trace._params.train;

if (!tracesHash[train]) {
tracesHash[train] = {
index: index,
traces: [],
info: {
path: trace._params.path,
color: trace._params.plotParams.color
}
};

index++;
}

tracesHash[train].traces.push(trace);
});

return tracesHash;
};

CatboostIpython.prototype.drawTraces = function() {
if ($(‘.catboost-panel__series .catboost-panel__serie’, this.layout).length) {
return;
}

var html = ”,
tracesHash = this.groupTraces();

for (var train in tracesHash) {
if (tracesHash.hasOwnProperty(train)) {
html += this.drawTrace(train, tracesHash[train]);
}
}

$(‘.catboost-panel__series’, this.layout).html(html);

this.updateTracesValues();

this.addTracesEvents();
};

CatboostIpython.prototype.getTraceDefParams = function(params) {
var defParams = {
smoothed: undefined,
best_point: undefined,
best_value: undefined,
cv_avg: undefined,
cv_stddev_first: undefined,
cv_stddev_last: undefined
};

if (params) {
return $.extend(defParams, params);
} else {
return defParams;
}
};

CatboostIpython.prototype.drawTrace = function(train, hash) {
var info = hash.info,
id = ‘catboost-serie-‘ + this.index + ‘-‘ + hash.index,
traces = {
learn: this.filterTracesEvery(hash.traces, this.getTraceDefParams({type: ‘learn’})),
test: this.filterTracesEvery(hash.traces, this.getTraceDefParams({type: ‘test’}))
},
items = {
learn: {
middle: ”,
bottom: ”
},
test: {
middle: ”,
bottom: ”
}
},
tracesNames = ”;

[‘learn’, ‘test’].forEach(function(type) {
traces[type].forEach(function(trace) {
items[type].middle += ‘

‘ +

‘;

items[type].bottom += ‘

‘ +

‘;

tracesNames += ‘

‘ +

‘ + trace._params.nameOfSet + ‘

‘;
});
});

var timeSpendHtml = ‘

‘ +

‘ +

‘;

var html = ‘

‘ +

‘ +
‘ +

‘ +
(this.getLaunchMode(info.path) !== ‘Eval’ ? timeSpendHtml : ”) +

‘ +

curr

‘ +

best

‘ +

‘ +

‘ +

‘ +

‘ +
tracesNames +

‘ +

‘ +
items.learn.middle +
items.test.middle +

‘ +

‘ +
items.learn.bottom +
items.test.bottom +

‘ +

‘ +

‘;

return html;
};

CatboostIpython.prototype.updateTracesValues = function(iteration, click) {
var tracesHash = this.groupTraces();

for (var train in tracesHash) {
if (tracesHash.hasOwnProperty(train) && !this.layoutDisabled.traces[train]) {
this.updateTraceValues(train, tracesHash[train], iteration, click);
}
}
};

CatboostIpython.prototype.updateTracesBest = function() {
var tracesHash = this.groupTraces();

for (var train in tracesHash) {
if (tracesHash.hasOwnProperty(train) && !this.layoutDisabled.traces[train]) {
this.updateTraceBest(train, tracesHash[train]);
}
}
};

CatboostIpython.prototype.getBestValue = function(data) {
if (!data.length) {
return {
best: undefined,
index: -1
};
}

var best = data[0],
index = 0,
func = this.lossFuncs[this.traces[this.activeTab].name],
bestDiff = typeof func === ‘number’ ? Math.abs(data[0] – func) : 0;

for (var i = 1, l = data.length; i best) {
best = data[i];
index = i;
}

if (typeof func === ‘number’ && Math.abs(data[i] – func) maxLength) {
maxLength = origTrace.y.length;
}
});

for (var i = 0; i 0) {
avgTrace.x[i] = i;
avgTrace.y[i] = sum / count;
}
}
};

CatboostIpython.prototype.updateTracesCVStdDev = function() {
var tracesHash = this.groupTraces(),
firstTraces = this.filterTracesOne(tracesHash.traces, {cv_stddev_first: true}),
self = this;

firstTraces.forEach(function(trace) {
var origTraces = self.filterTracesEvery(tracesHash.traces, self.getTraceDefParams({
train: trace._params.train,
type: trace._params.type,
smoothed: trace._params.smoothed
})),
lastTraces = self.filterTracesEvery(tracesHash.traces, self.getTraceDefParams({
train: trace._params.train,
type: trace._params.type,
smoothed: trace._params.smoothed,
cv_stddev_last: true
}));

if (origTraces.length && lastTraces.length === 1) {
self.cvStdDevFunc(origTraces, trace, lastTraces[0]);
}
});
};

CatboostIpython.prototype.cvStdDevFunc = function(origTraces, firstTrace, lastTrace) {
var maxCount = origTraces.length,
maxLength = -1,
count,
sum,
i, j;

origTraces.forEach(function(origTrace) {
if (origTrace.y.length > maxLength) {
maxLength = origTrace.y.length;
}
});

for (i = 0; i i) {
firstTrace.hovertext[i] += this.hovertextParameters[i];
lastTrace.hovertext[i] += this.hovertextParameters[i];
}
}
};

CatboostIpython.prototype.updateTracesSmoothness = function() {
var tracesHash = this.groupTraces(),
smoothedTraces = this.filterTracesOne(tracesHash.traces, {smoothed: true}),
enabled = this.getSmoothness() > -1,
self = this;

smoothedTraces.forEach(function(trace) {
var origTraces = self.filterTracesEvery(tracesHash.traces, self.getTraceDefParams({
train: trace._params.train,
type: trace._params.type,
indexOfSet: trace._params.indexOfSet,
cv_avg: trace._params.cv_avg,
cv_stddev_first: trace._params.cv_stddev_first,
cv_stddev_last: trace._params.cv_stddev_last
})),
colorFlag = false;

if (origTraces.length === 1) {
origTraces = origTraces[0];

if (origTraces.visible) {
if (enabled) {
self.smoothFunc(origTraces, trace);
colorFlag = true;
}

self.highlightSmoothedTrace(origTraces, trace, colorFlag);
}
}
});
};

CatboostIpython.prototype.highlightSmoothedTrace = function(trace, smoothedTrace, flag) {
if (flag) {
smoothedTrace.line.color = trace._params.plotParams.color;
trace.line.color = smoothedTrace._params.plotParams.color;
trace.hoverinfo = ‘skip’;

if (trace._params.cv_stddev_last) {
trace.fillcolor = trace._params.plotParams.fillsmoothcolor;
}
} else {
trace.line.color = trace._params.plotParams.color;
trace.hoverinfo = trace._params.plotParams.hoverinfo;

if (trace._params.cv_stddev_last) {
trace.fillcolor = trace._params.plotParams.fillcolor;
}
}
};

CatboostIpython.prototype.smoothFunc = function(origTrace, smoothedTrace) {
var data = origTrace.y,
smoothedPoints = this.smooth(data, this.getSmoothness()),
smoothedIndex = 0,
self = this;

if (smoothedPoints.length) {
data.forEach(function (d, index) {
if (!smoothedTrace.x[index]) {
smoothedTrace.x[index] = index;
}

var nameOfSet = smoothedTrace._params.nameOfSet;

if (smoothedTrace._params.cv_stddev_first || smoothedTrace._params.cv_stddev_last) {
nameOfSet = smoothedTrace._params.type + ‘ std’;
}

smoothedTrace.y[index] = smoothedPoints[smoothedIndex];
smoothedTrace.hovertext[index] = nameOfSet + ‘`: ‘ + smoothedPoints[smoothedIndex].toPrecision(7);
if (self.hovertextParameters.length > index) {
smoothedTrace.hovertext[index] += self.hovertextParameters[index];
}
smoothedIndex++;
});
}
};

CatboostIpython.prototype.formatItemValue = function(value, index, suffix) {
if (typeof value === ‘undefined’) {
return ”;
}

suffix = suffix || ”;

return ‘‘ + value + ‘‘;
};

CatboostIpython.prototype.updateTraceBest = function(train, hash) {
var traces = this.filterTracesOne(hash.traces, {best_point: true}),
self = this;

traces.forEach(function(trace) {
var testTrace = self.filterTracesEvery(hash.traces, self.getTraceDefParams({
train: trace._params.train,
type: ‘test’,
indexOfSet: trace._params.indexOfSet
}));

if (self.hasCVMode) {
testTrace = self.filterTracesEvery(hash.traces, self.getTraceDefParams({
train: trace._params.train,
type: ‘test’,
cv_avg: true
}));
}

var bestValue = self.getBestValue(testTrace.length === 1 ? testTrace[0].y : []);

if (bestValue.index !== -1) {
trace.x[0] = bestValue.index;
trace.y[0] = bestValue.best;
trace.hovertext[0] = bestValue.func + ‘ (‘ + (self.hasCVMode ? ‘avg’ : trace._params.nameOfSet) + ‘): ‘ + bestValue.index + ‘ ‘ + bestValue.best;
}
});
};

CatboostIpython.prototype.updateTraceValues = function(name, hash, iteration, click) {
var id = ‘catboost-serie-‘ + this.index + ‘-‘ + hash.index,
traces = {
learn: this.filterTracesEvery(hash.traces, this.getTraceDefParams({type: ‘learn’})),
test: this.filterTracesEvery(hash.traces, this.getTraceDefParams({type: ‘test’}))
},
path = hash.info.path,
self = this;

[‘learn’, ‘test’].forEach(function(type) {
traces[type].forEach(function(trace) {
var data = trace.y || [],
index = typeof iteration !== ‘undefined’ && iteration -1 ? bestValue.index : ”);

$(‘#’ + id + ‘ .catboost-panel__serie_best_test_value[data-index=’ + trace._params.indexOfSet + ‘]’, self.layout)
.html(self.formatItemValue(bestValue.best, bestValue.index, ‘best ‘ + trace._params.nameOfSet + ‘ ‘));
}
});
});

if (this.hasCVMode) {
var testTrace = this.filterTracesEvery(hash.traces, this.getTraceDefParams({
type: ‘test’,
cv_avg: true
})),
bestValue = this.getBestValue(testTrace.length === 1 ? testTrace[0].y : []);

$(‘#’ + id + ‘ .catboost-panel__serie_best_iteration’, this.layout).html(bestValue.index > -1 ? bestValue.index : ”);
}

if (click) {
this.clickMode = true;

$(‘#catboost-control2-clickmode’ + this.index, this.layout)[0].checked = true;
}
};

CatboostIpython.prototype.addTracesEvents = function() {
var self = this;

$(‘.catboost-panel__serie_checkbox’, this.layout).click(function() {
var name = $(this).data(‘seriename’);

self.layoutDisabled.traces[name] = !$(this)[0].checked;

self.redrawActiveChart();
});
};

CatboostIpython.prototype.getNextColor = function(path, opacity) {
var color;

if (this.colorsByPath[path]) {
color = this.colorsByPath[path];
} else {
color = this.colors[this.colorIndex];
this.colorsByPath[path] = color;

this.colorIndex++;

if (this.colorIndex > this.colors.length – 1) {
this.colorIndex = 0;
}
}

return this.hexToRgba(color, opacity);
};

CatboostIpython.prototype.hexToRgba = function(value, opacity) {
if (value.length 0) {
out += hours + ‘h ‘;
seconds = 0;
millis = 0;
}
if (minutes && minutes > 0) {
out += minutes + ‘m ‘;
millis = 0;
}
if (seconds && seconds > 0) {
out += seconds + ‘s ‘;
}
if (millis && millis > 0) {
out += millis + ‘ms’;
}

return out.trim();
};

CatboostIpython.prototype.mean = function(values, valueof) {
var n = values.length,
m = n,
i = -1,
value,
sum = 0,
number = function(x) {
return x === null ? NaN : +x;
};

if (valueof === null) {
while (++i


Jak widać, można zobaczyć, jak nasz model uczy się na podstawie pełnych wyników lub ładnych wykresów (osobiście zdecydowanie wybrałbym drugą opcję – po prostu sprawdź te wykresy: możesz na przykład powiększyć obszary zainteresowania!)

Dzięki temu możemy zobaczyć, że najlepsza wartość dokładności 0,8340 (na zestawie walidacyjnym) została osiągnięta na 157 etapie wzmocnienia.

Żeby to zobaczyć trzeba kliknąć na Accuracy i stanąć myszą na linii ciągłej (oznaczającej zmienne testowe) nie linii przerywanej(dane treningowe)Wartość accurace wysokości 0.834 osiąga u mnie przy 451 petli. To miejsce gdzie jest kropka!

Co to jest loglost?

Jeśli tylko przewidujesz prawdopodobieństwo dla klasy dodatniej, to funkcję utraty logarytmicznej można obliczyć dla jednej prognozy klasyfikacji binarnej ( yhat ) w porównaniu do oczekiwanego prawdopodobieństwa ( y ) w następujący sposób:

LogLoss = – ((1 – y) log (1 – yhat) + y log (yhat))

Obliczamy predykcję modelu

In [16]:
yhatA = model.predict(X_validation)
print(yhatA[:12])
[0 0 0 1 1 1 1 0 1 1 0 0]

y_validation[4]

In [17]:
y_train[12]
Out[17]:
0

Sprawdzenie tego modelu klasyfikacji

In [18]:
# Classification Assessment
def Classification_Assessment(model ,Xtrain, ytrain, Xtest, ytest, y_pred):
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
    def green(text):
        print('33[32m', text, '33[0m', sep='')  
    def blue(text):
        print('33[34m', text, '33[0m', sep='')         
    
    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Confusion Matrix Test data")
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    green('Valuation for test data only:')
    print(classification_report(ytest, model.predict(Xtest)))
      
    green('Valuation for test data only:')
    y_pred_proba = model.predict_proba(Xtest)[::,1]
    fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred)
    auc = metrics.roc_auc_score(ytest, y_pred)
    plt.plot(fpr, tpr, label='ROC (roc_auc = %0.2f)' % auc)
    plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
    plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.legend(loc=4)
    plt.plot([0, 1], [0, 1],'r--')
    plt.show()
    print('roc_auc %.3f' % auc)
    
   
    blue('---------------------') 
    AUC_train_1 = metrics.roc_auc_score(ytrain,model.predict_proba(Xtrain)[:,1])
    blue('AUC_train: %.3f' % AUC_train_1)
    AUC_test_1 = metrics.roc_auc_score(ytest,model.predict_proba(Xtest)[:,1])
    blue('AUC_test:  %.3f' % AUC_test_1)
    blue('---------------------')    

      
    print("Accuracy Training data:     ", np.round(accuracy_score(ytrain, model.predict(Xtrain)), decimals=4))
    green("----------------------------------------------------------------------")
    print("Accuracy Test data:         ", np.round(accuracy_score(ytest, model.predict(Xtest)), decimals=4)) 
    green("----------------------------------------------------------------------")
In [19]:
##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):     
    print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')
In [20]:
blue(X_train.shape)
green(y_train.shape)
blue(X_validation.shape)
green(y_validation.shape)
(668, 11)
(668,)
(223, 11)
(223,)
In [21]:
Classification_Assessment(model,X_train, y_train, X_validation, y_validation, yhatA)
Recall Training data:      0.7391
Precision Training data:   0.9791
----------------------------------------------------------------------
Recall Test data:          0.6629
Precision Test data:       0.8429
----------------------------------------------------------------------
Confusion Matrix Test data
[[123  11]
 [ 30  59]]
----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.80      0.92      0.86       134
           1       0.84      0.66      0.74        89

    accuracy                           0.82       223
   macro avg       0.82      0.79      0.80       223
weighted avg       0.82      0.82      0.81       223

Valuation for test data only:
roc_auc 0.790
---------------------
AUC_train: 0.968
AUC_test:  0.905
---------------------
Accuracy Training data:      0.8952
----------------------------------------------------------------------
Accuracy Test data:          0.8161
----------------------------------------------------------------------

2.2 Walidacja krzyżowa modelu

Dobrze jest zweryfikować swój model, ale zweryfikować go – nawet lepiej. A także z działkami! Bez słów:

Pokazuje parametry modelu

In [22]:
cv_params = model.get_params()
cv_params
Out[22]:
{'random_seed': 42, 'logging_level': 'Silent', 'custom_loss': ['Accuracy']}

Dodaje jeszcze jeden parametr do moedlu

In [23]:
cv_params.update({'loss_function': 'Logloss'})

Nie wiem co to jest

In [24]:
cv_data = cv(
    Pool(X, y, cat_features=categorical_features_indices),
    cv_params,
    plot=True)