監(jiān)督學(xué)習(xí) K近鄰算法算例

小青CAE

2021年5月18日 17:32

瀏覽：2900 評(píng)論：1 收藏：2

問題描述

使用客戶特征數(shù)據(jù)預(yù)測客戶是否有流失的可能；

數(shù)據(jù)文件名稱Orange_Telecom_Churn_Data.csv。

分析KNN算法不同參數(shù)設(shè)置對(duì)模型預(yù)測精度的影響

數(shù)據(jù)來源于：https://software.intel.com/content/www/cn/zh/develop/training/course-machine-learning.html

數(shù)據(jù)文件、源程序等均可在QQ群517718332中下載。

第一步:獲取數(shù)據(jù)

導(dǎo)入數(shù)據(jù)；

觀察數(shù)據(jù)格式；

對(duì)數(shù)據(jù)進(jìn)行初步的處理，比如刪除一些沒有意義的特征，替換一些缺失值等。

# 導(dǎo)入模塊
import pandas as pd 
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer,MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.neiors import KNeiorsClassifier
import numpy as np
import matplotlib.pyplot as plt 

# 導(dǎo)入數(shù)據(jù)
file_path = "./data/Orange_Telecom_Churn_Data.csv"
raw_data = pd.read_csv(file_path)
raw_data.head()

表格太大公眾號(hào)顯示有問題，大家可以在這個(gè)網(wǎng)站上看http://www.xiaoqing-cae.com/

raw_data.describe()

表格太大公眾號(hào)顯示有問題，大家可以在這個(gè)網(wǎng)站上看http://www.xiaoqing-cae.com/

# 觀察特征值是否能把目標(biāo)值區(qū)分開，或者說特征值是否有效
# sns.pairplot(raw_data, hue='churned', height=3);

g = sns.JointGrid(data=raw_data, x="account_length", y="number_vmail_messages", hue="churned")
g.plot(sns.scatterplot, sns.histplot);

監(jiān)督學(xué)習(xí) K近鄰算法算例的圖1

第二步:數(shù)據(jù)處理

觀察原始數(shù)據(jù)可知:

數(shù)據(jù)共有5000個(gè)樣本
每個(gè)樣本有20個(gè)特征值，1個(gè)目標(biāo)值
目標(biāo)是只有兩類，即只需要?jiǎng)澐譃閮深?

# 刪除沒有意義的特征值

raw_data.drop(['state', 'area_code', 'phone_number'], axis=1, inplace=True)
raw_data.head()

表格太大公眾號(hào)顯示有問題，大家可以在這個(gè)網(wǎng)站上看http://www.xiaoqing-cae.com/

# 數(shù)據(jù)中有些是浮點(diǎn)數(shù)，有些是類別如yes or no；需要將類別轉(zhuǎn)換為數(shù)據(jù)

lb = LabelBinarizer()

for col in ['intl_plan', 'voice_mail_plan', 'churned']:
    raw_data[col] = lb.fit_transform(raw_data[col])

# 分?jǐn)?shù)據(jù)集為訓(xùn)練集和測試集

x_cols = [x for x in raw_data.columns if x != 'churned']
x_data = raw_data[x_cols]
y_data = raw_data['churned']
x_train, x_test, y_train, y_test = train_test_split(x_data,y_data, test_size=0.2, random_state=20)

特征工程

# 將所有的特征值轉(zhuǎn)換至0~1范圍內(nèi)；特征值歸一化處理

msc = MinMaxScaler()

x_train = msc.fit_transform(x_train)
x_test  = msc.fit_transform(x_test)

訓(xùn)練模型與評(píng)估

# k從1~20；p從1~2；繪制精度的關(guān)系曲線

accuracy_s1 = []
accuracy_s2 = []

for i in range(1,30):
    # 訓(xùn)練模型
    knn = KNeiorsClassifier(n_neiors=i,p = 1)
    knn = knn.fit(x_train, y_train)
    # 評(píng)估模型準(zhǔn)確率
    score = knn.score(x_test, y_test)
    accuracy_s1.append([i,score])
    
    knn = KNeiorsClassifier(n_neiors=i,p = 2)
    knn = knn.fit(x_train, y_train)
    # 評(píng)估模型準(zhǔn)確率
    score = knn.score(x_test, y_test)
    accuracy_s2.append([i,score])

# 繪制曲線
accuracy_s1 = np.array(accuracy_s1)
accuracy_s2 = np.array(accuracy_s2)

plt.figure()
#解決中文顯示問題
plt.rcParams['font.sans-serif'] = ['KaiTi'] # 指定默認(rèn)字體
plt.rcParams['axes.unicode_minus'] = False # 解決保存圖像是負(fù)號(hào)'-'顯示為方塊的問題
plt.xlabel("k 值")
plt.ylabel("準(zhǔn)確率")
plt.plot(accuracy_s1[:,0],accuracy_s1[:,1],label='p = 1')
plt.plot(accuracy_s2[:,0],accuracy_s2[:,1],label='p = 2')
plt.legend()

plt.show()