Single Variable Linear Regression

Model

Cost Function

$(x_i,y_i)$ are the true values in the dataset which has m samples.

Gradient Descent

And we can calculate the partial derivatives with following equations:

Code in Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def compute_cost(x, y, w, b): 
"""
Computes the cost function for linear regression.
Args:
x (nd_array): Data, m examples
y (nd_array): target values
w,b (scalar): model parameters
Returns
total_cost (float): The cost of using w,b as the parameters for linear regression
to fit the data points in x and y
"""
# number of training examples
m = x.shape[0] #using numpy

cost_sum = 0
for i in range(m):
f_wb = w * x[i] + b
cost = (f_wb - y[i]) ** 2
cost_sum = cost_sum + cost
total_cost = (1 / (2 * m)) * cost_sum

return total_cost
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def compute_gradient(x, y, w, b): 
"""
Computes the gradient for linear regression
Args:
x (nd_array): Data, m examples
y (nd_array): target values
w,b (scalar): model parameters
Returns
dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
dj_db (scalar): The gradient of the cost w.r.t. the parameter b
"""

# Number of training examples
m = x.shape[0]
dj_dw = 0
dj_db = 0

for i in range(m):
f_wb = w * x[i] + b
dj_dw_i = (f_wb - y[i]) * x[i]
dj_db_i = f_wb - y[i]
dj_db += dj_db_i
dj_dw += dj_dw_i
dj_dw = dj_dw / m
dj_db = dj_db / m

return dj_dw, dj_db
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def gradient_descent(x, y, w_0, b_0, alpha, num_iters): 
"""
Performs gradient descent to fit w,b. Updates w,b by taking
num_iters gradient steps with learning rate alpha

Args:
x (nd_array): Data, m examples
y (nd_array): target values
w_0,b_0 (scalar): initial values of model parameters
alpha (float): Learning rate
num_iters (int): number of iterations to run gradient descent

Returns:
w (scalar): Updated value of parameter after running gradient descent
b (scalar): Updated value of parameter after running gradient descent
J_history (List): History of cost values
p_history (list): History of parameters [w,b]
"""

w = copy.deepcopy(w_0) # avoid modifying global w_0
# An array to store cost J and w's at each iteration primarily for graphing later
J_history = []
p_history = []
b = b_0
w = w_0

for i in range(num_iters):
# Calculate the gradient and update the parameters using gradient_function
dj_dw, dj_db = compute_gradient(x, y, w , b)

# Update Parameters
b = b - alpha * dj_db
w = w - alpha * dj_dw

# Save cost J at each iteration
if i < 100000: # prevent resource exhaustion
J_history.append(compute_cost(x, y, w , b))
p_history.append([w,b])

# Print cost every at intervals 10 times or as many iterations if < 10
if (i % math.ceil(num_iters/10)) == 0:
print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e} ",
f"w: {w: 0.3e}, b:{b: 0.5e}")

return w, b, J_history, p_history #return w,b and J,p history for graphing

Multiple Variables Linear Regression

Model

The model’s prediction with multiple variables is given by the linear model:

or in vector notation:

where $\cdot$ is a vector dot product

1
2
3
x = np.array([1,2,3])
y = np.array([4,5,6])
f = np.dot(w,x) + b #we can use numpy dot to calculate

Examples are stored in a NumPy matrix . Each row of the matrix represents one example. When you have $m$ training examples, and there are $n$ features, $\mathbf{X}$ is a matrix with dimensions ($m$, $n$) (m rows, n columns).

notation:

  • $\mathbf{x}^{(i)}$ is vector containing example i. $\mathbf{x}^{(i)}$ $ = (x^{(i)}_0, x^{(i)}_1, \cdots,x^{(i)}_{n-1})$
  • $x^{(i)}_j$ is element j in example i . The superscript in parenthesis indicates the example number while the subscript represents an element.
  • $\mathbf{w}$ is a vector with $n$ elements.
    • Each element contains the parameter associated with one feature.
    • notionally, we draw this as a column vector
  • $b$ is a scalar parameter.

Cost Function

The equation for the cost function with multiple variables $J(\mathbf{w},b)$ is:

where:

Gradient Descent

Gradient descent for multiple variables:

for j = 0..n-1, where, n is the number of features, parameters $w_j$, $b$, are updated simultaneously and where

  • m is the number of training examples in the data set

  • $f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ is the model’s prediction, while $y^{(i)}$ is the target value

  • For a logistic regression model , $f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b$

Feature Scaling

Mean Normalization

Z-Score Normalization

where $j$ selects a feature or a column in the X matrix. $\mu_j$ is the mean of all the values for feature (j) and $\sigma_j$ is the standard deviation of feature (j).

Code in Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def compute_cost(X, y, w, b): 
"""
Args:
X (matrix (m,n)): Data, m examples with n features
y (nd_array (m,)) : target values
w (nd_array (n,)) : model parameters
b (scalar) : model parameter
Returns:
cost (scalar): cost
"""
m = X.shape[0]
cost = 0.0
for i in range(m):
f_wb_i = np.dot(X[i], w) + b
cost = cost + (f_wb_i - y[i])**2
cost = cost / (2 * m)
return cost
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def compute_gradient(X, y, w, b): 
"""
Args:
X (matrix (m,n)): Data, m examples with n features
y (nd_array (m,)) : target values
w (nd_array (n,)) : model parameters
b (scalar) : model parameter
Returns:
dj_dw (nd_array (n,)): The gradient of the cost w.r.t. the parameters w.
dj_db (scalar): The gradient of the cost w.r.t. the parameter b.
"""
m,n = X.shape #(number of examples, number of features)
dj_dw = np.zeros((n,)) #create a new array
dj_db = 0.0

for i in range(m):
err = (np.dot(X[i], w) + b) - y[i]
for j in range(n):
dj_dw[j] = dj_dw[j] + err * X[i, j]
dj_db = dj_db + err
dj_dw = dj_dw / m
dj_db = dj_db / m

return dj_db, dj_dw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def gradient_descent(X, y, w_in, b_in, alpha, num_iters): 
"""
Args:
X (matrix (m,n)) : Data, m examples with n features
y (nd_array (m,)) : target values
w_in (nd_array (n,)) : initial model parameters
b_in (scalar) : initial model parameter
alpha (float) : Learning rate
num_iters (int) : number of iterations to run gradient descent
Returns:
w (nd_array (n,)) : Updated values of parameters
b (scalar) : Updated value of parameter
"""

# An array to store cost J and w's at each iteration primarily for graphing later
J_history = []
w = copy.deepcopy(w_in) #avoid modifying global w within function
b = b_in

for i in range(num_iters):
# Calculate the gradient and update the parameters
dj_db,dj_dw = compute_gradient(X, y, w, b)

# Update Parameters using w, b, alpha and gradient
w = w - alpha * dj_dw
b = b - alpha * dj_db

# Save cost J at each iteration
if i<100000: # prevent resource exhaustion
J_history.append(compute_cost(X, y, w, b))

# Print cost every at intervals 10 times or as many iterations if < 10
if (i % math.ceil(num_iters / 10)) == 0:
print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f}")

return w, b, J_history #return final w,b and J history
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
def zscore_normalize_features(X):
"""
computes X, zcore normalized by column
Args:
X (nd_array): Shape (m,n) input data, m examples, n features
Returns:
X_norm (nd_array): Shape (m,n) input normalized by column
mu (nd_array): Shape (n,) mean of each feature
sigma (nd_array): Shape (n,) standard deviation of each feature
"""
mu= np.mean(X, axis=0)
sigma = np.std(X, axis=0)
X_norm = (X - mu) / sigma
return (X_norm, mu, sigma)

Logistic Regression

Sigmoid Function

A logistic regression model applies the sigmoid to the linear regression model as shown below:

where $g(z)$ is shown above.

  • Therefore, to get a final prediction ($y=0$ or $y=1$) from the logistic regression model, we can use the following

    if $f_{\mathbf{w},b}(x) \ge 0.5$, predict $y=1$

    if $f_{\mathbf{w},b}(x) < 0.5$, predict $y=0$

  • $\mathbf{w} \cdot \mathbf{x} + b = 0$ is the decision boundary. It can be linear or non-linear, depending on what kinds of features you are using. If you use all linear features, it will be a line. Or if you have $x^2,x^3…$, it may be a complex curve line.

Loss Function

$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)})$ is the cost for a single data point, which is:

  • $f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ is the model’s prediction, while $y^{(i)}$ is the target value.

  • $f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = g(\mathbf{w} \cdot\mathbf{x}^{(i)}+b)$ where function $g$ is the sigmoid function.

it can also be written as:

Cost Function

Logistic Gradient Descent

Where each iteration performs simultaneous updates on $w_j$ for all $j$, where

  • m is the number of training examples in the data set
  • $f_{\mathbf{w},b}(x^{(i)})$ is the model’s prediction, while $y^{(i)}$ is the target
  • For a logistic regression model
    $z = \mathbf{w} \cdot \mathbf{x} + b$
    $f_{\mathbf{w},b}(x) = g(z)$
    where $g(z)$ is the sigmoid function:
    $g(z) = \frac{1}{1+e^{-z}}$

Overfitting And Regularization

To addressing overfit problem ,we can add a regularization term to the cost function.

In linear regression, It will be:

In logistic regression, it will be:

And we can calculate derivative term when performing gradient descent as following:

Code in Python

1
2
3
4
5
6
7
8
9
10
11
import numpy as np
def sigmoid(z):
"""
Compute the sigmoid of z
Args:
z (nd_array): A scalar, numpy array of any size.
Returns:
g (nd_array): sigmoid(z), with the same shape as z
"""
g = 1/(1+np.exp(-z))
return g
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def compute_cost_logistic(X, y, w, b):
"""
Computes cost
Args:
X (ndarray (m,n)): Data, m examples with n features
y (ndarray (m,)) : target values
w (ndarray (n,)) : model parameters
b (scalar) : model parameter
Returns:
cost (scalar): cost
"""
m = X.shape[0]
total_cost = 0
f_w = sigmoid(np.dot(X, w) + b)
for i in range(m):
temp_cost = -y[i]*np.log(f_w[i]) - (1-y[i])*np.log(1-f_w[i])
total_cost += temp_cost
total_cost = total_cost / m
return cost
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def compute_gradient_logistic(X, y, w, b): 
"""
Computes the gradient for logistic regression
Args:
X (ndarray (m,n): Data, m examples with n features
y (ndarray (m,)): target values
w (ndarray (n,)): model parameters
b (scalar) : model parameter
Returns
dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w.
dj_db (scalar) : The gradient of the cost w.r.t. the parameter b.
"""
m,n = X.shape
dj_dw = np.zeros((n,))
dj_db = 0.

err = sigmoid(np.dot(X, w) + b) - y
for j in range(n):
for i in range(m):
dj_dw[j] += err[i] * X[i,j]
dj_dw = dj_dw / m
dj_db = np.sum(err) / m

return dj_db, dj_dw

Neual Network

Matrix Multiply

屏幕截图 2025-01-27 202216

屏幕截图 2025-01-27 202227

  • $\mathcal{x_i}^T$ : a single example in training dataset
  • $\mathcal{W}$: the parameters matrix
  • $\mathcal{Z = XW +B}$

Neurons And Layers

Neuron without activation - Regression/Linear Model

Each neuron performs once linear model as following:

  • suitable for $y \in (-\infty , + \infty)$

Neuron with Sigmoid activation

Each neuron performs once sigmoid function:

where

  • suitable for binary classification, $y = 0 ~~ or ~~ 1$

Neurons with ReLU activation (Rectified Linear Unit)

Each neuron performs once ReLU function:

where

  • suitable for $y \ge 0$
  • strongly recommended to apply ReLU activation in hidden neurons

image-20250127203746058

Mathematical form of Neural Network

1

  • $\vec{x}$: can be also written as $\vec{a}^{[0]}$, is the input of the neuron network
  • $a_j^{[l]}$: the $j^{th}$ neuron in the $l^{th}$ layer
  • $\vec{a}^{[l]}$: the output of the $l^{th}$ layer
  • $\vec{w}_j^{[l]}$: the parameter vector of the $j^{th}$ neuron in the $l^{th}$ layer
  • $g(x)$: the activation function

Build Binary Classification Neural Network

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Sequential
from tensorflow.keras.activations import linear, relu, sigmoid
from tensorflow.keras.losses import BinaryCrossentropy,MeanSquareError

#create a network:L1,L2,L3,...
#each layer has certain numbers of neurons:25,10,2,...
#dense layer is a common layer type
#activation function can be linear or sigmoid or relu
model = Sequential(
[
tf.keras.Input(shape=(2,))
tf.keras.layers.Dense(units=25,activation="sigmoid",name="L1"),
tf.keras.layers.Dense(units=10,activation="sigmoid",name="L2"),
tf.keras.layers.Dense(units=1,activation="sigmoid",name="L3"),
...
]
)

X = np.array([[4,5],
[1,8],
[6,8],
[6,6],
[7,1],
[6,5]],
dtype=np.float32)
Y = np.array([[0],
[0],
[0],
[1],
[1],
[1]],
dtype=np.float32)
# input X,Y must be 2-D matrix

#create a "Normalization Layer" to get X normalized.
norm_l = tf.keras.layers.Normalization(axis=-1)
norm_l.adapt(X)# learns mean, variance of dataset
Xn = norm_l(X)

print(L1.get_weights())#return a numpy array
#>>>[array([[0.54]], dtype=float32), array([0.], dtype=float32)]

w,b = L1.get_weights()
print(w,b) #>>>[[0.54]] [0.]
print(w.shape,b.shape)#>>>(1, 1) (1,)
# w is a 2D matrix, b is a vector

set_w = np.array([[2]]) # w must be 2D matrix
set_b = np.array([-4.5])
L1.set_weights([set_w, set_b])#set weights manually


model.summary()
'''
shows the layers and number of parameters in the model.
Model: "sequential_3"
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ L1 (Dense) │ (None, 25) │ 75 │
└─────────────────┴──────────────────────┴────────────┘
Total params: 75 (300.00 B)
Trainable params: 75 (300.00 B)
Non-trainable params: 0 (0.00 B)
'''

# loss function can be BinaryCrossentropy or MeanSquareError
# 即二元交叉熵or均方误差
model.compile(
loss = tf.keras.losses.BinaryCrossentropy(),
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01),
) # generate the model

model.fit(
Xn,Y
epochs=100,
) # run gradient descent to fit the model

model.predict(new_X)
a1=L1(Xn)
print(a1)
'''
>>>tf.Tensor(
[[0.5 0.5 0.5 ... ]
[0.45 0.35 0.55 ...]
[0.41 0.23 0.59 ...]
[0.36 0.14 0.63 ...]
[0.32 0.08 0.68 ...]
[0.28 0.05 0.71 ...]], shape=(6, 25), dtype=float32)

output is a tensor type data
each row represent $\vec{a}^{[1]}_j$
'''

Multiclass Classification

Softmax Function

The softmax function can be written:

The output $\mathbf{a}$ is a vector of length N, so for softmax regression, you could also write:

Loss Function

The loss function associated with Softmax, called the sparse categorical cross-entropy loss(稀疏多分类交叉熵损失), is:

Cost Function

To write the cost function, we need to define an indicator function first:

Now the cost is:

Where $m$ is the number of examples, $N$ is the number of outputs.

Build Multiclass Classification Neuron Network

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Input
from tensorflow.keras.activations import linear, relu, sigmoid
from sklearn.datasets import make_blobs

'''
generate cluster samples
make_blobs()
para:
n_samples :sample numbers [int],default=100
n_features :dimension of samples [int],default=2
centers :labels of samples [int or vector((x,y),center of each label)]
cluster_std:standard deviation [float or vector]
random_state:random seeds
'''
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)

model = Sequential(
[ tf.keras.Input(shape=(400,)),
Dense(25, activation = 'relu'),
Dense(15, activation = 'relu'),
Dense(4, activation = 'linear') #<-- Attention!
]
)
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), #<-- Attention
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
)
#write code in this way is highly recommended, to decrease float-computing error
#output layer set as linear,and pass the results to loss function


'''The [.fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model) method returns a variety of metrics including the loss. This is captured in the `history` variable above. This can be used to examine the loss in a plot as shown below.'''
history = model.fit(
X_train,y_train,
epochs=10
)
''' fig,ax = plt.subplots(1,1, figsize = (4,3))
ax.plot(history.history['loss'], label='loss')
ax.set_ylim([0, 2])
ax.set_xlabel('Epoch')
ax.set_ylabel('loss (cost)')
ax.legend()
ax.grid(True)
plt.show()'''


p = model.predict(X_train)#The output predictions are not probabilities
results = tf.nn.softmax(p).numpy()#the output should be be processed by a softmax.
for i in range(len(results)):
print( f"{results[i]}, category: {np.argmax(results[i])}")

Optimizer: Adam

Adaptive Moment Estimation : 自适应矩估计

是动量法(Momentum)和均方误差法(RMSprop)的结合

  1. 初始化

    • $m_t=0$:加权一阶矩,梯度的加权均值
    • $v_t=0$:加权二阶矩,梯度平方的加权均值
    • $\alpha=0.001$:学习率
    • $\beta _1=0.9$:一阶矩衰减因子
    • $\beta _1=0.999$:二阶矩衰减因子
    • $\epsilon=10^{-8}$:平滑项,防止除数为0
  2. 计算损失函数对当前参数的梯度

    ​ $g_t$表示第t步循环的梯度

  3. 更新一阶矩和二阶矩

​ 将下式展开可以发现,越久远的$g_t$权重越小,离的越近的$g_t$权重越大

  1. 偏差修正
    在训练初期,$m_t$和$v_t$都是初始化为0的,而梯度不为零,因此用一阶矩和二阶矩估计梯度会显著偏小,所以需要进行修正:

    ​ $\beta^t$指第t次循环时$\beta$的t次幂。

  2. 更新参数

Tuning Models

Split Dataset

data % of total Description
training 60 Data used to tune model parameters $w$ and $b$ in training or fitting
cross-validation 20 Data used to tune other model parameters like degree(max power of x) of polynomial, regularization or the architecture of a neural network.
test 20 Data used to test the model after tuning to gauge performance on new data
1
2
3
from sklearn.model_selection import train_test_split
X_train, X_, y_train, y_ = train_test_split(X,y,test_size=0.40)
X_cv, X_test, y_cv, y_test = train_test_split(X_,y_,test_size=0.50)

Polynomial Regression

Finding the optimal degree

Choose the model which has the lowest cv error , even though it has a higher test error. ( To Avoid overfitting problem)

image-20250201234620912

Tuning Regularization

Choose the model which has the lowest cv error

Figure 1

Increasing Training Set Size

When a model is overfitting (high variance), collecting additional data can improve performance.
Figure 18

Neural Network

  • simple model(less layers, less neurons) usually has a little higher classification error on training data but does better on cross-validation data than the more complex model.
  • apply regularization to moderate the impact of a more complex model!
1
2
3
4
5
6
7
model_r = Sequential(
[
Dense(120, activation = 'relu', kernel_regularizer=tf.keras.regularizers.l2(0.1)),
Dense(40, activation = 'relu', kernel_regularizer=tf.keras.regularizers.l2(0.1)),
Dense(6, activation = 'linear', name="L3")
]
)

Evaluating on Skewed Datasets

(倾斜数据集)
$y=1$ in presence of rare class we want to detect

Actual Class
1 0
Predicted 1 True Positive False Positive
Class 0 False Negative True Negative
  1. Precision:
  1. Recall:
  1. Trading off between precision and recall:
    we can define a F1 score to measure these two factors:​ A module which has a larger F1 score does better.

Decision Trees

One Hot Encoding

If a categorical feature can take on $k$ discrete (离散的) values, create $k$ binary features (0 or 1 valued).

Entropy

  • Note
    • $p_1$ stands for the fraction of label $y=1$ in a node
    • The log is calculated with base $2$
    • For implementation purposes, $0 \text{log}_2(0) = 0$. That is, if $p_1 = 0$ or $p_1 = 1$, set the entropy to 0

Information Gain

where

  • $H(p_1^\text{node})$ is entropy at the node
  • $H(p_1^\text{left})$ and $H(p_1^\text{right})$ are the entropies at the left and the right branches resulting from the split
  • $w^{\text{left}}$ and $w^{\text{right}}$ are the proportion of examples at the left and right branch respectively

Steps to Build a Decision Tree

  1. Start with all examples at the root node
  2. Calculate information gain for splitting on all possible features, and pick the one with the highest information gain
  3. Split dataset according to the selected feature, and create left and right branches of the tree
  4. Keep repeating splitting process until stopping criteria is met, which can be a maximum depth, examples in a node is below a threshold, information gain is very small…

Tree Ensemble

Bagged Decision Trees (Randomizing Samples)

  1. Given training set of size m;
  2. Use sampling with replacement(有放回随机抽样) to create a new training set of size m;
  3. Train a tree in the new dataset;
  4. Repeat step2 and step3 for n times, then we have a bag of trees which has n trees
  5. Let all the trees voting for the result.

Random Forest (Randomizing Samples and Feature Choice)

  • At each node, when choosing a feature to use to split, if n features are available, pick a random subset of $k < n$ features and allow the algorithm to only choose from that subset of features
  • This can make trees much more different, improve computing speed and decrease overfitting.
  • Usually take $k=\sqrt{n}$