Single Variable Linear Regression

Model

$f_{w,b}(x) = wx + b$

Cost Function

$J(w,b) = \frac{1}{2m}\sum_{i=1}^{m}(f_{w,b}(x_i)-y_i)^2$

$(x_i,y_i)$ are the true values in the dataset which has m samples.

Gradient Descent

$\begin{cases} w = w - \alpha \frac{\partial}{\partial w}J(w,b) \\ \\ b = b - \alpha \frac{\partial}{\partial b}J(w,b) \end{cases}$

And we can calculate the partial derivatives with following equations:

$\frac{\partial}{\partial w}J(w,b) = \frac{1}{m}\sum_{i=1}^{m}(f_{w,b}(x_i)-y_i) \cdot x_i \\ \frac{\partial}{\partial b}J(w,b) = \frac{1}{m}\sum_{i=1}^{m}(f_{w,b}(x_i)-y_i)$

Code in Python

def compute_cost(x, y, w, b): 
    """
    Computes the cost function for linear regression.
    Args:
      x (nd_array): Data, m examples 
      y (nd_array): target values
      w,b (scalar): model parameters  
    Returns
        total_cost (float): The cost of using w,b as the parameters for linear regression
               to fit the data points in x and y
    """
    # number of training examples
    m = x.shape[0] #using numpy
    
    cost_sum = 0 
    for i in range(m): 
        f_wb = w * x[i] + b   
        cost = (f_wb - y[i]) ** 2  
        cost_sum = cost_sum + cost  
    total_cost = (1 / (2 * m)) * cost_sum  

    return total_cost

def compute_gradient(x, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      x (nd_array): Data, m examples 
      y (nd_array): target values
      w,b (scalar): model parameters  
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
     """
    
    # Number of training examples
    m = x.shape[0]    
    dj_dw = 0
    dj_db = 0
    
    for i in range(m):  
        f_wb = w * x[i] + b 
        dj_dw_i = (f_wb - y[i]) * x[i] 
        dj_db_i = f_wb - y[i] 
        dj_db += dj_db_i
        dj_dw += dj_dw_i 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 
        
    return dj_dw, dj_db

def gradient_descent(x, y, w_0, b_0, alpha, num_iters): 
    """
    Performs gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      x (nd_array):      Data, m examples 
      y (nd_array):      target values
      w_0,b_0 (scalar):  initial values of model parameters  
      alpha (float):     Learning rate
      num_iters (int):   number of iterations to run gradient descent
      
    Returns:
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b] 
      """
    
    w = copy.deepcopy(w_0) # avoid modifying global w_0
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    p_history = []
    b = b_0
    w = w_0
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = compute_gradient(x, y, w , b)     

        # Update Parameters
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            

        # Save cost J at each iteration
        if i < 100000:      # prevent resource exhaustion 
            J_history.append(compute_cost(x, y, w , b))
            p_history.append([w,b])
            
        # Print cost every at intervals 10 times or as many iterations if < 10
        if (i % math.ceil(num_iters/10)) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")
 
    return w, b, J_history, p_history #return w,b and J,p history for graphing

Multiple Variables Linear Regression

Model

The model’s prediction with multiple variables is given by the linear model:

$f_{\mathbf{w},b}(\mathbf{x}) = w_0x_0 + w_1x_1 +... + w_{n-1}x_{n-1} + b \tag{1}$

or in vector notation:

$f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b \tag{2}$

where $\cdot$ is a vector dot product

1
2
3

x = np.array([1,2,3])
y = np.array([4,5,6])
f = np.dot(w,x) + b #we can use numpy dot to calculate

Examples are stored in a NumPy matrix . Each row of the matrix represents one example. When you have $m$ training examples, and there are $n$ features, $\mathbf{X}$ is a matrix with dimensions ($m$, $n$) (m rows, n columns).

$\mathbf{X} = \begin{pmatrix} x^{(0)}_0 & x^{(0)}_1 & \cdots & x^{(0)}_{n-1} \\ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_{n-1} \\ \cdots \\ x^{(m-1)}_0 & x^{(m-1)}_1 & \cdots & x^{(m-1)}_{n-1} \end{pmatrix}$

notation:

$\mathbf{x}^{(i)}$ is vector containing example i. $\mathbf{x}^{(i)}$ $ = (x^{(i)}_0, x^{(i)}_1, \cdots,x^{(i)}_{n-1})$
$x^{(i)}_j$ is element j in example i . The superscript in parenthesis indicates the example number while the subscript represents an element.

$\mathbf{w}$ is a vector with $n$ elements.
- Each element contains the parameter associated with one feature.
- notionally, we draw this as a column vector

$\mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \cdots\\ w_{n-1} \end{pmatrix}$

$b$ is a scalar parameter.

Cost Function

The equation for the cost function with multiple variables $J(\mathbf{w},b)$ is:

$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 \tag{3}$

where:

$f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b \tag{4}$

Gradient Descent

Gradient descent for multiple variables:

$\begin{align*} & w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{5} \\ &b\ \ = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \end{align*}$

for j = 0..n-1, where, n is the number of features, parameters $w_j$, $b$, are updated simultaneously and where

$\begin{align} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{6} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{7} \end{align}$

m is the number of training examples in the data set
$f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ is the model’s prediction, while $y^{(i)}$ is the target value
For a logistic regression model , $f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b$

Feature Scaling

Mean Normalization

$x^{(i)}_j = \dfrac{x^{(i)}_j - \mu_j}{max - min}$

Z-Score Normalization

$x^{(i)}_j = \dfrac{x^{(i)}_j - \mu_j}{\sigma_j}$

where $j$ selects a feature or a column in the X matrix. $\mu_j$ is the mean of all the values for feature (j) and $\sigma_j$ is the standard deviation of feature (j).

$\begin{align} \mu_j &= \frac{1}{m} \sum_{i=0}^{m-1} x^{(i)}_j\\ \sigma^2_j &= \frac{1}{m} \sum_{i=0}^{m-1} (x^{(i)}_j - \mu_j)^2 \end{align}$

Code in Python

def compute_cost(X, y, w, b): 
    """
    Args:
      X (matrix (m,n)): Data, m examples with n features
      y (nd_array (m,)) : target values
      w (nd_array (n,)) : model parameters  
      b (scalar)       : model parameter
    Returns:
      cost (scalar): cost
    """
    m = X.shape[0]
    cost = 0.0
    for i in range(m):                                
        f_wb_i = np.dot(X[i], w) + b           
        cost = cost + (f_wb_i - y[i])**2       
    cost = cost / (2 * m)                      
    return cost

def compute_gradient(X, y, w, b): 
    """
    Args:
      X (matrix (m,n)): Data, m examples with n features
      y (nd_array (m,)) : target values
      w (nd_array (n,)) : model parameters  
      b (scalar)       : model parameter
    Returns:
      dj_dw (nd_array (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
    m,n = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((n,))  #create a new array
    dj_db = 0.0

    for i in range(m):                             
        err = (np.dot(X[i], w) + b) - y[i]   
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]    
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m                                
        
    return dj_db, dj_dw

def gradient_descent(X, y, w_in, b_in, alpha, num_iters): 
    """
    Args:
      X (matrix (m,n))   : Data, m examples with n features
      y (nd_array (m,))    : target values
      w_in (nd_array (n,)) : initial model parameters  
      b_in (scalar)       : initial model parameter
      alpha (float)       : Learning rate
      num_iters (int)     : number of iterations to run gradient descent
    Returns:
      w (nd_array (n,)) : Updated values of parameters 
      b (scalar)       : Updated value of parameter 
      """
    
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    w = copy.deepcopy(w_in)  #avoid modifying global w within function
    b = b_in
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters
        dj_db,dj_dw = compute_gradient(X, y, w, b)

        # Update Parameters using w, b, alpha and gradient
        w = w - alpha * dj_dw    
        b = b - alpha * dj_db 
      
        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append(compute_cost(X, y, w, b))

        # Print cost every at intervals 10 times or as many iterations if < 10
        if (i % math.ceil(num_iters / 10)) == 0:
            print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f}")
        
    return w, b, J_history #return final w,b and J history

import numpy as np
def zscore_normalize_features(X):
    """
    computes  X, zcore normalized by column
    Args:
      X (nd_array): Shape (m,n) input data, m examples, n features
    Returns:
      X_norm (nd_array): Shape (m,n)  input normalized by column
      mu (nd_array):     Shape (n,)   mean of each feature
      sigma (nd_array):  Shape (n,)   standard deviation of each feature
    """
    mu= np.mean(X, axis=0)
    sigma  = np.std(X, axis=0)
    X_norm = (X - mu) / sigma      
    return (X_norm, mu, sigma)

Logistic Regression

Sigmoid Function

$g(z) = \frac{1}{1+e^{-z}} \tag{1}$

A logistic regression model applies the sigmoid to the linear regression model as shown below:

$f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = g(\mathbf{w} \cdot \mathbf{x}^{(i)} + b ) \tag{2}$

where $g(z)$ is shown above.

Therefore, to get a final prediction ($y=0$ or $y=1$) from the logistic regression model, we can use the following

if $f_{\mathbf{w},b}(x) \ge 0.5$, predict $y=1$

if $f_{\mathbf{w},b}(x) < 0.5$, predict $y=0$
$\mathbf{w} \cdot \mathbf{x} + b = 0$ is the decision boundary. It can be linear or non-linear, depending on what kinds of features you are using. If you use all linear features, it will be a line. Or if you have $x^2,x^3…$, it may be a complex curve line.

Loss Function

$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)})$ is the cost for a single data point, which is:

$\begin{equation} loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = \begin{cases} - \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) & \text{if $y^{(i)}=1$}\\ \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) & \text{if $y^{(i)}=0$} \end{cases} \end{equation}$

$f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ is the model’s prediction, while $y^{(i)}$ is the target value.
$f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = g(\mathbf{w} \cdot\mathbf{x}^{(i)}+b)$ where function $g$ is the sigmoid function.

it can also be written as:

$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = \\-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)$

Cost Function

$J(\mathbf{w},b) = \frac{1}{m} \sum_{i=0}^{m-1} \left[ loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) \right]$

Logistic Gradient Descent

$w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} & \text{for j = 0..n-1} \\ b = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b}$

Where each iteration performs simultaneous updates on $w_j$ for all $j$, where

$\frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)}\\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})$

m is the number of training examples in the data set
$f_{\mathbf{w},b}(x^{(i)})$ is the model’s prediction, while $y^{(i)}$ is the target
For a logistic regression model
$z = \mathbf{w} \cdot \mathbf{x} + b$
$f_{\mathbf{w},b}(x) = g(z)$
where $g(z)$ is the sigmoid function:
$g(z) = \frac{1}{1+e^{-z}}$

Overfitting And Regularization

To addressing overfit problem ,we can add a regularization term to the cost function.

In linear regression, It will be:

$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=0}^{n-1} w_j^2$

In logistic regression, it will be:

$J(\mathbf{w},b) = \frac{1}{m} \sum_{i=0}^{m-1} \left[ loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) \right]+ \frac{\lambda}{2m} \sum_{j=0}^{n-1} w_j^2$

And we can calculate derivative term when performing gradient descent as following:

$\begin{align*} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} + \frac{\lambda}{m} w_j\\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \end{align*}$

Code in Python

import numpy as np
def sigmoid(z):
    """
    Compute the sigmoid of z
    Args:
        z (nd_array): A scalar, numpy array of any size.
    Returns:
        g (nd_array): sigmoid(z), with the same shape as z
    """
    g = 1/(1+np.exp(-z))
    return g

def compute_cost_logistic(X, y, w, b):
    """
    Computes cost
    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter    
    Returns:
      cost (scalar): cost
    """
    m = X.shape[0]
    total_cost = 0
    f_w = sigmoid(np.dot(X, w) + b)
    for i in range(m):
        temp_cost = -y[i]*np.log(f_w[i]) - (1-y[i])*np.log(1-f_w[i])
        total_cost += temp_cost
    total_cost = total_cost / m
    return cost

def compute_gradient_logistic(X, y, w, b): 
    """
    Computes the gradient for logistic regression 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
    Returns
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar)      : The gradient of the cost w.r.t. the parameter b. 
    """
    m,n = X.shape
    dj_dw = np.zeros((n,))
    dj_db = 0.

    err  = sigmoid(np.dot(X, w) + b) - y
    for j in range(n):
        for i in range(m):
            dj_dw[j] += err[i] * X[i,j]
    dj_dw = dj_dw / m
    dj_db = np.sum(err) / m                                 
        
    return dj_db, dj_dw

Neual Network

Matrix Multiply

屏幕截图 2025-01-27 202216

屏幕截图 2025-01-27 202227

$\mathcal{x_i}^T$ : a single example in training dataset
$\mathcal{W}$: the parameters matrix
$\mathcal{Z = XW +B}$

Neurons And Layers

Neuron without activation - Regression/Linear Model

Each neuron performs once linear model as following:

$f_{\mathbf{w},b}(x^{(i)}) = \mathbf{w}\cdot x^{(i)} + b$

suitable for $y \in (-\infty , + \infty)$

Neuron with Sigmoid activation

Each neuron performs once sigmoid function:

$f_{\mathbf{w},b}(x^{(i)}) = g(\mathbf{w}x^{(i)} + b)$

where

$g(x) = sigmoid(x)$

suitable for binary classification, $y = 0 ~~ or ~~ 1$

Neurons with ReLU activation (Rectified Linear Unit)

Each neuron performs once ReLU function:

$f_{\mathbf{w},b}(x^{(i)}) = g(\mathbf{w}x^{(i)} + b)$

where

$g(x) = max(0,x)$

suitable for $y \ge 0$
strongly recommended to apply ReLU activation in hidden neurons

Mathematical form of Neural Network

$\vec{x}$: can be also written as $\vec{a}^{[0]}$, is the input of the neuron network
$a_j^{[l]}$: the $j^{th}$ neuron in the $l^{th}$ layer
$\vec{a}^{[l]}$: the output of the $l^{th}$ layer
$\vec{w}_j^{[l]}$: the parameter vector of the $j^{th}$ neuron in the $l^{th}$ layer
$g(x)$: the activation function

Build Binary Classification Neural Network

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Sequential
from tensorflow.keras.activations import linear, relu, sigmoid
from tensorflow.keras.losses import BinaryCrossentropy,MeanSquareError

#create a network:L1,L2,L3,...
#each layer has certain numbers of neurons:25,10,2,...
#dense layer is a common layer type
#activation function can be linear or sigmoid or relu
model = Sequential(
    [
tf.keras.Input(shape=(2,))
tf.keras.layers.Dense(units=25,activation="sigmoid",name="L1"),
tf.keras.layers.Dense(units=10,activation="sigmoid",name="L2"),
tf.keras.layers.Dense(units=1,activation="sigmoid",name="L3"),
        ...
    ]
)

X = np.array([[4,5],
              [1,8],
              [6,8],
              [6,6],
              [7,1],
              [6,5]], 
             dtype=np.float32)
Y = np.array([[0],
              [0],
              [0],
              [1],
              [1],
              [1]], 
              dtype=np.float32)
# input X,Y must be 2-D matrix

#create a "Normalization Layer" to get X normalized.
norm_l = tf.keras.layers.Normalization(axis=-1)
norm_l.adapt(X)# learns mean, variance of dataset
Xn = norm_l(X)

print(L1.get_weights())#return a numpy array
#>>>[array([[0.54]], dtype=float32), array([0.], dtype=float32)]

w,b = L1.get_weights()
print(w,b) #>>>[[0.54]] [0.]
print(w.shape,b.shape)#>>>(1, 1) (1,)
# w is a 2D matrix, b is a vector

set_w = np.array([[2]]) # w must be 2D matrix
set_b = np.array([-4.5])
L1.set_weights([set_w, set_b])#set weights manually


model.summary()
'''
shows the layers and number of parameters in the model.
Model: "sequential_3"
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Layer (type)    ┃ Output Shape         ┃    Param # ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ L1 (Dense)      │ (None, 25)           │         75 │
└─────────────────┴──────────────────────┴────────────┘
 Total params: 75 (300.00 B)
 Trainable params: 75 (300.00 B)
 Non-trainable params: 0 (0.00 B)
'''

# loss function can be BinaryCrossentropy or MeanSquareError
# 即二元交叉熵or均方误差
model.compile(
	loss = tf.keras.losses.BinaryCrossentropy(),
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01),
	) # generate the model

model.fit(
    Xn,Y
    epochs=100,
	) # run gradient descent to fit the model

model.predict(new_X) 
a1=L1(Xn)
print(a1)
'''
>>>tf.Tensor(
[[0.5  0.5  0.5 ... ]
 [0.45 0.35 0.55 ...]
 [0.41 0.23 0.59 ...]
 [0.36 0.14 0.63 ...]
 [0.32 0.08 0.68 ...]
 [0.28 0.05 0.71 ...]], shape=(6, 25), dtype=float32)
 
 output is a tensor type data
 each row represent $\vec{a}^{[1]}_j$
'''

Multiclass Classification

Softmax Function

The softmax function can be written:

$a_j = \frac{e^{z_j}}{ \sum_{k=1}^{N}{e^{z_k} }} \tag{1}$

The output $\mathbf{a}$ is a vector of length N, so for softmax regression, you could also write:

$\begin{align} \mathbf{a}(x) = \begin{bmatrix} P(y = 1 | \mathbf{x}; \mathbf{w},b) \\ \vdots \\ P(y = N | \mathbf{x}; \mathbf{w},b) \end{bmatrix} = \frac{1}{ \sum_{k=1}^{N}{e^{z_k} }} \begin{bmatrix} e^{z_1} \\ \vdots \\ e^{z_{N}} \\ \end{bmatrix} \tag{2} \end{align}$

Loss Function

The loss function associated with Softmax, called the sparse categorical cross-entropy loss(稀疏多分类交叉熵损失), is:

$\begin{equation} L(\mathbf{a},y)=\begin{cases} -log(a_1), & \text{if $y=1$}.\\ &\vdots\\ -log(a_N), & \text{if $y=N$} \end{cases} \tag{3} \end{equation}$

Cost Function

To write the cost function, we need to define an indicator function first:

$\mathcal{I}\{y == n\} =\begin{cases} 1, & \text{if $y==n$}.\\ 0, & \text{otherwise}. \end{cases}$

Now the cost is:

$\begin{align} J(\mathbf{w},b) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{N} \mathcal{I}(y^{(i)} == j)\cdot \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^N e^{z^{(i)}_k} }\right] \tag{4} \end{align}$

Where $m$ is the number of examples, $N$ is the number of outputs.

Build Multiclass Classification Neuron Network

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Input
from tensorflow.keras.activations import linear, relu, sigmoid
from sklearn.datasets import make_blobs

'''
generate cluster samples
make_blobs()
para:
n_samples  :sample numbers [int],default=100
n_features :dimension of samples [int],default=2
centers    :labels of samples [int or vector((x,y),center of each label)]
cluster_std:standard deviation [float or vector]
random_state:random seeds
'''
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)

model = Sequential(
    [   tf.keras.Input(shape=(400,)),
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'linear') #<-- Attention!
    ]
)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  #<-- Attention
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
)
#write code in this way is highly recommended, to decrease float-computing error
#output layer set as linear,and pass the results to loss function


'''The [.fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model) method returns a variety of metrics including the loss. This is captured in the `history` variable above. This can be used to examine the loss in a plot as shown below.'''
history = model.fit(
    X_train,y_train,
    epochs=10
)
''' fig,ax = plt.subplots(1,1, figsize = (4,3))
    ax.plot(history.history['loss'], label='loss')
    ax.set_ylim([0, 2])
    ax.set_xlabel('Epoch')
    ax.set_ylabel('loss (cost)')
    ax.legend()
    ax.grid(True)
    plt.show()'''


p = model.predict(X_train)#The output predictions are not probabilities
results = tf.nn.softmax(p).numpy()#the output should be be processed by a softmax.
for i in range(len(results)):
    print( f"{results[i]}, category: {np.argmax(results[i])}")

Optimizer: Adam

Adaptive Moment Estimation : 自适应矩估计

是动量法（Momentum）和均方误差法（RMSprop）的结合

初始化：
- $m_t=0$：加权一阶矩，梯度的加权均值
- $v_t=0$：加权二阶矩，梯度平方的加权均值
- $\alpha=0.001$：学习率
- $\beta _1=0.9$：一阶矩衰减因子
- $\beta _1=0.999$：二阶矩衰减因子
- $\epsilon=10^{-8}$：平滑项，防止除数为0
计算损失函数对当前参数的梯度：
$g_t = \nabla_{\theta} L(\theta_t)$
$g_t$表示第t步循环的梯度
更新一阶矩和二阶矩：

将下式展开可以发现，越久远的$g_t$权重越小，离的越近的$g_t$权重越大

$m_t = \beta_1 \cdot m_{t-1} + (1-\beta_1) \cdot g_t \\ v_t = \beta_2 \cdot v_{t-1} + (1-\beta_2) \cdot v_t^2$

偏差修正
在训练初期，$m_t$和$v_t$都是初始化为0的，而梯度不为零，因此用一阶矩和二阶矩估计梯度会显著偏小，所以需要进行修正：
$\hat{m_t} = \frac{m_t}{1-\beta_1^t} \\\\ \hat{v_t} = \frac{v_t}{1-\beta_2^t}$
$\beta^t$指第t次循环时$\beta$的t次幂。
更新参数
$\theta_t = \theta_{t-1} - \alpha \cdot \frac{\hat{m_t}}{\sqrt{\hat{v_t}}+\epsilon}$

Tuning Models

Split Dataset

data	% of total	Description
training	60	Data used to tune model parameters $w$ and $b$ in training or fitting
cross-validation	20	Data used to tune other model parameters like degree(max power of x) of polynomial, regularization or the architecture of a neural network.
test	20	Data used to test the model after tuning to gauge performance on new data

1
2
3

from sklearn.model_selection import train_test_split
X_train, X_, y_train, y_ = train_test_split(X,y,test_size=0.40)
X_cv, X_test, y_cv, y_test = train_test_split(X_,y_,test_size=0.50)

Polynomial Regression

Finding the optimal degree

Choose the model which has the lowest cv error , even though it has a higher test error. ( To Avoid overfitting problem)

Tuning Regularization

Choose the model which has the lowest cv error

Increasing Training Set Size

When a model is overfitting (high variance), collecting additional data can improve performance.

Neural Network

simple model(less layers, less neurons) usually has a little higher classification error on training data but does better on cross-validation data than the more complex model.
apply regularization to moderate the impact of a more complex model!

model_r = Sequential(
    [
        Dense(120, activation = 'relu', kernel_regularizer=tf.keras.regularizers.l2(0.1)), 
        Dense(40, activation = 'relu', kernel_regularizer=tf.keras.regularizers.l2(0.1)),  
        Dense(6, activation = 'linear', name="L3")  
    ]
)

Evaluating on Skewed Datasets

（倾斜数据集）
$y=1$ in presence of rare class we want to detect

		Actual	Class
		1	0
Predicted	1	True Positive	False Positive
Class	0	False Negative	True Negative

Precision:

$p = \frac{True~~Positive}{Predicted~~Positve} = \frac{True~~Positive}{True~~pos + False~~pos}$

Recall:

$r = \frac{True~~Positive}{Actual~~Positve} = \frac{True~~Positive}{True~~pos + False~~neg}$

Trading off between precision and recall:
we can define a F1 score to measure these two factors: $F1~~score = \frac{1}{\frac{1}{2}(\frac{1}{p} + \frac{1}{r})} = \frac{2pr}{p+r}$ A module which has a larger F1 score does better.

Decision Trees

One Hot Encoding

If a categorical feature can take on $k$ discrete (离散的) values, create $k$ binary features (0 or 1 valued).

Entropy

$H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1)$

Note
- $p_1$ stands for the fraction of label $y=1$ in a node
- The log is calculated with base $2$
- For implementation purposes, $0 \text{log}_2(0) = 0$. That is, if $p_1 = 0$ or $p_1 = 1$, set the entropy to 0

Information Gain

$\text{Information Gain} = H(p_1^\text{node})- (w^{\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right}))$

where

$H(p_1^\text{node})$ is entropy at the node
$H(p_1^\text{left})$ and $H(p_1^\text{right})$ are the entropies at the left and the right branches resulting from the split
$w^{\text{left}}$ and $w^{\text{right}}$ are the proportion of examples at the left and right branch respectively

Steps to Build a Decision Tree

Start with all examples at the root node
Calculate information gain for splitting on all possible features, and pick the one with the highest information gain
Split dataset according to the selected feature, and create left and right branches of the tree
Keep repeating splitting process until stopping criteria is met, which can be a maximum depth, examples in a node is below a threshold, information gain is very small…

Tree Ensemble

Bagged Decision Trees (Randomizing Samples)

Given training set of size m;
Use sampling with replacement(有放回随机抽样) to create a new training set of size m;
Train a tree in the new dataset;
Repeat step2 and step3 for n times, then we have a bag of trees which has n trees
Let all the trees voting for the result.

Random Forest (Randomizing Samples and Feature Choice)

At each node, when choosing a feature to use to split, if n features are available, pick a random subset of $k < n$ features and allow the algorithm to only choose from that subset of features
This can make trees much more different, improve computing speed and decrease overfitting.
Usually take $k=\sqrt{n}$