Basic of Neural Network Programming

神经网络和深度学习[1-2]
神经网络和深度学习[1-3]
神经网络和深度学习[1-4]
改善深层神经网络：超参数调试、正则化以及优化【2-1】
改善深层神经网络：超参数调试、正则化以及优化【2-2】
改善深层神经网络：超参数调试、正则化以及优化【2-3】
结构化机器学习项目 3

Ng的深度学习视频笔记，长期更新

2.3 logistic Regression: Cost Function

二分类问题，每个特征前面有一个权重，然后加上一个偏置项，即 $y=w^Tx+b$ ，我们想让这个结果要么是0，要么是1，那么就用到了sigmod函数，即 $\sigma(z)=\frac{1}{1+e^{-z}}$ ，讲义中给出的是对于样本 $(x^{(i)},y^{(i)})$ 而言，有如下公式：

$\hat{y}^{(i)}=\sigma(w^Tx^{(i)}+b)=\frac{1}{1+e^{-(w^Tx^{(i)}+b)}}$

该样本的损失函数（loss function）一般为 $\frac{1}{2}(\hat{y}^{(i)}-y^{(i)})^2$ ，但是在logistic回归中一般使用下面的损失函数：

$L(\hat{y}^{(i)},y^{(i)})=-\big(y^{(i)}\,log(\hat{y}^{(i)})+(1-y^{(i)})\,log(1-\hat{y}^{(i)})\big)$

原因是因为平方损失函数（误差平方）在讨论最优化问题的时候是非凸的，即会得到多个局部最优解(梯度下降法可能找不到全局最优解)，而用log损失函数，起着和误差平方相似的作用，会给我们一个凸的优化问题，很容易做优化。
对于多个样本的成本函数(cost function) J 如下：

$J(w,b)=\frac{1}{m}\sum_{i=1}^mL(\hat{y}^{(i)},y^{(i)})\\ =-\frac{1}{m}\sum_{i=1}^m\big((y^{(i)}log(\hat{y}^{(i)})+(1-y^{(i)})log(1-\hat{y}^{(i)}))\big)$

成本函数是损失函数的平均值，通过改变参数w，b来使成本函数最小
【补充】：这节课有几个地方需要补充：为什么log损失是凸的，为什么误差平方使用梯度可能找不到全局最优,损失函数有哪几种，该如何使用
log的损失函数标准形式是： $L(Y,P(Y|X))=-logP(Y|X)$ ，和上面的对应的
先给出logistic回归针对二分类的模型表达式：

$f(x)=w*x+b\\ P(Y=1|x)=\frac{e^{f(x)}}{1+e^{f(x)}}\\ P(Y=0|x)=\frac{1}{1+e^{f(x)}}$

这样对应上面的损失函数，就会得到成本函数【注意：损失函数衡量单个样本，成本函数衡量多个样本】：

$J=-\big(y\,logP(Y=1|x)+(1-y)\,logP(Y=0|x)\big)$

而log函数是单调递增的，所以logP(Y|X)也是单调递增的，而-logP(Y|X)即为单调递减，这样使用梯度下降就可以找到全局最小值
而误差平方函数本身就是二次曲线，虽然是凸的，但是组合成的成本函数就不是凸的了
关于损失函数，这里只简单的提一下：
给出这么几种损失函数：
Among all linear methods $y=f(\theta^Tx)$ , we need to first determine the form of $f$ , and then finding $\theta$ by formulating it to maximizing likelihood or minimizing loss. This is straightforward.

For classification, it’s easy to see that if we classify correctly we have $y\cdot f = y\cdot \theta^Tx\gt0$ , and $y\cdot f = y\cdot\theta^Tx\lt0$ if incorrectly. Then we formulate following loss functions:

0/1 loss: $\min_\theta\sum_i L_{0/1}(\theta^Tx)$ . We define $L_{0/1}(\theta^Tx) =1$ if $y\cdot f \lt 0$ , and $=0$ o.w. Non convex and very hard to optimize.
Hinge loss: approximate 0/1 loss by $\min_\theta\sum_i H(\theta^Tx)$ . We define $H(\theta^Tx) = max(0, 1 - y\cdot f)$ . Apparently $H$ is small if we classify correctly.
Logistic loss: $\min_\theta \sum_i log(1+\exp(-y\cdot \theta^Tx))$ . Refer to my logistic regression notes for details.

For regression:

Square loss: $\min_\theta \sum_i||y^{(i)}-\theta^Tx^{(i)}||^2$

Fortunately, hinge loss, logistic loss and square loss are all convex functions. Convexity ensures global minimum and it’s computationally appleaing.
参考：http://www.cs.cmu.edu/~yandongl/loss.html
继续来看这些函数直观的意义，均匀取-2到2之间的点作为误差值，针对不同的损失函数，采取不同的处理方式，如下：

import numpy as np
import matplotlib.pyplot as plt
x=np.linspace(-2,2,300)#均匀间隔
#Hinge loss
hinge_loss_function = [] 
for i in (1-x):  
    if i > 0:  
        hinge_loss_function.append(i)  
    else:  
        hinge_loss_function.append(0) 
#指数损失
exponential_loss_function = np.exp(-x)
#对数损失
logistic_loss_function = np.log(1+np.exp(-x))/np.log(2) 
#0/1损失
l0_1_loss_function = []  
for j in x:  
    if j < 0:  
        l0_1_loss_function.append(1)  
    else:  
        l0_1_loss_function.append(0) 
#平方损失
pingfang_loss_function = (x-1) ** 2 
plt.figure(figsize=(10,8))
plt.plot(x, hinge_loss_function, 'r-')  
#plt.plot(x, exponential_loss_function, 'b-')  
plt.plot(x, logistic_loss_function, 'g-')  
plt.plot(x, l0_1_loss_function, 'k-')  
plt.plot(x, pingfang_loss_function, 'c-')  
plt.legend(['hinge_loss_function', 'logistic_loss_function', 'l0_1_loss_function', 'pingfang_loss_function'])  
plt.ylim(0,4)

如下图：

上面这个图需要明白一点，x轴代表的是上面提到的 $y\cdot f$ ，我们希望达到的目标是该值越大(正值)越好，也就是越远离0，这样我们的模型f就越好，因为这样拟合会很好(可以这样理解：对于坐标上的两种类型的点，我们找到一条直线将其分隔开，这条直线应该是离各类型的点最远的线，虽然靠近某一类型的点来划分也可以，但是这样预测能力就会降低)
其中可以看到0/1损失函数是最理想的，只要误差小于0，结果就是1；hinge损失函数是当误差小于1结果才为0，也就是说要求 $y\cdot f >1$ 就好；而log损失函数就是要求 $y\cdot f$ 越大越好；其中平方损失函数这个是不适合分类的，如果用分类的化，必须取 $y\cdot f <0$ 的结果来作为损失函数（感知机就是这样）。
关于使用方面：二分类问题一般使用log损失或者是Hinge损失函数，回归问题一般使用平方损失函数。

2.4 Gradient Descent

关于梯度下降，这里放一张图

如上图，就以 $J(w)=wx_{i}^2$ 为例，我们现在要找到使J最小的w，对于给定的 $w_{0}$ ，我们需要找到最小的J，变量的更新应该是沿着梯度方向，更新w： $w=w_{0}-\frac{dJ(w)}{dw}$
如果 $w_{0}$ 在左边，就增加w；如果 $w_{0}$ 在右边，就减小w

2.10 Logistic regression on m examples

这里将2.9节讲到的logistic回归中的梯度下降和对m个样例的梯度下降结合起来，并用python实现最原始的代码

上图其实在讲反向传播算法，而且讲到了反向传播算法中的最原始的计算方式
这里对其中的每一个变量求导（遵循视频中的讲解，使用da代表损失函数对a的求导）：

$da=\frac{dL(a,y)}{da}=-\frac{y}{a}+\frac{1-y}{1-a}\\ dz=\frac{dL(a,y)}{dz}=\frac{dL(a,y)}{da}\cdot \frac{da}{dz}=(-\frac{y}{a}+\frac{1-y}{1-a})\cdot a(1-a)=a-y\\ dw_{1}=\frac{dL(a,y)}{dw_{1}}=\frac{dL(a,y)}{da}\cdot \frac{da}{dz}\cdot \frac{dz}{dw_{1}}=(a-y)x_{1}\\ dw_{2}=\frac{dL(a,y)}{dw_{2}}=\frac{dL(a,y)}{da}\cdot \frac{da}{dz}\cdot \frac{dz}{dw_{2}}=(a-y)x_{2}\\ db=\frac{dL(a,y)}{db}=\frac{dL(a,y)}{da}.\frac{da}{dz}\cdot \frac{dz}{b}=a-y$

其中对于Sigmoid的求导公式 $f'(x)=f(x)(1-f(x))$
更新 $w_{1},w_{2},b$ ：

$w_{1}=w_{1}-\alpha \cdot dw_{1}\\ w_{2}=w_{2}-\alpha \cdot dw_{2}\\ b=b-\alpha \cdot db$

上面是对单个样本的梯度更新，下面logistic回归在多个样本上的python代码实现【理解此部分代码对于理解这两节讲的内容我觉得很重要，该代码为自己编写，可能存在错误】

"""
这里以[w1,w2],b为例
其中sample为m行3列矩阵，第一列为x1的值，第二列为x2的值，第3列为y值
传入的参数：
sample为样本
m为样本个数
k为迭代次数
alpha为学习率
min_value为成本函数收敛最小界
"""
def logistic_regression_m(sample,m,k,alpha,min_value):
    w1,w2,b,dw1,dw2,db=0#初始化
    #迭代k次
    for j in k:
        fun_j = 0#每次将成本函数归零
        #遍历m个样本
        for i in m:
            z = w1 * sample[i][0] + w2 * sample[i][1] + b
            a = 1 / (1 + math.exp(-z))#预测函数（模型）
            fun_j += -(sample[i][2] * math.log(a) + (1 - sample[i][2]) * math.log(1 - a))#成本函数
            #此部分参考上面讲解的链式求导
            dz = a - sample[i][2]#单个样本的梯度值，中间量
            dw1 += sample[i][0] * dz#将多个样本的梯度值相加
            dw2 += sample[i][1] * dz
            db += dz
        fun_j /= m
        if abs(fun_j)<min_value:break#计算代价函数的意义仅在于此
        dw1 /= m
        dw2 /= m
        #权重更新
        w1-=alpha*dw1
        w2-=alpha*dw2
        b-=alpha*db

2.11 Vectorization

这节讲要避免使用for，而使用向量，这里举了一个例子，是使用np.dot来计算向量和for运算时间的对比：

import numpy as np
import time
#a=np.array([1,2,3,4])
a=np.random.rand(1000000)
b=np.random.rand(1000000)
tic=time.time()
c=np.dot(a,b)
toc=time.time()
print("Vectorized version:"+str(1000*(toc-tic))+"ms")
c=0
tic=time.time()
for i in range(1000000):
    c+=a[i]*b[i]
toc=time.time()
print("for version:"+str(1000*(toc-tic))+"ms")

Vectorized version:1.27410888671875ms
for version:702.2850513458252ms

2.12 More vectorization examples

这一节讲到之前的logistic回归的代码可以使用向量优化，做一下修改，简化了一重循环，如下：

def logistic_regression_m_2(sample,m,k,alpha,min_value):
    # 初始化
    b,db=0
    dw=np.zeros((sample.shape[1],1))#表示的是创建一个行为sample.shape[1]，列为1的向量
    w=np.zeros((sample.shape[1],1))
    #迭代k次
    for j in k:
        fun_j = 0#每次将成本函数归零
        #遍历m个样本
        for i in m:
            z = np.dot(sample[i][:2],w) + b#sample[i]为行向量，w为列向量
            a = 1 / (1 + math.exp(-z))#预测函数（模型）
            fun_j += -(sample[i][2] * math.log(a) + (1 - sample[i][2]) * math.log(1 - a))#成本函数
            #此部分参考上面讲解的链式求导
            dz = a - sample[i][2]#单个样本的梯度值，中间量
            #将多个样本的梯度值相加
            dw+=sample[i]*dz
            db += dz
        fun_j /= m
        if abs(fun_j)<min_value:break#计算代价函数的意义仅在于此
        dw/= m
        #权重更新
        w-=alpha*dw
        b-=alpha*db

2.14 Vectorizing Logistic Regression’s Gradient Computation

这两节在讲简化上面的代码，如下图所示：

代码如下：

"""
继续修改该代码
这里修改一下参数，sample表示的是x变量，y来表示真实值
设x的特征为n,那么这里的sample表示的是n行m列的矩阵，即每一列代表一个x样本
y为1行n列的向量
"""
def logistic_regression_m_3(sample,y,k,alpha,min_value):
    m=sample.shape[1]
    # 初始化
    b=np.zeros((m,1))
    dw=np.zeros((sample.shape[0],1))#表示的是创建一个行为sample.shape[1]，列为1的向量
    w=np.zeros((sample.shape[0],1))
    #迭代k次
    for j in k:
        z=np.dot(np.transpose(w),sample)+b
        a=1/(1 + math.exp(-z))
        fun_j =-1/m*(y*np.log(a)+(1-y)*np.log(np.ones((1,m))-a))
        if abs(fun_j) < min_value: break  # 计算代价函数的意义仅在于此
        dz=a-y
        dw=1/m*np.dot(sample,np.transpose(dz))
        db=1/m*np.sum(dz)
        w=w-alpha*dw
        b=b-alpha*db

Broadcasting in Python

如下图，对于向量和数值（scaler）运算，scaler会自动补齐对应的行/列；同样的道理，矩阵和向量的运行，向量也会相应的补齐对应的行/列

大作业

这个作业基本上前面基本上已经实践过了，作业地址：Logistic Regression with a Neural Network mindset v4

1 - packgeds

首先是导入数据，这句from lr_utils import load_dataset，我的jupter中没有lr_utils，就直接将lr_utils.py和对应的datasets集拷贝过来了github

2 - Overview of the Problem set

1 2	# Loading the data (cat/non-cat) train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_dataset()

Exercise: Reshape the training and test data sets so that images of size (num_px, num_px, 3) are flattened into single vectors of shape (num_px ∗∗ num_px ∗∗ 3, 1).
A trick when you want to flatten a matrix X of shape (a,b,c,d) to a matrix X_flatten of shape (b ∗∗ c ∗∗ d, a) is to use:
X_flatten = X.reshape(X.shape[0], -1).T # X.T is the transpose of X

# Reshape the training and test examples
### START CODE HERE ### (≈ 2 lines of code)
train_set_x_flatten = train_set_x_orig.reshape(train_set_x_orig.shape[0],-1).T
test_set_x_flatten = test_set_x_orig.reshape(test_set_x_orig.shape[0],-1).T
### END CODE HERE ###
print ("train_set_x_flatten shape: " + str(train_set_x_flatten.shape))
print ("train_set_y shape: " + str(train_set_y.shape))
print ("test_set_x_flatten shape: " + str(test_set_x_flatten.shape))
print ("test_set_y shape: " + str(test_set_y.shape))
print ("sanity check after reshaping: " + str(train_set_x_flatten[0:5,0]))

train_set_x_flatten shape: (12288, 209)
train_set_y shape: (1, 209)
test_set_x_flatten shape: (12288, 50)
test_set_y shape: (1, 50)
sanity check after reshaping: [17 31 56 22 33]

One common preprocessing step in machine learning is to center and standardize your dataset, meaning that you substract the mean of the whole numpy array from each example, and then divide each example by the standard deviation of the whole numpy array. But for picture datasets, it is simpler and more convenient and works almost as well to just divide every row of the dataset by 255 (the maximum value of a pixel channel).
Let’s standardize our dataset.

1
2
3

#Let's standardize our dataset.
train_set_x = train_set_x_flatten/255.
test_set_x = test_set_x_flatten/255.

4 - General Architecture of the learning algorithm

省略了3
Exercise: Using your code from “Python Basics”, implement sigmoid(). As you’ve seen in the figure above, you need to compute $sigmoid(w^Tx+b)=\frac{1}{1+e^{−(w^Tx+b)}}$ to make predictions. Use np.exp()

def sigmoid(z):
    """
    Compute the sigmoid of z
    Arguments:
    z -- A scalar or numpy array of any size.
    Return:
    s -- sigmoid(z)
    """
    ### START CODE HERE ### (≈ 1 line of code)
    s = 1/(1+np.exp(-z))
    ### END CODE HERE ###
    
    return s

Initializing parameters

def initialize_with_zeros(dim):
    """
    This function creates a vector of zeros of shape (dim, 1) for w and initializes b to 0.
    
    Argument:
    dim -- size of the w vector we want (or number of parameters in this case)
    
    Returns:
    w -- initialized vector of shape (dim, 1)
    b -- initialized scalar (corresponds to the bias)
    """
    ### START CODE HERE ### (≈ 1 line of code)
    w = np.zeros((dim,1))
    b = 0
    ### END CODE HERE ###
    assert(w.shape == (dim, 1))
    assert(isinstance(b, float) or isinstance(b, int))
    return w, b

Forward and Backward propagation
Exercise: Implement a function propagate() that computes the cost function and its gradient.
Hints:
Forward Propagation:

You get X
You compute
$A=σ(w^TX+b)=(a^{(0)},a^{(1)},...,a^{(m−1)},a^{(m)})$
You calculate the cost function:
$J=−\frac{1}{m}∑_{i=1}^my^{(i)}log(a^{(i)})+(1−y^{(i)})log(1−a^{(i)})$

Here are the two formulas you will be using:

$\frac{∂J}{∂w}=\frac{1}{m}X(A−Y)^T\\ \frac{∂J}{∂b}=\frac{1}{m}∑_{i=1}^m(a^{(i)}−y^{(i)})$

def propagate(w, b, X, Y):
    """
    Implement the cost function and its gradient for the propagation explained above
    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of size (num_px * num_px * 3, number of examples)
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat) of size (1, number of examples)
    Return:
    cost -- negative log-likelihood cost for logistic regression
    dw -- gradient of the loss with respect to w, thus same shape as w
    db -- gradient of the loss with respect to b, thus same shape as b
    
    Tips:
    - Write your code step by step for the propagation. np.log(), np.dot()
    """
    
    m = X.shape[1]
    
    # FORWARD PROPAGATION (FROM X TO COST)
    ### START CODE HERE ### (≈ 2 lines of code)
    A = sigmoid(np.dot(w.T,X)+b)               
    cost = -1/m*np.sum(Y*np.log(A)+(1-Y)*np.log(1-A))                                # compute cost
    ### END CODE HERE ###
    
    # BACKWARD PROPAGATION (TO FIND GRAD)
    ### START CODE HERE ### (≈ 2 lines of code)
    dw = 1/m*np.dot(X,(A-Y).T)
    db = 1/m*np.sum(A-Y)
    ### END CODE HERE ###
    assert(dw.shape == w.shape)
    assert(db.dtype == float)
    cost = np.squeeze(cost)
    assert(cost.shape == ())
    
    grads = {"dw": dw,
             "db": db}
    
    return grads, cost

Exercise: Write down the optimization function. The goal is to learn ww and bb by minimizing the cost function J . For a parameter θ, the update rule is θ=θ−α dθ , where α is the learning rate.

def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):
    """
    This function optimizes w and b by running a gradient descent algorithm
    
    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of shape (num_px * num_px * 3, number of examples)
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat), of shape (1, number of examples)
    num_iterations -- number of iterations of the optimization loop
    learning_rate -- learning rate of the gradient descent update rule
    print_cost -- True to print the loss every 100 steps
    
    Returns:
    params -- dictionary containing the weights w and bias b
    grads -- dictionary containing the gradients of the weights and bias with respect to the cost function
    costs -- list of all the costs computed during the optimization, this will be used to plot the learning curve.
    
    Tips:
    You basically need to write down two steps and iterate through them:
        1) Calculate the cost and the gradient for the current parameters. Use propagate().
        2) Update the parameters using gradient descent rule for w and b.
    """
    
    costs = []
    
    for i in range(num_iterations):
        
        
        # Cost and gradient calculation (≈ 1-4 lines of code)
        ### START CODE HERE ### 
        grads, cost = propagate(w, b, X, Y)
        ### END CODE HERE ###
        
        # Retrieve derivatives from grads
        dw = grads["dw"]
        db = grads["db"]
        
        # update rule (≈ 2 lines of code)
        ### START CODE HERE ###
        w = w-learning_rate*dw
        b = b-learning_rate*db
        ### END CODE HERE ###
        
        # Record the costs
        if i % 100 == 0:
            costs.append(cost)
        
        # Print the cost every 100 training examples
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
    
    params = {"w": w,
              "b": b}
    
    grads = {"dw": dw,
             "db": db}
    
    return params, grads, costs

Exercise: The previous function will output the learned w and b. We are able to use w and b to predict the labels for a dataset X. Implement the predict() function. There is two steps to computing predictions:

Calculate $Ŷ =A=σ(w^TX+b)$

Convert the entries of a into 0 (if activation <= 0.5) or 1 (if activation > 0.5), stores the predictions in a vector Y_prediction. If you wish, you can use an if/else statement in a for loop (though there is also a way to vectorize this).

def predict(w, b, X):
    '''
    Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)
    
    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of size (num_px * num_px * 3, number of examples)
    
    Returns:
    Y_prediction -- a numpy array (vector) containing all predictions (0/1) for the examples in X
    '''
    
    m = X.shape[1]
    Y_prediction = np.zeros((1,m))
    w = w.reshape(X.shape[0], 1)
    
    # Compute vector "A" predicting the probabilities of a cat being present in the picture
    ### START CODE HERE ### (≈ 1 line of code)
    A = sigmoid(np.dot(w.T,X)+b) 
    ### END CODE HERE ###
    for i in range(A.shape[1]):
        
        # Convert probabilities A[0,i] to actual predictions p[0,i]
        ### START CODE HERE ### (≈ 4 lines of code)
        if A[0][i]>0.5:Y_prediction[0][i]=1
        ### END CODE HERE ###
    
    assert(Y_prediction.shape == (1, m))
    
    
    return Y_prediction

5 -Merge all functions into a model

You will now see how the overall model is structured by putting together all the building blocks (functions implemented in the previous parts) together, in the right order.
Exercise: Implement the model function. Use the following notation:

Y_prediction for your predictions on the test set
Y_prediction_train for your predictions on the train set
w, costs, grads for the outputs of optimize()

def model(X_train, Y_train, X_test, Y_test, num_iterations = 2000, learning_rate = 0.5, print_cost = False):
    """
    Builds the logistic regression model by calling the function you've implemented previously
    
    Arguments:
    X_train -- training set represented by a numpy array of shape (num_px * num_px * 3, m_train)
    Y_train -- training labels represented by a numpy array (vector) of shape (1, m_train)
    X_test -- test set represented by a numpy array of shape (num_px * num_px * 3, m_test)
    Y_test -- test labels represented by a numpy array (vector) of shape (1, m_test)
    num_iterations -- hyperparameter representing the number of iterations to optimize the parameters
    learning_rate -- hyperparameter representing the learning rate used in the update rule of optimize()
    print_cost -- Set to true to print the cost every 100 iterations
    
    Returns:
    d -- dictionary containing information about the model.
    """
    
    ### START CODE HERE ###
    
    # initialize parameters with zeros (≈ 1 line of code)
    w, b = initialize_with_zeros(X_train.shape[0])
    # Gradient descent (≈ 1 line of code)
    parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate)
    
    # Retrieve parameters w and b from dictionary "parameters"
    w = parameters["w"]
    b = parameters["b"]
    
    # Predict test/train set examples (≈ 2 lines of code)
    Y_prediction_test = predict(w, b, X_test)
    Y_prediction_train = predict(w, b, X_train)
    ### END CODE HERE ###
    # Print train/test Errors
    print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
    print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))
    
    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "w" : w, 
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}
    
    return d

1	d = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 2000, learning_rate = 0.005, print_cost = True)

train accuracy: 99.04306220095694 %
test accuracy: 70.0 %
到此就结束了，当然我们用训练集拟合，再对训练集预测，准确度肯定接近于1了，而测试集上只有70，说明过拟合了(我们的目的是让训练集和测试集结果尽量相同)
我这里按照教程将迭代次数改为5000后的结果：

learning_rates = [0.01, 0.001, 0.0001]
models = {}
for i in learning_rates:
    print ("learning rate is: " + str(i))
    models[str(i)] = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 5000, learning_rate = i, print_cost = False)
    print ('\n' + "-------------------------------------------------------" + '\n')
for i in learning_rates:
    plt.plot(np.squeeze(models[str(i)]["costs"]), label= str(models[str(i)]["learning_rate"]))
plt.ylabel('cost')
plt.xlabel('iterations')
legend = plt.legend(loc='upper center', shadow=True)
frame = legend.get_frame()
frame.set_facecolor('0.90')
plt.show()

learning rate is: 0.01
train accuracy: 100.0
test accuracy: 68.0

learning rate is: 0.001
train accuracy: 96.65071770334929
test accuracy: 74.0

learning rate is: 0.0001
train accuracy: 77.51196172248804
test accuracy: 56.0

这里可以看到如果学习率为0.001时，在测试集上的结果达到了74，这是因为由于学习率比较小，也就是沿梯度下降慢(从图中也可以很明显的发现),训练集并没有完全收敛，那么就谈不上过拟合，也就是说如果迭代次数继续加大，在0.001上仍然可以找到更好的结果，同理对0.1而言，减小迭代次数也可以发现更好的结果；不过这些后没有必要，使用正则化或者其他可以解决过拟合问题
完！