向客户推荐有效的报价

科技2023-12-15 106

Providing offers to customers can help increase business, but it is important to send the right offers to the right customers. Some offers may not be attractive to certain users, and other users may already make regular purchases without receiving offers.

向客户提供报价可以帮助增加业务，但是将合适的报价发送给合适的客户很重要。一些优惠可能对某些用户没有吸引力，而其他用户可能已经在没有收到优惠的情况下进行了常规购买。

As part of the Udacity Data Science Nanodegree, I created a method of recommending offer types for users of the Starbucks app. A sample dataset was provided with information regarding users, offer types, and interactions between users and the app during a test of various offers sent to app users. You can see the code and datasets for this project here. Libraries used include pandas, numpy, math, json, and matplotlib.

作为Udacity数据科学纳米学位的一部分，我创建了一种为星巴克应用程序的用户推荐商品类型的方法。在测试发送给应用程序用户的各种商品时，为样本数据集提供了有关用户，商品类型以及用户与应用程序之间的交互的信息。您可以在此处查看此项目的代码和数据集。使用的库包括pandas，numpy，math，json和matplotlib。

数据探索 (Data Exploration)

The data for this project consists of three json files. The portfolio.json file contains data for the 10 offers included in the test. This was imported as a pandas dataframe with 10 rows and 6 columns (channels, difficulty, duration, id, offer_type, and reward).

该项目的数据由三个json文件组成。 Portfolio.json文件包含测试中包含的10个报价的数据。这是作为具有10行和6列(渠道，难度，持续时间，id，offer_type和奖励)的pandas数据框导入的。

The profile.json file contains information about the 17,000 users. The pandas dataframe has 17,000 rows and 5 columns (age, became_member_on, gender, id, and income).

profile.json文件包含有关17,000个用户的信息。熊猫数据框具有17,000行和5列(年龄，成为会员，性别，身份证和收入)。

The transcript.json file contains information about the 306,534 events. The dataframe has 4 columns (event, person, time, value).

transcript.json文件包含有关306,534个事件的信息。数据框具有4列(事件，人，时间，值)。

Portfolio and Transcript have no null values. Profile has 2175 null values for both gender and income. Missing ages for profile are encoded as 118, and there are also 2175 such values. Upon further examination, these same users are missing all three values.

Portfolio和Transcript没有空值。配置文件的性别和收入均具有2175个空值。缺少配置文件的年龄编码为118，并且还有2175这样的值。经过进一步检查，这些相同的用户将丢失所有三个值。

性别分布 (Gender Distribution)

The users with demographic information are about 57% Male (M), 41% Female (F), and 1.4% Other (O).

拥有人口统计信息的用户大约为男性(M)57％，女性(F)41％和其他(O)1.4％。

Gender distribution of Starbucks app users 星巴克应用程序用户的性别分布 M 8484F 6129O 212

收入分配 (Income distribution)

Mean income is around $65405, and median income is $64000.

平均收入约为65405美元，中位数收入为64000美元。

Income distribution of Starbucks app users. 星巴克应用程序用户的收入分配。 count 14825.000000mean 65404.991568std 21598.299410min 30000.00000025% 49000.00000050% 64000.00000075% 80000.000000max 120000.000000

年龄分布 (Age distribution)

Mean age is about 62 years of age with a median age of 58 years.

平均年龄约为62岁，中位年龄为58岁。

Age distribution of Starbucks app users. 星巴克应用程序用户的年龄分布。 count 17000.000000mean 62.531412std 26.738580min 18.00000025% 45.00000050% 58.00000075% 73.000000max 118.000000

The age is skewed right because users without an age value are encoded as 118. If we remove these, there is a fairly normal distribution of ages. This gives a mean age of about 54 years and a median age of about 55 years.

年龄是正确的，因为没有年龄值的用户被编码为118。如果我们删除它们，则年龄的分布相当正常。这样得出的平均年龄约为54岁，中位年龄约为55岁。

Age distribution of Starbucks app users with outliers removed. 已删除异常值的星巴克应用程序用户的年龄分布。 count 14825.000000mean 54.393524std 17.383705min 18.00000025% 42.00000050% 55.00000075% 66.000000max 101.000000

活动分配 (Event distribution)

I wrote a function that takes in the transcript dataframe and outputs a dataframe which shows the counts of each event type for each user. All 17000 users are present in this dataframe. I also evaluated the event distribution of users without demographic information and users with demographic information separately, and the distributions were very similar. You can see the full analysis in the jupyter notebook here.

我编写了一个函数，该函数接受脚本数据框并输出一个数据框，该数据框显示每个用户的每种事件类型的计数。该数据框中包含所有17000个用户。我还分别评估了没有人口统计信息的用户和有人口统计学信息的用户的事件分布，并且分布非常相似。你可以看到在jupyter笔记本全面分析这里。

def create_events_df(df=transcript): ''' Description: Takes in the transcript dataframe and creates a new dataframe with a count of each event type for each person. Input: df - pandas dataframe, defaulted to transcript, the dataframe to convert Output: event_count_df - pandas, dataframe, contains counts of events for each person ''' df.sort_values(by='person', inplace=True) events = df.event.unique() event_count_df = pd.DataFrame(index = transcript['person'].unique()) for event_name in events: event_df = df[df['event'] == event_name].set_index(['person']) event_df.rename(columns={'event': event_name}, inplace=True) event_df.replace(event_name, 1, inplace=True) event_df = pd.DataFrame(event_df.groupby('person')[event_name].count()) event_count_df = event_count_df.join(event_df) event_count_df.fillna(0, inplace=True) return event_count_df Age distribution of Starbucks app users. 星巴克应用程序用户的年龄分布。

数据清理 (Data Cleaning)

活动详情(Event Details)

Another dataframe was created with details for each event occurring during the test period. From the transcript dataframe, I sorted values by person and time and reset the index. I then broke out the value column in transcript to retrieve offer id, transaction amount, and reward amounts. The new dataframe was concatenated to the event_detail_df and the original values column was dropped. Null values were filled with 0. The ‘offer id’ and ‘offer_id’ columns were merged. I filled the ‘offer id’ nulls with values from ‘offer_id’ and dropped the ‘offer_id column. The portfolio dataframe was joined to the event_detail_df on ‘offer id’ to create a full set of event details.

创建了另一个数据框，其中包含测试期间发生的每个事件的详细信息。从笔录数据框中，我按人和时间对值进行了排序，然后重置了索引。然后，我在笔录中细分了“值”列，以获取要约ID，交易金额和奖励金额。新的数据帧被连接到event_detail_df，原始值列被删除。空值用0填充。“ offer id”和“ offer_id”列已合并。我用“ offer_id”中的值填充“ offer id”空值，并删除了“ offer_id”列。投资组合数据框在“要约ID”上加入了event_detail_df，以创建完整的事件详细信息集。

event_values_df = transcript.sort_values(by=['person', 'time']).reset_index(drop=True)event_values_df_split = json_normalize(event_values_df['value'])event_detail_df = pd.concat([event_values_df, event_values_df_split], axis=1).drop('value', axis=1)event_detail_df['reward'].fillna(0, inplace=True)event_detail_df['amount'].fillna(0, inplace=True)event_detail_df['offer id'].fillna(event_detail_df['offer_id'], inplace=True)event_detail_df.drop('offer_id', axis=1, inplace=True)event_detail_df = event_detail_df.join(portfolio.set_index(['id']), how='left', on='offer id', rsuffix='_offer')

事件计数 (Event Counts)

The previous dataframe containing counts for each event type for each user was updated with the total amount spent per person (profit) and total reward received per person (loss). I calculated the sum of the amounts spent per person and the sum of rewards received per person then created a profit column with the sum of amounts and a loss column with the sum of rewards.

以前的数据框包含每个用户的每种事件类型的计数，并用每人花费的总金额(利润)和每人获得的总奖励(亏损)进行了更新。我计算了每人花费的总金额和每人获得的奖励总金额，然后创建了一个包含金额总和的利润列和一个包含奖励总金额的损失列。

event_df_totals = event_detail_df.groupby(by=['person']).sum()event_count_df['profit'] = event_df_totals['amount'] event_count_df['loss'] = event_df_totals['reward']

用户报价矩阵 (User-Offer Matrix)

I created a matrix which shows the counts of each offer type by user and event type. I grouped the event_detail_df by person, event, and offer_type, calculated the count of offer_type, and performed an unstack with null values set to 0.

我创建了一个矩阵，该矩阵按用户和事件类型显示每种商品类型的计数。我按人员，事件和offer_type对event_detail_df进行了分组，计算了offer_type的计数，并执行了将null值设置为0的拆栈。

offers_by_person = event_detail_df.groupby(['person', 'event', 'offer_type'])['offer_type'].count().unstack(fill_value=0)

提出建议 (Make Recommendations)

Three functions were created to aid in the process of recommending offers to users. The code for these functions can be found in the jupyter notebook here.

创建了三个功能来协助向用户推荐商品。这些功能的代码可以在此处的jupyter笔记本中找到。

查找具有与输入用户相似的人口统计信息的用户。 (Find users with similar demographic information to the input user.)

The find_similar_users function takes in a user and finds other users with similar demographic information. The user’s age, gender, and income data is pulled from the profile dataset, and the dataframe is filtered by the user’s gender and specified age and income ranges. The function returns an array of the ids of the similar users.

find_similar_users函数接收一个用户，并查找具有相似人口统计信息的其他用户。从配置文件数据集中提取用户的年龄，性别和收入数据，并通过用户的性别以及指定的年龄和收入范围过滤数据框。该函数返回相似用户的ID数组。

def find_similar_users(user, age_dif=10, income_dif=10000, user_df=profile): ''' Description: Takes in a user and finds other users with similar demographic information. Input: user - string, the user to compare age_dif - numeric, the maximum age difference between the user and other members. Defaults to 10 years. income_dif - numeric, the maximum income difference in between the user and other members. Defaults to $10,000. user_df - pandas dataframe, contains demographic information about all the users. Defaults to profile. Output: similar_users - array, users simililar to the provided user ''' user_info = profile[profile['id'] == user] age = user_info['age'].values[0] gender = user_info['gender'].values[0] income = user_info['income'].values[0] similar_users = profile[(profile['age'] >= (age - age_dif)) & (profile['age'] <= (age + age_dif)) & (profile['gender'] == gender) & ((profile['income'] >= (income - income_dif)) & (profile['income'] <= (income + income_dif)))]['id'].values return similar_users

根据指定的人口统计信息创建组 (Create a group from specified demographic information)

The create_user_group function takes in demographic information and outputs a list of users meeting those demographic specifications. The profile dataframe is filtered by the specified age range, gender(s), and income range. An array of ids for users meeting the specifications is returned.

create_user_group函数接收人口统计信息，并输出满足这些人口统计指标的用户列表。个人资料数据框按指定的年龄范围，性别和收入范围过滤。返回满足规范的用户的ID数组。

def create_user_group(age=[0, 118], gender=['M', 'F', 'O'], income=[0, 999999999], user_df=profile): ''' Description: Takes in demographic information and outputs a list of users meeting those demographic specifications. Input: age - array of two numeric values, the minimum and maximum ages for the user group. Defaults to range 0-118. gender - array of single character strings, the genders for the user group. 'M' for male, 'F' for female, 'O' for other or non-binary. Defaults to all three. income - numeric, the minimum and maximum incomesfor the user group. Defaults to range $0 to $999,999,999. user_df - pandas dataframe, contains demographic information about all the users. Defaults to profile. Output: users - array, users meeting the provided demographic information ''' users = profile[(profile['age'] >= (age[0])) & (profile['age'] <= (age[1])) & (profile['gender'].isin(gender)) & ((profile['income'] >= (income[0])) & (profile['income'] <= (income[1])))]['id'].values return users

为一组用户创建推荐。 (Create recommendations for a group of users.)

I created the recommend_group_offers function to sort the offer types by how well a given group responds to the offer. Each offer type is given a weight based on user response, and an array is returned with the offer types in weighted order. Groups are determined by the demographics of the user, age, gender, and income. The two above functions are used to create groups. I also used a test group which consists of the members who lack demographic information.

我创建了describe_group_offers函数，以按给定组对报价的响应程度对报价类型进行排序。根据用户的响应为每种商品类型赋予权重，并以加权顺序返回一个数组，其中包含商品类型。分组由用户的人口统计，年龄，性别和收入确定。以上两个函数用于创建组。我还使用了一个由缺少人口统计信息的成员组成的测试组。

The ideal situation is a user viewing an offer and then making the necessary purchases to complete the offer. In this case, the offer receives a weight increase of 1. If the number of completed offers of a given type is more than the number of viewed offers, it indicates that the offer did not have an effect on the purchases made. In this case, it would have been more cost-effective to not have sent the offer. If the number of viewed offers for an offer type is more than the number of completed offers, it represents an offer that may not be attractive enough to influence the user to make a purchase. A different offer may be more effective for that user.

理想的情况是用户查看要约，然后进行必要的购买以完成要约。在这种情况下，要约的权重增加1。如果给定类型的已完成要约的数量大于查看的要约的数量，则表明要约对所购买的商品没有影响。在这种情况下，不发送报价会更具成本效益。如果针对商品类型查看的商品数量大于已完成商品的数量，则表示商品的吸引力可能不足以影响用户进行购买。不同的报价对于该用户可能更有效。

For each user in the group, I retrieved the information from the user_matrix (defaulted to offers_by_person) and the user_count dataframe (defaulted to event_count_df). I pulled the user’s count of completed offers and viewed offers for each offer type. The count was set to 0 if none exist.

对于该组中的每个用户，我从user_matrix(默认为offers_by_person)和user_count数据帧(默认为event_count_df)中检索信息。我提取了用户的已完成报价计数，并查看了每种报价类型的报价。如果不存在，则将计数设置为0。

To calculate the weighted value of each offer type, I found the absolute value of the difference between the number of viewed offers and completed offers. The higher this difference, the less effective the offer. This value was divided by the number of offers received by the user, and the product was subtracted from 1, the ideal case. If a user completed the same number of offers they viewed, this value will be 1. The larger the difference between viewed and completed offers, the smaller the weight will be. Some weights will be negative indicating that the offer type performed poorly with the user and should not be sent to that user. These users are likely either making purchases regardless of receiving offers or they are ignoring the received offer. The weights for each offer type were summed for all users, and an array was created with the offer types in order by highest to lowest weight.

为了计算每种商品类型的加权值，我找到了已查看商品数量和已完成商品数量之差的绝对值。差异越大，报价的效果越差。该值除以用户收到的报价数量，然后从理想情况下的1中减去产品。如果用户完成了与他们查看的相同数量的商品，则该值为1。查看的商品和完成的商品之间的差异越大，权重就越小。一些权重将为负，表明商品类型在用户中的表现不佳，因此不应发送给该用户。这些用户可能不顾收到要约而进行购买，或者忽略了收到的要约。汇总了所有用户的每种商品类型的权重，并创建了一个按商品类型从高到低的权重排序的数组。

def recommend_group_offers(group, df_count=event_count_df, df_detail=event_detail_df, user_matrix=offers_by_person): ''' Description: Sorts the offer types by how well a given group responds to the offer. Input: group - array, df_count - pandas dataframe, contains event interaction counts for each customer. df_detail - pandas dataframe, contains details about the events that have occurred. user_matrix - pandas dataframe, shoes the count of customer interactions for each offer type. Output: top_offers - array, the array of offer types sorted by best response. Does not include ineffective offers. If no offers are effective, contains a string stating such. offer_weights - dictionary, the weights for each ofer type. ''' bogo_weight = 0 discount_weight = 0 info_weight = 0 top_offers = [] for user in group: try: user_df = user_matrix.loc[user] user_count = df_count.loc[user] counts = user_count.values transactions, received, viewed, completed, profit, loss = counts[range(len(counts))] if completed > 0: completed_offers = user_df.loc['offer completed'].values else: completed_offers = [0, 0, 0] if viewed > 0: viewed_offers = user_df.loc['offer viewed'].values else: viewed_offers = [0, 0, 0] bogo_weight += (1 - (abs(viewed_offers[0] - completed_offers[0])) / received) discount_weight += (abs((viewed_offers[1] - completed_offers[1])) / received) info_weight += (abs((viewed_offers[2] - completed_offers[2])) / received) except: continue offer_weights = {'bogo': bogo_weight, 'discount': discount_weight, 'information': info_weight} top_offers = sorted(offer_weights, key=offer_weights.get, reverse=True) return top_offers, offer_weights

A few different options were tested in order to find a good recommendation algorithm. Initially, I looked at the total number of each user’s interactions with each offer type. The original offers_by_person matrix had each person as the index and the total count of interactions with each offer type as the columns. This did not distinguish between received, viewed, and completed offers. With this algorithm, if users received more offers of a certain type, that offer would receive a higher weight even if the user did not view or complete them. This may work if each user received the same number of offers, but it does not provide a good representation of offer effectiveness in this case. Simply using the difference between the viewed offers and completed offers also did not give an adequate weight for offer effectiveness. Again, the number of offers received disproportionately affected the weight. I also tried dividing the difference by the number of offers received for each offer type, but the error increased indicating less accurate recommendations. Dividing by the total number of offers received provided more accurate recommendations.

为了找到一个好的推荐算法，测试了几种不同的选择。最初，我查看了每种商品类型与每个用户的互动总数。原始的offers_by_person矩阵将每个人作为索引，将每种商品类型的互动总数作为列。这不能区分收到，查看和完成的要约。使用此算法，如果用户收到更多特定类型的要约，则即使用户未查看或完成这些要约，该要约也将获得更高的权重。如果每个用户收到相同数量的要约，这可能会起作用，但是在这种情况下，它不能很好地表示要约的有效性。简单地使用查看的报价和已完成的报价之间的差异也无法充分考虑报价的有效性。同样，收到的要约数量严重影响了重量。我还尝试将差异除以每种商品类型收到的商品数量，但错误增加，表明推荐的准确性较低。用收到的要约总数除以可以得到更准确的建议。

评估结果 (Evaluate Results)

Testing a recommendation algorithm is important as we want to ensure we are making the right recommendations. Because we want to see the proportion of users we correctly predicted recommendations for, I evaluated the accuracy of the recommendations. To determine the accuracy, I calculated the root mean squared error (RMSE). This is a measure of the difference between predicted values and actual values. A low RMSE indicates a good fit of the data to the prediction model and a more accurate recommendation.

测试推荐算法非常重要，因为我们要确保我们提出正确的推荐。因为我们希望看到可以正确预测建议的用户比例，所以我评估了建议的准确性。为了确定准确性，我计算了均方根误差(RMSE)。这是对预测值和实际值之间差异的度量。 RMSE较低表示数据与预测模型非常吻合，并且建议更准确。

To calculate the root mean squared error, I compared the recommendation list for the group (prediction) to the weighted list for each individual user (actual). I created a dictionary to map each offer type to a number and a prediction array mapping each offer type in the recommendation array to the dictionary. I then calculated the squared error for each user and appended it to the errors array. After that, I calculated the mean of the errors array and returned the square root of the mean squared error. For one user the a recommendation order that is off by one place would yield a RMSE of 1.41. We would like to have an RMSE below this value to ensure most users have recommendations in the correct order. A higher RMSE indicates that the selected group may not have similar enough offer interactions to predict a good recommendation, and a different range of demographic values may need to be selected.

为了计算均方根误差，我将组的推荐列表(预测)与每个用户(实际)的加权列表进行了比较。我创建了一个字典，将每种商品类型映射到一个数字，并创建了一个预测数组，将推荐数组中的每种商品类型映射到字典。然后，我为每个用户计算平方误差，并将其附加到errors数组。之后，我计算了误差数组的平均值，并返回了均方误差的平方根。对于一个用户而言，推荐订单减少一位将产生1.41的RMSE。我们希望RMSE低于此值，以确保大多数用户以正确的顺序提出建议。较高的RMSE表示所选的组可能没有足够相似的商品交互作用来预测良好的推荐，因此可能需要选择其他范围的人口统计值。

I used the group of users with no demographic data as a test set to test each method of recommendations. These users had a similar distribution of events to the full dataset. Once I found an algorithm that seemed to be performing well, I tested it on other groups of users. I created groups based both on a similar user and on specified demographic information. The groups created from specific demographics tended to perform better than choosing a group based on similarity to a specified user.

我使用没有人口统计学数据的用户组作为测试集来测试每种推荐方法。这些用户对整个数据集都有类似的事件分布。一旦找到一种性能似乎不错的算法，便在其他用户组上进行了测试。我基于相似的用户和指定的人口统计信息创建了组。从特定的受众特征创建的组的效果要好于根据与指定用户的相似度来选择组。

def check_recommendation_accuracy(group, recommendation): ''' Description: Checks the accuracy of the recommendations by checking the group prediction against each user. Input: group - array, the ids for the users in the recommendation group. recommendation - array, the sorted array with the offer types in order of recommendation. Output: rmse - float, the root mean squared error for the recommendations. ''' offer_dict = {'bogo': 1, 'discount': 2, 'information': 3} pred = [] errors = [] for offer in recommendation: pred.append(offer_dict[offer]) for user in group: actual = [] for offer in recommend_group_offers([user])[0]: actual.append(offer_dict[offer]) sq_error = np.sum((pd.Series(actual) - pd.Series(pred)) **2) errors.append(sq_error) mean_sq_error = pd.Series(errors).mean() rmse = math.sqrt(mean_sq_error) return rmse

结论 (Conclusions)

The recommendation function creates a weighted list of offer types. Incorporating transaction amounts and reward amounts may further optimize the recommendation algorithm. For the demographic sets tested, bogo offers were most effective at getting users to view and complete them. Discounts were the second most effective for most groups followed by informational offers, but some groups responded better to informational offers than discounts. Exploring different demographic groups can yield different results, so it is important to evaluate various groups to see what offers will yield the most success. With a good algorithm for predicting effective offers for customers, we can achieve higher profits by decreasing the unnecessary dispersal of rewards and increasing sales.

推荐功能创建商品类型的加权列表。合并交易金额和奖励金额可以进一步优化推荐算法。对于经过测试的人口统计数据集，bogo报价最有效地吸引了用户查看并完成这些报价。折扣是大多数组中第二有效的方法，其次是信息性报价，但是某些组对信息性报价的React要好于折扣。探索不同的人口群体会产生不同的结果，因此评估各个群体以查看哪些提议将带来最大的成功非常重要。借助良好的算法来预测对客户的有效报价，我们可以通过减少不必要的奖励分散和增加销售来获得更高的利润。

翻译自: https://medium.com/@ronda_lunn/recommending-effective-offers-to-customers-b5e82df8b153

Processed: 0.019, SQL: 9