DataSklr

View Original

Predictive Modeling in Support of Email Campaign: EDA

Project Description:

A not-for-profit organization wants to execute a direct mail campaign and maximize the profitability (dollar volume of donations vs. the cost of execution) of said campaign. A database of previous donors is available for analysis. The project was divided into three parts:

  • During the first phase, we will predict, which individual in our database is likely to send a donation (a classification problem).

  • The second phase will predict the expected amount of donations from each individual (a regression problem).

  • The third phase of the project will be a financial exercise examining the profitability of the campaign.

Table of Contents:

In order to arrive at the most accurate prediction, machine learning models are built, tuned and compared against each other. The reader can get can click on the links below to assess the models or sections of the exercise. Each section has a short explanation of theory, and a description of applied machine learning with Python:

  1. Exploratory Data Analysis

  2. LDA/QDA/Naive Bayes Classifier

  3. Multi-Layer Perceptron

  4. K-Nearest Neighbors

  5. Support Vector Machines

  6. Ensemble Learning

  7. Model Comparisons

Exploratory Data Analysis:

A dataset of 6,002 observations were made available from a charitable organization. The data contained 23 variables that provided demographic and behavioral information about past donors to the organization.

Exploratory Data Analysis was conducted on the training data. All variables were examined in terms of their relationship with past donor behavior (e.g. did subject donate).

Region:

All subjects were assigned to five geographic regions. Region 1 contained 20.1% of all observations (n=1,209), Region 2 accounted for 34.7% of observations (n=2,083), Region 3 had 12.1% of cases (n=728). In contrast, Region 4 (n=795), and Region 5 (n=1,187) accounted for 13.2% and 19.8% of the total, respectively. Donor behavior was clearly differentiated across the five regions. People living in Region 1 and Region 2 were more likely to donate than those living in other regions. However, average amounts donated were higher when the donor lived in Regions 3 or 4.

Home Ownership:

There were 5,309 home owners in the dataset compared to only 693 people who did not own a home. The bar chart below clearly indicates that homeownership is more likely to be associated with donating behavior, whereas those not owning a home are not at all likely to donate. Further, the average amount donated appears to be higher among home owners vs. the amount donated by those without home ownership. Do note that there were several homeowners who donated more than the norm.

Number of Children:

There were 2,088 people in the data with no children, while 598 people had one child. About 1,771 individuals had two children, while 973 people had three children and 412 subjects had four kids. Only 160 individuals had five children.

The number of children does appear to influence donating behavior. For example, individuals without children were far more likely to donate, and the likelihood of becoming a donor declined as the number of children increased in the household. The amount donated also declined with the increase in the number of children.

Household Income:

Each subject was categorized into one of seven household income categories. Most households belonged to category four, while the rest were approximately normally distributed around the most frequent category. The variable appeared to be a good predictor of donor behavior as a larger percentage of those in category four were likely donors than those in other income categories. The amount of donations increased with the increase in income category designation (e.g. the more money a person makes, the higher the donated amount).

Gender:

While slightly more males than females were included as subjects in the dataset, the distribution of donors vs. non-­donors was about equal for both genders suggesting that the variable may not be a good predictor of donor behavior. Also, the median amount donated appeared to be about the same for male and female subjects.

Wealth Rating:

Wealth rating was based on median family income and population statistics indexed to relative wealth within each state, and classified individuals into segments ranging between 0 and ­9 (0 being the lowest and 9 being the highest). The plot clearly indicates that the wealthiest segments are most likely to become donors, while the poorest segments are very unlikely to donate. Interestingly, the wealthiest two segments were not those who donated the most in terms of total amounts. The median amount donated was highest for segments 4-­7. However, there were unusually high amounts donated by certain individuals belonging to the wealthiest groups.

Distribution of Variables:

The graphs show the following (left to right): Regions 1-5, Home Ownership, No. of Children, Household Income, Gender, Wealth Rating, Average Home Value, Median Family Income in Neighborhood, Average Family Income in Neighborhood, Percent low income potential donor in neighborhood, Lifetime number of promotions received, Dollar amount of lifetime gifts to date, Dollar amount of largest gift to date, Dollar amount of most recent gift, Number of months since last donation, Number of months between first and second gift, Average dollar amount of gifts to date , Donor Classification, Amount Donated.

We have seen many of the distributions in some pf the prior bar graphs, although they were presented in the context of donor behavior. The lower half of the table here has not been explored. None of the variables show normal distributions.

Correlations:

A correlation matrix revealed significant correlations among some of the variables. They appear as dark or black on the matrix below. These correlations should make us pause when thinking about fitting some model types (such as LDA), while other methods may not be as affected by the issue. Either way, multicollinearity will be a topic to explore during the analytics phase.

Bivariate Plots:

Finally, some bivariate plots to assess potential relationships between features and the target variable. Some of them do show a shape indicating a relationship, none of which appears to be linear. Click on the right or left arrow to advance the slide reel.

Code Used:

The code that was used to create he graphs and plots in the blog is provided below. I had to manipulate some of the variables so that they appear in the right format. These manipulations are also provided.

See this content in the original post