Question? Call Us  +1 (817) 546-4770

ICT707 Big Data Assignment

Academic Anxiety?

Get an original paper within hours and nail the task

156 experts online

Free Samples

ICT707 Big Data Assignment

.cms-body-content table{width:100%!important;} #subhidecontent{ position: relative;
overflow-x: auto;
width: 100%;}

ICT707 Big Data Assignment

0 Download22 Pages / 5,465 Words

Course Code: ICT707
University: Stanford University is not sponsored or endorsed by this college or university

Country: United States


The first part is to create the algorithms in the tasks, namely: Decision Tree, Gradient Boosted Tree and Linear regression and then to apply them to the bike sharing dataset provided. Try and produce the output given in the task sections (also given in the Big-Data Assignment.docx provided on Blackboard).
The second part is then use those algorithms created in the first part and apply them to another dataset chosen from Kaggle (other than the bike sharing dataset provided).
2What needs to be submitted for marking:
Each of the files submitted will be tested with the following datasets:
bike sharing [which is provided on blackboard]
A dataset of the students choice downloaded from
1.Utilising Python 3 Build the following regression models:
Decision Tree
Gradient Boosted Tree
Linear regression
2.Select a dataset (other than the example dataset given in section 3) and apply the Decision Tree and Linear regression models created above.  Choose a dataset from Kaggle
3.Build the following in relation to the gradient boost tree and the dataset choosen in step 2
a)Gradient boost tree iterations (see Big-Data Assignment.docx section 6.1)
b)Gradient boost tree Max Bins (see Big-Data Assignment.docx section 7.2)
4.Build the following in relation to the decision tree and the dataset choosen in step 2
a)Decision Tree Categorical features
b)Decision Tree Log (see Big-Data Assignment.docxsection 5.4)
c)Decision Tree Max Bins (see Big-Data Assignment.docx section 7.2)
d)Decision Tree Max Depth (see Big-Data Assignment.docx section 7.1)
5.Build the following in relation to the linear regression and the dataset choosen in step 2
a)Linear regression Cross Validation
i.Intercept (see Big-Data Assignment.docx section 6.5)
ii.Iterations (see Big-Data Assignment.docx section 6.1)
iii.Step size (see Big-Data Assignment.docxsection 6.2)
iv.L1 Regularization (see Big-Data Assignment.docx section 6.4)
v.L2 Regularization (see Big-Data Assignment.docx section 6.3)
b)Linear regression Log (see Big-Data Assignment.docx section 5.4)
6.Follow the provided example of the Bike sharing data set and the guide lines in the sections that follow this section to develop the requirements given in steps 1,3,4 and 5


Regression Models
Regression models are concerned with target variables that can take any real value. The underlying principle is to find a model that maps input features to predicted target variables. Regression is also a form of supervised learning.
Regression models can be used to predict just about any variable of interest. A few examples include the following:

Predicting stock returns and other economic variables
Predicting loss amounts for loan defaults (this can be combined with a classification model that predicts the probability of default, while the regression model predicts the amount in the case of a default)
Recommendations (the Alternating Least Squares factorization model from Chapter 5, Building a Recommendation Engine with Spark, uses linear regression in each iteration)
Predicting customer lifetime value (CLTV) in a retail, mobile, or other business, based on user behavior and spending patterns

In the different sections of this chapter, we will do the following:
Introduce the various types of regression models available in ML

Explore feature extraction and target variable transformation for regression models
Train a number of regression models using ML
Building a Regression Model with Spark
See how to make predictions using the trained model
Investigate the impact on performance of various parameter settings for regression using cross-validation

Types of regression models
The core idea of linear models (or generalized linear models) is that we model the predicted outcome of interest (often called the target or dependent variable) as a function of a simple linear predictor applied to the input variables (also referred to as features or independent variables).
Here, y is the target variable, w is the vector of parameters (known as the weight vector), and x is the vector of input features. wTx is the linear predictor (or vector dot product) of the weight vector w and feature vector x. To this linear predictor, we applied a function f (called the link function). Linear models can, in fact, be used for both classification and regression simply by changing the link function. Standard linear regression uses an identity link (that is, y = wTx directly), while binary classification uses alternative link functions as discussed here.
Spark’s ML library offers different regression models, which are as follows:

Linear regression
Generalized linear regression
Logistical regression
Decision trees
Random forest regression
Gradient boosted trees
Survival regression
Isotonic regression
Ridge regression

Regression models define the relationship between a dependent variable and one or more independent variables. It builds the best model that fits the values of independent variables or features.
Linear regression unlike classification models such as support vector machines and logistic regression is used for predicting the value of a dependent variable with generalized value rather than predicting the exact class label.
Linear regression models are essentially the same as their classification counterparts, the only difference is that linear regression models use a different loss function, related link function, and decision function. Spark ML provides a standard least squares regression model (although other types of generalized linear models for regression are planned).

Utilising Python 3 Build the following regression models:

Decision Tree
Gradient Boosted Tree
Linear regression

Select a dataset (other than the example dataset given in section 3) and apply the Decision Tree and Linear regression models created above.  Choose a dataset from Kaggle
Build the following in relation to the gradient boost tree and the dataset choosen in step 2

Gradient boost tree iterations (see section 6.1)
Gradient boost tree Max Bins (see section 7.2)

Build the following in relation to the decision tree and the dataset choosen in step 2

Decision Tree Categorical features
Decision Tree Log (see section 5.4)
Decision Tree Max Bins (see section 7.2)
Decision Tree Max Depth (see section 7.1)

Build the following in relation to the linear regression and the dataset choosen in step 2

Linear regression Cross Validation

Intercept (see section 6.5)
Iterations (see section 6.1)
Step size (see section 6.2)
L1 Regularization (see section 6.4)
L2 Regularization (see section 6.3)

Linear regression Log (see section 5.4)

Follow the provided example of the Bike sharing data set and the guide lines in the sections that follow this section to develop the requirements given in steps 1,3,4 and 5

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and returnback has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
Associated tasks
Regression: Predication of bike rental count hourly or daily based on the environmental and seasonal settings.
Event and Anomaly Detection: Count of rented bikes are also correlated to some events in the town which easily are traceable via search engines. For instance, query like “2012-10-30 washington d.c.” in Google returns related results to Hurricane Sandy. Some of the important events are identified in [1]. Therefore the data can be used for validation of anomaly or event detection algorithms as well.
Dataset characteristics
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

instant: This is the record ID
dteday: This is the raw date
season: This is different seasons such as spring, summer, winter, and fall
yr: This is the year (2011 or 2012)
mnth: This is the month of the year
hr: This is the hour of the day
holiday: This is whether the day was a holiday or not
weekday: This is the day of the week
workingday: This is whether the day was a working day or not
weathersit: This is a categorical variable that describes the weather at a particular time
temp: This is the normalized temperature
atemp: This is the normalized apparent temperature
hum: This is the normalized humidity
windspeed: This is the normalized wind speed
cnt: This is the target variable, that is, the count of bike rentals for that hour

Extracting features from the bike sharing dataset
We will be using the bike sharing dataset. This dataset contains hourly records of the number of bicycle rentals in the capital bike sharing system. It also contains variables related to date and time, weather, and seasonal and holiday information. The dataset is available at:
Once you have downloaded the file, unzip it. This will create a directory called Bike-Sharing-Dataset, which contains the day.csv, hour. csv, and the Readme.txt files. The Readme.txt file contains information on the dataset, including the variable names and descriptions. We will work with the hourly data contained in hour.csv. If you look at the first line of the dataset, you will see that it contains the column names as a header. You can do
Before we work with the data in Spark, we will again remove the header from the first line of the file using the same sed command that we used previously to create a new file called
Since we will be doing some plotting of our dataset later on, we will use the Python shell for this chapter. This also serves to illustrate how to use MLlib’s linear model and decision tree functionality from PySpark. Start up your PySpark shell from your Spark installation directory. If you want to use IPython, which we highly recommend, remember to include the IPYTHON=1 environment variable together with the pylab functionality:
You can type all the code that follows for the remainder of this chapter directly into your PySpark shell (or into IPython Notebook if you wish to use it). Recall that we used the IPython shell in Chapter 3, Obtaining, Processing, and Preparing Data with Spark. Take a look at that chapter and the code bundle for instructions to install IPython. We’ll start as usual by loading the dataset and inspecting it:
So, we have 17,379 hourly records in our dataset. We have inspected the column names already. We will ignore the record ID and raw date columns. We will also ignore the casual and registered count target variables and focus on the overall count variable, cnt (which is the sum of the other two counts). We are left with 12 variables. The first eight are categorical, while the last 4 are normalized real-valued variables. To deal with the eight categorical variables, we will use the binary encoding approach with which you should be quite familiar by now. The four real-valued variables will be left as is. We will first cache our dataset, since we will be reading from it many times:
In order to extract each categorical feature into a binary vector form, we will need to know the feature mapping of each feature value to the index of the nonzero value in our binary vector. Let’s define a function that will extract this mapping from our dataset for a given column:
Our function first maps the field to its unique values and then uses the zipWithIndex transformation to zip the value up with a unique index such that a key-value RDD is formed, where the key is the variable and the value is the index. This index will be the index of the nonzero entry in the binary vector representation of the feature. We will finally collect this RDD back to the driver as a Python dictionary. We can test our function on the third variable column (index 2):
Building Regression Models
Least squares regression
There are a variety of loss functions that can be applied to generalized linear models. The loss function used for least squares is the squared loss, which is defined as follows:
Here, as for the classification setting, y is the target variable (this time, real valued), w is the weight vector, and x is the feature vector. The related link function is the identity link, and the decision function is also the identity function, as generally, no thresholding is applied in regression. So, the model’s prediction is simply y = wTx. The standard least squares regression in ML library does not use regularization. Regularization is used to solve the problem of overfitting. Looking at the squared loss function, we can see that the loss applied to incorrectly predicted points will be magnified since the loss is squared. This means that least squares regression is susceptible to outliers in the dataset, and also to over-fitting. Generally, as for classification, we should apply some level of regularization in practice. Linear regression with L2 regularization is commonly referred to as ridge regression, while applying L1 regularization is called the lasso. When the dataset is small, or the number of examples is fewer, the tendency of the model to over fit is very high, therefore, it is highly recommended to use regularizers like L1, L2, or elastic net.
Decision trees for regression
Just like using linear models for regression tasks involves changing the loss function used,using decision trees for regression involves changing the measure of the node impurity used. The impurity metric is called variance, and is defined in the same way as the squared loss for least squares linear regression.
Extracting the right features from your data
As the underlying models for regression are the same as those for the classification case, we can use the same approach to create input features. The only practical difference is that the target is now a real-valued variable as opposed to a categorical one. The LabeledPoint class in ML library already takes this into account, as the label field is of the Double type, so it can handle both cases.
Creating feature vectors for the linear model
The next step is to use our extracted mappings to convert the categorical features to binary-encoded features. Again, it will be helpful to create a function that we can apply to each record in our dataset for this purpose. We will also create a function to extract the target variable from each record. We will need to import numpy for linear algebra utilities and MLlib’s LabeledPoint class to wrap our feature vectors and target variables:
In the preceding extract_features function, we ran through each column in the row of data. We extracted the binary encoding for each variable in turn from the mappings we created previously. The step variable ensures that the nonzero feature index in the full feature vector is correct (and is somewhat more efficient than, say, creating many smaller binary vectors and concatenating them). The numeric vector is created directly by first converting the data to floating point numbers and wrapping these in a numpy array. The resulting two vectors are then concatenated. The extract_label function simply converts the last column variable (the count) into a float. With our utility functions defined, we can proceed with extracting feature vectors and labels from our data records:
Linear Model feature vector length: 61
As we can see, we converted the raw data into a feature vector made up of the binary categorical and real numeric features, and we indeed have a total vector length of 61.
Creating feature vectors for the decision tree
As we have seen, decision tree models typically work on raw features (that is, it is not required to convert categorical features into a binary vector encoding; they can, instead, be used directly). Therefore, we will create a separate function to extract the decision tree feature vector, which simply converts all the values to floats and wraps them in a numpy array:
The following output shows the extracted feature vector, and we can see that we have a vector length of 12, which matches the number of raw variables we are using:
Training a regression model on the bike sharing dataset
We’re ready to use the features we have extracted to train our models on the bike sharing data. First, we’ll train the linear regression model and take a look at the first few predictions that the model makes on the data:
Note that we have not used the default settings for iterations and step here. We’ve changed the number of iterations so that the model does not take too long to train. As for the step size, you will see why this has been changed from the default a little later. You will see the following output:
Next, we will train the decision tree model simply using the default arguments to the trainRegressor method (which equates to using a tree depth of 5). Note that we need to pass in the other form of the dataset, data_dt, that we created from the raw feature values (as opposed to the binary encoded features that we used for the preceding linear model). We also need to pass in an argument for categoricalFeaturesInfo. This is a dictionary that maps the categorical feature index to the number of categories for the feature. If a feature is not in this mapping, it will be treated as continuous. For our purposes, we will leave this as is, passing in an empty mapping:
This is not as bad as it sounds. While we do not cover it here, the Python code included with this chapter’s code bundle includes an example of using categoricalFeaturesInfo. It does not make a large difference to performance in this case. From a quick glance at these predictions, it appears that the decision tree might do better, as the linear model is quite a way off in its predictions. However, we will apply more stringent evaluation methods to find out.
Evaluating the performance of regression models
We saw in Chapter 6, Building a Classification Model with Spark, that evaluation methods for classification models typically focus on measurements related to predicted class memberships relative to the actual class memberships. These are binary outcomes (either the predicted class is correct or incorrect), and it is less important whether the model just barely predicted correctly or not; what we care most about is the number of correct and incorrect predictions.
When dealing with regression models, it is very unlikely that our model will precisely predict the target variable, because the target variable can take on any real value. However, we would naturally like to understand how far away our predicted values are from the true values, so will we utilize a metric that takes into account the overall deviation. Some of the standard evaluation metrics used to measure the performance of regression models include the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), the Mean Absolute Error (MAE), the R-squared coefficient, and many others.
Mean Squared Error and Root Mean Squared Error
MSE is the average of the squared error that is used as the loss function for least squares regression:
It is the sum, over all the data points, of the square of the difference between the predicted and actual target variables, divided by the number of data points. RMSE is the square root of MSE. MSE is measured in units that are the square of the target variable, while RMSE is measured in the same units as the target variable. Due to its formulation, MSE, just like the squared loss function that it derives from, effectively penalizes larger errors more severely. In order to evaluate our predictions based on the mean of an error metric, we will first make predictions for each input feature vector in an RDD of LabeledPoint instances by computing the error for each record using a function that takes the prediction and true target value as inputs. This will return a [Double] RDD that contains the error values. We can then find the average using the mean method of RDDs that contain double values. Let’s define our squared error function as follows:
Root Mean Squared Log Error
This measurement is not as widely used as MSE and MAE, but it is used as the metric for the Kaggle competition that uses the bike-sharing dataset. It is, effectively, the RMSE of the log-transformed predicted and target values. This measurement is useful when there is a wide range in the target variable, and you do not necessarily want to penalize large errors when the predicted and target values are themselves high. It is also effective when you care about percentage errors rather than the absolute value of errors.
R-squared coefficient
The R-squared coefficient, also known as the coefficient of determination, is a measure of how well a model fits a dataset. It is commonly used in statistics. It measures the degree of variation in the target variable; this is explained by the variation in the input features. An R-squared coefficient generally takes a value between 0 and 1, where 1 equates to a perfect fit of the model.
Looking at the results, we can see that our initial guess about the decision tree model being the better performer is indeed true. The Kaggle competition leader board lists the Mean Value Benchmark score on the test set at about 1.58. So, we see that our linear model performance is not much better. However, the decision tree with default settings achieves a performance of 0.63.
Improving model performance and tuning parameters
In Chapter 5, Building a Classification Model with Spark, we showed how feature transformation and selection can make a large difference to the performance of a
model. In this chapter, we will focus on another type of transformation that can be applied to a dataset: transforming the target variable itself.
Transforming the target variable
Recall that many machine learning models, including linear models, make assumptions regarding the distribution of the input data as well as target variables. In particular, linear regression assumes a normal distribution. In many real-world cases, the distributional assumptions of linear regression do not hold. In this case, for  example, we know that the number of bike rentals can never be negative. This alone should indicate that the assumption of normality might be problematic. To get a better idea of the target distribution, it is often a good idea to plot a histogram of the target values. In this section, if you are using IPython Notebook, enter the magic function, %pylab inline, to import pylab (that is, the numpy and matplotlib plotting functions) into the workspace. This will also create any figures and plots inline within the Notebook cell. If you are using the standard IPython console, you can use %pylab to import the necessary functionality (your plots will appear in a separate window). We will now create a plot of the target variable distribution in the following piece of code:
Looking at the histogram plot, we can see that the distribution is highly skewed and certainly does not follow a normal distribution:
One way in which we might deal with this situation is by applying a transformation to the target variable, such that we take the logarithm of the target value instead of the raw value. This is often referred to as log-transforming the target variable (this transformation can also be applied to feature values). We will apply a log transformation to the following target variable and plot a histogram of the log-transformed values:
A second type of transformation that is useful in the case of target values that do not take on negative values and, in addition, might take on a very wide range of values, is to take the square root of the variable. We will apply the square root transform in the following code, once more plotting the resulting target variable distribution:
From the plots of the log and square root transformations, we can see that both result in a more even distribution relative to the raw values. While they are still not normally distributed, they are a lot closer to a normal distribution when compared to the original target variable.
Impact of training on log-transformed targets
So, does applying these transformations have any impact on model performance? Let’s evaluate the various metrics we used previously on log-transformed data as an example. We will do this first for the linear model by applying the numpy log function to the label field of each LabeledPoint RDD. Here, we will only transform the target variable, and we will not apply any transformations to the features:
Note that now that we have transformed the target variable, the predictions of the model will be on the log scale, as will the target values of the transformed dataset. Therefore, in order to use our model and evaluate its performance, we must first transform the log data back into the original scale by taking the exponent of both the predicted and true values using the numpy exp function. We will show you how to do this in the code here:
If we compare these results to the results on the raw target variable, we see that while we did not improve the MSE or MAE, we improved the RMSLE. We will perform the same analysis for the decision tree model:
It is probably not surprising that the log transformation results in a better RMSLE performance for the linear model. As we are minimizing the squared error, once we have transformed the target variable to log values, we are effectively minimizing a loss function that is very similar to the RMSLE. This is good for Kaggle competition purposes, since we can more directly optimize against the competition-scoring metric. It might or might not be as useful in a real-world situation. This depends on how important larger absolute errors are (recall that RMSLE essentially penalizes relative errors rather than absolute magnitude of errors).
Tuning model parameters
So far in this chapter, we have illustrated the concepts of model training and evaluation for MLlib’s regression models by training and testing on the same dataset. We will now use a similar cross-validation approach that we used previously to evaluate the effect on performance of different parameter settings for our models.
Creating training and testing sets to evaluate parameters 
The first step is to create a test and training set for cross-validation purposes. Spark’s Python API does not yet provide the randomSplit convenience method that is available in Scala. Hence, we will need to create a training and test dataset manually. One relatively easy way to do this is by first taking a random sample of, say, 20 percent of our data as our test set. We will then define our training set as the elements of the original RDD that are not in the test set RDD. We can achieve this using the sample method to take a random sample for our test set, followed by using the subtractByKey method, which takes care of returning the elements in one RDD where the keys do not overlap with the other RDD. Note that subtractByKey, as the name suggests, works on the keys of the RDD elements that consist of key-value pairs. Therefore, here we will use zipWithIndex on our RDD of extracted training examples. This creates an RDD of (LabeledPoint, index) pairs.
We will then reverse the keys and values so that we can operate on the index keys:
data_with_idx = data.zipWithIndex().map(lambda (k, v): (v, k))
test = data_with_idx.sample(False, 0.2, 42)
train = data_with_idx.subtractByKey(test)
Once we have the two RDDs, we will recover just the LabeledPoint instances we need for training and test data, using map to extract the value from the key-value pairs:
The impact of parameter settings for linear models
Now that we have prepared our training and test sets, we are ready to investigate the impact of different parameter settings on model performance. We will first carry out this evaluation for the linear model. We will create a convenience function to evaluate the relevant performance metric by training the model on the training set and evaluating it on the test set for different parameter settings.
We will use the RMSLE evaluation metric, as it is the one used in the Kaggle competition with this dataset, and this allows us to compare our model results against the competition leaderboard to see how we perform. The evaluation function is defined here:
Note that in the following sections, you might get slightly different results due to some random initialization for SGD. However, your results will be comparable.
As we saw when evaluating our classification models, we generally expect that a model trained with SGD will achieve better performance as the number of iterations increases, although the increase in performance will slow down as the number of iterations goes above some minimum number. Note that here, we will set the step size to 0.01 to better illustrate the impact at higher iteration numbers:
The output shows that the error metric indeed decreases as the number of iterations increases. It also does so at a decreasing rate, again as expected. What is interesting is that eventually, the SGD optimization tends to overshoot the optimal solution, and the RMSLE eventually starts to increase slightly:
Here, we will use the matplotlib library to plot a graph of the RMSLE metric against the number of iterations. We will use a log scale for the x axis to make the output easier to visualize:
Now, we can see why we avoided using the default step size when training the linear model originally. The default is set to 1.0, which, in this case, results in a nan output for the RMSLE metric. This typically means that the SGD model has converged to a very poor local minimum in the error function that it is optimizing. This can happen when the step size is relatively large, as it is easier for the optimization algorithm to overshoot good solutions. We can also see that for low step sizes and a relatively low number of iterations (we used 10 here), the model performance is slightly poorer. However, in the preceding Iterations section, we saw that for the lower step-size setting, a higher number of iterations will generally converge to a better solution. Generally speaking, setting step size and number of iterations involves a trade-off. A lower step size means that convergence is slower but slightly more assured. However, it requires a higher number of iterations, which is more costly in terms of computation and time, in particular at a very large scale. Selecting the best parameter settings can be an intensive process that involves training a model on many combinations of parameter settings and selecting the best outcome. Each instance of model training involves a number of iterations, so this process can be very expensive and time consuming when performed on very large datasets. The output is plotted here, again using a log scale for the step-size axis:
L2 regularization
In Chapter 5, Building a Classification Model with Spark, we saw that regularization has the effect of penalizing model complexity in the form of an additional loss term that is a function of the model weight vector. L2 regularization penalizes the L2-norm of the weight vector, while L1 regularization penalizes the L1-norm. We expect training set performance to deteriorate with increasing regularization, as the model cannot fit the dataset well. However, we would also expect some amount of regularization that will result in optimal generalization performance as evidenced by the best performance on the test set. We will evaluate the impact of different levels of L2 regularization in this code:
Again, the results are more clearly seen when plotted in the following graph. We see that there is a much more subtle decline in RMSLE, and it takes a very high value to cause a jump back up. Here, the level of L1 regularization required is much higher than that for the L2 form; however, the overall performance is poorer:
We can see from the results that as we might expect, the number of zero feature weights in the model weight vector increases as greater levels of L1 regularization are applied:
7.5 Intercept
The final parameter option for the linear model is whether to use an intercept or not. An intercept is a constant term that is added to the weight vector and effectively accounts for the mean value of the target variable. If the data is already centered or normalized, an intercept is not necessary; however, it often does not hurt to use one in any case. We will evaluate the effect of adding an intercept term to the model here:
The impact of parameter settings for the decision tree
Decision trees provide two main parameters: maximum tree depth and the maximum number of bins. We will now perform the same evaluation of the effect of parameter settings for the decision tree model. Our starting point is to create an evaluation function for the model, similar to the one used for the linear regression earlier. This function is provided here:
We would generally expect performance to increase with more complex trees (that is, trees of greater depth). Having a lower tree depth acts as a form of regularization, and it might be the case that as with L2 or L1 regularization in linear models, there is a tree depth that is optimal with respect to the test set performance. Here, we will try to increase the depths of trees to see what impact they have on test set RMSLE, keeping the number of bins at the default level of 32:
In this case, it appears that the decision tree starts over-fitting at deeper tree levels. An optimal tree depth appears to be around 10 on this dataset. Notice that our best RMSLE of 0.42 is now quite close to the Kaggle winner of around 0.29! The output of the tree depth is as follows:
Finally, we will perform our evaluation on the impact of setting the number of bins for the decision tree. As with the tree depth, a larger number of bins should allow the model to become more complex and might help performance with larger feature dimensions. After a certain point, it is unlikely that it will help any more and might, in fact, hinder performance on the test set due to over-fitting:
Here, we will show the output and plot to vary the number of bins (while keeping the tree depth at the default level of 5). In this case, using a small number of bins hurts performance, while there is no impact when we use around 32 bins (the default setting) or more. There seems to be an optimal setting for test set performance at around 16-20 bins:

Free Membership to World’s Largest Sample Bank

To View this & another 50000+ free samples. Please put
your valid email id.


Yes, alert me for offers and important updates


Download Sample Now

Earn back the money you have spent on the downloaded sample by uploading a unique assignment/study material/research material you have. After we assess the authenticity of the uploaded content, you will get 100% money back in your wallet within 7 days.

UploadUnique Document

DocumentUnder Evaluation

Get Moneyinto Your Wallet

Total 22 pages


*The content must not be available online or in our existing Database to qualify as

Cite This Work
To export a reference to this article please select a referencing stye below:


My Assignment Help. (2021). Big Data Assignment. Retrieved from

“Big Data Assignment.” My Assignment Help, 2021,

My Assignment Help (2021) Big Data Assignment [Online]. Available from:[Accessed 18 December 2021].

My Assignment Help. ‘Big Data Assignment’ (My Assignment Help, 2021) accessed 18 December 2021.

My Assignment Help. Big Data Assignment [Internet]. My Assignment Help. 2021 [cited 18 December 2021]. Available from:

.close{position: absolute;right: 5px;z-index: 999;opacity: 1;color: #ff8b00;}


Thank you for your interest
The respective sample has been mail to your register email id


$20 Credited
successfully in your wallet.
* $5 to be used on order value more than $50. Valid for
only 1

Account created successfully!
We have sent login details on your registered email.



Do you hate waiting for that one response from your academic chat consultant? Keep aside such stringencies and sign up with Our customer care executives are strictly against the policy of keeping your chat responses on hold. They are student-friendly, prompt and actively responsive. You Do not need to wait for long to solve your queries like assignment help Australia. So, sign up with us now and fight off the odds of delayed responses like a pro.

Latest Management Samples

div#loaddata .card img {max-width: 100%;

MPM755 Building Success In Commerce
Download :
0 | Pages :

Course Code: MPM755
University: Deakin University is not sponsored or endorsed by this college or university

Country: Australia

The process of developing a successful business entity requires a multidimensional analysis of several factors that relate to the internal and external environment in commerce. The areas covered in this current unit are essential in transforming the business perspective regarding the key commerce factors such as ethics, technology, culture, entrepreneurship, leadership, culture, and globalization (Nzelibe, 1996; Barza, 2…

SNM660 Evidence Based Practice
Download :
0 | Pages :

Course Code: SNM660
University: The University Of Sheffield is not sponsored or endorsed by this college or university

Country: United Kingdom

Critical reflection on the objective, design, methodology and outcome of the research undertaken Assessment-I
Smoking and tobacco addiction is one of the few among the most basic general restorative issues, particularly to developed nations such as the UK. It has been represented that among all risk segments smoking is the fourth driving purpose behind infections and other several ailments like asthma, breathing and problems in the l…
Australia Maidstone Management Business management with marketing University of New South Wales Masters in Business Administration 

BSBHRM513 Manage Workforce Planning
Download :
0 | Pages :

Course Code: BSBHRM513
University: Tafe NSW is not sponsored or endorsed by this college or university

Country: Australia

Task 1
1.0 Data on staff turnover and demographics
That includes the staffing information of JKL industries for the fiscal year of 2014-15, it can be said that the company is having problems related to employee turnover. For the role of Senior Manager in Sydney, the organization needs 4 managers; however, one manager is exiting. It will make one empty position which might hurt the decision making process. On the other hand, In Brisba…

MKT2031 Issues In Small Business And Entrepreneurship
Download :
0 | Pages :

Course Code: MKT2031
University: University Of Northampton is not sponsored or endorsed by this college or university

Country: United Kingdom

Entrepreneurial ventures
Entrepreneurship is the capacity and willingness to develop, manage, and put in order operations of any business venture with an intention to make profits despite the risks that may be involved in such venture. Small and large businesses have a vital role to play in the overall performance of the economy. It is, therefore, necessary to consider the difference between entrepreneurial ventures, individual, and c…
Turkey Istanbul Management University of Employee Masters in Business Administration 

MN506 System Management
Download :
0 | Pages :

Course Code: MN506
University: Melbourne Institute Of Technology is not sponsored or endorsed by this college or university

Country: Australia

An operating system (OS) is defined as a system software that is installed in the systems for the management of the hardware along with the other software resources. Every computer system and mobile device requires an operating system for functioning and execution of operations. There is a great use of mobile devices such as tablets and Smartphones that has increased. One of the widely used and implemented operating syste…
Australia Cheltenham Computer Science Litigation and Dispute Management University of New South Wales Information Technology 


Need an essay written specifically to meet your requirements?

Choose skilled experts on your subject and get an original paper within your deadline

156 experts online

Your time is important. Let us write you an essay from scratch

Tips and Tricks from our Blog

PROJ6016 Employer Based Project

Free Samples PROJ6016 Employer Based Project .cms-body-content table{width:100%!important;} #subhidecontent{ position: relative; overflow-x: auto; width: 100%;} PROJ6016 Employer

Read More »

PPMP20009 Marking Rubric

Free Samples PPMP20009 Marking Rubric .cms-body-content table{width:100%!important;} #subhidecontent{ position: relative; overflow-x: auto; width: 100%;} PPMP20009 Marking Rubric

Read More »