Free Samples

ICT110 Introduction To Data Science For Statistical Software

.cms-body-content table{width:100%!important;} #subhidecontent{ position: relative;

overflow-x: auto;

width: 100%;}

ICT110 Introduction To Data Science For Statistical Software

0 Download9 Pages / 2,029 Words

Course Code: ICT110

University: University Of The Sunshine Coast

MyAssignmentHelp.com is not sponsored or endorsed by this college or university

Country: Australia

Question:

Provide an introduction to the problem. Include background material as appropiate:who cares about this problem,what impact it has,where does the data come from.

Answer:

Introduction

Authorization and Purpose

In this report, the main purpose is the exploration of the data on health and population statistics. The data has been retrieved from World Bank which is a secondary source of data. The dataset contains information about the world health and population over the years 2001 – 2015. Information of countries across the East Asia and the Pacific are contained in the dataset. The statistical software “R-Studio” will be used to perform the analysis.

Limitations

The dataset that will be used for this is the secondary data retrieved from World Bank. The dataset contains a lot of missing values for the variables that are assumed to be important.

Scope

The dataset that has been retrieved for this study has the chances of performing one variable and two variable analyses along with a graphical representation for each type of analysis. For further analysis, k-means cluster analysis and regression analysis can also be performed.

Methodology

With the help of the descriptive statistics measures, exploratory data analysis will be performed and the advanced analysis will be performed with the help of k-means clustering and regression analysis.

Data Setup

The dataset retrieved from World Bank on the Health and Population data is in .csv format. The dataset contains information on numerous attributes but the values of all the attributes on all the years are not present in the dataset. There are a lot of unrecorded data which has to be cleaned before the analysis. All these cleaning and extracting of the data is performed using the statistical software “R-Studio” with the help of various packages. The packages that are required for the analysis are listed as follows:

table: used to provide data frame and faster data manipulations.

reshape2: Used to reshape the data

psych: Used for multivariate analysis

ggploy2: Used for plotting data

lattice: Graphics package

dplyr: Used for data cleaning and manipulation

The R-codes that will be used for data cleaning are given in the following table:

###===============Importing Data File in R=================###

health <- read.csv(file.choose(), header = TRUE, sep = "," , na.strings = "..", blank.lines.skip = TRUE,)
###===================Libraries Used===================###
library(data.table)
library(reshape2)
library(psych)
library(ggplot2)
library(lattice)
library(dplyr)
###======================Data Cleaning======================###
health1 <- data.table(health)
health1 <- health1[Series.Code %in% c("SP.ADO.TFRT", "SP.DYN.CBRT.IN","SP.DYN.LE00.IN","SL.UEM.TOTL.ZS","SL.UEM.TOTL.FE.ZS",
"NY.GNP.PCAP.CD","SH.XPD.TOTL.ZS","SL.UEM.TOTL.FE.ZS")]
health1 <- health1[,"X2015..YR2015." := NULL]
health1 <- na.omit(health1)
str(health1)
View(health1)
###=============Converting Years in Rows and Attributes in Columns=============###
health1 <- melt(health1, Series.Code = "Country.Code")
View(health1)
Exploratory Data Analysis
One Variable Analysis
Summary of Adolescent Fertility Rate
The R codes in table 3.1 and the following boxplot in figure 3.1 shows the summary measures of the variable adolescent fertility rate.
The boxplot shows the distribution of the fertility rates in women between 15 to 18 years of age. The black line in the plot shows the median of the distribution which is found to be less than the mean. Thus, it can be said that the fertility rate is less than the average rate in most females. There are no outliers to the data.
Table 3.1
###============Adolscent Fertility Rate================###
exp1 <- health1[Series.Code %in% "SP.ADO.TFRT"]
describe(exp1$value)
fill <- "green"
exp_plot1 <- ggplot(exp1, aes(x = factor(0), y = value)) + geom_boxplot(fill = fill)
exp_plot1 <- exp_plot1 + xlab("Adolescent fertility rate (births per 1,000 women ages 15-19)") + scale_x_discrete(breaks = NULL)
exp_plot1 <- exp_plot1 + ggtitle("Distribution of Adolescent fertility rate (births per 1,000 women ages 15-19)") + theme_bw()
plot(exp_plot1)
Summary of Crude Birth Rate The R codes in table 3.3 and the following boxplot in figure 3.2 shows the summary measures of the variable Crude Birth Rate. The histogram shows the distribution of the Crude Birth Rate. The bars show that the data is negatively skewed which indicates that in most of the cases, the crude birth rate is high.
###============Crude Birth Rate================###
exp2 <- health1[Series.Code %in% "SP.DYN.CBRT.IN"]
describe(exp2$value)
View(exp2)
n <- length(exp2$value)
r <- diff(range(exp2$value))
barfill <- "blue"
barlines <- "black"
exp_plot2 <- ggplot(exp2, aes(x = value)) + geom_histogram(binwidth=r/(log2(n)+1), colour = barlines, fill = barfill)
exp_plot2 <- exp_plot2 + scale_x_continuous("Birth rate, crude (per 1,000 people)", breaks = seq(0,24,3), limits = c(0,21)) + scale_y_continuous("Count")+theme_bw()
exp_plot2 <- exp_plot2 + ggtitle("Distribution of Birth rate, crude per 1,000 people")+ theme(plot.title = element_text(hjust = 0.5))
plot(exp_plot2)
Summary of Total Life Expectancy at Birth The R codes in table 3.5 and the following boxplot in figure 3.3 shows the summary measures of the variable Total Life Expectancy at Birth. The boxplot shows the distribution of the Total Life Expectancy at Birth. The black line in the plot shows the median of the distribution which is found to be almost equal to the mean from the figure. Thus, it can be said that the Total Life Expectancy at Birth is symmetrically distributed. There are no outliers to the data.Table 3.5###============Total Life Expectancy at Birth================###exp3 <- health1[Series.Code %in% "SP.DYN.LE00.IN"]describe(exp3$value)fill <- "green"exp_plot3 <- ggplot(exp3, aes(x = factor(0), y = value)) + geom_boxplot(fill = fill) exp_plot3 <- exp_plot3 + xlab("Life expectancy at birth, total (years)") + scale_x_discrete(breaks = NULL)exp_plot3 <- exp_plot3 + ggtitle("Life expectancy at birth, total (years)") + theme_bw()plot(exp_plot3)
Two Variable Analysis
Country Wise Analysis of total Unemployment
On the basis of different countries the rate of total unemployment is calculated. The summary shows the comparison with the help of boxplots. The boxplot compares the changes in the unemployment rates with respect to the countries.
Here boxplot is used as one variable is numerical and the other is categorical. In 37 boxplots representing countries, it can be seen that there are outliers in countries coded as CHIN, LAO, MYS, SLB, THA and VNM. SLB has two outliers. PHL has the highest rate of unemployment. MMR, PRK and SLB has the lowest unemployment rates.
Country Wise Analysis of total health expenditure
On the basis of different countries the total health expenditure is calculated. The summary shows the comparison with the help of boxplots. The boxplot compares the changes in the total health expenditure with respect to the countries.
Here boxplot is used as one variable is numerical and the other is categorical. In 37 boxplots representing countries, it can be seen that there are outliers in countries coded as KHM, KIR, PLW, SGP, TLS,TUV, VUT and WSM. KHM and KIR has two outliers. NRU has the highest health expenditure. MMR has the lowest health expenditure.
###============Total Health Expenditure (Country Wise)================###
exp6 <- health1[Series.Code %in% "SH.XPD.TOTL.ZS"]
fill <- "pink"
exp_plot5 <- ggplot(exp6, aes(x = exp6$Country.Code, y = exp6$value)) + geom_boxplot(fill = fill)
exp_plot5 <- exp_plot5 + scale_x_discrete(name = "Country") + scale_y_continuous(name = "Total Health Expenditure")+ theme_bw()
exp_plot5 <- exp_plot5 +theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Total Health Expenditure")
plot(exp_plot5)
Advanced Analysis
Clustering
Brief Explanation of k-means and Clustering
In this method segmentation of the data is done on the basis of the group means. The values of the data which are close to the group means are segmented into those groups (Chatfield, 2018).
Clustering Analysis
In this study, k-means clustering analysis is performed between Crude birth Rate and Crude Death rate. The grouping is done according to the countries. As there was no data for the year 2015, the whole column was eliminated from the data. In the figure, the red clusters are the clusters for the CBR and the green are the clusters for CDR. The green countries show higher birth and death rates while the red countries show higher death rate and lower birth rate.
The codes are attached in the following table (Husson, Lê & Pagès, 2017):
###============K-Means Clustering (Country Wise)================###
cluster <- filter(health, Series.Code %in% c("SP.DYN.CDRT.IN","SP.DYN.CBRT.IN","SH.IMM.IBCG"))
cluster <- subset(cluster, select = -(X2015..YR2015.))
cluster <- melt(cluster, Series.Code = c("Series.Code","Country.Name","Country.Code"))
cluster <- dcast(cluster, formula = Country.Code ~ Series.Code, mean)
cluster <- na.omit(cluster)
cluster
group <- kmeans(cluster[,c("SP.DYN.CDRT.IN","SP.DYN.CBRT.IN")],centers = 2, nstart = 10)
group
order = order(group$cluster)
data.frame(cluster$Country.Code[order], group$cluster[order])
cluster_plot <- plot(cluster$SP.DYN.CDRT.IN, cluster$SP.DYN.CBRT.IN, type="n", xlim=c(0,10), xlab="Crude Death Rate", ylab="Crude Birth Rate")+ text(x=cluster$SP.DYN.CDRT.IN, y=cluster$SP.DYN.CBRT.IN, labels=cluster$Country.Code,col=group$cluster+1)
Linear Regression
Brief Definition of Linear Regression
The relationship between two numerical variables are established with the help of regression analysis (Fox. 2015). The general equation of linear regression is given by:
Y = a + bX
Here, x and y are respectively the independent and the dependent variables with a being the value of the dependent variable in the absence of the independent variable and b representing the slope of the regression line (Draper & Smith, 2014).
Relation between tertiary school enrolment of females and female unemployment rate
From the analysis, it can be seen that the regression line shows a negative relationship between the independent and the dependent variables. With the increase in the female school enrolment, the female unemployment rate decreases.
The codes are attaches in the following table (Berk, 2016):
Relation between immunization rate and CDR
From the analysis, it can be seen that the regression line shows a very weak relationship between the independent and the dependent variables. Thus, it can be said that there is no effect of immunization on Crude Death Rate.
The codes are attaches in the following table:
###============Regression-Immunization Rate and CDR================###
reg_plot2 <- lm(formula = SP.DYN.CDRT.IN ~ SH.IMM.IBCG, data = reg)
summary(reg_plot2)
reg_plot2 <- ggplot(reg_plot2, aes(x=SH.IMM.IBCG, y=SP.DYN.CDRT.IN)) + geom_point(shape=2) + scale_x_continuous(name = "Immunization, BCG (% of one-year-old children)") + scale_y_continuous(name = "Crude Death rate per 1,000 people")+ geom_smooth(method=lm) +theme_bw()+ ggtitle("Relation of Crude Death Rate to Immunization BCG rate of one-year-old children")
plot(reg_plot2)
Conclusion
It can thus be concluded from all the analysis conducted so far that the variables have been analyzed by considering single variable, by considering two variables at a time. The k-means clustering analysis shows the relationship between the birth and the death rates across countries and have been grouped accordingly. Negative relationship has been observed within female education and female unemployment and no relationship has been observed on immunization and death rate.
Reflection
The problem that has been faced the most is the selection of the variables as most of the variables have innumerable missing values. However, the results have been obtained with some selected variables which could have been better if there were lesser missing values.
Reference List
Chatfield, C. (2018). Introduction to multivariate analysis. Routledge.
Husson, F., Lê, S., & Pagès, J. (2017). Exploratory multivariate analysis by example using R. Chapman and Hall/CRC.
Fox, J. (2015). Applied regression analysis and generalized linear models. Sage Publications.
Draper, N. R., & Smith, H. (2014). Applied regression analysis(Vol. 326). John Wiley & Sons.
Berk, R. A. (2016). Statistical learning from a regression perspective. New York: Springer.
Free Membership to World's Largest Sample Bank
To View this & another 50000+ free samples. Please put
your valid email id.
E-mail
Yes, alert me for offers and important updates
Submit
Download Sample Now
Earn back the money you have spent on the downloaded sample by uploading a unique assignment/study material/research material you have. After we assess the authenticity of the uploaded content, you will get 100% money back in your wallet within 7 days.
UploadUnique Document
DocumentUnder Evaluation
Get Moneyinto Your Wallet
Total 9 pages
PAY 5 USD TO DOWNLOAD
*The content must not be available online or in our existing Database to qualify as
unique.
Cite This Work
To export a reference to this article please select a referencing stye below:
APA
MLA
Harvard
OSCOLA
Vancouver
My Assignment Help. (2020). Introduction To Data Science For Statistical Software. Retrieved from https://myassignmenthelp.com/free-samples/ict110-introduction-to-data-science/secondary-data.html.
"Introduction To Data Science For Statistical Software." My Assignment Help, 2020, https://myassignmenthelp.com/free-samples/ict110-introduction-to-data-science/secondary-data.html.
My Assignment Help (2020) Introduction To Data Science For Statistical Software [Online]. Available from: https://myassignmenthelp.com/free-samples/ict110-introduction-to-data-science/secondary-data.html[Accessed 18 December 2021].
My Assignment Help. 'Introduction To Data Science For Statistical Software' (My Assignment Help, 2020)

My Assignment Help. Introduction To Data Science For Statistical Software [Internet]. My Assignment Help. 2020 [cited 18 December 2021]. Available from: https://myassignmenthelp.com/free-samples/ict110-introduction-to-data-science/secondary-data.html.

×

.close{position: absolute;right: 5px;z-index: 999;opacity: 1;color: #ff8b00;}

×

Thank you for your interest

The respective sample has been mail to your register email id

×

CONGRATS!

$20 Credited

successfully in your wallet.

* $5 to be used on order value more than $50. Valid for

only 1

month.

Account created successfully!

We have sent login details on your registered email.

User:

Password:

Due to their extensive subject and industry knowledge, the law assignment help experts at MyAssignmenthelp.com are true masters at understanding all convoluted legal concepts, statutes, and various laws in the real world. We bank entirely upon our team, who have acquired prestigious law degrees from top law schools in the UK, Aus, and Malaysia. They understand the complexities and have helped millions of students learn better and score high.

Latest Management Samples

div#loaddata .card img {max-width: 100%;

}

MPM755 Building Success In Commerce

Download :

0 | Pages :

9

Course Code: MPM755

University: Deakin University

MyAssignmentHelp.com is not sponsored or endorsed by this college or university

Country: Australia

Answers:

Introduction

The process of developing a successful business entity requires a multidimensional analysis of several factors that relate to the internal and external environment in commerce. The areas covered in this current unit are essential in transforming the business perspective regarding the key commerce factors such as ethics, technology, culture, entrepreneurship, leadership, culture, and globalization (Nzelibe, 1996; Barza, 2…

Read

More

SNM660 Evidence Based Practice

Download :

0 | Pages :

8

Course Code: SNM660

University: The University Of Sheffield

MyAssignmentHelp.com is not sponsored or endorsed by this college or university

Country: United Kingdom

Answers:

Critical reflection on the objective, design, methodology and outcome of the research undertaken Assessment-I

Smoking and tobacco addiction is one of the few among the most basic general restorative issues, particularly to developed nations such as the UK. It has been represented that among all risk segments smoking is the fourth driving purpose behind infections and other several ailments like asthma, breathing and problems in the l…

Read

More

Tags:

Australia Maidstone Management Business management with marketing University of New South Wales Masters in Business Administration

BSBHRM513 Manage Workforce Planning

Download :

0 | Pages :

20

Course Code: BSBHRM513

University: Tafe NSW

MyAssignmentHelp.com is not sponsored or endorsed by this college or university

Country: Australia

Answer:

Task 1

1.0 Data on staff turnover and demographics

That includes the staffing information of JKL industries for the fiscal year of 2014-15, it can be said that the company is having problems related to employee turnover. For the role of Senior Manager in Sydney, the organization needs 4 managers; however, one manager is exiting. It will make one empty position which might hurt the decision making process. On the other hand, In Brisba…

Read

More

MKT2031 Issues In Small Business And Entrepreneurship

Download :

0 | Pages :

5

Course Code: MKT2031

University: University Of Northampton

MyAssignmentHelp.com is not sponsored or endorsed by this college or university

Country: United Kingdom

Answer:

Entrepreneurial ventures

Entrepreneurship is the capacity and willingness to develop, manage, and put in order operations of any business venture with an intention to make profits despite the risks that may be involved in such venture. Small and large businesses have a vital role to play in the overall performance of the economy. It is, therefore, necessary to consider the difference between entrepreneurial ventures, individual, and c…

Read

More

Tags:

Turkey Istanbul Management University of Employee Masters in Business Administration

MN506 System Management

Download :

0 | Pages :

7

Course Code: MN506

University: Melbourne Institute Of Technology

MyAssignmentHelp.com is not sponsored or endorsed by this college or university

Country: Australia

Answer:

Introduction

An operating system (OS) is defined as a system software that is installed in the systems for the management of the hardware along with the other software resources. Every computer system and mobile device requires an operating system for functioning and execution of operations. There is a great use of mobile devices such as tablets and Smartphones that has increased. One of the widely used and implemented operating syste…

Read

More

Tags:

Australia Cheltenham Computer Science Litigation and Dispute Management University of New South Wales Information Technology

Next