{"id":289,"date":"2014-09-15T08:25:00","date_gmt":"2014-09-15T08:25:00","guid":{"rendered":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/?p=289"},"modified":"2025-02-26T13:21:38","modified_gmt":"2025-02-26T13:21:38","slug":"smoothing-a-scatter-plot","status":"publish","type":"post","link":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/2014\/09\/15\/smoothing-a-scatter-plot\/","title":{"rendered":"Smoothing a scatter plot"},"content":{"rendered":"<p>In my last posting I mentioned that in is possible to use Stata to run R\u00a0packages by preparing an R script, calling R from within Stata so that R runs in the background and then reading\u00a0R&#8217;s results into Stata for further processing. The big advantage of this method is that one does not need to\u00a0 re-write the R package\u00a0in order to make its analysis\u00a0available in\u00a0Stata. To illustrate this process I will\u00a0analyse some data using\u00a0a Bayesian scatter plot smoother that is available in\u00a0an R package called DPpackage.<\/p>\n<p>Before I\u00a0use this R package,\u00a0I&#8217;ll introduce the statistical problem by describing a\u00a0simpler Bayesian\u00a0scatter plot smoother implemented\u00a0entirely in Stata.<\/p>\n<p>Some time ago I showed how to fit Bayesian finite mixture models in Stata and I introduced a program called <strong>mixmnormal <\/strong>(<a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/2014\/05\/09\/mixtures-of-normal-distributions\/\">https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/2014\/05\/09\/mixtures-of-normal-distributions\/<\/a>); now we will adapt that program to smooth a scatter plot. The idea behind this use of mixture models dates back at least to Muller et al (Biometrika 1996;83:67-79) who described how to use an infinite normal mixture as a scatter plot smoother. R&#8217;s DPpackage implements Muller&#8217;s method, but this week I will\u00a0use\u00a0a large, but finite, mixture,\u00a0in order to\u00a0obtain an approximation to that method.<\/p>\n<p>To illustrate the method I will use some data on the\u00a0concentration of Immunoglobin G (IgG) measured in a sample of 298 children aged under 6 years. IgG is a component of the immune system and can be found in the blood and other body fluids. It binds to the surfaces of bacteria\u00a0and viruses, immobilizing them and enabling phagocytic cells to recognise them. The normal range in adults is between 7 and 15 g\/L but in children the normal range is lower and varies with age. The data were used by Nuofaily and Jones (<em>Applied Statistics<\/em> 2013;62:723-740) to illustrate quantile regression based on a generalized gamma distribution.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-296\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig1-300x218.png\" alt=\"qfig1\" width=\"363\" height=\"264\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig1-300x218.png 300w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig1.png 716w\" sizes=\"auto, (max-width: 363px) 100vw, 363px\" \/><\/a><\/p>\n<p>The idea behind the mixture method is to approximate the two-dimensional distribution of points by a finite mixture of normal distributions, as in the following illustration.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-297\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig2-300x218.png\" alt=\"qfig2\" width=\"362\" height=\"263\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig2-300x218.png 300w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig2.png 716w\" sizes=\"auto, (max-width: 362px) 100vw, 362px\" \/><\/a><\/p>\n<p>We can draw a slice through the mixture at a chosen age and obtain a flexible approximation to the conditional distribution by combining the component distributions. Alternatively we can find the conditional expectation, E(y|x), for any chosen age, x, by combining the conditional means of each of the components.<\/p>\n<p>The fitted trend line is just the average of these expectations weighted by the probability of a point falling in that component and the probability of the particular x for that component. Repeating this calculation for lots of x&#8217;s\u00a0enables us to draw the smooth trend line, shown in blue in the illustration.<\/p>\n<p>In the code below the priors and initial values are set and then the Bayesian mixture fitting program fits the model. The priors are set using the local r to denote the number of components in the mixture.<\/p>\n<pre style=\"padding-left: 30px\"><span style=\"color: #0000ff\"><span style=\"color: #0000ff\"> local r = 20<\/span>\r\n<span style=\"color: #0000ff\"> matrix R = J(2,2*`r',0)<\/span>\r\n<span style=\"color: #0000ff\"> forvalues j=1\/`r' {<\/span>\r\n<span style=\"color: #0000ff\">     matrix R[1,2*`j'-1] = 4<\/span>\r\n<span style=\"color: #0000ff\">     matrix R[2,2*`j'] = 4<\/span>\r\n<span style=\"color: #0000ff\"> }<\/span>\r\n<span style=\"color: #0000ff\"> matrix df = J(1,`r',16)<\/span>\r\n<span style=\"color: #0000ff\"> matrix M = J(2,`r',4)<\/span>\r\n<span style=\"color: #0000ff\"> matrix B = J(2,2*`r',0)<\/span>\r\n<span style=\"color: #0000ff\"> forvalues j=1\/`r' {<\/span>\r\n<span style=\"color: #0000ff\">     matrix B[1,2*`j'-1] = 0.1<\/span>\r\n<span style=\"color: #0000ff\">     matrix B[2,2*`j'] = 0.1<\/span>\r\n<span style=\"color: #0000ff\"> }<\/span>\r\n<span style=\"color: #0000ff\"> matrix ALPHA = J(1,`r',40\/`r')<\/span>\r\n<span style=\"color: #0000ff\"> matrix T = J(2,2*`r',0)<\/span>\r\n<span style=\"color: #0000ff\"> forvalues j=1\/`r' {<\/span>\r\n<span style=\"color: #0000ff\">    matrix T[1,2*`j'-1] = 1<\/span>\r\n<span style=\"color: #0000ff\">    matrix T[2,2*`j'] = 1<\/span>\r\n<span style=\"color: #0000ff\"> }<\/span>\r\n<span style=\"color: #0000ff\"> matrix C = J(298,1,1)<\/span>\r\n<span style=\"color: #0000ff\"> forvalues i=1\/298 {<\/span>\r\n<span style=\"color: #0000ff\">     matrix C[`i',1] = 1 + int(20*runiform())<\/span>\r\n<span style=\"color: #0000ff\"> }<\/span>\r\n<span style=\"color: #0000ff\"> matrix P = J(1,`r',1\/`r')<\/span>\r\n<span style=\"color: #0000ff\"> matrix MU = M\r\n <\/span>mixmnormal igg age, mu(MU) t(T) p(P) n(`r') m(M) b(B) r(R) \/\/\/<\/span>\r\n<span style=\"color: #0000ff\">     df(df) c(C) alpha(ALPHA) \/\/\/<\/span>\r\n<span style=\"color: #0000ff\">     burnin(1000) updates(5000) filename(\"igg.csv\")<\/span><\/pre>\n<p>Now we can calculate the expectations and average them across the iterations. The chain of 5000 updates is thinned by taking every 10th from iteration 5 onwards in order to save time in the calculation.<\/p>\n<pre style=\"padding-left: 30px\"> <span style=\"color: #0000ff\">local r = 20<\/span>\r\n<span style=\"color: #0000ff\"> insheet using igg.csv, clear<\/span>\r\n<span style=\"color: #0000ff\"> merge 1:1 _n using igg.dta<\/span>\r\n<span style=\"color: #0000ff\"> range x 0 6 1001<\/span>\r\n<span style=\"color: #0000ff\"> foreach iter of numlist 5(10)5000 {<\/span>\r\n<span style=\"color: #0000ff\">     qui gen e`iter' = 0 in 1\/1001<\/span>\r\n<span style=\"color: #0000ff\">     qui gen w = 0 in 1\/1001<\/span>\r\n<span style=\"color: #0000ff\">     forvalues s=1\/`r' {<\/span>\r\n<span style=\"color: #0000ff\">         matrix T = J(2,2,0)<\/span>\r\n<span style=\"color: #0000ff\">         matrix T[1,1] = t`s'_1_1[`iter']<\/span>\r\n<span style=\"color: #0000ff\">         matrix T[2,1] = t`s'_2_1[`iter']<\/span>\r\n<span style=\"color: #0000ff\">         matrix T[1,2] = t`s'_1_2[`iter']<\/span>\r\n<span style=\"color: #0000ff\">         matrix T[2,2] = t`s'_2_2[`iter']<\/span>\r\n<span style=\"color: #0000ff\">         matrix V = syminv(T)<\/span>\r\n<span style=\"color: #0000ff\">         local sy = sqrt(V[1,1])<\/span>\r\n<span style=\"color: #0000ff\">         local sx = sqrt(V[2,2])<\/span>\r\n<span style=\"color: #0000ff\">         local rho = V[1,2]\/(`sy'*`sx')<\/span>\r\n<span style=\"color: #0000ff\">         local my = mu`s'_1[`iter']<\/span>\r\n<span style=\"color: #0000ff\">         local mx = mu`s'_2[`iter']<\/span>\r\n<span style=\"color: #0000ff\">         local p = p`s'[`iter']<\/span>\r\n<span style=\"color: #0000ff\">         qui replace e`iter' = e`iter' + `p'*(`my'+`rho'*`sy'* \/\/\/<\/span>\r\n<span style=\"color: #0000ff\">            (x - `mx')\/`sx')*normalden((x - `mx')\/`sx')<\/span>\r\n<span style=\"color: #0000ff\">         qui replace w = w + `p'*normalden((x - `mx')\/`sx')<\/span>\r\n<span style=\"color: #0000ff\">     }<\/span>\r\n<span style=\"color: #0000ff\">     qui replace e`iter' = e`iter'\/w<\/span>\r\n<span style=\"color: #0000ff\">     drop w<\/span>\r\n<span style=\"color: #0000ff\"> }<\/span>\r\n<span style=\"color: #0000ff\"> egen fmean = rowmean(e*)<\/span>\r\n<span style=\"color: #0000ff\"> egen fp10 = rowpctile(e*) , p(10)<\/span>\r\n<span style=\"color: #0000ff\"> gen fp10 = rowpctile(e*) , p(90)<\/span>\r\n<span style=\"color: #0000ff\"> twoway (scatter igg age) (line fmean x, lpat(solid) lcol(red)) \/\/\/<\/span>\r\n<span style=\"color: #0000ff\">   (line fp10 x, lpat(dash) lcol(blue)) (line fp90 x, lpat(dash) \/\/\/<\/span>\r\n<span style=\"color: #0000ff\">   lcol(blue)), leg(off) xtitle(Age) ytitle(IgG)<\/span><\/pre>\n<p>Because we have a simulated Bayesian model fit it is a simple matter to find the average trend line and the 10% and 90% quantiles shown in red and blue in the plot below.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-300\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig3-300x218.png\" alt=\"qfig3\" width=\"366\" height=\"266\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig3-300x218.png 300w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig3.png 716w\" sizes=\"auto, (max-width: 366px) 100vw, 366px\" \/><\/a><\/p>\n<p>The choice of the number of components and setting of the priors together control the degree of smoothing and care should be taken when chosing these values. Perhaps by trying out different choices on simulated data to gain a feel for the degree of smoothing produced by a given set of priors.<\/p>\n<p>In the paper by Noufaily and Jones the authors set themselves the slightly different problem of estimating the quantiles of the sample rather than smoothing the scatter plot. The fitted mixture model can be used to address this problem as well.<\/p>\n<p>Imagine a vertical slice though the fitted bivariate components corresponding to some value x, we would obtain something similar to the picture below. The sum of these normal distributions produces represents the fitted conditional distribution for a particular age. The quantile that excludes q% will have q% of the sum of the areas in the blue region. All we need to do is to try a range of trial values of IgG until we find the value that excludes the required amount.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig4.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-301\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig4-300x218.png\" alt=\"qfig4\" width=\"337\" height=\"245\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig4-300x218.png 300w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig4.png 716w\" sizes=\"auto, (max-width: 337px) 100vw, 337px\" \/><\/a><\/p>\n<p>Below is a plot of the 10th, 25th, 50th, 75th and 90th quantiles. The code requires looping over lots of values of x so takes a few minutes to run and is given at the end of this posting. With relatively little effort this could be incorporated into an ado file in order to automate the entire analysis.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig5.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-302\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig5-300x218.png\" alt=\"qfig5\" width=\"351\" height=\"255\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig5-300x218.png 300w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/09\/qfig5.png 716w\" sizes=\"auto, (max-width: 351px) 100vw, 351px\" \/><\/a><\/p>\n<p>Next time I will analyse the same data using a Bayesian\u00a0infinite mixture model with Dirichlet process priors fitted by R&#8217;s DPpackage but controlled and run from within Stata.<\/p>\n<hr \/>\n<p>&nbsp;<\/p>\n<p>Code for performing quantile regression based on a fitted mixture distribution,<\/p>\n<pre><span style=\"color: #0000ff\">local r = 20<\/span>\r\n<span style=\"color: #0000ff\"> insheet using igg.csv, clear<\/span>\r\n<span style=\"color: #0000ff\"> merge 1:1 _n using igg.dta<\/span>\r\n<span style=\"color: #0000ff\"> local n = 61<\/span>\r\n<span style=\"color: #0000ff\"> range x 0 6 61<\/span>\r\n<span style=\"color: #0000ff\"> range y 0 15 1001<\/span>\r\n<span style=\"color: #0000ff\"> gen Q10 = 0<\/span>\r\n<span style=\"color: #0000ff\"> gen Q25 = 0<\/span>\r\n<span style=\"color: #0000ff\"> gen Q50 = 0<\/span>\r\n<span style=\"color: #0000ff\"> gen Q75 = 0<\/span>\r\n<span style=\"color: #0000ff\"> gen Q90 = 0<\/span>\r\n<span style=\"color: #0000ff\"> forvalues k=1\/61 {<\/span>\r\n<span style=\"color: #0000ff\"> local x = x[`k']<\/span>\r\n<span style=\"color: #0000ff\"> local nIter = 0<\/span>\r\n<span style=\"color: #0000ff\"> foreach iter of numlist 5(50)5000 {<\/span>\r\n<span style=\"color: #0000ff\"> local ++nIter<\/span>\r\n<span style=\"color: #0000ff\"> qui gen q = 0 in 1\/1001<\/span>\r\n<span style=\"color: #0000ff\"> qui gen w = 0 in 1\/1001<\/span>\r\n<span style=\"color: #0000ff\"> forvalues s=1\/`r' {<\/span>\r\n<span style=\"color: #0000ff\"> matrix T = J(2,2,0)<\/span>\r\n<span style=\"color: #0000ff\"> matrix T[1,1] = t`s'_1_1[`iter']<\/span>\r\n<span style=\"color: #0000ff\"> matrix T[2,1] = t`s'_2_1[`iter']<\/span>\r\n<span style=\"color: #0000ff\"> matrix T[1,2] = t`s'_1_2[`iter']<\/span>\r\n<span style=\"color: #0000ff\"> matrix T[2,2] = t`s'_2_2[`iter']<\/span>\r\n<span style=\"color: #0000ff\"> matrix V = syminv(T)<\/span>\r\n<span style=\"color: #0000ff\"> local sy = sqrt(V[1,1])<\/span>\r\n<span style=\"color: #0000ff\"> local sx = sqrt(V[2,2])<\/span>\r\n<span style=\"color: #0000ff\"> local rho = V[1,2]\/(`sy'*`sx')<\/span>\r\n<span style=\"color: #0000ff\"> local my = mu`s'_1[`iter']<\/span>\r\n<span style=\"color: #0000ff\"> local mx = mu`s'_2[`iter']<\/span>\r\n<span style=\"color: #0000ff\"> local p = p`s'[`iter']<\/span>\r\n<span style=\"color: #0000ff\"> qui replace q = q + `p'*normal( (y-(`my'+`rho'*`sy'*(`x' - `mx')\/ \/\/\/\/<\/span>\r\n<span style=\"color: #0000ff\"> `sx'))\/(`sy'*sqrt(1-`rho'*`rho')) )*normalden((`x' - `mx')\/`sx')<\/span>\r\n<span style=\"color: #0000ff\"> qui replace w = w + `p'*normalden((`x' - `mx')\/`sx')<\/span>\r\n<span style=\"color: #0000ff\"> }<\/span>\r\n<span style=\"color: #0000ff\"> qui replace q = q\/w<\/span>\r\n<span style=\"color: #0000ff\"> drop w<\/span>\r\n<span style=\"color: #0000ff\"> qui replace q = 0.1 in 1002<\/span>\r\n<span style=\"color: #0000ff\"> qui replace q = 0.25 in 1003<\/span>\r\n<span style=\"color: #0000ff\"> qui replace q = 0.5 in 1004<\/span>\r\n<span style=\"color: #0000ff\"> qui replace q = 0.75 in 1005<\/span>\r\n<span style=\"color: #0000ff\"> qui replace q = 0.9 in 1006<\/span>\r\n<span style=\"color: #0000ff\"> qui ipolate y q , gen(iy)<\/span>\r\n<span style=\"color: #0000ff\"> qui replace Q10 = Q10 + iy[1002] in `k'<\/span>\r\n<span style=\"color: #0000ff\"> qui replace Q25 = Q25 + iy[1003] in `k'<\/span>\r\n<span style=\"color: #0000ff\"> qui replace Q50 = Q50 + iy[1004] in `k'<\/span>\r\n<span style=\"color: #0000ff\"> qui replace Q75 = Q75 + iy[1005] in `k'<\/span>\r\n<span style=\"color: #0000ff\"> qui replace Q90 = Q90 + iy[1006] in `k'<\/span>\r\n<span style=\"color: #0000ff\"> drop iy q<\/span>\r\n<span style=\"color: #0000ff\"> }<\/span>\r\n<span style=\"color: #0000ff\"> }<\/span>\r\n<span style=\"color: #0000ff\"> qui replace Q10 = Q10\/`nIter'<\/span>\r\n<span style=\"color: #0000ff\"> qui replace Q25 = Q25\/`nIter'<\/span>\r\n<span style=\"color: #0000ff\"> qui replace Q50 = Q50\/`nIter'<\/span>\r\n<span style=\"color: #0000ff\"> qui replace Q75 = Q75\/`nIter'<\/span>\r\n<span style=\"color: #0000ff\"> qui replace Q90 = Q90\/`nIter'<\/span>\r\n<span style=\"color: #0000ff\"> twoway (scatter igg age) (line Q10 x, lpat(dot) lcol(blue)) \/\/\/<\/span>\r\n<span style=\"color: #0000ff\"> (line Q25 x, lpat(dash) lcol(blue)) (line Q50 x, lpat(solid) lcol(red)) \/\/\/<\/span>\r\n<span style=\"color: #0000ff\"> (line Q75 x, lpat(dash) lcol(blue)) (line Q90 x, lpat(dot) lcol(blue)), \/\/\/<\/span>\r\n<span style=\"color: #0000ff\"> leg(off) xtitle(Age) ytitle(IgG)<\/span><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>In my last posting I mentioned that in is possible to use Stata to run R\u00a0packages by preparing an R script, calling R from within Stata so that R runs in the background and then reading\u00a0R&#8217;s results into Stata for further processing. The big advantage of this method is that one does not need to\u00a0 [&hellip;]<\/p>\n","protected":false},"author":134,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[17,35],"class_list":["post-289","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-bayesian-mixture-model","tag-smoothing"],"_links":{"self":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts\/289","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/users\/134"}],"replies":[{"embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/comments?post=289"}],"version-history":[{"count":10,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts\/289\/revisions"}],"predecessor-version":[{"id":313,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts\/289\/revisions\/313"}],"wp:attachment":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/media?parent=289"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/categories?post=289"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/tags?post=289"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}