bridalsraka.blogg.se - Mosaic plot

Mosaic plot full#
Mosaic plot code#

The strength of a relation can be thought as a measure of how much the observed values deviate from the values in case of independence (independence = total missing of any kind of relation among variables). To make inferences about the population, we need to provide measures of statistical significance to certify the results found aren’t attributed to chance.

Mosaic plot full#

Mosaic plots represent the data as is, they don’t make any attempt to generalize to the full population. To understand how the colors are assigned, we need to introduce the standardized residuals, shown in the legend of the plot in fig. So, for combo products the 33.3% of transactions doesn’t contain coke and the 66.7% of them contains coke. Instead, if the sell category is “2”, 1355 transactions don’t contain coke 2710 transactions contains it. So, for single products, the 100% of transactions don’t contain coke and the 0% of them contain it. In fact, if the sell category is “0”, the transaction not containing coke (COKE = “0”) are 1355 the ones containing coke (COKE = “1”) are 0. So the heights of the boxes are proportional to percent of coke in a sell category. Now within sell categories (in green), it calculates the proportion belonging to each coke category (“0”, “1”). In fact, the totals of SELL_CATEGORY for “0” and “2” are 13 (total observed values), so 25% of products are single ones and 75% are combos. The widths of the cells are proportional to the percentage of single products and combo products ( total % within coke, in dark blue in fig. In our example, we start with sell categories. The mosaic plot recursively calculates marginal proportions starting with the rows. 2 - Contingency table for SELL_CATEGORY vs COKE Just open RStudio (or R Tools for Visual Studio) and run it:įig.

Mosaic plot code#

The following simple code in R will generate the mosaic plot for the variables SELL_CATEGORY and COKE. So the value “1” identifies the presence of coke “0” identifies the absence COKE: a categorical variable, identifying the presence of coke as item in a combo product sold.The category “0” identifies single products the category “2” identifies combo ones SELL_CATEGORY: a categorical variable, identifying the category of the product sold.The two variables under analysis are the following ones: Each row is a transaction done in a burger shop. The data set used in this post can be downloaded here. So, in this post we’ll go a little bit deeper in these concepts. People not aware of some statistical concepts can miss important information this plot can give us.

At first sight it may appear a little bit confusing.

If the two variables are categorical, the most common plot used to analyze their relationship is the mosaic plot. Among other kinds of analysis, one of the most interesting is the bi-variate one, that finds out the relationship between two variables. There is a specific phase, the first one in the project, that has the data analysis as goal: the Data Exploration phase. In a Data Science project it’s really important to get the more insights out of your data.