Thursday, 17 July 2014

The Data Analysts Toolkit Day 5: How do R and Power BI fit together?

How do R and Power BI fit together?

Technically, it is about munging data around between R and Excel and the Power BI components. 

You can use RODBC to connect to data between R and SQL Server, or R and Excel. Alternatively you can import data in.

Why else might you use R?

  • Pivot Tables are not always enough
  • Scaling Data (ScaleR)
  • R is very good at static data visualisation but Power BI and Excel are very good at dynamic data visualisation
  • You want to double check your results or do further analysis
They complement one another; they do not replace one another.

You may have heard my story about one organisation calculating the median incorrectly. 

The truth is, people don't often check their data. I help design data warehouses all the time, and I don't always hear people talk about reconciliation. I do hear about people racking up SSAS and diving straight in.

Just because something works technically, does not mean it is correct.

Upworthy and Facebook use R. A lot. So why not you? It is achievable.
Why R, and not some other package?

  • R most widely used data analysis software - used by 2M + data scientist, statisticians and analysts
  • Most powerful statistical programming language
  • used with RStudio, it can help you for the purposes of productivity
  • Create beautiful and unique data visualisations - as seen in New York Times, Twitter and Flowing Data
  • Thriving open-source community - leading edge of analytics research
  • Fills the talent gap - new graduates prefer R.
  • It's fun!

Excel is used by an estimated 1.3 billion people on the planet. That sounds really impressive, until you think that many people are often using it wrong!
R just helps you to do that double check, the sanity check, to see if your data is correct.


The Data Analysts Toolkit Day 4: Getting to know R a bit better

So what actually is R? Let's take a look at running some commands, and we can talk about it as we go along.

R is a powerful environment for statistical computing
It is like a calculator and its power is: 
… it lets you save results in variables. Let's take a look:

> x <- 3="" font="">
> y <- 5="" font="">
> z = 4
> x + y + z

Ok, so we can run little commands and it gives us numbers back. What's all the fuss about?

We can save data in files. They are saved as .RData files. R lets us import and export data in lots of different ways. If it sounds fun, it is because it is! Let's take a look.

> someData <- 1:10="" font="" nbsp="">
> save(someData, file = "D:/DataAnalystsToolkit/someData.Rdata") 
> rm(someData) 
> load("D:/DataAnalystsToolkit/someData.Rdata") 
> print(someData)

Well, that is great if our data is in RData format. But normally we need to import data from different sources.

Import from CSV File
Read.csv()
MyDataFrame <- div="" filepath.csv="" nbsp="" read.csv="">
print(MyDataFrame)

Alternatively we can load data in directly from RStudio. 
Go to Tools in RStudio, and select Import Dataset. 
Select the file CountryCodes.csv and select the Import button.
In RStudio, you will now see the data in the data pane.

The console window will show the following:
> #import dataset
> CountryCodes <- directory="" files="" header="F)</font" orking="" ountrycodes.csv="" read.csv="" rogram="">
>   View(CountryCodes)

Once the data is imported, we can check the data.

dim(CountryCodes)
head(CountryCodes)
tail(CountryCodes) 


So how do we connect R to SQL Server?

The Package RODBC provides R with a connection to ODBC databases:

> library(RODBC) 
> myodbcConnect <- dsn="servername" font="" nbsp="" odbcconnect="" pwd="******" uid="userid">

myQuery <- ...="" div="" from="" mytable="" nbsp="" where="">
# or read query from file 
# myQuery <- ataanalyststoolkit="" myquery.sql="" nchars="99999) </div" readchar="">
myData <- color="#ffa500" font="" sqlquery="">myodbcConnect
, myQuery, errors=TRUE) 
odbcCloseAll()

I hope that the rather bizarre colour coding will help you to see how the variables add up together to create a query.

RODBC also works for importing data from Excel files:

library(RODBC) 
filename <- ataanalyststoolkit="" font="" myspreadsheet.xls="" nbsp="">
myxlsFile <- filename="" odbcconnectexcel="" readonly="FALSE) </font">
sqlSave(myxlsFile, a, rownames = FALSE) 
b <- a="" font="" myxlsfile="" nbsp="" sqlfetch="">
odbcCloseAll()

Next, we will look in more detail how R and PowerBI complement one another.










The Data Analysts Toolkit Day 3: Introduction to running and understanding commands in R

I have located a number of R Scripts over at my OneDrive account and you can head over there to download them. There are some other goodies over there too, so why not take a look?

how do you open files? To create a new file, you use the File -> New menu.

To open an existing file you use either the File -> Open menu.
Alternatively, you can use the Open Recent menu to select from recently opened files.

If you open several files within RStudio, you can see them as tabs to facilitate quick switching between open documents. 

If you have a large number of open documents, you can also navigate between them using the >> icon.
You could also use the tab bar to navigate between files, or the View -> Switch to Tab menu item.

Let's open the Women file and take a look at the commands.

Comments in R are preceded with a hash symbol, so that is what we are using here.

# this loads the dataset
data(women)

# You can see what is in the dataset
women

# This allows you to see the column names
names(women)

# You can see the output of the height column here, in different ways
women$height
women[,1]
women[seq(1,2),]
women[1:5,1]
women[,2]

# we can start to have a little fun!
# we are going to tell R that we are going to build a model of the data

attach(women)
model <- font="" height="" lm="" weight="">
model
print(model)
mode(model)
predict(model, women, interval="predict")
newdata = data.frame(height=60)
newdata
predict(model, newdata, interval="predict")
women
r <- cor="" font="" height="" weight="">
summary(model)

Now we can see some strange output at the bottom of the page:

Call:
lm(formula = weight ~ height)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

What does this actually mean? As part of my course, I normally discuss this in detail. For those of you reading online, here is the potted summary:











Basically, you want big F, small p, at the 30,000 feet helicopter level. We will start from there, and move down as we go.

The stars are shorthand for significance levels, with the number of asterisks 
displayed according to the p-value computed. 
*** for high significance and * for low significance. 
In this case, *** indicates that there is likely to be a relationship.

Pro Tip
Is the model significant or insignificant? This is the purpose of the F statistic.
Check the F statistic first because if it is not significant, then the model doesn’t matter.

In the next Day, we will look at more commands in R.


The Data Analysts Toolkit Day 2: Installing R and R Studio

In this topic, we will look at the installation of R and RStudio as part of our R and Power BI Data Analysts Toolkit.

Why use RStudio as well as R?

Simply put, it is suitable for the enterprise and it is also open-source. It makes R easier to use and to troubleshoot.

RStudio is a tool which helps you to be more productive with R. It helps you with the command line to see your code better. It keeps your history nicely.

You can find more details of RStudio here.

Installation of R

The installation of R itself is very easy.  The online manual is here.

For those of you who want the potted version for installing R on Windows:

Go to the official R site
Go to the Mirrors page to look for your nearest location
Assuming you use Windows, you can start the download from here 
You can then download the R file
When it is downloaded, put it in a safe place and execute the exe.

Then you are done!

Installation of RStudio

We simply want the desktop version of RStudio, which you can get here on this page
The free version of RStudio can be found here on this page, which takes you through to the RStudio download page. You can then work out which is most appropriate for your system.














The Data Analysts Toolkit Day 1: cleaning and importing your data using Power Query


What does the Data Analyst need for a Toolkit?

  • Power BI - Excel, Power Query, Power View
  • R


Power BI - cleanse, model, visualise
R - check, check, check, explore, statistics.

In this series, we will look at:

Power BI - cleanse data
Some basic R scripts
analysing and visualising data in R
Accessing data in the Hortonworks Sandbox  for analysis with R
Power BI visualisation in Excel such as the little People Graph, which I love, because it is simply fun!

So let's get started!

Microsoft Power Query for Excel, is a new add-in that provides a seamless experience for data discovery, data transformation and enrichment for Information Workers, BI professionals and other Excel users.

With Power Query you can:
·         Identify the data you care about from the sources you work with (e.g. relational databases, Excel, text and XML files, OData feeds, web pages, Hadoop HDFS, etc.).
·         Discover relevant data from inside and outside your organization using the search capabilities within Excel.
·         Combine data from multiple, disparate data sources and shape it in order to prepare the data for further analysis in tools like Excel and Power Pivot, or visualization in tools like Power View and Power Map.
·         Share the queries that you created with others within your organization, so they can be easily found via Search.

It is part of our toolkit because it helps us to easily clean and prepare the data for consumption by R. 
So let's get to it.

Download and install Power Query


Before we proceed with Power Query, let’s download and install the Power Query Add-in.
Before we proceed to download, let’s check the supported operating systems for Power Query. At the time of writing, these are listed below:

·         Supported Operating System
·         Windows 7, Windows 8, Windows Server 2008 R2, Windows Server 2012
·         Windows Vista (requires .NET 3.5 SP1)
·         Windows Server 2008 (requires .NET 3.5 SP1)
·         Windows 7
·         Windows 8
·         Windows 8.1

The following Office versions are supported:
·         Microsoft Office 2010 Professional Plus with Software Assurance
·         Microsoft Office 2013 Professional Plus, Office 365 ProPlus or Excel 2013 Standalone

Microsoft Power Query for Excel requires Internet Explorer 9 or greater.

Microsoft Power Query for Excel is available for 32-bit (x86) and 64-bit (x64) platforms, your selection must match architecture of the installed version of Office.
Once you have checked these items, you can download the version of the Power Query add-in that matches the architecture (x86 or x64) of your Office installation.
  • Download the Power Query for Excel add-in.
  • Open the folder where you downloaded Microsoft Power Query for Excel.
  • Double-click the downloaded installer file, and then follow the steps in the wizard.
  • After the installation is complete, click Finish.
  • Power Query will appear as a new tab in the Excel ribbon.
  • Run the MSI installer and follow the setup steps.



Enabling Power Query

  • Once Power Query is installed, you may need to enable it. To do this, go to File and then choose Options. The Excel Options dialogue box will open.
  • Go to the Add-Ins option on the left hand side.
  • At the foot of the Excel Options box, you will see an option to Manage options. In the drop down list, look for COM Add-ins. 
  • Make sure that Power Query is checked.



Connecting to External Data Sources

We have options to connect to a wide range of data sources, both public and internal.

Connecting to Web Page data

  • To open Internet Explorer, press the Windows ( ) button.
  • Then, in Start, commence typing Internet Explorer.
  • When the Internet Explorer application appears, click the application tile. 
  • Go to the Wikipedia Human Development Index page in order to see it before we import it.
  • We will now import the data to this page.


Add a Wikipedia page data source to Power Query

  • In the POWER QUERY ribbon tab, click From Web.
  • In the Web Content dialog box, in the URL text box, paste the Wikipedia URL (http://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index).
  • Click OK.
  • In the right hand side, you will see a list of Tables. Click Table 2, and you will get a preview of the data.
  • Double click on it, and you will see the data appear in the Excel window.
  • Click on it, and rename the query to HDI Very High.
  • Make sure to click on Add to the Data Model at the bottom right hand side.
  • Next, click on Apply and Close.
  • You will be returned to the Excel workbook.


We will re-run the query now for the other HDI levels: High, Medium, and Low. 

To do this, we will reuse the existing query by clicking on the Power Query tab again, and then selecting Recent Sources.
  • In the drop-down list, you will see a list of your recent sources. Select the Wikipedia page.
  • You will be returned to the Power Query Editor, and you will see the tables on the right hand side.
  • Select Table 3 by double-clicking it. Next, Make sure you have clicked on Add to the Data Model at the bottom right hand side.
  • Repeat these steps for the medium Human Development information.
  • For the Medium and Low Human Development Indexes, you will need to repeat these steps using Table 4 and Table 5 respectively.


Transforming the data

Go back to the first HDI Very High Human Development query, and we can double click on it to produce the Power Query Editor.
We can transform the HDI Very High Human Development data by removing the column Rank Change in rank between 2013 report to 2011 report[1]. It is not required.

  • To do this, highlight the column Rank Change in rank between 2013 report to 2011 report[1] and right-click to delete it. We do not need it so if we select Remove, the column will be deleted.
  • Click on Row number 1. We will make this row the header row. To do this, select Use First Row as Headers and you will see that this row is now the header row.
  • Rename the HDI column to HDI 2012 by right clicking it, and selecting Rename.
  • We will still have a row that is text. To remove it, highlight the first row and select the button Remove top Rows from the Home tab in the Power Query Editor. 
  • Change the HDI column to a Decimal Number by highlighting the column, going to Data Type and selecting Decimal Number from the drop down list.
  • Let’s rename the HDI_1 column to Difference by right-clicking it and selecting Rename.
  • Let’s add in a custom column. To do this, right click on the Difference column and select Insert Custom Column
  • In the Insert Custom Column editor, double click the HDI 2012 column to put it into the text box.
  • Then, enter a minus sign after the HDI 2012 column
  • Finally, double click the Difference column to select it and put it into the text box.
  • Your query should look like this:
  • [HDI 2012] - [Difference]
  • Press the return button. Your final query output should appear as follows.
  • Now click Apply and Close.

Now, you can see in Excel we have five columns:
  • Rank
  • Country
  • HDI 2012
  • Difference
  • HDI 2010


Two of the columns contain HDI data, which is split by date. It will be easier to visualise the data if all of the HDI data is in the same column. Therefore, we can unpivot the columns so that it is easier to visualise.

  • To do this, select the HDI 2010, HDI 2012 and the Difference columns.
  • Go to the Transform tab and go to the option Unpivot Columns.
  • You will now see that the HDI 2010, HDI 2012 and the Difference columns would go in the same column, which is headed Value.
  • The attribute names have gone into a column called Attribute.
  • The other two columns, Rank and Country, have stayed the same.
  • We don’t actually need the Difference column now that we have obtained the data, so let’s filter it out.
  • To do this, go to the Attribute column heading and click the downward arrow next to it.
  • Deselect the Select All button, and select only the HDI 2010 and HDI 2012 options, and then click OK.

Now we can rename the attributes HDI 2010 and HDI 2012 so that it contains the year name only. To do this, select the Attribute column and go to the Transform tab.

  • Go to the option Replace Values and the Replace Values box will appear.
  • In the Value to Find box, enter HDI 2012.
  • In the Replace With box, enter 2012 and then click OK.
  • Repeat these steps for HDI 2010, renaming it to 2010.
  • Let’s rename the Attribute column to year. To do this, select the column and then right click. In the pop up menu that appears, select Rename. Rename the column to Year.
  • Now, let’s rename the Value column so that it is now called HDI Value. To do this, select the column and right click to get the pop up menu again. Look for the rename option, and rename the column to HDI Value.
  • If we examine the Rank column, we can see that some countries and jointly ranked. So, for example, Hong Kong and Iceland are jointly ranked 13th top in the list.
  • Let’s add an index to our Power Query. We could use this later as a key in the table.
  • To do this, right click on HDI Data and click Edit.
  • The Power Query Editor will appear again. Go to the Add Column tab in the Power Query Editor and select the option Insert Index Column.
  • The Index column will appear at the right hand side. To move it, select the column and go to the Move option.
  • Look for the option ‘To the Beginning’ and select it. The Index column will move to the left hand side of the table.
  • Once you have done this, go to the Home Tab and click on the Apply and Close button.


Connecting to Online data

 Let’s do an online search. This is a fun part of Power Query.

Before we dive in to use the Editor, let’s take a look at some of the features in the Power Query Editor.
You can see that there are four main tabs: Home, Transform, Add Column and View.
The Home tab contains useful functionality which are common when we are cleansing data.
Sometimes data files contain unnecessary data when they are exported from other sources. This involves specifying to Power Query which data you want to keep, not which data you want to throw away.
One example of this unnecessary data might include headers, titles, dates, owner information and so on. We can remove these rows by removing the top number of rows from the data set.

  • Click on the Power Query tab.
  • Click on the Online Search button.
  • The Online Search box will appear on the right hand side.
  • Type in past and future GDP in the search box and click return.
  • The search box will show two tabs: one is organisation and the other is public. If you click on organisation, you will probably see no data. If you click on public, there will be many data sources to choose from.


Extracting data from the Azure Data Market.

In the real world, however, we need to be able to connect to different data sources, which may contain large amounts of data.
We will look at connecting to multiple data sources at a time. This is a useful way of enriching our data., which is another way of describing data. In this definition, it is a  field that can be considered an independent variable, regardless of the data type. 

To connect the online data and local data, we will connect to Windows Azure Datamarket using Odata, which is a standardized protocol to provide CRUD (create, read, update, delete) access to a data source via a website. It is the data API for Microsoft Azure, but other organizations use it as well, such as eBay, SAP and IBM.

Before you start, you need to create a folder where you can download data for the purposes of running through the examples. You should pick a folder name that is meaningful for you. Also, be sure to select a location that has plenty of space. In this example, we will use the following location to store data:

D:\Data\DataAnalystToolkit

To connect to Windows Azure Datamarket, please sign up for a free account using a Windows Live ID. To do this, please visit https://datamarket.azure.com/ and follow the instructions. This may involve activating your account via a link, so please follow the instructions carefully.

Sign in to the Windows Azure Datamarket and navigate to the following URL:


About half way down the page, look for the Sign Up button and select it.

  • This will take you to a Terms and Conditions page. After you’ve read the Terms and Conditions, and, if you agree with them, tick the box to specify that you agree, and click on Sign Up.
  • This will take you to a Thank You page. Look for the link to Explore this Dataset on this page, and click it.







Explore this Dataset does not appear initially appear on the screen. You need to click the Subscribe Button (even after logging in) to activate that section.

When you click ’Explore this Dataset’, you will be able to see the data appear in the browser, which you can slice and dice. 

 In this example, we will load the data into a CSV file, rather than in the Data Explorer URL. To do this, we need the Primary Account Key.  In Windows Azure Datamarket, this is easy to obtain. From the above example, we can see a feature called ‘Primary Account Key’. If you click on the ‘Show’ link next to ‘Primary Account Key’ then your Primary Account Key will appear.

You can download the data, and import it into Power Query.

Open the Power Query editor, and go to the Power Query tab.
Look for the button From Other Sources.
Look for the option From Windows Azure MarketPlace and click on it
If you are signed in to Power Query already with a Power BI subscription, you will see your subscribed data feeds on the right hand side. Clever, huh?
If not, selecting this optin will give you the Data Explorer credential manager.

Once you are logged in, you can see your data sets and you can import them into Power Query.

I hope that you enjoyed this whizz tour of Power Query. On to the next step!









Tuesday, 15 July 2014

What do you need for my SQLBits R and PowerBI session? Nothing!

What do you need for my SQLBits R and PowerBI session? Nothing! 

I am not expecting people to have installed anything in advance, hence I have made no requests for the event. I will cover it during the session. Usually people never do it in advance because they are hassled in work, trying to clear their desks before getting training.

When I've held this course before, only about a third of people joined in as i went along. The others were happy to take notes and watch demos. Sometimes people are not given laptops out of the office. I run the course with this in mind. People learn in different ways.

R has its own inbuilt data sets so you will not require any preparation in advance, in terms of the data or the software. We will be looking at a wide range of things. I will make the R scripts available to you on the day as well.

The SQLBits wifi should be good; and if it not, then I have everything on USBs.

I hope that helps and that you enjoy your day, if you are attending.

Packt Tableau Dashboard Cookbook errata and completed files download


Some of you know that I wrote a book on Tableau, called the Tableau Dashboard Cookbook.

I am writing to apologise for some errors that occurred at the point of production. These were not in my control but I do feel very sad about it. I put a year of effort into this book, and it is personally very disappointing for me that it turned out this way. At some point during the final editing and posting to print process, for some unknown reason, some text and some images got mixed up on Chapter 2. I can only apologise for this issue and I have no idea why this occurred.

By way of reparation, I have posted the correct chapter here, with additional images and additional notes to make it easier for people. These things happen and I am sad about it, and I can only try to help people who have got caught up in this issue.

If you'd prefer to download the actual working examples, you can get the working examples for the book from here.

 

Grouping your data with calculations


In the first topic, we specified communication as one of the key features of a dashboard. We need to be able to share the right information to the right audience, at the right time, to the right people, in the right format.

Sometimes, the data needs to be translated so that it matches the business rules and the business understanding of the organization. Tableau offers a number of different ways that help you to translate the data into something that the business decision makers will understand.

Grouping is one way of making data meaningful to the business. In this topic, we will look at grouping some dimension members into a single member. Rolling up some of the members in one dimension is a good way of summarizing the data for dashboards.

In this recipe, we will look at grouping dimension members, and then we will look at more complex grouping of calculations. The business question is an investigation into the characteristics of customers into those who have children, and those who do not. We will group the NumberChildrenAtHome dimension members into the group of customers who have children, and those who do not.

Getting ready


We will need to add in a new data source for this topic.

How to do it...


1. We will connect to an Excel file data called DimCustomer.xls and import it into Tableau's internal data store mechanism. To do this, navigate to Data | Connect to Data.

2. Then select the link Text File and a file browser will appear. Navigate to the folder where you stored the downloaded Excel and CSV files.

3. Navigate to the DimCustomer.csv file and select it.

4. In the Excel Workbook Connection dialog box, change the Step 4 name so that it says Connection_DimCustomer.

5. You can now see the new data source connection on the Data pane.

6. Click on Ctrl + M to get a new worksheet in Tableau. Alternatively, go to the Worksheet menu item and select New Worksheet.

7. Click on the Connection_DimCustomer source and drag Number of Records from the Measures pane to the Rows shelf.

8. Drag the NumberChildrenAtHome dimension attribute to the Tableau canvas, to the left of the NumberOfRecords column. You can see an example in the next screenshot.


 

 

8.     Click on the table in the Show Me Panel. When we look at the NumberChildrenAtHome dimension, we see the following members and the number of customer records associated with each member:



9. We will group the dimension members so it is easier to see which customers have children, and which do not.

10. Go to the dimension called NumberChildrenAtHome and right-click on it. You will see the following menu:



 
11. This produces the Create Group dialog box, which you can see in the following screesnhot:



 
12. In the Field Name field, enter the name of the field.

13. Multi-select the numbers 1 through to 5 by holding the Shift key and clicking to select more than one number at a time.

14. Click on Group, rename the group to CustomersWithChildren, and click on OK.

15. You can then see the new group on the left-hand side, on the Dimensions pane. Drag your new grouping to the Rows shelf.

16. Remove the NumberOfChildrenAtHome pill that is NOT a group. The NumberOfChildrenAtHome group should be on the Rows shelf.

17. Then choose Number of Records from the Rows shelf and put it into the Columns shelf.

18. You can see that the table now only has two rows in it; one with a zero, and the other row has CustomersWithChildren.

19. In order to make it clear, click on the zero and select Edit Alias.

20. Rename the zero to Customers with No Children.

21. You can also rename the Group alias to Customers With Children so that it matches the format used elsewhere. To do this, in the table, select the numbers 1 to 5 and right-click on them.

22. In the pop-up menu, select Group.

23. Right-click on the new group and select Edit Alias.

24. In the textbox, enter Customers With Children and click on OK.

25. It is now clear from the table that over 11,000 customers have no children, whereas just over 7,000 customers do have children. We can visualize this information in a better way, and we will do this for the rest of the exercise.

26. To do this, change the visualization to a heat map using the Show Me panel and click on the stacked bar chart component. Click on the Swap button so that the bar appears  in a horizontal line.

27.  Drag the group to the Label button on the Marks shelf.

28. We can then change the colors by dragging Number of Records to the Color panel on the Marks shelf.

29. In order to show the difference between the customers who do and who do not have children, the custom diverging color palette has been selected here.

30. Two diverging colors of orange and blue are selected. Orange is selected for Customers With No Children because there is a greater number of customers who do not have children; brighter and more intense colors are often used in order to denote higher values, as shown:


 

 

THIS IS a piece which people were missing. This paragraph was originally located here, and then it was moved to the beginning of the chapter. I think that people are missing it. However, the chapter can look disjointed but this recipe is actually two recipes joined together.

The first part focuses on groups by dimensions; the second part focuses on groups by measures. It is included here again, which is where I had it originally.

Then, we will look at a more advanced example of grouping the data by the measure, rather than the dimension. To do this, we can create a calculation that will distinguish the values that are below the average sales amount and above the average. Results that are classified as above average are labelled as Above or Equal to Average, and below average sales are labelled as Below Average. We can then use this calculation to convey a visual message to the business user; for example, we could color the above average sales in one color and the below average sales in another in order to make the distinction easily identifiable.

31. To do this, stay in the Chapter 2 workbook and continue to work in the existing worksheet.

32. Click on the Connection_FactInternetSales data connection at the top left hand side.

33. Click on the broken chain next to the CustomerKey in order to activate the relationship.

34. Locate and Right-click on the SalesAmount measure.

35. Select the measure Create Calculated Field which you can see illustrated in the following screenshot:

 
 


36. When you select this option, you will get the following dialog box:

 
36. In the Name field, enter the name of the calculated field SalesAboveOrBelowAverage.

37. In the Formula field, we will put in a formula that will calculate whether or not the sales amount is above or below the average amount. The formula is as follows:

IF ( SUM([SalesAmount]) - WINDOW_AVG(SUM([SalesAmount]), First(), Last() ) < 0 )

Then Below Average

Else Above or Equal To Average

END

38. When you've placed the formula into the Calculation editor and clicked on OK, you will be returned to the main Tableau interface. You will see your new calculation on the left-hand side in the Measures pane.

39. Remove the NumberChildrenAtHome label from the Marks shelf.

40. Remove SalesAmount from the Dimension pane to the Rows shelf.

41. Drag the SUM(SalesAmount) measure to the Columns shelf.

42. Drag NumberCarsOwned from the Dimension pane to the Rows shelf.

43. Drag your new calculation SalesAboveOrBelowAverage to the Color button on the Marks shelf.

44. Your screen should now look like the following:



To summarize, we have created a calculation that is meaningful to the business user. It drives the color display of the measure, which can help the business user to understand more efficiently. To summarize, it is simple and effective to conduct a grouping of dimension members into a binary grouping. This is useful for dashboards in order to provide an "at a glance" metric visualization that shows the organization has more customers who do not have children, than those who do.

Essentially, this formula uses the WINDOW_AVG function to work out the average of the values that are in the Tableau view of the data. Basically, this average works out the value of the data that is viewable in the Tableau canvas and does not include data that has been filtered.

It uses First() and Last() to work out the average over all the data from the first row right until the last row. The calculation takes the current SalesAmount value and compares it to the average SalesAmount.

How it works...


Tableau allows you to group data together by simply arranging fields of your data source on a Tableau worksheet.

When you group fields in a worksheet, Tableau queries the data using standard drivers and query languages (such as SQL and MDX). It then groups data together wherever necessary. Finally, it presents a visual analysis of the data.