Sunday, 17 August 2014

Day 6: The Data Analysts Toolkit: Why are Excel and R useful together, and how do we connect them?

Why is analytics interesting? Well, companies are starting to view it as profitable. For example, McKinsey showed analytics was worth 100Bn today, and estimated to be over 320Bn by 2020.
When I speak to customers, this is the 'end goal' - they want to use their data in order to analyse and predict what their customers are saying to them. However, it seems that folks can be a bit vague on what predictive modelling actually is.

I think that this is why Power BI and Excel are a good mix together. It makes concepts like Predictive Modelling accessible, after a bit of a learning curve. Excel is accessible and user-friendly, and we can enhance our stats delivery using R as well as Excel.

One area of interest is Predictive Modelling. This is the process of using a statistical or model to predict the value of a target variable. What does this actually mean?  Predictive modelling is where we work to the predict values in new data, rather than trying to explain an existing data set. To do this, we work with variables. By their nature, these vary; if they didn’t, they would be called a constant.

One pioneer was Francis Galton, who was a bit of an Indiana Jones in his day.  Although he wrote in the 19th century, his work is considered good and clear enough to read today. Therefore, this research has a long lineage, although it seems to be a new thing. We will start with the simplest: linear regression.

Linear regression compares two variables x and y to answer the question, “How does y change with x?” For predictive modelling, we start out with what are known as ‘predictor variables’; in terms of this question, this would be x. The result is called the target variable. In this question, this would be y. Why would we do this?

  • Machine Learning
  • Statistics
  • Programming with Software
  • Programming with Data 
  • Fun!

Why would businesses work with it at all?

  • to discover new knowledge and patterns in the data
  • to improve business results 
  • to deliver better customised services
If we have only one predictor variable and the response and the predictor variable have a linear relationship, the data can be analyzed with a simple linear model. When there is more than one predictor variable, we would use multiple regression. In this case, our question would be: , “How does y change with multiple x?” 

In fitting statistical models in which some variables are used to predict others, we want to find is that the x and y variables do not vary independently of each other, but that they tend to vary together. We hope to find that y is varying as a straight-line function of x.

If we were to visualise the data, we would hope to find a pleasing line chart which shows y and x  relating to each other in a straight line, with a minimal amount of ‘noise’ in the chart. Visualising the data means that the relationship is very clear; analysing the data means that the data itself is robust and it has been checked.

I think that's why, in practice, Power BI, Excel and R work well together. R has got some great visualisations, but people are very comfortable with Excel for visualisations. All that loading packages stuff you have to do in R... it doesn't work for everyone. So we use R and Excel, at a high level, as follows:

  • We cleanse and prepare data with Excel or Power Query
  • We use RODBC to load data into R
  • We analyse and verify the data in R
  • We build models in R
  • We load the data back into Excel using RODBC
  • We visualise the data for results

Excel is, after all, one of the world's most successful software applications ever, with reputedly over one billion users. Using them both together means that you get the best of both words: R for analysis and model building: Excel is the 'default' for munging data around, and visualising it. I'm sure that one of the most popular buttons on software such as Tableau, QlikView et al is the 'Export to Excel' or 'Export to CSV' functionality. I'd be interested to know in what people think about that!

Building linear regression models in R is very simple; in our next session, we will look at how to do that, and then how to visualise it in Excel. Doing all this is easier than you think, and I will show you how.

Tuesday, 12 August 2014

Like bookclubs? Interested in learning Javascript online?

Like bookclubs? Interested in learning Javascript online? 

Ike Ellis is the lead for the PASS Book Readers VirtualChapter, and we're going to learn JavaScript in 8 weeks by following the recommended text and having fun discussions. 

August 20th is the first session, so you have plenty of time to get the book!

Sign up for the webinar at this link:

Friday, 8 August 2014

Jen's Diary, figuring it out this week

Don’t read my diary when I’m gone. Ok, I’m going to work now, when you wake up this morning, please read my diary. Look through my things, and figure me out. (Kurt Cobain)

I liked Andy Warren's suggestion about blogging more often about PASS.  At first, I was flattered that he'd actually read my blog at all! Then, I thought about it and, although I don't officially speak for PASS (I'm not on the Executive Committee), folks might like to know what I've been working on for the PASS Board. I should probably do this regularly because I do something "PASS" pretty every evening.

I've been up to a lot of other PASS work as well as organising SQLSaturday London Business Analytics edition, as part of my role for the Board. I will blog separately about this work. 

Guide, Encourage and Motivate Volunteers has been the theme this week. Well, I've been so busy that I had to look in my email to see what I've been doing! I'm a believer in individual contact where possible and I spend a lot of time helping folks over email or Skype. 

This week, I've had a lot of email traffic from community members who are volunteers, interested in volunteering, and some follow up from SQLBits where people are interested in PASS Chapters and want to get involved. A couple of folks are interested in setting up PASS Chapters at various locations around Europe, and a few others would like to volunteer but aren't really sure how to get started. There's the usual amount of techie questions that I get directly as well, which usually happens via Facebook or direct to my email account. I've been lucky to point one particular person to the High Availability and Disaster Recovery Virtual Chapter, since Ryan Adam's session was perfect for him. My portfolio is the Virtual Chapters, and there is an ongoing work there; for example, helping to find speakers or Virtual Chapters leader support. We are working on the Virtual Chapter sessions for September, so watch this space for more details. 

I was chuffed to be on the PASS website, talking about the benefits of running for PASS Board.  I made a little video and posted it on YouTube, and lo and behold, it was on the PASS site! I was pleased to see Adam Jorgensen's video as well. 

Why run for PASS? My video is below:

Here's a great infographic, which says it without my Scottish accent.

I hope that helps to see what I've been doing this week. Any questions, please let me know.

August 8th, 2014

Free Windows Azure Hybrid Cloud eBook

Hearing buzz around hybrid cloud solutions? Got some questions?
For a limited time, download an eBook from Microsoft solution architects that shares their perspective on what hybrid IT is, why it’s important, and how you should determine whether a hybrid cloud is right for your business.
Download a copy today!

Thursday, 17 July 2014

The Data Analysts Toolkit Day 5: How do R and Power BI fit together?

How do R and Power BI fit together?

Technically, it is about munging data around between R and Excel and the Power BI components. 

You can use RODBC to connect to data between R and SQL Server, or R and Excel. Alternatively you can import data in.

Why else might you use R?

  • Pivot Tables are not always enough
  • Scaling Data (ScaleR)
  • R is very good at static data visualisation but Power BI and Excel are very good at dynamic data visualisation
  • You want to double check your results or do further analysis
They complement one another; they do not replace one another.

You may have heard my story about one organisation calculating the median incorrectly. 

The truth is, people don't often check their data. I help design data warehouses all the time, and I don't always hear people talk about reconciliation. I do hear about people racking up SSAS and diving straight in.

Just because something works technically, does not mean it is correct.

Upworthy and Facebook use R. A lot. So why not you? It is achievable.
Why R, and not some other package?

  • R most widely used data analysis software - used by 2M + data scientist, statisticians and analysts
  • Most powerful statistical programming language
  • used with RStudio, it can help you for the purposes of productivity
  • Create beautiful and unique data visualisations - as seen in New York Times, Twitter and Flowing Data
  • Thriving open-source community - leading edge of analytics research
  • Fills the talent gap - new graduates prefer R.
  • It's fun!

Excel is used by an estimated 1.3 billion people on the planet. That sounds really impressive, until you think that many people are often using it wrong!
R just helps you to do that double check, the sanity check, to see if your data is correct.

The Data Analysts Toolkit Day 4: Getting to know R a bit better

So what actually is R? Let's take a look at running some commands, and we can talk about it as we go along.

R is a powerful environment for statistical computing
It is like a calculator and its power is: 
… it lets you save results in variables. Let's take a look:

> x <- 3="" font="">
> y <- 5="" font="">
> z = 4
> x + y + z

Ok, so we can run little commands and it gives us numbers back. What's all the fuss about?

We can save data in files. They are saved as .RData files. R lets us import and export data in lots of different ways. If it sounds fun, it is because it is! Let's take a look.

> someData <- 1:10="" font="" nbsp="">
> save(someData, file = "D:/DataAnalystsToolkit/someData.Rdata") 
> rm(someData) 
> load("D:/DataAnalystsToolkit/someData.Rdata") 
> print(someData)

Well, that is great if our data is in RData format. But normally we need to import data from different sources.

Import from CSV File
MyDataFrame <- div="" filepath.csv="" nbsp="" read.csv="">

Alternatively we can load data in directly from RStudio. 
Go to Tools in RStudio, and select Import Dataset. 
Select the file CountryCodes.csv and select the Import button.
In RStudio, you will now see the data in the data pane.

The console window will show the following:
> #import dataset
> CountryCodes <- directory="" files="" header="F)</font" orking="" ountrycodes.csv="" read.csv="" rogram="">
>   View(CountryCodes)

Once the data is imported, we can check the data.


So how do we connect R to SQL Server?

The Package RODBC provides R with a connection to ODBC databases:

> library(RODBC) 
> myodbcConnect <- dsn="servername" font="" nbsp="" odbcconnect="" pwd="******" uid="userid">

myQuery <- ...="" div="" from="" mytable="" nbsp="" where="">
# or read query from file 
# myQuery <- ataanalyststoolkit="" myquery.sql="" nchars="99999) </div" readchar="">
myData <- color="#ffa500" font="" sqlquery="">myodbcConnect
, myQuery, errors=TRUE) 

I hope that the rather bizarre colour coding will help you to see how the variables add up together to create a query.

RODBC also works for importing data from Excel files:

filename <- ataanalyststoolkit="" font="" myspreadsheet.xls="" nbsp="">
myxlsFile <- filename="" odbcconnectexcel="" readonly="FALSE) </font">
sqlSave(myxlsFile, a, rownames = FALSE) 
b <- a="" font="" myxlsfile="" nbsp="" sqlfetch="">

Next, we will look in more detail how R and PowerBI complement one another.

The Data Analysts Toolkit Day 3: Introduction to running and understanding commands in R

I have located a number of R Scripts over at my OneDrive account and you can head over there to download them. There are some other goodies over there too, so why not take a look?

how do you open files? To create a new file, you use the File -> New menu.

To open an existing file you use either the File -> Open menu.
Alternatively, you can use the Open Recent menu to select from recently opened files.

If you open several files within RStudio, you can see them as tabs to facilitate quick switching between open documents. 

If you have a large number of open documents, you can also navigate between them using the >> icon.
You could also use the tab bar to navigate between files, or the View -> Switch to Tab menu item.

Let's open the Women file and take a look at the commands.

Comments in R are preceded with a hash symbol, so that is what we are using here.

# this loads the dataset

# You can see what is in the dataset

# This allows you to see the column names

# You can see the output of the height column here, in different ways

# we can start to have a little fun!
# we are going to tell R that we are going to build a model of the data

model <- font="" height="" lm="" weight="">
predict(model, women, interval="predict")
newdata = data.frame(height=60)
predict(model, newdata, interval="predict")
r <- cor="" font="" height="" weight="">

Now we can see some strange output at the bottom of the page:

lm(formula = weight ~ height)

    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

What does this actually mean? As part of my course, I normally discuss this in detail. For those of you reading online, here is the potted summary:

Basically, you want big F, small p, at the 30,000 feet helicopter level. We will start from there, and move down as we go.

The stars are shorthand for significance levels, with the number of asterisks 
displayed according to the p-value computed. 
*** for high significance and * for low significance. 
In this case, *** indicates that there is likely to be a relationship.

Pro Tip
Is the model significant or insignificant? This is the purpose of the F statistic.
Check the F statistic first because if it is not significant, then the model doesn’t matter.

In the next Day, we will look at more commands in R.