Monday, 1 March 2010

Adventures in SQL: Using ASCII to remove carriage returns in user input

A current piece of work involves migrating data from a source system, where there is a lot of user input, into a target system.  The data migration process can always be tricky when user-generated input is involved, since the process involves formalising and sanitising this fluid, informal user input into a formal structure. The informal nature of user input can make this process an interesting challenge!
One recent project involved taking data from a source system and inserting it into a well-known target database, which is a well-known timesheet system. The target system takes comma separated value (CSV) files and imports the data using a proprietary engine.  The data is then propagated to its own underlying SQL Server database.
The CSV file production strategy has involved using SSIS 2008 to extract data from the source system, and place the data into the formal structure of the CSV files. One problem area, as expected, are free-text fields, where users can type in whatever they like!

One issue for CSV file production is carriage returns in free-text fields. SSIS sees the carriage returns, and behaves as its been told to do: it starts a new line.  This can be seen in the Flat File Connection Editor below:

Carriage Returns in Flat File Editor

The {CR}{LF} are control characters which denote ‘carriage return’ and 'line feed', respectively. When SSIS sees these control characters in the incoming text field, it naturally believes that this means it should create a new line in the CSV file: which it then goes and undertakes this activity. However, this causes an issue for the CSV file format overall, because the next line will only have one column in it, and will not be correctly formatted for the data import to work properly.

In order to resolve this issue, it is possible to use lower-level constructs in order to strip out any erroneous carriage returns in the user input. The SQL CHAR command is a string function that converts an INT ASCII code to a specific character. Here, it is used to insert control characters that we see in the earlier image, into character strings. Here is an example, which looks for a carriage return (CHAR(13)) and line feed (CHAR(10)):

select REPLACE(cast(w.userentereddetail as nvarchar(4000)), CHAR(13) + CHAR(10), '  ') as UserComments from tbl w

This SQL Statement uses a combination of CHAR and REPLACE to find any carriage returns in the ‘userentereddetail’ column, and will replace them with a space. This means that any carriage returns in the user input field are captured and replaced so that they will not interfere with the CSV production by SSIS.

To summarise, perhaps I’m showing my age by talking about ASCII. However, these little tidbits can be useful to know!

Thursday, 18 February 2010

Adventures in SQL: Coalesce versus ISNULL

Some of the differences between ISNULL and COALESCE are quite obvious. For example, a column that's been created by ISNULL is not regarded as NULLABLE. On the other hand, a column created by COALESCE is regarded as nullable. Similarly, ISNULL only takes two arguments, whereas COALESCE can accept more than two arguments. It is possible to nest ISNULL statements, however, so this can be used to meet a requirement to use ISNULL rather than COALESCE.
There is one difference between ISNULL and COALESCE, however, that may not be so obvious. There is a restriction with ISNULL, where the length of the first parameter dictates the data type of the resulting column. Thus, if the second argument is longer than the datatype of the first argument, it will be truncated.
Here is an example using the AddressLine2 column of the Person.Address table in the AdventureWorks2008 database, which has a datatype of nvarchar(60):

SELECT
ISNULL(AddressLine2,
'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890'
)  as ISNULL_Result
FROM [AdventureWorks2008].[Person].[Address]
SELECT
COALESCE(AddressLine2,
'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890'
) As COALESCE_Result
FROM [AdventureWorks2008].[Person].[Address]

Using ISNULL, the resulting column is reduced to having a length of 60. Using COALESCE, however, no such restriction is in place. Thus, using the above query, the length of the second argument is retained, resulting in a length of 70. This can be seen in the image below:

ISNULL versus COALESCE result


It is worth noting that, if COALESCE isn't used properly, it can provide strange results. For example, if the datatypes are far apart, such as an integer and a datetime, COALESCE will return a datetime. So, the following SQL command:

SELECT COALESCE(100, GETDATE()) as Coalese_Result

This results in the following:

'1900-04-11 00:00:00.000'

Since datetime has been picked as the higher priority datatype for the resulting, the integer value of 100 has been converted to a datetime. COALESCE is behaving as it was designed to do, which is to pick the higher-priority data type. Just be careful of unexpected results if the datatypes are mixed: test, test and test again!

To summarise, COALESCE can be a useful SQL tool to have in your back pocket, if it is used and tested properly.

Thursday, 4 February 2010

Plastic Surgery Data Analysis Using PerformancePoint

One area of interest in healthcare data is plastic surgery, and the trends that appear in the operations that individuals are choosing to have done. In order to explore more about this phenomenon, the Guardian Datastore has provided an excellent set of data which has been used here as a basis for further analysis, and the display was created using Microsoft Office PerformancePoint and SQL Server.

As the examples show, sometimes it’s not enough just to have a set of numbers; illustrating the numbers with graphs can highlight patterns to the data consumer that weren’t previously visible.

In order to display some of the capabilities of Performance Point, the plastic surgery data was used here in order to create graphs and dashboards using Microsoft SQL Server as a data source. The data was obtained from the Guardian Datastore has a comma-separated values document, or CSV. This was easily imported into a table in SQL Server using a straightforward wizard. The ‘presentation layer’ was Performance Point, which is part of the Office SharePoint suite of products. Performance Point Services is integrated into Sharepoint, and can take in a variety of data sources, including Excel, standard ODBC connections, SQL Server database sources. It is particularly good with Analysis Services cubes since this allows drill-down and click-through to display data in dashboards, graphs, or even ‘traffic light’ indicators. Performance Point aids analysis by allowing the user a lot of control over the appearance of the graphs, and, as is usual for Microsoft, there are lots of different ways to do the same thing. Whatever is easiest for the user, really! If you would like to know more about how everything hangs together, please see the PerformancePoint Services blog site here for a downloadable Visio or PDF diagram for easy reading.

As a sample, there is a PerformancePoint dashboard below, or click here for the large image:


Overall, the dashboard shows that plastic surgery is a growing area of healthcare, with increases shown in almost every type of surgery except the bottom two: facelifts and abdominoplasty (tummy tucks). In addition to an overall increase in the plastic surgery requested, the type of plastic surgery requested has changed from 2008 to 2009. For example, there has been an over 40% increase in brow lifts, and a quarter rise in rhinoplasty or ‘nose jobs’. However, perhaps surprisingly, the ‘tummy tuck’ operation has shown a nearly 10% drop in operations performed. This is highlighted by the following excerpt of the above graph, or click here for an enlarged version:


There is obviously still a difference in the types of surgery requested by males and females, and this can be seen from the graph below. For example, in females, breast augmentation was a popular choice, whilst the data shows that only 28 males have undergone breast augmentation surgery in 2009. The only choice nearing a 50-50 split was otoplasty, which is an operation to correct ear shape or size. This can be seen from the bottom half of the dashboard, which is reproduced here for ease of use:


One interesting find is that male cosmetic surgery is on the increase; in fact, it increased by 21% in the last year. In particular, the Guardian reports that plastic surgery operations to reduce breast size in men (gynecomastia) have nearly doubled in the last year, and the British Association of Aesthetic Plastic Surgeons (BAAPs) found that operations to correct gynecomastia in men grew by 80% overall over the past 5 years.

To summarise, there are plenty of ways of visualising the nuggets of data held in data sources, and the Microsoft stack allows plenty of different SQL Server based ways for storing and visualising data in accordance with user requirements.

Tuesday, 2 February 2010

Conditional Split Transformation in SSIS 2008: Uses and Examples

One very useful Control Flow feature in SSIS 2008 is the ‘Conditional Split’. Sometimes new SSIS users can have difficulties in finding out detail about the expressions used in the Conditional Split. The purpose of this blog is to provide some advice about the Conditional Split: its use and how it works, when to use it, and some examples of the expression language. For people who are new (and not-so-new!) to SSIS it can sometimes be difficult to decide which component to use, and it isn’t always immediately clear why the Conditional Split component is useful.


How does the Conditional Split transform work? Briefly, as a Data Flow component, the Conditional Split is a decision maker. The component tests each row, and then decides what its destination should be. In other words, the ‘Conditional Split’ tests each individual inbound row of data against a series of different conditions, which have been defined by the package creator. Then, the Conditional Split component funnels each row of the data up into different outputs, dependent on the result of the test against each condition.


In order to understand its purpose a bit better, here are some examples where it might be useful:


- Directing ‘bad’ rows towards a separate place for further examination. A recent example is where timesheet data has been imported from one source to a different destination. The source system permitted timesheets to be submitted without customer name, project name, or submission date; in contrast, the target system regards these timesheets as incorrect. In order to ensure that the import from source to target went as smoothly as possible, the records were filtered for NULLS in the client, project or submission date fields and directed to a separate location for review. I could then take these output files, and ask the customer for default values for these fields, for inserting at a separate time.


- Differentiating records between ‘new’ and ‘update’ records. ‘New’ records can be inserted into the appropriate place; ‘update’ records can have fields updated in the existing data, rather than inserted as new information. To divide data rows into ‘new’ rows and ‘update’ rows for differentiating between ‘new’ information and information that might need to be updated. For example, if you have a ‘new’ customer, then you may want to treat that differently from situations where an existing customer simply needs an updated profile. This is called ‘Slowly Changing Dimensions’ (SCD).


When should the Conditional Split component be used, instead of the Slowly Changing Dimension component? This is an important point, and it depends on the quality of the data. Note: the SCD component does not handle NULLs. It is possible to use the SSIS 2008 SCD to handle data with NULLs in it, so, in this case, the Conditional Split component should be used.


Prior to creating the package, it is possible to test the data for NULLs by using the Data Profiling Feature in SSIS, and it is recommended here that this investigation is done prior to writing the package. If the data does contain NULLS, then it is recommended to use the Conditional Split component. Choosing the correct and most efficient component for a particular purpose is half the battle; in the words of Mary Poppins, ‘well begun is half done’!


In order to set up the conditions, it is necessary to use the SSIS Expressions language, which applies to the various transformations within SSIS. In the Conditional Split transformation, it is only possible to use expressions that produce a TRUE or FALSE result. So, for example, the following expressions would be legal:

Column 1 >= Column 2
Column 1 < Column 2
ISNULL(Column 1)

Implicitly, this means that the Conditional Split transform can have more than one output destination. Due to this, it is important to consider what the ‘default’ situation will be, and where the ‘default’ rows should be directed. Here is an example of the Conditional Split transformation, using the ISNULL expression as an example, please click here for a larger example:


Conditional Split Expression Example



To conclude, here are some valuable tips in using the Conditional Split transformation:

If the Conditional Split component isn’t working properly, then it is possible to ‘daisy chain’ them in order to clarify the output. The ‘daisy chain’ method can help since it is possible to use Data Viewers in order to actually see what data is going through each flow. This mechanism can illuminate the decision behind the decisions made by the Conditional Split.

If the expressions do not result in TRUE or FALSE outcomes, then the Conditional Split component will not work properly. Examples of expressions in this category include increment statements, e.g. @Counter = 1, and other valid expressions which do not return TRUE or FALSE, such as GETDATE().



Tuesday, 26 January 2010

Some Fun - a Film Critics Dashboard!

It’s the time of year again, where the film critics and pundits try to guess who will be awarded an Academy Award©, otherwise known as an Oscar, at this year’s event.  In order to explore the numbers behind the Oscar© guesswork, I’ve tried to create a ‘Film Critic dashboard’ which based on Guardian Datastore data on awards made by awarding committees such as the European Film Awards, Golden Globes awards and Screen Critics awards in  Jan 2010. These awards have been achieved by luminaries in the film industry for 2009; these include British actress Kate Winslet, Meryl Streep and George Clooney, amongst others.
It’s possible to provide an interesting dashboard, produced using Tableau, which explores the numbers, or counts, of the awards achieved by film or by individual. The dashboard created below has used color in order to distinguish films for each other. For example, ‘Inglorourious Basterds’ is denoted in orange in the ‘Best Film’ and ‘Best Actor/Actress’ graphs for consistency.
Further, the color palette has been chosen with colour-blind individuals in mind. Figures show that approximately 10% of males and 1% of females are color-blind, which means that they have problems in distinguishing between green and red (Few, 2008). Further, the colour ‘red’ has different culture connotations. In the US, red can be associated with jeopardy, as we can see in the phrase ‘caught red-handed’.  This perception is in contrast with China, where red is considered a lucky colour. With this in mind, Tableau automatically provide a ‘Color-blind’ palette, which means that it is possible to try and accommodate issues surrounding color-blindness for viewers affected with this condition. An example dashboard, using a color-blind palette, is given below, and please click here for a larger version:

Overall Film Dashboard


It is also interesting to note that the only actor to have won any ‘Best Supporting’ Actor awards this year has been Christoph Waltz. Similarly, the only winner of major ‘Best Supporting Actress’ awards has been Mo’Nique, for her role in Precious. If the past is a correct indicator of future wins, then they had better get their ‘Winner’ speeches ready for the Academy Awards©! Please click here for a larger version, if required. 


Best Actor Actress Award Winners

The bottom left hand side of the graph shows a count of ‘Best Film’ awards given to different films. It is possible to see that ‘The Hurt Locker’, directed by Kathryn Bigelow, has won more ‘Best Film’ awards than the other films. The example below allows the graph to be seen in more detail, and please click here for a larger version:

Best Film Award Winners

Some contextual information is accommodated by provisioning a list of the Academy Award© wins by the actors and actresses who have won awards in 2009. Here is an example below:

Academy Awards To Date


Again, the intensity of color is used to identify the holder of the most Academy Awards©. It’s clear to see that James Cameron tops the list with a total of three Oscars© to his name, with George Clooney and Meryl Streep holding second place with two awards each.  If past Academy Award© achievements are an indicator of future wins, then we could guess that these two actors are a safe bet. But who knows?
Finally, it was interesting to see the overall awards achieved by films. The ‘Best Film’ award winners were shown in the bottom half of the dashboard, and the bottom right-hand heat map shows the ranking of film overall award winners. This view has been expanded over the previous ‘Best Film’ award winners to include awards such as ‘Best Screenplay’ and ‘Best Score’.  The heat map is provided here for review:
 


Total Film Awards



When other award types are included, it is possible to see that ‘Crazy Heart’ is the winner, with ‘The Hurt Locker’ and ‘Inglourious Basterds’ following quickly behind.
With this data in mind, we can watch the Academy Award© achievements with interest, and see how this dashboard and corresponding data has helped –or hindered – the correct prediction of the outcome!

Add to Technorati Favorites

Friday, 22 January 2010

Data Visualisation: Beauty and Clarity?

This blog post focuses on an analysis of the donations made, by country, to assist the healing of Haiti after the devastating series of earthquakes in January 2010.  The process of tagging donations as specifically coming from organization, governments and countries is complex, and not as quick as one might expect. For example, the European Union has an emergency aid fund, but so do most of its members. So, for example, Germany might donate to the EU fund, and provide its own donation as well. This makes it difficult to track exactly where the generous donations are coming from, due to the actual route of the donation. Further, the type of donation can also make things more complex. For example, Canada citizens will have their donations matched by a donation by Canadian government – in addition to the funds that the kind-hearted Canadian government has already pledged as an emergency response to funds required immediately. This means that the actual full amount of donations is still to be determined since funds are still coming in from different sources within Canada in mind.


The data community is finding its own way of determining donation amounts. Jer Thorp uses the concept of an ‘Avatar minute’, is how much of the film that a viewer would be able to watch with a given amount of money. For example, 16 cents would purchase you about 3 minutes of the film, for example. So, Thorp’s analysis is broken down into two pieces: the first part aims to show each government’s donation is worth per head, and secondly, how many minutes of Avatar each citizen would be able to watch with the donation.  The ‘Avatar minute’ concept has been applied to visualizations of Haiti Earthquake donations at Jer Thorp's website and if you haven’t had a look yet, I would recommend it. The images are incredibly beautiful, like this one below, which aims to show the amount of donations pledged by each country in ‘Avatar minutes’. All credit (and copyright) is given to Jer Thorp for this beautiful image:

4284507258_0d7978a37a


Thorp’s image is undoubtedly beautiful, so the image met its objective as displaying data as a beautiful piece of art. Displaying information beautifully is emerging field, and Microsoft’s Pivot project is an example of this phenomenon.  Further, Stephen Few provides an analysis in a recent blog post.

As a data analyst, however, the visualisation raised more questions. For example, the imagery made it difficult to compare donations between countries. In the above example, the size of the letters seemed to indicate the donation amount pledged in ‘Avatar minutes’. However, the analysis didn’t seem to make it clear that Luxembourg, Denmark and Guyana made similar donations per head, partly because the word length and letter size made it difficult to compare between them easily. Comparison was made more difficult by the 3D imagery, which adds beauty but didn’t add clarity. As Stephen Few points out, 3D imagery can make data more difficult for viewers to decode. In the above image, it isn’t clear whether the length or width of the film reel has a relation with the donation amount, or whether it is underlying the headline point about the concept of ‘Avatar minutes’.
With these issues in mind, the data has been re-visualised here using Tableau in order to clarify underlying patterns in the data, and make comparison straightforward. With this in mind, I set out to use the same data, from two sources: ReliefWeb and the ever-excellent Guardian Data Source. Before we look at the visualization note that, with the aforementioned issues in mind, the underlying data set has been restricted to include indubitable funds from different countries at the time of writing. This has been done to focus the ‘business question’ of the graph, this excludes money from individuals, organizations, and money that hasn’t been tagged definitively as coming from a specific source yet. Further, this data from ReliefWeb was accurate as at 21st January, 2010, and it is clear that this situation regarding inbound donations is subject to rapid change. The sample dashboard can be found here, or a larger sample can be obtained by clicking here:


Overall Dashboard


One single unified dashboard, with a number of graphs showing different views of the data, can allow the viewer to compare and contrast patterns more easily. For example, the top graph undoubtedly shows that the United States have provided the most funds, and a larger image can be obtained by clicking here:

Top 10 Donations by Country
The sheer size of the US Government donation is further clarified by the right-hand image, which shows that the United States have provided 5 times the individual donations of France and the UK, which have provided second and third highest donations respectively.

Top 3 Donations


When population numbers are added to the mix, then some additional features of the data are highlighted. The following graph shows population as the thin blue line, and donation amount as a thicker blue background line. For example, it’s possible to see that, at the time of writing, Australia and Sweden have given proportionally more per head than the other countries, since their ‘donation’ line is longer than their ‘population’ line. A larger sample can be obtained by clicking here:





Haiti Total Donations by Country and Population

This is further quantified by the final section of the dashboard, which shows the amount of donation, per head, provided by each government. This was calculated simply as: [Donations in US Dollars]/[Population] The image is provided next:

Donor Ranking

The donation numbers have been included, and the proportion is also indicated by the strength of the blue colour; in other words, the stronger the blue colour, the higher the population.  This allows  the user the facility to look at the numbers, or just have an ‘at-a-glance’ view in order to obtain the ‘sense’ of the data. It's possible to see that Australia, Sweden and Norway have given more then one dollar per head, at the time of writing (21st Jan, 2010).

To summarise, the initial data visualization was incredibly beautiful. However, the features that made it beautiful, also served to make the underlying data less clear. To conclude, in information visualization, there can be a payoff between beauty and clarity, but both domains can live side-by-side to serve different needs, and to learn from each other.


Bullet Graphs for Performance Monitoring and More | Tableau Software

Tableau are sponsoring a live webinar with Stephen Few on 23rd February; register quick! This will be well worth a listen. I've included the link below for you:

Bullet Graphs for Performance Monitoring and More | Tableau Software

Posted using ShareThis
 

©2009 Jen's BlogCopyright by Jennifer Stirrup who is looking for a new role.