I’ve just returned from a week in Sharm el Sheikh. November is a great time to escape the cold creeping into the South West England by basking under the Egyptian sun. Of course, that is, until the weeklong bliss in paradise comes crashing down as one…
(Image credit: John Flinchbaugh)
So, I’ve got my hands on some big, open data by way of NHS prescription data for August 2012 and I’ve been trying to answer the question what, if any, price variations are there for drugs and appliances prescribed by GP practices across the UK.
Well, as it turns out, analysing (big) data on a (small) laptop is harder than it looks. I’ve been doing some analyses on a 32-bit Windows 7 laptop from work (I’m sure proper data wranglers are either cringing or think I am mad) with 4GB of RAM, but about 1GB is taken up by security software and essential background services. So by the time I have loaded that CSV file into a data.frame in R, there isn’t much wiggle room to do the analysis in the traditional way, which was why in my last post I hit the message:
Error: cannot allocate vector of size 76.7 Mb
There are a few ways of resolving this issue. First, work smarter. In my few weeks of experimentation with R, I have found that there are usually more than one way of achieving the same result, and some ways are less memory intensive than others.
Second, you could use libraries such as ff or bigmemory that utilises chunking to overcome the memory issue, but this isn’t necessarily compatible with the codes I have already written and so I would have to learn a whole new set of languages and rewrite some of the algorithms. Frankly, as a beginner, I can only take in so much learning in one go!
Another option is to go big - get more RAM. Although Windows 8 64-bit supports memory up to 512GB, the Intel chips for off-the-shelf computers for consumers (with either Core i’s or Xeon chips) only support up to 32GB memory. And as you go up the RAM size, the price becomes prohibitively expensive (as far as experiment is concerned).
Finally, I could turn to the cloud. Amazon offers a freemium service so I could get myself set up with a Linux instance for no fee, set up R there and run a few queries. But again, there is yet another learning hurdle - how to set up the instance, how to use Linux, etc.
From all this, I learned that there is no getting away from technology when it comes to analysing big data set, as machines and software built for home (or small/medium enterprise) use are generally underpowered for everyday people who might want to have a crack at big data sets. While citizen and government interest in open data is growing, there are many hurdles when it comes to converting open data into useful information and insight that “little guys” can’t afford to do.
As for my own little experiment, I have opted for option #3 - I now have access to a virtual server through work, essentially a sandbox environment with 12GB of RAM. Yippee! Now my next challenge is actually setting up a working database so that I can more effectively analyse data through time.
As it’s now turning into a work project, I don’t think I can blog much more about the fun and crazy adventure. But stay tuned, I am sure I will think of something equally amazing to fill up this space!
(Image credit: clogwog)
Following my previous post, here’s what I’ve done to try and tidy up the data and amend rows of data where “quantity” was recorded as 0.
To recap the issue, “Quantity” recorded as “0” is most likely an entry error, because another column, titled “Items”, also gives reference to the number of items prescribed. There should be a direct relationship between items and quantity, such that quantity cannot be zero.
Therefore, for rows where “Quantity” has been marked as “0”, I am looking to replace this with the corresponding number from “Items” as a best guess estimate.
I tried this out on my “test” variable, which is a subset of the raw data where quantity has been marked as “0”, with the command:
> test$QUANTITY[test$QUANTITY==0] <- test$ITEMS[test$QUANTITY==0]
Worked like a treat. Time to do this on the large dataset…
Ever the cautious one, I don’t want to actually replace anything in the “Quantity” column, in case my formula doesn’t work and I ruin the raw data. So I thought I will replicate a column, called QUANTITY_ADJ, which I will manipulate instead:
> august_raw$QUANTITY_ADJ <- august_raw$QUANTITY
Error: cannot allocate vector of size 76.7 Mb
(Image credit: EU Social)
To answer the question whether or not there is price variability for the same prescription across the NHS, I thought that a visual graph would be the best way to quickly assess the situation.
Histogram is usually the graph of choice for plotting variation, but with almost 20,000 different drug codes to analyse, seeing so many individual histograms will be eye-wateringly painful and difficult to digest.
So I am thinking of making some box plots. It still won’t be neat and tidy, but I think it’s a good starting point to judge just how variable the prices are, especially if I plot them against the same Y-axis. This will have the added advantage of seeing which boxes are “taller” (eg. more variability) or “higher” up the Y-axis (eg. more costly items) to help me decide which prescriptions I might want to delve into at a later time.
As my dataset contains information on the cost and associated quantity, I can work out the “per item” cost simply by dividing the former by the latter. This way, when I plot the cost of one particular drug, the variability isn’t due to the number of items prescribed that month, but rather gives a measure that can be compared across different primary care trusts.
In the first block of code, I created a new column “costperitem” in my raw data by doing the aforementioned division. Then I used the summary function to quickly test the min and mix boundary that I could use for my boxplot.
But then I spotted a problem: the mean and the max of costperunit is registering as “Inf,” which means that in the raw data, there are instances where at least one trust has denoted zero as the quantity for a particular prescription.
That, ladies and gents, has got to be an input error. If something wasn’t prescribed, it shouldn’t t be on the list at all. To do a quick test on how many lines of data have a zero-quantity, I used the subset function:
I created a variable “test” that replicates the structure of august_raw data.frame, but to include data where QUANTITY column is equals (==) to 0. Then I used the nrow() command to work out there are 19 such cases. Line 5 also does exactly the same thing without creating a new data.frame, but I find that it’s best to create a new table so you can look more into the subset of data.
Perhaps the easiest thing to do is substitute all 0’s with 1’s, on the assumption that quantity of at least one must have been prescribed. But then I noticed that the “items” column show different numbers other than 1’s. Now, the “quantity” column should reflect the changes in “items”, such that as the number of items goes up, quantity should go up proportionally. For example, think of a standard packet of ibuprofen, which might come as a box of 10 pills. “Items” refer to the box, so if 2 boxes were prescribed, then “quantity” should say 20.
After spotting these 19 rows of data showing nil quantity, it makes me question how many of the other 10,056,613 rows of data have inconsistencies… And how many future assumptions I will have to make when further analysing the data.
For the sake of this soldiering on, I will replace quantity by the number of items for these 19 instances, which I will cover in my next post.
(Image credit: zhouxuan12345678)
As my last post alluded, the data is the answer, much like 42 is the answer in the Hitchhiker’s Guide to the Galaxy. The first step to analysing big data is asking the right questions, and then breaking the task at hand with smaller questions to work towards the ultimate answer using the tools available.
One of the first questions I want to answer, with the NHS prescription data, is how much price variations there are for each drug or appliance prescribed by the different GP practices across England.
The data set contains the following columns:
Breaking my question into smaller chunk, I need to work out how many drugs and appliances I am working with. This can be done by looking at the unique values in either column for BNF code (4) or BNF name (5).
In R, the unique function makes this a simple task:
"august_raw" is the original prescription data for the month of august, containing 10m+ rows of data and the 7 columns detailed above. In the first line of codes, I created a new variable (factor, technically) called “unique_bnfname” to store the list of unique values found in column “BNF.Name” in data.frame “august_raw”.
I carried out a similar procedure to find unique BNF codes, then used the length function to find out how many items are in the two factors. To my surprise, the lengths are different! There are 19,741 unique BNF codes in the dataset, but only 17,626 unique items when looking at BNF names - why?
Time to turn to the internet and learn about BNF codes and BNF names. Thankfully, the NHS has created a glossary of terms that answers this question:
The BSA prescribing and dispensing information systems use the therapeutic classifications defined in the British National Formulary (BNF) as extended by the BSA, The same drug may appear in different areas of the BNF as it can be prescribed to treat more than one condition. For example: aspirin is licensed for use as an antiplatelet and as an analgesic.
Ah ha, so some items have multiple codes because they are prescribed for different reasons. This newfound knowledge poses more questions - do identical items prescribed for different ailments have significantly different costs?
Big data analysis is more than just crunching numbers. It’s about getting intimate with your data - learning about its nature and quirks - and asking more questions as you get to know it better.
In my next post, I’ll update you on my progress on answering this new question.
(Image credit: Oberazzi)
Fellow tumblr user modestgesture posed some great questions on my last post:
Are you looking for trends or differences? do you plan on investigating 2, 3, or more groups? multiple subsets within a bigger group, or just 2 columns against eachother? without looking at the data these are my initial questions…
I’ll start by tackling the easiest question - what is in the dataset? The NHS prescription data includes, on a monthly basis, prescriptions by medical practitioners in England and dispensed in the UK. Each row of data details a specific item that was prescribed by an NHS practice (such as paracetamol or wound dressings), the number of items prescribed, the net ingredient cost, and the actual cost of the prescription, calculated by:
Actual Cost = (Net Ingredient Cost less discount) + Payment for Consumables + Payment for Containers + Out of Pocket Expenses
Given the nature of the data, one of the most obvious way of dissecting the data is to spot the differences. For a specific item prescribed, is there variability in the actual costs paid by different NHS practices? If so, are there practices or regions that are consistently overpaying? In addition to answering whatare the cost differences, can the data shed any insight as to why there are cost differences and how the differences could be eliminated?
Sifting the data through time would also be an interesting exercise as well as spotting long-term trends. As this would require multiple CSV files, however I will have to earmark this particular exercise for when I become more proficient with R.
In my next post I will let you know I get on with answering some of the above questions. In the meantime, keep firing away any questions you may have for me and/or the data!
While I am great at regurgitating textbook material in exams, information doesn’t stick with me unless I am using the knowledge, particularly if it’s knowledge from seeking for a specific answer to fit the puzzle.
A bit cryptic, perhaps, but I am so bored of reading Chapters 1-3 of introductory textbooks on R (or Python, HTML5, or other books on programming), printing “Hello World” commands, doing mathematics while minding the orders of operations, and trying to remember the differences between lists, vectors, matrices, and data.frames. And of course, everything appears so obvious until I try to recall which commands I’ve read in the last three days would actually do the one simple task at hand, only to conclude I have wasted the last three days gaining nothing.
Sometimes, going straight into the deep end and learning from experience is the best way to go. That’s why for centuries apprenticeships work.
Without getting sidetracked into a rant about the issues of first world education (!), I decided it’s time to just grab a big dataset and play with it. As Jeremy Clarkson puts it, how hard can it be?
So here goes.
UK’s National Health Service has been releasing prescription data by GP practices as open data, and I have downloaded the August 2012 data at the presentation level, the most recent data available at the time of writing. The CSV file is 1.3GB and I will be using R Studio (which runs on R) to analyse the data.
This is probably “small data” compared to what other people or companies may be analysing, but it’s pretty big to me - the file does not open in Microsoft Excel as it has too many rows of data!
Just how many? Well, first thing to do is to load the file in R:
This command reads the CSV into a data.frame which I have labelled “august_raw”. And then all I have to do is ask R:
My CSV file contains over 10 million rows. No wonder Excel struggles! Other interesting information about my “table” is that it’s a data.frame, took over 230 seconds to load into the system, and has 11 columns of data:
There is also a really handy command on R that gives a preview of what the data looks like:
Okay, so not the most sophisticated analysis on big data. So here’s where I need your help - are you experienced in R (and/or stats) and willing to be my teacher? After all, every apprentice needs a master craftsman. Even if you’re not well versed in R, can you think of questions that this dataset (or other big and open data) could answer? Join me in my learning experiment!
(Photo credit: HikingArtist.com)
Hello Tumblr and Tumblr-ians!
After a longer than expected sabbatical from Tumblr, I am making a comeback. Not one weekend goes by in the past nine months when I didn’t think I need to come back to my blog. But with a new job and several personal duties occupying my attention throughout the summer months - which quickly encroached into the winter months - I didn’t have time to even prioritise what it is I want to focus on, write about, and work on in either personal and professional capacities.
As 2012 comes to a close, projects are winding down and I’ve had a chance to review and cross out goals and milestones I have achieved this year. With a bit more space on the “to-do” list, and more importantly, another year of life experience, I am ready to flesh out personal ambitions against professional demands, leisure activities, and other personal interests to come up with my 2013 priorities.
And so here I am, back on Tumblr with renewed energy and new ideas. Previously I’ve been blogging about how technology is changing people’ lives for the better, particularly when it comes to data and information management. I am absolutely fascinated by data so I’m taking one step further in this direction , I am getting my hands dirty, so to speak, with data.%0A
Perhaps I am lured by the sexiness of data scientists or the opulent lifestyle achievable by big data analysts, but I like to think that I’m just a bit of a late bloomer and all those geeky things I’m interested in are finally coming together into one, happy being. I’ve always loved logic grid puzzles, planned my holidays on Excel spreadsheets, and found satisfaction from writing HTML codes on a note pad. It’s a shame that I didn’t like my C++ teacher in high school nor gave statistics a proper chance in school, but I’m hoping it’s not too late to start learning. Especially this time around I’m learning something for fun, rather than aiming for an A just to get as close to a 4.0 GPA as possible.
And so, this will be the journal of my adventure into data (big and small, open and private) using tools like statistics, R, Python, scripts, and more, sprinkled with occasional musings.
(Image credit: GTez on Flickr)
It was just over a year ago that I started my Tumblelog after starting a new job and embarking on a new career path. For me, it was the perfect time to pick up blogging again, this time looking at technology that makes a difference in our everyday lives, from the gadgets that help us see, eat, exercise, and behave better, to the nuances of digital media that affect most of us without realising it.
One year on, I have updated my Tumblelog with 320 posts (including this one) and amassed 5,165 followers. I am taking a hiatus during the month of March while I recharge my batteries and narrow down the focus of my blogging (as well as professional and personal) life. I don’t know how many of my followers are ‘real’, how many of you will see this post, and how many of you (probably not too many!) will miss me for the next few days, but feel free to reply to this post and say hello to me on Twitter @cecilialiao.
See you in April!