Wednesday, October 8, 2014

Mondrian - Let's pretend that data accuracy matters

Mondrian

Mondrian is a popular open source relational online analytical processing (ROLAP) engine developed by Pentaho that is used in their community and enterprise business intelligence (BI) offerings.  Mondrian is also used by other BI solutions such as JaspersoftSpagoBI, EazyBI and a few Pentaho associated projects such as JPivot, Pivot4J and Saiku.

Long standing Mondrian bug - You are affected!

We've been using Pentaho for the last five years and have recently discovered a huge bug in the way that Mondrian determines its non empty result sets.  We are current running Pentaho 5.0 but have gone back and confirmed the bug in Pentaho 3.0!  A bit of research shows that Pentaho has known about the bug since 2007/2008.  We've also confirmed the bug in Pentaho 5.1 and against the brand spanking new Mondrian 4.  The point is that this bug has been there forever and will be there in the future so it is important to understand what impact it is having and will continue to have on your reporting.

The core issue

Mondrian applies the date filter before calculating time based calculated measures when considering non empty results.

The implications

All your Mondrian sourced reports that rely on time based calculated measures such as YTD, MTD, QTD, Same Time (Month, Qtr etc) Last year and any variances you calculate off these, are not accurate.  You are missing data!

There are also a couple of smaller nuances for example TopCount works as expected but as soon as you try and drill down on anything that YTD would have hidden then you simply can not drill down (using JPivot).

Tools tested

From the Pentaho community offering we tested JPivot and Saiku and from the Pentaho enterprise offering we tested Analyser.  We tested Community Data Access (CDA) when we tested against Mondrian 4.0.  The bug was confirmed with all of these tools since they rely on Mondrian that has the bug.

We tested with Microsoft SQL Analysis Services (MSSAS) 2008 to confirm that Mondrian is behaving counter to user expectations.  MSSAS 2008 worked as expected.

An example

The example is built using the standard Pentaho SteelWheels dataset.  You can download any Pentaho version you like to follow along.  I will recommend a community version since Analyser doesn't easily give you the option to show non empty results and you therefore cannot see what you are missing.

Lets for example say you want to show all year to date sales for all products.  You Google around and find that the following calculated measure should get you there: Aggregate(Ytd([Time].CurrentMember), [Measures].[Sales]).  You can also use functions such as ParallelPeriod, Lag etc.  They all give the same result.

Using the following MDX query:
with member [Measures].[YTD Sales] as 'Aggregate(Ytd([Time].CurrentMember), [Measures].[Sales])'
select NON EMPTY {[Measures].[YTD Sales]} ON COLUMNS,
  NON EMPTY Hierarchize(Union(Crossjoin({[Customers].[All Customers]}, {[Product].[All Products]}), Crossjoin({[Customers].[All Customers]}, [Product].[All Products].Children))) ON ROWS
from [SteelWheelsSales]
where {[Time].[2003].[QTR1].[Mar]}

you get the following result:


At first glance you are pretty happy with that!  But if you know your dataset you realise that you also sell ships, trains and planes!  You also notice that your detail doesn't match your total!

You remove the NON EMPTY from the MDX above and now you get:


For the Enterprise guys, you get:

In Analyser you cannot easily show 'empty' rows (which aren't empty), but use the community version or plain SQL to confirm that you are actually missing data.

Why did that happen?  Because you didn't sell a ship, train or plane in March.  The March filter is applied first, then the calculated measure is done.

Further discussion

The above contrived example may make you think that you will probably never run into this issue since you sell most of your products most of the time.  For sure!  That is probably why it took us a while before we realised what was wrong with our data!  It works OK a lot of the time at a very high level and at first glance it definitely looks good.

You will however notice this bug more under the following circumstances:
  • You probably don't sell every product to every customer using every marketing channel using all your sales reps.  In other words, the deeper you drill into your data, the more likely it is to be wrong since your data will be more sparse.
  • A finer grained date dimension shows it faster.  If your grain is per day, then you will notice the bug at the start of every day since you will have no report.  If you have live data updates, you will have a 'fuller' report by the end of the day.  If you report by month then you will have a fuller set at the end of the month.  For example set your date to next month, the YTD figures should be the same as putting in this month  right?  In Mondrian that is wrong.  You haven't sold anything next month so you start with nothing.  After the first trading day you will only see what you traded in that day, and so on.
  • High customer turnover.  The higher your customer turnover the higher the chance that a specific customer didn't buy from you this month and then the sales rep's customer list for comparing this month with the last month / last year etc will be wrong.
  • High product changes.  If you often stop selling specific products (fasion etc) then those products will drop off the list.  For example all your winter clothes will drop off in summer.  Note that the product line numbers will still be correct unless you group all your winter clothes in one product line, then the line will disappear.

Can you train your users around this bug?  Not really.  For example in a year to date example you will be better off selecting a year instead of a month.  Then the above report will work as expected.  But then you can't put a "same period last year" measure on the screen.  For that you need to select a month / period and that breaks your year to date calculation.  Also "same periods last year" like a "YTD Last Year" won't work if you select a full year or if you sold something last year that you didn't sell this year!  In short, it gets messy and users are unlikely to get it right.

What can you do

First, test this bug on your production data and confirm it.  Then vote for the bug I created on Jira here and create a support ticket with Pentaho if you have a contract.  There is also more examples on the bug report and a breakdown of the bug into two separate but related issues. 

You can get a consultant in and depending on the size of your dataset they can conjure up some complex MDX using subsets and filters that hopefully executes in a reasonable amount of time.

We are totally baffled as to why this bug has been allowed to linger for as long as it has.  In discussions with colleagues and consultants the following Dilbert comic may finally explain it! http://dilbert.com/strips/comic/2007-05-16/