Monday, May 16, 2011

Large data set testing

One testing scenario that I've been thinking about recently is testing large data sets.  Our software runs large furniture manufacturing plants, and the combination of all of the different options available in some screens can get quite large and cause a performance issue.  More importantly, each customer is capable of configuring their data in many different ways.  

Take a recent example.  A customer had configured two features with the same name, but one contained alphanumeric data and one had integer data.  A  poor data setup from our perspective, so we disregarded this possibility when considering a schema change. When upgrading to a newer software version which put features in a common pool by name, the SQL upgrade script tried to merge the two data sets as one, and some data was lost because of the unexpected data type difference.  Luckily this was caught in a test upgrade by customer support and no live data was lost, but the point is the same:  How do you test for data compatibility issues when each customer is capable of creating data combinations in new and 'interesting' ways?  You may not agree with a data setup, but if it's allowed there's always the chance that someone somewhere will try it, and you can't always get away with calling it a 'data problem' (not if you want to keep your customers for long).

Large data sets like this are one of those good examples of the testing maxim that it isn't possible to test everything.  In any kind of large scale data based solution, there are going to be too many possible ways for customers to create and combine their data.  So how do we get the best test coverage?  Here are a few ways I currently go about it,  I'm sure you can think of more.

1) Focus on the most important data scenarios.
This may fall under the heading of 'duh', but is always worth restating.  It doesn't do your customer any good if you catch a problem with a board being cut 2 mm too long, but the procurement screen for ordering new material is broken.  There is a mantra in aviation that in an emergency situation, 'First, fly the airplane'.  Don't get distracted by details and forget the big stuff.  There is plenty of material already published on techniques for determining what's most important to you and your customers, I won't restate it here.

2) Restrict possible data entries and combinations.
While this is a development function based on lots of spec discussion and user input, you as the tester also have important input as a representative of the end user.   Sure, we all know to test edge cases and weird data inputs.  But we also need to slow down occasionally and think about whether
we can or should allow the end user to enter that data if it doesn't make sense.  Speak up and suggest a restriction if you think one is needed.  If it's shot down (hopefully with an explanation) you've learned something about the data and at least planted the idea.  If not, and the user later decides they want that functionality, now it's a discrete feature to test. 
Once the restrictions are in place, don't forget to test them.  More importantly, test data migration from versions without the restrictions to versions with the restriction to catch any data that has already strayed.

3) Keep expanding your regression test suite as defects are reported.
This is an easy one, you don't even have to find your own bugs!  As data problems are found and reported, try to add them to your suite of regression tests.  While automated tests are preferable for this, even a list of data combinations to try in manual testing is better than nothing.
The most important thing is to add the necessary data to whatever database you are using for testing, and use it.  Take the time to do a little exploratory testing around the defect scenario to see if you can catch any other issues revealed by that data combination.

4) Exploratory testing.
This one can be difficult, especially if you've been looking at a lot of the same program functionality over and over again for years like I have.
The idea is to try different ways to break out of the 'normal' success path testing mindset, and test the software in different ways and data
combinations.  As I mentioned before, it it's allowed there's a chance someone will try it, so try to get away from the 'nobody would ever do that!' mindset for at least a few minutes.  Talk to customer support and see if any end users are known for pushing the boundaries of the software, then take a look at their data if you can.  Read industry publications, competitor news, or occasionally even your own corporate website to see what marketing is trying to tell people your software will do.

5) Use customer data.
This one overlaps with the exploratory testing point.  If you can get your hands on some representative customer databases that use a lot of the functionality your software offers, spend some time getting to know their data.  Test the same test cases you're familiar with on a different data set, see if they are handled differently.  Stop and think: "What mindset does this data represent?  Can I apply it back to my familiar data?"
If you have customer representatives who work directly with the customer they can be a big help here.
Another test to try here is to upgrade several sets of customer data between software versions.  As seen in the example I first mentioned, this often
finds data scenarios that the person writing the upgrade scripts didn't think of.

6) Pairwise testing
This is a particularly useful tool to have in your back pocket when dealing with large sets of data that can be combined.
Any time I have more than 3 discrete sets of data being combined into a single scenario, I consider this approach.
However, if the individual data sets are very large, and/or have interdepencies and restrictions, I find that it loses its usefulness.

7) Fuzzing and semi-random data sets
Don't have enough data?  Make some up!  Free tools such as can be used to generate data sets within certain parameters that you can use to test your application with (e.g. addresses, names, integers, strings).  The TestComplete testing tool that I use comes with its own data generator also, or if you're handy with scripting you can write your own.
Fuzzing is a similar principle, but automatically applies large numbers of semi random inputs to the program.  
I must admit, though, while both of these can be useful for stress tests, I haven't had much luck in applying them to create realistic scenarios.

Hopefully this has given you a few ideas to start with.  One thing I can guarantee you is that customers will continue to find ways to break your software that you could never have imagined.  Don't feel too bad about it, add each problem to your testing repertoire, and make up your mind that that's one bug that won't slip by you again!

No comments:

Post a Comment