Welcome

Featured

Welcome to my Blog on Building the Smart Business. This blog looks at all the areas that need to be addressed so that companies can transition themselves to being an agile event driven optimised business. To do this we need several building blocks.

  • Big Data Platforms and Complex Analytics
  • Data Governance and Enterprise Information Management
  • Master Data Management
  • Process oriented Operational BI via On-demand Analytics
  • Event Processing and Automated Decisioning Management
  • Collaborative, Social and Mobile BI
  • CPM Strategy Management
  • Cloud Computing

I will discuss all of these areas in my blogs and ask you to comment on how you are using these technologies in your organisation.

For more information on Intelligent Business Strategies and how we can help you, please click here

Share it now!

Data Life Cycle in A Big Data Environment

The arrival of Big Data has seen a more complex analytical landscape emerge with multiple data stores beyond that of the traditional data warehouse. At the centre of this is Hadoop.  However other platforms like data warehouse appliances, NoSQL graph and stream processing platforms have also pushed their way into the analytical landscape. Now that these new platforms together form the analytical ecosystem it is clear that managing data in a big data environment has become more challenging. It is not just because of the arrival of new kinds of data store. It is the characteristics of Big Data like volume, variety and velocity that also make big data more challenging to manage. The arrival of data at very high rates and the volume of big data means that certain data management activities need to be automation instead of manual. This is especially the case in data lifecycle management.

In any data life cycle, data is created, shared, maintained, archived, retained and deleted. In a big data environment this also occurs. But when you add characteristics like volume, variety and velocity into the mix then the reality of managing big data thoughout its lifecycle  really challenges the capability of existing technology.

In the context of big data lifecycle management a key question is what is the role of Hadoop? Of course Hadoop could be several roles. Popular ones include:

  • Hadoop as a landing zone
  • Hadoop as a data refinery
  • Hadoop as a data hub
  • Hadoop as a Data Warehouse archive

Hadoop as a landing zone and Hadoop as a data refinery fit well into the data lifecycle between CREATE and SHARE. With big data, what happens between CREATED and SHARED is  a challenge. Before data is shared it needs to be trusted. That means that data needs to be captured, profiled, cleaned, integrated, refined, and sensitive data identified and masked to prevent shared sensitive data being visible to unauthorised users.

In the context of Hadoop as a landing zone, big data may arrive rapidly. Therefore technologies like stream processing are needed to filter data in motion to only capture the data of interest. Once captured, volume and velocity could easily make it impossible to profile captured data manually as would be typical in a data warehouse staging area. Therefore automated data profiling and relationship discovery is needed so that attention is drawn quickly to data that needs to be cleaned. Data cleansing and integration also needs to exploit the power of Hadoop MapReduce for performance and scalability on ETL processing in a big data environment.

Once big data is clean we can enter the data refinery which is of course when we see the use of Hadoop as an analytical sandbox. Several analytical sandboxes may be created and exploratory analysis performed to identify high value data.  At this point auditing and security of access to big data needs to be managed to monitor exactly what activities are being performed on the data so that unauthorised access to analytical sandboxes and the data within is prevented.  If master data is brought into Hadoop in the refining process, to analyse big data in context, then it must be protected and sensitive master data attributes like personal information needs masked before it is brought into a data refinery to add context to exploratory analysis tasks.

Once data refining has been done then new data can be published for authorised consumption. That is when Hadoop takes on the role of a data hub. Data can be moved from the Data Hub into Hadopp HDFS, Hadoop Hive, Data warehouses, Data Warehouse Analytical Appliances and NoSQL databases (e.g. graph databases) to be combined with other data for further analysis and reporting.  At this point we need built-in data governance whereby business users can enter the Data Hub, and subscribe to receive data sets and data subsets in a format they require. In this way even self-service access to big data is managed and governed.

Going further into the lifecycle, when data is no longer needed or used in data warehouses, we can archive it. Rather than doing that to tape and taking it offline, Hadoop gives us the option to keep it on-line by archiving it to Hadoop. However, if sensitive data in a data warehouse is archived, we must make sure masking is applied before it is archived to protect it form unauthorised use.  Similarly if we archive data to Hadoop, we may need to archive the data from several databases (e.g. data warehouses, data marts and graph databases) to preserve integrity and make a clean archive.

Finally there will be a time when we delete data. Even with storage being cheap, the idea that all data will be kept is just not going to happen. We need policies to govern that of course.  Furthermore, similar to archive,  these policies need to be applied across the whole analytical ecosystem with the aded complexity that big data means we need these tasks to be executed at scale.  These are just some of the things to think about in governing and managing the big data life cycle. Join me on IBM’s Big Data Management Tweet chat on Wednesday April 16th at 12:00 noon EST to discuss this in more detail.

 

Share it now!

Struggling with Your Big Data Strategy?

Two weeks ago I attended the popular Strata conference in Santa Clara California where, frankly the momentum behind Big Data was nothing short of unstoppable.  3100 delegate poured into the Santa Clara Convetion Centre to see a myriad of big data technologies. Things that stood out for me include the massive interest in the new Hadoop 2.0  Apache Spark in-memory framework that runs on top of YARN and the SQL-on-Hadoop wars that broke out with every vendor claiming they were faster than everyone else.   The momentum behind Spark was impressive with vendors including Cloudera and Hortonworks now running Spark on their Hadoop distributions.

The tables below show the SQL on Hadoop initiatives

SQLonHadoop

Some of the sessions on the SQL on Hadoop were a little disappointing as they focussed on far too much on query benchmarks rather than the challenges of using SQL to access complex data such as JSON data, text data or log data. Log data is of course very much in demand at present to provide insight into on-line behaviour.  In addition what about multiple concurrent users access Hadoop data via SQL? It is clear that the in-memory Shark on Spark (Hive on Spark) initiative coming out or AMPlab at UC Berkeley is looking to address this.

Pureplay Hadoop vendors Cloudera, Hortonworks and MapR were all out in force with like In addition to the above there were new self-service data management tools like Paxata and Trifacta who are aiming there products at data scientists and business analysts. This is fuelling a trend where users of these tools and self-service BI tools are now getting the power to clean and prepare their own data rather than using enterprise data management platforms from vendors like IBM, Informatica, SAP, SAS, GlobalIDs and Oracle.  I have already blogged about this in my last blog.  Also new visualization vendors like Zoomdata dazzed everyone with the ‘Minority Report’ demo and virtual reality demos.  Then of course there are the giants. IBM, Oracle, SAP, Microsoft.  Microsoft has integrated Excel with it HDInsights Hadoop distribution and its HDInsights on Windows Azure. Meanwhile IBM’s recent Watson Announcement shows Big Blue’s commitment to not just run analytics and BI tools against big data but to move  beyond that into cognitive technologies on top of its big data platform with the emergence of Watson Foundations, Watson Explorer, Watson Analytics and Watson Discovery Server.

With all this technology (apologies to other vendors not mentioned), it is not surprising people are feeling somewhat overwhelmed when it comes to putting together a big data strategy. Of course it is not just about technology. Questions about business case, roles, skills, architecture, new technology components, data governance, best practices, integration with existing technology, pitfalls to avoid and much more all need answered.  Therefore please join me on Twitter on Wednesday March 5th on IBM’s Big Data Tweetchat at 12:00 ET to discuss “Creating a Big Data Strategy”

 

Share it now!

Is Self-Service BI Going to Drive a Truck though Enterprise Data Governance?

There is no doubt that today self-service BI tools have well and truly taken root in many business areas with business analysts now in control of building their own reports and dashboards rather that waiting on IT to develop everything for them. Using data discovery and visualisation tools, like Tableau, Qlikview, Tibco Spotfire, MicroStratagy and others, business analysts can produce insights and publish them for information consumers to access via any device and subsequently act on.  As functionality deepens in these tools, most vendors have added the ability to access multiple data sources so that business analysts can ‘blend’ data from multiple sources to answer specific business questions.  Data blending is effectively lightweight data integration.  However this is just the start of it in my opinion. More and more functionality is being added into self-service BI tools to strengthen this data blending capability and I can understand why. Even though most business users I speak to would prefer not to integrate data (much in the same way in which they would prefer not to write SQL), the fact of the matter is that the increasing number of data stores in most organizations is driving the need for them to have to integrate data from multiple data sources.  It could be that they need data from multiple data warehouses, from personal and corporate data stores, from big data stores and from a data mart or some other combination. Given this is the case, what does it mean in terms of impact on enterprise data governance?  Let’s take a look.

When you look at most enterprise data governance initiatives, these initiatives are run by centralized IT organizations with the help of part time data stewards scattered around the business. Most enterprise data governance initiatives start with a determined effort to standardize data. This is typically done by establishing common data definitions for core master data, reference data (e.g. code sets), transaction data and metrics. Master data and reference data in particular are regularly the starting point for enterprise data governance. The reason for this is that this data is so widely shared across many applications. Once a set of common data definitions has been defined (e.g. for customer data), the next step is to discover where the disparate data for each entity (e.g. for customer) is out there in the data landscape.  This requires that data and data relationship discovery is done to find all instances of the same data out there across the data landscape. Once the disparate data has been found, it becomes possible to map the discovered disparate data back to the common data definitions defined earlier, profile the discovered disparate data to determine its quality and then define any rules needed to clean, transform and integrate that data to get it into a state that is fit for business use. Central IT organisations typically use a suite of data management tools to do this.   Not only that, but all the business metadata associated with common data definitions and the technical metadata that defines data cleansing, transformation and integration is typically all recorded in the data management platform metadata repository.  People who are unsure of where data came from can then view that metadata lineage to see where a data item came from and how it was transformed.

Given this is the case, what then is the impact of self-service BI given that business users are now in a position to define their own data names and integrate data on their own using a completely different type of tool from those that are provided in a data management platform?   Well it is pretty clear that even if central IT do a great job of enterprise data governance, the impact of self-service BI is that it is moving the goal posts.  If self-service BI is ungoverned it could easily lead to data chaos with every user creating reports and dashboards with their own personal data names and every user doing their own personal data cleansing and integration. Inconsistency could reign and destroy everything an enterprise data governance initiative has worked for. So what can be done? Are we about to descend into data chaos? Well first of all self-service BI tools are now starting to record / log a users actions on data while doing data blending to understand exactly how he or she has manipulated data.  That is of course a good thing. In other words self-service BI tools are starting to record metadata lineage – but in their repository and not that of a data management platform.  Also reports are available on lineage in some self-service BI tools already. A good example here is QlikView who do support metadata lineage and can report to show what has happened to data. However there appears to be no standards here for metadata import from existing data management platform repositories to re-use data definitions and to re-use transformations (other than XMI for basic interchange and even then there is no guarantee that interchange occurs).  Other self-service BI tool users may be able to re-use data transformations defined by a different user but, as far as I can see, this is only possible when the same tool is being used by all users.  The problem here is that  there appears to be no way to plug self-service BI into enterprise data governance initiatives and certainly no way to resolve conflicts if the same data is transformed and integrated in different ways by central IT on the one hand using a data management toolset and by business users on the other using a self-service BI tool.   If the self-service BI tool and the data management platform are from the same vendor you would like to think they would share the same repository but I would strongly recommend you check this as there is no guarantee.

The other issue is how to create common data names. I see no way to drive consistency across both self-service BI tools (where reports and dashboards are produced) and centralised data governance initiatives that use a business glossary especially if both technologies are not from the same vendor.  Again, even if they are, I would strongly recommend you check that integration between a vendor’s self-service BI tool and the same vendor’s data management tool suite is in place.

The third point to note is that BI development is now happening ‘outside in’ i.e. in the business first and then escalated up into the enterprise for enterprise wide deployment. I have no issue with this approach but if this is the case then enterprise data governance initiatives starting at the centre and driving re-use out to the business units are diametrically opposed to what is happening in self-service BI.  Ideally what we need is data governance from both ends and the ability to share common data definitions to get re-use of data names and data definitions as well as common data transformation and integration rules to get re-use across both environments.  However, in reality it is not yet happening because stand alone self-service BI tools vendors have not  implemented metadata integration across BI and heterogeneous data management tool suites. Today I regrettably have to say that in my honest opinion it is not there. This lack of integration spells only one thing ….re-invention rather than re-use. Self-service BI tools vendors are determined to give all power to the business users and so like it or not, self-service data integration is here to stay. And while these tools vendors are rightly recording what every user does to data to provide lineage, there are no metadata sharing standards between heterogeneous data management platforms (e.g. Actian (formerly Pervasive), Global IDs, IBM InfoSphere, Informatica, Oracle, SAP Business Objects, SAS (formerly Dataflux))  and heterogeneous self-service BI tools.   If this is the case it is pretty obvious that re-use of common data definitions and re-use of transformations is not going to happen across both environments.  The only chance is if both the data management platform and the self-service BI tools are out of the same stable but if you are looking for heterogeneous BI tool integration then it is not guaranteed as far as I can see.  All I can recommend right now is if you are tackling enterprise data governance – go and see your business users and educate them on the importance of data governance to prevent chaos until we get metadata sharing across heterogeneous self-service BI tools and data management platforms.  If you want to learn more about this please join me for my Enterprise Information Management masterclass in London on 26-28 February 2014

Share it now!

Agile Governance in the form of Automatic Discovery and Protection Is Needed to Create Confidence in a Big Data Environment

The arrival of Big Data is having a dramatic impact on many organizations, in terms of deepening insight. However it also has an impact in the enterprise that drives a need for data governance. Big Data introduces:

  • New sources of information
  • Data in motion as well as additional data at rest
  • Multiple analytical data stores in a more complex analytical environment (with some of these data stores possibly being in the cloud)
  • Big Data Platform specific storage e.g. Hadoop Distributed File System (HDFS), HBase, Analytical RDBMS Columnar Data Store, or a NoSQL Graph database
  • New analytical workloads
  • Sandboxes for data scientists to conduct exploratory analytics
  • New tools and applications to access and analyse big data
  • More complex information management in a big data environment to
    • Supply data to multiple analytical data stores
    • Move data between big data analytical systems and Data Warehouses
    • Move core data into big data analytical environment from core systems to facilitate analysis in context e.g. move core transaction data into a Graph database to do graph analysis on fraudulent payments

The data landscape is therefore becoming more complex. There are more data stores and each big data analytical platform may have a different way to store data sometimes with no standards.

Despite this more complex environment there is still a need to protect data in a Big Data environment but doing that is made more difficult by the new data characteristics of data volume, variety and velocity. Rich sets of structured and multi-structured data brought into a big data store for analysis may easily attract cyber criminals if sensitive data is included. Data sources like customer master data, location sensor data from smart phones, customer interaction data, on-line transaction data, e-commerce logs and web logs may all being brought into Hadoop for batch analytical reasons. Security around this kind of big data may therefore be an issue.  In such a vast sea of data we need technology to automatically discover and protect sensitive data. It is also not just the data that needs to be protected. Access to Big Data also needs to be managed whether that be data in analytical sandboxes, in distributed file systems, NoSQL DBMS or analytical databases,  Knowing that information remains protected even in this environment is important. In MapReduce applications, low level programming APIs need to be monitored to control access to sensitive data. Also, access control is needed to govern people with new analytical tools so that only authorised users can access sensitive data in analytical data stores . Compliance may dictate that some data streams, files and file blocks holding sensitive data are also protected e.g. by encrypting and masking sensitive data.

In addition a new type of user has emerged – the data scientist. Data scientists are highly skilled power users who need a secure environment where they can explore un-modelled multi-structured data and/or conduct complex analyses on large amounts of structured data.

However sandbox creation and access needs to be controlled, as does data going into and coming out of these sandboxes

If we can use software technology to automatically discover and protect big data then  confidence in Big Data will grow.

I shall be discussing Big Data governance in more detail at my up and coming Big Data Multi-Platform Analytics class in London on October 17-18. Please register here if you want to attend.

 

 

Share it now!

ELT Processing on Hadoop Will Boost Confidence in Big Data Quality

In my last blog I looked at big data governance and how it produces confidence in structured and multi-structured data that data scientists want to analyse. I would like to continue that theme in this blog by looking at what is happening in the area of Big Data governance in a little more detail.  Over the last two years we have seen several data management software vendors extend their products to support Big Data platforms like Hadoop. Initially this started out as supporting it as both a target to provision data for exploratory analysis and as a source to move derived insights from Hadoop into data warehouses.  However several vendors have evolved their data cleansing and integration tools to exploit Hadoop by implementing ELT processing on that platform much like they did on data warehouse systems.  Scalability and cost are major reasons for this.  This has prompted some organizations to consider loading all data into a Hadoop cluster for ELT processing via generated 3GL, Hive or Pig ELT jobs running natively on a low cost Hadoop cluster. (see below)

Several vendors have now added new tools to parse multi-structured data as well as MapReduce transforms to run data cleansing and data integration ELT processing on large volumes of multi-structured data on Hadoop.

Extending data governance and data management platforms to exploit scalable Big Data platforms not only allows customers to get more out of existing investment but also improves confidence in Big Data among data scientists and business analysts who want to undertake analysis of that data to produce valuable new insight.  Of course developers could do this themselves using Hadoop HDFS APIs. However, given that many data cleansing and integration tools also provide support for metadata lineage, it means that data scientists and business analysts working in a Big Data environment have access to metadata to allow them to see how Big Data has been cleaned and transformed en route to making that data available for exploratory analysis. This kind of capability just breeds confidence in the use of data. In addition there is nothing to stop Data Scientists making use of these tools by exploiting pre-built components and templates. Having workflow based data capture, preparation and even analytical tools available in a Big Data environment also improves productivity when programming skills are lacking.

I shall be discussing Big Data governance in more detail at my up and coming Big Data Multi-Platform Analytics class in London on October 17-18. Please register here if you want to attend.

Share it now!

Exploratory Analytics Vs Big Data Governance – Freedom Vs Control or Freedom with Confidence?

Exploratory analytics is at the heart of most Big Data projects. It involves loading data from multi-structured sources into ‘sandboxes’ for exploration and investigative analysis, often by skilled data scientists, with the intent on producing new insights.

Data being loaded into these sandboxes may include structured, modeled data from existing OLTP systems, data warehouses and master data management systems as well as un-modeled data from internal and external data sources.  This could include customer interaction data, web log data, social network interactions, sensor data, documents, rich media content and more.

Because Big Data is often un-modeled, the schema of this data is not known – it is schema-less.  The argument there is that data scientists need freedom to explore the data in any way they like to acquire it, prepare it, analyse it and visualize it.

Yet Big Data needs data governance to protect sensitive information made available in these exploratory environments.  Given that governance imposes control and accountability, how then can the two seemingly opposing forces of freedom and control co-exist in a Big Data analytical environment? Is this not a tug of war? Is Big Data Governance a straight jacket for Data Scientists? How can the freedom to conduct exploratory analysis proceed if data being analysed is subjected to governance policies and processes that need to be applied?

The answer is obvious. Confidence. Big Data Governance is not about restricting data scientists from doing exploratory analysis. It is about extending the reach of data management and data governance technologies from the traditional structured data world into a Big Data environment to:

  • Protect sensitive data brought into this environment
  • Control who has access to Big Data files, tables and sandboxes
  • Monitor data scientist and application activities in conformance with regulatory and legislative obligations
  • Provide audit trails
  • Provide data relationship discovery capability
  • Improve the quality of Big Data before analysis occurs
  • Integrate data from traditional and Big Data sources before analysis occurs
  • Provide business metadata in a Big Data environment to data scientists and business analysts
  • Provide the ability to assign new business data definitions and descriptions to newly discovered insights produced by data scientists before moving it into traditional data warehouses and data marts
  • Provide metadata lineage in a Big Data environment to data scientists and business analysts
  • Handle Big Data lifecycle management

All of this is about raising the bar in quality and confidence. Having high quality data before exploratory analysis takes place improves confidence in the data and also in the results.  Therefore Big Data Governance is not an opposing force to free form exploratory analytics. On the contrary, it fuels confidence in Big Data analytical environments

I shall be discussing Big Data governance in more detail at my up and coming Big Data Multi-Platform Analytics class in London on October 17-18. Please register here if you want to attend.

Share it now!

Big Data – Relational Opens Its Mouth – Is It Going To Consume Hadoop?

Back in September last year, I presented at BiG Data London (the largest Big Data interest group in Europe), looking at multi-platform Big data analytics. In that session I looked at stand-alone platforms for analytical workloads and asked the question “Is it going to stay that way?’. Is it that Hadoop analytical workloads would remain separate from graph analytics and from complex analysis of structured data in addition to the traditional data warehouse? The answer in my eyes was of course a resounding no.  What I observed was that integration was occurring across those platforms to create a single analytical ecosystem with enterprise data management moving data into and between platforms and in addition we had to hide the complexity from the users. One way of doing that was Data Virtualisation through connectivity to different types of relational and NoSQL data sources. However for me Data Virtualisation is not enough if it doesn’t also come with optimisation and to be fair to vendors like Cirro, CompositeDenodo  and others they have been adding optimisation to their products for some time.  The point about this is that if you want to comnect to a mix of NoSQL DBMSs, Hadoop and Analytical RDBMSs as well as Data Warehouses, On-Line Transaction Processing Systems and other data then you very quickly start to need the ability to know where the data is in underlying systems.  A global catalog is needed so that software knows that it needs to invoke underlying MapReduce jobs to get at Data in Hadoop HDFS, or that it accesses it directly by bypassing MapReduce via Impala for example.  The point here though is that the user is still shielded from multiple underlying data sources and just issues SQL – a relational interface.

However the next option I looked at was the Relational  DBMS itself. Step by step over the years Relational DBMSs have added functionality to fend off XML DBMSs e.g. IBM DB2 can store native XML in the database – no need to shred it and stitch it back together. Oracle and all other RDBMSs added user defined functions to fend off Object DBMSs and so put pay to them also.  Graph databases are an emerging NoSQL DBMS and now IBM DB2 and SAP HANA have added graph stores built into the Relational DBMS.  I asked the question then “Is Relational going to Consume Hadoop?” It got one heck of a reaction opening up discussion in the break.  Well let’s look further.

Teradata acquired Aster Data and now integrates with  Hadoop via SQL-H to run analytics there or to bring that data into the Teradata Aster Big Analytics Appliance and analyse it there using SQL MapReduce functions.  Hadoop vendors are adding SQL functionality. Hive was the first initiative. Since then we have had Hadapt, then Coudera announced  Impala and just recently HortonWorks announced Stinger to dramatically speed up HIve.

Then came a new surge from the RDBMS  vendors with the rise and rise of external table functions. Microsoft announced Polybase last November at its PASS conference. A really good article by Andrew Brust covers how Microsoft SQL Server 2012 Parallel Data Warehouse (PDW) will use Polybase to get directly at HDFS data in Microsoft’s HDInsight Hadoop distribution (its port of Hortonworks)  bypassing MapReduce.  So SQL queries come into PDW and it accesses Hadoop Data in HDInsight using Polybase (which will be released in stages).

And now today, EMC GreenPlum announced Pivotal HD which goes the whole hog and pushes the GreenPlum Relational DBMS Engine right into Hadoop directly on top of the HDFS file system and so its new PivotalHD Hadoop distribution announcement has a relational engine bolted right into it. MapReduce development can still occur and external table functions in GreenPlum database can invoke MapReduce jobs in Hadoop all inside the Pivotal HD Cluster ultimately with a GreenPlum MPP node instance lining up on every Hadoop data node. In short, the GreenPlum DBMS will use Hadoop HDFS as a data store.  That of course presents a challenge to cater for all file formats that can be stored in HDFS but it is clear that it is not only EMC GreenPlum that is taking on the challenge.

In this case, as in the case of Microsoft (and the other major RDBMS vendors) the trend is obvious. We are seeing a new generation of optimizer – a cross platform optimizer that can figure out the best place to run the analytical query workload or part of an analytical query so that it exploits the RDBMS engine and/or the Hadoop cluster, or both to the max .   That optimizer is going inside the RDBMS as far as I can see and the question will be what part of the execution plan runs in the relational engine accessing tabular data and what part of the execution plan pushes down analytics right into HDFS.  It is already evident that MapReduce is getting bypassed. Impala does it, Polybase is going to do it and clearly EMC GreenPlum, IBM and Oracle will also while leaving to option to still run MapReduce jobs. We are in transition as the Big Data world collides with the traditional and RDBMSs  push right down onto every Hadoop node to get parallel data movement between every Hadoop node and every MPP RDBMS node as queries execute. RDBMSs are going to pull HDFS data in parallel off the data nodes in a Hadoop cluster and/or push down query operators into Hadoop to exploit the full power of the Hadoop cluster and so push data back into the relational engine. It seems that Relational is going as close to Hadoop data as possible. Meanwhile everybody is beating a trail to self-service BI vendor’s doors like Tableau to simplify access to the whole platform. In this new set-up, data movement is going to have to be lightning speed! So where does ETL go?….Well…into the cluster with everything else right?  Move it in parallel and exploit the power to the max.

Whether it be data virtualization with a SQL or web service interface or MPP RDBMS, one thing is clear. Just running connectors between a RDDBMS and Hadoop (or any other NoSQL DBMS for that matter) is not where this ends. These technologies are going nose to nose with tight integration in a whole new massively parallel engine and a next generation cross platform new optimizer to go with it. It’s not all about relational. Big Data has brought a whole new generation of technology and spawned a major phase in transition.  It’s all going to be hidden from the user as relational opens its mouth and samples the best this new generation of technology can feed it. Who said relational was dead.

Share it now!

Getting Started With MDM

Below are answers to 4 key questions that often get asked when starting MDM projects.  Hope you find them useful.

1. What are the factors that trigger a company’s Master data management initiative?

  • The need to improve processes
  • The need to shift focus from product to customer orientation
  • The need to get control of expenditure with suppliers in procurement (Supplier and Materials master data is particularly important)
  • The need to improve accuracy of reporting for financial position when they have multiple ERP instances and multiple charts of accounts

2. What kinds of industries are quick to recognise the value of data governance & information related initiatives?

  • Process oriented businesses like manufacturing and pharmaceuticals will see its value quickly.
  • Financial services moving from product oriented to customer oriented risk management will also be receptive.
  • Insurance is having it forced upon them with Solvency II EU legislation.
  • Investment banks need customer master data to reduce risk and both customer and securities master data to improve process execution from Trade to Settlement to Custody

3. Why are companies so slow to start Information management, data governance and MDM initiatives?

Lack of basic understanding of core master and transaction data and where it is used in their business. This plus insufficient understanding of how core business processes work and how these processes cut across multiple departments and applications mean that people don’t understand the impact of bad or inconsistent data. IT in particular often have very limited understanding of business processes and therefore cannot see how lack of information management impacts business performance. Therefore they find it difficult to create a business case. For these reasons they do not see how data problems can impact:

  1. Operational costs – data defects increase cost of operating
  2. Speed of process execution
  3. Data defects slow down process execution
    • This can impact on customers if customers are waiting on a product
    • Can also make it difficult to scale the business without imposing high operational costs
  4. Decision making
    • Data defects impact on timeliness of decisions or the ability to make a decision at all
    • Data defects impact on accuracy of decisions
    • Data defects may mean event patterns that require action are not seen
  5. Reporting
    • Data defects cause reconciliation problems
    • Inability to see across the value chain
    • Inability to report on financial performance
  6. Risk management
    • Data defects can increase risk if risk cannot be identified due to lack of availability of information or lack of accuracy
  7. Compliance
    • Data security breaches cause brand damage and can lose customers,
    • Regulatory reporting errors that result in penalties
    • Damage to share price that impacts executive pay

For example, Customer master data is needed in Sales, marketing, service, finance, and distribution. It is not just a CRM problem. In addition, IT need to learn more about business to help to understand how to build a business case. I say ‘Follow your processes from end to end and see how they currently work”. This teaches you where data governance and MDM can make a difference and the business impact it can have.

4. What are the factors that cause failure or delays in a MDM initiative?

  1. Lack of basic understanding of:
    • How core business processes work
    • Core master data entities used in their business
    • Where the master data is located (i.e. what operational and BI systems)
    • Who currently maintains it
    • How it flows across applications
    • How it is synchronised
    • Impact on business performance from poor master data
  2. Inability to recognize that master data is not owned by an application and should not be associated with just one application
  3. No business ownership of master data to govern it
  4. No data governance control board and no Chief Data Officer/Architect
Share it now!

Big Data Analytics – A Rapidly Emerging Market

Last week in London I spoke at the IRM Data Warehousing and Business Intelligence conference on a variety of topics. One of these was Big Data which I looked at in the context of analytical processing.  There is no question the hype around this topic is reaching fever pitch so I thought I would try to put some order on it.

First, I am sure like many other authors in this space I need to define Big Data in the context of analytical processing to make it clear what we are talking about.  Big Data is a marketing term and not the best of terms at that.  A new reader in this market may well assume that this is purely about data volumes. Actually this is about being able to solve business problems that we could no solve before.  Big data can and more often than not include a variety of ‘weird’ data types. In that sense big data can be structured or poly-structured (where poly in this context means many).  The former would include high volume transaction data such as call data records in telcos, retail transaction data and pharmaceutical drug test data.  Poly-structured data is more difficult to process and includes semi-structured data like XML and HTML and unstructured data like text, image, rich media etc. Graph data is also a candidate.

From the experiences I have had in working in this area to date, I would say that web data, social network data and sensor data are emerging as very popular types of data in big data analytical projects.  Web data includes web logs and e-commerce logs such as those generated by on-line gaming and on-line advertising data.  Social network data would include twitter data, blogs etc. These are examples of interaction data which is something that has grown significantly over recent years. Sensor data is machine generated data from  ’An Internet of Things’. It is something we have only seen the beginning of in my opinion as much of it remains un-captured. RFIDs are probably the most written about of sensors. However these days we have sensors to measure temperature, light, movement, vibration, location, airflow, liquid flow, pressure and much more. There is no doubt that sensor data is on the increase and in my opinion it is something that will dwarf pretty well everything in terms of volume.  Telcos, utilities, manufacturing, insurance, airlines, oil and gas, pharmaceuticals, cities, logistics, facilities management and retail…..they are all jumping on the opportunity to use of sensor data to ‘switch on the lights’ in parts of the business where they have had no visibility before.  Sensor data is massive but we don’t want it all – it is the variance we are interested in.  Many Big Data analytical applications are/will emerge on the back of sensor data. These include analytical applications for use in:

  • Supply chain optimisation
  • Energy optimisation via sustainability analytics
  • Asset management
  • Location based advertising
  • Grid health monitoring
  • Fraud
  • Smart metering
  • Traffic optimisation
  • Etc., etc.

Text as I already mentioned is also a prime candidate for big data analytical processing. Sentiment analysis, case management, competitor analysis are just a few examples of a popular types of analysis on textual data.  Data sources like Twitter are obvious candidates but tweet stream data suffers from data quality problems that still have to be handled even in a big data environment. How many times do you see spelling mistakes in tweets for example.

There is a lot going on that is of interest to business in big data but while all of it offers potential return on investment, it is also increasing complexity. New types of data are being captured from internal and external data sources, there is an increasing requirement for faster data capture, more complex types of analysis are now in demand and new algorithms and tools are appearing to help us do this.

There are several reasons why big data is attractive to business. Perhaps for the first time, entire data sets can now be analysed and not just subsets. This is now a feasible option whereas it was not before.  So it is making enterprise think can we go down a level of detail? Is it worth it? Well to many it most certainly is. Even a 1% improvement brought about by analysing much more detailed data is significant for many large enterprises and well worth doing. Also schema variant data can now be analysed for the first time which could add a lot of valuable insight to that offered up by traditional BI systems.  Think of an insurance company for example. Any insurer whose business primarily comes from a broker network will receive much of its data in non-standard document format. Only a small percentage of that data finds its way into underwriting transaction processing systems while much of the valuable insight is left in the documents. Being able to analyse all of the data in these documents could offer up far more business value that could improve risk management and loss ratios.

At the same time there are inhibitors to big data analysis.  These include finding skilled people and a real lack of understanding around when to use Hadoop versus when to use Analytical RDBMS versus NoSQL DBMS.   On the skills front there is no question that the developers involved in Big Data projects are absolutely NOT your traditional DW/BI developers. Big Data developers are primarily programmers – not a skill often seen in a BI team.  Java programmers are aften seen at big data meet ups.  In addition, the analysis is primarily batch oriented with map / reduce programs being run and chained together using scripting languages like Pig Latin and JAQL (if you use the Hadoop stack that is)

Challenges with Big Data

There is no question that big data offers up challenges. These include challenges in the areas of:

  • Big data  capture
  • Big data transformation and integration
  • Big data storage – where do you put it and what are the options?
  • Loading big data
  • Analysing big data

Over this and my next few blogs we will look at these challenges.  Looking at the first one on big data capture, the issues are latency and scalability.  Latency needs change data capture, micro batches etc. However I think it is fair to say that if Hadoop is chosen as the analytical platform, it is not geared up for very low latency. Very low latency would lean towards stream processing as a big data technology which I will address in another blog.  Scaling data integration to handle Big Data can be tackled in a number of ways  You can use DI software that implements ELT processing i.e. exploits the parallel processing power of an underlying MPP based analytical database. You can make use of data integration software that has been rewritten to exploit multi-core parallelism (e.g. Pervasive DataRush). Alternatively you can use data integration accelerators like Syncsort DMExpress or exploit Hadoop Map/Reduce from within data integration jobs e.g. Pentaho Data Integrator. Or you could use specialist data integration software like Scribe log aggregation software (originally written by Facebook). Also vendors like Informatica have also announced a new HParser to help with data in a Hadoop environment.

With respect to storing data, there are a number of storage options for analysing Big Data. They range from:

Let’s dispel a myth right away. The idea that relational database technology cannot be used as a DBMS option for big data analytical processing is plain nonsense.  Any analyst opinion claiming that should be ignored.  Teradata, ExaSol, ParAccel, HP Vertica, IBM Netezza are all classic examples of analytical RDBMSs that can scale to handle big data applications with some of these vendors having customers in the Petabyte club. Improvements such as solid state disk, columnar data, in-database analytics and in-memory processing have all helped Analytical RDBMSs scale to higher heights. So it is an option for a big data analytical project perhaps more so with structured data.

Hadoop is an analytical big data storage option that has often been associated more with poly-structured data. Text is a common candidate.  NoSQL databases like Neo4J or InfiniteGraph graph databases are candidates particularly in the area of Social Network influencer analysis.   So it depends on what you are analysing.

Going back to Hadoop, the stack includes HDFS  - a distributed file system that partitions large files across multiple machines for high-throughput access to application data.  It allows us to exploit thousands of servers for massively parallel processing which can be rented on a public cloud if needs be. To exploit the power of Hadoop, developers code programs using a programming framework known as Map/Reduce. These programs run in batch to perform analysis and exploit the power of thousands of servers in a shared nothing architecture. Execution is done in two stages. Map and Reduce. Mapping refers to the process of breaking a large file into manageable chunks that can be processed in parallel. Reduce then processes the data to produce results. Hadoop Map/Reduce is therefore NOT a good match where:

  • Low latency is critical for accessing data
  • Processing a small subset of the data within a large data set
  • Real-time processing of  data that must be immediately processed

Also Hadoop is not normally a RDBMS competitor either. On the contrary it expands the opportunity to work with a broader range of content and so Big Data analytical processing conducted on Hadoop distributions is often upstream from traditional DW/BI systems. The insight derived from that processing then often finds its way into a DW/BI system.  There are a number of Hadoop distributions out there including Cloudera, EMC GreenPlum HD (a resell of MapR), Hortonworks, IBM InfoSphere BigInsights, MapR and Oracle Big Data Appliance.  Hadoop is still an immature space with vendors like ZettaSet bolstering the management of this kind of environment. To appeal to the SQL developer community Hive was created with a SQL like query language. In addition Mahout supports a lot of analytics than can be used in Map/Reduce programs.  It is an exciting space but by no means a panacea.  Vendors such as IBM, Informatica, Radoop, Pervasive (TurboRush for Hive and DataRush for Map/Reduce, Hadapt, Syncsort (DMExpress for Hadoop Acceleration), Oracle, and many others are all trying to gain competitive advantage by adding value to it. Some enhancements appeal more to Map/Reduce developers (e.g. Teradata, IBM Netezza, HP Vertica connectors to Cloudera) and some to SQL developers (e.g. Teradata AsterData SQL Map/Reduce, Hive). One thing is sure – both need to be accommodated.

Next time around I’ll discuss analysing big data in more detail. Look out for that and if you need help on a Big Data strategy feel free to contact me

Share it now!

The Two Sides of Collaborative BI

While there is a lot of hype around collaborative BI today, this concept is not new. First attempts at introducing collaborative functionality into BI environments happened as far back as eight years ago or more when vendors of Corporate Performance Management (CPM) products in particular added collaborative functionality to their products to allow users to annotate scorecards and comment on performance measures.  In addition being able to email links to report also appeared. While a lot was marketed about these kinds of features, they only achieved limited success. A key reason for this in my opinion was because collaborative functionality was ‘baked into’ BI and CPM tools. In other words vendors brought collaboration to BI.  However the MySpace and Facebook generation taught us a different approach. What these collaborative and social networking environments showed was that it is much more natural to publish content to collaborative workspaces to elicit feedback and to share that content with others who are interested in it.

In the context of BI, this turned the first generation collaborative BI tools on their head and said rather than take collaboration to BI it is far more effective to take BI to collaborative platform where the range of collaborative tools available offers a lot more power. Lyzasoft was a pioneer of this new generation of modern social and collaborative BI technologies.  Also new releases of more widely adopted BI platform products are now being integrated with mainstream collaborative platforms such as Microsoft SharePoint and IBM Lotus Connections.  Even cloud based collaboration technologies from vendors like Google are getting in on the act.  Mobile BI technology is taking this further by allowing people to collaborate on BI from mobile devices.

However, I (and others) would argue that we are still seeing only one side of the coin here with respect to BI and collaboration. That side is the classic approach of formal integration of data from multiple sources into a data warehouse, the producing of intelligence and the publishing of BI artefacts (dashboards, reports, etc.) into social and collaborative environments where it can be shared with others, rated and collaborated upon for joint decision making. But what about innovation, what about when innovative business users want to experiment, get some data and ‘play’ with it in a sandbox environment to figure out what business insight might be useful or to figure out what new metrics that would be useful to the business? Do we not need collaboration here also?  Another probing question is whether this innovation should be ‘upstream’ from a data warehouse? In other words let them play with the data until there is consensus as to what is useful and then feed this into a more classic approach of data integration, storage, analysis and sharing. I am comforted by the fact that it is not only me asking this question. Others like my good friend Barry Devlin are also talking about the use of collaboration and sharing of business insight produced in an innovative environment. I know Barry will be speaking about this here. The point is that in my opinion ( and it is only opinion admittedly) there is a place for collaborative and social BI in an innovative sandbox environment where BI is not yet ‘hardened’.  We need this capability in many industries. I have come across it in both retail banking and in manufacturing for example. However, what must be controlled is the release of newly formed innovation into production. This is where governance comes in. Data governance would allow newly created metrics to be published in a business glossary to be used by multiple BI tools in a hardened production environment for example. Also at this point, new data sources may be declared to a more formal production DW/BI environment for data acquisition.  Therefore we have two sides to collaborative BI, the innovation cycle which needs to share ‘experimental’ information and elicit feedback from other as well as the more formal production BI/DW environment where well polished business insight is shared across the enterprise for people to use and act on.  One feeds the other, typically because innovators also need to collaborate with IT to take the innovation and move it into the mainstream environment.

Let me know what you are doing with social and collaborative BI. I would be grateful for your comments.

Share it now!