BI & Warehousing

Oracle Analytics Summit 2022

Rittman Mead Consulting - Wed, 2022-06-22 03:41

Last week I attended the second Oracle Analytics Summit at Skywalker Ranch.

The last Summit was in 2019 and attended by Jon Mead and he summarised the experience here.

The purpose of the event is unchanged, and the idea is to showcase the current analytics products and services and more importantly, get feedback from analytics customers and partners on future developments and strategy.

It was clear during the event lots had changed since 2019 and the initiative has really started to pay off, Oracle was named as visionary in the 2022 Gartner magic quadrant for analytics and Business Intelligence Platforms for the third year in a row.  Rittman Mead has certainly seen an uptake of OAC/OAS and positive customer feedback.

The event started well with a recorded opening keynote and two customer success stories. MTN being one of them. We have been working with MTN for the last five years.

The core workshops over the two days focused on three main areas:

1.     Analytics (OAC/OAS)

2.     Oracle Applications and Fusion Analytics Workspace (FAW)

3.     Data

One of the highlights for me was a couple of talks around storytelling by James Richardson.  James used to work for Gartner and is now at Oracle. He continues to look at data storytelling and how this is expected to dominate business intelligence in the future, specifically now with Oracle.

The biggest change to OAC is the introduction of web-based semantic modelling tool. This will absolutely revolutionise the way we build semantic models in OAC. I’m sure most of the readers know or have come across the OBIEE Admin tool and RPD.

The Oracle BI Repository (RPD file) stores BI Server metadata. The metadata defines logical schemas, physical schemas, physical-to-logical mappings, aggregate table navigation, and other constructs. You can edit Oracle BI repositories using the Oracle BI Administration Tool.”

The challenge with the RPD has always been how to do multiuser development (MUD). The RPD is a single binary object and unless you edit in online mode (not advisable) it’s hard to version changes (Enabling OBIEE Concurrent development). The new semantic layer has been completely re-engineered to the latest software engineering practices thanks to the leadership of Bret Grinslade and team.  The User Interface (UI) looks modern, and you can use the UI or code directly in Semantic Model Markup Language (SMML). The lineage tool presents well. Keep an eye out for  upcoming blogs from our team on the subject as we start testing its current capabilities.

“The current version supports relations objects modelling. Essbase and OLAP are not available, and the tool is only for OAC. The traditional Admin tool will be here for a while.”

In addition there are continued improvements to OAC, focusing on AI/ML and natural language processing.

There is a continued drive to enhance Fusion Analytics Warehouse (FAW) for Fusion application users.  Additional external pipelines include SalesForce and EBS as examples.

Summary

There is a continued commitment to supporting customers and building a strong analytics community. The event was certainly filled with thought leaders in the area and hearing about customer experiences gave useful insight and feedback to Oracle product teams and partners.

Look out for our blogs on the new semantic layer and our experiences with FAW.

Customers I spoke to during the event were struggling to increase their users’ adoption of new analytics capabilities. Same reports, different platform. If you are struggling with user adoption, let us know.We have an excellent team of consultants and we have a user engagement service that helps to identify why users aren’t engaged with the reporting tools.

Categories: BI & Warehousing

Semantic Layer for Data Scientist

Dylan's BI Notes - Thu, 2022-06-16 18:36
I recently read a good buyer’s guide from AtScale: The Buyer’s Guide to the Best Semantic Layer Tools for Data and Analytics. I think the buyer’s guide is fair not because that the company I worked for also has such semantic layer, but I really feel that the drawback of the vendor specific semantic layer […]
Categories: BI & Warehousing

Tracing Data Lineage from Source Column to End User

Rittman Mead Consulting - Thu, 2022-06-16 05:46

A while ago I blogged about our Lineage Tool, which back then was limited to diving into the ODI Work Repository and emerging from it with source-to-target data flow diagrams. The task at hand was simple and the design straightforward: show Source and Target database entities and connect them with arrows representing the ELT logic in ODI Mappings. The hierarchy to use to drill down from summary to detail was quite natural: ODI Environment > ODI Model > Datastore (Table) > Attribute (Column).

ODI Environment ODI Lineage Demo

This 4-level hierarchy works very well with ODI. Here is a view at the Table level:(Similarly we can have views at the Model or the Environment level.)

Demo ODI Environment

However, there was a temptation to add more to the lineage: OBIEE (OAS/OAC) RPD and Web Catalog lend themselves nicely to lineage analysis. In fact, in the case of these two, Oracle allows us to generate lineage reports from the Model Administration Tool and Catalog Manager respectively.

Starting with RPD's Physical Model, we have very similar entities there: Physical Tables and their Aliases have Columns. Tables are grouped into Physical Schemas, which in turn are part of Databases. It looks like we have one more hierarchy level here: OBIEE Environment > Database > Physical Schema > Physical Table / Alias > Column. We could acknowledge that different lineage analysis sources will have different hierarchies with their own number of hierarchy levels. However, we also want to see the entire lineage across ODI and OBIEE at a certain hierarchy level - showing ODI lineage at a Column level and OBIEE lineage at a Table level will not work very well. Therefore we had to force all hierarchies to conform to 4 hierarchy levels. Here is the OBIEE RPD Physical Layer:

OBIEE RPD Test Environment Database

We have 4 hierarchy levels instead of 5 - Environment and Database are squeezed into a single level.

Notice Physical Tables on the left and their Aliases on the right - table and view aliasing naturally becomes part of the lineage flow.

Further down the lineage line, we have the OBIEE Presentation Layer - we can keep the Lineage more compact by hiding the BMM Layer by going from Physical straight to Presentation. Controversial? Yes, we do not always want as much detail in a diagram as possible. Just like our ODI lineage diagram that shows only Source and Target tables, hiding the inner workings of the ODI Mapping.

OBIEE RPD enviroment Drmo RPD

Here we have Presentation Tables and their Columns, Subject Area (SA) instead of a Schema, OBIEE RPD Environment at the top. This time there is no need to an extra hierarchy level, but we add (Presentation)  after the Environment name to better distinguish it from the Physical Layer.

Adding OBIEE Web Catalog to our Lineage is no harder than adding the RPD.

Obiee web cat environment

This time it seems we only really need 3 hierarchy levels: OBIEE Environment > Analysis (Report) > Report Column. However, we can introduce Catalog Folders to the hierarchy to stand at the Schema level.

As we can see, conforming to the 4-level hierarchy can be artificial in some cases, but overall the lineage makes sense and, importantly,  we can easily switch between the 4 hierarchy levels to get summary or detailed views, drill down Schemas, Tables and Columns to get a filtered view of those entities.

But does the Lineage end at the BI Report? It depends on what we want to call Lineage. We can argue that the end point of our lineage is not the Report but the BI User. Getting Usage Adoption data from OBIEE is easy, assuming that Usage Adoption data gathering is enabled. But can we fit UA data into our lineage, and make it conform to the 4-level hierarchy?

OBIEE UA Envionment

It is a stretch but yes, we can. Let us note that Usage Adoption is the first lineage source that does not have the concept of a Column and a collection of Columns (e.g. table or report). I chose to assign the user name to the table hierarchy level - because the table and the column hierarchy levels are probably going to be the two most frequently used and I do not want the user to be available at the finest level only. In the example above, the finest detail is Report Access Date. In a Production system that has been used for a few years, it would be substituted with Month or Quarter.

The end result is one large diagram that goes from the source table and column all the way to the BI User accessing that data in a report.

Data lineage tool mappings

In the diagram, we can select a source column, say, SRC_PRODUCT.PRICE and see which users have accessed reports with Product Price in it. If the column data is deemed sensitive, we can do security audits by using this Lineage diagram. We can see if a certain source Table or Column is currently being used in BI reports and also if those reports are actually being used. We can start our analysis from the other end as well, by selecting a user and seeing what source table and column data that user is or is not accessing.

As of now, the Rittman Mead's Lineage Tool is capable of tracing and visualising data lineage starting with the ODI Mapping's Source Datastore Column and ending with the OBIEE User. The Lineage Tool is constantly evolving and there is still much we want to add to it. The journey so far has been interesting and I am sure we will revisit the compromises we have made to make the many lineage data sources fit together in a coherent hierarchy in a single end-to-end lineage.

Interested in finding out more about the Lineage Tool, contact us.

Categories: BI & Warehousing

Cloud Database and Cloud DataLake

Dylan's BI Notes - Wed, 2022-06-15 13:54
The term DataLake was invented to describe the data storage and the fact that after Hadoop and HDFS were introduced, you can have a cheaper way and place to store your data without using a traditional database, by traditional, I mean a RDBMS, relational database management system. Cheaper is not just about cost, it is […]
Categories: BI & Warehousing

OAS / DV & HR Reporting (A Learning Experience Part 2)

Rittman Mead Consulting - Tue, 2022-06-07 03:04
OAS / DV & HR Reporting                  (A Learning Experience Part 2)

Until recently as a relative newbie to Oracle Analytics Server (OAS) and Oracle Data Visualisation (DV) myself, I was keen to share with you this learning experience.  

If you are a HR professional, or you work supporting HR with data and reports, this blog is aimed at you. My intention was, that you could use these two blogs (part 1 and part 2) as a guide to get your own local working OAS copy. You could also load in your own HR data into the system to make the experience more relatable. You could start working with HR data and quickly create some HR data visualisation reports. You could begin to build a business case for moving any current OBIEE system to OAS based on your findings. Even if you don't use OBIEE, OAS may well be worth a look.  

This is not a sales pitch for OAS, more to show how you can get OAS up and running with some HR data and make up your own mind about it. You can also to use the system to engage with others who may be interested. In addition the content should freely available to you if you have My Oracle Support.

Firstly as a recap, in Part 1 of this blog, I created a stand alone Oracle database and OAS system using Docker. I connected my local OAS instance to an Oracle database which contained an Oracle HR sample schema. Using the HR schema, I created an OAS HR dataset. I then used the HR dataset (the actual data!) to create some Basic HR visualisations (reports!). In Part 1 of the blog I focused on the newer business intelligence Data visualisation tools and data self-serve facility OAS (akin to Microsoft Power BI Desktop).

Hopefully if all that makes sense, so far so good!

Is this blog (Part 2) I am going old school. What if I want to leverage my existing HR OBIEE developments and recreate / migrate my existing BI subject areas to OAS and write the BI reports in OAS with the same interface OBIEE uses. What if i want to move my current HR BI Publisher reports into OAS? In addition what if i want to mix and match all the visualisations and BI reports in OAS.  DV/Classic BI/BI Publisher etc on a BI Dashboard.

Can I do all of this?  

Answer: Yes

In this blog are the the steps I took to prove it.

OAS / DV & HR Reporting                  (A Learning Experience Part 2)A mix of OAS Classic Views (OBIEE) and OAS Data Visualisation views on the HR Dataset. Rendered in OAS classic dashboard.

Firstly, how can I re-use an existing OBIEE schema and data in OAS?

The HR professionals my be less interested in the BI technical side might glaze over in this part of the blog! however, they will be interested in knowing if they have existing BI reports, these are going to transition and work fine in OAS with the minimum of fuss. Assuming I already use OBIEE, and have already have my HR subject areas set up. How do I get these into OAS so I can report on it?

Maybe I want to create new OAS visuals using this HR subject area to get some additional functionality thrown in for the effort involved.    

OAS / DV & HR Reporting                  (A Learning Experience Part 2)HR subject area in OAS Repository

1) My existing OBIEE repository is currently working against a local OBIEE 12c install and local Oracle database instance I created a while ago. We have already used this same HR Oracle sample schema in Part 1 of the blog to create visualisations in OAS. It would be interesting to see once I get the HR subject area working in OAS and write the reports that they reconcile back to the OAS reports that i created in Part 1 of the blog! after all I don't want to remodel the data again thats the point of doing this exercise and the same may be the case for you if you are thinking of a move from OBIEE to OAS.

Firstly I needed to get hold of the OAS repository from the OAS server copied locally. Then I can edit it with the BI Administration tool and add in the HR schema from my existing OBIEE repository.

To get access to the OAS642 container in docker

Docker exec -it oas642 bash  

Then cd /opt/oracle/config/domains/bi/bitools/bin

Run the datamodel download script to get hold of the OAS repository.

 ./datamodel.sh downloadrpd -O /home/oracle/oas642.rpd -W Admin123 -SI ssi -U weblogic -P Admin123

Now I can locate oas642.rpd in home/oracle.

I renamed the original OAS repository before I made any changes to it just so I retained an original copy locally.

mv oas642.rpd oas642orig.rpd

2) Next I edited the oas642 repository in the BI Administration tool to add in the HR physical/logical and presentation layer of the HR Subject Area. OK so here I had a slight issue in that my local BI Admin tool could not open the OAS repository as it complained it was a newer version than the local client BI Admin tool. I installed a newer BI client version using setup_bi_client-12.2.1.3.0-win64.exe and I installed this in windows 8 in compatibility mode.  I am using windows 10 and if I didn't do this, the BI admin tool would not work!

I cut and pasted the physical/logical and presentation layer between one session of the BI Admin tool (one OBIEE repository the other the OAS repository). There are other ways to get the the HR Subject Area into the OAS repository (repository merge etc) but, for now, I wanted to keep things simple. Once the elements were copied over I ran a consistency check on the OAS repository and made sure there were no errors with the copied HR subject area.

I changed the HR database connection pool to point to my docker container oracle database rather than my local database instance (the HR sample schema and data is identical in both databases). Note the Data source name IP in image below is the docker container IP for OAS (as identified in part 1 of this blog) not my local machine IP.

OAS / DV & HR Reporting                  (A Learning Experience Part 2)OAS Repository Data source name re-pointed to docker database container IP.

3) Next step was to upload the updated OAS repository with the HR subject area to the Docker OAS server container.

To upload the OAS repository from the directory /opt/oracle/config/domains/bi/bitools/bin

./datamodel.sh uploadrpd -i /home/oracle/oas642.rpd -si ssi -u weblogic -p Admin123

Message returned on command line was 'RPD upload completed successfully'.        

4) In OAS I navigated to the Datasets menu, and saw a new Dataset 'HR'.                        The purple box icon (below) denotes this is a OAS subject area.

OAS / DV & HR Reporting                  (A Learning Experience Part 2)HR Dataset which is the HR Subject Area.

I clicked on the icon and I can begin to build OAS visualisations using the HR Subject area. I could see the dimension folders and the HR Fact folder which all looked familiar.

OAS / DV & HR Reporting                  (A Learning Experience Part 2)HR Subject Area now in OAS. Select columns to start building.

In addition to the complete HR subject area, I also created another Dataset called 'HR Local Subject Area' based on a subset of the complete HR subject area. Why do this? I wanted to see if i could cherry pick elements of the HR subject area just to include the columns I needed to replicate the OAS visualisations that I created in Part 1 of the blog (which were written directly against the database). This process helps simply the report writing. Not so much required for this small HR subject area used in the examples, but, in the real world often very large subject areas get confusing to users who don't know which dimensions to use and they only need certain elements of the data. Also they may not be permitted to see all the data elements. You can tailor the dataset to a specific set of requirements and add other calculated columns if required at this stage.

OAS / DV & HR Reporting                  (A Learning Experience Part 2)Creating a Dataset subset of the HR Subject Area for a specific requirement

5) OK so now to create the visualisations in OAS against the HR local Subject Area Dataset!

OAS / DV & HR Reporting                  (A Learning Experience Part 2)Note:Two Datasets (top left) and The active tab writing against the HR Local SA Dataset

This was pretty straight forward to do, there were a couple of issues with data types but nothing that didn't take more that a couple of minutes to fix. The main thing was the data reconciled exactly to the visualisations written against the database connection which gave confidence that the move from the OBIEE repository to the OAS repository was successful.

6) What if I want to present the OAS data visuals on a BI Dashboard and also write some additional reports the old fashioned OBIEE way. Then combine these on a Dashboard.

I click on Open Classic home

OAS / DV & HR Reporting                  (A Learning Experience Part 2)

I go to My Dashboard and Edit. I drag my DV workbook into the dashboard and as I have two OAS canvas, I choose display canvas1 (show view) this was the Oracle database connection visuals. Note Canvas2 was the HR Subject Area visuals I just created.

OAS / DV & HR Reporting                  (A Learning Experience Part 2)Presenting the Data Visualisations on a Dashboard

I also added in a logo. I saved My Dashboard and had a look at it.

Quite happy with that view!

OAS / DV & HR Reporting                  (A Learning Experience Part 2)OAS Data visualisations on a BI classic Dashboard.

For the new BI Classic content, I clicked on create and choose Analysis.

OAS / DV & HR Reporting                  (A Learning Experience Part 2)

With the Criteria tab I could now create my OBIEE reports with the familiar interface using the HR Subject Area.

OAS / DV & HR Reporting                  (A Learning Experience Part 2)This all looks very familiar!

I knocked together a quick report. Note I could have also used the HR Local Subject Area Dataset I created to write the analysis.

OAS / DV & HR Reporting                  (A Learning Experience Part 2)

I added the new Analysis to my dashboard and combined with the OAS Data visualisations to get this view (below).

Again I was pretty happy with that view.

OAS / DV & HR Reporting                  (A Learning Experience Part 2)

7) Finally, I wanted to create a couple of basic BI Publisher reports in OAS as a way of proving I could migrate HR BI Publisher content to OAS should I wish.

In OAS I created a BIP Data Source connection to the HR schema database so I could get to the data.

For this I needed to go into Administration page of BI Publisher in OAS and create a new JDBC connection.

OAS / DV & HR Reporting                  (A Learning Experience Part 2)Oracle HR JDBC Connection created (second in list above). 

Then I created my data model, in this case with a bit of SQL (below). I also used the HR subject area to create another BI publisher HR salary report as its useful to be able to demonstrate both methods .

OAS / DV & HR Reporting                  (A Learning Experience Part 2)Basic BIP Data Model on HR Schema in Oracle Database 

Once I wrote my reports using the BI Publisher report interface, I presented them with my other Classic views on a BI dashboard.

Job done!

OAS / DV & HR Reporting                  (A Learning Experience Part 2)Example of OBIEE and BI Publisher views in OAS Classic 

Conclusion

During this exercise the penny dropped regarding OAS.  

I have worked in organisations where certain departments felt trapped into the traditional approach taken to BI. Basically that BI is a technical product that should be administered directly from IT (obviously this may be less relevant to you if you have already moved away from this model).

Essentially OAS moves you to a platform where you can start to involve users from the business and get them fully engaged in preparing and curating data through to writing visuals and reports. Once connections to data sources are set up a certain level of ownership of data and reporting can be passed over to the business from IT OAS also gives the business users the chance to create their own datasets based on data they may have in 3rd party systems or spreadsheets. Working alongside other departments in the business, all which have similar aspirations and requirements for data, then OAS has an enterprise ready feel to it.      

As well as the OAS Data visuals and interface it was reassuring to see that you could continue with the older BI technology and re-use content and Subject Areas/reports from OBIEE, also BI Publisher data models/reports should you have these.

In the real world with OAS, I can see HR departments sticking to having certain more complex HR reporting datasets written for them by more technical people in the OAS stack, but, certainly the flexibility is there to have one or two people within the HR department creating OAS datasets and visuals with complete autonomy. These individuals could also have SQL skills and work with the OAS Administrator to fast track content into the system. The OAS Administrator for HR content could also sit within the HR department. A blended approach is always the best in my opinion.

As a real world example, it's worth mentioning, I am currently working for a Rittman Mead client who are moving a number of on premise OBIEE repositories to OAS in the cloud (migrating to one consolidated OAS repository).  Phase 1 of the project involves replicating and enhancing current OBIEE dashboard and analytics in OAS classic. This phase of the project is being handled by a team specifically set up for this purpose. Phase 2 of the project is to push out the OAS Data Visualisation tool for adoption by business users, who will be trained to create and design datasets and workbooks. In addition the business unit users will receive training to maintain and develop further OAS analytics and dashboards in OAS classic. This is to help facilitate a move to a user reporting self serve model out in the business from the old central IT report writing model.

With Microsoft Power BI Desktop and similar products increasingly being requested by business users that have got tired of waiting for an overstretched IT to address reporting issues,  It certainly might be worth looking at OAS as a candidate solution into the mix.  

Overall, personally I found this a useful practical learning experience. I hope you may have gained something too, and, if you stuck with it this far then thanks for reading.

Categories: BI & Warehousing

A quick way of generating Informatica PowerCenter Mappings from a template

Rittman Mead Consulting - Mon, 2022-05-16 04:52
Generating Informatica PowerCenter Content - the Options

In our blogs we have discussed the options for Oracle Data Integrator (ODI) content generation here and here. Our go-to method is to use the ODI Java SDK, which allows querying, manipulating and generating new ODI content.

Can we do the same with Informatica PowerCenter? In the older PC versions there was the Design API that enabled browsing the repository and creating new content. However, I have never used it. My impression is that Oracle APIs are more accessible than Informatica APIs in terms of documentation, help available online and availability for download and tryout.
If we want to browse the PowerCenter repository content, there is an easy way - query the repository database. But what about content generation? Who will be brave or foolish enough to insert records directly into a repository database!? Fortunately, there is a way, and a fairly easy one, if you don't mind doing a bit of Python scripting.

Generate PowerCenter Mappings - an Overview

Selective Informatica PC repository migrations are done via XML export and import - it is easy and mostly fool-proof. If we can generate XMLs for import, then we have found a way of auto-generating PowerCenter content. Informatica seems to support this approach by giving us nice, descriptive error messages if something is wrong with import XMLs. Only completely valid XMLs will import successfully. I have never managed to corrupt my Informatica repository with a dodgy XML import.

Let us look at an example - we need to extract a large number of OLTP tables to a Staging schema. The source and staging tables have very similar structures, except the staging tables have MD5 codes based on all non-key source fields to simplify change data capture (CDC) and also have the extract datetime.

  1. We start by creating a single mapping in Designer, test it, make sure we are 100% happy with it before proceeding further;
  2. We export the mapping in XML format and in the XML file we replace anything unique to the source and target table and their fields with placeholder tags: [[EXAMPLE_TAG]]. (See the XML template example further down.)
  3. Before we generate XMLs for all needed mappings, we need to import Source and Target table definitions from the databases. (We could, if we wanted, generate Source and Target XMLs ourselves but PC Designer allows us to import tables in bulk, which is quicker and easer than generating the XMLs.)
  4. We export all Sources into a single XML file, e.g. sources.xml. Same with all the Targets - they go into targets.xml. (You can select multiple objects and export in a single XML in Repository Manager.) The Source XML file will serve as a driver for our Mapping generation - all Source tables in the sources.xml file will have a Mapping generated for them.
  5. We run a script that iterates through all source tables in the source XML, looks up its target in the targets XML and generates a mapping XML. (See the Python script example further down.) Note that both the Source and Target XML become part of the Mapping XML.
  6. We import the mapping XMLs. If we import manually via the Designer, we still save time in comparison to implementing the mappings in Designer one by one. But we can script the imports, thus getting both the generation and import done in minutes, by creating an XML Control File as described here.
Scripting Informatica PowerCenter Mapping generation

A further improvement to the above would be reusable Session generation. We can generate Sessions in the very same manner as we generate Mappings.

The Implementation

An example XML template for a simple Source-to-Staging mapping that includes Source, Source Qualifier, Expression and Target:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE POWERMART SYSTEM "powrmart.dtd">
<POWERMART CREATION_DATE="05/26/2021 11:55:12" REPOSITORY_VERSION="188.97">
<REPOSITORY NAME="DemoETL" VERSION="188" CODEPAGE="UTF-8" DATABASETYPE="Oracle">
<FOLDER NAME="Extract" GROUP="" OWNER="Developer" SHARED="NOTSHARED" DESCRIPTION="" PERMISSIONS="rwx---r--" UUID="55321111-2222-4929-9fdc-bd0dfw245cd3">

    [[SOURCE]]
	
    [[TARGET]]
	
    <MAPPING DESCRIPTION ="[[MAPPING_DESCRIPTION]]" ISVALID ="YES" NAME ="[[MAPPING_NAME]]" OBJECTVERSION ="1" VERSIONNUMBER ="2">
	
        <TRANSFORMATION DESCRIPTION ="" NAME ="SQ_EXTRACT" OBJECTVERSION ="1" REUSABLE ="NO" TYPE ="Source Qualifier" VERSIONNUMBER ="1">
            [[SQ_TRANSFORMFIELDS]]
            <TABLEATTRIBUTE NAME ="Sql Query" VALUE =""/>
            <TABLEATTRIBUTE NAME ="User Defined Join" VALUE =""/>
            <TABLEATTRIBUTE NAME ="Source Filter" VALUE =""/>
            <TABLEATTRIBUTE NAME ="Number Of Sorted Ports" VALUE ="0"/>
            <TABLEATTRIBUTE NAME ="Tracing Level" VALUE ="Normal"/>
            <TABLEATTRIBUTE NAME ="Select Distinct" VALUE ="NO"/>
            <TABLEATTRIBUTE NAME ="Is Partitionable" VALUE ="NO"/>
            <TABLEATTRIBUTE NAME ="Pre SQL" VALUE =""/>
            <TABLEATTRIBUTE NAME ="Post SQL" VALUE =""/>
            <TABLEATTRIBUTE NAME ="Output is deterministic" VALUE ="NO"/>
            <TABLEATTRIBUTE NAME ="Output is repeatable" VALUE ="Never"/>
        </TRANSFORMATION>
		
        <TRANSFORMATION DESCRIPTION ="" NAME ="EXPTRANS" OBJECTVERSION ="1" REUSABLE ="NO" TYPE ="Expression" VERSIONNUMBER ="2">
            [[EXP_TRANSFORMFIELDS]]
            <TRANSFORMFIELD DATATYPE ="nstring" DEFAULTVALUE ="ERROR(&apos;transformation error&apos;)" DESCRIPTION ="" EXPRESSION ="[[MD5_EXPRESSION]]" EXPRESSIONTYPE ="GENERAL" NAME ="CDC_MD5" PICTURETEXT ="" PORTTYPE ="OUTPUT" PRECISION ="32" SCALE ="0"/>
            <TRANSFORMFIELD DATATYPE ="date/time" DEFAULTVALUE ="ERROR(&apos;transformation error&apos;)" DESCRIPTION ="" EXPRESSION ="SYSTIMESTAMP()" EXPRESSIONTYPE ="GENERAL" NAME ="EXTRACT_DATE" PICTURETEXT ="" PORTTYPE ="OUTPUT" PRECISION ="29" SCALE ="9"/>
            <TABLEATTRIBUTE NAME ="Tracing Level" VALUE ="Normal"/>
        </TRANSFORMATION>

        [[SOURCE_INSTANCE]]
		
        <INSTANCE DESCRIPTION ="" NAME ="SQ_EXTRACT" REUSABLE ="NO" TRANSFORMATION_NAME ="SQ_EXTRACT" TRANSFORMATION_TYPE ="Source Qualifier" TYPE ="TRANSFORMATION">
            <ASSOCIATED_SOURCE_INSTANCE NAME ="[[SOURCE_INSTANCE_NAME]]"/>
        </INSTANCE>
		
        <INSTANCE DESCRIPTION ="" NAME ="EXPTRANS" REUSABLE ="NO" TRANSFORMATION_NAME ="EXPTRANS" TRANSFORMATION_TYPE ="Expression" TYPE ="TRANSFORMATION"/>
		
        [[TARGET_INSTANCE]]

        [[SRC_2_SQ_CONNECTORS]]

        [[SQ_2_EXP_CONNECTORS]]

        [[EXP_2_TGT_CONNECTORS]]

        <CONNECTOR FROMFIELD ="CDC_MD5" FROMINSTANCE ="EXPTRANS" FROMINSTANCETYPE ="Expression" TOFIELD ="CDC_MD5" TOINSTANCE ="[[TARGET_INSTANCE_NAME]]" TOINSTANCETYPE ="Target Definition"/>
        <CONNECTOR FROMFIELD ="EXTRACT_DATE" FROMINSTANCE ="EXPTRANS" FROMINSTANCETYPE ="Expression" TOFIELD ="EXTRACT_DATE" TOINSTANCE ="[[TARGET_INSTANCE_NAME]]" TOINSTANCETYPE ="Target Definition"/>

        <TARGETLOADORDER ORDER ="1" TARGETINSTANCE ="[[TARGET_INSTANCE_NAME]]"/>

        <ERPINFO/>
        <METADATAEXTENSION COMPONENTVERSION ="1000000" DATATYPE ="STRING" DESCRIPTION ="" DOMAINNAME ="User Defined Metadata Domain" ISCLIENTEDITABLE ="YES" ISCLIENTVISIBLE ="YES" ISREUSABLE ="YES" ISSHAREREAD ="NO" ISSHAREWRITE ="NO" MAXLENGTH ="256" NAME ="Extension" VALUE ="" VENDORNAME ="INFORMATICA"/>
    </MAPPING>
</FOLDER>
</REPOSITORY>
</POWERMART>

Python script snippets for generating Mapping XMLs based on the above template:

  1. To translate database types to Informatica data types:
mapDataTypeDict = {
	"nvarchar": "nstring",
	"date": "date/time",
	"timestamp": "date/time",
	"number": "decimal",
	"bit": "nstring"
}

2. Set up a dictionary of tags:

xmlReplacer = {
	"[[SOURCE]]": "",
	"[[TARGET]]": "",
	"[[MAPPING_DESCRIPTION]]": "",
	"[[MAPPING_NAME]]": "",
	"[[SQ_TRANSFORMFIELDS]]": "",
	"[[EXP_TRANSFORMFIELDS]]": "",
	"[[MD5_EXPRESSION]]": "",
	"[[SOURCE_INSTANCE]]": "",
	"[[SOURCE_INSTANCE_NAME]]": "",
	"[[TARGET_INSTANCE]]": "",
	"[[TARGET_INSTANCE_NAME]]": "",
	"[[SRC_2_SQ_CONNECTORS]]": "",
	"[[SQ_2_EXP_CONNECTORS]]": "",
	"[[EXP_2_TGT_CONNECTORS]]": ""
}

3. We use the Source tables we extracted in a single XML file as our driver for Mapping creation:

sourceXmlFilePath = '.\\sources.xml'

# go down the XML tree to individual Sources
sourceTree = ET.parse(sourceXmlFilePath)
sourcePowerMart = sourceTree.getroot()
sourceRepository = list(sourcePowerMart)[0]
sourceFolder = list(sourceRepository)[0]

for xmlSource in sourceFolder:
	# generate a Mapping for each Source
    
    # We also need to go down the Field level:    
    for sourceField in xmlSource:
    	# field level operations

4. Generate tag values. This particular example is of a Column-level tag, a column connector between Source Qualifier and Expression:

sqToExpConnectorTag = f'<CONNECTOR FROMFIELD ="{columnName}" FROMINSTANCE ="SQ_EXTRACT" FROMINSTANCETYPE ="Source Qualifier" TOFIELD ="{columnName}" TOINSTANCE ="EXPTRANS" TOINSTANCETYPE ="Expression"/>'

5. We assign our tag values to the tag dictionary entries:

xmlReplacer["[[SQ_2_EXP_CONNECTORS]]"] = '\n'.join(sqToExpConnectors)

6. We replace the tags in the XML Template with the values from the dictionary:

for replaceTag in xmlReplacer.keys():
	mappingXml = mappingXml.replace(replaceTag, xmlReplacer[replaceTag])

Interested in finding out more about our approach to generating Informatica content, contact us.

Categories: BI & Warehousing

Analyzing Brand Reputation with Oracle Analytics

Rittman Mead Consulting - Wed, 2022-05-11 04:00

When Italy used to qualify for the FIFA World Cup, I was a big fan of Oracle Endeca Information Discovery and implemented several innovative projects with it. Endeca allowed users to upload and combine diverse data for agile discovery on structured and unstructured information, transform and enrich data with a powerful visual data integration environment, mine text in multiple languages, and identify sentiment through natural language processing within the same framework. Unfortunately, it was too ahead of its time and didn't achieve the success it deserved.

Two World Cup qualifying failures and one Euro Cup triumph later, the time is ripe and several features seen in Endeca have been included in Oracle Analytics. In this post, I'm going to illustrate how to use Oracle Analytics to easily perform brand reputation analysis without machine learning or marketing experience.

Let's begin!

Identifying the Brand to Analyze

According to Marketing91:

Brand Reputation is the perception of the brand and how the particular brand of the company is viewed by customers, stakeholders, and the market as a whole. It is the culmination of ideas and emotions that a customer associated with the brand with the customer service experienced during the purchase of goods and services, whilst using them, and after-sales services provided by the company.

Since Italy can have a chance to win the 2022 FIFA World Cup only in videogames, I'm going to pick eFootball 2022 - a free-to-play football game - to conduct my brand reputation analysis.

At launch, the game was panned by critics and players, who criticized the atrocious graphics, lack of content, laggy engine and finicky controls. With 92% negative reviews, it became the worst-rated game on Steam a day after launch, and the lowest-rate game of 2021 on the review aggregator Metacritic.

Konami - the publisher - apologised for the game's many issues and said they would work on improving it. A major update was released on the 14th of April 2022 on PC and consoles: would it be sufficient to raise the reputation of the game?

Gathering the Data

People love to share their thoughts and opinions online. Due to this fact and the rise of social media, consumers have the greatest say in how a brand is perceived nowadays. Online discussions, reviews, posts, and articles about a brand can make it or break it.

I'm going to focus on a single social network - Twitter - to start my analysis and understand what people think about eFootball 2022. Unfortunately, Oracle Analytics does not (yet) allow users to gather data directly from social networks, so I decide to use the Twitter API v2 and search-tweets - a Python library for the v2 search endpoints - because I'm familiar with them. I don't cover the details here, but you can trust me when I say that the most difficult thing is to get the developer account application approved by Twitter!

In order to have a set of data suitable for analysis, I'm going to consider only tweets in English, and ignore retweets and quotes. I'm also going to exclude tweets with links because I noticed that most of them were actually spam. I can then retrieve all tweets posted over the last week related to eFootball 2022, and its official Twitter account @play_eFootball, with the following query:

(to:play_eFootball OR @play_eFootball OR #eFootball OR #eFootball2022 OR eFootball OR eFootball2022) -from:play_eFootball -is:retweet -is:quote -has:links lang:en

The response is in JSON format, which is not directly supported by Oracle Analytics, and contains a mix of tweets and users information, so I'm going to convert it in a couple of Excel files - Tweets.xlsx and Users.xlsx - using a simple custom Python script.

Tweets.xlsx contains the following fields:

  • id - The unique identifier of the requested tweet
  • text - The Actual UTF-8 text of the tweet
  • author_id - The unique identifier of the user who posted the tweet
  • created_at - Creation time of the tweet
  • Public engagement metrics for the tweet at the time of the request, such as retweet_count, reply_count, like_count and quote_count

Users.xlsx contains the following fields:

  • id - The unique identifier of the user
  • name - The name of the user, as they've defined it on their profile (subject to change)
  • username - The Twitter screen name, handle, or alias that the user identifies themselves with
  • location - The location specified in the user's profile (freeform value)
  • Public metrics about activity for the user, such as followers_count, following_count, tweet_count and listed_count
Enriching the Data

With the gathered data in Excel format, I'm finally able to upload it to Oracle Analytics and use the data preparation features to enrich the datasets before visualizing the data.

Focusing solely on the numbers is a big mistake in brand reputation analysis. The high number of mentions does not automatically mean the brand is doing great, in particular when the vast majority of those mentions are terrible reviews. Understanding the general attitude towards the brand it's a crucial activity, but it could be really tedious if you have to do it manually.

Luckily, the Analyze Sentiment step comes in handy: in Oracle Analytics, it can be added to a data flow to automatically detect the sentiment (Positive, Neutral or Negative) for a given text column. Figure 1 shows how easy is to configure this step. All I have to do is to select the column to analyze (text ) and provide a name for the new column that will contain the calculated sentiment (the default is emotion).

The Analyze Sentiment step allows to automatically detect the sentiment for a given textual column in a data flow.Figure 1. The Analyze Sentiment step allows to automatically detect the sentiment for a given textual column in a data flow.

The next step requires to save the data in an Oracle Database or Oracle Autonomous Data Warehouse, for instance by adding a Save Data step to the data flow and setting the Save data to property to an existing database connection (Figure 2).

The Save Data step can be used to save data in a database table and access additional database functions.Figure 2. The Save Data step can be used to save data in a database table and access additional database functions.

Having the data in the database allows to display the Database Analytics step in the data flow editor and perform advanced analysis and data mining analysis. In particular, I'm going to use it to break tweets text down into distinct words, count their occurrences and save the results in a new database table. Once again it's very easy to configure this step. All I have to do is to select the Text Tokenization analytics operation (Figure 3), specify the column to be tokenized (text), the reference column(s) to be included in the output dataset (id), the number of top tokens to return and wheter to use a special language lexer (Figure 4).

The Database Analytics step allows to use several database analytics functions in a data flow.Figure 3. The Database Analytics step allows to use several database analytics functions in a data flow.The Text Tokenization analytics function can be used to break a textual column down into distinct words and count their occurrences.Figure 4. The Text Tokenization analytics function can be used to break a textual column down into distinct words and count their occurrences.
⚠️
The user specified in the database connection must have execution privileges on the CTX_DDL package in order to successfully perform text tokenization.

Tokens and occurrences will be fundamental when I analyze the data to display Tag Cloud visualizations. It is a great visualization of the things people usually associate with the brand, and it can be an indicator of the brand reputation in general.

Since all tweets, users and tokens data is in the database, it's a good idea to create a dataset with multiple tables to analyze the data (Figure 5). This solution allows me to achieve better performance because all the joins will be executed at the database level and only the tables needed to satisfy a visualization are used in the query.

A dataset with multiple tables (Tweets, Tokens and Users) is the ideal solution for analyzing brand reputation in Oracle Analytics.Figure 5. A dataset with multiple tables (Tweets, Tokens and Users) is the ideal solution for analyzing brand reputation in Oracle Analytics.Analyzing the Data

Now that data is set up, it's time to conduct a brand reputation analysis to identify what works and what does not, and optimize the brand strategy accordingly.

The Line visualization is the ideal way to display brand mentions (i.e., the number of tweets) over time (Figure 6). The number of mentions fluctuate with the higher peaks on the 21st of April, when eFootball 2022 Season 1 started, and on the 25th of April, when Konami made important announcements related to the game. The decreasing trend of mentions should not be cause for concern: most of the tweets are created in the afternoon/evenining while I gathered the data on the 28th of April during my lunch break (Figure 7).

The Line visualization can be used to display the number of mentions over time in Oracle Analytics.Figure 6. The Line visualization can be used to display the number of mentions over time in Oracle Analytics.The Grid Heat Map visualization can be used to understand when the majority of tweets has been created in Oracle Analytics.Figure 7. The Grid Heat Map visualization can be used to understand when the majority of tweets has been created in Oracle Analytics.

As stated previously, the sentiment analysis is crucial for a successfull brand reputation analysis. I like to analyze it either at aggregated level with a Donut visualization (Figure 8) and over time with a Line visualization (Figure 9). According to the Analyze Sentiment step 47.66% are positive and only 21.64% are negative, a big step forward compared to when the game was originally published in 2021!

The Donut visualization can be used to display the total number of mentions and the sentiment ratio in Oracle Analytics.Figure 8. The Donut visualization can be used to display the total number of mentions and the sentiment ratio in Oracle Analytics.The Line visualization can be used to display the sentiment over time in Oracle Analytics.Figure 9. The Line visualization can be used to display the sentiment over time in Oracle Analytics.

Users have a major impact on brand reputation and a good brand reputation analysis should take into account what the most relevant influencers are posting. I usually display a Table visualization of the top 10 users by mentions, by followers count and by engagement (i.e., the sum of mentions' retweets, replies, likes and quotes), and create a data action to a detailed canvas for inspecting the user's content (Figure 10).

The Table visualization can be used to display the most relevant influencers by mentions, followers count and engagement in Oracle Analytics.Figure 10. The Table visualization can be used to display the most relevant influencers by mentions, followers count and engagement in Oracle Analytics.

Finally, there is the Tag Cloud visualization which clearly shows what people are looking for when mentioning the brand (Figure 11).

The Tag Cloud visualization can be used to display what people are looking for when mentioning the brand in Oracle Analytics.Figure 11. The Tag Cloud visualization can be used to display what people are looking for when mentioning the brand in Oracle Analytics.

With the fundamental visualizations clear in mind, the brand reputation workbook can take shape. In the Overview canvas (Figure 12), I like to present aggregated data such as the total number of mentions, the sentiment ratio, the sentiment over time, the number of mentions over time, and the top 50 words by occurence. I can restrict the analysis to a specific emotion, a date range, a peak or a word, and see how the selection affects the other visualizations.

The Overview canvas displays aggregated data about the brand.Figure 12. The Overview canvas displays aggregated data about the brand.

The Influencers canvas (Figure 13) lists the most relevant influencers alongside with the sentiment ratio and the most used words by mention, followers count and engagement. Again, I can restrict my analysis to a specific user, an emotion or a word, and see how the selection affects other visualizations.

The Influencers canvas displays aggregated data about the most relevant influencers.Figure 13. The Influencers canvas displays aggregated data about the most relevant influencers.

The last canvas, Tweets (Figure 14), allows to fully inspect the content of the gathered tweets and related public metrics. Since the data in this canvas is at a very granular level, I prefer to reach it through data actions from the Overview or Influencers canvases. This allows to preserve any selection previously made (i.e., the search context) and display the details only for the tweets that I'm really interested in.

The Tweets canvas displays the content of the gathered tweets and related public metrics.Figure 14. The Tweets canvas displays the content of the gathered tweets and related public metrics.What's next?

Once the data is available in a format supported by Oracle Analytics, it takes less than an hour to build a workbook like the one described in previous sections. It's worth saying that while it represents a solid starting point for a brand reputation analysis, there is also large room for improvement:

  • Manual data gathering is not the way to go in a production scenario, and the process must be automated.
  • I gathered data from Twitter, but there is a wide variety of platforms people use to mention brands and all relevant sources should be monitored.
  • Competitors' data should be collected to evaluate the strenghts and weaknesses of brands that are deemed to be a potential threat to your business.
Conclusion

Brand reputation nowadays is a must have intangible asset for any organization since even the small error in service can be visible to the entire world by a single tweet by that unhappy customer, and impact the business.

If you are looking into brand reputation analysis with Oracle Analytics and want to find out more, please do get in touch or DM us on Twitter @rittmanmead. Rittman Mead can help you with a product demo, training and assist within the development process.

Categories: BI & Warehousing

OAS / DV & HR Reporting (A Learning Experience)

Rittman Mead Consulting - Wed, 2022-05-04 03:13
OAS / DV & HR Reporting                   (A Learning Experience)

I wanted to share a personal learning experience around HR reporting using Oracle Analytics Server (OAS) and Data Visualisation (DV). In my experience, more often than not, the old school Oracle BI development tool skillset (BI Apps / OBIEE / BI Publisher) usually reside in the IT department. Often IT cannot respond quickly enough to develop new HR reports in line with increasing demand from HR for data. In addition, HR Professionals want to access the data themselves or at least be served it from people who report directly to them. After all they are the individuals who both need and fully understand the data and also in many cases are the only people who should be accessing it. Perhaps OAS has a part to play in improving this situation and give HR better access to HR/Pay data? My target audience for this blog is HR professionals and those that support the both technical and non-technical.

OAS / DV & HR Reporting                   (A Learning Experience)Example of Some DV reports built using Oracle Sample HR Schema

By explaining the steps I have taken during my learning experience (below), my intention is to illustrate the capability and flexibility of Oracle Analytics Server in both quickly accessing and presenting HR data while at the same time showing some of the touch points between the functional and more technical aspects of the tool. Finally I wanted to build the environment myself locally and use existing resources that are free to use. The only cost should be the time taken to learn!

Personally I am from an Oracle BI Apps / OBIEE background, so I wasn't really that familiar with OAS and the DV tool. I set myself the target of getting a working OAS system locally and then getting hold of some HR data and then presenting some basic HR reports in DV, Including some basic payroll metrics.  How hard can it be!?

The purpose of this blog is to assist an individual who has Admin rights to a machine and some technical skills to quickly get some data in front of HR for comment and feedback (your going to need at least 16GB ram to get docker images running). The intention was also to give visibility on key components of the application if you are a newbie to OAS. And Finally to show where the HR team can get involved into accessing and improving data quality quicker than with the older BI toolset. Please Note the blog is not intended to show complex data modelling of ERP/Fusion Oracle HR or Payroll for reporting purposes.

WHAT I DID

1) I used Docker Desktop to create a container for Oracle Database R19.3 and also a container for OAS 642. (the R19.3 Db container was sourced from Oracle Github Docker Images and OAS from Github gianniceresa Oracle Analytics 6.4 folder Note you still need the binaries from Oracle to perform the installation)  

2) I used the Oracle HR sample schema provided in an oracle database install as the Oracle datasource.  $ORACLE_HOME/demo/schema/human_resources. The schema and data was imported into the docker container for the R19.3 database.

OAS / DV & HR Reporting                   (A Learning Experience)Connection shown to Docker DB in SQLDeveloper for illustration of schema

3) In OAS I made a Database connection to the docker database container containing the HR schema. I called the connection 'HR' as I made the connection using  the HR database user.

Create Connection --> Oracle Database (I used basic connect)

OAS / DV & HR Reporting                   (A Learning Experience)The Oracle Database Connection

Note to make the connection in the connection field host you need to put the IP of the docker image so use command below to find (note db19r1 was the name of my db container)  

docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' db19r1

4) I created a new OAS dataset using the HR connection

OAS / DV & HR Reporting                   (A Learning Experience)HR Connection to create dataset denoted by Red Icon

I used manual SQL queries to select all data for each of the tables -

e-g Select * from HR.EMPLOYEES - Note. I could have just dragged and dropped the entire Employee table, however, I created as a SQL query in case I wanted to be more specific with data going forward and be able to alter or change the queries.

Queries / tables are then joined them together in the Join Diagram.

OAS / DV & HR Reporting                   (A Learning Experience)Join Diagram Example

Make sure you certify and grant access to the dataset in line with you security requirements. As this is just a sample set of data from Oracle we don't need to worry about this for now but it is worth mentioning.

OAS / DV & HR Reporting                   (A Learning Experience)

5) I wrote some basic reports in DV using the newly created HR Dataset.

OAS / DV & HR Reporting                   (A Learning Experience)Simple Table Vis. Salary By County and Dept with TotalsOAS / DV & HR Reporting                   (A Learning Experience)Simple Category Vis. UK Sales Dept Commission % by Person (ref line shown as Average)OAS / DV & HR Reporting                   (A Learning Experience)Pie Vis. Headcount by County Value and %OAS / DV & HR Reporting                   (A Learning Experience)Table Vis. Min / Max salary ranges by dept and Job Title.

I put a country filter on all the reports in the filter section of DV so I could limit to one or more countries.

OAS / DV & HR Reporting                   (A Learning Experience)DV Filter Vis. Filtered on United Kingdom

I found it fairly intuitive to use DV to write basic reports, however, there is a lot of functionality in DV that I can't even begin to cover here. So here is the sales pitch! Rittman Mead do a great course on DV https://www.rittmanmead.com/oracle-analytics-frontend/ :)

I also quickly had a look at the same HR dataset in OAS using the classic interface and felt very at home as a long time OBIEE user!

OAS / DV & HR Reporting                   (A Learning Experience)OAS Classic Report writing Interface

In Conclusion

With a little bit of effort, I created a working local OAS system locally on Docker desktop. I was able to connect OAS to an Oracle database. I created an OAS dataset against the HR database schema and wrote some basic DV reports using the dataset. At step 4 (above) I could see the HR team getting involved and assisting with or creating the dataset within OAS themselves and taking it from that point into DV or OAS Classic and creating the reports. Often in the HR team certain individuals have or want to learn SQL skills and the push is certainly in the direction of HR and payroll taking ownership of the data themselves.  I used a very basic HR dataset however the principles for creating more data sets with more complex requirements and data would be the same.

What's next?

In my next blog I am going t0 show how I used the OAS repository (RPD) to model the same Oracle HR sample data and present it to the user as a Local Subject Area for writing HR reports against using DV. If you currently use OBIEE this could interest you as you can replicate your current subject areas in OAS and rewrite the reports in DV or keep them in the OAS classic look and feel on a dashboard. So you get the best of both worlds with OAS! DV and the OAS classic interface.  

OAS / DV & HR Reporting                   (A Learning Experience)OAS / DV & HR Reporting                   (A Learning Experience)

 

Categories: BI & Warehousing

Introducing Rittman Mead Lineage Tool

Rittman Mead Consulting - Tue, 2022-04-26 06:19
Introduction Introducing Rittman Mead Lineage Tool

Rittman Mead Lineage Tool analyses ODI Mappings and produces neat visualisations of source-to-target mappings. The Lineage Tool is quick to set up and easy to run.

Introducing Rittman Mead Lineage ToolRittman Mead Lineage Tool: from ODI Mappings to Mapping Lineage Visualisation

The Lineage Tool can be used to reverse-engineer legacy ELT logic as the first step in a migration project. It can be a useful tool to track the progress of ongoing ODI development by scheduling it to be run on a daily basis. The tool can be useful for ODI content documentation - the output it produces is a set of static, cross-navigable SVG files that can be opened and navigated in a web browser.

Running the Lineage Tool

The Lineage Tool connects to an ODI Work Repository via the ODI Java SDK. This means the tool will connect to the ODI repository in the same way ODI Studio does. (The Tool has no dependency on the ODI Studio.)

The Tool scans through ODI Repository Projects and Folders, looking for Mappings. For each Mapping it traces all Attributes (columns) in Target Datastores (tables) to their Source Datastores and Attributes.

Introducing Rittman Mead Lineage ToolExtracting ODI metadata...

An intermediate result of the analysis is metadata of columns and mappings between them - similar to a data dictionary - that is written into the Lineage Tool's metadata schema. This output is much richer in data than the visualisations ultimately produced, therefore can be used for ad-hoc analysis of the lineage.

Introducing Rittman Mead Lineage Tool...and generating Lineage Visualisations.

Based on the produced lineage metadata, the Lineage Tool generates visualisations. The visualisations are generated as SVG documents  - vector graphics files that are written in format similar to HTML and support some of the functionality that an HTML file does: hyperlinks, HTML tables and HTML-like content formatting. The SVG documents are static (no Javascript), they are cross-navigable and support drill-down.

Exploring the Lineage

Lineage explorations starts from a Landing page that gives a summary of the Lineage content.

Introducing Rittman Mead Lineage Tool

Lineage content is offered at 4 levels of granularity: Column, Table, Schema and Project.

Introducing Rittman Mead Lineage ToolExploring Lineage at Table level...Introducing Rittman Mead Lineage Tool...and Column Level

Column and Table names, Schemas and Projects are navigable hyperlinks.

Clicking on a particular column gives us all source tables and columns as well as targets. We also get to see accompanying columns and their mappings - included for context but greyed out.

Introducing Rittman Mead Lineage ToolWe can also explore a single Column.

A single column view can be useful for security audits - to trace sensitive data like Customer Address from OLTP source columns to Data Warehouse Dimension attributes.

Interested in finding out more about the Lineage Tool, contact us.

Categories: BI & Warehousing

ML Data Engineering and Feature Store

Dylan's BI Notes - Tue, 2022-03-22 12:23
A typical ML process flow is about Load the data Explore and Clean the data Create features Create ML model Deploy the ML model for inference/prediction The problem of this flow is that it ignores the fact that the process has to be repeatable and the data need to be reused. In the real world, […]
Categories: BI & Warehousing

Oracle Analytics Server 2022 (6.4): The Best 15 New Features, Ranked

Rittman Mead Consulting - Mon, 2022-03-14 07:38

OAS 6.4 is finally live and provides a big set of enhancements that had been already available to OAC users. New features mostly affect the Data Visualization tool and involve the full process flow, from the data preparation to the data presentation, including visualizations, machine learning and administration improvements.

The focus of this post is on the best 15 new features of OAS 6.4 according to my personal opinion. I asked to my colleagues at Rittman Mead to rank them after a quick demonstration, so you know who to blame if the feature to hide loading messages is not at the top of the ladder!

If you are interested in a comprehensive list of new features and enhancements in OAS 6.4 please refer to What's New for Oracle Analytics Server.

15. Redwood themeThe Analysis Editor with the Redwood theme

OAS 6.4 includes components of Redwood - Oracle's new user experience design language. With a consistent look and feel across Analytics, Publisher and Data Visualization tools, the new default theme improves the user experience through a better handling of white space, a softer color palette, and new fonts.

14. Hide loading messagesThe new Hide loading messages setting

Funny random quotes are displayed in Data Visualization during loading since it has been released, and since then people are asking for a way to disable or customize them. These messages can be fun for the first 2 seconds, but when they are displayed over and over if loading takes longer... I feel like the application is fooling me!

My custom solution to hide loading messages in OAS 5.9My workaround to hide loading messages in OAS 5.9

I'm really happy to announce that I can finally crumple my loyal sticky note (see the picture above) because quotes can be replaced with a generic "Loading..." message in OAS 6.4. This can be done by switching on the Hide loading messages option in the System Settings section of Data Visualization Console, and restarting the Presentation Server.

13. Improved Home page searchThe new advanced search commands

The search bar in the Home page can be used to search for content and generate on-the-fly visualizations based on the chosen keywords. This was already one of my favourite features and it has been further improved in OAS 6.4: advanced search commands can be used to tailor search results for exact matches, multi-term matches, and field-level matches. When the list of accessible datasets is huge, these commands result particularly useful to quickly locate datasets created by a specific user or with a particular name. Unfortunately, the advanced search commands can be used only in the Home page search bar, and not in all other pages such as Data.

12. Support for Back button in browsers

The Back button in the browser can be pressed to navigate within the OAS 6.4 interface such as between editors and the Home page. Unfortunately, this cannot be used it in Analytics (Classic) and Publisher editors to undo an action.

11. View and export membership dataExporting membership data in OAS 6.4

Exporting membership data for auditing purposes and to find out exactly who has what access is a common task for any OAS administrator. To achieve this task I always used WLST with custom Python scripts, and I will probably continue with this approach in future. However, for users not familiar with WLST and Python, there is a new option in OAS 6.4 to download membership data for users, groups, and application roles to a CSV file in few clicks.

To view and export membership data for a user:

  1. Open the Users and Roles section in Data Visualization Console.
  2. Select the Users tab and select the name of the user whose membership details you want to see.
  3. Click on Application Roles (or Groups) under Direct Memberships.
  4. Click on the menu icon and select Show Indirect Memberships.
  5. Click on the Export button.

In a similar way membership data for groups and application roles can be viewed and exported.

10. Annotations on canvasesAdding notes in the Visualize tab

Notes are a great way to emphasize or explain important information in visualizations, but I always found it limiting to be able to use them only in Data Visualization stories. In OAS 6.4 notes can be added to canvases in workbooks and tied to specific data points also in the Visualize tab. If you are wondering what a workbook is, it's just how projects are called in the new release!

9. Expression filters for entities in datasetsFiltering entities in datasets using expression filters

When a column is added to a dataset from a database connection or a subject area, all of the column values are included in most cases. Columns can be filtered so that the dataset contains only the rows needed, but it was not possible to filter on expressions in OAS 5.9. More complex filters are now supported by using expression filters. They must be Boolean (that is, they must evaluate to true or false) and can reference zero or more data elements.

8. Select multiple columns/visualizations for editingSelecting multiple columns for editing

In the Metadata view of the Dataset Editor multiple columns can be selected to change settings (Data Type, Treat As, Aggregation and Hide) for more than one column at once. This is not yet possible in the Data view, properties of one column at a time can be configured. In a similar way, multiple visualizations can be selected on a workbook to easily change shared properties, copy and paste, and delete them. These two new features will surely allow users to save time when changing settings for multiple objects.

7. Automatic refresh of visualizationsAutomatic refresh of visualizations

Data in workbooks can now be automatically refreshed at a specified interval to ensure that the visualizations contain the most current data. It's sufficient to right-click on a canvas tab, select Canvas Properties and set Auto Refresh Data to Enabled to automatically refresh the data using a given Refresh Interval. Please note that this operation does NOT trigger a data cache reload. If a dataset table's access mode is set to Automatic Caching, then the table re-queries the cached data that could be stale.

6. Custom knowledge recommendationsAdding custom knowledge

After creating a dataset, the dataset undergoes column-level profiling to produce a set of semantic recommendations to enrich the data. These recommendations are based on the system automatically detecting a specific semantic type (such as geographic locations or recurring patterns) during the profile step. In OAS 6.4 custom knowledge recommendations can be used to augment the system knowledge by identifying more business-specific semantic types and making more relevant and governed enrichment recommendations. System administrators can upload custom knowledge files using the new Reference Knowledge section in Data Visualization Console. Custom knowledge files must be in CSV or Microsoft Excel (XLSX) format, contain keys to profile the data in the first column, and enrichment values in the other columns.

5. Data quality insightsData quality insights

OAS 6.4 automatically analyzes the quality of the data in the Dataset Editor, and provides a visual overview known as a quality insight in a tile above each column. Quality insights allow to explore data in real time using instant filtering, evaluate data and identify anomalies and outliers, replace or correct anomalies and outliers, and rename columns.

4. Developer OptionsThe Developer Options pane

In Data Visualization, built-in developer options are now available to embed content in other applications and analyze statistics such as query time, server, and streaming time for visualizations in workbooks. Users with administrator or content author privileges can display developer options by appending &devtools=true to the URL of a workbook in the browser.

3. Conditional formattingConditional formatting is now available in Data Visualization

How can conditional formatting be applied to workbooks in Data Visualization similar to the way it can be applied to analyses in Analytics (Classic)? This is frequently asked in Data Visualization training and the answer used to be that the feature was not available - a workaround was required to achieve something similar (e.g. by dragging a conditional expression to the color drop target and changing the color assignments as required). In OAS 6.4 it's now possible to highlight important events in data in a proper manner by clicking on the Manage Conditional Formatting Rules icon in the visualization toolbar.

2. Reload datasets on a scheduleDatasets can be reloaded on a schedule

Reloading datasets ensures that they contain current data. When a dataset is reloaded and it contains one or more table with the Data Access property set to Automatic Caching, the dataset's SQL statements are rerun and the current data is loaded into the cache to improve performance. This has always been a manual process in the past, but now it can be automated by creating a one-time or repeating schedule to reload a dataset's data into the cache. The New Schedule option has been added to the Actions Menu which appear by right-clicking on a dataset in the Data page. This option is NOT available for datasets that use only files or when data access for all tables is set to Live.

1. Datasets with multiple tablesDatasets with multiple tables can now be created

In OAS 5.9 it was possible to create a dataset only with a single entity from a data source connection or a local subject area. This could have been a set of columns from a single table, when the Select Columns option was selected, or a more complex query with joins and native functions, when the Enter SQL option was used instead.

As downsides:

  • Users without any SQL knowledge were not able to join tables in a dataset, or they had to run a data flow to join multiple datasets to achieve a similar result.
  • It was always necessary to retrieve all columns from all tables in a complex query, even when only one was included in a visualization, and this resulted in potential performance issues.

OAS 6.4 allows to perform self-service data modeling with datasets by adding multiple tables to a dataset from one or more relational data source connections or local subject areas. Not all the supported data sources can be used to create datasets with multiple tables. Please refer to Data Sources Available for Use in Datasets Containing Multiple Tables for a comprehensive list.

The Dataset Editor contains a new Join Diagram pane which displays all of the tables and joins in the dataset. When tables are dragged and dropped to the Join Diagram, joins are automatically added if they are already defined in the data source and column name matches are found between the tables. It's possible to prevent this default behaviour and define joins manually by switching off the Auto Join Tables toggle button in the Dataset Editor.

When creating datasets with multiple tables, Oracle recommends to:

  • Add to the dataset the most detailed table first (a fact table when data are dimensionally modeled) and then all remaining tables that provide context for the analysis.
  • Create a dataset for each star schema and use data blending to analyze data based on two star schemas.

OAS 6.4 treats datasets with multiple tables as data models in that only the tables needed to satisfy a visualization are used in the query. This is surely the best new feature of this release and represents a valid alternative to model data in the Metadata Repository (RPD) for users interested only in the Data Visualization tool.

Conclusion

OAS 6.4 includes an incredible amount of new features and the Data Visualization tool has been significantly improved. The support to datasets with multiple tables alone is worth the upgrade as it allows to define complex and structured datasets without impacting performance.

If you are looking into OAS 6.4 and want to find out more, please do get in touch or DM us on Twitter @rittmanmead. Rittman Mead can help you with a product demo, training and assist within the upgrade process.

Categories: BI & Warehousing

Unify 10.0.49

Rittman Mead Consulting - Thu, 2022-02-03 03:37
Unify 10.0.49

A new release of Unify is available, 10.0.49. This release includes support for Tableau 2021.4.

Please contact us at unify@rittmanmead.com if you have any questions.

Categories: BI & Warehousing

The role of Data Lakes or Staging Areas for Data Warehouse ETL

Rittman Mead Consulting - Mon, 2021-10-18 03:21
The role of Data Lakes or Staging Areas for Data Warehouse ETL


We were asked recently by a client, about the role of a staging area for ETL processes when loading a Data Warehouse, specifically they wanted to know if this was still required if using a Data Lake.


TLDR: Data Lakes and Staging areas could be interchangeable in terms of ETL processes, the key consideration is who else and what else will make use of the data within a Data Lake and do you have the right policies and procedures in place to ensure good data governance.


As with so many things people often see Data Lakes as a technology solution, but the reality is that its is a service. Data Lakes provide a method of surfacing data in it's raw/native form to a variety of users and down stream processes, these are intented to use relatively cheap storage and to help accelerate insights into business decisions. We see clients opting to implement Data Lakes on a variety of different technolgies which have various individual benifits, drawbacks and considerations, however the previaling trend in terms of operating an effective Data Lake and in terms of controlling cost is the need for careful goverance in terms of various aspects of data quality and secuirty including items such as data retention, and data dictionary.

A staging area for a Data Warehouse serves a single focused purpose of holding raw data from source systems and providing a location for transient tables that are part of the transformation steps. Depending on the design methodology and ETL toolset the purpose of the staging area varies slightly but the target audience is always simply the ETL process and the Data Engineers who are responible for developing and maintaing the ETL. This doesn't negate the need the data governance that is required in a Data Lake but it does simplify it significantly when compared to the multitude of users and processes which may access a Data Lake.



The role of Data Lakes or Staging Areas for Data Warehouse ETLTraditional direct ETL from source system to Data Warehouse The role of Data Lakes or Staging Areas for Data Warehouse ETLData Warehousing with the inclusion of a Data Lake



CONCLUSIONS

  1. Depending on the toolset chosen for the Data Lake, ETL, and Data Warehouse the location and method for performing transformations and storing transient intimidate tables could be either in the Data Lake or within a sub schema of the Data Warehouse database.

  2. If your ETL and Data Warehouse is the only downstream user of a Data Lake is it even a Data Lake?

  3. Get your processes and policies right in terms of data governance, retention, and security.

Categories: BI & Warehousing

Exploring Mapbox integration in OAC 6.2

Rittman Mead Consulting - Fri, 2021-10-15 04:36
Exploring Mapbox integration in OAC 6.2

Live location has increasingly proved important in tracking locations of objects and people in real time, especially during the pandemic when following the spread of Covid-19 became crucial in decision-making of protection protocols. Mapbox has risen in aiding numerous businesses with its high-performance platform, delivering real time mapping by collecting and processing live anonymous sensor data from users globally. OAC 6.2 was released with some Mapbox integrated features, meaning it was time to test them.

One of Mapbox’s many use cases, Vaccines.gov, devised a web application using Mapbox GL JS informing Americans of their nearest Covid-19 vaccination centres. Whilst experimenting with Mapbox in OAC, I also happened to use a Covid-19 dataset to plot some vaccinations statistics across the world. I was intrigued to see whether OAC could enable me to execute similar animations to the web application. So, I followed a video Oracle uploaded on their YouTube channel to get a gist of the features to test.

The process
This sample dataset I found on Kaggle is updated daily from the ‘Our World in Data’ GitHub repository and merged with a locations data file to distinguish the sources of these vaccinations. I imported them into DV and carried out a quick clean on a few variables I was intended to plot on a simple map visual; I replaced all null values with 0 for total_vaccinations and daily_vaccinations. I noticed the country variable was understandably being treated as an attribute, and the cleaned variables as numbers, which would help visualise the distributions easily with colour gradient scales.

First, I plotted a map using country under the Location category, and total_vaccinations under Colour.

Exploring Mapbox integration in OAC 6.2Map visualisation created using Mapbox library depicting the distribution of total vaccinations across the globe - the darker the area the higher the count.

The data is presented clearly on the map, and I could zoom in and out of specific locations seamlessly with a simple hover-and-zoom. Following the video, I added another visual depicting a table enlisting the countries and vaccines registered by each.

By selecting each field in this table, you can see the relevant country is highlighted. The first feature is the ‘Zoom to Selected’ option which directly zooms into the country of interest and can be found in the top-right menu of the map visual. Since country is only an attribute without geospatial coordinates, using the Layer Type ‘Point’ just centred the point within the country space, so leaving it as a ‘Polygon’ did not make a difference.

Exploring Mapbox integration in OAC 6.2Using fields of a table as a filter to highlight regions on the map. Zoom to Selected can be used to zoom into selected regions.

Now for the more interesting feature, enabling the ‘Use As Filter’ on the top-left icon of the table allows for automatic zooming into the map as different fields are selected. However, you need to ensure the ‘Auto Focus on Data’ feature under ‘Map’ properties is toggled on in order to see the automatic zoom functioning well; else, the map will remain static where you left off whilst other regions will be highlighted but not shown. In addition, I experimented with some map coordinates from Kaggle that looked at statistics regarding rainfall-triggered landslides across America. I realised the latitude/longitude coordinates had to be treated as an attribute (like country above) in order to accurately plot them.

Exploring Mapbox integration in OAC 6.2The whole table visualisation being used as a filter to highlight areas of interest on the map. With Auto-focus turned on, the map will automatically zoom into the selected area.

Limitations
You can also add other map backgrounds to the options listed, including OracleStreetMap, Oracle BI, Oracle Maps. From Google Maps, Baidu Maps, Web Map Service, and Tiled Web Map users can only utilise the MapView library with the first two options due to legalities.

Also worth mentioning, working with Mapbox library in OAC on an older laptop and an older version of a browser may slow down loading times of visualisations in the canvas due to some fairly intensive JavaScript.

Conclusions
Overall, the Mapbox integration in OAC can execute impressive animations around accurate geospatial data, similar to many of its business use cases today. The zoom feature is seamless and helps locate areas of interest quickly to extract relevant information. There are also various map theme options to suit visualisation needs for the data you are working with.

Categories: BI & Warehousing

Text mining in R

Rittman Mead Consulting - Wed, 2021-09-22 05:04

As data becomes increasingly available in the world today the need to organise and understand it also increases. Since 80% of data out there is in unstructured format, text mining becomes an extremely valuable practice for organisations to generate helpful insights and improve decision-making. So, I decided to experiment with some data in the programming language R with its text mining package “tm” – one of the most popular choices for text analysis in R, to see how helpful the insights drawn from the social media platform Twitter were in understanding people’s sentiment towards the US elections in 2020.

What is Text Mining?

Unstructured data needs to be interpreted by machines in order to understand human languages and extract meaning from this data, also known as natural language processing (NLP) – a genre of machine learning. Text mining uses NLP techniques to transform unstructured data into a structured format for identifying meaningful patterns and new insights.

A fitting example would be social media data analysis; since social media is becoming an increasingly valuable source of market and customer intelligence, it provides us raw data to analyse and predict customer needs. Text mining can also help us extract sentiment behind tweets and understand people’s emotions towards what is being sold.

Setting the scene

Which brings us to my analysis here on a dataset of tweets made regarding the US elections that took place in 2020. There were over a million tweets made about Donald Trump and Joe Biden which I put through R’s text mining tools to draw some interesting analytics and see how they measure up against the actual outcome – Joe Biden’s victory. My main aim was to perform sentiment analysis on these tweets to gain a consensus on what US citizens were feeling in the run up to the elections, and whether there was any correlation between these sentiments and the election outcome.

I found the Twitter data on Kaggle, containing two datasets: one of tweets made on Donald Trump and the other, Joe Biden. These tweets were collected using the Twitter API where the tweets were split according to the hashtags ‘#Biden’ and ‘#Trump’ and updated right until four days after the election – when the winner was announced after delays in vote counting. There was a total of 1.72 million tweets, meaning plenty of words to extract emotions from.

The process

I will outline the process of transforming the unstructured tweets into a more intelligible collection of words, from which sentiments could be extracted. But before I begin, there are some things I had to think about for processing this type of data in R:

1. Memory space – Your laptop may not provide you the memory space you need for mining a large dataset in RStudio Desktop. I used RStudio Server on my Mac to access a larger CPU for the size of data at hand.

2. Parallel processing – I first used the ‘parallel’ package as a quick fix for memory problems encountered creating the corpus. But I continued to use it for improved efficiency even after moving to RStudio Server, as it still proved to be useful.

3. Every dataset is different – I followed a clear guide on sentiment analysis posted by Sanil Mhatre. But I soon realised that although I understood the fundamentals, I would need to follow a different set of steps tailored to the dataset I was dealing with.

First, all the necessary libraries were downloaded to run the various transformation functions. tm, wordcloud, syuzhet are for text mining processes. stringr, for stripping symbols from tweets. parallel, for parallel processing of memory consuming functions. ggplot2, for plotting visualisations.

I worked on the Biden dataset first and planned to implement the same steps on the Trump dataset given everything went well the first time round. The first dataset was loaded in and stripped of all columns except that of tweets as I aim to use just tweet content for sentiment analysis.

The next steps require parallelising computations. First, clusters were set up based on (the number of processor cores – 1) available in the server; in my case, 8-1 = 7 clusters. Then, the appropriate libraries were loaded into each cluster with ‘clusterEvalQ’ before using a parallelised version of ‘lapply’ to apply the corresponding function to each tweet across the clusters. This is computationally efficient regardless of the memory space available.

So, the tweets were first cleaned by filtering out the retweet, mention, hashtag and URL symbols that cloud the underlying information. I created a larger function with all relevant subset functions, each replacing different symbols with a space character. This function was parallelised as some of the ‘gsub’ functions are inherently time-consuming.

A corpus of the tweets was then created, again with parallelisation. A corpus is a collection of text documents (in this case, tweets) that are organised in a structured format. ‘VectorSource’ interprets each element of the character vector of tweets as a document before ‘Corpus’ organises these documents, preparing them to be cleaned further using some functions provided by tm. Steps to further reduce complexity of the corpus text being analysed included: converting all text to lowercase, removing any residual punctuation, stripping the whitespace (especially that introduced in the customised cleaning step earlier), and removing English stopwords that do not add value to the text.

The corpus list had to be split into a matrix, known as Term Document Matrix, describing the frequency of terms occurring in each document. The rows represent terms, and columns documents. This matrix was yet too large to process further without removing any sparse terms, so a sparsity level of 0.99 was set and the resulting matrix only contained terms appearing in at least 1% of the tweets. It then made sense to cumulate sums of each term across the tweets and create a data frame of the terms against their calculated cumulative frequencies. I went on to only experiment with wordclouds initially to get a sense of the output words. Upon observation, I realised common election terminology and US state names were also clouding the tweets, so I filtered out a character vector of them i.e. ‘trump’, ‘biden’, ‘vote’, ‘Pennsylvania’ etc. alongside more common Spanish stopwords without adding an extra translation step. My criterion was to remove words that would not logically fit under any NRC sentiment category (see below). This removal method can be confirmed to work better than the one tm provides, which essentially rendered useless and filtered none of the specified words. It was useful to watch the wordcloud distribution change as I removed corresponding words; I started to understand whether the outputted words made sense regarding the elections and the process they were put through.

The entire process was executed several times, involving adjusting parameters (in this case: the sparsity value and the vector of forbidden words), and plotting graphical results to ensure its reliability before proceeding to do the same on the Trump dataset. The process worked smoothly and the results were ready for comparison.

The results

First on the visualisation list was wordclouds – a compact display of the 100 most common words across the tweets, as shown below.

Joe Biden's analysis wordcloudJoe Biden's analysis barplotDonal Trump's analysis wordcloudDonald Trump's analysis barplot

The bigger the word, the greater its frequency in tweets. Briefly, it appears the word distribution for both parties are moderately similar, with the biggest words being common across both clouds. This can be seen on the bar charts on the right, with the only differing words being ‘time’ and ‘news’. There remain a few European stopwords tm left in both corpora, the English ones being more popular. However, some of the English ones can be useful sentiment indicators e.g., ‘can’ could indicate trust. Some smaller words are less valuable as they cause ambiguity in categorisation without a clear context e.g., ‘just’, ‘now’, and ‘new’ may be coming from ‘new york’ or pointing to anticipation for the ‘new president’. Nonetheless, there are some reasonable connections between the words and each candidate; some words in Biden’s cloud do not appear in Trump’s, such as ‘victory’, ‘love’, ‘hope’. ‘Win’ is bigger in Biden’s cloud, whilst ‘white’ is bigger in Trump’s cloud as well as occurrences of ‘fraud’. Although many of the terms lack context for us to base full judgement upon, we already get a consensus of the kind of words being used in connotation to each candidate.

Analysing further, emotion classification was performed to identify the distribution of emotions present in the run up to the elections. The syuzhet library adopts the NRC Emotion Lexicon – a large, crowd-sourced dictionary of words tallied against eight basic emotions and two sentiments: anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, positive respectively. The terms from the matrix were tallied against the lexicon and the cumulative frequency was calculated for each sentiment. Using ggplot2, a comprehensive bar chart was plotted for both datasets, as shown below.

Side-by-side comparison of Biden and Trump's sentiment distribution.

Some revealing insights can be drawn here. Straight away, there is an absence of anger and disgust in Biden’s plot whilst anger is very much present in that of Trump’s. There is 1.6 times more positivity and 2.5 times more joy pertaining Biden, as well as twice the amount of trust and 1.5 times more anticipation about his potential. This is strong data supporting him. Feelings of fear and negativity, however, are equal in both; perhaps the audience were fearing the other party would win, or even what America’s future holds regarding either outcome. There was also twice the sadness and surprise pertaining Biden, which also makes me wonder if citizens are expressing potential emotions they would feel if Trump won (since the datasets were only split based on hashtags), alongside being genuinely sad or surprised that Biden is one of their options.

In the proportional bar charts, there is a wider gap between positivity and negativity regarding Biden than of Trump, meaning a lower proportion of people felt negatively about Biden. On the other hand, there is still around 13% trust in Trump, and a higher proportion of anticipation about him. Only around 4% of the words express sadness and surprise for him which is around 2% lower than for Biden – intriguing. We also must remember to factor in the period after the polls opened when the results were being updated and broadcasted, which may have also affected people’s feelings – surprise and sadness may have risen for both Biden and Trump supporters whenever Biden took the lead. Also, there was a higher proportion fearing Trump’s position, and the anger may have also creeped in as Trump’s support coloured the bigger states.

Proportional distribution of Biden-related sentimentsProportional distribution of Trump-related sentimentsSide-by-side comparison of proportional sentiment distribution

Conclusions
Being on the other side of the outcome, it is more captivating to observe the distribution of sentiments across Twitter data collected through the election period. Most patterns we observed from the data allude to predicting Joe Biden as the next POTUS, with a few exceptions when a couple of negative emotions were also felt regarding the current president; naturally, not everyone will be fully confident in every aspect of his pitch. Overall, however, we saw clear anger only towards Trump along with less joy, trust and anticipation. These visualisations, plotted using R’s tm package in a few lines of code, helped us draw compelling insights that supported the actual election outcome. It is indeed impressive how text mining can be performed at ease in R (once the you have the technical aspects figured out) to create inferential results instantly.

Nevertheless, there were some limitations. We must consider that since the tweets were split according to the hashtags ‘#Biden’ and ‘#Trump’, there is a possibility these tweets appear in both datasets. This may mean an overspill of emotions towards Trump in the Biden dataset and vice versa. Also, the analysis would’ve been clearer if we contextualised the terms’ usage; maybe considering phrases instead would build a better picture of what people were feeling. Whilst plotting the wordclouds, as I filtered out a few foreign stopwords more crept into the cloud each time, which calls for a more solid translation step before removing stopwords, meaning all terms would then be in English. I also noted that despite trying to remove the “ ’s” character, which was in the top 10, it still filtered through to the end, serving as an anomaly in this experiment as every other word in my custom vector was removed.

This experiment can be considered a success for an initial dip into the world of text mining in R, seeing that there is relatively strong correlation between the prediction and the outcome. There are several ways to improve this data analysis which can be aided with further study into various areas of text mining, and then exploring if and how R’s capabilities can expand to help us achieve more in-depth analysis.

My code for this experiment can be found here.

Categories: BI & Warehousing

Getting Data into Cloud

Dylan's BI Notes - Fri, 2021-08-27 14:51
When I worked on the data warehousing technologies, we extract the data from the source. The “Extract” is the first step in ETL (or ELT). The extraction was typically done by using SQL connection to the database that holds the transactional data. When we start introducing cloud based storage, or the Data Lake, many of […]
Categories: BI & Warehousing

Rittman Mead Sponsor Fastnet Entry 2021 Part 2

Rittman Mead Consulting - Fri, 2021-08-13 04:13

It started well… we left the pontoon in strong winds and heavy rain. Before the start of Fastnet everybody has to sail past an entry gate with storm sails or equivalent flying. The winds were strong enough to sail with them and a bit of engine assist so we enjoyed a two hour wet and windy sail to Cowes.

Having passed through the entry gate we headed back down wind and put the storm sails away. We put two reefs in the main and headed back towards the start line. Unfortunately shortly after we collided with another race boat.

Nobody is hurt but the boats are damaged and we had to retire.

There is a huge amount of preparation and planning put into entering and starting these events. Training, qualifying miles meticulous boat preparation, routing, monitoring weather, victualing. To end the race is a terrible blow.

Categories: BI & Warehousing

Rittman Mead Sponsor Fastnet Entry 2021

Rittman Mead Consulting - Wed, 2021-08-04 14:50

Part one.

In 2002 my wife Val and I decided to buy our first yacht. We lived in the Brighton so it seemed to make sense. In 2005 we replaced our first boat with a Sigma 400. A Sigma is a well made boat so it seemed logical she needed to be used properly; we decided to to sell everything and travel. I knew Jon Mead from a large gnarly migration project in Liverpool and he too lived in Brighton, so joined us racing for a season before we left. In 2006 Mr Mead joined us for a little sail across the atlantic (20 days). We were inexperienced as a crew and skipper but we had "fun" and survived.

Here we are again ..

In 2016 over a few glasses of wine with friends, heavily influenced by the wine we decided to enter Fastnet 2017.  Fastnet is an offshore sailing race from Cowes to the Fastnet rock Southern Ireland and back to Plymouth, about 600 miles. Fastnet happens every two years and has been running since 1925. The race is organised by the Royal Ocean Racing Club (RORC).

2017 was rather successful, we had crew, a mixture of experience and completed qualifying before then going on to complete the race.

In 2019 Val and I attempted the race double handed, quite a challenge on a Sigma 400 more suited to a crew of burley men. However things did not go as planned. A number of niggles, concerns about the rudder bearings, some rigging issues, tiredness and weather finally broke us.

So back to here we are again.. 2021 we were planning to sail with two friends currently living in Denmark. Covid and travel has meant they can no longer join us so back to double handing.

Why the blog? Well Rittman Mead has kindly sponsored some of our kit for the race. Well why not after all we are an Oracle partner and Larry loves a sail. There is a lovely new Spinnaker, 96 square meters of sail branded of course with Rittman Mead and we have some HH crew hoodies to keep us warm on the trip. Thanks Mr Mead.

Fingers crossed this year is a good one. Oh another thing to mention, this year the race is controversially changing. It is now about 100 miles longer and finishes in Cherbourg. We won't be stopping as, well you know, Covid so an additional 90 miles for our return after going over the finish.

Track us on Intemperance here https://www.rolexfastnetrace.com/en/follow

Oh this doesn't look great ..

Part two of Fastnet to follow…

Real-time Sailing Yacht Performance - Getting Started (Part 1)
In this series of articles, I intend to look at collecting and analysing ouryacht’s data. I aim to show how a number of technologies can be used to achievethis and the thought processes around the build and exploration of the data.Ultimately, I want to improve our sailing performance with data, n…
Categories: BI & Warehousing

Sql2Odi now supports the WITH statement

Rittman Mead Consulting - Tue, 2021-08-03 07:14

The Rittman Mead's Sql2Odi tool that converts SQL SELECT statements to ODI Mappings, now supports the SQL WITH statement as well. (For an overview of our tool's capabilities, please refer to our blog posts here and here.)

The Case for WITH Statements

If a SELECT statement is complex, in particular if it queries data from multiple source tables and relies on subqueries to do so, there is a good chance that rewriting it as a WITH statement will make it easier to read and understand. Let me show you what I mean...

SELECT 
  LAST_NAME,
  FIRST_NAME,
  LAST_NAME || ' ' || FIRST_NAME AS FULL_NAME,
  AGE,
  COALESCE(LARGE_CITY.CITY, ALL_CITY.CITY) CITY,
  LARGE_CITY.POPULATION
FROM 
  ODI_DEMO.SRC_CUSTOMER CST
  INNER JOIN ODI_DEMO.SRC_CITY ALL_CITY ON ALL_CITY.CITY_ID = CST.CITY_ID  
  LEFT JOIN (
    SELECT 
      CITY_ID,
      UPPER(CITY) CITY,
      POPULATION
    FROM ODI_DEMO.SRC_CITY
    WHERE POPULATION > 750000
  ) LARGE_CITY ON LARGE_CITY.CITY_ID = CST.CITY_ID  
WHERE AGE BETWEEN 25 AND 45

This is an example from my original blog posts. Whilst one could argue that the query is not that complex, it does contain a subquery, which means that the query does not read nicely from top to bottom - you will likely need to look at the subquery first for the master query to make sense to you.

Same query, rewritten as a WITH statement, looks like this:

WITH 
BASE AS (
SELECT 
  LAST_NAME,
  FIRST_NAME,
  LAST_NAME || ' ' || FIRST_NAME AS FULL_NAME,
  AGE,
  CITY_ID
FROM 
  ODI_DEMO.SRC_CUSTOMER CST
),
LARGE_CITY AS (
  SELECT 
    CITY_ID,
    UPPER(CITY) CITY,
    POPULATION
  FROM ODI_DEMO.SRC_CITY
  WHERE POPULATION > 750000
),
ALL_DATA AS (
  SELECT
    LAST_NAME,
    FIRST_NAME,
    FULL_NAME,
    AGE,
    COALESCE(LARGE_CITY.CITY, ALL_CITY.CITY) CITY,
    LARGE_CITY.POPULATION
  FROM
    BASE CST
    INNER JOIN ODI_DEMO.SRC_CITY ALL_CITY ON ALL_CITY.CITY_ID = CST.CITY_ID
    LEFT JOIN LARGE_CITY ON LARGE_CITY.CITY_ID = CST.CITY_ID  
  WHERE AGE BETWEEN 25 AND 45
)
SELECT * FROM ALL_DATA

Whilst it is longer, it reads nicely from top to bottom. And the more complex the query, the more the comprehensibility will matter.

The first version of our Sql2Odi tool did not support WITH statements. But it does now.

Convert a WITH Statement to an ODI Mapping

The process is same old - first we add the two statements to our metadata table, add some additional data to it, like the ODI Project and Folder names, the name of the Mapping, the Target table that we want to populate and how to map the query result to the Target table, names of Knowledge Modules and their config, etc.

After running the Sql2Odi Parser, which now happily accepts WITH statements, and the Sql2Odi ODI Content Generator, we end up with two mappings:

What do we see when we open the mappings?

The original SELECT statement based mappings is generated like this:

The new WITH statement mapping, though it queries the same data in pretty much the same way, is more verbose:

The additional EXPRESSION components are added to represent references to the WITH subqueries. While the mapping is now busier than the original SELECT, there should be no noticeable performance penalty. Both mappings generate the exact same output.

Categories: BI & Warehousing

Joining Data in OAC

Rittman Mead Consulting - Thu, 2021-07-29 08:41

One of the new features in OAC 6.0 was Multi Table Datasets, which provides another way to join tables to create a Data Set.

We can already define joins in the RPD, use joins in OAC’s Data Flows and join Data Sets using blending in DV Projects, so I went on a little quest to compare the pros and cons of each of the methods to see if I can conclude which one works best.

What is a data join?

Data in databases is generally spread across multiple tables and it is difficult to understand what the data means without putting it together. Using data joins we can stitch the data together, making it easier to find relationships and extract the information we need. To join two tables, at least one column in each table must be the same. There are four available types of joins I’ll evaluate:

1.    Inner join - returns records that have matching values in both tables. All the other records are excluded.

2.    Left (outer) join - returns all records from the left table with the matched records from the right table.

3.    Right (outer) join - returns all records from the right table with the matched records from the left table.

4.    Full (outer) join - returns all records when there is a match in either left or right tables.

Each of the three approaches give the developer different ways and places to define the relationship (join) between the tables. Underpinning all of the approaches is SQL. Ultimately, OAC will generate a SQL query that will retrieve data from the database, so to understand joins, let’s start by looking at SQL Joins

SQL Joins

In an SQL query, a JOIN clause is used to execute this function. Here is an example:

SELECT EMP.id, EMP.name, DEPT.stream
FROM EMP
INNER JOIN DEPT ON DEPT.id = EMP.id;  
Figure 1 - Example of an inner join on a sample dataset.

Now that we understand the basic concepts, let’s look at the options available in OAC.

Option 1: RPD Joins

The RPD is where the BI Server stores its metadata. Defining joins in the RPD is done in the Admin Client Tool and is the most rigorous of the join options. Joins are defined during the development of the RPD and, as such, are subject to the software development lifecycle and are typically thoroughly tested.

End users access the data through Subject Areas, either using classic Answers and Dashboards, or DV. This means the join is automatically applied as fields are selected, giving you more control over your data and, since the RPD is not visible to end-users, avoiding them making any incorrect joins.

The main downside of defining joins in the RPD is that it’s a slower process - if your boss expects you to draw up some analyses by the end of the day, you may not make the deadline using the RPD. It takes time to organise data, make changes, then test and release the RPD.

Join Details

The Admin Client Tool allows you to define logical and physical tables, aggregate table navigation, and physical-to-logical mappings. In the physical layer you define primary and foreign keys using either the properties editor or the Physical Diagram window. Once the columns have been mapped to the logical tables, logical keys and joins need to be defined. Logical keys are generally automatically created when mapping physical key columns. Logical joins do not specify join columns, these are derived from the physical mappings.

You can change the properties of the logical join; in the Business Model Diagram you can set a driving table (which optimises how the BI Server process joins when one table is smaller than the other), the cardinality (which expresses how rows in one table are related to rows in the table to which it is joined), and the type of join.

Driving tables only activate query optimisation within the BI Server when one of the tables is much smaller than the other. When you specify a driving table, the BI Server only uses it if the query plan determines that its use will optimise query processing. In general, driving tables can be used with inner joins, and for outer joins when the driving table is the left table for a left outer join, or the right table for a right outer join. Driving tables are not used for full outer joins.

The Physical Diagram join also gives you an expression editor to manually write SQL for the join you want to perform on desired columns, introducing complexity and flexibility to customise the nature of the join. You can define complex joins, i.e. those over non-foreign key and primary key columns, using the expression editor rather than key column relationships. Complex joins are not as efficient, however, because they don’t use key column relationships.

Figure 3 - Business Model Diagram depicting joins made between three tables in the RPD.

It’s worth addressing a separate type of table available for creation in the RPD – lookup tables. Lookup tables can be added to reference both physical and logical tables, and there are several use cases for them e.g., pushing currency conversions to separate calculations. The RPD also allows you to define a logical table as being a lookup table in the common use case of making ID to description mappings.

Lookup tables can be sparse and/or dense in nature. A dense lookup tables contains translations in all languages for every record in the base table. A sparse lookup table contains translations for only some records in the base tables. They can be accessed via a logical calculation using DENSE or SPARSE lookup function calls. Lookup tables are handy as they allow you to model the lookup data within the business model; they’re typically used for lookups held in different databases to the main data set.

Multi-database joins allow you to join tables from across different databases. Even though the BI Server can optimise the performance of multi-database joins, these joins are significantly slower than those within a single database.

Option 2: Data Flow Joins

Data Flows provide a way to transform data in DV. The data flow editor gives us a canvas where we can add steps to transform columns, merge, join or filter data, plot forecasts or apply various models on our datasets.

When it comes to joining datasets, you start by adding two or more datasets. If they have one or more matching columns DV automatically detects this and joins them; otherwise, you have to manually add a ‘Join’ step and provided the columns’ data types match, a join is created.

A join in a data flow is only possible between two datasets, so if you wanted to join a third dataset you have to create a join between the output of the first and second tables and the third, and so on. You can give your join a name and description which would help keep track if there are more than two datasets involved. You can then view and edit the join properties via these nodes created against each dataset. DV gives you the standard four types of joins (Fig. 1), but they are worded differently; you can set four possible combinations for each input node by toggling between ‘All rows’ and ‘Matching rows’. That means:

Join type

Node 1

Node 2

Inner join

Matching rows

Matching rows

Left join

All rows

Matching rows

Right join

Matching rows

All rows

Full join

All rows

All rows

The table above explains which type of join can be achieved by toggling between the two drop-down options for each dataset in a data flow join.

Figure 4 - Data flow editor where joins can be made and edited between two datasets.

It’s worth mentioning there is also an operator called ‘Union Rows’. You can concatenate two datasets, provided they have the same number of columns with compatible datatypes. There are a number of options to decide how you want to combine the rows of the datasets.

Figure 5 - Union Rows operator displaying how the choices made impact columns in the data.

One advantage of data flows is they allow you to materialise the data i.e. save it to disk or a database table. If your join query takes 30 minutes to run, you can schedule it to run overnight and then reports can query the resultant dataset.

However, there are limited options as to the complexity of joins you can create:

  1. the absence of an expression editor to define complex joins
  2. you cannot join more than two datasets at a time.

You can schedule data flows which would allow you to materialise the data overnight ready for when users want to query the data at the start of the day.

Data Flows can be developed and executed on the fly, unlike the longer development lifecycle of the RPD.

It should be noted that Data Flows cannot be shared. The only way around this is to export the Data Flow and have the other user import and execute it. The other user will need to be able to access the underlying Data Sets.

Option 3: Data Blending

Before looking at the new OAC feature, there is a method already present for cross-database joins which is blending data.

Given at least two data sources, for example, a database connection and an excel spreadsheet from your computer, you can create a Project with one dataset and add the other Data Set under the Visualise tab. The system tries to find matches for the data that’s added based on common column names and compatible data types. Upon navigating back to the Data tab, you can also manually blend the datasets by selecting matching columns from each dataset. However, there is no ability to edit any other join properties.

Figure 6 - Existing method of blending data by matching columns from each dataset.Option 4: Multi Table Datasets

Lastly, let’s look at the newly added feature of OAC 6.0: Multi Table Datasets. Oracle have now made it possible to join several tables to create a new Data Set in DV.

Historically you could create Data Sets from a database connection or upload files from your computer. You can now create a new Data Set and add multiple tables from the same database connection. Oracle has published a list of compatible data sources.

Figure 7 - Data Set editor where joins can be made between numerous datasets and their properties edited.

Once you add your tables DV will automatically populate joins, if possible, on common column names and compatible data types.

The process works similarly to how joins are defined in Data Flows; a pop-up window displays editable properties of the join with the same complexity - the options to change type, columns to match, the operator type relating them, and add join conditions.

Figure 8 - Edit window for editing join properties.

The data preview refreshes upon changes made in the joins, making it easy to see the impact of joins as they are made.

Unlike in the RPD, you do not have to create aliases of tables in order to use them for multiple purposes; you can directly import tables from a database connection, create joins and save this Multi Table Dataset separately to then use it further in a project, for example. So, the original tables you imported will retain their original properties.

If you need to write complex queries you can use the Manual SQL Query editor to create a Data Set, but you can only use the ‘JOIN’ clause.

Figure 9 - Manual Query editor option under a set up database connection.So, what’s the verdict?

Well, after experimenting with each type of joining method and talking to colleagues with experience, the verdict is: it depends on the use case.

There really is no right or wrong method of joining datasets and each approach has its pros and cons, but I think what matters is evaluating which one would be more advantageous for the particular use case at hand.

Using the RPD is a safer and more robust option, you have control over the data from start to end, so you can reduce the chance of incorrect joins. However, it is considerably slower and make not be feasible if users demand quick results. In this case, using one of the options in DV may seem more beneficial.

You could either use Data Flows, either scheduled or run manually, or Multi Table Datasets. Both approaches have less scope for making complex joins than the RPD. You can only join two Data Sets at a time in the traditional data flow, and you need a workaround in DV to join data across database connections and computer-uploaded files; so if time and efficiency is of essence, these can be a disadvantage.

l would say it’s about striking a balance between turnaround time and quality - of course both good data analysis in good time is desirable, but when it comes to joining datasets in these platforms it will be worth evaluating how the use case will benefit from either end of the spectrum.

Categories: BI & Warehousing

Pages

Subscribe to Oracle FAQ aggregator - BI &amp; Warehousing