Depth First Search in Practice


Problem


As Nine has recently merged with Fairfax, our team has a project to migrate Fairfax's legacy Teradata data models to Google BigQuery on our new Google Cloud Platform. Fairfax's team has been using Teradata for a long time and they have a convoluted and complex Teradata Framework called GCFR to maintain their Dimensional data modeling and ETL processes. Their data model consists of 3 layers: source layer, model layer and presentation layer. Most of the transformation logics are sitting inside the model layer. To get rid of the complexity, we have decided to remove model layer and build the presentation layer directly from source layer. One of the challenges is to trace down all the logics within this GCFR framework from presentation to model layer and back to source layer. We can do it manually but it takes an engineer a long time and big effort to manually to trace all related tables and transformation views.



Solution


I think it's a perfect use case to apply Depth First Search algorithm in this scenario to automate this process. Depth first search is like the right hand side of the below picture. If you follow the blue arrow of the picture, it starts from the top node of the binary tree and traverses down to deeper levels' nodes within a branch and then switches to a new branch until it hits the bottom of current branch. Then the traversing process repeats branch by branch. The same traversal logic applies even if it's not a binary tree e.g. multiple branches tree in our case. It is in contrast to Breadth First Search which traverses down to a different level until it finishes traversing each level's nodes across different branches of the tree.

So I wrote a quick python script (code snippet shown below). It starts from the top presentation layer data model e.g. a fact table, and drills down until it "automatically" realises that it hits the bottom i.e. the source table. The programme recursively opens up each transformation table or model table within this GCFR framework and then uses regex to extract out all the consisted/derived tables used to form the final fact table and then prints them out along the way. This programme saves our team huge effort to manually identify all these intermediate transformation tables.



# ...
def main():
    no_of_level= 1
    traced=set()
    print("- "+trace_obj)
    with open(f"{trace_obj}.md", 'a', newline="") as f:
                f.write("- "+trace_obj + "\r\n")
    dive(1,traced, trace_obj)
        
def dive(no_of_level,traced, parent_obj ):
    child_objs = expand_to_child_tables(parent_obj)
    for obj in child_objs:
        if obj not in traced  :
            with open(f"{trace_obj}.md", 'a', newline="") as f:
                f.write(level*(no_of_level)+"- "+obj + (" [mds]" if "1011_" in obj else "")  + "\r\n")
            print(level*(no_of_level)+"- "+obj+ (" [mds]" if "1011_" in obj else "") )
            traced.add(obj)
            dive(no_of_level+1, traced, obj)
        else:
            print(level*(no_of_level)+"- "+obj+ (" [mds]" if "1011_" in obj else "") + " **traced above**")
            with open(f"{trace_obj}.md", 'a', newline="") as f:
                f.write(level*(no_of_level)+"- "+obj + (" [mds]" if "1011_" in obj else "")  + " **traced above**\r\n")

def expand_to_child_tables(obj):
    table_name_without_schema = re.search(r".*\.(.*)",obj ,re.I).group(1)
    df= None
    if re.match(r'^.*_view\..*$', obj ,re.I ) :
        tx_view = re.sub(r"\/\*.*?\*\/", '', "".join(pd.read_sql_query(f"""show view {obj}""", connection).iloc[:,0].to_list())) 
        allmatches= [ i[2].strip().strip(",)(;").replace(" ","") for i in re.findall( r'((left|inner|left outer) join|from|join)[ \r\n]+?(prod_[^\s]+?\. *?[^\s]+?)[ \r\n]', tx_view, re.M|re.I) ]
        
        if len(allmatches)> 0:
            final_list= list(set( filter( lambda a : '.' in a and re.match(r'^[\w\.]+$', a ) is not None, allmatches ) ) )   
            final_list.insert(0,  final_list.pop(final_list.index(allmatches[0]) )  )
        else :
            final_list=[]
    else:
        df = pd.read_sql_query(f"""
        select  ctl_id, in_db_name, in_object_name, target_tableName
        from prod_gcfr_view.GCFR_Process
        where out_object_name like '%{table_name_without_schema}%'
         and stream_key <> '9999'
         and target_tablename = '{table_name_without_schema}'
         order by ctl_id """, connection)
        final_list=  [a+"."+b for a,b in zip ( df["In_DB_Name"].to_list(), df["In_Object_Name"].to_list() ) ]
    
    return list(filter(lambda x: x!='prod_gcfr_view.GCFR_Stream_Id_Log' 
                       and x!="." 
                       and re.match('^.*(BKEY_|BMAP_).*$', obj ,re.I  ) is None 
                       and x!="prod_prstn_trnsfrm_view.ad_order_header_date_1004"
                                     ,final_list ) )


Sample result

The programme saves the file into markdown format so we can expand or collapse relevant tables/views to get a high level understanding of the logic.



A sample result snippet is shown below. This is just a small snippet of one of the many dimension tables e.g. dim_advertiser. As you can see, it's already getting quite messy. If we did it without automation, it would be hard to imagine how long it could take us to figure out all these tables/views relationship.


- prod_prstn_view.dim_advertiser
    - prod_prstn.dim_advertiser
        - prod_prstn_trnsfrm_view.TX_dim_advertiser_genera_780
            - prod_prstn_trnsfrm.dim_advertiser_genera_drv
                - prod_prstn_trnsfrm_view.TX_dim_advertiser_genera_drv_insert_795
                    - prod_mdl_src_view.accounts
                        - prod_mdl.accounts
                            - prod_mdl_trnsfrm_view.TX_accounts_1004_3000_074
                                - prod_srci_src_view.SRCI_1004_3000_account
                                    - prod_srci.SRCI_1004_3000_account
                                        - prod_srci_trnsfrm_view.TX_1004_3000_account
                                            - prod_stg_src_view.STG_1004_3000_account
                                                - prod_stg.STG_1004_3000_account
                                            - prod_ref_src_view.BKEY_5026_ACNT
                                            - prod_ref_src_view.BMAP_Standard_Map
                                            - prod_ref_src_view.BKEY_5915_ITEM
                                            - prod_ref_src_view.BKEY_5942_ITEM_PRICING
                                            - prod_ref_src_view.BKEY_6300_PRTY
                            - prod_mdl_trnsfrm_view.TX_accounts_1005_3048_075
                                - prod_srci_src_view.SRCI_1005_3048_accounts
                                    - prod_srci.SRCI_1005_3048_accounts
                                        - prod_srci_trnsfrm_view.TX_1005_3048_accnt
                                            - prod_stg_src_view.STG_1005_3048_accounts
                                                - prod_stg.STG_1005_3048_accounts
                                            - prod_ref_src_view.BKEY_5026_ACNT **traced above**
                                            - prod_ref_src_view.BMAP_Standard_Map **traced above**
                                            - prod_ref_src_view.BKEY_6300_PRTY **traced above**
                            - prod_mdl_trnsfrm_view.TX_accounts_1006_3509_08B
                                - prod_SRCI_src_view.SRCI_1006_3509_account
                                    - prod_srci.SRCI_1006_3509_account
                                        - prod_srci_trnsfrm_view.TX_1006_3509_account
                                            - prod_stg_src_view.STG_1006_3509_account
                                                - prod_stg.STG_1006_3509_account
                                            - prod_ref_src_view.BKEY_5026_ACNT **traced above**
                                            - prod_ref_src_view.BMAP_STANDARD_MAP
                                            - prod_ref_src_view.BKEY_5112_ADDRESS
                                            - prod_ref_src_view.BKEY_6300_PRTY **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_1011_3170_720 [mds]
                                - prod_srci_src_view.SRCI_1011_3170_vw_Accounts_E_AccountGGP_LeafMembers [mds]
                                    - prod_srci.SRCI_1011_3170_vw_accounts_e_accountggp_leafmembers [mds]
                                        - prod_srci_trnsfrm_view.TX_1011_3170_vw_A_E_AGGP_LM [mds]
                                            - prod_stg_src_view.STG_1011_3170_vw_accounts_e_accountggp_leafmembers [mds]
                                                - prod_stg.STG_1011_3170_vw_accounts_e_accountggp_leafmembers [mds]
                                            - prod_ref_src_view.BMAP_Standard_Map **traced above**
                                            - prod_ref_src_view.BKEY_5026_ACNT **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_1011_3171_719 [mds]
                                - prod_srci_src_view.SRCI_1011_3171_vw_accounts_e_accountgp_leafmembers [mds]
                                    - prod_srci.SRCI_1011_3171_vw_accounts_e_accountgp_leafmembers [mds]
                                        - prod_srci_trnsfrm_view.TX_1011_3171_vw_A_E_AGP_LM [mds]
                                            - prod_stg_src_view.STG_1011_3171_vw_accounts_e_accountgp_leafmembers [mds]
                                                - prod_stg.STG_1011_3171_vw_accounts_e_accountgp_leafmembers [mds]
                                            - prod_ref_src_view.BKEY_5026_ACNT **traced above**
                                            - prod_ref_src_view.BMAP_Standard_Map **traced above**
                            - prod_mdl_trnsfrm_view.TX_accounts_1011_4407_13Q [mds]
                                - prod_srci_src_view.SRCI_1011_4407_vw_accounts_e_corpsubsaccountparent_leafmembers [mds]
                                    - prod_srci.SRCI_1011_4407_vw_accounts_e_corpsubsaccountparent_leafmembers [mds]
                                        - prod_srci_trnsfrm_view.TX_1011_4407_vw_A_E_CSAP_LM [mds]
                                            - prod_stg_src_view.STG_1011_4407_vw_accounts_e_corpsubsaccountparent_leafmembers [mds]
                                                - prod_stg.STG_1011_4407_vw_accounts_e_corpsubsaccountparent_leafmembers [mds]
                                            - prod_ref_src_view.BKEY_5026_ACNT **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_1011_3172_721 [mds]
                                - prod_srci_src_view.SRCI_1011_3172_vw_accounts_e_accountparent_leafmembers [mds]
                                    - prod_srci.SRCI_1011_3172_vw_accounts_e_accountparent_leafmembers [mds]
                                        - prod_srci_trnsfrm_view.TX_1011_3172_vw_A_E_AP_LM [mds]
                                            - prod_stg_src_view.STG_1011_3172_vw_accounts_e_accountparent_leafmembers [mds]
                                                - prod_stg.STG_1011_3172_vw_accounts_e_accountparent_leafmembers [mds]
                                            - prod_ref_src_view.BKEY_5026_ACNT **traced above**
                                            - prod_ref_src_view.BMAP_Standard_Map **traced above**
                                            - prod_ref_src_view.BKEY_7201_ACNT_GRP
                            - prod_mdl_trnsfrm_view.TX_accounts_1014_3247_13W
                                - prod_srci_src_view.SRCI_1014_3247_vw_group_subscriptions
                                    - prod_srci.SRCI_1014_3247_vw_group_subscriptions
                                        - prod_srci_trnsfrm_view.TX_1014_3247_vw_grp_sub
                                            - prod_stg_src_view.STG_1014_3247_vw_group_subscriptions
                                                - prod_stg.STG_1014_3247_vw_group_subscriptions
                                            - prod_ref_src_view.BKEY_5915_ITEM **traced above**
                                            - prod_ref_src_view.BKEY_5026_ACNT **traced above**
                            - prod_mdl_trnsfrm_view.TX_accounts_1015_3383_13X
                                - prod_srci_src_view.SRCI_1015_3383_cmsalessubsource
                                    - prod_srci.SRCI_1015_3383_cmsalessubsource
                                        - prod_srci_trnsfrm_view.TX_1015_3383_cmSalesSubSrc
                                            - prod_stg_src_view.STG_1015_3383_cmsalessubsource
                                                - prod_stg.STG_1015_3383_cmsalessubsource
                                            - prod_ref_src_view.BKEY_5026_ACNT **traced above**
                            - prod_mdl_trnsfrm_view.TX_accounts_1015_3393_0GN
                                - prod_srci_src_view.SRCI_1015_3393_cmsubscription
                                    - prod_srci.SRCI_1015_3393_cmsubscription
                                        - prod_srci_trnsfrm_view.TX_1015_3393_cmSubscription
                                            - prod_stg_src_view.STG_1015_3393_cmsubscription
                                                - prod_stg.STG_1015_3393_cmsubscription
                                            - prod_ref_src_view.BKEY_5026_ACNT **traced above**
                                            - prod_ref_src_view.BMAP_STANDARD_MAP **traced above**
                                            - prod_ref_src_view.BKEY_7315_SUBSCRIPTION_LINE_ITEM
                                            - prod_ref_src_view.BKEY_7256_ITEM_FREQUENCY
                                            - prod_ref_src_view.BKEY_5316_CAMPAIGN
                                            - prod_ref_src_view.BKEY_5915_ITEM **traced above**
                                            - prod_ref_src_view.BKEY_6835_SUBSCRIPTION
                                            - prod_ref_src_view.BKEY_6300_PRTY **traced above**
                            - prod_mdl_trnsfrm_view.TX_accounts_1045_4372_12I
                                - prod_srci_src_view.srci_1045_4372_accounts
                                    - prod_srci.SRCI_1045_4372_accounts
                                        - prod_srci_trnsfrm_view.TX_1045_4372_accounts
                                            - prod_stg_src_view.STG_1045_4372_accounts
                                                - prod_stg.STG_1045_4372_accounts
                                            - prod_ref_src_view.BKEY_5026_ACNT **traced above**
                                            - prod_ref_src_view.BKEY_5112_ADDRESS **traced above**
                                            - prod_ref_src_view.BKEY_5915_ITEM **traced above**
                                            - prod_ref_src_view.BKEY_6835_SUBSCRIPTION **traced above**
                                            - prod_ref_src_view.BKEY_6300_PRTY **traced above**
                                    - prod_ref_src_view.BMAP_STANDARD_MAP **traced above**
                    - prod_mdl_src_view.account_class_xref
                        - prod_mdl.account_class_xref
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1004_3000_529
                                - prod_srci_src_view.SRCI_1004_3000_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1004_3000_522
                                - prod_srci_src_view.SRCI_1004_3000_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1004_3000_787
                                - prod_srci_src_view.SRCI_1004_3000_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1005_3048_788
                                - prod_srci_src_view.SRCI_1005_3048_accounts **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1006_3509_08G
                                - prod_srci_src_view.SRCI_1006_3509_account
                                    - prod_srci.SRCI_1006_3509_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1006_3509_08I
                                - prod_srci_src_view.SRCI_1006_3509_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1006_3509_08F
                                - prod_srci_src_view.SRCI_1006_3509_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1006_3509_08L
                                - prod_srci_src_view.SRCI_1006_3509_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1006_3509_08M
                                - prod_srci_src_view.SRCI_1006_3509_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1006_3509_08N
                                - prod_srci_src_view.SRCI_1006_3509_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1006_3509_08K
                                - prod_srci_src_view.SRCI_1006_3509_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1006_3509_08J
                                - prod_srci_src_view.SRCI_1006_3509_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1006_3509_08H
                                - prod_srci_src_view.SRCI_1006_3509_account **traced above**
                            - prod_mdl_trnsfrm_view.TX_account_class_xref_1011_3169_729 [mds]
                                - prod_srci_src_view.SRCI_1011_3169_vw_accounts_e_account_leafmembers [mds]
                                    - prod_srci.SRCI_1011_3169_vw_accounts_e_account_leafmembers [mds]
                                        - prod_srci_trnsfrm_view.TX_1011_3169_vw_A_E_A_LM [mds]
                                            - prod_stg_src_view.STG_1011_3169_vw_accounts_e_account_leafmembers [mds]
                                                - prod_stg.STG_1011_3169_vw_accounts_e_account_leafmembers [mds]
                                            - prod_ref_src_view.BKEY_5026_ACNT **traced above**
                                            - prod_ref_src_view.BMAP_Standard_Map **traced above**
                                            - prod_stg_src_view.STG_1011_3688_vw_accounts_e_industry_leafmembers [mds]
                                                - prod_stg.STG_1011_3688_vw_accounts_e_industry_leafmembers [mds]
                                            - prod_ref_src_view.BKEY_6300_PRTY **traced above**

Depth First Search in Practice
arrow_back

Previous

Trick to use Django to serve React app

Next

Keep track of AWS Athena history using Binary Search Algorithm
arrow_forward