DGC - Sources November2023 ApacheAtlasSources en
DGC - Sources November2023 ApacheAtlasSources en
November 2023
This software and documentation are provided only under a separate license agreement containing restrictions on use and disclosure. No part of this document may be
reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC.
U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial
computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such,
the use, duplication, disclosure, modification, and adaptation is subject to the restrictions and license terms set forth in the applicable Government contract, and, to the
extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License.
Informatica, Informatica Cloud, Informatica Intelligent Cloud Services, PowerCenter, PowerExchange, and the Informatica logo are trademarks or registered trademarks
of Informatica LLC in the United States and many jurisdictions throughout the world. A current list of Informatica trademarks is available on the web at https://
www.informatica.com/trademarks.html. Other company and product names may be trade names or trademarks of their respective owners.
Portions of this software and/or documentation are subject to copyright held by third parties. Required third party notices are included with the product.
The information in this documentation is subject to change without notice. If you find any problems in this documentation, report them to us at
[email protected].
Informatica products are warranted according to the terms and conditions of the agreements under which they are provided. INFORMATICA PROVIDES THE
INFORMATION IN THIS DOCUMENT "AS IS" WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.
Table of Contents 3
Preface
Read Apache Atlas Sources to learn how to register and configure Apache Atlas sources in Metadata
Command Center as catalog sources. After you configure a catalog source, you extract metadata and then
view the results in Data Governance and Catalog.
4
Chapter 1
A source system is any system that contains data or metadata. For example, Apache Atlas is a source
system from which you can extract metadata through an Apache Atlas catalog source with Metadata
Command Center. A catalog source is an object that represents and contains metadata from the source
system.
Before you extract metadata from a source system, you first create and register a catalog source that
represents the source system. You can configure capabilities that represent tasks that the catalog source
can perform.
When Metadata Command Center extracts metadata, Data Governance and Catalog displays the extracted
metadata and its attributes as technical assets. You can then perform tasks such as analyzing the assets,
viewing lineage, and creating links between those assets and their business context.
The following image shows the process to extract metadata from an Apache Atlas source system:
5
After you verify prerequisites, perform the following tasks to extract metadata from Apache Atlas:
1. Register a catalog source. Create a catalog source object, select the source system, and specify values
for connection properties.
2. Configure the catalog source. Specify the runtime environment, optionally configure parameters for the
metadata extraction capability, and add filters for metadata extraction.
3. Associate stakeholders. Optionally, associate users with technical assets, giving the users permission to
perform actions determined by their roles.
4. Run or schedule the catalog source job.
5. Optionally, assign a connection to referenced source system assets.
After you run the catalog source job, you view the results in Data Governance and Catalog.
Apache Atlas is the governance and metadata framework for Hadoop. Apache Atlas has a scalable and
extensible architecture that can be plugged into many Hadoop components to manage their metadata in a
central repository.
Extracted metadata
You can extract metadata from an Apache Atlas source system.
Objects extracted
Metadata Command Center extracts the following metadata from an Apache Atlas source system:
• Atlas Server
• Hive Process
• Sqoop Process
• Calculation
Note: Calculation objects are extracted when there is column-level lineage from one asset to another in
Hive and Sqoop processes.
• Spark Application
• Spark Process
The Apache Atlas catalog source extracts data lineage from the following data sources:
• Oracle
• MySQL
• PostgreSQL
• Apache Hive
• Hadoop Distributed File System (HDFS)
• Apache HBase
• CREATETABLE
• CREATEVIEW
• CREATE_MATERIALIZED_VIEW
Metadata Command Center extracts folders as reference objects from Hadoop Distributed File System.
Metadata Command Center extracts the following objects as reference objects from Apache Hive:
• Schema
• Table
• View
• External Table
• Column
Field and column objects are extracted when there is column-level lineage from one asset to another in
Apache Atlas.
Extracted metadata 7
Chapter 2
• Verify authentication.
• Verify permissions.
• Import SSL certificate to the JRE folder of the Informatica Secure Agent.
• Get Apache Atlas source information.
Verify authentication
To extract Apache Atlas metadata, verify that you have the URL to access Apache Atlas and connect to the
Atlas REST API.
You need to provide the Kerberos principal for authentication when you configure the Apache Atlas catalog
source in Metadata Command Center.
• Add details of the Kerberos server to the host file located in the Secure Agent machine in the following
format: <ip_address> <hostname>
On a Windows machine, the host file is available in the following path: C:\Windows\System32\drivers
\etc\hosts
On a Linux machine, the host file is available in the following path: /etc/hosts
• Copy the Atlas Keytab file from the Hadoop cluster to any location on the Secure Agent machine.
• Enable Atlas hook from Hadoop Distributed File System and Apache Hive configurations so that Apache
Atlas can read the metadata.
• Copy the Kerberos configuration file from the Hadoop cluster to any location on the Secure Agent
machine. You can modify the Kerberos configuration file as per requirement.
The following code shows a sample Kerberos configuration file:
[libdefaults]
default_realm = *****
dns_lookup_kdc = false
dns_lookup_realm = false
ticket_lifetime = 86400
renew_lifetime = 604800
forwardable = true
default_tgs_enctypes = rc4-hmac
default_tkt_enctypes = rc4-hmac
8
permitted_enctypes = rc4-hmac
udp_preference_limit = 1
kdc_timeout = 3000
allow_weak_crypto=true
[realms]
<domain name> = {
kdc = *****
admin_server = *****
}
[domain_realm]
Note: If the Kerberos encryption algorithms are not compatible with Java Standard Edition version 11, you
can add the allow_weak_crypto=true property in the Kerberos configuration file.
Verify permissions
Verify that you have the following account and permissions:
• A user account to access and extract metadata from the Apache Atlas source system.
• Read permission for the account to access the Apache Atlas source system.
Verify permissions 9
5. Run the following command to import the SSL certificate:
keytool -import -alias <alias name> -keystore <path to cacert file> -file <absolute
path to SSL certificate>
Note: The Java certificate file is named cacerts and is located in the following Java directory: \jre\lib
\security\cacerts
For example, you can run the following command for Windows operating systems:
keytool -import -alias aliasname -keystore "C:\data\devprod\jdk\jre\lib\security
\cacerts" -file "C:\data\devprod\filename.crt"
Note: You don't need to create a connection object for Apache Atlas. You provide this information when you
configure the catalog source.
Property Description
Base URL URL to access Apache Atlas and connect to the Atlas REST API.
Keytab File Path The absolute path to the Kerberos keytab file located on the Secure Agent machine used for
authentication.
Configuration File Path The absolute path to the Kerberos configuration file located on the Secure Agent machine
used for authentication.
When you configure a catalog source, you define the source system from which you want to extract
metadata. Configure filters to include or exclude source system metadata before you run the job.
To provide stakeholders access to technical assets, you can assign access through roles. To view lineage for
any system that the source system references, create a catalog source and a connection associated with the
referenced source system after you run the job.
11
3. Click New from the menu.
4. Select Catalog Source from the list of asset types.
5. Select Apache Atlas from the list of source systems.
6. Click Create.
The following image shows the Apache Atlas registration information:
7. In the General Information area, enter a name and an optional description for the catalog source.
Note: After you create a catalog source, you can't change the name.
8. In the Connection Information area, enter the Apache Atlas connection information based on the
connection values that you got from the administrator.
The following table describes the properties to configure:
Property Description
Base URL URL to access Apache Atlas and connect to the Atlas REST API.
Keytab File Path The absolute path to the Kerberos keytab file located on the Secure Agent machine
used for authentication.
Configuration File Path The absolute path to the Kerberos configuration file located on the Secure Agent
machine used for authentication.
9. Click Next.
The Configuration page appears.
The metadata extraction capability extracts source metadata from external source systems.
Before you configure metadata extraction, configure runtime environments in the Informatica Intelligent
Cloud Services Administrator.
1. In the Connection and Runtime area, choose the Secure Agent group where you want to run catalog
source jobs.
2. Choose to retain or delete objects that are deleted from the source in the catalog using the Metadata
Change Option.
• Retain. Retains objects that are deleted from the source in the catalog. If you update or add a filter,
the catalog retains objects extracted from the previous job and extracts additional objects that match
the current filter. Objects deleted from the source are not deleted from the catalog. Enrichments
added on deleted objects and relationships are retained.
• Delete. Deletes metadata from the catalog based on objects deleted from the source and changes
you make to the filter. Enrichments added on deleted objects and relationships are also permanently
lost. Objects renamed in the source are removed and recreated in the catalog.
Note: You can also change the configured metadata change option when you run a catalog source.
3. In the Filters area, define one or more filter conditions to apply for metadata extraction:
a. From the Include or Exclude metadata list, choose to include or exclude metadata based on the filter
parameters.
b. From the Object type list, select Hive Database, HDFS Path, or HBase Namespace.
c. Enter the filter values.
Filters can contain the following wildcards:
• Question mark. Represents a single character.
Exclude filter conditions are considered if the assets in the include filter conditions are not related or
linked through lineage to the excluded assets. For example, add a filter condition to include metadata
related to all tables with the name EMP across all databases (*.EMP) and then add another filter
condition to exclude metadata related to the EMP table located in the HR database (HR.EMP). Here, the
exclude filter condition is considered as the assets are not related or linked through lineage.
Exclude filter conditions are not considered if the assets in the include filter conditions are related or
linked through lineage to the excluded assets. For example, add a filter condition to include metadata
related to EMP table in the HR database (HR.EMP) and then add another filter condition to exclude
metadata related to SAL table in the same HR database (HR.SAL). Here, the exclude filter condition is
not considered due to the presence of lineage links between the EMP and SAL tables.
If you add a filter condition to include metadata from a table deleted from the Apache Atlas source
system, Metadata Command Center ignores the filter condition.
If the value of the HDFS Path filter contains special characters, replace the special characters with an
asterisk wildcard character. For example, replace /Test$~^!()*<>_Folder with /Test*Folder.
4. In the Configuration Parameters area, enter configuration properties.
Note: Click Show Advanced to view all configuration parameters.
Property Description
Lineage Direction The direction of data flow between assets that you extract from Apache Atlas with the
direction parameter of the LineageRESTAPI.
Select one of the following options:
- BOTH. Extracts both input and output data flow between assets.
- INPUT. Extracts only input data flow between assets.
- OUTPUT. Extracts only output data flow between assets.
Lineage Depth The number of lineage hops to extract from Apache Atlas for filtered assets with the depth
parameter of the LineageRESTAPI.
Default is 3.
Page Result Limit Advanced parameter. The maximum number of search result entries per page from a fetch
using the limit parameter of the DiscoveryRESTAPI.
Default is 1000.
Entity Bulk Fetch Advanced parameter. The maximum number of entities to include in a bulk fetch when you
Count use the BulkEntityRESTAPI.
Default is 100.
Connection Advanced parameter. The maximum amount of time, in milliseconds, that the Secure Agent
Timeout waits to set up an HTTP connection to communicate and get a response from the Apache
Atlas server.
Default is -1 which means timeout is disabled.
Parallel Lineage Advanced parameter. The maximum number of LineageRESTAPI calls that can run
Fetch Count simultaneously to retrieve lineage data.
Default is 5.
Property Description
Expert Enter additional configuration options to be passed at runtime. Required if you need to
parameters troubleshoot the catalog source job.
Caution: Use expert parameters when it is recommended by Informatica Global Customer
Support.
6. Click Next.
The Associations page appears.
Verify that the organization administrator assigned users and user groups to the role that you want to
associate with technical assets.
4. Select one or more users or user groups to assign as stakeholders for the technical assets, and click OK.
Only the selected users and user groups belonging to the specified role are granted the role-defined
permissions to technical assets.
5. You can assign more than one role to technical assets and add users and user groups from each role. To
assign more roles, click the add button.
6. Choose to save and run the job or to schedule a recurring job.
• To save and run the job, click Save and then Run.
• To schedule a recurring job, click Next to open the Schedule page.
The first time that you run the job, Metadata Command Center extracts the metadata. Subsequently, each
time that you run the job, Metadata Command Center synchronizes the catalog with the source system.
1. On the Schedule page, click the Run on Schedule checkbox to schedule the job.
The Schedule configuration page opens.
2. Enter the start date, time zone, and the interval at which you want to run the job.
3. Click Save to save the schedule.
Before you assign a connection, create a catalog source for each reference source system and run the
catalog source job.
Note: You can view the lineage with reference objects without creating a connection assignment. After
connection assignment, you can view the actual objects.
Apache Atlas uses Sqoop queries to import data from a reference source system to a Hive database. If the
Sqoop query contains double quotes, replace the double quotes with backticks (`) in the Apache Atlas source
system to view the reference objects correctly in Data Governance and Catalog.
3. In the Assign Connection dialog box, select one or more endpoint objects to assign to the selected
connection and click Assign.
You can filter the list in the Assign Connection dialog box by name, type, or endpoint.
You can create a connection assignment to the following catalog source types:
• Apache Hive. The target endpoint object must belong to the Database class type.
• Hadoop Distributed File System. The target endpoint object must belong to the File System class
type.
• Oracle. The target endpoint object must belong to the Database class type.
• MySQL. The target endpoint object must belong to the Database class type.
• PostgreSQL. The target endpoint object must belong to the Database class type.
Note: You can assign connections to Oracle, MySQL, and PostgreSQL catalog sources only when
Metadata Command Center extracts Sqoop processes from an Apache Atlas source system.
When you click Assign, Metadata Command Center creates links between matching objects in the
connected catalog sources, and it calculates the percentage of matched and unmatched objects. The
higher the percentage of matched objects, the more accurate the lineage that you view in Data
Governance and Catalog.
When referenced source systems are connected to a catalog source, you can expand the hierarchy to see
details about the technical asset's component elements.
You can view the data lineage of an asset contained within a catalog source to see individual elements such
as data sources, calculations, and filters. When you view data lineage, you can see the individual upstream
elements that contribute data or expressions to each component of a data flow or catalog source.
20
3. On the Data Governance and Catalog home page, click the number in the Technical Assets panel.
The Technical Assets page opens.
4. Select Catalog Source in the Filter list.
The list of catalog sources opens.
5. Search for the catalog source from which you extracted metadata, and click the name.
The Overview tab of the asset opens.
The following image shows a sample asset page:
Data lineage is a visual representation of the flow of data across the systems in your organization. Lineage
depicts how the data flows from the system of its origin to the system of its destination.
To view data lineage at the source or target level, search for and open a technical asset, and then click the
Lineage tab.
The following image shows how the LOAD Hive process loads data from the ORDERS.avro reference source
file to the avro_load reference target table before connection assignment:
The following image shows how the LOAD Hive process loads data from the ORDERS.avro actual source file
to the avro_load actual target table after connection assignment:
Data sets are technical assets that contain sets of data. Examples include files, databases, or temp files that
hold the results of calculations. Data elements are objects upstream or downstream of a data set, and are
accessible when you expand a data set to the data element level. For example, a column in a source object.
The following image shows table-level lineage where the avro_table referenced table gets data from the
csv_format_parent referenced table after data transformation using the storageformats.avro Hive process
before connection assignment:
The following image shows table-level lineage where the avro_table actual table gets data from the
csv_format_parent actual table after data transformation using the storageformats.avro Hive process after
connection assignment:
The following image shows column-level lineage where the col_decimal referenced column of the avro_table
gets data from the col_decimal referenced column of the csv_format_parent table after data transformation
using the storageformats.avro Hive process before connection assignment:
The following image shows column-level lineage where the col_decimal actual column of the avro_table gets
data from the col_decimal actual column of the csv_format_parent table after data transformation using the
storageformats.avro Hive process after connection assignment: