Partager via


relationships Package

Classes

Multiplicity

PowerBI relationship cardinality descriptor stored in FabricDataFrame.

Functions

find_relationships

Suggest possible relationships based on coverage threshold.

By default include_many_to_many is False, which is the most common case. Generated relationship are m:1 (i.e. the "to" attribute is the primary key) and will also include 1:1 relationships.

If include_many_to_many is set to True (less common case), we will search for additional many to many relationships. The results will be a superset of default m:1 case.

Empty dataframes are not considered for relationships.

find_relationships(tables: Dict[str, DataFrame] | List[DataFrame], coverage_threshold: float = 1.0, name_similarity_threshold: float = 0.8, exclude: List[Tuple[str]] | DataFrame | None = None, include_many_to_many: bool = False, verbose: int = 0) -> DataFrame

Parameters

Name Description
tables
Required

A dictionary that maps table names to the dataframes with table content. If a list of dataframes is provided, the function will try to infer the names from the session variables and if it cannot, it will use the positional index to describe them in the results.

coverage_threshold

A minimum threshold to report a potential relationship. Coverage is a ratio of unique values in the "from" column that are found (covered by) the value in the "to" (key) column.

Default value: 1.0
name_similarity_threshold

Minimum similarity of column names before analyzing for relationship. The value of 0 means that any 2 columns will be considered. The value of 1 means that only column that match exactly will be considered.

Default value: 0.8
exclude

A dataframe with relationships to exclude. Its columns should contain the columns "From Table", "From Column", "To Table", "To Column", which matches the output of find_relationships.

Default value: None
include_many_to_many

Whether to also search for m:m relationships.

Default value: True
verbose
int

Verbosity. 0 means no verbosity.

Default value: 0

Returns

Type Description

A dataframe with candidate relationships identified by: from_table, from_column, to_table, to_column. Also provides auxiliary statistics to help with evaluation. If no suitable candidates are found, returns an empty DataFrame.

list_relationship_violations

Validate to see if the content of tables matches relationships.

The function examines results of joins for provided relationships and searches for inconsistencies with the specified relationship multiplicity.

Relationships from empty tables (dataframes) are assumed as valid.

list_relationship_violations(tables: Dict[str, DataFrame] | List[DataFrame], relationships: DataFrame, missing_key_errors='raise', coverage_threshold: float = 1.0, n_keys: int = 10) -> DataFrame

Parameters

Name Description
tables
Required

A dictionary that maps table names to the dataframes with table content. If a list of dataframes is provided, the function will try to infer the names from the session variables and if it cannot, it will use the positional index to describe them in the results.

relationships
Required

A dataframe with relationships to use for validation. Its columns should contain the columns "Multiplicity", "From Table", "From Column", "To Table", "To Column", which matches the output of find_relationships.

missing_key_errors
str

One of 'raise', 'warn', 'ignore'. Action to take when either table or column of the relationship is not found in the elements of the argument tables.

Default value: 'raise'
coverage_threshold

Fraction of rows in the "from" part that need to join in inner join.

Default value: 1.0
n_keys
int

Number of missing keys to report. Random collection can be reported.

Default value: 10

Returns

Type Description

Dataframe with relationships, error type and error message. If there are no violations, returns an empty DataFrame.

plot_relationship_metadata

Plot a graph of relationships based on metadata contained in the provided dataframe.

The input "metadata" dataframe should contain one row per relationship. Each row names the "from" and "to" table/columns that participate in the relationship, and their multiplicity as defined by Multiplicity.

plot_relationship_metadata(metadata_df: DataFrame, tables: Dict[str, DataFrame] | List[DataFrame] | None = None, include_columns: str = 'keys', missing_key_errors='raise', *, graph_attributes: Dict | None = None) -> Digraph

Parameters

Name Description
metadata_df

A "metadata" dataframe with relationships to plot. It should contain the columns "multiplicity", "From Table", "From Column", "To Table", "To Column", which matches the output of find_relationships.

Default value: None
tables

It needs to provided only when include_columns = 'all' and it will be used for mapping table names from relationships to the dataframe columns.

Default value: None
include_columns
str

One of 'keys', 'all', 'none'. Indicates which columns should be included in the graph.

Default value: 'keys'
missing_key_errors
str

One of 'raise', 'warn', 'ignore'. Action to take when either table or column of the relationship is not found in the elements of the argument tables.

Default value: 'raise'
graph_attributes

Attributes passed to graphviz. Note that all values need to be strings. Useful attributes are:

  • rankdir: "TB" (top-bottom) or "LR" (left-right)

  • dpi: "100", "30", etc. (dots per inch)

  • splines: "ortho", "compound", "line", "curved", "spline" (line shape)

Default value: None

Returns

Type Description

Graph object containing all relationships. If include_attributes is true, attributes are represented as ports in the graph.