Compartir a través de


FabricDataFrame Class

A dataframe for storage and propogation of PowerBI metadata.

The elements of column_metadata can contain the following keys:

  • table: table name in originating dataset

  • column: column name

  • dataset: originating dataset name

  • workspace_id: string form of workspace GUID

  • workspace_name: friendly name of originating workspace

  • description: description of column (if one is present)

  • data_type: PowerBI data type for this column

  • data_category: PowerBI data category for this column

  • alignment: PowerBI visual alignment for this column

Inheritance
sempy.functions._dataframe._sdataframe._SDataFrame
FabricDataFrame

Constructor

FabricDataFrame(data: ndarray | Iterable | dict | DataFrame | None = None, *args: Any, column_metadata: Dict[str, Any] | None = None, dataset: str | UUID | None = None, workspace: str | UUID | None = None, verbose: int = 0, **kwargs: Any)

Parameters

Name Description
data

Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs.

If data is a list of dicts, column order follows insertion-order.

Default value: None
*args
Required

Remaining arguments to be passed to standard pandas constructor.

column_metadata

Information about dataframe columns to be stored and propogated.

Default value: None
dataset
str or UUID

Name or UUID of the dataset to list the measures for.

Default value: None
workspace
str or UUID

The Fabric workspace name or UUID object containing the workspace ID. Defaults to None which resolves to the workspace of the attached lakehouse or if no lakehouse attached, resolves to the workspace of the notebook.

Default value: None
verbose
int

Verbosity. 0 means no verbosity.

Default value: 0
**kwargs
Required

Remaining kwargs to be passed to standard pandas constructor.

Keyword-Only Parameters

Name Description
column_metadata
Required
dataset
Required
workspace
Required
verbose
Required

Methods

add_measure

Join measures from the same dataset to the dataframe.

drop_dependency_violations

Drop rows that violate a given functional constraint.

Enforces a functional constraint between the determinant and dependent columns provided. For each value of the determinant, the most common value of the dependent is picked, and all rows with other values are dropped. For example given

ZIP

CITY

12345

Seattle

12345

Boston

12345

Boston

98765

Baltimore

00000

San Francisco

The row with CITY=Seattle would be dropped, and the functional dependency ZIP -> CITY holds in the output.

find_dependencies

Detect functional dependencies between the columns of a dataframe.

Columns that map 1:1 will be represented as a list.

Uses a threshold on conditional entropy to discover approximate functional dependencies. Low conditional entropy means strong dependence (i.e. conditional entropy of 0 means complete dependence). Therefore a lower threshold is more selective.

The function tries to prune the potential dependencies by removing transitive edges.

When dropna=True is specified, rows that have a NaN in either columns are eliminated from evaluation. This may result in dependencies being non-transitive, as in the following example. Even though A maps 1:1 with B and B maps 1:1 with C, A does not map 1:1 with C, because comparison of A and C includes additional NaN rows that are excluded when comparing A and C with B:

A

B

C

1

1

1

1

1

1

1

NaN

9

2

NaN

2

2

2

2

In some dropna=True cases the dependency chain can form cycles. In the following example, NaN values mask the pairwise mappings in such a way that A->B, B->C, C->A:

A

B

C

1

1

NaN

2

1

NaN

NaN

1

1

NaN

2

1

1

NaN

1

1

NaN

2

list_dependency_violations

Show violating values assuming a functional dependency.

Assuming that there's a functional dependency between column A (determinant) and column B (dependent), show values that violate the functional dependency (along with the count of their respective occurences).

Allows inspecting approximate dependencies and find data quality issues.

For example, given a dataset with zipcodes and cities, we would expect the zipcode to determine the city. However, if the dataset looks like this (where ZIP is the determinant and CITY is the dependent):

ZIP

CITY

12345

Seattle

12345

Boston

12345

Boston

98765

Baltimore

00000

San Francisco

Running this function would output the following violation:

ZIP

CITY

count

12345

Boston

2

12345

Seattle

1

The same zipcode is attached to multiple cities, which means there is some data quality issue within the dataset.

plot_dependency_violations

Show functional dependency violations in graphical format.

to_lakehouse_table

Write the data to OneLake as a Delta table with VOrdering enabled.

to_parquet

Write DataFrame to a parquet file specified by path parameter using Arrow including metadata.

add_measure

Join measures from the same dataset to the dataframe.

add_measure(*measures: List[str], dataset: str | UUID | None = None, workspace: str | UUID | None = None, use_xmla: bool = False, verbose: int = 0) -> FabricDataFrame

Parameters

Name Description
*measures
Required

List of measure names to join.

dataset
str or UUID

Name or UUID of the dataset to list the measures for. If not provided it will be auto-resolved from column metadata.

Default value: None
workspace
str or UUID

The Fabric workspace name or UUID object containing the workspace ID. Defaults to None which resolves to the workspace of the attached lakehouse or if no lakehouse attached, resolves to the workspace of the notebook.

Default value: None
use_xmla

Whether or not to use XMLA as the backend for the client. If there are any issues using the default Client, make this argument True.

Default value: False
verbose
int

Verbosity. 0 means no verbosity.

Default value: 0

Returns

Type Description

A new FabricDataFrame with the joined measures.

drop_dependency_violations

Drop rows that violate a given functional constraint.

Enforces a functional constraint between the determinant and dependent columns provided. For each value of the determinant, the most common value of the dependent is picked, and all rows with other values are dropped. For example given

ZIP

CITY

12345

Seattle

12345

Boston

12345

Boston

98765

Baltimore

00000

San Francisco

The row with CITY=Seattle would be dropped, and the functional dependency ZIP -> CITY holds in the output.

drop_dependency_violations(determinant_col: str, dependent_col: str, verbose: int = 0) -> FabricDataFrame

Parameters

Name Description
determinant_col
Required
str

Determining column name.

dependent_col
Required
str

Dependent column name.

verbose
int

Verbosity; 0 means no messages, 1 means showing the number of dropped rows, greater than one shows entire row content of dropped rows.

Default value: 0

Returns

Type Description

New dataframe with constraint determinant -> dependent enforced.

find_dependencies

Detect functional dependencies between the columns of a dataframe.

Columns that map 1:1 will be represented as a list.

Uses a threshold on conditional entropy to discover approximate functional dependencies. Low conditional entropy means strong dependence (i.e. conditional entropy of 0 means complete dependence). Therefore a lower threshold is more selective.

The function tries to prune the potential dependencies by removing transitive edges.

When dropna=True is specified, rows that have a NaN in either columns are eliminated from evaluation. This may result in dependencies being non-transitive, as in the following example. Even though A maps 1:1 with B and B maps 1:1 with C, A does not map 1:1 with C, because comparison of A and C includes additional NaN rows that are excluded when comparing A and C with B:

A

B

C

1

1

1

1

1

1

1

NaN

9

2

NaN

2

2

2

2

In some dropna=True cases the dependency chain can form cycles. In the following example, NaN values mask the pairwise mappings in such a way that A->B, B->C, C->A:

A

B

C

1

1

NaN

2

1

NaN

NaN

1

1

NaN

2

1

1

NaN

1

1

NaN

2

find_dependencies(dropna: bool = False, threshold: float = 0.01, verbose: int = 0) -> FabricDataFrame

Parameters

Name Description
dropna

Ignore rows where either column is NaN in dependency calculations.

Default value: False
threshold

Threshold on conditional entropy to consider a pair of columns a dependency. Lower thresholds result in less dependencies (higher selectivity).

Default value: 0.01
verbose
int

Verbosity. 0 means no verbosity.

Default value: 0

Returns

Type Description

A dataframe with dependencies between columns and groups of columns. To better visualize the 1:1 groupgings, columns that belong to a single groups are put into a single cell. If no suitable candidates are found, returns an empty DataFrame.

list_dependency_violations

Show violating values assuming a functional dependency.

Assuming that there's a functional dependency between column A (determinant) and column B (dependent), show values that violate the functional dependency (along with the count of their respective occurences).

Allows inspecting approximate dependencies and find data quality issues.

For example, given a dataset with zipcodes and cities, we would expect the zipcode to determine the city. However, if the dataset looks like this (where ZIP is the determinant and CITY is the dependent):

ZIP

CITY

12345

Seattle

12345

Boston

12345

Boston

98765

Baltimore

00000

San Francisco

Running this function would output the following violation:

ZIP

CITY

count

12345

Boston

2

12345

Seattle

1

The same zipcode is attached to multiple cities, which means there is some data quality issue within the dataset.

list_dependency_violations(determinant_col: str, dependent_col: str, *, dropna: bool = False, show_feeding_determinants: bool = False, max_violations: int = 10000, order_by: str = 'count') -> FabricDataFrame

Parameters

Name Description
determinant_col
Required
str

Candidate determinant column.

dependent_col
Required
str

Candidate dependent column.

dropna

Whether to drop rows with NaN values in either column.

Default value: False
show_feeding_determinants

Show values in a that are mapped to violating values in b, even if none of these values violate the functional constraint.

Default value: False
max_violations
int

The number of violations to return.

Default value: 10,000
order_by
str

Primary column to sort results by ("count" or "determinant"). If "count", sorts in order of determinant with highest number of dependent occurences (grouped by determinant). If "determinant", sorts in alphabetical order based on determinant column.

Default value: "count"

Returns

Type Description

FabricDataFrame containing all violating instances of functional dependency. If there are no violations, returns an empty DataFrame.

plot_dependency_violations

Show functional dependency violations in graphical format.

plot_dependency_violations(determinant_col: str, dependent_col: str, *, dropna: bool = False, show_feeding_determinants: bool = False, max_violations: int = 10000, order_by: str = 'count') -> graphviz.Graph

Parameters

Name Description
determinant_col
Required
str

Candidate determinant column.

dependent_col
Required
str

Candidate dependent column.

dropna

Whether to drop rows with NaN values in either column.

Default value: False
show_feeding_determinants

Show values in a that are mapped to violating values in b, even if none of these values violate the functional constraint.

Default value: False
max_violations
int

The number of violations to return.

Default value: 10,000
order_by
str

Primary column to sort results by ("count" or "determinant"). If "count", sorts in order of determinant with highest number of dependent occurences (grouped by determinant). If "determinant", sorts in alphabetical order based on determinant column.

Default value: "count"

Returns

Type Description

Graph of violating values.

to_lakehouse_table

Write the data to OneLake as a Delta table with VOrdering enabled.

to_lakehouse_table(name: str, mode: Literal['error', 'append', 'overwrite', 'ignore'] = 'error', method: Literal['spark', 'deltalake'] | None = None, spark_schema: StructType | Schema | deltalake.Schema | None = None, delta_column_mapping_mode: str = 'name') -> None

Parameters

Name Description
name
Required
str

The name of the table to write to.

mode
Required

Specifies the behavior when table already exists, by default "error". Details of the modes are available in the Spark docs.

method

Specifies the API to write the table. If specified as None, the function will auto-select the proper API based on current runtime.

Default value: None
spark_schema

Specifies the schema.

Default value: None
delta_column_mapping_mode
str

Specifies the column mapping mode to be used for the delta table. By default, it is set to "name".

Default value: "name"

to_parquet

Write DataFrame to a parquet file specified by path parameter using Arrow including metadata.

to_parquet(path: str, *args, **kwargs) -> None

Parameters

Name Description
path
Required
str

String containing the filepath to where the parquet should be saved.

*args
Required

Other args to be passed to PyArrow write_table.

**kwargs
Required

Other kwargs to be passed to PyArrow write_table.

Attributes

column_metadata

Information for the columns in the table.