FabricDataFrame Class
A dataframe for storage and propogation of PowerBI metadata.
The elements of column_metadata can contain the following keys:
table
: table name in originating datasetcolumn
: column namedataset
: originating dataset nameworkspace_id
: string form of workspace GUIDworkspace_name
: friendly name of originating workspacedescription
: description of column (if one is present)data_type
: PowerBI data type for this columndata_category
: PowerBI data category for this columnalignment
: PowerBI visual alignment for this column
- Inheritance
-
sempy.functions._dataframe._sdataframe._SDataFrameFabricDataFrame
Constructor
FabricDataFrame(data: ndarray | Iterable | dict | DataFrame | None = None, *args: Any, column_metadata: Dict[str, Any] | None = None, dataset: str | UUID | None = None, workspace: str | UUID | None = None, verbose: int = 0, **kwargs: Any)
Parameters
Name | Description |
---|---|
data
|
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs. If data is a list of dicts, column order follows insertion-order. Default value: None
|
*args
Required
|
Remaining arguments to be passed to standard pandas constructor. |
column_metadata
|
Information about dataframe columns to be stored and propogated. Default value: None
|
dataset
|
Name or UUID of the dataset to list the measures for. Default value: None
|
workspace
|
The Fabric workspace name or UUID object containing the workspace ID. Defaults to None which resolves to the workspace of the attached lakehouse or if no lakehouse attached, resolves to the workspace of the notebook. Default value: None
|
verbose
|
Verbosity. 0 means no verbosity. Default value: 0
|
**kwargs
Required
|
Remaining kwargs to be passed to standard pandas constructor. |
Keyword-Only Parameters
Name | Description |
---|---|
column_metadata
Required
|
|
dataset
Required
|
|
workspace
Required
|
|
verbose
Required
|
|
Methods
add_measure |
Join measures from the same dataset to the dataframe. |
drop_dependency_violations |
Drop rows that violate a given functional constraint. Enforces a functional constraint between the determinant and dependent columns provided. For each value of the determinant, the most common value of the dependent is picked, and all rows with other values are dropped. For example given ZIP CITY 12345 Seattle 12345 Boston 12345 Boston 98765 Baltimore 00000 San Francisco The row with CITY=Seattle would be dropped, and the functional dependency ZIP -> CITY holds in the output. |
find_dependencies |
Detect functional dependencies between the columns of a dataframe. Columns that map 1:1 will be represented as a list. Uses a threshold on conditional entropy to discover approximate functional dependencies. Low conditional entropy means strong dependence (i.e. conditional entropy of 0 means complete dependence). Therefore a lower threshold is more selective. The function tries to prune the potential dependencies by removing transitive edges. When dropna=True is specified, rows that have a NaN in either columns are eliminated from evaluation. This may result in dependencies being non-transitive, as in the following example. Even though A maps 1:1 with B and B maps 1:1 with C, A does not map 1:1 with C, because comparison of A and C includes additional NaN rows that are excluded when comparing A and C with B: A B C 1 1 1 1 1 1 1 NaN 9 2 NaN 2 2 2 2 In some dropna=True cases the dependency chain can form cycles. In the following example, NaN values mask the pairwise mappings in such a way that A->B, B->C, C->A: A B C 1 1 NaN 2 1 NaN NaN 1 1 NaN 2 1 1 NaN 1 1 NaN 2 |
list_dependency_violations |
Show violating values assuming a functional dependency. Assuming that there's a functional dependency between column A (determinant) and column B (dependent), show values that violate the functional dependency (along with the count of their respective occurences). Allows inspecting approximate dependencies and find data quality issues. For example, given a dataset with zipcodes and cities, we would expect the zipcode to determine the city. However, if the dataset looks like this (where ZIP is the determinant and CITY is the dependent): ZIP CITY 12345 Seattle 12345 Boston 12345 Boston 98765 Baltimore 00000 San Francisco Running this function would output the following violation: ZIP CITY count 12345 Boston 2 12345 Seattle 1 The same zipcode is attached to multiple cities, which means there is some data quality issue within the dataset. |
plot_dependency_violations |
Show functional dependency violations in graphical format. |
to_lakehouse_table |
Write the data to OneLake as a Delta table with VOrdering enabled. |
to_parquet |
Write DataFrame to a parquet file specified by path parameter using Arrow including metadata. |
add_measure
Join measures from the same dataset to the dataframe.
add_measure(*measures: List[str], dataset: str | UUID | None = None, workspace: str | UUID | None = None, use_xmla: bool = False, verbose: int = 0) -> FabricDataFrame
Parameters
Name | Description |
---|---|
*measures
Required
|
List of measure names to join. |
dataset
Required
|
Name or UUID of the dataset to list the measures for. If not provided it will be auto-resolved from column metadata. Default value: None
|
workspace
Required
|
The Fabric workspace name or UUID object containing the workspace ID. Defaults to None which resolves to the workspace of the attached lakehouse or if no lakehouse attached, resolves to the workspace of the notebook. Default value: None
|
use_xmla
Required
|
Whether or not to use XMLA as the backend for the client. If there are any issues using the default Client, make this argument True. Default value: False
|
verbose
Required
|
Verbosity. 0 means no verbosity. Default value: 0
|
Returns
Type | Description |
---|---|
A new FabricDataFrame with the joined measures. |
drop_dependency_violations
Drop rows that violate a given functional constraint.
Enforces a functional constraint between the determinant and dependent columns provided. For each value of the determinant, the most common value of the dependent is picked, and all rows with other values are dropped. For example given
ZIP
CITY
12345
Seattle
12345
Boston
12345
Boston
98765
Baltimore
00000
San Francisco
The row with CITY=Seattle would be dropped, and the functional dependency ZIP -> CITY holds in the output.
drop_dependency_violations(determinant_col: str, dependent_col: str, verbose: int = 0) -> FabricDataFrame
Parameters
Name | Description |
---|---|
determinant_col
Required
|
Determining column name. |
dependent_col
Required
|
Dependent column name. |
verbose
Required
|
Verbosity; 0 means no messages, 1 means showing the number of dropped rows, greater than one shows entire row content of dropped rows. Default value: 0
|
Returns
Type | Description |
---|---|
New dataframe with constraint determinant -> dependent enforced. |
find_dependencies
Detect functional dependencies between the columns of a dataframe.
Columns that map 1:1 will be represented as a list.
Uses a threshold on conditional entropy to discover approximate functional dependencies. Low conditional entropy means strong dependence (i.e. conditional entropy of 0 means complete dependence). Therefore a lower threshold is more selective.
The function tries to prune the potential dependencies by removing transitive edges.
When dropna=True is specified, rows that have a NaN in either columns are eliminated from evaluation. This may result in dependencies being non-transitive, as in the following example. Even though A maps 1:1 with B and B maps 1:1 with C, A does not map 1:1 with C, because comparison of A and C includes additional NaN rows that are excluded when comparing A and C with B:
A
B
C
1
1
1
1
1
1
1
NaN
9
2
NaN
2
2
2
2
In some dropna=True cases the dependency chain can form cycles. In the following example, NaN values mask the pairwise mappings in such a way that A->B, B->C, C->A:
A
B
C
1
1
NaN
2
1
NaN
NaN
1
1
NaN
2
1
1
NaN
1
1
NaN
2
find_dependencies(dropna: bool = False, threshold: float = 0.01, verbose: int = 0) -> FabricDataFrame
Parameters
Name | Description |
---|---|
dropna
|
Ignore rows where either column is NaN in dependency calculations. Default value: False
|
threshold
|
Threshold on conditional entropy to consider a pair of columns a dependency. Lower thresholds result in less dependencies (higher selectivity). Default value: 0.01
|
verbose
|
Verbosity. 0 means no verbosity. Default value: 0
|
Returns
Type | Description |
---|---|
A dataframe with dependencies between columns and groups of columns. To better visualize the 1:1 groupgings, columns that belong to a single groups are put into a single cell. If no suitable candidates are found, returns an empty DataFrame. |
list_dependency_violations
Show violating values assuming a functional dependency.
Assuming that there's a functional dependency between column A (determinant) and column B (dependent), show values that violate the functional dependency (along with the count of their respective occurences).
Allows inspecting approximate dependencies and find data quality issues.
For example, given a dataset with zipcodes and cities, we would expect the zipcode to determine the city. However, if the dataset looks like this (where ZIP is the determinant and CITY is the dependent):
ZIP
CITY
12345
Seattle
12345
Boston
12345
Boston
98765
Baltimore
00000
San Francisco
Running this function would output the following violation:
ZIP
CITY
count
12345
Boston
2
12345
Seattle
1
The same zipcode is attached to multiple cities, which means there is some data quality issue within the dataset.
list_dependency_violations(determinant_col: str, dependent_col: str, *, dropna: bool = False, show_feeding_determinants: bool = False, max_violations: int = 10000, order_by: str = 'count') -> FabricDataFrame
Parameters
Name | Description |
---|---|
determinant_col
Required
|
Candidate determinant column. |
dependent_col
Required
|
Candidate dependent column. |
dropna
Required
|
Whether to drop rows with NaN values in either column. Default value: False
|
show_feeding_determinants
Required
|
Show values in a that are mapped to violating values in b, even if none of these values violate the functional constraint. Default value: False
|
max_violations
Required
|
The number of violations to return. Default value: 10,000
|
order_by
Required
|
Primary column to sort results by ("count" or "determinant"). If "count", sorts in order of determinant with highest number of dependent occurences (grouped by determinant). If "determinant", sorts in alphabetical order based on determinant column. Default value: "count"
|
Returns
Type | Description |
---|---|
FabricDataFrame containing all violating instances of functional dependency. If there are no violations, returns an empty DataFrame. |
plot_dependency_violations
Show functional dependency violations in graphical format.
plot_dependency_violations(determinant_col: str, dependent_col: str, *, dropna: bool = False, show_feeding_determinants: bool = False, max_violations: int = 10000, order_by: str = 'count') -> graphviz.Graph
Parameters
Name | Description |
---|---|
determinant_col
Required
|
Candidate determinant column. |
dependent_col
Required
|
Candidate dependent column. |
dropna
Required
|
Whether to drop rows with NaN values in either column. Default value: False
|
show_feeding_determinants
Required
|
Show values in a that are mapped to violating values in b, even if none of these values violate the functional constraint. Default value: False
|
max_violations
Required
|
The number of violations to return. Default value: 10,000
|
order_by
Required
|
Primary column to sort results by ("count" or "determinant"). If "count", sorts in order of determinant with highest number of dependent occurences (grouped by determinant). If "determinant", sorts in alphabetical order based on determinant column. Default value: "count"
|
Returns
Type | Description |
---|---|
Graph of violating values. |
to_lakehouse_table
Write the data to OneLake as a Delta table with VOrdering enabled.
to_lakehouse_table(name: str, mode: Literal['error', 'append', 'overwrite', 'ignore'] = 'error', method: Literal['spark', 'deltalake'] | None = None, spark_schema: StructType | Schema | deltalake.Schema | None = None, delta_column_mapping_mode: str = 'name') -> None
Parameters
Name | Description |
---|---|
name
Required
|
The name of the table to write to. |
mode
Required
|
Specifies the behavior when table already exists, by default "error". Details of the modes are available in the Spark docs. |
method
Required
|
Specifies the API to write the table. If specified as None, the function will auto-select the proper API based on current runtime. Default value: None
|
spark_schema
Required
|
Specifies the schema. Default value: None
|
delta_column_mapping_mode
Required
|
Specifies the column mapping mode to be used for the delta table. By default, it is set to "name". Default value: "name"
|
to_parquet
Write DataFrame to a parquet file specified by path parameter using Arrow including metadata.
to_parquet(path: str, *args, **kwargs) -> None
Parameters
Name | Description |
---|---|
path
Required
|
String containing the filepath to where the parquet should be saved. |
*args
Required
|
Other args to be passed to PyArrow |
**kwargs
Required
|
Other kwargs to be passed to PyArrow |
Attributes
column_metadata
Information for the columns in the table.