[GRIFFIN-358] Rewrite the Rule/ Measure implementations #591

chitralverma · 2021-05-28T20:51:29Z

What changes were proposed in this pull request?

Current RuleParams can be of the following 3 DSL types,

Data Ops (for source preprocessing)
Griffin DSL
SparkSQL

GriffinDSL allows the implementation of measures (DQ Types) like Completeness, Accuracy, etc.

To enable such measures there is an extensive implementation of expression, task hierarchies, parsing and most of this is heavily dependent on scala-parser-combinators.

At the end of the implementation, Griffin DSL tries to mimic a SparkSQL-like query but substitution of user-defined constraints.

This approach has some drawbacks,

Suboptimal processing. While the transformation steps execute in parallel on the driver, the data set is still scanned multiple times in parallel which can cause inefficiencies on the SparkSession side and the internal task scheduler was single-threaded. Even though the data set can be cached, still it branched and crucial memory is required for holding the dataset rather than processing it.
Internal functions of Spark are not used. Data preprocessing has a very limited scope currently even though we have 100s spark SQL functions available for use.
This blocks structured streaming. The manually constructed SQL queries cause multiple aggregations in the same query on a streaming data set which is not supported by Spark's Structured streaming. There are workarounds for this but they all require rewriting the *Expr2DQSteps classes.
Griffin DSL is SparkSQL like but not 100% compatible. Profiling measure and SparkSQL are redundant functionalities

The proposed solution involves SparkSQL DSL based measures and some changes to Rule Params. This will enhance the data pre proc flows and the measures themselves

Does this PR introduce any user-facing change?
Yes. Users can use the new measures as a separate configuration and there is scope for more data pre-processing.

How was this patch tested?
Unit Tests

…s and completeness measure configuration guide.

…parkSql measures.

chitralverma · 2021-06-11T06:37:32Z

@wankunde @guoyuepeng Can you please review this. thanks

...rg/apache/griffin/measure/datasource/connector/batch/ElasticSearchGriffinDataConnector.scala

guoyuepeng · 2021-07-02T02:35:21Z

big patch.
let me go through it today.

Thanks.

chitralverma · 2021-07-05T03:37:31Z

big patch.
let me go through it today.

Thanks.

Thanks.

guoyuepeng · 2021-07-05T04:16:46Z

LGTM.
Will merge it

guoyuepeng

LGTM

chitralverma · 2021-07-06T11:35:11Z

Thanks for the merge! :)

chitralverma added 29 commits March 22, 2021 17:53

[GRIFFIN-358] Rewrite dataset preprocessing as SQL Queries

6745b98

[GRIFFIN-358] Rewrite new measure hierarchy and new completeness measure

b66f796

[GRIFFIN-358] New Profiling Measure

4373754

[GRIFFIN-358] New SparkSQL Measure

3f565f0

[GRIFFIN-358] New Duplication (Distinctness, Uniqueness) Measure

063e16c

[GRIFFIN-358] Changes to Metric Flush process

095864e

[GRIFFIN-358] New Accuracy Measure

e66a5cd

[GRIFFIN-358] Merge Measure constants

42c3c17

[GRIFFIN-358] Added CompletenessMeasureTest

2c1cbea

[GRIFFIN-358] Added AccuracyMeasureTest

1cbf6df

[GRIFFIN-358] Added SparkSqlMeasureTest

8c13b30

[GRIFFIN-358] Added DuplicationMeasureTest

8cd2190

[GRIFFIN-358] Added ProfilingMeasureTest

889ca5d

[GRIFFIN-358] Fixed formatting

64a00d1

[GRIFFIN-358] Fixed breaking test cases

739d3f5

[GRIFFIN-358] Added general documentation for new dimensions/ measure…

25ed0d1

…s and completeness measure configuration guide.

[GRIFFIN-358] Update Duplication Measure to exclude null values

aaa1bf3

[GRIFFIN-358] Added measure configuration guide for duplication and s…

15bcfa4

…parkSql measures.

[GRIFFIN-358] Added profiling measure configuration guide.

89e38a8

[GRIFFIN-358] Changed 'target' to 'ref' to clear terminology

b1f1de1

[GRIFFIN-358] Added accuracy measure configuration guide.

86555b5

[GRIFFIN-358] Allow users to run old "evaluate.rule" configs as well

c676998

[GRIFFIN-358] Updated Configurations for pre proc and batch all measures

3acd96b

[GRIFFIN-358] Added test cases for Data pre proc

74f0d68

[GRIFFIN-358] Changes structure of Measure

e6a3f6b

[GRIFFIN-358] Added parallelization to MeasureExecutor

cb25879

[GRIFFIN-358] Added code documentation for all new measures.

7943855

[GRIFFIN-358] Fixed breaking test case

3586a30

[GRIFFIN-358] Added sampling option to ProfilingMeasure

0a08b85

chitralverma self-assigned this Jun 11, 2021

chitralverma added the enhancement label Jun 11, 2021

chitralverma marked this pull request as ready for review June 11, 2021 06:37

chitralverma requested a review from guoyuepeng June 11, 2021 06:37

chitralverma added 4 commits June 11, 2021 21:01

[GRIFFIN-358] Error handling and code formatting changes

c2a173f

[GRIFFIN-358] Added SchemaConformance measure

e3803d2

[GRIFFIN-358] Changed Metric output format and fixed test cases

e1a6c03

[GRIFFIN-358] Added documentation for SchemaConformanceMeasure

444c956

wankunde reviewed Jun 30, 2021

View reviewed changes

...rg/apache/griffin/measure/datasource/connector/batch/ElasticSearchGriffinDataConnector.scala Outdated Show resolved Hide resolved

[GRIFFIN-358] Fix import

4908291

guoyuepeng approved these changes Jul 5, 2021

View reviewed changes

guoyuepeng merged commit 7a50813 into apache:master Jul 5, 2021

chitralverma deleted the fix-measures branch July 6, 2021 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GRIFFIN-358] Rewrite the Rule/ Measure implementations #591

[GRIFFIN-358] Rewrite the Rule/ Measure implementations #591

chitralverma commented May 28, 2021

chitralverma commented Jun 11, 2021

guoyuepeng commented Jul 2, 2021

chitralverma commented Jul 5, 2021

guoyuepeng commented Jul 5, 2021

guoyuepeng left a comment

chitralverma commented Jul 6, 2021

[GRIFFIN-358] Rewrite the Rule/ Measure implementations #591

[GRIFFIN-358] Rewrite the Rule/ Measure implementations #591

Conversation

chitralverma commented May 28, 2021

chitralverma commented Jun 11, 2021

guoyuepeng commented Jul 2, 2021

chitralverma commented Jul 5, 2021

guoyuepeng commented Jul 5, 2021

guoyuepeng left a comment

Choose a reason for hiding this comment

chitralverma commented Jul 6, 2021