Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GRIFFIN-358] Rewrite the Rule/ Measure implementations #591

Merged
merged 34 commits into from
Jul 5, 2021

Conversation

chitralverma
Copy link
Contributor

What changes were proposed in this pull request?

Current RuleParams can be of the following 3 DSL types,

  • Data Ops (for source preprocessing)
  • Griffin DSL
  • SparkSQL

GriffinDSL allows the implementation of measures (DQ Types) like Completeness, Accuracy, etc.

To enable such measures there is an extensive implementation of expression, task hierarchies, parsing and most of this is heavily dependent on scala-parser-combinators.

At the end of the implementation, Griffin DSL tries to mimic a SparkSQL-like query but substitution of user-defined constraints.

This approach has some drawbacks,

  • Suboptimal processing. While the transformation steps execute in parallel on the driver, the data set is still scanned multiple times in parallel which can cause inefficiencies on the SparkSession side and the internal task scheduler was single-threaded. Even though the data set can be cached, still it branched and crucial memory is required for holding the dataset rather than processing it.
  • Internal functions of Spark are not used. Data preprocessing has a very limited scope currently even though we have 100s spark SQL functions available for use.
  • This blocks structured streaming. The manually constructed SQL queries cause multiple aggregations in the same query on a streaming data set which is not supported by Spark's Structured streaming. There are workarounds for this but they all require rewriting the *Expr2DQSteps classes.
  • Griffin DSL is SparkSQL like but not 100% compatible. Profiling measure and SparkSQL are redundant functionalities

The proposed solution involves SparkSQL DSL based measures and some changes to Rule Params. This will enhance the data pre proc flows and the measures themselves

Does this PR introduce any user-facing change?
Yes. Users can use the new measures as a separate configuration and there is scope for more data pre-processing.

How was this patch tested?
Unit Tests

…s and completeness measure configuration guide.
@chitralverma chitralverma self-assigned this Jun 11, 2021
@chitralverma chitralverma marked this pull request as ready for review June 11, 2021 06:37
@chitralverma chitralverma requested a review from guoyuepeng June 11, 2021 06:37
@chitralverma
Copy link
Contributor Author

@wankunde @guoyuepeng Can you please review this. thanks

@guoyuepeng
Copy link
Contributor

big patch.
let me go through it today.

Thanks.

@chitralverma
Copy link
Contributor Author

big patch.
let me go through it today.

Thanks.

Thanks.

@guoyuepeng
Copy link
Contributor

LGTM.
Will merge it

Copy link
Contributor

@guoyuepeng guoyuepeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guoyuepeng guoyuepeng merged commit 7a50813 into apache:master Jul 5, 2021
@chitralverma
Copy link
Contributor Author

Thanks for the merge! :)

@chitralverma chitralverma deleted the fix-measures branch July 6, 2021 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants