40 questions
0
votes
0
answers
60
views
ImportError: cannot import name 'pkg_resources' from 'ydata_profiling'
I am new to streamlit and I am trying to use the ydata_profiling library for AutoML, but I keep getting the error of ImportError: cannot import name 'pkg_resources' from 'ydata_profiling'.
I tried ...
0
votes
0
answers
46
views
Python checkpoint module for estimation of remaning script time
I run background jobs for different users in our service. Subtasks and runtime is different based on the size and setting of the user account.
Subtasks within the job are something like: downloading ...
0
votes
0
answers
158
views
Why to_notebook_iframe (ydata-profiling) does not render the report on SageMaker notebook?
I am facing an issue to show the ydata-profiling report in the notebook using SageMaker studio. Everything looks fine to create the report, but the report render does not show up at the end and the ...
0
votes
0
answers
44
views
How to identify all possible differences in duplicate data from two different datasets and calculate frequency?
I have two datasets each containing Name, First Name, Street, House Number, Postal Code and City. I have noticed these datasets contain multiple cases of duplicates. For instance, in one dataset the ...
0
votes
1
answer
532
views
AttributeError when attempting to generate a report with ydata-profiling in Python
I am attempting to generate a data profiling report using the ydata-profiling library in Python. Upon executing the following code:
import ydata_profiling
profile = ydata_profiling.ProfileReport(...
0
votes
2
answers
3k
views
Saving and Reloading a ydata-profiling / pandas-profiling ProfileReport object for later use
I am using the ydata-profiling library to generate profile reports of my pandas DataFrame. I would like to save the entire ProfileReport object, so I can load it later without having to regenerate the ...
4
votes
2
answers
4k
views
Data Profiling using Pyspark
I'm trying create a PySpark function that can take input as a Dataframe and returns a data-profile report. I already used describe and summary function which gives out result like min, max, count etc. ...
0
votes
1
answer
162
views
How to customize customize alerts + other metrics in pandas_profiling / y_data_profiling alerts
pandas_profiling, or as it is now called, y_data_profiling provides a detailed breakdown of data quality.
How can we customize alerts + other metrics included in their default report?
I see options to ...
1
vote
1
answer
886
views
Is it possible in snowflake to write a query that lists the columns that have all null values?
In snowsight within snowflake, you can profile tables and see the % of null values in the UI, but is there an easy way to query for this data or export it from the UI? I just need to create a new ...
1
vote
1
answer
2k
views
Databricks : Export data profiling report
Databricks can create a data profiling report after using the display(dataframe_name).
I have created a data profiling report using Azure Databricks but I do not know how do I export it.
Can you ...
2
votes
1
answer
382
views
Using Pydequu on Jupyter Notebook and having this "An error occurred while calling o70.run.'
I'm trying to use Pydequu on Jupyter Notebook when i try to use ConstraintSuggestionRunner and show this error:
Py4JJavaError: An error occurred while calling o70.run.
: java.lang.NoSuchMethodError: '...
0
votes
2
answers
327
views
Detecting similar columns across multiple files based on statistical profile
I'm attempting to clean up a set of old files that contain sensor data measurements. Many of the files don't have headers, and the format (column ordering, etc.) is inconsistent. I'm thinking the ...
0
votes
0
answers
162
views
How can I connect a local delta lake with talend for data profiling purpose?
As I am new to talend, I am trying to connect my local delta lake with talend to do some data profiling on it.
1
vote
1
answer
55
views
Not able to perform operations on resulting dataframe after "join" operation in PySpark
df=spark.read.csv('data.csv',header=True,inferSchema=True)
rule_df=spark.read.csv('job_rules.csv',header=True)
query_df=spark.read.csv('rules.csv',header=True)
join_df=rule_df.join(query_df,rule_df....
0
votes
0
answers
631
views
How to create multiple pandas profiling reports for multiple csv files in a directory? The report name should match the file name
I tried this,
import glob
import os
import pandas as pd
import pandas_profiling
from pandas_profiling import ProfileReport
files = glob.glob("D:\home_health_services_current_data\*.csv")
df ...
1
vote
0
answers
124
views
SSIS Data Profiling Task - Not showing all in Data Profile Outputs
Chose following request for Data Profiling Task in SSDT 2017. But, it's only showing NullRationReq in output and NOT the other requests. I tried few times, and when checked profiler output xml - in ...
0
votes
1
answer
696
views
Data profiling of columns for big table (SQL Server)
I have table with over 40 million records. I need to make data profiling, including Nulls count, Distinct Values, Zeros and Blancs, %Numeric, %Date, Needs to be Trimmed, etc.
The examples that I was ...
0
votes
2
answers
212
views
Validation for columns work very slow (SQL Server)
I want to perform data profiling on the columns of a table. In this particular case - what percentage of data is date/integer/numeric/bit. The query that I am using:
SELECT
CAST(SUM(CASE WHEN ...
0
votes
1
answer
441
views
when i execute pandas-profiling package it won't return min, max and mean values
When i profiling the following data using pandas-profiling==2.8.0 it won't return min, max and mean values.
CSV data
a,b,c
12,2.5,0
12,4.7,5
33,5,4
44,44.21,67
python code
import json
import ...
0
votes
1
answer
1k
views
Db2 tables - finding all blank columns in a table that has 100+ columns
I have a table with 78 columns and 100k rows. Is there a way to find all the blank columns in the table without querying on each column to find their counts?
Running a not null query is time consuming ...
1
vote
0
answers
409
views
pandas-profiling "Duplicate rows" section is not showing-up in the HTML Report
I am using pandas-profiling=2.8.0 and I have generated an HTML report in which 2 duplicates are shown in the Overview Section, as seen below
But the "Duplicate rows" option/section is ...
0
votes
3
answers
3k
views
data profiling on bigquery table covering min,max,unique, null count statistics
I am looking for solution to perform data profiling on bigquery table covering below statistics for each column in table. Some of the columns are ARRAY and STRUCT as given below.
I tried multiple ...
1
vote
1
answer
774
views
why do I get IndexError while trying to get data profiling report?
I recently started using python. And, I am trying to get the report using pandas_profiling, I am running into IndexError. Can someone please explain how I can debug this?
Data has like 30 variables ...
2
votes
2
answers
951
views
How to detect and convert units of column values without using python loop?
As per my knowledge Python loops are slow, hence it is preferred to use pandas inbuilt functions.
In my problem, one column will have different currencies, I need to convert them to dollar. How can I ...
-2
votes
1
answer
65
views
DB2 : Need to get the list of columns and distinct value counts for a given db2 table
For data profiling purpose , I just need to get the idea if a columns in a given table has values populated or not. For that, I need to get the list of columns and distinct value counts for a given ...
2
votes
2
answers
839
views
How to loop through all tables and fields in each table to get percentage of missing values
I am trying to, using SSIS, obtain a table to get the percentage of missing values of every field in every table of a SQL Server database.
Ideally I would like to create a new table in another ...
1
vote
0
answers
412
views
Error when running Data Profiling Task with Azure SQL Server data
When running a Data Profiling Task in SSIS with data from an Azure SQL Server, I receive the following error message:
System.Data.SqlClient.SqlException (0x80131904): USE statement is not
supported ...
1
vote
1
answer
189
views
Profiling the empty string in SSIS Data Profiling
I've just started using the Data Profiling Task in SSIS to profile some data on our databases. I've found the option for profiling the column null ratios ("Column Null Ratio Profiles") but I'm ...
0
votes
2
answers
447
views
Find Multi-Column Primary key
I have about 30 tables from an old ERP which have multi-column primary keys. Unfortunately I don't know what those keys are. I've used the SSIS profiling task to determine primary key candidates for ...
10
votes
2
answers
1k
views
Data profiling Task - custom Profile Request
Is there any option to create a custom Profile Request for SSIS Data Profiling Task?
At the moment there are 5 standard profile requests under SSIS Data Profiling task:
Column Null Ratio Profile ...
0
votes
1
answer
38
views
XSLT: Copy two files into one common structure
I try to merge result of SSIS Data Profiler Task for several tables into one XML for inspection of the results within one single file inside "Data Profiler Viewer". The whole problem shrinks to the ...
1
vote
3
answers
2k
views
Data profiling in Power BI
I want to profile every single data table I have in my Power BI report. By data profile I mean something like this:
Are there ways to make a data profile view in Power BI? DAX measure or calculated ...
0
votes
1
answer
2k
views
generate PostgreSQL stats / data profiling [closed]
I would like to automate data profiling on PostgreSQL with a free tool, a tool that inspects data content through a column profile or percentage distribution of values. like max, min, avg.
3
votes
3
answers
978
views
Measuring peak disk use of a process
I am trying to benchmark a tool I'm developing in terms of time, memory, and disk use. I know /usr/bin/time gives me basically what I want for the first two, but for disk use I came to the conclusion ...
0
votes
1
answer
122
views
Extract pattern from dataset
I have a table with several columns filled with data from different parameters.
As some of the rows might share the same column values I'd like to extract the most repeating values for each column so ...
0
votes
2
answers
2k
views
Data Profiling on a File through SSIS
I'm new to SSIS Development. I need some guidance from experts on SSIS. Following are the list of questions :
We are having files with sizes from 1GB to 25 GB of type txt or dat files with tab ...
13
votes
5
answers
4k
views
Cannot start Concurrency Visualizer in Visual Studio 2012. Got error "Unable to start the ETW collection"
When I tried to profile a WPF application with Concurrency Visualzer (tried both launch and attach to process), I got the following error pop up - "Unable to start the ETW collection"
ETW clearly ...
-1
votes
5
answers
684
views
Tool for table_schema and table_name relationship
Do you know any tools for profiling,to see the structure and relationship of each tables inside the db? it is look like this one :
See screenShot below,
For bigger resolution, Please click here.
...
4
votes
1
answer
3k
views
MySQL capacity planning
In my production environment, I have a single instance of MySQL server running on 16gig of memory that handles up to 20,000 queries an hour. The size of one my table is growing at the rate of 2 ...
0
votes
1
answer
542
views
Suggestion on Customer Profiling System: Books, Articles, etc
I'm going to work on a Customer Profiling project (similar but not same to Google Analytics) for our own E-Commerce website using C#. I'm pretty new to this kind of project, and the Customer Profiling ...