Introductio 1
Introductio 1
Introductio 1
In this event, we must forecast microbusiness development throughout numerous nations and
states based on the density of microbusinesses in each country. Microbusinesses are frequently
too tiny or too new to appear in typical economic information sources of information, although
their activity may be tied to other economic variables that are of broader relevance. Country
policymakers attempt to create more inclusive and recession-resistant economies. We are
additionally conscious that, because of technological advancements, entrepreneurship is no
longer easier to achieve than it has become nowadays. Perhaps it's to achieve a greater
equilibrium between work and life, explore a passion, or replace lost employment, statistics show
that Americans are becoming more inclined to start a company of their own to suit their needs
and financial objectives. The problem is that these "microenterprises" are frequently too small or
too recently established to emerge in regular economic data sources, making it difficult for
policymakers to investigate them.
Significance of project
Dataset Overview
The business activity index, commerce and industry dataset, census dataset, and unique
microbusiness density dataset are the main components of our dataset. The following is a
forecasting competition, given historical economic data is freely available. Data acquired after
the deadline for submitting information expires will be used to calculate the forecasting phase
public leaderboard or final confidential leaderboard. We produce static predictions that will only
include information accessible when the submission time ends. It also implies that, although
we're going to rescore contributions during the forecasting time frame, notebook will be
replayed.
train.csv
test.csv
Data cleaning
EDA
Finding outliers:
Shift point and outlier detection are key approaches for time series analysis since they can aid in
the identification of major changes or irregularities in the data. Time series data are frequently
non-stationary, which means that their statistical features fluctuate with time. These
modifications can be caused by a variety of sources, including alterations in underlying patterns,
modifications to the flow of data, or the appearance of unusual events or aberrations. Changing
frequency detection can assist in determining when these modifications occur as well as provide
information about the underlying reasons of the changes.
We can observe that the data is distributed unevenly throughout the columns:
The columns "microbusiness_density" & "active" appear as though they are skewed
toward the left, due to certain outliers.
The proportion of the population born outside is quite low, implying that few individuals
migrate to these places, resulting in lower turnover of population in low circles.
The IT sector employs a sizable proportion of the workforce, creating opportunities for
these industries to produce services linked to technology and assist people in better
understanding technology.
Building of Models
KNN model
For boosting methods, we constructed the following features: lag features, basic features,
encoding features, and rank features. We then built features based on their neighbors. The k-NN
model is based on several attributes, such as census data, microbusiness densities data, and their
corresponding changes. The k-NN features improved our training loss and validation loss. The
validation loss for LightGBM decreased by around 3% to approximately 2.3, showing that
capturing neighbors as features for predictions is highly effective.
SMAPE
We used SMAPE (Symmetric Mean Absolute Percentage Error) as our time series evaluation
metric. It is commonly used for time series tasks and, as you know, penalizes underpredictions
more than overpredictions. There are other commonly used evaluation metrics that do the same,
such as Mean Average Percentage Error (MAPE). However, MAPE has its limitations:
1. It cannot be used if there are zero or close-to-zero values, as division by zero or small values
will tend to infinity.
2. Forecasts that are too low will have a percentage error that cannot exceed 100%, but for
forecasts that are too high, there is no upper limit to the percentage error. This means that the
evaluation metric will systematically select a method whose forecasts are low.
In contrast to MAPE, SMAPE has both a lower and an upper bound. The log-transformed
accuracy ratio of MAPE actually has a similar shape compared to SMAPE. In our task, when
considering Type 1 and Type 2 errors for prediction, we would rather optimize for Type 2 errors
to penalize our predictions for microbusiness densities that are far below true values. From a
resource allocation perspective, underpredicting microbusiness densities can be harmful to those
doing business since they won't receive the necessary resources. SMAPE penalizes
underpredictions more than log-transformed MAPE, which is why we chose it.
Previously, we identified some key features contributing to our training. Some examples include
the percentage of households with broadband access and the percentage of the population aged
25+ with a college degree. These features suggest that some economic indicators may be
correlated with microbusiness densities, so we added more external datasets to our model. With
these features added, the loss improved by another 3%. Improving the model solely through
feature engineering is challenging. We already transformed our target to the log difference.
The answer lies in outliers. For time series tasks, models are very sensitive to outliers. There are
many outliers in our target across timestamps, especially for counties with lower populations.
Some of the smallest counties have less than 1,000 people, and if one person suddenly decides to
start microbusiness, the density can change drastically.
Anomaly detection significantly improved our model, with the SMAPE loss improving by
around 30%. This demonstrates the importance of smoothing in some time series tasks.
Smoothing will be applied to data for all models moving forward, which will also be shown in
other models.
LightGBM & XGBoost
After applying some tricks, here are the feature importance from LightGBM and XGBoost. With
only the top 30 features, the model performs as well as it does with the full dataset. First, we aim
to find a lower-dimensional representation of the distances between targets:
SMAPE is asymmetric in terms of both over and predictions.
Over-forecasting by the same proportion results in a lesser loss. Over-forecasting is
desirable when the inaccuracy is substantial.
SMAPE has a maximum value of 200 (MAPE without over-forecasting has no
restriction).
There are approximately 1.95% of the biggest SMAPE losses that account for fifty percent of the
total loss. In reality, this issue is not a significant skew favoring the biggest losses; typically,
20% of the entire loss accounts for roughly 80% of the total amount of loss.
The plot above illustrates the main aspects, which are typically gorgeous and provide the
appearance of sophisticated analytics and data knowledge, therefore it's impossible not to include
one.
We can observe that the data is highly noisy, as well as elaborate baselines do not significantly
enhance the Last Value baseline. As an outcome, fairly simple logic can produce good outcomes.
The subsequent baseline's key idea is to build many groups of clips and simply predict by
multiplying the Last number by a certain variable. Losses on the specified train time determine
the correct multiplying of all cfips.
• For this level of competition, it can be challenging to anticipate target values higher than
a simple baseline. Additionally, there is a considerable probability that a trivial answer
will attain one of the top spots on the private ranking by accident.
• However, overfitting towards the public scoreboard provides a benefit since the public
information will not be published, and the general score is a method to learn anything
about the January data that is already available.
• A lot of baselines just forecast the following month and copy the same number to other
individuals; be cautious if you wish to combine public contributions alongside your
intricate trend-seasonality logic.
DBSCAN clustering
The DBSCAN algorithm groups points according to distance, often the Euclidean distance and
the smallest quantity of elements. That algorithm's key feature is that it assists us in locating
outliers than points within low-density zones; thus, this cannot be as sensitive to aberrations as
K-Means grouping. After forming the initial cluster, we analyze all of its component points to
determine their Eps -neighbors. When an individual possesses at least MinPoints Eps-neighbors,
the initial cluster is expanded by including those Eps-neighbors. The above procedure is repeated
until there are currently no additional endpoints to add this particular cluster.
The aforementioned result indicates that there are actually no missing values in the data that we
have. Let us collect the Annual Income and then Spending Score columns from our already
configured information and implement our DBSCAN algorithm across them.
• Assuming the collection of data contains two dimensions, specify the minimum sample
size per cluster to four.
• When the information includes a number of dimensions, the minimum number of
samples per cluster ought to be as follows: Min_sample(MinPoints) = 2 * Data dimension
Because our data is two-dimensional in nature we'll leave the MinPoint argument at its standard
setting of 4.
To compute Eps, we will use the KNN function to find the distance across every information
point and it’s KNN. After that, we sort them before plotting them. The plot identifies the
maximum number at the graph's curvature. This constitutes our Eps value..
Clusters plot
This is UMAP (Uniform Manifold Approximation and Projection), which relies on Riemannian
geometry of the manifold and algebraic topology. It uses fuzzy simplicial sets to approximate the
underlying structure of the data. This method captures both local and global structures by
considering the distances between data points in the high-dimensional space and building a
topological representation of the data. The loss function for UMAP is the cross-entropy between
the pairwise similarities in the high-dimensional space (P) and the low-dimensional space (Q).
To model the local and global structure of the data, high-dimensional pairwise similarities are
based on the distance metric, and low-dimensional pairwise similarities are based on the negative
exponential of the distance in the embedding space. We chose to use UMAP mainly because it is
better at preserving the global structure of the data, while t-SNE focuses more on local structure.
UMAP is more consistent due to its deterministic initialization compared to the random
initialization in t-SNE, and it scales better. So, our second approach involves finding neighbors
based on clusters using UMAP. The clusters seem to be reasonable, and in fact, there is a
significant overlap of neighbors, similar to k-NN, from the previous approach. After finding the
neighbors, we can use these neighbors to construct graphs for graph neural networks. For each
county, the counties' longitude and latitude are appended for distance calculation. Considering
each county as the source, the destinations are the neighbors, and the weights are the normalized
distances based on their geographical data. Since we are predicting MD on a monthly level, a
graph is generated for each month in the training data. Here is our model architecture: it consists
of 1, 2, 3... layers. On the right is our training process, where early stopping is employed, and the
validation loss is actually very low, resulting in a performance of SMAPE around 0.9. The
reason the model performs well with neighbors is due to graph convolution.
GCN
Model architecture:
The GRU layer is used after the 1D convolution, by adding so; we can model the long-range
dependencies and temporal relationships in the data. The combination of 1D CNN and GRU
layers takes advantage of the strengths of both architectures. The 1D CNN captures local patterns
and features in the data, while the GRU layer models the long-range temporal relationships. This
combination helps in obtaining a richer representation of the time series data, leading to better
predictions. The 1D CNN + GRU model provides similar performance to the GNN, and it is
important to note that we are only using microbusiness densities for prediction. This model
architecture has potential for a wide range of time series tasks, not just for MD prediction.
A new class was also verified to corroborate the general target classification efficacy of the RCS.
As seen below, the 1D CNN-GRU achieved the best accuracy in classification with 99%
validation loss and 975% training loss.
Visualizing cross-validation behavior
We'll create an algorithm which enables us to see how each cross-validation entity behaves.
We'll divide the data into four parts. On every parted way, we'll show the indices picked for both
the training (blue) and the test (red) sets.).
Conclusions
References