Keep data in sync with 3rd party api

Question

I'm currently facing a difficult problem regarding re-archituring a legacy webshop, i would really appreciate some insight.

The system depends on a 3rd party API for it's product/variant data the latter returns a JSON "page" containing an array of products.

Currently to keep everything in sync the process is to query this API once in a while to get all the products for a supplier and then updating a local MYSQL database.
I cannot just diff the changes made to a product since there is no reliable information in the API about product updates.

Edit: let's say i update the data twice a week, in order to check if the data changed i need to query my database for each product and then check field by field. I tried it and it takes forever. The API has a field last_updated_at on every record but it doesn't reflect when a field changed only that a new indexation occurred for the whole supplier. Unfortunately, more often than not no information has to change for this indexation to occur.

This process is written in procedural PHP and is quite heavy since it depends on json_decode and some MYSQL inserts/update inside foreach loops (one for each level of the JSON) some of the documents can easily weight more than 20MB, and a supplier can have up to 1000+ products/ 10000+ variants, furthermore we are syncing up to 50+ suppliers.

I would like to depart from this model and introduce some kind of continuous update model.

I don't think i strictly need MYSQL for storing products since I have to re-normalize the data to fit. At first I was thinking about fetching a single product as the front-end requests it and caching the response, then only use MYSQL for the order/quote part. Unfortunately I can't do something like GET /product/ID, the API only accept POST request with a request body so caching seems more complicated.

Anyway to cut it short: every time i think of something it seems I am stopped in my tracks by some hurdle that keep making me question my approach. I seem to go round and round without making any headway to tackle this.

Why would you do in my position?

"I cannot just diff the changes made to a product since there is no reliable information in the API about product updates." This doesn't seem to follow logically. If you have items, you can diff them. Can you clarify what you mean here? — JimmyJames, Commented Aug 26, 2019 at 14:02
Of course, let's say i update the data twice a week, in order to check if the data changed i need to query my database for each product and then check field by field. I tried it and it takes forever. The API has a field last_updated_at on every record but it doesn't reflect when a field changed only that a new indexation occurred for the whole supplier. Unfortunately, more often than not no information has to change for this indexation to occur — Satan-lui-mm, Commented Aug 26, 2019 at 14:05

Community · Accepted Answer · 2020-06-16 10:01:49Z

There are a few different concerns here. First, with the current model, you might be able to work around your current issues using a hash function. You take the data that you get from the supplier as a whole (i.e. original json, or whatever) and run something like MD5 or SHA1* on it. You then store this on the record along with the data.

Now when you check the supplier's site again, you calculate the hash. If it matches, the data is the same as before and you don't need to change the DB. If there are fields in thew supplier response like unique codes or timestamps, you will need remove or mask them before doing the has calculation.

For the redesign

There are many caching solutions that don't require using GETs but I think you wish to use some built-in functionality of a webserver, perhaps? If that's necessary, you could build a proxy web API that takes GET requests and makes the POST call to the external server. This layer would also make a good place to implement your cache. You could even use this layer directly in your front-end interface, just don't rely on the client-side to relay the price to your order processing layer.

*MD5 and SHA1 should not be used for anything related to security but for something like this it they are perfectly adequate.

I see your point, thanks for your help. You had me thinking, let's say I store each product with the key being the hash of the product data. Would you think it would be faster to compare old/new hash instead of comparing field by field as I have tried? — Satan-lui-mm, Commented Aug 26, 2019 at 14:50
I wouldn't use the hash as the key. That would mean that when the product data changes, your key changes. You'd need some way to find the old one so you could delete it anyway. I'm not seeing any advantages to that, just challenges. But yes, the idea is that you can calculate the hash very quickly. It also means that you wouldn't need to pull all the data from the DB for comparison, just the stored hashes. You can probably pull all hashes for a supplier in one query. After that you are only interacting with the DB for those things that are changed. — JimmyJames, Commented Aug 26, 2019 at 15:10

Stack Exchange Network

Keep data in sync with 3rd party api

1 Answer 1

For the redesign

Your Answer

Not the answer you're looking for? Browse other questions tagged
architecture
api
websites
or ask your own question.

Hot Network Questions

Keep data in sync with 3rd party api

1 Answer 1

For the redesign

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged architectureapiwebsites or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
architecture
api
websites
or ask your own question.