The GitHub PR is here.
The data that was loaded is from the USA/CA/butte_county/college/chico scraper. I chose this as a starting point because it has two different types of data, with two differing formats. This allowed me to verify the library reads from the schema.json properly and can load and map no matter the data output. You can find that PR here. It also created 2 new datasets here.
We currently have it set to not auto-commit so someone reviews before committing each time. But it does load data from files, use the schema.json file and (mostly) works!
Current Process:
1.Call the
/common/etllibrary from a specific scraped agency dir, pass theschema.jsonfile to it.2.The library will locally clone the
pdap/datasetsandpdap/data-intakerepos3.Then it will verify the
agency_idexists and will grab the appropriate record from the database4.Next it will enumerate over the
dataitems in theschema.json. It will create any datasets where the id is null, otherwise it will find an existing dataset and sync the data from the table with theschema.json. Whichever source has the most recentlast_modifiedtime will be the “Source of Truth” and that’s the data that will be used to sync.5.It will do the same for the
data.mappingobject, it will search for a table inpdap/data-intakewith the exact name of thedata_typeand then sync the columns so they are there. If a new column is in the database but not the schema.json file, it will add the missing column with a__skip__value.6.Once everything has been synced, it will finally enumerate over the
mappingobject and insert the data intopdap/data-intake. Erroneous records are skipped a message displayed to the console.7.The schema.json is overwritten in the directory with the synced changes from the database.
You just read issue #3 of PDAP Newsletter. You can also browse the full archives of this newsletter.