Database detox

Database detox

Happy new year! May crypto give an extra dimension to your life and may sprout help you with that! I used the last code-low days to first clean, and then fry sprouts database.

Indicators, indicators, indicators

One of the concepts sprout is built upon is: "everything is an indicator". What I mean by this is the fact that all numerical data can be taken as an indicator, like "normal" indicators such as price (current & ohlc) and volume, and crystal ball indicators, better known to some as technical analysis.

A clear example is number of Google searches: many people look at these number to give a better indication of an assets' value. (Whether these indicators are leading or trailing is another topic)

As such, we store quite a few indicators per post: upvotes (reddit & twitter), nr of comments, downvotes,... (reddit), and a few indicators internal to sprout: score, total upvotes,.. Under the notion that all of these data points are indicators, they were stored in the same way in our database. Over time, it has become clear this is not optimal: not only is this data queried in different ways, it was also eating up unnecessarily much space. It was especially hungry on in-memory indexes.

So, over the last week, sprouts database spent some time detoxing. Indicator count went down from 62M to a measly 5M, and more importantly, in-memory index size went from 19G to 3G! A whopping 85% memory reduction! A healthy way for the database to start 2018, and much needed for Big Plans® to come.

Photo by amoon ra, inserted here to break the wall of text

Lost Ghosts

I was so happy with the executed queries and the performance boost, I forgot to make a backup and started to perform a few maintenance queries. You know where I am going: I fried around my data (around 16% of it) What happened? Well, sprout stores all information for sites it doesn't crawl (yet) in a placeholder post, internally dubbed a "ghost post". In my quest to improve news quality on sprout, I started to precompute some fields on these posts. Only after precomputations were done I realised all other information on them were deleted...

Because these ghost posts don't store primary data by themselves, no
"real" data was lost and the impact was limited to executing the above mentioned detoxing and precomputation queries again, since last backup
(mid december). A task that could have been greatly reduced in terms of time and necessary brain power if i would have made backups more often.

Morale of the story: make backups, kids.