Reviewing merge request #1248: DB optimisation So, I noticed that Laconica has to calculate the populari...

DB optimisation

So, I noticed that Laconica has to calculate the popularity (weight) of every notice to get an ordered list of notices for the popularity thing in the side bar (and on the popular notices tab). This is something like O(n^2).

So I fixed that. Now we maintain a rank (1 is most popular, 0 is not ranked) in notice.rank. notice.rank gets updated whenever a notice is favourited. If it gets favoured, it checks its score against the score of the notice above it in rank and may move up as far as rank 1. Disfavouring a notice does the same but in reverse (with an extra check for the case where a notice has no more entries in the fave table - and the rank can be reset to 0 i.e. unranked). This is O(n) in the worst case, and O(1) in the best case.

I've tested this with PostgreSQL and MySQL, and I added to the 08to09*.sql upgrade scripts. I also created scripts/fixup_notice_ranks.php which should be run when upgrading to 0.9.x, although I couldn't find where upgrade instructions should go (it couldn't go in the SQL because it relies on the popular.dropoff configuration variable).

Note: you can safely ignore the lighttpd 404 hack - that's already been merged into 0.8.x. I use lighttpd so I needed it.

Commits that would be merged:

Version 1
  • Version 1
  • b0bb1ff
  • 5eb97fa
  • Enable 404-based rewrites for lighttpd installations in /

  • 0438ff8
  • SQL comments start with -- not //!

  • a069b96
  • Add rank to notice table.

  • 374a34c
  • Modify notice.rank on insert/delete in fave.

  • e277490
  • Use notice.rank to get popular notices.

  • 788cf2b
  • Merge commit 'mainline/0.9.x' into 0.9.x-rank

  • 38a99c0
  • Add script to initialise (or fix) notice.rank.

Showing b0bb1ff-5eb97fa

Comments

I want to give this a good scaling test on Postgresql – can ppl hold off merging until I'm back in my home country (Wednesday).
:–)

I haven’t even looked at this code, but it sounds like it doesn’t preserve the time-decay that we've got in the current popularity code.

Also, saving rank seems like a bad idea to me. If any of the ranked notices are deleted, all below them will need to be re-ranked — what a waste!

IF some rollup information needs to be saved in the DB — which I don’t think is true, since we cache this stuff for ~1h anyways — I'd rather we saved number of faves/notice instead. Storing data in each row that depends on the state of every other row in the table seems like a bad idea.

@evan: It does preseve the time decay. When an entry appears in the fave table, the only thing that can happen is the notice can go up 0 or more ranks. When an entry is removed from the fave table, it can go down 0 or more ranks. And the weights are absolute values, so we can compare two notices directly, and keep the list sorted.

And when a notice gets deleted, all that happens is each lower-ranked notice gets their rank changed – we don’t bother recalculating their weights or anything like that, but yes it would require changing n rows.

I can’t really say anything against your problem with storing unique values, other than this way is theoretically faster, and as far as I've tested the ranks don’t get out-of-sync (I'd like to see a test suite for the model … I guess that can be my next activity).

Ah. Looking it over, it seems like you re-score every notice every time any other notice gets faved. That sounds pretty expensive. Also, you score it relative to save time, not display time. I don’t think that’s a big deal, but worth noting.

I was thinking as I was at the gym about how to deal with this problem correctly. We need to store a time-decay-related value, at save time, that’s independent of other rows in the table. We can then sort/sift those values at request time.

I thought initially we could just store the time-decayed weight relative to some fixed time or epoch. However, epochs close to “now” will cause problems as we pass them, and epochs far in the past (t = 0) or far in the future (t = +infinity) would show negligible (or incalculable) differences for events near “now”.

So, I think the number we should store is the date at which the relevance of the notice drops below some boundary value (say, 0.0001). So, at save time, we solve:

sum(exp(-(t - fave.modified)/period)) = BOUNDARY

…for t, and store that t value for the notice. That value won’t change as we go into the future, and it will accurately reflect relative relevance.

Finally: since only a small minority of notices ever get faved, I think it may make sense to have a separate table for roll-up information about a notice, like notice_popularity.

it seems like you re-score every notice every time any other notice gets faved.

Nope. In the case of adding a fave, it tests to see if the notice’s weight is greater than that of the notice above, if so the two notices are swapped and it checks the notice above again (until the above weight is not greater than the other notice’s weight). If its weight isn’t greater than the weight of the notice above, it does nothing more.

In the best case of adding a fave to a notice, it calculates two scores (where the favoured notice’s score is less than that of the notice above it). In the worst case, it would calculate every (ranked) notice’s score.

With caching, we can assume that we currently recalculate every favoured notice’s score every time a notice is (dis)favoured. With this new approach, we’re recalculating the score of less_than_or_equal_to_every notice for each (dis)favouring (but most of the time it’s much_less_than_every notice).

Also, you score it relative to save time, not display time. I don’t think that’s a big deal, but worth noting.

You’re right, it’s not a big deal – when we score notices doesn’t matter so long as the weight function doesn’t fluctuate with time (which it doesn’t, and shouldn’t). The rankings don’t change by time – they can only change when a notice is favoured or disfavoured. So rankings will always be correct – until we add a new favourite, when some will have to be recalculated.

So, I think the number we should store is the date at which the relevance of the notice drops below some boundary value (say, 0.0001).

That sounds like it could work. I’ll look into that and report back.

I think it may make sense to have a separate table for roll-up information about a notice, like notice_popularity.

Yes, I agree. Putting whatever we decide to save in a seperate table is probably a good idea.

Add a new comment:

Login or create an account to post a comment

How to apply this merge request to your repository