You’ll be known as the real estate Friendster
For as long as I have been working at Redfin we have received e-mails about our poor performance. This weekend we got an above average number of e-mail complaints about performance, I took several calls about performance, two long time users complained it had gotten worse and one of our developers said it was particularly bad.
From a concerned user:
The site takes too long to load on the initial visit. When it’s more than 20 seconds its too long. The search also takes too long. You really need to figure this out or you’ll be known as the real estate Friendster.
In typical fashion I politely responded about how we are working on it (which isn’t a lie, we are working on it, but performance isn’t a problem you normally solve over night) but one friend’s report nagged on my mind. 20 seconds to load the details page? That sounded really long. However, no one at Redfin was having pages take that long to load. We looked at our KeyNote monitoring but everything looked fine. We looked at our web servers, no bad web servers and the loads were a little higher than normal since 60 Minutes but not too bad. We looked at the traffic through the load balancer and it was up since 60 Minutes but not too bad. What was wrong?
Puzzled I e-mailed my friend back and asked him to install Firebug to take a look at all the HTTP requests our page was making to see if he could identify which one was taking an abnormal amount of time. Turns out it was the initial request which was puzzling since we weren’t seeing the problem. Being a tester at Microsoft he decided to experiment and noticed that logging out drastically improved performance. Once we found this out we all tried logging in and logging out. No one noticed a difference except for one account belonging to our CTO which turned out to have the same problem.
Still we were puzzled. The performance problem was occurring on the details page which shouldn’t make any requests to the user table. Or does it? We used A Poor Man’s Query Profiler and found out that for some user accounts we were making 2100 SQL queries when they hit pages which should not have resulted in a single query. Digging deeper into our login validation we found a bug in our hibernate mapping that caused us to relate PersonId to LoginId and then chain together object loads. We figure this effected at least 700 registered users.
Not wanting to be known as the real estate of Friendster a few hard working developers fixed this last night.
What’s interesting is that no one at Redfin happened to have an account they used frequently which repro’d the problem. Or that new accounts had the problem. What’s also interesting is how the issue came to head only after 60 Minutes caused an increase in both server load and an increase in user registrations which aggravated the problem to the point some users found it completely unacceptable.
Of course there is still much more work that we want to fix on performance but this at least solves a major problem.
We owe a big thanks to Justin the Microsoft tester and Redfin customer who gave us the first big clue to solving this puzzle.