Blake O'Hare .com

How the TwoCans Q&A System Works

...since I'm sure you're all wondering!

A few notes about SQL

If you know SQL, you can skip this paragraph. If you're not familiar with SQL databases, this is all you need to know to understand: SQL databases are comprised of many tables. I (the programmer) define what columns are in each of the tables. However, the actions of the user will create, update, or delete rows in this table. The set of tables and their columns are called "the schema". The rows of the table are called "entries". For example, on this website there is a table called "users" and each row represents one registered user. When you registered, a row was added. There are columns for your username, email address, password, etc. Each time you log in, there's a column that indicates the last time you used the website which gets updated. This is how dynamic websites generally work.

How the Q&A used to be

For the Q&A on TwoCans, initially there were only two tables. "questions" and "answers". Each question has a unique ID#. The questions table would of course have the question ID#, the user that asked it, and the text of the question itself. The answers table would have the question ID# it was an answer to, who answered it, a 1 or 0 depending on if the user had read the answer, and of course the text of the answer.

Problems

This was all fine and dandy at first when there were only a few thousand questions and answers. However a few problems arose once this number got closer to a million.
  • When you answer questions, you don't want to see questions you've already answered. This required me to check not only the questions table, but also the answers table to see if you have any answers for a particular question. When there are a million answers, this is a very slow operation.
  • When you pick a random item in a table, SQL does this by selecting the contents of the WHOLE table, ordering it randomly, and then taking the first one. Shuffling a whole table to get 1 entry is a very slow operation.
  • Some people's ancient questions were still getting answers while others were not getting answers to their new questions.

At this point the page load time had slowed to about 2 seconds. Additionally, the database would crash and need to be rebooted every few minutes. I had to take the site down and rewrite it in a panic.

The New (Current) System

Working off a giant data set is something that should be avoided whenever possible. Therefore, there are now several tables that allow me to creatively work around doing things directly to the questions/answers table.

The Active Pile

There is a table called "active_questions". This table contains all the questions that have been asked or re-activated in the last 24 hours. This table usually contains around 600 entries at any given moment, last I checked. This table contains just the question ID#'s of active questions and a few other bits of interesting information such as the number of answers the question has, the user who asked it, the language it was asked in, and the time it was last activated or asked. This entry is deleted once it is more than 24 hours old. However the question itself is not deleted.

When you load the Answer page, the question you receive is coming from this table. The text of the question still comes from the questions table, however the full questions table is indexed against the question ID# so looking that up is not a big deal in terms of page load time. If you understand SQL, you know indexes are magical. If you do not understand SQL, then understand this: indexes are magical.

However, selecting a question randomly and checking to see if you already answered it is still a bit slow. So there are a few more optimizations...

The Question Plate

As I said earlier, the way SQL picks a random row out of a table is to shuffle the order of a whole table and then pick the first row. This is stupid slow for 1 row. It also seems silly if I need to do this many times one after another. So there is a table called "question_plate". What I do instead is just get the last 100 questions in the active_questions table and return them. No random ordering. Once the rows are received, they are randomly ordered by PHP and are inserted into the question_plate table along with your user ID# attached to them. This is your personal queue of randomly ordered questions. You may notice the page load is a bit slow once in a while, but just slightly.

After this point, each time you load the page, the next question in your queue is pulled out and fetched from the questions table. This is a nearly instantaneous operation (because of the index on the question ID#). Your queue is cleared and restocked after every few minutes or when you run out of questions in your queue. Those page loads are slightly slower than the others, but not really noticeable.

The Question Acknowledgement

The last bit of information is in a table called question_ack. This table contains a list of question ID#'s paired with user ID#'s. This list indicates the questions that you have "acknowledged". This means you have either answered the question or pressed the "Skip Forever" button.

When a question is pulled from the active_questions table to put in your queue, it is checked against the question_ack table. If an entry exists that corresponds to you and that question, it is ignored. Also, once an entry is pulled out of the queue, it is checked again. This prevents bugs such as if you have to browsers open and are answering questions in both. Imagine if you left your browser open on question #123456 and you opened a new browser and after a while your queue was restocked. Then when you got to #123456 in your second browser and answered it, then go back and answer it in the old browser, that would be a problem because you've answered the same question twice. This prevents that.

That's pretty much it. Now you have instant page loads instead of 5-second-long page load times. I hope I didn't put you to sleep too much.