Lab 9 - Data Mining


SQL or Data Mining?

  1. Given records of hospital treatments we need to find out how many of these took more than 2 days.
  2. Given records of patients check-ups we need to predict the number of patients in each month (Jan, Feb, etc) for the next 12 months.
  3. Assuming that our predictions from 2 are correct, we need to find the month for which the hospital will perform the most check-ups.
  4. We have micro-array expression data of various genes. We need to determine which genes lead to a certain genetic condition.
  5. We have micro-array expression data of various genes. We need to find the amount of genes expressed more than a specified threshold t over all our features.
  6. We want to discover relationships between products sold by an e-store.


Decision Tree

Here are our training data:



Here is the decision tree:

 

Here are some test records:

[22 , high , no , fair , yes]

[45 , high , no , excellent , yes]

[32 , low , yes , excellent , yes]

How would our decision tree classify these records?


Data Visualization

Many-eyes is a data visualization tool which can be found at http://www.many-eyes.com/
By using many-eyes, you can load your own data to the system, and visualize it with any type of chart you want. You can also publish your charts by using this website.

How can we visualize and interpret the data, which we found as a result of the question in the previous part?

Now, let's go to many-eyes web site, load our data into the system and visualize our data.

In order to load your own data, you have to register to the system. Don't worry, it doesn't take so much time.

After registering, login to the system and click on "Create Visualization" from the left menu.

Then, select "upload your own dataset".

Fill in the form. You can use this text file as a sample input.

When you click on upload after filling the form, you will be directed to a page where you can select the type of chart among a bunch of choices.

What kind of chart is convenient for your data? Bar chart, pie chart?


Python Practice: Reading from files

Download this excel file. It contains information about the average number of children per woman in many different countries for the years 1989 and 2009

Before we use the data in any data mining or visualization procedure, we usually want to correct them, purge them or even transform them into something new. As an example to that, the dataset you downloaded has some missing values. One way to cope with them is the following: If only one number is missing (i.e. for either 1989 or 2009 we don't have any statistics for that country), give it the value of the other year. If both are missing, do not include them in the final dataset.

  1. First, open this file in excel and save it as .csv (comma separated values)
  2. Write a python program that does the preprocessing that we described before (friendly reminder: don't use the console to code)
  3. Write the result to a new file
  4. Upload this file to Many-Eyes and see what visualizations you can create to depict this information


CS105