Monday, May 24, 2010

MinneBar "Data Mining: what is it good for?" notes

Dan led a session at MinneBar called "Data Mining: what is it good for?". It was "un": there was no planned agenda. We talked about whatever the crowd was interested in.

It was surprisingly popular, so I typed up notes from what remained on the board at the end. I expanded and linkified. This is useful as a reference if you were there.

We talked about terms, tools and techniques, and further resources (reading, interest groups, public datasets).

Our two main areas of discussion were finding patterns, and visualization of those patterns. I have not separated those out below.


Terms

Data mining is the process of extracting patterns from data (especially large volumes of data). Some patterns you can look for:

- associations among variables (e.g., association rule mining)
- clusters, so each cluster contains similar data and different clusters contain dissimilar data (see also cluster analysis)
- numerical relationships among variables regressions (e.g., regression analysis)

These patterns may be used to explain existing data, or in some cases to try to predict future data (usually assuming that past data predicts future).

Data warehousing is the process of taking data from disparate parts of a large organization and putting that data in one place so that it can be analyzed. Data warehousing enables data mining.

A data cube (also OLAP cube) is a method of organizing data by dimensions of interest in order to be able to interactively examine it along those dimensions even if it is a very large amount of data.

Tools and techniques

- Hadoop: an open source project for reliable, scalable distributed computing. One well-known subproject is "mapreduce", a method of distributed computation popularized by Google.
- Mathematica. (No one had examples.)
- Google Spreadsheet. Easy to get data in (via 50 URL fetches) or out (via API).
- App Engine. (No one had examples.)
- Mallet, an open-source, commercial-friendly Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
- Weka, a collection of tools for data mining: data pre-processing, classification, regression, clustering, association rules, and visualization. It has a GUI and an API.
- Ab Initio: a commercial data mining suite
- Tableau: commercial data business intelligence software
- Processing: a Java-based visualization tool developed at MIT
- R: an open-source statistics language influenced by S and Scheme.
- SAS: A commercial statistics and data processing suite
- SPSS: A commercial statistics package (owned by IBM)
- Clementine: A commercial statistics package (owned by SPSS)
- JMP: A commercial statistics package


OpenLayers: an open source BSD-style licensed tool to display a dynamic map on a web page
MapServer: an open source MIT-style licensed tool to display spatial data and interactive mapping applications (from U of MN)

- perl + gnuplot: tools to quickly visualize data

Further Resources

Websites

- Wikipedia page on data mining
Visual Complexity Blog: artistic, elaborate visual presentation

Books

- Edward Tufte. A visualization guru. See his classic book "The Visual Display of Quantitative Information".
- "Introduction to Information Retrieval" by Chris Manning. Information Retrieval (IR) is the process of returning useful results from a corpus of data given a search query. Example: Google search.

Other Interest Groups

Minnesota Chapter of the Data Management Association (DAMA).

Public datasets

- weather data: NOAAPORT

0 comments:

Post a Comment