It was surprisingly popular, so I typed up notes from what remained on the board at the end. I expanded and linkified. This is useful as a reference if you were there.
We talked about terms, tools and techniques, and further resources (reading, interest groups, public datasets).
Our two main areas of discussion were finding patterns, and visualization of those patterns. I have not separated those out below.
Terms
Data mining is the process of extracting patterns from data (especially large volumes of data). Some patterns you can look for:
- associations among variables (e.g., association rule mining)
- clusters, so each cluster contains similar data and different clusters contain dissimilar data (see also cluster analysis)
- numerical relationships among variables regressions (e.g., regression analysis)
These patterns may be used to explain existing data, or in some cases to try to predict future data (usually assuming that past data predicts future).
Data warehousing is the process of taking data from disparate parts of a large organization and putting that data in one place so that it can be analyzed. Data warehousing enables data mining.
A data cube (also OLAP cube) is a method of organizing data by dimensions of interest in order to be able to interactively examine it along those dimensions even if it is a very large amount of data.
Tools and techniques
- Hadoop: an open source project for reliable, scalable distributed computing. One well-known subproject is "mapreduce", a method of distributed computation popularized by Google.
- Mathematica. (No one had examples.)
- Google Spreadsheet. Easy to get data in (via 50 URL fetches) or out (via API).
- App Engine. (No one had examples.)
- Mallet, an open-source, commercial-friendly Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
- Weka, a collection of tools for data mining: data pre-processing, classification, regression, clustering, association rules, and visualization. It has a GUI and an API.
- Ab Initio: a commercial data mining suite
- Tableau: commercial data business intelligence software
- Processing: a Java-based visualization tool developed at MIT
- R: an open-source statistics language influenced by S and Scheme.
- SAS: A commercial statistics and data processing suite
- SPSS: A commercial statistics package (owned by IBM)
- Clementine: A commercial statistics package (owned by SPSS)
- JMP: A commercial statistics package
- OpenLayers: an open source BSD-style licensed tool to display a dynamic map on a web page
- MapServer: an open source MIT-style licensed tool to display spatial data and interactive mapping applications (from U of MN)
- perl + gnuplot: tools to quickly visualize data
Further Resources
Websites
- Wikipedia page on data mining
- Visual Complexity Blog: artistic, elaborate visual presentation
Books
- "Programming Collective Intelligence: Building Smart Web 2.0 Applications", an O'Reilly book.
- Edward Tufte. A visualization guru. See his classic book "The Visual Display of Quantitative Information".
- "Introduction to Information Retrieval" by Chris Manning. Information Retrieval (IR) is the process of returning useful results from a corpus of data given a search query. Example: Google search.
- "Data Mining: Practical Machine Learning Tools and Techniques" by the folks who made Weka
Other Interest Groups
- Minnesota Chapter of the Data Management Association (DAMA).
- Twin Cities Area SAS Users Group (TCASUG).
Public datasets
- data.gov
- weather data: NOAAPORT
0 comments:
Post a Comment