Last modified 10 months ago Last modified on 2017-05-09 08:52:07

TIES445 Data Mining, spring 2017

Time: March, 13th - May, 24th

Lecturers: Tommi Kärkkäinen, Mirka Saarela, Joonas Hämäläinen (+ Pekka W and Dr. Watson ;-)

Exercises: Mirka Saarela, Joonas Hämäläinen

Information in Korppi

Materials in Koppa

On Data Mining project

  • Done in groups of 2-3 students (under special circumstances also individually, e.g. if purely distance learning student)
  • Amount of work roughly 1-2 ECTS (30-50 hours of work)
  • Data + It's domain + some DM algorithm(s) + Interpretation + Presentation to others
  • Presentations in Seminars on Tue XX, May, xx:00-xx:00 (if more works to be presented then we continue later)
  • Orientation can be technical
    • Distributed DM on Hadoop for big data
    • Introduce a DM tool, e.g. Weka, Rapidmainer, etc., to others and do something with it and compare to M-stuff in exercises
  • or applied
    • Utilize standard techniques for own data set
    • Consult the possible application domains as listed in Lecture 1, e.g.
      • telecom traffic/protocol as original data source
      • software code or software project deliverables as original data source
      • images or videos as original data source
  • or scientific
    • Take one or a few recent scientific publications on interesting/related topic, review, and do some testing
  • or something else: think, make a suggestion and let's discuss further
  • Start thinking groups and theme+orientation now!
  • Inform me (the lecturer) about groups and topic (after discussions e.g. after lectures or during exercises)
    • One from the group can add project information below after logging in here
  • Projects can also be targeted to attend this!

Some topics of DM projects during spring 2016

  • Classification of Heart Disease (UCI dataset)
  • Human Activity Recognition (UCI dataset (Friday)
  • Knowledge Discovery on Pop Music Makers
  • Network Traffic Analysis
  • Fisher's linear discriminant (own generated data)
  • Analysis of own Finnish language content mined from old books (via Project Guttenberg), a punk forum & yle easy Finnish news
  • Death causes Presentation in Github
  • Knowledge Discovery on Computer Game Ownership on Steam
  • Knowledge Discovery from Enterprise Data
  • Qlik Sense for Data Exploration
  • Self-Organizing Map
  • Knowledge Discovery from Temporal Network Data
  • scikit-learn

Active DM projects

  1. Nummelin and Heino: Handwritten number recognition (classification)
  2. Lipponen and Naukkarinen: Sentiment analysis of a discussion forum
  3. Malmberg and Parviainen: Grouping and analysis of IT job announcements
  4. Heilala: Analysis of Tanzanian water pumps
  5. Honka, Hämäläinen, and Tammentie: Analyzing what determines happiness around the world
  6. Jonninen and Tuusa: Effective dimension analysis of information security datasets
  7. Chanda: Optical character recognition

Open Data Repositories

Note: Many of these repositories are redundant, i.e. they contain data sets that are obtained or linked to other similar repositories. Note2: I tried to avoid those data set storages which need special software for reading data (though Matlab's .mat-format exluded on this)

Some Matlab toolboxes

A few links to own research activities which can be referred during the lectures to exemplify the concepts

Support for exercises