(Russian version is here)
General info
- SQL-like Query Language for Real-time Streaming Analytics - we need SQL like query language for Realtime Streaming Analytics to be expressive, short, fast, define core operations that cover 90% of problems, and to be easy to follow and learn.
- The emergence of Spark
- Announcing Spark 1.3!
- R 3.1.3 is released (+ easy upgrading for Windows users with the installr package)
Theory, machine learning algorithms and code examples
- A Full Hardware Guide to Deep Learning
- Deep Learning, The Curse of Dimensionality, and Autoencoders - autoencoders are an extremely exciting new approach to unsupervised learning and for many machine learning tasks they have already surpassed the decades of progress made by researchers handpicking features.
- Deep Learning for Text Understanding from Scratch - forget about the meaning of words, forget about grammar, forget about syntax, forget even the very concept of a word. Now let the machine learn everything by itself.
- Python: scikit-learn – Training a classifier with non numeric features
- Artificial Neurons and Single-Layer Neural Networks - How Machine Learning Algorithms Work Part 1 - this article offers a brief glimpse of the history and basic concepts of machine learning. We will take a look at the first algorithmically described neural network and the gradient descent algorithm in context of adaptive linear neurons, which will not only introduce the principles of machine learning but also serve as the basis for modern multilayer neural networks in future articles.
- Naive Bayes on Apache Flink - in this blog post we are going to implement a Naive Bayes classifier in Apache Flink. We are going to use it for text classification by applying it to the 20 Newsgroup dataset. To understand what is going on, you should be familiar with Java and know what MapReduce is.
- Beginner's Guide to Machine Learning: Part 1 of 2 - data science, big data, data mining and machine learning are some of the most prominent buzzwords around right now. The difference between success or failure is more and more about the data you collect from your customers, their actions and devices and how these data points impact your business. Companies want to collect data, lots of it, and then do something with it.
- The genetic algorithms - a genetic algorithm (GA) is a variant of stochastic beam search, which involves several search points/states concurrently (similar to the shotgun approach noted in the former post), and somehow combines their features according to their performance to generate better successor states. Thus, GA differs from former approaches like simulated annealing that only rely on the modification and evolution of a single state.
- Data-processing and machine learning with Python
- Clustering With K-Means in Python - a very common task in data analysis is that of grouping a set of objects into subsets such that all elements within a group are more similar among them than they are to the others. The practical applications of such a procedure are many: given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some sort of structure that might indicate that the data is separable.
- Introduction to Machine Learning Studio
- How-to: Tune Your Apache Spark Jobs (Part 1)
- Gravitational Clustering - new supervised learning method that works through mimicking gravity.
Machine learning competitions
- 10 Steps to Success in Kaggle Data Science Competitions - the author, ranked in top 10 in five Kaggle competitions, shares his 10 steps for success. These also apply to any well-defined predictive analytics or modeling problem with a closed dataset.
Online courses, training materials and literature
- Online course: Process Mining: Data science in Action - process mining is the missing link between model-based process analysis and data-oriented analysis techniques. Through concrete data sets and easy to use software the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.
- Online course: Text Retrieval and Search Engines - search engines are essential tools for managing and mining big text data. Learn how search engines work, the major search algorithms, and how to optimize search accuracy.
- Online course: Applied Regression Analysis - regression modeling is the standard method for analysis of continuous response data. This course provides theoretical and practical training in statistical modeling with particular emphasis on linear and multiple regression.
- Online course: Mathematical Biostatistics Boot Camp 1 - this class presents the fundamental probability and statistical concepts used in elementary data analysis. It will be taught at an introductory level for students with junior or senior college-level mathematical training including a working knowledge of calculus. A small amount of linear algebra and programming are useful for the class, but not required.
- Free eBook & Video Interview | “Data Driven” with Industry Leaders Hilary Mason, DJ Patil, and Josh Wills
Videos, podcasts
- Edu-Video | Introduction to Deep Learning (Class Videos)
- Top 10 Data Mining Mistakes
- Talking Machines: Episode 6: Geoffrey Hinton, Yoshua Bengio, and Yann LeCun (part 2): Future of Machine Learning from the Inside Out - we hear the second part of our conversation with with Geoffrey Hinton (Google and University of Toronto), Yoshua Bengio (University of Montreal) and Yann LeCun (Facebook and NYU). They talk with us about this history (and future) of research on neural nets. We explore how to use Determinantal Point Processes. Alex Kulesza and Ben Taskar (who passed away recently) have done some really exciting work in this area, for more on DPPs check out their paper on the topic. Also, we take a listener question about machine learning and function approximation (spoiler alert: it is, and then again, it isn’t).
Data engineering
- Creating a Single View Part 1: Overview & Data Analysis - this series of three blog posts will provide an introduction to building a Single View with MongoDB. Part 1 covers an overview of what a Single View might look like and what you should consider while building one, while Part 2 will look at the implementation of a sample data model, and Part 3 will dig into the mechanics of how to move your data into MongoDB.
- Big Data Processing in Spark
- Getting Started with Apache Spark and Neo4j Using Docker Compose
Reviews
- Top stories for Mar 1-7: All Machine Learning Models Have Flaws; Analytics, Data Mining, Data Science professionals salary (KDnuggets.com)
- Top stories for Mar 8-14: 7 common Machine Learning mistakes; Deep Learning for Text Understanding from Scratch (KDnuggets.com)
- Weekly Digest - March 16 (DataScienceCentral.com)
- Data Science News 15 March 2015 (MyDataMine.com)
- Big Data News 12 March 2015 (MyDataMine.com)
- Issue 26 - March 13th 2015 (DataElixir.com)
- This Week in Data (March 13, 2015) - r1soft.com
- Stuff The Internet Says On Scalability For March 13th, 2015 (HighScalability.com)
Previous digest: Data science digest #40 (2 - 8 March 2015)
All data science digests: Data science digests
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.