DataAspirant Sept-Oct2015 newsletter

Data scientist


Hi dataaspirant lovers we are sorry for not publishing dataaspirant September  newsletter. So for October newsletter we come up with September newsletter ingredients  too. We rounded up the best blogs for anyone interested in learning more about data science. Whatever your experience level in data science or someone who’s just heard of the field,  these blogs provide enough detail and context for you to understand what you’re reading. We also collected some videos too. Hope you  enjoy October  dataaspirant newsletter.


Blog Posts:

1 . How to do a Logistic Regression in R:

Regression is the statistical technique that tries to explain the relationship between a dependent variable and one or more independent variables. There are various kinds of it like simple linear, multiple linear, polynomial, logistic, poisson etc

Read Complete post on: datavinci

2 . Introduction of Markov State Modeling:

Modeling and prediction problems occur in different domain and data situations. One type of situation involves sequence of events.

For instance, you may want to model behaviour of customers on your website, looking at pages they land or enter by, links they click, and so on. You may want to do this to understand common issues and needs and may redesign your website to address that. You may, on the other hand, may want to promote certain sections or products on website and want to understand right page architecture and layout. In other example, you may be interested in predicting next medical visit of patient based on previous visits or next purchase product of customer based on previous products.

Read Complete post on: edupristine

3 . Five ways to improve the way you use Hadoop:

Apache Hadoop is an open source framework designed to distribute the storage and processing of massive data sets across virtually limitless servers. Amazon EMR (Elastic MapReduce) is a particularly popular service from Amazon that is used by developers trying to avoid the burden of set up and administration, and concentrate on working with their data.

Read Complete post on: cloudacademy

4. What is deep learning and why is it getting so much attention:

Deep learning is probably one of the hottest topics in Machine learning today, and it has shown significant improvement over some of its counterparts. It falls under a class of unsupervised learning algorithms and uses multi-layered neural networks to achieve these remarkable outcomes.

Read Complete post on: analyticsvidhya

5. Facebook data collection and photo network visualization with Gephi and R:

The first thing to do is get the Facebook data. Before being allowed to pull it from R, you’ll need to make a quick detour to, register as a developer, and create a new app. Name and description are irrelevant, the only thing you need to do is go to Settings → Website → Site URL and fill in http://localhost:1410/ (that’s the port we’re going to be using). The whole process takes ~5 min and is quite painless

Read Complete post on: kateto

6. Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data:

The set of data storage and processing technologies that define the Apache Hadoop ecosystem are expansive and ever-improving, covering a very diverse set of customer use cases used in mission-critical enterprise applications. At Cloudera, we’re constantly pushing the boundaries of what’s possible with Hadoop—making it faster, easier to work with, and more secure.

Read Complete post on: cloudera

7. Rapid Development & Performance in Spark For Data Scientists:

Spark is a cluster computing framework that can significantly increase the efficiency and capabilities of a data scientist’s workflow when dealing with distributed data. However, deciding which of its many modules, features and options are appropriate for a given problem can be cumbersome. Our experience at Stitch Fix has shown that these decisions can have a large impact on development time and performance. This post will discuss strategies at each stage of the data processing workflow which data scientists new to Spark should consider employing for high productivity development on big data.

Read Complete post on: multithreaded

8. NoSQL: A Dog with Different Fleas:

The NoSQL movement is around providing performance, scale, and flexibility; where cost is sometimes part of the reasoning (e.g. Oracle Tax). Yet databases like MySQL, which provide all the Oracle features, is often considered before choosing NoSQL. And with respects to NoSQL flexibility. This also can be Pandora’s box. In other words, schema-less modeling has been shown to be a serious complication to data management. I was at the MongoDB Storage Engine Summit this year and the number one ask to the storage engine providers is “how to discover schema in a schema-less architecture?” In other words, managing models over time is a serious matter to consider too.

Read Complete post on: deepis

9. Apache Spark: Sparkling star in big data firmament:

The underlying data needed to be used to gain right outcomes for all above tasks is comparatively very large. It cannot be handled efficiently (in terms of both space and time) by traditional systems. These are all big data scenarios. To collect, store and do computations on this kind of voluminous data we need a specialized cluster computing system. Apache Hadoop has solved this problem for us.

Read Complete post on: edupristine

10. Sqoop vs. Flume – Battle of the Hadoop ETL tools:

Apache Hadoop is synonymous with big data for its cost-effectiveness and its attribute of scalability for processing petabytes of data. Data analysis using hadoop is just half the battle won. Getting data into the Hadoop cluster plays a critical role in any big data deployment. Data ingestion is important in any big data project because the volume of data is generally in petabytes or exabytes. Hadoop Sqoop and Hadoop Flume are the two tools in Hadoop which is used to gather data from different sources and load them into HDFS. Sqoop in Hadoop is mostly used to extract structured data from databases like Teradata, Oracle, etc., and Flume in Hadoop is used to sources data which is stored in various sources like and deals mostly with unstructured data.

Read Complete post on: dezyre



1. Spark and Spark Streaming at Uber :

2. How To Stream Twitter Data Into Hadoop Using Apache Flume:


That’s all for October 2015 newsletter. Please leave your suggestions on newsletter in the comment box. To get all  dataaspirant newsletters you can visit monthly newsletter page. Do please Subscribe to our blog so that every month you get our news letter in your inbox.


Follow us:


Leave a Reply

Your email address will not be published. Required fields are marked *