Why Analysts need Data Lakes?

With substantial analytical needs at Sopra Steria Apps, we are looking to expand our Data Science environment. My thoughts go towards a Data Lake architecture, from a concrete angle, having practical requirements and knowing quite precisely what we want. I’ve also done some research in the community. It stroke me that the ‘official’ rhetorics on what data lakes are is often detached from ‘practical’ needs … Continue reading Why Analysts need Data Lakes?

Simple hack to improve data clustering visualizations

Here is how to make your data clusters look pretty in no time (with python and matplotlib), with one-liner code hack. I wanted to visualize in python and matplotlib the data clusters returned by clustering algorithms such as K-means (sklearn.cluster.KMeans) library. I initially manually dispatched the set into groups and attached colors to labels, like this: fig, axes = plt.subplots(2,2, figsize=(20,10)) axes.scatter(group0[:,0], group0[:,1], c=’green’, s=10) … Continue reading Simple hack to improve data clustering visualizations

How to isolate data that constitutes a spike in histogram?

We would all love to spot business problems early on, to react before they become painful. You can learn a lot by looking at past problems. Hence, understanding the nature of anomalies in data can bring substantial operational benefits and know-how. And that’s the background for this work. I have identified an anomaly in data, which shows as spike in the histogram. In other words, … Continue reading How to isolate data that constitutes a spike in histogram?