Big Data tools - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/big-data-tools/ Wojciech Moszczyński Wed, 05 Sep 2018 19:24:00 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png Big Data tools - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/big-data-tools/ 32 32 Two easy Big Data tools to keep control on the astronomical scale business https://sigmaquality.pl/uncategorized/two-easy-big-data-tools-to-keep-control-on-the-astronomical-scale-business/ Wed, 05 Sep 2018 19:24:00 +0000 http://sigmaquality.pl/?p=5575 What use Big Data tools to keep control on the astronomical scale business? In last publication I tried to convince that sometimes better to abandon [...]

Artykuł Two easy Big Data tools to keep control on the astronomical scale business pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

What use Big Data tools to keep control on the astronomical scale business?

In last publication I tried to convince that sometimes better to abandon Excel. Especially when we face with giant numbers of dimensions, thousands of variables and huge scale. I found myself in such circumstances five years ago, when I start working for big producer of chipboard. My duty was monitoring and detecting anomalies in very big logistic processes. Daily I was monitoring moreover a thousand operation of reception raw materials and sending goods in Poland. In such kind of processes most important is effective detection of anomalies. If anomaly appeared, all complex devices were launched to explain situation.

I was working as the Data Scientist on over interesting projects. In the same time, simultaneously, built by me, autonomic system was detecting and reporting anomalies.

Big Data tools to keep control on the astronomical scale business

I used two simple algorithms: Algorithm of comparison real events to standard values and algorithm of the density of the probability of normal distribution.

This two directions of research, in very basic form, were replicated to big number of applications. These systems working effectively detecting anomalies and fraud. The system did not work in real time, but when I was recharging the data. The reason was frequent corrections in the documents created during the operation and shortly thereafter. My system would react too quickly. A day later the process was stable and ready for testing.

Algorithm of comparison real events to standard values

I show how works first algorithm on the extremely easy example. This method can be applied to very complex processes, monitoring huge number of variables in various configuration.

Please open this example database. Source here!

In first column we see date of reception of raw materials. Second column is the kind of assortment. In third column we find the price of transactions. In my practice I had many thousands of assortments another many columns with other cost, dimensions, providers, categories and delivery conditions. I created special columns with ratios who combine this dimensions and costs. In this example I will show only simple mechanism how to do it in Pandas.

We open example database.

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('c:/1/Base1.csv', index_col=0, parse_dates=['Date of transaction'])
df.head()

We generate maximum prices from the list of transaction.

df.pivot_table(index='Assortment', values='Purchase price', aggfunc={max})

Thanks to this I get information where appear the biggest cost of purchase for each assortments. This is very easy function in Python, but not very easy do it in Excel because need to use array function.

Excel array function used to giant database need really big potential of hardware and long time of calculation. Python calculate it faster. Now we add additional column with the month.

df['Month'] = df['Date of transaction'].dt.month
df.head()

Now we can display extreme values for each assortment in each month.

df.pivot_table(index=['Month','Assortment'], values='Purchase price', aggfunc={max, min, np.mean})

 

 

As usual controlling or purchasing department have special agreement with suppliers of raw materials. We get maximum prices table for each assortment or/end each supplier. It is difficult to check couple of thousands transactions using searching Excel functions that need a lot of time. Faster is done it in Pandas.

We open price table for assortments. Source here!

Price_Max_Min = pd.read_excel('c:/2/tablePR.xlsx', index_col=0)
Price_Max_Min.head()

 

Now to our main table with data of transaction we are adding column with max and min price from control table for each assortment.

T2 = Price_Max_Min.set_index('Asso')['Max'].to_dict()
df['Max_price'] = df['Assortment'].map(T2)

T3 = Price_Max_Min.set_index('Asso')['Min'].to_dict()
df['Min_price'] = df['Assortment'].map(T3)

df.head()

It is noneffective and ridiculous to do scandal when the maximum price is exceeded by several percents. First we add column who calculate percent of exceed from the indicated value.

df['
df['
df.head()

Good manager is the lazy manager. We have computer to find all cases where max price was exceeded over 15

df['Warning!'] = df['
df.head()

Now we have to catch all significantly price exceeds in entire monthly transaction report. In my previous work such report could have even more over three hundred thousands operations a month. Each transaction had hundreds records of information to compare. Excel was unable to work effectively in such environment. Let's see 5 random exceeds from in our transactions.

df[df['Warning!']=='Warning! High exceed! '].sample(5)

If we want to catch something, we have to start from the top. We display top ten of the biggest exceeds in the year.

kot = df[df['Warning!']=='Warning! High exceed! '].sort_values('
kot.nlargest(10, '

Now we display all exceeds from the September.

df[(df['Warning!']=='Warning! High exceed! ')&(df['Month']==9)].sort_values('

In next publication I am showing second Big Data tool, who helped me survive as Data Scientist in difficult environment.

Big Data tools to keep control on the astronomical scale business.

 

 

 

Artykuł Two easy Big Data tools to keep control on the astronomical scale business pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>