IJCRT2405424
IJCRT2405424
org © 2024 IJCRT | Volume 12, Issue 5 May 2024 | ISSN: 2320-2882
1
Assistant Professor, 2 Assistant Professor, 3Assiatant Professor, 4Assiatant Professor
1, 2, 3, 4
Dept. of computer science,
1, 2, 3, 4
Shobhit University Gangoh, Saharanpur, India.
Abstract: Python is an interpretable and scriptable programming language that can be used for both learning
and practical applications. Guido van Rossum created the potent high-level language python. It is an
interpretable programming language that is object-oriented. The primary Python programming software tools
for data wrangling, cloud computing, and machine learning approaches will be introduced in this presentation.
In summary, this paper will begin with an introduction to Python programming and data wrangling. It will
also include an overview of cloud computing, machine learning, and data wrangling. Finally, it will discuss
popular packages used in the data wrangling and machine learning fields, including NumPy, SciPy, Tensor
Flow, Keras, Matplotlib, and others. We will then go on to demonstrate the value of Python in developing
cloud computing and data wrangling apps.
Index Terms - Machine learning · Data Wrangling- Tools · Languages · Python, Cloud computing
(a) Collection
Gathering raw data from diverse sources is the initial stage in the data wrangling process. Databases, files,
external APIs, web scraping, and numerous other data streams are examples of these sources. The information
gathered may be semi-structured (JSON, XML files), unstructured (text documents, photos), or structured
(SQL databases, for example) [3].
(b) Cleaning
After the data is gathered, the cleaning procedure starts. Errors, inconsistencies, and duplicates that could
distort analysis results are eliminated in this step. Cleaning may entail: deleting information that isn't useful or
adds nothing to the analysis. Fixing data mistakes, like misspelling so in accurate values [4, 5]. Addressing
missing values by deleting them, assigning them to other data points, or making statistical approximations of
them. Recognizing and fixing discrepancies, like disparities in date or currency forms [4].
(c) Structuring
After cleaning, data needs to be structured or restructured into a more analysis-friendly format. This often
means converting unstructured or semi-structured data into a structured form, like a table in a database or a
CSV file [5]. This step may involve:
i. Parsing data into structured fields.
ii. Normalizing data to ensure consistent formats and units.
iii. Transforming data, such as converting text to lowercase, to prepare for analysis.
(d) Enriching
Data enrichment involves adding context or new information to the dataset to make it more valuable for
analysis. This can include:
Merging data from multiple sources to develop a more comprehensive dataset. Creating new variables or
features that can provide additional insights when analyzed [1, 3].
(e) Validating
Validation ensures the data's accuracy and quality after it has been cleaned, structured, and enriched. This step
may involve:
Data integrity checks, such as ensuring foreign keys in a database match. Quality assurance testing to ensure
the data meets predefined standards and rules [4].
(f) Storing
The final wrangled data is then stored in a data repository, such as a database or a data warehouse, making it
accessible for analysis and reporting. This storage not only secures the data but also organizes it in a way that
is efficient for querying and analysis [6].
(i) Documentation
Documentation is critical throughout the data wrangling process. It records what was done to the data, including
the transformations and decisions. This documentation is invaluable for reproducibility, auditing, and
understanding the data analysis process [4, 5, 6].
Python is a widely used programming language that offers several unique features and advantages compared
to languages like Java and C++. Our Python tutorial thoroughly explains Python basics and advanced concepts,
starting with installation, conditional statements, loops, built-in data structures [2] , Object-Oriented
Programming, Generators, Exception Handling, Python RegEx, and many other concepts. This tutorial is
designed for beginners and working professionals.
In the late 1980s, Guido van Rossum dreamed of developing Python. The first version of Python 0.9.0 was
released in 1991. Since its release, Python started gaining popularity [7]. According to reports, Python is now
the most popular programming language among developers because of its high demands in the tech realm.
(a)It is used Python Programming language
Web development (server-side),
Software development,
Mathematics,
System scripting.
Let's say we have a complex problem in which we need to make predictions. Instead of writing code, we just
need to feed the data to generic algorithms, which build the logic based on the data and predict the output. Our
perspective on the issue has changed as a result of machine learning [10].
Cloud Computing is the delivery of computing services such as servers, storage, databases, networking,
software, analytics, intelligence, and more, over the Cloud (Internet).Cloud Computing provides an alternative
to the on-premises datacenter. With an on-premises datacenter, we have to manage everything, such as
purchasing and installing hardware, virtualization, installing the operating system, and any other required
applications, setting up the network, configuring the firewall, and setting up storage for data. After doing all
the set-up, we become responsible for maintaining it through its entire lifecycle [11][12].But if we choose
Cloud Computing, a cloud vendor is responsible for the hardware purchase and maintenance. They also provide
a wide variety of software and platform as a service. We can take any required services on rent. The cloud
computing services will be charged based on usage.
1.4.1 Advantages of cloud computing
Cost: It reduces the huge capital costs of buying hardware and software.
Speed: Resources can be accessed in minutes, typically within a few clicks.
Scalability: We can increase or decrease the requirement of resources according to the business
requirements.
Productivity: While using cloud computing, we put less operational effort. We do not need to apply
patching, as well as no need to maintain hardware and software. So, in this way, the IT team can be
more productive and focus on achieving business goals.
Reliability: Backup and recovery of data are less expensive and very fast for business continuity.
Security: Many cloud vendors offer a broad set of policies, technologies, and controls that strengthen
our data security.
Public Cloud: The cloud resources that are owned and operated by a third-party cloud service
provider are termed as public clouds. It delivers computing resources such as servers, software, and
storage over the internet.
Private Cloud: The cloud computing resources that are exclusively used inside a single business or
organization are termed as a private cloud. A private cloud may physically be located on the
company’s on-site datacenter or hosted by a third-party service provider.
Hybrid Cloud: It is the combination of public and private clouds, which is bounded together by
technology that allows data applications to be shared between them. Hybrid cloud provides
flexibility and more deployment options to the business.
Infrastructure as a Service (IaaS): In IaaS, we can rent IT infrastructures like servers and virtual machines
(VMs) [11] , storage, networks, operating systems from a cloud service vendor. We can create VM running
Windows or Linux and install anything we want on it. Using IaaS, we don’t need to care about the hardware
or virtualization software, but other than that, we do have to manage everything else [13]. Using IaaS, we get
maximum flexibility, but still, we need to put more effort into maintenance. Platform as a Service (PaaS): This
IJCRT2405424 International Journal of Creative Research Thoughts (IJCRT) www.ijcrt.org d957
www.ijcrt.org © 2024 IJCRT | Volume 12, Issue 5 May 2024 | ISSN: 2320-2882
service provides an on-demand environment for developing, testing, delivering, and managing software
applications. The developer is responsible for the application, and the PaaS vendor provides the ability to
deploy and run it. Using PaaS, the flexibility gets reduce, but the management of the environment is taken care
of by the cloud vendors. Software as a Service (SaaS): It provides a centrally hosted and managed software
services to the end-users. It delivers software over the internet, on-demand, and typically on a subscription
basis. E.g., Microsoft One Drive, Dropbox, WordPress, Office 365, and Amazon Kindle. SaaS is used to
minimize the operational cost to the maximum extent [5] [6].
2 Objective of study
1. To identify the features of Python Programming.
2. To investigate python modules for Data Wrangling like NumPy which is used for matrix and vector.
3. To focus on python modules for Machine learning like Tensor flow numerical computations for machine
learning, Key areas for neural networks and machine learning.
3 Related work
Python provides many useful features which make it popular and valuable from the other programming
languages. It supports object-oriented programming, procedural programming approaches and provides
dynamic memory allocation [11].
b. Expressive Language ---Python can perform complex tasks using a few lines of code. A simple
example, the hello world program you simply type print ("Hello World"). It will take only one
line to execute, while Java or C takes multiple lines.
d. Cross-platform Language -- Python can run equally on different platforms such as Windows,
Linux, UNIX, and Macintosh, etc. So, we can say that Python is a portable language. It enables
programmers to develop the software for several competing platforms by writing a program
only once.
e. Free and Open Source --- Python is freely available for everyone. It is freely available on its
official website www.python.org. It has a large community across the world that is dedicatedly
working towards make new python modules and functions. Anyone can contribute to the Python
IJCRT2405424 International Journal of Creative Research Thoughts (IJCRT) www.ijcrt.org d958
www.ijcrt.org © 2024 IJCRT | Volume 12, Issue 5 May 2024 | ISSN: 2320-2882
community. The open-source means, "Anyone can download its source code without paying
any penny."
g. Extensible- --It implies that other languages such as C/C++ can be used to compile the code
and thus it can be used further in our Python code. It converts the program into byte code, and
any platform can use that byte code.
h. Large Standard Library -- It provides a vast range of libraries for the various fields such as
machine learning, web developer, and also for the scripting. There are various machine learning
libraries, such as Tensor flow, Pandas, NumPy, Keras, and Pytorch, etc. Django, flask, pyramids
are the popular framework for Python web development.
i. GUI Programming Support -- Graphical User Interface is used for the developing Desktop
application. PyQT5, Tkinter, Kivy are the libraries which are used for developing the web
application.
j. Integrated -- It can be easily integrated with languages like C, C++, and JAVA, etc. Python
runs code line by line like C, C++ Java. It makes easy to debug the code.
k. Embeddable -- The code of the other programming language can use in the Python source
code. We can use Python source code in another programming language as well. It can embed
other language into our code.
l. Dynamic Memory Allocation --In Python, we don't need to specify the data-type of the
variable. When we assign some value to the variable, it automatically allocates the memory to
the variable at run time. Suppose we are assigned integer value 15 to x, then we don't need to
write int x = 15. Just write x = 15.
i. Feature Creation: Feature creation is finding the most useful variables to be used in a predictive
model. The process is subjective, and it requires human creativity and intervention [6]. The new features
are created by mixing existing features using addition, subtraction, and ration, and these new features
have great flexibility [9].
ii. Transformations: The transformation step of feature engineering involves adjusting the predictor
variable to improve the accuracy and performance of the model. For example, it ensures that the model
is flexible to take input of the variety of data; it ensures that all the variables are on the same scale,
making the model easier to understand [5]. It improves the model's accuracy and ensures that all the
features are within the acceptable range to avoid any computational error.
iii. Feature Extraction: Feature extraction is an automated feature engineering process that generates new
variables by extracting them from the raw data. The main aim of this step is to reduce the volume of
data so that it can be easily used and managed for data modelling. Feature extraction methods
include cluster analysis, text analytics, edge detection algorithms, and principal components analysis
(PCA) [6] [7].
iv. Feature Selection: While developing the machine learning model, only a few variables in the dataset
are useful for building the model, and the rest features are either redundant or irrelevant. If we input the
dataset with all these redundant and irrelevant features, it may negatively impact and reduce the overall
performance and accuracy of the model. Hence it is very important to identify and select the most
appropriate features from the data and remove the irrelevant or less important features, which is done
IJCRT2405424 International Journal of Creative Research Thoughts (IJCRT) www.ijcrt.org d959
www.ijcrt.org © 2024 IJCRT | Volume 12, Issue 5 May 2024 | ISSN: 2320-2882
with the help of feature selection in machine learning. "Feature selection is a way of selecting the subset
of the most relevant features from the original features set by removing the redundant, irrelevant, or
noisy features" [12] [6] [8].
5.5. Economical
This cloud feature helps in reducing the IT expenditure of the organizations. In cloud computing, clients need
to pay the administration for the space used by them. There is no cover-up or additional charges that need to
be paid. Administration is economical, and more often than not, some space is allocated for free.
5.7. Security
Data security is one of the best features of cloud computing. Cloud services make a copy of the stored data to
prevent any kind of data loss. If one server loses data by any chance, the copied version is restored from the
other server. This feature comes in handy when multiple users are working on a particular file in real-time, and
one file suddenly gets corrupted [8].
5.8. Automation
Automation is an essential feature of cloud computing. The ability of cloud computing to automatically install,
configure and maintain a cloud service is known as automation in cloud computing. In simple words, it is the
process of making the most of the technology and minimizing the manual effort. However, achieving
automation in a cloud ecosystem is not that easy. This requires the installation and deployment of virtual
machines, servers, and large storage. On successful deployment, these resources also require constant
maintenance [6].
5.9. Resilience
Resilience in cloud computing means the ability of a service to quickly recover from any disruption. The
resilience of a cloud is measured by how fast its servers, databases and network systems restart and recover
IJCRT2405424 International Journal of Creative Research Thoughts (IJCRT) www.ijcrt.org d960
www.ijcrt.org © 2024 IJCRT | Volume 12, Issue 5 May 2024 | ISSN: 2320-2882
from any loss or damage. Availability is another key feature of cloud computing. Since cloud services can be
accessed remotely, there are no geographic restrictions or limits on the use of cloud resources [5].
6. Conclusion
In this paper we have presented usage of python as a tool in various research areas like Data Wrangling,
Machine learning and Cloud Computing. Along with Python language, there are many other languages are
used for Data wrangling, Machine learning using cloud computing and for developing IoT devices like Java,
C++ etc. But right now most of the developers use python scripting language than Java, C++. Because of its
easy syntax, secure coding, and its simplicity. When it comes to robust and performance, developers choose
Python. With respect to the future work there is still huge space for this language to serve other upcoming
research areas because of its features like simplicity, extensive library, inbuilt and extensible In future we will
propose python as a powerful tool which is used by many research communities.
Acknowledgement
We would like to thank Department of Computer Science & Engineering, Shobhit University Gangoh. For
encouragement and support in writing this paper.
References
1. Cline Don, Yueh Simon and Chapman Bruce, Stankov Boba, Al Gasiewski, and Masters Dallas, Elder
Kelly, Richard Kelly, Painter Thomas H., Miller Steve, Katzberg Steve, Mahrt Larry, (2009), NASA
Cold Land Processes Experiment (CLPX 2002/03): Airborne Remote Sensing.
2. A. Fatima, N. Nazir, and M. G. Khan, “Data Cleaning In Data Warehouse: A Survey of Data Pre-
processing Tech- niques and Tools,” Int. J. Inf. Technol. Comput. Sci., vol. 9, no. 3, pp. 50–61, 2017.
3. Kandel Sean, Paepcke Andreas, Hellersteiny Joseph and Heer Jeffrey (2011), Wrangler: Interactive
Visual Specifi- cation of Data Transformation Scripts, ACM Human Fac- tors in Computing Systems
(CHI) ACM 978-1-4503- 0267-8/11/05.
4. Chaudhuri. S and Dayal. U (1997), An overview of data warehousing and OLAP technology. In
SIGMOD Record
5. (2001) “Potter's Wheel: An Interactive Data Cleaning Sys- tem”, Proceedings of the 27th VLDB
Conference.
6. Ahuja.S, Roth.M, Gangadharaiah R, Schwarz.P and Bas- tidas.R, (2016), “Using Machine Learning to
Accelerate Data Wrangling”, IEEE 16th International Conference on Data Mining Workshops
(ICDMW), 2016, Barcelona, Spain, pp. 343-349.doi:10.1109/ICDMW.2016.0055.
7. Data wrangling platform (2017) publication, www.trifacta.com.[Online] Available:
https://www.trifacta.com/products/architecture//. [Accessed on: 01 May 2017].
8. Norman D.A, (2013), Text book on “The Design of Eve- ryday Things, Basic Books”, [Accessed on:
12 April 2017].
9. Jordan, M.I., Mitchell, and T.M.: Machine learning: trends, perspectives, and prospects. Science
349(6245), 255–260 (2015)
10. Le Cun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436– 444 (2015)
11. Holzinger, A., Dehmer, M., Jurisica, I.: Knowledge discovery and interactive data mining in
bioinformatics - state- of-the-art, future challenges and research direc-tions. BMC Bioinform. 15(S6),
I1 (2014)
12. Wolfram, S.: Mathematica: A System for Doing Mathematics by Computer. Addi-son Wesley
Longman Publishing Co., Inc., Boston (1991)
13. Engblom, S., Lukarski, D.: Fast MATLAB compatible sparse assembly on multicore computers.
Parallel Comput. 56, 1–17 (2016) https://www.researchgate.net/publicati
on/330513589_Internet_of_Things_IOT