Kazadi Joel 9213934 DLMDSPWP01
Kazadi Joel 9213934 DLMDSPWP01
WRITTEN ASSIGNMENT
Joël Kazadi
Matriculation Number: 9213934
This work aims to develop a Python program for regression analysis. The main objective is to use training dataset
and test dataset to select the best ideal functions from a set of fifty provided. These ideal functions are associated
with x-y-pairs of values and represent appropriate theoretical models for the data.
The process of selecting ideal functions is based on the Least Square approach, where the goal is to minimize the
sum of squared deviations between the actual y-values and the values predicted by the ideal functions. To achieve
this, the program will use the training data to perform a linear regression and calculate the deviations for each
function. The four ideal functions that minimize these deviations will be chosen as the most suitable to represent the
training data.
Once the ideal functions have been selected, the program will use the test data to determine whether each x-y-pair
of values can be assigned to one of the chosen ideal functions. The match will be established by comparing the
maximum deviation of the regression with the largest deviation between the training data and the corresponding
ideal function. A strict criterion will be applied, where the maximum deviation cannot exceed the largest deviation
multiplied by a factor of sqrt(2). For comparison purposes, we will define another, more relaxed criterion to examine
the sensitivity of the assignments.
To facilitate data manipulation, the program will create an SQLite database using the SQLAlchemy library. The
training data will be loaded into a five-column table, where the first column represents the x-values and the remaining
columns represent the values of the four selected ideal functions. This tabular representation of the data will enable
a clear and comprehensible visualization of the selected deviation.
Regarding program design, we will adopt an object-oriented approach, using inheritance to organize the various
classes. Exception handling will also be implemented, covering both standard and user-defined exceptions. To
ensure the program’s robustness, unit tests will be developed for each essential component. For data manipulation,
we will use the Pandas package, which offers many features for working with DataFrames, and Bokeh will be used
for interactive data visualization, enabling deeper exploration of discrepancies.
Finally, the code will be carefully documented using “docstrings” to provide comprehensive and understandable
documentation. In addition, we will provide detailed instructions on the Git commands needed to clone the project
into a Git version control system, ensuring efficient source code management.
2 Analysis
The training dataset contains 400 data points, and the test dataset 100 data points. We fitted a linear regression
model to the training dataset for the 50 ideal functions provided (y1 , y2 , . . . , y50 ). We then selected the 4 functions
that minimize the sum of squares of the residuals. The results of the selection are shown in Figure 1. The optimal
functions are: y11 , y14 , y15 and y12 .
1
Fig. 1: Top-5 ideal functions
The four ideal functions correspond to distinct regression lines. By representing these lines in the test data space, it
is possible to assign each pair of x-y values to one of the ideal functions, according to the proximity of the point to the
line. This assignment is based on two criteria. The first criterion is strict. It establishes the correspondence between
the data point and the ideal function by considering that the deviation of the fitted regression line cannot exceed
√
the deviation between the training dataset and the ideal function by more than a factor of 2. The second criterion
relaxes this constraint, focusing solely on the proximity of the data point to the regression line for assignment to one
of the ideal functions.
Fig. 2: Assignment based on the strict criterion
Figure 2 reveals that, based on strict assignment, almost 80% of the test data were labeled “unassigned”. Further-
more, no data points were assigned to the ideal function y14 . In view of these unexpected results, we reformulated the
assignment criterion, making it dependent only on the proximity aspect, i.e. the mapping is performed by minimum
2
deviation. The results of this approach are presented in Figure 3, and show that all data points are assigned to one of
the four ideal functions, i.e. the “unassigned” class no longer exists. Furthermore, each ideal function contains at
least one assignment, i.e. there are no empty classes. It’s important to note that the relaxation of the assignment
criterion is only performed for illustration and comparison purposes.
3 Conclusion
In this project, we implemented a Python program to perform linear regression and carry out a selection of ideal
functions. First, we created a SQLite database to store training data, ideal functions and test results. Then, using
linear regression, we modeled the relationships between the input variable and the 50 output variables one after the
other. Finally, on the basis of the regression results, we selected the four best functions using least-squares criterion.
To evaluate the performance of our approach, we carried out tests on a separate dataset. Using the selected ideal
functions, we assigned the test data to the corresponding functions, calculating the deviations between predicted and
actual values. The results showed that the majority of test data were correctly assigned to an ideal function, with
acceptable deviations.
Finally, the project was cloned into the version control system Git, ensuring traceability and efficient management
of changes made to the source code. The commands needed to clone the project are available (see Appendix),
making it easy to access and use the source code.
3
A Appendix
# 0. IMPORTING LIBRAIRIES
# 2. CODE IMPLEMENTATION
class DataHandler :
"""
Base class for data handling .
"""
def __init__ (self , file_path ):
self. file_path = file_path
self.data = None
5
class TrainDataHandler ( DataHandler ):
"""
Class for handling training data.
"""
def load_data (self ):
"""
Load training data from a CSV file.
"""
try:
self.data = pd. read_csv (self. file_path )
except Exception as e:
raise Exception (f" Failed to load training data: {str(e)}")
6
def linear_regression (x, y):
"""
Perform linear regression and
return the model and the sum of squared deviations .
"""
# Reshape x if it is a 1D array
if len(x.shape) == 1:
x = x. reshape( − 1, 1)
model = LinearRegression ()
model .fit(x, y)
y_pred = model. predict (x)
deviations_squared = (y − y_pred ) ∗ ∗ 2
sum_deviations_squared = np.sum( deviations_squared )
return model , sum_deviations_squared
class IdealFunctionSelector :
"""
Class for selecting ideal functions .
"""
def __init__ (self , train_df ):
self. train_df = train_df
class TestMapping :
"""
Class for mapping test data to ideal functions .
"""
def __init__ (self , test_df , ideal_functions ):
self. test_df = test_df
self. ideal_functions = ideal_functions
9
# Check if this deviation is the smallest found
if deviation < min_deviation :
min_deviation = deviation
best_function = function_name
# Plot each ideal function with colors consistent with the color mapping
x = train_df [’x’]. values
for model , _, function_name in ideal_functions [:4]:
11
# Predict y − values using the model for each x − value
y_pred = model. predict (x. reshape ( − 1, 1))
# Get the color for the function from the color mapping dictionary
color = color_mapping [ function_name ]
# Plot test data points with the corresponding colors and legend field
p. circle ( test_df [’x’], test_df [’y’],
size =10, color= test_df [’color ’], legend_field =’color ’)
# 3. UNIT TEST
def setUp(self ):
# Set up the data for testing
self. train_data = pd. DataFrame ({
’x’: [1, 2, 3, 4, 5],
’y1’: [1, 2, 3, 4, 5],
’y2’: [2, 4, 6, 8, 10],
’y3’: [1, 4, 9, 16, 25],
’y4’: [1, 1, 1, 1, 1]
})
12
self. test_data = pd. DataFrame ({
’x’: [2, 4, 6],
’y’: [2, 4, 6]
})
13
def test_map_test_data (self ):
# Test mapping test data to ideal functions
mappings , deviations = self. test_mapping . map_test_data ()
# Return mappings and deviations matching test data length
self. assertEqual (len( mappings ), len(self. test_data ))
self. assertEqual (len( deviations ), len(self. test_data ))
if __name__ == ’__main__ ’:
unittest .main ()
# 4. DISPLAYING OUTPUTS
# Visualize results
visualize_results (train_df , ideal_functions , test_df , mappings )
visualize_results (train_df , ideal_functions , test_df , rel_mappings )
14
# Sum of Squares of Residuals for Ideal Functions
# Extracting function names and sum − of − squares values from the data
function_names = [entry [2] for entry in ideal_functions ]
sum_of_squares = [entry [1] for entry in ideal_functions ]
# Creating a ColumnDataSource for Bokeh
source1 = ColumnDataSource (data=dict( function_names = function_names ,
sum_of_squares = sum_of_squares ))
15
# Count the occurrences of each value in the list
count_dict = Counter ( data_list )
# Save training data , ideal functions , and test results to the database
test_results = mappings , deviations
save_to_database (session , train_df , ideal_functions , test_results )
16
# 5. GIT COMMANDS (To run on Anaconda PowerShell )
17