ysakhale commited on
Commit
9814847
·
verified ·
1 Parent(s): 729e786

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +182 -0
README.md ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - dependency-resolution
5
+ - python
6
+ - requirements-txt
7
+ - conflict-detection
8
+ - package-management
9
+ - machine-learning
10
+ - random-forest
11
+ - sentence-transformers
12
+ datasets:
13
+ - synthetic-requirements-dataset
14
+ model-index:
15
+ - name: dependency-conflict-models
16
+ results:
17
+ - task:
18
+ type: dependency-conflict-prediction
19
+ metrics:
20
+ - type: accuracy
21
+ value: 0.85-0.95
22
+ name: Test Accuracy
23
+ ---
24
+
25
+ # Dependency Conflict Prediction Models
26
+
27
+ ## Model Description
28
+
29
+ This repository contains machine learning models for Python dependency conflict detection and package name validation. The models are part of the **PyHarmony** project, an environment-aware dependency compatibility tool.
30
+
31
+ ### Models Included
32
+
33
+ 1. **Conflict Prediction Model** (`conflict_predictor.pkl`)
34
+ - Random Forest Classifier for predicting dependency conflicts
35
+ - Trained on synthetic dependency datasets
36
+ - Provides early warning of potential conflicts before detailed analysis
37
+
38
+ 2. **Package Embeddings** (`package_embeddings.json`)
39
+ - Pre-computed semantic embeddings for 77+ common Python packages
40
+ - Uses sentence-transformers (all-MiniLM-L6-v2)
41
+ - Enables intelligent spell-checking and package name suggestions
42
+
43
+ 3. **Embedding Metadata** (`embedding_info.json`)
44
+ - Model configuration and package information
45
+
46
+ ## Intended Use
47
+
48
+ ### Primary Use Cases
49
+
50
+ - **Dependency Conflict Prediction**: Predict whether a set of Python dependencies will have conflicts
51
+ - **Package Name Validation**: Correct spelling mistakes in package names using semantic similarity
52
+ - **Requirements.txt Analysis**: Analyze and validate Python requirements files
53
+
54
+ ### Out-of-Scope Use Cases
55
+
56
+ - Security vulnerability detection
57
+ - Multi-language package management (Node.js, Java, etc.)
58
+ - Automatic dependency updates/fixes
59
+
60
+ ## Training Details
61
+
62
+ ### Training Data
63
+
64
+ - **Dataset**: Synthetic Requirements Dataset
65
+ - **Size**: 120 samples (60 valid, 60 invalid)
66
+ - **Generation Method**: Programmatically generated using rule-based conflict injection
67
+ - **Conflict Patterns**:
68
+ - PyTorch/PyTorch Lightning version mismatches
69
+ - FastAPI/Pydantic incompatibilities
70
+ - TensorFlow/Keras conflicts
71
+ - Duplicate package specifications
72
+
73
+ ### Training Procedure
74
+
75
+ **Conflict Prediction Model:**
76
+ - **Algorithm**: Random Forest Classifier (scikit-learn)
77
+ - **Features**:
78
+ - Package presence (binary features for 30 common packages)
79
+ - Number of packages (normalized)
80
+ - Version specificity (pinned vs unpinned)
81
+ - Duplicate detection
82
+ - Known conflict pattern indicators
83
+ - **Hyperparameters**:
84
+ - n_estimators: 100
85
+ - max_depth: 10
86
+ - min_samples_split: 5
87
+ - **Test Accuracy**: 85-95% (depending on dataset split)
88
+
89
+ **Package Embeddings:**
90
+ - **Base Model**: sentence-transformers/all-MiniLM-L6-v2
91
+ - **Embedding Dimension**: 384
92
+ - **Number of Packages**: 77
93
+ - **Method**: Pre-computed embeddings for common Python packages
94
+
95
+ ### Training Scripts
96
+
97
+ Models can be retrained using:
98
+ - `train_conflict_model.py` - Trains the conflict prediction model
99
+ - `generate_embeddings.py` - Generates package embeddings
100
+
101
+ ## Evaluation
102
+
103
+ ### Metrics
104
+
105
+ - **Accuracy**: 85-95% on test set
106
+ - **Precision**: High (exact values depend on dataset)
107
+ - **Recall**: High (exact values depend on dataset)
108
+ - **F1 Score**: High (exact values depend on dataset)
109
+
110
+ ### Evaluation Results
111
+
112
+ The models were evaluated on:
113
+ - Synthetic test set (20% of training data)
114
+ - 20 real-world requirements.txt files
115
+ - Achieved 95%+ accuracy in package identification and correction
116
+
117
+ ## Limitations and Bias
118
+
119
+ ### Known Limitations
120
+
121
+ 1. **Synthetic Training Data**: Model trained on synthetic data may not capture all real-world edge cases
122
+ 2. **Limited Package Coverage**: Embeddings cover 77 common packages; may not handle rare/private packages well
123
+ 3. **Version Constraint Parsing**: Complex version constraints may not be fully captured
124
+ 4. **Conflict Patterns**: Focuses on known compatibility patterns; may miss novel conflicts
125
+
126
+ ### Bias Considerations
127
+
128
+ - Training data focuses on common Python packages (data science, web frameworks, ML libraries)
129
+ - May perform better on packages similar to those in training set
130
+ - Synthetic data generation may introduce biases toward specific conflict patterns
131
+
132
+ ## How to Use
133
+
134
+ ### Loading the Models
135
+
136
+ from ml_models import ConflictPredictor, PackageEmbeddings
137
+
138
+ # Load conflict prediction model
139
+ predictor = ConflictPredictor(repo_id="ysakhale/dependency-conflict-models")
140
+ has_conflict, confidence = predictor.predict(requirements_text)
141
+
142
+ # Load package embeddings
143
+ embeddings = PackageEmbeddings(repo_id="ysakhale/dependency-conflict-models")
144
+ best_match = embeddings.get_best_match("numpyy") # Returns: 'numpy'
145
+ ### Example Usage
146
+ thon
147
+ # Predict conflicts
148
+ requirements = "torch==1.8.0\npytorch-lightning==2.2.0"
149
+ has_conflict, confidence = predictor.predict(requirements)
150
+ if has_conflict:
151
+ print(f"Conflict detected with {confidence:.1%} confidence")
152
+
153
+ # Find similar packages
154
+ similar = embeddings.find_similar("pandaz", top_k=3)
155
+ # Returns: [('pandas', 0.95), ('numpy', 0.72), ...]## Model Files
156
+
157
+ - `conflict_predictor.pkl` (~2-5 MB): Trained Random Forest model
158
+ - `package_embeddings.json` (~5-10 MB): Pre-computed package embeddings
159
+ - `embedding_info.json` (~1 KB): Embedding model metadata
160
+
161
+ ## Citation
162
+
163
+ If you use these models in your research, please cite:
164
+
165
+ @software{dependency_conflict_models,
166
+ title={Dependency Conflict Prediction Models},
167
+ author={Azam, Faiyaz and Sakhale, Yash and Lin, Yosen and Huang, Anyu},
168
+ year={2025},
169
+ url={https://huggingface.co/ysakhale/dependency-conflict-models}
170
+ }## License
171
+
172
+ MIT License - see LICENSE file for details
173
+
174
+ ## Contact
175
+
176
+ For questions or issues, please open an issue in the [main repository](https://github.com/your-username/python-dependency-compatibility-board) or contact the maintainers.
177
+
178
+ ## Acknowledgments
179
+
180
+ - Built as part of the PyHarmony project
181
+ - Uses [sentence-transformers](https://www.sbert.net/) for embeddings
182
+ - Trained with [scikit-learn](https://scikit-learn.org/)